Discussion about the structure of the new Search API

Posted by drunken monkey on May 30, 2010 at 12:12pm

Since I've virtually flooded this group with my topics in the last weeks, most regular Search group readers will probably be aware that I'm (in the course of the GSoC) planning to create a new Search API. Now, since an API is pretty much useless without anyone actually using it, there of course should be a discussion involving all "stakeholders" on where this should go, prior to any coding.
This is said discussion.

I have of course done some research in the past weeks so I already have a bunch of ideas. Therefore I'll at first present those and then always follow up with open questions for that particular area. Of course, comments and criticism outside of these questions is also allowed and encouraged. ;)
My plan naturally incorporates many ideas from the D8 Core Search discussion here on g.d.o. Another main source of inspiration is Young Hahn's emerging searchlight module whose goal is quite similar to mine, with searchlight just being more specialized, concentrating on Views in lieu of even more flexibility. Lastly, my previous Apachesolr RDF project — even though it turned out to be pretty useless — contains quite a number of elements which I'd find suitable for this project, too. Also it provides some clear "DON'T!"s — the true measure of experience. ;)

Overall procedure for creating searches

(Terms are as used in my apachesolr_rdf module, but see open questions.)
The user (i.e., a site admin) first creates a new search server. He selects the backend to use (e.g. Solr, Sphinx, simple DB implementation, …), configures it (e.g. host/port/path for Solr, maybe different modes available, etc.) and sets its name.
He can then create one or more indexes for that server. He would have to select the data type (e.g., nodes or users, provided by the entity data source) and a previously created server. There will of course be a lot of additional configuration possibilities. Then, when cron is run, items of that type (provided by the data source) will be added to that index and be returned in all subsequent searches.
Finally, so that any of this has a point, he will have to create a search. He selects the index to use, the path were the search should be displayed and other options (name, advanced fields to use, filter queries to always be appended, necessary permissions, …). (It would also be possible to create a search without using a path, which could then only be executed programmatically, by other modules providing such functionality.)
For indexes and searches, also the plugins to use at different points in the index / search process can be defined, configured and ordered.
End users (i.e., non-admins) only ever see the searches, without knowing the server, index or backend behind it.
(Internal detail: Creating a search just lets the searchapi module add a menu item for the given path which then asks for and processes user input, redirects to an appropriate search URL, extracts the search parameters from the URL and then calls a function that actually executes the search. If this is implemented in a clean way, other modules can easily provide other means of searching and execute searches programmatically. Strong coupling, as currently in search_get_keys(), should be avoided at any cost.)

Open questions

How should the objects (server, index, backend, search, filter, …) be called to make their meaning as clear as possible to the user? I got some negative feedback regarding their usage for the apachesolr_rdf module. (And especially "server" is probably unsuitable for some backends.)
Are searches created through other modules (i.e. not using the default "menu item" implementation) also managed by searchapi? Or, asked another way: Is the "menu item" implementation also just a plugin, and when creating a search one could also select a different one?
Should it be possible to create a search that uses more than one index and combines the results? Should this be possible by default, or should it be possible for modules to somehow add this functionality? Should this only be possible for indexes on a single server or at least using the same backend, or for just any indexes?

Additional ideas

The search itself could use a plugin to parse the user's input, thereby allowing different search syntaxes to be defined by modules and used by sites (maybe even let the user decide which one he wants to use).
Servers, indexes and searches can be deactivated seperately (although deactivating a server also deactivates associated indexes and searches). This would e.g. allow to stop indexing data for some time, but still use searches on that index.

Object orientation

Functionality that individual backends have to provide will be specified as interfaces, (an) abstract class(es) implementing most of these methods in a generic way will also be provided. This is almost identical to searchlight's way, but using interfaces as appropriate.

Open questions

Should there be separate interfaces/objects for indexing and searching, or is there no real point?
Where/how should these classes be defined? hook_searchapi_backends() returning information on files and classes (so a module would have to be created for new backends)? Or collect as many backends as possible in the searchapi module itself?

Extension points

Extension/Plugin points for
- data source (entities are default, but also any other source possible, like views, pages, RDF data, ...)
- data alteration (e.g., add data for comments or attachments to node, or profile fields to user)
- pre-processing (stemming, word-splitting, …) of indexed items and search queries (should also be configurable by backend – e.g., Solr handles pre-processing itself and might be irritated by being served already processed data)
- alteration of search queries (introduce custom sort, facets, etc.)
- post-processing/ranking/sorting (might use/alter ranking data provided by backend)
- displaying search results (maybe this should use additional plugins for creating excerpts and highlighting)
Defaults are not hard-coded, there is just searchapi's implementation of those plugins, which will be activated by default. Everything should be configurable by users (i.e., site admins – although e.g. ranking mode could even be selected by normal users) and by modules. Also, the active data source and backend might want to switch individual plugins on or off.

Open questions

Should these be hooks or plugin objects?
Should the plugins be handed single data items individually or the whole array at once? (And: does that matter much?)
Priority of options set by users, modules and other plugins? Who can override whom?

Additional ideas

When executing a search programmatically, an array of custom plugins could also be optionally passed to it. (At least if they are objects, don't know how this could work with hooks.) At each extension point, the search mechanism would then determine if one of those objects is suitable (using instanceof) and call its according methods (instead of the options set by the user, where only one plugin is used (e.g. data source)).
Where there are several plugins executed sequentially (pre- and post-processing, …), a weight should be defined by the plugin and be alterable by the user, to determine order.

Data source

Default implementation uses entity_metadata. Would have to provide information on what kinds of data are available and their respective attributes (metadata that is helpful for the user or other parts of search framework). Especially information on the fields of each entity and their datatypes would be needed. The data source is also responsible for maintaining information on what items still need to be indexed for each created index. (Something like $dataSource->getItemsToIndex($index, $numItems) would be used to retrieve them at index time.)

Open questions

Can capsuling general information (what is available, how can it be retrieved) and maintaining index status be separated into two different objects for greater flexibility? Would this be reasonable?
Should index information (at least in the default implementation) be kept in a single table, or in individual ones for each index?

Language awareness

Is very important, but how should it be implemented? Different indexes for different languages? Should they be created manually by the user (if more than one language is available, ask user when she creates the index) or automatically (when an index is created by the user, "secretly" create one for each language)? Does this make sense for all data, or just some (probably "decided" by data source)?
Let plugins define what languages they can/should be used in? If yes: how? (Both whitelist and blacklist approaches would not really suffice if plugin developer's should not be forced to examine every language in existence.) Or let users decide and just urge plugin developers to provide some verbal clues in the plugin's description?
And: Should this be planned in from the beginning, or would it be possible to add it later, when the core functionality is basically done? (The latter would allow us to think about how to implement this, when we already can clearly see what is possible / necessary, and where.)

OK, this turned out to be a lot more material than I thought it would be, but at least that makes for a good base for discussion. So please, comment on the ideas, add your own ideas and thoughts, and discuss the hell out of them. ;)

I'll probably later include results of this discussion in the project's Wiki page, so others can also easily add or correct things. But other ideas on how to present this better and more comfortably would also be appreciated.

Comments

@search-API: I'm not sure I'd

Posted by fago on May 30, 2010 at 5:01pm

@search-API:
I'm not sure I'd expect a search API to provide the possibility to create searches. Perhaps just leave that up to views or maybe put it in another module?
Apart from that I'd suggest to make the server, index definitions exportable. As you already use the metadata stuff, you could use the entity CRUD API for that. (you can define entities to be exportable)

Where there are several plugins executed sequentially (pre- and post-processing, …),

hm, are concrete pre+post processing plugins really necessary? I don't know, but I'd avoid introducing concepts if there is no good cause to introduce them as it could quickly overcomplicating things. E.g. if separate plugins aren't necessary, what about giving the indexer the possibility to provide a configuration form + just serialize its settings for storage. Then those settings are available to the implementation and it may use it whenever needed, such it can obey the pre-processing settings.

@data-sources:
Sounds interesting. Well the key point here is the metadata I think, as you'll need it for views. If possible, I'd suggest you to stick with the metadata format of entity-metadata. It should contain everything you need, but if necessary we could look into extending it.

I also use the same way of specifying metadata in Rules 7.x-2.x for allowing the description of further data structures being no entities. Together with the associated getter/setter callbacks it should be possible to deal with any data + you can use the wrappers provided by entity metadata for it.

displaying search results (maybe this should use additional plugins for creating excerpts and highlighting)

Why not stick to views for displaying search results? I'd have thought of your work as the "indexer abstraction" part only and leave creating + displaying searches to views.

Views vs. simple display

Posted by jhodgdon on May 31, 2010 at 4:00pm

One small point: The objective here is to create something that could eventually go into core. So it will need to have a simple way to display search results. It could also allow using Views, but it shouldn't depend on views.

Drupal programmer - http://poplarware.com
Drupal author - http://shop.oreilly.com/product/0636920034612.do
Drupal contributor - https://www.drupal.org/u/jhodgdon

One small point: The

Posted by fago on June 1, 2010 at 7:08am

One small point: The objective here is to create something that could eventually go into core. So

That's new for me? I thought of the project of being an abstracting for different search indexer/backends, not a replacement of the core search API. So what's the goal? A good point to clarify ;)

@pre-post-plugins:
For a complete search API that makes sense. For the indexing abstraction imho it's out of scope (but of course a nice addition). However different indexers like solr have their own (java) preprocessors which can't be implemented as such plugins nevertheless, so there should be the possibility of per-indexer-settings.

@views
I never said it should depend on views, I just said it should leverage views for stuff it already provides. We have views for creating multiple searches, we don't need another module doing so.

Change in scope

Posted by drunken monkey on June 1, 2010 at 1:33pm

Oh, yeah, I somewhat changed the concrete scope / goal of the project. Since major improvements (especially in modularity/flexibility, pretty much this project's main goal, too) were planned for search in D8 anyways, and my project coincided with a lot of them, joining these efforts seemed the way to go.
I'm not really trying to develop a module straight for core, but if the project is a success large parts of it could replace core search in D8, or help in developing these parts for core search.
In effect, this just means the project will be a bit more general, also provide said search creation capabilities and should minimize external dependencies.

pre-processing on server

Posted by jhodgdon on May 31, 2010 at 4:24pm

One other comment on what you've written here: It's best, in my opinion, if the pre- and post-processing is totally separate from the server. That way you can write the English-language linguistic stemming module once, and use it no matter if you are using Solr, Lucene, or some simple search back end server. And you could swap out your back end and still have searches behaving the same way.

Pre-processing is VERY language dependent, and I don't think we can expect or hope that all back ends would implement it for all languages.

Drupal programmer - http://poplarware.com
Drupal author - http://shop.oreilly.com/product/0636920034612.do
Drupal contributor - https://www.drupal.org/u/jhodgdon

General comments

Posted by jhodgdon on May 31, 2010 at 4:21pm

I have a few miscellaneous comments, some of which may be obvious to you already:

1) You have indicated you want to use object orientation here, and didn't mention Drupal's hook system. Probably/hopefully you plan to use the hook system to let a module say "Here, I'm giving you a class that implements a search back end", right? Also, some of the steps are better suited (in my opinion) to simply defining functions (i.e. hooks) rather than implementing a class. If it's just one function, a class seems to be overkill. So I personally wouldn't go overboard on requiring everything to be a class.

2) Pre and post-processing: It has been suggested that a system like Drupal's filtering system could be used for pre and post-processing. I think this is a good idea. It must be language-dependent -- e.g., if you're processing Spanish, run it through the Spanish stemmer, and if you're doing English, use the English stemmer. So maybe to genrealize, you could make a system where you would define the order of the pre-processing filters, and have several contingencies in the configuration? Or it might make more sense to have some kind of a hook_preprocessor_info() where a processor could say "I operate on English", "I operate on nodes", etc.

3) Another step that you didn't really elaborate on is how the items are rendered to send the information to the server for indexing. There are several possibilities, including passing them through the theming system somehow, or calling a function like node_view() (or whatever the equivalent would be for other data types). It would be useful if this rendering choce were also pluggable, though possibly only certain rendering mechanisms would work with certain data types.

4) Are you aware that currently the user search in the core Search module doesn't do any search indexing at all? It's a straight query on the user table. Just a comment. No provision has been made in D7 core to search fields on users, either.

5) Definitely I think it would be useful to be able to select one or more data types to be indexed together in one search setup. I don't really see a problem with doing this, except that the "advanced" fields would probably not apply to all data types (but that is true even if you restrict searching to nodes -- for instance not all content types have the same taxonomies or fields).

6) Pre-processing (before indexing), search querying, and post-processing of results are all inter-related. For instance, if you stem all words down to their linguistic roots before indexing some content, then you have to do the same pre-processing step on the search keywords, or you won't find a match (i.e. you have to run preprocessing steps on your keywords before passing them to the search engine). And then when you are making your search excerpt that is highlighting matches, you would need to match on pre-processed text or you wouldn't find matches. So the preprocessor has to be allowed the ability to highlight during post-processing.

7) Languages: I think the best approach is to mark each item stored in the index as to its language (or mark it "language-neutral" if it isn't language-specific or if the language is unknown). Then when someone does a search, you can use Drupal's standard language detection to figure out what language they are currently using (or let them switch), so that someone is always searching in a particular language. As I mentioned above, pre-processing is definitely language-specific, and you have to pre-process the keywords as well as the text being indexed. So you really have to build that awareness into the system at a low level, I think.

Drupal programmer - http://poplarware.com
Drupal author - http://shop.oreilly.com/product/0636920034612.do
Drupal contributor - https://www.drupal.org/u/jhodgdon

Thanks for your detailled feedback!

Posted by drunken monkey on May 31, 2010 at 9:56pm

Thanks for your detailled feedback! I hope this doesn't get too chaotic, but I'll directly answer to most of your points.

ad 1) Oh, yes, of course. Completely forgot to mention this as I basically treated it as implied. So yes, there would be one or more hooks for defining search plugins. A good question here would be, whether we use a single one where you'll also have to specify the types of plugins in the returned array, or one hook for each plugin.
Of course, if we decide that some plugins should best be functions, those would probably be a hook for themselves, e.g. hook_searchapi_index_alter() or something like that. (I think this could be a single function, right? What else?) I'm also definitely contra forced object-orientation where it's just overkill.

ad 2) I haven't worked with the filter system before, but it looks interesting. Only bad thing would be if search filters would then clutter the admin page for input formats, where they (typically) wouldn't be used…
The language-dependence is really complicated. I think the only feasible way will be to let the plugin object either tell the system what kind of data it can handle, or itself check if it can handle it once data is passed to it. The user herself would then be responsible for enabling stemmers, etc., for every language (and data type) that will show up on that site.
I'm not sure if this is what you were suggesting but I don't think other ways can easily be implemented.

ad 3) Sorry, this has puzzled me before: Why render the node/item? For indexing, a representation with separate fields is usually what we want, isn't it? The "data alteration" plugins would take care of adding all data not stored directly with the node/whatever, so I don't get this. (Not trying to argue, I'm really clueless.)

ad 4) Yes, I was aware, and it's probably a good thing to change.

ad 5) OK, we'll see how and where this fits in.

ad 6) Right. Possibly the best way would even be using a single plugin, with methods for all three of these related steps. When a plugin e.g. only affects indexing it would just not implement the other method(s).
However, highlighting seems to be quite a pita to implement, the more I think about it. It has to know about the preprocessers, but then again if they are coupled, there will probably be trouble when either no or several preprocessors "want" to highlight. I'll probably have to read up a bit about how other sophisticated search systems handle this. I also don't know how well current search modules handle this.

ad 7) Yes, language will probably also be quite a challenge. Your approach seems best, but the pre-/post-processing and other plugins worry me… So yes, I should definitely imagine multi-language content right from the beginning and see how and where this affects things.

@ fago: Thanks for your comments, too! I'm with Jennifer on the built-in search capability and independency from views, as well as the plugins for pre-/post-processing. But we should definitely talk about the entity_metadata capabilities.
And yes, I'll of course stick to one format for the data source, so other providers would probably have to use the one that entity_metadata defines.

Replies to your replies...

Posted by jhodgdon on June 1, 2010 at 12:17am

1) Multiple hooks are better than 1 hook, as of Drupal 7, I would say. Many of the multi-function hooks in D6 were split into pieces in D7. So I would make each hook have a particular purpose. See also: the Field API system of D7, which has zillions of hooks for modules to say "I do this part of the work".

2) I would suggest using a separate system for search pre/post process -- just model it on the existing (misnamed) "input filter" system, so you can for instance drag around to change the order of processing.

3) Sometimes the theme adds things that are not represented by fields. For instance, if you have a node reference field, the theme might load that node and display selected information from the referenced node. Or you might embed an entire view (of nodes that are never indexed by themselves) in the node display. In these cases, indexing the node and its fields doesn't give you the same text that, say, Google would get by indexing the rendered page. On many sites, the theme does this extra rendering, not modules, so I'm not sure how the "data alteration" plugins would work in this case? Or maybe one of them would allow you to render the content of the node according to the theme?

In any case, yes sometimes you want fields so you can do faceted/field-wise/advanced searching, but then again sometimes you just want to type in keywords and make sure they're found if they would appear on the page. I'm not advocating removing the fields, just saying that for indexing the main "body" of the node/page/whatever, for some applications it's useful to be able to pass it through the theme.

I can point you to some specific examples if you're interested...

6) You are correct, highlighting is difficult. The current core search module does not handle this at all well with preprocessing -- there's an issue on it that got pushed out to Drupal 8 (unfortunately). I did have a working patch on that issue, but ... well, it didn't make it in. Sigh. I do have a variation of that patch working in my Search by Page module (which adds a hook for preprocessing modules to affect search excerpts) and Porter Stemmer (which implements the hook to highlight with English stemming). It's handy being the maintainer of two search-related modules. :)

Drupal programmer - http://poplarware.com
Drupal author - http://shop.oreilly.com/product/0636920034612.do
Drupal contributor - https://www.drupal.org/u/jhodgdon

Reply to your reply to my reply to your comment

Posted by drunken monkey on June 1, 2010 at 1:24pm

(SCNR)
ad 2) OK, so you mean just recycling some of its code? That's reasonable, of course. All by myself I probably couldn't program such a drag-and-drop thing anyways …

ad 3) Ah, didn't think of that. In this case, yeah, rendering does make sense. But since this isn't always necessary, maybe a plugin that renders the item and uses that information for the fulltext field would really be the best approach. Doesn't sound too complicated, either.

ad 6) Could you post/edit some links to these patches/functions? Sounds like a good starting point.

Hi all. Great discussion. If

Posted by cpliakas on June 1, 2010 at 6:36pm

Hi all. Great discussion.

If I could weigh in on the pre/post processing functionality, one of the biggest performance-killers inside of Search Lucene API is the analyzer, which performs pre-processing functions such as stemming, stripping out stopwords, etc. As a direct result the highlighting mechanism is also incredibly slow, where sometimes it takes longer to highlight the 10 search result snippets than it does to execute the actual search. Although I agree with Jennifer that this would have the advantage of using one code base for all search backends, the point I am trying to make is that I would hesitate to move all of the pre/post process and highlight functionality into Drupal without fully exploring the performance implications of doing so.

As an example, the 7.x-3.x version of Search Lucene API integrates with the Elastic Search project, which utilizes the Lucene fast-vector-highlighter for incredibly efficient highlighting. To me, it would be better for the Search API to leverage the search backend's strengths than force it to do something else. If we are to go the pre/post processing inside of Drupal route, I would strongly recommend making it an option which will at least give admins the choice of using something native to the backend.

Thanks,
Chris

The point I am trying to make

Posted by Scott Reynolds on June 1, 2010 at 7:37pm

The point I am trying to make is that I would hesitate to move all of the pre/post process and highlight functionality into Drupal without fully exploring the performance implications of doing so.

I was thinking that from this discussion, the pre/post process plugins would produce configuration files for the various stuff for the backends. So for Solr, it would produce the XML configuration for the analyzers per field. Wasn't my impression that they would "do" anything on within Drupal.

That said, this

That way you can write the English-language linguistic stemming module once, and use it no matter if you are using Solr, Lucene, or some simple search back end server.

Seems to indicate that my assumption is wrong. The problem with this approach is that we are assuming that Drupal does this best. I just highly doubt that, Solr, Lucene, Sphinx and Xapian most likely will implement this better then we ever could. These plugins should produce 'configurations' that instruct the backend how to process each field. Configurations can be a file or a database save (variable_get/set).

/me is trying HARD not to make this a bike shed....

Really?

Posted by jhodgdon on June 2, 2010 at 5:11am

Are you saying that Lucene, Solr, Sphinx, and Xapian all have linguistic stemming modules for a wide variety of languages, and they all do them well? Hmmm.

Drupal programmer - http://poplarware.com
Drupal author - http://shop.oreilly.com/product/0636920034612.do
Drupal contributor - https://www.drupal.org/u/jhodgdon

Links...

Posted by jhodgdon on June 2, 2010 at 5:10am

The D8 planning page has links to issues I mentioned:
http://drupal.org/node/717654

In particular, the preprocessing section has two issues related to the preceding discussion.

Drupal programmer - http://poplarware.com
Drupal author - http://shop.oreilly.com/product/0636920034612.do
Drupal contributor - https://www.drupal.org/u/jhodgdon

2) I would suggest using a

Posted by robertdouglass on June 7, 2010 at 10:40am

2) I would suggest using a separate system for search pre/post process -- just model it on the existing (misnamed) "input filter" system, so you can for instance drag around to change the order of processing.

This is a great idea, one that I've long endorsed. If you look at the Lucene model, it's important to be able to process texts going into the index, and then to process search queries being issued. Often, the success of a search depends on these two processing phases being coordinated. For example, if you stem your input texts during indexing you have to stem your search queries before issuing the query on the index.

In light of this, a system similar to input formats would be applied to both indexed texts and to search queries.

This must be optional, though - as systems like Solr / Lucene will handle all of this natively anyway.

@ Chris, Scott

Posted by drunken monkey on June 1, 2010 at 8:40pm

(Sorry to break comment threading, but the boxes were getting smaller and smaller…;))

OK, this needs some clearing up. I can only state my own itentions (not what e.g. Jennifer might have had in mind), but I strongly agree with both of you. Solr and most other backends will do a much better job at stemming, tokenizing and what-not than we could ever hope to implement.
However, there will be some backends (at least the drupal-implemented default backend) which have to rely on drupal for these tasks. Also, there may be additional pre-/post-processing ("PPP", from now on) steps not implemented in a certain backend where the user might therefore want to use a drupal filter even if she is using a "professional" search backend.

So I think we still have to implement some flexible framework for PPP and just let the user (or, probably, the backend automatically) turn all filters off when the search engine will itself take care of that.

Producing configuration files for the user imo isn't a good idea, since this would wholly depend on the backend and therefore be impossible to implement generically. I think, when a custom backend is used, the necessary configuration should be the responsibility of the user and maybe explained somehow by the backend implementation.

Regarding highlighting: Well, somehow this will also have to be possible, but if you can e.g. use that fast-vector-highlighter you just turn off all of drupal's highlighting plugins – or implement one that uses that external highlighter (depending on the highlighter's implementation). I'll make sure to take a look at luceneapi's highlighter and see to it that integrating such external ones will go smoothly.
Again, I agree that implementing this ourselves will be difficult, probably less exact and much slower, so exploiting existing solutions whereever possible is highly desirable. In principle, this is what this whole project is about. ;)

I hope I could allay your fears. This project really isn't about not letting search engines do their thing and building drupal into a Swiss army search engine that generously leaves still some bits to third party software. On the contrary, it's about making it easier for developer's to integrate third party search solutions into drupal, and of course all their capabilities with them. We just have to make sure that all necessary tasks are taken care of in the event that no dedicated search software is used, or doesn't offer a certain desired feature.

I'm in favor of a lightweight

Posted by robertdouglass on June 7, 2010 at 10:41am

I'm in favor of a lightweight pre/post processing system for texts and search queries. I see it as necessary.

rendering

Posted by douggreen on June 1, 2010 at 10:47pm

You might want to just call the entity_view mode=search, which will allow modules to alter what gets rendered.

Doug Green
www.douggreenconsulting.com
www.dougjgreen.com

Core, clarification

Posted by douggreen on June 1, 2010 at 10:51pm

We've wanted to re-architect core search since the Minnesota search sprint. As noted above, some discussion already took place here. I don't expect to get a new core search module out of this, that would be a lot to ask. But it is my hope that this project can be both the proving ground and the foundation for this change. That's why so many of us are watching it closely and trying to give some guidance.

Doug Green
www.douggreenconsulting.com
www.dougjgreen.com

There will of course be a lot

Posted by robertdouglass on June 7, 2010 at 10:32am

There will of course be a lot of additional configuration possibilities. Then, when cron is run, items of that type (provided by the data source) will be added to that index and be returned in all subsequent searches.

Let's abstract this more. Let's say items get put in a queue, and then processed (whether this is by cron or not should be open). Cron can stay the default implementation, but a lot of systems can support instantaneous indexing of new content, or have other queue/index management strategies. We're a bit too tied to cron right now.

Good idea!

Posted by drunken monkey on June 7, 2010 at 2:17pm

Good idea! This shouldn't be too hard to abstract and would still be a nice feature.

But also abstracting from the database as the place where the unindexed items are stored wouldn't make sense, I guess?

Some progress

Posted by drunken monkey on June 21, 2010 at 2:09pm

I've made considerable progress over the last couple of weeks, so I think it's time to report back here.

What works
The whole framework for indexing data and searching should be working. One can index a variable number of items for a defined index, which is done at cron time automatically. Indexed items are passed through various hooks/plugins, one of which calls entity_prepare_view(). Track on updated/inserted entities is also kept.
Searches can be executed programmatically by calling search_api_search(). This will automatically create a SearchApiQuery object with a parsed query (query mode can be selected and new ones can be added), pass it through appropriate hooks/plugins and then feed it to an SearchApiServiceInterface object which encapsulates the functionality of a specific search server.
At the moment, everything is specific to entities, I haven't abstracted any of that. However, the more I think about it, the best way to implement search for other data based on this API might in any case be to define this data (pages, or whatever) as entities. But if my opinion changes on this, the process could probably relatively easily be abstracted.

Hooks exposed by this module (so far)
- hook_search_api_service_info (+ _alter): Lets modules register SearchApiServiceInterface implementations.
- hook_search_api_alter_callback_info: Lets modules register available data alter callbacks, which are used at index time to add (or, possibly, remove) data from indexed items. These are just functions, not plugin objects.
- hook_search_api_processor_info: Register processor classes, which can be used to preprocess indexed items and search queries, and postprocess search results.
- hook_search_api_query_alter:

Open (current) questions

Should hook_search_api_alter_callback_info() and hook_search_api_processor_info() also have corresponding alter() hooks?
Should there be a possibility for alter callbacks and processors to specify option forms which are used to congifure them at individual indexes?
Currently, preprocessors and postprocessors can be enabled seperately (although preprocessing at index and search time cannot). Does this make sense or will we always want preprocessors to also be called for postprocessing (and vice versa)? Processors not wanting to do one of them can just ignore the respective call, after all. Or should even all three be enabled seperately?
Should we implement hook_entity_info() to export definitions on indexes and servers?
Current directory structure is as follows:
search_api.api.php
search_api.info
search_api.install
search_api.module
includes/processor.inc
includes/query.inc
includes/service.inc
Is this OK, up to now? Should all of the rest of the (framework/API) code also go into the .module file or should additional .inc files be used? And, most important (although a while till it becomes current): Where should implementations for individual backends go? Also into includes/, into an includes/ subfolder, into contrib/ subfolders or into completely new projects?

In any case, the next steps will be to write an admin interface to configure all of this, create servers and indexes, etc., and then to implement one or two service classes.

I don't know what I'll do then. Implement some way to really offer this search functionality to users, probably, although I haven't yet decided, how. Either with the traditional search tabs in search/* (although considerably more configurable), or by writing a Views integration (probably in a seperate module). Or maybe both. Let's wait and see. ;)

@ Doug:

You might want to just call the entity_view mode=search, which will allow modules to alter what gets rendered.
I found the entity_info documentation on view modes, but couldn't find out where I can set the mode when loading/preparing an entity. How can this be specified?

Little problem with UI

Posted by drunken monkey on June 28, 2010 at 7:49am

I've run into a bit of a problem in a detail of the server creation process. Advice from Form API gurus or just opinions on the usability of the different solutions would be appreciated.

Right now (in the version I committed yesterday) I've managed to immediately display the service configuration form of whichever service the user chose, with some Javascript magic stolen from searchlight. I had some problems because invisble fields (of service classes that were not selected) would still generate "Foo field is required!" errors, even though the values weren't even used. I managed to circumvent this by not including the service config forms of unselected services when building a submitted form.

However, today I realized that this leads to the config forms also not being shown if validation errors occur (as you can see yourself in the current version in the repo).
I found a way to fix this again, by using a #pre_render hook (which only gets called if validation really fails), but this is for one thing rather complicated (duplicating some parts of the Form API), and for another circumvents things like hook_form_alter(). So I'd like your opinion on my options:

1) Fix it nonetheless via the #pre_render hook.
2) Forbid to use required fields in the services' configuration forms.
3) Deactivate #required status for fields of unselected service classes when building a submitted form¹.
4) Drop the Javascript magic and do a two-step form, where the first step is just selecting the service class to use.
5) Some brilliant solution I haven't thought of but you have.

¹ If I'm not mistaken, this would mean that, if validation fails, the forms for unselected service classes won't show their required fields in the re-displayed form. However, if this form is then submitted, the required fields of the now selected service class would still be checked, so this wouldn't be too bad (and could even be circumvented, again with a #pre_render hook). Biggest disadvantage here would imo be the even more complicated code.

Edit: Thanks to fago, the above problem is solved – I'll just use the Form API's AJAX functionality, which after all is made for exactly that sort of thing. Should have remember that myself…

However, there are two other problems I could use some help with, if one of you is eager to help. ;)

When displaying single servers (admin/config/search/search_api/server/%), the breadcrumb doesn't work. It just displays the "Home" link and nothing else. Probably is related to this issue. Do I just have to wait for it to be fixed (or fix it myself, of course ;)) or could there be something else wrong with my menu code? I already had some problems there, so it seems my knowledge of the menu system is not quite flawless.
The overview table at admin/config/search/search_api is always sorted ascending by status (disabled first, then enabled) by default, no matter how I specify the header I pass to the orderByHeader() method. What do I have to do to let it sort descending by status, by default?

Breadcrumbs, ordering

Posted by jhodgdon on June 28, 2010 at 3:18pm

Breadcrumbs in admin are broken in core. That's not your fault.

For ordering, I haven't looked at your code, but did you use ->extend('TableSort') in your query? If so, you should be able to specify any kind of ordering you want in your orderByHeader().

Also, in general... Just a note that if you are heavily using JavaScript/AJAX on admin forms, it's a good idea to make sure there is also a nicely-degrading way to use the form without JavaScript enabled.

{EDIT/ADDED} And without a mouse, etc. There can be a lot of accessibility issues...

Drupal programmer - http://poplarware.com
Drupal author - http://shop.oreilly.com/product/0636920034612.do
Drupal contributor - https://www.drupal.org/u/jhodgdon

TabelSort bug?

Posted by drunken monkey on June 28, 2010 at 4:29pm

Breadcrumbs in admin are broken in core. That's not your fault.

OK, good to know.

For ordering, I haven't looked at your code, but did you use ->extend('TableSort') in your query? If so, you should be able to specify any kind of ordering you want in your orderByHeader().

Yes, I did, this just doesn't seem to work as I thought (or, "as intended"?). I mean, if no "order" and "sort" GET parameters are given, I thought this wouldn't do anything, but it still seems to sort ascendingly by status. I now dug a bit into the TableSort code and produced this issue as a result. That behaviour was a bit too counterintuitive to anticipate.

Also, in general... Just a note that if you are heavily using JavaScript/AJAX on admin forms, it's a good idea to make sure there is also a nicely-degrading way to use the form without JavaScript enabled.

You are right, I really should not forget about this. But as I'm generally rather pedantic about standards and accessibility, you probably don't have to worry. ;)

update.php and "hook_entity_delete"

Posted by drunken monkey on July 3, 2010 at 1:13pm

Speaking of accessibility: Is it posssible that update.php doesn't work without Javascript? But of course that's a bit outside of the scope of this project… What I really wanted to ask:

There are hook_entity_insert() and hook_entity_update() to keep informed about new or changed entities, which of course is needed for indexing. But there doesn't seem to be a hook_entity_delete() that lets one react to entites being deleted. Without knowing when entities are deleted, there would of course be stale data left in the indexes. Is there any way around this? Or is hook_entity_update() also invoked when entites are deleted?

That's interesting...

Posted by jhodgdon on July 3, 2010 at 4:47pm

You seem to be correct that there is no hook_entity_delete(). There's a hook_node_delete() and a hook_user_delete(), and some taxonomy-related hooks as well, but no over-arching hook_entity_delete(). There is also no entity_delete() or entity_delete_multiple() function.

It's probably not a huge deal for searches done in the core Search module (or in the main Drupal database), because they can just join to the entity table of interest, so they wouldn't pick up deleted content in the results. Also, they would probably need to check access permissions, so that part of the query would also likely ban deleted content.

I don't actually know how Solr and other external search engines handle that issue though...

Drupal programmer - http://poplarware.com
Drupal author - http://shop.oreilly.com/product/0636920034612.do
Drupal contributor - https://www.drupal.org/u/jhodgdon

Whoops

Posted by jhodgdon on July 3, 2010 at 4:49pm

Sorry, this got in twice by mistake and now I can't delete it...

Drupal programmer - http://poplarware.com
Drupal author - http://shop.oreilly.com/product/0636920034612.do
Drupal contributor - https://www.drupal.org/u/jhodgdon

Progress update

Posted by drunken monkey on July 8, 2010 at 5:06pm

For my current status / the progress made so far, see these two comments:
Comment 1
Comment 2

Also, I could use some advice: Would it be advisable to use a Cron Queue instead of hook_cron() directly for indexing? I haven't really looked into queues yet, but it seems they are rather suited for such things. However, afaik core search currently doesn't use them, right?

No queues currently

Posted by jhodgdon on July 9, 2010 at 1:47pm

I don't know anything about cron queues vs. hook_cron() -- are cron queues in core? If not, I would say don't use them...

Anyway, I can answer your question: the core search module currently does not use cron queues.

Drupal programmer - http://poplarware.com
Drupal author - http://shop.oreilly.com/product/0636920034612.do
Drupal contributor - https://www.drupal.org/u/jhodgdon

Anyway, I can answer your

Posted by Scott Reynolds on July 9, 2010 at 3:59pm

Anyway, I can answer your question: the core search module currently does not use cron queues.

We should fix that. Cron queues are http://api.drupal.org/api/group/queue/7.

File an issue...

Posted by jhodgdon on July 10, 2010 at 2:22pm

Preferably with a patch...

Drupal programmer - http://poplarware.com
Drupal author - http://shop.oreilly.com/product/0636920034612.do
Drupal contributor - https://www.drupal.org/u/jhodgdon

The Language Problem

Posted by drunken monkey on July 22, 2010 at 4:41pm

OK, the basic framework is almost done, the only thing still missing (apart from a few details) is the language thing. Suggestions would be more than welcome.

First off, I haven't really worked with Drupal's l10n/i18n features (apart from using t()) and I also couldn't find the documentation I searched. How is this generally handled in content, at database level?
It seems that new translations for nodes are completely seperate nodes, only linked via tnid. I don't think that would be possible e.g. for users, so is there some common scheme for entities, at least for determining which language an entity's data is in, or if it is language-independent?
If there is no such way for entities in general, then I don't think my module will be able to do much in that respect. Implementing a nodes-only solution would be against the whole concept of this project (or, at least, against half of it). And I can't think of anything else we might do. The only possibility that comes to my mind is that we could let an index specify, which language(s) it stores, so searches could select between multiple indexes based on language settings. However, the user would be responsible to somehow ensure that the indexes actually only store that language(s).
Otherwise, some links to documentation or the necessary functions/parameters would be helpful.

If determining an entity's language is possible, the currently most probable scheme for making search indexes language-aware is as follows: When a user creates an index, she chooses the languages that index should be used for. Then, all language-dependent information (e.g. enabled processors and their configuration) is stored one level deeper in the options array than currently, being additionally keyed by language. Settings made for the undefined/neutral language are the "root settings", specific languages may override those. The search server should internally create one index for each language (including "undefined") and then return results for the requested language(s) at search time. At index time, the classification into languages would probably be done by the framework, so the service implementations don't have to bother.

Should the first part be solved by then, I'll probably start implementing this (or whatever approach appears to be the best one, then) next Monday.

Languages are going to be difficult...

Posted by jhodgdon on July 23, 2010 at 2:38pm

I don't know what to tell you for how translations are going to be handled "in general" for entities. As you have gathered (at least as far as I know) there isn't any standard way.
- D7 Drupal core really only provides a way to translate nodes.
- Taxonomy vocabularies and terms don't have a language field on their entity objects, and the i18n module (which allows you to translate taxonomy terms) basically assumes they are in the site-default language, and stores translations completely separately in that module.
- Users do have a language field, but I don't know of any facility for translating them.
- Fields in core have language/translation ability built into the db, but I'm not sure it's been totally fleshed out.

But is translation relevant? I think it is reasonable to assume in your module that entities should either have a language field, or they can be assumed to be in the default site language. Keep in mind that "undetermined" is a valid language choice too for language fields - http://api.drupal.org/api/constant/LANGUAGE_NONE/7

The thing is, as I'm sure you know, a lot of search preprocessing needs to know the language the content is in (such as stemming modules and CJK tokenizing). Which means you need to know the language at indexing time, as well as the language the user typed keywords in. IMO, all searches should be against content in one language -- that goes along with the philosophy of the i18n module, of generally only showing one language at a time on the site. I don't think the indexes should be language-specific, but they do need to store the language as a field, and at search time, my recommendation would be to restrict the search to one language's content.

As a note, I actually tried to add language awareness to search in D7, but ... well, it didn't make it... http://drupal.org/node/511594

Anyway, I think it would be reasonable to assume that an entity you are indexing/searching should tell you what language it is in, and your module could default to the site's default language if there is no language provided. Does that make sense?

Drupal programmer - http://poplarware.com
Drupal author - http://shop.oreilly.com/product/0636920034612.do
Drupal contributor - https://www.drupal.org/u/jhodgdon

OK, that makes sense

Posted by drunken monkey on July 27, 2010 at 11:02am

So, you're saying I should just check at index time whether isset($entity->language) and, if this is the case, store the language in the index? Yeah, that would make sense, I guess.
How should the language "und" be handled, then? Saved as the site's default language, or saved seperately and always included in searches (something like "language:(X OR und)" for language X)?

translatable fields

Posted by fago on July 27, 2010 at 11:23am

translatable fields are in core - probably http://drupal.org/project/translation will make it "usable". So an entity might have values in multiple languages.

Fields - yes; entities though?

Posted by jhodgdon on July 27, 2010 at 2:58pm

You are correct that fields being translatable is in core. But the parts that are on the entity (e.g. node->title, which I think is still on the node and not a field) are not necessarily translatable. I think it's still a little bit hacked together, and I am not convinced it will work very well...

Regarding the other question about UND, I think you should probably store UND as the language and do an OR like you suggested. The reason is that say the default language of a site is Spanish, someone searching in Italian would also want to see the language-neutral content (and that is technically what UND is supposed to represent.

And of course you will want to use the Drupal constant and not the string (which I believe is 'und')...

Drupal programmer - http://poplarware.com
Drupal author - http://shop.oreilly.com/product/0636920034612.do
Drupal contributor - https://www.drupal.org/u/jhodgdon

Tokenizing / Boosting

Posted by drunken monkey on July 29, 2010 at 9:33pm

I have added a few processors (one for tokenizing/stripping syntax, one for ignoring the case and a yet unimplemented one for filtering HTML) and encountered a small problem with the current API layout, regarding tokenizing. My implementation simply seperated the individual tokens by spaces and the server I'm currently starting to implement would then do a simple preg_split('/[ \t\n\r]+/'). This would also work when there would be no tokenizer and result in a simple fall-back, where everything other than pure whitespace is indexed. Tokenizing strings into an array of tokens would of course be more direct and cleaner, but then subsequent preprocessors (we can't force the user to always set the tokenizer as the last one) would suddenly get an array as the field value. So just using a string made sense.

However, now I considered the possibility that preprocessors might want to boost certain words/phrases in texts (best example would be the HTML filter, which might want to boost headings, etc.), and this seems almost impossible without letting preprocessors transform text fields into token arrays. The problems here would be, a) that these arrays would have to be very well defined, to make any sense for a generic API, and b) that now all processors would have to distinguish four cases when examining values: a literal, an array of literals ("list<*>" types), an array of tokens or an array of arrays of tokens.
I can't help but feel that this makes the process unnecessarily complex. If this is really needed for enabling boosting of index terms, it would of course have to be done, but I'd appreciate it very much if anyone would come up with a better solution.

Specifying tokenizers as an altogether different entity (or as a special preprocessor) isn't a good solution either, I think, since there can well be cases where preprocessors should work after a tokenizer, or where several tokenizers should work on fields.

Er, I don't think I'm making much sense here, but I tried as best as I could. I hope, someone at least understands the problem well enough to ask the right questions to help me clarify.

Good question...

Posted by jhodgdon on July 30, 2010 at 2:59pm

You are making sense... I'm glad to see you thinking about this project in such depth -- these are serious questions and issues you are bringing up!

So I'm looking at how the current D7 search.module does this -- it's in http://api.drupal.org/api/function/search_index/7

Basically the process is:
1) Get rid of ignored HTML tags.
2) Split the text into tags and text within tags.
3) Go through the split array.
a) For each tag, calculate the boost for this tag (e.g. 10 * for H1, 5 * for H2, or whatever). This is multiplied or added or something when there is a tag within the tag. If it's an ending tag, decrease the boost.
b) For each non-tag text portion, do whatever other preprocessing is desired on the text within the element, find the words to index, and index them with the boost. This way the preprocessing is done on a string of text, with no HTML tags in it.

This process makes sense to me... definitely each chunk of text is going to need a "boost" value associated with it, so you couldn't just paste them back together into a big string after doing the HTML processing... I can't think of another method that would work.

Drupal programmer - http://poplarware.com
Drupal author - http://shop.oreilly.com/product/0636920034612.do
Drupal contributor - https://www.drupal.org/u/jhodgdon

That was not really an answer...

Posted by jhodgdon on July 30, 2010 at 3:12pm

I guess the reply above didn't really address your question...

It seems to me that maybe a preprocessor should be able to specify what "type" of input and output it expects:
- Text with HTML tags
- Text without HTML tags
- Individual words

And for each case, the output could be "with boost/ranking" or "without boost/ranking", and could be an array or not.

For example:
- The HTML tokenizer could say "I take in a string of text with HTML tags, and I return an array of text without HTML tags, with boost scores".
- A stemming processor could say "I take in an individual word and I return an individual word"
- A split-into-words processor could say "I take in a string of text without HTML tags and I return an array of individual words".
- A remove-all-tags processor could say "I take in a string of text with HTML tags and I return a string of text without HTML tags"
- A highlight-favorite-words processor could say "I take in an individual word and return an individula word with a boost score".

Then maybe you could have phases of processing. Start with "text with HTML tags" and progress until you have an array of individual words. In each phase, allow processors that take that as input, and if they return an array, then you pass each item of the array to the next processor. And if the processor returns a different type of output, then the next processing steps have to take that type of input.

So I (as a site admin) could decide to chain together:
a) Remove unwanted tags (html string -> html string)
b) HTML booster (html string -> array of plain text strings)
c) Remove punctuation (plain text to plain text, applied to each string in b)
d) Split into words (plain text -> array of words, which your module would call repeatedly on each array element of (c) output; the boost from (b) would be applied to each word in the output)
e) Acronym expansion (word -> array of words; boost values from d would be applied to each word in each output, and then the array would be flattened afterwards)
f) stemming (word -> word, applied to each element of the array)

Does that make any kind of sense?

Drupal programmer - http://poplarware.com
Drupal author - http://shop.oreilly.com/product/0636920034612.do
Drupal contributor - https://www.drupal.org/u/jhodgdon

And maybe...

Posted by jhodgdon on July 30, 2010 at 3:15pm

It might also makes sense to get more into detail on the text types, like
- lower-case text words
- words with diacriticals/accents removed
?

Drupal programmer - http://poplarware.com
Drupal author - http://shop.oreilly.com/product/0636920034612.do
Drupal contributor - https://www.drupal.org/u/jhodgdon

I dunno…

Posted by drunken monkey on July 30, 2010 at 3:58pm

I guess the reply above didn't really address your question...

No, but it highlights well the issue I'm addressing. search_index() does things we want to do here, too, but it's a lot easier there since everything is in one huge function, instead of seperate processors that can be chained arbitrarily.

Regarding your approach: It does make sense, but I think it is both too complex and to restrictive to be practical.
- A "lowercase" filter does not care where in the whole process it is applied.
- There are more than one way to go from "text with HTML" to "tokens without HTML". How to decide for one?
- What if you just want to apply a processor that processes tokens, but none that produces them?

Also, now that I think about it: In my system, text in different fields can be processed differently, which means that each field could potentially be in a different processing "stage". This wouldn't be possible anymore in your setup. I don't know whether it is of any practical impact, but it's a restriction nonetheless.

Also, juggling with all these orthogonal types would be a big DX contra, I think.
And things like "lowercase" and "accents removed" would be even more confusing and specific. I guess, you could set such "tags" on the data to prevent other processors from doing the same, but there is no way to standardize possible tags comprehensively, and without standardization one processor would use "lower case" and one "lowercase", and the whole effect would be lost.

I think the approach I currently came up with (see comment below, in reply to fago) works well enough. I don't let processors specify any such characteristics explicitly, but let them decide on whether (and how) they handle particular data, themselves. It's just the responsibility of the user not to add two processors that e.g. remove HTML tags. Text or tokens is about the only distinction needed here, imo.

Hmmm...

Posted by jhodgdon on July 31, 2010 at 2:48pm

I guess my thought was that if each processor specified what type of input/output it could deal with (which could potentially be several possibilities), then you could avoid having two tokenizers in the preprocess chain because you would require the output of processor N in the chain to be one of the possible input types of processor N+1.

Anyway, it was just a thought about how to organize things. The order of processing is relevant (at least in some cases), so some way to give a clue to admins as to the correct ordering seems like a good idea.

Drupal programmer - http://poplarware.com
Drupal author - http://shop.oreilly.com/product/0636920034612.do
Drupal contributor - https://www.drupal.org/u/jhodgdon

Reasoning understood

Posted by drunken monkey on August 1, 2010 at 11:36am

I understand the reasoning and you certainly got a point. The order really is important. But so it is e.g. with filters, and they also let you order them completely freely. Maybe just because your idea would be a lot harder for filters, but it still works.
As I see it, there is nothing wrong with having two tokenizers – the second one could split some tokens further, duplicate or re-fuse them even, if it's sensible (or it thinks it is). I could imagine that in some cases, this would be what is needed to achieve a certain result, if you don't want to program a combined tokenizer by hand. (PS: I just now came up with an example, even soon really used in the framework: HTML filter and tokenizer. While the HTML filter only tokenizes the text according to the different HTML elements it is nested in, the tokenizer is responsible for afterwards also splitting according to white-space.)
In your scheme, tokenizers could of course specify tokenized data as a possible input type – in my scheme, potentially every tokenizer has this ability, giving more power to the user. And if a certain tokenizer really can't deal with tokenized text, it can simply choose to ignore it.

More detailled descriptions, giving also order-related hints to the user, seem to me to be the better solution here.

a literal, an array of

Posted by fago on July 30, 2010 at 1:36pm

a literal, an array of literals ("list<*>" types), an array of tokens or an array of arrays of tokens.

Hm, maybe you can pass everything as list/array ? Or just apply the processor on each list item yourself.

In Rules I also have data processors, I just register them per data type. Thus the processor has to declare what types of data it supports - your tokenized text could be just an additional type not?

Good idea!

Posted by drunken monkey on July 30, 2010 at 3:40pm

Ah, specifying data explicitly as tokenized is a good idea, thanks! The problem remains that processors will have to be able to deal with both types of data if they want to preprocess text fields, but at least this is an easier way to distinguish the cases. And since I already have an abstract preprocessor class with kind of a framework for just overriding specific bits, I'll just extend it a bit to ease concrete implementations.

Letting the processor specify explicitly, what types of data it can handle, is unnecessary, though, in my opinion (if you wanted to suggest that, too).

Boosting fields

Posted by drunken monkey on July 30, 2010 at 10:25pm

Yup, the approach worked very well, I think this was the right decision. The abstract processor is a bit complicated now, but the concrete implementations are as easy as can be.

Another question, or rather a point to discuss: How about an option for setting the boost level of individual fields? That way, e.g. terms found in the header would have more impact on the item's relevancy than ones in the body, etc. I thought about just adding this as an extra column to the fields settings table (see screenshot 13/14 here if you are not familiar with my admin UI) and handing those value to the server. Data alter callbacks and preprocessors could also access those, of course.

Good plan?

SQL help, anyone?

Posted by drunken monkey on August 2, 2010 at 7:28pm

I'm currently almost done with the database search implementation. Downside is, I was "almost done" two days ago, when I started to try and implement the actual searching (the only method still missing, everything else is in place). The SQL queries are really becoming insanely complex and since I'm usually able to solve any problem about half an hour after I post it to this group, I thought I might try my luck. ;) In the off change that the magic shouldn't work, I'd be very grateful for any SQL experts reading this and lending a helping hand, or head. The details are posted in the issue queue, so as not to clutter this discussion with a long description that most will ignore anyways.

@ Field boosts: Just did it, forget I asked. ;) Not such a huge API change, after all…

Database search basically working

Posted by drunken monkey on August 5, 2010 at 8:53pm

Good news: The implementation of the database search is as good as done, in most cases correct results are returned.
However, the bad news: For more complicated search I suddenly get a PDO exception I don't understand.
Help of any SQL/PDO experts out there would be appreciated!
I tried to fix this (amongst other things, but still) for the better part of two days now, but still can't make anything of it.

Since GSoC is almost over, I think I'll now turn to a Views integration and to writing some tests, and try to fix that problem later, after maybe someone has enlightened me as to what the problem even is.

Final status update

Posted by drunken monkey on August 14, 2010 at 11:21pm

I'm back from a week without internet and since the GSoC is about to end, I thought I'd give you a quick status update. The following things are done:

The basic API / framework with all functionality
A UI for the framework
A few data alter callbacks and processors:
- Aggregated fulltext field
- URL field
- Case-independent searches
- HTML filter
- Tokenizer
A simple database-based backend
A load of tests

Until Monday evening I'll probably just write some more tests and more documentation (like a README for the base module, explaining the whole thing). Tasks for the future:

Having another go at improving the UI.
Writing a search_api_pages module for creating simple search pages (and blocks), like with the core search module.
Doing a Views integration for displaying search results as a view.
Writing another backend.
Figuring out all the @todos I left all over the place during the project (this might take more time than the previous four combined ;) ).
And probably also other things I currently forgot.

Regarding the Views integration I'm hesitating a bit, since Views 7.3 seems far from stable and of course I want to avoid having to constantly adapt my code to the latest Views developments. Any opinions on that? I also couldn't really find good resources regarding Views 3 but reading some code should probably suffice, I guess.

Update

Posted by drunken monkey on September 23, 2010 at 11:00pm

Haven't posted here in a while, but since there currently is a lot going on in the module I figured I'd keep all those updated still following this discussion. I've already announced this a week ago over at the Solr Next Gen discussion. So, basically, these three are (rudimentarily) done:

Writing a search_api_pages module for creating simple search pages (and blocks), like with the core search module.

[Doing a Views integration](http://drupal.org/node/872904) for displaying search results as a view.

Writing another backend.

The search_api_page (I stuck with singular ;)) module really is rather simple, but it's nice for quickly testing out functionality. It is currently broken, though, apparently due to some change in core … The same one seems to have taken out Views pages, so at least I'm in good company. (Or it's got something to do with my local installation (even though I reinstalled it to check), have to figure that out.)

The Views integration was rather easy and now copmpletely works. For every defined search index, a new base table is created. You can select all known entity properties as fields (and properties of entity properties, like an author's name – technically, all the way down), add filters and arguments for any indexed fields and search for every indexed field that permits sorts, completely through views and with no dependency on the search server.
Currently, just a few configuration forms and custom data type handling are missing, but those will come in the next few days …

The second backend I wrote was, as expected, the one for Solr. It, too, now completely works, I'll just add a few more configuration options and maybe some admin information (indexed fields and terms, like in the apachesolr module). So, yeah, now you can index and search any entity easily and flexibly via Solr. :)

As mentiones, I'll keep on working hard on this at least until my semester starts in October. I'll also add a search_api_facets module (or, rather, implement it – a folder and one or two functions already exist) for creating facets for any Search API query. I'll try to make that as flexible as possible, so whereever you might use a Search API query, you should also be able to create facets specifically for that case, or for the index as a whole.

When all of this has matured a bit and there is no imminent work (probably some time next week), I'll let you know again. I'll also try to come up with some demo or such, so you can see for yourself how awesome all of this is. ;)

Major feature batch completed

Posted by drunken monkey on October 1, 2010 at 12:37am

The new features mentioned in the previous post are all done now. Want to do a facetted Solr search from a view, sorting by comment count? You got it! ;)

But why bore you here with verbose descriptions, when I can bore you far better with video?
I created a screencast, summarizing the most important new and old features – so if you haven't had time yet to check out the module, or not for a while, watch the video and be amazed!

Also, as if this had been cunningly planned beforehand, a new Beta release containing all new features is now available.I even went as far as to admit some still existing shortcomings. ;)

For those not wanting to download 150MB

Posted by drunken monkey on October 5, 2010 at 10:06am

The screencast can now also be viewed directly on the web.

search api, taxonomy and entity

Posted by quazardous on November 22, 2010 at 3:00pm

hello,

i m trying to guess how to put search api, taxonomy and entity together...

first of all, entity "alone" does not provide relation between node and taxonomy, am I right ?

I m not sure what to do (or what to think)... when I add a new type of node (say article) with a ref to taxonomy terms (say tags) : i have to declare a new entity (or group ?) article with relation to taxonomy entity ? or what ? What is your point of view ?

PS : in search page UI there is a field "Searched fields" that seams not to be really used (beta3). My question is search api able to provide by field search access ?
PS EDIT : ok line 121 of search_api_page.pages.inc added ->fields($page->options['fields']) to pass the search page conf to the query and it works :p

first of all, entity "alone"

Posted by drunken monkey on November 22, 2010 at 8:14pm

first of all, entity "alone" does not provide relation between node and taxonomy, am I right ?

No, this works out of the box – in the Entity API module always, and in the Search API at least in the development version. But as you are using the Beta 3, I just noticed that this was indeed still broken back then. I'll do a Beta 4 probably this week, then the "recommended" release will support this, too.

filter factory

Posted by quazardous on November 26, 2010 at 2:23pm

re hi !

in the search_api_page module the use of query is quite simple : search_api_query()

but when i want to add some filters it s not obvious what to do...

ok i ll add some :

<?php
  $filter=new SearchApiQueryFilter();
  $filter->condition('status', true);
  $filter->condition('created', 1289976947, '<=');
...

  return search_api_query($page->index_id, array('search id' => 'search_api_page:' . $page->path))
    ->keys($keys)
    ->fields($page->options['fields'])
    ->filter($filter)
    ->range($offset, $limit)
    ->execute();
?>

and it works fine but I feel it dirty since I use a hard coded SearchApiQueryFilter that could no more be extended (by someone who needs to do so)...

how can I add filters using a search api factory way ?

That depends

Posted by drunken monkey on November 26, 2010 at 5:30pm

When you are just talking about search_api_page, then you can't. That module only supports the most basic use cases, Views should be used when anything more is needed. However, you could rather easily extend the module to either provide filters for search pages (probably just rows with three inputs for field, value, operator) or use hook_search_api_query_alter() to add/remove filters for search_api_page queries (which also makes your additions extendable, even if they are hardcoded).

no i m not talking about

Posted by quazardous on November 27, 2010 at 5:00pm

no i m not talking about search_api_page :o

I m writing a new module on top of search_api_page : the module will provide filter widgets corresponding to the non fulltext indexed fields of the search page index (like the default content browser of drupal).

... and in fact i just need to add filters to the query in a proper way !

i want to replace $filter=new SearchApiQueryFilter(); by a search_api_create_filter() like function to avoid using hard coded class names.

$query->createFilter($conjunc

Posted by drunken monkey on November 28, 2010 at 2:40pm

$query->createFilter($conjunction);

And if your'e just adding additional AND conditions, $query->condition($field, $value, $operator) will work as well, without the need to create an extra filter object.

ext_search_page module

Posted by quazardous on December 1, 2010 at 6:00pm

hi,

I have coded the big part of an "Extended search page" module.

it features :
- non fulltext index can be used as filters (like in the content page) in the search page form
- you can choose a form widget for each filter
- thoses widgets are shown in the search page and managed by $_GET
- suported fields : boolean, date with date_range, node_reference (with mono term select)
- widgets support can be extended by a hook

I'll apply for a CVS account asap.

next will be a "Use as default content manager" and Field/CCK widget integration for entity_reference...

Nice! :D

Posted by drunken monkey on December 3, 2010 at 5:12pm

Wow, sounds awesome. :D Can't wait to take a look at it!
Good luck with your CVS account!

ty, but was easy on top of ur

Posted by quazardous on December 7, 2010 at 12:27am

ty, but was easy on top of ur work :p

and getting the cvs account seems to be the hardest part ;p
http://drupal.org/node/986920

have you planned to add node level index update (ie in hook_nodeapi ) to keep index "live" up to date when nodes are modified ?

Yes, I heard that CVS account

Posted by drunken monkey on December 7, 2010 at 11:39am

Yes, I heard that CVS account granting is pretty complicated these days … They plan to change it, but of course that's only a small consolation for you. ^^"

have you planned to add node level index update (ie in hook_nodeapi ) to keep index "live" up to date when nodes are modified ?

There is no hook_nodeapi() anymore in Drupal 7. But I implemented hook_entity_update() which I think does what you mean – all updated entities (not just nodes) are automatically marked as "dirty", i.e., needing to be reindexed.
The only, minor problem right now is that when, e.g., a user's name is changed, nodes with that user as author aren't marked as dirty, even if "author:name" is indexed.

kk, If search api becomes

Posted by quazardous on December 7, 2010 at 1:22pm

kk,

If search api becomes "central" (and it should be ;p) for node management it will be mandatory to index "dirty" node in the hook_entity_update() (or in a very regular cron at least ) for main index ...

In that case …

Posted by drunken monkey on December 10, 2010 at 7:40pm

If Search API becomes central for node management, we can easily talk about an "instant indexing" option that would reindex right at entity update time.

Discussion about the structure of the new Search API

Overall procedure for creating searches

Open questions

Additional ideas

Object orientation

Open questions

Extension points

Open questions

Additional ideas

Data source

Open questions

Language awareness

Comments

Group organizers

Group categories

Search tags

New groups

Group notifications