Since I've virtually flooded this group with my topics in the last weeks, most regular Search group readers will probably be aware that I'm (in the course of the GSoC) planning to create a new Search API. Now, since an API is pretty much useless without anyone actually using it, there of course should be a discussion involving all "stakeholders" on where this should go, prior to any coding.
This is said discussion.
I have of course done some research in the past weeks so I already have a bunch of ideas. Therefore I'll at first present those and then always follow up with open questions for that particular area. Of course, comments and criticism outside of these questions is also allowed and encouraged. ;)
My plan naturally incorporates many ideas from the D8 Core Search discussion here on g.d.o. Another main source of inspiration is Young Hahn's emerging searchlight module whose goal is quite similar to mine, with searchlight just being more specialized, concentrating on Views in lieu of even more flexibility. Lastly, my previous Apachesolr RDF project — even though it turned out to be pretty useless — contains quite a number of elements which I'd find suitable for this project, too. Also it provides some clear "DON'T!"s — the true measure of experience. ;)
Overall procedure for creating searches
(Terms are as used in my apachesolr_rdf module, but see open questions.)
The user (i.e., a site admin) first creates a new search server. He selects the backend to use (e.g. Solr, Sphinx, simple DB implementation, …), configures it (e.g. host/port/path for Solr, maybe different modes available, etc.) and sets its name.
He can then create one or more indexes for that server. He would have to select the data type (e.g., nodes or users, provided by the entity data source) and a previously created server. There will of course be a lot of additional configuration possibilities. Then, when cron is run, items of that type (provided by the data source) will be added to that index and be returned in all subsequent searches.
Finally, so that any of this has a point, he will have to create a search. He selects the index to use, the path were the search should be displayed and other options (name, advanced fields to use, filter queries to always be appended, necessary permissions, …). (It would also be possible to create a search without using a path, which could then only be executed programmatically, by other modules providing such functionality.)
For indexes and searches, also the plugins to use at different points in the index / search process can be defined, configured and ordered.
End users (i.e., non-admins) only ever see the searches, without knowing the server, index or backend behind it.
(Internal detail: Creating a search just lets the searchapi module add a menu item for the given path which then asks for and processes user input, redirects to an appropriate search URL, extracts the search parameters from the URL and then calls a function that actually executes the search. If this is implemented in a clean way, other modules can easily provide other means of searching and execute searches programmatically. Strong coupling, as currently in
search_get_keys(), should be avoided at any cost.)
How should the objects (server, index, backend, search, filter, …) be called to make their meaning as clear as possible to the user? I got some negative feedback regarding their usage for the apachesolr_rdf module. (And especially "server" is probably unsuitable for some backends.)
Are searches created through other modules (i.e. not using the default "menu item" implementation) also managed by searchapi? Or, asked another way: Is the "menu item" implementation also just a plugin, and when creating a search one could also select a different one?
Should it be possible to create a search that uses more than one index and combines the results? Should this be possible by default, or should it be possible for modules to somehow add this functionality? Should this only be possible for indexes on a single server or at least using the same backend, or for just any indexes?
The search itself could use a plugin to parse the user's input, thereby allowing different search syntaxes to be defined by modules and used by sites (maybe even let the user decide which one he wants to use).
Servers, indexes and searches can be deactivated seperately (although deactivating a server also deactivates associated indexes and searches). This would e.g. allow to stop indexing data for some time, but still use searches on that index.
Functionality that individual backends have to provide will be specified as interfaces, (an) abstract class(es) implementing most of these methods in a generic way will also be provided. This is almost identical to searchlight's way, but using interfaces as appropriate.
Should there be separate interfaces/objects for indexing and searching, or is there no real point?
Where/how should these classes be defined? hook_searchapi_backends() returning information on files and classes (so a module would have to be created for new backends)? Or collect as many backends as possible in the searchapi module itself?
Extension/Plugin points for
- data source (entities are default, but also any other source possible, like views, pages, RDF data, ...)
- data alteration (e.g., add data for comments or attachments to node, or profile fields to user)
- pre-processing (stemming, word-splitting, …) of indexed items and search queries (should also be configurable by backend – e.g., Solr handles pre-processing itself and might be irritated by being served already processed data)
- alteration of search queries (introduce custom sort, facets, etc.)
- post-processing/ranking/sorting (might use/alter ranking data provided by backend)
- displaying search results (maybe this should use additional plugins for creating excerpts and highlighting)
Defaults are not hard-coded, there is just searchapi's implementation of those plugins, which will be activated by default. Everything should be configurable by users (i.e., site admins – although e.g. ranking mode could even be selected by normal users) and by modules. Also, the active data source and backend might want to switch individual plugins on or off.
Should these be hooks or plugin objects?
Should the plugins be handed single data items individually or the whole array at once? (And: does that matter much?)
Priority of options set by users, modules and other plugins? Who can override whom?
When executing a search programmatically, an array of custom plugins could also be optionally passed to it. (At least if they are objects, don't know how this could work with hooks.) At each extension point, the search mechanism would then determine if one of those objects is suitable (using
instanceof) and call its according methods (instead of the options set by the user, where only one plugin is used (e.g. data source)).
Where there are several plugins executed sequentially (pre- and post-processing, …), a weight should be defined by the plugin and be alterable by the user, to determine order.
Default implementation uses entity_metadata. Would have to provide information on what kinds of data are available and their respective attributes (metadata that is helpful for the user or other parts of search framework). Especially information on the fields of each entity and their datatypes would be needed. The data source is also responsible for maintaining information on what items still need to be indexed for each created index. (Something like
$dataSource->getItemsToIndex($index, $numItems) would be used to retrieve them at index time.)
Can capsuling general information (what is available, how can it be retrieved) and maintaining index status be separated into two different objects for greater flexibility? Would this be reasonable?
Should index information (at least in the default implementation) be kept in a single table, or in individual ones for each index?
Is very important, but how should it be implemented? Different indexes for different languages? Should they be created manually by the user (if more than one language is available, ask user when she creates the index) or automatically (when an index is created by the user, "secretly" create one for each language)? Does this make sense for all data, or just some (probably "decided" by data source)?
Let plugins define what languages they can/should be used in? If yes: how? (Both whitelist and blacklist approaches would not really suffice if plugin developer's should not be forced to examine every language in existence.) Or let users decide and just urge plugin developers to provide some verbal clues in the plugin's description?
And: Should this be planned in from the beginning, or would it be possible to add it later, when the core functionality is basically done? (The latter would allow us to think about how to implement this, when we already can clearly see what is possible / necessary, and where.)
OK, this turned out to be a lot more material than I thought it would be, but at least that makes for a good base for discussion. So please, comment on the ideas, add your own ideas and thoughts, and discuss the hell out of them. ;)
I'll probably later include results of this discussion in the project's Wiki page, so others can also easily add or correct things. But other ideas on how to present this better and more comfortably would also be appreciated.