Refactoring core search
Drupal's search APIs received some good attention at the recent Boston Drupalcon. Following up on discussions there, here is an attempt to draw together ideas on directions for refactoring core search. Please wade in and add your ideas and observations.
Existing core search
Drupal core search is implemented in an integrated way, providing a powerful working solution but little flexibility. Core search integrates several distinct pieces, among them:
- For nodes, a custom SQL-based indexing solution.
- For nodes, an SQL-based search algorithm.
- For nodes, a set of search operands (e.g., type).
Many common needs are difficult or impossible on the basis of core search, for example:
- Substitute an external indexing solution, e.g., SOLR, or reliance on e.g. MySQL full text indexing.
- Add new search operands, e.g., author.
- Search for all object types at once (e.g., users and nodes).
- Extend indexing or other features to non-node objects.
- Add new parameters for ranking search results.
Because of these limitations, contrib search development often bypasses or overrides core search. An example is the Core Searches module, which includes a core patch to remove user and node search implementations.
Where development does build in part on core search, as with the promising Faceted Search module, it is forced to resort to complex overrides and becomes incompatible with search implementations other than core Drupal search.
Needs
There's a need, therefore, to decouple the components of core search such that they are extensible and implementation agnostic. Possible aims include:
- We have a search indexing and searching API that accepts multiple implementations, of which core search's custom indexing and SQL algorithm is one. Core search's indexing and searching can be turned off, or can coexist with other backends.
- Indexing can be applied to multiple object types (e.g., users as well as nodes).
- Search implementations are optional (e.g., it's possible to enable content but not user searching).
- Operands are defined in an extensible way that includes information needed for indexing.
Steps
First steps in these directions include:
- Define a data format for modules to describe search operands.
- Define a method for modules to register themselves as implementing indexing and/or searching for particular object types.
Details, sketches
Here we can sketch out what some possible approaches to the directions listed in Steps.
hook_search_operands() ?
This hook would allow modules to describe search operands in a way that provides search implementations with the data they need to index and implement the operands. (This hook might partially deprecate hook_nodeapi() op = 'update index'.)
Obviously we need a lot more than the following. Just what?
<?php
/**
* Implementation of hook_search_operands().
*/
function node_search_operands($type) {
$operands = array();
switch ($type) {
case 'node':
$operands['type'] = array(
'#type' => 'checkboxes',
'#title' => t('Only of the type(s)'),
'#options' => array_map('check_plain', node_get_types('names')),
// If the value is available as a property of the item being indexed,
// give the property name.
'#object_property' => 'type',
);
break;
}
return $operands;
}
?>See this relevant issue: http://drupal.org/node/69595
hook_search_backends() ?
This hook would allow modules to register callbacks for search indexing and execution.
<?php
/**
* Implementation of hook_search_backends().
*/
function sql_search_search_backends() {
return array(
'sql_search' => array(
//
'index_callback' => 'sql_search_index',
'search_callback' => 'sql_search_do_search'
),
);
}
?>Refactor ranking
See this relevant issue: http://drupal.org/node/145242

Some structural changes I'd like to see
Improvements
I'm concerned with external search engine integration. This is partly a list of thoughts, and partly a response, from notes that I've been making while thinking about it today. Hopefully it's valuable input.
It's useful to have a Drupal framework for the different phases of search. These are the ones that I think need abstractions: tracking index changes, indexing, search form and form validation, query parse, query build, performing the search, displaying the results, displaying further results. I think that these abstractions should be agnostic interfaces which don't directly reflect the Drupal internal search mechanism.
Core Search module heads in the right direction - separate modules for search framework and a the Drupal search implementation. This makes it much easier to use an external engine.
Tracking index changes - xapian module uses this, building a search queue. My mnogosearch work is probably going to use the same. It should also be possible to make a lightweight external calls to add to the queue - we use vbulletin and I can imagine a vbulletin call to add data into the queue.
It would be useful to be able to integrate results from different search methods. And it might be interesting to allow search decoration. So for instance, if I start at the standard search box and type 'robertDouglas' then the normal search module might show all the pages where Robert has posted. But the user module should be able to say that the most important is Roberts profile, so that this is ranked at or near the top of the results list.
Query parsing: when passing the query out to an external engine, query parsing is probably limited. It might be useful to be able to replace particular terms for instance mnogosearch uses & and ~ instead of AND and NOT, but to present a consistent interface AND and NOT should be used.
Integration of external engines with content types. Most of the industrial strength search engines allow data partition on tags, or sections or something similar to provide faceted search. But it's not just CCK content types, it's modules too that implement a content type: for instance search on book content. To make this work, it helps to have special knowledge when indexing - for instance by adding tags, or weights, or facet info.
"do_search needs refactoring. Possibly needs to be broken into two or more phases that include callbacks to the the caller, or a query object that has defined query building setters. Sending in snippets of raw SQL for multiple phases is a bit confusing." do search needs to be an abstraction, calling the relevant registered search engines. The query should be an object, probably decorated by the query parser, because different search engines have different interfaces. The method of calling an engine should be embodied in a function because different search engines have different mechanisms - at least 3 exist already in the wild - through URL, through PHP extension call, and through external process - e.g. perl call.
"The keywords should not persist throughout the request in the form of a string, but rather an object that handles adding fields, removing fields, cloning etc." See above: sometimes the keywords are the most useful. Other times the keywords will need to be combined with special knowledge when using an external search engine - e.g. we are trying to search users. The framework should be agnostic and definitely shouldn't throw away or hide information.
Different search engines WILL return results in different ways, so perhaps the framework should have multiple levels of cusomisation. For instance - return a page without breaking it into individual results and theming; or, return theme individual results.