Some brainstorming notes from the sprint
public
group: Search
robertDouglass - Fri, 2008-05-09 21:49
Totally disorderly and mostly here for our own reference =)
- Refactor node/user search implementations into own modules.
- Control over the search interface.
- Moving stuff between adv. search form and main search form and/or block.
- Full search building interfaces. Create search environments; Each env. has own settings. eg. What content types are in search? What does the interface look like? Analog to building a view w/ fastsearch.
Brainstorm: if you store a view (from a fastsearch view, for example), and it is capable of returning a list of all of the nids in the view, we could build facet lists from the view. This would have general utility to views.
Note: said brainstorm highlights the need to isolate the indexer as a standalone component.
Federated search: Do similar search on multiple Drupal sites at once. Existing implementation is S. Witten's Search RSS aggregator.
Problems:
- Authentication
- (Node) access
- Result weaving (score/relevance)
- Asynchronous nature (blocking on slow sites?)
do_search: Gets its SQL from;
- arguments to the function
- search_parse_query
- Builds some of its own.
do_search tasks include:
- Documenting it.
- Code comments needs work.
- The variable name needs work.
- Refactoring.
Search query handling (currently search_get_keys, search_query_insert, search_query_extract). uid:2, nid:5
- term_name:"Ronald Reagan"
(solution to this in ApacheSolr) - Tracking the query at various parts of request. For example, adv. search form, the string for the query gets passed along in the $form. Ugly. Solution is global singleton. (or static, passed by reference singleton...) ApacheSolr solves this.



Where's the best place for feedback? A few random thoughts. ;)
Hi there!
This is all incredibly exciting. It's great to see core search getting some much needed love and attention.
It's not clear from the search group or this post how the sprinters want to get input from those of us who care but can't be there in person. So, I took the liberty of enabling comments on this post to add a few thoughts that wouldn't get lost inline with the brainstorming. Please let the rest of us know how we can best add our input. Thanks!
The most frustrating part of core search for me and the people using the sites I've built is the fact that the indexer only indexes full words. Among the many reasons that Google kicks our ass, I think substring searching is near (but not at) the top of the list. I'm no indexing wizard, so I have no idea how to do this in a way that still performs/scales well. All I can think of is a RLIKE query on the search index tables instead of the exact match it currently does. Perhaps an option to toggle if the query against the index is exact or RLIKE would be a good step -- it'd let smaller sites that don't have to worry about the DB load as much turn on this additional functionality, and still let bigger sites avoid the hit.
The other idea I had would be the ability to search in the revision histories of nodes, not just the current revision. Again, no immediate idea how to do this and maintain performance. ;) But, as above, at least providing an option for this for people who want it and can afford the cost? This would probably be a drastic change to the schema so that you could record the revision ID and the nid as sids. Or, maybe you just store the vid as the sid, and in places where you really need the nid for something, you join on {node} (since vids are unique).
Anyway, best of luck to the sprinters, and let the rest of us know how we can help participate in the discussions remotely.
Thanks!
-Derek
Found the issue for substring searches ;)
http://drupal.org/node/103548#comment-838963
I still don't see anything about searching node revision histories. Should I just make an issue about that?
Should I add both issues to http://groups.drupal.org/node/10569 ?
Thanks for your feedback,
Thanks for your feedback, Derek. Here's a fine place for it. The feedback actually eclipses the original post (which I can't believe got promoted to the front page... lol) in value.
I've got your partial search concerns in mind but I'm still not sure how to reconcile stemming with RLIKE and performance. One problem is that partial string searches, in addition to matching some results that currently don't get matched, as in the case of "quake" -> "earthquake", will match a whole bunch of other stuff that might not be relevant. Just looking back at these few sentences, if I search for form I don't want to see performance, and if I search for arches I don't want results for searches. The more results there are the harder it is to provide relevance and ranking. So it isn't an easy problem.
Searching revisions isn't something we had time to address, so yes, please add the issue to our long list and keep brainstorming ways to implement it. Using $node->vid as sid is a way to start, but I begin to worry about growing the index needlessly large. If we have to index every revision of a 100 word node every time someone updates punctuation we're in big performance trouble, I think.
A couple more ideas...
Please don't forget those who are on shared hosting, and therefore can't run seperate search services.
Also, perhaps it would be useful to implement >1 backend for any search API - just to stop us making too many assumptions based on ApachrSolr. eg: http://drupal.org/project/xapian (which, while it requires a PEAR module, can run without a search service).
Hmm, does click-tracking have a place in search? Some engines/algorithms uses click tracking (vs rank) to improve search result ordering based on click popularity.
Sorry I can't be there, but I'm definately interested in Drupal search.
Not Solr-specific
Although it is often cited as an example, be assured that no work in the sprint has been oriented specifically towards Solr. Actually, most of the work was focused on improving Drupal's built-in search while at the same time making small steps towards the longer-term goal of providing a richer framework for accommodating third-party search engines, whatever they are.
Thanks for pointing out Xapian. It definitely looks worth checking!
Useful
Thanks, it's very useful to know that there's work on improving the integration with 3rd party engines.
At the moment I'm working on a module to use mnogosearch as the search engine. We chose mnogosearch has a number of useful characteristics:
- it's an industrial strength search engine : for instance, it's what MySQL use as the basis of search on their site.
- it can run an ongoing indexing task
- it's fast - especially in the binary version of the index
- it's open source.
From my point of view, the core Drupal search and the UI and framework for search need to be split into 2 modules: it's a waste of database and processing power for Drupal to keep an index when there's another index. But the Drupal hooks are useful because they let us notify the spider engine of changes, and they are able to provide extra context info that's not easily available on a straight spidering.