Some brainstorming notes from the sprint

You are viewing a wiki page. You are welcome to join the group and then edit it. Be bold!

Posted by robertdouglass on May 9, 2008 at 9:49pm
Last updated by dww on Sat, 2008-05-10 16:11

Totally disorderly and mostly here for our own reference =)

Refactor node/user search implementations into own modules.
Control over the search interface.
Moving stuff between adv. search form and main search form and/or block.
Full search building interfaces. Create search environments; Each env. has own settings. eg. What content types are in search? What does the interface look like? Analog to building a view w/ fastsearch.

Brainstorm: if you store a view (from a fastsearch view, for example), and it is capable of returning a list of all of the nids in the view, we could build facet lists from the view. This would have general utility to views.

Note: said brainstorm highlights the need to isolate the indexer as a standalone component.

Federated search: Do similar search on multiple Drupal sites at once. Existing implementation is S. Witten's Search RSS aggregator.
Problems:

Authentication
(Node) access
Result weaving (score/relevance)
Asynchronous nature (blocking on slow sites?)

do_search: Gets its SQL from;

arguments to the function
search_parse_query
Builds some of its own.

do_search tasks include:

Documenting it.
Code comments needs work.
The variable name needs work.
Refactoring.

Search query handling (currently search_get_keys, search_query_insert, search_query_extract). uid:2, nid:5

term_name:"Ronald Reagan"
(solution to this in ApacheSolr)
Tracking the query at various parts of request. For example, adv. search form, the string for the query gets passed along in the $form. Ugly. Solution is global singleton. (or static, passed by reference singleton...) ApacheSolr solves this.

Comments

Where's the best place for feedback? A few random thoughts. ;)

Posted by dww on May 10, 2008 at 4:30pm

Hi there!

This is all incredibly exciting. It's great to see core search getting some much needed love and attention.

It's not clear from the search group or this post how the sprinters want to get input from those of us who care but can't be there in person. So, I took the liberty of enabling comments on this post to add a few thoughts that wouldn't get lost inline with the brainstorming. Please let the rest of us know how we can best add our input. Thanks!

The most frustrating part of core search for me and the people using the sites I've built is the fact that the indexer only indexes full words. Among the many reasons that Google kicks our ass, I think substring searching is near (but not at) the top of the list. I'm no indexing wizard, so I have no idea how to do this in a way that still performs/scales well. All I can think of is a RLIKE query on the search index tables instead of the exact match it currently does. Perhaps an option to toggle if the query against the index is exact or RLIKE would be a good step -- it'd let smaller sites that don't have to worry about the DB load as much turn on this additional functionality, and still let bigger sites avoid the hit.

The other idea I had would be the ability to search in the revision histories of nodes, not just the current revision. Again, no immediate idea how to do this and maintain performance. ;) But, as above, at least providing an option for this for people who want it and can afford the cost? This would probably be a drastic change to the schema so that you could record the revision ID and the nid as sids. Or, maybe you just store the vid as the sid, and in places where you really need the nid for something, you join on {node} (since vids are unique).

Anyway, best of luck to the sprinters, and let the rest of us know how we can help participate in the discussions remotely.

Thanks!
-Derek

Found the issue for substring searches ;)

Posted by dww on May 10, 2008 at 4:46pm

http://drupal.org/node/103548#comment-838963

I still don't see anything about searching node revision histories. Should I just make an issue about that?

Should I add both issues to http://groups.drupal.org/node/10569 ?

Thanks for your feedback,

Posted by robertdouglass on May 19, 2008 at 7:48am

Thanks for your feedback, Derek. Here's a fine place for it. The feedback actually eclipses the original post (which I can't believe got promoted to the front page... lol) in value.

I've got your partial search concerns in mind but I'm still not sure how to reconcile stemming with RLIKE and performance. One problem is that partial string searches, in addition to matching some results that currently don't get matched, as in the case of "quake" -> "earthquake", will match a whole bunch of other stuff that might not be relevant. Just looking back at these few sentences, if I search for form I don't want to see performance, and if I search for arches I don't want results for searches. The more results there are the harder it is to provide relevance and ranking. So it isn't an easy problem.

Searching revisions isn't something we had time to address, so yes, please add the issue to our long list and keep brainstorming ways to implement it. Using $node->vid as sid is a way to start, but I begin to worry about growing the index needlessly large. If we have to index every revision of a 100 word node every time someone updates punctuation we're in big performance trouble, I think.

A couple more ideas...

Posted by lyricnz on May 12, 2008 at 2:15am

Please don't forget those who are on shared hosting, and therefore can't run seperate search services.

Also, perhaps it would be useful to implement >1 backend for any search API - just to stop us making too many assumptions based on ApachrSolr. eg: http://drupal.org/project/xapian (which, while it requires a PEAR module, can run without a search service).

Hmm, does click-tracking have a place in search? Some engines/algorithms uses click tracking (vs rank) to improve search result ordering based on click popularity.

Sorry I can't be there, but I'm definately interested in Drupal search.

Simon Roberts
Taniwha Solutions

Not Solr-specific

Posted by David Lesieur on May 15, 2008 at 3:44pm

Although it is often cited as an example, be assured that no work in the sprint has been oriented specifically towards Solr. Actually, most of the work was focused on improving Drupal's built-in search while at the same time making small steps towards the longer-term goal of providing a richer framework for accommodating third-party search engines, whatever they are.

Thanks for pointing out Xapian. It definitely looks worth checking!

// David Lesieur // Associé // Whisky Echo Bravo // Développement Web, experts Drupal // Montréal //

Useful

Posted by jeff veit on May 27, 2008 at 1:12pm

Thanks, it's very useful to know that there's work on improving the integration with 3rd party engines.

At the moment I'm working on a module to use mnogosearch as the search engine. We chose mnogosearch has a number of useful characteristics:
- it's an industrial strength search engine : for instance, it's what MySQL use as the basis of search on their site.
- it can run an ongoing indexing task
- it's fast - especially in the binary version of the index
- it's open source.

From my point of view, the core Drupal search and the UI and framework for search need to be split into 2 modules: it's a waste of database and processing power for Drupal to keep an index when there's another index. But the Drupal hooks are useful because they let us notify the spider engine of changes, and they are able to provide extra context info that's not easily available on a straight spidering.

This may be what you are

Posted by aufumy on September 17, 2008 at 5:05pm

This may be what you are looking for: http://drupal.org/node/282192

Nedjo's patch seems to be going in the right direction, I hope his patch will get some community love, so that other search issues can be tackled.

Federated Search

Posted by aufumy on October 15, 2008 at 10:52pm

Am working on federated search module at http://drupal.org/project/distributed_search it depends on services module being implemented on client sites.

Some brainstorming notes from the sprint

Comments

Where's the best place for feedback? A few random thoughts. ;)

Found the issue for substring searches ;)

Thanks for your feedback,

A couple more ideas...

Not Solr-specific

Useful

This may be what you are

Federated Search

Search

Group organizers

Group categories

Search tags

New groups

Group notifications