Search Scoring Improvements

We encourage users to post events happening in the community to the community events group on https://www.drupal.org.
BlakeLucchesi's picture

Background
Recently at DrupalCon Boston there was a Birds of a Feather Search discussion which consisted of 20 members of the community interested in increasing the core search engine features and flexibility for external search engine integration. During the discussion David Lesiuer presented his Faceted Search module many agreed had features we should look to integrate with core search. His module provides a great interface for site administrators to create different search ‘interfaces’ for their site users. These interfaces are configurable to allow users to search through only specific content types, include content from only specific taxonomies, etc.

As work commences to improve the core search module I would like to propose to build out a search scoring interface to the core search module. The scoring interface would allow modules to interact with the way nodes are ranked during result retrieval.

Futher discussion regarding the search API in general:
http://boldsource.com/articles/advancing-drupal-search
http://groups.drupal.org/node/10128
http://groups.drupal.org/node/9795#comment-32886
http://robshouse.net/2008/03/05/event/drupalcon-boston-solr-bof

Project Details
The proposed project is to make an addition to the core search module that will enable dynamic scoring. The additions to the search module will allow core and contributed modules the ability to tell the search module how to change the scoring of results on a per search ‘interface’ basis.

The scoring API will provide similar features to what the core search module does already in regards to allowing site administrators to change the way nodes are ranked based on keyword relevance, date posted, and number of comments. The problem with the way this is currently implemented is that each module that wishes to offer customized scoring has to also implement its own search routines. This means that different modules cannot work on the same search form with one another to modify the scoring of results. I would like to extend the current capability to allow this to happen.

Benefits to the Community
Increasing the flexibility of the Drupal search engine to allow non-programmers to modify content ranking is something that many users can benefit from. Some of the use cases that were specifically brought up during DrupalCon were:

*Drupal site has many different areas including an ecommerce area with products, the site admin wants to have separate search pages for products only, and one with products and content. The search API would allow this to happen, and the search scoring API addition would allow the admin to make sure that product matches get ranked higher than non product type matches, or products that have sold more often rise higher than products that don’t sell as well. (rank content differently based on content type)

*Drupal site has community and editorial content. The site admin wants to make sure that editorial content is given search precedence over normal user contributed content. (rank content differently based on user role)

*Drupal site wants content that needs to be ranked with priority given to particular taxonomy terms. (rank content differently based on taxonomy term)

Current search score modifier settings:

Deliverables
The final product will be a patch to the core search module. It will provide hooks for other modules that will allow site administrators to change the scoring metrics on a per ‘interface’ basis. Included with the contribution to the search module will be Simpletest coverage to ensure the additions work as expected with future development.

Project Schedule
The following is a proposed schedule for the project that will also include weekly updates to the community and my mentor.
May 8-11: Drupal Search Sprint in Minnesota. I have arranged to participate in the search sprint at the University of Minnesota where work to improve the core search API will be done. During the code sprint I plan to discuss specifics of the interface with others working to extend the search API.
May 26: SoC Official Coding Starts. By the start of coding I will have an outline of hooks and functions that will need to be implemented in the core API so that other search modules can interface with it.
June 5: Simpletest patterns will be developed to represent a typical interaction from a contributed module to the search API scoring hooks and also from the search API to the search engine module.
July 14: Midterm evaluation. By the midterm review I should be able to put together a working implementation of the search scoring API so that further testing and community feedback can be given on a working module.
August 11: Last week for code completion. Between the midterm and the final evaluation I will work to complete testing on the code by gathering and comparing result scoring. During the final weeks of the project I’d also like to review any documentation to ensure that the community has the resources they need to utilize the scoring API.

Bio
My name is Blake Lucchesi and I participated with Drupal in SoC 2007. I was responsible for the creation of the fuzzy search module. I realized while doing my project last year that the core Drupal search implementation allowed little flexibility in controlling the way core search indexing procedure, and had to recreate my own search implementation which had a lot of overlap in the basic functions such as word tokenizing and user interface handling.

After further discussion at this past DrupalCon many agreed that developing a better core API for search is a good route to take because of the wide variety of uses that each website will have. A better search API will allow programmers and non-programmers to extend their search functionality with ease and without the risk of breaking the core drupal search functionality.

My other involvements in the Drupal community include contributing the Ubercart Coupons module, a module that connects the Wordfilter and Workflow-ng modules, contributing a few patches to allow user names to be searchable by the core search module, presenting at local group meetups and writing various mini tutorials on my blog.

AttachmentSize
Search Score Modifier Settings25.49 KB

Comments

This looks great to me, all

catch's picture

This looks great to me, all kinds of possibilities once it's in place.

Are you planning for the search index to pull ranking information in from modules when it indexes, or would all the ranking logic be done on search? The first causes issues for (a node could have 3 votes when first indexed, then voting api probably needs a way to tell the index that it's got 300 instead a week later), additional joins to get that sort of information directly whilst searching sounds a bit scary as well.

Thanks for the comments.

BlakeLucchesi's picture

Thanks for the comments. This is something I've considered, and the current search module actually uses joins to to do the current score modifications. I see an issue with doing joins with such an open interface. Allowing 10 different modules to modify scores, each of which adding their own columns would wreak havoc, not only that but we're also looking to support external engines which makes it impossible to do table joins.

I would like to see the score modifiers indexed, and possibly allow site administrators to determine how often they want the score modifiers updated (every cron run, every x days, etc). I realize that it may be a burden to run these score updates every night (or 10 times a night for really large websites), but I think that its a minor thing to perform to get the type of features and flexibility that the score modifier will provide.

I think in adding features to the search module we should also look at a way to queue nodes for re-indexing. As of 5.x the indexing of nodes was based on last updated timestamps. I think this has been changed in 6.x to allow a node to be flagged for indexing/re-indexing. I believe that having a system that works off of flags will be better because we could allow the voting api to flag a node for 'score indexing' after each vote, or the same with an ecommerce module. Perhaps we could work it so that score indexing and node content indexing are able to run in isolation. This would save cron overhead for nodes whose content doesn't change, no need to re-index the whole node when all that needs to be updated is scoring statistics.

-Blake

Custom Scoring Each node

jaffarcheckout's picture

We are building a customized Scoring module for the Apache Solr Results.

The Custom function reads:

function hook_apachesolr_modify_query(&$query, &$params, $caller) {
$params['bf'][] = "recip(rord(is_field_mytestval),1000,1000,1000)^200";
}

In the above function

we assigned score to this is_field_mytestval ....

There is so difference tin the results, the above Query doesnot get exact result.