Summer of Code 2008: Search Scoring Improvements

You are viewing a wiki page. You are welcome to join the group and then edit it. Be bold!

Project Information:

Current Status: Working to complete tasks outlined below and get them ready for acceptance to D7 core. The Drupal project page can be found here: The project page will contain a downloadable package of all code contributed from this project.


My project is to add value to the core Drupal search module by enabling modules to modify node scoring during the search process. Use cases for this will be to allow modules like voting api, ecommerce/ubercart, etc with the ability to influence node score rankings during search.


Project Schedule:

[X] May 8 - 12: Attending Search Code sprint in Minnesota
[X] May 28: Official Start of SoC Coding.
[X] June: Begin working on Search Ranking module that overrides core node search but provides hook_ranking so that other modules can modify search scoring results.
[X] June 25: Complete SimpleTest coverage for new Search Ranking module.
[X] July 15: Complete fivestar implementation of hook_ranking() and provide test coverage.


Status Updates:

Live from the Minnesota Search Sprint! Day one included Chad Fennel and myself writing test coverage to get Doug Green's ranking hook put into core. Currently Doug's patch now includes rank modifiers for:

*Sticky (nodes marked as sticky receive higher score)
*Promoted (nodes marked as promoted receive higher score)
*Relevancy (number of keywords present in the node/comments)
*Recency (the currentness of the node)
*Comments (how many comments the node has)
*Statistics (how many visits the node has)

Once this is patch is approved it will be possible to write other ranking modifiers for contributed modules such as Ubercart and VotingAPI.

Live from the Minnesota Search Sprint! Day two includes work to create more test coverage and work with Robert to allow input filters to be used as a way to process text before inserted into the database. This has big implications to opening up this essential task to search to additional filter modules. One of the most pressing use cases is that site administrators will be able to configure language processors based on the types of text they are indexing. As of d6, this process was fixed with 3 set filters and a hook_search_preprocess which meant nodes could hook into the process but their execution was determined by they order in the system table.

Live from the Minnesota Search Sprint! Day three involved working on an additional patch to the hook_rankings patch which allows nodes with links pointed to them to be ranked higher than nodes without many links.

Issue found with scoring factor normalization that causes some factors to be scored abnormally higher than the others, posted in the tasks above.

Project page created and initial contribution of test coverage for the text processing function in search.module has been made.

Task Summary: Discussed current scoring factors and their normalizations with mentor Charlie Gordon. Also worked on a patch that provides a score factor for internal links.
Difficulties: Found issues with current normalization functions that will need to be reworked and proposed to ensure validity.
Next Week: This upcoming week is going to be difficult for me to work on the project because I will be finishing my school quarter with finals next Tuesday. I hope to at least work on the normalization functions between now and then but I will have to see how things progress.

Task Summary: Create a patch to add ranking factor based on the number of links a page has throughout the rest of the site. Node links ranking patch.
Difficulties: None found.
Next Week: Put together an analysis of the normalization of scoring so that each factor is guaranteed to be based on a score between 0 and 1.

Task Summary: Backported hook_ranking from Drupal HEAD to 6.x. Project Page: Issue Queue: This module cleanly overrides the core content search implemented by node.module.
Difficulties: No major hurdles.
Next Week: By Friday the plan is to have simpletest coverage for the ranking modifiers so that we can ensure the core ranking factors operate properly. After this is completed it will allow me to create new ranking factors and easily test that the new ranking factors don't adversely affect the core of this module.

Task Summary: Made sure that installation and removal of module did not create any conflict with core search module and also ensured that current search paths would remain the same. This meant removing the new tab that search_ranking creates on the search page, and modifying the menu callback for node.module's search implementation. The benefit of doing this was that urls stayed the same and we get the advantage of having the advanced search form still available and functional.
Difficulties: Modifying the file to ensure proper search overriding. Also ran into problems when trying to create a hook_ranking() for fivestar, as it is, hook_ranking doesn't allow where clauses and searching through the votingapi_cache table there are two values per node, one is total votes the other is average score. Possible route would be to store data used for search in its own table (adding a where to the search sql would work but then it removes any results which don't have matching rows in the votingapi_cache table which is not desirable).
Next Week: Complete simpletest coverage and determine a route for which we can create a clean integration with votingapi/fivestar.

Task Summary: This summary update covers the work I have completed over the last two weeks, during which time I've familiarized myself quite well with the simpletest module. I have successfully completed a core search_ranking test class that allows module maintainers to easily write their own simpletest coverage for their implementations of hook_ranking. I think this will be a great addition to core because hook_ranking is now part of the core search module.

In addition I completed an implementation for fivestar.module and wrote test coverage for it as well. You can take a look and hopefully help test it here:

Difficulties: Figuring out how to work with SimpleTest and see what is an isn't possible. While not so much a difficulty but rather something I spent some thinking on was what types of tests needed to be written to verify results were accurate and also writing a clean base testing class for hook_ranking that provides easy to use functions for testing other modules implementations.

Next Week: Go back and review search scoring normalization to ensure that each node ranking factor is contributing only the amount of points that is expected. If this isn't the case then results will not be as expected.