Fuzzy Search Engine Major Update

We encourage users to post events happening in the community to the community events group on https://www.drupal.org.
BlakeLucchesi's picture

Over the past week and a half or so I have made much progress on my project. The following are main accomplishments:

Completed Items

Scoring Hooks

  • Scoring hooks to allow other module developers the opportunity to insert a score multiplier at the time of indexing. Scoring Hook API.
  • Scoring hooks modify the score of a node during indexing, and affect the score on each word being indexed per node. This is the most effective way to do such an operation because it requires no extra time during the search. It is also effective with this new search module because of the ability to tag nodes for re-indexing, whereas search.module only re-indexes if a node's updated timestamp has been changed.
  • Site administrators have control over how much of an affect each score modifier has through the control panel (Screen Shot below).

Only local images are allowed.

Indexing Improvements

  • Html Tag scoring, just like that used in search.module, however I used a different regular expression. In search.module the regex separates the content at each tag, which needs a fix to catch unclosed tags. The approach I have taken pulls out any text between a tag, and only if that tag has a definite beginning and end.
  • More efficient indexing! I've made it so that each word is only indexed once. While this takes a bit more processing during the indexing phase because I have to reloop and collect each word into an array before indexing, the benefits are that the search index is smaller and it provides more accurate results for the completeness metric since the same word doesn't contribute to the completeness for every occurrence within a node.
  • Nodeapis $op = 'update_index' has been implemented in the indexing phase so that modules currently sending information to the indexer will work properly.

Administerable N-Gram Length

  • Variable length qgrams, the administrator will have easy access to change the length of the qgrams, i've allowed for 3, 4 or 5 as of now.
  • Indexing and searching will also now work for words that are shorter than the qgram length as well. This is great for when the admin sets the length to 5, it will ensure that smaller words are indexed and searchable.

To Do List

This Week

  • Finalize the re-index api functions that will allow other modules to easily tag a node as needing to be reindexed.
  • I need to work on the front end appearance of the search engine. The results need to be displayed with teasers below them. I currently have a theme function for outputting the search form, but I'd like to do the same for the results page and make sure it is user friendly for someone to modify and print the search form to any part of their site, this includes making a block for the search form.
  • I'd like to enable a stop words function, it will be administrable by the site admin whether to turn it off or on and they can choose the words they wish to not have indexed.
  • Allow people to search for exact phrases and use OR and AND to filter results (just like the current search does).

By Next Week

  • Get volunteers running tests on the results being returned and get reports on performance.
  • Use the results from the previous to fine tune the search query to enable the best matches being returned.

SoC 2007

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds:

Hot content this week