Fuzzy Search Engine Updates

Events happening in the community are now at Drupal community events on www.drupal.org.
BlakeLucchesi's picture

Fuzzy Search Module (search_fuzzy.module)

This week I've spent a number of hours going through the current drupal search code to find out if there is any way I can construct my module around the current search module. I have come to the conclusion however, that this will not be possible because of the way the words indexed in search_index(). I at first thought it would be possible to use the hook_preprocess to manipulate the words into trigrams for indexing, but had a shortcoming when I realized that I could not then insert the fractions of the length of the trigram to the length of the full word.

The reason the above is important is because I am using this as a major metric for the completeness of the returned results. A simple match on a single trigram compared to multiple matches of trigrams should be weighted more heavily. Thus each trigram is given a score not only from the tags it is found within but based on the length of the original word. This allows me to sort results based on the amount of the original query that was actually found in the dataset.

My solution to this has now been to recreate the search index into a new table 'search_fuzzy_index'. Because I am going this route I also decided to take into account something I had talked to my mentor Robert Douglass about and that was a flag used to indicate if the node needed reindexing, this is different from the search.module which uses timestamps from node creation, last update, last comment, and last cron run, to figure out which nodes to reindex. I have accomplished this by adding a column to the node table that indicates whether or not the node needs reindexing. I plan on either providing a hook or function that will check this off as needed by other modules.

Furthermore I realize that in creating a completely separate module for search, I must provide themes with a simple variable to print out a search form. As the usual hook_search will not be available to my use.

Current Progress:
-Basic node indexing is complete with trigrams and completeness, scoring based on tags is not complete as of yet but is my next step.
-Basic query string filtering into trigrams is complete, however I still need to expand the filtering of search terms to include support for boolean search.
-Cron indexing calls are working properly and an administration page with a checkbox for telling the index to be recreated are also complete.

As I stated before, I will need volunteers for testing, so far I just have 1 person...

-Blake
www.boldsource.com

Comments

Blake, pending a code review

robertdouglass's picture

Blake, pending a code review of the new module, I wonder if it wouldn't be better to just have a hook for the search index update and let modules return lists of nodes as they wish. Why do you need a column in the node table? It might be that other data storage is necessary, but it seems that the modules that implement the hook should be responsible for that storage. Or you could provide a separate table for nodes that need indexing and an api for adding nodes to it. Maybe that's the best.

table search_index_queue
nid
module
timestamp
primary key (nid)

select nid from search_index_queue order by timestamp asc limit n

When a username gets updated usermodule would put all of that user's nodes into this table. When a taxonomy term gets updated taxonomy would put all nodes categorized with that term into this table. When a node gets updated or created, node module puts them into this table. Search index consumes and deletes from this table.

I like idea on the addition

BlakeLucchesi's picture

I like idea on the addition of the table, this also makes it easier to see which modules exactly are calling for node indexes to be updated.

SoC 2007

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds:

Hot content this week