Part 1 of my project (implementing synonym matching in the search index) is nearly completed, I am waiting for the patches to be accepted into core for drupal 6. In addition to synonym matching I also submitted a patch to index usernames with the nodes as requested in the Search group on drupal.org. The patches can be reviewed here, all comments welcome.
http://drupal.org/node/155262 - Taxonomy synonym search indexing
http://drupal.org/node/155254 - Username search indexing
For part 2 of my project I am to implement a fuzzy search engine in drupal.
I would like to produce a module that implements n-gram based fuzzy search capabilities. [For sequences of characters, the 3-grams (sometimes referred to as "trigrams") that can be generated from "good morning" are "goo", "ood", "od ", "d m", " mo", "mor" and so forth.] One of the main reasons why I would like to implement this specific type of fuzzy algorithm is its language independence. One of the major downsides to this implementation is the increase in size of the search index to:
SUM [length(word(i)) - length(n_gram) + 1]instead of
SUM [word(i)]Also, the current index has only a score based on the texts place in an assortment of tags; there will be a need for an additional column to score the trigram based on the size of the full word, lack of doing so would inherently give larger words a larger score in the results. Thus there needs to be a normalization factor added to the scores of the results, making an exact word match score of 1, and so forth. This can be done with the following simple equation
trigram score = 3/length(word)
Using sql we can sum the trigram scores on results HAVING the same nid. The benefits to doing such are that exact results will return higher scores and results in which a spelling mistake has occured will return a somewhat high score but not as high as one that was spelled correctly. This helps in instances where a simple change in one character can result in a completely different word/meaning.
I will follow up more this week as my work progresses, but my initial plans call for a seperate search index as not to interfere with the current search index.
One last thing, I'll be needing volunteers with some decent size sites to test my module out, so if you are interested please send me an email so that I can start gathering a beta group of testers to work with to modify my algorithm to return the best results.
Thanks,
Blake
www.boldsource.com

Comments
nice work so far!
Hi Blake,
Thanks for this update. Looks like it's finally shaping up. I have a "decent-sized" intranet here which may be used for testing. Can you also consider using another open source indexer and just building a bridge to it?
"Work smarter, not harder."
http://digitalsolutions.ph
User Experience Design
A Podcast for Mac Switchers
That's out of scope for
That's out of scope for Blake's project. Do check out the Lucene/Nutch/Solr suite of goodies, though.
Lucene
Thanks for the correction, Robert. I was in fact thinking of Lucene on fuzzy searches.
"Work smarter, not harder."
http://digitalsolutions.ph
User Experience Design
A Podcast for Mac Switchers