Fuzzy Techniques and Implementation

Posted by BlakeLucchesi on May 18, 2007 at 8:52am

I've been doing my best to wrap my head around ways to make drupal more fuzzy search capable. The following are some goals of fuzzy search and I guess some comments. I'm not exactly sure how this will help as of now, but I really feel like along with improving the search engine's speed we should look at ways to provide more relevant results.

The goal of Fuzzy Matching should provide

Synonym Matching

words with hyphenations
numbers typed as words or numerals (five or 5)
general synonyms (car or automobile, same thing)
same word used in different tense/plural (I traveled, he travels, you travel)

Misspellings

Missing Characters
Transposed Characters (mostly happens when someone knows the spelling by doesn't type it correct. tpyo)
Additional Characters (triggger vs trigger)

All of these lead to solutions which are both language dependent and independent.

Use of levenshtein or similar_text function (Language Independent) (levenshtein is quicker and provides number of edits between two words, similar_text provides percentage of correctness)
Stemming (Language Dependent) (this solves the issue of different tense/plural)
Q or N-gram (Language Independent) (Breaks the words into smaller strings and indexes each. Apples to app, ppl, ple, les. This has the drawback that it can bloat the index and requires a significantly larger number of search queries, however, it provides a rather good solution for a language independent fuzzy matching. One drawback is that it modifies our search index and thus large sites going to this solution would need to reindex their existing site)
Suffix strings (Language Independent) (This is along the same lines as the Q or N-gram based approach, the difference is in the difficulty in the recursion to come up with the suffix strings. Same drawback as q-grams in the need to re-index.)

One possible solution to cut down on the processing time of search is to provide fuzzy results only if no results were found by regular matching, in this case we would likely want to refrain from using a language dependent algorithm.

I have to go for now, but I just wanted to throw up some information, I'm sure those reading this are already familiar with the above information, but hopefully its useful for someone. I'd really like to get more discussion on these ideas as well because it will help me with my SoC project, and I want to be sure that the work I put in this summer is of value to the community.

I'll be back with more later.

Comments

existing capabilities, hooks, and prototyping new features

Posted by douggreen on May 20, 2007 at 1:09pm

There already exists a porterstemmer module that uses hook_search_preprocess. I believe that all your other items could be build using this hook. It would be nice to have a search supplemental pack that included a bundle of modules like porterstemmer and the ones you are contemplating.

It's probably already too late for 6.x, though. But would be nice to scope and prototype for the future. I recommend trying to implement a couple of these and if necessary, suggesting new hooks for core search that allows them to happen. You seem to have some specific ideas about algorithms. I presume that these would require different data store and retrieve. Since faceted search (see nina) also has different store and retrieve requirements, one thing that we need to consider is a better abstraction of this. Can we use the current hook_search and hook_search_preprocess to accomplish this?

Doug Green
www.douggreenconsulting.com
www.dougjgreen.com

scoring fuzzy matches

Posted by douggreen on June 11, 2007 at 5:31pm

I'd really like to get hook_node_rank #145242 into core, which would allow you to use this to alter the score ranking based on "fuzzy" matches. Everything else being equal, a fuzzy match should have a lower score than an exact match.

Doug Green
www.douggreenconsulting.com
www.dougjgreen.com

Fuzzy Techniques and Implementation

Comments

existing capabilities, hooks, and prototyping new features

scoring fuzzy matches

Search

Group organizers

Group categories

Search tags

New groups

Group notifications