So I have been tasked with solving the issue of stemming, mispellings, multilingual and partial matching for a site. I decided to create a quick prototype for n-gram indexing. (http://en.wikipedia.org/wiki/N-gram among other resources) . It is a relatively proven technique for these issues, but creates a larger index. So of course not sure if it is the best route in terms of performance, but I plan on running tests. I also plan on using this indexing technique to draw in similar items, using weighted cosine similarity against the search index would generate a 0 - 1 similarity score based on n-gram matching. So would work with say "Dog" and "Dogs".
Has large advantage over stemming as it isn't language dependent (see stemming issue in Drupal project issue queue). I am thinking about releasing it to the community at large as it is awfully neat. Just not sure if it is the best route. I know fuzzy search has explored this, but without using the standard Drupal indexing system. My module, unlike fuzzy search, builds its n-grams in the search index. Is that a bad thing? Why?
What answers have you all found? What makes Solr/Lucrene so special. Can we make Solr respect node_access?
Comments
Talk is cheep code is
Talk is cheep code is golden: here is the module. Its D6 but just change the form alter and form validation routines and your done.
<?php
// search_gram.module
/<strong>
* implementation hook_form_alter
*
* add in the setting to the search admin settings
*/
function search_gram_form_alter(&$form, $form_state, $form_id) {
if ($form_id == 'search_admin_settings') {
$form['indexing_settings']['search_gram_size'] = array(
'#type' => 'textfield',
'#title' => t('Size of N-Gram'),
'#description' => t('Adjusts the size of the gram that is indexed. If it is set to 3 it will break down a word into 3 character strings'),
'#size' => 3,
'#default_value' => variable_get('search_gram_size', 3),
);
$form['#validate'][] = 'search_gram_settings_validate';
}
}
/</strong>
* validate the search settings form. Make sure the min word size is <= n-gram
*/
function search_gram_settings_validate($form, &$form_state) {
if ($form_state['minimum_word_size'] < $form_state['search_gram_size']) {
form_set_error('search_gram_size', t('N-Gram size must be less then or equal to the minimum word size'));
}
}
/<strong>
* Return the grams
*
* @param $text
* The text to split. This is a single piece
* of plain-text that was extracted
* from between two HTML tags.
* Will not contain any HTML entities.
*/
function search_gram_search_preprocess($text) {
// get the n-gram
$gram_size = variable_get('search_gram_size', 3);
// elminate stop words
// elminate all spaces
$text = _stop_words($text);
// step through each of the text
// and add it modified text as space deliminated
$words = array();
for($start = 0; $start + $gram_size <= strlen($text); $start++) {
$words[] = substr($text, $start, $gram_size);
}
// implode the text so its 'words'
return implode(" ", $words);
}
/</strong>
* filter out the stop words from the $text and eliminate spaces
*
* @param $text
* the search text to remove the stop words from
*/
function _stop_words($text) {
$stop_words = array(
'a',
'able',
'about',
'above',
'abroad',
'according',
'accordingly',
'across',
'actually',
'adj',
'after',
'afterwards',
'again',
'against',
'ago',
'ahead',
'aint',
'all',
'allow',
'allows',
'almost',
'alone',
'along',
'alongside',
'already',
'also',
'although',
'always',
'am',
'amid',
'amidst',
'among',
'amongst',
'an',
'and',
'another',
'any',
'anybody',
'anyhow',
'anyone',
'anything',
'anyway',
'anyways',
'anywhere',
'apart',
'appear',
'appreciate',
'appropriate',
'are',
'arent',
'around',
'as',
'as',
'aside',
'ask',
'asking',
'associated',
'at',
'available',
'away',
'awfully',
'b',
'back',
'backward',
'backwards',
'be',
'became',
'because',
'become',
'becomes',
'becoming',
'been',
'before',
'beforehand',
'begin',
'behind',
'being',
'believe',
'below',
'beside',
'besides',
'best',
'better',
'between',
'beyond',
'both',
'brief',
'but',
'by',
'c',
'came',
'can',
'cannot',
'cant',
'cant',
'caption',
'cause',
'causes',
'certain',
'certainly',
'changes',
'clearly',
'cmon',
'co',
'co.',
'com',
'come',
'comes',
'concerning',
'consequently',
'consider',
'considering',
'contain',
'containing',
'contains',
'corresponding',
'could',
'couldnt',
'course',
'cs',
'currently',
'd',
'dare',
'darent',
'definitely',
'described',
'despite',
'did',
'didnt',
'different',
'directly',
'do',
'does',
'doesnt',
'doing',
'done',
'dont',
'down',
'downwards',
'during',
'e',
'each',
'edu',
'eg',
'eight',
'eighty',
'either',
'else',
'elsewhere',
'end',
'ending',
'enough',
'entirely',
'especially',
'et',
'etc',
'even',
'ever',
'evermore',
'every',
'everybody',
'everyone',
'everything',
'everywhere',
'ex',
'exactly',
'example',
'except',
'f',
'fairly',
'far',
'farther',
'few',
'fewer',
'fifth',
'first',
'five',
'followed',
'following',
'follows',
'for',
'forever',
'former',
'formerly',
'forth',
'forward',
'found',
'four',
'from',
'further',
'furthermore',
'g',
'get',
'gets',
'getting',
'given',
'gives',
'go',
'goes',
'going',
'gone',
'got',
'gotten',
'greetings',
'h',
'had',
'hadnt',
'half',
'happens',
'hardly',
'has',
'hasnt',
'have',
'havent',
'having',
'he',
'hed',
'hell',
'hello',
'help',
'hence',
'her',
'here',
'hereafter',
'hereby',
'herein',
'heres',
'hereupon',
'hers',
'herself',
'hes',
'hi',
'him',
'himself',
'his',
'hither',
'hopefully',
'how',
'howbeit',
'however',
'hundred',
'i',
'id',
'ie',
'if',
'ignored',
'ill',
'im',
'immediate',
'in',
'inasmuch',
'inc',
'inc.',
'indeed',
'indicate',
'indicated',
'indicates',
'inner',
'inside',
'insofar',
'instead',
'into',
'inward',
'is',
'isnt',
'it',
'itd',
'itll',
'its',
'its',
'itself',
'ive',
'j',
'just',
'k',
'keep',
'keeps',
'kept',
'know',
'known',
'knows',
'l',
'last',
'lately',
'later',
'latter',
'latterly',
'least',
'less',
'lest',
'let',
'lets',
'like',
'liked',
'likely',
'likewise',
'little',
'look',
'looking',
'looks',
'low',
'lower',
'ltd',
'm',
'made',
'mainly',
'make',
'makes',
'many',
'may',
'maybe',
'maynt',
'me',
'mean',
'meantime',
'meanwhile',
'merely',
'might',
'mightnt',
'mine',
'minus',
'miss',
'more',
'moreover',
'most',
'mostly',
'mr',
'mrs',
'much',
'must',
'mustnt',
'my',
'myself',
'n',
'name',
'namely',
'nd',
'near',
'nearly',
'necessary',
'need',
'neednt',
'needs',
'neither',
'never',
'neverf',
'neverless',
'nevertheless',
'new',
'next',
'nine',
'ninety',
'no',
'nobody',
'non',
'none',
'nonetheless',
'noone',
'no-one',
'nor',
'normally',
'not',
'nothing',
'notwithstanding',
'novel',
'now',
'nowhere',
'o',
'obviously',
'of',
'off',
'often',
'oh',
'ok',
'okay',
'old',
'on',
'once',
'one',
'ones',
'ones',
'only',
'onto',
'opposite',
'or',
'other',
'others',
'otherwise',
'ought',
'oughtnt',
'our',
'ours',
'ourselves',
'out',
'outside',
'over',
'overall',
'own',
'p',
'particular',
'particularly',
'past',
'per',
'perhaps',
'placed',
'please',
'plus',
'possible',
'presumably',
'probably',
'provided',
'provides',
'q',
'que',
'quite',
'qv',
'r',
'rather',
'rd',
're',
'really',
'reasonably',
'recent',
'recently',
'regarding',
'regardless',
'regards',
'relatively',
'respectively',
'right',
'round',
's',
'said',
'same',
'saw',
'say',
'saying',
'says',
'second',
'secondly',
'see',
'seeing',
'seem',
'seemed',
'seeming',
'seems',
'seen',
'self',
'selves',
'sensible',
'sent',
'serious',
'seriously',
'seven',
'several',
'shall',
'shant',
'she',
'shed',
'shell',
'shes',
'should',
'shouldnt',
'since',
'six',
'so',
'some',
'somebody',
'someday',
'somehow',
'someone',
'something',
'sometime',
'sometimes',
'somewhat',
'somewhere',
'soon',
'sorry',
'specified',
'specify',
'specifying',
'still',
'sub',
'such',
'sup',
'sure',
't',
'take',
'taken',
'taking',
'tell',
'tends',
'th',
'than',
'thank',
'thanks',
'thanx',
'that',
'thatll',
'thats',
'thats',
'thatve',
'the',
'their',
'theirs',
'them',
'themselves',
'then',
'thence',
'there',
'thereafter',
'thereby',
'thered',
'therefore',
'therein',
'therell',
'therere',
'theres',
'theres',
'thereupon',
'thereve',
'these',
'they',
'theyd',
'theyll',
'theyre',
'theyve',
'thing',
'things',
'think',
'third',
'thirty',
'this',
'thorough',
'thoroughly',
'those',
'though',
'three',
'through',
'throughout',
'thru',
'thus',
'till',
'to',
'together',
'too',
'took',
'toward',
'towards',
'tried',
'tries',
'truly',
'try',
'trying',
'ts',
'twice',
'two',
'u',
'un',
'under',
'underneath',
'undoing',
'unfortunately',
'unless',
'unlike',
'unlikely',
'until',
'unto',
'up',
'upon',
'upwards',
'us',
'use',
'used',
'useful',
'uses',
'using',
'usually',
'v',
'value',
'various',
'versus',
'very',
'via',
'viz',
'vs',
'w',
'want',
'wants',
'was',
'wasnt',
'way',
'we',
'wed',
'welcome',
'well',
'well',
'went',
'were',
'were',
'werent',
'weve',
'what',
'whatever',
'whatll',
'whats',
'whatve',
'when',
'whence',
'whenever',
'where',
'whereafter',
'whereas',
'whereby',
'wherein',
'wheres',
'whereupon',
'wherever',
'whether',
'which',
'whichever',
'while',
'whilst',
'whither',
'who',
'whod',
'whoever',
'whole',
'wholl',
'whom',
'whomever',
'whos',
'whose',
'why',
'will',
'willing',
'wish',
'with',
'within',
'without',
'wonder',
'wont',
'would',
'wouldnt',
'x',
'y',
'yes',
'yet',
'you',
'youd',
'youll',
'your',
'youre',
'yours',
'yourself',
'yourselves',
'youve',
'z',
'zero',
);
// remove apostrophes and split into words
$terms = preg_split('/([^a-zA-Z]+)/', str_replace("'", '', $text), -1, PREG_SPLIT_NO_EMPTY );
// check for stop words
for($i = 0; $i < count($terms); $i++) {
if (in_array(trim($terms[$i]), $stop_words)) {
unset($terms[$i]);
}
}
// combine and remove the spaces
return implode('', $terms);
}
?>
I have forgotten where i got
I have forgotten where i got the stop words. But its pretty decent