This is my first update on groups.drupal.org about my SoC Project regarding enhancing drupal's search engine. I still have 3 more days of finals so I havn't gotten to any of the coding, but after putting forth much research and thought (and some kind help from my mentor Robert Douglass), I've decided on the following 2 parts as goals for my project.
Goals for Summer of Code Search Project:
-
Implement synonym insertion into the search index. The goal is to allow site administrators to define synonyms for their site so that while the site is indexed. The benefit of this will be that it will increase the number of relevant search results returned to site users. I plan on having an administrative area in the search settings page that will allow administrators to define/edit/delete synonyms for their site.
-
Implement a fuzzy search module that will allow users to have minor mistakes in spelling and still receive relevant search results. I will be working on a custom algorithm that incorporates tactics covered in the research I have done and it will work on any moderately sized Drupal site. (I noticed that on the "search" group that much of the focus has been on implementing a high speed solution, and the speed of fuzzy search can't do much to compete with a direct term matching search. I would like to see this module get offered as an extra module and not incorporated into core because of this.)
I'd love to hear some feed back, especially from those who have sites with a sizable amount of content (I'll be looking for test volunteers to give me feedback on the algorithm performance and result relevancy.)
Thanks!
Blake

Comments
Great!
These improvements will be very much appreciated!
I'm sure that you already have good ideas about how to implement these, but regarding synonyms it could be interesting to integrate with the taxonomy system — and to improve it at the same time. Right now, taxonomy already supports synonyms, but a major flaw is that the data structure does not reflect the symmetry between synonyms. Would storing synonyms as terms make the taxonomy data structure good enough for your system?
I'd like also to point out to anyone interested that some works have allowed the indexing of synonyms, but these only index synonyms for node terms (and not node content).
// David Lesieur // Associé // Whisky Echo Bravo // Développement Web, experts Drupal // Montréal //
Thanks for the update,
Thanks for the update, Blake. Good luck on your finals. I agree with David that it is worth looking into reusing the taxonomy module for synonyms, but in the end you have to decide which approach will give you the best results. Interesting discussion points.
thanks!
Hi blake, thanks for the update. Good luck on your exams!
Just a thought. Will it be out of scope for you to implement some form of reporting about keyword stats? Example: most popular keywords, word grouping (based on synonyms). Doing this will greatly help content managers in understanding user behavior-- what users are looking for, what they actually click based on a list of search results, etc. If its beyond scope, that's okay. Just keep this in mind for future releases :)
benc
"Work smarter, not harder."
http://digitalsolutions.ph
User Experience Design
A Podcast for Mac Switchers
indexing node author into the search index
Hi all, this seems like a good place to ask this question.
I was just asked by one of our staff how come the drupal search function cannot find an article contributor's article even though their name is printed right there on the home page. I tested it and sure enough you cannot search for an author's name, even though it is right there. The user pointed out rightly that this is very confusing for someone expecting the search feature to "just work".
Anyway we'd like to implement this functionality, I would be willing to create the patch given some clues about the best way to do it and where to start. Or if it's a bad idea, could you please explain why and if there is some sort of alternative?
Thanks for any help you can provide, and good luck with everything you're working on Blake!
Fortunately this is pretty
Fortunately this is pretty straightforward. The author's name needs to be indexed along with the text of the content. To do this, you'd add a line to function node_update_index() that appends the author's name, probably in <h2> tags, to the text that gets sent to the indexer.
nodeapi 'update index'
I think the cleaner implementation is to implement a nodeapi 'update index'. Something like:
function yourmodule_nodeapi($node, $op, $arg = 0) {switch ($op) {
case 'update index':
if ($node->uid) {
$user = user_load($node->uid);
return $user->name;
}
}
}
Since all of the CCK fields are in $body, the user name might be the only thing not indexed, so I tink that It's a good idea and might be core considerable (Steven being the arbitrar, not me). If you submit a patch, please link back here. One reason it might not get core consideration is because it's so easy to do in a contrib module (as I've shown above). You might consider writing a contrib module first as a proof of concept. But first, make sure nobody else has done this yet -- I haven't checked.
Doug Green
www.douggreenconsulting.com
www.dougjgreen.com