A hook for search scoring factors?

Events happening in the community are now at Drupal community events on www.drupal.org.
robertdouglass's picture

do_search currently takes a bunch of parameters which are supposed to be sql snippets that get used in building the search queries. The only example of these being used is in node_search() where extra tables are joined to make scoring factors for things like comments count and page views, etc. My idea is to explore whether each of these scoring factors could be added in a more organized way, in do_search, by invoking a hook:

hook_scoring_factor($search, $keys)

The hook would build an array of queries that would come directly from the relevant modules (ie if $search is 'node', comment module would be responsible for returning a full SQL query that would calculate the comment count scoring factor). This array would then be converted into a series of subqueries in the main query. We could normalize the scores based on the number of subqueries returned and account for search-time weighing of the scoring factors (from the admin settings) right in the main query:

mysql> select 'field1', 'field2', 0.33 * (4 / 11) * (select 10) + 0.33 * (5 / 11) * (select 3) + 0.33 * (2 / 11) * (select 7) as score;
+--------+--------+----------+
| field1 | field2 | score    |
+--------+--------+----------+
| field1 | field2 | 2.070000 |
+--------+--------+----------+

In the above, 0.33 comes from the fact that there are 3 subqueries (1 / 3). The second calculation (4/11, 5/11, 2/11) would be the scoring factor weights (total weights = 11, 4, 5 and 2 are the individual scoring factors). Then select 10, select 3 and select 7 are simplified versions of the queries that would be returned by hook_scoring_factor.

Clearly this is an over-simplistic example and not even a proof-of-concept, but I want to throw the idea out there for the smart brains to chew on and help refine.

Oh, and why would we want to do this? So that new-fangled modules such as voting modules or other metrics systems can inject their own scoring factors. Wouldn't it be cool if the votes a node has received could improve its positioning in the search results?

Comments

Proof of Concept

douggreen's picture

I think it's a great idea. Not only is it extensible, it actually cleans up the node.module code. Here's a 6.x patch - http://drupal.org/node/145242

(I originally wrote the technical details about the patch here. Given that the 6.x code freeze fastly approaching, I was anxious to turn this into an issue. Since both comments said almost exactly the same thing, I removed this comment in favor of the d.o comment, so that discussion takes place in only one thread. My apologies if this breaks some protocol.)

Would this be any help?

BlakeLucchesi's picture

I'd like to see Drupal core get rid of the automatic ANDing of search terms. I feel like it punishes the user for inadvertently including too many search terms. What I would really like to see is automatic OR searching and then relying on the sum of scores from the found values to sort the more relevant terms to the top.

Below is a piece of pseudo sql that may help express what I'm suggesting:

SELECT sid, SUM( score )
FROM search_index
WHERE word LIKE '%term1%'
OR word LIKE '%term2%'
GROUP BY sid
ORDER BY sum( score )

I think by using the above type structure we can accomplish a basic search in one single query, bypass temporary tables, and increase the likelihood of relevant results.

The above query doesn't do anything like assess for -keyword or "key phrase", but using this hook feature we could possibly do the query rewriting

my blog (mostly drupal)

This is basically what the current search does

douggreen's picture

The current search does basically what you suggest, the only difference between the AND and OR terms is the use of the HAVING clause. As long as {search_index} is unique (which it isn't) this works. Please see my proposed patch for a faster core search - http://drupal.org/node/145560.

AND: SELECT sid, SUM(score) FROM search_index WHERE (word = 'term1' OR word = 'term2') GROUP BY sid HAVING COUNT(*) = 2 ORDER BY sum(score)

OR: SELECT sid, SUM(score) FROM search_index WHERE (word = 'term1' OR word = 'term2') GROUP BY sid HAVING COUNT(*) > 0 ORDER BY sum(score)

I believe that AND and OR searching have exactly the same performance. You can drop the HAVING clause with the OR (and I do), but I don't think this has any impact on performance.

Also, your example uses LIKE. I think this is a bad idea. A like here would find root words and would not use indexes and thus be much slower.

Search

Group organizers

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds: