Apache SOLR to index a HUGE forum

flexer's picture

I'm quite busy trying to get the latest (dev) Apache_Solr module to work with our complex multi site setup and it all went fine, until we had to import an huge phpbb2 forum (28k topics, near 1 million comments - some topic has thousands of comments).

Now I'm puzzled: which route should I go through? Does anyone has succesfully implemented something like that?

From the top of my mind, I'd consider each comment as a separate document to feed SOLR with, with a rightly cooked "indexer" script that will extract them AND the topic, of course (first the topics, then the comments).

Every problem will then be delegated to the template.php file and associated tpls, where I'll get each result, test if it is a comment or a topic, and if it's a comment load the relevant topic to show it alongside the comment itself...

Issues with this approach:
1. How one would consider a comment like a "document", with a bogus $nid?
2. How could I get back the parent topic of a comment, directly from SOLR schema?

Pondering... in the meanwhile, if anyone has an idea I'll be pleased to listen :)

Groups:
Login to post comments

I tested something similar,

kajetan's picture
kajetan - Fri, 2009-01-30 22:55

I tested something similar, but never (yet) run it live. I imported a forum that was slightly bigger than yours, about 5,8 million comments, 600,000+ topics and some 200,000+ users.

I used http://drupal.org/project/apachesolr basically out-of-the-box to see what happens. It indexed every node and it's comments as one document, so you ended up in the right topic when your search matched a comment.

Before I started I was, like you, thinking about indexing every comment as a separate document. But now I'm pretty sure that indexing nodes is better, because topics where more comments are about the stuff you search for get better score.

I didn't come much longer than that. Testing with this amount of data takes time :) Indexing the forum took 3 days on my developer machine.


Using apache_solr with a comment bias

kcoop's picture
kcoop - Wed, 2009-04-29 14:49

I've built a system like this that is essentially a forum, where the topics are nodes with comments. I'm now wanting to give it a search facility. SOLR seems a great way to go, and apache_solr looks like it would help me get much of the way there. In an email discussion I had with Robert Douglass where I was considering implementing comments as nodes, he suggested there were small tweaks that could be done to index comments rather than nodes.

I like kajetan's point about using the relevancy one gets from clustering under a node. But most importantly, I want the results to be individual comments, both for scanning the results, and for navigation - I'd like clicking on a result to navigate specifically to that comment, perhaps using an anchor, though not sure how that works with paging.

Looking at the code, it appears fairly node-centric. I'm wanting to return specific comments rather than their nodes, and as such I'm considering building a completely new module based on this one. I'd hate to do this though, for the obvious drupally reasons. Before I dive in, I'm wondering if anyone has any thoughts on how best to extend/refactor the code to accommodate this scenario.


apachesolr with comment bias

Ariesto - Wed, 2009-04-29 15:34

@kcoop, sounds like both you and I have been investigating a similar revised forum system

I have been debating the same question about comments as a node or the original comment way. I've read lots of conflicting opinions on the subject, so I would love it if this thread could decide what coding method provides the best search results.

Keep in mind, that the solution will need to give the administrator the choice for the node types with comments that return the comment bias rather than the node, and then there will need to be a way for the user to choose between the comment bias or original node topic. The ability for the search to anchor to the comment within the paging is a necessity (otherwise it wouldn't be a very helpful search bias). ApacheSolr is great, so we should definitely develop a solution that uses that module/platform.

I can't offer any code advice -because this is outside my area of expertise- however I can offer to chip in some funds to assist both seeing this problem solved and code implemented.