Multisite Search using ApacheSolr module

Events happening in the community are now at Drupal community events on www.drupal.org.
auxin's picture

Hi,

Can anyone let me know if it is possible to index and search multiple Drupal and non-drupal websites using the ApacheSolr module?

If not please let me know of any other way that this could be achieved.

Thanks

Comments

Yes :) Kyle Mathews

kyle_mathews's picture

Yes :)

Kyle Mathews

Kyle Mathews

Thanks

auxin's picture

Hi Kyle,

Thanks for the response. Ive been looking around for a way to do a search for one drupal and several non-drupal sites using the ApacheSolr module (running on the drupal site ofcourse).

Id be very grateful if you could please let me know how this can be done.

Regards,
Auxin

I am far from an expert on

kyle_mathews's picture

I am far from an expert on Solr (ok, not at all) but I'll give a go at describing what you'd do and rely on people who actually understand how this is done to correct me where I stray.

First you set up Apache Solr on a server somewhere.

Then you install the ApacheSolr drupal module on each of your drupal sites. Make sure to install apachesolr multisite search module as well. I've never configured this so you'll have to figure it out but basically each Drupal site needs to point at your Apache Solr server.

Finally, for each non-drupal site, you'll have to write a custom indexer. Solr is a web service. To index your other sites, basically you just need to create a big XML document with all the necessary information and send it to Solr. I haven't done this before but I'm sure the Solr documentation explains this very well. You could probably look at the ApacheSolr module code as well for some hints although most of that is probably Drupal specific.

Once all of the sites are indexed, Solr will let you search just one of the sites or all of them together. The drupal apachesolr module is pre-configured to do all that.

Kyle Mathews

Thanks for your help Kyle.

auxin's picture

Thanks for your help Kyle. This sounds like what I am trying to do right now.

I am using Apache Nutch to crawl the non-drupal sites and I send the XML generated by Nutch to Solr for indexing. After doing this I was only able to search all the sites together (could not search just one site) and even then the results that were returned had URLs that were ill-formed. I thought it was probably because I had made some error in combining the Schema.xml from the ApacheSolr Module and the schema.xml from Nutch. Do you think this might be the problem? Or is it something else and how would you suggest fixing it in either case?

Also I dont have the apachesolr multisite search module that you mentioned in your post. From the looks of http://drupal.org/node/408942 and the module's CVS they seem to have removed the multisite module for the time being. I guess I'll have to wait for them to incorporate it into the main module's code.

Thanks again for the quick reply, Im very grateful for your advice.

Regards,
Auxin

Hmmm. . . your questions are

kyle_mathews's picture

Hmmm. . . your questions are definitely beyond my knowledge level now -- anyone else want to jump in?

I hadn't realized that the multi-search part for ApacheSolr is gone -- I guess you'll have to wait (or jump in and help out if you can).

Kyle Mathews

Kyle Mathews

Multisite .... eventually

robertdouglass's picture

The issue for jumping in and helping on getting multisite working again is here: http://drupal.org/node/411262

Note that we need this feature by the time g.d.o. and d.o. relaunch with the redesign, so it will get done.

The 1.0 release of Nutch from last month has Solr support baked in, but nobody has yet investigated whether that makes it compatible with Drupal's ApacheSolr implementation. This is definitely an interesting research area that a lot of people are clamoring for.

That'd be really powerful,

kyle_mathews's picture

That'd be really powerful, to have Nutch and Drupal's ApacheSolr working closely together. I work at my university right now. Universities tend to be a hodge podge of different website technologies. It'd be really powerful to be able to crawl all the sites using Nutch and then search across the sites from a Drupal install.

Kyle Mathews

Kyle Mathews

I ran across this guide

kyle_mathews's picture

I ran across this guide tonight about integrating Nutch and Solr. A drupal feller chimed in in the comments about using Nutch with Drupal/ApacheSolr and said they worked together as long as you're careful to match the two schema.xml together:
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/

Kyle Mathews

Kyle Mathews

katbailey's picture

Hey all,

I found the above article extremely helpful and was able to get nutch and solr working together. I then tried to create a schema.xml that was an amalgamation of the one from Nutch and the one from the ApacheSolr Drupal module - this essentially just meant adding some of the Nutch fields to the Drupal schema, and a copyField. Nutch didn't complain, it sent its docs to solr no problem. However the Drupal indexing no longer worked. The conflict was the copyField, which is as follows <copyField source="url" dest="id"/> - Nutch uses the url as the unique key for documents, whereas Drupal uses a hash of the nid. When I removed that, I could send Drupal content to solr again. So I'd basically have to keep changing this one line in the schema depending on which content I want to index.
I think there's probably a clever way around this that hopefully doesn't involve delving into the guts of the SolrIndexer plugin in Nutch - probably all that's needed is a nutch-specific id field (one that will always be empty for Drupal docs) that can be copied into the id field - the problem with the url field is that Drupal docs use it too.

There are other problems with trying to use the two together though. I can't see how to get much control over the html parsing and field-mapping of nutch content and can't imagine how you could get facets covering your content from Nutch. Oh and to get Drupal to even show the nutch docs in search results I had to hack the ApacheSolr module in a couple of places.

We're probably not going to use nutch after all anyway so I'm currenlty investigating other options... just wanted to share my fndings ;-)

Katherine

Oh, and I meant to also

katbailey's picture

Oh, and I meant to also mention that when I did get ApacheSolr search on drupal to display nutch docs in the results, I had to apply this patch http://drupal.org/node/411262#comment-1615246 so that the documents linked to the correct urls.

Hmm thanks for the post, I

auxin's picture

Hmm thanks for the post, I definitely didnt know about the copyField issue. Facets were a concern but ultimately from my Googling Im lead to believe that facets are not possible/viable in the case of a Multi-site search.

That plus the lack of control over parsing(like you mentioned) and the extra time effort involved made us ultimately go with a different solution. Im still looking into this as and when we get I get time....

Any further developments or

Grayside's picture

Any further developments or experiments on this quest?

Further Developments

dstuart's picture

I will be presenting a solution to the Nutch and Drupal problem at Drupalcon SF http://bit.ly/dftBpU. I am madly trying to finish the Nutch Module for this date. On the multisite side of things we are currently developing a module that should make this a little easier

Regards,

David Stuart

Any news?

st4rdust's picture

Any further developments or experiments on this quest?

Lucene, Nutch and Solr

Group organizers

Group categories

Projects

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds:

Hot content this week