Posted by auxin on April 30, 2009 at 4:40pm
Hi,
Can anyone let me know if it is possible to index and search multiple Drupal and non-drupal websites using the ApacheSolr module?
If not please let me know of any other way that this could be achieved.
Thanks
Comments
Yes :) Kyle Mathews
Yes :)
Kyle Mathews
Kyle Mathews
Thanks
Hi Kyle,
Thanks for the response. Ive been looking around for a way to do a search for one drupal and several non-drupal sites using the ApacheSolr module (running on the drupal site ofcourse).
Id be very grateful if you could please let me know how this can be done.
Regards,
Auxin
I am far from an expert on
I am far from an expert on Solr (ok, not at all) but I'll give a go at describing what you'd do and rely on people who actually understand how this is done to correct me where I stray.
First you set up Apache Solr on a server somewhere.
Then you install the ApacheSolr drupal module on each of your drupal sites. Make sure to install apachesolr multisite search module as well. I've never configured this so you'll have to figure it out but basically each Drupal site needs to point at your Apache Solr server.
Finally, for each non-drupal site, you'll have to write a custom indexer. Solr is a web service. To index your other sites, basically you just need to create a big XML document with all the necessary information and send it to Solr. I haven't done this before but I'm sure the Solr documentation explains this very well. You could probably look at the ApacheSolr module code as well for some hints although most of that is probably Drupal specific.
Once all of the sites are indexed, Solr will let you search just one of the sites or all of them together. The drupal apachesolr module is pre-configured to do all that.
Kyle Mathews
Thanks for your help Kyle.
Thanks for your help Kyle. This sounds like what I am trying to do right now.
I am using Apache Nutch to crawl the non-drupal sites and I send the XML generated by Nutch to Solr for indexing. After doing this I was only able to search all the sites together (could not search just one site) and even then the results that were returned had URLs that were ill-formed. I thought it was probably because I had made some error in combining the Schema.xml from the ApacheSolr Module and the schema.xml from Nutch. Do you think this might be the problem? Or is it something else and how would you suggest fixing it in either case?
Also I dont have the apachesolr multisite search module that you mentioned in your post. From the looks of http://drupal.org/node/408942 and the module's CVS they seem to have removed the multisite module for the time being. I guess I'll have to wait for them to incorporate it into the main module's code.
Thanks again for the quick reply, Im very grateful for your advice.
Regards,
Auxin
Hmmm. . . your questions are
Hmmm. . . your questions are definitely beyond my knowledge level now -- anyone else want to jump in?
I hadn't realized that the multi-search part for ApacheSolr is gone -- I guess you'll have to wait (or jump in and help out if you can).
Kyle Mathews
Kyle Mathews
Multisite .... eventually
The issue for jumping in and helping on getting multisite working again is here: http://drupal.org/node/411262
Note that we need this feature by the time g.d.o. and d.o. relaunch with the redesign, so it will get done.
The 1.0 release of Nutch from last month has Solr support baked in, but nobody has yet investigated whether that makes it compatible with Drupal's ApacheSolr implementation. This is definitely an interesting research area that a lot of people are clamoring for.
That'd be really powerful,
That'd be really powerful, to have Nutch and Drupal's ApacheSolr working closely together. I work at my university right now. Universities tend to be a hodge podge of different website technologies. It'd be really powerful to be able to crawl all the sites using Nutch and then search across the sites from a Drupal install.
Kyle Mathews
Kyle Mathews
I ran across this guide
I ran across this guide tonight about integrating Nutch and Solr. A drupal feller chimed in in the comments about using Nutch with Drupal/ApacheSolr and said they worked together as long as you're careful to match the two schema.xml together:
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
Kyle Mathews
Kyle Mathews
Nutch and ApacheSolr working together will take some work...
Hey all,
I found the above article extremely helpful and was able to get nutch and solr working together. I then tried to create a schema.xml that was an amalgamation of the one from Nutch and the one from the ApacheSolr Drupal module - this essentially just meant adding some of the Nutch fields to the Drupal schema, and a copyField. Nutch didn't complain, it sent its docs to solr no problem. However the Drupal indexing no longer worked. The conflict was the copyField, which is as follows
<copyField source="url" dest="id"/>- Nutch uses the url as the unique key for documents, whereas Drupal uses a hash of the nid. When I removed that, I could send Drupal content to solr again. So I'd basically have to keep changing this one line in the schema depending on which content I want to index.I think there's probably a clever way around this that hopefully doesn't involve delving into the guts of the SolrIndexer plugin in Nutch - probably all that's needed is a nutch-specific id field (one that will always be empty for Drupal docs) that can be copied into the id field - the problem with the url field is that Drupal docs use it too.
There are other problems with trying to use the two together though. I can't see how to get much control over the html parsing and field-mapping of nutch content and can't imagine how you could get facets covering your content from Nutch. Oh and to get Drupal to even show the nutch docs in search results I had to hack the ApacheSolr module in a couple of places.
We're probably not going to use nutch after all anyway so I'm currenlty investigating other options... just wanted to share my fndings ;-)
Katherine
Oh, and I meant to also
Oh, and I meant to also mention that when I did get ApacheSolr search on drupal to display nutch docs in the results, I had to apply this patch http://drupal.org/node/411262#comment-1615246 so that the documents linked to the correct urls.
Hmm thanks for the post, I
Hmm thanks for the post, I definitely didnt know about the copyField issue. Facets were a concern but ultimately from my Googling Im lead to believe that facets are not possible/viable in the case of a Multi-site search.
That plus the lack of control over parsing(like you mentioned) and the extra time effort involved made us ultimately go with a different solution. Im still looking into this as and when we get I get time....
Any further developments or
Any further developments or experiments on this quest?
Further Developments
I will be presenting a solution to the Nutch and Drupal problem at Drupalcon SF http://bit.ly/dftBpU. I am madly trying to finish the Nutch Module for this date. On the multisite side of things we are currently developing a module that should make this a little easier
Regards,
David Stuart
Any news?
Any further developments or experiments on this quest?