Architecture question re: huge indexes

Events happening in the community are now at Drupal community events on www.drupal.org.
Todd Young's picture

I intend to have a Drupal 6 site with some cutom node types indexed by ApacheSolr using the XML Schema provided. However, I'd like to add a lot of gigantic indexes to it that reside only in the Lucene/Solr system (ha) and not physically in my Drupal MySQL database. For example, I may have a thousand or even a hundred thousand nodes in Drupal, but I might have ten different "external" indexes with millions of records each, and I'd like to conduct faceted search against the whole lot of 'em.

How do I go about appending multi-million-record indexes to my working Drupal 6 + ApacheSolr configuration so that I can search across the universe?

I anticipate (ultimately) hundreds of millions of very tiny data in these indexes with a sweatshop of Lucene servers, but only one Drupal site. So of course I can't just create hundreds of millions of nodes in my poor little MySQL server.

I know this is a complex and wide-scoping question so of course I'm hoping for general pointers to appropriate reference material.

Comments

Huge indices

dstuart's picture

Hi yountod,

Do you already have the non Drupal indices? Is the schema the same or similar as the Drupal schema? How are you currently getting the millions of records into solr?

I have none yet but I'm

Todd Young's picture

I have none yet but I'm anticipating a variety of ongoing sources, such as indexing other DB's, filesystems, and probably crawling stuff with Nutch. The Drupal use may even be able to specify new Nutch targets or new file locations on a continual basis right from inside the app. So I guess that's all on the Solr side, nothing to do with Drupal as far as "seeing" that stuff. But once Lucene has indexed it, will my Drupal ApacheSolr search results include that non-Drupal content?

If its in the same index format

dstuart's picture

Hey,
The Apache solr module can display it if it uses the same fields. I am taking over the nutch module and it will have this functionality in terms of the mapping of fields to apache solr schema format. I would recommend downloading the apache solr module and looking at the schema provided to ensure you are indexing you other sources using the appropriate fields

Regards

Dave

As far as I'm aware you will

katbailey's picture

As far as I'm aware you will also need to hack ApacheSolr module slightly to get it do show you non-Drupal results because it modifies the query that's sent to Solr in order to add a node_access check. This was a problem I was coming up against a few months back anyway, when trying to display search results that included Drupal nodes and nutch-crawled content. The other problem was not being able to use the same schema for nutch and drupal because nutch uses the url as unique key, but it sounds like Dave is on that - nice one Dave! :-)

Lucene, Nutch and Solr

Group organizers

Group categories

Projects

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds: