To answer some of Roberts questions and open up discussion, I've implemented a non-Drupal search using the Zend port
of Lucene. I used Lucene to index an archive of mbox mailing list files, and saved files, offsets, and search terms in the index. I then added a tab to the Drupal search and hooked in the Zend Search api to search the Lucene indexes. At first, I used the Zend engine to build the index, but at 0.2 or whatever version they were at, it couldn't build the full index without memory faults. After falling back to Java Lucene to build the index, I currently use the php Zend Search to read the indexes and pull up the email messages and threads.
The index itself is about 100MB, and the mbox files are about 400MB.
The Zend Search has worked really well for searching, and I couldn't be happier with the Lucene indexer in general. It is faster and more generalized than the other indexers I've used.
I currently don't plan to start a Drupal project as I see my implementation as pretty specialized.
Feel free to see it run:
http://www.bronco.com/cms/search/mybox (search for something automotive)
-John
Comments
Merging indexes
Have you had any experience merging distinct indexes into one? This would be beyond the segment merging that takes place internally.
My case would be multiple spiders feeding it's own index then merging them into one for searching. This link returned from a Google search says "Yes, you can merge two indexes using Lucene's IndexWriter.addIndexes() methods." But not being a Java programmer I am not able to test this.
~jim
I've used the IndexWriter's
I've used the IndexWriter's addIndexes in Java, but I'm not sure if the Zend folks have implemented it.
It works as expected.
-John
-John
I would not need the PHP
I would not need the PHP implementation, as I would use the Java implementation to aggregate the spider and Drupal site indexes into one large collection. I have Java programmers on staff to help me with that part and this would be a back-end batch process, the search would take place via our search appliance (either Omnifind or a custom Zend interface). My goal was to get away from crawling the dynamic Drupal sites so frequently to update the index. Now I can let a Drupal site maintain it's own index then aggregate for the enterprise search.
Thanks for the info!