Solr Nutch Search Sandbox Project Updated to Integrate with Common Schema

Hello all:

Based on our discussion last month on IRC, I reconfigured this sandbox project as a few Nutch settings that creates an index compatible with the common schema for the apachesolr module.

The purpose is ad-hoc crawling and indexing, but searching within Drupal and the results are integrated with the Drupal node results.

This is for Nutch 1.x only at this stage.


Thanks for sharing! Great

Thanks for sharing! Great work.

Search API and Solr integration

Hi Chris

Thanks for publishing this body of work

Can you elaborate a little?

You changed the sandbox project to be compatable with the apachesolr module and also drupals native search?

Have you tested apachesolr module integration? Which version works?

Also. What of Search API module suite? Sarnia?

Sorry to bug you as you have given nutch in drupal its first update in 2 years...


Wow. Apologies. You actually

Wow. Apologies. You actually documented the sandbox. % )
Will check it out.

Obvious questions around the existing Nutch module? Have you used that? Etc

Nutch 1.6 works also

congratulations for fabulous work

I have this working using the VirtualBox development environment DrupalPro running Drupal distribution of Open Outreach rc8 and I am getting results in my site search ! YAY!

am running a bigger index and testing latest Apache Solr Views module, TIKA etc

will also try with Search API

wanted to test with Panopoly also, perhaps OpenAcademy

there is also this Semantic Web / Linked Data distro just released which could be interesting

the obvious thing is to try the D7 version of the long forsaken Nutch module

I'll shift across to the project issue queues, I'd suggest this is worthy of full project status

What would be needed to use

What would be needed to use this with Nutch 2.x?

A co-maintainer

Are you interested?

I would be interested! Solr

I would be interested!

Solr 4.4.0
Nutch 2.2.1
Latest drupal

Semms that I "kinda" got it to work - exept the URL's title displays the actualu URL, so its duplicated: "URL" - "Content Description" - "URL" instead of "Title" - "Content Description" - "Url"