Big Data Drupal with Cloudera, Hadoop, MapReduce, Nutch and Solr

niccolox's picture

thanks to the recent work of the Solr Nutch sandbox project I've managed to get Nutch 1.6 jobs to run on a Cloudera CDH3 4 node cluster sending results to Solr 3.6.2 (hosted within Tomcat on Aegir BOA) and then integrated into the Apache Solr 7.1.1 module (not the dev) into search results and Apache Solr Views

I must say, I am pretty excited about Hadoop / Cloudera running Nutch and Solr and integrating with Drupal

for anyone interested in setting up a Cloudera cluster I recommend masterschema (centos) and Gregory Grubbs on YouTube (debian)

I'll post some notes etc ASAP

BIG THanks
as the man said, standing on the shoulders or giants @dtsuart @cliefen and the apachesolr team, not to mention the pretty amazing cloudera manager


Cloudera CDH3 Hadoop Cluster
Cloudera Manager 4, CDH3 (Hadoop, Mapreduce, Zookeeper etc etc)
Ubuntu 10.04 LTS Server * 4 (3 nodes operational, 1 node needs fixing)
Nutch 1.6 (running in Hadoop/Mapdreduce)

Aegir BOA

Aegir BOA 2.5 (with Tomcat running 3.6.2 Solr, need to increase memory etc)
Solr 3.62
Nutch Solr sandbox project
ApacheSolr module 7.1.1 (NOT dev)
ApacheSolr Views module
Open Outreach rc9

Blogged some scraps of notes and useful commands etc


Well Done...

Anonymous's picture

Sweet writeup, thanks for sharing!

Michael Clendening

CDH4 and Ubuntu 12

niccolox's picture

quick update, got the nutch job running on 25 threads across 4 nodes on CDH4 on Ubuntu 12.04.2 LTS


niccolox's picture

I've added an extra commercial open-source product to the tool-chain (Bonita Soft) and used the Nutch Multisite module from ApacheSolr Examples project

I created a slide-deck and a business document

hope to do a decent blog post about this soon

Nutch 1.x On CDH4

bjaxelsen's picture

Thanks for sharing.

I am curious about a few things:

How come you ended up using Nutch 1.x rather than 2.x?

Did you come across any dependency issues? As far as I know Nutch depends on older versions of Hadoop, Hbase and Zookeeper than the versions that ships with CDH4. I may be wrong.

I have looked at the Cloudera Search (an extension to CDH4) - did you (or maybe someone else within this group) try it out?

good questions

niccolox's picture

thanks for the feedback.

I am doing a session at badcamp in berkeley in two weeks

also, have a mini site

as to your specific question;

nutch 1.6 is stable and feature complete and does work on CDH3/CDH4 .. nutch 1.7 is a departure with pluggable search backends, i.e. solr and elastic

nutch 2 is a big change, a bunch of features are lost, again various pluggable search engines. nutch 2 doesnt work on on cdh3/cdh4 - there is a nutch jira issue and a cloudera jira issue about that

my older blog post is at

I am preparing a new blog post and preparing a new presentation over the next couple of weeks


bjaxelsen's picture

...for the reply, I will to with the 1.x version then

personally, I would use 1.6.

niccolox's picture

personally, I would use 1.6. Not 1.7

1.7 seems to be a bridge to 2 and I just wanted something that works


chandra_drupal's picture

Thanks for sharing

Apache Spark / Hadoop / Big Data

darrell_ulm's picture

Looking for more Drupal tie ins to Apache Spark / Hadoop / Big Data, which are running somewhere, if anyone has any additional info out of interest.

Lucene, Nutch and Solr

Group organizers

Group categories


Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds:

Hot content this week