Big Data Drupal with Cloudera, Hadoop, MapReduce, Nutch and Solr

niccolo's picture

thanks to the recent work of the Solr Nutch sandbox project I've managed to get Nutch 1.6 jobs to run on a Cloudera CDH3 4 node cluster sending results to Solr 3.6.2 (hosted within Tomcat on Aegir BOA) and then integrated into the Apache Solr 7.1.1 module (not the dev) into search results and Apache Solr Views

I must say, I am pretty excited about Hadoop / Cloudera running Nutch and Solr and integrating with Drupal

for anyone interested in setting up a Cloudera cluster I recommend masterschema (centos) and Gregory Grubbs on YouTube (debian)

I'll post some notes etc ASAP

BIG THanks
as the man said, standing on the shoulders or giants @dtsuart @cliefen and the apachesolr team, not to mention the pretty amazing cloudera manager

BIG DATA

Cloudera CDH3 Hadoop Cluster
Cloudera Manager 4, CDH3 (Hadoop, Mapreduce, Zookeeper etc etc)
Ubuntu 10.04 LTS Server * 4 (3 nodes operational, 1 node needs fixing)
Nutch 1.6 (running in Hadoop/Mapdreduce)

Aegir BOA

Aegir BOA 2.5 (with Tomcat running 3.6.2 Solr, need to increase memory etc)
Solr 3.62
Nutch Solr sandbox project
ApacheSolr module 7.1.1 (NOT dev)
ApacheSolr Views module
Open Outreach rc9

Blogged some scraps of notes and useful commands etc
http://niccolox.org/blog/big-data-drupal-cloudera-cluster-boa-solr-nutch...

Comments

Well Done...

mclendening's picture

Sweet writeup, thanks for sharing!

Peace,
Michael Clendening

CDH4 and Ubuntu 12

niccolo's picture

quick update, got the nutch job running on 25 threads across 4 nodes on CDH4 on Ubuntu 12.04.2 LTS

latest

niccolo's picture

I've added an extra commercial open-source product to the tool-chain (Bonita Soft) and used the Nutch Multisite module from ApacheSolr Examples project

I created a slide-deck http://www.scribd.com/doc/144088681/Big-Data-Drupal and a business document http://www.scribd.com/doc/144099411/Democratic-Big-Data-Drupal

hope to do a decent blog post about this soon

Nutch 1.x On CDH4

bjaxelsen's picture

Thanks for sharing.

I am curious about a few things:

How come you ended up using Nutch 1.x rather than 2.x?

Did you come across any dependency issues? As far as I know Nutch depends on older versions of Hadoop, Hbase and Zookeeper than the versions that ships with CDH4. I may be wrong.

I have looked at the Cloudera Search (an extension to CDH4) - did you (or maybe someone else within this group) try it out?

good questions

niccolo's picture

hi,
thanks for the feedback.

I am doing a session at badcamp in berkeley in two weeks
http://2013.badcamp.net/sessions/big-data-drupal-cloudera-hadoop-mapredu...

also, have a mini site www.bigdatadrupal.com

as to your specific question;

nutch 1.6 is stable and feature complete and does work on CDH3/CDH4 .. nutch 1.7 is a departure with pluggable search backends, i.e. solr and elastic

nutch 2 is a big change, a bunch of features are lost, again various pluggable search engines. nutch 2 doesnt work on on cdh3/cdh4 - there is a nutch jira issue and a cloudera jira issue about that

my older blog post is at
http://niccolox.org/blog/big-data-drupal

I am preparing a new blog post and preparing a new presentation over the next couple of weeks

Thanks

bjaxelsen's picture

...for the reply, I will to with the 1.x version then

personally, I would use 1.6.

niccolo's picture

personally, I would use 1.6. Not 1.7

1.7 seems to be a bridge to 2 and I just wanted something that works

thanks

chandoo_abi's picture

Thanks for sharing

Lucene, Nutch and Solr

Group organizers

Group categories

Projects

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds: