thanks to the recent work of the Solr Nutch sandbox project I've managed to get Nutch 1.6 jobs to run on a Cloudera CDH3 4 node cluster sending results to Solr 3.6.2 (hosted within Tomcat on Aegir BOA) and then integrated into the Apache Solr 7.1.1 module (not the dev) into search results and Apache Solr Views
I must say, I am pretty excited about Hadoop / Cloudera running Nutch and Solr and integrating with Drupal
for anyone interested in setting up a Cloudera cluster I recommend masterschema (centos) and Gregory Grubbs on YouTube (debian)
I'll post some notes etc ASAP
BIG THanks
as the man said, standing on the shoulders or giants @dtsuart @cliefen and the apachesolr team, not to mention the pretty amazing cloudera manager
BIG DATA
Cloudera CDH3 Hadoop Cluster
Cloudera Manager 4, CDH3 (Hadoop, Mapreduce, Zookeeper etc etc)
Ubuntu 10.04 LTS Server * 4 (3 nodes operational, 1 node needs fixing)
Nutch 1.6 (running in Hadoop/Mapdreduce)
Aegir BOA
Aegir BOA 2.5 (with Tomcat running 3.6.2 Solr, need to increase memory etc)
Solr 3.62
Nutch Solr sandbox project
ApacheSolr module 7.1.1 (NOT dev)
ApacheSolr Views module
Open Outreach rc9
Blogged some scraps of notes and useful commands etc
http://niccolox.org/blog/big-data-drupal-cloudera-cluster-boa-solr-nutch...
Comments
Well Done...
Sweet writeup, thanks for sharing!
Peace,
Michael Clendening
CDH4 and Ubuntu 12
quick update, got the nutch job running on 25 threads across 4 nodes on CDH4 on Ubuntu 12.04.2 LTS
latest
I've added an extra commercial open-source product to the tool-chain (Bonita Soft) and used the Nutch Multisite module from ApacheSolr Examples project
I created a slide-deck http://www.scribd.com/doc/144088681/Big-Data-Drupal and a business document http://www.scribd.com/doc/144099411/Democratic-Big-Data-Drupal
hope to do a decent blog post about this soon
Nutch 1.x On CDH4
Thanks for sharing.
I am curious about a few things:
How come you ended up using Nutch 1.x rather than 2.x?
Did you come across any dependency issues? As far as I know Nutch depends on older versions of Hadoop, Hbase and Zookeeper than the versions that ships with CDH4. I may be wrong.
I have looked at the Cloudera Search (an extension to CDH4) - did you (or maybe someone else within this group) try it out?
good questions
hi,
thanks for the feedback.
I am doing a session at badcamp in berkeley in two weeks
http://2013.badcamp.net/sessions/big-data-drupal-cloudera-hadoop-mapredu...
also, have a mini site www.bigdatadrupal.com
as to your specific question;
nutch 1.6 is stable and feature complete and does work on CDH3/CDH4 .. nutch 1.7 is a departure with pluggable search backends, i.e. solr and elastic
nutch 2 is a big change, a bunch of features are lost, again various pluggable search engines. nutch 2 doesnt work on on cdh3/cdh4 - there is a nutch jira issue and a cloudera jira issue about that
my older blog post is at
http://niccolox.org/blog/big-data-drupal
I am preparing a new blog post and preparing a new presentation over the next couple of weeks
Thanks
...for the reply, I will to with the 1.x version then
personally, I would use 1.6.
personally, I would use 1.6. Not 1.7
1.7 seems to be a bridge to 2 and I just wanted something that works
thanks
Thanks for sharing
Apache Spark / Hadoop / Big Data
Looking for more Drupal tie ins to Apache Spark / Hadoop / Big Data, which are running somewhere, if anyone has any additional info out of interest.
Drupal StackExchange
I know this thread is old but
I know this thread is old but I am looking for information now
Anietie Ekanem
Social Niche Guru Inc.
http://SocialNicheGuru.com
http://facebook.com/SocialNicheGuru