Install and Configure Nutch in 5 minutes

You are viewing a wiki page. You are welcome to join the group and then edit it. Be bold!

Ok, here we go. This information is only relevant to those wishing to start out with Nutch for the first time or developers who test various Nutch functions and have to tear down and setup to confirm results. There are many ways to do it but this works for me. Also, there are scripts here. So, run them t your own risk. If you don't know, ask.

1.) Login to your server using ssh and create a dir named /stuff
a.) mkdir /stuff
b.) touch nutch.sh
c.) vi nutch.sh
d.) copy the contents into the file: (Make sure you change XXX to your instance. NOTE: chown XXX:XXX -R /lib/nutch the XXX is your webserver name. Also note that if you put your ApacheSolr module in sites/all you will have to add that. 777 is just for demonstration purposes. You are responsible for your security)


mkdir /lib/nutch
wget http://mirrors.kahuki.com/apache//nutch/apache-nutch-1.2-src.zip
unzip apache-nutch-1.2-src.zip
cp -rf apache-nutch-1.2/* /lib/nutch
cd /lib/nutch
mkdir crawl
mkdir seed
mkdir logs
mkdir crawl/crawldb
mkdir crawl/segments
mkdir crawl/linkdb
touch seed/urls
touch logs/hadoop.log
cp -rf /home/XXX/www/XXX/modules/apachesolr/*.xml /lib/nutch/conf
ant
chown XXX:XXX -R /lib/nutch
chmod 777 -R /lib/nutch
cd /stuff
rm -rf apache-nutch-1.2
rm -rf apache-nutch-1.2-scr.zip

e.) cmhod 700 nutch.sh
f.) ./nutch.sh
2.) cd /lib/nutch/conf
3.) vi nutch-site.xml
a.)insert this between configuration and change the stuff that has xxx
<property>
  <name>http.agent.name</name>
  <value>xxx</value>
  <description>MUST NOT be empty</description>
</property>

<property>
  <name>http.robots.agents</name>
  <value>*</value>
  <description>The agent strings we'll look for in robots.txt files,
  comma-separated, in decreasing order of precedence. You should
  put the value of http.agent.name as the first agent name, and keep the
  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
  </description>
</property>

<property>
  <name>http.robots.403.allow</name>
  <value>true</value>
  <description>Some servers return HTTP status 403 (Forbidden) if
  /robots.txt doesn't exist. This should probably mean that we are
  allowed to crawl the site nonetheless. If this is set to false,
  then such sites will be treated as forbidden.</description>
</property>

<property>
  <name>http.agent.description</name>
  <value>xxx</value>
  <description>Further description of our bot</description>
</property>

<property>
  <name>http.agent.url</name>
  <value>www.xxx.com</value>
  <description>A URL to advertise in the User-Agent header.</description>
</property>

<property>
  <name>http.agent.email</name>
  <value>xxx at xxx dot xxx</value>
  <description>An email address to advertise in the HTTP 'From' request
   header and User-Agent header. A good practice is to mangle this
   address (e.g. 'info at example dot com') to avoid spamming.
  </description>
</property>

<property>
  <name>http.agent.version</name>
  <value>xxx-2.2.1</value>
  <description>A version string to advertise in the User-Agent
   header.</description>
</property>

<property>
  <name>http.agent.host</name>
  <value>xxx.com</value>
  <description>Name or IP address of the host on which the Nutch crawler
  would be running.</description>
</property>

<property>
  <name>http.timeout</name>
  <value>10000</value>
  <description>The default network timeout, in milliseconds.</description>
</property>

<property>
  <name>http.max.delays</name>
  <value>100</value>
  <description>The number of times a thread will delay when trying to
  fetch a page.</description>
</property>

<property>
  <name>generate.max.count</name>
  <value>xxx</value>
  <description>The maximum number of urls in a single
  fetchlist.  -1 if unlimited. The urls are counted according
  to the value of the parameter generator.count.mode.
  </description>
</property>

<property>
  <name>fetcher.store.content</name>
  <value>true</value>
  <description>If true, fetcher will store content.</description>
</property>

b.) Save it and exit {Press Esc Button) :w (Press Enter) then (Press Esc Button) :q (Press Enter)

4.) vi solrindex-mapping.xml
5.) Replace everythign from to with:

          <fields>
                <field dest="site" source="site"/>
                <field dest="title" source="title"/>
                <field dest="host" source="host"/>
                <field dest="segment" source="segment"/>
                <field dest="boost" source="boost"/>
                <field dest="digest" source="digest"/>
                <field dest="tstamp" source="tstamp"/>
                <field dest="id" source="url"/>
                <field dest="body" source="content"/>
                <copyField source="url" dest="url"/>
        </fields>

a.) Save it and exit {Press Esc Button) :w (Press Enter) then (Press Esc Button) :q (Press Enter)
6.) vi schema.xml
a.) Add this in the section

  <field dest="host" source="host"/>
  <field dest="segment" source="segment"/>
  <field dest="boost" source="boost"/>
  <field dest="digest" source="digest"/>
  <field dest="tstamp" source="tstamp"/>
 

a.) a.) Save it and exit {Press Esc Button) :w (Press Enter) then (Press Esc Button) :q (Press Enter)
7.) Copy all of your important files from /lib/nutch/conf to /stuff (solrconfig.xml,schema.xml,solrindex-mapping.xml, nutch-site.xml)
a.) cd /lib/nutch/conf
b.) cp solrconfig.xml /stuff (do this for all other xml files)

That's it. If you want to destroy your nutch instance just go back to ssh and:

rm -rf /lib/nutch/*
cd /stuff
./nutch.sh
cp *.xml /lib/solr/nutch/conf

To get super lazy just add this lien to the end of the nutch.sh script after you setup nutch correctly the first time.

cp *.xml /lib/solr/nutch/conf

After that, technically, you can just type in ./nutch.sh and wham, you whole nutch is re-installed and setup the way you need it. If I missed something, let me know and I will make sure to correct it.

Lucene, Nutch and Solr

Group organizers

Group categories

Projects

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds:

Hot content this week