Posted by broncomania on November 1, 2010 at 1:40am
I try to crawl user websites and build an relationship between them in solr. My knowledge is just at the beginning of nutch and solr, but I think this is really usefull feature. Maybe someone had expierences with this topic and give me a clue or a hint for doing this witch nutch, solr and drupal.
Comments
I have a very similar need
I have a very similar need for crawling external sites based on URL's stored in nodes. I have a node type called "Company" that includes a brief company description along with an external link to the company's website. I would like nutch to crawl each external URL and use the results on my site. Example...
Company A is a medical company that provides knee braces. The content in the "Company" node mentions knee braces but does not reference individual products. I was hoping nutch would index content on the company website so Drupal could return "Company A" if a user searched for a product they sold. Can this be done w/ solr & nutch?
SOLUTION
Okay I got it working. It's very easy if u know how and I think I'm the first who post this solution. You have to add your user uid to the seed list of your nutch installation.
For example.
http://www.meshle.com{here comes a tabulator!!!!!}uid=USER_UID
Thats the first step.The second step is you have to extend your nutch crawler with a plugin which grabs the information from the seed list. Their is already a plugin which handles this, but I forgot which one ... I think its URL Meta Indexing Filter (urlmeta).
Of course u have to add the field to the solr xml schema! From now u can handle your solr content personal with facets or what ever. If u need more info just contact me. Hope this helps a little bit
Hi,thanks for your
sorry see next post
Hi,thanks for your
sorry see next post
Hi, thanks for your
Hi,
thanks for your solution!! But i still don't get it!
What i did:
----modified nutch-site.xml-----
<property><name>plugin.includes</name>
<value>...index-(basic|anchor|urlmeta)...</value>
<description>...
</description>
</property>
<property>
<name>urlmeta.tags</name>
<value>newTag</value>
</property>
-----added the tag to urls.txt ------
http://www.url.net\tnewTag=1
---modified schema.xml in solr/conf/schema.xml ---
<field name="newTag" type="string" stored="true" indexed="true" />Nothing happens... What am i doing wring?
Cheers Jepse
Cant able to crawl drupal site with nutch 1.12
Tried the above module, but not working. If anyone found solution, please guide.