Make a relation between Nutch crawled websites and the user

Events happening in the community are now at Drupal community events on www.drupal.org.
broncomania's picture

I try to crawl user websites and build an relationship between them in solr. My knowledge is just at the beginning of nutch and solr, but I think this is really usefull feature. Maybe someone had expierences with this topic and give me a clue or a hint for doing this witch nutch, solr and drupal.

Comments

I have a very similar need

bee2b's picture

I have a very similar need for crawling external sites based on URL's stored in nodes. I have a node type called "Company" that includes a brief company description along with an external link to the company's website. I would like nutch to crawl each external URL and use the results on my site. Example...

Company A is a medical company that provides knee braces. The content in the "Company" node mentions knee braces but does not reference individual products. I was hoping nutch would index content on the company website so Drupal could return "Company A" if a user searched for a product they sold. Can this be done w/ solr & nutch?

SOLUTION

broncomania's picture

Okay I got it working. It's very easy if u know how and I think I'm the first who post this solution. You have to add your user uid to the seed list of your nutch installation.
For example.
http://www.meshle.com{here comes a tabulator!!!!!}uid=USER_UID

Thats the first step.The second step is you have to extend your nutch crawler with a plugin which grabs the information from the seed list. Their is already a plugin which handles this, but I forgot which one ... I think its URL Meta Indexing Filter (urlmeta).

Of course u have to add the field to the solr xml schema! From now u can handle your solr content personal with facets or what ever. If u need more info just contact me. Hope this helps a little bit

Hi,thanks for your

jepse's picture

sorry see next post

Hi,thanks for your

jepse's picture

sorry see next post

Hi, thanks for your

jepse's picture

Hi,

thanks for your solution!! But i still don't get it!

What i did:

----modified nutch-site.xml-----

<property>
  <name>plugin.includes</name>
  <value>...index-(basic|anchor|urlmeta)...</value>
  <description>...
  </description>
</property>
<property>
<name>urlmeta.tags</name>
<value>newTag</value>
</property>

-----added the tag to urls.txt ------
http://www.url.net\tnewTag=1

---modified schema.xml in solr/conf/schema.xml ---

<field name="newTag" type="string" stored="true" indexed="true"  />

Nothing happens... What am i doing wring?

Cheers Jepse

Naveen Balakrishnan's picture

Tried the above module, but not working. If anyone found solution, please guide.

Lucene, Nutch and Solr

Group organizers

Group categories

Projects

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds: