Parsing Views Feeds

Events happening in the community are now at Drupal community events on www.drupal.org.
ebeyrent's picture

I am new to Nutch, and am attempting to parse a site that has two Views blocks on the front page, both also providing feeds.

My first attempt to parse resulted in the following error:

parser not found for contentType=application/xhtml+xml

I attempted to fix this by editing conf/parse-plugins.xml, where I added:

Now, when I attempt to parse, I get the following:

Can't be handled as rss document. org.apache.commons.feedparser.FeedParserException: org.jdom.input.JDOMParseException: Error on line 102: The entity name must immediately follow the '&' in the entity reference.

Has anyone encountered this before, and if so , were you able to resolve this issue and how?

Comments

Include Plugin nutch-site.xml

maxmmize's picture

Is the plugin included in your nutch-site.xml? Are you using url normalizer?

Also, tyr and put code tags between the code you were showing us: < code> < /code> with no spaces after the <

Ugh. Sorry about that.

ebeyrent's picture

Ugh. Sorry about that. Here's what I added:

<mimeType name="application/xhtml+xml">
  <plugin id="parse-rss" />
  <plugin id="feed" />
</mimeType>

Here are the plugins in my nutch-site.xml file:

<property>
    <name>plugin.includes</name>
    <value>protocol-http|urlfilter-regex|parse-html|parse-rss|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>

Here's what the entire parse-plugins.xml file looks like.

<?xml version="1.0" encoding="UTF-8"?>
<!--
        Licensed to the Apache Software Foundation (ASF) under one or more
        contributor license agreements.  See the NOTICE file distributed with
        this work for additional information regarding copyright ownership.
        The ASF licenses this file to You under the Apache License, Version 2.0
        (the "License"); you may not use this file except in compliance with
        the License.  You may obtain a copy of the License at

        http://www.apache.org/licenses/LICENSE-2.0

        Unless required by applicable law or agreed to in writing, software
        distributed under the License is distributed on an "AS IS" BASIS,
        WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
        See the License for the specific language governing permissions and
        limitations under the License.

        Author     : mattmann
        Description: This xml file represents a natural ordering for which parsing
        plugin should get called for a particular mimeType.
-->

<parse-plugins>

  <!--  by default if the mimeType is set to *, or
        if it can't be determined, use parse-tika -->
        <mimeType name="*">
          <plugin id="parse-tika" />
        </mimeType>

        <mimeType name="application/rss+xml">
            <plugin id="parse-rss" />
            <plugin id="feed" />
        </mimeType>

        <mimeType name="application/x-bzip2">
                <!--  try and parse it with the zip parser -->
                <plugin id="parse-zip" />
        </mimeType>

        <mimeType name="application/x-gzip">
                <!--  try and parse it with the zip parser -->
                <plugin id="parse-zip" />
        </mimeType>

        <mimeType name="application/x-javascript">
                <plugin id="parse-js" />
        </mimeType>

        <mimeType name="application/x-shockwave-flash">
                <plugin id="parse-swf" />
        </mimeType>

        <mimeType name="application/zip">
                <plugin id="parse-zip" />
        </mimeType>

        <mimeType name="text/xml">
                <plugin id="parse-tika" />
                <plugin id="parse-rss" />
                <plugin id="feed" />
        </mimeType>

        <mimeType name="application/xhtml+xml">
          <plugin id="parse-rss" />
          <plugin id="feed" />
        </mimeType>

       <!-- Types for parse-ext plugin: required for unit tests to pass. -->

        <mimeType name="application/vnd.nutch.example.cat">
                <plugin id="parse-ext" />
        </mimeType>

        <mimeType name="application/vnd.nutch.example.md5sum">
                <plugin id="parse-ext" />
        </mimeType>

        <!--  alias mappings for parse-xxx names to the actual extension implementation
        ids described in each plugin's plugin.xml file -->
        <aliases>
                <alias name="parse-tika"
                        extension-id="org.apache.nutch.parse.tika.TikaParser" />
                <alias name="parse-ext" extension-id="ExtParser" />
                <alias name="parse-js" extension-id="JSParser" />
                <alias name="parse-rss"
                        extension-id="org.apache.nutch.parse.rss.RSSParser" />
                <alias name="feed"
                        extension-id="org.apache.nutch.parse.feed.FeedParser" />
                <alias name="parse-swf"
                        extension-id="org.apache.nutch.parse.swf.SWFParser" />
                <alias name="parse-zip"
                        extension-id="org.apache.nutch.parse.zip.ZipParser" />
        </aliases>

</parse-plugins>

Have you tried parsing with

maxmmize's picture

Have you tried parsing with Tika?

Yeah, I tried that first,

ebeyrent's picture

Yeah, I tried that first, since that's what the text/xml mimeType definition had.

Lucene, Nutch and Solr

Group organizers

Group categories

Projects

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds: