Feed Scraper 2: ideas and future development

Events happening in the community are now at Drupal community events on www.drupal.org.
ademarco's picture

After a good talk with alex_b at DrupalCon in Paris about Feed Scraper I've got to share some ideas about how the next version of the module could look like.

Here the main ones:

  • Pipes: in Feed Element Mapper you can set a list of parsers for each feed content type. Sometimes, having a unique sequence of parsers for the entire XML feed is not enough to catch all the information a feed carries on. It would be handy to be able to have a specialized sequence of parsers (a "pipe") for each CCK field you want to map the result to. In this respect projects like QueryPath could nicely suit: quoting its "About the Name" section "The path part is intended to convey the chain-like "path" that one creates through". On Feed Scraper module page an issue has already been raised about how to transform a list of words in taxonomy terms (http://drupal.org/node/571358), here a pipe system could really help: after getting the list of comma-separated words, using the RegEx Parser, you would apply a String Explosion Parser to get the final array-like result, for example.

  • Multiple values: if a scraper returns more then one value it must be possible to map them to a multiple values CCK field (a list of links, for example). Since Feed Element Mapper already support it this should not be so difficult to implement it.

  • HTML Scraping: already as a feature request (http://drupal.org/node/466414) I think it will be really interesting to have it and QueryPath fits again in this use case thanks to its built-in support of CSS selection (from this a CSS Selector Parser can be easily derived).
  • Features integration: the next version of Feed Scraper has to be integrated with the Features module to easily bundle scrapers (or, eventually, pipes) into stand-alone features.

That's all for the moment, I'll be waiting for your feedback.

Antonio

Comments

Pipes: When I first thought

alex_b's picture

Pipes:

When I first thought of pipes, I was thinking of rules that would be represented as a string. E. g.

// Searches all titles in the XML document and grabs all NewTimes strings from it.
$rule = '###XPATH###//title###REGEX###/New(.
?)Times/i'

Would this approach be too limited? Could you give an example where? If it is too limited, are you looking at using the FeedAPI infrastructure for organizing rules into pipes?

Multiple values:

Short: you should be fine here.

The support of multiple value sets and of complex value sets (i. e. an array tree or an complex object) is entirely up to the Mapper. FeedAPI Mapper only assumes that arrays where the keys are numerics are always arrays and you would never use a specific key in such an array as mapping source.

Features integration:

Use CTools/export

For an example, look at http://cvs.drupal.org/viewvc.py/drupal/contributions/modules/data/data.m...

When using CTools export you declare your configuration table to CTools and you use CTools/export load wrappers for all loading operations. Exportables and with that Features integration comes for free.

Feeds

I'm making some progress these days on Feeds: http://github.com/lxbarth/Feeds - you may want to start porting your parser to Feeds in a week from now. Then I will have the project at a useable point.

Parsing free text

amir simantov's picture

Hi guys,

Is free text related to this duscussion or not? I was not sure... Please refer to this issue: http://drupal.org/node/649516

Thanks!

how to modify css in incoming rss feed

mhalpern's picture

I set up an Arabic RSS feed using the feeds module that works great except it doesn't display the imported feed nodes right to left. Does the feeds module have a way to specify that the created nodes should be displayed right to left? In fact, is there any way for the feeds module to specify the "look" of the imported feed?

Well, i realized trying to

mhalpern's picture

Well, i realized trying to use the feeds module is not the way to go. I just modified the page.tpl.php to look for nodes where the language=ar.
in that case, I apply the css style direction="rtl" Works great. Seems like in the future it would be a lot nicer to be able to do this when you set up the feed?

RSS & Aggregation

Group organizers

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds: