A new aggregator for Drupal 7

Aron Novak's picture

 

 
Problems:

  • Drupal's core aggregator does not satisfy the requirements of state
    of the art aggregation applications. E. g. core aggregator lacks
    pluggable configurations or the possibility of creating nodes from
    feed items.
  • There is an explosion of contrib modules that try to cover the
    deficiencies of Drupal's aggregator in one or the other way. Previous
    attempts to unify aggregation functionality had limited success.

Suggestion:

  • Create a simple but extensible API for aggregation that ships in a
    configuration that covers the most common use case of aggregating
    feeds as nodes on a web site.
  • The basic architecture consists of three pieces:
    • an aggregator.module that defines the API, invokes callbacks,
      manages settings and cron processing.
    • a parser_common module that downloads and parses XML data into a
      PHP array. This parser should be independent from aggregator.module.
    • An aggregator_node module that creates nodes from feed items. This
      is a standard add on module that depends on aggregator module.


Design details:

  • All feeds are represented as nodes.
  • Feeds can be configured on a per content type basis - e. g. create
    an og group agnostic feed content type for pulling in news feeds and
    create an og group integrated feed content type for pulling in iCal
    feeds
  • aggregator module handles settings of add on modules through a
    settings hook. This helps add on module developers to deal with per
    content type settings.
  • As opposed to FeedAPI, there won't be a possibility to override per
    content type settings on a per feed level. This is a heavy feature
    that is not much used and a source of errors and complications.
  • Aggregator module will detect the type of feed and automatically
    choose the right parser if more than one parser is enabled.
  • Downloading time on cron will be managed on a time basis (e. g. 50%
    of cron time).
  • The parser will be independent so it can be used without aggregator
    module.
  • The standard configuration will be very basic and as opposed to the
    current core aggregator, it will come without a category handling.
  • It is still to be determined how to build views of feed items -
    depends on whether a views style view builder will go into D7 core.
  • We should possibly split feed downloading and feed parsing in two
    different hook operations

Tests and documentation:

  • Documentation on api.drupal.org
  • Unit tests and possibly UI tests for optimal test coverage.

Comments

FYI: if this planned stuff

Aron Novak's picture

FYI: if this planned stuff will be the part of Drupal 7, all of the FeedAPI-dependent contrib modules will be ported. (feedapi_mepper - FEMP, feedapi_node_views, etc)

Categorization

chx's picture

While I can see how it's a waste to reimplement all we have so nodes are a good choice, we must not lose the default categorization feature -- that's what Drupal Planet uses. I can definitely see how the categorization interface is not needed (at worst, it can be a contrib) but a feed really should have a default term which it assigns to its nodes.

Default categorization (this

alex_b's picture

Default categorization (this functionality is called taxonomy inheritance in leech and feedapi, I don't know about other contrib modules) is very light weight and can definitely be a feature of core - it would be part of aggregator_node.module I assume.

Timing?

agentrickard's picture

Aron did great work in SoC 2006, and replacing core aggregator -- without dropping features -- has always been part of the plan for FeedAPI.

Of course, if the code freeze is in May, then this is not feasible as an SoC project -- unless we're targeting Drupal 8.

--
http://ken.therickards.com/
http://savannahnow.com/user/2
http://blufftontoday.com/user/3

Fits perfectly into D7 timeline

alex_b's picture

Hi Ken,

This project works with the earliest code freeze date of July 15. Aron is planning to do the heavy lifting of this project before July 15 anyway - the plan is to have a tight patch with all functionality implemented rolled by then. After July 15, there will be time for bug fixes and minor tweaks.

This tight time line also makes sure that there is going to be enough time to streamline the patch before August 11 (suggested pencils down date).

Alex

http://www.buytaert.net/drupal-7-timeline

Mentoring

alex_b's picture

I would love to do the mentoring of this project.

Aron and I have been working closely together on http://www.drupal.org/project/feedapi - I am looking forward to apply what we've learned from FeedAPI to this project.

On top of that, it would be great to have a more seasoned core Drupal developer and another aggregation module developer (simplefeed? aggregation?) as co-mentors on this.

Alex

I'd love to help

csevb10@drupal.org's picture

Hey Alex,
I'd love to help co-mentor the project. Anything I can do to help forward a more robust aggregation system in core, I will do.
--Bill

Sorry I didn't get back on

ominds's picture

Sorry I didn't get back on my comments of feedapi, but I did look through it, and I did find it quite impressive. I was pretty hard pressed with a number of projects, and actually, I still am. Apologies again.

Yes, I'd definitely be interested in co-mentoring this as well. And don't mind dedicating some time to get a spectacular aggregation system in core.

Three contrib modules on board

alex_b's picture

That's awesome -

This means that we've got the players of three popular contrib aggregtors on board of this endeavor: aggregation, simplefeed and feedapi.

-Alex

Some ideas

skiquel's picture

I think this is an important proposal. The idea on what exactly that aggregator should do and how it should integrate into Drupal is of great importance to the 7.0 release.

What should an aggregator do? Maybe act as content puller that creates and populates nodes -- with all the node/content-type jazz -- and then with the ability to mass-manipulate the content pulled (remove items (nodes in a feed), turned into cck fields, contemplate, etc).

I would like to throw up some ideas up in the air. Aside from saying things to code specifically, I will list a few potential ideas to leave the API open to:

  • Detection of namespaces like <slash:>.
  • CCK/Views Integration with date/body and namespace elements?
  • Ability to use custom parser engines (for content that's not XML). For instance, a script that could pull out content between ID's in a page or between patterned/regex'd tags, sql files. Although pages and files like this lack the uniformity of an XML file, someone may be brave enough to find the patterns of a consistent site and dig at it. Leaving the opportunity open for advanced, customized data collection, at the same time not spoiling the fun for the RSS people.
  • Instead of "Allowed tags", give users the ability run retrieved content through input filters

I'm in, as always

Boris Mann's picture

Some the Achieve Internet guys helped get a few patches in for D6, but we didn't get very far. Since I've been pushing for improvements since......D4.5 :P, I'm in for mentoring and helping to architect which pieces SHOULD be in core.

Yep

csevb10@drupal.org's picture

Yeah, we tried to patch a bit for D6 but never got too far, so I'm going to try and lend a solid hand here and hopefully we can revamp aggregator so it's more universally usable. :-)

Yes, this is a terrific

moshe weitzman's picture

Yes, this is a terrific proposal. Aron and alex_b are a dream team. I have no doubts about the success of this proposal.

-1 unless you also ship the

Morbus Iff's picture

-1 unless you also ship the standard d6 aggregator capability: for feed items not to be nodes.
I'm fine if feeds are nodes, but not feed items.

Part of FeedAPI bundle

Boris Mann's picture

Duplicates the core aggregator functionality -- feeds are nodes, feed items are not.

Boris - where was this

Morbus Iff's picture

Where was this redefined? In the workflow graphic above, it specifically says "create nodes from feed items" and then later, in the text, it mentions "It is still to be determined how to build views of feed items" - which, presuming we're not talking about the views-without-nodes version - assumes that feed items are nodes too.

Layers

jpetso's picture

The graphic above also splits the functionality into two modules, with the one that creates nodes from feed items (aggregator_node.module) being an optional layer on top of the core aggregator functionality.

FeedAPI

alex_b's picture

Boris is referring to FeedAPI - FeedAPI for D5 ships with processors for aggregator like feed item creation and for feed item as node creation.

@jetpso: the core aggregator

Morbus Iff's picture

@jetpso: the core aggregator module proposed above only handles CRUD for feeds, not for feed items.

@alex_b: I'm not concerned with what other modules currently do. I'm concerned with what the current proposal, which I define as the original node body above (ie., the graphics and following text) displays. And that says and infers that feed items are nodes.

Actually, you know, I'm A-OK

Morbus Iff's picture

Actually, you know, I'm A-OK with feed items as nodes, but only if there's no way to automatically delete old feed items. The crux of my dislike for feed items as nodes is the default capability (in most, if not all, of the items-as-nodes modules I've seen) to automatically delete items that are older than a certain date. This creates massive amounts of linkrot, and just isn't anything that I can accept being in core. If that automatic capability doesn't exist (ie., it's either a contrib module, or the user has to manually delete the nodes), I'd be OK with it.

Moved to official ideas list

alex_b's picture

Given the positive feedback on this thread, I moved this proposal to the official ideas list: http://drupal.org/google-summer-of-code/2008/ideas-list

If Aron is aiming to be the

catch's picture

If Aron is aiming to be the student on this (or did I misunderstand?), then it shouldn't go on the official ideas list - that's for stuff we think would be cool for students to bid for, not for things proposed by prospective students who we already know about. Unless Aron wants to compete for fun :)

Everything about this looks very lovely. +1

Misunderstanding

alex_b's picture

Hey catch,

Aron is aiming to be student for this. My misunderstanding.

http://drupal.org/node/239039 needs to be removed from list.

Alex

Extended / revised proposal for Google and for us

Aron Novak's picture

Yes, i would like to be the student of this project :)

About the feed items to be nodes / not to be nodes: yes, in the default configuration, feed items will be nodes. About auto-expiring of feed items: for example FeedAPI has such a default value where feed items have never expired (and deleted).

skiquel: namespaces and input filters are both important ideas! About custom parsers: yes, it will be supported by design. Absolutely sure :)

The proposal is here: http://feedapi.novaak.net/aggregator-proposal.pdf
Please share all of your ideas and suggestions

There's a couple of things I

catch's picture

There's a couple of things I don't see - not required for a better aggregator, but will need to be in place to get that better aggregator into core.

  • upgrade path for existing feeds, feed categories, feed sources etc.
  • upgrade path for users of the existing aggregator module who don't want to use nodes for storage (which could be done in contrib as a plugin, or won't fixed - but the discussion needs to be had).

These wouldn't even have to be part of the GSoC project, but worth bringing up. All very exciting though.

Improve XML Parsing too

bensheldon's picture

I'm down for new features, but the core aggregator function, simply parsing XML, is slightly lacking right now. There seems to be many issues with the parser not properly reading valid feeds, especially ATOM feeds.

There seem to be lots of work-arounds, but they all seem to require some additional information or expertise. I'd like to see the core-parser be something that just works so that we can add onto that foundation. I'd like to see the Aggregator be solid enough that we can trust users (not just administrators) with adding and managing their own feeds.

I agree

alex_b's picture

I agree: more robust parser. And: the focus of this project shouldn't be to add many new features. It should refactor aggregator so that it's more extensible. Higher level features (i. e. what goes well beyond current aggregator's feature set) should be still matter of contrib. With this new aggregator contrib add ons should be much easier possible.

eagereyes's picture

I only just came across this, but I wanted to chime in with a feature request: please do not require the creation of nodes. I only use the aggregator for a sidebar block, and don't need or want proper nodes. All the other aggregation modules I've seen are complete overkill for what should be the basic function of the aggregator: provide a simple list of links to content elsewhere.

Not exactly a feature

giorgio79's picture

Not exactly a feature request, more of a vision.

What I have in mind when I think of the Drupal aggreagotor is the equivalent of a Yahoo Pipes functionality, where we can slice and dice feeds, do our own conversions etc.

I would also love to see xmlreader as the core parser instead of simplexml, because with xmlreader you can process a 2GB xml feed with an 8MB shared host.
How? http://www.ibm.com/developerworks/library/x-pullparsingphp.html
(simplexml would try to parse the whole thing into an array first, good luck with that...)

I use this along with Boost to put up some nice sites that would usually require a dedicated server or more....

Review Critical
ClipGlobe - World Travel

****Me and Drupal :)****
Clickbank IPN - Sell online or create a membership site with the largest affiliate network!
Review Critical - One of my sites

Talking about Yahoo Pipes,

giorgio79's picture

Talking about Yahoo Pipes, check this one out!!

http://drupal.org/project/transformations

I just noticed how old the original post is, still :)

Review Critical
ClipGlobe - World Travel

****Me and Drupal :)****
Clickbank IPN - Sell online or create a membership site with the largest affiliate network!
Review Critical - One of my sites

RSS & Aggregation

Group organizers

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds: