A new aggregator for Drupal 7

Posted by aron novak on March 17, 2008 at 2:39pm

Problems:

Drupal's core aggregator does not satisfy the requirements of state
of the art aggregation applications. E. g. core aggregator lacks
pluggable configurations or the possibility of creating nodes from
feed items.
There is an explosion of contrib modules that try to cover the
deficiencies of Drupal's aggregator in one or the other way. Previous
attempts to unify aggregation functionality had limited success.

Suggestion:

Create a simple but extensible API for aggregation that ships in a
configuration that covers the most common use case of aggregating
feeds as nodes on a web site.
The basic architecture consists of three pieces:
- an aggregator.module that defines the API, invokes callbacks,
  manages settings and cron processing.
- a parser_common module that downloads and parses XML data into a
  PHP array. This parser should be independent from aggregator.module.
- An aggregator_node module that creates nodes from feed items. This
  is a standard add on module that depends on aggregator module.

Only local images are allowed.
Design details:

All feeds are represented as nodes.
Feeds can be configured on a per content type basis - e. g. create
an og group agnostic feed content type for pulling in news feeds and
create an og group integrated feed content type for pulling in iCal
feeds
aggregator module handles settings of add on modules through a
settings hook. This helps add on module developers to deal with per
content type settings.
As opposed to FeedAPI, there won't be a possibility to override per
content type settings on a per feed level. This is a heavy feature
that is not much used and a source of errors and complications.
Aggregator module will detect the type of feed and automatically
choose the right parser if more than one parser is enabled.
Downloading time on cron will be managed on a time basis (e. g. 50%
of cron time).
The parser will be independent so it can be used without aggregator
module.
The standard configuration will be very basic and as opposed to the
current core aggregator, it will come without a category handling.
It is still to be determined how to build views of feed items -
depends on whether a views style view builder will go into D7 core.
We should possibly split feed downloading and feed parsing in two
different hook operations

Tests and documentation:

Documentation on api.drupal.org
Unit tests and possibly UI tests for optimal test coverage.

Comments

FYI: if this planned stuff

Posted by aron novak on March 17, 2008 at 2:41pm

FYI: if this planned stuff will be the part of Drupal 7, all of the FeedAPI-dependent contrib modules will be ported. (feedapi_mepper - FEMP, feedapi_node_views, etc)

Categorization

Posted by chx on March 17, 2008 at 3:05pm

While I can see how it's a waste to reimplement all we have so nodes are a good choice, we must not lose the default categorization feature -- that's what Drupal Planet uses. I can definitely see how the categorization interface is not needed (at worst, it can be a contrib) but a feed really should have a default term which it assigns to its nodes.

Default categorization (this

Posted by alex_b on March 17, 2008 at 3:09pm

Default categorization (this functionality is called taxonomy inheritance in leech and feedapi, I don't know about other contrib modules) is very light weight and can definitely be a feature of core - it would be part of aggregator_node.module I assume.

http://www.twitter.com/lxbarth

Timing?

Posted by agentrickard on March 23, 2008 at 10:12pm

Aron did great work in SoC 2006, and replacing core aggregator -- without dropping features -- has always been part of the plan for FeedAPI.

Of course, if the code freeze is in May, then this is not feasible as an SoC project -- unless we're targeting Drupal 8.

--
http://ken.therickards.com/
http://savannahnow.com/user/2
http://blufftontoday.com/user/3

--
http://ken.therickards.com/

Fits perfectly into D7 timeline

Posted by alex_b on March 24, 2008 at 3:09pm

Hi Ken,

This project works with the earliest code freeze date of July 15. Aron is planning to do the heavy lifting of this project before July 15 anyway - the plan is to have a tight patch with all functionality implemented rolled by then. After July 15, there will be time for bug fixes and minor tweaks.

This tight time line also makes sure that there is going to be enough time to streamline the patch before August 11 (suggested pencils down date).

Alex

http://www.buytaert.net/drupal-7-timeline

http://www.twitter.com/lxbarth

Mentoring

Posted by alex_b on March 24, 2008 at 3:13pm

I would love to do the mentoring of this project.

Aron and I have been working closely together on http://www.drupal.org/project/feedapi - I am looking forward to apply what we've learned from FeedAPI to this project.

On top of that, it would be great to have a more seasoned core Drupal developer and another aggregation module developer (simplefeed? aggregation?) as co-mentors on this.

Alex

http://www.twitter.com/lxbarth

I'd love to help

Posted by csevb10@drupal.org on March 24, 2008 at 8:35pm

Hey Alex,
I'd love to help co-mentor the project. Anything I can do to help forward a more robust aggregation system in core, I will do.
--Bill

Sorry I didn't get back on

Posted by Ashraf Amayreh on March 25, 2008 at 10:24am

Sorry I didn't get back on my comments of feedapi, but I did look through it, and I did find it quite impressive. I was pretty hard pressed with a number of projects, and actually, I still am. Apologies again.

Yes, I'd definitely be interested in co-mentoring this as well. And don't mind dedicating some time to get a spectacular aggregation system in core.

Three contrib modules on board

Posted by alex_b on March 25, 2008 at 8:13pm

That's awesome -

This means that we've got the players of three popular contrib aggregtors on board of this endeavor: aggregation, simplefeed and feedapi.

-Alex

http://www.twitter.com/lxbarth

Some ideas

Posted by tonyn on March 25, 2008 at 7:34pm

I think this is an important proposal. The idea on what exactly that aggregator should do and how it should integrate into Drupal is of great importance to the 7.0 release.

What should an aggregator do? Maybe act as content puller that creates and populates nodes -- with all the node/content-type jazz -- and then with the ability to mass-manipulate the content pulled (remove items (nodes in a feed), turned into cck fields, contemplate, etc).

I would like to throw up some ideas up in the air. Aside from saying things to code specifically, I will list a few potential ideas to leave the API open to:

Detection of namespaces like <slash:>.
CCK/Views Integration with date/body and namespace elements?
Ability to use custom parser engines (for content that's not XML). For instance, a script that could pull out content between ID's in a page or between patterned/regex'd tags, sql files. Although pages and files like this lack the uniformity of an XML file, someone may be brave enough to find the patterns of a consistent site and dig at it. Leaving the opportunity open for advanced, customized data collection, at the same time not spoiling the fun for the RSS people.
Instead of "Allowed tags", give users the ability run retrieved content through input filters

I'm in, as always

Posted by boris mann on March 25, 2008 at 10:05pm

Some the Achieve Internet guys helped get a few patches in for D6, but we didn't get very far. Since I've been pushing for improvements since......D4.5 :P, I'm in for mentoring and helping to architect which pieces SHOULD be in core.

Yep

Posted by csevb10@drupal.org on March 26, 2008 at 7:09pm

Yeah, we tried to patch a bit for D6 but never got too far, so I'm going to try and lend a solid hand here and hopefully we can revamp aggregator so it's more universally usable. :-)

Yes, this is a terrific

Posted by moshe weitzman on March 25, 2008 at 11:01pm

Yes, this is a terrific proposal. Aron and alex_b are a dream team. I have no doubts about the success of this proposal.

-1 unless you also ship the

Posted by morbus iff on March 25, 2008 at 11:38pm

-1 unless you also ship the standard d6 aggregator capability: for feed items not to be nodes.
I'm fine if feeds are nodes, but not feed items.

Part of FeedAPI bundle

Posted by boris mann on March 27, 2008 at 5:50am

Duplicates the core aggregator functionality -- feeds are nodes, feed items are not.

Boris - where was this

Posted by morbus iff on March 27, 2008 at 11:30am

Where was this redefined? In the workflow graphic above, it specifically says "create nodes from feed items" and then later, in the text, it mentions "It is still to be determined how to build views of feed items" - which, presuming we're not talking about the views-without-nodes version - assumes that feed items are nodes too.

Layers

Posted by jpetso on March 27, 2008 at 1:02pm

The graphic above also splits the functionality into two modules, with the one that creates nodes from feed items (aggregator_node.module) being an optional layer on top of the core aggregator functionality.

FeedAPI

Posted by alex_b on March 27, 2008 at 1:40pm

Boris is referring to FeedAPI - FeedAPI for D5 ships with processors for aggregator like feed item creation and for feed item as node creation.

http://www.twitter.com/lxbarth

@jetpso: the core aggregator

Posted by morbus iff on March 27, 2008 at 1:53pm

@jetpso: the core aggregator module proposed above only handles CRUD for feeds, not for feed items.

@alex_b: I'm not concerned with what other modules currently do. I'm concerned with what the current proposal, which I define as the original node body above (ie., the graphics and following text) displays. And that says and infers that feed items are nodes.

Actually, you know, I'm A-OK

Posted by morbus iff on March 25, 2008 at 11:51pm

Actually, you know, I'm A-OK with feed items as nodes, but only if there's no way to automatically delete old feed items. The crux of my dislike for feed items as nodes is the default capability (in most, if not all, of the items-as-nodes modules I've seen) to automatically delete items that are older than a certain date. This creates massive amounts of linkrot, and just isn't anything that I can accept being in core. If that automatic capability doesn't exist (ie., it's either a contrib module, or the user has to manually delete the nodes), I'd be OK with it.

Moved to official ideas list

Posted by alex_b on March 26, 2008 at 6:02pm

Given the positive feedback on this thread, I moved this proposal to the official ideas list: http://drupal.org/google-summer-of-code/2008/ideas-list

http://www.twitter.com/lxbarth

If Aron is aiming to be the

Posted by catch on March 27, 2008 at 1:11pm

If Aron is aiming to be the student on this (or did I misunderstand?), then it shouldn't go on the official ideas list - that's for stuff we think would be cool for students to bid for, not for things proposed by prospective students who we already know about. Unless Aron wants to compete for fun :)

Everything about this looks very lovely. +1

Misunderstanding

Posted by alex_b on March 27, 2008 at 1:45pm

Hey catch,

Aron is aiming to be student for this. My misunderstanding.

http://drupal.org/node/239039 needs to be removed from list.

Alex

http://www.twitter.com/lxbarth

Extended / revised proposal for Google and for us

Posted by aron novak on March 28, 2008 at 3:34pm

Yes, i would like to be the student of this project :)

About the feed items to be nodes / not to be nodes: yes, in the default configuration, feed items will be nodes. About auto-expiring of feed items: for example FeedAPI has such a default value where feed items have never expired (and deleted).

skiquel: namespaces and input filters are both important ideas! About custom parsers: yes, it will be supported by design. Absolutely sure :)

The proposal is here: http://feedapi.novaak.net/aggregator-proposal.pdf
Please share all of your ideas and suggestions

There's a couple of things I

Posted by catch on March 28, 2008 at 4:07pm

There's a couple of things I don't see - not required for a better aggregator, but will need to be in place to get that better aggregator into core.

upgrade path for existing feeds, feed categories, feed sources etc.
upgrade path for users of the existing aggregator module who don't want to use nodes for storage (which could be done in contrib as a plugin, or won't fixed - but the discussion needs to be had).

These wouldn't even have to be part of the GSoC project, but worth bringing up. All very exciting though.

Improve XML Parsing too

Posted by bensheldon on April 7, 2008 at 2:08am

I'm down for new features, but the core aggregator function, simply parsing XML, is slightly lacking right now. There seems to be many issues with the parser not properly reading valid feeds, especially ATOM feeds.

There seem to be lots of work-arounds, but they all seem to require some additional information or expertise. I'd like to see the core-parser be something that just works so that we can add onto that foundation. I'd like to see the Aggregator be solid enough that we can trust users (not just administrators) with adding and managing their own feeds.

I agree

Posted by alex_b on April 7, 2008 at 12:40pm

I agree: more robust parser. And: the focus of this project shouldn't be to add many new features. It should refactor aggregator so that it's more extensible. Higher level features (i. e. what goes well beyond current aggregator's feature set) should be still matter of contrib. With this new aggregator contrib add ons should be ~~much easier~~ possible.

http://www.twitter.com/lxbarth

Please keep basic aggregator that does not need nodes

Posted by eagereyes on July 15, 2008 at 10:11pm

I only just came across this, but I wanted to chime in with a feature request: please do not require the creation of nodes. I only use the aggregator for a sidebar block, and don't need or want proper nodes. All the other aggregation modules I've seen are complete overkill for what should be the basic function of the aggregator: provide a simple list of links to content elsewhere.

Not exactly a feature

Posted by giorgio79 on April 16, 2009 at 12:15pm

Not exactly a feature request, more of a vision.

What I have in mind when I think of the Drupal aggreagotor is the equivalent of a Yahoo Pipes functionality, where we can slice and dice feeds, do our own conversions etc.

I would also love to see xmlreader as the core parser instead of simplexml, because with xmlreader you can process a 2GB xml feed with an 8MB shared host.
How? http://www.ibm.com/developerworks/library/x-pullparsingphp.html
(simplexml would try to parse the whole thing into an array first, good luck with that...)

I use this along with Boost to put up some nice sites that would usually require a dedicated server or more....

Review Critical
ClipGlobe - World Travel

****Me and Drupal :)****
Clickbank IPN - Sell online or create a membership site with the largest affiliate network!
Review Critical - One of my sites

Talking about Yahoo Pipes,

Posted by giorgio79 on April 19, 2009 at 4:49am

Talking about Yahoo Pipes, check this one out!!

http://drupal.org/project/transformations

I just noticed how old the original post is, still :)

Review Critical
ClipGlobe - World Travel

****Me and Drupal :)****
Clickbank IPN - Sell online or create a membership site with the largest affiliate network!
Review Critical - One of my sites

A new aggregator for Drupal 7

Comments

FYI: if this planned stuff

Categorization

Default categorization (this

Timing?

Fits perfectly into D7 timeline

Mentoring

I'd love to help

Sorry I didn't get back on

Three contrib modules on board

Some ideas

I'm in, as always

Yep

Yes, this is a terrific

-1 unless you also ship the

Part of FeedAPI bundle

Boris - where was this

Layers

FeedAPI

@jetpso: the core aggregator

Actually, you know, I'm A-OK

Moved to official ideas list

If Aron is aiming to be the

Misunderstanding

Extended / revised proposal for Google and for us

There's a couple of things I

Improve XML Parsing too

I agree

Please keep basic aggregator that does not need nodes

Not exactly a feature

Talking about Yahoo Pipes,

RSS & Aggregation

Group organizers

New groups

Group notifications