Aggregator for D7 outline

We encourage users to post events happening in the community to the community events group on https://www.drupal.org.
Aron Novak's picture

Have a look at PHP skeleton code for better understanding this outline.

Def:
Parser: Responsible for creating a feed data structure which is expected by the aggregator (hook_aggregator_parser). By design additional parsers could consume everything (ical, html pages, emails)
Processor: Responsible for accepting items and save them, show them to the user, etc (hook_aggregator_processor)

Planned modules:

* aggregator
* aggregator_node
* aggregator_light
* syndication_parser (can be used independently from aggregator)

aggregator: handle feeds, handle feed configurations, cron-part, caching, feed refreshing
aggregator_light: processor for handling non-node items (like the current aggregator does)
aggregator_node: processor for handling node items
syndication_parser: parser for RSS, Atom, RDF feeds.

Create a feed:
low-level:
1) $feed_node = aggregator_feed_create(‘http://example.com/feed’, ‘flatfeed’);
$feed_node is a node object skeleton with some feed specific items, see below
2) aggregator_feed_retrieve($feed_node);
It populates the $feed object with downloaded and parsed stuff [hook_parse() ].
3) aggregator_feed_save($feed_node);
Calls the processor(s) and save the feed.

In addition to the node form, a simplified proxy will be provided in a block. Aggregator will also support feed autodetection. (for example it’s enough to supply http://www.drupal.org to create the feed)

Caching
All parsed data from the parsers is cached with the Drupal standard cache functions: cache_set() / cache_get()
Level 0: if the etag / last modified signs that the feed is not refreshed, the refresh could stop.
Level 1: if the parsed data is the same as the previous time (hash comparison of the cache and the parser output) processors won’t be called.

Refreshing
On cron time, all the saved feeds are refreshed and if there is a processor assigned to the feed, it is called also to process the items.

* A feed can be refresh manually.
* Cron is working off feeds for a fixed time, not a fixed number of feeds.
* Feed refreshing can be adjusted in the top of time-driven approach: “as often as possible”, “twice a day”, “once a day”, “never”. But aggregator cannot guarantee precisious timings of course.

Parsers
hook_aggregator_parser($url)
The processor gets the URL. This does not mean code overlapping between parsers, because parsers may use aggregator ‘s function to download, aggregator_download(), the feed (download-related tasks: handling redirects, etag, last-modified, authentication)
This means that in the most of the cases the parser don’t have to take care of downloading and calling simplexml!
One feed is assigned with exactly one parser. Zero parser is meaningless, more parsers for one feed adds unneeded complexity.

Processors
One feed can be assigned with more than one processor. The processors are in a chain, they see the previous one’s result (reference passing)
The processors gets the whole feed object including the items.
The default configuration of processors will support categorization similar to the existing aggregator. This categorization will be based on taxonomy system.

Feed configuration
Based on content-types. It can be similar to FeedAPI, you can assign processors and a parser to a configuration. Only per-content-type configuration is allowed, no per feed node overrides.
The feed type handling is responsibility of the user. If you write a specific parser and a specific processor, you need to ensure that the feed configuration is meaningful.

OPML import/export
I would suggest that OPML import/export should be an add-on module and not part of core.

Data structures
Outline: $node->feed->items
Detailed:
Feed properties: title, description, link to the site (link), feed URL (url)
Item properties: title, description, link to the original article, id, date (consist of UTC timestamp, timezone string, local time string), namespaces, categories
Item: http://groups.drupal.org/files/item.png

An example for the namespace support:

<item>
<dc:rights>foo</dc:rights>
</item>

Turned into:
namespace => array(“dc” => array(“rights” => “foo”))

Comments

The aggregator alpha release

Aron Novak's picture

The aggregator alpha release is out: http://tinyurl.com/5gj4ca

OPML

Boris Mann's picture

Looks like OPML import is getting some patch love for the old version of aggregator. Why should it be dropped? Seems a simple enough functionality. Add it's a regression to remove features...maybe you want to recruit Arancaytar to work on this as a separate module? Hmm...although it's such core functionality making it part of aggregator might make sense ... see http://drupal.org/node/16282

The rest of the plan looks good.

OPML import/export should be part of core

kyle_mathews's picture

I'd second that OPML import/export should be part of core. Working on memetracker, I'm constantly adding feeds to new memetrackers -- and adding 50 feeds at a time is a serious pain. I think importing OPML files is a common enough use case that it should be kept in core.

Kyle Mathews

Kyle Mathews

RSS & Aggregation

Group organizers

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds:

Hot content this week