Aggregation API requirements - SoC project

You are viewing a wiki page. You are welcome to join the group and then edit it. Be bold!

Posted by aron novak on June 18, 2007 at 10:17am
Last updated by aron novak on Wed, 2007-06-20 07:26

Previously I compared some feed parsers and aggregation-purpose modules to provide an outline of the current solutions. It will be helpful for me at the start of the summer of code project. At the project website you can find the devlog and details of the process of the work.. At this page I would like to collect the features what an aggregation modules should know and decide what's worth to include in the API.
The main tasks of an aggregation module:

Downloading the feed

The current modules use two basic functions:

The aims of this layer: download the data, follow the redirects and handle URLs that require authorization. cURL solves all aims, drupal_http_request needs some external logic (additional header item for HTTP request) for authorization.
For the API I prefer to avoid dependency on cURL, so I can imagine a function that adds some more functionality to drupal_http_request.
From the discussion at the mailing lists I heard that the cURL beats the performance of drupal_http_request. Maybe the API should offer cURL as a second option.
I assume this task should be implemented only in the API. So the modules don't have to take care of it.

Parsing the feed

The current situation: almost all module use their own-style parser and data structures. The API should provide the following:

Registering the parsers, collect informations like: supported feed types, PHP-compatibility level (?)
Define an output format, the parser have to provide such a data structure that the API requires
Let the modules choose between the compatible parsers (with the given system configuration), otherwise fall back to the default compatible parser.
Possibility of "one feed - multiple parser" assign (Morbus 's feature request)
Handle non-XML "feeds" (for eg. "latest comic" HTML) and be able to write a processor/feed parser to it (Morbus 's feature request)

API's part: provide a default parser, define a data structure, provide a pluggable parser system, (provide a basic UI - no, the basic UI is in the SoC project, but it's not the API. Modules' part: if it's needed, then add their own parsers (with added functionality)

alex_b - What would be the best data structure to use? Depending on the parser in place, you will end up with all kinds of different output formats. I wonder what a simple, compatible and extensible data structure would look like - are there already any ideas? What are the requirements for such a datastructure?

Aron Novak - I assume the default data structure should be something like the default parser produces (with some tweaks). Unfortunately the other parsers output have to be transformed to the common structure. I cannot see a better solution now. If someone see the better way, let's share it :)

Storing the feed and the news items

Instead of describing the ideas, here is an example database scheme:
The database scheme
Ideally an external module with added functionality shouldn't want to add another table, just use the existing ones. With the additional values, the modules can extend the functionality. For specialized things maybe the modules want to create new tables.
The most of this task ideally is behind the API.

Morbus Iff, morbus@disobey.com -- I'm worried about how the feed_item_option table will scale. Even a smallest single item has three pieces of data (title, description, date) -- more full-fledged ones will include upwards of ten (if you consider Dublin Core, categories, comment URLs, or any of the many other namespaces. For a site with 1000 feeds (one of your use cases for promoting smart downloading dynamics), this table will get pretty gigantic pretty quickly (1000 feeds with 10 items each with 3 items each: 30,000 rows in this table for the first pull -- much more my particular need, which is to archive 150+ feeds indefinitely). Or, is this just for "options" about each individual feed item? What's that mean anyways? Likewise, how does this handle multi-tree'd items, like the fictional [item][morbus:container][morbus:leaf1 /][morbus:leaf2 /][/item]? Since we can't store the namespace prefix (since namespace prefixes are nothing but shorthand and have no meaning themselves), how will you differentiate between "container:leaf1" and "someothercontainer:leaf1"?

Aron Novak - Thanks for the detailed comment! As we discussed at IRC, our common viewpoint in this question that this database scheme was not optimal (for performance reasons mainly) and the modules on the top of the API will create their own tables for their data. Then possibly the modules register their external tables to the API.
Another problem is with the data scheme is the types of the columns. Both text? Too inefficient. Thanks for pointing out that this was a dump idea.

Morbus Iff, morbus@disobey.com -- I'm also curious on the ability to store the entire "data" of the item itself. If I don't install any parsers that understand Dublin Core, then two months later, I do, I'd like to be able to reparse all the other items so I get that DC data. This would involve some sort of dump of the entire contents of an item (its raw XML, perhaps?) that I could then reparse at a later time. I suppose, however, that this would best be done as a new parser, which begs another questions: can multiple parsers be selected for a single feed? And, is a parser (SimplePIE, SimpleRSS, etc.) different from the API that passes around parsed items? In a perfect world, I'd love to be able to have three feeds configured such like: Feed #1 (parse with SimplePIE and SaveXML; process items with EmailItem and ArchiveForever), Feed #2 (parse with SimplePIE; process items with TurnIntoNodes and ArchiveForever), Feed #3 (parse with SaveXML, no processing).

Aron Novak - You will have the ability w/ the API to save the whole XML/whatever data

alex_b: 1) Reparse snapshot of feed at a later time. I am not quite sure if I understand that right - I am thinking that if one would like to store the raw feed, it could easily be done by an add on module outside of the actual aggregation pipe. (Yes, that's what I ended up believing as well, as per my SaveXML parser example above. --Morbus Iff).

alex_b: 2) A per feed configuration of parser and modification/output filter and the possibility to use not just one modification/output filter on a feed. I like this :) This would be a huge feature for the Aggregation API and something that sets it apart from existing modules (?) (Actually, both SimpleFeed and Feed Parser offer the basis of those features, but both miss the full functionality slightly. --Morbus Iff).

Fetch/store/delete/update of news items regularly

This task is usually done on cron-time. The API should implement the hook_cron() and let the modules to control the flow of cron strictly. For example a frequent question is: when a feed should be updated? As I think the API should provide a general way, then let the modules to overwrite the rule of the refreshing. Then the actual feed-refresh code can be overwritten too. In this way, the module developers can decide what they want to do in cron-time (see http://groups.drupal.org/node/4309)
Only the skeleton is in the API.

Transforming the news items into some type of content

There are two totally different concept, turning a news item into a node or into sg. else. This part should not be in the API, because the modules has its favour in this question. For example some modules use fixed node-type, some of them use variable node-type, others uses blocks/whatever to show the feed and its content.
Nothing or only some helper functionality in the API.
alex_b -- Shouldn't the Aggregation API provide some basic default functionality? I mean in a way that you can install it and it just works and only if you need particular functionality you go for an add on module. If so, what would be a good default functionality? Imitate the current Drupal core aggregator with some obvious improvements like using taxonomies for categorizing feeds?

Aron Novak - I call such part of the module "API" what is responsible to serve other modules/code. Of course a simple module will be built on the top of the API, in my terms, it's no the part of the API. I used this terminology to show what's the module tasks if someone wants to use the API. But the SoC project will produce a default, simple UI too.

UI

The default module on the top of Aggregation API will have to provide some basic UI for configuring the module, configuring feeds, listing feeds and feed items. Strictly speaking the API is only the interface what the other modules will reuse.

Aggregation API requirements - SoC project

Downloading the feed

Parsing the feed

Storing the feed and the news items

Fetch/store/delete/update of news items regularly

Transforming the news items into some type of content

UI

RSS & Aggregation

Group organizers

New groups

Group notifications