Typical problems and questions while concepting about an aggregator

Posted by aron novak on May 12, 2008 at 12:59pm

As maybe you heard, my application was accepted to write a new aggregator for Drupal core :) : http://code.google.com/soc/2008/drupal/appinfo.html?csaid=1222E090E875B36D
Here as a first step, I would like to collect lots of possible problems in aggregation area. Please share your opinion about these questions to help Drupal 7 to have as good aggregator as possible :) :

The items are not necessarily ordered by relevance.

Node creation

There is no clean programmatic way of node creation. Simulating user interaction with drupal_execute() is too slow (http://lists.drupal.org/pipermail/development/2008-March/029148.html)
There are problems with the user not being logged in when cron creates feed item nodes from feeds (http://drupal.org/node/255059)

In short: no perfect way for creating a node programmatically. Why does aggregator need to take care of this? Because as it's stated, the new aggregator creates nodes from feed items to reuse Drupal existing CRUD capabilities.

RSS, ATOM, RDF, BlogML, Dublin core, namespaces

Different capabilities, different versions.
Suggestion: support RSS, RDF, ATOM 's common features and provide data structure to access all other things. And add support for handling namespaces easily.

Common data structure

simple but flexible PHP data structure for feed items.
consistent and clean structure on $node objects

Cron execution

Problem: slowness, timeouts
Suggestion: Elapsed time-based handling. Check elapsed time between feeds and feed items too.

Duplicate item detection

Problem: Judging if too feed items are the same is not an easy task. The guid should be perfect, but many feeds misuses this. Original url is good (especially with guid together) but its purpose is different originally.
Suggestion:

	guid1 == guid2	guid1 != guid2
url1 == url2	same	same (some feeds do this when uploading revisions)
url1 != url2	different	different

But it's basically judging based on original url only. Duplicate checking should be configurable / alterable by add on modules.

Providing an upgrade path

It should be considered, from at least the old aggregator. It's not a problem, but still important.

Indexing

There are some sites with thousand of feeds and even more feed items. Indeces for the aggregator-related tables should be added carefully.

OPML

OPML export / import is a very essential feature for syndication softwares. Do we need this for core aggregator?

Feed cache

Feed cache is important for performance reasons. But there are still open questions. How to invalidate cache entries? Which part should do the caching? The aggregator module or the parsers?

URL redirect

So it should support HTTP (and HTML) redirects too. What do you think? With SimpleXML now HTML redirects can easily handled too.

RSS channel extracting

People like if they only need to enter: drupal.org and the aggregator software settle the dirty task: find the rss link. Also, SimpleXML HTML parsing capability could help to do this elegantly. Current core aggregator does not support this.

Pipe or other structure

I imagine the pipe structure as the classic unix pipe: output goes to input, so it's a chain where all of the stuff can adjust on the data.
The other possibility is that the processors are independent and they don't know each others, it looks like a tree.

Pipe's advantage: very flexible and the parts can exploit each other capabilities.
Tree's advantage: the parts can be totally independent, so they don't know from each other. More robust, one buggy stuff cannot break anything.

Support multiple configurations

A feature that proved to be useful in FeedAPI is the capability of setting up different configurations of feed processors. A content type would define a configuration of a feed processor. Thus it is possible to e. g. set up a feed content type for iCal feeds that produces event content types and another feed content type for RSS/Atom feeds that produces news articles.

Settings handling

Should settings of add on modules be handled by aggregator module or not? Especially if you consider per content type configurations, per node overrides (think default setting of public flag on story content type and override on per node basis) become very useful - that in turn is easy to do if settings are handled by the aggregator. On the other hand, settings handling is a complex and rather long piece of code in feedapi.module.

Types of feeds

To illustrate the multitude of feeds that could be handled by a new aggregator:

RSS/Atom
NewsML
iCal
Email
Other generic XML formats
HTML content through scraper

Attachment	Size
pipe.png	10.14 KB
tree_struct.png	11.9 KB

Comments

Types of feeds

Posted by boris mann on May 12, 2008 at 9:50pm

If we have a focus on RDF in Drupal 7, then this thinking should be included in the aggregator design. Add this to "types of feeds".

A focus on RDF? Where's this

Posted by morbus iff on May 13, 2008 at 2:30am

A focus on RDF? Where's this coming from? RDF, in regards to feeds, tends to refer to RSS 1.0 and RSS 0.91 (as opposed to Winer's RSS 0.92 and 2.0 formats). Generally speaking, most of the folks that I know of in the semweb community have given up on RDF, per se.

RDF data...

Posted by boris mann on May 13, 2008 at 4:50am

...not RDF 1.0 craziness. But rather, SemWeb data expressed as XML / RDF "feeds". Don't worry, I haven't jumped the Shirky :P

speed of drupal_execute

Posted by greggles on May 12, 2008 at 10:43pm

It seems like the speed problem with drupal_execute has less to do with the drupal_execute and more to do with the subsequent node_load, right?

If so, perhaps
1) You could avoid the node load and query directly?
2) You could alter drupal_execute?

I'm concerned (particularly as a maintainer of a 3rd party module that alters the node form) that alternate solutions may never work well...

--
_{Open Prediction Markets | Drupal Dashboard}

knaddison blog | Morris Animal Foundation

Typical problems and questions while concepting about an aggregator

Node creation

RSS, ATOM, RDF, BlogML, Dublin core, namespaces

Common data structure

Cron execution

Duplicate item detection

Providing an upgrade path

Indexing

OPML

Feed cache

URL redirect

RSS channel extracting

Pipe or other structure

Support multiple configurations

Settings handling

Types of feeds

Comments

Types of feeds

A focus on RDF? Where's this

RDF data...

speed of drupal_execute

RSS & Aggregation

Group organizers

New groups

Group notifications