Typical problems and questions while concepting about an aggregator

We encourage users to post events happening in the community to the community events group on https://www.drupal.org.
Aron Novak's picture

As maybe you heard, my application was accepted to write a new aggregator for Drupal core :) : http://code.google.com/soc/2008/drupal/appinfo.html?csaid=1222E090E875B36D
Here as a first step, I would like to collect lots of possible problems in aggregation area. Please share your opinion about these questions to help Drupal 7 to have as good aggregator as possible :) :

The items are not necessarily ordered by relevance.

Node creation

In short: no perfect way for creating a node programmatically. Why does aggregator need to take care of this? Because as it's stated, the new aggregator creates nodes from feed items to reuse Drupal existing CRUD capabilities.

RSS, ATOM, RDF, BlogML, Dublin core, namespaces

Different capabilities, different versions.
Suggestion: support RSS, RDF, ATOM 's common features and provide data structure to access all other things. And add support for handling namespaces easily.

Common data structure

  • simple but flexible PHP data structure for feed items.
  • consistent and clean structure on $node objects

Cron execution

Problem: slowness, timeouts
Suggestion: Elapsed time-based handling. Check elapsed time between feeds and feed items too.

Duplicate item detection

Problem: Judging if too feed items are the same is not an easy task. The guid should be perfect, but many feeds misuses this. Original url is good (especially with guid together) but its purpose is different originally.
Suggestion:

guid1 == guid2 guid1 != guid2
url1 == url2 same same (some feeds do this when uploading revisions)
url1 != url2 different different

But it's basically judging based on original url only. Duplicate checking should be configurable / alterable by add on modules.

Providing an upgrade path

It should be considered, from at least the old aggregator. It's not a problem, but still important.

Indexing

There are some sites with thousand of feeds and even more feed items. Indeces for the aggregator-related tables should be added carefully.

OPML

OPML export / import is a very essential feature for syndication softwares. Do we need this for core aggregator?

Feed cache

Feed cache is important for performance reasons. But there are still open questions. How to invalidate cache entries? Which part should do the caching? The aggregator module or the parsers?

URL redirect

So it should support HTTP (and HTML) redirects too. What do you think? With SimpleXML now HTML redirects can easily handled too.

RSS channel extracting

People like if they only need to enter: drupal.org and the aggregator software settle the dirty task: find the rss link. Also, SimpleXML HTML parsing capability could help to do this elegantly. Current core aggregator does not support this.

Pipe or other structure

I imagine the pipe structure as the classic unix pipe: output goes to input, so it's a chain where all of the stuff can adjust on the data.
The other possibility is that the processors are independent and they don't know each others, it looks like a tree.

Pipe's advantage: very flexible and the parts can exploit each other capabilities.
Tree's advantage: the parts can be totally independent, so they don't know from each other. More robust, one buggy stuff cannot break anything.

Support multiple configurations

A feature that proved to be useful in FeedAPI is the capability of setting up different configurations of feed processors. A content type would define a configuration of a feed processor. Thus it is possible to e. g. set up a feed content type for iCal feeds that produces event content types and another feed content type for RSS/Atom feeds that produces news articles.

Settings handling

Should settings of add on modules be handled by aggregator module or not? Especially if you consider per content type configurations, per node overrides (think default setting of public flag on story content type and override on per node basis) become very useful - that in turn is easy to do if settings are handled by the aggregator. On the other hand, settings handling is a complex and rather long piece of code in feedapi.module.

Types of feeds

To illustrate the multitude of feeds that could be handled by a new aggregator:

  • RSS/Atom
  • NewsML
  • iCal
  • Email
  • Other generic XML formats
  • HTML content through scraper
AttachmentSize
pipe.png10.14 KB
tree_struct.png11.9 KB

Comments

Types of feeds

Boris Mann's picture

If we have a focus on RDF in Drupal 7, then this thinking should be included in the aggregator design. Add this to "types of feeds".

A focus on RDF? Where's this

Morbus Iff's picture

A focus on RDF? Where's this coming from? RDF, in regards to feeds, tends to refer to RSS 1.0 and RSS 0.91 (as opposed to Winer's RSS 0.92 and 2.0 formats). Generally speaking, most of the folks that I know of in the semweb community have given up on RDF, per se.

RDF data...

Boris Mann's picture

...not RDF 1.0 craziness. But rather, SemWeb data expressed as XML / RDF "feeds". Don't worry, I haven't jumped the Shirky :P

speed of drupal_execute

greggles's picture

It seems like the speed problem with drupal_execute has less to do with the drupal_execute and more to do with the subsequent node_load, right?

If so, perhaps
1) You could avoid the node load and query directly?
2) You could alter drupal_execute?

I'm concerned (particularly as a maintainer of a 3rd party module that alters the node form) that alternate solutions may never work well...

--
Open Prediction Markets | Drupal Dashboard

RSS & Aggregation

Group organizers

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds: