Do not parse feeds on acquisition

Posted by agentrickard on May 31, 2007 at 1:33pm

This is a concept that hit me last night.

One of the issues we have with current Aggregator is that after a certain number of feeds, it starts to have problems finishing its cron cycle.

Part of this is caused by the fact that Aggregator does three actions during cron:

Get feed
Validate feed
Parse feed

Really, aren't only the first two necessary during the cron run? Couldn't we save some cycles and boost performance by delaying the parsing stage until the data was requested by a user-initiated function (like an Aggregator block load)?

In this scenario, after feed validation, the XML would be stored temporarily in the database, then parsed out when requested the next time.

Comments

If the feed isn't parsed on

Posted by budda on May 31, 2007 at 1:40pm

If the feed isn't parsed on a regular basis and stored in to the dbase it may be the case that the site misses some new feed items (if the RSS feed is very active - such as forum posts) because nobody visited the site for 12 hours to initiate a block view function.

Or am i miss-understanding your idea?

I do agree, feed parsing is time consuming in the cron loop - which is why I implemented the "X feeds per cron run" feature for FeedParser.

I was thinking about that

Posted by alex_b on June 4, 2007 at 5:24pm

I was thinking about that same thing the other day...

So, would you store the entire feed locally and then only parse if the current feed differs from the stored one?

Another cron-timeout problem I am looking at is the creation of nodes, which actually takes longer than parsing/duplicate identification and is the point where things tend to time out on our sites.

It could help to decouple the parsing/node creation process. Any ideas on this?

To complete the picture: Aron Novaak implemented a multi threaded cron patch for the feed download process in leech because of timeout issues.

http://www.twitter.com/lxbarth

The Feedparser package has

Posted by budda on June 7, 2007 at 10:58pm

The Feedparser package has already decoupled the acquisition of the feed xml parsing and the actual creation of nodes etc.

I guess the next stage would be to only create XX nodes per cron run? But surely there will come a point where the number of feeds being updated generates too much new content to ever create all the nodes for.

One of the sites I run is running a cron process ever 2 minutes to try and keep up with new content aggregation - and still dies often in the cron process - giing lots of "Last cron run did not complete." in the watchdog log.

I looked at using threaded curl commands to download multiple feeds at once, but i don't think that's the real bottleneck - and multi-threaded curl required PHP5 too.

Do you have a link to the patch you mentioned?

hi budda, here is the link:

Posted by alex_b on June 8, 2007 at 2:24pm

hi budda,

here is the link: http://drupal.org/node/87528

http://www.twitter.com/lxbarth

Node creation

Posted by agentrickard on June 5, 2007 at 5:39pm

Certainly I think node creation could be done lazily. The only question is a version of budda's: "What if no one asks for the node and it never gets created?"

Perhaps the answer is to separate the acquire / parse / create routines in the cron task and then stagger them. "Create" can only run if "Parse" is finished. "Acquire" can only run if "Parse" and "Create" are finished....

Aaron will be working on this for SoC.

--
http://ken.therickards.com/
http://savannahnow.com/user/2
http://blufftontoday.com/user/3

--
http://ken.therickards.com/

Slow node creation

Posted by budda on June 7, 2007 at 10:53pm

If the process if broken down in to the 3 separate stages then the host site is going to have to be running the cron job pretty frequently to keep up with all the feeds in a sites list.

As mentioned in an earlier post, the actual node creation process is the killer (due to triggering any number of modules which also implement the hook_nodeapi) and so this could still kill the cron processing with a timeout.

Or...

Posted by boris mann on June 8, 2007 at 1:40am

Have a daemon that does the feed fetching that is non-Drupal -- e.g. Python, Java. Could do things externally like filtering, talk to Drupal via XML-RPC (and insert nodes that way, or talk to the DB directly).

Probably out of scope about what we're talking about here, but it's a better architecture than mis-using cron for this....

A daemon idea does sound

Posted by budda on June 28, 2007 at 5:36pm

A daemon idea does sound more and mroe appealing when people are talking about aggregating 4000+ feeds on a daily basis. That's over 166 feeds per hour :-(

I've never looked at Drupals XML-RPC interface so don't know what its capable of. However I'd be interested in any further ideas of how to implement this for the larger scale sites.

Same here. We are looking at

Posted by alex_b on June 28, 2007 at 10:21pm

Same here.

We are looking at a series of aggregators on various Drupal installations whose overlap in the feedbase increases from day to day. Just today the idea came up again to directly suck nodes from a sibling system instead of downloading them from the actual host of the feed.

We are not quite there yet. But we might be soon. I am definitely interested in ideas in regards to using an external aggregator or another drupal installation as "downloader".

http://www.twitter.com/lxbarth

An idea

Posted by budda on July 4, 2007 at 10:31am

I woke up with this idea, so it might not be well though out but wanted to bounce it around here for people to pick at.

The aggregator module implements its own ?q=aggregator/cron.php path
Sites would need to add an additional crontab entry for this dedicated aggregator cron call. Maybe running it every few minutes to keep on top of your feeds.

The new aggregator cron call could then use libcurl multithread mode to call another url X times. Where X is defined in the feed manager settings. This would pull in the feeds and parse the XML in to some interim format to be consumed later on.

An additional url could also be called at the same time which consumes the iterim feed items waiting and processes them in to nodes or simple items, or anything else.

Problem - multithreaded cronlib only works in PHP5.

php5

Posted by agentrickard on July 4, 2007 at 7:12pm

High-volume sites need to run on PHP 5 anyway :-).

I like the idea of separate cron. The multi-threading is over my head, though.

--
http://ken.therickards.com/
http://savannahnow.com/user/2
http://blufftontoday.com/user/3

--
http://ken.therickards.com/

I think a

Posted by alex_b on July 5, 2007 at 6:46am

I think a ?q=aggregator/cron.php path makes a lot of sense. Only seperating the aggregator cron run from the rest of the pack (potentially slow ones like search, ...) can buy enough time in a lot of cases.

There is an implementation of a seperate cron trigger for leech in the 4.7.x version of the module.

http://www.twitter.com/lxbarth

Do not parse feeds on acquisition

Comments

If the feed isn't parsed on

I was thinking about that

The Feedparser package has

hi budda, here is the link:

Node creation

Slow node creation

Or...

A daemon idea does sound

Same here. We are looking at

An idea

php5

I think a

RSS & Aggregation

Group organizers

New groups

Group notifications

Hot content this week