I plan to compare the different XML feed parsers here in the viewpoint of functionality / speed and the interface that the API provides.
Please extend this page with any parsers / feed formats you think it's worth to take care.
I'll include the following parsers:
Example feeds
- RSS 0.91
- RSS 0.92
- RSS 2.0
- http://static.userland.com/gems/backend/rssTwoExample2.xml
- http://www.christiannewswire.com/rss/catfeed_2.xml (a troublemaker because big)
- Atom 0.3
- Atom 1.0
- RDF
Code size
leech_news_parser | SimplePie | Aggregation's parser |
20K | 130K | 6.3K |
Supported feed types
Parser | RSS 0.91 | RSS 0.92 | RSS 2.0 | Atom 0.3 | Atom 1.0 | RDF |
leech_news_parser | X | X | X | X | X | X |
SimplePie | X | X | X | X | X | X |
Aggregation | X | X | X | X |
I tested Aggregation parser's compatibility with some feeds in Aggregator's issue queue. Results are here
Parsing speed
This time is the process time of the biggest example feed above: http://www.christiannewswire.com/rss/catfeed_2.xml .
leech_news_parser | SimplePie | Aggregation's parser |
2.41s | 10.99s (1s - if the sanitizing is off) | 0.1s |
Morbus Iff, morbus@disobey.com -- The above tests are "unfair" to SimplePie, see http://lists.drupal.org/pipermail/development/2007-June/024790.html.
Aron Novak - ok, thanks, I didn't know a lot about the SimplePie's internal structure.
So the situation is that SimplePie provide a lots of things if it is allowed to do sanitizing:
- $this->enable_order_by_date(true);
- $this->remove_div(true);
- $this->strip_comments(true);
- $this->strip_htmltags(true);
- $this->strip_attributes(true);
- $this->set_image_handler(true);
The values are the averages of thee times execution. The deviation was negligible.
At SimplePie I have to do an estimate, because the feed download time was accumulated to the measure.
Behaviour, structure of the result / input of the parser
leech_news_parser
Input: the raw XML data
Example usage: $parsed = leech_news_parse_news_feed($xml_data);
Output: Mix of arrays and objects->StdClass
SimplePie
Input: URL to the feed - downloading and like this is behind the API
Example usage
$feed = new SimplePie();
$feed->feed_url($url);
$feed->init();
$parsed = $feed->get_items();
Output: Mix of arrays and objects->SimplePie_Item
The parsed data can be accessed through SimplePie->function() functions also. Let's see the API.
Aggregation
Input: Type of the feed, and the XML data processed by simplexml
Example usage:call_user_func_array('aggregation'.$handler_name.'_parse', array($feed_xml, $feed));
Output: Uses the PHP5-only simplexml_load_string then mine the data from it according to the specified feed format. Then consume it immediately - no between-stage is present like above.
Expand the parsing capability is trivial. The most straightforward and the smallest codebase because of the big part is inside the PHP. But feed-format auto detection is a must.
Mentor's note: I agree. The parser-selection system for Aggregation is tied to Taxonomy, which is a very odd design choice. Automating feed detection == good user experience. - Ken
Aggregation author: I used the taxonomy to utilize its current add/edit/delete capability. With a good argument against this, I could do a move to an external table in an hour tops. As to auto-detection, I got no issues at all with users finding trouble in specifying their feed types. A patch for auto detection could be done in an hour's time. I currently have many more interesting feature requests on the issue queue. Any patches are welcome though!
Aron Novak - It is very likely that I'll send you patches like you mentioned. Honestly my current idea is to make Aggregation's parser to the new API's default parser. At the top of the page, you can see that i plan some more test with the parser.
What version of SimplePie did you use, and what exactly were you trying to test? It's important to know that SimplePie does a lot of data sanitization by default to protect users from malicious feeds, and generally make things more useful -- which, in turn, requires a bit more processing. Also, if you're going to test speed, you should test the post-cached speed instead of calculating the wait-for-the-remote-server-to-respond time into it. -- Ryan Parman
Aron Novak - You are absolutely right, I just missed to make public the test method and the details. (btw. the SimplePie-question is discussed above).
Ryan Parman - Aron, thanks for updating that. It should be noted that set_stupidly_fast() is only available in the development builds (soon to be 1.0), and we've also spent some time tuning SimplePie for performance. If possible, I'd also like to add an Atom 1.0 feed to your testing: http://www.tbray.org/ongoing/ongoing.atom . This feed would test the correct handling of a very, very complicated Atom 1.0 feed, which SimplePie does supremely well and most others fail at.
Details of the test method
I made the test on a PHP5 + Apache2 server with Zend Platform installed (for debugging reasons).
The versions:
leech_news_parser | SimplePie | Aggregation's parser |
1.2 | 1.0 Beta 3.2 | 3.0 |
Test code:
You can see the testing code here.