NITF/Atom/NewsML extensions for FeedAPI

yelvington@drupal.org's picture
public
yelvington@drup... - Thu, 2008-03-13 19:13

Status: Added to official ideas list http://drupal.org/node/234652

I'm crossposting this to the following groups, all of which have a dog in this hunt: SOC2008, Knight Foundation, Newspapers on Drupal, RSS & Aggregation.

News agencies, "legacy" newsroom management system implementors, publishers and archiving companies all support an XML standard called News Industry Text Format (NITF), developed by the International Press Telecommunications Council.

We need a robust, broadly supported common NITF feed handler that works with the Drupal FeedAPI framework, ultimately enabling loading of NITF data into CCK nodes with configurable entity mapping. This feed handler should expose its own API so that additional handlers can be added to process NewsML (primarily championed by Reuters) and Atom wrappers (used by the Associated Press in AP Exchange).

This is a foundational piece that's needed for Drupal to be a longterm solution to many news-handling requirements of news publishers, who increasingly are flocking to Drupal as a website solution.

I'm proposing this as a Google Summer of Code idea and offering to serve as a mentor to the project.

This project is straightforward in theory but not in practice, as the quality of NITF implementation varies more than a little bit among the various data providers. In addition, both AP and Reuters have come up with significantly differing approaches to extending the capabilities of NITF to encompass multiple media types and sets of information. The AP is moving to a search-based "wire" in which editors will be able to create ad hoc and stored queries, exporting either individual NITF items or Atom feeds containing additional metadata and embedded NITF. Reuters originated and obtained IPTC approval of a "wrapper" called NewsML that predates Atom significantly.

References:
International Press Telecommunications Council http://www.iptc.org/
http://www.nitf.org/
http://www.newsml.org/

IPTC also maintains standards for events, sports data, TV listings, et cetera. Any work on this project should provide a sound foundation for further projects in those areas.

Town News, a US-based Web service provider to a number of smaller newspapers, contributed an NITF PEAR class that may be of use.
http://pear.php.net/reference/PHP_Beautifier-0.1.1/XML/XML_NITF.html


Sounds like a

alex_b@drupal.org's picture
alex_b@drupal.org - Thu, 2008-03-13 20:44

Sounds like a straightforward task and an exciting opportunity for Drupal.

Is this a good example feed for NITF? http://www.nitf.org/IPTC/NITF/3.2/examples/nitf-fishing.xml are there others?

I found this discussion about NITF here: http://groups.drupal.org/node/2667

Technically, this is about writing a FeedAPI-parser that implements this functionality.

I see the PEAR classes that you posted (PHP4 based, PHP 5 compatible) - is anybody aware of any SimpleXML based NITF parser?

Alex


Not quite straightforward

yelvington@drupal.org's picture
yelvington@drup... - Sat, 2008-03-15 15:36

It should be straightforward but there are some twists.

The NITF spec is huge and overly ambitious, while the implementations tend to be invalid in one way or another. Some of the legacy newsroom systems emit real trash. In the AP Exchange and NewsML examples, NITF is entangled with yet another layer of XML.

So you have to accommodate error gracefully and provide an open interface for the Atom and NewsML integrations down the road if anybody is going to be able to use this.

Take the AP for example.

AP produces a general news service that's targeted for immediate online publication, called AP Online. It's minimally marked up, but generally clean. It's not hard to build a FeedAPI loader that works with it.

But a lot of us (particularly newspapers) don't want general newswires; everybody has them and there's no real value to being one of a thousand sites carrying AP Online. What we want is deep information about specific subjects of high interest to our geographic or other niche. Much of that information doesn't go on the AP Online wire and isn't available to competitors such as Google and Yahoo.

AP's proposed solution is AP Exchange, which gives editors global access to a huge AP database. But the output implementation is frustrating.

The NITF spec has a tag structure for compound headlines. AP sticks non-headline information into it.

  <hedline>
     <hl1>Martians land at White House, 27th-Ld-Writethru, NA</hl1>
</hedline>

The proper headline in this example is "Martians land at White House" and everything else is metadata that might or might not be there and doesn't belong there. This is basically a 1970s ANPA-1392 slugline stuck into the headline field, probably for the benefit of some luddite newspaper that hasn't discovered the 21st century and is still using an editing system that runs on a PDP-8.

I've been bitching about this a great deal, and the AP has agreed to provide a clean headline -- but external to NITF, in the Atom layer. So in that example the process might work like this:

  • Editor defines a persistent search in AP Exchange, creating in effect a custom newswire.
  • The search is exported as an Atom feed. AP has extended the Atom namespace with custom tags and AP can embed the full NITF for a story inside the Atom feed.
  • Drupal process fetches the Atom feed. For each valid item, it gets the headline from the Atom layer, then hands the embedded NITF to the NITF parser, getting back an object. We then can ignore the NITF object's headline and prefer the Atom version, but we use the body and perhaps other components of the NITF object.

Along the way you have to make some decisions about how much information to keep and how much to discard. Presumably this ought to be part of the setup process for each feed, allowing the Drupal admin to map entities to CCK fields,

Now, it's not necessarily for all of this to be accomplished in the SoC project. I'm trying to paint the larger picture because if all we get is a parser that only works on clean, sane, proper NITF and doesn't accommodate the real world, we won't be laying the right foundation.


Example files

yelvington@drupal.org's picture
yelvington@drup... - Sat, 2008-03-15 15:40

Is this a good example feed for NITF? http://www.nitf.org/IPTC/NITF/3.2/examples/nitf-fishing.xml are there others?

I'm not crazy about that example. I know what Alan Karben was trying to demonstrate but it would be better if IPTC also had some examples that were structured as common news stories.

Next week I'll see if I can get AP's permission to post some actual example files from its feeds, and also try to dig up some NITF output from a DTI system. If anyone else could provide some example data from other sources (hello, Reuters?) it would be helpful.


I see the PEAR classes that

johsw@drupal.org's picture
johsw@drupal.org - Thu, 2008-03-13 21:19

I see the PEAR classes that you posted (PHP4 based, PHP 5 compatible) - is anybody aware of any SimpleXML based NITF parser?

I have a module that imports NITF-formatted xml-files in to our cck-content type. But because our editorial system doesn't produce valid NITF, it's kindda hackish. But it uses simpleXml and it works for our use. (see: http://drupal.org/information.dk)

But I would really like to help out making a more generic solution using FeedAPI and mapping to cck-fields 'n all.


Right

agentrickard@drupal.org's picture
agentrickard@dr... - Fri, 2008-03-14 00:53
But because our editorial system doesn't produce valid NITF, it's kindda hackish.

This is the common problem. Legacy systems for newspapers all churn out crap. We need a clean parser that is extensible to account for variations.

--
http://ken.therickards.com/
http://savannahnow.com/user/2
http://blufftontoday.com/user/3


Rough architecture

alex_b@drupal.org's picture
alex_b@drupal.org - Fri, 2008-03-14 14:08

This is the common problem. Legacy systems for newspapers all churn out crap. We need a clean parser that is extensible to account for variations.

First step is to define an internal format of NITF data - a data structure that defines a place for all standard elements and a place for non-standard elements.

e.g., the internal data format for RSS/Atom feeds looks like this ATM:

$item->title;
$item->description;
$item->options->original_url;
$item->options->author->name;
$item->options->author->link;
...

The NITF parser can then handle all standard parts parsing. For handling non-standard elements, there needs to be a hook that passes the full news item (or full feed) down to implementing modules and sticks the result into the place for non standard elements.

I assume that NITF has a concept of items - a feed basically consists of a series of news items - just like RSS (?). How big can NITF news feeds become? Are we talking about very large XML documents here?

Alex


Document formatting (and some background)

yelvington@drupal.org's picture
yelvington@drup... - Sat, 2008-03-15 16:09

I think there's an assumption that NITF is a single-document text formatting standard, although there are accommodations for embedding references to other media types and related items.

The support in NITF for packaging multiple related items is weak, and that's one of the reasons Reuters pushed for NewsML, which conceptually is a wrapper that can contain a mix of items.

Some background on NITF:

NITF began its life as SGML markup intended to replace the ANPA-1392 standard for wire service transmissions, which dated back to the era when newspapers set type in molten lead. The NITF standard was unfinished when the Web came along, so many of its tags were adjusted to accommodate HTML conventions. Then it was still unfinished when XML became the hot item, so it was adjusted some more to make it conform with XML conventions. Unfortunately after that it was still unfinished and delivering no value to wire service subscribers.

There is a division in the newspaper business between slow-moving "big iron" print-centric and "web guys." NITF was in the hands of the former group.

Back around 1997 or so a bunch of us "web guys" got together at an American Press Institute workshop. Ffrustrated at the lack of progress on NITF, began working on our own quick-and-dirty XML standard designed to meet Web needs.

Dan Froomkin, who now writes about politics for washingtonpost.com, made the original pitch and it was called "Danco Web-O-Matic," quickly renamed NML as we brought into the conversation a number of serious players from Atex, WSJ.com (Alan Karben, who had implemented a proprietary XML standard for the Journal), some news librarians, and academicians. We were trying to be thorough and introduced geocoding, the Dublin Core and other concepts into the conversation. We discovered that Dave Megginson was working in the same area and swallowed as much of his work as we could.

This alarmed the NITF folks. They asked us to join with them, so we came together in a daylong summit in Dallas. We did a quick gap analysis, agreed to add some NML items to the NITF mix (notably the "abstract" tag, which is the equivalent of Drupal's "teaser" in a display context), and shoved it out the door. About a month later NITF 1.0 was published and the way was cleared for wire service adoption.

I pretty much walked away from the conversation at that point, having accomplished my immediate goal. Now, a decade later, I'm looking at what some of the implementors have done (or not done) with the standard and frankly I'm appalled. But that's the world we live in and that's the world any Drupal project has to accommodate.


Example of malformed nitf

johsw@drupal.org's picture
johsw@drupal.org - Mon, 2008-03-17 09:46

I've pasted an elaborate example from our system here:

http://drupalbin.com/1210

This example shows two of the problems with our exporter.

1) The subheader is nested in the hl1 tags. I have to parse on line break
2) The same goes for the byline field, where a "trumpet" is inserted before the actual byline

This is our reality :-)

Best,
/Johs.


I'm interested in this

T Yang - Tue, 2008-03-25 13:59

I agree with Alex's idea.
@yelvington
May I have I question? The problem now is only misplaced data? I think one more example will make it more clear to me. Thanks

No conflict with Aggregator for D7 proposal

alex_b@drupal.org's picture
alex_b@drupal.org - Tue, 2008-03-25 20:30

If the core aggregator proposal for D7 goes through (http://groups.drupal.org/node/9857), and the project is successful, FeedAPI will be discontinued with D7.

However, FeedAPI functionality will be ported or portable to the D7 core aggregator.

Therefore, working on an NITF parser as add on for FeedAPI is not lost investment. In any case I recommend developing on D6 and keeping the actual FeedAPI integration cleanly separated from the parser logic. Thus, an upgrade to D7 further down the road will be easy.

Alex


Good idea

T Yang - Thu, 2008-03-27 05:42

@Alex
I'll look into FeedAPI module to see how to make it part of FeedAPI.

@Yelvington
I'd like to listen to your suggestion about more specific features that this module will implement. Could you be the mentor of this project?

Thanks,

Yang Tao

Mentoring

yelvington@drupal.org's picture
yelvington@drup... - Wed, 2008-04-16 20:06

Sorry, I have been traveling (Macau, Bangkok, Phuket, Kuala Lumpur) for a couple of weeks and almost completely offline.

Yes, I can mentor. I have a fairly good grasp of NITF and the (awful, in my opinion) implementations. The problem is not accommodating the standard, but rather dealing with real-world data.


thoughts

T Yang - Sun, 2008-04-20 03:04

Should it seems like this:
About the parser: A standard NITF parser, but it can use a list, which describe the specific parser for specific tag, to parse the feed correctly.

And the in the list user can define different types of feeds. The feeds of the same type have same specific tags to be handled with. Next time when user create a node, he can choose the type from the list if it has been defined.

But if the defining of type happens in node creating process, is it a must to modify the FeedAPI module? How to do it in different way without harming the FeedAPI module? Maybe write a processor and invoke feedapi_node?

Yang Tao

PRISM

chu@drupal.org - Fri, 2008-04-04 10:48

You might want to take a look at the PRISM standard as it encapsulates NITF and NewsML - http://www.prismstandard.org/

Sending you an actual sample

Prnewsnow's picture
Prnewsnow - Tue, 2008-05-13 23:49

I am sending a sample of the raw xml from AP to Alex. I just got full access to 100% of the ap content to create real time distribution but their system is "labor intensive" Downloading files to a production machine to reload them back to the server. The ap is completely behind the times on what we can do with information.

Their combined xml format has media files for stories that have nothing to do with the images or captions.

The first step would be to get simple information from the file into a singe line of database information. Title, byline, Image info, body text, pub date.

Once I get that into Mysql with reliablity then I can report it via a clean xml format back out of our news distribution for a wider audience of webmasters to use on their site.

Images themselves are an issue as in the sample I am seeing images that are from 50k to 1 mb in size. With the smaller images of low quality without reason. I have a script to clean up large images into web use sizes that maintain the clarity of the image. I will have to work on server to server call for that though because downloading and uploading 1mb images every few minutes doesn't sound like a reliable production method.


The comment in Russian is spam.

endrus - Sun, 2008-05-25 14:59

Please, delete.

Hi there were some requests

Oyun - Sat, 2008-06-28 13:05

Hi

there were some requests for code for displaying categories/tags in there specific groups as opposed to one list, I meant to post it but currently I am having problems with it, as we have changed the categories and the code was 'hard coded' to the initial set-up.

I will post when it is fixed.

thanks for feedback everyone, and yes to Ray, the only reason I have used usernode module is to have drupal create an (unseen) node that the profiles are attached to using nodeprofile/nodefamily.

Oyun