NITF/Atom/NewsML extensions for FeedAPI

Posted by yelvington on March 13, 2008 at 7:13pm

Status: Added to official ideas list http://drupal.org/node/234652

I'm crossposting this to the following groups, all of which have a dog in this hunt: SOC2008, Knight Foundation, Newspapers on Drupal, RSS & Aggregation.

News agencies, "legacy" newsroom management system implementors, publishers and archiving companies all support an XML standard called News Industry Text Format (NITF), developed by the International Press Telecommunications Council.

We need a robust, broadly supported common NITF feed handler that works with the Drupal FeedAPI framework, ultimately enabling loading of NITF data into CCK nodes with configurable entity mapping. This feed handler should expose its own API so that additional handlers can be added to process NewsML (primarily championed by Reuters) and Atom wrappers (used by the Associated Press in AP Exchange).

This is a foundational piece that's needed for Drupal to be a longterm solution to many news-handling requirements of news publishers, who increasingly are flocking to Drupal as a website solution.

I'm proposing this as a Google Summer of Code idea and offering to serve as a mentor to the project.

This project is straightforward in theory but not in practice, as the quality of NITF implementation varies more than a little bit among the various data providers. In addition, both AP and Reuters have come up with significantly differing approaches to extending the capabilities of NITF to encompass multiple media types and sets of information. The AP is moving to a search-based "wire" in which editors will be able to create ad hoc and stored queries, exporting either individual NITF items or Atom feeds containing additional metadata and embedded NITF. Reuters originated and obtained IPTC approval of a "wrapper" called NewsML that predates Atom significantly.

References:
International Press Telecommunications Council http://www.iptc.org/
http://www.nitf.org/
http://www.newsml.org/

IPTC also maintains standards for events, sports data, TV listings, et cetera. Any work on this project should provide a sound foundation for further projects in those areas.

Town News, a US-based Web service provider to a number of smaller newspapers, contributed an NITF PEAR class that may be of use.
http://pear.php.net/reference/PHP_Beautifier-0.1.1/XML/XML_NITF.html

Comments

Sounds like a

Posted by alex_b on March 13, 2008 at 8:44pm

Sounds like a straightforward task and an exciting opportunity for Drupal.

Is this a good example feed for NITF? http://www.nitf.org/IPTC/NITF/3.2/examples/nitf-fishing.xml are there others?

I found this discussion about NITF here: http://groups.drupal.org/node/2667

Technically, this is about writing a FeedAPI-parser that implements this functionality.

I see the PEAR classes that you posted (PHP4 based, PHP 5 compatible) - is anybody aware of any SimpleXML based NITF parser?

Alex

http://www.twitter.com/lxbarth

Not quite straightforward

Posted by yelvington on March 15, 2008 at 3:36pm

It should be straightforward but there are some twists.

The NITF spec is huge and overly ambitious, while the implementations tend to be invalid in one way or another. Some of the legacy newsroom systems emit real trash. In the AP Exchange and NewsML examples, NITF is entangled with yet another layer of XML.

So you have to accommodate error gracefully and provide an open interface for the Atom and NewsML integrations down the road if anybody is going to be able to use this.

Take the AP for example.

AP produces a general news service that's targeted for immediate online publication, called AP Online. It's minimally marked up, but generally clean. It's not hard to build a FeedAPI loader that works with it.

But a lot of us (particularly newspapers) don't want general newswires; everybody has them and there's no real value to being one of a thousand sites carrying AP Online. What we want is deep information about specific subjects of high interest to our geographic or other niche. Much of that information doesn't go on the AP Online wire and isn't available to competitors such as Google and Yahoo.

AP's proposed solution is AP Exchange, which gives editors global access to a huge AP database. But the output implementation is frustrating.

The NITF spec has a tag structure for compound headlines. AP sticks non-headline information into it.

  <hedline>
     <hl1>Martians land at White House, 27th-Ld-Writethru, NA</hl1>
 </hedline>

The proper headline in this example is "Martians land at White House" and everything else is metadata that might or might not be there and doesn't belong there. This is basically a 1970s ANPA-1392 slugline stuck into the headline field, probably for the benefit of some luddite newspaper that hasn't discovered the 21st century and is still using an editing system that runs on a PDP-8.

I've been bitching about this a great deal, and the AP has agreed to provide a clean headline -- but external to NITF, in the Atom layer. So in that example the process might work like this:

Editor defines a persistent search in AP Exchange, creating in effect a custom newswire.
The search is exported as an Atom feed. AP has extended the Atom namespace with custom tags and AP can embed the full NITF for a story inside the Atom feed.
Drupal process fetches the Atom feed. For each valid item, it gets the headline from the Atom layer, then hands the embedded NITF to the NITF parser, getting back an object. We then can ignore the NITF object's headline and prefer the Atom version, but we use the body and perhaps other components of the NITF object.

Along the way you have to make some decisions about how much information to keep and how much to discard. Presumably this ought to be part of the setup process for each feed, allowing the Drupal admin to map entities to CCK fields,

Now, it's not necessarily for all of this to be accomplished in the SoC project. I'm trying to paint the larger picture because if all we get is a parser that only works on clean, sane, proper NITF and doesn't accommodate the real world, we won't be laying the right foundation.

Example files

Posted by yelvington on March 15, 2008 at 3:40pm

Is this a good example feed for NITF? http://www.nitf.org/IPTC/NITF/3.2/examples/nitf-fishing.xml are there others?

I'm not crazy about that example. I know what Alan Karben was trying to demonstrate but it would be better if IPTC also had some examples that were structured as common news stories.

Next week I'll see if I can get AP's permission to post some actual example files from its feeds, and also try to dig up some NITF output from a DTI system. If anyone else could provide some example data from other sources (hello, Reuters?) it would be helpful.

I see the PEAR classes that

Posted by johsw@drupal.org on March 13, 2008 at 9:19pm

I see the PEAR classes that you posted (PHP4 based, PHP 5 compatible) - is anybody aware of any SimpleXML based NITF parser?

I have a module that imports NITF-formatted xml-files in to our cck-content type. But because our editorial system doesn't produce valid NITF, it's kindda hackish. But it uses simpleXml and it works for our use. (see: http://drupal.org/information.dk)

But I would really like to help out making a more generic solution using FeedAPI and mapping to cck-fields 'n all.

Right

Posted by agentrickard on March 14, 2008 at 12:53am

But because our editorial system doesn't produce valid NITF, it's kindda hackish.

This is the common problem. Legacy systems for newspapers all churn out crap. We need a clean parser that is extensible to account for variations.

--
http://ken.therickards.com/
http://savannahnow.com/user/2
http://blufftontoday.com/user/3

--
http://ken.therickards.com/

Rough architecture

Posted by alex_b on March 14, 2008 at 2:08pm

This is the common problem. Legacy systems for newspapers all churn out crap. We need a clean parser that is extensible to account for variations.

First step is to define an internal format of NITF data - a data structure that defines a place for all standard elements and a place for non-standard elements.

e.g., the internal data format for RSS/Atom feeds looks like this ATM:

$item->title;
$item->description;
$item->options->original_url;
$item->options->author->name;
$item->options->author->link;
...

The NITF parser can then handle all standard parts parsing. For handling non-standard elements, there needs to be a hook that passes the full news item (or full feed) down to implementing modules and sticks the result into the place for non standard elements.

I assume that NITF has a concept of items - a feed basically consists of a series of news items - just like RSS (?). How big can NITF news feeds become? Are we talking about very large XML documents here?

Alex

http://www.twitter.com/lxbarth

Document formatting (and some background)

Posted by yelvington on March 15, 2008 at 4:09pm

I think there's an assumption that NITF is a single-document text formatting standard, although there are accommodations for embedding references to other media types and related items.

The support in NITF for packaging multiple related items is weak, and that's one of the reasons Reuters pushed for NewsML, which conceptually is a wrapper that can contain a mix of items.

Some background on NITF:

NITF began its life as SGML markup intended to replace the ANPA-1392 standard for wire service transmissions, which dated back to the era when newspapers set type in molten lead. The NITF standard was unfinished when the Web came along, so many of its tags were adjusted to accommodate HTML conventions. Then it was still unfinished when XML became the hot item, so it was adjusted some more to make it conform with XML conventions. Unfortunately after that it was still unfinished and delivering no value to wire service subscribers.

There is a division in the newspaper business between slow-moving "big iron" print-centric and "web guys." NITF was in the hands of the former group.

Back around 1997 or so a bunch of us "web guys" got together at an American Press Institute workshop. Ffrustrated at the lack of progress on NITF, began working on our own quick-and-dirty XML standard designed to meet Web needs.

Dan Froomkin, who now writes about politics for washingtonpost.com, made the original pitch and it was called "Danco Web-O-Matic," quickly renamed NML as we brought into the conversation a number of serious players from Atex, WSJ.com (Alan Karben, who had implemented a proprietary XML standard for the Journal), some news librarians, and academicians. We were trying to be thorough and introduced geocoding, the Dublin Core and other concepts into the conversation. We discovered that Dave Megginson was working in the same area and swallowed as much of his work as we could.

This alarmed the NITF folks. They asked us to join with them, so we came together in a daylong summit in Dallas. We did a quick gap analysis, agreed to add some NML items to the NITF mix (notably the "abstract" tag, which is the equivalent of Drupal's "teaser" in a display context), and shoved it out the door. About a month later NITF 1.0 was published and the way was cleared for wire service adoption.

I pretty much walked away from the conversation at that point, having accomplished my immediate goal. Now, a decade later, I'm looking at what some of the implementors have done (or not done) with the standard and frankly I'm appalled. But that's the world we live in and that's the world any Drupal project has to accommodate.

Example of malformed nitf

Posted by johsw@drupal.org on March 17, 2008 at 9:46am

I've pasted an elaborate example from our system here:

http://drupalbin.com/1210

This example shows two of the problems with our exporter.

1) The subheader is nested in the hl1 tags. I have to parse on line break
2) The same goes for the byline field, where a "trumpet" is inserted before the actual byline

This is our reality :-)

Best,
/Johs.

I'm interested in this

Posted by our1944 on March 25, 2008 at 1:59pm

I agree with Alex's idea.
@yelvington
May I have I question? The problem now is only misplaced data? I think one more example will make it more clear to me. Thanks

No conflict with Aggregator for D7 proposal

Posted by alex_b on March 25, 2008 at 8:30pm

If the core aggregator proposal for D7 goes through (http://groups.drupal.org/node/9857), and the project is successful, FeedAPI will be discontinued with D7.

However, FeedAPI functionality will be ported or portable to the D7 core aggregator.

Therefore, working on an NITF parser as add on for FeedAPI is not lost investment. In any case I recommend developing on D6 and keeping the actual FeedAPI integration cleanly separated from the parser logic. Thus, an upgrade to D7 further down the road will be easy.

Alex

http://www.twitter.com/lxbarth

Good idea

Posted by our1944 on March 27, 2008 at 5:42am

@Alex
I'll look into FeedAPI module to see how to make it part of FeedAPI.

@Yelvington
I'd like to listen to your suggestion about more specific features that this module will implement. Could you be the mentor of this project?

Thanks,

Yang Tao

Mentoring

Posted by yelvington on April 16, 2008 at 8:06pm

Sorry, I have been traveling (Macau, Bangkok, Phuket, Kuala Lumpur) for a couple of weeks and almost completely offline.

Yes, I can mentor. I have a fairly good grasp of NITF and the (awful, in my opinion) implementations. The problem is not accommodating the standard, but rather dealing with real-world data.

thoughts

Posted by our1944 on April 20, 2008 at 3:04am

Should it seems like this:
About the parser: A standard NITF parser, but it can use a list, which describe the specific parser for specific tag, to parse the feed correctly.

And the in the list user can define different types of feeds. The feeds of the same type have same specific tags to be handled with. Next time when user create a node, he can choose the type from the list if it has been defined.

But if the defining of type happens in node creating process, is it a must to modify the FeedAPI module? How to do it in different way without harming the FeedAPI module? Maybe write a processor and invoke feedapi_node?

Yang Tao

PRISM

Posted by chu on April 4, 2008 at 10:48am

You might want to take a look at the PRISM standard as it encapsulates NITF and NewsML - http://www.prismstandard.org/

Sending you an actual sample

Posted by Prnewsnow on May 13, 2008 at 11:49pm

I am sending a sample of the raw xml from AP to Alex. I just got full access to 100% of the ap content to create real time distribution but their system is "labor intensive" Downloading files to a production machine to reload them back to the server. The ap is completely behind the times on what we can do with information.

Their combined xml format has media files for stories that have nothing to do with the images or captions.

The first step would be to get simple information from the file into a singe line of database information. Title, byline, Image info, body text, pub date.

Once I get that into Mysql with reliablity then I can report it via a clean xml format back out of our news distribution for a wider audience of webmasters to use on their site.

Images themselves are an issue as in the sample I am seeing images that are from 50k to 1 mb in size. With the smaller images of low quality without reason. I have a script to clean up large images into web use sizes that maintain the clarity of the image. I will have to work on server to server call for that though because downloading and uploading 1mb images every few minutes doesn't sound like a reliable production method.

The comment in Russian is spam.

Posted by endrus on May 25, 2008 at 2:59pm

Please, delete.

Hi there were some requests

Posted by Oyun on June 28, 2008 at 1:05pm

there were some requests for code for displaying categories/tags in there specific groups as opposed to one list, I meant to post it but currently I am having problems with it, as we have changed the categories and the code was 'hard coded' to the initial set-up.

I will post when it is fixed.

thanks for feedback everyone, and yes to Ray, the only reason I have used usernode module is to have drupal create an (unseen) node that the profiles are attached to using nodeprofile/nodefamily.

Oyun

Missing Data

Posted by Go-Gulf.com on August 22, 2008 at 4:00am

How can i locate missing data. Is there any way to get some old data recover. I have lost some key data and try to locate it but couldn't not able to find it.

This project is straightforward

Posted by oyun sitesi on January 17, 2009 at 8:12pm

This is a foundational piece that's needed for Drupal to be a longterm solution to many news-handling requirements of news publishers, who increasingly are flocking to Drupal as a website solution.

Oyun oyna

NITF

Posted by greg.harvey on May 15, 2009 at 4:18pm

I'm in the process of writing a Views 2 style for NITF output of nodes, if anyone's interested. =)

It will be like the RSS Feed style in essence.

Greg, Would it be possible

Posted by stewsnooze on May 15, 2009 at 6:29pm

Greg,
Would it be possible to include mapping cck fields onto particular nitf xpaths?

Also what version where you looking at?

Stew

Full Fat Things ( http://fullfatthings.com ), my Drupal consultancy that makes sites fast.

CCK

Posted by greg.harvey on May 15, 2009 at 7:18pm

Will need to look into it. Probably not. Client hasn't specified a version, so I guess I'll use latest (3.4).

Man at work

Posted by greg.harvey on May 18, 2009 at 5:01pm

Made good progress on this today. Moved the code I started to a module called nitf_views, heavily based on eaton's work with atom_views, and started working. So far my row plugin builds a dynamic header for the NITF document, including revision management. Doc ID I'm using is a text field in the interface where you can enter a unique publisher name as a string, then nid.vid, e.g.

<doc-id id-string="reuters.4.6"/>

Where "reuters" was entered in the settings for the row plugin in the Views 2 UI, the nid is 4 and the current revision (vid) is 6.

It also generates this if it is replacing a previous revision:

 <docdata management-status="usable" management-doc-idref="reuters.4.5" management-idref-status="canceled">

Note the vid is now 5, not 6. This is the document cancellation notice for the previous vid. =)

eaton's Atom style for Views 2 (available in CVS and working) can be used to wrap the various documents together using Views. I intend to expose the resulting view (probably Atom, maybe RSS) using the excellent Services module. Did a similar thing for Defaqto last year on www.defaqto.com, using Views and Services to distribute content, and it was very effective.

The Services module can be used to authenticate a user and deliver the contents of a view over a range of available servers, for those not familiar. So the strength is I can easily expose this, securely, to other newspapers with something like XML-RPC. I'll use that because it's native, but the nuSOAP libraries and Services SOAP server effectively last year, so can confirm it works too for the MS guys.

Neat stuff. I'll blog it when I'm done and I'll release the NITF views plugin. Ping me for an early example.

Also may end up writing a web service for the Services module that can send all files when given a valid NITF document ID, since I suspect they'll want that. But the Services API is so cool, that'll probably take about 10 minutes!! =)

Project page created

Posted by greg.harvey on May 19, 2009 at 5:16pm

No code today - probably tomorrow: http://drupal.org/project/nitf_views

Commit

Posted by greg.harvey on May 20, 2009 at 6:50pm

Very close to a commit on this in its initial form. Only issue for me is base64_encode() seems to baulk when I pass it large video files. Probably a setting. Images are working fine (base64 encoded and in a <media-object> element as a child of <media>, one for each CCK image field).

I'll sort this tomorrow AM and commit some code. =)

Need help

Posted by greg.harvey on January 14, 2010 at 4:55pm

We've taken this as far as we can. Happy to share latest work with interested maintainers. See:
http://groups.drupal.org/node/44562

Re: CCK again

Posted by greg.harvey on May 19, 2009 at 12:12pm

Ps - when I say "probably not", what I mean is I won't have time to do this this time around, and it's not a requirement for me, but I'll publish the module so feel free to add such functionality! =)

Issue with DTD

Posted by greg.harvey on May 21, 2009 at 10:32am

I don't know if anyone is tracking this... perhaps I should start a new post rather than extending this one, however this is a question for NITF folk out there.

Do any of you host your own slightly modified DTD? Do you think it's valid to do so?

I'll explain why I ask... the current DTD for NITF 3.4 supports embedded media, with the <media> element. It also supports delivery of the binary data in a <media-object> element as a child of that. What it does not consider is the possible need to split the media object in to many chunks. This is surprising, given that one of the valid types is "video", which always had the potential to be a very large file that you will not (or perhaps sometimes cannot) push out all in one stream.

Problem is, while the current DTD allows several media files per media item (represented by a pair of <media-object> and <media-reference> elements) it doesn't really allow multiple chunks per object. I mean, I could fiddle it following the documentation for the current DTD. It reads:

media-reference

Reference to an external media object, OR to its following media-object. Points to any object, such as photo, audio, video and text; or to executable files.

Ok, so I might be able to do something like this and let the recipient sort out the mess:

<media media-type="video">
  <media-reference mime-type="video/mpeg" source="my-video.mpg"></media-reference>
  <media-object encoding="base64" id="my-video-chunk-1"> BASE 64 ENCODED BINARY HERE </media-object>

  <media-reference mime-type="video/mpeg" source="my-video.mpg"></media-reference>
  <media-object encoding="base64" id="my-video-chunk-2"> BASE 64 ENCODED BINARY HERE </media-object>

  <media-reference mime-type="video/mpeg" source="my-video.mpg"></media-reference>
  <media-object encoding="base64" id="my-video-chunk-3"> BASE 64 ENCODED BINARY HERE </media-object>
   
  <media-caption>The tides, captured on film late yesterday.</media-caption>

  <media-producer>greg.harvey</media-producer>
</media>

But that feels wrong! Surely it should be something like this:

<media media-type="video">
  <media-reference mime-type="video/mpeg" source="my-video.mpg">
    <media-object encoding="base64" chunk-id="1"> BASE 64 ENCODED BINARY HERE </media-object>
    <media-object encoding="base64" chunk-id="2"> BASE 64 ENCODED BINARY HERE </media-object>
    <media-object encoding="base64" chunk-id="3"> BASE 64 ENCODED BINARY HERE </media-object>
  </media-reference>
   
  <media-caption>The tides, captured on film late yesterday.</media-caption>

  <media-producer>greg.harvey</media-producer>
</media>

Or perhaps even:

<media media-type="video">
  <media-reference mime-type="video/mpeg" source="my-video.mpg">
    <media-object encoding="base64">
      <media-chunk id="1"> BASE 64 ENCODED BINARY HERE </chunk>
      <media-chunk id="2"> BASE 64 ENCODED BINARY HERE </chunk>
      <media-chunk id="3"> BASE 64 ENCODED BINARY HERE </chunk>
    </media-object>
  </media-reference>
   
  <media-caption>The tides, captured on film late yesterday.</media-caption>

  <media-producer>greg.harvey</media-producer>
</media>

Hence me wondering about slightly altering the DTD for the <media> element and hosting an altered version. Is this a terrible idea? What would you do? Fudge it with the current DTD or make your own?

Or am I missing the point and NITF already supports file chunking and I'm doing it wrong?

Advice welcome! =)

Leaving it alone

Posted by greg.harvey on May 21, 2009 at 11:11am

To quote a friend and former colleague at Dow Jones:

Leave it as is, and progagate your bolshy talk with the standards owner.

So fudge it is. But emailed the gatekeeper of the standard too. =)

Prototype Nitf Parser for Feeds

Posted by SqyD on June 27, 2010 at 7:40pm

For anyone still interested in this: I've slapped together a very basic nitf parser. I've used the nitf_xml pear package mentioned before and have just implemented the subset of that subset. It's nowhere near a complete implementation but with some fiddling around someone might be able to put it to some use.

I welcome comments/patches/cheers/boes but please be gentile since this is my first ever attempt at a module. ;-)

http://sqyd.net/feeds_nitfparser-6.x-0.1-dev.tar.gz

Paul Krischer
"SqyD"

NITF/Atom/NewsML extensions for FeedAPI

Comments

Sounds like a

Not quite straightforward

Example files

I see the PEAR classes that

Right

Rough architecture

Document formatting (and some background)

Example of malformed nitf

I'm interested in this

No conflict with Aggregator for D7 proposal

Good idea

Mentoring

thoughts

PRISM

Sending you an actual sample

The comment in Russian is spam.

Hi there were some requests

Missing Data

This project is straightforward

NITF

Greg, Would it be possible

CCK

Man at work

Project page created

Commit

Need help

Re: CCK again

Issue with DTD

Leaving it alone

Prototype Nitf Parser for Feeds

RSS & Aggregation

Group organizers

New groups

Group notifications