IPTC News Archetechture (NAR)

Posted by Max Bell on September 27, 2006 at 11:15am

http://www.iptc.org/NAR/

Greetings, fellow standards junkies:

I stumbled across the site above while looking to see what was being done with aggregation these days. For anyone who's ever given any thought to how much IE6 is to CSS like RSS 2.0 is to aggregation, it might make sense if I admitted I originally bookmarked the site above for future review and subconsciously repressed the memory as a hoax intended to torment me.

If it was some big deal, why wasn't anybody talking about it? Little did I know...

To be completely serious, though, I've been thinking a lot these past couple of months about what could be done to make Drupal more attractive to publishers looking to convert their operations to the web or folks looking to create web-based publications, and there's been growing interest in the subject. Given the spike in growth seen in the community following the DeanSpace/CivicSpace project and how popular Drupal has become with non-profits and the like, I've been thinking a complimentary project to the ecommerce suite aimed at publishers might both increase Drupal's popularity as a commercial platform and drive new business to contractors.

Initially, I'd been been playing with aggregator2 and putting together various prototype installations, but then discovered that it had been rewritten and then seemingly abandoned by it's author; this led me to begin researching the possibility of writing my own aggregator and thinking about how to combine support for RSS/RDF/ATOM with Views and CCK. I think such a project might have potential, and help to make Drupal more attractive for the purposes I've outlined.

Having gotten caught up on my sleep and revisited the IPTC's site, I realized that they have developed a fascinating and very rich collection of formats that have strong international support from various news wires and financial publications. The formats themselves might seem familiar to anyone who's spent time examining the specs for RSS etc., are geared towards the semantic web, provide methods for handling payloads in a variety of media types, a standard for metadata relating to everything else, and best of all, A STANDARDIZED TAXONOMY.

Anyone who took a moment to examine the work already done on the first version of these formats will see right away where what is being done lends itself to the structure of a CMS without thinking too much about it, and one has apparently been written. No doubt you also know that such software is expensive, proprietary, and tends to be about the same quality as PHPNuke.

Until I'd stumbled on this, my intent was simply to work at putting together an aggegator for Drupal 5 and then demo it, so that you folks could explain all my stupid mistakes and I could begin assembling a prototype for my installation profile. This is a bit more work than I had anticipated, however, and doubtless fits the "join forces" criterion mentioned in the handbook, so I thought I should ask and see if anyone would be interested in pursuing this as a group project.

More importantly, though, I wanted to put this before the community at large, because regardless of the merit of my own project, I think it portends much larger ramifications for Drupal as a platform. To say that Drupal is a platform for publishing content might be the grossest overstatement of the obvious that can be made about it, but here is a collection of standards intended to structure the publication of content and I'd never heard anything said about it before this morning.

I invite comment, and hope there is some interest in discussing this topic. Not only might this assist in Drupal's march towards Global Domination, I see the potential to bring more users into the community and work for contractors, a way to level the playing field for small publishers who might find a cost barrier to market entry otherwise, and a means of opening up the field of commercial publication by providing a low cost, standards-compliant platform, with Drupal and everything that means as it's core.

Let's see Joomla pull THAT one off. ;)

Comments

Had a look

Posted by boris mann on September 27, 2006 at 4:39pm

It would be good to support these. We need an extensible parsing framework, as well as a series of callbacks that can handle arbitrary feed types. I just posted about an overview I did on our wiki.

We have a wiki?

Posted by Max Bell on September 27, 2006 at 10:01pm

It bothers me a bit that NewsML and the other MLs, NITF etc are all somewhat monolithic, taken together, especially when the one-sentance summary is "News feeds with embedded Photoshop documents". But this standard is moving into it's second iteration and when you really look at it, you realize "Hey, this is XML, but it isn't being written for the web". And then you start noticing that NITF is intended to be used as a straight text protocol and they're talking about applications for it that are about getting text out of devices using an RS232 port.

When I was looking at this strictly in terms of aggregation, the other thing I was thinking about was how to incorporate email with feeds, simply because I wanted to combine the mailing lists with the feeds from groups and drupal.org -- the amount of information about Drupal is pretty monolithic, too, and it's very easy to spend a lot of time just keeping up. I suspect there are a few other people besides me who have made a small hobby out of trying to second-guess what the inner circle is doing next.

Speaking of which, where is the wiki, and when did we get one of those?

Sorry...

Posted by boris mann on September 28, 2006 at 3:09am

"Our" wiki being Bryght's public dev wiki, that can be used for any Drupal purposes: https://svn.bryght.com/dev/

And I've written up several "recipes" for wikis in Drupal in the past: http://wiki.bryght.com/wiki/drupalwikirecipe

Crazy insane!

Posted by Max Bell on September 28, 2006 at 5:26am

Dude! The link you posted to the public dev wiki opened to a page that displayed a collections of links written in Chinese. I created an account (which made the text display properly) and looking for an indication of what might have caused the original problem (since this made me suspect that anyone with a cookie set could see the page correctly), I found this:

This Site Hacked By Adak Digital Security Team

Hacked By tin la day aon doi la be kho

Dont Worry Admin No Files Was Deleted

Adak Is The Best Hacking Team

Dont Forget S.W.A.T. Was Here

Adak Digital Security Team

[wWw.AdaK-HackinG.CoM]

https://svn.bryght.com/dev/wiki/TracGuide

Might wanna get somebody to take a look at this, a bad setting I'd change if something was wrong, this is probably not just a file that's been injected but a specific vulnerability (whee!).

Not a hack

Posted by boris mann on September 28, 2006 at 6:04am

It's wiki spam :( I cleaned up all the wiki spam I could see, and the Drupal Feedparser page should once again be visible: https://svn.bryght.com/dev/wiki/DrupalFeedParsing

Gotcha.

Posted by Max Bell on September 28, 2006 at 6:21am

I need to talk to budda and look at feedparser, then.

Knew he was working on the project via this group, but did not know it had made it as far as CVS.

Standardised taxonomy hmmm

Posted by Anonymous on September 27, 2006 at 9:32pm

I love taxonomy - see my one man project in sig, (thanks to DrupalCon people who helped me crystallise certain aspects).

I'd be open to discussion about whether or not there are synergies.

From a technical POV I'm trying to keep the structure of taxonomy simple (linked nodes, multiple parents/children) and embed the semantics into the structure - breadcrumb defines meaning/context.

I am NOT a programmer, but hope to get it to the stage where the programmers can see what I'm aiming at and hopefully decide it's doable to do the cunning stuff (directed graph, editable directed graphs and export in standards compliant formats).

Maybe even start importing STANDARDISED TAXONOMIES :-)

oops

Posted by Anonymous on September 27, 2006 at 9:34pm

another registration area...

taxonomy project at www.iandickson.com/taxonomy/drupal/

Ah, slightly different...

Posted by Max Bell on September 28, 2006 at 6:15am

By "standardized taxonomy", I mean a taxonomy that would function both as a way of categorizing a news feed and the story type, i.e. "national" and "international" news. If you think of it strictly in terms of news feeds, however, I suspect that there is no written standard for a taxonomy of any kind outside of this -- and this is only an obvious example. A less obvious one is that metadata is available for, say, MPEG video and JPEG graphics files that, passed to gdlib as parameters, would return an image that could be taken from a frame of the MPEG file, or, via the same XML metadata tags, take a thumbnail from that MPEG frame or the JPEG file and be rendered via gdlib as a thumbnail.

Otherwise, outside of the binary data in digital audio/video files, you're really talking about an XML wrapper (NewsML) that contains the information stored as a node and including a whole lot more, like the taxonomy information, metadata relating to embedded graphics, video and audio, etc. -- all of which is also a very sophisticated syndication/aggregation format, like RSS 1.1 or ATOM (but is actually completely separate) -- point being that the XML data is passed from syndicating source to aggregating destination the exact same way RSS and ATOM are.

If all this seems confusing, I'm probably not explaining it as well as I should, and I may have to write up in much greater detail to explain.

But it's fair to say that as a standard for syndicating and publishing to a web site digital information, it is complete, and is currently used by multinational media companies for the purpose of exchanging information between one another. When CNN International buys newswire feeds from, say, Das Bild in Germany, for example, the content is delivered using NewsML with the various other standards, (SportsML, for example) embedded in it. Failing the emergence of a competing standard (and I do not see ANY so far, although I will need to check), this is what will eventually allow CNN to connect video from a reporters' cell phone the actual broadcast without having to stop and reroute the audio signal because it's not being sent.

It's definately the cure to RSS 1.0/2.0 newsfeeds being incompatible (try to import an RSS 2.0 feed into Drupal using the native aggregator; it doesn't work), but more, a newsML enabled cell phone would allow you to post stories with audio, video and text without ever being connected to the site via the page, using syndication.

standardized aggregation as nodes and auto-taxonomy essential

Posted by jt6919 on September 29, 2006 at 3:37pm

I don't know what this "Drupal Groups" thing is or why my normal drupal.org login won't work here - but I created a new account just so I could comments on this post. There are two areas I feel Drupal is significantly lacking. One is management of content, and the other is content aggregation.

The main problems with the default aggregator in 4.7 is that the aggregated content aren't stored as nodes. This means you can't do most things with them - including search. The aggregator2 module was an attempt to fix this - and make all aggregated content into a node, but there were all kinds of formatting issues and bugs. There are a bunch of patches out there, but for the most part aggregator2 is a total pain to make work.

The last maintainer of aggregator2 created a new version called "leech_news", and posted the code to one of the issues - but it never made it into CVS. He had promised a revision of leech_news to possibly replace aggregator2, but hasn't posted a single time to the Drupal forums since. He told me in email he was coming back Sept 18 to work on it again - but hasn't.

Even though leech_news is a separate module, it needs aggregator2 installed and enabled to work (on the server, it lives in a subdirectory of /modules/aggregator2). I use leech_news in a live production site today - and have aggregated over 12,000 nodes using it so far. leech_news is totally different because you create a 'node_template' that you want use when aggregating and then associate that with a content-type. Then when you create a new node (that has been associated), you just put in the feed URL, and an ajax form brings up the corresponding fields on how you want the feed parsed, updated, etc. I use pathauto to auto-generate path names from the title.

Even though I'm aggregating upwards of 100 feed posts per day, my main complaints currently are:
- there is no way to categorize or tag any of the nodes when aggregated (at all), other than manually editing each node - so they aren't really a part of the taxonomy at all, other than a 'loose node'
- some random feeds don't parse right - they chop the page up into a unending indented list
- feed URL's longer than 250 characters can't be added (a HUGE problem)
- if you tell leech_news to 'promote 3 posts' - it will only promote 3 posts for 'all-time' not 'at a time'
- you can't tell leech to 'promote x, but then demote x'
- each feed item's main page doesn't even lists the aggregated items - you have to click 'view items' to see the posts

I have been VERY interested in some kind of "auto-taxonomy" that (like pathauto) would auto-assign aggregated nodes to taxonomy terms and vocab. If this is what your 'standardized taxonomy' can do - it would be a watershed moment for Drupal. Especially if it can also be used to relate nodes to one another better - different media types included.

I have been building web sites for 10 years professionally and have used various CMS systems (OSS and non). The challenge at first was trying to setup a static type heirarchal structure in a flat formatted CMS. I think so many flocked to Drupal (like myself) because of the taxonomy idea. But for most - the taxonomy becomes a burden when you have to manuall maintain it, and it becomes absolutely impossible if you are aggregating lots of content. This is why social software techniquest like tagging are so popular.

Please describe how what you posted about (standarized taxonomy) and what I (and many others) need (auto-categorization of aggregated content) can possibly occur in Drupal 5.0?

I use Drupal everyday in about a dozen of my web sites - check out http://www.smorgasbord.net and http://celebritynewslive.com.

IPTC Newscodes

Posted by Max Bell on September 29, 2006 at 10:11pm

The real problem with taxonomy is that there is no standard for using it. Free tagging is some help, but also makes it possible to confuse apples and oranges. In the IPTC's instance, thier solution is a taxonomy/metadata standard which involves a series of broad categories of metadata relating aggregated content called NewsCodes.

I'm sorry I don't have more time to address your post before I leave, but I want to be sure to leave this topic open for a bit and allow people a chance to respond.

But to briefly address your questions about taxonomy, besides the fact that NewsCodes address a finite and very specific set of metadata, I think the solution might involve the same mechanism used to search nodes and index them; however crude, taxonomy might at least be established for nodes according to whether or not the term existed within the node itself. More so, however, I think there is need for a tool that would provide for bulk-operations to be performed on nodes, however, and I am interested in exploring the suitability of NewsCodes as a possible structure and methodology for accomplishing these kinds of tasks.

As I say again, though, at a minimum, I think I'm going to have to provide a better summary of the formats before I can really explain them; if you have a chance to examine the link above, you'll note that the subject of NewsCodes by itself covers a very broad subject and it is only one part of a much broader set of standards.