ThompsonReuters and Phase2 have released the OpenCalais Drupal module for both D5 and D6. Note that this is not the same as the previously announced but unreleased Calais module.
It submits content to OpenCalais and gets back a set metadata that it maps to Drupal vocabularies and terms. The results can be automatically inserted into the database or merely suggested to node authors.
I've been trying to test it on some full-text RSS feeds, but I'm having some issues with FeedAPI, so I'm doing a lot of cutting and pasting of newspaper content for my test. So far, I can see some opportunities for generating good related-item linksets, but there's a lot of work to do.
In the Drupal instance at least, there's no disambiguation among the many possible John Smiths, so proper name detection may not be as useful as we'd like. But other vocabularies may be more useful. For instance, in one story it detected "Fleming Island High school football field" as a facility, and "Clay County school,Clay County Sheriff's Office,Fleming Island High School,Green Cove Springs Police Department,W.E. Cherry Elementary School" as organizations.

Comments
Missing Piece 1?
Yelvington
If this is what I hope it is... I've been advocating for some module that effectively scans content for taxonomy matches then optionally presents the checked terms to node authors or automatically tags the content with the matching terms.
Is this what this does?
If so this solves a missing functionality, much needed, in Drupal.
The other? Ah the ability to drag and drop content with images (or to automatically take in content from sources like word, pdf etc.
Listen to Andy Partridge for a better life.
...and if you read, buy a book
Listen to Andy Partridge for a better life.
...and if you read, buy a book
Close...
Close. It manages a whole set of Calais-specific vocabularies (which you can turn on or off, at your pleasure). Examples of the vocabularies would be "person," "region," "company," "movie," "medical condition" and so forth, which gives you some fine-grained control. It doesn't just toss all the terms into one big bucket.
As it scans the content, if it finds a key word or phrase that ought to be in a vocabulary, it inserts the term. So it's sort of like an automated free tagger.
If you put it into "suggest" mode it finds the terms, but doesn't insert them. Instead, they're on a tagging page (presented as a tab), and an editor may choose to accept and insert them by clicking on the suggestions. A bit of Javascript is used to copy the suggestions into the vocabulary's free tagging field. If you want to add a tag that Calais did not find, you can do so at that point.
Tried out your module today
And it worked great (mostly). I'm thinking of integrating your module with my Google Summer of Code project -- Memetracker
I had it working with FeedAPI -- it was pretty cool to import a bunch of blog posts and see them all automatically tagged. Pretty slick. There are some tagging errors -- one example, "AP - Google Inc." is not the right name for the company. I'm sure the problem is on their end not yours. It'd be nice if their was some way of training Calias -- saying you got this one wrong or something like that. But I'm really liking the module so far.
Kyle Mathews
Kyle Mathews
Calais tag modifications
Some of this might already exist -- if you check out admin/settings/calais/calais-tagmods you can specify tags that get overridden -- so, you could specify "AP - Google Inc.=AP", and that substitution would occur -- I haven't tested this functionality, but the UI suggests this is already there.
Cheers,
Bill
FunnyMonkey
Tools for Teachers
FunnyMonkey
Gone live
I've added OpenCalais to my blog as I continue to test. Still not sure whether we will use this on a live newspaper site, but it does look promising.
Interesting. I noted on this
Interesting. I noted on this article, it created two tags for "St. Paul Pioneer Press," with no discernible difference between them. Is that a bug with the Drupal, or with OpenCalais itself? Can you tell?
The Boise Drupal Guy!
Not a bug ...
It's not a bug. Calais saw the term "St. Paul Pioneer Press" and assigned it to two different vocabularies: Company and PublishedMedium. In this case the latter is probably inappropriate. In fact, it's highly unlikely that PublishedMedium would appropriately apply to anything on my blog, so I should probably just turn it off.
The module provides an interface for managing the terms applied to each node. You can configure it to suggest, or assign. I've got "assign" turned on. You can override the assignments easily; in fact, there's a bit of Javascript assistance, so all you have to do is click on a term you don't like and move it from "assigned" to "suggested" behind the scenes.
You can also, as I said, ignore an entire vocabulary that consistently doesn't fit with your content.
And you can map Calais terms to local terms, so if you want to map "Star Tribune" to "Enemy Newspaper" or vice versa, you can do so.
Displaying the tags
I should add that the default Drupal behavior of displaying all the tags may not be optimal over the long term for making the best use of Calais tagging. Thompson Reuters intends to continue expanding the number of vocabularies, and we're just going to be overwhelmed. It's more likely that the availability of the metadata will lead us to develop some custom queries that generate "related item" linksets. I'm not yet sure how that will shake out.
Over the next week we hope to start feeding complete live newspaper content into a test site to see how it shapes up.