RDF and autotagging.

We encourage users to post events happening in the community to the community events group on https://www.drupal.org.
Adam S's picture

I've been trying very hard to get the OpenCalais system working for me. In terms of auto tagging it's been the best system I've used so far. I really don't have any idea what RDF is. My guess is it's a way to store information about entities using an intuitive three part structure, subject, predicate, and object.

My first impression of how OpenCalais worked was that it parsed sentences to determine relevancy. But, I'm getting odd results from a variety of non-scientific tests. It seems that if it can't determine pronoun antecedents it just skips over the whole sentence. I'm not sure what's going on or how it can be improved.

I wrote what I'm submitting below in the OpenCalais issue queue and thinking that it didn't really belong there and that I should stop bugging those guys so much I am now going to bug all of you a little. My apoligies------

I've been trying really hard to develop an understanding of how Calais works. It's very abstract. Originally, I thought 'semantic' meant that it was parsing sentences semantically to determine meaning. This isn't what RDF is, at least I don't think. Now I'm beginning to think that Calais takes a text and tries to find objects within it, a bank, for example, and wrap that object in a three part xml markup, creating a central repository, database of objects somewhere.

I always turn to Bulfinch for semantically correct writing. If you ever wanted to test a system that can parse English syntax I would start with his Mythology because his grammar has cleaner, more organized syntax than Drupal Core or jQuery Core -- Boston Latin, Exeter, and Harvard beat it into him. I was curious what exactly Calais is returning and placed a whole random chapter and then only one paragraph of a chapter of Bulfinch's Mythology in the viewer at viewer.opencalais.com.

In the vale of Enna there is a lake embowered in woods, which screen it from the fervid rays of the sun, while the moist ground is covered with flowers, and Spring reigns perpetual. Here Proserpine was playing with her companions, gathering lilies and violets, and filling her basket and her apron with them, when Pluto saw her, loved her, and carried her off. She screamed for help to her mother and companions; and when in her fright she dropped the corners of her apron and let the flowers fall, childlike she felt the loss of them as an addition to her grief. The ravisher urged on his steeds, calling them each by name, and throwing loose over their heads and necks his iron-coloured reins. When he reached the River Cyane, and it opposed his passage, he struck the river-bank with his trident, and the earth opened and gave him a passage to Tartarus.

I know that Calais is designed to parse American news and I'm really putting it to the test, but, returning only two entities 'bank', an industrial term, and 'river Cyane', a natural resources, doesn't make much sense to me. What's happening?

I look at this passage and think how much can we glean from the syntax? Every sentence in the English language has to have a subject and a predicate. 1. There / is a lake 2. Proserpine / was playing 3. Prosepine / screamed 4. Proserpine / dropped her apron 5. Pluto / urged on his steeds. 6. Pluto / reached the River Cyane 7. The River Cyane / opposed passage. 8. Pluto / struck the river-bank. 9. The earth / opened. 9. The earth / gave Pluto a passage.

This passage subjects are Proserpine (3). Pluto (3), The River Cyane(1), earth (dirt, ground) (2), a lake(1).

I did pull some of the sentences that were compounded and made substitutions for pronouns. What about all the other grammatical structures with verbs? Relative, noun clauses, gerunds, and infinitives all have patterns and structure which -- using a term I see every day in Drupal -- can be parsed.

I'm sure that there are a lot of people working on computer systems that parse language.

Proserpine, Pluto, and the dirt contain all the symbolic meaning of the sexual innuendo of the passage, which I, by the way, randomly pulled from the mythology, just as it is contained in some more modern symbolic entities similiar to Neo, Trinity, and the Matrix (aka dirt or womb), Disney's Beauty and the Beast, or Linda Woolverton's newest movie. What's up with Alice and the hole? It would be a decent enough extraction of terms. Perhaps from a literary perspective, a straight mapping between a sentence subject and an rdf subject would be the best situation.

Here is some of the RDf that OpenCalais returned and I just don't think it's correct.

<rdf:Description rdf:about="http://d.opencalais.com/genericHasher-1/286f764a-3b21-35f2-b3ed-259f3ba3e387">
<rdf:type rdf:resource="http://s.opencalais.com/1/type/em/e/NaturalFeature"/><c:name>
River Cyane</c:name>
</rdf:Description>
<rdf:Description rdf:about="http://d.opencalais.com/dochash-1/e9499a15-6247-39af-9f89-ce4c4efca040/Instance/1">
<rdf:type rdf:resource="http://s.opencalais.com/1/type/sys/InstanceInfo"/>
<c:docId rdf:resource="http://d.opencalais.com/dochash-1/e9499a15-6247-39af-9f89-ce4c4efca040"/>
<c:subject rdf:resource="http://d.opencalais.com/genericHasher-1/286f764a-3b21-35f2-b3ed-259f3ba3e387"/>
<!--NaturalFeature: River Cyane; -->
<c:detection>
[necks his iron-coloured reins. When he reached ]the River Cyane[, and it opposed his passage, he struck the]</c:detection>
<c:prefix>necks his iron-coloured reins. When he reached </c:prefix>
<c:exact>the River Cyane</c:exact>
<c:suffix>, and it opposed his passage, he struck the</c:suffix>
<c:offset>712</c:offset>
<c:length>15</c:length>
</rdf:Description>

Comments

Automatic Semantic Tagging

Dan@Ontos's picture

OpenCalais and other similar systems that are based on natural language processing have all the same problem. They can only extract named entities and relations from a very well specified domain (e.g. business news, politics or sport). Further they need also a set of sentences (not just one short one) in order to understand the meaning.
We at Ontos identified this issue and are working on a new approach which will allow to become a bit more flexible.
First we will introduce the possibility that a user can add his own terms and secondly we will provide a way of a mixed approach using the automatic way and the manual correction. Currently developing this for a customer and the plan is to provide a Drupal project and module within this year (2010).
An interesting article about NLP and Semantic Web can be found under http://sciyo.com/articles/show/title/providing-semantic-content-for-the-....

OpenCalais tagging

fran.sansalone's picture

Take a look at OpenCalais SocialTags extraction - http://www.opencalais.com/documentation/opencalais-web-service-api/api-m... - we enabled SocialTags and submitted Adamsohn's text to it, with the following results:

Proserpine
Roman mythology
Greek mythology
Polytheism
Lilies
Tartarus
Apron
Pluto
Cyane
Enna

As for RDF, you might want to investigate other output format options with OpenCalais - http://www.opencalais.com/documentation/calais-web-service-api/interpret....

Regards,
Fran Sansalone
Calais Community Manager

First, I'm glad that I

Adam S's picture

First, I'm glad that I started my little rant with "In terms of auto tagging it's been the best system I've used so far." Of the three auto tagging systems used with this module http://drupal.org/project/autotagging, the Calais system returned -- in my humble opinion -- the best results. However, using the OpenCalais module, http://drupal.org/project/opencalais, returns far cleaner and more accurate tags to the same content.

Second, I built a tiny newspaper website using OpenPublish. I, however, stripped the OpenCalais system out of it because of numerous problems. The system finally works and I plan on implementing it on the production site very soon. However, the editor has been giving me flack about some mistakes OpenCalais is making. The first demonstration was a couple months ago and the system failed to determine pronoun antecedent agreement. I started a forum thread on OpenCalais.org asking about this which was never answered.

Last Wednesday I went to the Miami Drupal Meetup hosted by The Knight Foundation at their corporate headquarters. Sadly, not many people showed up for this months event. Since I had been working all week with OpenCalais it was very fresh on my mind and being in the board room of an organization whose purpose is to "advance journalism in the digital age," I though it would be prudent to give a quick demonstration of OpenCalais and Phase2 website, The New Republic.

Here is the first paragraph I sent Frank Frebbraro in an email:

Last night at the Miami Drupal Meetup hosted by The Knight Foundation (http://www.knightfdn.org/home/) I tried to demonstrate OpenCalias and OpenPublish by taking an article from tnr.com and placing it in viewer.opencalais.com. Judging from the first few sentences, the article (http://www.tnr.com/article/politics/presumed-innocent?page=0,0) was clearly about Guantanamo detainees. Unfortunately, OpenCalais completely missed that.

My letter to Frank was as much about that article in TNR being poorly written as it missed an important point in that article and that OpenCalais yields better results from well written compositions and could possible be used as an editorial device to determine readability.

In another demonstration I gave of the system two weeks ago it yielded Monaco, California as a result instead of Monaco, France. The person who I gave that demonstration to wasn't impressed either.

Third, I'm trying to understand what the semantic web and OpenCalais are. Because I heard about the semantic web through OpenCalais which parses content using semantics doesn't mean that RDF is related to language semantics. At least I don't think? This is the question I'm asking. What are the differences between the semantic web and language semantics?

The last issue is that OpenCalais might be utilizing a poor engine to extract information from text. As a user how can I help to make this better? Google Translate has a button that allows uses to make suggestions to improve the quality of the translation. Does OpenCalais have a user feedback functionality like this. If I point out that OpenCalais completely missed a pronoun antecedent and returns a result suggesting the entire article was about a person mentioned as an object of one sentence will it be noticed and useful.

Marine job board with Drupal 7 at http://windwardjobs.com

I gave a presentation at

linclark.research's picture

I gave a presentation at DrupalCamp Vienna that might help at least a little with understanding the Semantic Web.

http://www.youtube.com/watch?v=xcPf4PeF57Y

NLP, RDF and large datasets

Dan@Ontos's picture

@Lin: I like your YouTube contribution and I guess for people being new to Semantics it gives them a nice overview. Ontos has been working sine 2001 in the field of semantic web technologies and we have build an engine called OntosMiner which extracts named entities and relations. Further we have developed a framework which then receives those extracted data (XML/RDF) and sores them into a large graph or in other words into a knowledge base which is based on RDF triples. We also work on the issue of named entity resolution as you want to be sure if you insert a new data that a named entity (e.g. Guantanamo) is merged with an existing one.
On our web you will find also http://news.ontos.com which shows how our technology is working for English news articles that have been processed by our NLP engine (OntosMiner) and then stored as triples in our RDF store. The widgets just provide a possible user interface on accessing the data. In the background we use SPARQL.
@ALL: As said we are currently involved in 2 projects where we want to integrate our technology into Drupal for the 2 customers.
Our plan is to make this available as a project inside Drupal for all others. We support English, Russian and German for the natural language processing part.
I will be at the SemTech (http://semtech2010.semanticuniverse.com/agenda.cfm?confid=42&scheduleDay...) in San Francisco in June and give a demo about our Drupal and Publishing activity.
@adamsohn: We developed an ontology dictionary which allows users to specify their own vocabulary and make it part of the NLP process. By this we provide more flexibility and personalization. Our API (free to use) already includes this function. As said we will have in 2010 (hopefully summer) some first Drupal modules using our API and therefore our approach of semantics.

Semantic Web

Group organizers

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds: