Proposal: Auto Taxonomy Generation from Node-Content

We encourage users to post events happening in the community to the community events group on https://www.drupal.org.
narres's picture

Overview: Parsing node content and auto generate taxonomy (semantic cloud).

Description: To be done: First and most important step: Find words, wich are significant. The faster - the better.
Step 2: Build structures from these words.
Possible tasks, which can be solved by "Auto Semantic Taxonomy":

  • Language detection
  • RDF'ing
  • Spam Detection
  • Autotagging
  • User Matching (through node_profile)
  • Meta Generation

In fact it's very similar to

but does it all in Drupal.

Main excercice: Write somthing like "opencalais" within Drupal.

Should be able to biuld taxonomy trees like:
vehicle:

  • car
    • Ford
    • BMW
  • motorcycle
    • BMW

and relations between BMW as motorcycle or car.

Mentors:

Interested Student:

Official Proposal

Difficulty: hard

Comments

Moving to Official Ideas list

Alex UA's picture

This seems like a very interesting way to create recommended content. Let's see if any students are interested.

--
Alex Urevick-Ackelsberg
ZivTech: Illuminating Technology

Alex Urevick-Ackelsberg
ZivTech: Illuminating Technology

Great

narres's picture

Got allready a Co-Mentor (sematic science) but till now no student :)

another similar project

danithaca's picture

another similar project might be NAT? http://drupal.org/project/nat

I'm quite interested in implementing this, because it's related to my research too. But I might be working in another SoC project, the Recommender Bundle.

This might require some heavy matrix computation like SVD (singular vector decomposition). Probably it can be outsourced to Recommender API (later maybe renamed to Math API)

i like this

sandipdev's picture

I like this idea...It would also help me in another project that I am doing for Indian Institute of Management, Ahmedabad and National Innovation Foundation (nif.org). They want a project corpus site where undergraduate,graduate and possibly PhD research projects of relating to science and tech could be uploaded. Some projects would be opensourced so anyone can view/modify them. Others would be paid. Also industry can ask for projects suited to them. Anyways so this kinda module would help here. It will help generate taxonomy terms based on the project content...

Will submit a proposal for this..

Ps: One more thing, this project I mentioned. I am also supposed to put a feature which can track how one project got transformed to another and then to another. So like, someone submitted project A. Someone else took it up modified it to B and then so on to Z..I have to track that modification. Now it would be good if the user tells me about the original source...but they want some kinda auto detection...thats kinda tricky...any suggestions...also they want a feature through which one can find the similarity between 2 or more project theses...Will put this up on the forum too,.....

Memetracker

batje's picture

This module, though needing an additional python library, seems to be related to this.

http://drupal.org/project/memetracker

Comparison

garthee's picture
  1. Autotag - checks the node for the presence of words that are equal to the terms defined under associated vocabulary
  2. Autocategorize - also offers the similar feature, but provide synonyms creation with word stems and regular expressions
  3. Opencalasis - offers a rich analysis through natural language processing, machine learning, etc

In my opinion first two are not efficient, and would not be useful for a generic purpose. For example having the term "book" merely in the body of a post that talks about assembling a bicycle with a reference to "book x" wouldn't necessarily mean we should tag it with the term book.

At the same time Opencalais, being a 3rd party tool, is backed with a huge dataset (when it comes to machine learning, data is the most critical component). However the same diverse dataset leads to irrelevant tagging (terms we wouldn't have opted had we tagged manually), which they are trying to reduce by providing selection of types of classifications (email, company country, etc) that in turn requires admin config. Still it is not perfect and I turned it off after finding it bit annoying.

I believe the requirement is not to fully automate rather provide semi automatic ways of tagging to ease ordinary users from the trouble of tagging or provide suggestions for tagging to improve the accuracy and relevancy.

I think it would be better to provide a module (with API) that will
1. Try to use the existing terms (not only looking for the presence of the word in the body) and finding the relevance (through machine learning, even Bidirectional Associative Memory[1] can be used (this can be in a. bulk processing, b. on submission of nodes, c. suggested terms in a popup)
2. Generating new terms automatically through machine learning
Here the experience comes through
a. the occurance, ranking, any weight metrics of a noun
b. filter with specific guidelines to avoid words such as "blog", "post"
3. Generating relationships

However in my opinion the third case of generating relationships between two terms, such as placing BMW under Car and Motorbike, may make things awry specially considering the limited availability of data (confined within a single Drupal site).

Well written

Are we outputing RDF or RDFing the post

garthee's picture

I am still unclear about the objective. If it is helping the process of RDFing the output (the current Drupal path), having an automated / semi automated intelligent tagging mechanism (A) will be boon. However if it is RDFing user submitted content (B) as opencalias does, the aforementioned facts such as
1. Limited availability of words
2. Posts are of different context but still not diverse enough and large enough to capture relationships and learn
3. Irrelevant tags may raise more cons than pros (Academics may overlook the results and praise the theory, but not commercial people)
will severely affect our module being used widely

However, as I mentioned earlier A can be achieved (in results) and for B either we will have to rely on external data, or estabilishing connection with other drupal sites and share data.

The objective is to find

narres's picture

The objective is to find automaticly relevant words in a node and create new tags from these in case the tags are not existing. In case the tag is already existing, just tag the post with the existing tag.

I played a bit on my experimental http://newsclick.biz/ (D5) site:

  • It's just reading RSS-Feeds and using FeedAPI
  • Using the search table and cutting of the top 5% (to less in common) and below 50% (not significant enough) scored words.
  • Depending on the length of the article (count words) I'm autotagging 5% (or so) words from search table, which are most high ranked

So All DrupalCon DC 2009 Session Videos are Now Online (e.g.) ist autotagged as "News and announcements, recorded, recording" and in the same "cloud":

* The Google Summer of Code is Back for 2009!
* News from the Documentation Sprint at DrupalCon DC
* Pregnancy.org relaunched on Drupal - a case study
* Drupal 6.10 and 5.16 released
* Drupal.org is upgraded to Drupal 6
* New Book: Drupal 6 Site Builder Solutions
* Fields in Drupal core code Sprint

That's a very simple PHP-code I used with workflow_ng (now rules) like this snippet:

<?php
$sql
= "
SELECT i.word, t.count
FROM search_index AS i, search_total AS t
WHERE i.word = t.word
AND t.count < "
.$cntmax."
AND t.count > "
.$cntmin."
AND i.sid="
.$node->nid."
order by t.count desc
LIMIT "
.$limit;

$i=0;
$result = db_query($sql);
while (
$term = db_fetch_object($result)) {
 
$terms[$i] = $term->word."(".round($term->count,2).")";
 
$query = "SELECT * FROM {term_data} WHERE vid = 3 AND name = '".$term->word."'";
 
$termexist = db_fetch_object(db_query($query));
  if (
$termexist) {
   
$tid_next = $termexist->tid;
  } else {
   
$tid_next = db_next_id('{term_data}_tid');
   
$query = "INSERT INTO {term_data} (tid, name, description, vid, weight) VALUES (".$tid_next.", '".$term->word."', '', 3, 0)";
   
db_query($query);
   
$query = "INSERT INTO {term_hierarchy} (tid, parent) VALUES (".$tid_next.", 0)";
   
db_query($query);
  }
  if ( !
is_int($term->word) && ! is_numeric($term->word)) {
   
$query = "INSERT INTO {term_node} (nid, tid) VALUES (".$node->nid.", ".$tid_next.")";
   
db_query($query);
  }
 
$i++;
}
?>

Very simple, but works as a playground for me :)

Proposal

garthee's picture

I have posted my proposal draft at http://theebgar.net/story/gsoc-2009-proposal
You can find the proposal at Official GSoC page: http://socghop.appspot.com/student_proposal/show/google/gsoc2009/garthee...

If you could give any feedback on this before submission deadline it will be really helpful.

Thank you
-Garthee

Another useful tool that could speed this up

mbutcher's picture

This is a very enticing proposal.

One thing that might expedite some aspects of the tagging may be using OpenAmplify. There's a rudimentary module that provides library support for Drupal, but it isn't used for tagging or taxonomy. It may provide a quicker way of doing some of the local (document-specific) statistical analysis, and the output could then be used to make broader ontologies for a site.

bobbell2's picture

Thanks for your sharing,i learn a lot.
In Wordpress there was a plugin that could auto-create tags based on node content . I don't know how it did this but perhaps it looked up the most common words in the body content and used them as the tags. I have some posts that have lost their tags and it would be great to be able to auto generate some new ones from posts with a script rather than having to go back through and manually do them.
my site:http://www.businessrefinery.com/