Draft API specification for tag suggestion/autotagging modules

You are viewing a wiki page. You are welcome to join the group and then edit it. Be bold!

Posted by nedjo on October 18, 2010 at 5:04pm
Last updated by febbraro on Thu, 2010-11-04 14:08

This wiki page is aimed at promoting a community consensus on an API for providing tag suggestions. Please join the discussion and help move this initiative forward.

On the plus side, the comparison of autotagging tools on this site shows that lots of work has been done. But it also highlights the lack of coordination. Each module providing automated tags or tag suggestions tends to invent its own approaches, incompatible with others. There are no common standards. Crucially, most or all modules contain some mix of two very different areas of functionality:

A solution to provide tag (keyword, term) suggestions or matches.
UI and/or data manipulation of those tags.

The mixture of these two areas of functionality is a key source of incompatibility, since in most cases, even with custom coding, it is difficult or impossible to take advantage of tag suggestions offered by a given module without being weighed by its UI or data manipulation decisions.

The lack of a common standard has produced typical problems: duplication of effort, monolithic code trying to do many distinct tasks at once, conflict between different developers, confusion among users (which tools should I use?), etc. While in many cases a diversity of approaches can be beneficial, in this case the lack of standards offers no tangible advantages and instead holds back progress in this important area of Drupal development.

To address these issues, we should develop a standard API for modules providing tag suggestions. Such a standard need only address the provision of tag suggestions--how these suggestions are used, saved, or presented to users is up to modules using but not providing the API.

Here is a draft Drupal Tag Suggestion Specification. Please review and add improvements.

Outstanding questions

The following are questions that have not yet been addressed in the spec. Please add to this list and/or fill in proposed approaches in the draft spec.

RDF information: Should RDF information associated with tags be returned as part of the tag suggestions? If so, how?
Specifying eligible vocabularies and/or terms. Should there be a site-wide way to flag vocabularies or terms as eligible for suggestion? E.g., the Extractor module uses such a method. If so, how should it be implemented? Should it go into the proposed tag_suggestion module?
Instead of a set of hooks, would it be better to use CTools plugin architecture? Is CTools an acceptable dependency?
If an API module is produced (e.g., tag_suggestions), should it ship with some base methods? If so, which?
If an API module is produced, should it be a new module? Or should it be e.g. the D7 version of an existing module? If using the namespace of an existing module (understanding that it would be a ground up rewrite), which modules should be candidates?

Draft specification

1. Aims and scope

This specification aims to provide a common set of methods and formats enabling interoperability among applications providing and using tag (keyword, term) suggestions.

Target use cases include:

Provide suggested taxonomy terms for a given field or entity (node, user, etc.).
Automatically register taxonomy terms based on content analysis.
Provide suggested meta tags for a given page.

For purposes of this specification, tag suggestion is given a broad meaning, including methods that may or may not analyze content and may or may not limit suggestions to existing taxonomy terms. Sample methods might include:

List the most used existing taxonomy terms as stored in the site's database, optionally limited to a given vocabulary.
Provide keyword suggestions based on an external service's analysis of a given text string.

This specification defines APIs to be used by back end modules providing tag suggestions, including:

API to describe tag suggestion methods: hook_tag_suggestion_info().
Callback to respond to tag suggestion requests.
A standard format for tag suggestions.

It also defines a proposed helper module that would provide methods to be used by front end modules--modules that request tag suggestions.

2. No tag handling UI or data manipulation in back end modules

To enable the maximum flexibility, modules implementing the API hooks defined in this spec must be pure back end solutions.

A module complying with this specification may not provide, and may not have a direct or indirect dependency upon any other module that provides, any tag handling UI or data manipulation such as:

Saving tag suggestions on entity (node, user, etc.) insert or update.
UI widgets to manipulate (add, move, delete) tag suggestions.
Presentation, AJAX loading, etc., of suggested tags.

A UI and/or data storage may be provided for tasks intrinsic to the back end, e.g., to configure settings needed for a particular tag suggestion service.

In addition, module authors are strongly encouraged not to bundle tag suggestion modules with other modules that provide UI or data manipulation but rather to provide them as separate projects.

3. API hooks

3.1. Tag suggestion methods

Modules offering tag suggestions must implement hook_tag_suggestion_info(), returning an array of one or more methods offered. A method should be an array with specified keys as follows.

[Edited to remove use of pseudo-hook, replacing with explicit callback]

<?php
/**
 * Define Tag Suggestion API methods.
 *
 * @return
 *   An array whose keys are method names and whose values are arrays
 *   describing the method, with the following key/value pairs:
 *   - label: The human-readable name of the method.
 *   - description: A short description for the method.
 *   - uses_args: An array of arguments to the the request_callback
 *     that this method makes use of, if supplied.
 *   - provides_keys: An array of keys in the return format of the response
 *     callback that this method provides, if supplied all its uses_args arguments.
 *   - request_callback: callback function for requesting tag suggestions.
 */
function hook_tag_suggestion_info() {
  return array(
    'example' => array(
      'label' => t('Frequency analysis'),
      'description' => t('Return the most-used terms in a .'),
      'uses_args' => array('text', 'limit', 'stop_words'),
      'provides_keys' => array('name', 'count', 'relevance'),
      'request_callback' => 'example_tag_suggestions',
    ),
  );
}
?>

3.2. Tag suggestion responses

Modules implementing hook_tag_suggestion_info() must supply a callback returning an array of suggested tags.

[Edited to remove use of pseudo-hook, replacing with explicit callback]

<?php
/**
 * Tag suggestion callback: Provide tag suggestion responses for this module's suggestion methods.
 *
 * All arguments are optional and implementing modules may use or ignore
 * any argument. For example, vocabulary IDs may not be relevant to methods
 * fetching solutions from external services or based on keyword frequency.
 *
 * @param $text
 *   Text to be analyzed.
 * @param $limit
 *   Maximum number of items to be returned. Defaults to 0, meaning no limit.
 * @param $vids
 *   Array of vocabularies to be referenced.
 * @param $timestamp
 *   Timestamp associated with (may be used to help interpret terms
 *   like "yesterday").
 * @param $stop_words
 *   Terms that should not be included in the results.
 *
 * @return
 *   An array whose keys with the following key/value pairs (all but
 *   'name' are optional):
 *   - name: The tag or term.
 *   - tid: The ID of a locally available taxonomy term with this name.
 *   - count: The number of matches found for this term in analyzed text.
 *   - relevance: An integer from 1 (lowest) to 10 (highest) indicating the degree of
 *     relevance of the term.
 */
function example_tag_suggestions($text = NULL, $limit = 0, $vids = array(), $timestamp = NULL, $stop_words = array()) {
  $terms = array();
  if (!empty($text)) {
    $word_count = array_count_values(str_word_count(strtolower(strip_tags($text)), 1));
    arsort($word_count);
    // Skip stop words.
    if (!empty($stop_words)) {
      $word_count = array_diff_key($word_count, array_flip($stop_words));
    }
    // Reduce to limit.
    if ($limit > 0 && count($word_count) > $limit) {
      $word_count = array_slice($word_count, 0, $limit);
    }  
    $min = end($word_count);
    $max = reset($word_count);
    foreach ($word_count as $name => $count) {
      // Convert occurrence value to scale from 1 to 10.
      $relevance = 1 + ($count - $min) * (10 - 1) / ($max - $min);
      $terms[] = array(
        'name' => $name,
        'count' => $count,
        'relevance' => $relevance,
      );
    }
  }
  return $terms;
}
?>

4. Using the API

The tag_suggestion module is provided for use by module authors to request tag suggestion methods or responses.

The method tag_suggestion_get_info() is provided to fetch information on available methods. Sample usage: a module providing a tag suggestion field widget might invoke this method in a field configuration form, fetching a list of suggestion methods to be used for the field.

<?php
/**
 * Fetch information on all tag suggestion methods or a particular method.
 *
 * @param $method
 *   An individual method requested.
 * @param $key
 *   Return information only for the specified key.
 * @param $uses_args
 *   Array of vocabularies to be referenced.
 * @param $timestamp
 *   Timestamp associated with (may be used to help interpret terms
 *   like "yesterday").
 * @param $stop_words
 *   Terms that should not be included in the results.
 *
 * @return
 *   An array of available methods meeting the request criteria.
 */
function tag_suggestion_get_info($method = NULL, $key = NULL, $uses_args = array(), $provides_keys = array()) {

}
?>

The method tag_suggestion_request() is provided to fetch tag suggestions. Sample usage: a module providing a tag suggestion field widget might invoke this method before rendering its widget, fetching a list of suggested terms to be displayed. (Existing terms might be passed in an array of stop_words to prevent repeat suggestions.)

<?php
/**
 * Request tag suggestions.
 *
 * All arguments are optional and implementing modules may use or ignore
 * any argument. For example, vocabulary IDs may not be relevant to methods
 * fetching solutions from external services or based on keyword frequency.
 *
 * @param $methods
 *   Array of methods to send requests to. If empty, all available methods
 *   are used.
 * @param $text
 *   Text to be analyzed.
 * @param $limit
 *   Maximum number of items to be returned. Defaults to 0, meaning no limit.
 * @param $vids
 *   Array of vocabularies to be referenced.
 * @param $timestamp
 *   Timestamp associated with (may be used to help interpret terms
 *   like "yesterday").
 * @param $stop_words
 *   Terms that should not be included in the results.
 * @param $sort_key
 *   Sort results by this key.
 * @param $sort_direction
 *   Sort order: 'asc' or 'desc'.
 *
 * @return
 *   An array whose keys with the following key/value pairs (all but
 *   'name' are optional):
 *   - name: The tag or term.
 *   - tid: The ID of a locally available taxonomy term with this name.
 *   - count: The number of matches found for this term in analyzed text.
 *   - relevance: An integer from 1 (lowest) to 10 (highest) indicating the degree of
 *     relevance of the term.
 *   - method: The method that returned this suggestion.
 */
function tag_suggestion_request($methods = array(), $text = NULL, $limit = 0, $vids = array(), $timestamp = NULL, $stop_words = array(), $sort_key, $sort_direction) {

}
?>

Existing module maintainers

If you are a maintainer of an existing module that either provides or uses tag suggestions, your input here would be very appreciated. Please take the time to review the draft spec. Are you supportive of the effort? Would it meet your needs? What is missing? Please fill in and change the spec as needed. If there are priority areas not met in the spec, please either note them in the "Outstanding questions" section above or add them to the draft spec.

If you have a chance, please also fill in a row in the following table, indicating your take on this initiative. Please fill in:

Under Maintainer, your user name.
Under Support, whether you support this draft spec and would be willing to adopt it in your module. Chose Y (yes) if you basically support the idea and could see adopting this spec with a few changes. Chose M (maybe) if you support the idea but think it would need major changes and work before it would be workable. Chose N (no) if you don't support the basic idea or approach and/or would not wish to adopt a spec in your own module.
Under Comaintain, whether you would be interested in comaintaining a tag_suggestion module providing a small set of methods for tag suggestion requests.
Under Comments, please note comments changes that you've made to the draft spec or other comments.

Thanks!

Module	Maintainer	Support (Y/N/M)	Comaintain (Y/N/M)	Comments
Active Tags	dragonwize	M	Y	While this is not a horrible system, I think using the ctools plugin architecture would be better.
Alchemy
Amplify	mbutcher	Y	M
Auto Tagging
Autocategorise
Extractor	alex_b	Y	M
HILCC Taxonomy Autotag
Inform	JeremyFrench	M	M	Inform is no longer under active development. See comment
Keyword Analysis
OpenCalais	febbraro	Y	M	More thought needed.
Suggested Terms	Crell	M	M	No pseudo-hooks. Period.
Tagging
Taxonomy Autotagger	SDRycroft	Y	M
Yahoo Terms

Comments

No more pseudo-hook

Posted by Crell on October 19, 2010 at 5:22am

As maintainer of Suggested Terms I am very much in support of this idea in principle. However, the example API above appears to make use of a pseudo-hook aka magic callback aka spawn of evil. We absolutely do not need more of those, in fact we need to hunt them down and exterminate them. I cannot endorse any API design that makes use of magic callbacks.

The implementation of a given suggestion routine should be either an explicit callback like a menu hook, or a proper plugin.

Actually since I have Butler plugins on the brain these days I'm wondering if these couldn't be a simple use case in Drupal 7.

Possibly a stupid question.

Posted by sdrycroft on October 21, 2010 at 9:34am

What exactly are pseudo-hooks/magic-callbacks? I've done a quick search of Drupal.org, but all it comes up with are a number of pages where the discussion revolves around the removal of aforementioned code.

Cheers, Simon

Quick try at explanation

Posted by nedjo on October 21, 2010 at 5:40pm

Others pls correct/clarify, I'm not feeling very articulate on this. But here goes.

In the first draft of the spec, I followed what's done e.g. in field module: an info hook returns available items (e.g. content types) and callbacks for that module's items are constructed based on the module name.

The resulting callback is "magic" in the sense that it's constructed from the module name and a string and is not a hook in the regular Drupal sense, i.e., designed to be e.g. listed by module_implements(). It's like a method of the item (e.g., field type), though it's bundled with the methods of other items defined by the module. In short, it's a mishmash.

In the revised spec I've addressed this by providing explicit callbacks. Alternatives would include making the plugins objects and having a get_suggestions method.

Support in Principal

Posted by JeremyFrench on October 19, 2010 at 8:37am

I worked on the Inform module at the economist, I haven’t worked there for a while, so I don’t know if they ended up using it, or something like it for their eventual topic solution.

The module worked ok, but Drupal didn’t scale well to the number of terms and tags which were generated. I can’t remember the precise figures but there were millions of tags and a lot of terms.

This is something which should be thought about for any D6 implementation of this. D7 has fixed a few of the bugs.

I like the idea of centralising the housekeeping for the tag suggestion modules. This spec thus far focuses mainly on the API rather than the support functions it will provide. It would be nice to get an idea where people thing the demarcation is.

For example I feel that if this spec is properly implemented all the tag providers should do is provide terms. And do no house keeping of their own. I shall mull over some more and perhaps edit the post later.

This is overall going in the

Posted by alex_b on October 19, 2010 at 2:42pm

This is overall going in the right direction. Consolidation is a really good thing here, the existing solutions seem ripe for it.

As a maintainer of Extractor I am interested in seeing this consolidation happen and I can help with reviews and - once a final API is in place - porting 'simple extractor' (a vocabulary based simple lookup tagger) and 'placemaker' (a yahoo placemaker based lookup tagger). My time budget in the next couple of months is going to be very tough though, so unfortunately I won't be able to commit to more.

On a more detail level:

I second Crell - pseudo hooks need to be changed to callbacks declared in info hooks.
We should make sure every piece of configuration is exportable.
I'd use CTools plugins for handling tagging plugins.

http://www.twitter.com/lxbarth

Pseudo hook removed

Posted by nedjo on October 19, 2010 at 10:45pm

Thanks for the comments! I removed the pseudo hook, replacing it with an explicit callback.

Yep, we could do this with Ctools or (when it's ready) Butler. Let's wait a few days, hopefully hear from some more module maintainers, then map out some plans.

Is this outside the scope?

Posted by janusman on October 20, 2010 at 2:35pm

In general, I think this is a good idea, and agree with most comments above.

In my particular case (the HILCC module) I feel it could be a little out of scope. I'll explain:

I have an established vocabulary with hierarchy from which to pick terms from (HILCC Subjects), depending on information on the node (Library of Congress Call Number-LCCN) which might come from a number of sources, depending on the modules installed (could be a CCK text field, a Biblio field, etc.) There is a basic mapping from one or more LCCN ranges into each particular HILCC subject.

Other similar cases could be:
* tagging nodes with State > County from scanning for zip codes, or area codes, etc. in nodes.
* tagging nodes with a color from an established palette, from scanning and averaging its attached image files.
* tagging nodes with named ranges (e.g. "$0-$10") from scanning a "price" or other numerical field.

I see that the current proposed API might not (?) support this. Again, it might be out of scope, but I feel it's important to actually wrangle in information to make it usable/findable (I'm into faceted search UIs).

Thoughts?

In scope

Posted by nedjo on October 20, 2010 at 9:55pm

I'm thinking this is in scope. From what you're describing, it sounds like it could fit into the spec. Request passes in text that's analyzed, returning an array of tags.

What may need work is: how do we designate what text (or other attributes, like images) are used for the analysis.

At present this is up to the module initiating the request, which could do any number of things: render a node and pass the full result; use the unrendered body field; allow an admin to select specific text field values for analysis; etc.

Part of what's challenging here is that different tag suggestion modules may need different data to analyze. E.g., HILCC may need specific fields that are meaningless to other tag suggestion modules, which may need their own data, e.g., from different fields.

Options:

We pass a full entity (node, etc.) to the tag suggestion callback, allowing individual modules to
In describing itself, each tag suggestion module includes data on the fields/properties it uses. These are extracted from the entity and passed to the tag suggestion callback (each callback gets only the string that's appropriate to it).

The first would be much more straightforward.

So we either add an $entity argument to the proposed callback or - probably better - we replace the existing $text argument with $entity (and maybe add $entity_type, e.g., 'node'), assuming that the tag suggestion module knows best what it needs.

My thoughts on this In the

Posted by JeremyFrench on October 21, 2010 at 7:04am

My thoughts on this

In the general case the api should decide what it is that is being tagged and pass the text to the implementing modules, this will allow the user to identify taggable fields. Or at least keep the decision about what to tag away from the implementing modules. It will also allow for tagging before nodes are saved, or of arbitrary text.

However some implementing modules may require more context to perform their analysis, so the entity should be a property which is described in the implementation definition.

Good point

Posted by nedjo on October 22, 2010 at 12:07am

Yes. And probably we should convert the current arguments (which are getting to be a long list) into an array of $options.

Separate problem space?

Posted by Crell on October 25, 2010 at 2:58am

$option parameters run the risk of turning into "big mass of undocumented crap" that pervades Drupal right now. We'd have to be careful not to replicate FAPI. :-)

I don't know how to resolve the context question here, though. Suggested Terms, for instance, relies on the vocabulary in question and on tags present on other nodes of the same type as the one being edited to come up with its suggestions. Textual analysis would be irrelevant there. In that case the relevant information is node type, vocabulary, and field. Other tagging modules would rely on textual analysis exclusively.

Which begs the question, are those two use cases that should be combined in the first place, or should textual-analysis-based data sources be treated separately from metadata-based data sources?

When working on DBTNG, Merge queries were originally a part of Insert queries. It was the realization that the resulting handling logic was getting totally ugly that prompted us to split Merge queries out into a separate operation all their own, vastly simplifying the logic in the process (and, as it turns out, allowing us to actually follow the SQL spec).

Understand your points. Re

Posted by JeremyFrench on October 25, 2010 at 3:46pm

Understand your points.

Re the options array, you have a good point, if we can keep it simple all the better.

Re the separate problem space, I'm not so sure. As I see it there are three alternatives.

Use text.
Use context.
Use text and context.

If option three doesn't exist in an implementation then I think separate calls is the way to go. Otherwise I think we need to work out a way to pass the attributes in structured way.

Thoughts on the spec

Posted by febbraro on November 4, 2010 at 2:03pm

Sorry to the lateness of my reply, took a long vacation. :) There is a lot to think about here so this is my first brief thoughts....

First off, I agree with Alex B, every single piece of configuration needs to be exportable and the plugins should be developed with ctools plugins.

Now, with respect to what I do with OpenCalais, this needs to support the following and I'm not sure if I see it all here.

The returned Suggestions can be in any number of vocabularies, and would likely need to be organized as such. (People, Companies, etc.) some folks may want this all in one vocabulary while other see benefit in having each conceptual type of entity in a different vocab.
Suggestions and the resultant Tags from accepting a suggestion are not the same thing. Suggestions need to have any number of extra pieces of meta-data associated with them. (Linked Data URI, Ticker symbol, Lat/Lon, etc.) Am I correct in that this does not prescribe anything about the storage of the suggestion and just that it relates to a term? So potentially terms are created from a suggestion and then the new term is offered for relation? Or was the idea to have uniform (extendible) storage for Suggestions as well?
We would also need a set of alter hooks (or similar concept) to allow other modules to effect the suggestions (think about black/whitelisting)
The idea of automated triggering on specific actions, things like publishing not just insert/update. Also need support for manual requesting for services that send data to external sources. An example here would be embargoed press releases not having their data sent to a 3rd party until it was ready to be released. This might be more of a responsibility of the suggestion implementor, but a concern.
The configuration would need to support auto association based on relevance or the like for each individual service. This would need to go down to the specific vocabulary level too, example: All People over .75 relevant, All Cities over .25 relevant.

Reading back my thoughts are all over the board, but these are concerns that I have to address with respect to my specific implementations.

Thanks for the feedback. It

Posted by nedjo on January 17, 2011 at 6:53pm

Thanks for the feedback.

It looks like a lot of this should work fine with the spec as it's evolving.

"The returned Suggestions can be in any number of vocabularies..." Presumably then an optional "vocabulary" property to each suggestion.
"Suggestions and the resultant Tags from accepting a suggestion are not the same thing..." Definitely. Suggestions are just that, and are not necessarily related to storage.
"We would also need a set of alter hooks...". Yep.
"The idea of automated triggering on specific actions, things like publishing not just insert/update..." I think this is beyond the scope of the initial spec.
"The configuration would need to support auto association based on relevance or the like for each individual service." This too seems out of scope--it's what to do with suggestions once you have them.

Subscribing...looking for a

Posted by dave reid on January 17, 2011 at 6:43pm

Subscribing...looking for a solution that I can help work on for D7 for use with the Social bookmarking profile: http://drupal.org/project/drupmarks

Senior Drupal Developer for Lullabot | www.davereid.net | @davereid

Next steps

Posted by nedjo on January 17, 2011 at 7:08pm

None of the modules listed at http://groups.drupal.org/node/38290 yet has a stable release, so it seems like a good time to push ahead with initiative.

Personally I can lend a bit of coordination time, but this isn't an area of focus for me otherwise. I posted this proposal because there seemed to be a need, but I'll need to look to others to take leadership in terms of code.

We've made some good progress:

There looks to be broad agreement on the approach of writing a new API module, based on ctools plugin architecture, that can be a dependency of modules either providing or requesting tag suggestions.
Several leaders in the area are potentially available to collaborate/co-maintain a solution.

How to get from here to there?

Needs look to include:

Revise the spec, incorporating changes.
Agree on a namespace for the new API module.
Draft and commit initial code for the tag suggestion API module.
Write two simple implementations of the API, one providing tag suggestions and the other consuming them.

Volunteers for any of these tasks? Should we organize an IRC meeting to get started?

Slight change to tag suggestion callback.

Posted by sdrycroft on March 4, 2011 at 1:09pm

I would suggest the following change to the tag suggestions callback, which makes things much more flexible.

<?php
/**
 * Tag suggestion callback: Provide tag suggestion responses for this module's suggestion methods.
 *
 * @param $args
 *   A keyed array of options.  This must include the keys as defined in the hook_tag_suggestion_info.
 *
 * @return
 *   An array whose keys with the following key/value pairs (all but
 *   'name' are optional):
 *   - name: The tag or term.
 *   - tid: The ID of a locally available taxonomy term with this name.
 *   - count: The number of matches found for this term in analyzed text.
 *   - relevance: An integer from 1 (lowest) to 10 (highest) indicating the degree of
 *     relevance of the term.
 */
function example_tag_suggestions($args) {
...
?>