Taxonomy as a field

Posted by bjaspan on February 10, 2008 at 6:12pm

It's useful to explore how existing core functionality can or cannot be implemented as a field. Doing so may cause us to change how we decide to implement them. In this post I'm talking in terms of nodes but trying to avoid assuming that we are limited to nodes instead of Content objects and that all fields are stored locally in SQL. I'm also mostly just thinking as I go/type; I don't have a specific outcome in mind yet.

Taxonomy is a "multi-valued field" assigned to node types currently via variable settings. The term_node table is effectively a field table: (nid, tid) instead of (nid, delta, tid). One note is that the tid assignments to a nid are unordered whereas field values are ordered. Since the nid,tid order has never mattered before, arbitrarily assigning an order via delta should not matter.

During load, term_node connects to term_data and vocabulary. Observation: Although we wrote down that "drupal_load('node', $nid) will SELECT FROM node + load all fields", drupal_load() probably will not actually load all the data itself directly; instead, it will have to call a taxonomy_field_load() hook. For the specific taxonomy case, perhaps we could encode information in the Schema structure or the field definition that tells drupal_load() to join these other tables. However, consider a field that access remote data (e.g. "Amazon products related to this story" field). drupal_load() can't possibly load that data directly, so we need a field_load() hook, so taxonomy should probably just use it.

term_data contains a term weight which determines the order of tids in the $node->taxonomy array. This is disjoint from the delta column in the term_node field table; the weight is a property of the term, not its assignment to a node. Vocabularies have weights too. taxonomy_field_load() will presumably return an array of field values to be merged into the node. When the node is saved, the taxonomy field table (term_node replacement) will be filled with (nid, delta, tid) but again the deltas will be "arbitrary".

So taxonomy_field_load() turns out to look almost exactly like taxonomy_nodeapi('load'); well, we always knew nodeapi was a pretty powerful model.

Today, $node->taxonomy is filled in as an array mapping tid to an "object" of all the columns of term_data. If taxonomy is a field, then $node->taxonomyfieldname is a field array which means its keys are deltas (in this case, meaninglessly so). $node->taxfieldname[n] is a term_data "object" including the tid. So code that assumes the indexes are tids will need to change to use the tid field directly. Such code exists in taxonomy.module, elsewhere in core, and throughout contrib, probably.

Hmmm. I actually just glossed over something important. If taxonomy is a field type, then multiple fields of type taxonomy can be added to a node. This could be quite useful: "user assigned terms" and "editor assigned terms". However, fields are stored based on their name: $node->usertaxterms, $node->editortaxterms, or something like that. Right now, a node only has one set of terms and it is always available at $node->taxonomy. So if I have a node object I can ask, "what are its terms?" because I know where they are stored. With taxonomy as a field, I do not know where they are stored.

This is not limited to taxonomy; comments are the same way, always at $node->comments but if comments were a field they'd be in $node->commentfieldname.

I think the ability to access $node->taxonomy directly is an historical artifact. Drupal was designed as a CMS and taxonomy, comments, and other CMSy features were built it in fixed ways. Many sites use Drupal for events, and event nodes often have a venue, but we do not expect to be able to access $node->venue to find out the venue for an event node. We have to know the name of the venue field.

What to do about this? Random thoughts:

The easy solution: At install time, when we create the 'page' and 'story' type, we create a field of type taxonomy named 'taxonomy' and assign it to those node types. If you create a new node type and want it to play with standard taxonomy functionality, assign the 'taxonomy' field to it. Otherwise, feel free to do whatever you want.
Perhaps we need to be able to ask a field type for all of the values on a node from that field type. e.g. $taxonomy = taxonomy_field_all_values($node). Or probably, since the core field registry knows what fields are assigned to each node type, it would be $taxonomy = field_all_values_of_field_type('taxonomy', $node). That would give me all the terms for the node. If I specifically wanted some subset (e.g. only one field's worth of terms), that implies I'm writing new code in a post-taxonomy-as-field world and presumably I know which field I'm looking for.
Perhaps what's going on here is that we're trying to turn Drupal into a more general web app development platform and we're discovering that it has aspects of the initial app built on the platform, the CMS app, integrated to deeply into core. The CMS app defines that $node->taxonomy and $node->comment exist as those specific names and then uses them. So perhaps we need to very clearly separate "core Drupal functionality" and "the CMS app" from each other. They can still be shipped in the same tarbar but as clearly defined entities.

All that from "let's think about taxonomy as a field"!

Comments

Fun thought experiment

Posted by moshe weitzman on February 10, 2008 at 8:06pm

These explorations are fun!

A few thoughts while reading this

delta for term assignments could be really useful. barry knows that as maintainer of the primary term module.
D6 actually maps tid to vid since term assignments are versioned. i only mention it because we are now more like how field modules natively work (a good thing)
we can go a long ways on just #1. i think we can go forever with just #1 if we have a field api that allows changing of field names (i.e. the key). then one can retroactively call a certain field a 'taxonomy' field. node types currently allow changing of 'machine name', but CCK fields do not. IMO, these changes are essential or else you have sites that have to hang on to spelling mistakes made when creating a field.

Widget

Posted by yched on February 10, 2008 at 8:50pm

Problem with #1, if I understand it correctly, is that on node edit, you get a single select to edit the terms for all vocabularies. Not too friendly...
Mmh, maybe not, after all. Spitting one select per vocab could be the job of widgets operating on 'taxonomy' field type.

Moshe : or else you have sites that have to hang on to spelling mistakes made when creating a field.
Well, current discussions involve a large extension of the amount of things you can't change after field creation - basically, everything that would change the db schema (text max length...)

Vocabularies

Posted by bjaspan on February 10, 2008 at 9:35pm

Hmmm, yes, not all vocabularies are associated with all node types, and those associations can change. This can be a field instance setting, right? When you assign a taxonomy field to a node type (thus making a field instance), you choose which vocabularies are valid and, as yched says, the widget is responsible for creating the multiple select boxes. (It is going to need to be a custom-callback widget anyway as the standard "add more" interface is not appropriate for taxonomy.)

Selecting different vocabularies for a field instance does not change anything about the schema, it's purely a widget issue, so there is no reason the vocabularies cannot be edited later.

term properties

Posted by catch on February 12, 2008 at 4:02pm

One thing not mentioned in much depth yet is the properties of terms.

Currently we have 'Description' (now shown on taxonomy/term/n pages by default in D6), 'Parents', 'Related terms', and 'Synonyms' - there's a patch against D7 to integrated synonyms and freetagging: http://drupal.org/node/201269 - otherwise these are often dormant bits of core.

I'm not sure where this kind of thing falls in with content/fields/field-properties.

To me - term description (and maybe term name, path alias and others) are candidates for 'fields in core', with the extensibility that implies - there's certainly demand for this with taxonomy_image, geo-enabling, category module etc.

I'd personally think of term_parent, term_relations, term_synonyms etc. becoming instances of the taxonomy reference field. But is there any scope for fields being themselves fieldable in this model?

I intended this thread as

Posted by bjaspan on February 12, 2008 at 7:54pm

I intended this thread as discussion for making term association with nodes (or Content objects) a field instead of a custom nodeapi behavior. What you are talking about (I think), and what you then suggest directly at the end of your comment, is being able to add fields to terms.

Both of the proposed content models would support that. In Model 1, you'd implement the Content/Fieldable interface for terms and away you'd go. I think the same is possible in Model 2, but I have not yet fully grokked it.

Whether term parent, relations, synonyms would then become "fields" or a term or remain "properties" (like uid and created are properties of nodes), I don't know. It certainly does seem like Taxonomy Image could be implement as an Image field on terms, terms could be geo-enabled by adding a GMap field, etc.

I still can't decide whether I like the name Content or Fieldable for the interface. I was leaning towards content because "fields are content," but I think that a term is not content, adding a map to a term does not really make it content (it just adds more info to the term), and so maybe Fieldable is right after all, even though it is a barbarism as our esteemed French colleague points out. :-) Perhaps even "Entity" is the right term, but Entity is a very high-level term and maybe we do not want to constrain ourselves that all Entities can accept fields. But this is just a naming issue.

Just an Interface

Posted by Crell on February 12, 2008 at 8:37pm

The association of a taxonomy term to a Node/Entity/Thingie/whatever it's called should definitely be a Field. That Field is a multi-value field, where value is a term (in a given vocabulary?) associated with the node in question.

The advantage of a good, clean interface is that what happens behind that Field is flexible.

It could be implemented as simply a front-end to the current database tables, and that's the end of it, thank you have a nice day.
It could be implemented so that a Vocabulary is an Entity type (using Model 2 terminology), and each Vocabulary is a subtype. A given term is then an entity (~node) of that subtype. The term field on a Node, then, is a "Vocabulary reference field", like Node reference field and User reference field. That of course would allow terms to have fields, and to even reference to other terms, which is just weird but might be useful. :-)
It could be implemented such that each term is a field on a Vocabulary thingie, and then the "Vocab reference field" specifies the vocab ID and which field(s)/terms to point to. (I think this one is kinda silly, but I needed a 3rd.)

All of that is hidden behind the API Interface, which is why a good API is so important.

Barry, thanks for

Posted by catch on February 13, 2008 at 12:20am

Barry, thanks for clarifying. That's what I figured when I read the first few posts in this group, but it got less clear to me as time went on, back to some semblance of clarity now :) And apologies for taking the discussion out of scope.

Taxonomy is currently

Posted by yched on February 12, 2008 at 9:36pm

Taxonomy is currently slightly misaligned with current CCK concepts :

One one hand, term/node associations are currently stored in a single table for all vocabularies, and all terms came be found under a single $node->taxonomy entry, which maps to : all vocabularies use the same 'taxonomy' field, that holds all the info.
Then, as Barry mentioned, each actual field instance can limit the set of vocabularies that can be selected from (story uses v1, page uses v1 and v2).
On the other hand, different content types can be tagged with different vocabularies, which maps to : each vocabulary is a different 'taxonomy' field.
You get separate widgets quite naturally, can choose different widget styles (select, checkboxes, autocomplete...) and you can put them at different places in your edit form when rendering the node. If / when we have field-level access rules (better than CCK Field Perms, I mean), this more granular approach comes handy too.
Also, under this condition only can we use the field's 'delta' order for the 'Primary term' feature Moshe mentioned.

The workflow then is :
Create a vocabulary and populate terms just like currently, this automatically creates a Field, which is available for you to create instances on [node types | fieldable entities].

Thinking while I'm typing : CCK currently has this limitation that you can't have 2 instances of the same field on the same content type (and my personnal opinion is that this this will stay). This gets in the way of Barry's "editor assigned terms / user assigned terms". Plus we're not sure a field will be shareable across different kinds of 'fieldable content' (nodes, users, foobars...)

So maybe : create a taxonomy field, define which vocab it uses (this can't be changed later on), assign it to node types.

I'm not sure the 'one place for all terms' aspect is that important. The get_all_values_for_this_field_type($field_type, $node) helper function Barry mentioned provides a replacement for the 'single entry in $node' part.
If we can ditch the 'same table' requirement (which probably needs some consideration), I think the last scenario above is more flexible and feature-rich.

I don't understand why we

Posted by bjaspan on February 12, 2008 at 10:23pm

I don't understand why we would need each vocabulary to be a separate field. In my earlier comment on vocabularies I wrote:

Hmmm, yes, not all vocabularies are associated with all node types, and those associations can change. This can be a field instance setting, right? When you assign a taxonomy field to a node type (thus making a field instance), you choose which vocabularies are valid and, as yched says, the widget is responsible for creating the multiple select boxes.

So any field of type taxonomy can in principal hold terms from any voc. It is the field instance (== field as assigned to a node type) settings that define which vocs that field instance will allow. So what we currently have as a property of a vocabulary (what node types to associated with) becomes a property of the field instance, but the functionality is identical. I think it is more sensible for this to be a field property than a vocabulary property anyway.

Also, my "editor assigned terms" and "user assigned terms" example specifically said they'd be two separate fields, not two copies of the same field on the same type (which we've already discussed is not possible).

Finally, you wrote, "we're not sure a field will be shareable across different kinds of 'fieldable content' (nodes, users, foobars...)." Actually, I think we are pretty sure a field will be shareable across different kinds of "fieldable content" (or entities/thingies/whatever as Larry likes to call them :-). That's kinda the whole point of supporting multi-source data. I know my proposed content model intends to support it and I'm pretty sure Larry's does too.

Yep

Posted by Crell on February 13, 2008 at 7:45am

Ideally, yes. We should be able to take a Node, an Artwork, and a User and add taxonomy terms to all of them. However it gets stored in the database, the ability to value-add at the Entity/Uber-node level rather than the current simple-node level is where we want to end up.

'1 field per vocab' vs

Posted by yched on February 12, 2008 at 10:58pm

'1 field per vocab' vs 'taxo fields potentially holding any number of vocabs each' :
It's just that 1 field per vocab seems more conceptually straightforward to me, and I'm not sure what the other option buys us.
Well, the ability to display terms for a node from different vocabs in a single sequence, I guess...
The remark about the "editor assigned terms / user assigned terms" use case was a counter-argument for my "creating a vocab internally creates a field to be used" train of thought. Sorry for not being clear, I should have edited out.
Actually, I think we are pretty sure a field will be shareable across different kinds of "fieldable content"
You mean one field has one instance on a node type, another one on users, etc ?
Wasn't clear to me. This has been mentioned in Chicago, but I was left with the memory of a 'mmh, maybe not'. I think we even mentioned having the 'entity' type (node, user, term...) appear in the name of the field table. Not too sure of the underlying reasons right now. Then again, the 'models' writeups might have updated that without me realizing it...

field will be shareable

Posted by yched on February 13, 2008 at 12:36am

field will be shareable across different kinds of "fieldable content"
Right, you mention that in the 'Model 1' writeup. Added a comment over there.