Field API specification

We encourage users to post events happening in the community to the community events group on https://www.drupal.org.
You are viewing a wiki page. You are welcome to join the group and then edit it. Be bold!

This document has been moved to Google Docs for easier online collaboration. The link is:

To view the document: http://docs.google.com/Doc?id=dcz939d6_2gnh78thd
To edit the document: https://docs.google.com/a/acquia.com/Doc?id=dcz939d6_2gnh78thd

Everyone can view the document. People who have been invited as collaborators can edit it. If you want to edit it or if there is a problem, contact Barry Jaspan (firstname dot lastname at acquia dot com).

Comments

field_attach_load()

catch's picture

This should probably take a similar form to the new hook_nodeapi_load() (for the changes to node_load itself, see node_load_multiple() and the issue where it went in http://drupal.org/node/324313)

So it could look something like:

<?php
field_attach_load
($content_types, $object_type, $objects)
?>

(incidentally, why pass the object ID if this is already in $object or the keys of the $objects array?)

Or following hook_load()

<?php
foreach ($types as $type) {
 
$module_field_attach_load($type, $object_type, $objects_of_type)
}
?>

Although in that case you wouldn't get the benefit of loading shared fields in one query, which ought to be easy enough passing in everything together.

Note this same API is already applied to taxonomy terms - and I think it's applicable to other core objects as well (users, maybe comments).

Updated the document with

catch's picture

Updated the document with changes along these lines.

hook_field_attach_* variants

yched's picture

About hook name 'variants' : yes, name conflicts are definitely an issue.

if both hook_field_attach_[op]_[object_type] and hook_field_attach_[op]_[field_type] are allowed, then object_type cannot ever be the same as field_type.
plus ambiguity : hook_field_attach_load_node_foo_bar can be seen as
hook_field_attach_[op]_[object_type]_[content_type]_[field_type] for [content_type] = foo, [field_type] = bar
and hook_field_attach_[op]_[object_type]_[content_type] for [content_type] = foo_bar

I updated the 'Field

yched's picture

I updated the 'Field definition functions and hooks' section according to http://drupal.org/cvs?commit=158154 and http://drupal.org/cvs?commit=158146

hook_field_attach_*

yched's picture

I guess hook_field_attach_* could be the way fields_ui can :
- customize ('override') some settings in module-defined fields
- provide richer display settings than the ones supported by core 'fields'
(think : per-context field order, per context labels... ).

ids

yched's picture

field_attach_*() functions will need a way to figure out that fields are attached to nodes by nid/vid but to users by uid. Not sure yet what's the best way to abstract that.

Key by vid?

catch's picture

I'm looking through content_nodeapi_load() and content_storage_load(). Seems like keying by vid ought to do it no?

<?php
field_attach_load
($types, 'node', $nodes);
$vids = array_keys($nodes); // vids

field_attach_load($types, 'user', $users);
$uids = array_keys($users); // uids.
?>

Could the cache keys just be vid instead of content:$nid:$vid? If so, fetching from cache could be

<?php
cache_get_multiple
(array_keys($nodes), 'content'); 
?>
- that function isn't in HEAD yet but it's in the cache node_load() patch - if that doesn't in then I'll break it back out into a separate one.

That way everything just attached to a unique id, and for nodes it happens to be the revision id instead of nid. I'm not sure how this would fit with tnid yet though if we wanted to support that.

Load, update, insert, delete

yched's picture

Load, update, insert, delete revision are keyed by vid, but delete is keyed by nid.
That's also why the nid is currently needed in the cache id : to delete all cache entries for the node on node deletion. Maybe a cache_set_multiple might let us go with only vid. in cache ids, but some special logic will be needed anyway to retrieve all other vids for a given vid.

I think we'll need some sort of 'fieldable entity' meta-descriptor, to specify :
- whether the entity has subtypes (field instances are not attached to nodes as a whole, but to node types)
- what are the id keys : nid/vid, uid, tid, cid...
- maybe : whether the entity handles its own caching : if the 'cache node_load' patch gets in, field-level caching will be redundant for nodes, but still needed for other non-cached entities. Or, we state that 'fields' are the primary data units, and forget 'cache node_load' in favor of the fields cache (non-field additions are then not cached...). The latter might be part of the 'discourage non-field additions' idea Bary mentioned somewhere (can't find the post right now).

Then it's the fact that 'fieldable entities' expose themselves somehow that lets fields.module prepare a place for their fields in its internal structures (currently the _content_type_info() array).

makes sense

catch's picture

I'd not thought about delete, but yeah both will be needed in that case.

The cache node_load patch adds a new hook_nodeapi_load_alter() specifically for stuff which shouldn't be cached persistently. Depending on the order of things going in, a generic field cache might well make sense (although I'd be concerned about relying on that until poll choices, taxonomy terms and other core stuff attached to nodes are fields).

Right, most core additions

yched's picture

Right, most core additions will probably not be ported to 'fields' in D7, so node caching is still valuable over a generic field cache for now...

more object caches then?

catch's picture

Other other option is to implement caches for other objects which will have fields attached to them. At the moment node_load() joins on the users table just for name and picture. It'd be nicer if user.module handled this in hook_nodeapi_load() - and a multiple load/cached user_load() would let us pull in even profile information pretty inexpensively. The main issue is the trade off with extra cache_sets on sites which don't have a lot of fields to load anyway - but for example if we could make user pictures and signatures into fields for D7 (let alone profile module), that might make it a bit more worthwhile in core. Or we could add an option in admin/settings/performance for sites which know they won't get the benefit of a persistent cache for otherwise lightweight (but potentially field heavy) objects.

edit: this also means less cache contention on large sites if using memcache - although presumably a per-field cache could also be implemented.

Response to comments so far

bjaspan's picture

Here are my thoughts on issues mentioned in comments so far. I have not yet integrated these suggestions into the wiki page:

field_attach_load for multiple content types cannot be implemented in a single query because the tables depend on content type. We can batch them together as much as possible (ie: do all the pages, then all the stories). We just have to make sure we do not spend more time wrangling data in PHP than we save with the queries.

The reason to pass the object ID directly is that we do not have a uniform location for the object ID in the objects: $node->nid vs $user->uid. We are not going to hard-code a list of types to id field in the Field API.

Regarding nid/vid vs uid, yes, I've thought a lot about that, and I forgot to represent my conclusion in that wiki page. My first thought was that node would simply choose to pass vid as the object id instead of nid but, as I realized and yched pointed out, that doesn't work for "delete" or potentially other operations on all versions. The best conclusion I've reached is that if the Field API wants to support revisions (which it does), then it should simply accept object ID and object VID arguments. For example, using the "non-multiple" API:

<?php
  field_attach_load
($content_type, $object_type, $object_id, $object_vid, &$object);
?>

node.module would pass $node->nid and $node->vid. If the user.module does not care to support revisions, then it would just pass $user->uid and 0. The same would hold for fields on remote data objects that do not themselves have revisions.

The same idea applies to objects that do not have content types. If user.module doesn't care to support multiple content types of users, it just passes 'user' as the content type as well as the object type.

I had not yet considered the field caching question. I guess my initial thought is that if we want the Field API to be able to know to cache fields for some object types but not others, then (as I think yched suggested) we need to add an API function to set meta-data about each object type, and the meta-data it would have a "cacheable" field. Whichever module defines the object type would use the API during installation to set the meta-data appropriately.

Regarding hook_field_attach_*, we can get around name collisions by encoding more info in the hook name, e.g.: hook_field_attach_[op]_class_[object_type] vs hook_field_attach_[op]_ctype_[content_type]. This would make for some pretty ugly hook names, but would let us be very specific about which hook functions get called on which operations. The question is this: Is it faster/better to let the code registry try to invoke a large number of uniquely-named hooks, many of which will never exist, or have it actually invoke a large number of hook_field_attach_[op] hooks, each of which will have an if statement at the beginning which filters out the invocations that particular implementation does not care about?

object ids

catch's picture

In hook_nodeapi_load() the node ids are passed as the keys of the $nodes array - so we're passing them directly, just not as an extra parameter. Same with taxonomy terms and tid, potentially the same for users. There's no reason we couldn't key $nodes by vid for field_attach_load() in a similar way - saves passing redundant data around - except on delete, but then maybe we just need an extra parameter for the delete op.

To save one key in the function names for granular hooks, can we not do hook_field_$op instead of hook_field_attach_$op.

In terms of the registry and hook naming, I think we'll only really get a speed gain if we actually load less code rather than saving a few function calls here or there - we've already de-opped on _load/_delete - it's likely that _object would save some code (say on a user profile page where no nodes are loaded). With type it's hard to say - since if a field is attachable to multiple types, then it won't be able to use the _$type hook efficiently. To save some re-implementing of object/type checking within hooks, we could maybe have a field_support_types($field, $types) to centralise the 'do I do anything to any of these object types' lookup.

Thinking ahead - if we implement attaching fields to taxonomy terms, do we want to treat/reimplement vocabularies as types?

To save one key in the

yched's picture
  • To save one key in the function names for granular hooks, can we not do hook_field_$op instead of hook_field_attach_$op.
    The hook_field_$op namespace will already be taken when we de-opify hook_field (field type implementations), and as Barry pointed in the original post ('Constituencies'), those three APIs are really targeted at different uses and audiences, so we need to clearly separate them. The hooks we're considering here are actually related to the action of attaching them to entities...

  • if we implement attaching fields to taxonomy terms, do we want to treat/reimplement vocabularies as types?
    maybe not "reimplement", but being able to assign different fields to different vocabs would sure be valuable.

field_attach_load for

yched's picture

- field_attach_load for multiple content types cannot be implemented in a single query because the tables depend on content type. We can batch them together as much as possible (ie: do all the pages, then all the stories)
That's how load_multiple works in CCK HEAD currently - although it doesn't join tables (yet ?) and still does one query per field per content type. This can probably be made smarter, there are at least 2 strategies we might want to consider.

- Using 'vid' 0 for non-revisioned objects should work. +1 also on not hardcoding any knowledge on id fields for nodes, users etc in fields.module. I do think we'll need 'fieldable types' to expose the name of their id fields :

function node_field_metadata() {
  return array (
    'node' =>
      'id' => 'nid',
      'revision id' => 'vid', // Means that nodes have 'revisions'
      'cacheable' => FALSE, // If node keeps its own cache
    ),
  );
}

- "invoke a large number of uniquely-named hooks, many of which will never exist", vs "invoke a large (but much smaller : at most one per module) number of hook_field_attach_[op] hooks, each of which will have an if statement at the beginning which filters out the invocations that particular implementation does not care about" :
he theme registry allows for something like the first option in theme-land (functions and template suggestions are pre-resolved). Not sure the code registry is comparable, though...

...the tables depend on

David Strauss's picture

...the tables depend on content type.

Are we certain we're keeping that part of the design?

No.

bjaspan's picture

Is anything ever certain in life?

There was lots of discussion about switching from per-type wide tables to all fields in their own tables; in fact, that was our original proposal from DADS. We got lots of push-back from Moshe, Earl, and others based on "too many joins." I know you've also suggested/considered per-field-type storage, though I assume that will get the same pushback.

You are the database expert. If you have a compelling argument to make for a different representation, by all means do so. :-)

Death and taxes

David Strauss's picture

We should go with per-type tables. Yes, the JOINs will slightly increase, but the sole condition under which the current model reduces JOINs is when, of the fields you're using, at least two are single-content-type and single-valued. And because such queries often already live in temporary table/filesort land, the marginal gain from fewer JOINs isn't great. I'd rather spend engineering effort on optimizations that have more dramatic impacts.

But, why per-type instead of per-field?

  • We don't have to modify the schema unless we're installing a module that creates a new field type.
  • We can still enforce the same integrity constraints.
  • We can easily and dramatically reduce the number of queries to load fields. (All text fields in one batch, etc.) Even with JOINs, you can't easily load multiple multi-valued fields if they're in different tables. The gains here multiply when we load multiple items at the same time.
  • Per-field tables often result in excessive numbers of tables.

As for old revisions, we won't want to keep those in the main tables. We can either move them to per-type old-revision tables or do something more dramatic.

I'm still willing to hear arguments for per-field storage, especially if it's less efficient to use an "fid" (field ID) key than pull from a different table. It will take a lot more to convince me that we should keep the current hybrid model, especially with its unreliable and complicated methods to handle single- to multi-valued field reconfiguration.

Per-field-type tables

bjaspan's picture

You say that per-content-type storage only saves joins/queries when 2+ fields are per-type and single-valued. That's true, but many of the sites I've build have content types with LOTS of per-type-single-value fields, sometimes up to 30 of them. So per-content-type storage saves LOTS of queries over per-field storage. I would say that forcing per-field storage for everything is D.O.A. for this reason.

Now, per-content-type storage does not have the same problem. Even if you have 30 single-value fields, you probably only have a few field types (integer, varchar, text, date). Four queries isn't so much worse than one, especially if you get multi-loading out of it.

HOWEVER, with per-field-type storage, you are trading queries for PHP execution time. Suppose you query your single text-field table for 15 different fields. Now you have to iterate through the results and break them up into the different field elements of the return object, whereas with per-content-type storage you get each value nicely separate straight from the db.

The upshot of all this is that I think per-field-type storage might be a great idea. But we already have per-content and per-field implemented. Before we go to the extra time and effort of changing the current implementation to a new design, I'd like to see some performance numbers that considers both the database and PHP execution time.

I'll add this as a goal for the sprint and we can debate priorities during the planning session.

Separate revision tables

bjaspan's picture

I think your suggestion of keeping revisions in a separate set of tables is a good one. I guess each insert operation would become two operations, one insert/update into the "current revision" table and one insert into the "old revisions" table.

What is your "more dramatic" suggestion?

I'll add "separate revision tables" to as a sprint goal to be prioritized during the planning session.

Yes, we would push the old

David Strauss's picture

Yes, we would push the old revision into the archive table and update the other one in-place. It would be kind of like "title" in {node}, but consistently applied as a technique for all fields.

"More dramatic" would be something like a pickling strategy, where we serialize and compress old revisions.

Serializing and compressing

bjaspan's picture

Serializing and compressing old revisions sounds like a good feature for a contrib module.

Somehow this led me to a couple of vaguely related thoughts:

  • I think the main field tables should still contain a vid column to indicate the current value (as with node table). That way, you can load the current version and find out the vid, or do a join based on current vid to another table, without having to join to the field revisions table. Agree?
  • I previously suggested non-versioned data could always store a vid of 0. Perhaps NULL would be better to semantically indicate the difference between "not versisioned" and "version 0." I've never really known the up/downsides of nullable vs. non-nullable columns, besides requiring an extra bit of storage.
  • Do we need an API to operation on multiple revisions at a time (besides delete), like "load all revisions of field F for node 12"?

per-field-type storage won't

yched's picture

per-field-type storage won't float, because field of the same type might require different column settings (text : formatted/plain, max length...)

Different column settings

David Strauss's picture

Different column settings would qualify as different types.

Problematic, because widgets

yched's picture

Problematic, because widgets and formatters are attached to field types.
Means a textfield widget for a 100 char field cannot handle a 200 char field.

I've already said that

David Strauss's picture

I've already said that 100-char and 200-char fields would be considered different data types because they would have different column configurations in the database. Furthermore, field data types do not have 1:1 mappings to widgets and formatters.

object ids

bjaspan's picture

I am not a fan of catch's suggestion:

In hook_nodeapi_load() the node ids are passed as the keys of the $nodes array ... just not as an extra parameter. Same with taxonomy terms and tid, potentially the same for users. There's no reason we couldn't key $nodes by vid for field_attach_load() in a similar way... except on delete, but then maybe we just need an extra parameter for the delete op.

First, I do not like the idea of passing one set of data to load, save, form, validate, etc. but a different set to delete.

Second, I do not think passing the array of nodes keyed by vid will even work. The field tables are going to have columns for id (object_id, not necessarily nid) and vid (again, not necessarily $node->vid). For non-node objects, there is no nid... so how would field_attach_load() know that for nodes it is being passed a vid key but for users a uid key? Answer: it wouldn't, unless we effectively special-cased it. Bzzz.

I also observe that while we know that node nids are all unique and node vids are all unique, we should not assume that ALL version ids for all kinds of objects are unique (presumably the object ids themselves are all unique within a class). I'm sure there are web services that export data objects all of whose version numbers count 0, 1, 2, etc.

This implies that the primary key of all field tables will have to be (object_type, id, vid), not just (vid) as with our current tables, because (object_type, vid) will not always be unique. And we cannot make any queries of the form

[OPERATION] content_field_foo WHERE object_type = 'xyz' and vid=nnn

We will always have to include the id column in the WHERE clause.

At the moment I think we have two workable options:

  1. Pass in an array of objects, ids, and vids separately.
  2. Pass in just the array of objects and use yched's hook_field_metadata() idea to define what the id/vid keys are, and get them out of the objects directly.

#2 does seem better. "When in doubt, add another layer of indirection"?

agree on #2

catch's picture

That seems like the cleanest way to do this, and avoid problems down the line - might also handle the tnid issue as well.

I wouldn't be so certain

David Strauss's picture

I wouldn't be so certain about revision IDs in the main tables. I'd like to eliminate those. We can store old revisions elsewhere, where they don't crowd our indexes and slow down our queries.

Field Property

neoliminal's picture

Should differences in field source be dealt with in core or on a module level?

Should a core property for fields be source ("internal" or "external")? This designation would describe the field as populated by the internal system (either by user input or php/drupal fabrication) or by an external server.

--
John Kipling Lewis

--
John Kipling Lewis

Field API and Profile

specmav's picture

I know that "fields in core" is a long way off and the complete specification is still being tinkered with. I am curious to know if there is a movement to make the "fields in core" module work with objects other than nodes (mainly using fields with profile module)?

Thanks

Yes

mfer's picture

Fields in core is well underway and fields can be attached to any object.

Matt Farina
www.innovatingtomorrow.net
www.geeksandgod.com
www.superaveragepodcast.com
www.mattfarina.com

Awesome, I was contemplating

specmav's picture

Awesome, I was contemplating using content profile in the future , but since "fields in core" will allow me to use other more specialized fields, I think I will use profile module for the time being until I upgrade to D7.

The upgrade path to D7's

yched's picture

The upgrade path to D7's 'fields on users' should be much easier from D6 Content Profile that from core's profile.module :-)

Well...I won't be adding

specmav's picture

Well...I won't be adding user profiles for a while, so I guess I will take a "wait and see" approach. Also upgrade paths in core are always supported. And since profile.module is part of core, that gives me at least some reassurance.

Fields in Core

Group organizers

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds: