Proposed Content Model 2

Posted by Crell on February 10, 2008 at 10:21pm

During the Data Architecture Design Sprint (or DADS, as I will henceforth call it), we discussed two general models for reaching our vision of "data anywhere, value-add anything". Those models became dubbed "Model 1" and "Model 2", because that's the order in which we happened to write them down.

Barry has already done a good job of explaining Model 1 in an earlier post. Here, I am going to lay out the structure of Model 2. I suspect that the final system, whever it looks like, will draw heavily from both models.

As Barry described in Model 1 (and will probably be repeated in every summary we write for the next few months), Drupal's secret weapon is contrib. Once a piece of data can be made into a node, it automatically gets hundreds of extra features by magic. CCK and Views are the most common and most powerful magic, but by no means the only. However, shoe-horning everything into nodes breaks very quickly when dealing with content that doesn't fit into the "node" model neatly. In core, we have users. In my day job, I frequently have museum artwork databases. I have also considered ways to expose a DocBook tree to Drupal, such as the documentation Steven Peck has been writing for Drupal 6. I will use all of these examples in the descriptions that follow.

The Institute of Contemporary Art

This fictional museum is based on the site that, by coincidence, we were launching during the DADS, the Art Institute of Chicago. There's much more going on with that site, but for now we'll focus on the data access. (If you want to hear more, we've proposed a session for DrupalCon Boston. Go vote for it. :-) ) For consistency, though, I will continue to call it the ICA.

The ICA has their entire archive in a legacy database system. That system has a read-only SOAP interface, using RPC-style calls that return an array structure. There are several types of "thingies" that the system exposes to us: Artworks, Resources, Artists, and Categories. Each has its own integer key-space, in addition to between 2 and 30 single-value fields, ranging from an artwork title to the URL of the large-sized jpg of an artwork. Resources also have a type, which is one of "quicktime", "full HTML", "audio clip", "jpg", etc. Each Resource type needs somewhat different theming treatment.

In Drupal 5/6, the only options for integrating such a 3rd party system are to lazy-create a stub node that wraps a 3rd party object (which can be a bit messy) or to create a completely parallel system that doesn't touch nodes (or all of their yummy value-add) at all. Neither is a great solution.

Entities

In order to be able to value-add to non-local content, we need to be able to define content that is not nodes but still gets node-esque contrb-magic-connection-points. Rather than make everything a node, we generalize those connection points to a higher-level concept formerly known as Thingie and now represented by the name "Entity" (or Prince, if you prefer).

An Entity is a common interface that value-add modules can rely on. Specifically, an Entity type has, and is defined by:

A unique key-space. Users, Nodes, Artworks, and Resource all have a separate key-space. That is, their IDs are not mutually exclusive. User 123, Node 123, Artwork 123, and Resource 123 can all co-exist peacefully.
A set of properties. We discussed "properties" on and off during the DADs, and didn't qutie nail down what they were. The best definition we came up with was "single-value data, and defined by the Entity type rather than the sub-type".
Some set of behaviors. An Entity type is able to define extra logic that applies to it, but doesn't apply to other Entity types. For instance, the User entity type defines the logic relating to logging in a user, a concept that has no meaning for a Node or an Artist.
One or more subtypes. A subtype defines a set of field definitions that it will contain.

Not all Entity types will have or need subtypes. Nodes do now. We discussed possible subtypes for Users, and came up with possible use-cases (users authenticated by different systems, such as local or LDAP) but nothing firm. Artworks and Artists do not have subtypes, however, Resources do. For simplicity, an Entity type that doesn't need subtypes simply has a single subtype, as "1" can always be treated as a special case of "many".

Absent from the Entity definition is whether its fields are local or remote; we established fairly early on that question to be a per-field question. The keys-space of the Entity type, however, may be locally or remotely controlled.

Contrib modules then value-add to an Entity by adding a field to one or more of the types of that Entity. One could add, say, a fivestar field to the Artwork Entity's only subtype, the Resource Entity's "quicktime" subtype, and the "story" subtype of the Node entity.

Some examples

I had originally intended to have drupal_load() always take an integer ID, to normalize it across the board, but I realized while reading Barry's Model 1 description that won't work if the entity has yet to be locally linked from a remote service. That gives an Entity two IDs:

id: The unique key in the entity's native system. In the case of Nodes and Users, it is nid or vid. For, say, a DocBook data source, it could be an XPath query string or the ID of a specific element.
localId: This key, if present, is always an integer and is used by locally-stored fields to reference the Entity. In some cases it will map 1:1 to the id (pretty much any case where the id is an integer, including nid and vid). That is an implementation detail left to each Entity to determine.

Perhaps some code will explain the structure better.

<?php
function node_view() {
  $node = drupal_load('node', 123);
  return "<h2>". $node->title() ."</h2>". $node->field('body')->value();
}

function drupal_load($entity_type, $id) {
  return new $entity_type($id);
}

class Node {
  protected $_id;
  protected $_properties;
  protected $_fields;

  function __construct($id) {
    $this->_id = $id;
    $this->_properties = db_fetch_object(db_query("SELECT * FROM {node} WHERE nid=%d", $id));
    $this->_fields = get_fields_associated_with_type('Node', $this->type());
  }
  function id() {
    return $this->_id;
  }
  function localId() {
    return $this->_id;
  }
  function type() {
    return $this->_properties->type;
  }
  function title() {
    return $this->_properties->title;
  }
  function fields() {
    return $this->_fields;
  }
  function field($field_name) {
    return $this->_fields[$field_name];
  }
  function save() {
    foreach ($this->fields as $field) {
      $field->save($this->localId());
    }
  }
}
?>

Things to note:

Yes, it is OO. OOP is very good at enforcing an interface and hiding implementation details, which is what we want to do here.
title() is a method rather than a direct property. Rendering a human-readable string to represent an Entity is a behavior, and therefore is Entity-type specific.
All properties of the Node class are prefixed with a _ as well as made protected. That way, we can still arbitrarily add extra properties to the node without colliding with them to support older code, at least temporarily until we figure out how to make "everything a field".
Because we're behind an object interface, we can implement lazy-loading or not, or other optimizations, in the constructor, the field() and fields() methods, and in the yet-to-be-written get_fields_associated_with_type() routine. In some cases, lazy-loading makes a lot of sense. In others, such as the Artwork example above, it's more efficient to pull the entire data object at once because the backend in this case supports it. That's information that the caller (node_view()) shouldn't have to care about.
Somewhere in there it would be wise to implement at least static load caching, probably in drupal_load(), but we'll deal with that later.

For comparison, the Artwork Entity might look like this, if we had to deal with non-integer ids:

<?php
class Artwork {
  protected $_id;
  protected $_properties;
  protected $_fields;
  protected $_additionalFields;

  function __construct($id) {
    $this->_id = $id;
    $data = SoapConnection::query('artwork', $id);
    $this->_fields = parse_data_out_into_field_objects($data);
    $this->_additionalFields = get_fields_associated_with_type('Artwork', 'artwork');
  }
  function id() {
    return $this->_id;
  }
  function localId() {
    static $localId;
    if (empty($localId)) {
      $localId = db_result(db_query("SELECT localId FROM {artwork} WHERE id='%s'", $this->id()));
      if (empty($localId)) {
        db_query("INSERT INTO {artwork} (id) VALUES('%s')", $this->id());
        $localId = db_last_id();
      }
    }
    return $localId;
  }
  function type() {
    return 'artwork';
  }
  function title() {
    return $this->field('title')->value();
  }
  function fields() {
    return $this->_fields + $this->additionalFields;
  }
  function field($field_name) {
    if (isset($this->fields[$field_name])) {
      return $this->_fields[$field_name];
    }
    else {
      return $this->_additionalFields[$field_name];
    }
  }
  function save() {
    foreach ($this->_additionalFields as $field) {
      $field->save($this->localId());
    }
  }
}
?>

Note here that the title is being generated differently, and we're internally classifying fields by where they came from. Some are saveable, some are not. If we try to save an artwork, a local ID will be created if necessary for our "additional fields" to use. Again, the outside world doesn't care. All it knows is that it asks for an Artwork of ID '123.456/5', adds a fivestar rating to it, and saves the Artwork. And until we call save(), the local database is unaffected.

Still an open question, of course, is what the interface for Field will look like. We've already said we want those to always be multi-value, but beyond that, I'm not sure what they'd look like.

For those who like UML, here's what this model would look like, I think:

Attachment	Size
model2.png	20.46 KB

Comments

Excellent.

Posted by bjaspan on February 10, 2008 at 11:04pm

Excellent writeup, Larry. The two models have a lot in common and it may turn out they are practically identical except for terminology. I think now we both need to meditate on these proposals to figure out what is the same, what is different, and what the significance is. I'll do this as soon as I get over the flu I developed in Chicago. :-/

Icky

Posted by Crell on February 11, 2008 at 1:32am

Is there anyone who wasn't sick right before or right after the DADS? Yeesh. chx and I are sitting here in -2 degree weather now. We picked a really lousy time to have a meeting, apparently. :-)

I think the basic difference between the two models is that Model 1 puts most of the differentiation on the node type (thus making nodes significantly more powerful) while Model 2 adds the same logical additions to Drupal but does so on a split level between Entity type and subtype. Both end up with the same dual-key concept and per-field data sources, which is the main take-away here.

What I'm not entirely clear on is how, for instance, I'd implement ICA Resources in Model 1. They share a key-space, but have different fields and need different theming. That's isomorphic to nodes, which share a key-space (nid) but do not share field definitions.

Thanks for the great

Posted by Frando on February 10, 2008 at 11:31pm

Thanks for the great writeup. Even if I was not able to attend the Design Sprint, thanks to the great write-ups you guys produced one can really start thinking now ;)
My initial feeling is that the two models have a lot in common in their basic principles. Working out a document outlining the differences and their significance, as Barry just suggested, sounds like a great idea.

great!

Posted by fago on February 11, 2008 at 9:47am

this is great! I think this is exactly what drupal needs. It'd be really promising for the future..
@title(): good observation!

model2 looks to be more generic, as model 1 concentrates on content. While model 1 would make content even more powerful, model2 takes the power and spreads it to all entities! Imo this is the key point of this all, make it possible for contribs to deal with all entities, not only subsets (nodes) of it.

keep up with the good work!! thanks!

Subtypes

Posted by nedjo on February 13, 2008 at 5:00pm

For simplicity, an Entity type that doesn't need subtypes simply has a single subtype, as "1" can always be treated as a special case of "many".

I've been mulling over this idea a bit more. I think we framed the problem as: how do we attach fields to entities that have no subtypes. We also discussed two other problems that I suspect are closely related:

Can we attach arbitrary fields to a single (e.g. node) instance? For example, can I decide to attach an image to a single node (e.g. of type 'story') without having all story nodes have an image field?
What is the relationship between (a) subtypes (e.g., node types) and (b) other sets of properties that similarly can be applied to entities? Take the example of event information. An event might be characterized by a combination of several fields: start time, end time, location, etc. Currently, we integrate such sets of properties in two different ways, as a subtype and as a set of properties that can be applied to all subtypes. Does the distinction really need two very different approaches? Can we instead find a single approach that can apply to either a subset of entities of a type (one node type) or all entities of a type (all nodes, or at least several types of nodes)? Say we want to make event properties available to each of six node types. Do we really need to separately attach each of these fields to each of those node types? Wouldn't that be a step backward from what we have now?

I'm now seeing these three problems as variations of the same one: how can we attach properties or sets of properties to an entity type (rather than a subtype)?

And I think we're basically right with the idea of "a single subtype", except that likely we'll want this for entity types that have subtypes as well as those that don't. Specifically, we might reserve a namespace for a generic subtype that applies to all instances of an entity type (e.g., all nodes or all users).

<?php
define('GENERIC_SUBTYPE', 'generic_subtype');

// Keeping to our current node structure for now...
$node = new StdClass();
$node->type = GENERIC_SUBTYPE,
...
node_save($node);
?>

Then we can attach field instances to this generic subtype and have the ability to affect all nodes (or users, or whatever).

How would we handle e.g. the use case where we want to attach an image to a single node? Assume we don't want an image field to appear on all node edit forms. So we offer a UI through which users can select from additional properties to apply to their node. For simplicity, let's say a drop-down list. Select "image" and the image field widget UI is dynamically loaded and inserted into the node form. In node load, the appropriate table is consulted and, if an image is found, it is loaded for later presentation. In subsequent node edits, the image field is automatically available since it's known the node has an associated image.

Another possibility...

Posted by nedjo on February 13, 2008 at 5:09pm

We could also consider the approach of adding a new boolean property to subtypes (e.g., node types): multiple. Singular subtypes ($subtype['multiple'] = FALSE) would be what we have now, and would remain the default. Multiple subtypes ($subtype_type['multiple'] = TRUE) could be applied in combination to a given entity instance, in addition to a primary (singular) subtype.

I don't follow

Posted by Crell on February 18, 2008 at 5:48am

Right now, if we consider node type to be the subtype then nodes are default multiple TRUE, not FALSE.

In the more general case, there are three levels at which fields can be associated: At the Entity type, Entity subtype, or Entity Instance. Right now, we only implement Entity subtype (node type).

The first can be sort of emulated by the subtype, since there's typically only a handful of subtypes of any given entity. I can definitely see value to being able to define it at the Entity type level, but it's not critical since we do have a workable alternative. (Comments and Taxonomy essentially do this now.) The entity instance is the level that is (a) rather hard to do right now and (b) cannot effectively be used to emulate the other two. I can definitely see the value in it, but I don't think it's critical and definitely not a mechanism on which to build the entire system.

Actually, I am pretty sure Entity instance-specific fields could be handled via a contrib. For a first cut at a Data API implementation, I'd rather focus on the known quantity of subtype-based fields and then expand out from there as possible.

There are no subtypes

Posted by bjaspan on February 29, 2008 at 7:42pm

I have not yet written my "content model bake-off" document but the more I think about it the more I realize the two models are almost the same. The major differences are terminology and the way the terms make us thing about things. So I'm going to see if I get consensus of some terms, and I'm starting with "sub-type."

Larry asserts that "Not all Entity types will have or need subtypes. Nodes do now." Actually, they don't---"sub-type" is not a concept in today's Drupal at all. What we have are content types. A content type defines (a) which module controls the node's basic operations (hook_load, hook_save, etc) and (b) through CCK, a set of fields that will be attached to nodes of that content type. We are proposing to eliminate (a) completely, at least over time, so fields will be the way that core and contrib modules add custom behavior to content types.

So, terminology. We can say that "page" and "story" are "sub-types" of nodes. Or we can say that "page" and "story" are content types whose loader (I also used the term "controller") is "node". Either way, node_load() is the function that returns the object containing the basic content.

The difference is semantic. If page and story are sub-types of nodes, then it sounds like node should control which fields get attached to nodes of those sub-types. Indeed, in Larry's pseudo-code, the Node constructor is what actually loads the fields for the sub-type. On the other hand, if the loader for content type story is node, then it sounds like node_load() is called to the load the object and the higher-level "add fields to content objects" code is what loads up the various fields for the content type and attaches them to the object.

To put it another way, I see no reason that class Node and class Artwork should need to call $this->_additionalFields = get_fields_associated_with_type('Type', 'subtype') and have the additional accessor methods for _additionalFields or do anything at all about fields. Node just loads basic data from the node table. Artwork just loads an artwork object from SOAP. Code outside of these classes then loads up the additional fields that are associated with the content type and adds them on to the object (or delays doing that until someone calls the function on the content object that tries to use a field).

Am I making any sense?

Subtle differences

Posted by Crell on March 1, 2008 at 7:51pm

As a thought experiment while working on the Final Report, I took the text from Model 1 and did a find/replace on it to change "Content" and "Contents" to "Entity". Aside from some weird grammatical issues that resulted, the two descriptions were indeed extremely similar.

As I understand what you're describing, using OO terminology here, a keyspace is not defined by an entity type but by a keyspace object. A given node type / entity type / thingie type (page, story, artwork, etc.) then references that keyspace object in order to do its lookups. (The keyspace object for what we now consider nodes would be a reference to the nid/vid columns of the node table, while for Artwork it would reference out over SOAP.) In Model 2, the keyspace is defined by the entity super-type.

So in that regard, what model 2 calls a sub-type model 1 would call "multiple types referencing the same keyspace object".

(I'm thinking aloud here; please bear with me.)

So the question is, which mechanism gives us the most flexibility and power while minimizing complexity. The other issue I see, though, is that of behaviors. In Model 1, I'm unclear on how additional behaviors get defined for an entity type, like "login" for User or whatever Files would have.

In both models, I see an advantage to being able to pre-load some fields (say, that all come from a single SOAP request or single DB request) while lazy-loading others (those that come from a contrib module). If that's handled through a lazy-load interface, then the constructor for the entity object can pre-load what makes sense for that entity type.

Of course, that then pushes that logic out to the "keyspace" object in Model 1. When an Entity class is constructed, it would be passed the entity model" object it should use. Would we then have an admin screen where users can define new entity types, and the first step is selecting which "entity model" (Node, Artwork, User, File) it uses? That sounds like it would be very complex, both to implement and for the user. It also sounds like there would have to end up being a call_user_func() inside a __call() method, which is one of the slower combinations of indirection PHP offers.

So I can see a composition method of defining keyspace and behavior as being more flexible than a class-intrinsic one, but I also see it being more complex both to implement and to understand.

Hm. We may just have to code something and see how it looks. :-)

Keeping it simple

Posted by bjaspan on March 3, 2008 at 12:06am

I think we agree completely on key space. In Model 1, key spaces are defined by modules such as Node, Artwork, User. I think this is the same as Model 2, but in the previous post you said "In Model 2, the keyspace is defined by the entity super-type." I thought entities had sub-types, so I do not know what an entity super-type is.

Aside: I just said key spaces are defined by modules. I've previously said key spaces are defined by Loaders and Controllers. This is the same model, I just haven't chosen terminology. "modules" is easy to understand because it is just how we have it now but may not be completely accurate since, for example, a contrib module may implement the Content interface (e.g. the hooks) for User, making User into a Content Type loader/controller, and it isn't user.module that provides the code, it is the contrib module.

We agree again that simplicity is key. You are unclear how Model 1 handles "additional behaviors for an entity type" and the answer is IT DOESN'T. "Content Model 1" is a content model. It addresses how to add Fields to Content Types. Functionality like how the login operation uses a User object are not affected at all and continue to work exactly how they do now, or however the community decides to change it for D7.

This last point really gets to my major difference with Model 2. You are talking in terms of Entities and you want everything (Nodes, Users, Comments, Views, Artworks, etc.) to be Entities. You want to define a unified object model for loading and operating on all Entities. That's a fine goal, and one we discussed at DADS, but it is a substantially larger task than defining how we can attach fields to object types other than nodes even if those nodes are not defined in the local database.

I want to start very small and simple by making it possible to attach Fields to anything that implements the right interface (the right set of hooks) and That's All, Folks. Let's do that and get it committed. It will be a major step forward. We can work on a unified object model next or even in parallel.

Aha.

Posted by bjaspan on March 3, 2008 at 12:13am

I understand why you think I associated keys with content types and not with loaders (e.g. node). In my Proposed Content Model 1, I show table field_fivestar_vote:

Table field_fivestar_vote:

Type     | Id        | Vote
----------------------
ICA-Art | ICA17 | 3
story    | 12      | 5

From this it is reasonable to conclude perhaps that "story" keys are different from "page" keys but that was not my intention. Notice below that I wrote, "For nodes, the translation is the identity function (nid -> nid)." That "12" is a node id, not a "story id." It is possible that having "story" in the type column is a mistake, but my intention was something like this: "There is a content_type table that says that 'story' is a content type loaded by the node module. The content_type_fields table additionally says that story Content Types have a fivestar field, and a foo field, etc. So when someone asks the Content system to 'load story 12', it knows to load node 12, and then to load all the fields that are associated with stories."