Proposed Content Model 1

Posted by bjaspan on February 6, 2008 at 2:09pm

Our concept for this Design Sprint is to use "fields in core" as a lens for thinking about Drupal's future data model needs. We will then propose a "Minimum Viable Product" for putting fields into core in a very simple way given our current node-based data model but hopefully do so in a way that provides a path towards supporting our vision of the future data model.

After two days in flaming, I'll summarize our "vision" as I understand it:

The future is about web services; if Drupal has no value in a web services world, it is dead.
Drupal's special sauce is that it provides a platform for contrib authors everywhere to write all kinds of value-adding functionality for content. So we have comments, fivestar, views, gmap, and all the other things we can add to and ways in which we can use what are currently nodes.
Therefore, we need to make Drupal into something that allows all our great contrib modules to provide their functionality not just for content stored locally in SQL but also content that comes from web services.
We also need to make Drupal into something that can provide its data as a web service. That's more of a Data API question and as explained below this article will have not much to say about that.

This document proposes what I'll call Content Model 1. I do not really remember what we called models 1 and 2 yesterday. In this document I'm describing the model that, I think, I had in mind at my most clear-feeling point yesterday.

This model is about content, about how what we currently understand as Nodes and Fields can mature into a way to interact with multi-source data. I am specifically not thinking about how a User might exist in this model and certainly not how a View object might exist in this model (a View object represents a way to retrieve and display content, so the content model needs to support Views, but I am not going to talk about how to load and save the View object itself). This statement is relevant because part of our original goal was "a consistent Data API for performing CRUD operations on any kind of object (user, node, view, frobnitz)" and this document DOES NOT address that (though it is not inconsistent with such an API).

The Artwork use case

I am going to use the Artwork example we've been working with as a use case. I'll clarify it somewhat first. I'm sure Larry has a very clarified idea here already because he's actually implemented something for a client, but we haven't gone into much detail yet so this may prove slightly different than what he did/has in mind or has done.

The ICA (Institute of Contemporary Art) is a big museum with a lot of works of art in it (paintings under glass, sculpture, etc). They have an existing "legacy" database (e.g. in Oracle) of their artwork. They have implemented a web service for accessing their data. The web service is an RPC-style service. It provides methods like:

Seach. Search art by artist, year, type (painting, sculpture), keywords on text descriptions, etc. Returns a list of identifiers that match.
Load. Give me the full data object for a particular piece of art. Takes an id, returns the data in in XML format. An AJAX consumer of the load method can then whip up some HTML to display the art in a web browser.
Save. If I'm the appropriate ICA employee, I can create or modify existing entries in the database.

Today if we wanted to use something like this with Drupal we'd have two choices:

We could import a large number of ICA-Art nodes into Drupal with an appropriate set of fields. Then the site could display, search, comment on, vote on, etc., all the art. We would not be using the web service or its backing database; we have essentially copied the database into Drupal.
We can drop in a bit of JavaScript or an iframe or something provided by the ICA to display ICA art info on my site. Maybe the ICA snippet gives search capability, etc., but it is all within a specific block on my site and is just AJAX-retrieved data that gets displayed. I cannot create a comment (to be stored in Drupal) on a piece of art because the art does not exist anywhere in the Drupal system.

The Content data flow

We've talked a lot about Entities. I'm going to make up a new name, Content. Content is the interface that that things which are content implement. A Content may turn out to be an Entity; maybe the Data API operates on any Entity (users, Contents, artworks, frobnitzes, etc.) for consistent CRUD operations. I'm not sure yet.

Later, I'm going to use a pseudo-syntax for defining an Interface in the Java sense (btw PHP does support interfaces if we actually use classes but this Content Model 1 is not necessarily tied to a class structure, it could be procedural). For now, I'll talk through how Drupal interacts with Content objects. It is going to look an awful lot like how Drupal interacts with Node objects with a critical difference described at the end.

Suppose the ICA exposes a web service for its art. The Drupal ICA.module allows a Drupal site to use this content. It implements the Content interface for data from the ICA webservice. The developers who create the ICA's actual website wrote this module so the ICA's own website can display the ICA's art. It's possible the ICA staff can log in to the same public website and because of their roles can use ICA.module to execute the ICA webservice save method. (More likely, by the decision of the ICA IT department, the ICA webservice save method can only be invoked from within their internal network by some non-Drupal application, in which case the ICA.module just does not use the webservice save method; it provides read-only access. But for this example, let's suppose it is read/write.) Furthermore, the ICA.module developers who are ICA staff also contribute the ICA.module to the d.o contrib repository. Now, if I like art, I can install the ICA.module on my own site and immediately it appears like I have all the ICA's art info on my site just like I might have Flickr photos on my site.

Alternatively, maybe the ICA exposes its web service but does not care about Drupal at all. Some contrib author writes the ICA.module to expose the ICA artwork via the Drupal Content interface. Same deal.

When installed, ICA.module uses a Drupal API to create an ICA-Art Content type (what we currently think of as a node type). It specifies:

ICA-Art is controlled by the ICA.module. This means Contents of type ICA-Art use hooks ica_load, ica_save, ica_view, ica_search.
ICA-Art has a controller-provided set of fields:
- Title, artist name, year created, artist's description, critic's review. These use a Text input widget (for whoever has permission to edit an ICA-Art Content) and a Text formatter for display.
- Image. This uses a File Upload widget (e.g. whatever imagefield currently uses) and any formatter that can handle images (e.g. Default which is a simple img tag, or imagecache formatters).
- The controller-provided set of fields are equivalent to the fields that node.module currently provides for all nodes: title, body, uid, created, etc.

A user visits the home page and types "monet" into the search box. A search module runs hook_search for every kind of Content (or whatever subset it is configured for), including ica_search, and merges their output to form the displayed results list.

ica_search uses the web service to implement the search and returns its results, a list of records matching "monet", each containing:

id. An ICA-Art ids (equivalent of a node.nid). The ids can be of whatever arbitrary format is defined by the web service (an integer, a string like ICA17, a unique URL, whatever) but to us it is just a unique opaque identifier.
link. The link to the full item such as icaart/ICA17. This is a Drupal path. ica_menu() declared this path and it calls the ica_page_view() callback.
display. This is whatever info should be displayed in the results.

The user clicks on one of the ICA-Art item titles, requesting the Drupal path icaart/ICA17. ica_page_view() calls drupal_load('ICA-Art', 'ICA17') to load the Content item of that type and id. This calls ica_load() because Drupal knows the ICA-Art is controlled by the ica.module (I'll get to the hook_nodeapi equivalent, hook_content(), later). What ica_page_view() gets back is an object that implements the Content interface and it is guaranteed to have at least the fields that ICA.module originally defined it to have. In our current rendering model where the page callback returns $content HTML, ica_page_view() calls drupal_view() on the Content object and returns it.

<?php
function ica_page_view($id) {
  $content = drupal_load('ICA-Art', $id);

  // $content->title contains the title
  // $content->field_artist_name contains the artist name
  // etc

  return drupal_view($content);
}

// This is called by drupal_load().
function ica_load($id) {
  $xml = ica_web_service_get($id);
  
  $content = create_object_with_appropriate_fields_based_on_data_from($xml);
  return $content; // which implements the Content interface
}

// This is called by drupal_view().
function ica_view($content) {
  // a very dumb HTML rendering
  return '<h2>'.$content->title.'</h2><img src="'.$content->field_image[0]['href'].'">';
}
?>

So far I've basically just described the way Drupal works with nodes. What's the point of all this?

We can now browse and search ICA-Art Contents as if they existed directly on our local site.
WE HAVE NOT TOUCHED THE DATABASE. There is no entry in the "node table" for ICA-Art contents. There is no "ICA-Art" table. The data does not exist locally.
ICA-Arts ARE NOT NODES. They are Contents.

Node is now also an implementation of the Content interface. They are controlled by the node.module. Nodes have an id (nid) and some controller-provided fields (title, uid, created, status). They implement hook_load, _save, _search, _etc. When you perform these operations on nodes, node.module uses the SQL database to fulfill them.

Custom fields on Contents

So far I've just declared that what we currently think of as the API for nodes is now (slightly altered) the API for Contents and node.module happens to implement that API (conveniently, it almost already does). What's the point?

Now I'd like to enable fivestar voting on a variety of Content types, including nodes and ICA-Arts. Nodes exist locally, but ICA-Arts do not exist in our database.

Using the same Content Type API that ICA.module used to declare that ICA-Art Contents exist with fields title, artist name, and image, the Fivestar module (driven by its Settings page submitted by an admin) uses the Content Type API to say "ICA-Art Contents have a field called 'vote', implemented by the fivestar module."

Now, drupal_load('ICA-Art', $id) first calls ica_load($id) (because that is the ICA-Art Content controller) and then using whatever API we define for "fields in core" calls the hook to add the fivestar 'vote' field to the $content object. The fivestar module looks in its local SQL-based table to find the data:

Table field_fivestar_vote:

Type    | Id    | Vote
----------------------
ICA-Art | ICA17 | 3
story   | 12    | 5

Field tables, which in D6 contain nid, now contain a Type column identifying the Content type for this row and the ID of that particular Content, along with the field's value. Fivestar does the select using $content->type and $content->id, gets the value 3, and sets $content->field_vote[0]['value'] = 3.

Content IDs are opaque strings but it is not efficient to change the nid column in field tables to a string. Instead we use a level of indirection. The Content interface provides a method to convert the Content IDs ('ICA17') to a local integer (1234) that is what gets stored in the field table. For nodes, the translation is the identity function (nid -> nid). For ICA.module, there is a lookaside table that maps the id 'ICA17' to a local integer.

The end result is that the ICA-Art Content ID ICA17 which does not live in our local database now contains value-added data from a Drupal module which does live in our database.

Summary

The Content interface defines the operations that allow Drupal to load, save, search, view, etc. content. Nodes are a kind of content (node.module implements the Content interface) that happen to be based in a locally stored SQL database. Other implementations of the Content interface do not have to be based in local SQL.

Fields can be added to any Content type. A field can exist in local SQL even if the Content type it connects to isn't. A field could also be stored somewhere else (the filesystem, a remote web service, whatever) in which case it would not have a local SQL table.

This addresses the first three points of the vision from the introduction. The fourth point is implicitly addressed by drupal_load() and friends but I do not pretend to have covered it here. However, I do not see why this proposed content model would not work with a consistent CRUD API.

Comments

An image of our hydra

Posted by karens on February 6, 2008 at 2:25pm

I added an image of our hydra to the Day 2 wiki page.

golden words

Posted by moshe weitzman on February 6, 2008 at 7:58pm

this is terrific stuff. i didn't realize we were focusing on the 'remote content' problem, but solving it will bring loads of new and interesting possibilities. this document is lovely start. i can't wait to share it with the world.

i'll proceed in the engineer's tradition of ignoring all the patently great information in the doc and focusing on the unclear ...

in your example, the ICA module is creating an ICA-Art Content type, and it provides "controller fields" like artist and so on. I can see many use cases for this. However, there are other use cases where the content type should be a regular GUI created content type and the ICA-Art is just a field on the type. Going further, the content type might also have a a piece of music that comes from a different external provider. think of this content type like an art+music pairing. So now we have two external IDs on the same node.

If you want to get crazier, I might indeed want to add ratings to this node type. The object users are rating might be the pairing (i.e. the whole node) or it might be the art and music individually (i.e. the fields). I have IDs for all of these, so why not? I might have just crossed the line of being able to communicate clearly. let me know.

Remote fields

Posted by bjaspan on February 7, 2008 at 3:11am

Yes. What you are describing we called remote fields. Our current field modules (text, nodereference, etc.) happen to only retrieve stuff from the local SQL database but there is no reason they are limited to that. In fact, we already have field modules that access external services. Gmap is an example though in fact it mostly just outputs Javascript which does the accessing. A simple example is a remoteimage field whose widget accepts a URL and whose formatter displays it as an IMG tag. A more complex field can perform a web service call (cached, probably) and display the results somehow. The args for the web service call could come from the Content the field is attached to or data provided on the Content edit form.

An example of a remote web service field: Passes the contents of Content's text fields to a web service, gets back a list of Related Links from sites all over the web, and sets $node->content['related_links']['#value'] = them('item_list', the list of links). (Of course, it would be $content->content['related_links'] since in this mode fields attach to things that implement the Content interface of which node is just one.

I'm actually pretty sure this is described in one of our daily summaries somewhere. Or it should be. :-)

getting my head around this

Posted by catch on February 7, 2008 at 10:08am

All pretty exciting. I didn't see taxonomy in there, so will try to describe how I think that could work to see if I understand this, at all.

Taxonomy becomes an entity. Vocabularies become entity sub-types (which can have their own sets of fields attached like node types do now). Associating other entities with taxonomy terms you'd use a reference field which would replace term_node, and deal with any entity available, including webservice supplied content like Art, not just nodes.

Taxonomy module itself could be swapped out by an external social tagging web service, but the way that Drupal handled the associations internally would be standardised. You could potentially have two or more external webservices all hooked into each other, alongside native drupal content. Anywhere close?

Brilliant

Posted by Chris Johnson on February 7, 2008 at 10:18pm

Brilliant ideas about the Content interface and remote fields. Thanks for writing about them so clearly, Barry.

Entities don't exist yet

Posted by bjaspan on February 7, 2008 at 10:27pm

The only reference I made in my post to "entities" is to say that I do not really know what they are yet (and neither does anyone else) so it is not sensible to say anything is an entity.

I described an interface called Content (which I think should be renamed to Fieldable) that has the property of being what Fields can be attached to. The Content (Fieldable) interface looks a lot like the current node API (node_load, node_save, node_view, etc.) with some functionality factored out of node into "fields.inc" or whatever we name the place that CCK's field-handling code goes.

I think that Taxonomy can be a field, continuing to exist basically exactly as it does now but the term_node table will change from storing (nid, tid) to (content_type, content_id, delta, tid) just like other fields. I think comments, in their current "not a node" state, will also be fields in exactly this way. However, we are not planning to convert these things all at once in our initial "minimal viable implementation" of fields in core. Once fields can be applied to remote Contents, however, I think the motivation to convert everything will be powerful.

Incidentally, all prior debates on the "comments as nodes" topic notwithstanding, when I peer into the future I see that comments are actually going to be Contents (implement the Content interface, allowing fields to be attached). Content types that want to be commentable (nodes, ICA-Arts, whatever) will have a souped-up nodereference field (maybe called commentfield) added to them that references the Comment Contents that belong to this Content (e.g. "the comment nodes that belong to this node").

It is a minor point, but not

Posted by moshe weitzman on February 8, 2008 at 2:41am

It is a minor point, but not everyone is a fan of type+id table design such as (content_type, content_id, delta, tid) . for example, see david's reply to barry's post at http://groups.drupal.org/node/4328#comment-12844. i'll ad that current Views is a bit unhappy about this design as well. No idea if Views1 becomes more happy.

A bit snippy...

Posted by bjaspan on February 8, 2008 at 4:10am

I think my comment comes across as a bit snippy. Sorry 'bout that, I didn't mean it that way. My point was to discourage the use of/reference to undefined concepts as doing so makes mistakes very likely.

Ok this clarifies things a

Posted by catch on February 8, 2008 at 1:22pm

Ok this clarifies things a bit, thanks for going into it further (and it wasn't snippy).

Good memory!

Posted by bjaspan on February 8, 2008 at 3:57am

Damn, Moshe... that post and reply is from last June! I'd forgotten about it. It's a fair point, though; the way I propose making field tables work basically prevents them from working with referential integrity. However:

I do not believe Drupal can or ever will support referential integrity for a wide variety of reasons.
"Referential integrity" or even just "join relationships" are a meaningless concept when the object the field is connecting to exists only via a web service, not in the local SQL database.
The alternative we discussed to a table field_fieldname storing (type, id, delta, data) is a table named type_field_fieldname storing (id, delta, data). For those types that do exist in the local SQL database, the former allows a single query returning all of the objects of those types in the system (i.e. "show me all comments on the site along with title of the related object") whereas the latter requires one query per type the field is attached to, followed by a merge. Of course, a separate query will be required for every type that is not in the database anyway (e.g. comments on ICA Art pieces), so "using a single query" may also be a meaningless goal.

I need to nail this all down more clearly in the near future... there is still a big risk I do not know what I'm talking about.

Views integration

Posted by moshe weitzman on February 8, 2008 at 8:25pm

I see that you deferred discussion of Views integration for the moment. But I must say that this is a vital topic and and very hard problem when your View consists of remote content and specifically those "controller fields" you mentioned. I'm thinking about it and coming up pretty blank on an efficient way to deal with them.

I've thought a bit about

Posted by bjaspan on February 10, 2008 at 3:31am

I've thought a bit about this.

Views performs a search query: select nodes with various properties. Currently, Views is "cheating" by violating the node abstraction barrier: it works directly at the SQL level instead of going through an API provided by node.module. It represents its query directly in terms of the underlying SQL tables.

For Views to be able to integrate data from multiple sources, two things are needed: (1) It will have to perform multiple queries, one per data source, and merge the results. (2) It will have to represent its queries via some data structure and pass it to a Content interface API. The node implementation of Content will convert that into SQL. The ICA-Art implementation will pass it to a web service API.

Not all web services expose the same kind of search interfaces. It probably will not be possible to have a system exactly like Views that works with all interesting web services. So maybe Views will continue to exist tied directly to node and some lower-fidelity system will exist that operates on the Content interface.

plug-in data sources

Posted by yched on February 9, 2008 at 8:31pm

I think an important question on 'remote fields' is how they possibly relate to a 'local' version of the 'same' field type.
I briefly raised that point in Chicago, but i think this has great potential, and possibly partially answers Moshe's concerns about views integration.

Point is :
In Barry's example, we have 'local' imagefields, and 'remote' (ICA-based) imagefields. We should have a way to say that both are 'imagefields', that can support the same set of plug-in 'actors' (widgets, formatters, future (?) validators). Once we have common formatters, we get free views integration for those remote fields, free widgets if the data source isn't read only...
Remember, it's formatters and widgets who define what field types they can act on. Create a new field type, and you lose existing formatters...

A field type is mainly defined by its data 'columns'. This curently maps to columns in the db data tables, but what this really defines is : 'What elementary pieces of data contitute a value for this field type'. If provided those 'pieces of data', formatters, widgets, etc can do their job whatsoever. Whether those values come from local db or an external data source does not really define separate field types.

Which leads me to : 'Data source' could be another sort of 'plug-in actor', much like our current widgets and formatters.
Create a field of a given type, assign a data source, assign a widget, assign formatters.
Only difference being :
Widgets and formatters are assigned to field instances and can change from one instance to another.
'Data source' (like possibly 'validators', but we don't have those yet) is assigned at the field level, and applies to all shared instances (Shared instances means : data stored in one single place)

CCK currently hardcodes a 'local db' source, which would need to be abstracted out into a 'plug-in' source.
Now, I'm not familiar enough with, say, SOAP-based data querying, to determine whether a generic one-fits-all 'SOAP source' plug-in is possible, or if this would need to be tailor-coded for each data provider that your site interfaces with ('ICA-SOAP' in Barry's example). But even then, after you wrote your own source management code, you still benefit on the existing goodies for the field types you use, because you didn't created your own separate field type no-one has ever heard of.

SOAP too generic

Posted by Crell on February 10, 2008 at 7:25pm

SOAP is too generic an interface to have a single-source data plugin, I suspect. In the case of the museum we just launched, it was all RPC calls that returned arrays that needed to be translated. In the case of a more OO SOAP interface, it would be nearly the entire data object / entity / content object already. SOAP is just a communications protocol, really.

RDF and SPARQL

Posted by Chris Johnson on February 11, 2008 at 9:26am

I think you may want to look at RDF and SPARQL for remote data source / web services.

Clients for Services.module ?

Posted by yched on February 12, 2008 at 10:04pm

Yeah, I'd expected so.
So, in some/most cases you have to write you own code to query your source anyway. I don't think this brings down the 'field source plug-in' idea. You still gain more in writing your tailor-made 'field source plug-in' rather than pure custom code that don't integrate with Drupal 'content' concepts and ecosystem.

Maybe an interesting angle would be providing default 'client sources plugins' for the protocols the Services module implements on the 'server' side. This lets drupal sites freely exchange data 'out of the box', while being still potentially open to other repositories.

shared fields across different 'fieldable' types

Posted by yched on February 13, 2008 at 12:39am

Field tables, which in D6 contain nid, now contain a Type column identifying the Content type for this row and the ID of that particular Content, along with the field's value

I think I remember there was a discussion about the (id, entity-type, value) schema for data tables, with 'id' being an external key to a different table depending on the value of 'entity-type'.
This probably doesn't prevent us from sharing the field anyway, but maybe using seperate tables for different 'fieldable' types :
node_field_foo, user_field_foo, whatelse_field_foo