One place that the Web Services and Context Core Initiative (WSCCI) and Configuration Management Initiative (CMI) overlap is the need to have a standard, canonical format to represent nodes and other entities in non-PHP and non-SQL format. There are a number of places where that is useful:
- Including entities in exported configuration, or in configuration files.
- Taking a content snapshot in some form other than an SQL dump file (which, you know, kinda sucks for mose uses).
- Transferring a node from one site to another for content sharing purposes.
- Aggregating content from many sites together for improved searching and cataloging.
- Exposing Drupal content to other non-Drupal systems. This is made easier by using non-Drupal-specific formats.
These are all problem spaces that exist in Drupal 7 now, and did back in Drupal 6, too. Various one-off solutions exist. For Drupal 8 we should have a better universal answer to this question, and be able to build common tools to support it. Those can, and should, also influence our API design to help improve external integration.
There are four general approaches that I am aware of.
- Serialized PHP: The simplest of course is to simply dump a node to PHP code using var_export(), or just use PHP's serialze() function on it. While that does result in a string representation of a node (or other entity) that can be saved to disk or sent to another site, it is a generally poor format. It is PHP-specific, Drupal-specific, serialize() is easily corrupted, and it does not do anything to help with IDs that differ between sites or references to other entities. In short, it's not worth our consideration.
- drupal_execute() arrays: This is the approach taken by the Deploy module in Drupal 6. The basic idea is that nodes in Drupal 6 rely too heavily on the Form API for, well, everything, so saving a node and not going through a form save operation would lose half the useful data. That is fortunately not the case in Drupal 7 anymore thanks to improvements in the Field API, and of course FAPI arrays are one of the least portable formats we could come up with so it fails goals 1, 2, and 5.
- json_encode(): Deploy in Drupal 7 drops drupal_execute() in favor of running json_encode() on a node object to send between sites via Services.module. The receiving end then simply runs json_decode() and node_save() (well, a custom entity_save() routine since core has no entity_save()). That is much cleaner and more portable. JSON is a well-known standard format that is dead-simple to parse in PHP, it's understood by a wide variety of systems, and can be included in either JSON-based or XML-based configuration files. (With some escaping it's just CDATA.) However, blindly dumping a node object to JSON without thinking about its structure is not useful for external integration, because the structure is too unpredictable for anything but custom parsing.
- Atom/XML: On a previous client project in Drupal 6, I worked on a team that produced the Views Atom and Feeds Atom modules. The basic idea was to serialize nodes to a custom XML format, and then use the Atom format (IETF 4287) to wrap them for transport between sites. Atom turned out to be an excellent choice, as Atom supports multiple payload formats, including non-XML; it supports encryption (although we did not use it); it supports UUIDs for synchronizing objects to avoid content duplication; and it supports PubsubHubbub, an Atom extention that makes push-based updates possible. (And yes, there's a module for that.) It worked well, and at Palantir we're now starting work on a Drupal 7 project based on the same tools. (Expect Drupal 7 versions of that full suite soon.)
I spoke with Deploy maintainer Dick Olsson (dixon_) earlier today, and we both agreed that we really ought to standardize on one format that Deploy can use now in D7, that we can use for clients like the one I'm working with now, and for a Drupal 8 standard. There's plenty of good reasons for it, and no good reasons to not standardize, and we can standardize now, even without Drupal 8 being anywhere close to a release.
We also agreed that Atom was probably the best wrapper format, since it contains a number of features (as above) that are useful when needed and can be skipped when not. It's also a well-recognized standard, which in most cases is superior to some Drupal-proprietary format.
So, let's do. And let's do while I have a client that can pay for at least some of the work to help build a common library for it. :-)
A canonical serialized entity representation should:
- Be at least somewhat human readable, or at least is if you pretty-print the whitespace.
- Be reasonably straightforward to parse in PHP.
- Be parsable by non-Drupal, non-PHP systems as well.
- Have a consistent, regular, predictable structure.
- Be supportable by any entity automatically by virtue of being an entity.
- Not try to handle everything that an entity might have on it, only those things that are fully supported. That is, entities right now have basic properties that are defined by the entity type, and they have fields. It's been common for modules to also throw any random stuff they want onto the bare object structure at various times. Those are very specifically not supported, as that makes the structure too unpredictable.
- The following workflow must work, and result in no change to an entity (this is not an API example):
$entity = entity_load($type, $id);
$string = entity_serialize($entity);
$entity = entity_deserialize($string);
entity_save($entity); // Once this API call exists.
- Be revision-aware.
Two options immediately spring to mind. One is to reuse the XML format from the Views Atom module. (Note: The sample there is namespaced, which makes it uglier to read, but the actual tags are fairly simple; please pardon the namespacing.) That has the advantage of already existing, and we can rip parsing logic out of that module into a standalone library. We could also tweak it as needed before making it a canonical format.
The other is to use JSON, but something more robust than just throwing an object into json_enocde(). For one thing, in Drupal 8 entities are classed objects and will have non-public properties, so that won't even work in the first place as those non-public properties would get lost. For another, we want a more regular and non-Drupal-specific structure than that would give us.
(Yes, the XML vs. JSON wars have already been fought. That CMI is going XML at this point is a mark in XML's favor. Please do not simply repeat anything already said in that thread. Please.)
I will also offer that there is no intrinsic reason we cannot provide both an XML and JSON canonical form, as long as they are reasonably related. It does not have to be either/or.
References and Dependencies
Here of course comes the ugly part. Drupal entities routinely contain references to other entities. Nodereference, Userreference, Entity Reference, File fields, OG group membership, the author property of nodes... the list goes on. Of course, those references are generally entity IDs, which means totally and utterly useless when a node is serialized and used anywhere except right back on the same site. We need some alternate way to represent them.
I encourage everyone to read these two articles on REST before commenting on this section. They contain very valid points regarding how resources should reference each other in a REST/hypermedia form. There's some discussion of them in this earlier thread, too. Remember, the receiving system may not be a Drupal site!
Just throwing UUIDs on everything is only a partial answer. Having some sort of /entity/$type/$uuid path that we can always rely on could be a part of the solution, but perhaps not. I'm not sure here yet.
The other question is files. Not only do we need to translate fids into something useful, using a Drupal stream wrapper URL may not be useful. Sometimes it will be; actually in the client project I have right now we do want to send over Drupal stream wrapper URLs, because we have a common file server. However, that will not always be the case. So what do we want to do here?
Wrappers and control
For both an XML format and a JSON format, the Atom spec actually provides a very nice envelope. It's a widely understood format, extensible, supports both push and pull based updates, has an extension that can push deletion notifications, and scales well once you introduce an external PuSH hub server.
Naturally not every use case will need a wrapper; if we're just saving out a serialized entity to disk, then Atom doesn't have any real purpose. For a web service wrapper, though, we could do far worse.