WSCCI Web Services Format Sprint Report

Posted by Crell on June 14, 2012 at 1:46pm

A small team met in Paris at the office of Commerce Guys from June 3-5th to discuss Drupal's web services serialization and syndication format needs. In short, "OK, we are going to have all of this fun new routing capability, now what do we do with it?" More specifically, how do we go about serializing Drupal data for consumption by remote programs (either other Drupal sites as in the case of content staging, other non-Drupal sites, client-side applications, or mobile apps), and what protocols and APIs do we make available to manipulate that data?

In attendance were:

Larry Garfield, WSCCI Lead
Wolfgang Ziegler, Entity maintainer
Dick Olsson, Deploy maintainer
Damien Tournoud, Core generalist and gracious host
Alex Bronstein (via skype), Core generalist

The raw notes from the sprint are available as a Google Doc, but as usual are rather disjoint and likely won't make sense if you weren't there. A more complete report is below.

Executive summary

There are no clear and obvious winners in this space. All of the available options have different serious limitations, either in their format compatibility with Drupal, their access protocols (or lack thereof), or available mature toolchain.

Our recommendation at this time is to make JSON-LD our primary supported web services data format. It is quite flexible, and supports the self-discovery capabilities that we want to support. What it appears to lack is the set of tools and standards provided by the Atom and AtomPub specifications, which provide everything we want except for an actual data payload format. For use cases where the capabilities of Atom (such as Pubsubhubbub support) are necessary, wrapping JSON-LD strings in an Atom wrapper is ugly but technically possible. Alternatively, the JCR/PHPCR XML serialization format can serve as a forward-looking XML-based serialization when Atom functionality and true hypermedia are required.

This will require changes to the Entity system, most of which are already in progress. However, this provides new impetus to complete these changes in a timely manner. In short:

"Fields" get renamed to "Properties", and become the one and only form of data on an Entity. Any non-Property data on an Entity will not be supported in any way (except for IDs).
Properties become classed objects and include what is currently fields plus what is currently raw entity data (e.g., {node}.uid).
Entity Reference (or similar) gets moved into core.
All entity relationships are considered intrinsic on one side (the side with a reference field) and extrinsic on the other (the side referenced). That is, all relationships are mono-directional.
Every relationship (may) have a virtual Property assigned to the entity that is linked to, which stores no data but provides a mechanism to look up "all entities that reference to me". That is, back-references.
Content metadata (eg the sticky bit on nodes, og memberships, etc.) is implemented as a foreign entity with reference.
The responsibility for entity storage will be moved from the Field/Property level to the Entity level. That is, we eliminate per-field storage backends.

Background

There are two broad categories of web services to consider: s2s (Server to Server, Drupal or otherwise) and s2c (Server to Client, where client could be a mobile app, web app, client-side editor like Aloha or CreateJS, etc.). There is of course plenty of overlap. Both markets have different existing conventions, which frequently are not entirely compatible as they have different histories and priorities.

Entity API Revisions

In order to support generic handling of Entity->serialized translations, we need to standardize and normalize how entities are structured. Currently in Drupal 7 entities are largely free-form naked data structures. While Fielded data has a semi-regular form, its API is inadequate and much data is present on an entity via some other means. In order to handle serialization of entities, we need to either:

Allow modules to implement per-property, per-serialization format bridge code. That would result in n*m elements that would need to get written by someone (whether hooks or objects or plugins or whatever).
Provide a single standard interface by which all relevant data on an entity can be accessed, so that a generic implementation may be written to handle all Property types.

Given the extremely high burden the first option would place on module developers, we felt strongly that the second option would be preferable and result in better DX.

Ongoing work on the "Entity Property Metadata" in core effort has already begun this process. What we describe here is not a radical change, but more a tweaking of ongoing work.

The renaming of "Fields" to "Properties" is largely for DX. The word "field" means three different things in Drupal right now: A data fragment on an entity, a column in an SQL table, and a part of a record in Views. With Views likely moving into core for Drupal 8, eliminating one use of the term will help avoid confusion.

We therefore have a data model that looks as follows:

Entity [ Property [ PropertyItem [ primitive data values ] ] ]

Where "Property" was called in Drupal 7 "Field" and PropertyItem was called in Drupal 7 an "Item". This is largely just a rename.

That is, an Entity object is a glorified array of Property objects. A Property object is a glorified array of PropertyItem objects. A PropertyItem object contains some number of primitive values (strings and numbers), but no nested complex data structures. (An array or stdClass object may be PHP-serialized to a string as now, but the serialization system will treat that as an opaque string and not support any additional sub-value structure.

Additionally, each Entity class and Property class will be responsible for identifying its metadata on demand via a method. That is, much of the information currently captured in hook_entity_info() will move into a metadata() method of the Entity class; the information currently captured in hook_field_schema() and some of that captured in hook_field_info() will move into a metadata() method of the Property class. That allows the necessary information be available where it is needed, without having to pre-define giant lookup arrays. It also allows for that information to vary per-instance, such as field schema already does now.

Entity and Property classes will implement PHP magic methods for easier traversal. A preliminary, partial, demonstration-only implementation is as follows:

<?php
class Entity implements IteratorAggregate {

 // Keyed array of Property objects
 protected $properties;

 public function getProperties() {
   return $properties;
 }

 public function getIterator() {
   return new ArrayIterator($this->properties);
 }

 public function __get($name) {
   // returns the Property named $name
   return $this->properties[$name];
 }

public function __set($name, $value) {
   // sets the Property named $name
}
}

interface PropertyInterface {
 // Returns the stuff that was in hook_field_schema().
 public function metadata();
}

interface ReferencePropertyItemInterface extends PropertyInterface {
 public function entity();
}

class Property implements PropertyInterface, ArrayAccess, IteratorAggregate {

// Indexes array of PropertyItem objects
protected $items;

 public function offsetGet($offset) {
   // Returns a PropertyItem object.
   return $this->items[$offset];
 }

 // On Properties, as a convenience, the [0] is optional. If
 // you just access a value name, you get the 0th item. That is
 // useful for properties that you know for sure are single-value. However,
 // because the [] version is always there this will never fatal out the way it
 // would if the data structure itself actually changed.
 public function __get($name) {
   return $this->items[0][$name];
 }

 public function getIterator() {
   return new ArrayIterator($this->properties);
 }
}

class Node extends Entity {
 // Convenience method.
 public function author() {
   // Could also micro-optimize and call offsetGet(0).
   return $this->author[0]->entity;
 }
}

interface PropertyItemInterface { }

class PropertyItem implements PropertyItemInterface {

// The internal primitive values
protected $primitives;

public function __get($name) {
    return $this->primitives[$name];
}

public function processed($name) {
   // This is pseudo-code only; the real implementation here will not call any functions directly
   // but use something injected as appropriate.  We have not figured out that level of detail yet.
   return filter_format($this->primitives[$name]);
 }
}

class ReferencePropertyItem extends PropertyItem implements ReferencePropertyItemInterface {
 public function entity() {
   // Look up the ID of the entity we're referencing to, load it, and return it.
 }
}

// Individual properties can totally add their own useful methods as appropriate. This is encouraged.
class DateProperty extends PropertyItem {
 public function value() {
   return new DateTime($this->properties['date_string'], new DateTimezone($this->properties['timezone']));
 }
}

class ReferencedPropertyItem extends PropertyItem implements ReferencedPropertyItemInterface {
 public function getReferences() {
   // Returns a list of all entities that referenced TO this entity via this property.
 }
}

// For values that do not store anything, but calculate values on the fly
interface CalculatedPropertyInterface { /* ... */ }


$entity = new Entity();
foreach ($entity as $property) {
 // $property is an instance of Property, always.
 foreach ($property as $item) {
  // $item is an instance of PropertyItemInterface, always.

  if ($item instanceof ReferencePropertyItemInterface) {
   $o = $value->entity();
   // do something with $o.
  }
  // Do something with $item.
 }
}

// Usage examples

$node
// __get() returns a Property
  ->updated
    // ArrayAccess returns a PropertyItem
    [0]
      // __get() returns the internal primitive string called timezone.
      ->timezone;

$node
// __get() returns a Property
  ->updated
    // ArrayAccess returns a PropertyItem
    [0]
      // __set() assigns the value of the internal timezone primitive.
      ->timezone = 'America/Chicago';

$node
// __get() returns a Property
->author
  // If you leave out [], it defaults to 0
  // The entity method returns the referenced user object
  ->entity()
    // If you leave out the [], it defaults to 0
    ->name
      // The actual string name value.
      ->value;

// In practice, often use the utility methods.
$node->author()->label();
?>

By default, when you load an entity you will specify the language to use. That value will propagate down to all Properties and Items, so by default module developers will not need to think about language in each call. If a module developer does care about specific languages further down, additional non-magic equivalent methods will be provided that allow for specific languages to be specified. The details here will have to be worked out with Gabor and the rest of the i18n team.

When defining a new Entity Type, certain Properties may be defined as mandatory for the structure; the Title and Updated properties for nodes, for instance. These properties will be hard-coded into the definition of the Entity Type, and may be stored differently by the entity storage engine. However, to consuming code that is processing an entity there is no difference between a built-in Property and a user-added Property. An Entity Type is also free to define itself as not allowing user-added Properties (effectively mirroring non-fieldable entities today).

While objects are not as expensive in PHP as they once were back in the PHP 4 days, the amount of specialty method calls above MAY lead to performance concerns. We do not anticipate it being a large issue in practice. If so, more direct, less magical methods may be used in high-cost critical path areas (such as calling offsetGet() directly rather than using []) to minimize the overhead.

Extrinsic information

Currently there are a number of values on some entities that, in this model, do not "belong to" that entity. The best example here are the sticky and promote flags on nodes. This data is properly extrinsic to the node, but for legacy reasons are still there. That is information that often should not be syndicated. Organic Group membership is another example of extrinsic data.

We discussed the need to therefore represent extrinsic data separately from Properties. However, developing yet-another-api seemed like a dead-end. Instead, we decided that the way to resolve intrinsic vs. extrinsic data was as follows:

All Properties are intrinsic to the Entity the Property is on.
A ReferencedProperty (backlink) is not a part of the Entity itself. That is, the Entity knows about the existence of such linked data, but the data in question is extrinsic to it.
Extrinsic data on an Entity should be implemented as a separate Entity type, which references to the Entity it describes.
If data links two entities but is extrinsic to both, then an intermediary entity may have a reference to both entities.

For example, core will introduce a BinaryAttribute entity type (or something like that). It will contain only two values: its own ID, and a single-value ReferenceProperty to the entity it describes. There will be two bundles provided by core: Sticky (references to nodes) and Promoted (references to nodes). To mark a node as Sticky, create a BinaryAttribute entity, bundle Sticky. To mark it unsticky, delete that entity. Same for Promoted. (Note: Additional metdata fields, such as the date the sticky was created or the user that marked it sticky, may also be desired. Unmarking an entity may also be implemented not by deleting the flagging entity but having a boolean field that holds a yes or no. That is an implementation detail that we did not explore in full as it is out of scope for this document.)

After speaking with Workbench Moderation maintainer Steve Persch, we concluded that some such metadata (such as published) is relevant not to entities but to entity revisions. Fortunately that is easy enough to implement by providing an EntityReference Property and an EntityVersionReference Property, the latter of which references by version ID while the former references by entity ID. Which is appropriate in which case is left as an exercise to the implementer of each case.

Although not the intent, this effectively ports the functionality of Flag module into core, at least for global flags. Only a UI would be missing (which is out of scope for this document). It also suggests how per-user flags could be implemented: A UserBinaryAttribute entity type that references to both an entity and a user (specifically).

These changes would open up a number of interesting possibilities, such as much more robust content workflows, the ability to control access to the Sticky and Promoted values without "administer nodes" or even without node-edit capability, etc. We did not fully explore the implications of this change, other than to decide we liked the possibilities that it opened.

Implications for services

The primary relevant reason for all of this refactoring is to normalize the the data model sufficiently that we can automate the process of serializing entities to JSON or XML. As above, we want to avoid forcing nm (or worse, ij*k) necessary bridge components for each entity type, property type, and output format. It also neatly separates intrinsic and extrinsic data in a way that allows us to include it or not as the situation dictates. The other DX and data modeling benefits that it implies are very nice gravy, and exploring the implications of those changes and what additional benefits they offer is left as an exercise for the reader (and for later teams).

Syndication formats

With a generically mappable data model, we then turned to the question of what to do with it. We identified a number of needs and use cases that we needed to address:

Exposing entities in a machine-readable format
Exposing collections of entities in a machine-readable format
Exposing entities in both raw form suitable for round-tripping back to a node object and in a "processed" format that is safe for anonymous user consumption. (E.g., with public:// URLs converted to something useful, with text formats applied to textual data, etc.)
A way to resolve relationships between entities such that multiple related entities could be syndicated in a single string or a series of related strings (linked by some known mechanism). E.g., A node with its author object embedded or not, or with tags represented as links to tag entities or as inline term objects.
Every entity (even those that do not have an HTML URI) need to have a universally accessible canonical URI.
Semantically correct use of HTTP hypermedia information (GET,POST, DELETE, etc. PUT and PATCH are quirky and of questionable use.)
Data primitives we must support: String, int, float, date (not just as a string/int), URI (special case of string), duration.
Compound data types (Fields) are limited to being built on those data primitives; includes "string (contains html)".
Data structure inspection: Given "node of type X", what are its fields? Given "field of type Y", what are its primitives?
While we were not directly concerning ourselves with arbitrary non-entity data, a format that lent itself to other uses (such as Views that did not map directly to a single entity) is a strong benefit.

Given that set of requirements, we evaluated a number of existing specifications. All of them had serious deficiencies vis a vis the above list.

CMIS

CMIS is a big and robust specification. However, it consists mainly of optional feature sets, which would allow us to implement only a portion of CMIS and punt on the rest of it. CMIS' data model is very traditional: Documents are very simple creatures, and are organized into Directories to form a hierarchy.

CMIS also includes a number of different bindings for manipulation. The basic web bindings are designed to closely mimic HTML forms, right down to requiring a POST for all manipulation operations. They also required very specific value structures that we felt did not map to how Drupal entities are structured nor to how Drupal forms work, making it of little use.

CMIS also includes bindings for AtomPub, which is a much more hypermedia-friendly high-level API for communication. CMIS has no innate concept of internationalization, so that needs to be emulated in the data with separate data properties.

CMIS is based in XML, although a JSON variant is in draft form at this time.

Atom

Atom is an XML-based envelope format. That is, it does not define the format of a single item. Rather, it defines a mechanism for collecting a set of items together feed-like, for defining links to related content, for paging sets of content, etc. The structure of a single content item is undefined, and may be defined by the user. Atom also includes a number of useful extensions, in particular Pubsubhubbub and Tombstone, which allow for push-notifications and push-deletion. That is extremely useful for many content sharing and content syndication situations.

There are a couple of JSON-variants of Atom, including one from Google, but none seem to have any market traction.

AtomPub

AtomPub is a separate IETF spec from Atom the format, although the two are designed to complement each other. AtomPub defines the HTTP level usage of Atom, as well as the semantic meaning of various links to embed within an Atom document. (e.g., link rel="edit", which defines the link to use to POST an updated version of the document or collection.)

JSON-LD

JSON-LD is not quite a format as much as it is a meta-format. Rather, it's a way to represent RDF-like semantic information in a JSON document, without firmly specifying the structure of the JSON document itself. That makes it much more flexible than CMIS in terms of supporting an existing data specification (like Drupal's), but also means we need to spend the time to define which semantics we're actually using. That includes determining what vocabularies to use where, and which to custom-define for Drupal.

Our initial thought was to try to map entities as above to CMIS, so that we could leverage the AtomPub bindings that were already defined. We figured that would result in the least amount of "we have to invent our own stuff". However, we determined that would be infeasible. Documents in CMIS are too limited to represent a Drupal entity, even in the more rigid form described above. We would have to map individual Properties to CMIS Documents, and Entities and Language would have to be represented as Directories. However, that would make representing an Entity in a single XML string quite difficult, and/or require custom extensions to the CMIS format. At that point, there's little advantage to using CMIS in the first place.

While CMIS may work very well for low-complexity highly-organized data such as a Document Repository like Alfresco, it is less well suited to highly-complex but low-organization data such as Drupal.

Atom/AtomPub, while really nice and offering almost everything we want, are missing the one most important piece of the puzzle: They are by design mum on the question of the actual data format itself.

We then turned to JSON-LD. It took a while to wrap our heads around it, but once we understood what it was trying to do we determined that it was possible to implement a Drupal entity data model in JSON-LD. While not the most pristine, it is not too bad. We developed a few prototypes before speaking with Lin Clark and ending up with the following prototype implementation:

{
"@context": {
  "@language": "de",
  "ex": "http://example.org/schema/",
  "title": "ex:node/title",
  "body": "ex:node/body",
  "tags": "ex:node/tags"
},
"title": [
  {
    "@value": "Das Kapital"
  }
],
"body":
[
  {
    "@value": "Ich habe Durst."
  }
],
"tags":
[
  {
    "@id": "http://example.com/taxonomy/term/1",
    "@type": "ex:TaxonomyTerm/Tags",
    "title": "Wasser"
  }
]
}

This is still preliminary and will certainly evolve but should get the basic idea across.

Of specific note, JSON-LD has native support for language variation. It's imperfect, but should be adequate to represent Drupal's multi-lingual entities.

Defining what the semantic vocabularies in use will be is another question. Our conclusion there is that the schema information provided by a Property implementation should also include the vocabulary and particular semantics that field should use.

That is not actually as large a burden as it sounds. In most cases it will be reasonably obvious, once standards are developed. For instance, date fields should use ical. In cases where multiple possible vocabularies exist, a Property can make it variable in the same fashion as the field schema itself is currently variable, but only on Property creation (just as it is now). If no vocabulary is specified, it falls back to generic default "text" and "number" semantics.

As a nice side-effect, this bakes RDF-esque semantics into our data model at a basic level, which should keep all of the semantic-web fans happy. It also will ease integration with CreateJS, VIE, and similar rich client-side editors that can integrate with Aloha, which is already under consideration for Spark and potentially Drupal 8.

This does not, of course, provide the REST/hypermedia semantics we need. As far as we were aware there is no JSON-based hypermedia standard. There are a couple of suggested proposed standards, but none that is actually a standard standard.

Symfony Live Addendum

Following the Sprint, Larry attended Symfony Live Paris, the main developer conference for the Symfony project at which he was a speaker. There, Larry was able to do some additional research with domain experts in this area.

One of the keynote speakers was David Zuelke of Agave, and the topic was (surprise!) REST and Hypermedia APIs. The session video is not yet online, but the slides were 90% the same as this presentation. It is recommended viewing for everyone in this thread. In particular, note the hypermedia section that starts at slide 95. One of the key take-aways from the session (echoed in other articles that we've checked as a follow-up), is that we're not the only ones with trouble mapping JSON to Hypermedia. It just doesn't do it well. XML is simply a better underlying format for true-REST/HATEOAS functionality, and the speaker encouraged the audience to respond to knee-jerk JSON preference with "tough, it's the wrong tool."

After the session, David acknowledged that the situation is rather suboptimal right now (XML Is better for Document representation, JSON for Object representation; and we need to do both).

Larry also spoke with Henri Bergus, Midgard developer, author of CreateJS, and future DrupalCon Munich speaker. Henri pointed out that the JCR/PHPCR standard (Java Content Repository and its PHP-based port) does have its own XML serialization format independent of CMIS. After a brief look, that format appears much more viable than CMIS although additional research is needed. It is defined in the JCR specification, section 6.4.

Assuming JCR/PHPCR's XML serialization can stand up to further scrutiny, particularly around multi-lingual needs, it would be a much more viable option for true-HATEOAS behavior as we could easily wrap it in Atom/AtomPub for standardized linking, flow control, subscription, and all of the other things HATEOAS and Atom offer. While Atom would allow JSON-LD to be wrapped as the payload as well, wrapping JSON-LD in Atom would require both producers and consumers to implement both an Atom and a JSON-LD parser, regardless of their language. That would be possible, but sub-optimal.

At this time we are not firm on this conclusion, but given the varied needs of different use cases we are leaning toward recommending the use of PHPCR-in-Atom and JSON-LD as twin implementations. Attempts at implementing both options will likely highlight any flaws in either approach that cannot be determined at this time. Whether one or both ends up in core vs. contrib should be simply a matter of timing and resource availability, as being physically in core should not provide any magical architectural benefit. (If it does, then we did it wrong.) That said, the Entity API improvements discussed above are the same regardless of format, and offer a variety of additional benefits as well.

Acknowledgements

Thank you to everyone who attended the sprint, including those who just popped in briefly during the Tuesday biweekly WSCCI meeting. Thank you also to Steve Persch and Lin Clark for their impromptu help. Thanks to Acquia for sponsoring travel expenses for some attendees. And of course thank you to Commerce Guys for being such wonderful hosts, and to Sensio Labs for bringing Larry in to speak for Symfony Live as that's how we were able to have this sprint in the first place.

Onwards!

Comments

JSON-LD is Great! Should we eval HAL too, though?

Posted by ethanw on June 14, 2012 at 3:12pm

This is awesome. There's a ton of work and insight in this report, and as someone who's been struggling to find the time for in-depth review of multiple formats I am grateful for all expertise and perspective included here.

JSON-LD is certainly very well suited to the type of data Drupal sites need to serve, and has the backing of a number of important parties, is being actively developed by a growing and engaged community that's interested in supporting the Drupal effort, and is a joy to work with when building JavaScript-based web an mobile apps. What more could you ask for!?

That said, I do wonder if it would be helpful to evaluate one last format before finalizing the choice: HAL. HAL is very similar to JSON-LD in many ways: it handles JSON linking linked and structured data in a similar fashion and is designed to work with other data standards (in HAL's case URI templates and web linking standards).

The one major advantage HAL has over JSON-LD is that it natively supports both JSON and XML representations, and is designed to make either representation easy to work with.

I think HAL also has some disadvantages: the community is not as robust and activity level not as high as JSON-LDs, it doesn't seem to have the same range of support for semantic data and i18n, and while it is an IANA standard, I don't know if it's prospects are quite as bright as JSON-LDs at this moment.

That said, if there is a fair split between developers who would prefer JSON and XML, then HAL's XML representation could make it a good fit for a general, all-purpose format.

If it would be helpful at this point, I can pull together a quick review of HAL similar to the CMIS one, to determine if it meets the general requirements and should be put into a head-to-head with JSON-LD. This has been on my list for a bit, but unfortunately has taken me a while to get to.

To be clear: I'm not advocating for HAL over JSON-LD, but do think it might merit full consideration against JSON-LD before committing to one over the other.

Momentum

Posted by Crell on June 14, 2012 at 7:32pm

To be honest, we didn't discuss HAL in great detail at the sprint. Given how fragmented the market is, my inclination is to stick with formats that have more momentum. If we want to support Create JS (I believe we do), then we will have to support JSON-LD one way or another. PHPCR has been discussed a number of times and has some support within the high-end PHP world, and is the only XML format we looked at that was even partially viable. (CMIS, as said, doesn't really work for us.) Between the two of those (and Atom/AtomPub as a wrapper), that should cover most use cases.

That said, I certainly won't stop you from giving HAL a more thorough look if you think it would be valuable. :-) And worst case, with the way we've approached the problem space (normalizing Entity API first) there should not be anything to prevent someone from implementing HAL payloads alongside whatever Core does, at the same URIs, with (relative) ease. (Certainly easier than it would be in Drupal 7.)

wow

Posted by chx on June 14, 2012 at 4:09pm

While I understand that Relation module is more complex, the ... not sure how to word this ... tone of this is writeup/decision a little bit ... ick? People want http://drupal.org/node/1293792 symmetrical relations.

Differnet problem space

Posted by Crell on June 14, 2012 at 7:36pm

We did briefly ponder "does this mean we need Relations in core?" and concluded that it was a different use case. There are plenty of use cases for complex relationship management but not the ones we were trying to address here, which were simply the line between intrinsic and extrinsic data. Making a relationship monodirectional provides a very clear line that we can follow. It also in no way precludes an intermediary relationship entity that references to two other entities, thus making it extrinsic to both of the nodes/users in question.

We know that there will need to be some sort of "get entities that point to me" operation, and our tentative suggestion there is a virtual backreference field. That implementation may change in, er, implementation, but that's the tentative solution. (Suggestions for better mechanisms are welcome, as long as it doesn't balloon an already not-small roadmap.)

Great news

Posted by lanthaler on June 14, 2012 at 9:59pm

This is great news! If you need any support let us know. We are more than happy to help you.

Please note, that in the example that you posted, the language wouldn't apply to "Das Kapital" and "Ich habe Durst." as you explicitely define it as a literal without language. You would need to use either

{
"@context": {
  "@language": "de",
  "ex": "http://example.org/schema/",
  "title": "ex:node/title",
  "body": "ex:node/body",
  "tags": "ex:node/tags"
},
"title": "Das Kapital",      <-- can also be an array, but I think there will always just be one value, right?
"body": "Ich habe Durst.",
"tags":
[
  {
    "@id": "http://example.com/taxonomy/term/1",
    "@type": "ex:TaxonomyTerm/Tags",
    "title": "Wasser"
  }
]
}

or something like

...
"title": [
  {
    "@value": "Das Kapital",
    "@language": "de"
  }
]
...

You also stated that JSON-LD's native support for language variation is imperfect. What do you miss? What could still be improved? Any other feedback, ideas, suggestions?

Markus Lanthaler
@markuslanthaler

I haven't studied the spec,

Posted by linclark on June 15, 2012 at 1:09am

I haven't studied the spec, but all show up as language tagged literals in the N-Quads tab of http://json-ld.org/playground/ .... is there something in the spec that says that shouldn't be the case?

EDIT: Actually, I forgot I did study the spec for that part. It works because I put the language in the context (as per section 3.5).

At times, it is important to annotate a string with its language. In JSON-LD this is possible in a variety of ways. Firstly, it is possible to define a default language for a JSON-LD document by setting the @language key in the @context or in a term definition:

EDIT #2:

Sorry, you're right about the snippet that was in the document. I was using a different snippet that I had on my local but must not have added to the Google doc.

EDIT: Actually, I forgot I

Posted by fago on June 15, 2012 at 12:01pm

EDIT: Actually, I forgot I did study the spec for that part. It works because I put the language in the context (as per section 3.5).

Yep. What we need to achieve is to have language-specific properties. I.e. we have the node tags property but possibly in multiple languages. So it wasn't clear to me whether JSON-LD "only" applies the language to string values. So if the language is in the context, is it just the default for the string values or can we actually do language-specific properties via that as well?

To make the example more clear: The "tags" property of a node would reference to other resources (the tags), so there is no string value. Still, we have different values per language what we need to represent.

Tags in different languages

Posted by lanthaler on June 15, 2012 at 12:15pm

To make the example more clear: The "tags" property of a node would reference to other resources (the tags), so there is no string value. Still, we have different values per language what we need to represent.

That depends how you model your data. Most of the time you would have a tag representing a specific concept and have labels for that concept in different languages. Something like:

  "tags": [
    {
      "@id": "http://example.com/taxonomy/term/1",
      "@type": "ex:TaxonomyTerm/Tags",
      "title": [
        { "@value": "Wasser", "@language": "de"  },
        { "@value": "Water", "@language": "en"  },
        { "@value": "Acqua", "@language": "it"  }
      ]
    }
  ]

Markus Lanthaler
@markuslanthaler

per-language tags

Posted by fago on June 24, 2012 at 1:57pm

That depends how you model your data.

Indeed. In Drupal with both ways, thus we have translated term labels, but we also have different terms per language for a single content node. Thus, we need way to depict the translatable property "tags" to have different values (=references to terms) per language, e.g. having something like "tags_de" and "tags_en". The question is whether its possible to leverage the context-language for that in JSON-LD?

In thinking about this, one

Posted by linclark on June 25, 2012 at 3:24pm

In thinking about this, one distinction I see is between application-specific detail and true semantic difference.

Application detail

For example, let's say that I have a node that is in English and German. Term 1 is "Schadenfreude". This doesn't have an English translation, and I don't think my English speaking audience would understand what the word means, so I don't offer an English translation for the term. I also don't add the term as a term reference on the English version of the node.

This does not mean that the concept of "Schadenfreude" doesn't apply to the English version of the node, just that I don't want to display it on the node. So this is not a semantic difference and it doesn't matter whether non-Drupal consumers of our data understand the distinction. If they think their English speaking audience understands the word "Schadenfreude", then they can show it with the node content.

Semantic difference

Lets say that in the English version of a node, the body text is different enough that it contains totally separate concepts, such as "revolution". The English version then is tagged with term references to the extra concepts. Even though the term for "revolution" has a German translation and is used on other German nodes, the German text for this node does not talk about revolution, so should not contain the term reference.

What this difference means

Application—If we are trying to support an application level difference, we should consider whether the JSON-LD needs to represent that distinction. As I mention above, this distinction isn't important to non-Drupal consumers of our data, so we might be able to use application-specific processing rules.

Or we might be able to represent it in the JSON-LD by using multiple linked variants. We could provide both a full serialization (which has all term references) and then a language specific serialization (which only contains the values pertinent to that language), using something like owl:sameAs to relate the two.

Semantic—If we are trying to support a true semantic difference, we should consider whether the appropriate relationship between the two language versions is really identity. My feeling at the moment is that identity would not be the right relationship.

In thinking about this even

Posted by linclark on June 26, 2012 at 4:04am

In thinking about this even further, there are a few options for supporting semantic difference. Here are the two I think are least problematic.

Named Graphs
Object keyed by language

Named graphs

First, I started looking at JSON-LD's handling of named graphs. The idea of named graphs is that you can say that an entity has a certain property value in a specific context. If we create a named graph for each language, then we can say that certain entity references only occur within that language's graph.

JSON-LD introduced support for this a few months ago. However, this support isn't completely finalized until a W3C working group says so. It is highly unlikely, but still possible that it will be removed. Another concern, which myself and Manu Sporny (a JSON-LD editor) share, is that named graphs are confusing for most developers.

One pro of this approach is that it maintains the distinction between different language versions even when transformed to RDF... if we care about that sort of thing.

Rough example:

{
  "@context": {
    "entity": "http://example.com/node/1",
    "ex": "http://example.org/schema/",
    "title": "ex:node/title",
    "body": "ex:node/body",
    "tags": "ex:node/tags"
  },
  "@graph": [
  {
    "@id": "http://example.com/node/1/en",
    "@type": "ex:Node/Article",
    "@graph": [
      {
        "@context": {
          "@language": "en"
        },
        "@id": "entity",
        "title": ["Capital"],
        "tags": [
          {
            "@id": "http://example.com/taxonomy/term/1",
            "@type": "ex:TaxonomyTerm/Tags",
            "title": "Water"
          },
          {
            "@id": "http://example.com/taxonomy/term/2",
            "@type": "ex:TaxonomyTerm/Tags",
            "title": "revolution"
          }
        ]
      }
    ]
  },
  {
    "@id": "http://example.com/node/1/de",
    "@type": "ex:Node/Article",
    "@graph": [
      {
        "@context": {
          "@language": "de"
        },
        "@id": "entity",
        "title": "Das Kapital",
        "tags": 
          [
            {
              "@id": "http://example.com/taxonomy/term/1",
              "@type": "ex:TaxonomyTerm/Tags",
              "title": "Wasser"
            }
          ]
        }
      ]
    }
  ]
}

Object keyed by language

Manu suggested a variation of fago's suggestion.

fago suggested that properties have different names based on the language. For example, field_tags_en and field_tags_de.

Manu suggested that we could create 1 level of nesting in the JSON. Instead of using the top level JSON object to represent the entity, the top level object would have attributes like "model_en" and "model_de". These would contain objects with the values for each language version.

This makes me slightly uncomfortable because it means creating a vocabulary term for each language. It also does not maintain the separation between different language variations when transformed to RDF... while the literal values will still be language tagged, the entity references themselves will no longer have a language context in RDF.

But, as Manu pointed out, this object structure might be more acceptable for average Web devs.

Rough example (thanks to Manu):

{
  "@context": {
    "entity": "http://example.com/node/1",
    "ex": "http://example.org/schema/",
    "title": "ex:node/title",
    "body": "ex:node/body",
    "tags": "ex:node/tags",
    "model_en": "ex:node/model_en",
    "model_de": "ex:node/model_de"
  },
  "model_en": {
    "@context": {
      "@language": "en"
    },
    "@id": "http://example.com/node/1",
    "@type": "ex:Node/Article",
    "title": ["Capital"],
    "tags": [
    {
      "@id": "http://example.com/taxonomy/term/1",
      "@type": "ex:TaxonomyTerm/Tags",
      "title": "Water"
    }, {
      "@id": "http://example.com/taxonomy/term/2",
      "@type": "ex:TaxonomyTerm/Tags",
      "title": "revolution"
    }]
  },
  "model_de": {
    "@context": {
      "@language": "de"
    },
    "@id": "http://example.com/node/1",
    "@type": "ex:Node/Article",
    "title": ["Das Kapital"],
    "tags": {
      "@id": "http://example.com/taxonomy/term/1",
      "@type": "ex:TaxonomyTerm/Tags",
      "title": "Wasser"
    }
  }
}

There was a third option that I could elaborate, but I find it more problematic than these two. We can discuss it further if neither of these seem to fit.

Machine-readable

Posted by Crell on June 26, 2012 at 5:02am

My knee-jerk response is that the second is better, not because it's simpler but because the first implies a different @id for different languages, and thus that different translations of an object are different objects. As effulgentsia notes below, that's something we're trying to move away from in Drupal 8.

Also, the point of a highly-regular and standardized Entity/Property models is that, I hope, most developers won't need to deal with this format directly. They'll just take an entity, toss it into a black box encoder, take the string and toss it into a black box decoder, and get back the object they want. (There's 2 such black boxes available, as previously discussed.) At least on the PHP side; JS developers I can see having to deal with it more directly. Not sure there.

So immedaite readability of the string is a factor, but I wouldn't say it's the most important factor; this is, mainly, a machine-readable format, and human-readable second.

My knee-jerk response is that

Posted by linclark on June 26, 2012 at 1:15pm

My knee-jerk response is that the second is better... because the first implies a different @id for different languages

Because of the way named graphs work, you are still using the entity ID (i.e. node/1) as the subject of the properties. The graph ID takes a different meaning.

If you were to transform this to RDF without named graphs, it would look like the following. The statements below have three parts, entity-property-value, and are called triples.

# Title.
<node/1> <title> "Capital"@en
<node/1> <title> "Das Kapital"@de

# Tags.
<node/1> <tags> <taxonomy/term/1>
<node/1> <tags> <taxonomy/term/2>

If you use named graphs, it maintains the same entity-property-value relationship. It also appends a fourth value to the end, which indicates the context in which this statement is true. These statements are called quads because of their 4 parts, entity-property-value + provenance.

You'll note that the Tags section has an extra quad. This is to indicate that <taxonomy/term/1> shows up in both the English version and the German version of <node/1>.

# Title.
<node/1> <title> "Capital"@en <graph/node/1/en>
<node/1> <title> "Das Kapital"@de <graph/node/1/de>

# Tags.
<node/1> <tags> <taxonomy/term/1> <graph/node/1/en>
<node/1> <tags> <taxonomy/term/1> <graph/node/1/de>
<node/1> <tags> <taxonomy/term/2> <graph/node/1/en>

However, you are right that because of the JSON structure, a developer might infer that the graph ID is the identifier for the entity, not just the context for the statements.

So immedaite readability of the string is a factor, but I wouldn't say it's the most important factor; this is, mainly, a machine-readable format, and human-readable second.

In my opinion, the named graph approach is more generically machine readable. It can be transformed to other Linked Data formats in a lossless way, maintaining the distinction between the English version and the German version while still using the same entity ID.

JS developers I can see having to deal with it more directly.

One thing Manu did mention is the following:

manu-db: oh! Also, keep in mind that you can send side-band data w/ JSON-LD... to aid lookups, if you have to...
[4:59pm] manu-db: so you could include something that looks like XPATH into the JSON (JSON-PATH?) that would help the developer understand where the data they are looking for is in the JSON-LD document...
.....
[5:02pm] manu-db: it's like shipping custom indexes for your JSON-LD document based on data you know the developer is going to need to look up quickly.

The example he gave was "en_title": "obj.title[0]".

My assumption is that this side-band data would only work when the JS developer was using a JSON-LD parser... but I frankly don't know enough about JSON libraries, maybe this is something that is common for JSON? If any hardcore JS developers want to chime in, feel free.

Named graphs & single language?

Posted by fago on July 10, 2012 at 2:59pm

I like the named graphs approach, as it keeps the semantic meaning. Maybe, we can find a way it's simpler to deal with for average devs?

Maybe, the extra named-graph context layer could be saved if we only have to serve a single language?

Let's keep in mind that we'll also have the use-case of getting the single-language properties of an entity as well, as pointed out by effulgentisa.

So, I guess it can be assumed that the average front-end developer or mobile application developer mostly has interest in getting the entity representation in a certain language. Thus, simplifying the single-language representation should be a help.

+1 for the graph approach. I

Posted by scor on July 10, 2012 at 4:13pm

+1 for the graph approach. I think a hybrid solution would fit several use cases:
- single specific language (requested by content negotiation or in the url)
- all languages for content sharing and content syndication situations where you want to have the data across all languages.

In practice it would mean that consumers have to hop through the @graph level when no language is specified, but the content of the @graph could be returned when a language is specified?

Named graphs could also potentially be used for Entity API Revisions: once you have the notion of graph metadata, you can not only associated a language to it, but also a version ID. I just asked Manu, even though there is no built-in versioning mechanism in JSON-LD, it's something that can be part of the graph @id:

  "@graph": [
  {
    "@id": "http://example.com/node/1/7b64cfe/en",
    "@type": "ex:Node/Article",
    ...

What we need to achieve is to

Posted by scor on June 15, 2012 at 2:01pm

What we need to achieve is to have language-specific properties

I'm not sure I understand at what level the language is set, PropertyItem or primitive data values?. Are the translations part of the same entity or not? Let's take the example of a node with a property tags which references a tag "Dog" with the URI http://example.com/taxonomy/term/1. Say I want to translate that tag (aka taxonomy term) in French. Would I need to create a different tag, or would I be able to translate the same entity property 'title' into French? I think (and hope!) you mean the later, and that would map to the example Markus posted above. In other words, the term entity is the same no matter what translation you consider: http://example.com/taxonomy/term/1 has a title available in English (Dog) and French (Chien). I know up to D7, language tags are entity specific, so a node is either in English or German, and to translate a node into another language, you have to create a new node in that language (and the system would keep track of the link between them behind the scenes). I hope this is no longer the case, and that translations are available at the field value level. This would map better to Linked Data (we've been struggling to support multiple translations in the RDF module for that reason).

field translation

Posted by fago on June 24, 2012 at 2:04pm

@scor:

Yep, the goal is to have the d7 field-translation approach for everything, i.e. in-object translation.

http://example.com/taxonomy/term/1 has a title available in English (Dog) and French (Chien).

Exactly. But still, if the term-reference field "tags" is translatable, you'll have different tags per language of the content node. That's different to having the same tags for each language of the node, but having translated term names then.

node:
tags[de]: 1, 2,
tags[en]: 2, 3

tag 1:
name[de]: Zehe
name[en]: foo
tag 2:
...

I haven't seen the proposal

Posted by linclark on June 24, 2012 at 8:38pm

I haven't seen the proposal for in-object translation, so I'm not sure what that looks like yet. It seems we're still handling each language version as a separate node in D8-dev.

Based on the evolving notes from the sprint, it sounded like the JSON would only present one language version of the entity at a time. This was the line that I referenced, though I can no longer find it outside of my text in the Google Doc:

By default, when you load an entity you will specify the language to use.

Did this change during the sprint? Will one JSON document have to contain all of the entity's field translations for all languages?

If the request for the entity must specify the language, then this shouldn't be a problem. If I understand correctly, the entity returned would only include field values in the specified language (and also 'und' values). So in your example above, if the English version was requested, the field values would be the IDs for terms 2 and 3. The ID for term 1 would not be included in the response, since it only applies to the German version of the node.

We could then optionally include the term's name field in all languages, or we could limit it to the specified language. Whichever we choose wouldn't have much impact on the overall structure of the JSON (though it would have an impact on its verbosity).

But I could be missing something since I haven't been involved in i18n.

Goal for D8 = each language version NOT a separate node

Posted by effulgentsia on June 25, 2012 at 7:07pm

It seems we're still handling each language version as a separate node in D8-dev... But I could be missing something since I haven't been involved in i18n.

D8 HEAD is still set up this way, but the goal is to stop doing it that way and extend the translatable fields concept to all entity properties.

Some resources:
From http://hojtsy.hu/d8mi/content-language-and-translation:

The biggest problem however is that there are two systems [field translation and node translation], people need to choose between them and developers need to code against both. So the primary plan for Drupal 8 is to unify content language support (and translation) on the general entity.

From http://drupal.org/node/1498634#comment-6093822:

Yes, the plan is still that we'd drop tnids and the current translation.module altogether and move in a UI for field/property/entity translation.

A patch to start doing this while the rest of entity property API is being worked on in parallel: http://drupal.org/node/1658712

Will one JSON document have to contain all of the entity's field translations for all languages?

We may have 2 different use-cases we need to support:

We must support a use-case that returns a single JSON document for a single entity that contains all language variants for the entity's properties/fields. This is to allow content staging / migration: Site X has node foo that has properties stored in multiple languages, and all of node foo must be transferred to Site Y.
We will likely also want to support a use-case that returns a JSON document for an entity with property/field values only in a single language: for example, a mobile app wants to show the contents of node foo to a user in a language appropriate for that user (based on whatever language negotiation rules are in effect), and we want to conserve on bandwidth, so no point returning values in any other language.

"title": "Das Kapital",

Posted by fago on June 15, 2012 at 11:56am

Thanks for the feedback!!!

"title": "Das Kapital", <-- can also be an array, but I think there will always just be one value, right?

The plan would be to move to a fixed structure for all entity properties, i.e. Entity [ Property [ PropertyItem [ primitive data values ] ] ]. Whereas on the entity property level we'd always have a list of PropertyItems.

Thus, we should probably follow the same structure in json-LD for each property as well.

Language-tagged literals

Posted by lanthaler on June 15, 2012 at 10:46am

If it would show up as language tagged literals in the playground it would be a bug, but it doesn't, I just checked: bit.ly/LpvA2a

It's the third example in the spec: http://json-ld.org/spec/latest/json-ld-syntax/#string-internationalization

It is also possible to override the default language or specify a plain value by omitting the @language tag or setting it to null when expressing the expanded value:

{
  "@context": {
    ...
    "@language": "ja"
  },
  "name": {
    "@value": "Frank"
  },
  "occupation":  {
    "@value": "Ninja",
    "@language": "en"
  },
  "speciality": "手裏剣"
}

Markus Lanthaler
@markuslanthaler

Right, in my second edit I

Posted by linclark on June 15, 2012 at 11:34am

Right, in my second edit I realized that I had been testing with a version I had saved on my local, not the one in the post.

Correct Create.js link

Posted by henribergius on June 15, 2012 at 4:48pm

Hi,

This is great news!

The correct link for Create.js is http://createjs.org/

Fixed

Posted by Crell on June 16, 2012 at 4:03pm

Thanks.

I only followed 2/3rds of

Posted by nicksanta on June 17, 2012 at 11:43pm

I only followed 2/3rds of that post, but it looks awesome nonetheless! Keep up the great contributions guys!

Multiple Formats?

Posted by MD3 on July 18, 2012 at 7:27pm

I feel like I may have missed something from earlier, why are we only going to limit core services to 1 format? And if so, do we have a way to add more formats in a pluggable fashion?

I see a lot of people (such as myself) implementing JSON/REST interfaces for expensive iPhone applications croaking if we chose XML without the ability to use JSON. This would kill the companies I work with cost wise to upgrade D7 to D8.

For contrib, yes, but for core, too hard

Posted by effulgentsia on July 18, 2012 at 8:03pm

do we have a way to add more formats in a pluggable fashion?

This is definitely the goal. That's why much of this post discusses the Entity Property API needed to support that well.

why are we only going to limit core services to 1 format?

Because a useful implementation is more than just format, but something that follows a specification that has good client library support. For example, an AtomPub interface with JSON-LD payload or the CMIS spec with XML bindings. Implementing each of these is time consuming, so better to focus on getting 1 done as completely as possible than multiple done very partially. Also, some of these specs are still undergoing major improvements, so for example, implementing CMIS 1 now would pose major challenges, but implementing CMIS 2 may be much easier, but CMIS 2 might not be ready in time for Drupal 8 core, so a contrib module would be a better home for it.

WSCCI Web Services Format Sprint Report

Executive summary

Background

Entity API Revisions

Extrinsic information

Implications for services

Syndication formats

Symfony Live Addendum

Acknowledgements

Comments

Application detail

Semantic difference

What this difference means

Named graphs

Object keyed by language

Group organizers

Group categories

Discussions

New groups

Group notifications

Hot content this week