Entity RDF: data descriptions for plugable serializations of Drupal entities (JSON-LD and more)

We encourage users to post events happening in the community to the community events group on https://www.drupal.org.
scor's picture

There's been lots of discussions on various approaches for serializing entities in the WSCCI group and with Drupal 8 feature freeze only a few months away, it's time to start coding. JSON-LD was chosen as a potential good candidate for serializing Drupal entities. One of the key features of JSON-LD is the ability to include a @context along with your data. This context describes the data present in the JSON document and is useful for consumers to understand how to best make use of the data.

Taking a step back and looking at Drupal's ecosystem, there are other places where such data descriptions are useful, JSON-LD is only one of the many formats that Drupal can support. Others include RDFa, Microdata, N-Triples, and SPARQL for example. Generalizing data descriptions in such a way that they can be used in any serialization format is the topic of my proposal today.

Plugable serializations of Drupal entities overview

The RDF module in Drupal 7 was a first attempt at describing Drupal's data in a way that is useful for adding RDFa in HTML. Back in 2009 when we worked on it, there was no solid entity API in core and we had to come up with our own way to store these data description mappings and expose them to other APIs. At the time we talked with fago who had ideas on how to bake RDF in his entity API proposal, but given how much work there was to include the entity API alone in core, we decided it was best to not rely on the Entity API. Today in the context of Drupal 8, the situation has changed a lot. We're going to have Entity Property API in Drupal 8 core, so it's time to revisit the decision we made in 2009. Also, since the launch of Drupal 7 last year, we've learnt a lot on what works and what doesn't work in rdf.module, and what its limitations are.

Enter Entity RDF, a complete rewrite of the core RDF module which takes advantage of the Entity Property API. During DrupalCon Denver I suggested to fago the idea of making the data descriptions (aka RDF mappings) available with the rest of the Entity property info. The Entity RDFmodule does just that, with the following improvements over the older RDF implementation:

  • RDF mappings can not only be assigned to entity properties and fields but also to field internal properties, which is useful for compound fields such as addressfield for example
  • References to classes and properties are stored as full URIs to avoid depending on prefix bindings, the use of shorthand notations (CURIEs) is left up to the administration UI and/or the serialization
  • RDF mappings are available in the entity property info via the metadata wrapper in Drupal 7 (and natively in Drupal 8)
  • RDF mappings are no longer carried around in the entity object, the stucture info is separate from the data
  • Default RDF mappings are no longer automatically set for common properties, but instead they need to be enabled. Suggestions for mappings can be provided to the administrator setting the mappings

I'm hoping these changes address the shortcomings of the current RDF Mapping API. Many of these changes come from looking at the various implementations in contrib that have been developed mostly by Lin Clark and/or myself: RDF Extensions, Microdata, RDFa.

Proof of concept implementations

Entity RDF is a work in progress and there are many things that need to be improved, but I feel it's in a good shape to start sharing the work with others and ask for feedback. Currently there is no UI for settings the data descriptions and these have to be maintained via code, but once we've ironed out the right format to express these data descriptions, the work on the UI can begin. I've also started the port of the JSON-LD module and RDF Extensions to use the new Entity RDF module, you will find working implementations in the 'entity_rdf' branch of both of these modules. I was pleasantly surprised by how easy it was to switch from the old API to the new one while porting RDF Extensions, in fact that's the one I used while developing entity_rdf.module to ensure the mapping structure was working and made sense.

I created a few Drupal entities for demo purposes on a sandbox to allow easy preview of the various outputs. One of them is an event for DrupalCon Munich and I've mapped its fields to the schema.org vocabulary. At the moment you can view the serialization of these demo entities in JSON-LD or in N-Triples. For those unfamiliar with N-Triples, it's a basic plain text serialization of an entity where each line asserts a value of the entity in the form:
<Entity URI&rt; <Property URI&rt; "Value" .

For example, the title of the DrupalCon event is serialized as:

<http://wscci1.openspring.net/node/3> <http://schema.org/name> "DrupalCon Munich" .

Note that "value" can be a URI sometimes (in this case it will be wrapped with angle brackets like the other URIs). URI values can link to other entities on your site in the case of an entity reference or an image field, or to another site in the case of a link field.

Next is JSON-LD, which includes two parts, the context where the data descriptions are grouped, followed by the actual data. Note that you might want to use a browser extension to view JSON in a human readable fashion, in Firefox I use JSONView for example. In JSON-LD, the context maps certain keys of the JSON data to URIs, the same "property URIs" that we saw in the N-Triples output above. The screenshot below shows the JSON-LD output of the title and location of the event (the actual output contains more data than that):

@context is a flat list of mappings for the keys we care about: title, field_address as well as some of the nested keys inside field_address. Right under @context you will find other elements of information about the entity such as its URI (@id) and its types (@type).

You can compare the N-Triples and the JSON-LD outputs and you will find that the same RDF mappings are used in both serializations. They come from the entity property info provided by Entity API. Each serializer receives these mappings with the rest of the property info and can insert the mappings at the appropriate places in the output. Entity RDF collects all the RDF mappings via hook_entity_rdf_mappings() and implements hook_entity_property_info_alter() to insert these mappings in the property info for all entity types and bundles. All of this data is then cached by Entity API. Here is how the RDF mapping definitions of our event bundle with title and location look like:

<?php
   
'node:event' => array(
     
'rdf types' => array('http://schema.org/Event', 'http://schema.org/EducationEvent'),
    ),
   
'node:event:title' => array(
     
'rdf properties' => array('http://schema.org/name'),
    ),
   
'node:event:field_address' => array(
     
'rdf properties' => array('http://schema.org/location'),
     
'rdf types' => array('http://schema.org/PostalAddress'),
     
'field properties' => array(
       
'thoroughfare' => array(
         
'rdf properties' => array('http://schema.org/streetAddress'),
        ),
       
'locality' => array(
         
'rdf properties' => array('http://schema.org/addressLocality'),
        ),
       
'administrative_area' => array(
         
'rdf properties' => array('http://schema.org/addressRegion'),
        ),
       
'postal_code' => array(
         
'rdf properties' => array('http://schema.org/postalCode'),
        ),
       
'country' => array(
         
'rdf properties' => array('http://schema.org/addressCountry'),
        ),
      ),
    ),
?>

You can view the entire RDF mapping definitions in the git repository of this demo. The structure above is under discussion, any feedback is appreciated (issue).

Entity RDF ships with token support, which allows to override a value when the default value isn't what you want. For instance, Entity API will by default link a node to its author URI. What if you don't want to publish the author as a URI, but simply as a string? You can specify the value in the RDF mappings with a token:

<?php
   
'node:article:author' => array(
     
'rdf properties' => array('http://schema.org/creator'),
     
'rdf value' => '[node:author:name]',
    ),
?>

Both serializations will reflect this value, see it in action with this article in JSON-LD and N-Triples. By default <http://wscci1.openspring.net/user/1> would be used as value, but now the string "Stephane Corlosquet" is used instead. Another situation where this is useful is when using Linked Data, for example with schema.org enumerations when referring to different types of cuisines when describing restaurants. I've created an entity for the Schlösselgarten restaurant (JSON-LD, N-Triples) and tagged it with the term "German" from a cuisine taxonomy vocabulary I have created on the site. Wikipedia has an extensive list of Cuisine types which are well described there. When exporting data, it makes more sense to refer to this list of cuisine types that Wikipedia maintains rather than creating my own, and that's what schema.org started to recommend not too long ago. By default, Drupal's entity serialization would link the restaurant to the cuisine tag URI (http://wscci1.openspring.net/taxonomy/term/5), but with token, we can easily say that instead we want to use the Wikipedia URI in our RDF mappings:
<?php
   
'node:restaurant:field_cuisine' => array(
     
'rdf properties' => array('http://schema.org/servesCuisine'),
     
'rdf resource' => '[node:field-cuisine:field-wikipedia-url:url]',
    ),
?>

which will give us this kind of output:
<http://wscci1.openspring.net/node/1> <http://schema.org/servesCuisine> <http://en.wikipedia.org/wiki/Cuisine_of_Germany> .

It's also possible to set a datatype and a callback function for the values, this is useful in the case of dates, check out the event serializations for examples (JSON-LD, N-Triples).

Validation

So far all the implementations I've used in the above examples are all working on Drupal 7, but once we agree that this approach makes sense and that all the issues found through this fist round of implementation can be addressed, we can work on a port to Drupal 8. One benefit of working in Drupal 7 is that we can also try this approach without waiting for Entity Property API to be mature for Drupal 8, and we can see whether existing Drupal 7 sites can be made to work with JSON-LD or other serializations the way we plan to see it working in Drupal 8. I've personally already started to switch to Entity RDF for my current projects at work, and I've been pleasantly surprised by how trivial the switch was. Entity RDF actually ships with a migration feature allowing to migrate Drupal 7 core RDF mappings to the new format. By implementing this approach on actual sites we'll be able to validate our current plans very quickly. Expect to hear feedback soon if things don't work out.

Issues to investigate in JSON-LD

While working on the entity_rdf branch of the JSON-LD module (which will become 7.x-2.x if all goes well), I've found a few JSON-LD issues which need to be addressed.

Nested field property values

First of all, I've tried to stick to the JSON data output of RESTful Web Services, and add a new @context element with the data descriptions in it. By default RestWS and Entity API serialize each compound field as a nested set of key-value pairs. The field property values are nested inside the field, like this:

"field_address": {
    "country": "DE",
    "locality": "München",
    "postal_code": "81925",
    "thoroughfare": "Cosimastraße 41"
},

In JSON-LD, you need to have a mapping for each level of your nested structure, the field itself, and the compound values. While this makes sense for the addressfield where the address values are meant to be grouped together, there are other fields where this doesn't work quite as well such as the date field or the body field.

Date field output:

"field_date": {
    "value": "2012-08-20T00:00:00-05:00",
    "value2": "2012-08-24T00:00:00-05:00",
    "duration": "PT345600M"
},

Body field output:

"body": {
    "value": "<p>DrupalCon is an international event...</p>\n",
    "summary": "",
    "format": "filtered_html"
},

In both cases, for JSON-LD to work we would need to flatten these to match what consumers typically expect. For an event, the start date and the end date are usually directly attached to the event entity, that's what http://schema.org/Event expects for example, and there is no intermediary property we can use here for the "set of start and end dates". The same applies for the body field, where the value of the body is what's expected to be directly attached to the entity without an intermediary level of nesting. We might have to break the mold of the default entity API serialization. That's in fact what I did in the entity_rdf branch of RDF Extensions to serialize in N-Triples.

Duplicate keys

The second issue I found is where duplicate keys of the JSON data can clash when creating the @context, for example 'value' is used inside both the body field and the date field, and because the context is a flat list of keys and their associated mapping, there is no way to know to which field the 'value' field property mapping applies too. Same problem applies to the 'url' key which is a native property of all entities and also present inside any link field. The consequence is that duplicate keys in the JSON data end up having the same mapping set in @context, e.g. the mapping for 'value' will affect both the body field and the date field. One possibility is to have nested @contexts. This idea was brought up in the JSON-LD mailing list a few weeks ago, but the group decided against it due to the added complexity. Given that the group is in feature freeze at the moment, it will be hard to revisit this decision. One work around suggested by the group is to override the context where necessary inside the nested data element, my feeling is that we then start to mix data and structure, and I kind of like having the data descriptions tucked away in @context at the beginning of the JSON-LD output. Another workaround would be to ensure there are no duplicate keys, and to rename them maybe using their path, e.g. 'body_value' or field_date_value', but that's error prone unless we do it systematically whether we have a mapping in @context or not, otherwise the keys in the JSON-LD output will change if a mapping was to be added later on by the site administrator.

Feedback on the above proposal on how to define data descriptions a la entity RDF as well as any of the proof of concept implementations is highly appreciated. Please comment here or in the issue queues (Entity RDF, JSON-LD and RDF Extensions). I'm on my way to DrupalCon and I look forward to talking about these ideas with anyone interested.

Comments

I've had a few people telling

scor's picture

I've had a few people telling me they didn't get any notification for this post, so hopefully this comment will trigger them to be sent out.

Something I forgot to mention in the post is that I use RestWS to handle the HTTP request routing to the right serialization based on either the extension used in the url or the HTTP headers.

Duplicate key issue

Stefan Freudenberg's picture

I don't think having field specific context attached to the field output in JSON-LD is technically bad. It follows a common pattern of overriding basic class properties at the subclass level. In this case @context would be the base which each key-value pair inherits. So it makes sense to have an @context element there overriding the default.

Example:

{
  "@context":
  {
    "name": "http://example.com/person#name",
    "details": "http://example.com/person#details"
  },
  "name": "Markus Lanthaler",
  ...
  "details":
  {
    "@context": {
      "name": "http://example.com/organization#name"
    },
    "name": "Graz University of Technology"
  }
}

http://json-ld.org/spec/FCGS/json-ld-syntax/20120626/#external-contexts

Multivalued?

effulgentsia's picture

Nesting @context like the above makes sense to me. Would it support multivalued fields? Like this?

{
  "@context":
  {
    "name": "http://example.com/person#name",
    "details": "http://example.com/person#details"
  },
  "name": "Markus Lanthaler",
  ...
  "details":
  [
    {
      "@context": {
        "name": "http://example.com/organization#name"
      },
      "name": "Graz University of Technology"
    },
    {
      "@context": {
        "name": "http://example.com/organization#name"
      },
      "name": "Some other university"
    }
  ]
}

If there's a way to not need to duplicate the @context for each item, that would be even better, but not critical.

Yes, that's possible.

lanthaler's picture

Yes, that's possible.

Markus Lanthaler
@markuslanthaler

Field properties

Stefan Freudenberg's picture

The problem of nested fields is due to us not having clearly defined the roles of entities and fields with respect to what kind of data types they store. At a conceptual level I would even say that the relation between entities and fields is reversed with respect to other data modelling techniques. Compare with the way SQL allows us to define entities with columns (fields) or with the way nodes and properties are handled in JCR. In Drupal fields are way richer than JCR properties or MySQL columns which can only hold values of one type (string, integer, etc).

Drupal's fields are definitely very flexible and powerful but that comes at the cost of additional workarounds when delivering them in other formats defined by common schemas. It would in my opinion also make our interfaces easier if we put the restriction on fields that they can only have one (semantic) type of data, either a single value or multiple thereof. Or we somehow have to work around the fact that our fields are to be mapped to what is considered an entity or node in other data models.

this looks great! I wish I'd

ashepherd's picture

this looks great! I wish I'd seen this post when it came out. If anyone is still interested in this work, I propose we hold a BoF at DrupalCon LA.

Agreed. . . .

jpw1116's picture

Count me in, please.

I've got a Google doc going

ashepherd's picture

I've got a Google doc going to help plan the BoF, and i'll keep this thread updated as we make progress leading up to DrupalCon.

Doc: https://docs.google.com/document/d/1Y1x92G6jL5yzmM6kRS1boRERuMzKWYvuzbVM...

If anyone has agenda items or topics they want to cover in the Bof, post them here and I'll add to the Google Doc

Semantic Web

Group organizers

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds: