State of the RDF in Drupal core after the code sprint

You are viewing a wiki page. You are welcome to join the group and then edit it. Be bold!

The objectives of the sprint were to flesh out a lightweight basic design for integrating RDF into Drupal core and implement as much as possible the features we agree on during the sprint.

The approach we choose is to define mappings to RDF classes and properties for each type of object in Drupal. That includes bundles (nodes, users) but also comments and terms. These mappings are stored as arrays and can be defined programmatically in the modules. They can also be altered the same way forms, links or other objects can be altered in Drupal. A User Interface in the contributed modules will allow users to edit these mappings in the same fashion as the RDF API or RDF CCK currently do for Drupal 6 - there might be a very simple UI in core to define the RDF class of custom content types (event, person, album), but we don't think there should be more than that.

There is a default mapping for node types in the case where no module overwrites it, which is defined in node_type_set_defaults():

<?php
    $type
->rdf_mapping = array(
     
'rdftype' => array('sioc:Item', 'foaf:Document'),
     
'title'   => array('dc:title'),
     
'created' => 'dc:created',
     
'changed' => 'dc:modified',
     
'body'    => 'content:encoded',
     
'uid'     => 'sioc:has_creator',
     
'name'    => 'foaf:name',
    );
?>

Modules implementing a bundle can define the mapping for its fields:

<?php
/**
* Implementation of hook_rdf_mapping().
*/
function hook_rdf_mapping() {
  return array(
   
'blog' => array(
     
'rdftype' => 'sioc:Post',
     
'title'   => array('dc:title'),
     
'created' => array(
       
'property' => array('dc:date', 'dc:created'),
       
'datatype' => 'xsd:dateTime',
       
'callback' => 'date_iso8601',
      ),
     
'name'    => array('dc:creator', 'foaf:name'),
     
'uid'     => 'sioc:has_creator',
    )
  );
}
?>

The 'created' field contains a more complex array because it needs to be output as iso8601 and with a specific datatype.

The RDF mappings are inserted in the object (node, user, term...) as it gets built and are therefore available for manipulation to allow an extra level of granularity for creative developers, in the case where a per object type mapping rule would not be enough. To retrieve the mappings of a given bundle, the function rdf_get_mapping($bundle) should be used:

<?php
/**
* Returns the mapping for the attributes of the given bundle as an
* associative array.
*
* @param $bundle
*   A bundle name
* @return array
*   The mappings
*/
function rdf_get_mapping($bundle) {
 
$rdf_mapping = module_invoke_all('rdf_mapping');
 
 
// Allow other modules to alter RDF mappings.
 
drupal_alter('rdf_mapping', $rdf_mapping);
   
  if (!
array_key_exists($bundle, $rdf_mapping)) {
   
$rdf_mapping[$bundle] = array();
  }
 
  return
$rdf_mapping[$bundle];
}
?>

We are aware that the term bundle is restrictive as our approach caters for other object type like comments and terms. If you have suggestions, please post them. So far we've been using object type, but that's too vague.

Each instance in Drupal should have a URI (Universal Resource Identifier) different from the URL of the page. This is a subtle difference which can be better explained with an example. Let's take a bundle artist, it is important to differentiate the artist himself (the person) and the page which describe the artist on a given site. Typically, the creation date of the page has nothing to do with the artist and should only refer to the page. Same goes for the user who created the page, but who didn't "create" the artist (this would not make sense, unless you actually talk about the parent of the artist!). On the other side, say a field date of birth is describing the artist, and not the webpage which cannot have a date of birth. I hope this example makes the point. Typically, URIs can be generated easily by appending #this to the URL, the exception being the comments which already have a fragment to which we can add this.

So far we haven't mentionned RDFa, because all we've done to this point is very generic and not serialized yet. That means it can be literally serialized in any format. Core will only support RDFa, but contributed modules could implement other serialization formats such as RDF/XML, ntriples, etc. RDFa is one RDF seralization format which is embeded in the HTML code. Parsing an RDFa page consists of looking at the DOM tree and apply an algorithm to build up triples (Subject - Predicate - Object). Some RDFa helper functions are made available to the theme functions responsible for the generation of the HTML code. Centralizing the RDFa attributes generation makes it easier to switch off the RDFa if the user wishes to do so. We still need to determine which level or granularity should be given to such a switch, but a global switch might to be enough, given that it is possible to implement custom RDFa output via custom theme functions if necessary. Some sets of RDFa attributes are pre-generated depending on the context (full node or teaser) and used by default in the theme functions or tpl files. Moreover, each RDFa attribute values are also available for developers who wish to built their own RDFa output. One example is for the page.tpl.php which outputs some RDFa attributes describing what the page is about and its type via the snippet <?php print $rdfa_page_about_typeof; ?>. The $rdfa_page_about_typeof variable might contain for example the set of RDFa attributes about="#this" typeof="foaf:Document". Each of these about and typeof values will also accessible by the themer in case she would desire to place them in a different location within the theme function.

In some situations where some RDFa markup is likely to be repeated/redundant, we keep the amount of RDFa markup minimal. It is the case in the users, the nodes in teaser mode - several teasers could appear on certain pages such as the front page - and for the terms links - many term links can occur on some pages. An exhaustive information on the instances can be found on the full node page or on term page. The main RDFa attribute we've been able to spare in these situations is the typeof attribute. Comments don't have dedicated pages so all the markup will be on the full node output.

Typical RDFa output for a teaser:

<?php
<div about="/node/5#this" id="node-5" class="node clearfix">
  <
h2 ><a property="dc:title" href="/d701/node/5">title of the article</a></h2>
  <
div property="content:encoded dc:description" class="content">
    <
p>The content of the page...</p>
  </
div>
</
div>
?>

We've taken a snapshot of the "close to ideal" RDFa page for an article. See the recent post Half way through the RDF code sprint to get up and running with the js RDFa checker and make sure you are able to read the RDFa out of the static HTML page which was attached to the post. Note that not all the RDFa markup present on this static file are not yet implemented in the branch.

Issues with the current theme system

We did run into some cumbersome issues with some theme functions such as theme_username() and we plan to submit patches to refactor them and improve their design outside the scope of this RDF in core effort.

RDF model vs. RDFa serialization

We will break the patches into 2 categories. The RDF model is how and where the RDF mappings and the object values are stored and represented in memory: put together they define the RDF graph of the object (node, user...) [think PHP objects in memory]. Then this RDF data must be serialized in RDFa via the theme layer [think serialize($my_array) in PHP]. There are other RDF serialization formats such as RDF/XML, ntriples. In Drupal core, we only serialize in RDFa, but the RDF model is made available in memory for eventual modules willing to serializing it in other formats.

Polls and questions

We were not able to decide whether poll opinion about overhead of RDFa markup (by default? pluggable?)
We are aware that the term bundle is restrictive as our approach caters for other object type like comments and terms. If you have suggestions, please post them.

Suggested steps for proceeding

  1. commit the RDF api. It is not ready yet, and need more tests for it.
  2. commit the helper functions for RDFa (used in the theme system).
  3. add RDFa output for specific subsystems separately: node, fields, user, blog, etc.

If you want to get involved and look at the some code, please visit http://happypixels.net/blog/rdf-core-code-sprint-more-details

This is a wiki page with comments, so we'll keep the page updated based on your feedback. Please tell us what you think on the approach above and on the open issues.

Comments

follow up sprint

kvantomme's picture

Great job! When you say that it's not ready yet, do you somewhere have a task list of stuff that still needs to be done/polished?

Let me know if a follow sprint in Belgium at the end of June/beginning of July would be feasible.

--

I blog and Tweet

--

I blog and Tweet

We have a list of issues on

scor's picture

We have a list of issues on google docs and we need to move them to the drupal.org issue tracker and continue from there.

good work!

fago's picture

Here are some questions, thoughts:

  • Why is the 'datatype' property needed for 'created' and else not? Does the absence of the datatype property tell us something about the datatype? (It's clear to me why the handler is there.)

  • What about using the term "entity type" instead of bundle / object type ?

  • Global switch: I think a lot of people like RDFa output in general, but don't want to share some parts of their data, e.g. a special content type. As of today the value of data is enormous and people need 100% control about it. Leaving it up to the themers makes this common case quite complicated to achieve. Apart from that, what if overriding the node template doesn't suffice - e.g. as the fields would still generate rdfa markup? Imo at least a per-content type kill-switch would make sense + the API should allow more fine grained control, but should be fine with hook_rdf_mapping_alter().

  • Is there a way to formalize whether a property talks about the document or the thing it represents? Let's use your example, take nodes describing artists, can I specify that 'created' refers to creation time of the document and another field "born" referring to the birthday?

  • I suppose that shoud be foaf:name?

    'name'    => array('dc:creator', 'foaf:Person'),

Why is the 'datatype'

scor's picture

Why is the 'datatype' property needed for 'created' and else not? Does the absence of the datatype property tell us something about the datatype? (It's clear to me why the handler is there.)

by default in RDFa if you don't specify any datatype the value will be parsed as a string (like the name of user), this is what you want most of the time. But in the case of the date for example, we want an RDFa output like

<?php
<span property="dc:created" datatype="xsd:dateTime" content="2009-05-11T10:32:40+00:00">Mon, 05/11/2009 - 10:32</span>
?>

Is there a way to formalize whether a property talks about the document or the thing it represents? Let's use your example, take nodes describing artists, can I specify that 'created' refers to creation time of the document and another field "born" referring to the birthday?

Yes. The problem is that Drupal natively does not really distinguish between the two. How about:

<?php
     
'created' => array(
       
'property' => array('dc:date', 'dc:created'),
       
'datatype' => 'xsd:dateTime',
       
'callback' => 'date_iso8601',
       
'type' => 'document',
      ),
?>

Alternatively, we can also hardcode the native metadata as document metadata, since the Fields which a user might use are very unlikely to be about the page, but rather about the resource described.

I suppose that shoud be foaf:name?

of course :). fixed.

@datatype: ah I see. Isn't

fago's picture

@datatype: ah I see. Isn't something similar needed to generate this markup taken from your example?

john

Alternatively, we can also hardcode the native metadata as document metadata, since the Fields which a user might use are very unlikely to be about the page, but rather about the resource described.

Hm, I think it depends. When you add some further fields for an article content type, it's referring to the document. Anyway adding such a property doesn't harm as it can be optional, so I'd just add one. Perhaps "subject" would be a good name?

Semantic Web

Group organizers

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds: