Architecture of the centralized vocabulary republishing service

You are viewing a wiki page. You are welcome to join the group and then edit it. Be bold!

Posted by scor on June 28, 2011 at 4:06pm
Last updated by scor on Wed, 2012-02-22 21:26

These are the notes of the brainstorming Lin and I had on IRC #drupal-rdf.

The goal of this service is to republish well known vocabularies in a convenient way for Drupal site administrators to use with the RDF Mapping UI. This service will streamline the vocabulary import process and hide some of the more tedious operations such as discovering the right location for a vocabulary, its namespace, and picking the right format for the importing process.

The service will be powered by Drupal and will make the vocabularies available at an easy to remember URLs such as http://example.com/foaf for example for the FOAF vocabulary. The prefixes will be aligned with the RDFa Core Default Profile (draft) and http://prefix.cc/.

Below is the initial proposed architecture for the custom module handling the vocabulary management and publishing. It is purposely kept simple, but could be extended in the future.

Architecture

The module would contain a list of vocabularies that we want to offer. The vocabulary registry is a simple PHP array stored in a configuration file of the form:

<?php
$vocabs = array(
  'foaf' => array(
    'prefix' => 'foaf',
    'name' => 'FOAF',
    'description' => 'Friend of a Friend is a vocabulary devoted to linking people and information using the Web',
    'location' => 'http://xmlns.com/foaf/spec/index.rdf',
    'namespace' => 'http://xmlns.com/foaf/0.1/',
    'spec' => 'http://xmlns.com/foaf/spec/',
  ),
  'sioc' => array(
    'prefix' => 'sioc',
    'name' => 'SIOC',
    'description' => 'The SIOC (Semantically-Interlinked Online Communities) Core Ontology provides the main concepts and properties required to describe information from online communities',
    'location' => 'http://rdfs.org/sioc/ns#',
    'namespace' => 'http://rdfs.org/sioc/ns#',
    'spec' => 'http://rdfs.org/sioc/spec/',
  ),
  'schema' => array(
    'prefix' => 'schema',
    'name' => 'Schema.org',
    'description' => 'Schema.org provides a collection of schemas that webmasters can use in their pages in ways recognized by major search providers',
    'location' => 'http://schema.org/docs/schemaorg.owl',
    'namespace' => 'http://schema.org/',
    'spec' => 'http://schema.org/docs/schemas.html',
  ),
);
?>

Each vocabulary is published as JSON http://example.com/[prefix].json, and http://example.com/[prefix] shows some information about the vocabulary (such as namespaces, eventual changes etc).

The vocabularies are kept in sync with their source via daily or weekly cron job, which might regenerate the json file if necessary. The administrators of the service site might decide to get an email if a change in a vocabulary has been noticed (if they want to advertise it on the vocabulary page, link to announcement for example).

The JSON export is kept as lightweight as we possible so that it can be used as is on the consumer sites with the RDF mapping UI, in a similar fashion as the current version of the schema.org module. The values which are relevant for the JSON export are the id, label and comment values of each term.

Hosting on Git

While the initial thought was to host this service on Drupal, an alternate, more collaborative model is to use Github where the service is built and maintained by the community, in the same fashion as Homebrew. Anyone can fork the vocabulary repository and suggest new vocabularies and updates for existing vocabularies. The benefit of this approach is also the decentralized nature of Git, for example people can add their custom vocabulary while keeping up with the official repository. Also it's easy to set redundant service nodes which can kick in for example if Github is down. The benefits above would be harder to achieve with a regular Drupal site.

Comments

Git decentralized model for vocabulary republishing service?

Posted by scor on February 22, 2012 at 8:24pm

I'd like to revive this wiki page and also allow small discussion around it. I've just added another idea regarding hosting, which would be to rely on Github in a model similar to homebrew (check the last section of the wiki page above).

For reference, if you are

Posted by dman on February 22, 2012 at 8:57pm

For reference, if you are interested, over in taxonomy_xml I've been doing something similar for (taxonomy) vocabularies for a few years already. I never got around to doing the centralized server/index repo thing (though have always wanted to) but I thought I'd throw up my info schema for comparison.
As seen at
http://drupalcode.org/project/taxonomy_xml.git/blob/refs/heads/7.x-1.x:/...

    'WCA-terms' => array(
      'provider' => 'W3C ',
      'name' => 'Web Characterization Terminology & Definitions Sheet',
      'servicetype' => 'lookup',
      'description' => 'Glossary of web technology terms used by the W3C. (WCA-terms)',
      'about' => 'http://www.w3.org/QA/2003/01/Glossary',
      #'identifier' => 'identifier',
      'protocol' => 'URI',
      'pattern' => 'http://www.w3.org/2003/03/glossary-project/data/glossaries/WCA-terms.rdf',
      'format' => 'rdf',
    ),

If this is to be consumed and selected by end-users, I found it vital to add human labels, descriptions, and documentation URL references to extend the data pretty quick.
My UI presents a drop-down where users can choose which external datasets to import, and a help page where each of the repositories and services is provided.

Despite some crossover (especially of terminology), the jobs we are doing are different. I just wanted to suggest : Add a human label and description into your spec early on!

... and now that I think of it ... Why not use something DOAP-shaped ? :-B

Hey Dan!

Posted by scor on February 22, 2012 at 9:25pm

Thanks for your suggestions Dan. You're definitely right that human labels are important. I've added them to the wiki (feel free to edit it too!). re 'format', 'servicetype' and 'protocol', these might be useful for the script fetching the source and generating the format(s) published by the service. 'provider' could be added too.

DOAP is a good idea and needs to be explored, turtle or JSON-LD might be more appropriate than XML, and could look close to the arrays above. we will have a RDF toolchain anyways when fetching/parsing and republishing the vocabs. might be extra requirement on the consumer side though, let's see.

Thanks for posting, you're

Posted by linclark on February 22, 2012 at 9:32pm

Thanks for posting, you're definitely right about adding in labels and descriptions early. I have been implementing a service-based vocabulary import for Microdata and while I did have labels, I neglected descriptions at first. I just added them when a user requested them.

One of the motivations for this is to move away from requiring an RDF parsing library. Unfortunately, the library we use is no longer supported or even recommended by its developer, and there don't seem to be any up-and-coming PHP-based libraries, which is worrisome. As a result, we want to stay in JSON and I think we want to stay as lightweight as possible.

We could potentially reuse some of the DOAP modeling and the property names. I don't know whether this would buy us anything or not.

Problem loading software

Posted by rongcheek on February 22, 2012 at 10:37pm

Hostgator is trying to work out the problems. I just spent 55 minutes on the phone with them. Scor, I appreciate your help and once it is in place I'll get back. Enjoy your meeting.

Ron

Not serious about DOAP, just for reference

Posted by dman on February 22, 2012 at 10:39pm

Yeah, not really very interested in overloading the simple JSON to become full RDF - but just re-using the attribute names for consistency with DOAP terminology is probably a win. Even if just for the saving in brain-space.