Implementing Kendra Signpost using Drupal and RDF?

Posted by dahacouk on April 8, 2010 at 2:41pm

We (the Kendra team) are in a bit of a quandary. We need to implement our EU funded project (Kendra Signpost) using Drupal but we are unsure, given recent D7/RDF announcements, as to which direction to take, namely D6 or D7 and using which modules.

Here is a brief description of what we want to achieve with questions as how to proceed - given that we have developers that can contribute to collaborative efforts so long as there is mutual benefit and we get to build Kendra Signpost.

Overview

Kendra is being paid by the EU to build Kendra Signpost which needs to launch middle of 2010. Let's put this EU money to good use.
Initially we wanted to build this system using only Drupal 7 but there are a number of modules that are just not ready for us yet. But as soon as Drupal 7 is ready we want to move all projects to it. So, we need to have clear upgrade paths when we choose Drupal modules. And perhaps only work on and contribute to efforts that are both Drupal 6 and 7 simultaneously.

Questions

How can Kendra implement Kendra Signpost using existing Drupal modules/APIs or extending existing Drupal modules/APIs?
How can Kendra combine this work with extending Drupal and RDF in Drupal to the maximum benefit?
Would now be a good time for us to join the Drupal core RDF effort, or are things sufficiently advanced that we would just get in the way of current developers?
If the latter is the case, how can we best help build on top of the core in ways that would provide the best advantage to both the Kendra and Drupal projects?
What stage are we at with Drupal 7 core RDF development, eg. alpha, pre-alpha, beta?
I know it's a tricky one but, given current pace of effort, can you give us any guidance as to timescales?
Do you have any design specs, documentation, etc, we could read to get up to speed on the current state of development?

Kendra Signpost

Kendra Signpost is a system for authoring, publishing and consuming media metadata using RDF (and possibly other markup schemes).
The emphasis is on high-level metadata: creating and maintaining ontologies and vocabularies.
But there needs to be at least simple low-level metadata catalog support present to help prove the concept - see the scenario below.
Wherever possible, we want to use existing RDF infrastructure.
In the short term, we will be emitting RDF from structured data held in CCK nodes.
We also need to demonstrate ingesting of CSV and various XML schema and, if possible, ingesting RDF too - this is the interesting bit, and the real long-term goal of the project.

Working with Drupal RDF

We don't want to reinvent the wheel.
We understand that Drupal 6 RDF support is being withdrawn.
We need to know the current state of Drupal 7 RDF support.
Our backup plan is to cobble together our own output-only RDF support - simple but tedious to implement, inelegant and wasteful compared to working with proper core RDF support.
However, if we do have to bodge things in the short term, we would like to at least do this in such a way that it is ready to be ported to the Drupal 7 RDF support as and when it is available.

Walkthough:

Sign up via Drupal.
Uploading a file of music or film catalogue data: CSV, RDF, XLS, or whatever.
Define fields and formats.
Define mapping to other schema already in the Kendra Signpost database.
Publish!
Help/comments e-mail service, so we can add new features if needed.
Should be able to re-jig and re-publish at any time (ie change definitions or add/remove/change data).
Keep the original data in a flat file.
Enable searches across multiple catalogues using similar/mapped terminology/field. Because all fields in a schema have been mapped to at least one other field in one other schema.

Scenario A

Here's a walkthrough of how the Kendra Signpost authoring system should work...

Consider the owner of a small record label which wants to make their content available to the web in a way that will allow it to be automatically indexed and aggregated/syndicated by third parties.

Almost universally, they have a catalog of all their past releases stored as a set of Excel spreadsheets in a range of different formats that have changed over time, without any standardisation of their fields or the data within them.

First, they get hold of an account on the Kendra website. For the purposes of this trial Kendra will host the Kendra Signpost system. In the future the system it will be installable on their own server, or provided as a hosted service by a third party.

Once they've logged in, they are presented with a start page with a short summary of how to get started, with links to more help information and examples of how to use the system. They decide to dive in, and click the "get started" button.

First, the system invites them to either upload a catalog file or start defining catalog fields from scratch. In either case the user will be able to choose from a wizard, a blank slate or a pre-defined schema - these could be industry standard schema, from other users or from their own previous definitions. Can we import schema from other sources?

Let's say they upload a file. Once the file has been uploaded, the system parses it and presents them with a quick preview of the top of the file, in a way similar to a database import wizard. Below the preview, there is a list of the columns in the file, with places to fill in the definition of that column's data.

In some cases, the system will have already made a guess at the format and interpretation of a column, based on its heading, or the format of the data in the column below. In others, the definition will be partially or completely blank.

They can now start to work through the definition of the fields in the file, using a guided system of choices to determine the characteristics of that column's data. Wherever possible, that data type is mapped onto an existing metadata type from a known standard. Where it is not, the user may need to define their own formats and interpretations, with assistance from the system.

Even where it is impossible to get a precise match, it may be possible to suggest a field from a known metadata standard that has a similar or related meaning. For example, in a system that does not have a concept of "composer" or "lyricist", both may be said to be similar to the Dublin Core concepts of "contributor" and "creator", without necessarily implying that they are identical concepts. At the same time, they can let the system know that both of these concepts also describe people, that there may be one or many of them, and so forth.

At the end of the description process, all the metadata fields will have been described either by matching them to existing metadata fields, or by describing them in terms of similarities to existing concepts.

The system can then do two things: it can generate a description of their metadata fields as an ontology, itself defined in terms of relationships to other ontologies, and it can also generate the metadata for every row of the spreadsheet in terms of that ontology, or - where possible - in terms of other existing RDF vocabularies, either by generating a subset or an approximation to the entered data.

All of this is available either for embedding within a website, or as a feed to, or in conjunction with, other services. Wherever possible, Kendra Signpost will use native Drupal constructs, in the simplest ways possible. Where necessary, it will define its own formats, but the Kendra Foundation will work together with Drupal developers and the wider metadata community to bring Kendra Signpost into conformance with existing standards, or to work with developers of emerging standards.

Since the Kendra Signpost software is implemented in Drupal, its content can naturally be used in any Drupal site. However, it can also generate RDF that can be linked to by, or embedded in, other web pages, or provided as feeds, etc, etc.

However, the process does not stop with ingesting and marking up a single file. Other files can be added, and re-used, extend or amend the same semantic markup, either in whole, or in part. Files can be re-uploaded with revisions, if necessary.

For small users, the Kendra Signpost system can be used as an element of a Drupal-based content management system in its own right. Large users may prefer to use it merely as an adjunct to a much larger content management system, used only as a convenient way to create initial ontology markup, leaving the actual metadata generation to the main CMS.

Scenario B

All is dark and still. There is nothing in the Kendra Signpost metadata store. We import the following RDF vocabularies:

P2P-Next - http://www.p2p-next.org
TV-Anytime -
MPEG-7 -
URIPlay - http://uriplay.org/spec/ontology/latest.rdf
ID3 - http://www.semanticdesktop.org/ontologies/nid3/

The next stage is to map them to each other. Can we use the current P2P-Next project mappings in .xsd/.rdf/.xml format?

This is a big matrix! So, we need a matrix for each schema and how it relates to each other schema. For 5 schemas that means we have 15 one-to-one tables assuming order (as in A relates to B or B relates to A) doesn't matter. Otherwise we'll have 25 one-to-one tables.

Or... We don't actually have to have a one-to-one table for each set of schemas. Of course one-to-one tables can just be for certain sets. Say map from P2P-Next on to all the other schemas for a start - that's only 4 one-to-one tables (relationship sets).

Notice we have no central core schema to map to? There is no master schema hard coded into the system. We are just hosting a spaghetti of relationships!

Then we bring in a maverick record label that has to do it their own way with their own field names in an Excel/CSV file. They need to map it into Kendra Signpost metadata store too - but in much the same way.

Then we want to be able to perform a search on the Kendra Signpost metadata store - using both facets and 'smart' query builder metaphors. We should be able to search using the terms of one schema and we will find matches where equivalent terms in other schema have been mapped/linked together.

The search system should also be able to traverse the logic of the keywords. So, if term A=B and B=C then we should be able to infer that A=C without us having previously specified a direct mapping.

We can then go on to optimise the the search system with intelligent caches and automatically created indexes of relationship logic. These are all built from the mappings we create and also by looking at usage and weight of these relationships.

?...

Cheers, Kendra Team

Comments

Initial thoughts

Posted by scor on April 8, 2010 at 3:00pm

This is great news, Daniel! Some initial thoughts before going into the details of the projects....

# Would now be a good time for us to join the Drupal core RDF effort, or are things sufficiently advanced that we would just get in the way of current developers?

We're way past the code freeze, so it's too late. But the RDF mapping API in core should already get you quite far in terms of RDFa alone - then there is contrib.

# What stage are we at with Drupal 7 core RDF development, eg. alpha, pre-alpha, beta?

alpha3. API changes are quite rare now, so it's a good time to start building D7 modules. It's not recommended to go live yet, but you should get ready for when the first release candidate is out and have your site architecture and modules thought out and ready by then ;)

# I know it's a tricky one but, given current pace of effort, can you give us any guidance as to timescales?

yes, tricky. Dries promised to sing on stage during DrupalCon if there was a first RC out by April 19th, but that's unlikely since we've had no beta yet. It'll take several weeks to get rc1 out the door, but that does not mean we should wait and see (see my answer above).

# Do you have any design specs, documentation, etc, we could read to get up to speed on the current state of development?

The best place right now is to look at code documentation: http://api.drupal.org/api/group/rdf/7

I would recommend to focus on Drupal 7 and drop Drupal 6. Note that Drupal 7 does not only bring you RDFa, but you also get to benefit from all the goodness of Drupal 7! You should make a list of module you would have used in Drupal 6, and see their status for Drupal 7. Some Drupal 6 modules are now useless due to some improvements in Drupal 7. And if you have resources to help porting modules to Drupal 7 it's good too! Remember the drop is always moving...

Drupal 7 all the way...

Posted by dahacouk on April 8, 2010 at 6:03pm

Thanks Stephane! I hear you. Drupal 7 it has to be. We've started with Drupal 6 for another project (actually a survey) because we need to have multilingual capabilities and we couldn't see the Drupal 7 modules we needed being mature enough. What's important is that we have a single user/profile database shared between the survey and the Kendra Signpost trial. Hopefully we can cobble that until Drupal 7 has our need multilingual capabilities.

What we really need to do now then is figure out how we implement Kendra Signpost using Drupal 7. I guess what would be preferable is working on or extending the Drupal 7 RDF contrib module to cater for our needs if the goals are aligned. I've just added a "Walkthrough" section above that summarises the steps we need to take.

There are a number of issues we need to address in terms of the UI that enables various metadata fields to be mapped from one schema to another and also how it all works underneath the hood. Look forward to our coming discussions...

D7 for new builds

Posted by JeremyFrench on April 8, 2010 at 3:13pm

Paraphrasing what Dries said at the London meetup here:
"The biggest blocker to D7 is the upgrade path"
"D7 is more stable now than D6 was when released"
"Drupal gardens is D7"

So for a new build D7 is not a bad way to go. Especially a new build that is focusing on RDF. The down side is that it may not be as stable as D6 is right now and there are less modules.

Suggest you consider DrupalGardens

Posted by mlangfeld on April 10, 2010 at 1:55pm

You can try out Drupal 7 on DrupalGardens today (well, once you're accepted into the alpha/beta). A great way to see how it works, though you won't be able to use modules not already included (pretty basic at the moment). Though with fields in core + Simple Views, you can build a fair amount.

You will, however, have a voice to promote the modules you'd like to see. You'll also be beta-testing both Gardens and Drupal 7 itself (I've found a few bugs, helping the Acquia team to advance both Gardens and occasionally Drupal 7),helping advance Drupal 7's launch date.

I expect you'd outgrow the Garden quickly. Your site could either stay in the Garden or move out freely at any time, and can now use a custom domain name.

Best, Marilyn

Just added Scenario B if that

Posted by dahacouk on April 16, 2010 at 7:57am

Just added Scenario B if that helps explain what we are trying to do. Look forward to feedback...

Interesting project

Posted by milesw on May 23, 2010 at 3:42pm

This is an interesting project you've got on your hands. One thing I'm still unclear on after reading this overview is what the system provides for end users. This description stands out:

Kendra Signpost is a system for authoring, publishing and consuming media metadata using RDF (and possibly other markup schemes).

and then I see...

Consider the owner of a small record label which wants to make their content available to the web...

What does the record label owner get out of his time spent mapping his spreadsheet? Hopefully more than just some RDF/XML. I'm guessing his label will have some kind of nice profile on your larger site with all their releases, where the content is marked up with RDFa?

Should be able to re-jig and re-publish at any time (ie change definitions or add/remove/change data).

That means you want the record label owner to be able to correct/update what he's already uploaded?

In that case, I see two phases involved: 1) getting data into Drupal as content, and 2) getting the content out as RDF.

Some thoughts...

1) Look at the Feeds module. Even if it doesn't completely fit your needs, it's an extremely flexible and well-designed framework for importing and processing data that will surely give you some insight. The concept is that there are three steps involved with importing external data: 1) Fetching 2) Parsing 3) Processing. It comes with CSV and XML parsers, along with a processor that produces nodes. You can easily add your own plugins to handle any of those three steps though. In your case, you would...

Fetching: Let a user upload a file
Parsing: Parse their file and provide CCK mapping options
Processing: Produce nodes based on those mappings

Whether you use one big content type or many depends on the data and the number of properties you're dealing with. You can also do some mapping to taxonomies (or Content Taxonomy) for more classification and filtering possibilities.

Note: Feeds doesn't have a D7 release yet.

2) Generate RDF on the fly. Each of the CCK fields would represent an RDF property. You define those mappings behind the scene -- the user shouldn't have to see any URIs. Stephane's RDF CCK module should be a good start.

Personally I don't think storing and manipulating RDF makes sense, though I haven't tried out the core RDF tools in D7 yet. However, the vast majority of Drupal tools are built to work with traditional content from the Drupal database -- nodes, fields, terms etc.

RDF storage and mapping

Posted by darrenmothersele on May 25, 2010 at 11:04am

Thanks Miles, there's some interesting points here. I've been working with Kendra on this, and we're looking at building on D7 and the RDF features in core and contrib.

Miles, I understand your final point about Drupal tools being built to work with nodes, fields, terms, etc, but in this case we are talking about a catalogue site where we will be working with RDF data as an input as well as output. More on this below under storage.

Feeds module

I have used Feeds module with D6 and it's a good fit for what we are trying to do. Eventually we will definitely benefit from pluggable fetch/parse/process-ing as we could have uploaded files, imported RDF, and other sources. The fact there is no D7 version in the works is an issue, and something I'm having a look at.

Storage

It seems to make sense that each item imported is assigned to a node, but each import can potentially have a different structure. We could generate content types programmatically with the appropriate field specifications to receive the imported data, but I'm not sure this is scalable as we could end up with a lot of content types.

I think we will need to have a generic node format to import the data, and then store attributes in either Data (as with Feeds) or using a generic field. As Data is not available in D7 yet, we will probably use the Schema API directly.

Feeds API does provide processors that store content using Data, and I see there is a module that manages links between this imported Data and nodes. I am wondering if this is really a better solution here rather than generating lots of different content types?

RDF Mapping

We can use hook_rdf_mapping() in D7 to define new RDFa output mappings. There doesn't appear to be a hook_rdf_mapping_alter() hook or similar, so I'm not sure it's possible to manipulate existing mapping definitions from other modules. I am wondering how this could be achieved if required?

RDF mapping

Posted by milesw on May 25, 2010 at 3:14pm

Cool, thanks for posting details about your progress. Thorough writeups for real projects like this one can be amazingly valuable, especially in this case where you're working with RDF -- something new and less familiar. So please do post updates when you guys find time :)

I'm still a bit unclear on how much a public frontend you intend to build. If you don't need to create fancy displays or filtering, and it's really all about the metadata, then maybe there's no reason to map to content types / CCK fields?

Today I finally installed D7 and started looking at the core RDF features. If you take a look at the comments for rdf_modules_installed() it explains why there is no hook_rdf_mapping_alter(). Basically if a module defines RDF mappings and claims them as defaults, those mappings are not saved to the database. When you define mappings using hook_rdf_mapping_alter() and they're not defined as defaults (with 'RDF_DEFAULT_BUNDLE'), they'll override any existing defaults.

At first glance the D7 core aggregator looks very flexible. You can define custom fetchers, parsers and processors, but apparently not on a per-feed basis. I'm way behind the times here, need to start testing some of this stuff...