Advice on Technique for LOD-enabled "Digital Library"

Posted by joelrichard on July 30, 2012 at 12:47pm

Good morning,

I've cross-posted this to both Semantic Web and Libraries because I think there are elements that appeal to both.

I'm going to give a bit of background on the project I'm working on and hopefully I can get some confirmation as to whether or not this is the "best" way to go about doing this. This is Drupal 7, of course. So far, I've been moving in the most obvious direction and using the tools that are most readily available.

Back-story:

We're looking to develop a digital library (which is, I know, a somewhat vague term) containing online books, exhibitions, collections of images, collections of other things, and a few online databases. The Linked Open Data (LOD) part of this is easy for most of the content, be it schema.org or other methodologies.

One big part of this digital library is an online database of sorts containing an authoritative index of botanists and their publications. The goal is to have a semantic web identifier for each of these 10,000 botanists and their 37,000+ publications. And here are my questions and my possible answers.

Questions

Q1: Is importing these into drupal nodes the best way to go? (I think it's the only way to go as I need node references and I'd like to take advantage of all that D7 has built in.)

Q2: Is the Feeds Import module the best/fastest/most reliable way to get data into D7?

Q3: With regard to RDFa and the RDFx module, how much should I care about content negotiation? Obviously RDFa is handled for me, but if/when a remote computer comes poking at my site and requests application/xml+rdf, is something useful going to happen?

Q4: How much should I expect performance to degrade with an extra 47k nodes in my system (if we're talking triples, then it's roughly 500k) and what will happen when we load another database that will increase the amount of data my a factor of 10? I assume I'll need more server resources, but can Drupal gracefully handle this amount of content? Do I have to worry about ARC or SPARQL performance?

I hope that when this is done and working, it will be something that exhibits all that Drupal + RDF can be and although I've had challenges already, I hope that I don't encounter more. :)

Thanks!

--Joel

Comments

Unfortunately, I'm not sure

Posted by linclark on July 30, 2012 at 2:33pm

Unfortunately, I'm not sure whether you can do what you would like to do with off the shelf components in Drupal, mostly because of Q4.

Q1: You could also create custom entities and use entity reference module. If you are building this for pedants, you will want to note that Drupal's handling of RDF doesn't concern itself with httpRange-14.

Q2: Generally, Feeds module (different from Feed import module) works pretty well and is the most widely used.

Q3: Unfortunately, the maintainers of RDFx (which include myself) haven't really been working actively on the module in nearly a year. An issue here or there is fixed, but there hasn't been real forward progress.

That said, one feature that was completed was integration with RestWS to provide different versions. You will need to enable that module and set up the RESTful web services permissions. RestWS module does respect Accept headers.

Q4: If you are looking to provide a performant SPARQL endpoint with 500K triples, ARC2 is probably not your best bet. While there was some talk of making the SPARQL backend pluggable, I don't believe any work has happened on that front (someone correct me if I'm wrong).

SPARQL endpoint performance aside, Drupal can handle that much content. However, you would need to tune performance. There are a variety of tutorials and presentations for this.

Migrate module

Posted by scor on July 30, 2012 at 3:04pm

For importing data in Drupal, provided you are willing to write some code, I highly recommend the migrate module. It's ideal if you have custom data to import into various field types. It supports an increasing numbers of contrib fields, it integrates with Drush, supports rollback, there is lots of documentation, it's great for debugging and getting your import script right. It's also very fast, and you can see the progress of your import while it's running. As a developer/hacker you will love it.

You could use ARC2 during the prototyping phase, but having an alternative store would be good, especially if you plan to grow (it also depends on the type of queries you plan to run). Pluggable backends will most likely be built as service classes for search_api, if you're interested please feel free to contribute to this issue.

I forgot to note two gotchas

Posted by linclark on July 30, 2012 at 3:17pm

I forgot to note two gotchas with the core RDF output. You will want to be aware of these.

RDFa output doesn't work for some fields, such as Relation or compound fields like AddressField, because of the way the HTML is constructed in the field formatters.
RDFa output doesn't work with Views very easily. You can make it work by configuring the view to output the whole node instead of individual fields, but that isn't an option for some people.

Agree, and a few pointers

Posted by tinflute on July 30, 2012 at 3:27pm

Q1: yes, importing in some form or another would be required. (The alternative of using drupal to access an existing legacy database would not be fun, and drupal probably wouldn't be the right tool to use for such an approach.)

Q2: From my research on this topic there a spectrum from hard+flexible to easy+unflexible, with Migrate on the left, Feeds in the middle, and Node import on the right. Node import requires no coding, and I have successfully import 10k nodes with it (took about 36 hours).
-- no matter what approach you take, definitely take a tiny slide of your import dataset and test on that until you debug the import script. only run the entire set once you are totally sure the script handles all the outlier items (nulls, invalid strings, weird characters, etc)
-- another tip on importing. you'll most likely be importing your content to custom content types defined with CCK and various contrib field types (e.g. location, isbn, etc). The ease of the import will depend on how well you setup your content types, so take the time to thoroughly understand CCK and the content system. Some contrib field types will cause problems with the import, so be ready to mix and match, and find work-arounds the facilitate the import process.

Q3. The Drupal semantic web crew (incl. linclark) are heros, and their efforts are bringing Drupal to the fore of semantic web. If the contrib modules don't handle your situation, you could check out the suite of semantic tools offered by Structure Dynamics. They're open source & drupal based, and yes have a steep learning curve. I haven't set up their system before but hope to one day soon.

Q4. I recommend using Apache solr as your search index. And if this is a production site plan to have separate server boxes for Solr and your database.

Q4. I recommend using Apache

Posted by scor on July 30, 2012 at 3:52pm

Q4. I recommend using Apache solr as your search index. And if this is a production site plan to have separate server boxes for Solr and your database.

I'm not sure what kind of search Joel wants to build. If it's just a regular keyword search (with maybe some facets), then you don't need SPARQL and you should just go with solr which will scale very well (you can either host it yourself or buy a slice from a SaaS commercial vendor). You only need SPARQL if you intend to offer a SPARQL endpoint for people or tools to query your site using the SPARQL query language for RDF. Solr doesn't support SPARQL.

Import speeds

Posted by joelrichard on August 1, 2012 at 11:40am

36 Hours to import 10k nodes? That's a vast difference between what I was seeing with Feeds. I think I was seeing about 5k per hour (or so) with my import, but perhaps my nodes were smaller or my computer faster. (37k feeds in about 8 hours, assuming I had no errors)

Thanks everyone

Posted by joelrichard on August 1, 2012 at 11:42am

I'm going to go ahead and move forward with the project, using native Drupal / RDFx.

Q1/Q2. I've decided to forego using node import and I'm building this as a module for portability. The .install file will create the content types and import the nodes, which I think will be faster than node import. We'll see. I should have that done today.

Q4. I think for now, based on what Lin said, that I will stay away from a SPARQL endpoint until such time as I can test the performance and understand it better.

Since I'm learning D7 Module building, so I'd like to think that one day I can contribute back to RDFx. I strongly believe that Linked Open Data is the way to go and as Drupal is my platform of choice...

Regarding solr, I do believe we'll go there one day. In the short term, we're utilizing an in-house Google Search Appliance for our general keyword search, so I'm not going to fall back on the built-in Drupal search.

When the module is done, perhaps I'll post it here and share it with you all. (botany! exciting!)

I will continue to investigate the other things mentioned here. Thanks, everyone!

Advice on Technique for LOD-enabled "Digital Library"

Comments

Unfortunately, I'm not sure

Migrate module

I forgot to note two gotchas

Agree, and a few pointers

Q4. I recommend using Apache

Import speeds

Thanks everyone

Libraries

Group organizers

Group categories

Resources

New groups

Group notifications