Triples Stores Integration With Drupal

public
kunaljain18n - Mon, 2008-03-24 08:58

Hi All,

I have recently been looking into drupal's intiative around semantic web. I also looked around the modules contributed in rdf and the api's associated with it.

Regarding it, i would like to open a discussion around using 3rd party triples stores to integrate with drupal and rdf api's available. Right now the triples are stored in RDBMS lets say mysql but looking at this approch, i doubt on the scalibility factor for it. A case when lets say 1 billion triples are there in the store, how the performance will be in that scenario ?

Any idea how to plug in other triple store to it, and use there semantic related api's to store the information ?
How about integration of "franz allegrograph" which i read is considered as a good triples store in terms of performance !

Thanks
Kunal

triples stores integration

dorgon@drupal.org's picture
dorgon@drupal.org - Mon, 2008-03-24 12:46

hi,

I've not seen Drupal APIs yet, but I've been looking at RAP and ARC2 last week [1]. I'm currently working on a best practice proposal for a scalable three-tier Semantic Web architecture with Drupal in the front-end, Java/.NET/etc. in the back-end and an scalable RDF store in the middle.

The simpliest way to access a store is by SPARQL protocol (REST). Although this works (even for concurrent requests), in a scalable three-tier scenario with heavy load this approach will likely fail because of lack of transactions and worse performance compared to ODBC/JDBC persistent connections.

On the other hand, REST is universal and I think this would be the best choice for smaller applications or part-wise integration of RDF in some module of your application. If everything you store is triples and you'd like to fully rely on the store, I think this is not a good solution.

My conclusion: if you want to partly integrate RDF or play around, use the SPARQL clients around to access your store of choice (some do support SPARUL (Jena) or SPARQL+ (ARC2) for updates). If you want to use Drupal/PHP as a frontend for a larger application with a triple store in the middle, help me to find an optimal solution ;-).

I've started with Virtuoso and ODBC - you can just send SPARQL queries over ODBC with the "sparql" prefix in your query string. What you get back is an ordinary iterable result set. I compiled mod_odbc into PHP and now I'm playing around with Virtuoso. Next, I'd like to write a wrapper for RAP graphs which allows me to transparently access Virtuoso-backed graphs/models. I'll start as soon as OpenLink has fixed the broken graph-wrapper for Jena (according to Kingsley it's on the agend with high priority) since this should be my basis for the PHP/RAP port.

Any comments welcome!

Regards,
Andy

[1] http://groups.drupal.org/node/10044


Deployment obstacle

soeren - Mon, 2008-03-24 13:07

Andy,

I think it would be a major obstacle for Drupal deployment, if the knowledge store functionality would solely rely on a Java or some other triple store. We actually did a lot of work making RAP/Erfurt (the RDF API behind OntoWiki) scale in the same range as other triple stores (cf. my earlier post in [1]). Of course for large installations it can make definitely sense to enable the plugin of a specialized triple store such as Virtuoso. In OntoWiki/Erfurt we solved the problem by establishing a clear triple store interface (based on SPARQL) - the out of the box version directly works on top of a MySQL table (which is surprisingly fast, due to our direct translation from SPARQL to SQL). When you need more performance you can plugin other triple stores, which offer SPARQL support. According to our experiments you can gain 5-25% depending on the size of the graph (which from my point is not sufficiently much to justify external deployment dependencies).

Happy easter egg hunt

Sören

[1] http://groups.drupal.org/node/10044

what do you mean with...

dorgon@drupal.org's picture
dorgon@drupal.org - Mon, 2008-03-24 17:00

..."clear triple store interface"? Does it mean another application, e.g. a Java backend can access the store from the outside? This would be reasonable.

By the way, I'd never suggest to integrate external stores for php-only Drupal sites. The reason I came up with the idea of integrating Virtuoso is the following:

We have a Java application which does crawling, scraping, and analyzing of Web data. So, first point: we already use Java and Jena but could switch to Virtuoso (there is a Jena graph wrapper available from OpenLink; in fact it's currently broken, but they will fix it soon). Secondly, we would like to use Drupal for the frontend which is PHP. So I'm looking for a scalable solution with Java in the back, a universal RDF store in the middle and PHP in the front. Each frontend request will require multiple SPARQL queries for pulling data. And the Java app in the back will update the store frequently.

Concerning RDF storage there are basically three options:

  1. Backend-centric storage: use a Java in-process store (e.g. Jena backed with MySQL) + php-java-bridge
  2. Frontend-centric storage: use a PHP store (e.g. RAP) in the Frontend + REST/SPARQL interface for the Java app in the back
  3. Three-tier: use an universal store in the middle (REST/SPARQL or ODBC)

Additionally, it may become relevant for us to use transactions which I get for free when choosing option #3.

Based on the experiences you mention, option #2 is more interesting than I believed first. I think it also depends on the load of the frontend vs. backend. Especially if there are much more frontend requests compared to queries issued by the Java app in the back, option #2 may be preferable. I'd really be interested in the interface you talked about. Can you add some details (maybe pointers into some SVN/code?)

thanks
Andy


SPARQL and HTTP/1.1 keeping connections temporarily open

dorgon@drupal.org's picture
dorgon@drupal.org - Mon, 2008-03-24 17:10

what I experienced when executing mutliple SPARQL queries over HTTP was, that the JVM HTTP Client didn't keep connections open like Browsers do. In HTTP/1.1 the basically stateless protocol was extended for performance reasons such that browsers and other clients can keep connections to a host open since it is very likely that a user clicks on a further link within seconds... If not, each SPARQL query requires new name resolution and initiating the connection which slows down the whole system.

I'll have to find a way I can tell the Java client to keep the connection open for some seconds.

Happy Easter also!


Triple store interface in PHP

soeren - Mon, 2008-03-24 17:42

All code is GPLed and available from the OntoWiki [1] website. The clean separation between frontend and backend is currently work in progress, but already developed quite far. By means of SPARQL it is working right now, but for OntoWiki we needed some additional functionality/extensions (such as counting, aggregating, updating etc.) which is not (yet) available in SPARQL. For technical details regarding the OntoWiki/Erfurt implementations please inquire on the OntoWiki mailinglist, since all developers will be reading that.

Sören

[1] http://ontowiki.net

Multiple backends supported by the RDF API

arto@drupal.org's picture
arto@drupal.org - Mon, 2008-03-24 20:17

Kunal et al,

To cater for varying use cases, Drupal 6.x's RDF API has been explicitly designed with built-in support for plugging in various backend stores.

The bundled RDF DB submodule provides a MySQL-backed triple store that will likely be sufficient for the majority of casual users, but a Sesame connector is also in the works and additionally we'll eventually provide for the possibility of adding any SPARQL endpoints as repositories.

The key here is that RDF API queries can execute against all configured repositories, so using an external native repository (Sesame, Virtuoso, AllegroGraph, etc) will be no different than using a built-in MySQL store, from the Drupal developers' point of view. See the (in the works) developer documentation for some more details.


Sounds good

dorgon@drupal.org's picture
dorgon@drupal.org - Tue, 2008-03-25 17:59

sounds good! Are you building the RDF API based on RAP or ARC2 or did you begin from scratch?


Started from a clean slate

arto@drupal.org's picture
arto@drupal.org - Tue, 2008-03-25 18:53

I started from scratch last October, as I felt that neither RAP or ARC, as they existed back then, were compelling in a PHP5+ scenario. (This was before ARC2 was available.) In addition, neither RAP nor ARC (and this includes ARC2, also) were, or are, suitable for inclusion into the Drupal core based on considerations of licensing, code footprint, and design, so this practically mandates developing our own more lightweight and more Drupalesque RDF API.

I've since then retrofitted ARC2 into the RDF API module bundle, where it is used as an optional component to provide support for additional RDF formats (RDF/XML and Turtle). It still requires separate manual installation due to differing licensing (Drupal.org only hosts GPL-licensed code), but it's there for any module author to make use of directly, if so desired, by simply introducing a dependency on the RDF API module, which will then load up ARC2 if the library has been installed.

ARC2 is also a required dependency for the SPARQL module, as it is used to parse textual SPARQL queries into the domain-specific language we use internally, suitable for execution by the federated SPARQL engine that I'm currently working on.


SPARQL on PHP/MySQL Performance

fessio - Thu, 2008-09-04 03:05

Hi everyone. Didn't know where to ask the question, so I decided to do it here. Could you tell me more about performance(aka retrieval speed) issues of storing RDF data in MySQL(like ARC does). Is it suitable for production use with hude amounts of data? And if not, what solutions can you advise for this purpose?
And one more: Is RDF the right technology for the task need to accomplish? The site i need to develop is a video streeming site with a huge amount of content. So the point is to make mutual connections between videos based on Topics, Actors, Directors, etc to make the search of content more usable and to add extra interactivity.
Thanks in advance.
Any feedback is welcome :)

ARC performs well up to a

scor@drupal.org's picture
scor@drupal.org - Thu, 2008-09-04 07:25

ARC performs well up to a few hundred thousand triples. If you need to store more than that, look at Java based solutions like Jena, Sesame or Virtuoso.
Your application sounds like a good case for justifying using RDF.


Take a look at Mulgara and Allegrograph

cameronhunt's picture
cameronhunt - Thu, 2008-09-04 14:39

I've heard good things about Mulgara - and I'll be starting to use it soon on another non-Drupal project. If you're looking for very high performance and aren't bothered by the closed source license, also consider Allegrograph.