Solr RDF Support

Posted by drunken monkey on March 26, 2009 at 2:11pm

Overview

This project is about adding RDF Support to the popular ApacheSolr module in the form of a Solr RDF contrib module. The module should be able to read an RDF class specification and automatically generate the necessary mapping to a Solr server, provide the capability to search resources with that type and also generate facets based on its properties. It would even be possible to build the existing Node search capabilites completely on top of this mechanism! But in any case you could also add arbitrary other types like users or taxonomy terms, or resources from other websites altogether.

Description

The ApacheSolr module neatly integrates the fulltext search capabilites of Solr into drupal and thus provides an appealing alternative to drupal's standard search. However, at the moment the module is Node-centric, but with the help of RDF meta-data could easily be made completely flexible regarding the type of searched data. Since both facets and the module's Views integration are rather abstract even now, new search types could benefit from the complete capabilities of the current module.

At the end of the project, the following things will be possible:

Index available RDF resources with Solr
Automatically generate facets for RDF Properties
Use Solr to build a faceted search, which can be used to browse through the site's RDF resources
Do automatic generation of solr views integration (based on the views backend abstraction patch)
One could easily add RDFa support to the output of the search
Specify different Solr servers for different RDF classes (e.g. for performance reasons)

The module will of course include a suitable admin user interface and will be highly flexible regarding the presentation of the search.

Use Cases/Benefits

Use it to search content of drupal sites as exposed to the RDF module
Modules can easily expose further data as RDF, which is picked up by Solr without any further work
Build a multisite search upon RDF data of several drupal sites, or even
Use it for searching through arbitrary RDF content out there (once added to the RDF module). Want to build a faceted search for DBPedia? ;)

Mentors

Robert Douglass

Comments

First draft at the proposal

Posted by drunken monkey on March 28, 2009 at 11:31am

Title:
Adding RDF Support to the ApacheSolr module

Abstract:
This project will improve the ApacheSolr module by enabling it to handle (i.e., index and search with a comfortable UI) any kind of RDF data. This will instantly make it possible to provide meaningful searches for all site content that isn't node-centric, or shouldn't be, as well as content from virtually anywhere else on the web. Only an RDF class description and a way to access the data would have to be provided (apart from the normal Solr requirements) and the module would automatically do the rest of the work.

Content:

The ApacheSolr module neatly integrates the fulltext search capabilites of Solr into drupal and thus provides an appealing alternative to drupal's standard search. However, at the moment the module is rather node-centric, even though most parts of it are written general enough to be easily used for all other kinds of data.
This project aims at using these capabilities by providing a way for module users (i.e., site admins) to index and search arbitrary data in a meaningful way, just by providing an RDF class for the data. The meta-data from this class is then automatically processed by the module to generate the necessary structrures to index and search resources of this class, as well as provide facets and a comfortable admin UI.

The module will of course integrate with the existing RDF module(s), especially with the RDF API module. It will also include a suitable admin user interface and will be highly flexible regarding the presentation of the search (probably also by integrating with Views).

Benefits: At the end of the project, the following things will be possible:

Index available RDF resources with Solr
Automatically generate facets for RDF Properties
Use Solr to build a faceted search, which can be used to browse through the site's RDF resources
Do automatic generation of solr views integration (based on the views backend abstraction patch)
Thereby, let the site easily display RDF data with the presentational power of Views
One could easily add RDFa support to the output of the search
Specify different Solr servers for different RDF classes (e.g. for performance reasons)

Use Cases:

Use it to search content of drupal sites as exposed to the RDF module
Modules can easily expose further data as RDF, which is picked up by Solr without any further work
Build a multisite search upon RDF data of several drupal sites, or even
Use it for searching through arbitrary RDF content out there (once added to the RDF module). Want to build a faceted search for DBPedia? ;)

Personal Information: I was born on 03/20/1987 in Vienna. After completing school in 2005 I started studying "Software and Information Engineering" at the Vienna University of Technology. I successfully completed this study in 2008, becoming a Bachelor of Science, and am now studying for a Master grade in "Software Engineering and Internet Computing" on the same University. I am also studying for a Bachelor in Engineering Physics.

I have been a strong admirer of the FOSS philosophy for a long time and eventually in 2007 began actively contributing to it, soon settling on drupal which struck me as a highly flexible and still astonishingly user-friendly CMS. I created some (rather poor) patches before successfully participating in last year's GSoC, in which I created Views integration for the ApacheSolr module. The patch to the Views module that would allow different data sources for views is still pending, but has drawn a lot of attention throughout the community and will be committed soon.

Since I have already worked very closely on the ApacheSolr module and was pretty impressed by it, I would like to continue contributing to it and am therefore applying for this project. I hope that with it I can improve the module's back-end just like I improved its front-end last year.

My drupal.org user name is drunken monkey.

Deliverables:

A (apachesolr contrib or stand-alone) module that allows the indexing and searching of arbitrary RDF data via a comfortable and flexible UI
Test cases for that module

Project Schedule: Since my semester doesn't end until the end of June, I probably won't be able to code much in the first few weeks. I plan to make up for this by concentrating on gathering information and developing concrete concepts in this time frame, and then working harder for the next two months.

Part of the project will be R&D of how to reach the planned goal, so a detailled schedule cannot be given yet.

Until May 23: Research existing efforts and relating modules
May 23 - June 21: Continue research and come up with concrete ideas on how to tackle the problem; development of first prototypes
June 22 - July 5: Decide on a definitive method; start coding for this approach
July 6 - August 10: Develop this approach into a full-fledged module that supports the tasks listed under "Benefits"
August 11 - August 17: Clean up code, document, ...
Continuous tasks: Write unit tests for finished segments; keep in touch with the communities to get tips, hints, suggestions, complaints, etc., and react to them; contribute to the apachesolr module by reviewing/writing small patches, etc.

Additional Info:
http://groups.drupal.org/node/20589

Very exciting proposal

Posted by robertdouglass on March 28, 2009 at 12:33pm

Thomas, fago and I have been hatching this proposal (with input from many others) for a few days. I'm thrilled and really hope to get a chance to mentor it.

Creating your own Semantic Google

Posted by Arto on March 28, 2009 at 8:14pm

Thanks for pointing me at this, Robert. I think this would, quite simply, kick arse.

Combined with some other ongoing efforts (such as an upcoming RDF API storage adapter that uses Bitcache to enable the storage and usage of vast amounts of RDF data in Amazon S3-based repositories), this means that the road to building your very own Linked Data-enabled Google can soon start at Drupal.

This is essential infrastructure, both for the cloud and for the Semantic Web.

I really like this

Posted by bhuga-gdo on March 28, 2009 at 7:44pm

There's so much to like here I can't even begin. For a lot of use cases that don't need the full power of sparql, this is just as good as a full-fledged RDF store but would probably perform lots better, and Solr's search correction functionality makes it even better than a normal RDF store in a lot of cases. Fantastic, fantastic, fantastic.

Is there a place to vote for this? :)

I'm really fascinating about

Posted by miglius on March 28, 2009 at 8:07pm

I'm really fascinating about this initiative and could give a concrete example which could immediately take advantage of this, i.e. <a "href"http://drupal.org/project/fileframework">FileFramework extracts all metadata of the files uploaded to the drupal and sores is ing the RDF storage backend, but currently files cannot be located or searched based on the metadata such as image exif of pdfinfo which live in RDF. The project you propose would open this door and all the files uploaded to the drupal could instantly appear in the search index.

Thanks for the tip

Posted by drunken monkey on March 30, 2009 at 11:55am

Thanks, additional use cases are always useful!

And also thanks for the other encouraging comments, @ all.

stability of RDF schema in Drupal

Posted by scor on March 31, 2009 at 4:19pm

switching to online discussion
Renaud Delbru wrote:

In my opinion, I think the first step will be to clearly define the use case of the project. What will be the input data ? RDF, ok, but RDF data with a relatively fixed schema ? Or RDF data with very heterogeneous data ? I see on your proposal that one of the first use case is to search drupal content that are exposed as RDF. I imagine that such data has a relatively fixed schema. Is it right ? could you give some RDF data example ?

I ask this question because this will have strong consequences on the implementation. If you have a relatively fixed schema, the implementation of the RDF modules and of the facets navigation is greatly simplified.

However, if the RDF data input is arbitrary, with very different schema, this is another problem (and a little more tricky to manage).

Also, what kind of queries you are expecting (simple keywords search, or more advanced queries like structured search) ? What do you want to search (documents or entities) ? If you can give some examples, it will allow me to better understand your requirements.

As I said on the Solr mailing list, dynamic fields *can work* in certain case. It can work if the schema is fixed, with a low number of different predicates, and if the query will stay relatively simple.

To answer Renaud's question, I would say the native RDF data schema coming from Drupal would be relatively stable since it would be predefined by the site administrator upon installation of the site via the various RDF modules. It is not like a "semantic wiki" where you cannot predict the schema. Most of Drupal's data is pre-structured via content types and their CCK fields. However this structure can vary from one site to another, and this needs to be taken into account as well. So maybe a templating mechanism for solr needs to be used which can be installed as part of a Drupal module? (think context/spaces here in Drupal jargon).

Second point is about local schema vs external mappings. Natively Drupal will have its own local RDF Schema, which can optionally be mapped to external RDF classes and properties via the RDF API and RDF CCK. How does solr deal with that? Does solr require to use well known RDF terms (in which case mappings are required) or can it be used on local terms?

Renaud Delbru wrote: You

Posted by scor on March 31, 2009 at 9:56pm

Renaud Delbru wrote:

You are saying that the schema will vary from one site to another. Are these sites managed by the same drupal instance ? Or is it multiple distinct drupal instance ?
In this architecture, where is located Solr ? Is Solr used within / embedded in Drupal to search content ? Or is Solr external to the drupal instances ?
Could you try to describe the architecture with multiples drupal instances and solr server ? Thanks.

What is the use case here ? Do you mean Drupal will expose heterogeneous RDF data, and that you want to be able to search this data using Solr ?

If you are planning to index just one site, the schema is more or less fixed over time unless the administrator installs new modules or create new content types. What I meant by saying the schema will vary from one site to another is that you cannot predefine the schema in the solr/RDF module because every Drupal site is different. Each Drupal site would have to provide its schema + data to the solr server which would send back indexed data for Drupal to display it. This is the way the current solr integration with Drupal works (but without the RDF part).

From reading your replies above, managing several Drupal sites federated on the same solr instance may be beyond this proposal, although RDF in itself would have no problem handling it.

Second point is about local

Posted by fago on April 1, 2009 at 5:48pm

Second point is about local schema vs external mappings. Natively Drupal will have its own local RDF Schema, which can optionally be mapped to external RDF classes and properties via the RDF API and RDF CCK. How does solr deal with that? Does solr require to use well known RDF terms (in which case mappings are required) or can it be used on local terms?

Hm why should solr care which vocabularies one is using?

To bring the mail discussion here on g.d.o, here is what I wrote:

Ok, to have a more concrete use-case lets work out how a multisite
search could work:
Each drupal instance has its own schema and sends its data for indexing
to the solr server. Then I think you would have to import the schema of
the other instances on the "main instance", which runs the search. So
you could use that information to do things like generating facets or
views integration.
The same way could be used for other RDF data: Import the schema, but
also get the actual RDF data and let it be indexed by solr. Does that
make sense?

As I said on the Solr mailing list, dynamic fields can work in
certain case. It can work if the schema is fixed, with a low number of
different predicates, and if the query will stay relatively simple.

Hm yes, I think that's still a possible way to go. If using SIREn isn't
to different, it might make sense to support both and let the user
choose what to use. So one could just stay simple and use "solr out of
the box" or go with the fastest solution.. :)

Stephane and Renaud have agreed to be co-mentors on this project so
we'll have three (yay!)

I'd also love to help, so of course I'm available as co-mentor too. As I
live in the same area as thomas, we can easily meet in person and
discuss the whole topic... :)

Reply to Fargo mail

Posted by renaud on April 3, 2009 at 10:40am

Hi,

sorry for the late reply, I was kind of busy the last days.

fago wrote:

How did you want to implement the RDF support for Solr and Drupal
without working with Solr source ;o).

By using SIREn or just with solr as it is by using dynamic fields.
Thomas' proposal is about integrating the drupal rdf module with solr,
so it's a drupal soc project and I really think he should concentrate on
the drupal side only. So if SIREn doesn't support facets, he would have
to accept that (or workaround it).
So probably going just with the delivery "explore facets" is the best
way to go.

Ok I got it. Sorry for the confusion, when replying I thought of the proposal of Thomas to send directly raw RDF data to Solr which requires to create simple Solr extensions (a RDF analyser for example).
But, now I understand that the idea of the project proposal is to use existing technologies (Solr, maybe SIREn) in order to create a module / extension to Drupal, and preferably without creating any extension to third-party modules (in this case Solr or SIREn). Is it right ?

I don't think solr cares which vocabularies you are using?

Right, in Solr and SIREn, you are not limited in term of vocabularies.

Yep, I think the major use case is searching through data provided by
drupal or several drupal instances. As Stephane mentioned the schema
should be known very well and we can go with that.

You are saying that the schema will vary from one site to another.

It does.

Ok, so to summarise, each drupal instance will have a known and fixed schema (more or less), but the schema will change from one drupal instance to another. As a consequence, the schema over multiples drupal instances will be large and free.

In this architecture, where is located Solr ? Is Solr used within /
embedded in Drupal to search content ? Or is Solr external to the
drupal instances ?

Solr is external and accessed by its web service.

Ok, to have a more concrete use-case lets work out how a multisite
search could work:
Each drupal instance has its own schema and sends its data for indexing
to the solr server. Then I think you would have to import the schema of
the other instances on the "main instance", which runs the search. So
you could use that information to do things like generating facets or
views integration.

Just for clarification, what do you call "view integration" ? A view (search) over multiples (selected) drupal instances ?

The same way could be used for other RDF data: Import the schema, but
also get the actual RDF data and let it be indexed by solr. Does that
make sense?

Yes, make more sense to me now ;o).

As I said on the Solr mailing list, dynamic fields can work in certain case. It can work if the schema is fixed, with a low number of
different
predicates, and if the query will stay relatively simple.

Hm yes, I think that's still a possible way to go. If using SIREn isn't
to different, it might make sense to support both and let the user
choose what to use. So one could just stay simple and use "solr out of
the box" or go with the fastest solution.. :)

Ok, so I think it is necessary that I explain a little bit in detail what are the advantages of using SIREn over simply Solr. It will help to choose the best approach based on your use case. The following is a part of the case study we wrote for the next "Lucene in Action" book (so, please keep it internally).

SIREn is meant for indexing and querying large amount of semi-structured data. The use of the SIREn extension is appropriate when it is expected to have data with a large schema or with multiples schema, with a rapidly evolving schema or when the type of data elements is eclectic. SIREn allows to efficiently perform (1) complex structured queries (when the program / user is aware of the schema) and (2) more imprecise queries (a list of simple keywords since the schema is unknown). Moreover, even in the case of imprecise queries, SIREn will be able to "respect" the data structure during the query processing in order to avoid false positive matches as explained in the following.

SIREn enables searching across all fields: With semi-structured documents, the data schema is generally unknown and the number of different fields in the data collection can be large. Sometimes, we are not able to know which field to use when searching. In that case, we should execute the search in every field.

The only way to search multiple fields at the same time in Lucene is to use the wildcard ’*’ pattern for the field element. Internally, Lucene will expand the query by concatenating all the fields to each query terms. As a consequence, this will generally cause an explosion of number of query terms. A common workaround used is to copy all other fields to a master search field. While this solution can be convenient in certain cases, it 1. duplicates the information in the index, and therefore increases the index size; 2. loses information structure, since all values are concatenated inside one field.

SIREn maintains a single efficient lexicon: Since data schema is free, many identical terms can appear in different fields. Lucene maintains one lexicon per field, by concatenating the field name with the term. If the same term appears in multiple fields, it will be stored in multiple lexicons. As SIREn primarily uses a single field, it has a single large lexicon and avoids term duplication in the dictionary.
SIREn enables flexible fields: Data schema is generally unknown and it is common to have many name variations for a same field. In SIREn, a field is a term as any other elements of the data collection and can then be normalized before indexing. In addition, this enables full-text search (boolean operator, fuzzy operator, spelling suggestions, etc.) on a field name. For example, on the Web of Data, we generally don’t know what are the exact term (e.g. URI) used as predicate when searching entities. By tokenising and normalising the predicate URI, i.e. http://purl.org/dc/elements/1.1/author, one can search for all the predicates from one or more schemas containing the term author.
SIREn allows efficient handling of multi-valued fields: In Lucene, multi-valued fields are handled by concatenating values together before indexing. As a consequence, if two string values "Mark James Smith" and "John Christopher Davis" are indexed in a field author, the query author:"James AND Christopher" looking for an author called “James Christopher" will return a false positive. In SIREn, values in a multi-valued property are independent and each one can be searched separately or leaving the possibility to search them together as one unit.

So, based on this explanation, could you tell me if among your use cases, you think you will need one (or more) of these features ?
In general, if you plan to index RDF data with a fixed schema and if you think that you will not have the need of querying among all the fields (i.e. the field or predicate is unknown, in SPARQL it gives the following query: SELECT ?s WHERE { ?s ?p "my keyword search" .}), then maybe the simple approach with dynamic fields will be sufficient.

Hope my explanation was clear enough and gives you a better idea of where SIREn can help.

Regards,
Renaud Delbru

oh, its fago, not fargo..

Posted by fago on April 7, 2009 at 1:33pm

oh, its fago, not fargo.. ;)

Just for clarification, what do you call "view integration" ? A view (search) over multiples (selected) drupal instances ?

Views is a popular drupal module, which was originally built for listing any data contained in the mysql db. It's also capable of building searches and currently there is some work in progress which should make it possible to use it directly with solr. For this to work, modules have to provide some information about the db structure and how to deal with it, called "views integration".

@SIREn:
Thanks a lot for the detailed description!

The following is a part of the case study we wrote for the next "Lucene in Action" book (so, please keep it internally).

Oh, this site is public and probably already indexed by google? ops.. :/ However I think you can still edit your comment if you want.

So, based on this explanation, could you tell me if among your use cases, you think you will need one (or more) of these features ?

Hm, the features sound like being great improvements, certainly useful, but I don't think they are a must-have. Anyway, I think Thomas has to evaluate the pros/cons carefully when he is starting with the project and I'm sure your explanation is going to be great help then, thanks!

So for now, we have to wait and hope that the project makes it.. :)

Some additional precision

Posted by renaud on April 3, 2009 at 10:42am

Just a few additional information,

It is also possible to use the basic functionalities of Solr (dynamic fields / facets) and of SIREn at the same time. The idea is to use normal Solr fields for relatively generic properties across all drupal instances (and then benefits of the automatic facet generation), and use a SIREn field for indexing free RDF data.