RDF for Solr: Possible improvements

Posted by drunken monkey on July 25, 2009 at 12:42am

The Apache Solr RDF module is now in a state, where it can already, theoretically, be used. However, there is much room for improvement, so I'd like to discuss some possible ways to do this.

The Status Quo

At the moment, the process for setting up a search for RDF data is the following:

Set up a Solr server suitable for indexing data for the desired schema/method. (There are, at the moment, two different ones to choose from, with a hook for adding other ones.)
This involves copying the schema.xml and solrconfig.xml file corresponding to the schema in Solr's conf directory and, in one case, also copying a JAR file.
Enable the apachesolr_rdf module (and all it depends on). (Note that, at the moment, you also have to manually apply two patches to the apachesolr module for this module to work, but this should be solved soon, one way or the other.)
Optionally: Go to Search > RDF to verify, that a message is displayed, saying that there are no contexts enabled.
Go to the module's configuration page, where a list of all contexts known to the RDF module is displayed. Searching and indexing works on a per-context base, which seemed the most practical approach to me.
Click "Edit" beside the context you want to search. Check the "Enabled" box and fill out the form. Click "Submit".
When there are no errors, you should be able to index data by running cron. Check the logs afterwards, if there is a message that resources where indexed, and no error messages.
Now go to Search > RDF, where the name of your previously enabled context should be listed, linking to a search form.
Try out the search.

Possible improvements

At the moment, I have thought of the following possible improvements (apart from UI improvements, advanced search fields and facets, which are already "TODO"):

Alter the "dynamic fields" approach to let users decide, which properties (apart from rdf:type, rdfs:label and rdfs:comment) are indexed in dynamic fields.
Alter the schemas in some other way (although I don't know, what else might be done better).
Decouple indexing and searching (further). At the moment, searching over data of more than one context can only be done rather hackishly, by giving several contexts the same ids and servers. Probably more user-friendly would be to let users specify servers (along with their schemas), contexts to be indexed (maybe even with further filter criteria – or not linked to contexts, but just to a filter) and searches (possibly with options like filters to be applied to all queries, enabling the use of a single index for different searches).
Add more control over and information about indexes (re-index, delete index, delete only some resources?, meta-data on the index).
Views integration (probably rather laborious, but could be worth it).
Allow data sources other than the RDF module (probably via a hook).

Your comments on these, as well as other ideas or general feedback to the module would be appreciated!

Comments

Another one…

Posted by drunken monkey on July 25, 2009 at 1:16am

Another improvement, dangling between "possible" and "TODO": At the moment, filters for the "dynamic fields" schema suffer the problem that – apart from being absent from the user interface, leaving them to be added via direct URL manipulation – they would somehow have to figure out, whether the field to be searched is of type "text" or "string", since this determines the field name. Searching both fields and ORing this together would be possible, but a) unnecessarily complicated and b) probably slower.

If the method would be altered so the properties to index in dynamic fields would have to be specified by the user, this would be solved implicitly, since then the module would have to know which filters would be possible anyways, and could then just also store if the field has string or text values.
This could just be specified by the user. Allowing them, additionally, to set a certain name and maybe even possible properties, filters could be made into a rather powerful presentation tool.

Another possibility would be to fuse the two dynamic fields together and index all values as text, even URIs. But then, the impact on facetting, general search results and other things would have to be considered.

How is the mapping RDF -> Solr Docs

Posted by arademaker on September 16, 2009 at 11:49pm

Can you please better explain how you map RDF data to Solr Docs in the current implementation? Can we control that? I mean. It is resources to Docs and properties to fields? just it? If I have a graph with resources of different type. Like people and publications?

One document per resource

Posted by drunken monkey on September 17, 2009 at 11:39am

Yes, each resource gets mapped to one Solr document that contains its triples in some way. The way in which the resource's properties are included in this document vary according to the schema selected, however. See the schemas' explanations for (hopefully sufficient) details. (To do this, at the moment you have to create a server with the schema and look at its overview. Or you can look directly at apachesolr_rdf_apachesolr_rdf_schemas().)

RDF for Solr: Possible improvements

The Status Quo

Possible improvements

Comments

Another one…

How is the mapping RDF -> Solr Docs

One document per resource

Lucene, Nutch and Solr

Group organizers

Group categories

Projects

New groups

Group notifications