Posted by drunken monkey on June 12, 2009 at 5:29pm
(For information about my project, see here. Put shortly, it's about enabling Solr to index RDF data via drupal.)
Before starting the actual coding, even on prototypes, the basic options for implementing this will have to be discussed. At the moment, my mentors and I see the following three possibilities:
- Requiring a running SIREn instance and using that. (Depending on when it gets released.)
- The method proposed in this blog post, using seven static fields, which is (like SIREn) good for general RDF data, but doesn't e.g. allow for facetting on a specific property (only on the existence of a specific property), as the index generally contains no link from the predicate to the object.
- Using dynamic fields to store each predicate along with its corresponding object in the statement (apart from optional other information, like just all predicates used for the resource). This would of course bloat the index rather badly when used with generic data, but could be very useful when employed on data of a fixed structure, e.g. when using it to index some data provided by drupal.
At the moment it looks like I'll implement two and three and, if the prototypes work well enough, allow admins to choose between the two, using the former for generic data and the latter for specific drupal data.
What are your thoughts on this? Any other variants that sound promising? Anything I might have overlooked?
Comments
To get the discussion
To get the discussion started, I re-post some thoughts from my previous mail:
As the main use-case is handling drupal-related data I think the method
with dynamic fields is suited better. On a drupal site the number of
fields is roughly predictable even when indexing multiple
"drupal-entities". I don't think this results in much more fields than
the apachesolr module already creates for nodes (CCK).
But compared to the solution above you have
"solr numbers", dates as "solr dates". Thus one can do things like range
queries.
So a short summary:
variant 1: !?
variant 2: It's not possible to filter for values of a certain predicate -> It's basically restricted to full-text search + filter/faceting by rdf type
variant 3: No limitations, but when data with a huge number of different predicates is indexed, the index gets bloated. So indexing big chunks of rdf-data with arbitrary schema - maybe imported from somewhere - might become inefficient.
Variant 2
Hi everybody!
First, thanks for linking to my blog post about this matter. This is going to be quite an interesting addition to Drupal.
So, I wanted to make some things clearer vis-a-vis the variant #2.
First, there is a way to filter by property. We have to use the "property" and "object_property" fields.
"preperty": a property that doesn't have a resource IRI as range (right now, all data types become literals; but we can certainly tweak it so that we can enable what fago is talking about: the possibility to do range queries).
"object_property": a property that has a resource IRI as range.
Right now, the object_label is a human readable label used to refer to the Resource referred by the object_property (if such a label is available). So, we would have two choices here to enable what drunken_monkey want to do: filtering per property AND on the IRIs of the resource in the range of these properties:
(1) the first choice is to re-use object_label and to put the IRIs of the objects instead of the human readable lable.
(2) to create a new column that we could call property_object_iri (or anything else) that where the IRI of the objects of these object_properties would be archived.
I prefer option #2. This enables you to do filtering on properties + IRIs of object properties.
Personally I didn't implemented this because my goal was really not to create some kind of triple store engine out of Solr; but really just to complement existing triple store with high-performance Fulltext & Faceting search engine (Solr).
However, everything is possible :)
Does this clarification make sense?
Thanks,
Take care,
Fred
Thanks for the comment!
Thanks for commenting on this, your knowledge and experience with your project could be very helpful for this!
If facetting on objects in specific properties can really work with your method, that would be great news, of course! However, I can't really follow your explanation, or rather don't see how this would enable us to provide facets of the desired kind.
I'm still just a beginner in RDF, maybe therefore I can't really grasp the whole idea – or maybe I just phrased my intentions not clearly enough, or used misleading (or plain wrong) lingo.
What I want to do is provide facets showing all resources with a specific object to a specific predicate, e.g., all resources with "type" "book" (mentally replace the words with corresponding URIs). So we'd have (at least, I think so) to have some mapping between the properties and their respecitve objects that Solr knows about, not just the properties and the objects in two separate fields. Objects could be also literals, in this case, so they wouldn't even need to have a IRI.
As I understand it (and, as said, that might be far from what's really the case), your method would only allow to filter on resources that have a property "type" and an object "book", which could also find, e.g., a TV documentation about books.
I'm sorry, but it would be great if you could clarify this for me.
Hi! Well, I think that one
Hi!
Well, I think that one of the problem is that I am missing some information about what you are trying to do :)
So, I will try to answer your questions be best way I can; and will seek for some clarifications too (to make sure I understand your aims).
About: "If facetting on objects in specific properties"; could you please give me an example of what are your aims here.
"I'm still just a beginner in RDF, maybe therefore I can't really grasp the whole idea – or maybe I just phrased my intentions not clearly enough, or used misleading (or plain wrong) lingo." This is not a problem; it always take some exchanges before people understand each other on these subjects. There is unfortunately nothing easy with this stuff :)
"provide facets showing all resources with a specific object to a specific predicate"; just to clarify your thought/language here: an object of a triple is a resource (like the subject). So, you probably mean something like "showing all objects of a property which are resource of a specific kind (if this matter). A "kind" can be a Resource, Class, Literal, Person, etc. (then you can play with some inference as well).
Is it what you mean?
About "So we'd have (at least, I think so) to have some mapping between the properties and their respecitve objects that Solr knows about, not just the properties and the objects in two separate fields.", well, here I use a little trick that I am not mentioning in my blog post. Because of the way Solr index new docs, I know what property is related to what object because of the sequence of indexation. So, I know that the "property" #1 is related to the "text" #1. I know that the "object_property" #1 is related to the "object_label" #1. So, this means that I can easily recreate my triples that way. But it is maybe not the best way to do this with Solr :) (pretty new to it myself).
Personally, what I do is to use Solr only for faceting and full search. Otherwise, once a Solr search is done, I query a triple store using SPARQL to get more information about the description of the resources IRIs returned by Solr. Also remember that the Solr index is itself populated by a triple store (RDF is the canonical data format I used; and the triple store (Virtuoso) the tool to manipulate that canonical format).
"As I understand it (and, as said, that might be far from what's really the case), your method would only allow to filter on resources that have a property "type" and an object "book", which could also find, e.g., a TV documentation about books."
As-is: no. The only thing that I care about is to index literals in a high performance system :)
However, this could be done by enhancing this basic idea; partially with the method I proposed above.
In any case, I would have a question for you: is your goal to "get all books" or "get all books which have Mr. A as author", or "get all cities of France", or "...". If so, I would suggest you to experement with some triple stores and SPARQL queries.
But if you try to "give me all the resources with the keywords "vive la France"; or "sementic web" or "semantic web"~10; then for this task, I think Solr is the way to go for now.
So, I think this really depends on what you are trying to do. Solr is a tool, like Virtuoso or MySQL. RDF is the cannonical format and this is what matter I think.
BTW, you can experiment this search engine here:
http://constructscs.com/conStruct/search/?query=semantic+web&type=all&da...
More information about this demo can be found here: http://constructscs.com (code to be released this summer under GPL2 here: http://drupal.org/project/construct)
It is related to a new system called structWSF: http://openstructs.org (code to be released this summer under Apache 2)
We just released these two web sites today. So this should give you some more details about this stuff (more to come with better documentation).
But for example, if you click on a result returned by this module (Solr powered):
http://constructscs.com/conStruct/view/?uri=http%3A%2F%2Fconstructscs.co...
What you see on this page come from the Virtuoso triple store and is templated using Smarty & a special API above it.
I think we have to keep one thing in mind here: different tools for different purposes :)
So please answer the couple of questions I asked in this reply so that I have a better understand of your aims, etc.
Take care,
Fred
Clarifications
This is, what the example was about: e.g. finding via a facet all resources with "type" "book". Or, put perhaps more unambigously: Find all resources that are the subject in a triple with a (specific) predicate that means "is type of" and an (specific) object which means "book".
So this would supply a facet that would show resources with different types and allow filtering on them. But the type is just an example, it could be used for any specific predicate.
Of course, this would only make sense if not only a few resources have this property, and only unless they have many different values (e.g. a comment). Which facets are provided would be configurable by the admin, so he or she would have to decide, which make sense.
OK, this seems to be one of the main sources of misunderstanding here: I use "object" just as the third component of triples, it doesn't matter (for the general goal) if these are other resources or literals.
I already knew that, the apachesolr module uses the same trick e.g. for taxonomy terms. But the problem is, Solr itself doesn't know about it, so you can't search for values at corresponding positions of two multi-valued fields (which would be necessary for searching for objects in specific properties).
Well, what I want is fulltext search capability, so using Solr is pretty much a premise for this project. But facetting or at least filtering on values for specific properties would be nice, too.
(God, the more I write, the less comprehensible it looks to me... I hope, this time I have made myself clearer.)
And thanks for the links, that project also looks very interesting!
Hi! This is, what the
Hi!
So, what you want are simple filtering on key/value pairs (property/object part of the triple). If so, check bellow.
Well, an "object" is the third third component of a triple. I just wanted to make sure it was the way you were using this third in this context too :)
Well, it really depends on the applications that will use that information (if it is important or not).
For example, if you get this has an object: "this-is-an-object"
Are we referring to a string (part of some text); or is it an IRI referring to a resource (with its own description, etc)?
In same cases you don't care, in other cases you do really care :)
For example, in my personal usecase, I do only index the "objects" that belong to the class of Literals (so, all my objects are literals; I don't archive any IRIs (by the way, a IRI is an enhanced URI; this term is mainly use in the SPARQL spec; just to get rid of some possible ambiguity).
When I find a property which doesn't have a Literal as "object", then I fetch the description of that non-literal object, and try to find a "label" for it. If I find one, then I save it with my "object_label" Solr field. Otherwise I ignore it.
But this is really geared toward my needs and aims. In your usecase, you seem to care about the non-Literal objects
Okay, so about the question: how to get all the Solr docs that have a specific property/object, what about changing the schema a little bit to endup with a "property" field, and a "object" field; and then to send Solr queries such as:
Faceting only:
q=:&start=0&rows=10&facet=true&facet.limit=-1&facet.field=property&facet.field=object&facet.mincount=1&fq=property:%22http%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2Fbased_near%22 AND object:%22http%3A%2F%2Fsws.geonames.org%2F1850147%2F%22
This query would give you all the Solr docs that have a property foaf:based_near and an object geonames:Tokyo. So, this would give you all the resources (mainly people and organization) based near Tokyo.
If you want only the people based near Tokyo, you could enhance this query with:
q=:&start=0&rows=10&facet=true&facet.limit=-1&facet.field=property&facet.field=object&facet.mincount=1&fq=property:%22http%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2Fbased_near%22 AND object:%22http%3A%2F%2Fsws.geonames.org%2F1850147%2F%22 AND property:%22http%3A%2F%2Fwww.w3.org%2F1999%2F02%2F22-rdf-syntax-ns%23%22 AND object:%22http%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2FPerson%22
And so on...
HOWEVER take care: if your only goal is to do this kind of query, I would REALLY suggest you to take a look at SPARQL & some triple stores. The only reason I would suggest to use Solr instead of some triple stores to manipulate RDF data is when you mean one of these two conditions:
1- You want performing full text search over big datasets (5, 10, 20, etc. gig of data) on small servers
2- You want performing aggregate functions (not filtering; but the count number of these filterings) on big datasets, on small servers
This is the reason why we are using both SOlr & Virutoso on structWSF to feed conStruct.
Hope this will help you to archive your aims :) Please tell me if it makes sense/is clearer :)
Take care,
Fred
Thanks again!
Thanks again!
Now everything has really become a lot clearer to me and I think I understand exactly, what you mean.
However:
As said, this wouldn't be exactly the same thing, I think. For example it would also find a resource with the triples:
resource foaf:based_near geonames:Vienna
resource http://example.com/rdf/#based_very_far geonames:Tokyo
At least if I'm not mistaken. But of course it would depend on the data, how probable such a situation would be, and in most cases your suggested implementation would probably suffice.
In any case, you've really helped me much, with ideas, knowledge and the great links to related work, thank you! This will surely be useful when implementing that prototype.
PS: "IRI" was really new to me, but I already looked it up in Wikipedia the first time you used it. ;)
Hi! As said, this wouldn't
Hi!
Right, and here is the issue we are facing (now understanding what you were thinking about :) ):
Let say you have these triples:
foaf:based_near geonames:Vienna .
foaf:based_very_far geonames:Tokyo .
foaf:based_near geonames:Tokyo .
foaf:name "Bob" .
...
Translated in Solr docs, it would give something like:
[solr-field] -> [value]
"uri" -> "resource-A"
"property" -> "foaf:based_near"
"property" -> "foaf:based_very_far"
"object" -> "geonames:Vienna"
"object" -> "geonames:Tokyo"
"uri" -> "resource-B"
"property" -> "foaf:based_near"
"property" -> "foaf:name"
"object" -> "geonames:Tokyo"
"object" -> "Bob"
Note: "foaf:", "geonames:", etc... all these prefixes have to be extended with their full URI before indexation in the index.
Then, this query:
q=:&start=0&rows=10&facet=true&facet.limit=-1&facet.field=property&facet.field=object&facet.mincount=1&fq=property:%22http%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2Fbased_near%22 AND object:%22http%3A%2F%2Fsws.geonames.org%2F1850147%2F%22
Would return the two Solr doc above because Solr has no knowledge of the order of the "properties" and "objects".
So, the only solution would be to filter this at the application level; however this would return too many false positives.
I think the conclusion is: with the method I pushed forward in my blog post, we don't have this problem since we have specific fields ("type" for example"), which doesn't involve this issue. To enable what you want, it appears that dynamic fields would be needed (but what it means if we have a dataset of 1.5 million resources and about 50 to 100 million triples?); I cannot say because I never tried dynamics fields with that kind of dataset.
I will suggest you one more time to take a look at some triple stores and the SPARQL query language to do this kind of querying over RDF data. As I said, Solr for two things:
1- Full text search and,
2- Aggregates (how many things of type "X"; how many properties "Y"; how many....)
Thanks for having creating this topic: quite an interesting one.
Don't hesitate the contact me in the future about this kind of stuff!
Take care,
Fred
Fred, any chance Thomas could study your code?
http://openstructs.org/downloads
I don't know how (or if) you intend to release the code, but it would likely be very helpful to Thomas' project to see your implementation.
-Robert
Hi Robert, Hoo certainly. I
Hi Robert,
Hoo certainly. I mean, the structWSF code will be available at http://openstructs.org and conStruct's code (which is an aggregation of multiple modules: "conStruct core", "structSearch", "structBrowse", "structCreate", "structUpdate", "structDelete", "structView", "structResource". And then to come: "structImport", "structExport", "structDisplay" and "structOntology") will be available at http://constructscs.com.
These should be released by the end of this month (June) with the code documentation. More technical documentation will be written in July and August.
the conStruct modules will be available on Drupal CSV under GPL2
the structWSF will be available on our own SVN server under Apache 2
So, I think we only have to wait another week or two before I release this code (still have other milestones to finish and to make sure the code documentation is clear for a first Alpha release).
I will also be available to assist him to understand the code, and to create its own.
Take care,
Fred
Awesome!
That's sure to help us nail this SoC project - along with all the experience you bring having nearly completed such an implementation. How exciting!
Hi Robert! Good that you are
Hi Robert!
Good that you are happy. I think this SoC project could have (at least) two possibilities:
(1) use structWSF and enhancing it to add new web service(s) endpoints (for example, other than the Search and Browse web services that support the search and browse tools of the constructscs.com demo sandbox); and then to create the proper UI (drupal module) for the project. That way, you could save much time implementing some of the existing capabilities, and concentrating time, resources and effort to implement the real meat of the project.
(2) re-use some of the code needed for the project (some GPL2 code (conStruct) and some Apache 2 code (structWSF)).
In anycase, I think this SoC project can benefit of what is comming within 2 weeks (just a matter for me to wrap up and clean the first version of the documentation of the code; but as I said, the more technical and general (traditional paper like) documentation is still to come (other than what we currently have on the web sites).
So, lets see what can be done. In any case, I would be please to assist any specific enhancement for this project; and to help you implementing it on your servers if you take the path #1 (considering some lacking documentation for that purpose; that won't be available within the next 2 weeks).
Thanks,
Fred
Thanks,
Fred
It's amazing that you are
It's amazing that you are already doing so similar work :) It looks really interesting, however I wonder, what's the role of the CMS/Drupal in your system? Is it a data-provider or maybe also the frontend?
I think it can do much more than just "full text search". Think of full text search in a given property or returning resources that are relevant to another one. Though to let such features work properly solr has to know about the predicate-object relations.
I think "dynamic fields" could handle that much triples - the count doesn't make the distinction, the structure of the schema does. If those 1.5 million resources have a heterogeneous schema you would end up with a bloated index, with a homogeneous schema I see no problems. So whether dynamic fields are useful or not, ends up with the question: What data do you want to index?
For indexing data of one or a handful drupal sites, I already wrote above:
Also if you want to go and add some external data, this approach seems fine to me and thanks to RDF this is no big deal. But of course, for searching the "web of data" the solution would have to look different. In that case your approach looks to be the better choice - also SIREn looks promising to me.
More about structWSF and conStruct
Hi Fago!
First, I will re-state a couple of thngs vis-a-vis structWSF (web service framework) and conStruct (drupal) and the relation between the two.
Lets take that schema that shows the "whole system in action":
http://openstructs.org/structwsf/architecture
The goal of structWSF is to make all functions of our framework available as a web service. This means that any function of the system (search, browse, authentication, CRUD operations, dataset handling, etc, etc, etc) is a web service endpoint. This is structWSF.
Then, as you can notice at the top of this schema, we have a CMS layer on top of these capabilities. So, what we have done with conStruct is to create a suite of Drupal modules that interact with these functions. In such a setup, Drupal gives us many benefits. We use its various APIs (core + extended with other modules such as OG) to manage users creation & management, to have access to a complete CMS that let us create quality web sites, etc.
At this level, Drupal is the frontend of the framework. But what is quite interesting is that you can have one structWSF instance on one server, and multiple Drupal instances (different web sites, with different skins that use some or all datasets hosted on the structWSF instance) on other servers. All the Drupal instances can play with the same data (or their own), remotely on that structWSF instance.
Additionally, a single Drupal instance can interact with multiple structWSF instances (if that drupal instance as the proper rights...).
So, lets say that you have a drupal instance -A-; and that this drupal instance has access to a structWSF instance -B-. This means that you can create datasets on this structWSF instance; that you can CRUD data; search/browse it; etc. It is nice to be able to do all these things from Drupal, but at some point in time, you want to manage that same information, but with some FireFox addon you created. Then it is not a problem, since all capabilities of structWSF are available as a web service endpoint, then you can interact with the same data by using these endpoints (exactly like Drupal does).
This means that your data is not locked in any system; and that it is accessible to use from any application.
So, this is more or less the relation between structWSF and conStruct (drupal).
However, if you goal is to put all the data you create from Drupal into a structWSF instance, then the only thing you have to do is to use the other Drupal RDF modules that are existing to take some CCK types and fields, to map them to some classes and properties, and to use the structRdfCCK module to make the bridge between what these modules creates, and the target structWSF instance where you want that information. Then this information would become accessible to any applications that know how to interact with the structWSF endpoints, and naturally, that information will become accessible to the conStruct modules (search, browse and other to come (from us and contributed)). BTW, structRdfCCK is a test that I have done to make sure the idea was working; so eventually I will release it (but it won't for the first release).
Sure it can; but it is still limited I think (and as you put it in your message, it is on a case-by-case basis depending on the schema used). Depending on the size and on the complexity of the RDF graph of the dataset, know that it is possible to do SPARQL fulltext search queries that can return references to resources such as:
Give me the "family name" of a "person" that "knows" a "person" that has "Bob" as "proper name" and that "lives" in "Montreal" in the "country" of "France" (and not Montreal in Quebec).
if you follow the path of such a query:
"person - family name -" -> "person - Bob -" -> "city Montreal" -> "country France"
This incurs 4 different resources that are linked together. This can be easily done with a SPARQL query; but not with Solr I fair.
However, triple stores are not good at aggregations of things (number of resource that are Person; number of resources that use the property "foaf:family_name", etc...) because of the way they perform these counts. However, Solr is quite good at it. (Note: I am always referring to big datasets of million of resources; not dataset of a couple of thousands or hundred of thousands of resources).
So, this is why I said these two points above.
Thanks for confirming this. So, this is always the same thing: the good tool for the good task :)
Tell me if I am wrong, but I think that, per my description above, SIREn has a quite different aim than structWSF and conStruct, no? (connectivity drivers for drupal, to other RDBMS?)
So, I hope this additonal information about all these possibilities will help understanding how structWSF and conStruct could be leveraged for this project :)
Thanks!
Take care,
Fred
thanks for your
thanks for your explanation!
I agree, that's not feasible with solr. However things like full text search in a given property or returning resources that are relevant to another one are - but they are only possible if solr knows about the predicate-object relation. So the right schema also depends on what you want to achieve.
Exactly!
Of course, sry I was referring just to your solr schema, not to the whole system. So SIREn could be used as alternative to the standard solr server for indexing rdf, however unfortunately we don't know much about it yet :(
I agree, that's not
Sure, no doubts about that :)
Sure, this seems to be possible. In fact, at the end, many things can be used to index and query triples. The most flexible systems are the triple stores being developed for some years (but even there, many startegies have been employed so far); otherwise we have this Solr example (will, we more or less archive triples; but well, we can reconstruct them from the information indexed with the schema :) ). Others will index them in-memory in JavaScript objects for other purposes, etc.
So you are naturally right here.
Thanks!
Take care,
Fred
Hi again, Here is another
Hi again,
Here is another example of what you can do with what I was talking about in the blog post you mentionned. So, with some modifications to the Solr schema, I think this could give you another example of what you could do:
http://constructscs.com/blog/fgiasson/2009/6/structbrowse-now-available-...
Hope it helps,
Thanks,
Fred
Thanks for the insights, Fred
All help is appreciated.
I believe that the Tracker
I
believeknow that the Tracker project : http://projects.gnome.org/tracker/ does RDF querying the following way1.) parses RDF query
2.) maps the predicates to actual fields in the index
3.) complies those fields into a standard solr query
Basically, the RDF query part of the Tracker project doesn't touch the index. It is a compiler from RDF to Tracker query. I think this is the way to go. Compile the RDF query into the proper Solr query.
Which, as noted above is a huge plus. big horizontal solr indices are sooo bad, particularly if RDF query isn't pushing a q to the solr server.
edit: source: http://www.mail-archive.com/tracker-list@gnome.org/msg04851.html
Tracker uses Solr?
From your comment I understood that they use Solr to do this. Looking at their project page I find that unlikely.
Discussion continued..
i think we should to keep the discussion in one place, so I try to move the new mail discussion over to this thread by posting my reply here:
Oh, interesting idea. I don't think it would bloat the index, as the "fixed" schema has only a low number of fields. Thus one could index "more important" properties with dynamic fields and others with the fixed schema. However, the question remains how to determine important fields. Of course once this changes, you'd have to re-index...
When I think of views integration on top of that, I'd also say a typical query is to filter for a given value for a given property and do full text search and/or facetted browsing in the other properties.
Hm, I don't think a human does, but drupal might do, e.g. views. However I don't see why this shouldn't be possible with the simple schema too. I think the rdf module provides already such CURIEs which we could rely on. Thus the index is a bit smaller, but we would assume that different drupal sites talking to the server use the same CURIEs.
Again, an interesting idea :) But couldn't we save the Subject and save it once in a separate field? Thus we could just index in multi valued field "type person.", "foaf:knows uri://Fritzi."
However, I don't see why we can't apply it to literals too?
Just index:
"foaf:name "Wolfgang Ziegler"" ?
I think this approach wouldn't bloat the index, but would already allow for a lot of queries. So I wonder, whether Fred's approach has some advantages over this one.
Oh, interesting idea. I
I would see the use of fixed schema the other way round:
I will index the most common (generic) and important property with static fields (such as rdf:type, rdfs:label, specific vocabulary of drupal, etc.), and the uncommon or unknown ones with dynamic fields. As you said, the important fields should be well defined from the beginning, it is why I propose very generic properties. Anyway, even if the schema changes, you don't have to re-index all the documents. Only the new freshly index documents will have the new schema, whereas the other ones will have the previous schema. Then, incrementally, with the time, most of the documents will be re-index, and their schema updated.
Sure, the idea is to have a 'subject' field, that index and store the subject uri, and a second field that index the predicate-object relations.
And has you said, we are not limited by plain URI. We can index also prefix:shortname such as foaf:knows, or just shortname such as knows. All these variations can be indexed at the same time (Lucene/Solr allows to index multiple words at the same position). So in fact, you can index:
# syntax -> position:token
0:http://www.w3.org/1999/02/22-rdf-syntax-ns#type 1:http://xmlns.com/foaf/0.1/Person
0:rdf:type 1:foaf:Person
0:type 1:Person
The only requirement is that both the predicate and object should be one single token.
If you think that the user will always give the "full literal", it will works. But, for more fuzzy keywords search, such as "foaf:name ziegler", it does not work. In fact, the proposed trick works only if both the predicate and object are taken single element (even if the literal contains multiple words). This approach is closer to the mechanism employed in triple-store, where literal objects are interpreted as a single element in the underlying database system.
My first proposition for the schema will be:
- subject : field that index and store the subject uri
- context: the field that will store the context (or provenance) of the entity. This correspond to the fourth element, i.e. c, in what we call quads when manipulating named graphs.
- type: the field that index the class of the entity (it always a good idea to put the predicate rdf:type as a separate static field, since you know in advance that the rdf:type predicate is likely to appear in every entity. Moreover, this saves space since you don't have a long posting list for the term rdf:type, which is completely useless since the term appears in every entity).
- ... (other static fields have to be defined. I am pretty sure that there is other one in the drupal data schema. I generally also add rdfs:label as a static field, since when reasoning is performed, it is likely that an entity as a rdfs:label property).
- po-relation: the field that index all the predicate-object relations (as discussed above). This can be used also as default search field, since a simple keyword lookup or boolean combination of keywords is also supproted by this field.
- *_p: optionally a few dynamic fields indexing the predicate having literal as values. This will allow to support queries such as "foaf:name = ziegler".
Using this schema, you can search for any entity matching a certain keywords (using the object field), a certain relation (using the po-relation). You can also generate facets using the static fields and dynamic fields.
An important note about the facet in Solr. Facets are not restricted to certain field types, but Solr provides more flexible methodology for creating facet (the facet.query features, see http://wiki.apache.org/solr/SimpleFacetParameters). The facet.query allows to define facet based on arbitrary queries. If you want to allow flexibility on the admin side, this can be an interesting solution. Indeed, the admin could create exactly the facets he want. You just need to provide an interface so that the admin can explore the data and find the interesting facets for his data collection.
Another point about the facets. In Solr, facets are just some kind of "filter query", in the sense that each facet is in fact a query that will be executed at the Solr initialisation, and the result of the query (i.e. a list of document identifiers) will be cached in memory. Then, when the user will select a facet, the current search result will be intersect (or filtered) by the facet query result (which is cached).
You need to keep this in mind when creating facets. It will be difficult to create facets for all the predicates (your memory will explode), and even if you select a few number of predicates (or fields) as facets, you need to be careful and look at how many different values the predicate has. For example, for the field rdf:type, let say we have 100 different classes. If we select rdf:type as a field facet, Solr will cache 100 query results (one for each class) in memory.
As you can see, the memory requirement can rapidly increase if you don't carefully select the field facets.
It is for this reason I am a little afraid of just using dynamic fields, and use them as facets. I would advice to take time to select carefully a few number of predicates and create a static field for each of them in the index schema, in order to define a set of generic facets. Yes, you will lose some "expressiveness" power and flexibility for the faceted interface, but I think a trade-off has to be made.
Best,
Renaud Delbru
OK, this is another idea,
OK, this is another idea, interesting. What advantages would you say that some static property fields would have? What's the difference between having the type in a field "type" and having it in, e.g., "rdfs_type_p" (or, rather, some translation of the complete URI, probably)?
Unless the user has to input it directly, of course, then the benefits would be rather obvious. ;)
Also I don't think we should hardcode specific drupal properties/fields in the schema. Drupal data might be a prominent use case, but homogenous data from other sources should be just as add-able, in my opinion, if there are no big reasons against this.
But using dynamic fields for just a few properties (maybe/probably again specified by the site admin) is a very interesting idea in any case.
And the part about facet generation sounds new to me. Do you mean to say that for every field that could possibly be used as a facet, all results are kept in memory? This sounds almost unbelievable to me, especially when keeping in mind the performance Solr has on the whole. Could you clarify this? (But maybe it's just me being a bit tired...;))
In any case, such performance considerations have of course to be taken into account, too, you're right. And facets sure are a potential source of hassle there.
OK, this is another idea,
Indeed, there is not so much benefits between a static field and a dynamic field. With the static field, you just know in advance that the field exits, and then use it as default for facet for example.
You're saying that drupal propeties might be a prominent use case. In addition, you know in advance that there will be there. So why not define them as static ?
And as you said, homogeneous data from other sources should be just add-able. This for me is a perfect use case for dynamic fields, i.e. they will be created on the fly based on the incoming data.
I think it will be interesting to know what actions the admin will be able to do. For example, it will be possible to present to the admins a summary of the data collection (e.g. statistics), then the admin can
* choose which predicate to keep (and index) or to discard
* choose which predicate to use as facets (and then, automatically create specific fields for them),
* create manually a list of "facet queries"
* ...
You can read http://wiki.apache.org/solr/SolrFacetingOverview to get a general background on how facets works in Solr, and what you should take care of.
There is three type of facet operations:
* facet query: created manually (this is in fact a query), and the facet is cached only the first time it is used
* Enum Based Facet: Given a field name, Solr will iterate over all the terms of the field (using the dictionary), and cache their inverted list (the list of document IDs) in the FilterCache.
* Field Cache Based Facet: don't know exactly how it works, but it seems that it is for a special use case (facets will a large number of values)
As you see, Solr relies on the FilterCache for the facet operations. The filterCache will store for each Facet-Value the list of document IDs. In most of the case, the list of document IDs (a list of integer) as a small memory footprint, so you can cache many of them (hundreds without problems).
There is some cases where the list of documents IDs can be very large, in fact cover a large part of the document IDs. For example, this happens with very common facet value (a facet value that appears in all the documents).
Imho, I think the best strategy for the facets will be to leave to the admin the question of selecting the correct facets for the data collection. But, you can propose "generic facets" that you know will be relevant for every data collection (like rdf:type).
Best,
Renaud Delbru
Indeed, there is not so
Hm, if there is no benefit, so why should we do so?
Yep, that makes sense. Also allowing the admin to restrict indexing of RDF resources of a certain type might make sense too.
But using dynamic fields for
This is particularly interesting if you think that a solr index will mostly (maybe only) index one kind of dataset. Even with big datasets (gig of data), the schema(s) used to describe the entities of the dataset are not that big (except for some bio ontologies, etc). So, in most case, this is doable. In worse case, with some additional setuping, you can always use Solr in multi-core where you can have different schemas running on different cores of the same instance (just a possibility)
Sure, the idea is to have a
I don't think we have humans searching for the whole URIs and automatically generated queries can use the abbreviations too - so why bother with the whole URIs?
Indeed, this schema is powerful - probably it makes sense to use it as default for all properties of indexed resources. Apart from that we can try to ship with a sensible default to add dynamic-fields for important drupal-resources that we need to be able use for facets or sorting on it. Thus the admin could still customize the defaults and re-index afterwards. Thus we would have a really performant and powerful solution :)
The challenge: the User Interface
Because if you rely on prefixes, you will endup with all kind of problems when will come to time to integrate external data within this instance.
In any case, you should't expect any sane human to do any query using neither full URIs or prefixes.
What you have you to do is to put the burden on the developer and UI designer. But even thre, the UIs can become quite complex and hard to use for someone that have no knowledge in these concepts. It is not natural to describe the entity you are searching for. But it is natural (now) to write keywords.
Here is one of my old query UI (2 years ago) that turned out to be quite complex:
http://fgiasson.com/blog/index.php/2007/03/26/zitgist-search-query-inter...
The hardest part of the story is not to create the backend, but really to figure out how to create a meaningful UI that is easy to use and understand by users that have no prior knowledge on any of this stuff.
CURIEs
Hi here
Take cae here. You can't rely on these curies. Sure most people use FOAF for Friend of a Friend, however they are local prefixes within a single RDF document. There is no standard, and nothnig would stop me to use NS1:
In fact, many systems that automatically generate triples use generic prefixes. The only thing you can rely on are the full URIs.
The latest proposition could work. But the problem I see (depending on the usecase) is that you won't be able to do filtering and aggregations on properties only (number of time property A is used; give me all the resources that are described by property B, etc). Except if you do add this more granular information in your index (so, something to keep in mind)
Thanks,
Fred
First shot at dynamic-fields schema
I've committed a first draft at a schema for the dynamic-fields variant to the module's CSV (would've attached it additionally, but upload doesn't allow XML).
First version had just one dynamic field ("*", so any unknown field would be caught), but having one for strings (URIs, mostly) and one for texts (literal objects) would be more practical, wouldn't it?
There is also the property_object field that just stores the triple's second and third part, and uses just a WhitespaceTokenizerFactory for tokenizing, so phrase queries with URIs should be handled correctly.
The only other field is the URI.
(Update: Added object field for fulltext data about the object.)
NB: At the moment, all fields are indexed and stored, when programming the interface we'll see which of them really need to be retrievable.
I still don't think that static fields add something valuable, and if someone would want to use the module for completely non-drupal-related data, the field could also be absent in all resources. That it would be there would therefore still be just an assumption, like with a dynamic field.
And since the RDF module doesn't store any context (at least I think so) and I'm not sure about its uses, I've also left it out compared to Renaud's proposition.
But about the "object" field: You mentioned it, Renaud, but didn't list it. Would you add that one too?
Properly thinking about it, this would probably be a rather good idea, using that one for the real fulltext data (trying to use resource's labels instead of URIs, etc.) and therefore also as the default search field. We could also add the type or other data of the (non-literal) object to this field, if we just index it, without storing.
Committing that one as revision 3...
Also, regarding the other meta-discussion, I too am in favor of using complete URIs, sole prefixes are not guaranteed to be unique or consistent among different sources, and could therefore lead to serious problems (albeit only in rare cases).
Hi Thomas, I had a look on
Hi Thomas,
I had a look on your schema, and I have some feedbacks.
I mentioned it, but modified the schema because in fact it is not needed if the 'property_object' field is correctly analysed. If the values of the property_object field are analysed in a way that URI are not tokenised, and literals are tokenised, it becomes possible to answer the same queries than with two fields (*_s, and *_t) that are indexing separatly URI and literal objects. The later fields are, imho, useful only if you plan to use them as facets.
I have a modified version of the StandardAnalyzer of Lucene that recognise URI (based on the RFC3986), and hence does not tokenize URI. I don't know how to upload these files here. Can someone explain me ?
I will advice also to keep the solr.LowerCaseFilterFactory for URI, it helps to normalise them (useful for search engine, see http://www.silverspike.co.uk/2009/02/28/url-canonicalisation-and-normali...). I have also another URI filter that removes the trailing backslash.
I will advice to remove the solr.EnglishPorterFilterFactory. It is defined by default, but from my experience, it gives unexpected search results, it is better to keep the full word instead of its stemming version. The consequence is just a small increase in the index size.
I think the rdf api already
I think the rdf api already supports context, see rdf_rdf_contexts(). For a description of it see http://sw.deri.org/2008/07/n-quads/.
So when indexing rdf data from multiple drupal instances it would be useful to be able to search/filter by it, so probably a static field makes perfectly sense for it.
Named Graphs
There is also this paper that introduces the concept of named graphs:
http://www.hpl.hp.com/techreports/2004/HPL-2004-57R1.html
Named graphs are important when managing and integrating data from multiples sources of information.
context in the RDF API
yes, the RDF module does store contexts and it's also part of the UI in the RDF data page (admin/content/rdf) in the advanced options fieldset.
Graph
As you will notice in the conStruct release to be finished by tomorrow, I have updated the schema with this additional information (dataset/graph) in the initial schema I posted within my blog post. This is what you can already experiment with the multiple references to "dataset" on the conStruct demo pages. In any case, I would really encourage this practice in this project.
Context
OK, thanks @ all for the input! I've now added a (stored and indexed) string field "context" to the schema. I've also added Renaud's suggestions regarding filters.
Wow, perfect! The other one, that removes trailing backslashes, could also be useful.
About uploading: Basically, the attachment field is directly under the comment textarea – but there are only certain filename extensions allowed. So you'll either have to append a dummy extension, or send the filters via mail.
CustomAnalyzer
Ok, I see it. Lazy me, I should have look better ;o). But, it does not accept archives. I have uploaded the file in my dropbox:
http://files.getdropbox.com/u/1278798/custom-analyzer.tar.gz
I just copy-pasted the files from various projects. The URI filter has to be modified with the correct token type for URI.
In order to use these classes in Solr, you just have to create a jar, and put the jar in the lib directory of Solr. When the Solr server is launched, it will automatically load all the jar files found in this directory.
I suggest you first to play a little bit with them, on your local computer. Just create some unit test files that instantiate the Custom analyzer, and try to parse some data with it, to check and see the output.
Thanks a lot!
Thanks a lot! There is a problem, though, because there are some depenencies not contained, namely:
org.sindice.solr.plugins.analysis.standard.FieldAnalyzerMapping
org.sindice.lucene.index.benchmark.btc.NodeIteratorTokenizer
I've quickly looked through the JAR files distributed with Solr, and with Google, but have found neither. Could you send them, too? Or would it be better (or possible) to slightly change the code so they aren't needed?
And yeah, I'll surely first check it out a bit instead of just using it. ;) And which token type would you suggest for URIs, just "word"?
org.sindice.lucene.index.ben
In the URI filter, you have to change:
if (reusableToken.type().equals(NodeIteratorTokenizer.getTokenTypes()[NodeIteratorTokenizer.URI])) {
to
if (reusableToken.type().equals(CustomStandardTokeniser.getTokenTypes()[CustomStandardTokeniser.URI])) {
This gets ride off the NodeIteratorTokenizer dependency.
I see the dependency in CustomStandardAnalyzer, and you can remove it safely, with all that makes reference to it. It was an home made field mapping, to be able to apply different analyzer on different fields. Useless here.
Tried it out a bit...
I also had to change "
CustomStandardTokenizer.getTokenTypes()[CustomStandardTokenizer.URI]
" to "CustomStandardTokenizer.getTokenType(CustomStandardTokenizer.URI)
" in URITrailingSlashFilter.java, but now it works great, thanks again!However, one "bug" report and one question:
"Bug": URI escape sequences (via "%xx") aren't recognized as being part of a URI. E.g., "http://example.com/New%20File.txt" is split into tokens "http://example.com/New" and "20File.txt". And although packing the character classes map is a nice idea, it makes changes like this a bit difficult. ;)
Question: Will this code be released, and under a GPL-compatible license? Otherwise I would probably have to abandon this option... Or provide some complicated instructions on how to obtain these classes, pack them into a JAR file and put it in the right directory.
Although, I don't even know if I could just put the JAR file into my project, even if it was licensed under the GPL, regarding drupal's 3rd-party-libs policy...
What would be your suggestion on how to make this available along with my module?
I wouldn't like to do without this tokenizer, it makes things really a lot easier! And re-implementing it myself is probably a bit beyond the scope of this project...
"Bug": URI escape sequences
What do you mean by "packing the character classes map" ?
For the bug, I'll try to look at it, or I can send you the JFlex file so that you can fix it, and regenerate the CustomStandardTokeniserImpl file ;o).
Good question. Maybe, it will be a good idea to ask drupal about the licensing, so that we have an idea of what is possible to do or not (for later also, if I need to give you other files). Otherwise, I can offer the JFlex file to Drupal, this will be my small contribution to the project. But at one condition ;o), that drupal will assert my moral right on those files (i.e. that drupal will identify me as the author of the file).
Renaud Delbru
"Bug": URI escape sequences
I have modified the JFlex file for the tokenizer, in order to recognize the encoded character '%xx' in an URI. Tell me if the problem is still here (I didn't test it). You can download the new version at:
http://files.getdropbox.com/u/1278798/custom-tokenizer.tar.gz
It contains also the JFlex file, if you want to play with it, modify it and regenerate the scanner (CustomStandardAnalyzerImpl.java).
Renaud Delbru
I have modified the JFlex
Yes, sadly it is. The Tokenizer just behaves a bit different when inputting interactively (only returns the latest-but-one token, and the last only once the input is ended), but the bug is still there and the same.
I also haven't used JFlex, yet, so it would be great, if you could fix this for me! But I could also do it myself, I think, if you prefer.
Bug Fixed
I modified the JFlex file, test it and solve the bug (it was just a syntax problem in the JFlex file).
I have uploaded the files, as well as the unit tests, to the same dropbox location (custom-analyzer.tar.gz). Enjoy.
Renaud Delbru
Sorry, it seems that I
Sorry, it seems that I forgot to download this new version, or saved it somewhere and can't find it now. And now the above links don't work anymore.
Could you please upload the fixed version once more?
File uploaded
File uploaded.
Sorry, I removed it thinking you stored the archive somewhere.
License
Sorry, didn't really realize this was generated by JFlex and that JFlex encodes such mappings that way (the "\2\u24\3..." strings). Thought, this was some trick of yours.
Come to think about it, the specific license probably won't be a problem. It's just that afaik third-party librarys (which the JAr would probably count as) mustn't be included in a module's CVS project. So I'll just have to upload it somewhere else (just need to find a good spot) and link to it via the module's main page and the README. As long as your license will allow this, you can choose whichever you want. And of course I'd give you full credits as the author of these classes, and you'd keep full rights on them.
These are just my thoughts, maybe I missed something, but I think it should be OK this way.
Licensing Issue
Don't bother too much with that. Just include the files in the project. Add me as an author, changes the file header (to remove any mention of the sindice project), and add yourself also as an author.
Or, maybe there is a licensing issue that I am not aware. Can somebody from Drupal check that issue ?
Since this file are based on original Lucene files that are under the Apache 2.0 License, I think you just have to mention Lucene in the project, and mention that the files that have been derived from Lucene project:
Some code in src/java/.../CustommStandardAnalyser.java was
derived from Lucene 2.4 sources available at
http://lucene.apache.org/
Renaud Delbru
Found the link
I was referring to this rule in drupal's CVS Usage Policy:
So I think it's pretty obvious, that I really shouldn't directly include the JAR or the JAVA source files in the CVS project. But it's not much of an inconvinience, I just need some space where I can upload the file and an accompanying page explaining about its author, provenance and license.
Drupal CVS
Ok, I understand better now the restrictions. Let's do like you proposed, a small jar / tarball provided somewhere, under Apache License 2.0. I don't think there will be any problem with this solution.
You could host your
You could host your code/files on a public repository like github or unfuddle.
Doesn't HAVE to be external
If it can be GPL licensed, and if you are maintaining it as part of THIS project, you can include it on Drupal.org. If either of those statements aren't true, then yes, keep it external.
Oh, good!
Good news, but then I'd have to put the Java source files up there, not the JAR, right? If this is the case, then it would be rather impractical for users, who would have to compile and JAR the files themselves, just for using the module.
Also I don't know if it could be GPL licensed, or if Renaud would do that.
But if he would, and the JAR wouldn't be a problem, then this would of course be the best alternative.
GPL Licensing
Hi,
on my side, I don't see any problem to see these files under the GPL license. How do we proceed ?
Renaud Delbru
Well, that depends on if I
Well, that depends on if I could just commit the JAR file or would have to commit the individual JAVA files. I'm not really versed in drupal's rules in this respect.
But I think, if there are no objections, I'll just commit the JAR file and add some lines to the README.txt giving you and Lucene credit for it. If it's OK with you.
Context again...
Hi!
One quick question for the RDF pros here: Is the context assigned per resource/subject or per statement/triple?
In other words: Is it possible, that a resource is the subject of statements in different contexts?
Or is this maybe rare enough to be ignored?
Because right now, when "context" is just a single-valued field per resource, multiple contexts for a resource would obviously lead to problems. Also I wanted to let admins set the indexing of RDF content (Is it indexed? And: using which method and which Solr server?) based on different contexts, but if a resource could be in multiple ones, this could also lead to problems.
On the other hand, if I could count on all resources having a single context in which statements about them are made, other program logic would also get significantly easier.
Thanks in advance!
Hi! First question, when you
Hi!
First question, when you say "context", you are referring to a "data source", "dataset"? In RDF, what we use to "cluster" information is one more tuple in the triple know as a "graph" IRI (it is why we more often than not, talk about a Quad store (and not only a triple store); a quad store has 4 tuples: graph, suject, predicate and object).
A "graph" can be used for many things. Personally I use it most of the time to track provenance of data (a certain dataset from a certain data source).
Certainly. Consider a resource <resource-A>. I can describe the same resource in multiple contextes (graphs). We can have Quads such as:
<graph-A> <resource-A> <foaf:name> "bob".
<graph-B> <resource-A> <olb:bio> "bob is...".
etc.
Personally, what I am doing with the solr index for that usecase is to create the key value of a solr document from two fields: URI & Dataset (i.e. context, graph, etc.)
In my example bellow, this would lead to two Solr document for the same resource in two different graphs.
Does this answer your questions?
Tell me if I missed something :)
Thanks,
Take care,
Fred
Thanks, exactly what I wanted to know!
Yes, with "context" I meant the graph the triple was a part of.
OK, I feared as much.
But your ID with generating several documents per subject, one for each context, seems like a good solution for at least one problem. Just gotta think it through a bit, how the rest of the code will have to look, then.
Anyways, thanks for the quick reply!
Good
Hi again,
Good, so just tell me if think about some issues for some edge cases. On my side, this is working perfectly for all uses-cases that I tested and the way I use Solr in our systems.
Thanks!
Take care,
Fred
And one question about schema.xml...
OK, this discussion is slowly becoming kind of an all-purpose help site, but this has at least some relevance regarding the implementation:
What kind of strings are allowed as field names? Any, or just like standard identifiers (/[a-z][a-z_0-9]*/)?
Or, to be more precise: Would it be possible to use just e.g. "http://www.w3.org/1999/02/22-rdf-syntax-ns#type_s" as the dynamic field name for the type predicate, or will I have to think of some translation "predicate URI ==> field name"?
The official Solr documentation is a bit meager in that respect, but maybe someone here will know. Otherwise I'll just have to post to the Solr Developers List.
I think all kind of string
I think all kind of string are allowed.
The only problem that I am aware is with string with the character ':', since it is normally reserved by lucene (there is also other characters). It obliges to espace these characters at query time. For example, if the original query is
http://www.w3.org/1999/02/22-rdf-syntax-ns#type_s:person
you should rewrite it to
http://www.w3.org/1999/02/22-rdf-syntax-ns#type_s:person
before giving it to Lucene / Solr.
Renaud Delbru
Ah, yeah, your'e right,
Ah, yeah, your'e right, wouldn't have thought of that! Could have been some nasty bug-chasing...
Your two queries are identical, but luckily the SolrPhpClient used by the apachesolr module already has a handy escape() methode, so I won't have to bother with the details. ;) Just gotta remember to use it, when implementing the query functionality...
In any case, thanks for the answer! Nice to hear that I won't have to worry about that. ;)
Your two queries are
Yes, the drupal system removed the backslash before the ':'. But I think you understood the problem.
Renaud Delbru
SIREn was released
SIREn 0.1 was released on 22-July.
http://siren.sindice.com/
Looks good!