We've learned a lot about Solr and Drupal in the past three years. Much is possible with the current ApacheSolr module, but some things aren't possible, and many things aren't easy. There are quirks and limitations that reflect early design decisions which we now could solve better. To move forward and make the future a better place for Solr and Drupal, a new effort is beginning to redesign the integration from the ground up.
These are some of the high level design goals:
- Study the PECL library, learn from it, and possibly use it: http://pecl.php.net/package/solr
- Develop a query library that has improved developer usability; for example http://github.com/technosophos/SolrAPI/blob/master/solrapi.inc
- Develop (with) components that are not Drupal specific so that other open source projects can use them (see above two points)
- Take advantage of cool things like the Search API, where it makes sense: http://drupal.org/project/search_api
- Learn from efforts like Searchlight and enable them to build on shared core components: http://drupal.org/project/searchlight
- Remove node centricity, embrace the Entity API in Drupal-7, and rely on Views for as much as possible
- Make indexing more flexible and faster
- Decentralize the module structure so that more contributors can become involved in more projects
- Allow Solr to be used in more contexts than traditional search (faceted browsing, for example)
Matt Butcher will be coordinating initial planning and research, and together with input from you, will draft architecture documents to get us from where we are today to the bright and shiny future.
The project will use this group as its home. There is now a new tab/page, "Solr Next Gen", as well as a tag that you can subscribe to, so that we can track discussions. Development will happen in the 7.x-2.x branch of the apachesolr module.
Now is a good time to elaborate on the list of design goals that you'd like to see in the new architecture by commenting here.

Comments
Sounds fantastic
Look forward to seeing the next version happen! The objectives are really spot on.
Can we set a vague though doable time frame for an alpha release?
That's a great initiative, I
That's a great initiative, I would give you as much as help as I can.
I'm starting right now, in fact:
That's it right now, I would be pleased to help! See you.
EDIT: Poor formatting.
Pierre.
I haven't been following the
I haven't been following the progress on the D8 Search API (e.g. http://groups.drupal.org/node/71988), but as some of our efforts will overlap, we should share ideas with that group.
A Quick Overview of What I'm Planning
I'm just starting to plan out how I want to do things, and one of the first things that I will produce is a nice high-level diagram of how the components will fit together. Not quite there yet, though. So here's sort of an overview of what I'm thinking.
Solr Core
At the very root of the new effort will the a library of tools that make low-level access to Solr available. In short, it will make Solr functionality available from within Drupal. This library will include:
If we can see that it is profitable, this level will also provide the means to use either the PHP Solr Client or the PECL Solr client.
The work I started as SolrAPI will provide foundational material for this library.
Drupal Adapters
This is really rough right now, but I am envisioning building an entity-based adapter module that will be able to index nodes and other entity-based data structures.
While working on this, I'd like to build an adapter-friendly API that others can use to hook into the indexing system using their own data. Ideally, a developer interested in, say, indexing custom tables should be able to write a module containing a small bit of adapter code.
Search Integration
In addition to the core library, we will also have a module or multiple modules that provide support for integrating with Drupal's search API. Essentially, this will be an adapter module.
And from here...
Once these three systems have been developed, other modules (Views integration, advanced faceting, etc.) should have programmatic access to whatever lower levels they need. My hope is that we do the lower level parts well enough that these other modules can be done with new levels of capabilities, performance, and features.
Blog: http://technosophos.com
QueryPath: http://querypath.org
Awesome ideas. Excited to
Awesome ideas. Excited to see how this shapes up in addition to helping out in any way I can. There are a couple of projects I am working on that I think apply to this new architecture initiative.
The first project is called Facet API. It's main goals are the following:
The benefit of Facet API being outside of Apache Solr is that it is really stress-testing the developer APIs and making sure they are as friendly and flexible as possible. In addition, it is backend agnostic which allows non-Solr users / developers to build rich faceted interfaces that the Solr community can benefit from as well.
The second project is Search Lucene API which targets a completely different different audience, but there is a ton of overlap in terms or architecture and overall goals. Some of the features in Search Lucene API (current and future iterations) that I would love to see in Apache Solr are the following:
hook_apachesolr_index_info(). If we add a hook, we can easily add a UI component that will allow users to add a new connection via the admin interface.hook_apachesolr_index_info()above, the boost fields would be one of the administrative settings you would get out of the box by implementing it.There are a few ore things that could be cool, but this post is getting really long. I think the three things above require some refactoring at the lowest level of the module, and the flexibility provided by them could be an enabler for some other cool features to be build on top of Apache Solr in contrib.
I'm fond of the Facet API as
I'm fond of the Facet API as you describe, pattern, at a first read, seems to be good. I think that scemas are better than explanation, and maybe some simple class diagrams would be better in order to see the whole thing and detect the redundancies or inconsistencies.
Pierre.
My projects
First off: Great initiative! The module really needed a revamp and this seems to move in exactly the right direction. I hope I can be of help regarding this process.
Past experiences
Robert asked me to write a bit about my past experiences with writing contrib modules for the apachesolr project. These are two GSoC projects I did, the Views integration in 2008 and the Apachesolr RDF module in 2009. Sadly I have already forgotten half of it, but maybe what I got can still help to flesh out some requirements for the Solr Next Generation.
What I've read here seems like most of this is already addressed, but since this was my main isssue with the module, I still wanted to mention it. I also don't think this is still quite as bad as back then.
For Solr NG the scope of single functions should be better thought through, and also important API functions that could or will be useful for other projects should be well documented.
OK, this is pretty poor as far as feedback goes, but it's been quite some time. I also think, the current apachesolr module already fixed some of these problems. Anyways, maybe this helps you.
But what I wanted to mostly talk about was my current project.
Search API
Before implementing anything other than the Solr connection library, I strongly urge you to take a close look at my Search API project (if you haven't done so, already). This already takes care of several important features and reading this discussion I really think that basing your module upon the Search API would remove a lot of duplicated effort from this project. For example, my module already takes care of the following:
Right now, all core entities are supported out-of-the-box, a programmer who wants to index and search for his own entities just has to implement
hook_entity_info()andhook_entity_property_info()(which he should do in any case) to achieve that. All search functionality is then available for his entities.The fields that should be indexed, as well as the way in which they should be indexed, is completely configurable. All fields from related entities, e.g. a node's author's mail address, are also available if desired.
Almost any special features in any other search module can already be implemented with the Search API by writing custom preprocessors or "data alter callbacks".
With a Solr module built upon the Search API, e.g. users previously content with a database-based search could easily switch to a Solr search when their site gets larger, while keeping all their settings and just needing to re-index their data. Especially once more features are available for the Search API, this will be critical for making switching the search backend as painless as possible. With a completely seperate Solr module, users would have to set up additional fields, facets, views and the like all over again.
Additionally, I'm at the moment starting to implement three additional important features (which will be completed over the course of the next few weeks):
Facets
The module will soon support creating facets for any non-fulltext field, as far as possible with backend-independence and for any kind of entity.
Views integration
Once this is completed, users will be able to create views with search keys as filters or arguments, searching or filtering on any indexed property defined for the index and displaying all of an entity's properties the way they want. Facet support will also be included right away, so that facetting views queries can easily be set up.
Apache Solr backend
In fact, I'm already developing this one, and it will be completed in one or two weeks. So after finishing the base library, you could just test out the Search API's Solr backend. If it meets your basic requirements, you could then just take it over for the apachesolr-7.x-2.x branch and apply any modifications or add any additional features you had in mind. This would be both a great test and chance for the Search API, as well as a considerable ease for you, as you would only have to implement Solr-specific features and maybe help to improve the Search API in a few places.
It would be really cool, if you thought about it. I think the Search API can help you a lot with builidng a better Solr search integration, and keep you from wasting time with duplicating things I've already implemented.
But in any case, I'd be glad to help in any way I can, as far as my time permits.
Another thing, regarding my task of implementing a Solr backend for the Search API: I'm currently thinking about which library to use for communicating with the Solr server, and Robert told me it would be best to ask here.
Of course, using your new library, once it's finished, would be best, but since I'm already working on it I have few options but to stick with another one now and then switch to your library later. So, do you have any recommendation on which library you would use, if you wouldn't implement your own?
The PECL library looks great, but it has to be added in the php.ini, which won't be an option for a lot of users. And your current Solr API looks cool, too, but relies on the apachesolr module at the moment, as far as I can see.
So, any tips? Or is the SolrPhpClient used by the apachesolr module right now the best option anyways?
@ Chris: The Facet API looks interesting, too, yeah. A pity it's not usable, it would be great to build on it when implementing facets for the Search API. But I'll take a good look at it and try to keep as much compatibility as possible, so I can switch later when your module is released.
Hi Thomas. I have been
Hi Thomas.
I have been keeping up with your progress, and my goal is not to step on your project's toes because I think what you are doing is super important. To me duplication of functionality isn't necessarily a bad thing, especially since the search community is still trying to vet out issues and determine a direction on where to go. This project is really the 3.0 version of Search Lucene API's facet functionality, but in a backend-agnostic package. Since it is uses adapters for the backend, I am definitely planning on implementing a "Search API" adapter once things stabilize on my end.
I look forward to integrating with your stuff,
Chris
Backend library
For communicating with the Solr server, I'd go with the PHP-Solr-Client that the Solr module already uses. It's portable, it's likely to be known by Drupal users already, and it doesn't have the requirements that the Pecl module has. The code base also seems pretty stable and in my experience the maintainers have been very responsive.
For SolrAPI (http://github.com/technosophos/solrapi), the query part is complete. You can do everything from switching query parsers to tuning highlighting parameters. But a query currently just returns a result object from PHP-Solr-Client. The goal is to add a response framework and an indexing framework. SolrAPI is running live on our server clusters, and is performing well.
Your work on SearchAPI is very exciting. I've read much of the code, and I like the solid construction. I don't know whether the refactoring will depend on SearchAPI (that's a decision I just can't make yet), but it will certainly be able to easily interoperate with it. I'm very excited about the potential of SearchAPI. I just want to make sure that however things go, we can make use of 100% of Solr's features without being constricted by a generic API on top of that. I also want to make sure that performance stays front and center. Again, not criticisms of SearchAPI (I don't know enough about it to criticize), but design goals that will determine how Apache Solr gets re-architected.
I'm also very excited about FacetAPI, Views integration, and some of these other higher-level modules. A huge part of my design goals will be to make it easy for those modules to build off of a solid API.
Blog: http://technosophos.com
QueryPath: http://querypath.org
mkalkbrenner
Robert asked us to introduce ourselves. So here I go:
I started working with Solr at Cocomore back in 2007 using version 1.1. We developed our own Drupal 5 module to replace the internal search by a powerful Solr solution. Early 2009 I attended Robert's session about Solr at the Drupal Camp Cologne. A few days later we decided to stop porting our module to Drupal 6 but to use Robert's module instead.
After that I contributed different bugfixes and features to apachesolr like support for hierarchical facets. And some i18n / l10n patches which have never been committed. But based on these patches and the ideas I shared at a BOF at Drupal Con Paris 2009 a Swiss company paid us to create a prototype for multilingual searches. The result was Apache Solr Multilingual which is also suitable for single non-English languages.
In general full text searching highly depends on the language and on the domain ("Apple" might mean something different to IT people). So beside of all the possibilities regarding the different back end technologies that have bean mentioned in the posts above it would be really nice to see these kind of features in Solr Next Gen to enable the contrib modules to increase the end users satisfaction:
My personal wish: think about internationalization first before implementing anything. At least provide a clean API that a module like Apache Solr Multilingual could use to provide internationalization support.
Personal Genomics Servicesbio.logis GmbH
Internationalization
This is interesting for me as well. I tried to keep that in mind while developing the Search API, but since I haven't ever worked with any kind of i18n, I didn't really know what had to be done.
Right now, every indexed item contains a language field, so that searches can filter on that, or preprocessors can reject indexing items with certain languages (e.g. to build specific indexes for each language and there adding language-specific preprocessors).
Would this be enough for your use cases? What exactly do you need in a search module for easing the implementation of i18n support? Your experience would be very valuable here, I think.
Hey mkalkbrenner, With
Hey mkalkbrenner,
With apachesolr_views and apachesolr_mutliserver you have a facility to do a lot of these things. We can also look at how some of these things can moving into the main API as some of things like multicore will be default in later versions of Apache Solr.
In terms of managing variations on indexes including different languages I think we should look at the search API as a whole in terms of storing this data so in can be indexed by various search engine either via MongoDB, HBase etc. This gives the greatest flexibility in terms of being able to manipulate things in code (which gives a lower barrier to entry for most Drupal coders) and then people can use things like Acquia search or other third party services if they don't have the systems knowledge.
I also thing we should look at our default schema to being mostly or entirely dynamic so it gives the flexibility of indexing data structures that aren't just pages this will definitely effect the exposed api's in terms of data retrieval but I think this maybe covered by Thomas's hook_entity_info() and hook_entity_property_info() http://groups.drupal.org/node/92799#comment-297949.
Cheers
Dave
With apachesolr_views and
I know these modules but it would be nice to have the main functionality in the main api, because otherwise there're to many moving targets. Currently it's pretty hard to stay in sync with apachesolr 6.x-1.x and 6.x-2.x and other modules like apachesolr_attachments. Right now it's impossible for an administrator of a multilingual site to survive a module update of a single apachesolr module because we're not fast enough to implement all that changes that happen over all modules.
Basically this is a good approach. But if it comes to full text search it might get complicated. If you take Solr as an example you have to configure a lot in the back end itself to support non-English languages.
I hope to find the time soon to have a closer look at this. A dynamic schema sounds good. In terms of hook_entity_info() it will mean that we have to define many localized entities per each single entity, right?
Personal Genomics Servicesbio.logis GmbH
I hope to find the time soon
Exactly – just like node translations are handled in Core. The Search API doesn't right now support translatable fields (i.e., somehow indexing all versions and searching for the right one), though, since (as far as I can tell) those aren't really supported by Core either, at least UI-wise. But it shouldn't be too hard to patch (I would have one or two ideas concerning this), if there are enough use cases.
By the way: Views integration and the Solr backend are basically done (i.e., in a well usable state), working on refining and facets. The post in the other discussion has more details.
I agree... where do we start?
Internationalization should certainly be front-and-center, and I agree that if we architect the lowest layers with this in mind, then the higher level applications will be much easier.
So where do we start? Is it best to go with separating different languages into different indexes? Into different fields in the same index (processed by different analyzers)?
Blog: http://technosophos.com
QueryPath: http://querypath.org
Is it best to go with
Currently we use one index with different field types. The advantage is that you can not only do language specific searches but multilingual searches and things like CLIR (cross lingual information retrieval).
If you want to realize these features with different indexes you have to aggregate the search results which means that you have to do things like facet counts in your code instead of letting solr doing this work.
Another approach for distributed indexes are Solr shards. But they might to be too complicated to set up and the unique key across the shards for an entity might become a problem for CLIR. (The easiest way to implement CLIR is to store all translations of an entity in one record.)
If you run a multilingual site and you are sure that you don't want support multilingual searches or CLIR, distributed indexes are the better and easier approach: if you switch the language of the site you switch the index too.
So like stopwords or synonyms the decision for distributed indexes or a single index depends on the use case.
Personal Genomics Servicesbio.logis GmbH
Great example of why this is important
http://drupal.org/node/915626
It goes back to the fl and hl.fl parameters that Solr expects to be comma separated (a weirdness of their API). Our new API needs to make the addition of parameters and values easy and safe in our hook and drupal_alter based architecture. The array based parameters work out ok, but comma separated parameters have the tendency to get clobbered, so our API needs to wrap this and manage the values in a safe way.
I'm really looking forward
I'm really looking forward for the Search API integration...
Any news regarding this refactoring?
I am wondering if this initiative is still under work or if it has been cancelled. Any idea?
At this point, I don't see
At this point, I don't see any plan to work on a 7.x-2.x branch. The PECL library and other libraries don't offer a compelling reason to re-write the module.
Some of the features mentioned in the opening were implemented in 7.x-1.x. We did move most of the "contrib" modules out. etc.
For 8.x, Nick_vh and I have discussed the idea that the main apachesolr module could displace search_api_solr as the connector to the index.
For 8.x, Nick_vh and I have
So you would want to base the D8 version of your module on the Search API? We should discuss this in Prague – it would be good to know before I begin to port the Search API Solr module. (Although this probably won't happen for another month or two, so you still have time to decide in any case.)
Also, you might want to get involved in the Search API porting process to identify any issues you might encounter, or have already noticed with the D7 version.