Solr Next Gen - the 7.x-2.x refactoring

Events happening in the community are now at Drupal community events on www.drupal.org.
robertdouglass's picture

We've learned a lot about Solr and Drupal in the past three years. Much is possible with the current ApacheSolr module, but some things aren't possible, and many things aren't easy. There are quirks and limitations that reflect early design decisions which we now could solve better. To move forward and make the future a better place for Solr and Drupal, a new effort is beginning to redesign the integration from the ground up.

These are some of the high level design goals:

  • Study the PECL library, learn from it, and possibly use it: http://pecl.php.net/package/solr
  • Develop a query library that has improved developer usability; for example http://github.com/technosophos/SolrAPI/blob/master/solrapi.inc
  • Develop (with) components that are not Drupal specific so that other open source projects can use them (see above two points)
  • Take advantage of cool things like the Search API, where it makes sense: http://drupal.org/project/search_api
  • Learn from efforts like Searchlight and enable them to build on shared core components: http://drupal.org/project/searchlight
  • Remove node centricity, embrace the Entity API in Drupal-7, and rely on Views for as much as possible
  • Make indexing more flexible and faster
  • Decentralize the module structure so that more contributors can become involved in more projects
  • Allow Solr to be used in more contexts than traditional search (faceted browsing, for example)

Matt Butcher will be coordinating initial planning and research, and together with input from you, will draft architecture documents to get us from where we are today to the bright and shiny future.

The project will use this group as its home. There is now a new tab/page, "Solr Next Gen", as well as a tag that you can subscribe to, so that we can track discussions. Development will happen in the 7.x-2.x branch of the apachesolr module.

Now is a good time to elaborate on the list of design goals that you'd like to see in the new architecture by commenting here.

Comments

Sounds fantastic

sidharth_k's picture

Look forward to seeing the next version happen! The objectives are really spot on.

Can we set a vague though doable time frame for an alpha release?

That's a great initiative, I

pounard's picture

That's a great initiative, I would give you as much as help as I can.

I'm starting right now, in fact:

  • Does "Decentralize the module structure" means that you will totally cut off the search UI part and let the indexer live by itself (and may be other components)?
  • "and rely on Views for as much as possible": how could you rely on Views? Provide a support for querying with Views is a great feature, but the indexer and API should not be Views centric, Views is for frontends, and you are talking about making your module a pure backend and API module!
  • The query library is really a cool idea, and that is exactly what is missing in actual module.

That's it right now, I would be pleased to help! See you.

EDIT: Poor formatting.

Pierre.

I haven't been following the

jpmckinney's picture

I haven't been following the progress on the D8 Search API (e.g. http://groups.drupal.org/node/71988), but as some of our efforts will overlap, we should share ideas with that group.

A Quick Overview of What I'm Planning

mbutcher's picture

I'm just starting to plan out how I want to do things, and one of the first things that I will produce is a nice high-level diagram of how the components will fit together. Not quite there yet, though. So here's sort of an overview of what I'm thinking.

Solr Core

At the very root of the new effort will the a library of tools that make low-level access to Solr available. In short, it will make Solr functionality available from within Drupal. This library will include:

  • Indexing support
  • Searching support
  • At least some degree of support for non-search access to Solr data (e.g. lists of indices)

If we can see that it is profitable, this level will also provide the means to use either the PHP Solr Client or the PECL Solr client.

The work I started as SolrAPI will provide foundational material for this library.

Drupal Adapters

This is really rough right now, but I am envisioning building an entity-based adapter module that will be able to index nodes and other entity-based data structures.

While working on this, I'd like to build an adapter-friendly API that others can use to hook into the indexing system using their own data. Ideally, a developer interested in, say, indexing custom tables should be able to write a module containing a small bit of adapter code.

Search Integration

In addition to the core library, we will also have a module or multiple modules that provide support for integrating with Drupal's search API. Essentially, this will be an adapter module.

And from here...

Once these three systems have been developed, other modules (Views integration, advanced faceting, etc.) should have programmatic access to whatever lower levels they need. My hope is that we do the lower level parts well enough that these other modules can be done with new levels of capabilities, performance, and features.

Awesome ideas. Excited to

cpliakas's picture

Awesome ideas. Excited to see how this shapes up in addition to helping out in any way I can. There are a couple of projects I am working on that I think apply to this new architecture initiative.

The first project is called Facet API. It's main goals are the following:

  • Provide a simple API that allows any search backend to create and manage facets. Currently the Search Lucene API and Apache Solr Search Integration projects are supported, and the plan is to add an adapter for Searchlight.
  • Make the API backend agnostic so facet configurations can be shared across the various modules.
  • Introduce the concept of facet realms which render facets in a similar fashion, for example as a list of clickable links or series of form elements.
  • Provide a simple, pluggable widget API that allows developers to build varying displays of facets within a realm. One example is a tagcloud widget.
  • Provide a UI component that serves as a GUI to the underlying APIs which will allow for non-developers to create interesting faceted displays via the admin interface. Ultimately this is a facet builder.
  • Integrate with Context and Features so the facet configurations can more easily be packaged and shared.

The benefit of Facet API being outside of Apache Solr is that it is really stress-testing the developer APIs and making sure they are as friendly and flexible as possible. In addition, it is backend agnostic which allows non-Solr users / developers to build rich faceted interfaces that the Solr community can benefit from as well.

The second project is Search Lucene API which targets a completely different different audience, but there is a ton of overlap in terms or architecture and overall goals. Some of the features in Search Lucene API (current and future iterations) that I would love to see in Apache Solr are the following:

  • The ability to more easily add new index connections. For example, a site may want to connect to an additional Solr instance that is managed outside of Drupal. Search Lucene API attacks this problem by putting to basic administrative forms in the core Framework. Any module that implements the index info hook gets the basic administration interface with connection parameters, index information, etc. for free. For the sake of this post, lets call the new hook hook_apachesolr_index_info(). If we add a hook, we can easily add a UI component that will allow users to add a new connection via the admin interface.
  • The ability to add sub-indexes, or separate more targeted searches. Sticking with the hook above, there could be a "base index" key which could, for example, point to the normal apachesolr_search index and inherit it's connection parameters. Either by using the Facet API module, creating an option to automatically apply filters, or finding some way to hook this into the Views interface, users could create things like blog or forum searches without having to do any custom coding.
  • Implement a hook to define boost fields. There is currently a variable that stores boost fields which you can work with to add boost fields, but it would be a little more transparent to add the fields via a hook. Sticking with the hook_apachesolr_index_info() above, the boost fields would be one of the administrative settings you would get out of the box by implementing it.

There are a few ore things that could be cool, but this post is getting really long. I think the three things above require some refactoring at the lowest level of the module, and the flexibility provided by them could be an enabler for some other cool features to be build on top of Apache Solr in contrib.

I'm fond of the Facet API as

pounard's picture

I'm fond of the Facet API as you describe, pattern, at a first read, seems to be good. I think that scemas are better than explanation, and maybe some simple class diagrams would be better in order to see the whole thing and detect the redundancies or inconsistencies.

Pierre.

My projects

drunken monkey's picture

First off: Great initiative! The module really needed a revamp and this seems to move in exactly the right direction. I hope I can be of help regarding this process.

Past experiences

Robert asked me to write a bit about my past experiences with writing contrib modules for the apachesolr project. These are two GSoC projects I did, the Views integration in 2008 and the Apachesolr RDF module in 2009. Sadly I have already forgotten half of it, but maybe what I got can still help to flesh out some requirements for the Solr Next Generation.

  • The basic problem I had with the module in several places, was the assumptions it made about you wanted to do with a function, class or the whole module. Everything was built to index nodes on a single server and then to display a page with a single search on it. Creating different servers or no server at all, indexing something other than nodes, not using the standard search page with URL parameters or not wanting facets to display could all be only done in slightly hack-ish manners. E.g., something like executing a search with the Drupal_Apache_Solr_Service class depended on a variable ("apachesolr_index_status" or the like – telling, basically, whether indexing was turned on by the user) being activated, even though I wanted to search a completely different server.
    What I've read here seems like most of this is already addressed, but since this was my main isssue with the module, I still wanted to mention it. I also don't think this is still quite as bad as back then.
  • Another thing was a confusing API, with some functionality being split between several magic functions, and other functions on the other hand doing several things at once (where a programmer of another module might only need a single one of them).
    For Solr NG the scope of single functions should be better thought through, and also important API functions that could or will be useful for other projects should be well documented.

OK, this is pretty poor as far as feedback goes, but it's been quite some time. I also think, the current apachesolr module already fixed some of these problems. Anyways, maybe this helps you.
But what I wanted to mostly talk about was my current project.

Search API

Before implementing anything other than the Solr connection library, I strongly urge you to take a close look at my Search API project (if you haven't done so, already). This already takes care of several important features and reading this discussion I really think that basing your module upon the Search API would remove a lot of duplicated effort from this project. For example, my module already takes care of the following:

  • Management of different search servers and seperate indexes.
  • Indexing management (list of unindexed and updated items) [OK, this one is very basic]
  • Indexing any kind of entity (basically your whole "Drupal Adapters" item)
    Right now, all core entities are supported out-of-the-box, a programmer who wants to index and search for his own entities just has to implement hook_entity_info() and hook_entity_property_info() (which he should do in any case) to achieve that. All search functionality is then available for his entities.
  • Very flexible indexing (and searching) workflow
    The fields that should be indexed, as well as the way in which they should be indexed, is completely configurable. All fields from related entities, e.g. a node's author's mail address, are also available if desired.
    Almost any special features in any other search module can already be implemented with the Search API by writing custom preprocessors or "data alter callbacks".
  • Backend independence
    With a Solr module built upon the Search API, e.g. users previously content with a database-based search could easily switch to a Solr search when their site gets larger, while keeping all their settings and just needing to re-index their data. Especially once more features are available for the Search API, this will be critical for making switching the search backend as painless as possible. With a completely seperate Solr module, users would have to set up additional fields, facets, views and the like all over again.

Additionally, I'm at the moment starting to implement three additional important features (which will be completed over the course of the next few weeks):

Facets

The module will soon support creating facets for any non-fulltext field, as far as possible with backend-independence and for any kind of entity.

Views integration

Once this is completed, users will be able to create views with search keys as filters or arguments, searching or filtering on any indexed property defined for the index and displaying all of an entity's properties the way they want. Facet support will also be included right away, so that facetting views queries can easily be set up.

Apache Solr backend

In fact, I'm already developing this one, and it will be completed in one or two weeks. So after finishing the base library, you could just test out the Search API's Solr backend. If it meets your basic requirements, you could then just take it over for the apachesolr-7.x-2.x branch and apply any modifications or add any additional features you had in mind. This would be both a great test and chance for the Search API, as well as a considerable ease for you, as you would only have to implement Solr-specific features and maybe help to improve the Search API in a few places.

It would be really cool, if you thought about it. I think the Search API can help you a lot with builidng a better Solr search integration, and keep you from wasting time with duplicating things I've already implemented.
But in any case, I'd be glad to help in any way I can, as far as my time permits.


Another thing, regarding my task of implementing a Solr backend for the Search API: I'm currently thinking about which library to use for communicating with the Solr server, and Robert told me it would be best to ask here.
Of course, using your new library, once it's finished, would be best, but since I'm already working on it I have few options but to stick with another one now and then switch to your library later. So, do you have any recommendation on which library you would use, if you wouldn't implement your own?
The PECL library looks great, but it has to be added in the php.ini, which won't be an option for a lot of users. And your current Solr API looks cool, too, but relies on the apachesolr module at the moment, as far as I can see.
So, any tips? Or is the SolrPhpClient used by the apachesolr module right now the best option anyways?


@ Chris: The Facet API looks interesting, too, yeah. A pity it's not usable, it would be great to build on it when implementing facets for the Search API. But I'll take a good look at it and try to keep as much compatibility as possible, so I can switch later when your module is released.

Hi Thomas. I have been

cpliakas's picture

Hi Thomas.

I have been keeping up with your progress, and my goal is not to step on your project's toes because I think what you are doing is super important. To me duplication of functionality isn't necessarily a bad thing, especially since the search community is still trying to vet out issues and determine a direction on where to go. This project is really the 3.0 version of Search Lucene API's facet functionality, but in a backend-agnostic package. Since it is uses adapters for the backend, I am definitely planning on implementing a "Search API" adapter once things stabilize on my end.

I look forward to integrating with your stuff,
Chris

Backend library

mbutcher's picture

For communicating with the Solr server, I'd go with the PHP-Solr-Client that the Solr module already uses. It's portable, it's likely to be known by Drupal users already, and it doesn't have the requirements that the Pecl module has. The code base also seems pretty stable and in my experience the maintainers have been very responsive.

For SolrAPI (http://github.com/technosophos/solrapi), the query part is complete. You can do everything from switching query parsers to tuning highlighting parameters. But a query currently just returns a result object from PHP-Solr-Client. The goal is to add a response framework and an indexing framework. SolrAPI is running live on our server clusters, and is performing well.

Your work on SearchAPI is very exciting. I've read much of the code, and I like the solid construction. I don't know whether the refactoring will depend on SearchAPI (that's a decision I just can't make yet), but it will certainly be able to easily interoperate with it. I'm very excited about the potential of SearchAPI. I just want to make sure that however things go, we can make use of 100% of Solr's features without being constricted by a generic API on top of that. I also want to make sure that performance stays front and center. Again, not criticisms of SearchAPI (I don't know enough about it to criticize), but design goals that will determine how Apache Solr gets re-architected.

I'm also very excited about FacetAPI, Views integration, and some of these other higher-level modules. A huge part of my design goals will be to make it easy for those modules to build off of a solid API.

mkalkbrenner

mkalkbrenner's picture

Robert asked us to introduce ourselves. So here I go:

I started working with Solr at Cocomore back in 2007 using version 1.1. We developed our own Drupal 5 module to replace the internal search by a powerful Solr solution. Early 2009 I attended Robert's session about Solr at the Drupal Camp Cologne. A few days later we decided to stop porting our module to Drupal 6 but to use Robert's module instead.

After that I contributed different bugfixes and features to apachesolr like support for hierarchical facets. And some i18n / l10n patches which have never been committed. But based on these patches and the ideas I shared at a BOF at Drupal Con Paris 2009 a Swiss company paid us to create a prototype for multilingual searches. The result was Apache Solr Multilingual which is also suitable for single non-English languages.

In general full text searching highly depends on the language and on the domain ("Apple" might mean something different to IT people). So beside of all the possibilities regarding the different back end technologies that have bean mentioned in the posts above it would be really nice to see these kind of features in Solr Next Gen to enable the contrib modules to increase the end users satisfaction:

  • support of multiple solr servers (which is currently possible using ugly variables translation)
  • support of multiple indexes on a server
  • aggregate search results from different indexes

My personal wish: think about internationalization first before implementing anything. At least provide a clean API that a module like Apache Solr Multilingual could use to provide internationalization support.

Internationalization

drunken monkey's picture

My personal wish: think about internationalization first before implementing anything. At least provide a clean API that a module like Apache Solr Multilingual could use to provide internationalization support.

This is interesting for me as well. I tried to keep that in mind while developing the Search API, but since I haven't ever worked with any kind of i18n, I didn't really know what had to be done.
Right now, every indexed item contains a language field, so that searches can filter on that, or preprocessors can reject indexing items with certain languages (e.g. to build specific indexes for each language and there adding language-specific preprocessors).
Would this be enough for your use cases? What exactly do you need in a search module for easing the implementation of i18n support? Your experience would be very valuable here, I think.

Hey mkalkbrenner, With

dstuart's picture

Hey mkalkbrenner,

With apachesolr_views and apachesolr_mutliserver you have a facility to do a lot of these things. We can also look at how some of these things can moving into the main API as some of things like multicore will be default in later versions of Apache Solr.

In terms of managing variations on indexes including different languages I think we should look at the search API as a whole in terms of storing this data so in can be indexed by various search engine either via MongoDB, HBase etc. This gives the greatest flexibility in terms of being able to manipulate things in code (which gives a lower barrier to entry for most Drupal coders) and then people can use things like Acquia search or other third party services if they don't have the systems knowledge.

I also thing we should look at our default schema to being mostly or entirely dynamic so it gives the flexibility of indexing data structures that aren't just pages this will definitely effect the exposed api's in terms of data retrieval but I think this maybe covered by Thomas's hook_entity_info() and hook_entity_property_info() http://groups.drupal.org/node/92799#comment-297949.

Cheers

Dave

With apachesolr_views and

mkalkbrenner's picture

With apachesolr_views and apachesolr_mutliserver you have a facility to do a lot of these things. We can also look at how some of these things can moving into the main API as some of things like multicore will be default in later versions of Apache Solr

I know these modules but it would be nice to have the main functionality in the main api, because otherwise there're to many moving targets. Currently it's pretty hard to stay in sync with apachesolr 6.x-1.x and 6.x-2.x and other modules like apachesolr_attachments. Right now it's impossible for an administrator of a multilingual site to survive a module update of a single apachesolr module because we're not fast enough to implement all that changes that happen over all modules.

In terms of managing variations on indexes including different languages I think we should look at the search API as a whole in terms of storing this data so in can be indexed by various search engine either via MongoDB, HBase etc. This gives the greatest flexibility in terms of being able to manipulate things in code (which gives a lower barrier to entry for most Drupal coders) and then people can use things like Acquia search or other third party services if they don't have the systems knowledge.

Basically this is a good approach. But if it comes to full text search it might get complicated. If you take Solr as an example you have to configure a lot in the back end itself to support non-English languages.

I also thing we should look at our default schema to being mostly or entirely dynamic so it gives the flexibility of indexing data structures that aren't just pages this will definitely effect the exposed api's in terms of data retrieval but I think this maybe covered by Thomas's hook_entity_info() and hook_entity_property_info() http://groups.drupal.org/node/92799#comment-297949.

I hope to find the time soon to have a closer look at this. A dynamic schema sounds good. In terms of hook_entity_info() it will mean that we have to define many localized entities per each single entity, right?

I hope to find the time soon

drunken monkey's picture

I hope to find the time soon to have a closer look at this. A dynamic schema sounds good. In terms of hook_entity_info() it will mean that we have to define many localized entities per each single entity, right?

Exactly – just like node translations are handled in Core. The Search API doesn't right now support translatable fields (i.e., somehow indexing all versions and searching for the right one), though, since (as far as I can tell) those aren't really supported by Core either, at least UI-wise. But it shouldn't be too hard to patch (I would have one or two ideas concerning this), if there are enough use cases.

By the way: Views integration and the Solr backend are basically done (i.e., in a well usable state), working on refining and facets. The post in the other discussion has more details.

I agree... where do we start?

mbutcher's picture

Internationalization should certainly be front-and-center, and I agree that if we architect the lowest layers with this in mind, then the higher level applications will be much easier.

So where do we start? Is it best to go with separating different languages into different indexes? Into different fields in the same index (processed by different analyzers)?

Is it best to go with

mkalkbrenner's picture

Is it best to go with separating different languages into different indexes? Into different fields in the same index (processed by different analyzers)?

Currently we use one index with different field types. The advantage is that you can not only do language specific searches but multilingual searches and things like CLIR (cross lingual information retrieval).

If you want to realize these features with different indexes you have to aggregate the search results which means that you have to do things like facet counts in your code instead of letting solr doing this work.

Another approach for distributed indexes are Solr shards. But they might to be too complicated to set up and the unique key across the shards for an entity might become a problem for CLIR. (The easiest way to implement CLIR is to store all translations of an entity in one record.)

If you run a multilingual site and you are sure that you don't want support multilingual searches or CLIR, distributed indexes are the better and easier approach: if you switch the language of the site you switch the index too.

So like stopwords or synonyms the decision for distributed indexes or a single index depends on the use case.

Great example of why this is important

robertdouglass's picture

http://drupal.org/node/915626

It goes back to the fl and hl.fl parameters that Solr expects to be comma separated (a weirdness of their API). Our new API needs to make the addition of parameters and values easy and safe in our hook and drupal_alter based architecture. The array based parameters work out ok, but comma separated parameters have the tendency to get clobbered, so our API needs to wrap this and manage the values in a safe way.

I'm really looking forward

Fidelix's picture

I'm really looking forward for the Search API integration...

Any news regarding this refactoring?

sdelbosc's picture

I am wondering if this initiative is still under work or if it has been cancelled. Any idea?

At this point, I don't see

pwolanin's picture

At this point, I don't see any plan to work on a 7.x-2.x branch. The PECL library and other libraries don't offer a compelling reason to re-write the module.

Some of the features mentioned in the opening were implemented in 7.x-1.x. We did move most of the "contrib" modules out. etc.

For 8.x, Nick_vh and I have discussed the idea that the main apachesolr module could displace search_api_solr as the connector to the index.

For 8.x, Nick_vh and I have

drunken monkey's picture

For 8.x, Nick_vh and I have discussed the idea that the main apachesolr module could displace search_api_solr as the connector to the index.

So you would want to base the D8 version of your module on the Search API? We should discuss this in Prague – it would be good to know before I begin to port the Search API Solr module. (Although this probably won't happen for another month or two, so you still have time to decide in any case.)
Also, you might want to get involved in the Search API porting process to identify any issues you might encounter, or have already noticed with the D7 version.

Lucene, Nutch and Solr

Group organizers

Group categories

Projects

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds: