Core Drupal Search Architecture for D8

We encourage users to post events happening in the community to the community events group on https://www.drupal.org.
jhodgdon's picture

Today at DrupalCon San Francisco, 11 of us gathered to discuss what the ideal architecture would be for Search in Drupal Core. Present were maintainers/users of core Search, various contrib search modules, Lucene, and Solr:
http://drupal.org/user/153120 - janusman
http://drupal.org/user/277371 - awolfey
http://drupal.org/user/29191 - douggreen
http://drupal.org/user/266779 - cpliaka
http://drupal.org/user/472460 - jpmckinney
http://drupal.org/user/10297 - unexpand
http://drupal.org/user/733232 - nihiliad
http://drupal.org/user/49851 - pwolanin
http://drupal.org/user/157079 - mradcliffe
http://drupal.org/user/155601 - jhodgdon

We all agreed that we want the core Search module to be pluggable/modular. Here are some notes:
* We want the whole system to be pluggable.
* Steps in the indexing/cron process
- Decide what needs to be indexed (could be nodes, other entities, ...).
- Render each item, or build a structured renderable array
- Pre-process (stemming, n-grams, word splitting, etc.)
- Index each item
* Steps at search time:
- User interface - ask user for the search query (defined syntax, faceted, etc.)
- Preprocess (as in indexing)
- Query to get search results (with ranking)
- Post-process (spelling suggestions, etc.)
- Extract excerpts/highlight
- Display results
* All of these steps should be pluggable.
* Core search would be a framework that would coordinate the steps, and keep track of what content needs to be indexed for each pluggable search index/retrievable framework
* We could also provide a google/yahoo/etc. search box, like what you get in Firefox
* We would also provide (maybe as a contrib module) a default storage/retrieval method, basically the current core search mechanism but maybe limited to single keywords (for efficiency).
* Doug is also interested in building a MongoDB implementation of storage/retrieval
* Solr, Lucene, etc. would also be able to build storage/retrieval
* Needs to be language-aware and compatible with multi-lingual sites
* Needs to be extensible, such as supporting facets as one extension, "advanced search", etc.
* Write it using PHP objects

Next steps:
* Maintainers of Lucene, Solr, etc. will provide descriptions of what they needed to modify in the core framework to get things working
* Hopefully everyone will tell me what I missed and got wrong in this post.
* We'll have some meetings on IRC/Skype, working towards a sprint or work session
* Keep in touch with the GSoC person who's working on search, so hopefully they can do something that will be productive to the effort

See also:
http://drupal.org/node/717654 (Search in D8 and beyond - basically a collection of feature requests for D8, somewhat categorized)

Comments

Search term auto-completion and live results...

dahacouk's picture

Are features like auto-completion of search terms (a la Google) and live as-you-type results (a la Spotlight) on the feature radar for searching within a site? Would be cool.

For an example of live as-you-type results see: http://www.w3schools.com/php/php_ajax_livesearch.asp

Looks like there's already a module but stalled (?) http://drupal.org/project/livesearch

@dahacouk: I think core

jpmckinney's picture

@dahacouk: I think core should make it easier to develop such widgets, but it should not be responsible for maintaining any of them.

@jhodgdon Thanks for getting

cpliakas's picture

@jhodgdon Thanks for getting all this in writing. It was great to meet all those who attended the diner, and I look forward to this greatly. Will post something more substantial after I am done with my vacation :-).

Interesting project by Young

cpliakas's picture

Interesting project by Young Hahn at http://github.com/yhahn/searchlight. It was mentioned by kyle_mathews in http://groups.drupal.org/node/57273#comment-163408, but I thought it would be appropriate to cross post it here. Seem to have a lot of the elements we are looking for, although as it is written it wouldn't be a project fit for core because of the dependencies. Regardless, lots of great ideas in there.

And now...

mradcliffe's picture

And now the creating a generic search API SoC project was approved. I guess we need to hurry up and get organized so we don't waste his effort.

Awesome initiative!

skilip's picture

Awesome initiative!

As one of the search subsystem maintainers...

douggreen's picture

As one of the search subsystem maintainers (jhodgdon being the other), I'd like to be somewhat central to what happens here. I'm hoping that the SoC project can take some guidance from all of us.

Here's my vision:

  • pluggable, lightweight and fast, with a somewhat basic plugin that ships with core. We can haggle over what this is later, but I'm thinking that we might ship core with a two simple search plugins, (a) one loosely based on the current engine but without the need for the second query which is only necessary for OR terms and quoted strings, and (b) one that allows you to select which search engine you want to use (Yahoo, Google, Bing, and any generic one, whatever...)
  • fields, supports searching by fields and facets
  • entities, supports any fieldable object, not just nodes
  • permissions, supports searching and display of entities and fields based on permissions
  • design checked against known engines: lucene, solr, core search, generic search engine, and others?
  • there was one more bullet point, but I can't remember it, ... when I remember I'll come back and edit this...

To accomplish this we'll probably start from scratch. There seem to be a couple projects already under way for this, maybe searchlight or the SoC project.

Someone please enlighted me as to the SoC schedule. If we have time, it would be nice to add to this for the next week or so, then have a phone/IRC meeting with all the interested parties, and decide how much (if any) code we need to write before the SoC project gets started.

IMO, those who know Drupal search and the Drupal search problems, should provide enough leadership here, so that the SoC student has some general direction, before turning them loose. I think that weekly meetings between us and the SoC student (think lots of mentors) would be a good thing.

Google/Yahoo...

jhodgdon's picture

One thing that you've mentioned a couple of times and that I'm not sure about is the idea of a "choose your engine" thing that would search Google, Yahoo, etc.

Are you just suggesting that if someone uses this, and searches for "foo" for example, they would be redirected to google.com (or whichever engine), with their keywords + site:example.com in there (or whatever is appropriate for the engine)? So it would take the visitor completely off their site?

If so, I think this would be better as a contributed module, rather than something that Drupal core endorses.

Yes

douggreen's picture

Yes, I hear you. Where we put it is not a technical decision. I do want to make sure that our "plugable" search api supports it. I think it would be nice to support the top X search engines with core out-of-the-box-Drupal.

I think that it would be a little more than just search though, for example, when we mark a node for update, we'll want to send this to the search engine backend requesting a re-index.

Requires business-level account

jhodgdon's picture

I think what you're talking about (node notifications) would require the site owner to sign up for a business-level, paid account with the search provider. Which would also potentially let them get the results and display them within the Drupal site (at least, Google Custom Search does that), and that would be a good thing for our architecture to support.

So I agree we should make the pluggable architecture support the ability to get results from someone like Google and display them, if someone has an account that lets them do that. But I don't see how we'd need our architecture to support putting up a box that would let people type in a search query and then redirect to an outside site to display the results, though. That's a simple block containing a form that goes out of the site, and not really related to our search architecture at all IMO.

You lose me with field-level

pwolanin's picture

You lose me with field-level permissions for searching. We don't support this with Apache Solr integration now, and while an implementation could feasibly be done, it would be quite a pain and is not something I think should be a priority or perhaps not included at all in a generic framework.

Perhaps we should also be looking at Views 3 in terms of generically defining queries and providing ways for the response to be formatted?

Isn't this a security problem?

douggreen's picture

If a site has hidden a field from some users, but we index it, and let someone search on it, whether we display it or not, isn't this a security problem?

Yes, so don't use field-level

pwolanin's picture

Yes, so don't use field-level permissions, or accept that you should only index and search on fields that are accessible to all users. If you need to search or filter on some special administrative field, you might have to build a custom interface.

I'm pretty sure that restricting search based on field-level permissions is not supported by core search today. You are restricted at the entity (node) level but not at the field level http://api.drupal.org/api/function/node_search_execute/7

SoC project co-ordination...

dahacouk's picture

I hope all of this will be co-ordinated with the Creating a generic Search API for Drupal project.

@dahacouk I think that is

cpliakas's picture

@dahacouk I think that is what Doug is trying to do here, and I applaud his efforts for doing so. Search development has become fragmented over the years, and this is the strongest initiative I have seen to consolidate our efforts. Therefore, I agree with Doug that the search subsystem maintainers should take a central role in making sure all the efforts that are a result of this initiative are pointing in the same direction. In addition, I think it is the responsibility of people like myself as well as the SoC "students" to make sure we filter information upwards to facilitate coordination.

"Hi" from the GSoC guy

drunken monkey's picture

Hi, I'm the infamous GSoC student. ;)
Looking at the discussion start, there are several great ideas and I sense a lot of potential here. My project really overlaps in large parts with this effort (even though it's in the context of a contributed module, not directly improving core) and I'd very much appreciate any directions you could give me. I think that joining efforts here could really be a big step forward for drupal search, not only in regard to D8 but also for D7. I'd love to be of help improving drupal search with my project and only through feedback from different experienced search developers I'll be able to do that.

Apart from a detailled discussion on further progress I'd especially be interested in the rationale behind the choice for object orientation, and how you think this should work in practice. Define some interfaces in core that search engines have to implement? There surely are a number of pros and cons here and it would be nice to have your insights on these.

@ douggreen: See here for the GSoC timeline. Official start of coding is on May 24. However, since my semester continues until the end of June, I probably won't code very much in the first month or so. Therefore, coding efforts until the end of June could probably be easily incorporated into my project.
And of course, advice, insights, comments and feedback will be appreciated the whole time.

SoC

douggreen's picture

The Drupal model is to prove it in contrib first, then move it to core. So creating a proof-of-concept in contrib is the right thing to do. Your contrib module should be for 7.x. There will be work beyond the SoC project, but that will be up to us (the Drupal community, but you too), to polish it for core worthiness. But if we start with that goal in mind, hopefully we can build it such that it won't need too much cleanup.

We shouldn't get crazy with OOP. One basic tenant to Drupal is that it should be easy for people to contribute too, and too much OOP makes this difficult. I like many of the 7.x OOP interface classes. The 7.x caching system is a good model: An interface class, a default implementation, sometimes a couple implementations, with a variable override to change the class. The interface is OOP, but once you drill down into the implementation, it's procedural.

If you're willing, I'd like to co-mentor you with Robert, and have regular (weekly) meetings to plan and review.

@drunkenmonkey, what do you

douggreen's picture

@drunkenmonkey, what do you mean by views integration here, are you talking about Drupal views module integration:

We discussed at the Minnesota Search Sprint, abstracting

  • searching and indexing
  • display, where views integration might just be a display abstraction???

If you swap the search and index implementation, can the standard display implementation work? Or do we need to tie all three of these together.

Another nice to have use-case fix is, intermixing results from two different sources. This is hard to do right now. But say that you're searching the worlds libraries, and you want to combine the results from multiple z29.50 servers, how would you do that?

Yes, Drupal Views module

drunken monkey's picture

Yes, I mean integrating the search API with Views, just like e.g. apachesolr_views does at the moment for the apachesolr module. I'm pretty sure that when backends (search implementations) return their search results in a uniform way, and with the help of entity_metadata, we can create a single set of views plugins that will provide all necessary data to display search results from arbitrary search engines.

As for the query over multiple datasources: this indeed could be an interesting use case. I haven't thoroughly thought about it yet, but I think that as long as all backends provide the same set of data fields for each search result (especially relevancy, which might be tricky), it should be possible to implement that. I'll keep it in mind and see how easily this could be added.