Improving the Search API

We encourage users to post events happening in the community to the community events group on https://www.drupal.org.
drunken monkey's picture

As some of you might know, in last year's GSoC I created the Search API module, which has already gained some fame since then. It's a highly flexible search solution for Drupal 7, already coming with support for Views and facetted searches out-of-the-box, amongst other things.
Still, like all software, it isn't perfect and there are still a number of known shortcomings, as well as other potential for improvement. So for this Google Summer of Code, I propose to fix some of these shortcomings, and add some additional features.
The tasks I'd have in mind are the following (in order of importance):

The last two are of course the buffers for any remaining time at the end — even when I'm done with the others, there will always be numerous ways to improve tests and documentation.

Detailled tasks

Add autocompletion feature

While this is rarely a cirtical requirement, autocompletion for search keywords is definitely nice to have. For an example, see Google.
Since there currently is no way to search for partial hits with the Search API, this feature will very probably be a Solr-specific extension. Or at least a backend-dependent feature, with only a Solr implementation provided by me.
This will probably be a contrib module of its own, and based on the current Apache Solr Autocomplete module.

Provide ways to index data other than entities

This is one of the most long-standing shortcomings of the module that aren't fixed yet, so it's really time to tackle this. Right now, only things defined as Drupal entities (nodes, comments, users, taxonomy terms, etc., and ones from contrib modules) are available for indexing. On one hand, this is a big improvement over previous search solutions, that allowed only nodes (and node-related data) to be searched, but on the other hand, if you want to index whole pages or even external data, you're out of luck. You would have to define this other data as an entity, with all its information — which works, but is still a work-around and doesn't work in cases where you, e.g., only want to search external data, not index it.
The solution would probably be to add another abstraction layer between the searches and the entity layer, that would allow things other than normal entities to be included. Basically, indexes would then have a datasource-specific implementation class (like the backend-specific service classes for servers) which would take care of item retrieval, property extraction, reacting to creations, updates and deletions of items, etc.
This is far from trivial, as the "searched items are entities" assumption is baked into the Search API in numerous places on a rather basic level. Coming up with a way to solve all these references won't be easy. Also, an upgrade path has to be provided for current users.

Note: Robin Barre recently created an issue where he tries to solve this problem. Therefore, maybe this will already be solved once GSoC starts, or my main work might merely be to review Robin's solution and help him complete it.

Add a "More like this" feature

"More like this" is supported natively by Apache Solr and is chiefly used for displaying blocks with links to related content on node pages. This should also be supported by the Search API, through a generic feature that the Solr backend would then implement. A probably implementation would be as a simple Views argument handler, though there are still some issues to think about with this approach. If it would be such a rather minor addition, this could well live in the core Search API module for now.
Note that there is already an issue with some more detailled thoughts on this, where miiimooo also wants to implement this. So maybe there will only slight improvements or nothing at all be left to be done for this task when GSoC starts.

Add hierarchical facets for taxonomy terms

For hierarchical taxonomy terms, the facets currently don't really represent the hierarchy of terms, but just display them in a flat list. With this task I'd like to change this. Taxonomy facets should then be layed out that when you e.g. select "America", you are presented with facets for the lower level (e.g., "United States", "Canada", "Mexico", "Brazil", …) and so on. In many situations, this fits the desired behaviour much more than directly presenting the actually added terms (that will probably be on the lowest level).
The current thought for the implementation of this would be to provide a data alteration that adds all parent terms to the indexed data in a single field, and then some Javascript UI magic that displays that data correctly.
Since it wouldn't add generic behaviour (don't want to build a Relations API) but only a feature specifically for taxonomy terms, this would most likely live in a separate module.

Add additional little multi-language features

While some basic requirements for i18n were included in the Search API right from the start, only very little is provided yet on the frontend in that respect. Therefore, a few helpful features should be included (all of them in the main project, in the fitting module):

  • Add an option list to the "Item language field, so e.g. facets display the language name, instead of its ID.
  • Add a data alteration for indexing only items in a certain language.
  • (Maybe) Add a setting to indexes, which language they should use for retrieving the data from translatable fields when indexing.
  • Improve the "Item language" Views filter to add a "Current language" option.

While still leaving open a number of problems for multi-language sites, these would at least solve several frequent use cases, and should all be pretty simple to implement.

Extend test coverage

There are already several tests, for both the UI and the database backend. (Testing the Solr backend is almost impossible to do cleanly, as this would require setting up a test Solr server.) Still, most additional modules (Views, Facets, …) are untested, and there is also always room for improving the existing tests.
Also, all other tasks worked on during this project should have some (or, better, extensive) test coverage.

Extend documentation

Basically, exactly the same as for test coverage holds. I already consider the documentation pretty good (compared to many other, popular modules), but there is always room for improvement. E.g., handbook pages for both users and developers could be added, as well as advanced_help integration and even (additional) tutorial videos.

Additional notes

As you can see, those are several tasks, two of which (indexing non-entities and hierarchical facets) I'd also consider rather hard to do. Still, I'm confident that I'd be able to complete all of those (except testing and documentation, which are inherently almost incompletable) during the three months of GSoC, and would love to have that sort of incentive to work on these problems and features.

Rough schedule / timeline

As in previous years, my semester doesn't end until the end of June, so I'll concentrate most of the work in July and August. This worked quite well in the last years, so I don't expect any problems in that regard. Since some things depend on the state of other projects (how much I'll have to do myself), and since I can only guess as to the amount of work needed for all tasks, the schedule below is more or less makeshift, and will be revised as I go.

  • May 23 – June 12: Autocompletion [As said, I'll still have university courses and exams at this time, hence the rather long time estimated for the task.]
  • June 13 – July 10: Indexing non-entities [This is a rather big one.]
  • July 11 – July 17: Add "More like this" feature
  • July 18 – July 31: Hierarchical facets
  • August 1 – August 7: Additional multi-language features
  • August 8 – August 22: Writing additional tests and documentation, fixing remaining issues, general polishing, and probably finishing tasks that should have been completed weeks ago. ;)

Of course, tasks will also overlap as I start tasks while previous ones are reviewed or issues filed. And testing and documentation are of course also concurrent continous tasks.

About me

I'm a 24 year old CS Master student living in Vienna, Austria and already a bit of a GSoC veteran as this would be my fourth Summer of Code project for Drupal. In 2008, I provided Views with pluggable data backends and implemented one for the apachesolr module.
In 2009, I created the apachesolr_rdf module, which was a bit like a much weaker version of the Search API, centered on RDF instead of entities and using only Solr.
And, as mentioned, last year I created the Search API module.

Mentor

mh86

Comments

Thomas has a great track

robertDouglass's picture

Thomas has a great track record completing GSoC projects and exceeding the expectations. I'm thrilled to see this proposal and think it's very much on the spot for taking Drupal search to the next level.

Good luck!

stevepurkiss's picture

Sounds like a great project! I had heard of your work but not tried it out yet. I set up solr the other day for a new project and was planning to watch the solr DrupalCon Chicago session tomorrow as I want to learn more about setting up custom searches, sounds like your module will help with that.

One thing is the autocomplete which is a much loved feature, so I hope you get to work on this!

Gah... now i want to go play more!

Great proposal! I've already

mh86's picture

Great proposal! I've already been using the Search API and the way it allows you to create searches convinced me (already saved me lots of hours with code customizations). The proposed extensions, especially the search notifications and the autocomplete, will enhance the search experience and I'm really looking forward to these features.

I'd like to help with mentoring this project and as we both live in Vienna, regular meetings in person would be possible.

Small update

drunken monkey's picture

I just added a (very rough) time schedule and a link to the Views Saved Searches module, which might be relevant for the first task.

Search in core...

jhodgdon's picture

I would really like to see this GSoC project aimed squarely at getting Search API or something like it ready to be the core Search framework module. I think some of the proposed tasks work towards that, and some don't. Would it be possible to aim for the ones that do first, at least?

Which tasks?

drunken monkey's picture

I take it you mean indexing non-entity data and the multi-language improvements? Or tests/documentation?
In any case, it's a difficult balance between making the Search API module itself more useful to users right now, and improving it architecturally to advance it to something that could become a core search framework. I'm pretty confident, though, that I'll complete all the tasks in the project, and the time difference won't be more than one or two months anyways.

I did initially think of making "Core search in D8" a GSoC proposal directly, but that is simply not suitable for GSoC for various reasons. So instead I opted for this indirect approach, which in my opinion encompasses a good mix of architectural improvements and feature additions. And making the module more useful right now will ultimately also benefit the larger goal of D8 search: the more users the Search API in D7 gets, the more issues with its current architecture will be found. Which leads to a clearer picture of how the architecture for D8 core search could look.
(Chris also said something to this effect in the D8 core search discussion.)

Or other tasks?

drunken monkey's picture

Or, a different question: What additional, or alternative, tasks would you propose to this end?
Of course, I've already said I'll work on D8 core search, but such things without a real deliverable aren't really suited for GSoC, I think. If you have something specific in mind, please suggest it and I'll see if I can add this.

marvil07's picture

jhodgdon started a wiki page Search module as API framework with some ideas.

One of the main points I would like to see happening is embrace plugins(CTools plugins) inside the Search API, since it will be touching several parts of the source. So, Thomas, it would be great if you want to include it in your goals ;-)

Why CTools?

drunken monkey's picture

I'm not sure what the real advantage would be in switching to CTools plugins. Right now, the Search API has plugins – which, I think, are pretty cleanly defined and provide all needed flexibility. They're just no CTools plugins. If Drupal core had a plugin framework (which is apparently planned for D8), it would of course be good to use that. But using CTools would, from what I can see, only add another module dependency, with all its related issues (version mismatches, scattered bugs/issues, scattered documentation, possible inconsistencies in style, etc.).
I admit, I haven't really looked into CTools much – partly because I couldn't find any documentation to speak of. But I don't really see much advantage in relying on it to manage my plugins. Especially now that there's already my own code for that.

However, there is currently a Search API issue where Rebecca White also wants to introduce CTools plugins. Maybe I'll see some advantages there. ;)

ideas on link, little note on plugins future

marvil07's picture

Hey Thomas :-)

There are some reasons on Implementation Notes header on the link I pasted. By the way I also talked with Larry(probably one of the people that will be provinding us a plugins api on core for Drupal 8) on drupalcon about what to do for use plugins; and he also told me that it is a good idea to use CTools plugins, since many modules are already using it, and it should be not so hard to move from CTools OO plugins to the future drupal 8 ones.

I'm aware of that.

drunken monkey's picture

I just don't see a reason to switch right now. AS far as I can see, we could just as well switch in D8, when there is the real core system. It will probably be even a little less work (as the ctools-7.x -> drupal-8.x step drops out, however tiny) and would take place on a point where the API changes anyways, instead of now, when I'm mostly set on keeping the API fixed.

i am with you on this one

guy_schneerson's picture

I am designing a module with custom plugins and am not planning to use ctools, when its in 8 i will.
I not saying it is the correct way, its just looks to me that it will have all the advantage of doing exactly what i need with out overheads and dependencies.

If you search for

dawehner's picture

If you search for documentation about ctools look at ctools itself.

For example there is the ctools_plugin_example which explains it extensive.

In general adding ctools as depedency isn't that hard.

Many modules already depend on it, for example the whole features/context stack.
It's hard to imagine to build a site without ctools nowadays.

i want to participate in this in GSoC 11

kumar_rakesh's picture

Please guide me how can i participate in this?

What do you mean?

drunken monkey's picture

Do you mean in this project, or in the GSoC in general?

This project is mine, so you can't officially use this as your GSoC project. If you want to help in the project, you can wait for me to start (if I get accepted) and then review patches and provide feedback, and participate in discussions about it.
If you want to participate in GSoC in general, best search this group for other projects that haven't got a student yet and get in touch with the people who proposed it, or who offered to mentor it. You then have until April 8 to actually submit your proposal to Google.

Change in subprojects

drunken monkey's picture

A Danish company, Mediehuset Ingeniøren, just offered to sponsor the first part of this proposal, "Add saved searches and search notifications". So this will probably not be included in the GSoC project.
Instead, I replaced this with two other tasks, "More like this" and "Hierarchical taxonomy facets", that were suggested to me.

Official GSoC proposal created

drunken monkey's picture

I just created the official proposal on the GSoC site: see here.
However, it sadly isn't publicly visible at the moment, as that also displays my mail address publicly and I'm getting more than enough spam mails as it is. ;) (See this issue I created in the Melange issue queue.) Therefore, probably only mentors and organization admins will be able to view it. The same information should be contained in the opening post here, though.
The bug was fixed, I now set the proposal to be publicly visible.

Awesome. Great to see search

cpliakas's picture

Awesome. Great to see search get some love. Regarding the hierarchical facets, that should be addressed not only for taxonomy terms but anything that has hierarchical relationships once the Integrate with Facet API issue is completed. Could be one less tasks to tackle!

Oh, so this is already

drunken monkey's picture

Oh, so this is already possible with Facet API? Or is it just planned?
In any case, great news! (And good thing I didn't start that task first …)

This is already in there and

cpliakas's picture

This is already in there and working very well. Check out this image from Apache Solr Search Integration's implementation. All you have to do is set a "hierarchy callback" for the facet which simply maps the facet items to their parents. Facet API actually comes with a built-in hierarchy callback for taxonomy terms that Search API can use directly as well as a model for creating callbacks for non-taxonomy fields. If a facet implements a hierarchy callback, there is an option for it to be flattened which is useful when using widgets such as Tagcloud Facets and Chart Facets.

Awesome!

drunken monkey's picture

Wow, awesome!
Thanks again for pointing this out and explaining!

Absolutely! Glad to help out.

cpliakas's picture

Absolutely! Glad to help out.

Wiki page

drunken monkey's picture

Just realized I didn't even link the project's official wiki page here, yet:
http://groups.drupal.org/node/145259