As some of you might know, in last year's GSoC I created the Search API module, which has already gained some fame since then. It's a highly flexible search solution for Drupal 7, already coming with support for Views and facetted searches out-of-the-box, amongst other things.
Still, like all software, it isn't perfect and there are still a number of known shortcomings, as well as other potential for improvement. So for this Google Summer of Code, I propose to fix some of these shortcomings, and add some additional features.
The tasks I'd have in mind are the following (in order of importance):
- Add autocompletion feature
- Provide ways to index other data than entities
- Add a "More like this" feature
- Add hierarchical facets for taxonomy terms
- Add additional little multi-language features
- Extend test coverage
- Extend documentation
The last two are of course the buffers for any remaining time at the end — even when I'm done with the others, there will always be numerous ways to improve tests and documentation.
Add autocompletion feature
While this is rarely a cirtical requirement, autocompletion for search keywords is definitely nice to have. For an example, see Google.
Since there currently is no way to search for partial hits with the Search API, this feature will very probably be a Solr-specific extension. Or at least a backend-dependent feature, with only a Solr implementation provided by me.
This will probably be a contrib module of its own, and based on the current Apache Solr Autocomplete module.
Provide ways to index data other than entities
This is one of the most long-standing shortcomings of the module that aren't fixed yet, so it's really time to tackle this. Right now, only things defined as Drupal entities (nodes, comments, users, taxonomy terms, etc., and ones from contrib modules) are available for indexing. On one hand, this is a big improvement over previous search solutions, that allowed only nodes (and node-related data) to be searched, but on the other hand, if you want to index whole pages or even external data, you're out of luck. You would have to define this other data as an entity, with all its information — which works, but is still a work-around and doesn't work in cases where you, e.g., only want to search external data, not index it.
The solution would probably be to add another abstraction layer between the searches and the entity layer, that would allow things other than normal entities to be included. Basically, indexes would then have a datasource-specific implementation class (like the backend-specific service classes for servers) which would take care of item retrieval, property extraction, reacting to creations, updates and deletions of items, etc.
This is far from trivial, as the "searched items are entities" assumption is baked into the Search API in numerous places on a rather basic level. Coming up with a way to solve all these references won't be easy. Also, an upgrade path has to be provided for current users.
Note: Robin Barre recently created an issue where he tries to solve this problem. Therefore, maybe this will already be solved once GSoC starts, or my main work might merely be to review Robin's solution and help him complete it.
Add a "More like this" feature
"More like this" is supported natively by Apache Solr and is chiefly used for displaying blocks with links to related content on node pages. This should also be supported by the Search API, through a generic feature that the Solr backend would then implement. A probably implementation would be as a simple Views argument handler, though there are still some issues to think about with this approach. If it would be such a rather minor addition, this could well live in the core Search API module for now.
Note that there is already an issue with some more detailled thoughts on this, where miiimooo also wants to implement this. So maybe there will only slight improvements or nothing at all be left to be done for this task when GSoC starts.
Add hierarchical facets for taxonomy terms
For hierarchical taxonomy terms, the facets currently don't really represent the hierarchy of terms, but just display them in a flat list. With this task I'd like to change this. Taxonomy facets should then be layed out that when you e.g. select "America", you are presented with facets for the lower level (e.g., "United States", "Canada", "Mexico", "Brazil", …) and so on. In many situations, this fits the desired behaviour much more than directly presenting the actually added terms (that will probably be on the lowest level).
Since it wouldn't add generic behaviour (don't want to build a Relations API) but only a feature specifically for taxonomy terms, this would most likely live in a separate module.
Add additional little multi-language features
While some basic requirements for i18n were included in the Search API right from the start, only very little is provided yet on the frontend in that respect. Therefore, a few helpful features should be included (all of them in the main project, in the fitting module):
- Add an option list to the "Item language field, so e.g. facets display the language name, instead of its ID.
- Add a data alteration for indexing only items in a certain language.
- (Maybe) Add a setting to indexes, which language they should use for retrieving the data from translatable fields when indexing.
- Improve the "Item language" Views filter to add a "Current language" option.
While still leaving open a number of problems for multi-language sites, these would at least solve several frequent use cases, and should all be pretty simple to implement.
Extend test coverage
There are already several tests, for both the UI and the database backend. (Testing the Solr backend is almost impossible to do cleanly, as this would require setting up a test Solr server.) Still, most additional modules (Views, Facets, …) are untested, and there is also always room for improving the existing tests.
Also, all other tasks worked on during this project should have some (or, better, extensive) test coverage.
Basically, exactly the same as for test coverage holds. I already consider the documentation pretty good (compared to many other, popular modules), but there is always room for improvement. E.g., handbook pages for both users and developers could be added, as well as advanced_help integration and even (additional) tutorial videos.
As you can see, those are several tasks, two of which (indexing non-entities and hierarchical facets) I'd also consider rather hard to do. Still, I'm confident that I'd be able to complete all of those (except testing and documentation, which are inherently almost incompletable) during the three months of GSoC, and would love to have that sort of incentive to work on these problems and features.
Rough schedule / timeline
As in previous years, my semester doesn't end until the end of June, so I'll concentrate most of the work in July and August. This worked quite well in the last years, so I don't expect any problems in that regard. Since some things depend on the state of other projects (how much I'll have to do myself), and since I can only guess as to the amount of work needed for all tasks, the schedule below is more or less makeshift, and will be revised as I go.
- May 23 – June 12: Autocompletion [As said, I'll still have university courses and exams at this time, hence the rather long time estimated for the task.]
- June 13 – July 10: Indexing non-entities [This is a rather big one.]
- July 11 – July 17: Add "More like this" feature
- July 18 – July 31: Hierarchical facets
- August 1 – August 7: Additional multi-language features
- August 8 – August 22: Writing additional tests and documentation, fixing remaining issues, general polishing, and probably finishing tasks that should have been completed weeks ago. ;)
Of course, tasks will also overlap as I start tasks while previous ones are reviewed or issues filed. And testing and documentation are of course also concurrent continous tasks.
I'm a 24 year old CS Master student living in Vienna, Austria and already a bit of a GSoC veteran as this would be my fourth Summer of Code project for Drupal. In 2008, I provided Views with pluggable data backends and implemented one for the apachesolr module.
In 2009, I created the apachesolr_rdf module, which was a bit like a much weaker version of the Search API, centered on RDF instead of entities and using only Solr.
And, as mentioned, last year I created the Search API module.