Best approach to indexing stemmed and unstemmed fulltext in Drupal?

alanom's picture

A common desire with Apache Solr search servers is to get the "best of both" stemming and not stemming terms, indexing both the original term and the stem with something like SnowballPorterFilterFactory. Stemming matches grammatical variations, while indexing the original boosts exact matches to rank higher than near matches, and protects against awkward cases where after stemming, the original term no longer matches.

When Solr schemas are defined wholly statically, there are two straightforward approaches I'm aware of (below). What's the best (meaning, most robust, least dependent on brittle hard coding) approach to adapting these to Drupal where the field schemas are generated dynamically by Search API or Apache Solr module? (I'm particularly interested in Search API, but Apache Solr module approaches would be useful information).

The two approaches I'm aware of are:

  • Using copyfield in the schema to copy each full text field. Pass one to the stemmer, and don't process the other
  • Before stemming, make a composite field of all full text fields, defined as a string, so it isn't processed or stemmed (search API can create composite fields, but not strings - only full text or numeric etc)

Any advice on doing equivalents in Drupal with existing modules will be welcome.

I've posted a searchAPI-specific variant as a support request on the Search API issue queue - but with only one project maintainer, we know that Search API support requests are seldom answered. There's also a similar search API focused question on Drupal Answers with a 150 point bounty for anyone interested in that site wanting some easy rep :)

I'm asking here as well because those questions are focused squarely on Search API, but comments, insights and thoughts from people with Apache Solr module experience will also be relevant and useful.


Can you give an example of a

pwolanin's picture

Can you give an example of a search term that fails to match due to stemming?

Also, looking at the apachesolr schema.xml, the label and content fields are already copied to the "spell" field which uses a simple (non-stemming) analysis chain.

So, you could potentially add that to your query fields with a boost for improving the score of exact matches.