Optimizing Apachesolr for non-english languages

Events happening in the community are now at Drupal community events on www.drupal.org.
ducdebreme's picture

I have had a lot of research about optimizing Apachesolr for non-english languages, especially for German. It comes out, that there search results can be dramatically improved by adjusting Solr's stemming and by breaking up compound words. This can be easily achieved with slight changes to Apachesolr's schema.xml.

This post is about configuring stemming:
http://www.early-dance.de/news/9188-optimizing-apachesolr-non-english-la...

And this post is about compound word splitting, that is needed in languages like German that have long combined words like "Dampfschifffahrt":
http://www.early-dance.de/news/9189-apachesolr-issues-german-and-other-g...

Comments

Thanks!

robertdouglass's picture

That's great information. A lot of people ask about this and first hand experiences are super helpful.

I made my first proposals

mkalkbrenner's picture

I made my first proposals regarding these issues half a year ago:
http://drupal.org/node/463886

We also tweaked word splitting on one index in our production environment, but didn't provide a patch right now.

Markus Kalkbrenner
Cocomore AG
drupal.cocomore.com

Yes Markus, your post was

ducdebreme's picture

Yes Markus, your post was very helpful and encouraged me to do further research.
But what surprised me most was that a compound splitter is already built into Apachesolr!

compound splitter

mkalkbrenner's picture

We had a look at compound splitter and integrate it shortly in our localized version of apachesolr which is freely available here:
http://drupal.cocomore.com/de/project/apachesolr

BTW beside a compound splitter patch we're currently working on xi:include patches and a German manual.

Hi Markus, this sounds

ducdebreme's picture

Hi Markus, this sounds definitely interesting. I'll check it!
Stefan

Apache Solr Multilingual

mkalkbrenner's picture

We just released the first alpha version of Apache Solr Multilingual which supports language specific stemming and compound word splitting.

Lucene, Nutch and Solr

Group organizers

Group categories

Projects

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds: