Optimizing Apachesolr for non-english languages

ducdebreme's picture

I have had a lot of research about optimizing Apachesolr for non-english languages, especially for German. It comes out, that there search results can be dramatically improved by adjusting Solr's stemming and by breaking up compound words. This can be easily achieved with slight changes to Apachesolr's schema.xml.

This post is about configuring stemming:
http://www.early-dance.de/en/news/9188-optimizing-apachesolr-non-english...

And this post is about compound word splitting, that is needed in languages like German that have long combined words like "Dampfschifffahrt":
http://www.early-dance.de/en/news/9189-apachesolr-issues-german-and-othe...

Login to post comments

Thanks!

robertDouglass's picture
robertDouglass - Thu, 2009-10-15 11:50

That's great information. A lot of people ask about this and first hand experiences are super helpful.


I made my first proposals

mkalkbrenner's picture
mkalkbrenner - Thu, 2009-10-15 12:06

I made my first proposals regarding these issues half a year ago:
http://drupal.org/node/463886

We also tweaked word splitting on one index in our production environment, but didn't provide a patch right now.

Markus Kalkbrenner
Cocomore AG
drupal.cocomore.com


Yes Markus, your post was

ducdebreme's picture
ducdebreme - Thu, 2009-10-15 12:15

Yes Markus, your post was very helpful and encouraged me to do further research.
But what surprised me most was that a compound splitter is already built into Apachesolr!


compound splitter

mkalkbrenner's picture
mkalkbrenner - Mon, 2009-11-02 10:32

We had a look at compound splitter and integrate it shortly in our localized version of apachesolr which is freely available here:
http://drupal.cocomore.com/de/project/apachesolr

BTW beside a compound splitter patch we're currently working on xi:include patches and a German manual.


Hi Markus, this sounds

ducdebreme's picture
ducdebreme - Wed, 2009-11-04 15:39

Hi Markus, this sounds definitely interesting. I'll check it!
Stefan