Solr in non-english

Events happening in the community are now at Drupal community events on www.drupal.org.
fp's picture

I am trying to run apachesolr on a site which for now has only French content.

I have attached both the schema.xml I use and the query results from solr for a query on the word "Vidéocassettes".

From what I have gathered so far, I assumed that the following filter

charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"

would take care of mapping the accents for both the indexing and the querying. If I remove the accent from my query (eg: "Videocasettes") I get the expected results which leads me to think that the indexing character mapping is working. However, the accented query return no results.

What am I missing?

Thanks!
fp

AttachmentSize
schema.xml.txt10.11 KB
results.xml.txt2.45 KB

Comments

Might be a Tomcat configuration issue

David Lesieur's picture

I have encountered the same issue, which was fixed by adding the URIEncoding="UTF-8" attribute to the right Connector element in Tomcat's server.xml.

Indeed

fp's picture

Fantastic! Thanks for your timely response David. Much appreciated.

Is this solution is really working ?

briandorval2's picture

I've try but maybe I've made a mistake, are you talking about the server.xml located in the tomcat6/conf ???

My problem is that if I index data with accent like É, À, È it will not be listed alphabetically with the E or A but at the end of the list...

I've tried to reindex with your modification and the accent are still ordered at the end of the list... Am I missing something ?

Thanks!

The moon is closer to the sun than I am to anyone.

Beyond the server.xml

fp's picture

Yes, you are referring to the correct server.xml. I assume that you have restarted tomcat...

Have you had a look at the schema.xml(.txt) file that I posted originally? There are a couple important bits, such as mapping-ISOLatin1Accent.txt for the charFilter and <filter class="solr.SnowballPorterFilterFactory" language="French"/>.

You seem to have modify the

briandorval2's picture

You seem to have modify the stopwords to be french, do you have an example of what you did in this file.

I think it's the only thing that I'm missing.

Thank you for your help I really appreciate !

The moon is closer to the sun than I am to anyone.

One other thing to take into

wmostrey's picture

One other thing to take into account in the server.xml is to remove useBodyEncodingForURI from the Connector.

I spent hours trying to figure out why UTF-8 wasn't working while I had URIEncoding="UTF-8" enabled. After a whole lot of testing it was the useBodyEncodingForURI that was causing the issue.

Lucene, Nutch and Solr

Group organizers

Group categories

Projects

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds: