Telling Solr to ignore certain patterns in indexed fields when querying them

Posted by katbailey on December 1, 2009 at 1:21am

We need to implement html node titles with bbcode (as per http://drupal.org/node/28537) for a client site that's using ApacheSolr. The titles need to be displayed in search results with their html intact so the bbcode version has to get indexed. I need to tell Solr to ignore this code when trying to match queries. For example, if a user searches for "blue smurf" (with the quotation marks), and there's a node with the title "[strong]blue[/strong] smurf" in the index, Solr needs to recognise this as a match. I thought that this was what the PatternReplaceFilterFactory was for and addded the following to the query analyzer for the "text" fieldtype definitation:

<filter class="solr.PatternReplaceFilterFactory" pattern="[\/?(em|i|b|strong|u)]" replacement="" replace="all"/>

... but it's not doing the trick. Maybe I'm misunderstanding how this should work. Anyone familiar with this kind of thing who could give me some guidance?

Comments

Pattern

Posted by dstuart on December 1, 2009 at 11:12am

Hey Kat,

I haven't tested it yet but wouldn't your regex need to be

as the not escaping the square brackets would cause it to be part of the pattern as opposed to being an actual charater?

Regards,

Dave

Sorry the code got stripped take 2

Posted by dstuart on December 1, 2009 at 11:25am

<filter class="solr.PatternReplaceFilterFactory" pattern="[\/?(em|i|b|strong|u)]" replacement="" replace="all"/>

take 3

Posted by dstuart on December 1, 2009 at 11:27am

right for some reason it is string the forward slashes which means that your code maybe right

pattern="forwardslash[\/?(em|i|b|strong|u)forwardslash]"

This seems to work

Posted by dstuart on December 1, 2009 at 12:27pm

Had a play around and this seems to work with the added bonus of working with and without strong tags e.g. "blue smurf" and "blue[/strong] smurf"
please note forwardslash means / (just in case)

    <fieldType name="bbTitle" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.EdgeNGramFilterFactory" maxGramSize="100" minGramSize="1" />
        <filter class="solr.PatternReplaceFilterFactory"  pattern="forwardslash[\/?(em|i|b|strong|u)forwardslash]" replacement="" replace="all" />
      </analyzer>
      <analyzer type="query">
       <tokenizer class="solr.KeywordTokenizerFactory"/>
       <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

Hi Dave, I hadn't noticed

Posted by katbailey on December 1, 2009 at 4:17pm

Hi Dave,
I hadn't noticed that it had actually stripped my backslashes out when I posted above. Your forwardslashes above should be backslashes, right? That's exactly what my pattern is - the thing is I've added it right into the "text" fieldType definition, I guess I should create a new fieldType definition for this as you have done above. I'll try that.
Thanks so much for taking the time to look at this.

Katherine

oops

Posted by dstuart on December 1, 2009 at 4:43pm

Yep your right backslash.
A really good way of testing is using the analysis tool in the apache solr interface if go to http://[servername]:[solr port/solr/admin/analysis.jsp you can either type in the field name or the type and put some sample text turn on "verbose output" and "highlight matches" and you get a stack trace of what it does

Regards,

Dave

I haven't had a chance to

Posted by katbailey on December 4, 2009 at 2:05am

I haven't had a chance to work on this the last couple of days but did try the analysis tool mentioned above and it looks to me like it's telling me my filter is not working, though I'm not sure I'm reading the results correctly. I've been trying to attach some screenshots showing what I'm getting when I enter "[strong]blue[/strong] smurf" as the indexed term and "blue smurf" as the query. To me it looks like my filter isn't ignoring the "strong" in "[strong]".
Anyway g.d.o's captcha has been giving me a seriously hard time... will keep trying to attach the screen shots in the hopes that someone else (Dave?) can make more sense of them than I can and maybe point out where I'm going wrong with this...

Seems crazy to have to do it

Posted by katbailey on December 4, 2009 at 6:21pm

Seems crazy to have to do it like this but hopefully this comment will get through and here are my screenshots:
http://katbailey.net/sites/default/files/ss_apachesolr_1.gif
http://katbailey.net/sites/default/files/ss_apachesolr_2.gif
http://katbailey.net/sites/default/files/ss_apachesolr_3.gif

OK, I found a workaround. I'm

Posted by katbailey on December 4, 2009 at 7:40pm

OK, I found a workaround. I'm just putting the bbcode version of the title in a bbtitle field - so that solr will query on the title field (which has the bbcode stripped) and when we display the titles in results we use the bbtitle field.
Anyway, thanks again Dave for pointing me to that analysis tool :-)

Katherine

no problem

Posted by dstuart on December 5, 2009 at 3:41pm

Hi Katherine,

Had a look at your analysis output in order to get it working I think you needed to apply the EdgeNGramFilterFactory on top of the pattern as well as using KeywordTokenizerFactory instead of the whitespace one. Nice to see you have a work if you would like me to take a look at your field type I can

Regards,

Dave

Telling Solr to ignore certain patterns in indexed fields when querying them

Comments

Pattern

Sorry the code got stripped take 2

take 3

This seems to work

Hi Dave, I hadn't noticed

oops

I haven't had a chance to

Seems crazy to have to do it

OK, I found a workaround. I'm

no problem

Lucene, Nutch and Solr

Group organizers

Group categories

Projects

New groups

Group notifications

Hot content this week