Telling Solr to ignore certain patterns in indexed fields when querying them

katbailey's picture

We need to implement html node titles with bbcode (as per http://drupal.org/node/28537) for a client site that's using ApacheSolr. The titles need to be displayed in search results with their html intact so the bbcode version has to get indexed. I need to tell Solr to ignore this code when trying to match queries. For example, if a user searches for "blue smurf" (with the quotation marks), and there's a node with the title "[strong]blue[/strong] smurf" in the index, Solr needs to recognise this as a match. I thought that this was what the PatternReplaceFilterFactory was for and addded the following to the query analyzer for the "text" fieldtype definitation:

<filter class="solr.PatternReplaceFilterFactory" pattern="[\/?(em|i|b|strong|u)]" replacement="" replace="all"/>

... but it's not doing the trick. Maybe I'm misunderstanding how this should work. Anyone familiar with this kind of thing who could give me some guidance?

Comments

Pattern

dstuart's picture

Hey Kat,

I haven't tested it yet but wouldn't your regex need to be

as the not escaping the square brackets would cause it to be part of the pattern as opposed to being an actual charater?

Regards,

Dave

Sorry the code got stripped take 2

dstuart's picture

<filter class="solr.PatternReplaceFilterFactory" pattern="[\/?(em|i|b|strong|u)]" replacement="" replace="all"/>

take 3

dstuart's picture

right for some reason it is string the forward slashes which means that your code maybe right

pattern="forwardslash[\/?(em|i|b|strong|u)forwardslash]"

This seems to work

dstuart's picture

Had a play around and this seems to work with the added bonus of working with and without strong tags e.g. "blue smurf" and "blue[/strong] smurf"
please note forwardslash means / (just in case)

    <fieldType name="bbTitle" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.EdgeNGramFilterFactory" maxGramSize="100" minGramSize="1" />
        <filter class="solr.PatternReplaceFilterFactory"  pattern="forwardslash[\/?(em|i|b|strong|u)forwardslash]" replacement="" replace="all" />
      </analyzer>
      <analyzer type="query">
       <tokenizer class="solr.KeywordTokenizerFactory"/>
       <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

Hi Dave, I hadn't noticed

katbailey's picture

Hi Dave,
I hadn't noticed that it had actually stripped my backslashes out when I posted above. Your forwardslashes above should be backslashes, right? That's exactly what my pattern is - the thing is I've added it right into the "text" fieldType definition, I guess I should create a new fieldType definition for this as you have done above. I'll try that.
Thanks so much for taking the time to look at this.

Katherine

oops

dstuart's picture

Yep your right backslash.
A really good way of testing is using the analysis tool in the apache solr interface if go to http://[servername]:[solr port/solr/admin/analysis.jsp you can either type in the field name or the type and put some sample text turn on "verbose output" and "highlight matches" and you get a stack trace of what it does

Regards,

Dave

I haven't had a chance to

katbailey's picture

I haven't had a chance to work on this the last couple of days but did try the analysis tool mentioned above and it looks to me like it's telling me my filter is not working, though I'm not sure I'm reading the results correctly. I've been trying to attach some screenshots showing what I'm getting when I enter "[strong]blue[/strong] smurf" as the indexed term and "blue smurf" as the query. To me it looks like my filter isn't ignoring the "strong" in "[strong]".
Anyway g.d.o's captcha has been giving me a seriously hard time... will keep trying to attach the screen shots in the hopes that someone else (Dave?) can make more sense of them than I can and maybe point out where I'm going wrong with this...

Seems crazy to have to do it

OK, I found a workaround. I'm

katbailey's picture

OK, I found a workaround. I'm just putting the bbcode version of the title in a bbtitle field - so that solr will query on the title field (which has the bbcode stripped) and when we display the titles in results we use the bbtitle field.
Anyway, thanks again Dave for pointing me to that analysis tool :-)

Katherine

no problem

dstuart's picture

Hi Katherine,

Had a look at your analysis output in order to get it working I think you needed to apply the EdgeNGramFilterFactory on top of the pattern as well as using KeywordTokenizerFactory instead of the whitespace one. Nice to see you have a work if you would like me to take a look at your field type I can

Regards,

Dave

Lucene, Nutch and Solr

Group organizers

Group categories

Projects

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds: