Posted by katbailey on December 1, 2009 at 1:21am
We need to implement html node titles with bbcode (as per http://drupal.org/node/28537) for a client site that's using ApacheSolr. The titles need to be displayed in search results with their html intact so the bbcode version has to get indexed. I need to tell Solr to ignore this code when trying to match queries. For example, if a user searches for "blue smurf" (with the quotation marks), and there's a node with the title "[strong]blue[/strong] smurf" in the index, Solr needs to recognise this as a match. I thought that this was what the PatternReplaceFilterFactory was for and addded the following to the query analyzer for the "text" fieldtype definitation:
<filter class="solr.PatternReplaceFilterFactory" pattern="[\/?(em|i|b|strong|u)]" replacement="" replace="all"/>... but it's not doing the trick. Maybe I'm misunderstanding how this should work. Anyone familiar with this kind of thing who could give me some guidance?

Comments
Pattern
Hey Kat,
I haven't tested it yet but wouldn't your regex need to be
as the not escaping the square brackets would cause it to be part of the pattern as opposed to being an actual charater?
Regards,
Dave
Sorry the code got stripped take 2
<filter class="solr.PatternReplaceFilterFactory" pattern="[\/?(em|i|b|strong|u)]" replacement="" replace="all"/>take 3
right for some reason it is string the forward slashes which means that your code maybe right
pattern="forwardslash[\/?(em|i|b|strong|u)forwardslash]"This seems to work
Had a play around and this seems to work with the added bonus of working with and without strong tags e.g. "blue smurf" and "blue[/strong] smurf"
please note forwardslash means / (just in case)
<fieldType name="bbTitle" class="solr.TextField" positionIncrementGap="100"><analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.EdgeNGramFilterFactory" maxGramSize="100" minGramSize="1" />
<filter class="solr.PatternReplaceFilterFactory" pattern="forwardslash[\/?(em|i|b|strong|u)forwardslash]" replacement="" replace="all" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Hi Dave, I hadn't noticed
Hi Dave,
I hadn't noticed that it had actually stripped my backslashes out when I posted above. Your forwardslashes above should be backslashes, right? That's exactly what my pattern is - the thing is I've added it right into the "text" fieldType definition, I guess I should create a new fieldType definition for this as you have done above. I'll try that.
Thanks so much for taking the time to look at this.
Katherine
oops
Yep your right backslash.
A really good way of testing is using the analysis tool in the apache solr interface if go to http://[servername]:[solr port/solr/admin/analysis.jsp you can either type in the field name or the type and put some sample text turn on "verbose output" and "highlight matches" and you get a stack trace of what it does
Regards,
Dave
I haven't had a chance to
I haven't had a chance to work on this the last couple of days but did try the analysis tool mentioned above and it looks to me like it's telling me my filter is not working, though I'm not sure I'm reading the results correctly. I've been trying to attach some screenshots showing what I'm getting when I enter "[strong]blue[/strong] smurf" as the indexed term and "blue smurf" as the query. To me it looks like my filter isn't ignoring the "strong" in "[strong]".
Anyway g.d.o's captcha has been giving me a seriously hard time... will keep trying to attach the screen shots in the hopes that someone else (Dave?) can make more sense of them than I can and maybe point out where I'm going wrong with this...
Seems crazy to have to do it
Seems crazy to have to do it like this but hopefully this comment will get through and here are my screenshots:
http://katbailey.net/sites/default/files/ss_apachesolr_1.gif
http://katbailey.net/sites/default/files/ss_apachesolr_2.gif
http://katbailey.net/sites/default/files/ss_apachesolr_3.gif
OK, I found a workaround. I'm
OK, I found a workaround. I'm just putting the bbcode version of the title in a bbtitle field - so that solr will query on the title field (which has the bbcode stripped) and when we display the titles in results we use the bbtitle field.
Anyway, thanks again Dave for pointing me to that analysis tool :-)
Katherine
no problem
Hi Katherine,
Had a look at your analysis output in order to get it working I think you needed to apply the EdgeNGramFilterFactory on top of the pattern as well as using KeywordTokenizerFactory instead of the whitespace one. Nice to see you have a work if you would like me to take a look at your field type I can
Regards,
Dave