Mailing List Archive

DelimitedBoostTokenFilterFactory Issue - Boosting and BooleanSimilarity
Hi there,

I’m developing custom java application with lucene 8.5.0.

I've tried to use DelimitedBoostTokenFilterFactory but I have a problem, so
please help me if I'm doing something wrong.

When I’m using BM25Similarity and delimitedBoost filter everything works as
expected, but if I switch to BooleanSimilarity nothing happens.

Parsed query looks ok. It has synonyms with proper boost value, but the
final score hasn’t changed.

I’m using StandardAnalyzer for search, and my SynonymGraphFilter has
default configuration:

Map<String, String> synonymParam = new HashMap<>();
synonymParam.put("synonyms", synonymFileName);
synonymParam.put("ignoreCase", "true");
synonymParam.put("format", "solr");
synonymParam.put("expand","true");

synonymParam.put("tokenizerFactory","org.apache.lucene.analysis.core.WhitespaceTokenizerFactory");
Map<String, String> delimitedBoostTokenFilterMap = new HashMap<>();
delimitedBoostTokenFilterMap.put("delimiter", "|");
Analyzer customAnalyzer = CustomAnalyzer.builder(Paths.get(synonymFolder))
.withTokenizer(StandardTokenizerFactory.NAME)
.addTokenFilter(SynonymGraphFilterFactory.NAME,
synonymParam)
.addTokenFilter(DelimitedBoostTokenFilterFactory.NAME,
delimitedBoostTokenFilterMap)
.build();


Here’s my debug output and some additional info:

Query: +Synonym(morphology_term_original_name_key:neoplasm^0.7
morphology_term_original_name_key:tumor^0.8
morphology_term_original_name_key:tumour^0.6)

1.0 = weight(Synonym(morphology_term_original_name:neoplasm^0.7
morphology_term_original_name:tumor^0.8
morphology_term_original_name:tumour^0.6) in 0) [BooleanSimilarity], result
of:
1.0 = score(BooleanWeight), computed from:
1.0 = boost, query boost

If I use the BM25Similarity, the printout is as follows:

0.75188845 = weight(Synonym(morphology_term_original_name:neoplasm^0.7
morphology_term_original_name:tumor^0.8
morphology_term_original_name:tumour^0.6) in 0) [BM25Similarity], result of:
0.75188845 = score(freq=0.8), computed as boost * idf * tf from:
1.3862944 = idf, computed as log(1 + (N – n + 0.5) / (n + 0.5)) from:
1 = n, number of documents containing term
5 = N, total number of documents with field
0.5423729 = tf, computed as freq / (freq + k1 * (1 – b + b * dl / avgdl))
from:
0.8 = termFreq=0.8
1.2 = k1, term saturation parameter
0.75 = b, length normalization parameter
1.0 = dl, length of field
2.4 = avgdl, average length of field

Thanks in advance!

Ivana