Hi folks,
I'm working with some fuzzy queries and trying my best to understand what
is the expected behaviour of the searcher. I'm not sure if this is a
similarity bug or an incorrect usage on my end.
The problem is when I do a fuzzy search for a term "spark~" then instead of
matching documents with spark first, it will match other documents that
have multiple other near terms like "spar" and "spars". I see this same
thing with both ClassicSimilarity and BM25.
This is from a much smaller (two document) index when I was trying to
isolate and reproduce the issue, but I see comparable behaviour with more
varied scoring on a much larger corpus. The two documents are:
addDoc("spark spark", writer); // exact match
addDoc("spar spars", writer); // multiple fuzzy terms
The non-zero edit distance terms get a slight down-boost, but it's not
enough to overcome their sum exceeding even the TF boost for the desired
document.
A full reproducible unit test is at
https://github.com/apache/lucene/commit/dbf8e788cd2c2a5e1852b8cee86cb21a792dc546
What is the recommended approach to get the document with exact term
matching for me again? I don't see an option to tweak the internal boost
provided by FuzzyQuery, that's one idea I had. Or is this a different
change that needs to be fixed at the lucene level rather than application
level?
Thanks,
Mike
More detail:
The first document with the field "spark spark" has a score explanation:
1.4054651 = sum of:
1.4054651 = weight(field:spark in 0) [ClassicSimilarity], result of:
1.4054651 = score(freq=2.0), product of:
1.4054651 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
1 = docFreq, number of documents containing term
2 = docCount, total number of documents with field
1.4142135 = tf(freq=2.0), with freq of:
2.0 = freq, occurrences of term within document
0.70710677 = fieldNorm
And a document with the field "spar spars" comes in ever so slightly higher
at
1.5404116 = sum of:
0.74536043 = weight(field:spar in 1) [ClassicSimilarity], result of:
0.74536043 = score(freq=1.0), product of:
0.75 = boost
1.4054651 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
1 = docFreq, number of documents containing term
2 = docCount, total number of documents with field
1.0 = tf(freq=1.0), with freq of:
1.0 = freq, occurrences of term within document
0.70710677 = fieldNorm
0.79505116 = weight(field:spars in 1) [ClassicSimilarity], result of:
0.79505116 = score(freq=1.0), product of:
0.8 = boost
1.4054651 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
1 = docFreq, number of documents containing term
2 = docCount, total number of documents with field
1.0 = tf(freq=1.0), with freq of:
1.0 = freq, occurrences of term within document
0.70710677 = fieldNorm
I'm working with some fuzzy queries and trying my best to understand what
is the expected behaviour of the searcher. I'm not sure if this is a
similarity bug or an incorrect usage on my end.
The problem is when I do a fuzzy search for a term "spark~" then instead of
matching documents with spark first, it will match other documents that
have multiple other near terms like "spar" and "spars". I see this same
thing with both ClassicSimilarity and BM25.
This is from a much smaller (two document) index when I was trying to
isolate and reproduce the issue, but I see comparable behaviour with more
varied scoring on a much larger corpus. The two documents are:
addDoc("spark spark", writer); // exact match
addDoc("spar spars", writer); // multiple fuzzy terms
The non-zero edit distance terms get a slight down-boost, but it's not
enough to overcome their sum exceeding even the TF boost for the desired
document.
A full reproducible unit test is at
https://github.com/apache/lucene/commit/dbf8e788cd2c2a5e1852b8cee86cb21a792dc546
What is the recommended approach to get the document with exact term
matching for me again? I don't see an option to tweak the internal boost
provided by FuzzyQuery, that's one idea I had. Or is this a different
change that needs to be fixed at the lucene level rather than application
level?
Thanks,
Mike
More detail:
The first document with the field "spark spark" has a score explanation:
1.4054651 = sum of:
1.4054651 = weight(field:spark in 0) [ClassicSimilarity], result of:
1.4054651 = score(freq=2.0), product of:
1.4054651 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
1 = docFreq, number of documents containing term
2 = docCount, total number of documents with field
1.4142135 = tf(freq=2.0), with freq of:
2.0 = freq, occurrences of term within document
0.70710677 = fieldNorm
And a document with the field "spar spars" comes in ever so slightly higher
at
1.5404116 = sum of:
0.74536043 = weight(field:spar in 1) [ClassicSimilarity], result of:
0.74536043 = score(freq=1.0), product of:
0.75 = boost
1.4054651 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
1 = docFreq, number of documents containing term
2 = docCount, total number of documents with field
1.0 = tf(freq=1.0), with freq of:
1.0 = freq, occurrences of term within document
0.70710677 = fieldNorm
0.79505116 = weight(field:spars in 1) [ClassicSimilarity], result of:
0.79505116 = score(freq=1.0), product of:
0.8 = boost
1.4054651 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
1 = docFreq, number of documents containing term
2 = docCount, total number of documents with field
1.0 = tf(freq=1.0), with freq of:
1.0 = freq, occurrences of term within document
0.70710677 = fieldNorm