Mailing List Archive: Fuzzy Query Similarity

Fuzzy Query Similarity

Jul 8, 2022, 1:14 PM

Post #1 of 7 (590 views)

Hi folks,

I'm working with some fuzzy queries and trying my best to understand what
is the expected behaviour of the searcher. I'm not sure if this is a
similarity bug or an incorrect usage on my end.

The problem is when I do a fuzzy search for a term "spark~" then instead of
matching documents with spark first, it will match other documents that
have multiple other near terms like "spar" and "spars". I see this same
thing with both ClassicSimilarity and BM25.

This is from a much smaller (two document) index when I was trying to
isolate and reproduce the issue, but I see comparable behaviour with more
varied scoring on a much larger corpus. The two documents are:

addDoc("spark spark", writer); // exact match

addDoc("spar spars", writer); // multiple fuzzy terms

The non-zero edit distance terms get a slight down-boost, but it's not
enough to overcome their sum exceeding even the TF boost for the desired
document.

A full reproducible unit test is at
https://github.com/apache/lucene/commit/dbf8e788cd2c2a5e1852b8cee86cb21a792dc546

What is the recommended approach to get the document with exact term
matching for me again? I don't see an option to tweak the internal boost
provided by FuzzyQuery, that's one idea I had. Or is this a different
change that needs to be fixed at the lucene level rather than application
level?

Thanks,
Mike

More detail:

The first document with the field "spark spark" has a score explanation:

1.4054651 = sum of:
1.4054651 = weight(field:spark in 0) [ClassicSimilarity], result of:
1.4054651 = score(freq=2.0), product of:
1.4054651 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
1 = docFreq, number of documents containing term
2 = docCount, total number of documents with field
1.4142135 = tf(freq=2.0), with freq of:
2.0 = freq, occurrences of term within document
0.70710677 = fieldNorm

And a document with the field "spar spars" comes in ever so slightly higher
at

1.5404116 = sum of:
0.74536043 = weight(field:spar in 1) [ClassicSimilarity], result of:
0.74536043 = score(freq=1.0), product of:
0.75 = boost
1.4054651 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
1 = docFreq, number of documents containing term
2 = docCount, total number of documents with field
1.0 = tf(freq=1.0), with freq of:
1.0 = freq, occurrences of term within document
0.70710677 = fieldNorm
0.79505116 = weight(field:spars in 1) [ClassicSimilarity], result of:
0.79505116 = score(freq=1.0), product of:
0.8 = boost
1.4054651 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
1 = docFreq, number of documents containing term
2 = docCount, total number of documents with field
1.0 = tf(freq=1.0), with freq of:
1.0 = freq, occurrences of term within document
0.70710677 = fieldNorm

Re: Fuzzy Query Similarity [ In reply to ]

msokolov at gmail

Jul 9, 2022, 5:08 AM

Post #2 of 7 (587 views)

I am no expert with this, but I got curious and looked at
FuzzyQuery/MultiTermQuery and I don't see any way to "boost" exact
matches, or even to incorporate the edit distance more generally into
the per-term score, although it does seem like that would be something
people would generally expect. So maybe FuzzyQuery should somehow do
that? But without changing it, you could also use a query that does it
explicitly; if you get a term "foo", you could maybe search for "foo
OR foo~" ?

On Fri, Jul 8, 2022 at 4:14 PM Mike Drob <mdrob@mdrob.com> wrote:
>
> Hi folks,
>
> I'm working with some fuzzy queries and trying my best to understand what
> is the expected behaviour of the searcher. I'm not sure if this is a
> similarity bug or an incorrect usage on my end.
>
> The problem is when I do a fuzzy search for a term "spark~" then instead of
> matching documents with spark first, it will match other documents that
> have multiple other near terms like "spar" and "spars". I see this same
> thing with both ClassicSimilarity and BM25.
>
> This is from a much smaller (two document) index when I was trying to
> isolate and reproduce the issue, but I see comparable behaviour with more
> varied scoring on a much larger corpus. The two documents are:
>
> addDoc("spark spark", writer); // exact match
>
> addDoc("spar spars", writer); // multiple fuzzy terms
>
> The non-zero edit distance terms get a slight down-boost, but it's not
> enough to overcome their sum exceeding even the TF boost for the desired
> document.
>
> A full reproducible unit test is at
> https://github.com/apache/lucene/commit/dbf8e788cd2c2a5e1852b8cee86cb21a792dc546
>
> What is the recommended approach to get the document with exact term
> matching for me again? I don't see an option to tweak the internal boost
> provided by FuzzyQuery, that's one idea I had. Or is this a different
> change that needs to be fixed at the lucene level rather than application
> level?
>
> Thanks,
> Mike
>
>
>
> More detail:
>
>
> The first document with the field "spark spark" has a score explanation:
>
> 1.4054651 = sum of:
> 1.4054651 = weight(field:spark in 0) [ClassicSimilarity], result of:
> 1.4054651 = score(freq=2.0), product of:
> 1.4054651 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
> 1 = docFreq, number of documents containing term
> 2 = docCount, total number of documents with field
> 1.4142135 = tf(freq=2.0), with freq of:
> 2.0 = freq, occurrences of term within document
> 0.70710677 = fieldNorm
>
> And a document with the field "spar spars" comes in ever so slightly higher
> at
>
> 1.5404116 = sum of:
> 0.74536043 = weight(field:spar in 1) [ClassicSimilarity], result of:
> 0.74536043 = score(freq=1.0), product of:
> 0.75 = boost
> 1.4054651 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
> 1 = docFreq, number of documents containing term
> 2 = docCount, total number of documents with field
> 1.0 = tf(freq=1.0), with freq of:
> 1.0 = freq, occurrences of term within document
> 0.70710677 = fieldNorm
> 0.79505116 = weight(field:spars in 1) [ClassicSimilarity], result of:
> 0.79505116 = score(freq=1.0), product of:
> 0.8 = boost
> 1.4054651 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
> 1 = docFreq, number of documents containing term
> 2 = docCount, total number of documents with field
> 1.0 = tf(freq=1.0), with freq of:
> 1.0 = freq, occurrences of term within document
> 0.70710677 = fieldNorm

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Fuzzy Query Similarity [ In reply to ]

uwe at thetaphi

Jul 9, 2022, 8:49 AM

Post #3 of 7 (587 views)

The problem is that the query combines the native termquery score (which
depends on length of document and term's statistic). The edit distance
is also multiplied in. When the difference in term statistics is too
large, the edit distance no longer matters. This is perfectly fine and
also happens with other types of queries. When you have seldom terms in
small documents, those matches will always come up. This is also a
problem if you for example boost cheaper products to the top.

If you are only interested in the query distance, you should configure
IndexSearcher to use BooleanSimilarity - in that case it will ignore the
term statistics and disable norms on the field (during indexing or with
a wrapper on the IndexReader):
https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/similarities/BooleanSimilarity.html

You can do this only for a specific field:
https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/similarities/PerFieldSimilarityWrapper.html

Uwe

Am 09.07.2022 um 14:08 schrieb Michael Sokolov:
> I am no expert with this, but I got curious and looked at
> FuzzyQuery/MultiTermQuery and I don't see any way to "boost" exact
> matches, or even to incorporate the edit distance more generally into
> the per-term score, although it does seem like that would be something
> people would generally expect. So maybe FuzzyQuery should somehow do
> that? But without changing it, you could also use a query that does it
> explicitly; if you get a term "foo", you could maybe search for "foo
> OR foo~" ?
>
> On Fri, Jul 8, 2022 at 4:14 PM Mike Drob <mdrob@mdrob.com> wrote:
>> Hi folks,
>>
>> I'm working with some fuzzy queries and trying my best to understand what
>> is the expected behaviour of the searcher. I'm not sure if this is a
>> similarity bug or an incorrect usage on my end.
>>
>> The problem is when I do a fuzzy search for a term "spark~" then instead of
>> matching documents with spark first, it will match other documents that
>> have multiple other near terms like "spar" and "spars". I see this same
>> thing with both ClassicSimilarity and BM25.
>>
>> This is from a much smaller (two document) index when I was trying to
>> isolate and reproduce the issue, but I see comparable behaviour with more
>> varied scoring on a much larger corpus. The two documents are:
>>
>> addDoc("spark spark", writer); // exact match
>>
>> addDoc("spar spars", writer); // multiple fuzzy terms
>>
>> The non-zero edit distance terms get a slight down-boost, but it's not
>> enough to overcome their sum exceeding even the TF boost for the desired
>> document.
>>
>> A full reproducible unit test is at
>> https://github.com/apache/lucene/commit/dbf8e788cd2c2a5e1852b8cee86cb21a792dc546
>>
>> What is the recommended approach to get the document with exact term
>> matching for me again? I don't see an option to tweak the internal boost
>> provided by FuzzyQuery, that's one idea I had. Or is this a different
>> change that needs to be fixed at the lucene level rather than application
>> level?
>>
>> Thanks,
>> Mike
>>
>>
>>
>> More detail:
>>
>>
>> The first document with the field "spark spark" has a score explanation:
>>
>> 1.4054651 = sum of:
>> 1.4054651 = weight(field:spark in 0) [ClassicSimilarity], result of:
>> 1.4054651 = score(freq=2.0), product of:
>> 1.4054651 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
>> 1 = docFreq, number of documents containing term
>> 2 = docCount, total number of documents with field
>> 1.4142135 = tf(freq=2.0), with freq of:
>> 2.0 = freq, occurrences of term within document
>> 0.70710677 = fieldNorm
>>
>> And a document with the field "spar spars" comes in ever so slightly higher
>> at
>>
>> 1.5404116 = sum of:
>> 0.74536043 = weight(field:spar in 1) [ClassicSimilarity], result of:
>> 0.74536043 = score(freq=1.0), product of:
>> 0.75 = boost
>> 1.4054651 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
>> 1 = docFreq, number of documents containing term
>> 2 = docCount, total number of documents with field
>> 1.0 = tf(freq=1.0), with freq of:
>> 1.0 = freq, occurrences of term within document
>> 0.70710677 = fieldNorm
>> 0.79505116 = weight(field:spars in 1) [ClassicSimilarity], result of:
>> 0.79505116 = score(freq=1.0), product of:
>> 0.8 = boost
>> 1.4054651 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
>> 1 = docFreq, number of documents containing term
>> 2 = docCount, total number of documents with field
>> 1.0 = tf(freq=1.0), with freq of:
>> 1.0 = freq, occurrences of term within document
>> 0.70710677 = fieldNorm
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: uwe@thetaphi.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Fuzzy Query Similarity [ In reply to ]

uwe at thetaphi

Jul 9, 2022, 9:23 AM

Post #4 of 7 (587 views)

Hi
> FuzzyQuery/MultiTermQuery and I don't see any way to "boost" exact
> matches, or even to incorporate the edit distance more generally into
> the per-term score, although it does seem like that would be something
> people would generally expect.

Actually it does this:

* By default FuzzyQuery uses a rewrite method that expands all terms
as should clauses into a boolean query:
MultiTermQuery.TopTermsBlendedFreqScoringRewrite(maxExpansions)
* TopTermsReqrite basically keeps track of a "boost" factor for each
term and sorts the "best" terms in a PQ:
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/TopTermsRewrite.java#L109-L160
* For each collected term the term enumeration sets a boost (1.0 for
exact match):
https://github.com/apache/lucene/blob/dd4e8b82d711b8f665e91f0d74f159ef1e63939f/lucene/core/src/java/org/apache/lucene/search/FuzzyTermsEnum.java#L248-L256

So in short the exact term gets a boost factor of 1 in the resulting
term query, all other terms a lower one.

Uwe

--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:uwe@thetaphi.de

Re: Fuzzy Query Similarity [ In reply to ]

msokolov at gmail

Jul 9, 2022, 10:07 AM

Post #5 of 7 (587 views)

Oh good! Thanks for clarifying, Uwe

On Sat, Jul 9, 2022, 12:23 PM Uwe Schindler <uwe@thetaphi.de> wrote:

> Hi
> > FuzzyQuery/MultiTermQuery and I don't see any way to "boost" exact
> > matches, or even to incorporate the edit distance more generally into
> > the per-term score, although it does seem like that would be something
> > people would generally expect.
>
> Actually it does this:
>
> * By default FuzzyQuery uses a rewrite method that expands all terms
> as should clauses into a boolean query:
> MultiTermQuery.TopTermsBlendedFreqScoringRewrite(maxExpansions)
> * TopTermsReqrite basically keeps track of a "boost" factor for each
> term and sorts the "best" terms in a PQ:
>
> https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/TopTermsRewrite.java#L109-L160
> * For each collected term the term enumeration sets a boost (1.0 for
> exact match):
>
> https://github.com/apache/lucene/blob/dd4e8b82d711b8f665e91f0d74f159ef1e63939f/lucene/core/src/java/org/apache/lucene/search/FuzzyTermsEnum.java#L248-L256
>
> So in short the exact term gets a boost factor of 1 in the resulting
> term query, all other terms a lower one.
>
> Uwe
>
> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail:uwe@thetaphi.de
>

Re: Fuzzy Query Similarity [ In reply to ]

mdrob at apache

Jul 11, 2022, 1:36 PM

Post #6 of 7 (580 views)

Hi Uwe, thanks for all the pointers!

I tried using BooleanSimilarity and the resulting scores were even more divergent! 1.0 for the exact match vs 1.55 (= 0.8 + 0.75) for the multiple terms that were close. Which makes sense with ignoring TF but still doesn't help me down-boost the other terms.

On 2022/07/09 16:23:37 Uwe Schindler wrote:
> Hi
> > FuzzyQuery/MultiTermQuery and I don't see any way to "boost" exact
> > matches, or even to incorporate the edit distance more generally into
> > the per-term score, although it does seem like that would be something
> > people would generally expect.
>
> Actually it does this:
>
> * By default FuzzyQuery uses a rewrite method that expands all terms
> as should clauses into a boolean query:
> MultiTermQuery.TopTermsBlendedFreqScoringRewrite(maxExpansions)
> * TopTermsReqrite basically keeps track of a "boost" factor for each
> term and sorts the "best" terms in a PQ:
> https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/TopTermsRewrite.java#L109-L160
> * For each collected term the term enumeration sets a boost (1.0 for
> exact match):
> https://github.com/apache/lucene/blob/dd4e8b82d711b8f665e91f0d74f159ef1e63939f/lucene/core/src/java/org/apache/lucene/search/FuzzyTermsEnum.java#L248-L256
>

Thanks for the link to this calculation. I spent a long time trying to find it but kept missing.

There's some interesting things happening here by making longer terms more similar. Starting from "spark" we say that "spar" is 75% similar because it's a 4 character term that needs a single edit (1/4) and "spare" is 80% similar because it's a 5 character term with a single edit (1/5). I don't have enough information yet to say if this is expected in the application or not, but it explains how we get the scores so there's something satisfying about at least that bit.

As a hacky idea, I tried changing the boost in FuzzyTermsEnum from that computed similarity to it squared, which worked for this exact case but didn't keep up with adding a third fuzzy term to that competing document.

After thinking about this more, I suspect that what I really want is for FuzzyQuery to score as the max of any of the matching terms, rather than the sum? This would be a big change though. I don't know that it's fair for multiple approximate matches to outweigh a single exact match here. We get so close to what I need with TestFuzzyQuery.testSingleQueryExactMatchScoresHighest but it doesn't quite make it all the way.

What do you think?

> So in short the exact term gets a boost factor of 1 in the resulting
> term query, all other terms a lower one.
>
> Uwe
>
> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail:uwe@thetaphi.de
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Fuzzy Query Similarity [ In reply to ]

mdrob at apache

Jul 12, 2022, 9:22 AM

Post #7 of 7 (576 views)

On Mon, Jul 11, 2022 at 3:36 PM Mike Drob <mdrob@apache.org> wrote:

> Hi Uwe, thanks for all the pointers!
>
> I tried using BooleanSimilarity and the resulting scores were even more
> divergent! 1.0 for the exact match vs 1.55 (= 0.8 + 0.75) for the multiple
> terms that were close. Which makes sense with ignoring TF but still doesn't
> help me down-boost the other terms.
>
> On 2022/07/09 16:23:37 Uwe Schindler wrote:
> > Hi
> > > FuzzyQuery/MultiTermQuery and I don't see any way to "boost" exact
> > > matches, or even to incorporate the edit distance more generally into
> > > the per-term score, although it does seem like that would be something
> > > people would generally expect.
> >
> > Actually it does this:
> >
> > * By default FuzzyQuery uses a rewrite method that expands all terms
> > as should clauses into a boolean query:
> > MultiTermQuery.TopTermsBlendedFreqScoringRewrite(maxExpansions)
> > * TopTermsReqrite basically keeps track of a "boost" factor for each
> > term and sorts the "best" terms in a PQ:
> >
> https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/TopTermsRewrite.java#L109-L160
> > * For each collected term the term enumeration sets a boost (1.0 for
> > exact match):
> >
> https://github.com/apache/lucene/blob/dd4e8b82d711b8f665e91f0d74f159ef1e63939f/lucene/core/src/java/org/apache/lucene/search/FuzzyTermsEnum.java#L248-L256
> >
>
> Thanks for the link to this calculation. I spent a long time trying to
> find it but kept missing.
>
> There's some interesting things happening here by making longer terms more
> similar. Starting from "spark" we say that "spar" is 75% similar because
> it's a 4 character term that needs a single edit (1/4) and "spare" is 80%
> similar because it's a 5 character term with a single edit (1/5). I don't
> have enough information yet to say if this is expected in the application
> or not, but it explains how we get the scores so there's something
> satisfying about at least that bit.
>
> As a hacky idea, I tried changing the boost in FuzzyTermsEnum from that
> computed similarity to it squared, which worked for this exact case but
> didn't keep up with adding a third fuzzy term to that competing document.
>
> After thinking about this more, I suspect that what I really want is for
> FuzzyQuery to score as the max of any of the matching terms, rather than
> the sum? This would be a big change though. I don't know that it's fair for
> multiple approximate matches to outweigh a single exact match here. We get
> so close to what I need with
> TestFuzzyQuery.testSingleQueryExactMatchScoresHighest but it doesn't quite
> make it all the way.
>
> It looks like if I remove the hard coded use of Boolean RewriteMethod and
let it fall back to the default Disjunction Max I get the behaviour that I
want.
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/MultiTermQuery.java#L184

What are the use cases where we need a summation of the scores instead of
taking the max?

> What do you think?
>
> > So in short the exact term gets a boost factor of 1 in the resulting
> > term query, all other terms a lower one.
> >
> > Uwe
> >
> > --
> > Uwe Schindler
> > Achterdiek 19, D-28357 Bremen
> > https://www.thetaphi.de
> > eMail:uwe@thetaphi.de
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>