Mailing List Archive: Fuzzy Search Scoring Adjustment

When performing a fuzzy search inside a BooleanQuery, it looks like the
default behavior is to score all fuzzy matches separately and then sum them
up to get an aggregate score. However, I need it to instead score based on
the maximum of each distinct match it might find, rather than the sum of
them, to avoid overly inflated scores in some circumstances.

For example, consider a query for "Bstn~2" and four documents containing
"Boston", "Basin", "Boston Basin", and "Boston Boston Basin". The query
might respectively score them as 1, 1, 2, and 3 (or something like that,
depending on the scorer used, of course). However, I need it to instead
score them as 1, 1, 1, and 2, since that's the count of just the most
frequent unique fuzzy match in each document.

Ideally I'd like to use a built in mechanism for achieving this, but if
it's not available, a way to extend the BooleanQuery, BooleanWeight, and/or
BooleanScorer classes to have slightly different scoring logic but
otherwise function exactly the same would also work, but all of those are
either final classes or have no public constructor, effectively making it
impossible to reuse their logic directly, as near as I can tell.

If anyone has any ideas of how to approach this, it would be very helpful.

Thanks,
Kainoa

You can create a different RewriteMethod for MultiTermQueries (see the default used by Fuzzy query). This one is used to convert the FuzzyQuery on rewrite to a BooleanQuery. To achieve what you want to have just create a subclass of RewriteMethod that uses a DisjunctionMaxQuery instead of BooleanQuery to collect the clauses:

Subclass this abstract one:
https://lucene.apache.org/core/8_6_0/core/org/apache/lucene/search/TopTermsRewrite.html

...and set it as RewriteMethod on the Fuzzy. Use one of the already existing subclasses as example and adapt it for DisjunctionMaxQuery.

Uwe

Am September 23, 2020 5:58:29 PM UTC schrieb "Eastlack, Kainoa" <keastlack@novetta.com>:
>When performing a fuzzy search inside a BooleanQuery, it looks like the
>default behavior is to score all fuzzy matches separately and then sum
>them
>up to get an aggregate score. However, I need it to instead score based
>on
>the maximum of each distinct match it might find, rather than the sum
>of
>them, to avoid overly inflated scores in some circumstances.
>
>For example, consider a query for "Bstn~2" and four documents
>containing
>"Boston", "Basin", "Boston Basin", and "Boston Boston Basin". The query
>might respectively score them as 1, 1, 2, and 3 (or something like
>that,
>depending on the scorer used, of course). However, I need it to instead
>score them as 1, 1, 1, and 2, since that's the count of just the most
>frequent unique fuzzy match in each document.
>
>Ideally I'd like to use a built in mechanism for achieving this, but if
>it's not available, a way to extend the BooleanQuery, BooleanWeight,
>and/or
>BooleanScorer classes to have slightly different scoring logic but
>otherwise function exactly the same would also work, but all of those
>are
>either final classes or have no public constructor, effectively making
>it
>impossible to reuse their logic directly, as near as I can tell.
>
>If anyone has any ideas of how to approach this, it would be very
>helpful.
>
>Thanks,
>Kainoa

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de