Mailing List Archive: question about impacts use case

Hi, I've been working on seeing whether we can make use of impacts in
Amazon search and I have some questions. To date, we haven't used
Lucene's scoring APIs at all; all of our queries are constant score,
we early terminate based on a sorted index rank and then re-rank using
custom non-Lucene ranking models. There is now an opportunity (some
early ranking models have gotten simplified) for us to move some of
the ranking workload into Lucene where we should be able to benefit
from skipping hits via impacts.

I'm struggling with a typical query (not our actual setup, but
illustrates the functional gap) that is an OR-query something like:

title:Harry_Potter_and_the_sorcerers_stone^100 (+fulltext:harry
+fulltext:potter +sorcerer + stone)

Suppose there is only one document with that title, but a few dozen
match all the individual terms. The one-word terms occur frequently in
the fulltext field, but the title only once, yet it is a "high impact"
term from the point of view of the query score. We don't index impacts
for a term when docFreq < 128. This means we will never be able to
skip low-scoring documents for this query, assuming that the score of
the fulltext clause will always be much less than the score from the
exact title match (which is by design - we always want exact title
matches to rank highly). Even when min-competitive-score is for a
document that has each word twice, we still can't skip documents where
they only occur once, because the maximum score for the title scorer
is the maximum *over the whole index* -- basically the scorer is
thinking there might be another exact title match somewhere deeper in
the index *even though its postings have already been exhausted*.

I have only just started to look at the impacts code and don't have
any clear idea whether this is difficult to fix, or whether I may have
misconfigured something, but thought I would ask here to see if anyone
has any idea. Things I did check:

- the query is running in TOP_SCORES mode
- the collector is calling Scorer.setMinimumScore with a low score,
and subsequently collecting all matching hits even though their scores
are all lower than the min
- the title impacts is represented by SlowImpactsEnum

One thing that may be relevant is that I am using a custom
Query/Weight/Scorer wrapping the two clauses in order to modify their
scores, because I am trying to mimic a pre-existing scoring function.
These apply a linear function with an offset, scale and a maximum
ceiling (so can't be done just with boosts as shown above). This
Scorer implements score/getMaxScore by applying its modifications to
the underlying scores, setMinCompetitiveScore basically inverts that,
and advanceShallow delegates to the inner Scorer. I didn't implement
anything around BulkScorer - maybe that's a gap?

any pointers appreciated!

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Well, digging a little deeper I can see that skipping behavior is
going to depend heavily on the distribution of documents in the index,
and how many skip levels there are and so on, and I may be getting
hung up on a particular test case that doesn't generalize. In this
case all the high-scoring documents come early in the docid order (due
to our static index sort), so there are lots of possibilities for
skipping that may be unusual? One thing that occurred to me was that
when the Query writer knows that a child Query will always lead the
disjunction, they could possibly indicate that somehow - we could have
a UNION query or so that would process its child Queries in series and
then merge their results? Which would be a bad strategy in general,
but good when there is one high-scoring lead query that has few
results. But I am of course hoping this would just fall out of
WANDScorer as it is already dividing up head/tail queries ...

One thing that seems odd (I think, unless I'm confused! - very
possible) is that TermScorer reports its max score as the global max
once its iterator has been exhausted, when it seems it ought to report
0. I added a check for docID() == NO_MORE_DOCS in my wrapping Query to
assert this, and I can see it has some effect.

Anyway I am seeing *some* skipping, which is tantalizing.

On Sat, Apr 1, 2023 at 10:00?AM Michael Sokolov <msokolov@gmail.com> wrote:
>
> Hi, I've been working on seeing whether we can make use of impacts in
> Amazon search and I have some questions. To date, we haven't used
> Lucene's scoring APIs at all; all of our queries are constant score,
> we early terminate based on a sorted index rank and then re-rank using
> custom non-Lucene ranking models. There is now an opportunity (some
> early ranking models have gotten simplified) for us to move some of
> the ranking workload into Lucene where we should be able to benefit
> from skipping hits via impacts.
>
> I'm struggling with a typical query (not our actual setup, but
> illustrates the functional gap) that is an OR-query something like:
>
> title:Harry_Potter_and_the_sorcerers_stone^100 (+fulltext:harry
> +fulltext:potter +sorcerer + stone)
>
> Suppose there is only one document with that title, but a few dozen
> match all the individual terms. The one-word terms occur frequently in
> the fulltext field, but the title only once, yet it is a "high impact"
> term from the point of view of the query score. We don't index impacts
> for a term when docFreq < 128. This means we will never be able to
> skip low-scoring documents for this query, assuming that the score of
> the fulltext clause will always be much less than the score from the
> exact title match (which is by design - we always want exact title
> matches to rank highly). Even when min-competitive-score is for a
> document that has each word twice, we still can't skip documents where
> they only occur once, because the maximum score for the title scorer
> is the maximum *over the whole index* -- basically the scorer is
> thinking there might be another exact title match somewhere deeper in
> the index *even though its postings have already been exhausted*.
>
> I have only just started to look at the impacts code and don't have
> any clear idea whether this is difficult to fix, or whether I may have
> misconfigured something, but thought I would ask here to see if anyone
> has any idea. Things I did check:
>
> - the query is running in TOP_SCORES mode
> - the collector is calling Scorer.setMinimumScore with a low score,
> and subsequently collecting all matching hits even though their scores
> are all lower than the min
> - the title impacts is represented by SlowImpactsEnum
>
> One thing that may be relevant is that I am using a custom
> Query/Weight/Scorer wrapping the two clauses in order to modify their
> scores, because I am trying to mimic a pre-existing scoring function.
> These apply a linear function with an offset, scale and a maximum
> ceiling (so can't be done just with boosts as shown above). This
> Scorer implements score/getMaxScore by applying its modifications to
> the underlying scores, setMinCompetitiveScore basically inverts that,
> and advanceShallow delegates to the inner Scorer. I didn't implement
> anything around BulkScorer - maybe that's a gap?
>
> any pointers appreciated!

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org