Hi, I've been working on seeing whether we can make use of impacts in
Amazon search and I have some questions. To date, we haven't used
Lucene's scoring APIs at all; all of our queries are constant score,
we early terminate based on a sorted index rank and then re-rank using
custom non-Lucene ranking models. There is now an opportunity (some
early ranking models have gotten simplified) for us to move some of
the ranking workload into Lucene where we should be able to benefit
from skipping hits via impacts.
I'm struggling with a typical query (not our actual setup, but
illustrates the functional gap) that is an OR-query something like:
title:Harry_Potter_and_the_sorcerers_stone^100 (+fulltext:harry
+fulltext:potter +sorcerer + stone)
Suppose there is only one document with that title, but a few dozen
match all the individual terms. The one-word terms occur frequently in
the fulltext field, but the title only once, yet it is a "high impact"
term from the point of view of the query score. We don't index impacts
for a term when docFreq < 128. This means we will never be able to
skip low-scoring documents for this query, assuming that the score of
the fulltext clause will always be much less than the score from the
exact title match (which is by design - we always want exact title
matches to rank highly). Even when min-competitive-score is for a
document that has each word twice, we still can't skip documents where
they only occur once, because the maximum score for the title scorer
is the maximum *over the whole index* -- basically the scorer is
thinking there might be another exact title match somewhere deeper in
the index *even though its postings have already been exhausted*.
I have only just started to look at the impacts code and don't have
any clear idea whether this is difficult to fix, or whether I may have
misconfigured something, but thought I would ask here to see if anyone
has any idea. Things I did check:
- the query is running in TOP_SCORES mode
- the collector is calling Scorer.setMinimumScore with a low score,
and subsequently collecting all matching hits even though their scores
are all lower than the min
- the title impacts is represented by SlowImpactsEnum
One thing that may be relevant is that I am using a custom
Query/Weight/Scorer wrapping the two clauses in order to modify their
scores, because I am trying to mimic a pre-existing scoring function.
These apply a linear function with an offset, scale and a maximum
ceiling (so can't be done just with boosts as shown above). This
Scorer implements score/getMaxScore by applying its modifications to
the underlying scores, setMinCompetitiveScore basically inverts that,
and advanceShallow delegates to the inner Scorer. I didn't implement
anything around BulkScorer - maybe that's a gap?
any pointers appreciated!
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Amazon search and I have some questions. To date, we haven't used
Lucene's scoring APIs at all; all of our queries are constant score,
we early terminate based on a sorted index rank and then re-rank using
custom non-Lucene ranking models. There is now an opportunity (some
early ranking models have gotten simplified) for us to move some of
the ranking workload into Lucene where we should be able to benefit
from skipping hits via impacts.
I'm struggling with a typical query (not our actual setup, but
illustrates the functional gap) that is an OR-query something like:
title:Harry_Potter_and_the_sorcerers_stone^100 (+fulltext:harry
+fulltext:potter +sorcerer + stone)
Suppose there is only one document with that title, but a few dozen
match all the individual terms. The one-word terms occur frequently in
the fulltext field, but the title only once, yet it is a "high impact"
term from the point of view of the query score. We don't index impacts
for a term when docFreq < 128. This means we will never be able to
skip low-scoring documents for this query, assuming that the score of
the fulltext clause will always be much less than the score from the
exact title match (which is by design - we always want exact title
matches to rank highly). Even when min-competitive-score is for a
document that has each word twice, we still can't skip documents where
they only occur once, because the maximum score for the title scorer
is the maximum *over the whole index* -- basically the scorer is
thinking there might be another exact title match somewhere deeper in
the index *even though its postings have already been exhausted*.
I have only just started to look at the impacts code and don't have
any clear idea whether this is difficult to fix, or whether I may have
misconfigured something, but thought I would ask here to see if anyone
has any idea. Things I did check:
- the query is running in TOP_SCORES mode
- the collector is calling Scorer.setMinimumScore with a low score,
and subsequently collecting all matching hits even though their scores
are all lower than the min
- the title impacts is represented by SlowImpactsEnum
One thing that may be relevant is that I am using a custom
Query/Weight/Scorer wrapping the two clauses in order to modify their
scores, because I am trying to mimic a pre-existing scoring function.
These apply a linear function with an offset, scale and a maximum
ceiling (so can't be done just with boosts as shown above). This
Scorer implements score/getMaxScore by applying its modifications to
the underlying scores, setMinCompetitiveScore basically inverts that,
and advanceShallow delegates to the inner Scorer. I didn't implement
anything around BulkScorer - maybe that's a gap?
any pointers appreciated!
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org