Mailing List Archive: Maximum score estimation

Maximum score estimation

Dec 5, 2022, 11:02 PM

Post #1 of 7 (344 views)

Hello dev!
Users are interested in the meaning of absolute value of the score, but we
always reply that it's just relative value. Maximum score of matched docs
is not an answer.
Ultimately we need to measure how much sense a query has in the index. e.g.
[jet OR propulsion OR spider] query should be measured like
nonsense, because the best matching docs have much lower scores than
hypothetical (and assuming absent) doc matching [jet AND propulsion AND
spider].
Could it be a method that returns the maximum possible score if all query
terms would match. Something like stubbing postings on virtual all_matching
doc with average stats like tf and field length and kicks scorers in? It
reminds me something about probabilistic retrieval, but not much. Is there
anything like this already?

--
Sincerely yours
Mikhail Khludnev

Re: Maximum score estimation [ In reply to ]

wunder at wunderwood

Dec 6, 2022, 9:15 AM

Post #2 of 7 (343 views)

As you point out, this is a probabilistic relevance model. Lucene uses a vector space model.

A probabilistic model gives an estimate of how relevant each document is to the query. Unfortunately, their overall relevance isn’t as good as a vector space model.

You could calculate an ideal score, but that can change every time a document is added to or deleted from the index, because of idf. So the ideal score isn’t a useful mental model.

Essentially, you need to tell your users to worry about something that matters. The absolute value of the score does not matter.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/ (my blog)

> On Dec 5, 2022, at 11:02 PM, Mikhail Khludnev <mkhl@apache.org> wrote:
>
> Hello dev!
> Users are interested in the meaning of absolute value of the score, but we always reply that it's just relative value. Maximum score of matched docs is not an answer.
> Ultimately we need to measure how much sense a query has in the index. e.g. [jet OR propulsion OR spider] query should be measured like nonsense, because the best matching docs have much lower scores than hypothetical (and assuming absent) doc matching [jet AND propulsion AND spider].
> Could it be a method that returns the maximum possible score if all query terms would match. Something like stubbing postings on virtual all_matching doc with average stats like tf and field length and kicks scorers in? It reminds me something about probabilistic retrieval, but not much. Is there anything like this already?
>
> --
> Sincerely yours
> Mikhail Khludnev

Re: Maximum score estimation [ In reply to ]

Dec 18, 2022, 2:47 AM

Post #3 of 7 (336 views)

Thanks for replym Walter.
Recently Robert commented on PR with the link
https://cwiki.apache.org/confluence/display/LUCENE/ScoresAsPercentages it
gives arguments against my proposal. Honestly, I'm still in doubt.

On Tue, Dec 6, 2022 at 8:15 PM Walter Underwood <wunder@wunderwood.org>
wrote:

> As you point out, this is a probabilistic relevance model. Lucene uses a
> vector space model.
>
> A probabilistic model gives an estimate of how relevant each document is
> to the query. Unfortunately, their overall relevance isn’t as good as a
> vector space model.
>
> You could calculate an ideal score, but that can change every time a
> document is added to or deleted from the index, because of idf. So the
> ideal score isn’t a useful mental model.
>
> Essentially, you need to tell your users to worry about something that
> matters. The absolute value of the score does not matter.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/ (my blog)
>
> On Dec 5, 2022, at 11:02 PM, Mikhail Khludnev <mkhl@apache.org> wrote:
>
> Hello dev!
> Users are interested in the meaning of absolute value of the score, but we
> always reply that it's just relative value. Maximum score of matched docs
> is not an answer.
> Ultimately we need to measure how much sense a query has in the index.
> e.g. [jet OR propulsion OR spider] query should be measured like
> nonsense, because the best matching docs have much lower scores than
> hypothetical (and assuming absent) doc matching [jet AND propulsion AND
> spider].
> Could it be a method that returns the maximum possible score if all query
> terms would match. Something like stubbing postings on virtual all_matching
> doc with average stats like tf and field length and kicks scorers in? It
> reminds me something about probabilistic retrieval, but not much. Is there
> anything like this already?
>
> --
> Sincerely yours
> Mikhail Khludnev
>
>
>

--
Sincerely yours
Mikhail Khludnev

Re: Maximum score estimation [ In reply to ]

wunder at wunderwood

Dec 19, 2022, 11:30 AM

Post #4 of 7 (334 views)

That article is copied from the old wiki, so it is much earlier than 2019, more like 2009. Unfortunately, the links to the email discussion are all dead, but the issues in the article are still true.

If you really want to go down that path, you might be able to do it with a similarity class that implements a probabilistic relevance model. I’d start the literature search with this Google query.

probablistic information retrieval <https://www.google.com/search?client=safari&rls=en&q=probablistic+information+retrieval&ie=UTF-8&oe=UTF-8>

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/ (my blog)

> On Dec 18, 2022, at 2:47 AM, Mikhail Khludnev <mkhl@apache.org> wrote:
>
> Thanks for replym Walter.
> Recently Robert commented on PR with the link https://cwiki.apache.org/confluence/display/LUCENE/ScoresAsPercentages <https://cwiki.apache.org/confluence/display/LUCENE/ScoresAsPercentages> it gives arguments against my proposal. Honestly, I'm still in doubt.
>
> On Tue, Dec 6, 2022 at 8:15 PM Walter Underwood <wunder@wunderwood.org <mailto:wunder@wunderwood.org>> wrote:
> As you point out, this is a probabilistic relevance model. Lucene uses a vector space model.
>
> A probabilistic model gives an estimate of how relevant each document is to the query. Unfortunately, their overall relevance isn’t as good as a vector space model.
>
> You could calculate an ideal score, but that can change every time a document is added to or deleted from the index, because of idf. So the ideal score isn’t a useful mental model.
>
> Essentially, you need to tell your users to worry about something that matters. The absolute value of the score does not matter.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org <mailto:wunder@wunderwood.org>
> http://observer.wunderwood.org/ <http://observer.wunderwood.org/> (my blog)
>
>> On Dec 5, 2022, at 11:02 PM, Mikhail Khludnev <mkhl@apache.org <mailto:mkhl@apache.org>> wrote:
>>
>> Hello dev!
>> Users are interested in the meaning of absolute value of the score, but we always reply that it's just relative value. Maximum score of matched docs is not an answer.
>> Ultimately we need to measure how much sense a query has in the index. e.g. [jet OR propulsion OR spider] query should be measured like nonsense, because the best matching docs have much lower scores than hypothetical (and assuming absent) doc matching [jet AND propulsion AND spider].
>> Could it be a method that returns the maximum possible score if all query terms would match. Something like stubbing postings on virtual all_matching doc with average stats like tf and field length and kicks scorers in? It reminds me something about probabilistic retrieval, but not much. Is there anything like this already?
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev

Re: Maximum score estimation [ In reply to ]

joaquin.delgado at gmail

Dec 19, 2022, 10:03 PM

Post #5 of 7 (334 views)

Actually, I believe that the Lucene scoring function is based on *Okapi
BM25* (BM is an abbreviation of best matching) which is based on the
probabilistic
retrieval <https://en.m.wikipedia.org/wiki/Probabilistic_relevance_model>framework
developed in the 1970s and 1980s by Stephen E. Robertson
<https://en.m.wikipedia.org/wiki/Stephen_E._Robertson>, Karen Spärck Jones
<https://en.m.wikipedia.org/wiki/Karen_Sp%C3%A4rck_Jones>, and others.

There are several interpretations for IDF and slight variations on its
formula. In the original BM25 derivation, the IDF component is derived from
the Binary Independence Model
<https://en.m.wikipedia.org/wiki/Binary_Independence_Model>.

Info from:
https://en.m.wikipedia.org/wiki/Okapi_BM25

You could calculate an ideal score, but that can change every time a
> document is added to or deleted from the index, because of idf. So the
> ideal score isn’t a useful mental model.
>
> Essentially, you need to tell your users to worry about something that
> matters. The absolute value of the score does not matter.
>

While I understand the concern, quite often BM25 scores are used post
retrieval (in 2-stage retrieval/ranking systems) to fuel learning-to-rank
models that often transform the score into [0,1] using some normalization
function that often involves estimating a max score by looking at the
score distribution.

J

On Mon, Dec 19, 2022 at 11:31 AM Walter Underwood <wunder@wunderwood.org>
wrote:

> That article is copied from the old wiki, so it is much earlier than 2019,
> more like 2009. Unfortunately, the links to the email discussion are all
> dead, but the issues in the article are still true.
>
> If you really want to go down that path, you might be able to do it with a
> similarity class that implements a probabilistic relevance model. I’d start
> the literature search with this Google query.
>
> probablistic information retrieval
> <https://www.google.com/search?client=safari&rls=en&q=probablistic+information+retrieval&ie=UTF-8&oe=UTF-8>
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/ (my blog)
>
> On Dec 18, 2022, at 2:47 AM, Mikhail Khludnev <mkhl@apache.org> wrote:
>
> Thanks for replym Walter.
> Recently Robert commented on PR with the link
> https://cwiki.apache.org/confluence/display/LUCENE/ScoresAsPercentages it
> gives arguments against my proposal. Honestly, I'm still in doubt.
>
> On Tue, Dec 6, 2022 at 8:15 PM Walter Underwood <wunder@wunderwood.org>
> wrote:
>
>> As you point out, this is a probabilistic relevance model. Lucene uses a
>> vector space model.
>>
>> A probabilistic model gives an estimate of how relevant each document is
>> to the query. Unfortunately, their overall relevance isn’t as good as a
>> vector space model.
>>
>> You could calculate an ideal score, but that can change every time a
>> document is added to or deleted from the index, because of idf. So the
>> ideal score isn’t a useful mental model.
>>
>> Essentially, you need to tell your users to worry about something that
>> matters. The absolute value of the score does not matter.
>>
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/ (my blog)
>>
>> On Dec 5, 2022, at 11:02 PM, Mikhail Khludnev <mkhl@apache.org> wrote:
>>
>> Hello dev!
>> Users are interested in the meaning of absolute value of the score, but
>> we always reply that it's just relative value. Maximum score of matched
>> docs is not an answer.
>> Ultimately we need to measure how much sense a query has in the index.
>> e.g. [jet OR propulsion OR spider] query should be measured like
>> nonsense, because the best matching docs have much lower scores than
>> hypothetical (and assuming absent) doc matching [jet AND propulsion AND
>> spider].
>> Could it be a method that returns the maximum possible score if all query
>> terms would match. Something like stubbing postings on virtual all_matching
>> doc with average stats like tf and field length and kicks scorers in? It
>> reminds me something about probabilistic retrieval, but not much. Is there
>> anything like this already?
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>>
>>
>>
>
> --
> Sincerely yours
> Mikhail Khludnev
>
>
>

Re: Maximum score estimation [ In reply to ]

wunder at wunderwood

Dec 20, 2022, 2:00 PM

Post #6 of 7 (334 views)

Comparing scores within the result set for a single query works fine. Mapping those to [0,1] is fine, too.

Comparing scores for different queries, or even for the same query at different times, isn’t valid. Showing the scores to people almost guarantees they’ll compare the scores between different queries.

The BM25 in Lucene is a change to the formulas for idf, tf, and length normalization. It is still fundamentally a tf.idf model.

https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/ (my blog)

> On Dec 19, 2022, at 10:03 PM, J. Delgado <joaquin.delgado@gmail.com> wrote:
>
> Actually, I believe that the Lucene scoring function is based on Okapi BM25 (BM is an abbreviation of best matching) which is based on the probabilistic retrieval <https://en.m.wikipedia.org/wiki/Probabilistic_relevance_model>framework developed in the 1970s and 1980s by Stephen E. Robertson <https://en.m.wikipedia.org/wiki/Stephen_E._Robertson>, Karen Spärck Jones <https://en.m.wikipedia.org/wiki/Karen_Sp%C3%A4rck_Jones>, and others.
>
> There are several interpretations for IDF and slight variations on its formula. In the original BM25 derivation, the IDF component is derived from the Binary Independence Model <https://en.m.wikipedia.org/wiki/Binary_Independence_Model>.
>
> Info from:
> https://en.m.wikipedia.org/wiki/Okapi_BM25 <https://en.m.wikipedia.org/wiki/Okapi_BM25>
>
>> You could calculate an ideal score, but that can change every time a document is added to or deleted from the index, because of idf. So the ideal score isn’t a useful mental model.
>>
>> Essentially, you need to tell your users to worry about something that matters. The absolute value of the score does not matter.
>>
> While I understand the concern, quite often BM25 scores are used post retrieval (in 2-stage retrieval/ranking systems) to fuel learning-to-rank models that often transform the score into [0,1] using some normalization function that often involves estimating a max score by looking at the score distribution.
>
> J
>
> On Mon, Dec 19, 2022 at 11:31 AM Walter Underwood <wunder@wunderwood.org <mailto:wunder@wunderwood.org>> wrote:
> That article is copied from the old wiki, so it is much earlier than 2019, more like 2009. Unfortunately, the links to the email discussion are all dead, but the issues in the article are still true.
>
> If you really want to go down that path, you might be able to do it with a similarity class that implements a probabilistic relevance model. I’d start the literature search with this Google query.
>
> probablistic information retrieval <https://www.google.com/search?client=safari&rls=en&q=probablistic+information+retrieval&ie=UTF-8&oe=UTF-8>
>
> wunder
> Walter Underwood
> wunder@wunderwood.org <mailto:wunder@wunderwood.org>
> http://observer.wunderwood.org/ <http://observer.wunderwood.org/> (my blog)
>
>> On Dec 18, 2022, at 2:47 AM, Mikhail Khludnev <mkhl@apache.org <mailto:mkhl@apache.org>> wrote:
>>
>> Thanks for replym Walter.
>> Recently Robert commented on PR with the link https://cwiki.apache.org/confluence/display/LUCENE/ScoresAsPercentages <https://cwiki.apache.org/confluence/display/LUCENE/ScoresAsPercentages> it gives arguments against my proposal. Honestly, I'm still in doubt.
>>
>> On Tue, Dec 6, 2022 at 8:15 PM Walter Underwood <wunder@wunderwood.org <mailto:wunder@wunderwood.org>> wrote:
>> As you point out, this is a probabilistic relevance model. Lucene uses a vector space model.
>>
>> A probabilistic model gives an estimate of how relevant each document is to the query. Unfortunately, their overall relevance isn’t as good as a vector space model.
>>
>
>> You could calculate an ideal score, but that can change every time a document is added to or deleted from the index, because of idf. So the ideal score isn’t a useful mental model.
>>
>> Essentially, you need to tell your users to worry about something that matters. The absolute value of the score does not matter.
>>
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org <mailto:wunder@wunderwood.org>
>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/> (my blog)
>>
>>> On Dec 5, 2022, at 11:02 PM, Mikhail Khludnev <mkhl@apache.org <mailto:mkhl@apache.org>> wrote:
>>>
>>> Hello dev!
>>> Users are interested in the meaning of absolute value of the score, but we always reply that it's just relative value. Maximum score of matched docs is not an answer.
>>> Ultimately we need to measure how much sense a query has in the index. e.g. [jet OR propulsion OR spider] query should be measured like nonsense, because the best matching docs have much lower scores than hypothetical (and assuming absent) doc matching [jet AND propulsion AND spider].
>>> Could it be a method that returns the maximum possible score if all query terms would match. Something like stubbing postings on virtual all_matching doc with average stats like tf and field length and kicks scorers in? It reminds me something about probabilistic retrieval, but not much. Is there anything like this already?
>>>
>>> --
>>> Sincerely yours
>>> Mikhail Khludnev
>>
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>

Re: Maximum score estimation [ In reply to ]

Feb 13, 2023, 3:47 AM

Post #7 of 7 (299 views)

Hello.
Just FYI. I scratched a little prototype
https://github.com/mkhludnev/likely/blob/main/src/test/java/org/apache/lucene/contrb/highly/TestLikelyReader.java#L53
To estimate maximum possible score for the query against an index:
- it creates a virtual index (LikelyReader), which
- contains all terms from the original index with the same docCount
- matching all of these terms in the first doc (docnum=0) with the maximum
termFreq (which estimating is a separate question).
So, if we search over this LikelyReader we get a score estimate, which can
hardly be exceeded by the same query over the original index.
I suppose this might be useful for LTR as a better alternative to the query
score feature.

On Tue, Dec 6, 2022 at 10:02 AM Mikhail Khludnev <mkhl@apache.org> wrote:

> Hello dev!
> Users are interested in the meaning of absolute value of the score, but we
> always reply that it's just relative value. Maximum score of matched docs
> is not an answer.
> Ultimately we need to measure how much sense a query has in the index.
> e.g. [jet OR propulsion OR spider] query should be measured like
> nonsense, because the best matching docs have much lower scores than
> hypothetical (and assuming absent) doc matching [jet AND propulsion AND
> spider].
> Could it be a method that returns the maximum possible score if all query
> terms would match. Something like stubbing postings on virtual all_matching
> doc with average stats like tf and field length and kicks scorers in? It
> reminds me something about probabilistic retrieval, but not much. Is there
> anything like this already?
>
> --
> Sincerely yours
> Mikhail Khludnev
>

--
Sincerely yours
Mikhail Khludnev