Actually, I believe that the Lucene scoring function is based on *Okapi
BM25* (BM is an abbreviation of best matching) which is based on the
probabilistic
retrieval <
https://en.m.wikipedia.org/wiki/Probabilistic_relevance_model>framework
developed in the 1970s and 1980s by Stephen E. Robertson
<
https://en.m.wikipedia.org/wiki/Stephen_E._Robertson>, Karen Spärck Jones
<
https://en.m.wikipedia.org/wiki/Karen_Sp%C3%A4rck_Jones>, and others.
There are several interpretations for IDF and slight variations on its
formula. In the original BM25 derivation, the IDF component is derived from
the Binary Independence Model
<
https://en.m.wikipedia.org/wiki/Binary_Independence_Model>.
Info from:
https://en.m.wikipedia.org/wiki/Okapi_BM25 You could calculate an ideal score, but that can change every time a
> document is added to or deleted from the index, because of idf. So the
> ideal score isn’t a useful mental model.
>
> Essentially, you need to tell your users to worry about something that
> matters. The absolute value of the score does not matter.
>
While I understand the concern, quite often BM25 scores are used post
retrieval (in 2-stage retrieval/ranking systems) to fuel learning-to-rank
models that often transform the score into [0,1] using some normalization
function that often involves estimating a max score by looking at the
score distribution.
J
On Mon, Dec 19, 2022 at 11:31 AM Walter Underwood <wunder@wunderwood.org>
wrote:
> That article is copied from the old wiki, so it is much earlier than 2019,
> more like 2009. Unfortunately, the links to the email discussion are all
> dead, but the issues in the article are still true.
>
> If you really want to go down that path, you might be able to do it with a
> similarity class that implements a probabilistic relevance model. I’d start
> the literature search with this Google query.
>
> probablistic information retrieval
> <https://www.google.com/search?client=safari&rls=en&q=probablistic+information+retrieval&ie=UTF-8&oe=UTF-8>
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/ (my blog)
>
> On Dec 18, 2022, at 2:47 AM, Mikhail Khludnev <mkhl@apache.org> wrote:
>
> Thanks for replym Walter.
> Recently Robert commented on PR with the link
> https://cwiki.apache.org/confluence/display/LUCENE/ScoresAsPercentages it
> gives arguments against my proposal. Honestly, I'm still in doubt.
>
> On Tue, Dec 6, 2022 at 8:15 PM Walter Underwood <wunder@wunderwood.org>
> wrote:
>
>> As you point out, this is a probabilistic relevance model. Lucene uses a
>> vector space model.
>>
>> A probabilistic model gives an estimate of how relevant each document is
>> to the query. Unfortunately, their overall relevance isn’t as good as a
>> vector space model.
>>
>> You could calculate an ideal score, but that can change every time a
>> document is added to or deleted from the index, because of idf. So the
>> ideal score isn’t a useful mental model.
>>
>> Essentially, you need to tell your users to worry about something that
>> matters. The absolute value of the score does not matter.
>>
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/ (my blog)
>>
>> On Dec 5, 2022, at 11:02 PM, Mikhail Khludnev <mkhl@apache.org> wrote:
>>
>> Hello dev!
>> Users are interested in the meaning of absolute value of the score, but
>> we always reply that it's just relative value. Maximum score of matched
>> docs is not an answer.
>> Ultimately we need to measure how much sense a query has in the index.
>> e.g. [jet OR propulsion OR spider] query should be measured like
>> nonsense, because the best matching docs have much lower scores than
>> hypothetical (and assuming absent) doc matching [jet AND propulsion AND
>> spider].
>> Could it be a method that returns the maximum possible score if all query
>> terms would match. Something like stubbing postings on virtual all_matching
>> doc with average stats like tf and field length and kicks scorers in? It
>> reminds me something about probabilistic retrieval, but not much. Is there
>> anything like this already?
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>>
>>
>>
>
> --
> Sincerely yours
> Mikhail Khludnev
>
>
>