Mailing List Archive: Lucene Approximation

Lucene Approximation

Jun 2, 2020, 8:50 AM

Post #1 of 4 (775 views)

Hello,

I am not sure if I am at the right place here, but I got a question about
the approximation my Lucene implementation does.

I am trying to calculate the same scores Lucenes BM25Similiarity calculates,
but I found out that Lucene only approximates the length of documents for
scoring but uses the correct values for the average document length.
Is there a way to turn off these approximations or to get the values, so
that I can save it for my own calculations?

For my Implementation I use Lucene 8.4.1 in Combination with Spring Boot, if
this is necessary.

Thank you in advance,
Moritz

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene Approximation [ In reply to ]

msokolov at gmail

Jun 2, 2020, 9:48 AM

Post #2 of 4 (775 views)

Permalink

You could append an EOF token to every indexed text, and then iterate
over Terms to get the positions of those tokens?

On Tue, Jun 2, 2020 at 11:50 AM Moritz Staudinger
<moritz@staudinger.work> wrote:
>
> Hello,
>
> I am not sure if I am at the right place here, but I got a question about
> the approximation my Lucene implementation does.
>
> I am trying to calculate the same scores Lucenes BM25Similiarity calculates,
> but I found out that Lucene only approximates the length of documents for
> scoring but uses the correct values for the average document length.
> Is there a way to turn off these approximations or to get the values, so
> that I can save it for my own calculations?
>
> For my Implementation I use Lucene 8.4.1 in Combination with Spring Boot, if
> this is necessary.
>
> Thank you in advance,
> Moritz
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene Approximation [ In reply to ]

moritz at staudinger

Jun 2, 2020, 10:23 AM

Post #3 of 4 (774 views)

Permalink

Thank you for your answer, but please could you explain this idea in
detail as I cannot see how this would help solving my problem?

For example, I got the indexed Wikipedia Article "Alan Smithee" with a
document length of 756, which also is used when calculating the average
document length. But if the BM25 score in this article is calculated it
uses the approximated document length of 728, which returns a different
result from when the score is calculated with the correct document
length. So I wonder where this value is calculated and how I might
change this approximation or at least can get the approximated value, so
that I can use it for my own calculations.

On 2020-06-02 18:48, Michael Sokolov wrote:
> You could append an EOF token to every indexed text, and then iterate
> over Terms to get the positions of those tokens?
>
> On Tue, Jun 2, 2020 at 11:50 AM Moritz Staudinger
> <moritz@staudinger.work> wrote:
>>
>> Hello,
>>
>> I am not sure if I am at the right place here, but I got a question
>> about
>> the approximation my Lucene implementation does.
>>
>> I am trying to calculate the same scores Lucenes BM25Similiarity
>> calculates,
>> but I found out that Lucene only approximates the length of documents
>> for
>> scoring but uses the correct values for the average document length.
>> Is there a way to turn off these approximations or to get the values,
>> so
>> that I can save it for my own calculations?
>>
>> For my Implementation I use Lucene 8.4.1 in Combination with Spring
>> Boot, if
>> this is necessary.
>>
>> Thank you in advance,
>> Moritz
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene Approximation [ In reply to ]

msokolov at gmail

Jun 2, 2020, 2:57 PM

Post #4 of 4 (773 views)

Permalink

Sorry, I thought that you wanted to maintain the true value rather
than the approximated value. I am not entirely sure, but I think the
approximation arises due to rounding and low-precision storage of
these values in the index. You might be able to reverse engineer it by
looking at "Norms," which involve the document length. TBH there has
been a fair amount of change there in recent releases, and I'm not
completely up to speed on what we store, so I'll decline to provide
more misinformation at this point!

On Tue, Jun 2, 2020 at 1:20 PM <moritz@staudinger.work> wrote:
>
> Thank you for your answer, but please could you explain this idea in
> detail as I cannot see how this would help solving my problem?
>
> For example, I got the indexed Wikipedia Article "Alan Smithee" with a
> document length of 756, which also is used when calculating the average
> document length. But if the BM25 score in this article is calculated it
> uses the approximated document length of 728, which returns a different
> result from when the score is calculated with the correct document
> length. So I wonder where this value is calculated and how I might
> change this approximation or at least can get the approximated value, so
> that I can use it for my own calculations.
>
> On 2020-06-02 18:48, Michael Sokolov wrote:
> > You could append an EOF token to every indexed text, and then iterate
> > over Terms to get the positions of those tokens?
> >
> > On Tue, Jun 2, 2020 at 11:50 AM Moritz Staudinger
> > <moritz@staudinger.work> wrote:
> >>
> >> Hello,
> >>
> >> I am not sure if I am at the right place here, but I got a question
> >> about
> >> the approximation my Lucene implementation does.
> >>
> >> I am trying to calculate the same scores Lucenes BM25Similiarity
> >> calculates,
> >> but I found out that Lucene only approximates the length of documents
> >> for
> >> scoring but uses the correct values for the average document length.
> >> Is there a way to turn off these approximations or to get the values,
> >> so
> >> that I can save it for my own calculations?
> >>
> >> For my Implementation I use Lucene 8.4.1 in Combination with Spring
> >> Boot, if
> >> this is necessary.
> >>
> >> Thank you in advance,
> >> Moritz
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org