Mailing List Archive

How exactly the normalized length of the documents are stored in the index
During indexing, an inverted index is made with the term of the documents
and the term frequency, document frequency etc. are stored. If I know
correctly, the exact document length is not stored in the index to reduce
the size. Instead, a normalized length is stored for each document.
However, for most retrieval functions, document length is a necessary
component and the normalized doc-length is used in those functions.

I want to ask how exactly the normalization process is performed. The
question might have been answered already, but I was unable to find the
proper response. Your help is much appreciated.

Thanks.
Re: How exactly the normalized length of the documents are stored in the index [ In reply to ]
The BM25 similarity computes the normalized length as the number of tokens,
ignoring synonyms (tokens at the same position).

Then it encodes this length as an 8-bit integer in the index using this
logic:
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/SmallFloat.java#L147-L156,
which preserves a bit more than 4 significant bits.

On Tue, Jul 13, 2021 at 1:22 PM Dwaipayan Roy <dwaipayan.roy@gmail.com>
wrote:

> During indexing, an inverted index is made with the term of the documents
> and the term frequency, document frequency etc. are stored. If I know
> correctly, the exact document length is not stored in the index to reduce
> the size. Instead, a normalized length is stored for each document.
> However, for most retrieval functions, document length is a necessary
> component and the normalized doc-length is used in those functions.
>
> I want to ask how exactly the normalization process is performed. The
> question might have been answered already, but I was unable to find the
> proper response. Your help is much appreciated.
>
> Thanks.
>


--
Adrien