Mailing List Archive

PackedInts functionalities
Hi devs,

As I was working on https://github.com/apache/lucene/issues/12513 I needed to compress positive integers which are used to locate postings etc.

To put it concretely, I will need to pack a few values per term contiguously and those values can have different bit-width. For example, consider that we need to encode docFreq and postingsStartOffset per term and docFreq takes 4 bit and the postingsStartOffset takes 6 bit. We expect to write the following for two terms.

```
Term1??????????????????????????????????????????????????????| ????Term2

docFreq(4bit) | postingsStartOffset(6bit) | docFreq(4bit) | postingsStartOffset(6bit)

```

On the read path, I expect to locate the offest for a term first and followed by reading two values that have different bit-width.

In the spirit of not re-inventing necessarily, I tried to explore the existing PackedInts util classes and I believe there is no support for this at the moment. The biggest gap I found is that the existing classes expect to write/read values of same bit-width.

I'm writing to get feedback from yall to see if I missed anything.

Cheers,
Tony X
Re: PackedInts functionalities [ In reply to ]
Hello Tony
Is it possible to write a block of docfreqs and then a block of
postingoffsets?
Or why not write them as 10-bit integers and then split to quad and sextet
in the posting format code?

On Mon, Oct 16, 2023 at 11:50?PM Dongyu Xu <dongyu214@hotmail.com> wrote:

> Hi devs,
>
> As I was working on https://github.com/apache/lucene/issues/12513 I
> needed to compress positive integers which are used to locate postings etc.
>
> To put it concretely, I will need to pack a few values per term
> contiguously and those values can have different bit-width. For example,
> consider that we need to encode docFreq and postingsStartOffset per term
> and docFreq takes 4 bit and the postingsStartOffset takes 6 bit. We
> expect to write the following for two terms.
>
> ```
> Term1 | Term2
>
> docFreq(4bit) | postingsStartOffset(6bit) | docFreq(4bit) |
> postingsStartOffset(6bit)
>
> ```
>
> On the read path, I expect to locate the offest for a term first and
> followed by reading two values that have different bit-width.
>
> In the spirit of not re-inventing necessarily, I tried to explore the
> existing PackedInts util classes and I believe there is no support for this
> at the moment. The biggest gap I found is that the existing classes expect
> to write/read values of same bit-width.
>
> I'm writing to get feedback from yall to see if I missed anything.
>
> Cheers,
> Tony X
>


--
Sincerely yours
Mikhail Khludnev
Re: PackedInts functionalities [ In reply to ]
+1 to what Mikhail wrote, this is e.g. how postings work: instead of
interleaving doc IDs and frequencies, they always store a block of 128 doc
IDs followed by a block of 128 frequencies.

For reference, bit packing feels space-inefficient for this kind of data. I
would expect docFreqs to have a zipfian distribution, so you would end up
using a number of bits per docFreq that is driven by the highest docFreq in
the block while most values might be very low. Do you need random-access
into these doc freqs and postings start offsets or will you decode data for
an entire block every time anyway?


On Tue, Oct 17, 2023 at 8:39?AM Mikhail Khludnev <mkhl@apache.org> wrote:

> Hello Tony
> Is it possible to write a block of docfreqs and then a block of
> postingoffsets?
> Or why not write them as 10-bit integers and then split to quad and sextet
> in the posting format code?
>
> On Mon, Oct 16, 2023 at 11:50?PM Dongyu Xu <dongyu214@hotmail.com> wrote:
>
>> Hi devs,
>>
>> As I was working on https://github.com/apache/lucene/issues/12513 I
>> needed to compress positive integers which are used to locate postings etc.
>>
>> To put it concretely, I will need to pack a few values per term
>> contiguously and those values can have different bit-width. For example,
>> consider that we need to encode docFreq and postingsStartOffset per term
>> and docFreq takes 4 bit and the postingsStartOffset takes 6 bit. We
>> expect to write the following for two terms.
>>
>> ```
>> Term1 | Term2
>>
>> docFreq(4bit) | postingsStartOffset(6bit) | docFreq(4bit) |
>> postingsStartOffset(6bit)
>>
>> ```
>>
>> On the read path, I expect to locate the offest for a term first and
>> followed by reading two values that have different bit-width.
>>
>> In the spirit of not re-inventing necessarily, I tried to explore the
>> existing PackedInts util classes and I believe there is no support for this
>> at the moment. The biggest gap I found is that the existing classes expect
>> to write/read values of same bit-width.
>>
>> I'm writing to get feedback from yall to see if I missed anything.
>>
>> Cheers,
>> Tony X
>>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>


--
Adrien
Re: PackedInts functionalities [ In reply to ]
Thank you both for your ideas.

I sensed that there could be potential confusions. Let me try to clear them first --
I'm not changing how postings are encoded. Rather I'm changing how the postings's metadata (in code, this will be the IntBlockTermState<https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90PostingsFormat.java#L443>?) are encoded in the term dictionary.

As a useful example, here are the data that we need to encode for a field that has "everything" enabled (doc, freq, positions, payload and offsets)

* docStartFp: the posting file pointer
* skipOffset: pointer to skip data
* posStartFp: positions file pointer
* payStartFp: payload file pointer
* docFreq: number of docs that has this term
* totalTermFeq: total freq of this term across all docs
* singletonDocId: when a term just has a single posting, we store the docid directly
* lastPosBlockOffset: the relative offset of the last position block relative to posStartFp


> Is it possible to write a block of docfreqs and then a block of postingoffsets?

Yes, but that is less desirable since it will lose locality. It'd be multiple seeks instead of one. Plus, the example I used is rather simplified. There are more than two type of values that we need to deal with.


> why not write them as 10-bit integers and then split to quad and sextet in the posting format code?

I've considered this. It is actually a decent workaround but comes with its limitation. As mentioned there could be multiple type of values and in total they may exceed 64 bits.


Adrien: The docFreq distribution is interesting. bit-packing can be a good start, I think? To the rest of your point, I do need random access within the block of terms as its main usage is to back the term dictionary.



Thanks,
Tony

________________________________
From: Adrien Grand <jpountz@gmail.com>
Sent: Tuesday, October 17, 2023 1:26 AM
To: dev@lucene.apache.org <dev@lucene.apache.org>
Subject: Re: PackedInts functionalities

+1 to what Mikhail wrote, this is e.g. how postings work: instead of interleaving doc IDs and frequencies, they always store a block of 128 doc IDs followed by a block of 128 frequencies.

For reference, bit packing feels space-inefficient for this kind of data. I would expect docFreqs to have a zipfian distribution, so you would end up using a number of bits per docFreq that is driven by the highest docFreq in the block while most values might be very low. Do you need random-access into these doc freqs and postings start offsets or will you decode data for an entire block every time anyway?


On Tue, Oct 17, 2023 at 8:39?AM Mikhail Khludnev <mkhl@apache.org<mailto:mkhl@apache.org>> wrote:
Hello Tony
Is it possible to write a block of docfreqs and then a block of postingoffsets?
Or why not write them as 10-bit integers and then split to quad and sextet in the posting format code?

On Mon, Oct 16, 2023 at 11:50?PM Dongyu Xu <dongyu214@hotmail.com<mailto:dongyu214@hotmail.com>> wrote:
Hi devs,

As I was working on https://github.com/apache/lucene/issues/12513 I needed to compress positive integers which are used to locate postings etc.

To put it concretely, I will need to pack a few values per term contiguously and those values can have different bit-width. For example, consider that we need to encode docFreq and postingsStartOffset per term and docFreq takes 4 bit and the postingsStartOffset takes 6 bit. We expect to write the following for two terms.

```
Term1??????????????????????????????????????????????????????| ????Term2

docFreq(4bit) | postingsStartOffset(6bit) | docFreq(4bit) | postingsStartOffset(6bit)

```

On the read path, I expect to locate the offest for a term first and followed by reading two values that have different bit-width.

In the spirit of not re-inventing necessarily, I tried to explore the existing PackedInts util classes and I believe there is no support for this at the moment. The biggest gap I found is that the existing classes expect to write/read values of same bit-width.

I'm writing to get feedback from yall to see if I missed anything.

Cheers,
Tony X


--
Sincerely yours
Mikhail Khludnev


--
Adrien