Mailing List Archive

PFOR for docids?
Hi folks-

I'm curious to understand the history/context of using PFOR for positions
and frequencies while continuing to use basic FOR for docid encoding. I've
done my best to turn up any past conversations on this, but wasn't able to
find much. Apologies if I missed it in my digging! From what I've gathered,
the basic FOR encoding was introduced to Lucene with LUCENE-3892
<https://issues.apache.org/jira/browse/LUCENE-3892> (which was a
continuation of LUCENE-1410
<https://issues.apache.org/jira/browse/LUCENE-1410>). While PFOR had been
discussed plenty in the earlier issues, I gather that it wasn't actually
committed until LUCENE-9027
<https://issues.apache.org/jira/browse/LUCENE-9027>. Hopefully I've got
that much right. And it appears at that time to have been introduced for
positions and frequencies, but not docids.

Is the reasoning here that, a) since docids are delta-encoded already,
outliers/exceptions will be less likely/beneficial, and b) FOR allows for
an optimization in decoding the deltas (via. ForUtil#decodeAndPrefixSum)
which can't be utilized with PFOR, since the exceptions must be patched in
before decoding deltas? Are the other reasons FOR continues to be used for
docids that I'm overlooking?

I'm curious as I recently ran some internal benchmarks on the Amazon
product search engine replacing FOR with PFOR for docids delta encoding,
and saw an index size reduction of -0.93% while also improving our red-line
queries/sec by +1.0%. I expected the index size reduction but wasn't
expecting to see a QPS improvement, which I haven't yet been able to
explain. I'm wondering if there are some good reasons to keep using FOR for
docids, or if there'd be any appetite to discuss using PFOR for everything?
Again, apologies if I've overlooked some past discussion in my digging. Any
history/context is much appreciated!

Cheers,
-Greg
Re: PFOR for docids? [ In reply to ]
Hi Greg,

You guessed b) right. I had done some microbenchmarks
<https://github.com/jpountz/decode-128-ints-benchmark> that suggested that
the optimization of the prefix sum was important performance-wise compared
to unpacking and then doing the prefix sum.

The other factor that played a role to me is that frequencies and positions
are usually less performance-sensitive than doc IDs. For instance
conjunctions only decode frequencies after reaching agreement across all
clauses, and phrase queries only decode positions if term frequencies
suggest that the current document could have a competitive score.

I would be interested in discussing using PFOR for everything. At Elastic
we've been looking into building 3-gram indexes to perform infix search,
and noticed that we would save a lot of space if we moved to PFOR rather
than FOR for doc ID deltas. The reason is that these postings of 3-grams
are very dense, and saving one bit per value saves a lot more space
relatively when blocks have 5 bits per value than when they have e.g. 15
bits per value.

On Mon, Mar 15, 2021 at 8:53 PM Greg Miller <gsmiller@gmail.com> wrote:

> Hi folks-
>
> I'm curious to understand the history/context of using PFOR for positions
> and frequencies while continuing to use basic FOR for docid encoding. I've
> done my best to turn up any past conversations on this, but wasn't able to
> find much. Apologies if I missed it in my digging! From what I've gathered,
> the basic FOR encoding was introduced to Lucene with LUCENE-3892
> <https://issues.apache.org/jira/browse/LUCENE-3892> (which was a
> continuation of LUCENE-1410
> <https://issues.apache.org/jira/browse/LUCENE-1410>). While PFOR had been
> discussed plenty in the earlier issues, I gather that it wasn't actually
> committed until LUCENE-9027
> <https://issues.apache.org/jira/browse/LUCENE-9027>. Hopefully I've got
> that much right. And it appears at that time to have been introduced for
> positions and frequencies, but not docids.
>
> Is the reasoning here that, a) since docids are delta-encoded already,
> outliers/exceptions will be less likely/beneficial, and b) FOR allows for
> an optimization in decoding the deltas (via. ForUtil#decodeAndPrefixSum)
> which can't be utilized with PFOR, since the exceptions must be patched
> in before decoding deltas? Are the other reasons FOR continues to be used
> for docids that I'm overlooking?
>
> I'm curious as I recently ran some internal benchmarks on the Amazon
> product search engine replacing FOR with PFOR for docids delta encoding,
> and saw an index size reduction of -0.93% while also improving our red-line
> queries/sec by +1.0%. I expected the index size reduction but wasn't
> expecting to see a QPS improvement, which I haven't yet been able to
> explain. I'm wondering if there are some good reasons to keep using FOR for
> docids, or if there'd be any appetite to discuss using PFOR for everything?
> Again, apologies if I've overlooked some past discussion in my digging. Any
> history/context is much appreciated!
>
> Cheers,
> -Greg
>


--
Adrien
Re: PFOR for docids? [ In reply to ]
Hi Adrien-

Thanks for all the additional context and the benchmark link!

> The reason is that these postings of 3-grams are very dense, and saving one bit per value saves a lot more space relatively when blocks have 5 bits per value than when they have e.g. 15 bits per value.

That's an interesting observation. I'm working on gathering this same
bpv stat in our application right now, so this is useful to keep in
mind.

It seems like it would be valuable to measure the performance impact
of moving to PFOR for doc IDs in contrast to what it might save in
index size. While I'm able to do this on our internal system with our
own benchmarks, that's a very specific use-case and may not translate
well to a more general upstream solution. Do you have any guidance on
how this type of evaluation is generally done in the context of an
upstream change? As a first step, I can open a Jira issue to track the
evaluation if you think that would be useful. Thanks again!

Cheers,
-Greg


On Tue, Mar 16, 2021 at 2:11 AM Adrien Grand <jpountz@gmail.com> wrote:
>
> Hi Greg,
>
> You guessed b) right. I had done some microbenchmarks that suggested that the optimization of the prefix sum was important performance-wise compared to unpacking and then doing the prefix sum.
>
> The other factor that played a role to me is that frequencies and positions are usually less performance-sensitive than doc IDs. For instance conjunctions only decode frequencies after reaching agreement across all clauses, and phrase queries only decode positions if term frequencies suggest that the current document could have a competitive score.
>
> I would be interested in discussing using PFOR for everything. At Elastic we've been looking into building 3-gram indexes to perform infix search, and noticed that we would save a lot of space if we moved to PFOR rather than FOR for doc ID deltas. The reason is that these postings of 3-grams are very dense, and saving one bit per value saves a lot more space relatively when blocks have 5 bits per value than when they have e.g. 15 bits per value.
>
> On Mon, Mar 15, 2021 at 8:53 PM Greg Miller <gsmiller@gmail.com> wrote:
>>
>> Hi folks-
>>
>> I'm curious to understand the history/context of using PFOR for positions and frequencies while continuing to use basic FOR for docid encoding. I've done my best to turn up any past conversations on this, but wasn't able to find much. Apologies if I missed it in my digging! From what I've gathered, the basic FOR encoding was introduced to Lucene with LUCENE-3892 (which was a continuation of LUCENE-1410). While PFOR had been discussed plenty in the earlier issues, I gather that it wasn't actually committed until LUCENE-9027. Hopefully I've got that much right. And it appears at that time to have been introduced for positions and frequencies, but not docids.
>>
>> Is the reasoning here that, a) since docids are delta-encoded already, outliers/exceptions will be less likely/beneficial, and b) FOR allows for an optimization in decoding the deltas (via. ForUtil#decodeAndPrefixSum) which can't be utilized with PFOR, since the exceptions must be patched in before decoding deltas? Are the other reasons FOR continues to be used for docids that I'm overlooking?
>>
>> I'm curious as I recently ran some internal benchmarks on the Amazon product search engine replacing FOR with PFOR for docids delta encoding, and saw an index size reduction of -0.93% while also improving our red-line queries/sec by +1.0%. I expected the index size reduction but wasn't expecting to see a QPS improvement, which I haven't yet been able to explain. I'm wondering if there are some good reasons to keep using FOR for docids, or if there'd be any appetite to discuss using PFOR for everything? Again, apologies if I've overlooked some past discussion in my digging. Any history/context is much appreciated!
>>
>> Cheers,
>> -Greg
>
>
>
> --
> Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: PFOR for docids? [ In reply to ]
I went ahead and created a Jira issue to track this investigation:
https://issues.apache.org/jira/browse/LUCENE-9850. I thought a Jira
issue might help us better track my findings as I work on this. I hope
this was appropriate, but if not, please let me know. Thanks again!

Cheers,
-Greg

On Tue, Mar 16, 2021 at 12:51 PM Greg Miller <gsmiller@gmail.com> wrote:
>
> Hi Adrien-
>
> Thanks for all the additional context and the benchmark link!
>
> > The reason is that these postings of 3-grams are very dense, and saving one bit per value saves a lot more space relatively when blocks have 5 bits per value than when they have e.g. 15 bits per value.
>
> That's an interesting observation. I'm working on gathering this same
> bpv stat in our application right now, so this is useful to keep in
> mind.
>
> It seems like it would be valuable to measure the performance impact
> of moving to PFOR for doc IDs in contrast to what it might save in
> index size. While I'm able to do this on our internal system with our
> own benchmarks, that's a very specific use-case and may not translate
> well to a more general upstream solution. Do you have any guidance on
> how this type of evaluation is generally done in the context of an
> upstream change? As a first step, I can open a Jira issue to track the
> evaluation if you think that would be useful. Thanks again!
>
> Cheers,
> -Greg
>
>
> On Tue, Mar 16, 2021 at 2:11 AM Adrien Grand <jpountz@gmail.com> wrote:
> >
> > Hi Greg,
> >
> > You guessed b) right. I had done some microbenchmarks that suggested that the optimization of the prefix sum was important performance-wise compared to unpacking and then doing the prefix sum.
> >
> > The other factor that played a role to me is that frequencies and positions are usually less performance-sensitive than doc IDs. For instance conjunctions only decode frequencies after reaching agreement across all clauses, and phrase queries only decode positions if term frequencies suggest that the current document could have a competitive score.
> >
> > I would be interested in discussing using PFOR for everything. At Elastic we've been looking into building 3-gram indexes to perform infix search, and noticed that we would save a lot of space if we moved to PFOR rather than FOR for doc ID deltas. The reason is that these postings of 3-grams are very dense, and saving one bit per value saves a lot more space relatively when blocks have 5 bits per value than when they have e.g. 15 bits per value.
> >
> > On Mon, Mar 15, 2021 at 8:53 PM Greg Miller <gsmiller@gmail.com> wrote:
> >>
> >> Hi folks-
> >>
> >> I'm curious to understand the history/context of using PFOR for positions and frequencies while continuing to use basic FOR for docid encoding. I've done my best to turn up any past conversations on this, but wasn't able to find much. Apologies if I missed it in my digging! From what I've gathered, the basic FOR encoding was introduced to Lucene with LUCENE-3892 (which was a continuation of LUCENE-1410). While PFOR had been discussed plenty in the earlier issues, I gather that it wasn't actually committed until LUCENE-9027. Hopefully I've got that much right. And it appears at that time to have been introduced for positions and frequencies, but not docids.
> >>
> >> Is the reasoning here that, a) since docids are delta-encoded already, outliers/exceptions will be less likely/beneficial, and b) FOR allows for an optimization in decoding the deltas (via. ForUtil#decodeAndPrefixSum) which can't be utilized with PFOR, since the exceptions must be patched in before decoding deltas? Are the other reasons FOR continues to be used for docids that I'm overlooking?
> >>
> >> I'm curious as I recently ran some internal benchmarks on the Amazon product search engine replacing FOR with PFOR for docids delta encoding, and saw an index size reduction of -0.93% while also improving our red-line queries/sec by +1.0%. I expected the index size reduction but wasn't expecting to see a QPS improvement, which I haven't yet been able to explain. I'm wondering if there are some good reasons to keep using FOR for docids, or if there'd be any appetite to discuss using PFOR for everything? Again, apologies if I've overlooked some past discussion in my digging. Any history/context is much appreciated!
> >>
> >> Cheers,
> >> -Greg
> >
> >
> >
> > --
> > Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: PFOR for docids? [ In reply to ]
This isn't only appropriate, but also the preferred way to move such
discussions forward. Thanks Greg!

On Wed, Mar 17, 2021 at 11:27 PM Greg Miller <gsmiller@gmail.com> wrote:

> I went ahead and created a Jira issue to track this investigation:
> https://issues.apache.org/jira/browse/LUCENE-9850. I thought a Jira
> issue might help us better track my findings as I work on this. I hope
> this was appropriate, but if not, please let me know. Thanks again!
>
> Cheers,
> -Greg
>
> On Tue, Mar 16, 2021 at 12:51 PM Greg Miller <gsmiller@gmail.com> wrote:
> >
> > Hi Adrien-
> >
> > Thanks for all the additional context and the benchmark link!
> >
> > > The reason is that these postings of 3-grams are very dense, and
> saving one bit per value saves a lot more space relatively when blocks have
> 5 bits per value than when they have e.g. 15 bits per value.
> >
> > That's an interesting observation. I'm working on gathering this same
> > bpv stat in our application right now, so this is useful to keep in
> > mind.
> >
> > It seems like it would be valuable to measure the performance impact
> > of moving to PFOR for doc IDs in contrast to what it might save in
> > index size. While I'm able to do this on our internal system with our
> > own benchmarks, that's a very specific use-case and may not translate
> > well to a more general upstream solution. Do you have any guidance on
> > how this type of evaluation is generally done in the context of an
> > upstream change? As a first step, I can open a Jira issue to track the
> > evaluation if you think that would be useful. Thanks again!
> >
> > Cheers,
> > -Greg
> >
> >
> > On Tue, Mar 16, 2021 at 2:11 AM Adrien Grand <jpountz@gmail.com> wrote:
> > >
> > > Hi Greg,
> > >
> > > You guessed b) right. I had done some microbenchmarks that suggested
> that the optimization of the prefix sum was important performance-wise
> compared to unpacking and then doing the prefix sum.
> > >
> > > The other factor that played a role to me is that frequencies and
> positions are usually less performance-sensitive than doc IDs. For instance
> conjunctions only decode frequencies after reaching agreement across all
> clauses, and phrase queries only decode positions if term frequencies
> suggest that the current document could have a competitive score.
> > >
> > > I would be interested in discussing using PFOR for everything. At
> Elastic we've been looking into building 3-gram indexes to perform infix
> search, and noticed that we would save a lot of space if we moved to PFOR
> rather than FOR for doc ID deltas. The reason is that these postings of
> 3-grams are very dense, and saving one bit per value saves a lot more space
> relatively when blocks have 5 bits per value than when they have e.g. 15
> bits per value.
> > >
> > > On Mon, Mar 15, 2021 at 8:53 PM Greg Miller <gsmiller@gmail.com>
> wrote:
> > >>
> > >> Hi folks-
> > >>
> > >> I'm curious to understand the history/context of using PFOR for
> positions and frequencies while continuing to use basic FOR for docid
> encoding. I've done my best to turn up any past conversations on this, but
> wasn't able to find much. Apologies if I missed it in my digging! From what
> I've gathered, the basic FOR encoding was introduced to Lucene with
> LUCENE-3892 (which was a continuation of LUCENE-1410). While PFOR had been
> discussed plenty in the earlier issues, I gather that it wasn't actually
> committed until LUCENE-9027. Hopefully I've got that much right. And it
> appears at that time to have been introduced for positions and frequencies,
> but not docids.
> > >>
> > >> Is the reasoning here that, a) since docids are delta-encoded
> already, outliers/exceptions will be less likely/beneficial, and b) FOR
> allows for an optimization in decoding the deltas (via.
> ForUtil#decodeAndPrefixSum) which can't be utilized with PFOR, since the
> exceptions must be patched in before decoding deltas? Are the other reasons
> FOR continues to be used for docids that I'm overlooking?
> > >>
> > >> I'm curious as I recently ran some internal benchmarks on the Amazon
> product search engine replacing FOR with PFOR for docids delta encoding,
> and saw an index size reduction of -0.93% while also improving our red-line
> queries/sec by +1.0%. I expected the index size reduction but wasn't
> expecting to see a QPS improvement, which I haven't yet been able to
> explain. I'm wondering if there are some good reasons to keep using FOR for
> docids, or if there'd be any appetite to discuss using PFOR for everything?
> Again, apologies if I've overlooked some past discussion in my digging. Any
> history/context is much appreciated!
> > >>
> > >> Cheers,
> > >> -Greg
> > >
> > >
> > >
> > > --
> > > Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

--
Adrien