Hi folks-
I'm curious to understand the history/context of using PFOR for positions
and frequencies while continuing to use basic FOR for docid encoding. I've
done my best to turn up any past conversations on this, but wasn't able to
find much. Apologies if I missed it in my digging! From what I've gathered,
the basic FOR encoding was introduced to Lucene with LUCENE-3892
<https://issues.apache.org/jira/browse/LUCENE-3892> (which was a
continuation of LUCENE-1410
<https://issues.apache.org/jira/browse/LUCENE-1410>). While PFOR had been
discussed plenty in the earlier issues, I gather that it wasn't actually
committed until LUCENE-9027
<https://issues.apache.org/jira/browse/LUCENE-9027>. Hopefully I've got
that much right. And it appears at that time to have been introduced for
positions and frequencies, but not docids.
Is the reasoning here that, a) since docids are delta-encoded already,
outliers/exceptions will be less likely/beneficial, and b) FOR allows for
an optimization in decoding the deltas (via. ForUtil#decodeAndPrefixSum)
which can't be utilized with PFOR, since the exceptions must be patched in
before decoding deltas? Are the other reasons FOR continues to be used for
docids that I'm overlooking?
I'm curious as I recently ran some internal benchmarks on the Amazon
product search engine replacing FOR with PFOR for docids delta encoding,
and saw an index size reduction of -0.93% while also improving our red-line
queries/sec by +1.0%. I expected the index size reduction but wasn't
expecting to see a QPS improvement, which I haven't yet been able to
explain. I'm wondering if there are some good reasons to keep using FOR for
docids, or if there'd be any appetite to discuss using PFOR for everything?
Again, apologies if I've overlooked some past discussion in my digging. Any
history/context is much appreciated!
Cheers,
-Greg
I'm curious to understand the history/context of using PFOR for positions
and frequencies while continuing to use basic FOR for docid encoding. I've
done my best to turn up any past conversations on this, but wasn't able to
find much. Apologies if I missed it in my digging! From what I've gathered,
the basic FOR encoding was introduced to Lucene with LUCENE-3892
<https://issues.apache.org/jira/browse/LUCENE-3892> (which was a
continuation of LUCENE-1410
<https://issues.apache.org/jira/browse/LUCENE-1410>). While PFOR had been
discussed plenty in the earlier issues, I gather that it wasn't actually
committed until LUCENE-9027
<https://issues.apache.org/jira/browse/LUCENE-9027>. Hopefully I've got
that much right. And it appears at that time to have been introduced for
positions and frequencies, but not docids.
Is the reasoning here that, a) since docids are delta-encoded already,
outliers/exceptions will be less likely/beneficial, and b) FOR allows for
an optimization in decoding the deltas (via. ForUtil#decodeAndPrefixSum)
which can't be utilized with PFOR, since the exceptions must be patched in
before decoding deltas? Are the other reasons FOR continues to be used for
docids that I'm overlooking?
I'm curious as I recently ran some internal benchmarks on the Amazon
product search engine replacing FOR with PFOR for docids delta encoding,
and saw an index size reduction of -0.93% while also improving our red-line
queries/sec by +1.0%. I expected the index size reduction but wasn't
expecting to see a QPS improvement, which I haven't yet been able to
explain. I'm wondering if there are some good reasons to keep using FOR for
docids, or if there'd be any appetite to discuss using PFOR for everything?
Again, apologies if I've overlooked some past discussion in my digging. Any
history/context is much appreciated!
Cheers,
-Greg