Mailing List Archive

Configurable Postings Block Size?
Hi folks!

I've been a bit curious to test out different block size configurations in
the Lucene postings list format, but thought I'd reach out to the community
here first to see what work may have gone into this previously. I'm
essentially interested in benchmarking different block size configurations
on the real-world application of Lucene I'm working on.

If my understanding of the code is correct, I know we're currently encoding
compressed runs of 128 docs per block, relying on ForUtil for
encoding/decoding purposes. It looks like we define this in
ForUtil#BLOCK_SIZE (and reference it in a few external classes), but also
know that it's not as simple as just changing that one definition. It
appears much of the logic in ForUtil relies on the assumption of 128
docs-per-block.

I'm toying with the idea of making ForUtil a bit more flexible to allow for
different block sizes to be tested in order to run the benchmarking I'd
like to run, but the class looks heavily optimized to generate SIMD
instructions (I think?), so that might be folly. Before I start hacking on
a local branch to see what I can learn, is there any prior work that might
be useful to be aware of? Anyone gone down this path and have some
learnings to share? Any thoughts would be much appreciated!

Cheers,
-Greg
Re: Configurable Postings Block Size? [ In reply to ]
If you want to test a different block size (say 64 or 256), I really
recommend to just fork a different codec for the experiment.

There will likely be higher level changes you need to make, not just
changing a number. For example if you just increased this number to 256
without doing anything else, I wouldn't be surprised if you see worse
performance. More of the postings would be vint-encoded than before with
128, which might have some consequences. skipdata layout might be
inappropriate, these things are optimized for blocks of 128.

Just in general, I recommend making a codec for the benchmarking
experiments, tools like luceneutil support comparing codecs against each
other anyway so you can easily compare fairly against the existing codec.
Also, it should be much easier/faster to just make a new codec and adapt it
to test what you want!

I think it is an antipattern to make stuff within the codec "flexible", it
is autogenerated decompression code :) I am concerned such "flexibility"
would create barriers in the future to optimizations. For example we should
be able to experiment with converting this compression code over to
explicit vector API in java.

On Sat, Feb 27, 2021 at 4:29 PM Greg Miller <gsmiller@gmail.com> wrote:

> Hi folks!
>
> I've been a bit curious to test out different block size configurations in
> the Lucene postings list format, but thought I'd reach out to the community
> here first to see what work may have gone into this previously. I'm
> essentially interested in benchmarking different block size configurations
> on the real-world application of Lucene I'm working on.
>
> If my understanding of the code is correct, I know we're currently
> encoding compressed runs of 128 docs per block, relying on ForUtil for
> encoding/decoding purposes. It looks like we define this in
> ForUtil#BLOCK_SIZE (and reference it in a few external classes), but also
> know that it's not as simple as just changing that one definition. It
> appears much of the logic in ForUtil relies on the assumption of 128
> docs-per-block.
>
> I'm toying with the idea of making ForUtil a bit more flexible to allow
> for different block sizes to be tested in order to run the benchmarking I'd
> like to run, but the class looks heavily optimized to generate SIMD
> instructions (I think?), so that might be folly. Before I start hacking on
> a local branch to see what I can learn, is there any prior work that might
> be useful to be aware of? Anyone gone down this path and have some
> learnings to share? Any thoughts would be much appreciated!
>
> Cheers,
> -Greg
>
Re: Configurable Postings Block Size? [ In reply to ]
Thanks for the feedback Robert; makes sense to me. I'll tinker with a
forked codec and see if the experimentation produces anything interesting.

When you mention "autogenerated decompression code", do you mean that some
of this code is actually being generated?

Cheers,
-Greg

On Sun, Feb 28, 2021 at 5:05 AM Robert Muir <rcmuir@gmail.com> wrote:

> If you want to test a different block size (say 64 or 256), I really
> recommend to just fork a different codec for the experiment.
>
> There will likely be higher level changes you need to make, not just
> changing a number. For example if you just increased this number to 256
> without doing anything else, I wouldn't be surprised if you see worse
> performance. More of the postings would be vint-encoded than before with
> 128, which might have some consequences. skipdata layout might be
> inappropriate, these things are optimized for blocks of 128.
>
> Just in general, I recommend making a codec for the benchmarking
> experiments, tools like luceneutil support comparing codecs against each
> other anyway so you can easily compare fairly against the existing codec.
> Also, it should be much easier/faster to just make a new codec and adapt it
> to test what you want!
>
> I think it is an antipattern to make stuff within the codec "flexible", it
> is autogenerated decompression code :) I am concerned such "flexibility"
> would create barriers in the future to optimizations. For example we should
> be able to experiment with converting this compression code over to
> explicit vector API in java.
>
> On Sat, Feb 27, 2021 at 4:29 PM Greg Miller <gsmiller@gmail.com> wrote:
>
>> Hi folks!
>>
>> I've been a bit curious to test out different block size configurations
>> in the Lucene postings list format, but thought I'd reach out to the
>> community here first to see what work may have gone into this previously.
>> I'm essentially interested in benchmarking different block size
>> configurations on the real-world application of Lucene I'm working on.
>>
>> If my understanding of the code is correct, I know we're currently
>> encoding compressed runs of 128 docs per block, relying on ForUtil for
>> encoding/decoding purposes. It looks like we define this in
>> ForUtil#BLOCK_SIZE (and reference it in a few external classes), but also
>> know that it's not as simple as just changing that one definition. It
>> appears much of the logic in ForUtil relies on the assumption of 128
>> docs-per-block.
>>
>> I'm toying with the idea of making ForUtil a bit more flexible to allow
>> for different block sizes to be tested in order to run the benchmarking I'd
>> like to run, but the class looks heavily optimized to generate SIMD
>> instructions (I think?), so that might be folly. Before I start hacking on
>> a local branch to see what I can learn, is there any prior work that might
>> be useful to be aware of? Anyone gone down this path and have some
>> learnings to share? Any thoughts would be much appreciated!
>>
>> Cheers,
>> -Greg
>>
>
Re: Configurable Postings Block Size? [ In reply to ]
Yeah, have a look at gen_ForUtil.py

On Mon, Mar 1, 2021 at 1:05 PM Greg Miller <gsmiller@gmail.com> wrote:

> Thanks for the feedback Robert; makes sense to me. I'll tinker with a
> forked codec and see if the experimentation produces anything interesting.
>
> When you mention "autogenerated decompression code", do you mean that
> some of this code is actually being generated?
>
> Cheers,
> -Greg
>
> On Sun, Feb 28, 2021 at 5:05 AM Robert Muir <rcmuir@gmail.com> wrote:
>
>> If you want to test a different block size (say 64 or 256), I really
>> recommend to just fork a different codec for the experiment.
>>
>> There will likely be higher level changes you need to make, not just
>> changing a number. For example if you just increased this number to 256
>> without doing anything else, I wouldn't be surprised if you see worse
>> performance. More of the postings would be vint-encoded than before with
>> 128, which might have some consequences. skipdata layout might be
>> inappropriate, these things are optimized for blocks of 128.
>>
>> Just in general, I recommend making a codec for the benchmarking
>> experiments, tools like luceneutil support comparing codecs against each
>> other anyway so you can easily compare fairly against the existing codec.
>> Also, it should be much easier/faster to just make a new codec and adapt it
>> to test what you want!
>>
>> I think it is an antipattern to make stuff within the codec "flexible",
>> it is autogenerated decompression code :) I am concerned such "flexibility"
>> would create barriers in the future to optimizations. For example we should
>> be able to experiment with converting this compression code over to
>> explicit vector API in java.
>>
>> On Sat, Feb 27, 2021 at 4:29 PM Greg Miller <gsmiller@gmail.com> wrote:
>>
>>> Hi folks!
>>>
>>> I've been a bit curious to test out different block size configurations
>>> in the Lucene postings list format, but thought I'd reach out to the
>>> community here first to see what work may have gone into this previously.
>>> I'm essentially interested in benchmarking different block size
>>> configurations on the real-world application of Lucene I'm working on.
>>>
>>> If my understanding of the code is correct, I know we're currently
>>> encoding compressed runs of 128 docs per block, relying on ForUtil for
>>> encoding/decoding purposes. It looks like we define this in
>>> ForUtil#BLOCK_SIZE (and reference it in a few external classes), but also
>>> know that it's not as simple as just changing that one definition. It
>>> appears much of the logic in ForUtil relies on the assumption of 128
>>> docs-per-block.
>>>
>>> I'm toying with the idea of making ForUtil a bit more flexible to allow
>>> for different block sizes to be tested in order to run the benchmarking I'd
>>> like to run, but the class looks heavily optimized to generate SIMD
>>> instructions (I think?), so that might be folly. Before I start hacking on
>>> a local branch to see what I can learn, is there any prior work that might
>>> be useful to be aware of? Anyone gone down this path and have some
>>> learnings to share? Any thoughts would be much appreciated!
>>>
>>> Cheers,
>>> -Greg
>>>
>>
Re: Configurable Postings Block Size? [ In reply to ]
Oh, got it. This is great, thanks!

Cheers,
-Greg

On Mon, Mar 1, 2021 at 11:28 AM Robert Muir <rcmuir@gmail.com> wrote:

> Yeah, have a look at gen_ForUtil.py
>
> On Mon, Mar 1, 2021 at 1:05 PM Greg Miller <gsmiller@gmail.com> wrote:
>
>> Thanks for the feedback Robert; makes sense to me. I'll tinker with a
>> forked codec and see if the experimentation produces anything interesting.
>>
>> When you mention "autogenerated decompression code", do you mean that
>> some of this code is actually being generated?
>>
>> Cheers,
>> -Greg
>>
>> On Sun, Feb 28, 2021 at 5:05 AM Robert Muir <rcmuir@gmail.com> wrote:
>>
>>> If you want to test a different block size (say 64 or 256), I really
>>> recommend to just fork a different codec for the experiment.
>>>
>>> There will likely be higher level changes you need to make, not just
>>> changing a number. For example if you just increased this number to 256
>>> without doing anything else, I wouldn't be surprised if you see worse
>>> performance. More of the postings would be vint-encoded than before with
>>> 128, which might have some consequences. skipdata layout might be
>>> inappropriate, these things are optimized for blocks of 128.
>>>
>>> Just in general, I recommend making a codec for the benchmarking
>>> experiments, tools like luceneutil support comparing codecs against each
>>> other anyway so you can easily compare fairly against the existing codec.
>>> Also, it should be much easier/faster to just make a new codec and adapt it
>>> to test what you want!
>>>
>>> I think it is an antipattern to make stuff within the codec "flexible",
>>> it is autogenerated decompression code :) I am concerned such "flexibility"
>>> would create barriers in the future to optimizations. For example we should
>>> be able to experiment with converting this compression code over to
>>> explicit vector API in java.
>>>
>>> On Sat, Feb 27, 2021 at 4:29 PM Greg Miller <gsmiller@gmail.com> wrote:
>>>
>>>> Hi folks!
>>>>
>>>> I've been a bit curious to test out different block size configurations
>>>> in the Lucene postings list format, but thought I'd reach out to the
>>>> community here first to see what work may have gone into this previously.
>>>> I'm essentially interested in benchmarking different block size
>>>> configurations on the real-world application of Lucene I'm working on.
>>>>
>>>> If my understanding of the code is correct, I know we're currently
>>>> encoding compressed runs of 128 docs per block, relying on ForUtil for
>>>> encoding/decoding purposes. It looks like we define this in
>>>> ForUtil#BLOCK_SIZE (and reference it in a few external classes), but also
>>>> know that it's not as simple as just changing that one definition. It
>>>> appears much of the logic in ForUtil relies on the assumption of 128
>>>> docs-per-block.
>>>>
>>>> I'm toying with the idea of making ForUtil a bit more flexible to allow
>>>> for different block sizes to be tested in order to run the benchmarking I'd
>>>> like to run, but the class looks heavily optimized to generate SIMD
>>>> instructions (I think?), so that might be folly. Before I start hacking on
>>>> a local branch to see what I can learn, is there any prior work that might
>>>> be useful to be aware of? Anyone gone down this path and have some
>>>> learnings to share? Any thoughts would be much appreciated!
>>>>
>>>> Cheers,
>>>> -Greg
>>>>
>>>
Re: Configurable Postings Block Size? [ In reply to ]
So, slightly different topic, maybe, but related so tacking onto this
thread...

While tweaking ForUtil locally to experiment with different block sizes, I
realized that PForUtil encodes the offset for each "patch" using a single
byte, which implies a strict upper limit of 256 on the BLOCK_SIZE defined
in ForUtil. This essentially silently failed on me when I was trying to set
up blocks of 512. The unit tests caught it since the results were incorrect
after encoding/decoding with PForUtil (hooray!), but it would have been
nice to have an assert somewhere guarding for this to make matters a little
more explicit.

While I realize that the likelihood of changing the blockside in ForUtil
may be low for now, it seems like such a small, easy change to toss an
assert in that it seems useful. What do you all think? Worth opening a
minor issue for this and putting in a one-liner?

Cheers,
-Greg

On Mon, Mar 1, 2021 at 11:30 AM Greg Miller <gsmiller@gmail.com> wrote:

> Oh, got it. This is great, thanks!
>
> Cheers,
> -Greg
>
> On Mon, Mar 1, 2021 at 11:28 AM Robert Muir <rcmuir@gmail.com> wrote:
>
>> Yeah, have a look at gen_ForUtil.py
>>
>> On Mon, Mar 1, 2021 at 1:05 PM Greg Miller <gsmiller@gmail.com> wrote:
>>
>>> Thanks for the feedback Robert; makes sense to me. I'll tinker with a
>>> forked codec and see if the experimentation produces anything interesting.
>>>
>>> When you mention "autogenerated decompression code", do you mean that
>>> some of this code is actually being generated?
>>>
>>> Cheers,
>>> -Greg
>>>
>>> On Sun, Feb 28, 2021 at 5:05 AM Robert Muir <rcmuir@gmail.com> wrote:
>>>
>>>> If you want to test a different block size (say 64 or 256), I really
>>>> recommend to just fork a different codec for the experiment.
>>>>
>>>> There will likely be higher level changes you need to make, not just
>>>> changing a number. For example if you just increased this number to 256
>>>> without doing anything else, I wouldn't be surprised if you see worse
>>>> performance. More of the postings would be vint-encoded than before with
>>>> 128, which might have some consequences. skipdata layout might be
>>>> inappropriate, these things are optimized for blocks of 128.
>>>>
>>>> Just in general, I recommend making a codec for the benchmarking
>>>> experiments, tools like luceneutil support comparing codecs against each
>>>> other anyway so you can easily compare fairly against the existing codec.
>>>> Also, it should be much easier/faster to just make a new codec and adapt it
>>>> to test what you want!
>>>>
>>>> I think it is an antipattern to make stuff within the codec "flexible",
>>>> it is autogenerated decompression code :) I am concerned such "flexibility"
>>>> would create barriers in the future to optimizations. For example we should
>>>> be able to experiment with converting this compression code over to
>>>> explicit vector API in java.
>>>>
>>>> On Sat, Feb 27, 2021 at 4:29 PM Greg Miller <gsmiller@gmail.com> wrote:
>>>>
>>>>> Hi folks!
>>>>>
>>>>> I've been a bit curious to test out different block size
>>>>> configurations in the Lucene postings list format, but thought I'd reach
>>>>> out to the community here first to see what work may have gone into this
>>>>> previously. I'm essentially interested in benchmarking different block size
>>>>> configurations on the real-world application of Lucene I'm working on.
>>>>>
>>>>> If my understanding of the code is correct, I know we're currently
>>>>> encoding compressed runs of 128 docs per block, relying on ForUtil for
>>>>> encoding/decoding purposes. It looks like we define this in
>>>>> ForUtil#BLOCK_SIZE (and reference it in a few external classes), but also
>>>>> know that it's not as simple as just changing that one definition. It
>>>>> appears much of the logic in ForUtil relies on the assumption of 128
>>>>> docs-per-block.
>>>>>
>>>>> I'm toying with the idea of making ForUtil a bit more flexible to
>>>>> allow for different block sizes to be tested in order to run the
>>>>> benchmarking I'd like to run, but the class looks heavily optimized to
>>>>> generate SIMD instructions (I think?), so that might be folly. Before I
>>>>> start hacking on a local branch to see what I can learn, is there any prior
>>>>> work that might be useful to be aware of? Anyone gone down this path and
>>>>> have some learnings to share? Any thoughts would be much appreciated!
>>>>>
>>>>> Cheers,
>>>>> -Greg
>>>>>
>>>>
Re: Configurable Postings Block Size? [ In reply to ]
I think its a good idea, especially if the assert can be in a good place
(ideally a not-so-hot place, e.g. encoding, patching code). asserts have
some costs for this kind of code even when disabled, bytecode count limits
are used for compiler threshold and stuff.

On Wed, Mar 3, 2021 at 9:05 PM Greg Miller <gsmiller@gmail.com> wrote:

> So, slightly different topic, maybe, but related so tacking onto this
> thread...
>
> While tweaking ForUtil locally to experiment with different block sizes, I
> realized that PForUtil encodes the offset for each "patch" using a single
> byte, which implies a strict upper limit of 256 on the BLOCK_SIZE defined
> in ForUtil. This essentially silently failed on me when I was trying to set
> up blocks of 512. The unit tests caught it since the results were incorrect
> after encoding/decoding with PForUtil (hooray!), but it would have been
> nice to have an assert somewhere guarding for this to make matters a little
> more explicit.
>
> While I realize that the likelihood of changing the blockside in ForUtil
> may be low for now, it seems like such a small, easy change to toss an
> assert in that it seems useful. What do you all think? Worth opening a
> minor issue for this and putting in a one-liner?
>
> Cheers,
> -Greg
>
> On Mon, Mar 1, 2021 at 11:30 AM Greg Miller <gsmiller@gmail.com> wrote:
>
>> Oh, got it. This is great, thanks!
>>
>> Cheers,
>> -Greg
>>
>> On Mon, Mar 1, 2021 at 11:28 AM Robert Muir <rcmuir@gmail.com> wrote:
>>
>>> Yeah, have a look at gen_ForUtil.py
>>>
>>> On Mon, Mar 1, 2021 at 1:05 PM Greg Miller <gsmiller@gmail.com> wrote:
>>>
>>>> Thanks for the feedback Robert; makes sense to me. I'll tinker with a
>>>> forked codec and see if the experimentation produces anything interesting.
>>>>
>>>> When you mention "autogenerated decompression code", do you mean that
>>>> some of this code is actually being generated?
>>>>
>>>> Cheers,
>>>> -Greg
>>>>
>>>> On Sun, Feb 28, 2021 at 5:05 AM Robert Muir <rcmuir@gmail.com> wrote:
>>>>
>>>>> If you want to test a different block size (say 64 or 256), I really
>>>>> recommend to just fork a different codec for the experiment.
>>>>>
>>>>> There will likely be higher level changes you need to make, not just
>>>>> changing a number. For example if you just increased this number to 256
>>>>> without doing anything else, I wouldn't be surprised if you see worse
>>>>> performance. More of the postings would be vint-encoded than before with
>>>>> 128, which might have some consequences. skipdata layout might be
>>>>> inappropriate, these things are optimized for blocks of 128.
>>>>>
>>>>> Just in general, I recommend making a codec for the benchmarking
>>>>> experiments, tools like luceneutil support comparing codecs against each
>>>>> other anyway so you can easily compare fairly against the existing codec.
>>>>> Also, it should be much easier/faster to just make a new codec and adapt it
>>>>> to test what you want!
>>>>>
>>>>> I think it is an antipattern to make stuff within the codec
>>>>> "flexible", it is autogenerated decompression code :) I am concerned such
>>>>> "flexibility" would create barriers in the future to optimizations. For
>>>>> example we should be able to experiment with converting this compression
>>>>> code over to explicit vector API in java.
>>>>>
>>>>> On Sat, Feb 27, 2021 at 4:29 PM Greg Miller <gsmiller@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi folks!
>>>>>>
>>>>>> I've been a bit curious to test out different block size
>>>>>> configurations in the Lucene postings list format, but thought I'd reach
>>>>>> out to the community here first to see what work may have gone into this
>>>>>> previously. I'm essentially interested in benchmarking different block size
>>>>>> configurations on the real-world application of Lucene I'm working on.
>>>>>>
>>>>>> If my understanding of the code is correct, I know we're currently
>>>>>> encoding compressed runs of 128 docs per block, relying on ForUtil for
>>>>>> encoding/decoding purposes. It looks like we define this in
>>>>>> ForUtil#BLOCK_SIZE (and reference it in a few external classes), but also
>>>>>> know that it's not as simple as just changing that one definition. It
>>>>>> appears much of the logic in ForUtil relies on the assumption of 128
>>>>>> docs-per-block.
>>>>>>
>>>>>> I'm toying with the idea of making ForUtil a bit more flexible to
>>>>>> allow for different block sizes to be tested in order to run the
>>>>>> benchmarking I'd like to run, but the class looks heavily optimized to
>>>>>> generate SIMD instructions (I think?), so that might be folly. Before I
>>>>>> start hacking on a local branch to see what I can learn, is there any prior
>>>>>> work that might be useful to be aware of? Anyone gone down this path and
>>>>>> have some learnings to share? Any thoughts would be much appreciated!
>>>>>>
>>>>>> Cheers,
>>>>>> -Greg
>>>>>>
>>>>>
Re: Configurable Postings Block Size? [ In reply to ]
Thanks Robert. I've created
https://issues.apache.org/jira/browse/LUCENE-9822 and will attach a patch
shortly.

Cheers,
-Greg

On Wed, Mar 3, 2021 at 6:21 PM Robert Muir <rcmuir@gmail.com> wrote:

> I think its a good idea, especially if the assert can be in a good place
> (ideally a not-so-hot place, e.g. encoding, patching code). asserts have
> some costs for this kind of code even when disabled, bytecode count limits
> are used for compiler threshold and stuff.
>
> On Wed, Mar 3, 2021 at 9:05 PM Greg Miller <gsmiller@gmail.com> wrote:
>
>> So, slightly different topic, maybe, but related so tacking onto this
>> thread...
>>
>> While tweaking ForUtil locally to experiment with different block sizes,
>> I realized that PForUtil encodes the offset for each "patch" using a single
>> byte, which implies a strict upper limit of 256 on the BLOCK_SIZE defined
>> in ForUtil. This essentially silently failed on me when I was trying to set
>> up blocks of 512. The unit tests caught it since the results were incorrect
>> after encoding/decoding with PForUtil (hooray!), but it would have been
>> nice to have an assert somewhere guarding for this to make matters a little
>> more explicit.
>>
>> While I realize that the likelihood of changing the blockside in ForUtil
>> may be low for now, it seems like such a small, easy change to toss an
>> assert in that it seems useful. What do you all think? Worth opening a
>> minor issue for this and putting in a one-liner?
>>
>> Cheers,
>> -Greg
>>
>> On Mon, Mar 1, 2021 at 11:30 AM Greg Miller <gsmiller@gmail.com> wrote:
>>
>>> Oh, got it. This is great, thanks!
>>>
>>> Cheers,
>>> -Greg
>>>
>>> On Mon, Mar 1, 2021 at 11:28 AM Robert Muir <rcmuir@gmail.com> wrote:
>>>
>>>> Yeah, have a look at gen_ForUtil.py
>>>>
>>>> On Mon, Mar 1, 2021 at 1:05 PM Greg Miller <gsmiller@gmail.com> wrote:
>>>>
>>>>> Thanks for the feedback Robert; makes sense to me. I'll tinker with a
>>>>> forked codec and see if the experimentation produces anything interesting.
>>>>>
>>>>> When you mention "autogenerated decompression code", do you mean that
>>>>> some of this code is actually being generated?
>>>>>
>>>>> Cheers,
>>>>> -Greg
>>>>>
>>>>> On Sun, Feb 28, 2021 at 5:05 AM Robert Muir <rcmuir@gmail.com> wrote:
>>>>>
>>>>>> If you want to test a different block size (say 64 or 256), I really
>>>>>> recommend to just fork a different codec for the experiment.
>>>>>>
>>>>>> There will likely be higher level changes you need to make, not just
>>>>>> changing a number. For example if you just increased this number to 256
>>>>>> without doing anything else, I wouldn't be surprised if you see worse
>>>>>> performance. More of the postings would be vint-encoded than before with
>>>>>> 128, which might have some consequences. skipdata layout might be
>>>>>> inappropriate, these things are optimized for blocks of 128.
>>>>>>
>>>>>> Just in general, I recommend making a codec for the benchmarking
>>>>>> experiments, tools like luceneutil support comparing codecs against each
>>>>>> other anyway so you can easily compare fairly against the existing codec.
>>>>>> Also, it should be much easier/faster to just make a new codec and adapt it
>>>>>> to test what you want!
>>>>>>
>>>>>> I think it is an antipattern to make stuff within the codec
>>>>>> "flexible", it is autogenerated decompression code :) I am concerned such
>>>>>> "flexibility" would create barriers in the future to optimizations. For
>>>>>> example we should be able to experiment with converting this compression
>>>>>> code over to explicit vector API in java.
>>>>>>
>>>>>> On Sat, Feb 27, 2021 at 4:29 PM Greg Miller <gsmiller@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi folks!
>>>>>>>
>>>>>>> I've been a bit curious to test out different block size
>>>>>>> configurations in the Lucene postings list format, but thought I'd reach
>>>>>>> out to the community here first to see what work may have gone into this
>>>>>>> previously. I'm essentially interested in benchmarking different block size
>>>>>>> configurations on the real-world application of Lucene I'm working on.
>>>>>>>
>>>>>>> If my understanding of the code is correct, I know we're currently
>>>>>>> encoding compressed runs of 128 docs per block, relying on ForUtil for
>>>>>>> encoding/decoding purposes. It looks like we define this in
>>>>>>> ForUtil#BLOCK_SIZE (and reference it in a few external classes), but also
>>>>>>> know that it's not as simple as just changing that one definition. It
>>>>>>> appears much of the logic in ForUtil relies on the assumption of 128
>>>>>>> docs-per-block.
>>>>>>>
>>>>>>> I'm toying with the idea of making ForUtil a bit more flexible to
>>>>>>> allow for different block sizes to be tested in order to run the
>>>>>>> benchmarking I'd like to run, but the class looks heavily optimized to
>>>>>>> generate SIMD instructions (I think?), so that might be folly. Before I
>>>>>>> start hacking on a local branch to see what I can learn, is there any prior
>>>>>>> work that might be useful to be aware of? Anyone gone down this path and
>>>>>>> have some learnings to share? Any thoughts would be much appreciated!
>>>>>>>
>>>>>>> Cheers,
>>>>>>> -Greg
>>>>>>>
>>>>>>
Re: Configurable Postings Block Size? [ In reply to ]
I did end up internally benchmarking a few different FOR block sizes and
wanted to circle back here with the results in case anyone else was
curious. Maybe these results will be useful to others. Or maybe someone
will spot something interesting here that I overlooked. The tl;dr is that
the current block size (128) seems to perform the best for us.

These results were all from an internal benchmarking tool we use against
Amazon's product search engine. For my methodology, I directly modified
ForUtil on my local branch to work on different block sizes (64, 256 and
512 in addition to the default 128). Because ForUtil is common to both
ForDeltaUtil and PForUtil, and there's an interesting interaction between
the PFOR "patched exception" count and the block size (changing only the
blocksize without changing exception count skews the ratio), I took the "P"
out of PForUtil for testing (set exceptions to zero to make this just
vanilla FOR). Maybe this methodology is flawed, but it was a start. I'm
highlighting the impacts to index size, red-line queries/sec, GC time and
avg. latency since those were the most interesting. I hope this table is
moderately readable...

baseline | candidate | index size impact |red-line qps impact | gc time
impact | avg. latency impact |
------------|---------------|------------------------|---------------------------|---------------------|---------------------------|
128 | 64 | 0% |
-1.5% | -4.88% | +0.96% |
128 | 256 | +0.42% |
-0.73% | +4.53% | +0.49% |
256 | 512 | +0.58% |
-1.17% | +4.67% | +1.53% |

It makes sense that increasing the block size would cause the index size to
increase. Especially without any exceptions, the number of bits-per-value
needed in each block would be expected to increase. I'm also not surprised
that GC time increases with block size since decoding each block will
create more garbage.

I did find it a bit surprising that our red-line qps and avg. latency both
regressed when shrinking the block size to 64. Our searcher does a lot of
conjunctive querying, so I had hypothesized that we'd see better qps and
latency by shrinking the block size given that we should be doing a lot of
skipping. I'm not sure what's happening here. One speculation is that we
might be "skipping" to adjacent blocks more often than I'd guess. When that
happens, we'd essentially decode a block of 64 then turn right around and
decode the next block of 64. In these cases, I expect it's more efficient
to decode all 128 in one shot. I don't have any metrics to support this
though, so just throwing out a guess. If I do end up instrumenting some
metrics to support or refute this, I'll report back for those curious.

I also tested the impact of keeping the block size at 128 but just removing
the three PFOR "exceptions" (so basically testing PFOR vs. FOR for the same
block size). This was also curious. The index size increased by +0.66%
(makes sense) but red-line qps dropped by -0.58%. I would have expected
red-line qps to slightly improve by moving from PFOR to FOR since the
"patching" should be relatively expensive. Maybe the slightly more
compressed data resulting from PFOR has some other advantages? Dunno.
Another curiosity for the time-being.

Cheers,
-Greg

On Thu, Mar 4, 2021 at 6:22 AM Greg Miller <gsmiller@gmail.com> wrote:

> Thanks Robert. I've created
> https://issues.apache.org/jira/browse/LUCENE-9822 and will attach a patch
> shortly.
>
> Cheers,
> -Greg
>
> On Wed, Mar 3, 2021 at 6:21 PM Robert Muir <rcmuir@gmail.com> wrote:
>
>> I think its a good idea, especially if the assert can be in a good place
>> (ideally a not-so-hot place, e.g. encoding, patching code). asserts have
>> some costs for this kind of code even when disabled, bytecode count limits
>> are used for compiler threshold and stuff.
>>
>> On Wed, Mar 3, 2021 at 9:05 PM Greg Miller <gsmiller@gmail.com> wrote:
>>
>>> So, slightly different topic, maybe, but related so tacking onto this
>>> thread...
>>>
>>> While tweaking ForUtil locally to experiment with different block sizes,
>>> I realized that PForUtil encodes the offset for each "patch" using a single
>>> byte, which implies a strict upper limit of 256 on the BLOCK_SIZE defined
>>> in ForUtil. This essentially silently failed on me when I was trying to set
>>> up blocks of 512. The unit tests caught it since the results were incorrect
>>> after encoding/decoding with PForUtil (hooray!), but it would have been
>>> nice to have an assert somewhere guarding for this to make matters a little
>>> more explicit.
>>>
>>> While I realize that the likelihood of changing the blockside in ForUtil
>>> may be low for now, it seems like such a small, easy change to toss an
>>> assert in that it seems useful. What do you all think? Worth opening a
>>> minor issue for this and putting in a one-liner?
>>>
>>> Cheers,
>>> -Greg
>>>
>>> On Mon, Mar 1, 2021 at 11:30 AM Greg Miller <gsmiller@gmail.com> wrote:
>>>
>>>> Oh, got it. This is great, thanks!
>>>>
>>>> Cheers,
>>>> -Greg
>>>>
>>>> On Mon, Mar 1, 2021 at 11:28 AM Robert Muir <rcmuir@gmail.com> wrote:
>>>>
>>>>> Yeah, have a look at gen_ForUtil.py
>>>>>
>>>>> On Mon, Mar 1, 2021 at 1:05 PM Greg Miller <gsmiller@gmail.com> wrote:
>>>>>
>>>>>> Thanks for the feedback Robert; makes sense to me. I'll tinker with a
>>>>>> forked codec and see if the experimentation produces anything interesting.
>>>>>>
>>>>>> When you mention "autogenerated decompression code", do you mean
>>>>>> that some of this code is actually being generated?
>>>>>>
>>>>>> Cheers,
>>>>>> -Greg
>>>>>>
>>>>>> On Sun, Feb 28, 2021 at 5:05 AM Robert Muir <rcmuir@gmail.com> wrote:
>>>>>>
>>>>>>> If you want to test a different block size (say 64 or 256), I really
>>>>>>> recommend to just fork a different codec for the experiment.
>>>>>>>
>>>>>>> There will likely be higher level changes you need to make, not just
>>>>>>> changing a number. For example if you just increased this number to 256
>>>>>>> without doing anything else, I wouldn't be surprised if you see worse
>>>>>>> performance. More of the postings would be vint-encoded than before with
>>>>>>> 128, which might have some consequences. skipdata layout might be
>>>>>>> inappropriate, these things are optimized for blocks of 128.
>>>>>>>
>>>>>>> Just in general, I recommend making a codec for the benchmarking
>>>>>>> experiments, tools like luceneutil support comparing codecs against each
>>>>>>> other anyway so you can easily compare fairly against the existing codec.
>>>>>>> Also, it should be much easier/faster to just make a new codec and adapt it
>>>>>>> to test what you want!
>>>>>>>
>>>>>>> I think it is an antipattern to make stuff within the codec
>>>>>>> "flexible", it is autogenerated decompression code :) I am concerned such
>>>>>>> "flexibility" would create barriers in the future to optimizations. For
>>>>>>> example we should be able to experiment with converting this compression
>>>>>>> code over to explicit vector API in java.
>>>>>>>
>>>>>>> On Sat, Feb 27, 2021 at 4:29 PM Greg Miller <gsmiller@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi folks!
>>>>>>>>
>>>>>>>> I've been a bit curious to test out different block size
>>>>>>>> configurations in the Lucene postings list format, but thought I'd reach
>>>>>>>> out to the community here first to see what work may have gone into this
>>>>>>>> previously. I'm essentially interested in benchmarking different block size
>>>>>>>> configurations on the real-world application of Lucene I'm working on.
>>>>>>>>
>>>>>>>> If my understanding of the code is correct, I know we're currently
>>>>>>>> encoding compressed runs of 128 docs per block, relying on ForUtil for
>>>>>>>> encoding/decoding purposes. It looks like we define this in
>>>>>>>> ForUtil#BLOCK_SIZE (and reference it in a few external classes), but also
>>>>>>>> know that it's not as simple as just changing that one definition. It
>>>>>>>> appears much of the logic in ForUtil relies on the assumption of 128
>>>>>>>> docs-per-block.
>>>>>>>>
>>>>>>>> I'm toying with the idea of making ForUtil a bit more flexible to
>>>>>>>> allow for different block sizes to be tested in order to run the
>>>>>>>> benchmarking I'd like to run, but the class looks heavily optimized to
>>>>>>>> generate SIMD instructions (I think?), so that might be folly. Before I
>>>>>>>> start hacking on a local branch to see what I can learn, is there any prior
>>>>>>>> work that might be useful to be aware of? Anyone gone down this path and
>>>>>>>> have some learnings to share? Any thoughts would be much appreciated!
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> -Greg
>>>>>>>>
>>>>>>>