Mailing List Archive: GC improvement when skipping bytes in DataInput

GC improvement when skipping bytes in DataInput

gsmiller at gmail

Feb 17, 2021, 1:14 PM

Post #1 of 10 (711 views)

Hi folks-

I work on a Lucene-based search system and we recently added Java Flight
Recorder to our benchmark tooling. When looking through results, we found
DataInput#skipBytes() to be a top contributor to garbage creation. We're
using Lucene84SkipReader and always skipping over Impacts in our use-case.
At first glance, it appeared pretty obvious that creating new instances of
the skipBuffer byte[] for each instance of DataInput was the culprit.

It looks like alternatives were discussed originally in LUCENE-5583
<https://issues.amazon.com/issues/LUCENE-4947>, one of which being a
thread-local implementation of the skip buffer (since it can't be a static
field without breaking delegating subclasses, like ChecksumIndexInput). At
the time, a thread-local was advised against
<https://issues.apache.org/jira/browse/LUCENE-5583?focusedCommentId=13964258&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13964258>
by
Uwe due to GC expense, but in our benchmarks, bringing in a thread-local
implementation reduced overall GC time by ~7%.

I'd like to revisit this implementation decision and discuss ways in which
we can reduce this unnecessary garbage creation. It seems like moving to a
thread-local implementation is a win here, but I'd love to hear more
thoughts or alternative suggestions from the group. I'm new to this
community, so I'm not sure the best way to proceed. Should I open a Jira
issue as a next step? Thanks in advance!

Cheers,
-Greg

Re: GC improvement when skipping bytes in DataInput [ In reply to ]

rcmuir at gmail

Feb 17, 2021, 1:47 PM

Post #2 of 10 (711 views)

See this already-open issue:
https://issues.apache.org/jira/browse/LUCENE-9480

No threadlocals are necessary. DataInput/IndexInput are already intended
for use by one thread.

For e.g. mmapdirectory this should be "+=". no buffer is required.

On Wed, Feb 17, 2021 at 4:15 PM Greg Miller <gsmiller@gmail.com> wrote:

> Hi folks-
>
> I work on a Lucene-based search system and we recently added Java Flight
> Recorder to our benchmark tooling. When looking through results, we found
> DataInput#skipBytes() to be a top contributor to garbage creation. We're
> using Lucene84SkipReader and always skipping over Impacts in our use-case.
> At first glance, it appeared pretty obvious that creating new instances of
> the skipBuffer byte[] for each instance of DataInput was the culprit.
>
> It looks like alternatives were discussed originally in LUCENE-5583
> <https://issues.amazon.com/issues/LUCENE-4947>, one of which being a
> thread-local implementation of the skip buffer (since it can't be a static
> field without breaking delegating subclasses, like ChecksumIndexInput). At
> the time, a thread-local was advised against
> <https://issues.apache.org/jira/browse/LUCENE-5583?focusedCommentId=13964258&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13964258> by
> Uwe due to GC expense, but in our benchmarks, bringing in a thread-local
> implementation reduced overall GC time by ~7%.
>
> I'd like to revisit this implementation decision and discuss ways in which
> we can reduce this unnecessary garbage creation. It seems like moving to a
> thread-local implementation is a win here, but I'd love to hear more
> thoughts or alternative suggestions from the group. I'm new to this
> community, so I'm not sure the best way to proceed. Should I open a Jira
> issue as a next step? Thanks in advance!
>
> Cheers,
> -Greg
>

Re: GC improvement when skipping bytes in DataInput [ In reply to ]

gsmiller at gmail

Feb 17, 2021, 2:10 PM

Post #3 of 10 (711 views)

Thanks for pointing me to the existing issue! I think I generally agree
that no threadlocal would be needed if the functionality of skipBytes() and
seek() were collapsed as per the issue you pointed me to. In the current
state though, concurrency causes problems for delegating implementations of
DataInput, specifically ChecksumIndexInput (as detailed in LUCENE-5583
<https://issues.apache.org/jira/browse/LUCENE-5583>).

I'll spend a little more time thinking about LUCENE-9480 and comment there.
Seems like a better approach to eliminate the need for reading in bytes
just to skip, assuming it can be made to play nice with checksums, etc.
Thanks again!

Cheers,
-Greg

On Wed, Feb 17, 2021 at 1:47 PM Robert Muir <rcmuir@gmail.com> wrote:

> See this already-open issue:
> https://issues.apache.org/jira/browse/LUCENE-9480
>
> No threadlocals are necessary. DataInput/IndexInput are already intended
> for use by one thread.
>
> For e.g. mmapdirectory this should be "+=". no buffer is required.
>
> On Wed, Feb 17, 2021 at 4:15 PM Greg Miller <gsmiller@gmail.com> wrote:
>
>> Hi folks-
>>
>> I work on a Lucene-based search system and we recently added Java Flight
>> Recorder to our benchmark tooling. When looking through results, we found
>> DataInput#skipBytes() to be a top contributor to garbage creation. We're
>> using Lucene84SkipReader and always skipping over Impacts in our use-case.
>> At first glance, it appeared pretty obvious that creating new instances of
>> the skipBuffer byte[] for each instance of DataInput was the culprit.
>>
>> It looks like alternatives were discussed originally in LUCENE-5583
>> <https://issues.amazon.com/issues/LUCENE-4947>, one of which being a
>> thread-local implementation of the skip buffer (since it can't be a static
>> field without breaking delegating subclasses, like ChecksumIndexInput). At
>> the time, a thread-local was advised against
>> <https://issues.apache.org/jira/browse/LUCENE-5583?focusedCommentId=13964258&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13964258> by
>> Uwe due to GC expense, but in our benchmarks, bringing in a thread-local
>> implementation reduced overall GC time by ~7%.
>>
>> I'd like to revisit this implementation decision and discuss ways in
>> which we can reduce this unnecessary garbage creation. It seems like moving
>> to a thread-local implementation is a win here, but I'd love to hear more
>> thoughts or alternative suggestions from the group. I'm new to this
>> community, so I'm not sure the best way to proceed. Should I open a Jira
>> issue as a next step? Thanks in advance!
>>
>> Cheers,
>> -Greg
>>
>

Re: GC improvement when skipping bytes in DataInput [ In reply to ]

rcmuir at gmail

Feb 17, 2021, 3:33 PM

Post #4 of 10 (711 views)

There is zero concurrency issue. I think you are reading discussion about
old patches that didn't get committed, ignore that and look at the code.
Look at DataInput javadoc: "DataInput may only be used from one thread,
because it is not thread safe (it keeps internal state like file position)."
Look at IndexInput javadoc: "IndexInput may only be used from one thread,
because it is not thread safe (it keeps internal state like file position)."
This applies to ChecksumIndexInput, too, it is a subclass.

Let's try to avoid adding more complex logic to the situation. Base classes
like DataInput shouldn't have "slow" methods that do things like read bytes
and copy them around just to increment a position pointer... that is not
good. It is supposed to just have some simple decoding helpers (like vint
decode). So it is at the wrong "level" to have such a method, as it can't
do it in an efficient way.

A good start would be to rename DataInput.skipBytes() to a deprecated
DataInput.skipBytesSlowly() and add a new abstract DataInput.skipBytes().
Now you just have to implement skipBytes() with all the subclasses, but you
can always start with skipBytesSlowly() // TODO: fix this, so it allows
incremental progress. For IndexInput, you can make skipBytes() just call
seek(), that is an easy win. ByteArrayDataInput is already good to go,
perhaps it is the only one with a correct implementation :)

On Wed, Feb 17, 2021 at 5:10 PM Greg Miller <gsmiller@gmail.com> wrote:

> Thanks for pointing me to the existing issue! I think I generally agree
> that no threadlocal would be needed if the functionality of skipBytes() and
> seek() were collapsed as per the issue you pointed me to. In the current
> state though, concurrency causes problems for delegating implementations of
> DataInput, specifically ChecksumIndexInput (as detailed in LUCENE-5583
> <https://issues.apache.org/jira/browse/LUCENE-5583>).
>
> I'll spend a little more time thinking about LUCENE-9480 and comment
> there. Seems like a better approach to eliminate the need for reading in
> bytes just to skip, assuming it can be made to play nice with checksums,
> etc. Thanks again!
>
> Cheers,
> -Greg
>
> On Wed, Feb 17, 2021 at 1:47 PM Robert Muir <rcmuir@gmail.com> wrote:
>
>> See this already-open issue:
>> https://issues.apache.org/jira/browse/LUCENE-9480
>>
>> No threadlocals are necessary. DataInput/IndexInput are already intended
>> for use by one thread.
>>
>> For e.g. mmapdirectory this should be "+=". no buffer is required.
>>
>> On Wed, Feb 17, 2021 at 4:15 PM Greg Miller <gsmiller@gmail.com> wrote:
>>
>>> Hi folks-
>>>
>>> I work on a Lucene-based search system and we recently added Java Flight
>>> Recorder to our benchmark tooling. When looking through results, we found
>>> DataInput#skipBytes() to be a top contributor to garbage creation. We're
>>> using Lucene84SkipReader and always skipping over Impacts in our use-case.
>>> At first glance, it appeared pretty obvious that creating new instances of
>>> the skipBuffer byte[] for each instance of DataInput was the culprit.
>>>
>>> It looks like alternatives were discussed originally in LUCENE-5583
>>> <https://issues.amazon.com/issues/LUCENE-4947>, one of which being a
>>> thread-local implementation of the skip buffer (since it can't be a static
>>> field without breaking delegating subclasses, like ChecksumIndexInput). At
>>> the time, a thread-local was advised against
>>> <https://issues.apache.org/jira/browse/LUCENE-5583?focusedCommentId=13964258&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13964258> by
>>> Uwe due to GC expense, but in our benchmarks, bringing in a thread-local
>>> implementation reduced overall GC time by ~7%.
>>>
>>> I'd like to revisit this implementation decision and discuss ways in
>>> which we can reduce this unnecessary garbage creation. It seems like moving
>>> to a thread-local implementation is a win here, but I'd love to hear more
>>> thoughts or alternative suggestions from the group. I'm new to this
>>> community, so I'm not sure the best way to proceed. Should I open a Jira
>>> issue as a next step? Thanks in advance!
>>>
>>> Cheers,
>>> -Greg
>>>
>>

Re: GC improvement when skipping bytes in DataInput [ In reply to ]

gsmiller at gmail

Feb 17, 2021, 3:52 PM

Post #5 of 10 (711 views)

Thanks Robert for the detailed response. Let me try to address a few points
inline if I may...

There is zero concurrency issue. I think you are reading discussion about
old patches that didn't get committed, ignore that and look at the code.
Look at DataInput javadoc: "DataInput may only be used from one thread,
because it is not thread safe (it keeps internal state like file position)."
Look at IndexInput javadoc: "IndexInput may only be used from one thread,
because it is not thread safe (it keeps internal state like file position)."
This applies to ChecksumIndexInput, too, it is a subclass.

Right, I am looking at the code but maybe we're talking about two
different things, so let me clarify. I agree that there is no concurrency
issue with the current code and I apologise if my point was confusing. The
reason skipBytes was made an instance variable as opposed to a static one
was to *avoid *creating a concurrency issue (which certainly would exist if
it had been made static). Making it an instance variable is wasteful for GC
though, no? My suggestion of moving to a threadlocal hits a "happy medium"
where we're not allocating these silly buffers for each DataInput instance
but making sure each thread has a separate one. Does this make more sense
now?

Let's try to avoid adding more complex logic to the situation. Base classes
like DataInput shouldn't have "slow" methods that do things like read bytes
and copy them around just to increment a position pointer... that is not
good. It is supposed to just have some simple decoding helpers (like vint
decode). So it is at the wrong "level" to have such a method, as it can't
do it in an efficient way.

Sure, agree. But the more I think about the Jira you pointed me to, the
more I think we're talking about two separate problems (that could
certainly have a common solution). I agree that the current approach of
skipping bytes in general is wasteful and I don't disagree that it
shouldn't be in DataInput (or that DataInput and IndexInput should be
collapsed). Seems right. But... there's a potential needle-moving change by
just reducing garbage creation here. I just want to make sure we don't let
perfection block progress here.

A good start would be to rename DataInput.skipBytes() to a deprecated
DataInput.skipBytesSlowly() and add a new abstract DataInput.skipBytes().
Now you just have to implement skipBytes() with all the subclasses, but you
can always start with skipBytesSlowly() // TODO: fix this, so it allows
incremental progress. For IndexInput, you can make skipBytes() just call
seek(), that is an easy win. ByteArrayDataInput is already good to go,
perhaps it is the only one with a correct implementation :)

I like this thought. Let me see if I can run with your suggestion and come
up with an incremental approach to reap some short-term GC gains while
moving in a better direction overall. Just need to noodle on it a bit...

Thanks again for the discussion!

Cheers,
-Greg

On Wed, Feb 17, 2021 at 3:34 PM Robert Muir <rcmuir@gmail.com> wrote:

> There is zero concurrency issue. I think you are reading discussion about
> old patches that didn't get committed, ignore that and look at the code.
> Look at DataInput javadoc: "DataInput may only be used from one thread,
> because it is not thread safe (it keeps internal state like file position)."
> Look at IndexInput javadoc: "IndexInput may only be used from one thread,
> because it is not thread safe (it keeps internal state like file position)."
> This applies to ChecksumIndexInput, too, it is a subclass.
>
> Let's try to avoid adding more complex logic to the situation. Base
> classes like DataInput shouldn't have "slow" methods that do things like
> read bytes and copy them around just to increment a position pointer...
> that is not good. It is supposed to just have some simple decoding helpers
> (like vint decode). So it is at the wrong "level" to have such a method, as
> it can't do it in an efficient way.
>
> A good start would be to rename DataInput.skipBytes() to a deprecated
> DataInput.skipBytesSlowly() and add a new abstract DataInput.skipBytes().
> Now you just have to implement skipBytes() with all the subclasses, but you
> can always start with skipBytesSlowly() // TODO: fix this, so it allows
> incremental progress. For IndexInput, you can make skipBytes() just call
> seek(), that is an easy win. ByteArrayDataInput is already good to go,
> perhaps it is the only one with a correct implementation :)
>
>
> On Wed, Feb 17, 2021 at 5:10 PM Greg Miller <gsmiller@gmail.com> wrote:
>
>> Thanks for pointing me to the existing issue! I think I generally agree
>> that no threadlocal would be needed if the functionality of skipBytes() and
>> seek() were collapsed as per the issue you pointed me to. In the current
>> state though, concurrency causes problems for delegating implementations of
>> DataInput, specifically ChecksumIndexInput (as detailed in LUCENE-5583
>> <https://issues.apache.org/jira/browse/LUCENE-5583>).
>>
>> I'll spend a little more time thinking about LUCENE-9480 and comment
>> there. Seems like a better approach to eliminate the need for reading in
>> bytes just to skip, assuming it can be made to play nice with checksums,
>> etc. Thanks again!
>>
>> Cheers,
>> -Greg
>>
>> On Wed, Feb 17, 2021 at 1:47 PM Robert Muir <rcmuir@gmail.com> wrote:
>>
>>> See this already-open issue:
>>> https://issues.apache.org/jira/browse/LUCENE-9480
>>>
>>> No threadlocals are necessary. DataInput/IndexInput are already intended
>>> for use by one thread.
>>>
>>> For e.g. mmapdirectory this should be "+=". no buffer is required.
>>>
>>> On Wed, Feb 17, 2021 at 4:15 PM Greg Miller <gsmiller@gmail.com> wrote:
>>>
>>>> Hi folks-
>>>>
>>>> I work on a Lucene-based search system and we recently added Java
>>>> Flight Recorder to our benchmark tooling. When looking through results, we
>>>> found DataInput#skipBytes() to be a top contributor to garbage creation.
>>>> We're using Lucene84SkipReader and always skipping over Impacts in our
>>>> use-case. At first glance, it appeared pretty obvious that creating new
>>>> instances of the skipBuffer byte[] for each instance of DataInput was the
>>>> culprit.
>>>>
>>>> It looks like alternatives were discussed originally in LUCENE-5583
>>>> <https://issues.amazon.com/issues/LUCENE-4947>, one of which being a
>>>> thread-local implementation of the skip buffer (since it can't be a static
>>>> field without breaking delegating subclasses, like ChecksumIndexInput). At
>>>> the time, a thread-local was advised against
>>>> <https://issues.apache.org/jira/browse/LUCENE-5583?focusedCommentId=13964258&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13964258> by
>>>> Uwe due to GC expense, but in our benchmarks, bringing in a thread-local
>>>> implementation reduced overall GC time by ~7%.
>>>>
>>>> I'd like to revisit this implementation decision and discuss ways in
>>>> which we can reduce this unnecessary garbage creation. It seems like moving
>>>> to a thread-local implementation is a win here, but I'd love to hear more
>>>> thoughts or alternative suggestions from the group. I'm new to this
>>>> community, so I'm not sure the best way to proceed. Should I open a Jira
>>>> issue as a next step? Thanks in advance!
>>>>
>>>> Cheers,
>>>> -Greg
>>>>
>>>

Re: GC improvement when skipping bytes in DataInput [ In reply to ]

rcmuir at gmail

Feb 17, 2021, 4:09 PM

Post #6 of 10 (711 views)

On Wed, Feb 17, 2021 at 6:53 PM Greg Miller <gsmiller@gmail.com> wrote:

>
> Right, I am looking at the code but maybe we're talking about two
> different things, so let me clarify. I agree that there is no concurrency
> issue with the current code and I apologise if my point was confusing. The
> reason skipBytes was made an instance variable as opposed to a static one
> was to *avoid *creating a concurrency issue (which certainly would exist
> if it had been made static). Making it an instance variable is wasteful for
> GC though, no? My suggestion of moving to a threadlocal hits a "happy
> medium" where we're not allocating these silly buffers for each DataInput
> instance but making sure each thread has a separate one. Does this make
> more sense now?
>
>
Adding a threadlocal isn't a happy medium here. The first thing I see is
thread churn issues (apps that use non-fixed threadpools). See javadocs for
ClosableThreadLocal for more information. We can't even use that
CloseableThreadLocal hack here, because nobody calls close() on clones of
IndexInputs. So threadlocal would seriously make matters worse for some
apps using the library, all for something that should be a "+=" :)

I'm not trying to suggest perfection, just start by deprecating the slow
impl, make it abstract and "hoist" the responsibility of implementation
upwards to subclasses that are better prepared to handle them. For the
majority use case (e.g. IndexInput), your buffer trivially goes away since
you implemented it with seek(). The ones that are left can remain slow,
that's fine, you speed up 90% easily, and we can see what is needed to fix
the leftovers. But I really don't think these will be difficult either,
e.g. for ChecksumIndexInput we can probably leave it abstract and implement
the skipping directly on BufferedChecksumIndexInput (it already has a
buffer, so it should be able to use that one, and avoid 2 buffers like
today).

Re: GC improvement when skipping bytes in DataInput [ In reply to ]

gsmiller at gmail

Feb 17, 2021, 4:57 PM

Post #7 of 10 (711 views)

Fair points. I'm going to see if I can carve out some time to take up your
suggested approach. Would you suggest using LUCENE-9480
<https://issues.apache.org/jira/browse/LUCENE-9480> to report back on any
progress made? There are a few different ideas captured there, and it
sounds like the leading suggestion is to collapse the ideas of DataInput
and IndexInput, so I'm not sure if that's the best place to track the
suggested approach we've discussed here or not.

Thanks again for the discussion!

Cheers,
-Greg

On Wed, Feb 17, 2021 at 4:09 PM Robert Muir <rcmuir@gmail.com> wrote:

>
>
> On Wed, Feb 17, 2021 at 6:53 PM Greg Miller <gsmiller@gmail.com> wrote:
>
>>
>> Right, I am looking at the code but maybe we're talking about two
>> different things, so let me clarify. I agree that there is no concurrency
>> issue with the current code and I apologise if my point was confusing. The
>> reason skipBytes was made an instance variable as opposed to a static one
>> was to *avoid *creating a concurrency issue (which certainly would exist
>> if it had been made static). Making it an instance variable is wasteful for
>> GC though, no? My suggestion of moving to a threadlocal hits a "happy
>> medium" where we're not allocating these silly buffers for each DataInput
>> instance but making sure each thread has a separate one. Does this make
>> more sense now?
>>
>>
> Adding a threadlocal isn't a happy medium here. The first thing I see is
> thread churn issues (apps that use non-fixed threadpools). See javadocs for
> ClosableThreadLocal for more information. We can't even use that
> CloseableThreadLocal hack here, because nobody calls close() on clones of
> IndexInputs. So threadlocal would seriously make matters worse for some
> apps using the library, all for something that should be a "+=" :)
>
> I'm not trying to suggest perfection, just start by deprecating the slow
> impl, make it abstract and "hoist" the responsibility of implementation
> upwards to subclasses that are better prepared to handle them. For the
> majority use case (e.g. IndexInput), your buffer trivially goes away since
> you implemented it with seek(). The ones that are left can remain slow,
> that's fine, you speed up 90% easily, and we can see what is needed to fix
> the leftovers. But I really don't think these will be difficult either,
> e.g. for ChecksumIndexInput we can probably leave it abstract and implement
> the skipping directly on BufferedChecksumIndexInput (it already has a
> buffer, so it should be able to use that one, and avoid 2 buffers like
> today).
>

Re: GC improvement when skipping bytes in DataInput [ In reply to ]

rcmuir at gmail

Feb 17, 2021, 5:15 PM

Post #8 of 10 (711 views)

Sure, we can always create followups, but we can start with trying to make
the implementations efficient.

We'd probably want to create more followup issues later anyway, e.g. to
address any TODOs, and ultimately to fix the API: it is silly to have
seek(long) and skipBytes(long) that do exactly the same thing.

It is just especially egregious that one of these has a default
implementation that will allocate buffers and sequentially read+throw away
up to 2^63-1 bytes, so let's fix that first.

On Wed, Feb 17, 2021 at 7:58 PM Greg Miller <gsmiller@gmail.com> wrote:

> Fair points. I'm going to see if I can carve out some time to take up your
> suggested approach. Would you suggest using LUCENE-9480
> <https://issues.apache.org/jira/browse/LUCENE-9480> to report back on any
> progress made? There are a few different ideas captured there, and it
> sounds like the leading suggestion is to collapse the ideas of DataInput
> and IndexInput, so I'm not sure if that's the best place to track the
> suggested approach we've discussed here or not.
>
> Thanks again for the discussion!
>
> Cheers,
> -Greg
>
> On Wed, Feb 17, 2021 at 4:09 PM Robert Muir <rcmuir@gmail.com> wrote:
>
>>
>>
>> On Wed, Feb 17, 2021 at 6:53 PM Greg Miller <gsmiller@gmail.com> wrote:
>>
>>>
>>> Right, I am looking at the code but maybe we're talking about two
>>> different things, so let me clarify. I agree that there is no concurrency
>>> issue with the current code and I apologise if my point was confusing. The
>>> reason skipBytes was made an instance variable as opposed to a static one
>>> was to *avoid *creating a concurrency issue (which certainly would
>>> exist if it had been made static). Making it an instance variable is
>>> wasteful for GC though, no? My suggestion of moving to a threadlocal hits a
>>> "happy medium" where we're not allocating these silly buffers for each
>>> DataInput instance but making sure each thread has a separate one. Does
>>> this make more sense now?
>>>
>>>
>> Adding a threadlocal isn't a happy medium here. The first thing I see is
>> thread churn issues (apps that use non-fixed threadpools). See javadocs for
>> ClosableThreadLocal for more information. We can't even use that
>> CloseableThreadLocal hack here, because nobody calls close() on clones of
>> IndexInputs. So threadlocal would seriously make matters worse for some
>> apps using the library, all for something that should be a "+=" :)
>>
>> I'm not trying to suggest perfection, just start by deprecating the slow
>> impl, make it abstract and "hoist" the responsibility of implementation
>> upwards to subclasses that are better prepared to handle them. For the
>> majority use case (e.g. IndexInput), your buffer trivially goes away since
>> you implemented it with seek(). The ones that are left can remain slow,
>> that's fine, you speed up 90% easily, and we can see what is needed to fix
>> the leftovers. But I really don't think these will be difficult either,
>> e.g. for ChecksumIndexInput we can probably leave it abstract and implement
>> the skipping directly on BufferedChecksumIndexInput (it already has a
>> buffer, so it should be able to use that one, and avoid 2 buffers like
>> today).
>>
>

Re: GC improvement when skipping bytes in DataInput [ In reply to ]

gsmiller at gmail

Feb 17, 2021, 5:55 PM

Post #9 of 10 (711 views)

Sounds good; thanks again. I'll see if I can cook something up this week.
As I thought about this a little further, I think I'll need to avoid a
default "skipSlowly" in DataInput in order to see the same benefits I
observed with my local change, since it requires doing away with allocating
a byte[] for skipping in each DataInput. So if DataInput still has a
default "slow skip" implementation, that doesn't quite solve the
accumulation of byte array garbage (although there would be other benefits
of course). After looking at the code a little more though, it doesn't seem
too difficult to create a "proper" skipBytes implementation in all the
necessary places.

Thanks again!

Cheers,
-Greg

On Wed, Feb 17, 2021 at 5:15 PM Robert Muir <rcmuir@gmail.com> wrote:

> Sure, we can always create followups, but we can start with trying to make
> the implementations efficient.
>
> We'd probably want to create more followup issues later anyway, e.g. to
> address any TODOs, and ultimately to fix the API: it is silly to have
> seek(long) and skipBytes(long) that do exactly the same thing.
>
> It is just especially egregious that one of these has a default
> implementation that will allocate buffers and sequentially read+throw away
> up to 2^63-1 bytes, so let's fix that first.
>
> On Wed, Feb 17, 2021 at 7:58 PM Greg Miller <gsmiller@gmail.com> wrote:
>
>> Fair points. I'm going to see if I can carve out some time to take up
>> your suggested approach. Would you suggest using LUCENE-9480
>> <https://issues.apache.org/jira/browse/LUCENE-9480> to report back on
>> any progress made? There are a few different ideas captured there, and it
>> sounds like the leading suggestion is to collapse the ideas of DataInput
>> and IndexInput, so I'm not sure if that's the best place to track the
>> suggested approach we've discussed here or not.
>>
>> Thanks again for the discussion!
>>
>> Cheers,
>> -Greg
>>
>> On Wed, Feb 17, 2021 at 4:09 PM Robert Muir <rcmuir@gmail.com> wrote:
>>
>>>
>>>
>>> On Wed, Feb 17, 2021 at 6:53 PM Greg Miller <gsmiller@gmail.com> wrote:
>>>
>>>>
>>>> Right, I am looking at the code but maybe we're talking about two
>>>> different things, so let me clarify. I agree that there is no concurrency
>>>> issue with the current code and I apologise if my point was confusing. The
>>>> reason skipBytes was made an instance variable as opposed to a static one
>>>> was to *avoid *creating a concurrency issue (which certainly would
>>>> exist if it had been made static). Making it an instance variable is
>>>> wasteful for GC though, no? My suggestion of moving to a threadlocal hits a
>>>> "happy medium" where we're not allocating these silly buffers for each
>>>> DataInput instance but making sure each thread has a separate one. Does
>>>> this make more sense now?
>>>>
>>>>
>>> Adding a threadlocal isn't a happy medium here. The first thing I see is
>>> thread churn issues (apps that use non-fixed threadpools). See javadocs for
>>> ClosableThreadLocal for more information. We can't even use that
>>> CloseableThreadLocal hack here, because nobody calls close() on clones of
>>> IndexInputs. So threadlocal would seriously make matters worse for some
>>> apps using the library, all for something that should be a "+=" :)
>>>
>>> I'm not trying to suggest perfection, just start by deprecating the slow
>>> impl, make it abstract and "hoist" the responsibility of implementation
>>> upwards to subclasses that are better prepared to handle them. For the
>>> majority use case (e.g. IndexInput), your buffer trivially goes away since
>>> you implemented it with seek(). The ones that are left can remain slow,
>>> that's fine, you speed up 90% easily, and we can see what is needed to fix
>>> the leftovers. But I really don't think these will be difficult either,
>>> e.g. for ChecksumIndexInput we can probably leave it abstract and implement
>>> the skipping directly on BufferedChecksumIndexInput (it already has a
>>> buffer, so it should be able to use that one, and avoid 2 buffers like
>>> today).
>>>
>>

Re: GC improvement when skipping bytes in DataInput [ In reply to ]

rcmuir at gmail

Feb 17, 2021, 6:10 PM

Post #10 of 10 (711 views)

The current allocation is lazy/on-demand (see the code), so I wouldn't
worry about it. If skipSlowly is not called explicitly, nothing will
allocate byte[]s.

On Wed, Feb 17, 2021 at 8:56 PM Greg Miller <gsmiller@gmail.com> wrote:

> Sounds good; thanks again. I'll see if I can cook something up this week.
> As I thought about this a little further, I think I'll need to avoid a
> default "skipSlowly" in DataInput in order to see the same benefits I
> observed with my local change, since it requires doing away with allocating
> a byte[] for skipping in each DataInput. So if DataInput still has a
> default "slow skip" implementation, that doesn't quite solve the
> accumulation of byte array garbage (although there would be other benefits
> of course). After looking at the code a little more though, it doesn't seem
> too difficult to create a "proper" skipBytes implementation in all the
> necessary places.
>
> Thanks again!
>
> Cheers,
> -Greg
>
> On Wed, Feb 17, 2021 at 5:15 PM Robert Muir <rcmuir@gmail.com> wrote:
>
>> Sure, we can always create followups, but we can start with trying to
>> make the implementations efficient.
>>
>> We'd probably want to create more followup issues later anyway, e.g. to
>> address any TODOs, and ultimately to fix the API: it is silly to have
>> seek(long) and skipBytes(long) that do exactly the same thing.
>>
>> It is just especially egregious that one of these has a default
>> implementation that will allocate buffers and sequentially read+throw away
>> up to 2^63-1 bytes, so let's fix that first.
>>
>> On Wed, Feb 17, 2021 at 7:58 PM Greg Miller <gsmiller@gmail.com> wrote:
>>
>>> Fair points. I'm going to see if I can carve out some time to take up
>>> your suggested approach. Would you suggest using LUCENE-9480
>>> <https://issues.apache.org/jira/browse/LUCENE-9480> to report back on
>>> any progress made? There are a few different ideas captured there, and it
>>> sounds like the leading suggestion is to collapse the ideas of DataInput
>>> and IndexInput, so I'm not sure if that's the best place to track the
>>> suggested approach we've discussed here or not.
>>>
>>> Thanks again for the discussion!
>>>
>>> Cheers,
>>> -Greg
>>>
>>> On Wed, Feb 17, 2021 at 4:09 PM Robert Muir <rcmuir@gmail.com> wrote:
>>>
>>>>
>>>>
>>>> On Wed, Feb 17, 2021 at 6:53 PM Greg Miller <gsmiller@gmail.com> wrote:
>>>>
>>>>>
>>>>> Right, I am looking at the code but maybe we're talking about two
>>>>> different things, so let me clarify. I agree that there is no concurrency
>>>>> issue with the current code and I apologise if my point was confusing. The
>>>>> reason skipBytes was made an instance variable as opposed to a static one
>>>>> was to *avoid *creating a concurrency issue (which certainly would
>>>>> exist if it had been made static). Making it an instance variable is
>>>>> wasteful for GC though, no? My suggestion of moving to a threadlocal hits a
>>>>> "happy medium" where we're not allocating these silly buffers for each
>>>>> DataInput instance but making sure each thread has a separate one. Does
>>>>> this make more sense now?
>>>>>
>>>>>
>>>> Adding a threadlocal isn't a happy medium here. The first thing I see
>>>> is thread churn issues (apps that use non-fixed threadpools). See javadocs
>>>> for ClosableThreadLocal for more information. We can't even use that
>>>> CloseableThreadLocal hack here, because nobody calls close() on clones of
>>>> IndexInputs. So threadlocal would seriously make matters worse for some
>>>> apps using the library, all for something that should be a "+=" :)
>>>>
>>>> I'm not trying to suggest perfection, just start by deprecating the
>>>> slow impl, make it abstract and "hoist" the responsibility of
>>>> implementation upwards to subclasses that are better prepared to handle
>>>> them. For the majority use case (e.g. IndexInput), your buffer trivially
>>>> goes away since you implemented it with seek(). The ones that are left can
>>>> remain slow, that's fine, you speed up 90% easily, and we can see what is
>>>> needed to fix the leftovers. But I really don't think these will be
>>>> difficult either, e.g. for ChecksumIndexInput we can probably leave it
>>>> abstract and implement the skipping directly on BufferedChecksumIndexInput
>>>> (it already has a buffer, so it should be able to use that one, and avoid 2
>>>> buffers like today).
>>>>
>>>