Mailing List Archive

Possible resource leak in IndexWriter.deleteAll()/FieldNumbers.clear()
I have some code that is kind of abusing IndexWriter.deleteAll(). In short,
I'm basically experimenting with using tiny (one block of joined
parent/child documents) indexes as a serialized format to index on one
fleet and then merge these tiny indexes on another fleet. I'm doing this by
indexing a block, committing, storing the contents of the index directory
in a zip file, invoking deleteAll(), and repeating. Believe it or not, the
performance is not terrible. (Currently getting about 20% of the throughput
I see with regular indexing.)

Regardless of my serialization shenanigans above, I've found that
performance degrades over time for the process, as it spends more time
allocating and freeing memory. Analyzing some heap dumps, it's because
FieldInfos.byNumber is getting bigger and bigger. IndexWriter.deleteAll()
doesn't truly reset state. Specifically, it calls
globalFieldNumberMap.clear(), which clears all of the FieldNumbers
collections, but it doesn't reset lowestUnassignedFieldNumber. So, that
number keeps counting up, and new instances of FieldInfos allocate larger
and larger arrays (and only use the top indices).

Has anyone else encountered this? Can I open an issue for resetting
lowestUnassignedFieldNumber in FieldNumbers.clear()? Is there any risk in
doing so?

(For my specific use-case, I would be okay with not clearing
globalFieldNumberMap at all, since the set of fields is bounded, but
assigning new field numbers is probably among the least of my costs.)
Re: Possible resource leak in IndexWriter.deleteAll()/FieldNumbers.clear() [ In reply to ]
Thanks for sharing the background of your indexing serialization
shenanigans :-) -- interesting.

I think IndexWriter.deleteAll() should ultimately reset
lowestUnassignedFieldNumber. globalFieldNumberMap.clear() is only called
by deleteAll, so this simple proposal makes sense to me. File a JIRA issue.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Wed, Nov 18, 2020 at 1:17 PM Michael Froh <msfroh@gmail.com> wrote:

> I have some code that is kind of abusing IndexWriter.deleteAll(). In
> short, I'm basically experimenting with using tiny (one block of joined
> parent/child documents) indexes as a serialized format to index on one
> fleet and then merge these tiny indexes on another fleet. I'm doing this by
> indexing a block, committing, storing the contents of the index directory
> in a zip file, invoking deleteAll(), and repeating. Believe it or not, the
> performance is not terrible. (Currently getting about 20% of the throughput
> I see with regular indexing.)
>
> Regardless of my serialization shenanigans above, I've found that
> performance degrades over time for the process, as it spends more time
> allocating and freeing memory. Analyzing some heap dumps, it's because
> FieldInfos.byNumber is getting bigger and bigger. IndexWriter.deleteAll()
> doesn't truly reset state. Specifically, it calls
> globalFieldNumberMap.clear(), which clears all of the FieldNumbers
> collections, but it doesn't reset lowestUnassignedFieldNumber. So, that
> number keeps counting up, and new instances of FieldInfos allocate larger
> and larger arrays (and only use the top indices).
>
> Has anyone else encountered this? Can I open an issue for resetting
> lowestUnassignedFieldNumber in FieldNumbers.clear()? Is there any risk in
> doing so?
>
> (For my specific use-case, I would be okay with not clearing
> globalFieldNumberMap at all, since the set of fields is bounded, but
> assigning new field numbers is probably among the least of my costs.)
>
Re: Possible resource leak in IndexWriter.deleteAll()/FieldNumbers.clear() [ In reply to ]
I'm curious if you tried creating a new IndexWriter for each batch?

On Wed, Nov 18, 2020 at 1:18 PM Michael Froh <msfroh@gmail.com> wrote:
>
> I have some code that is kind of abusing IndexWriter.deleteAll(). In short, I'm basically experimenting with using tiny (one block of joined parent/child documents) indexes as a serialized format to index on one fleet and then merge these tiny indexes on another fleet. I'm doing this by indexing a block, committing, storing the contents of the index directory in a zip file, invoking deleteAll(), and repeating. Believe it or not, the performance is not terrible. (Currently getting about 20% of the throughput I see with regular indexing.)
>
> Regardless of my serialization shenanigans above, I've found that performance degrades over time for the process, as it spends more time allocating and freeing memory. Analyzing some heap dumps, it's because FieldInfos.byNumber is getting bigger and bigger. IndexWriter.deleteAll() doesn't truly reset state. Specifically, it calls globalFieldNumberMap.clear(), which clears all of the FieldNumbers collections, but it doesn't reset lowestUnassignedFieldNumber. So, that number keeps counting up, and new instances of FieldInfos allocate larger and larger arrays (and only use the top indices).
>
> Has anyone else encountered this? Can I open an issue for resetting lowestUnassignedFieldNumber in FieldNumbers.clear()? Is there any risk in doing so?
>
> (For my specific use-case, I would be okay with not clearing globalFieldNumberMap at all, since the set of fields is bounded, but assigning new field numbers is probably among the least of my costs.)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Possible resource leak in IndexWriter.deleteAll()/FieldNumbers.clear() [ In reply to ]
I didn't try creating a new IndexWriter for each batch, but I was assuming
that would be heavier, as it would allocate a new DocumentsWriter, and
through that new DocumentsWriterPerThreads. Skimming through the code for
DWPT, it looks like there are various pools involved in creating each
DWPT's instance of DefaultIndexingChain, which might be expensive to create
frequently, rather than reusing on flush().

Also I was partly motivated by laziness. The production code I'm borrowing
for this prototype doesn't make it easy to recreate the IndexWriterConfig,
and IWC is not reusable across IndexWriter instances.

On Wed, Nov 18, 2020 at 12:25 PM Michael Sokolov <msokolov@gmail.com> wrote:

> I'm curious if you tried creating a new IndexWriter for each batch?
>
> On Wed, Nov 18, 2020 at 1:18 PM Michael Froh <msfroh@gmail.com> wrote:
> >
> > I have some code that is kind of abusing IndexWriter.deleteAll(). In
> short, I'm basically experimenting with using tiny (one block of joined
> parent/child documents) indexes as a serialized format to index on one
> fleet and then merge these tiny indexes on another fleet. I'm doing this by
> indexing a block, committing, storing the contents of the index directory
> in a zip file, invoking deleteAll(), and repeating. Believe it or not, the
> performance is not terrible. (Currently getting about 20% of the throughput
> I see with regular indexing.)
> >
> > Regardless of my serialization shenanigans above, I've found that
> performance degrades over time for the process, as it spends more time
> allocating and freeing memory. Analyzing some heap dumps, it's because
> FieldInfos.byNumber is getting bigger and bigger. IndexWriter.deleteAll()
> doesn't truly reset state. Specifically, it calls
> globalFieldNumberMap.clear(), which clears all of the FieldNumbers
> collections, but it doesn't reset lowestUnassignedFieldNumber. So, that
> number keeps counting up, and new instances of FieldInfos allocate larger
> and larger arrays (and only use the top indices).
> >
> > Has anyone else encountered this? Can I open an issue for resetting
> lowestUnassignedFieldNumber in FieldNumbers.clear()? Is there any risk in
> doing so?
> >
> > (For my specific use-case, I would be okay with not clearing
> globalFieldNumberMap at all, since the set of fields is bounded, but
> assigning new field numbers is probably among the least of my costs.)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: Possible resource leak in IndexWriter.deleteAll()/FieldNumbers.clear() [ In reply to ]
Yeah that sounds as if it would be too expensive. I wasn't quite sure
what would be involved..

On Wed, Nov 18, 2020 at 3:56 PM Michael Froh <msfroh@gmail.com> wrote:
>
> I didn't try creating a new IndexWriter for each batch, but I was assuming that would be heavier, as it would allocate a new DocumentsWriter, and through that new DocumentsWriterPerThreads. Skimming through the code for DWPT, it looks like there are various pools involved in creating each DWPT's instance of DefaultIndexingChain, which might be expensive to create frequently, rather than reusing on flush().
>
> Also I was partly motivated by laziness. The production code I'm borrowing for this prototype doesn't make it easy to recreate the IndexWriterConfig, and IWC is not reusable across IndexWriter instances.
>
> On Wed, Nov 18, 2020 at 12:25 PM Michael Sokolov <msokolov@gmail.com> wrote:
>>
>> I'm curious if you tried creating a new IndexWriter for each batch?
>>
>> On Wed, Nov 18, 2020 at 1:18 PM Michael Froh <msfroh@gmail.com> wrote:
>> >
>> > I have some code that is kind of abusing IndexWriter.deleteAll(). In short, I'm basically experimenting with using tiny (one block of joined parent/child documents) indexes as a serialized format to index on one fleet and then merge these tiny indexes on another fleet. I'm doing this by indexing a block, committing, storing the contents of the index directory in a zip file, invoking deleteAll(), and repeating. Believe it or not, the performance is not terrible. (Currently getting about 20% of the throughput I see with regular indexing.)
>> >
>> > Regardless of my serialization shenanigans above, I've found that performance degrades over time for the process, as it spends more time allocating and freeing memory. Analyzing some heap dumps, it's because FieldInfos.byNumber is getting bigger and bigger. IndexWriter.deleteAll() doesn't truly reset state. Specifically, it calls globalFieldNumberMap.clear(), which clears all of the FieldNumbers collections, but it doesn't reset lowestUnassignedFieldNumber. So, that number keeps counting up, and new instances of FieldInfos allocate larger and larger arrays (and only use the top indices).
>> >
>> > Has anyone else encountered this? Can I open an issue for resetting lowestUnassignedFieldNumber in FieldNumbers.clear()? Is there any risk in doing so?
>> >
>> > (For my specific use-case, I would be okay with not clearing globalFieldNumberMap at all, since the set of fields is bounded, but assigning new field numbers is probably among the least of my costs.)
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Possible resource leak in IndexWriter.deleteAll()/FieldNumbers.clear() [ In reply to ]
Thanks David!

Created https://issues.apache.org/jira/browse/LUCENE-9617 and posted a PR:
https://github.com/apache/lucene-solr/pull/2088

On Wed, Nov 18, 2020 at 10:26 AM David Smiley <dsmiley@apache.org> wrote:

> Thanks for sharing the background of your indexing serialization
> shenanigans :-) -- interesting.
>
> I think IndexWriter.deleteAll() should ultimately reset
> lowestUnassignedFieldNumber. globalFieldNumberMap.clear() is only called
> by deleteAll, so this simple proposal makes sense to me. File a JIRA issue.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Wed, Nov 18, 2020 at 1:17 PM Michael Froh <msfroh@gmail.com> wrote:
>
>> I have some code that is kind of abusing IndexWriter.deleteAll(). In
>> short, I'm basically experimenting with using tiny (one block of joined
>> parent/child documents) indexes as a serialized format to index on one
>> fleet and then merge these tiny indexes on another fleet. I'm doing this by
>> indexing a block, committing, storing the contents of the index directory
>> in a zip file, invoking deleteAll(), and repeating. Believe it or not, the
>> performance is not terrible. (Currently getting about 20% of the throughput
>> I see with regular indexing.)
>>
>> Regardless of my serialization shenanigans above, I've found that
>> performance degrades over time for the process, as it spends more time
>> allocating and freeing memory. Analyzing some heap dumps, it's because
>> FieldInfos.byNumber is getting bigger and bigger. IndexWriter.deleteAll()
>> doesn't truly reset state. Specifically, it calls
>> globalFieldNumberMap.clear(), which clears all of the FieldNumbers
>> collections, but it doesn't reset lowestUnassignedFieldNumber. So, that
>> number keeps counting up, and new instances of FieldInfos allocate larger
>> and larger arrays (and only use the top indices).
>>
>> Has anyone else encountered this? Can I open an issue for resetting
>> lowestUnassignedFieldNumber in FieldNumbers.clear()? Is there any risk in
>> doing so?
>>
>> (For my specific use-case, I would be okay with not clearing
>> globalFieldNumberMap at all, since the set of fields is bounded, but
>> assigning new field numbers is probably among the least of my costs.)
>>
>