I have some code that is kind of abusing IndexWriter.deleteAll(). In short,
I'm basically experimenting with using tiny (one block of joined
parent/child documents) indexes as a serialized format to index on one
fleet and then merge these tiny indexes on another fleet. I'm doing this by
indexing a block, committing, storing the contents of the index directory
in a zip file, invoking deleteAll(), and repeating. Believe it or not, the
performance is not terrible. (Currently getting about 20% of the throughput
I see with regular indexing.)
Regardless of my serialization shenanigans above, I've found that
performance degrades over time for the process, as it spends more time
allocating and freeing memory. Analyzing some heap dumps, it's because
FieldInfos.byNumber is getting bigger and bigger. IndexWriter.deleteAll()
doesn't truly reset state. Specifically, it calls
globalFieldNumberMap.clear(), which clears all of the FieldNumbers
collections, but it doesn't reset lowestUnassignedFieldNumber. So, that
number keeps counting up, and new instances of FieldInfos allocate larger
and larger arrays (and only use the top indices).
Has anyone else encountered this? Can I open an issue for resetting
lowestUnassignedFieldNumber in FieldNumbers.clear()? Is there any risk in
doing so?
(For my specific use-case, I would be okay with not clearing
globalFieldNumberMap at all, since the set of fields is bounded, but
assigning new field numbers is probably among the least of my costs.)
I'm basically experimenting with using tiny (one block of joined
parent/child documents) indexes as a serialized format to index on one
fleet and then merge these tiny indexes on another fleet. I'm doing this by
indexing a block, committing, storing the contents of the index directory
in a zip file, invoking deleteAll(), and repeating. Believe it or not, the
performance is not terrible. (Currently getting about 20% of the throughput
I see with regular indexing.)
Regardless of my serialization shenanigans above, I've found that
performance degrades over time for the process, as it spends more time
allocating and freeing memory. Analyzing some heap dumps, it's because
FieldInfos.byNumber is getting bigger and bigger. IndexWriter.deleteAll()
doesn't truly reset state. Specifically, it calls
globalFieldNumberMap.clear(), which clears all of the FieldNumbers
collections, but it doesn't reset lowestUnassignedFieldNumber. So, that
number keeps counting up, and new instances of FieldInfos allocate larger
and larger arrays (and only use the top indices).
Has anyone else encountered this? Can I open an issue for resetting
lowestUnassignedFieldNumber in FieldNumbers.clear()? Is there any risk in
doing so?
(For my specific use-case, I would be okay with not clearing
globalFieldNumberMap at all, since the set of fields is bounded, but
assigning new field numbers is probably among the least of my costs.)