Mailing List Archive

Reindexing leaving behind 0 live doc segments
Hello,
I am trying to execute a program to read documents segment-by-segment and
reindex to the same index. I am reading using Lucene apis and indexing
using solr api (in a core that is currently loaded).

What I am observing is that even after a segment has been fully processed
and an autoCommit (as well as autoSoftCommit ) has kicked in, the segment
with 0 live docs gets left behind. *Upon Solr restart, the segment does get
cleared succesfully.*

I tried to replicate same thing without the code by indexing 3 docs on an
empty test core, and then reindexing the same docs. The older segment gets
deleted as soon as softCommit interval hits or an explicit commit=true is
called.

Here are the two approaches that I have tried. Approach 2 is inspired by
the merge logic of accessing segments in case opening a DirectoryReader
(Approach 1) externally is causing this issue.

But both approaches leave undeleted segments behind until I restart Solr
and load the core again. What am I missing? I don't have any more brain
cells left to fry on this!

Approach 1:
=========
try (FSDirectory dir = FSDirectory.open(Paths.get(core.getIndexDir()));
IndexReader reader = DirectoryReader.open(dir)) {
for (LeafReaderContext lrc : reader.leaves()) {

//read live docs from each leaf , create a
SolrInputDocument out of Document and index using Solr api

}
}catch(Exception e){

}

Approach 2:
==========
ReadersAndUpdates rld = null;
SegmentReader segmentReader = null;
RefCounted<IndexWriter> iwRef =
core.getSolrCoreState().getIndexWriter(core);
iw = iwRef.get();
try{
for (SegmentCommitInfo sci : segmentInfos) {
rld = iw.getPooledInstance(sci, true);
segmentReader = rld.getReader(IOContext.READ);

//process all live docs similar to above using the segmentReader.

rld.release(segmentReader);
iw.release(rld);
}finally{
if (iwRef != null) {
iwRef.decref();
}
}

Help would be much appreciated!

Thanks,
Rahul
Re: Reindexing leaving behind 0 live doc segments [ In reply to ]
Hi Rahul.
Are you looking for
https://lucene.apache.org/core/9_0_0/core/org/apache/lucene/index/IndexWriter.html#forceMergeDeletes()
?

On Tue, Aug 29, 2023 at 5:20?AM Rahul Goswami <rahul196452@gmail.com> wrote:

> Hello,
> I am trying to execute a program to read documents segment-by-segment and
> reindex to the same index. I am reading using Lucene apis and indexing
> using solr api (in a core that is currently loaded).
>
> What I am observing is that even after a segment has been fully processed
> and an autoCommit (as well as autoSoftCommit ) has kicked in, the segment
> with 0 live docs gets left behind. *Upon Solr restart, the segment does get
> cleared succesfully.*
>
> I tried to replicate same thing without the code by indexing 3 docs on an
> empty test core, and then reindexing the same docs. The older segment gets
> deleted as soon as softCommit interval hits or an explicit commit=true is
> called.
>
> Here are the two approaches that I have tried. Approach 2 is inspired by
> the merge logic of accessing segments in case opening a DirectoryReader
> (Approach 1) externally is causing this issue.
>
> But both approaches leave undeleted segments behind until I restart Solr
> and load the core again. What am I missing? I don't have any more brain
> cells left to fry on this!
>
> Approach 1:
> =========
> try (FSDirectory dir = FSDirectory.open(Paths.get(core.getIndexDir()));
> IndexReader reader = DirectoryReader.open(dir)) {
> for (LeafReaderContext lrc : reader.leaves()) {
>
> //read live docs from each leaf , create a
> SolrInputDocument out of Document and index using Solr api
>
> }
> }catch(Exception e){
>
> }
>
> Approach 2:
> ==========
> ReadersAndUpdates rld = null;
> SegmentReader segmentReader = null;
> RefCounted<IndexWriter> iwRef =
> core.getSolrCoreState().getIndexWriter(core);
> iw = iwRef.get();
> try{
> for (SegmentCommitInfo sci : segmentInfos) {
> rld = iw.getPooledInstance(sci, true);
> segmentReader = rld.getReader(IOContext.READ);
>
> //process all live docs similar to above using the segmentReader.
>
> rld.release(segmentReader);
> iw.release(rld);
> }finally{
> if (iwRef != null) {
> iwRef.decref();
> }
> }
>
> Help would be much appreciated!
>
> Thanks,
> Rahul
>


--
Sincerely yours
Mikhail Khludnev
Re: Reindexing leaving behind 0 live doc segments [ In reply to ]
Thanks for the response Mikhail. I don't think I am looking for
forceMergeDeletes() though since it could be more expensive than I would
like and I only want to see the unreferenced segments with 0 live docs to
be deleted. Just the way they get deleted with a commit=true option or even
softDelete.

Another piece of important information that I missed out earlier is that
when I examine the segments referenced by the segments_* files these
segments (with 0 live docs) are no longer part of it, but they are still
not cleared. Would appreciate more lines of thought!

Thanks,
Rahul

On Tue, Aug 29, 2023 at 2:46?AM Mikhail Khludnev <mkhl@apache.org> wrote:

> Hi Rahul.
> Are you looking for
>
> https://lucene.apache.org/core/9_0_0/core/org/apache/lucene/index/IndexWriter.html#forceMergeDeletes()
> ?
>
> On Tue, Aug 29, 2023 at 5:20?AM Rahul Goswami <rahul196452@gmail.com>
> wrote:
>
> > Hello,
> > I am trying to execute a program to read documents segment-by-segment and
> > reindex to the same index. I am reading using Lucene apis and indexing
> > using solr api (in a core that is currently loaded).
> >
> > What I am observing is that even after a segment has been fully processed
> > and an autoCommit (as well as autoSoftCommit ) has kicked in, the segment
> > with 0 live docs gets left behind. *Upon Solr restart, the segment does
> get
> > cleared succesfully.*
> >
> > I tried to replicate same thing without the code by indexing 3 docs on an
> > empty test core, and then reindexing the same docs. The older segment
> gets
> > deleted as soon as softCommit interval hits or an explicit commit=true is
> > called.
> >
> > Here are the two approaches that I have tried. Approach 2 is inspired by
> > the merge logic of accessing segments in case opening a DirectoryReader
> > (Approach 1) externally is causing this issue.
> >
> > But both approaches leave undeleted segments behind until I restart Solr
> > and load the core again. What am I missing? I don't have any more brain
> > cells left to fry on this!
> >
> > Approach 1:
> > =========
> > try (FSDirectory dir = FSDirectory.open(Paths.get(core.getIndexDir()));
> > IndexReader reader = DirectoryReader.open(dir)) {
> > for (LeafReaderContext lrc : reader.leaves()) {
> >
> > //read live docs from each leaf , create a
> > SolrInputDocument out of Document and index using Solr api
> >
> > }
> > }catch(Exception e){
> >
> > }
> >
> > Approach 2:
> > ==========
> > ReadersAndUpdates rld = null;
> > SegmentReader segmentReader = null;
> > RefCounted<IndexWriter> iwRef =
> > core.getSolrCoreState().getIndexWriter(core);
> > iw = iwRef.get();
> > try{
> > for (SegmentCommitInfo sci : segmentInfos) {
> > rld = iw.getPooledInstance(sci, true);
> > segmentReader = rld.getReader(IOContext.READ);
> >
> > //process all live docs similar to above using the segmentReader.
> >
> > rld.release(segmentReader);
> > iw.release(rld);
> > }finally{
> > if (iwRef != null) {
> > iwRef.decref();
> > }
> > }
> >
> > Help would be much appreciated!
> >
> > Thanks,
> > Rahul
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>
Re: Reindexing leaving behind 0 live doc segments [ In reply to ]
Hi Rahul,

What you're describing sounds similar to index rearranging [1], although in
that case the reindexing is done in a new index. The last commit in the
IndexRearranger class added support for reading and reindexing deletes -
maybe
having a look at that and at the Javadoc would help?


Stefan

[1]
https://github.com/apache/lucene/blob/d1c353116157d0375de9d673ae5e9c90524ffe2f/lucene/misc/src/java/org/apache/lucene/misc/index/IndexRearranger.java


On Wed, 30 Aug 2023 at 15:19, Rahul Goswami <rahul196452@gmail.com> wrote:

> Thanks for the response Mikhail. I don't think I am looking for
> forceMergeDeletes() though since it could be more expensive than I would
> like and I only want to see the unreferenced segments with 0 live docs to
> be deleted. Just the way they get deleted with a commit=true option or even
> softDelete.
>
> Another piece of important information that I missed out earlier is that
> when I examine the segments referenced by the segments_* files these
> segments (with 0 live docs) are no longer part of it, but they are still
> not cleared. Would appreciate more lines of thought!
>
> Thanks,
> Rahul
>
> On Tue, Aug 29, 2023 at 2:46?AM Mikhail Khludnev <mkhl@apache.org> wrote:
>
> > Hi Rahul.
> > Are you looking for
> >
> >
> https://lucene.apache.org/core/9_0_0/core/org/apache/lucene/index/IndexWriter.html#forceMergeDeletes()
> > ?
> >
> > On Tue, Aug 29, 2023 at 5:20?AM Rahul Goswami <rahul196452@gmail.com>
> > wrote:
> >
> > > Hello,
> > > I am trying to execute a program to read documents segment-by-segment
> and
> > > reindex to the same index. I am reading using Lucene apis and indexing
> > > using solr api (in a core that is currently loaded).
> > >
> > > What I am observing is that even after a segment has been fully
> processed
> > > and an autoCommit (as well as autoSoftCommit ) has kicked in, the
> segment
> > > with 0 live docs gets left behind. *Upon Solr restart, the segment does
> > get
> > > cleared succesfully.*
> > >
> > > I tried to replicate same thing without the code by indexing 3 docs on
> an
> > > empty test core, and then reindexing the same docs. The older segment
> > gets
> > > deleted as soon as softCommit interval hits or an explicit commit=true
> is
> > > called.
> > >
> > > Here are the two approaches that I have tried. Approach 2 is inspired
> by
> > > the merge logic of accessing segments in case opening a DirectoryReader
> > > (Approach 1) externally is causing this issue.
> > >
> > > But both approaches leave undeleted segments behind until I restart
> Solr
> > > and load the core again. What am I missing? I don't have any more brain
> > > cells left to fry on this!
> > >
> > > Approach 1:
> > > =========
> > > try (FSDirectory dir = FSDirectory.open(Paths.get(core.getIndexDir()));
> > > IndexReader reader = DirectoryReader.open(dir)) {
> > > for (LeafReaderContext lrc : reader.leaves()) {
> > >
> > > //read live docs from each leaf , create a
> > > SolrInputDocument out of Document and index using Solr api
> > >
> > > }
> > > }catch(Exception e){
> > >
> > > }
> > >
> > > Approach 2:
> > > ==========
> > > ReadersAndUpdates rld = null;
> > > SegmentReader segmentReader = null;
> > > RefCounted<IndexWriter> iwRef =
> > > core.getSolrCoreState().getIndexWriter(core);
> > > iw = iwRef.get();
> > > try{
> > > for (SegmentCommitInfo sci : segmentInfos) {
> > > rld = iw.getPooledInstance(sci, true);
> > > segmentReader = rld.getReader(IOContext.READ);
> > >
> > > //process all live docs similar to above using the segmentReader.
> > >
> > > rld.release(segmentReader);
> > > iw.release(rld);
> > > }finally{
> > > if (iwRef != null) {
> > > iwRef.decref();
> > > }
> > > }
> > >
> > > Help would be much appreciated!
> > >
> > > Thanks,
> > > Rahul
> > >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> >
>
Re: Reindexing leaving behind 0 live doc segments [ In reply to ]
Hi Rahul,

Please do not pursue Approach 2 :) ReadersAndUpdates.release is not
something the application should be calling. This path can only lead to
pain.

It sounds to me like something in Solr is holding an old reader (maybe the
last commit point, or reader prior to the refresh after you re-indexed all
docs in a given now 100% deleted segment) open.

Does Solr keep old readers open, older than the most recent commit? Do you
have queries in flight that might be holding the old reader open?

Given that your small by-hand test case (3 docs) correctly showed the 100%
deleted segment being reclaimed after the soft commit interval or a manual
hard commit, something must be different in the larger use case that is
causing Solr to keep a still old reader open. Is there any logging you can
enable to understand Solr's handling of its IndexReaders' lifecycle?

Mike McCandless

http://blog.mikemccandless.com


On Mon, Aug 28, 2023 at 10:20?PM Rahul Goswami <rahul196452@gmail.com>
wrote:

> Hello,
> I am trying to execute a program to read documents segment-by-segment and
> reindex to the same index. I am reading using Lucene apis and indexing
> using solr api (in a core that is currently loaded).
>
> What I am observing is that even after a segment has been fully processed
> and an autoCommit (as well as autoSoftCommit ) has kicked in, the segment
> with 0 live docs gets left behind. *Upon Solr restart, the segment does get
> cleared succesfully.*
>
> I tried to replicate same thing without the code by indexing 3 docs on an
> empty test core, and then reindexing the same docs. The older segment gets
> deleted as soon as softCommit interval hits or an explicit commit=true is
> called.
>
> Here are the two approaches that I have tried. Approach 2 is inspired by
> the merge logic of accessing segments in case opening a DirectoryReader
> (Approach 1) externally is causing this issue.
>
> But both approaches leave undeleted segments behind until I restart Solr
> and load the core again. What am I missing? I don't have any more brain
> cells left to fry on this!
>
> Approach 1:
> =========
> try (FSDirectory dir = FSDirectory.open(Paths.get(core.getIndexDir()));
> IndexReader reader = DirectoryReader.open(dir)) {
> for (LeafReaderContext lrc : reader.leaves()) {
>
> //read live docs from each leaf , create a
> SolrInputDocument out of Document and index using Solr api
>
> }
> }catch(Exception e){
>
> }
>
> Approach 2:
> ==========
> ReadersAndUpdates rld = null;
> SegmentReader segmentReader = null;
> RefCounted<IndexWriter> iwRef =
> core.getSolrCoreState().getIndexWriter(core);
> iw = iwRef.get();
> try{
> for (SegmentCommitInfo sci : segmentInfos) {
> rld = iw.getPooledInstance(sci, true);
> segmentReader = rld.getReader(IOContext.READ);
>
> //process all live docs similar to above using the segmentReader.
>
> rld.release(segmentReader);
> iw.release(rld);
> }finally{
> if (iwRef != null) {
> iwRef.decref();
> }
> }
>
> Help would be much appreciated!
>
> Thanks,
> Rahul
>
Re: Reindexing leaving behind 0 live doc segments [ In reply to ]
Stefan, Mike,
Appreciate your responses! I spent some time analyzing your inputs and
going further down the rabbit hole.

Stefan,
I looked at the IndexRearranger code you referenced where it tries to drop
the segment. I see that it eventually gets handled via
IndexFileDeleter.checkpoint() through file refCounts (=0 for deletion
criteria). The same method also gets called as part of IndexWrtier.commit()
flow (Inside finishCommit()). So in an ideal scenario a commit should have
taken care of dropping the segment files. So that tells me the refCounts
for the files are not getting set to 0. I have a fair suspicion the
reindexing process running on the same index inside the same JVM has to do
something with it.

Mike,
Thanks for the caution on Approach 2 ...good to at least be able to
continue on one train of thought. As mentioned in my response to Stefan,
the reindexing is going on *inside* of the Solr JVM as an asynchronous
thread and not as a separate process. So I believe the open reader you are
alluding to might be the one I am opening to through DirectoryReader.open()
(?) . However, looking at the code, I am seeing IndexFileDeleter.incRef()
only on the files in SegmentCommitInfos.

Does an incRef() also happen when an IndexReader is opened ?

Note:The index is a mix of 7.x and 8.x segments (on Solr 8.x). By extending
TMP and overloading findMerges() I am preventing 7.x segments from
participating in merges, and the code only reindexes these 7.x segments
into the same index, segment-by-segment.
In the current tests I am performing, there are no parallel search or
indexing threads through an external request. The reindexing is the only
process interacting with the index. The goal is to eventually have this
running alongside any parallel indexing/search requests on the index.
Also, as noted earlier, by inspecting the SegmentInfos , I can see the 7.x
segment progressively reducing, but the files never get cleared.

If it is my reader that is throwing off the refCount for Solr, what could
be another way of reading the index without bloating it up with 0 doc
segments?

I will also try floating this in the Solr list to get answers to some of
the questions you pose around Solr's handling of readers..

Thanks,
Rahul




On Thu, Aug 31, 2023 at 6:48?AM Michael McCandless <
lucene@mikemccandless.com> wrote:

> Hi Rahul,
>
> Please do not pursue Approach 2 :) ReadersAndUpdates.release is not
> something the application should be calling. This path can only lead to
> pain.
>
> It sounds to me like something in Solr is holding an old reader (maybe the
> last commit point, or reader prior to the refresh after you re-indexed all
> docs in a given now 100% deleted segment) open.
>
> Does Solr keep old readers open, older than the most recent commit? Do
> you have queries in flight that might be holding the old reader open?
>
> Given that your small by-hand test case (3 docs) correctly showed the 100%
> deleted segment being reclaimed after the soft commit interval or a manual
> hard commit, something must be different in the larger use case that is
> causing Solr to keep a still old reader open. Is there any logging you can
> enable to understand Solr's handling of its IndexReaders' lifecycle?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Mon, Aug 28, 2023 at 10:20?PM Rahul Goswami <rahul196452@gmail.com>
> wrote:
>
>> Hello,
>> I am trying to execute a program to read documents segment-by-segment and
>> reindex to the same index. I am reading using Lucene apis and indexing
>> using solr api (in a core that is currently loaded).
>>
>> What I am observing is that even after a segment has been fully processed
>> and an autoCommit (as well as autoSoftCommit ) has kicked in, the segment
>> with 0 live docs gets left behind. *Upon Solr restart, the segment does
>> get
>> cleared succesfully.*
>>
>> I tried to replicate same thing without the code by indexing 3 docs on an
>> empty test core, and then reindexing the same docs. The older segment gets
>> deleted as soon as softCommit interval hits or an explicit commit=true is
>> called.
>>
>> Here are the two approaches that I have tried. Approach 2 is inspired by
>> the merge logic of accessing segments in case opening a DirectoryReader
>> (Approach 1) externally is causing this issue.
>>
>> But both approaches leave undeleted segments behind until I restart Solr
>> and load the core again. What am I missing? I don't have any more brain
>> cells left to fry on this!
>>
>> Approach 1:
>> =========
>> try (FSDirectory dir = FSDirectory.open(Paths.get(core.getIndexDir()));
>> IndexReader reader = DirectoryReader.open(dir)) {
>> for (LeafReaderContext lrc : reader.leaves()) {
>>
>> //read live docs from each leaf , create a
>> SolrInputDocument out of Document and index using Solr api
>>
>> }
>> }catch(Exception e){
>>
>> }
>>
>> Approach 2:
>> ==========
>> ReadersAndUpdates rld = null;
>> SegmentReader segmentReader = null;
>> RefCounted<IndexWriter> iwRef =
>> core.getSolrCoreState().getIndexWriter(core);
>> iw = iwRef.get();
>> try{
>> for (SegmentCommitInfo sci : segmentInfos) {
>> rld = iw.getPooledInstance(sci, true);
>> segmentReader = rld.getReader(IOContext.READ);
>>
>> //process all live docs similar to above using the segmentReader.
>>
>> rld.release(segmentReader);
>> iw.release(rld);
>> }finally{
>> if (iwRef != null) {
>> iwRef.decref();
>> }
>> }
>>
>> Help would be much appreciated!
>>
>> Thanks,
>> Rahul
>>
>
Re: Reindexing leaving behind 0 live doc segments [ In reply to ]
Hi,

in Solr the empty segment keeps open as long as there is a Searcher
still open. At some point the empty segment (100% deletions) will be
deleted, but you have to wait until SolIndexSearcher has restarted.
Maybe check your solrconfig.xml and check if openSearcher is enabled
after autoSoftCommit:
https://solr.apache.org/guide/solr/latest/configuration-guide/commits-transaction-logs.html

Uwe

Am 31.08.2023 um 21:35 schrieb Rahul Goswami:
> Stefan, Mike,
> Appreciate your responses! I spent some time analyzing your inputs and
> going further down the rabbit hole.
>
> Stefan,
> I looked at the IndexRearranger code you referenced where it tries to drop
> the segment. I see that it eventually gets handled via
> IndexFileDeleter.checkpoint() through file refCounts (=0 for deletion
> criteria). The same method also gets called as part of IndexWrtier.commit()
> flow (Inside finishCommit()). So in an ideal scenario a commit should have
> taken care of dropping the segment files. So that tells me the refCounts
> for the files are not getting set to 0. I have a fair suspicion the
> reindexing process running on the same index inside the same JVM has to do
> something with it.
>
> Mike,
> Thanks for the caution on Approach 2 ...good to at least be able to
> continue on one train of thought. As mentioned in my response to Stefan,
> the reindexing is going on *inside* of the Solr JVM as an asynchronous
> thread and not as a separate process. So I believe the open reader you are
> alluding to might be the one I am opening to through DirectoryReader.open()
> (?) . However, looking at the code, I am seeing IndexFileDeleter.incRef()
> only on the files in SegmentCommitInfos.
>
> Does an incRef() also happen when an IndexReader is opened ?
>
> Note:The index is a mix of 7.x and 8.x segments (on Solr 8.x). By extending
> TMP and overloading findMerges() I am preventing 7.x segments from
> participating in merges, and the code only reindexes these 7.x segments
> into the same index, segment-by-segment.
> In the current tests I am performing, there are no parallel search or
> indexing threads through an external request. The reindexing is the only
> process interacting with the index. The goal is to eventually have this
> running alongside any parallel indexing/search requests on the index.
> Also, as noted earlier, by inspecting the SegmentInfos , I can see the 7.x
> segment progressively reducing, but the files never get cleared.
>
> If it is my reader that is throwing off the refCount for Solr, what could
> be another way of reading the index without bloating it up with 0 doc
> segments?
>
> I will also try floating this in the Solr list to get answers to some of
> the questions you pose around Solr's handling of readers..
>
> Thanks,
> Rahul
>
>
>
>
> On Thu, Aug 31, 2023 at 6:48?AM Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> Hi Rahul,
>>
>> Please do not pursue Approach 2 :) ReadersAndUpdates.release is not
>> something the application should be calling. This path can only lead to
>> pain.
>>
>> It sounds to me like something in Solr is holding an old reader (maybe the
>> last commit point, or reader prior to the refresh after you re-indexed all
>> docs in a given now 100% deleted segment) open.
>>
>> Does Solr keep old readers open, older than the most recent commit? Do
>> you have queries in flight that might be holding the old reader open?
>>
>> Given that your small by-hand test case (3 docs) correctly showed the 100%
>> deleted segment being reclaimed after the soft commit interval or a manual
>> hard commit, something must be different in the larger use case that is
>> causing Solr to keep a still old reader open. Is there any logging you can
>> enable to understand Solr's handling of its IndexReaders' lifecycle?
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Mon, Aug 28, 2023 at 10:20?PM Rahul Goswami <rahul196452@gmail.com>
>> wrote:
>>
>>> Hello,
>>> I am trying to execute a program to read documents segment-by-segment and
>>> reindex to the same index. I am reading using Lucene apis and indexing
>>> using solr api (in a core that is currently loaded).
>>>
>>> What I am observing is that even after a segment has been fully processed
>>> and an autoCommit (as well as autoSoftCommit ) has kicked in, the segment
>>> with 0 live docs gets left behind. *Upon Solr restart, the segment does
>>> get
>>> cleared succesfully.*
>>>
>>> I tried to replicate same thing without the code by indexing 3 docs on an
>>> empty test core, and then reindexing the same docs. The older segment gets
>>> deleted as soon as softCommit interval hits or an explicit commit=true is
>>> called.
>>>
>>> Here are the two approaches that I have tried. Approach 2 is inspired by
>>> the merge logic of accessing segments in case opening a DirectoryReader
>>> (Approach 1) externally is causing this issue.
>>>
>>> But both approaches leave undeleted segments behind until I restart Solr
>>> and load the core again. What am I missing? I don't have any more brain
>>> cells left to fry on this!
>>>
>>> Approach 1:
>>> =========
>>> try (FSDirectory dir = FSDirectory.open(Paths.get(core.getIndexDir()));
>>> IndexReader reader = DirectoryReader.open(dir)) {
>>> for (LeafReaderContext lrc : reader.leaves()) {
>>>
>>> //read live docs from each leaf , create a
>>> SolrInputDocument out of Document and index using Solr api
>>>
>>> }
>>> }catch(Exception e){
>>>
>>> }
>>>
>>> Approach 2:
>>> ==========
>>> ReadersAndUpdates rld = null;
>>> SegmentReader segmentReader = null;
>>> RefCounted<IndexWriter> iwRef =
>>> core.getSolrCoreState().getIndexWriter(core);
>>> iw = iwRef.get();
>>> try{
>>> for (SegmentCommitInfo sci : segmentInfos) {
>>> rld = iw.getPooledInstance(sci, true);
>>> segmentReader = rld.getReader(IOContext.READ);
>>>
>>> //process all live docs similar to above using the segmentReader.
>>>
>>> rld.release(segmentReader);
>>> iw.release(rld);
>>> }finally{
>>> if (iwRef != null) {
>>> iwRef.decref();
>>> }
>>> }
>>>
>>> Help would be much appreciated!
>>>
>>> Thanks,
>>> Rahul
>>>
--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: uwe@thetaphi.de


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Reindexing leaving behind 0 live doc segments [ In reply to ]
Uwe,
Thanks for the response. I have openSearcher=false in autoCommit, but I do
have an autoSoftCommit interval of 5 minutes configured as well which
should open a searcher.
In vanilla Solr, without my code, I see that if I completely reindex all
documents in a segment (via a client call), the segment does get deleted
after the soft commit interval. However if I process the segments as per
Approach-1 in my original email, I see that the 0 doc 7.x segment stays
even after the process finishes, i.e even after I exit the
try-with-resources block. Note that my index is a mix of 7.x and 8.x
segments and I am only reindexing 7.x segments by preventing them from
participating in merge via a custom MergePolicy.
Additionally as mentioned, Solr provides a handler (<core>/admin/segments)
which does what Luke does and it shows that by the end of the process there
are no more 7.x segments as referenced by the segments_x file. But for some
reason the physical 7.x segment files continue to stay behind until I
restart Solr.

Thanks,
Rahul

On Mon, Sep 4, 2023 at 7:18?AM Uwe Schindler <uwe@thetaphi.de> wrote:

> Hi,
>
> in Solr the empty segment keeps open as long as there is a Searcher
> still open. At some point the empty segment (100% deletions) will be
> deleted, but you have to wait until SolIndexSearcher has restarted.
> Maybe check your solrconfig.xml and check if openSearcher is enabled
> after autoSoftCommit:
>
> https://solr.apache.org/guide/solr/latest/configuration-guide/commits-transaction-logs.html
>
> Uwe
>
> Am 31.08.2023 um 21:35 schrieb Rahul Goswami:
> > Stefan, Mike,
> > Appreciate your responses! I spent some time analyzing your inputs and
> > going further down the rabbit hole.
> >
> > Stefan,
> > I looked at the IndexRearranger code you referenced where it tries to
> drop
> > the segment. I see that it eventually gets handled via
> > IndexFileDeleter.checkpoint() through file refCounts (=0 for deletion
> > criteria). The same method also gets called as part of
> IndexWrtier.commit()
> > flow (Inside finishCommit()). So in an ideal scenario a commit should
> have
> > taken care of dropping the segment files. So that tells me the refCounts
> > for the files are not getting set to 0. I have a fair suspicion the
> > reindexing process running on the same index inside the same JVM has to
> do
> > something with it.
> >
> > Mike,
> > Thanks for the caution on Approach 2 ...good to at least be able to
> > continue on one train of thought. As mentioned in my response to Stefan,
> > the reindexing is going on *inside* of the Solr JVM as an asynchronous
> > thread and not as a separate process. So I believe the open reader you
> are
> > alluding to might be the one I am opening to through
> DirectoryReader.open()
> > (?) . However, looking at the code, I am seeing IndexFileDeleter.incRef()
> > only on the files in SegmentCommitInfos.
> >
> > Does an incRef() also happen when an IndexReader is opened ?
> >
> > Note:The index is a mix of 7.x and 8.x segments (on Solr 8.x). By
> extending
> > TMP and overloading findMerges() I am preventing 7.x segments from
> > participating in merges, and the code only reindexes these 7.x segments
> > into the same index, segment-by-segment.
> > In the current tests I am performing, there are no parallel search or
> > indexing threads through an external request. The reindexing is the only
> > process interacting with the index. The goal is to eventually have this
> > running alongside any parallel indexing/search requests on the index.
> > Also, as noted earlier, by inspecting the SegmentInfos , I can see the
> 7.x
> > segment progressively reducing, but the files never get cleared.
> >
> > If it is my reader that is throwing off the refCount for Solr, what could
> > be another way of reading the index without bloating it up with 0 doc
> > segments?
> >
> > I will also try floating this in the Solr list to get answers to some of
> > the questions you pose around Solr's handling of readers..
> >
> > Thanks,
> > Rahul
> >
> >
> >
> >
> > On Thu, Aug 31, 2023 at 6:48?AM Michael McCandless <
> > lucene@mikemccandless.com> wrote:
> >
> >> Hi Rahul,
> >>
> >> Please do not pursue Approach 2 :) ReadersAndUpdates.release is not
> >> something the application should be calling. This path can only lead to
> >> pain.
> >>
> >> It sounds to me like something in Solr is holding an old reader (maybe
> the
> >> last commit point, or reader prior to the refresh after you re-indexed
> all
> >> docs in a given now 100% deleted segment) open.
> >>
> >> Does Solr keep old readers open, older than the most recent commit? Do
> >> you have queries in flight that might be holding the old reader open?
> >>
> >> Given that your small by-hand test case (3 docs) correctly showed the
> 100%
> >> deleted segment being reclaimed after the soft commit interval or a
> manual
> >> hard commit, something must be different in the larger use case that is
> >> causing Solr to keep a still old reader open. Is there any logging you
> can
> >> enable to understand Solr's handling of its IndexReaders' lifecycle?
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >>
> >> On Mon, Aug 28, 2023 at 10:20?PM Rahul Goswami <rahul196452@gmail.com>
> >> wrote:
> >>
> >>> Hello,
> >>> I am trying to execute a program to read documents segment-by-segment
> and
> >>> reindex to the same index. I am reading using Lucene apis and indexing
> >>> using solr api (in a core that is currently loaded).
> >>>
> >>> What I am observing is that even after a segment has been fully
> processed
> >>> and an autoCommit (as well as autoSoftCommit ) has kicked in, the
> segment
> >>> with 0 live docs gets left behind. *Upon Solr restart, the segment does
> >>> get
> >>> cleared succesfully.*
> >>>
> >>> I tried to replicate same thing without the code by indexing 3 docs on
> an
> >>> empty test core, and then reindexing the same docs. The older segment
> gets
> >>> deleted as soon as softCommit interval hits or an explicit commit=true
> is
> >>> called.
> >>>
> >>> Here are the two approaches that I have tried. Approach 2 is inspired
> by
> >>> the merge logic of accessing segments in case opening a DirectoryReader
> >>> (Approach 1) externally is causing this issue.
> >>>
> >>> But both approaches leave undeleted segments behind until I restart
> Solr
> >>> and load the core again. What am I missing? I don't have any more brain
> >>> cells left to fry on this!
> >>>
> >>> Approach 1:
> >>> =========
> >>> try (FSDirectory dir = FSDirectory.open(Paths.get(core.getIndexDir()));
> >>> IndexReader reader = DirectoryReader.open(dir)) {
> >>> for (LeafReaderContext lrc : reader.leaves()) {
> >>>
> >>> //read live docs from each leaf , create a
> >>> SolrInputDocument out of Document and index using Solr api
> >>>
> >>> }
> >>> }catch(Exception e){
> >>>
> >>> }
> >>>
> >>> Approach 2:
> >>> ==========
> >>> ReadersAndUpdates rld = null;
> >>> SegmentReader segmentReader = null;
> >>> RefCounted<IndexWriter> iwRef =
> >>> core.getSolrCoreState().getIndexWriter(core);
> >>> iw = iwRef.get();
> >>> try{
> >>> for (SegmentCommitInfo sci : segmentInfos) {
> >>> rld = iw.getPooledInstance(sci, true);
> >>> segmentReader = rld.getReader(IOContext.READ);
> >>>
> >>> //process all live docs similar to above using the segmentReader.
> >>>
> >>> rld.release(segmentReader);
> >>> iw.release(rld);
> >>> }finally{
> >>> if (iwRef != null) {
> >>> iwRef.decref();
> >>> }
> >>> }
> >>>
> >>> Help would be much appreciated!
> >>>
> >>> Thanks,
> >>> Rahul
> >>>
> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: Reindexing leaving behind 0 live doc segments [ In reply to ]
It looks like your code has a leak and does not close all
IndexReaders/Writers that you use during your custom code in Solr. It is
impossible to review this from outside.

You shuld use the Solr provided SolrIndexWriter and SolrIndexSearcher to
do your custom stuff and let Solr manage them.

Uwe

Am 10.09.2023 um 04:09 schrieb Rahul Goswami:
> Uwe,
> Thanks for the response. I have openSearcher=false in autoCommit, but I do
> have an autoSoftCommit interval of 5 minutes configured as well which
> should open a searcher.
> In vanilla Solr, without my code, I see that if I completely reindex all
> documents in a segment (via a client call), the segment does get deleted
> after the soft commit interval. However if I process the segments as per
> Approach-1 in my original email, I see that the 0 doc 7.x segment stays
> even after the process finishes, i.e even after I exit the
> try-with-resources block. Note that my index is a mix of 7.x and 8.x
> segments and I am only reindexing 7.x segments by preventing them from
> participating in merge via a custom MergePolicy.
> Additionally as mentioned, Solr provides a handler (<core>/admin/segments)
> which does what Luke does and it shows that by the end of the process there
> are no more 7.x segments as referenced by the segments_x file. But for some
> reason the physical 7.x segment files continue to stay behind until I
> restart Solr.
>
> Thanks,
> Rahul
>
> On Mon, Sep 4, 2023 at 7:18?AM Uwe Schindler <uwe@thetaphi.de> wrote:
>
>> Hi,
>>
>> in Solr the empty segment keeps open as long as there is a Searcher
>> still open. At some point the empty segment (100% deletions) will be
>> deleted, but you have to wait until SolIndexSearcher has restarted.
>> Maybe check your solrconfig.xml and check if openSearcher is enabled
>> after autoSoftCommit:
>>
>> https://solr.apache.org/guide/solr/latest/configuration-guide/commits-transaction-logs.html
>>
>> Uwe
>>
>> Am 31.08.2023 um 21:35 schrieb Rahul Goswami:
>>> Stefan, Mike,
>>> Appreciate your responses! I spent some time analyzing your inputs and
>>> going further down the rabbit hole.
>>>
>>> Stefan,
>>> I looked at the IndexRearranger code you referenced where it tries to
>> drop
>>> the segment. I see that it eventually gets handled via
>>> IndexFileDeleter.checkpoint() through file refCounts (=0 for deletion
>>> criteria). The same method also gets called as part of
>> IndexWrtier.commit()
>>> flow (Inside finishCommit()). So in an ideal scenario a commit should
>> have
>>> taken care of dropping the segment files. So that tells me the refCounts
>>> for the files are not getting set to 0. I have a fair suspicion the
>>> reindexing process running on the same index inside the same JVM has to
>> do
>>> something with it.
>>>
>>> Mike,
>>> Thanks for the caution on Approach 2 ...good to at least be able to
>>> continue on one train of thought. As mentioned in my response to Stefan,
>>> the reindexing is going on *inside* of the Solr JVM as an asynchronous
>>> thread and not as a separate process. So I believe the open reader you
>> are
>>> alluding to might be the one I am opening to through
>> DirectoryReader.open()
>>> (?) . However, looking at the code, I am seeing IndexFileDeleter.incRef()
>>> only on the files in SegmentCommitInfos.
>>>
>>> Does an incRef() also happen when an IndexReader is opened ?
>>>
>>> Note:The index is a mix of 7.x and 8.x segments (on Solr 8.x). By
>> extending
>>> TMP and overloading findMerges() I am preventing 7.x segments from
>>> participating in merges, and the code only reindexes these 7.x segments
>>> into the same index, segment-by-segment.
>>> In the current tests I am performing, there are no parallel search or
>>> indexing threads through an external request. The reindexing is the only
>>> process interacting with the index. The goal is to eventually have this
>>> running alongside any parallel indexing/search requests on the index.
>>> Also, as noted earlier, by inspecting the SegmentInfos , I can see the
>> 7.x
>>> segment progressively reducing, but the files never get cleared.
>>>
>>> If it is my reader that is throwing off the refCount for Solr, what could
>>> be another way of reading the index without bloating it up with 0 doc
>>> segments?
>>>
>>> I will also try floating this in the Solr list to get answers to some of
>>> the questions you pose around Solr's handling of readers..
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>>
>>>
>>> On Thu, Aug 31, 2023 at 6:48?AM Michael McCandless <
>>> lucene@mikemccandless.com> wrote:
>>>
>>>> Hi Rahul,
>>>>
>>>> Please do not pursue Approach 2 :) ReadersAndUpdates.release is not
>>>> something the application should be calling. This path can only lead to
>>>> pain.
>>>>
>>>> It sounds to me like something in Solr is holding an old reader (maybe
>> the
>>>> last commit point, or reader prior to the refresh after you re-indexed
>> all
>>>> docs in a given now 100% deleted segment) open.
>>>>
>>>> Does Solr keep old readers open, older than the most recent commit? Do
>>>> you have queries in flight that might be holding the old reader open?
>>>>
>>>> Given that your small by-hand test case (3 docs) correctly showed the
>> 100%
>>>> deleted segment being reclaimed after the soft commit interval or a
>> manual
>>>> hard commit, something must be different in the larger use case that is
>>>> causing Solr to keep a still old reader open. Is there any logging you
>> can
>>>> enable to understand Solr's handling of its IndexReaders' lifecycle?
>>>>
>>>> Mike McCandless
>>>>
>>>> http://blog.mikemccandless.com
>>>>
>>>>
>>>> On Mon, Aug 28, 2023 at 10:20?PM Rahul Goswami <rahul196452@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>> I am trying to execute a program to read documents segment-by-segment
>> and
>>>>> reindex to the same index. I am reading using Lucene apis and indexing
>>>>> using solr api (in a core that is currently loaded).
>>>>>
>>>>> What I am observing is that even after a segment has been fully
>> processed
>>>>> and an autoCommit (as well as autoSoftCommit ) has kicked in, the
>> segment
>>>>> with 0 live docs gets left behind. *Upon Solr restart, the segment does
>>>>> get
>>>>> cleared succesfully.*
>>>>>
>>>>> I tried to replicate same thing without the code by indexing 3 docs on
>> an
>>>>> empty test core, and then reindexing the same docs. The older segment
>> gets
>>>>> deleted as soon as softCommit interval hits or an explicit commit=true
>> is
>>>>> called.
>>>>>
>>>>> Here are the two approaches that I have tried. Approach 2 is inspired
>> by
>>>>> the merge logic of accessing segments in case opening a DirectoryReader
>>>>> (Approach 1) externally is causing this issue.
>>>>>
>>>>> But both approaches leave undeleted segments behind until I restart
>> Solr
>>>>> and load the core again. What am I missing? I don't have any more brain
>>>>> cells left to fry on this!
>>>>>
>>>>> Approach 1:
>>>>> =========
>>>>> try (FSDirectory dir = FSDirectory.open(Paths.get(core.getIndexDir()));
>>>>> IndexReader reader = DirectoryReader.open(dir)) {
>>>>> for (LeafReaderContext lrc : reader.leaves()) {
>>>>>
>>>>> //read live docs from each leaf , create a
>>>>> SolrInputDocument out of Document and index using Solr api
>>>>>
>>>>> }
>>>>> }catch(Exception e){
>>>>>
>>>>> }
>>>>>
>>>>> Approach 2:
>>>>> ==========
>>>>> ReadersAndUpdates rld = null;
>>>>> SegmentReader segmentReader = null;
>>>>> RefCounted<IndexWriter> iwRef =
>>>>> core.getSolrCoreState().getIndexWriter(core);
>>>>> iw = iwRef.get();
>>>>> try{
>>>>> for (SegmentCommitInfo sci : segmentInfos) {
>>>>> rld = iw.getPooledInstance(sci, true);
>>>>> segmentReader = rld.getReader(IOContext.READ);
>>>>>
>>>>> //process all live docs similar to above using the segmentReader.
>>>>>
>>>>> rld.release(segmentReader);
>>>>> iw.release(rld);
>>>>> }finally{
>>>>> if (iwRef != null) {
>>>>> iwRef.decref();
>>>>> }
>>>>> }
>>>>>
>>>>> Help would be much appreciated!
>>>>>
>>>>> Thanks,
>>>>> Rahul
>>>>>
>> --
>> Uwe Schindler
>> Achterdiek 19, D-28357 Bremen
>> https://www.thetaphi.de
>> eMail: uwe@thetaphi.de
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: uwe@thetaphi.de


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org