Mailing List Archive

Questions about Lucene source
Hello,

I was going through some parts of the Lucene source and had some questions:
1) Can lucene have 0 document segments? Or will they always be purged
(either by TMP or otherwise) on a commit?
Eg: A segment has 4 docs, and I make a /update call to overwrite all 4 docs
(so deleted docs == max docs) and call commit. Will/Can this segment still
exist after commit?

2) Starting Lucene 7.0, each segment also stores a "minVersion" which
tracks the min version of the segment that contributed docs to this
segment.
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/core/src/java/org/apache/lucene/index/SegmentInfo.java#L83

Reading through LUCENE-7756 I see that one reason to have minVersion was to
have the entire version of the original index stored somewhere since a
change was made to store only the major version at the index level (in
SegmentInfos)

https://issues.apache.org/jira/browse/LUCENE-7756?focusedCommentId=15945863&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15945863

Checking the code, I found it's being consulted for any signs of index
corruption but that was pretty much it. Curious if there is any other
intended/planned use for minVersion? Eg: some choice of codec at read time
based on this field or anything else?

Thanks,
Rahul
Re: Questions about Lucene source [ In reply to ]
Following up on my questions since they didn't get much love the first
time. Any inputs are greatly appreciated!

Thanks,
Rahul

On Wed, Sep 14, 2022 at 3:58 PM Rahul Goswami <rahul196452@gmail.com> wrote:

> Hello,
>
> I was going through some parts of the Lucene source and had some questions:
> 1) Can lucene have 0 document segments? Or will they always be purged
> (either by TMP or otherwise) on a commit?
> Eg: A segment has 4 docs, and I make a /update call to overwrite all 4
> docs (so deleted docs == max docs) and call commit. Will/Can this segment
> still exist after commit?
>
> 2) Starting Lucene 7.0, each segment also stores a "minVersion" which
> tracks the min version of the segment that contributed docs to this
> segment.
>
> https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/core/src/java/org/apache/lucene/index/SegmentInfo.java#L83
>
> Reading through LUCENE-7756 I see that one reason to have minVersion was
> to have the entire version of the original index stored somewhere since a
> change was made to store only the major version at the index level (in
> SegmentInfos)
>
>
> https://issues.apache.org/jira/browse/LUCENE-7756?focusedCommentId=15945863&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15945863
>
> Checking the code, I found it's being consulted for any signs of index
> corruption but that was pretty much it. Curious if there is any other
> intended/planned use for minVersion? Eg: some choice of codec at read time
> based on this field or anything else?
>
> Thanks,
> Rahul
>
>
Re: Questions about Lucene source [ In reply to ]
> (so deleted docs == max docs) and call commit. Will/Can this segment still
> exist after commit?
>

Depends on your merge policy index deletion policy. You can configure
Lucene to keep older commits (and then you'll preserve all historical
segments).

I don't know the answer to your second question.

D.
Re: Questions about Lucene source [ In reply to ]
On the 2nd question, we do not plan on leveraging this information to
figure out the codec: the codec that should be used to read a segment is
stored separately (also in segment infos).

It is mostly useful for diagnostics purposes. E.g. if we see an interesting
corruption case where checksums match, we can guess that there is a bug
somewhere in Lucene in a version that is between this minimum version and
the version that was used to write the segment.

On Sat, Sep 17, 2022 at 11:07 AM Dawid Weiss <dawid.weiss@gmail.com> wrote:

> > (so deleted docs == max docs) and call commit. Will/Can this segment
> still
> > exist after commit?
> >
>
> Depends on your merge policy index deletion policy. You can configure
> Lucene to keep older commits (and then you'll preserve all historical
> segments).
>
> I don't know the answer to your second question.
>
> D.
>


--
Adrien
Re: Questions about Lucene source [ In reply to ]
David and Adrien, thanks for your responses. Bringing up an old thread
here. Revisiting this question ...
> (so deleted docs == max docs) and call commit. Will/Can this segment still
> exist after commit?

SInce I am using Solr (8.11.1), the default deletion policy is
SolrDeletionPolicy which retains only the latest commit by default and
deletes the rest. In that case, would a segment be automatically
deleted once all of the docs in it have been marked deleted (eg: via
reindexing)? If yes, at what point (commit or merge)?

Thanks,
Rahul

On Fri, Sep 23, 2022 at 9:25 AM Adrien Grand <jpountz@gmail.com> wrote:

> On the 2nd question, we do not plan on leveraging this information to
> figure out the codec: the codec that should be used to read a segment is
> stored separately (also in segment infos).
>
> It is mostly useful for diagnostics purposes. E.g. if we see an interesting
> corruption case where checksums match, we can guess that there is a bug
> somewhere in Lucene in a version that is between this minimum version and
> the version that was used to write the segment.
>
> On Sat, Sep 17, 2022 at 11:07 AM Dawid Weiss <dawid.weiss@gmail.com>
> wrote:
>
> > > (so deleted docs == max docs) and call commit. Will/Can this segment
> > still
> > > exist after commit?
> > >
> >
> > Depends on your merge policy index deletion policy. You can configure
> > Lucene to keep older commits (and then you'll preserve all historical
> > segments).
> >
> > I don't know the answer to your second question.
> >
> > D.
> >
>
>
> --
> Adrien
>