Mailing List Archive: How to handle corrupt Lucene index

How to handle corrupt Lucene index

tim at whittington

Apr 13, 2022, 5:24 PM

Post #1 of 11 (1148 views)

I'm working with/on a database system that uses Lucene for full text
indexes (currently using 7.3.0).
We're encountering occasional problems that occur after unclean shutdowns
of the database , resulting in
"org.apache.lucene.index.CorruptIndexException: file mismatch" errors when
the IndexWriter is constructed.

In all of the cases this has occurred, CheckIndex finds no issues with the
Lucene index.

The database has write-ahead-log and recovery facilities, so making the
Lucene indexes durable wrt database operations is doable, but in this case
the IndexWriter itself is failing to initialise, so it looks like there
needs to be a lower-level validation/recovery operation before reconciling
transactions can take place.

Can anyone provide any advice about how the database can detect and recover
from this situation?

thanks
Tim
---

Relevant parts of the exception:

org.apache.lucene.index.CorruptIndexException: file mismatch, expected
id=e673n8syolqg0phzxvw8d7czu, got=dwpa40yzwp7gf06xibrsx1pn2
(resource=BufferedChecksumIndexInput(MMapIndexInput(path="/databases/xxxx/luceneIndexes/SearchNameIx/_
8x.si")))
at org.apache.lucene.codecs.CodecUtil.checkIndexHeaderID(CodecUtil.java:351)
at org.apache.lucene.codecs.CodecUtil.checkIndexHeader(CodecUtil.java:256)
at
org.apache.lucene.codecs.lucene70.Lucene70SegmentInfoFormat.read(Lucene70SegmentInfoFormat.java:95)
at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:360)
at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:290)
at
org.apache.lucene.index.IndexFileDeleter.<init>(IndexFileDeleter.java:165)
at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:1121)
------------------ 8< --------------------------

Re: How to handle corrupt Lucene index [ In reply to ]

rcmuir at gmail

Apr 13, 2022, 5:41 PM

Post #2 of 11 (1148 views)

On Wed, Apr 13, 2022 at 8:24 PM Tim Whittington
<tim@whittington.nz.invalid> wrote:
>
> I'm working with/on a database system that uses Lucene for full text
> indexes (currently using 7.3.0).
> We're encountering occasional problems that occur after unclean shutdowns
> of the database , resulting in
> "org.apache.lucene.index.CorruptIndexException: file mismatch" errors when
> the IndexWriter is constructed.
>
> In all of the cases this has occurred, CheckIndex finds no issues with the
> Lucene index.
>
> The database has write-ahead-log and recovery facilities, so making the
> Lucene indexes durable wrt database operations is doable, but in this case
> the IndexWriter itself is failing to initialise, so it looks like there
> needs to be a lower-level validation/recovery operation before reconciling
> transactions can take place.
>
> Can anyone provide any advice about how the database can detect and recover
> from this situation?
>

File mismatch means files are getting mixed up. It is the equivalent
of swapping say, /etc/hosts and /etc/passwd on your computer.

In your case you have a .si file (lets say it is named _79.si) that
really belongs to another segment (e.g. _42).

This isn't a lucene issue, this is something else you must be using
that is "transporting files around", and it is mixing the files up.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to handle corrupt Lucene index [ In reply to ]

baris.kazar at oracle

Apr 13, 2022, 5:45 PM

Post #3 of 11 (1148 views)

In my experience that if you built index at version x then use index also in version x.
I never encountered any problems this way witj Lucene.

Can you maybe recreate lucene index on 7.3.0?

Also how do you use database in your scenario?
Are you using jdbc like operations like in Oracle database? lucene operations are independent of database operations.

Best regards
________________________________
From: Tim Whittington <tim@whittington.nz.INVALID>
Sent: Wednesday, April 13, 2022 8:24 PM
To: java-user@lucene.apache.org <java-user@lucene.apache.org>
Subject: How to handle corrupt Lucene index

I'm working with/on a database system that uses Lucene for full text
indexes (currently using 7.3.0).
We're encountering occasional problems that occur after unclean shutdowns
of the database , resulting in
"org.apache.lucene.index.CorruptIndexException: file mismatch" errors when
the IndexWriter is constructed.

In all of the cases this has occurred, CheckIndex finds no issues with the
Lucene index.

The database has write-ahead-log and recovery facilities, so making the
Lucene indexes durable wrt database operations is doable, but in this case
the IndexWriter itself is failing to initialise, so it looks like there
needs to be a lower-level validation/recovery operation before reconciling
transactions can take place.

Can anyone provide any advice about how the database can detect and recover
from this situation?

thanks
Tim
---

Relevant parts of the exception:

org.apache.lucene.index.CorruptIndexException: file mismatch, expected
id=e673n8syolqg0phzxvw8d7czu, got=dwpa40yzwp7gf06xibrsx1pn2
(resource=BufferedChecksumIndexInput(MMapIndexInput(path="/databases/xxxx/luceneIndexes/SearchNameIx/_
8x.si")))
at org.apache.lucene.codecs.CodecUtil.checkIndexHeaderID(CodecUtil.java:351)
at org.apache.lucene.codecs.CodecUtil.checkIndexHeader(CodecUtil.java:256)
at
org.apache.lucene.codecs.lucene70.Lucene70SegmentInfoFormat.read(Lucene70SegmentInfoFormat.java:95)
at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:360)
at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:290)
at
org.apache.lucene.index.IndexFileDeleter.<init>(IndexFileDeleter.java:165)
at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:1121)
------------------ 8< --------------------------

Re: How to handle corrupt Lucene index [ In reply to ]

Apr 13, 2022, 6:15 PM

Post #4 of 11 (1148 views)

To be clear, these indexes are created and read with the same Lucene
version (7.3.0).

Tim

On Thu, 14 Apr 2022 at 12:45, Baris Kazar <baris.kazar@oracle.com> wrote:

> In my experience that if you built index at version x then use index also
> in version x.
> I never encountered any problems this way witj Lucene.
>
> Can you maybe recreate lucene index on 7.3.0?
>
> Also how do you use database in your scenario?
> Are you using jdbc like operations like in Oracle database? lucene
> operations are independent of database operations.
>
> Best regards
> ________________________________
> From: Tim Whittington <tim@whittington.nz.INVALID>
> Sent: Wednesday, April 13, 2022 8:24 PM
> To: java-user@lucene.apache.org <java-user@lucene.apache.org>
> Subject: How to handle corrupt Lucene index
>
> I'm working with/on a database system that uses Lucene for full text
> indexes (currently using 7.3.0).
> We're encountering occasional problems that occur after unclean shutdowns
> of the database , resulting in
> "org.apache.lucene.index.CorruptIndexException: file mismatch" errors when
> the IndexWriter is constructed.
>
> In all of the cases this has occurred, CheckIndex finds no issues with the
> Lucene index.
>
> The database has write-ahead-log and recovery facilities, so making the
> Lucene indexes durable wrt database operations is doable, but in this case
> the IndexWriter itself is failing to initialise, so it looks like there
> needs to be a lower-level validation/recovery operation before reconciling
> transactions can take place.
>
> Can anyone provide any advice about how the database can detect and recover
> from this situation?
>
> thanks
> Tim
> ---
>
> Relevant parts of the exception:
>
> org.apache.lucene.index.CorruptIndexException: file mismatch, expected
> id=e673n8syolqg0phzxvw8d7czu, got=dwpa40yzwp7gf06xibrsx1pn2
>
> (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/databases/xxxx/luceneIndexes/SearchNameIx/_
> 8x.si")))
> at
> org.apache.lucene.codecs.CodecUtil.checkIndexHeaderID(CodecUtil.java:351)
> at org.apache.lucene.codecs.CodecUtil.checkIndexHeader(CodecUtil.java:256)
> at
>
> org.apache.lucene.codecs.lucene70.Lucene70SegmentInfoFormat.read(Lucene70SegmentInfoFormat.java:95)
> at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:360)
> at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:290)
> at
> org.apache.lucene.index.IndexFileDeleter.<init>(IndexFileDeleter.java:165)
> at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:1121)
> ------------------ 8< --------------------------
>

Re: How to handle corrupt Lucene index [ In reply to ]

Apr 13, 2022, 6:17 PM

Post #5 of 11 (1148 views)

Thanks for this - I'll have a look at the database server code that is
managing the Lucene indexes and see if I can track it down.

Tim

On Thu, 14 Apr 2022 at 12:41, Robert Muir <rcmuir@gmail.com> wrote:

> On Wed, Apr 13, 2022 at 8:24 PM Tim Whittington
> <tim@whittington.nz.invalid> wrote:
> >
> > I'm working with/on a database system that uses Lucene for full text
> > indexes (currently using 7.3.0).
> > We're encountering occasional problems that occur after unclean shutdowns
> > of the database , resulting in
> > "org.apache.lucene.index.CorruptIndexException: file mismatch" errors
> when
> > the IndexWriter is constructed.
> >
> > In all of the cases this has occurred, CheckIndex finds no issues with
> the
> > Lucene index.
> >
> > The database has write-ahead-log and recovery facilities, so making the
> > Lucene indexes durable wrt database operations is doable, but in this
> case
> > the IndexWriter itself is failing to initialise, so it looks like there
> > needs to be a lower-level validation/recovery operation before
> reconciling
> > transactions can take place.
> >
> > Can anyone provide any advice about how the database can detect and
> recover
> > from this situation?
> >
>
> File mismatch means files are getting mixed up. It is the equivalent
> of swapping say, /etc/hosts and /etc/passwd on your computer.
>
> In your case you have a .si file (lets say it is named _79.si) that
> really belongs to another segment (e.g. _42).
>
> This isn't a lucene issue, this is something else you must be using
> that is "transporting files around", and it is mixing the files up.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: How to handle corrupt Lucene index [ In reply to ]

baris.kazar at oracle

Apr 13, 2022, 6:18 PM

Post #6 of 11 (1148 views)

That is a good practice and i pointed out that since i saw lucene 7.0 in the stack trace.

Best regards
________________________________
From: Tim Whittington <timw@apache.org>
Sent: Wednesday, April 13, 2022 9:15 PM
To: java-user@lucene.apache.org <java-user@lucene.apache.org>
Subject: Re: How to handle corrupt Lucene index

To be clear, these indexes are created and read with the same Lucene
version (7.3.0).

Tim

On Thu, 14 Apr 2022 at 12:45, Baris Kazar <baris.kazar@oracle.com> wrote:

> In my experience that if you built index at version x then use index also
> in version x.
> I never encountered any problems this way witj Lucene.
>
> Can you maybe recreate lucene index on 7.3.0?
>
> Also how do you use database in your scenario?
> Are you using jdbc like operations like in Oracle database? lucene
> operations are independent of database operations.
>
> Best regards
> ________________________________
> From: Tim Whittington <tim@whittington.nz.INVALID>
> Sent: Wednesday, April 13, 2022 8:24 PM
> To: java-user@lucene.apache.org <java-user@lucene.apache.org>
> Subject: How to handle corrupt Lucene index
>
> I'm working with/on a database system that uses Lucene for full text
> indexes (currently using 7.3.0).
> We're encountering occasional problems that occur after unclean shutdowns
> of the database , resulting in
> "org.apache.lucene.index.CorruptIndexException: file mismatch" errors when
> the IndexWriter is constructed.
>
> In all of the cases this has occurred, CheckIndex finds no issues with the
> Lucene index.
>
> The database has write-ahead-log and recovery facilities, so making the
> Lucene indexes durable wrt database operations is doable, but in this case
> the IndexWriter itself is failing to initialise, so it looks like there
> needs to be a lower-level validation/recovery operation before reconciling
> transactions can take place.
>
> Can anyone provide any advice about how the database can detect and recover
> from this situation?
>
> thanks
> Tim
> ---
>
> Relevant parts of the exception:
>
> org.apache.lucene.index.CorruptIndexException: file mismatch, expected
> id=e673n8syolqg0phzxvw8d7czu, got=dwpa40yzwp7gf06xibrsx1pn2
>
> (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/databases/xxxx/luceneIndexes/SearchNameIx/_
> 8x.si")))
> at
> org.apache.lucene.codecs.CodecUtil.checkIndexHeaderID(CodecUtil.java:351)
> at org.apache.lucene.codecs.CodecUtil.checkIndexHeader(CodecUtil.java:256)
> at
>
> org.apache.lucene.codecs.lucene70.Lucene70SegmentInfoFormat.read(Lucene70SegmentInfoFormat.java:95)
> at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:360)
> at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:290)
> at
> org.apache.lucene.index.IndexFileDeleter.<init>(IndexFileDeleter.java:165)
> at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:1121)
> ------------------ 8< --------------------------
>

Re: How to handle corrupt Lucene index [ In reply to ]

baris.kazar at oracle

Apr 13, 2022, 6:20 PM

Post #7 of 11 (1148 views)

yes that is a great point to look at first and that would eliminate any jdbc related issues that may lead to such problems.
Best regards
________________________________
From: Tim Whittington <timw@apache.org>
Sent: Wednesday, April 13, 2022 9:17:44 PM
To: java-user@lucene.apache.org <java-user@lucene.apache.org>
Subject: Re: How to handle corrupt Lucene index

Thanks for this - I'll have a look at the database server code that is
managing the Lucene indexes and see if I can track it down.

Tim

On Thu, 14 Apr 2022 at 12:41, Robert Muir <rcmuir@gmail.com> wrote:

> On Wed, Apr 13, 2022 at 8:24 PM Tim Whittington
> <tim@whittington.nz.invalid> wrote:
> >
> > I'm working with/on a database system that uses Lucene for full text
> > indexes (currently using 7.3.0).
> > We're encountering occasional problems that occur after unclean shutdowns
> > of the database , resulting in
> > "org.apache.lucene.index.CorruptIndexException: file mismatch" errors
> when
> > the IndexWriter is constructed.
> >
> > In all of the cases this has occurred, CheckIndex finds no issues with
> the
> > Lucene index.
> >
> > The database has write-ahead-log and recovery facilities, so making the
> > Lucene indexes durable wrt database operations is doable, but in this
> case
> > the IndexWriter itself is failing to initialise, so it looks like there
> > needs to be a lower-level validation/recovery operation before
> reconciling
> > transactions can take place.
> >
> > Can anyone provide any advice about how the database can detect and
> recover
> > from this situation?
> >
>
> File mismatch means files are getting mixed up. It is the equivalent
> of swapping say, /etc/hosts and /etc/passwd on your computer.
>
> In your case you have a .si file (lets say it is named _79.si) that
> really belongs to another segment (e.g. _42).
>
> This isn't a lucene issue, this is something else you must be using
> that is "transporting files around", and it is mixing the files up.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: How to handle corrupt Lucene index [ In reply to ]

Apr 13, 2022, 7:39 PM

Post #8 of 11 (1148 views)

Using a known-broken Lucene index directory, I dropped down to the Lucene
API and tracked this down a bit further.

My directory listing is this:

----------------
17 Mar 13:39 _8w.fdt
17 Mar 13:39 _8w.fdx
17 Mar 13:39 _8w.fnm
17 Mar 13:39 _8w.nvd
17 Mar 13:39 _8w.nvm
17 Mar 13:39 _8w.si
17 Mar 13:39 _8w_Lucene50_0.doc
17 Mar 13:39 _8w_Lucene50_0.pos
17 Mar 13:39 _8w_Lucene50_0.tim
17 Mar 13:39 _8w_Lucene50_0.tip
17 Mar 13:39 _8w_Lucene70_0.dvd
17 Mar 13:39 _8w_Lucene70_0.dvm
17 Mar 14:33 _8x.cfe
17 Mar 14:33 _8x.cfs
20 Mar 21:19 _8x.fdt
20 Mar 21:19 _8x.fdx
20 Mar 21:19 _8x.fnm
20 Mar 21:19 _8x.nvd
20 Mar 21:19 _8x.nvm
20 Mar 21:19 _8x.si
20 Mar 21:19 _8x_Lucene50_0.doc
20 Mar 21:19 _8x_Lucene50_0.pos
20 Mar 21:19 _8x_Lucene50_0.tim
20 Mar 21:19 _8x_Lucene50_0.tip
20 Mar 21:19 _8x_Lucene70_0.dvd
20 Mar 21:19 _8x_Lucene70_0.dvm
20 Mar 21:19 _8y.cfe
20 Mar 21:19 _8y.cfs
20 Mar 21:19 _8y.si
20 Mar 21:19 _8z.cfe
20 Mar 21:19 _8z.cfs
20 Mar 21:19 _8z.si
20 Mar 21:19 _90.cfe
20 Mar 21:19 _90.cfs
20 Mar 21:19 _90.si
20 Mar 21:19 _91.cfe
20 Mar 21:19 _91.cfs
20 Mar 21:19 _91.si
20 Mar 21:19 _92.cfe
20 Mar 21:19 _92.cfs
20 Mar 21:19 _92.si
20 Mar 21:19 _93.cfe
20 Mar 21:19 _93.cfs
20 Mar 21:19 _93.si
20 Mar 21:19 _94.cfe
20 Mar 21:19 _94.cfs
20 Mar 21:19 _94.si
20 Mar 21:19 _95.cfe
20 Mar 21:19 _95.cfs
20 Mar 21:19 _95.si
18 Mar 06:49 segments_93
20 Mar 21:19 segments_96
6 Mar 21:22 write.lock

----------------

When I load SegmentInfos for segments_96 directly, it succeeds, and I can
see it's referencing all the SegmentInfo except for _8w.
If I try to load SegmentInfos for segments_93, it gets past loading _8w and
fails on _8x.
Checking with a hex editor, segments_93 is referencing _8w ... _94 and
segments_96 is referencing _8x ... _95

The IndexWriter failure is due to the IndexFileDeleter attempting to load
segments_93 to track referenced commit infos.

Is this a state an IndexWriter could get the directory into, or does it
involve higher level interference (like copying files around)?

Tim

On Thu, 14 Apr 2022 at 13:20, Baris Kazar <baris.kazar@oracle.com> wrote:

> yes that is a great point to look at first and that would eliminate any
> jdbc related issues that may lead to such problems.
> Best regards
> ________________________________
> From: Tim Whittington <timw@apache.org>
> Sent: Wednesday, April 13, 2022 9:17:44 PM
> To: java-user@lucene.apache.org <java-user@lucene.apache.org>
> Subject: Re: How to handle corrupt Lucene index
>
> Thanks for this - I'll have a look at the database server code that is
> managing the Lucene indexes and see if I can track it down.
>
> Tim
>
> On Thu, 14 Apr 2022 at 12:41, Robert Muir <rcmuir@gmail.com> wrote:
>
> > On Wed, Apr 13, 2022 at 8:24 PM Tim Whittington
> > <tim@whittington.nz.invalid> wrote:
> > >
> > > I'm working with/on a database system that uses Lucene for full text
> > > indexes (currently using 7.3.0).
> > > We're encountering occasional problems that occur after unclean
> shutdowns
> > > of the database , resulting in
> > > "org.apache.lucene.index.CorruptIndexException: file mismatch" errors
> > when
> > > the IndexWriter is constructed.
> > >
> > > In all of the cases this has occurred, CheckIndex finds no issues with
> > the
> > > Lucene index.
> > >
> > > The database has write-ahead-log and recovery facilities, so making the
> > > Lucene indexes durable wrt database operations is doable, but in this
> > case
> > > the IndexWriter itself is failing to initialise, so it looks like there
> > > needs to be a lower-level validation/recovery operation before
> > reconciling
> > > transactions can take place.
> > >
> > > Can anyone provide any advice about how the database can detect and
> > recover
> > > from this situation?
> > >
> >
> > File mismatch means files are getting mixed up. It is the equivalent
> > of swapping say, /etc/hosts and /etc/passwd on your computer.
> >
> > In your case you have a .si file (lets say it is named _79.si) that
> > really belongs to another segment (e.g. _42).
> >
> > This isn't a lucene issue, this is something else you must be using
> > that is "transporting files around", and it is mixing the files up.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

Re: How to handle corrupt Lucene index [ In reply to ]

rcmuir at gmail

Apr 13, 2022, 7:59 PM

Post #9 of 11 (1148 views)

Honestly the only time i've seen the mixed up files before (and the
motivation for the paranoid checks in lucene), was bugs in some
distributed replication code. In this case code that was copying files
across the network had some bugs (e.g. used hashing of file contents
to try to reduce network chatter but didn't handle hash collisions
properly). So it would actually most commonly happen for .si file
simply because it is typically a tiny file and more likely to cause
hash collisions in some distributed code doing that. This was the
motivation for adding unique id to each segment and all files
corresponding to that segment... basically as a library, we can't
trust filenames to be what they claim.

segments_N doesn't just reference your segments by names like _8w and
_94 but it also has segment's unique IDs, too. Would have to look at
its file format to tell you how to see this with your hex editor. But
in general, the segment unique ID is referenced everywhere, starting
from segments_N. This way, when loading any index files for that
segment (including *.si), lucene checks they have matching ID so that
we know they really do belong to that segment. Because we can't trust
filenames when users may manipulate them :)

If the file really belongs to another segment (e.g. because files got
mixed up), there's a clear error this way that files are mixed up.
otherwise, without this check, you get pure insanity trying to debug
problems when files get mixed up.

On Wed, Apr 13, 2022 at 10:39 PM Tim Whittington <timw@apache.org> wrote:
>
> Using a known-broken Lucene index directory, I dropped down to the Lucene
> API and tracked this down a bit further.
>
> My directory listing is this:
>
> ----------------
> 17 Mar 13:39 _8w.fdt
> 17 Mar 13:39 _8w.fdx
> 17 Mar 13:39 _8w.fnm
> 17 Mar 13:39 _8w.nvd
> 17 Mar 13:39 _8w.nvm
> 17 Mar 13:39 _8w.si
> 17 Mar 13:39 _8w_Lucene50_0.doc
> 17 Mar 13:39 _8w_Lucene50_0.pos
> 17 Mar 13:39 _8w_Lucene50_0.tim
> 17 Mar 13:39 _8w_Lucene50_0.tip
> 17 Mar 13:39 _8w_Lucene70_0.dvd
> 17 Mar 13:39 _8w_Lucene70_0.dvm
> 17 Mar 14:33 _8x.cfe
> 17 Mar 14:33 _8x.cfs
> 20 Mar 21:19 _8x.fdt
> 20 Mar 21:19 _8x.fdx
> 20 Mar 21:19 _8x.fnm
> 20 Mar 21:19 _8x.nvd
> 20 Mar 21:19 _8x.nvm
> 20 Mar 21:19 _8x.si
> 20 Mar 21:19 _8x_Lucene50_0.doc
> 20 Mar 21:19 _8x_Lucene50_0.pos
> 20 Mar 21:19 _8x_Lucene50_0.tim
> 20 Mar 21:19 _8x_Lucene50_0.tip
> 20 Mar 21:19 _8x_Lucene70_0.dvd
> 20 Mar 21:19 _8x_Lucene70_0.dvm
> 20 Mar 21:19 _8y.cfe
> 20 Mar 21:19 _8y.cfs
> 20 Mar 21:19 _8y.si
> 20 Mar 21:19 _8z.cfe
> 20 Mar 21:19 _8z.cfs
> 20 Mar 21:19 _8z.si
> 20 Mar 21:19 _90.cfe
> 20 Mar 21:19 _90.cfs
> 20 Mar 21:19 _90.si
> 20 Mar 21:19 _91.cfe
> 20 Mar 21:19 _91.cfs
> 20 Mar 21:19 _91.si
> 20 Mar 21:19 _92.cfe
> 20 Mar 21:19 _92.cfs
> 20 Mar 21:19 _92.si
> 20 Mar 21:19 _93.cfe
> 20 Mar 21:19 _93.cfs
> 20 Mar 21:19 _93.si
> 20 Mar 21:19 _94.cfe
> 20 Mar 21:19 _94.cfs
> 20 Mar 21:19 _94.si
> 20 Mar 21:19 _95.cfe
> 20 Mar 21:19 _95.cfs
> 20 Mar 21:19 _95.si
> 18 Mar 06:49 segments_93
> 20 Mar 21:19 segments_96
> 6 Mar 21:22 write.lock
>
> ----------------
>
> When I load SegmentInfos for segments_96 directly, it succeeds, and I can
> see it's referencing all the SegmentInfo except for _8w.
> If I try to load SegmentInfos for segments_93, it gets past loading _8w and
> fails on _8x.
> Checking with a hex editor, segments_93 is referencing _8w ... _94 and
> segments_96 is referencing _8x ... _95
>
> The IndexWriter failure is due to the IndexFileDeleter attempting to load
> segments_93 to track referenced commit infos.
>
> Is this a state an IndexWriter could get the directory into, or does it
> involve higher level interference (like copying files around)?
>
> Tim
>
> On Thu, 14 Apr 2022 at 13:20, Baris Kazar <baris.kazar@oracle.com> wrote:
>
> > yes that is a great point to look at first and that would eliminate any
> > jdbc related issues that may lead to such problems.
> > Best regards
> > ________________________________
> > From: Tim Whittington <timw@apache.org>
> > Sent: Wednesday, April 13, 2022 9:17:44 PM
> > To: java-user@lucene.apache.org <java-user@lucene.apache.org>
> > Subject: Re: How to handle corrupt Lucene index
> >
> > Thanks for this - I'll have a look at the database server code that is
> > managing the Lucene indexes and see if I can track it down.
> >
> > Tim
> >
> > On Thu, 14 Apr 2022 at 12:41, Robert Muir <rcmuir@gmail.com> wrote:
> >
> > > On Wed, Apr 13, 2022 at 8:24 PM Tim Whittington
> > > <tim@whittington.nz.invalid> wrote:
> > > >
> > > > I'm working with/on a database system that uses Lucene for full text
> > > > indexes (currently using 7.3.0).
> > > > We're encountering occasional problems that occur after unclean
> > shutdowns
> > > > of the database , resulting in
> > > > "org.apache.lucene.index.CorruptIndexException: file mismatch" errors
> > > when
> > > > the IndexWriter is constructed.
> > > >
> > > > In all of the cases this has occurred, CheckIndex finds no issues with
> > > the
> > > > Lucene index.
> > > >
> > > > The database has write-ahead-log and recovery facilities, so making the
> > > > Lucene indexes durable wrt database operations is doable, but in this
> > > case
> > > > the IndexWriter itself is failing to initialise, so it looks like there
> > > > needs to be a lower-level validation/recovery operation before
> > > reconciling
> > > > transactions can take place.
> > > >
> > > > Can anyone provide any advice about how the database can detect and
> > > recover
> > > > from this situation?
> > > >
> > >
> > > File mismatch means files are getting mixed up. It is the equivalent
> > > of swapping say, /etc/hosts and /etc/passwd on your computer.
> > >
> > > In your case you have a .si file (lets say it is named _79.si) that
> > > really belongs to another segment (e.g. _42).
> > >
> > > This isn't a lucene issue, this is something else you must be using
> > > that is "transporting files around", and it is mixing the files up.
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to handle corrupt Lucene index [ In reply to ]

Apr 13, 2022, 8:21 PM

Post #10 of 11 (1148 views)

Yeah, I really appreciate the paranoia in the file format.

This is a distributed/replicated database (I'd forgotten to mention that
until you mentioned distributed replication), so I suspect the database
server is shunting actual segment files around during a recovery process
and getting things muddled up.
I actually captured one of the other nodes, and it seems to have a similar
problem, except it has 3 segments_ files (2 of which are identical to the
ones in the index I listed).

I'll continue to dig through the database server code to track down what's
causing this.

Thanks a lot for the quick help.
Tim

On Thu, 14 Apr 2022 at 15:00, Robert Muir <rcmuir@gmail.com> wrote:

> Honestly the only time i've seen the mixed up files before (and the
> motivation for the paranoid checks in lucene), was bugs in some
> distributed replication code. In this case code that was copying files
> across the network had some bugs (e.g. used hashing of file contents
> to try to reduce network chatter but didn't handle hash collisions
> properly). So it would actually most commonly happen for .si file
> simply because it is typically a tiny file and more likely to cause
> hash collisions in some distributed code doing that. This was the
> motivation for adding unique id to each segment and all files
> corresponding to that segment... basically as a library, we can't
> trust filenames to be what they claim.
>
> segments_N doesn't just reference your segments by names like _8w and
> _94 but it also has segment's unique IDs, too. Would have to look at
> its file format to tell you how to see this with your hex editor. But
> in general, the segment unique ID is referenced everywhere, starting
> from segments_N. This way, when loading any index files for that
> segment (including *.si), lucene checks they have matching ID so that
> we know they really do belong to that segment. Because we can't trust
> filenames when users may manipulate them :)
>
> If the file really belongs to another segment (e.g. because files got
> mixed up), there's a clear error this way that files are mixed up.
> otherwise, without this check, you get pure insanity trying to debug
> problems when files get mixed up.
>
> On Wed, Apr 13, 2022 at 10:39 PM Tim Whittington <timw@apache.org> wrote:
> >
> > Using a known-broken Lucene index directory, I dropped down to the Lucene
> > API and tracked this down a bit further.
> >
> > My directory listing is this:
> >
> > ----------------
> > 17 Mar 13:39 _8w.fdt
> > 17 Mar 13:39 _8w.fdx
> > 17 Mar 13:39 _8w.fnm
> > 17 Mar 13:39 _8w.nvd
> > 17 Mar 13:39 _8w.nvm
> > 17 Mar 13:39 _8w.si
> > 17 Mar 13:39 _8w_Lucene50_0.doc
> > 17 Mar 13:39 _8w_Lucene50_0.pos
> > 17 Mar 13:39 _8w_Lucene50_0.tim
> > 17 Mar 13:39 _8w_Lucene50_0.tip
> > 17 Mar 13:39 _8w_Lucene70_0.dvd
> > 17 Mar 13:39 _8w_Lucene70_0.dvm
> > 17 Mar 14:33 _8x.cfe
> > 17 Mar 14:33 _8x.cfs
> > 20 Mar 21:19 _8x.fdt
> > 20 Mar 21:19 _8x.fdx
> > 20 Mar 21:19 _8x.fnm
> > 20 Mar 21:19 _8x.nvd
> > 20 Mar 21:19 _8x.nvm
> > 20 Mar 21:19 _8x.si
> > 20 Mar 21:19 _8x_Lucene50_0.doc
> > 20 Mar 21:19 _8x_Lucene50_0.pos
> > 20 Mar 21:19 _8x_Lucene50_0.tim
> > 20 Mar 21:19 _8x_Lucene50_0.tip
> > 20 Mar 21:19 _8x_Lucene70_0.dvd
> > 20 Mar 21:19 _8x_Lucene70_0.dvm
> > 20 Mar 21:19 _8y.cfe
> > 20 Mar 21:19 _8y.cfs
> > 20 Mar 21:19 _8y.si
> > 20 Mar 21:19 _8z.cfe
> > 20 Mar 21:19 _8z.cfs
> > 20 Mar 21:19 _8z.si
> > 20 Mar 21:19 _90.cfe
> > 20 Mar 21:19 _90.cfs
> > 20 Mar 21:19 _90.si
> > 20 Mar 21:19 _91.cfe
> > 20 Mar 21:19 _91.cfs
> > 20 Mar 21:19 _91.si
> > 20 Mar 21:19 _92.cfe
> > 20 Mar 21:19 _92.cfs
> > 20 Mar 21:19 _92.si
> > 20 Mar 21:19 _93.cfe
> > 20 Mar 21:19 _93.cfs
> > 20 Mar 21:19 _93.si
> > 20 Mar 21:19 _94.cfe
> > 20 Mar 21:19 _94.cfs
> > 20 Mar 21:19 _94.si
> > 20 Mar 21:19 _95.cfe
> > 20 Mar 21:19 _95.cfs
> > 20 Mar 21:19 _95.si
> > 18 Mar 06:49 segments_93
> > 20 Mar 21:19 segments_96
> > 6 Mar 21:22 write.lock
> >
> > ----------------
> >
> > When I load SegmentInfos for segments_96 directly, it succeeds, and I can
> > see it's referencing all the SegmentInfo except for _8w.
> > If I try to load SegmentInfos for segments_93, it gets past loading _8w
> and
> > fails on _8x.
> > Checking with a hex editor, segments_93 is referencing _8w ... _94 and
> > segments_96 is referencing _8x ... _95
> >
> > The IndexWriter failure is due to the IndexFileDeleter attempting to load
> > segments_93 to track referenced commit infos.
> >
> > Is this a state an IndexWriter could get the directory into, or does it
> > involve higher level interference (like copying files around)?
> >
> > Tim
> >
> > On Thu, 14 Apr 2022 at 13:20, Baris Kazar <baris.kazar@oracle.com>
> wrote:
> >
> > > yes that is a great point to look at first and that would eliminate any
> > > jdbc related issues that may lead to such problems.
> > > Best regards
> > > ________________________________
> > > From: Tim Whittington <timw@apache.org>
> > > Sent: Wednesday, April 13, 2022 9:17:44 PM
> > > To: java-user@lucene.apache.org <java-user@lucene.apache.org>
> > > Subject: Re: How to handle corrupt Lucene index
> > >
> > > Thanks for this - I'll have a look at the database server code that is
> > > managing the Lucene indexes and see if I can track it down.
> > >
> > > Tim
> > >
> > > On Thu, 14 Apr 2022 at 12:41, Robert Muir <rcmuir@gmail.com> wrote:
> > >
> > > > On Wed, Apr 13, 2022 at 8:24 PM Tim Whittington
> > > > <tim@whittington.nz.invalid> wrote:
> > > > >
> > > > > I'm working with/on a database system that uses Lucene for full
> text
> > > > > indexes (currently using 7.3.0).
> > > > > We're encountering occasional problems that occur after unclean
> > > shutdowns
> > > > > of the database , resulting in
> > > > > "org.apache.lucene.index.CorruptIndexException: file mismatch"
> errors
> > > > when
> > > > > the IndexWriter is constructed.
> > > > >
> > > > > In all of the cases this has occurred, CheckIndex finds no issues
> with
> > > > the
> > > > > Lucene index.
> > > > >
> > > > > The database has write-ahead-log and recovery facilities, so
> making the
> > > > > Lucene indexes durable wrt database operations is doable, but in
> this
> > > > case
> > > > > the IndexWriter itself is failing to initialise, so it looks like
> there
> > > > > needs to be a lower-level validation/recovery operation before
> > > > reconciling
> > > > > transactions can take place.
> > > > >
> > > > > Can anyone provide any advice about how the database can detect and
> > > > recover
> > > > > from this situation?
> > > > >
> > > >
> > > > File mismatch means files are getting mixed up. It is the equivalent
> > > > of swapping say, /etc/hosts and /etc/passwd on your computer.
> > > >
> > > > In your case you have a .si file (lets say it is named _79.si) that
> > > > really belongs to another segment (e.g. _42).
> > > >
> > > > This isn't a lucene issue, this is something else you must be using
> > > > that is "transporting files around", and it is mixing the files up.
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: How to handle corrupt Lucene index [ In reply to ]

rcmuir at gmail

Apr 13, 2022, 8:23 PM

Post #11 of 11 (1148 views)

If you are looking at the files in hex, you can see the file format
docs online for your version:
https://lucene.apache.org/core/7_3_0/core/org/apache/lucene/index/SegmentInfos.html
SegID is written right after SegName, it is 16 bytes (128-bit number)

On Wed, Apr 13, 2022 at 10:59 PM Robert Muir <rcmuir@gmail.com> wrote:
>
> Honestly the only time i've seen the mixed up files before (and the
> motivation for the paranoid checks in lucene), was bugs in some
> distributed replication code. In this case code that was copying files
> across the network had some bugs (e.g. used hashing of file contents
> to try to reduce network chatter but didn't handle hash collisions
> properly). So it would actually most commonly happen for .si file
> simply because it is typically a tiny file and more likely to cause
> hash collisions in some distributed code doing that. This was the
> motivation for adding unique id to each segment and all files
> corresponding to that segment... basically as a library, we can't
> trust filenames to be what they claim.
>
> segments_N doesn't just reference your segments by names like _8w and
> _94 but it also has segment's unique IDs, too. Would have to look at
> its file format to tell you how to see this with your hex editor. But
> in general, the segment unique ID is referenced everywhere, starting
> from segments_N. This way, when loading any index files for that
> segment (including *.si), lucene checks they have matching ID so that
> we know they really do belong to that segment. Because we can't trust
> filenames when users may manipulate them :)
>
> If the file really belongs to another segment (e.g. because files got
> mixed up), there's a clear error this way that files are mixed up.
> otherwise, without this check, you get pure insanity trying to debug
> problems when files get mixed up.
>
> On Wed, Apr 13, 2022 at 10:39 PM Tim Whittington <timw@apache.org> wrote:
> >
> > Using a known-broken Lucene index directory, I dropped down to the Lucene
> > API and tracked this down a bit further.
> >
> > My directory listing is this:
> >
> > ----------------
> > 17 Mar 13:39 _8w.fdt
> > 17 Mar 13:39 _8w.fdx
> > 17 Mar 13:39 _8w.fnm
> > 17 Mar 13:39 _8w.nvd
> > 17 Mar 13:39 _8w.nvm
> > 17 Mar 13:39 _8w.si
> > 17 Mar 13:39 _8w_Lucene50_0.doc
> > 17 Mar 13:39 _8w_Lucene50_0.pos
> > 17 Mar 13:39 _8w_Lucene50_0.tim
> > 17 Mar 13:39 _8w_Lucene50_0.tip
> > 17 Mar 13:39 _8w_Lucene70_0.dvd
> > 17 Mar 13:39 _8w_Lucene70_0.dvm
> > 17 Mar 14:33 _8x.cfe
> > 17 Mar 14:33 _8x.cfs
> > 20 Mar 21:19 _8x.fdt
> > 20 Mar 21:19 _8x.fdx
> > 20 Mar 21:19 _8x.fnm
> > 20 Mar 21:19 _8x.nvd
> > 20 Mar 21:19 _8x.nvm
> > 20 Mar 21:19 _8x.si
> > 20 Mar 21:19 _8x_Lucene50_0.doc
> > 20 Mar 21:19 _8x_Lucene50_0.pos
> > 20 Mar 21:19 _8x_Lucene50_0.tim
> > 20 Mar 21:19 _8x_Lucene50_0.tip
> > 20 Mar 21:19 _8x_Lucene70_0.dvd
> > 20 Mar 21:19 _8x_Lucene70_0.dvm
> > 20 Mar 21:19 _8y.cfe
> > 20 Mar 21:19 _8y.cfs
> > 20 Mar 21:19 _8y.si
> > 20 Mar 21:19 _8z.cfe
> > 20 Mar 21:19 _8z.cfs
> > 20 Mar 21:19 _8z.si
> > 20 Mar 21:19 _90.cfe
> > 20 Mar 21:19 _90.cfs
> > 20 Mar 21:19 _90.si
> > 20 Mar 21:19 _91.cfe
> > 20 Mar 21:19 _91.cfs
> > 20 Mar 21:19 _91.si
> > 20 Mar 21:19 _92.cfe
> > 20 Mar 21:19 _92.cfs
> > 20 Mar 21:19 _92.si
> > 20 Mar 21:19 _93.cfe
> > 20 Mar 21:19 _93.cfs
> > 20 Mar 21:19 _93.si
> > 20 Mar 21:19 _94.cfe
> > 20 Mar 21:19 _94.cfs
> > 20 Mar 21:19 _94.si
> > 20 Mar 21:19 _95.cfe
> > 20 Mar 21:19 _95.cfs
> > 20 Mar 21:19 _95.si
> > 18 Mar 06:49 segments_93
> > 20 Mar 21:19 segments_96
> > 6 Mar 21:22 write.lock
> >
> > ----------------
> >
> > When I load SegmentInfos for segments_96 directly, it succeeds, and I can
> > see it's referencing all the SegmentInfo except for _8w.
> > If I try to load SegmentInfos for segments_93, it gets past loading _8w and
> > fails on _8x.
> > Checking with a hex editor, segments_93 is referencing _8w ... _94 and
> > segments_96 is referencing _8x ... _95
> >
> > The IndexWriter failure is due to the IndexFileDeleter attempting to load
> > segments_93 to track referenced commit infos.
> >
> > Is this a state an IndexWriter could get the directory into, or does it
> > involve higher level interference (like copying files around)?
> >
> > Tim
> >
> > On Thu, 14 Apr 2022 at 13:20, Baris Kazar <baris.kazar@oracle.com> wrote:
> >
> > > yes that is a great point to look at first and that would eliminate any
> > > jdbc related issues that may lead to such problems.
> > > Best regards
> > > ________________________________
> > > From: Tim Whittington <timw@apache.org>
> > > Sent: Wednesday, April 13, 2022 9:17:44 PM
> > > To: java-user@lucene.apache.org <java-user@lucene.apache.org>
> > > Subject: Re: How to handle corrupt Lucene index
> > >
> > > Thanks for this - I'll have a look at the database server code that is
> > > managing the Lucene indexes and see if I can track it down.
> > >
> > > Tim
> > >
> > > On Thu, 14 Apr 2022 at 12:41, Robert Muir <rcmuir@gmail.com> wrote:
> > >
> > > > On Wed, Apr 13, 2022 at 8:24 PM Tim Whittington
> > > > <tim@whittington.nz.invalid> wrote:
> > > > >
> > > > > I'm working with/on a database system that uses Lucene for full text
> > > > > indexes (currently using 7.3.0).
> > > > > We're encountering occasional problems that occur after unclean
> > > shutdowns
> > > > > of the database , resulting in
> > > > > "org.apache.lucene.index.CorruptIndexException: file mismatch" errors
> > > > when
> > > > > the IndexWriter is constructed.
> > > > >
> > > > > In all of the cases this has occurred, CheckIndex finds no issues with
> > > > the
> > > > > Lucene index.
> > > > >
> > > > > The database has write-ahead-log and recovery facilities, so making the
> > > > > Lucene indexes durable wrt database operations is doable, but in this
> > > > case
> > > > > the IndexWriter itself is failing to initialise, so it looks like there
> > > > > needs to be a lower-level validation/recovery operation before
> > > > reconciling
> > > > > transactions can take place.
> > > > >
> > > > > Can anyone provide any advice about how the database can detect and
> > > > recover
> > > > > from this situation?
> > > > >
> > > >
> > > > File mismatch means files are getting mixed up. It is the equivalent
> > > > of swapping say, /etc/hosts and /etc/passwd on your computer.
> > > >
> > > > In your case you have a .si file (lets say it is named _79.si) that
> > > > really belongs to another segment (e.g. _42).
> > > >
> > > > This isn't a lucene issue, this is something else you must be using
> > > > that is "transporting files around", and it is mixing the files up.
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> > >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org