Mailing List Archive: CheckIndex complaining about -1 for norms value

CheckIndex complaining about -1 for norms value

trejkaz at trypticon

Jun 10, 2020, 7:21 PM

Post #1 of 7 (1182 views)

Hi all.

We use CheckIndex as a post-migration sanity check and are seeing this
quirk, and I'm wondering whether negative norms is even legit or
whether it should have been treated as if it were zero...

TX

0.00% total deletions; 378 documents; 0 deleteions
Segments file=segments_1 numSegments=1 version=8.5.1
id=52isly98kogao7j0cnautwknj
1 of 1: name=_0 maxDoc=378
version=8.5.1
id=52isly98kogao7j0cnautwkni
codec=Lucene84
compound=false
numFiles=18
size (MB)=0.663
diagnostics = {java.vendor=Oracle Corporation, os=Mac OS X,
java.version=1.8.0_191, java.vm.version=25.191-b12,
lucene.version=8.5.1, os.arch=x86_64,
java.runtime.version=1.8.0_191-b12, source=addIndexes(CodecReader...),
os.version=10.15.5, timestamp=1591841756208}
no deletions
test: open reader.........OK [took 0.004 sec]
test: check integrity.....OK [took 0.002 sec]
test: check live docs.....OK [took 0.000 sec]
test: field infos.........OK [36 fields] [took 0.000 sec]
test: field norms.........OK [26 fields] [took 0.001 sec]
test: terms, freq, prox...ERROR: java.lang.RuntimeException:
Document 0 doesn't have terms according to postings but has a norm
value that is not zero: -1

java.lang.RuntimeException: Document 0 doesn't have terms according to
postings but has a norm value that is not zero: -1
at org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:1678)
at org.apache.lucene.index.CheckIndex.testPostings(CheckIndex.java:1871)
at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:724)
at org.apache.lucene.index.CheckIndex.doCheck(CheckIndex.java:2973)

test: stored fields.......OK [15935 total field count; avg 42.2
fields per doc] [took 0.003 sec]
test: term vectors........OK [.1173 total term vector count; avg
3.1 term/freq vector fields per doc] [took 0.170 sec]
test: docvalues...........OK [.16 docvalues fields; 11 BINARY; 2
NUMERIC; 0 SORTED; 2 SORTED_NUMERIC; 1 SORTED_SET] [took 0.003 sec]
test: points..............OK [4 fields, 1509 points] [took 0.000 sec]
FAILED
WARNING: exorciseIndex() would remove reference to this segment;
full exception:
java.lang.RuntimeException: Term Index test failed
at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:750)
at org.apache.lucene.index.CheckIndex.doCheck(CheckIndex.java:2973)

WARNING: 1 broken segments (containing 378 documents) detected
Took 0.355 sec total.
WARNING: would write new segments file, and 378 documents would be
lost, if -exorcise were specified

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: CheckIndex complaining about -1 for norms value [ In reply to ]

jpountz at gmail

Jun 10, 2020, 11:37 PM

Post #2 of 7 (1180 views)

Hi Trejkaz,

Negative norm values are legal. The problem here is that Lucene expects
that documents that have no terms must either not have a norm value
(typically because the document doesn't have a value for the field), or a
norm value equal to 0 (typically because the token stream over the field
value produced no tokens).

Are you using a custom similarity or one of the Lucene ones? One would only
get -1 as a norm with the Lucene similarities if it had a number of tokens
that is very close to Integer.MAX_VALUE.

On Thu, Jun 11, 2020 at 4:22 AM Trejkaz <trejkaz@trypticon.org> wrote:

> Hi all.
>
> We use CheckIndex as a post-migration sanity check and are seeing this
> quirk, and I'm wondering whether negative norms is even legit or
> whether it should have been treated as if it were zero...
>
> TX
>
>
> 0.00% total deletions; 378 documents; 0 deleteions
> Segments file=segments_1 numSegments=1 version=8.5.1
> id=52isly98kogao7j0cnautwknj
> 1 of 1: name=_0 maxDoc=378
> version=8.5.1
> id=52isly98kogao7j0cnautwkni
> codec=Lucene84
> compound=false
> numFiles=18
> size (MB)=0.663
> diagnostics = {java.vendor=Oracle Corporation, os=Mac OS X,
> java.version=1.8.0_191, java.vm.version=25.191-b12,
> lucene.version=8.5.1, os.arch=x86_64,
> java.runtime.version=1.8.0_191-b12, source=addIndexes(CodecReader...),
> os.version=10.15.5, timestamp=1591841756208}
> no deletions
> test: open reader.........OK [took 0.004 sec]
> test: check integrity.....OK [took 0.002 sec]
> test: check live docs.....OK [took 0.000 sec]
> test: field infos.........OK [36 fields] [took 0.000 sec]
> test: field norms.........OK [26 fields] [took 0.001 sec]
> test: terms, freq, prox...ERROR: java.lang.RuntimeException:
> Document 0 doesn't have terms according to postings but has a norm
> value that is not zero: -1
>
> java.lang.RuntimeException: Document 0 doesn't have terms according to
> postings but has a norm value that is not zero: -1
> at org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:1678)
> at org.apache.lucene.index.CheckIndex.testPostings(CheckIndex.java:1871)
> at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:724)
> at org.apache.lucene.index.CheckIndex.doCheck(CheckIndex.java:2973)
>
> test: stored fields.......OK [15935 total field count; avg 42.2
> fields per doc] [took 0.003 sec]
> test: term vectors........OK [.1173 total term vector count; avg
> 3.1 term/freq vector fields per doc] [took 0.170 sec]
> test: docvalues...........OK [.16 docvalues fields; 11 BINARY; 2
> NUMERIC; 0 SORTED; 2 SORTED_NUMERIC; 1 SORTED_SET] [took 0.003 sec]
> test: points..............OK [4 fields, 1509 points] [took 0.000 sec]
> FAILED
> WARNING: exorciseIndex() would remove reference to this segment;
> full exception:
> java.lang.RuntimeException: Term Index test failed
> at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:750)
> at org.apache.lucene.index.CheckIndex.doCheck(CheckIndex.java:2973)
>
> WARNING: 1 broken segments (containing 378 documents) detected
> Took 0.355 sec total.
> WARNING: would write new segments file, and 378 documents would be
> lost, if -exorcise were specified
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

--
Adrien

Re: CheckIndex complaining about -1 for norms value [ In reply to ]

trejkaz at trypticon

Jun 10, 2020, 11:44 PM

Post #3 of 7 (1180 views)

Well,

We're using the default Lucene similarity. But as far as I know, we've
always disabled norms as well. So I'm surprised I'm even seeing norms
mentioned in the context of our own index, which is why I wondered
whether -1 might have been an older placeholder for "no value" which
later became 0 or something.

About the only thing I'm sure about at the moment is that whatever is
going on is weird.

TX

On Thu, 11 Jun 2020 at 15:38, Adrien Grand <jpountz@gmail.com> wrote:
>
> Hi Trejkaz,
>
> Negative norm values are legal. The problem here is that Lucene expects
> that documents that have no terms must either not have a norm value
> (typically because the document doesn't have a value for the field), or a
> norm value equal to 0 (typically because the token stream over the field
> value produced no tokens).
>
> Are you using a custom similarity or one of the Lucene ones? One would only
> get -1 as a norm with the Lucene similarities if it had a number of tokens
> that is very close to Integer.MAX_VALUE.
>
> On Thu, Jun 11, 2020 at 4:22 AM Trejkaz <trejkaz@trypticon.org> wrote:
>
> > Hi all.
> >
> > We use CheckIndex as a post-migration sanity check and are seeing this
> > quirk, and I'm wondering whether negative norms is even legit or
> > whether it should have been treated as if it were zero...
> >
> > TX
> >
> >
> > 0.00% total deletions; 378 documents; 0 deleteions
> > Segments file=segments_1 numSegments=1 version=8.5.1
> > id=52isly98kogao7j0cnautwknj
> > 1 of 1: name=_0 maxDoc=378
> > version=8.5.1
> > id=52isly98kogao7j0cnautwkni
> > codec=Lucene84
> > compound=false
> > numFiles=18
> > size (MB)=0.663
> > diagnostics = {java.vendor=Oracle Corporation, os=Mac OS X,
> > java.version=1.8.0_191, java.vm.version=25.191-b12,
> > lucene.version=8.5.1, os.arch=x86_64,
> > java.runtime.version=1.8.0_191-b12, source=addIndexes(CodecReader...),
> > os.version=10.15.5, timestamp=1591841756208}
> > no deletions
> > test: open reader.........OK [took 0.004 sec]
> > test: check integrity.....OK [took 0.002 sec]
> > test: check live docs.....OK [took 0.000 sec]
> > test: field infos.........OK [36 fields] [took 0.000 sec]
> > test: field norms.........OK [26 fields] [took 0.001 sec]
> > test: terms, freq, prox...ERROR: java.lang.RuntimeException:
> > Document 0 doesn't have terms according to postings but has a norm
> > value that is not zero: -1
> >
> > java.lang.RuntimeException: Document 0 doesn't have terms according to
> > postings but has a norm value that is not zero: -1
> > at org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:1678)
> > at org.apache.lucene.index.CheckIndex.testPostings(CheckIndex.java:1871)
> > at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:724)
> > at org.apache.lucene.index.CheckIndex.doCheck(CheckIndex.java:2973)
> >
> > test: stored fields.......OK [.15935 total field count; avg 42.2
> > fields per doc] [took 0.003 sec]
> > test: term vectors........OK [.1173 total term vector count; avg
> > 3.1 term/freq vector fields per doc] [took 0.170 sec]
> > test: docvalues...........OK [.16 docvalues fields; 11 BINARY; 2
> > NUMERIC; 0 SORTED; 2 SORTED_NUMERIC; 1 SORTED_SET] [took 0.003 sec]
> > test: points..............OK [4 fields, 1509 points] [took 0.000 sec]
> > FAILED
> > WARNING: exorciseIndex() would remove reference to this segment;
> > full exception:
> > java.lang.RuntimeException: Term Index test failed
> > at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:750)
> > at org.apache.lucene.index.CheckIndex.doCheck(CheckIndex.java:2973)
> >
> > WARNING: 1 broken segments (containing 378 documents) detected
> > Took 0.355 sec total.
> > WARNING: would write new segments file, and 378 documents would be
> > lost, if -exorcise were specified
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
> --
> Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: CheckIndex complaining about -1 for norms value [ In reply to ]

jpountz at gmail

Jun 11, 2020, 12:00 AM

Post #4 of 7 (1180 views)

To my knowledge, -1 always represented the maximum supported length, both
before and after 7.0 (when we changed the norms encoding). One thing that
changed when we introduced sparse norms is that documents with no value
moved from having 0 as a norm to not having a norm at all, but I don't see
how this could explain what you are seeing either.

Do you know what is the Lucene version that initially indexed this document
(and thus computed the norm value)?

On Thu, Jun 11, 2020 at 8:45 AM Trejkaz <trejkaz@trypticon.org> wrote:

> Well,
>
> We're using the default Lucene similarity. But as far as I know, we've
> always disabled norms as well. So I'm surprised I'm even seeing norms
> mentioned in the context of our own index, which is why I wondered
> whether -1 might have been an older placeholder for "no value" which
> later became 0 or something.
>
> About the only thing I'm sure about at the moment is that whatever is
> going on is weird.
>
> TX
>
> On Thu, 11 Jun 2020 at 15:38, Adrien Grand <jpountz@gmail.com> wrote:
> >
> > Hi Trejkaz,
> >
> > Negative norm values are legal. The problem here is that Lucene expects
> > that documents that have no terms must either not have a norm value
> > (typically because the document doesn't have a value for the field), or a
> > norm value equal to 0 (typically because the token stream over the field
> > value produced no tokens).
> >
> > Are you using a custom similarity or one of the Lucene ones? One would
> only
> > get -1 as a norm with the Lucene similarities if it had a number of
> tokens
> > that is very close to Integer.MAX_VALUE.
> >
> > On Thu, Jun 11, 2020 at 4:22 AM Trejkaz <trejkaz@trypticon.org> wrote:
> >
> > > Hi all.
> > >
> > > We use CheckIndex as a post-migration sanity check and are seeing this
> > > quirk, and I'm wondering whether negative norms is even legit or
> > > whether it should have been treated as if it were zero...
> > >
> > > TX
> > >
> > >
> > > 0.00% total deletions; 378 documents; 0 deleteions
> > > Segments file=segments_1 numSegments=1 version=8.5.1
> > > id=52isly98kogao7j0cnautwknj
> > > 1 of 1: name=_0 maxDoc=378
> > > version=8.5.1
> > > id=52isly98kogao7j0cnautwkni
> > > codec=Lucene84
> > > compound=false
> > > numFiles=18
> > > size (MB)=0.663
> > > diagnostics = {java.vendor=Oracle Corporation, os=Mac OS X,
> > > java.version=1.8.0_191, java.vm.version=25.191-b12,
> > > lucene.version=8.5.1, os.arch=x86_64,
> > > java.runtime.version=1.8.0_191-b12, source=addIndexes(CodecReader...),
> > > os.version=10.15.5, timestamp=1591841756208}
> > > no deletions
> > > test: open reader.........OK [took 0.004 sec]
> > > test: check integrity.....OK [took 0.002 sec]
> > > test: check live docs.....OK [took 0.000 sec]
> > > test: field infos.........OK [36 fields] [took 0.000 sec]
> > > test: field norms.........OK [26 fields] [took 0.001 sec]
> > > test: terms, freq, prox...ERROR: java.lang.RuntimeException:
> > > Document 0 doesn't have terms according to postings but has a norm
> > > value that is not zero: -1
> > >
> > > java.lang.RuntimeException: Document 0 doesn't have terms according to
> > > postings but has a norm value that is not zero: -1
> > > at org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:1678)
> > > at
> org.apache.lucene.index.CheckIndex.testPostings(CheckIndex.java:1871)
> > > at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:724)
> > > at org.apache.lucene.index.CheckIndex.doCheck(CheckIndex.java:2973)
> > >
> > > test: stored fields.......OK [.15935 total field count; avg 42.2
> > > fields per doc] [took 0.003 sec]
> > > test: term vectors........OK [.1173 total term vector count; avg
> > > 3.1 term/freq vector fields per doc] [took 0.170 sec]
> > > test: docvalues...........OK [.16 docvalues fields; 11 BINARY; 2
> > > NUMERIC; 0 SORTED; 2 SORTED_NUMERIC; 1 SORTED_SET] [took 0.003 sec]
> > > test: points..............OK [4 fields, 1509 points] [took 0.000
> sec]
> > > FAILED
> > > WARNING: exorciseIndex() would remove reference to this segment;
> > > full exception:
> > > java.lang.RuntimeException: Term Index test failed
> > > at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:750)
> > > at org.apache.lucene.index.CheckIndex.doCheck(CheckIndex.java:2973)
> > >
> > > WARNING: 1 broken segments (containing 378 documents) detected
> > > Took 0.355 sec total.
> > > WARNING: would write new segments file, and 378 documents would be
> > > lost, if -exorcise were specified
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> > --
> > Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

--
Adrien

Re: CheckIndex complaining about -1 for norms value [ In reply to ]

lucene at mikemccandless

Jun 11, 2020, 6:26 AM

Post #5 of 7 (1176 views)

Maybe we should fix CheckIndex to print norms as unsigned integers?

Mike McCandless

http://blog.mikemccandless.com

On Thu, Jun 11, 2020 at 3:00 AM Adrien Grand <jpountz@gmail.com> wrote:

> To my knowledge, -1 always represented the maximum supported length, both
> before and after 7.0 (when we changed the norms encoding). One thing that
> changed when we introduced sparse norms is that documents with no value
> moved from having 0 as a norm to not having a norm at all, but I don't see
> how this could explain what you are seeing either.
>
> Do you know what is the Lucene version that initially indexed this document
> (and thus computed the norm value)?
>
> On Thu, Jun 11, 2020 at 8:45 AM Trejkaz <trejkaz@trypticon.org> wrote:
>
> > Well,
> >
> > We're using the default Lucene similarity. But as far as I know, we've
> > always disabled norms as well. So I'm surprised I'm even seeing norms
> > mentioned in the context of our own index, which is why I wondered
> > whether -1 might have been an older placeholder for "no value" which
> > later became 0 or something.
> >
> > About the only thing I'm sure about at the moment is that whatever is
> > going on is weird.
> >
> > TX
> >
> > On Thu, 11 Jun 2020 at 15:38, Adrien Grand <jpountz@gmail.com> wrote:
> > >
> > > Hi Trejkaz,
> > >
> > > Negative norm values are legal. The problem here is that Lucene expects
> > > that documents that have no terms must either not have a norm value
> > > (typically because the document doesn't have a value for the field),
> or a
> > > norm value equal to 0 (typically because the token stream over the
> field
> > > value produced no tokens).
> > >
> > > Are you using a custom similarity or one of the Lucene ones? One would
> > only
> > > get -1 as a norm with the Lucene similarities if it had a number of
> > tokens
> > > that is very close to Integer.MAX_VALUE.
> > >
> > > On Thu, Jun 11, 2020 at 4:22 AM Trejkaz <trejkaz@trypticon.org> wrote:
> > >
> > > > Hi all.
> > > >
> > > > We use CheckIndex as a post-migration sanity check and are seeing
> this
> > > > quirk, and I'm wondering whether negative norms is even legit or
> > > > whether it should have been treated as if it were zero...
> > > >
> > > > TX
> > > >
> > > >
> > > > 0.00% total deletions; 378 documents; 0 deleteions
> > > > Segments file=segments_1 numSegments=1 version=8.5.1
> > > > id=52isly98kogao7j0cnautwknj
> > > > 1 of 1: name=_0 maxDoc=378
> > > > version=8.5.1
> > > > id=52isly98kogao7j0cnautwkni
> > > > codec=Lucene84
> > > > compound=false
> > > > numFiles=18
> > > > size (MB)=0.663
> > > > diagnostics = {java.vendor=Oracle Corporation, os=Mac OS X,
> > > > java.version=1.8.0_191, java.vm.version=25.191-b12,
> > > > lucene.version=8.5.1, os.arch=x86_64,
> > > > java.runtime.version=1.8.0_191-b12,
> source=addIndexes(CodecReader...),
> > > > os.version=10.15.5, timestamp=1591841756208}
> > > > no deletions
> > > > test: open reader.........OK [took 0.004 sec]
> > > > test: check integrity.....OK [took 0.002 sec]
> > > > test: check live docs.....OK [took 0.000 sec]
> > > > test: field infos.........OK [36 fields] [took 0.000 sec]
> > > > test: field norms.........OK [26 fields] [took 0.001 sec]
> > > > test: terms, freq, prox...ERROR: java.lang.RuntimeException:
> > > > Document 0 doesn't have terms according to postings but has a norm
> > > > value that is not zero: -1
> > > >
> > > > java.lang.RuntimeException: Document 0 doesn't have terms according
> to
> > > > postings but has a norm value that is not zero: -1
> > > > at
> org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:1678)
> > > > at
> > org.apache.lucene.index.CheckIndex.testPostings(CheckIndex.java:1871)
> > > > at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:724)
> > > > at org.apache.lucene.index.CheckIndex.doCheck(CheckIndex.java:2973)
> > > >
> > > > test: stored fields.......OK [.15935 total field count; avg 42.2
> > > > fields per doc] [took 0.003 sec]
> > > > test: term vectors........OK [.1173 total term vector count; avg
> > > > 3.1 term/freq vector fields per doc] [took 0.170 sec]
> > > > test: docvalues...........OK [.16 docvalues fields; 11 BINARY; 2
> > > > NUMERIC; 0 SORTED; 2 SORTED_NUMERIC; 1 SORTED_SET] [took 0.003 sec]
> > > > test: points..............OK [4 fields, 1509 points] [took 0.000
> > sec]
> > > > FAILED
> > > > WARNING: exorciseIndex() would remove reference to this segment;
> > > > full exception:
> > > > java.lang.RuntimeException: Term Index test failed
> > > > at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:750)
> > > > at org.apache.lucene.index.CheckIndex.doCheck(CheckIndex.java:2973)
> > > >
> > > > WARNING: 1 broken segments (containing 378 documents) detected
> > > > Took 0.355 sec total.
> > > > WARNING: would write new segments file, and 378 documents would be
> > > > lost, if -exorcise were specified
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> > >
> > > --
> > > Adrien
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
> --
> Adrien
>

Re: CheckIndex complaining about -1 for norms value [ In reply to ]

jpountz at gmail

Jun 11, 2020, 7:32 AM

Post #6 of 7 (1176 views)

+1

On Thu, Jun 11, 2020 at 3:27 PM Michael McCandless <
lucene@mikemccandless.com> wrote:

> Maybe we should fix CheckIndex to print norms as unsigned integers?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Jun 11, 2020 at 3:00 AM Adrien Grand <jpountz@gmail.com> wrote:
>
> > To my knowledge, -1 always represented the maximum supported length, both
> > before and after 7.0 (when we changed the norms encoding). One thing that
> > changed when we introduced sparse norms is that documents with no value
> > moved from having 0 as a norm to not having a norm at all, but I don't
> see
> > how this could explain what you are seeing either.
> >
> > Do you know what is the Lucene version that initially indexed this
> document
> > (and thus computed the norm value)?
> >
> > On Thu, Jun 11, 2020 at 8:45 AM Trejkaz <trejkaz@trypticon.org> wrote:
> >
> > > Well,
> > >
> > > We're using the default Lucene similarity. But as far as I know, we've
> > > always disabled norms as well. So I'm surprised I'm even seeing norms
> > > mentioned in the context of our own index, which is why I wondered
> > > whether -1 might have been an older placeholder for "no value" which
> > > later became 0 or something.
> > >
> > > About the only thing I'm sure about at the moment is that whatever is
> > > going on is weird.
> > >
> > > TX
> > >
> > > On Thu, 11 Jun 2020 at 15:38, Adrien Grand <jpountz@gmail.com> wrote:
> > > >
> > > > Hi Trejkaz,
> > > >
> > > > Negative norm values are legal. The problem here is that Lucene
> expects
> > > > that documents that have no terms must either not have a norm value
> > > > (typically because the document doesn't have a value for the field),
> > or a
> > > > norm value equal to 0 (typically because the token stream over the
> > field
> > > > value produced no tokens).
> > > >
> > > > Are you using a custom similarity or one of the Lucene ones? One
> would
> > > only
> > > > get -1 as a norm with the Lucene similarities if it had a number of
> > > tokens
> > > > that is very close to Integer.MAX_VALUE.
> > > >
> > > > On Thu, Jun 11, 2020 at 4:22 AM Trejkaz <trejkaz@trypticon.org>
> wrote:
> > > >
> > > > > Hi all.
> > > > >
> > > > > We use CheckIndex as a post-migration sanity check and are seeing
> > this
> > > > > quirk, and I'm wondering whether negative norms is even legit or
> > > > > whether it should have been treated as if it were zero...
> > > > >
> > > > > TX
> > > > >
> > > > >
> > > > > 0.00% total deletions; 378 documents; 0 deleteions
> > > > > Segments file=segments_1 numSegments=1 version=8.5.1
> > > > > id=52isly98kogao7j0cnautwknj
> > > > > 1 of 1: name=_0 maxDoc=378
> > > > > version=8.5.1
> > > > > id=52isly98kogao7j0cnautwkni
> > > > > codec=Lucene84
> > > > > compound=false
> > > > > numFiles=18
> > > > > size (MB)=0.663
> > > > > diagnostics = {java.vendor=Oracle Corporation, os=Mac OS X,
> > > > > java.version=1.8.0_191, java.vm.version=25.191-b12,
> > > > > lucene.version=8.5.1, os.arch=x86_64,
> > > > > java.runtime.version=1.8.0_191-b12,
> > source=addIndexes(CodecReader...),
> > > > > os.version=10.15.5, timestamp=1591841756208}
> > > > > no deletions
> > > > > test: open reader.........OK [took 0.004 sec]
> > > > > test: check integrity.....OK [took 0.002 sec]
> > > > > test: check live docs.....OK [took 0.000 sec]
> > > > > test: field infos.........OK [36 fields] [took 0.000 sec]
> > > > > test: field norms.........OK [26 fields] [took 0.001 sec]
> > > > > test: terms, freq, prox...ERROR: java.lang.RuntimeException:
> > > > > Document 0 doesn't have terms according to postings but has a norm
> > > > > value that is not zero: -1
> > > > >
> > > > > java.lang.RuntimeException: Document 0 doesn't have terms according
> > to
> > > > > postings but has a norm value that is not zero: -1
> > > > > at
> > org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:1678)
> > > > > at
> > > org.apache.lucene.index.CheckIndex.testPostings(CheckIndex.java:1871)
> > > > > at
> org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:724)
> > > > > at org.apache.lucene.index.CheckIndex.doCheck(CheckIndex.java:2973)
> > > > >
> > > > > test: stored fields.......OK [.15935 total field count; avg 42.2
> > > > > fields per doc] [took 0.003 sec]
> > > > > test: term vectors........OK [.1173 total term vector count; avg
> > > > > 3.1 term/freq vector fields per doc] [took 0.170 sec]
> > > > > test: docvalues...........OK [.16 docvalues fields; 11 BINARY; 2
> > > > > NUMERIC; 0 SORTED; 2 SORTED_NUMERIC; 1 SORTED_SET] [took 0.003 sec]
> > > > > test: points..............OK [4 fields, 1509 points] [took
> 0.000
> > > sec]
> > > > > FAILED
> > > > > WARNING: exorciseIndex() would remove reference to this
> segment;
> > > > > full exception:
> > > > > java.lang.RuntimeException: Term Index test failed
> > > > > at
> org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:750)
> > > > > at org.apache.lucene.index.CheckIndex.doCheck(CheckIndex.java:2973)
> > > > >
> > > > > WARNING: 1 broken segments (containing 378 documents) detected
> > > > > Took 0.355 sec total.
> > > > > WARNING: would write new segments file, and 378 documents would be
> > > > > lost, if -exorcise were specified
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > >
> > > > >
> > > >
> > > > --
> > > > Adrien
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> > --
> > Adrien
> >
>

--
Adrien

Re: CheckIndex complaining about -1 for norms value [ In reply to ]

trejkaz at trypticon

Jun 14, 2020, 5:59 PM

Post #7 of 7 (1160 views)

The answer here might be "terrifyingly old" actually. We've been using
IndexUpgrader quite heavily. My best guess is Lucene 2.x.

What I can verify at least, by manual inspection, is that the docs
where the value is showing -1 are also docs where there is no value in
the field, where I presume it was supposed to be zero instead. What's
funny though is that some values _are_ zero...

Anyway, I can probably try writing a tricky migration to rewrite those
to 0. Or deleting them entirely, because like I previously said, I
thought we weren't using norms at all. I'll have to track down the
truth on that because I don't know anymore. This index clearly has
them but it's also very old.

Only after migrating up to Lucene 8 did we get any complaints from
CheckIndex, though. So now I'm not sure whether I have broken this by
migrating to v8, or whether this is just a new check in v8 and the
index was already screwed. More investigation required. :(

TX

On Thu, 11 Jun 2020 at 16:00, Adrien Grand <jpountz@gmail.com> wrote:
>
> To my knowledge, -1 always represented the maximum supported length, both
> before and after 7.0 (when we changed the norms encoding). One thing that
> changed when we introduced sparse norms is that documents with no value
> moved from having 0 as a norm to not having a norm at all, but I don't see
> how this could explain what you are seeing either.
>
> Do you know what is the Lucene version that initially indexed this document
> (and thus computed the norm value)?
>
> On Thu, Jun 11, 2020 at 8:45 AM Trejkaz <trejkaz@trypticon.org> wrote:
>
> > Well,
> >
> > We're using the default Lucene similarity. But as far as I know, we've
> > always disabled norms as well. So I'm surprised I'm even seeing norms
> > mentioned in the context of our own index, which is why I wondered
> > whether -1 might have been an older placeholder for "no value" which
> > later became 0 or something.
> >
> > About the only thing I'm sure about at the moment is that whatever is
> > going on is weird.
> >
> > TX
> >
> > On Thu, 11 Jun 2020 at 15:38, Adrien Grand <jpountz@gmail.com> wrote:
> > >
> > > Hi Trejkaz,
> > >
> > > Negative norm values are legal. The problem here is that Lucene expects
> > > that documents that have no terms must either not have a norm value
> > > (typically because the document doesn't have a value for the field), or a
> > > norm value equal to 0 (typically because the token stream over the field
> > > value produced no tokens).
> > >
> > > Are you using a custom similarity or one of the Lucene ones? One would
> > only
> > > get -1 as a norm with the Lucene similarities if it had a number of
> > tokens
> > > that is very close to Integer.MAX_VALUE.
> > >
> > > On Thu, Jun 11, 2020 at 4:22 AM Trejkaz <trejkaz@trypticon.org> wrote:
> > >
> > > > Hi all.
> > > >
> > > > We use CheckIndex as a post-migration sanity check and are seeing this
> > > > quirk, and I'm wondering whether negative norms is even legit or
> > > > whether it should have been treated as if it were zero...
> > > >
> > > > TX
> > > >
> > > >
> > > > 0.00% total deletions; 378 documents; 0 deleteions
> > > > Segments file=segments_1 numSegments=1 version=8.5.1
> > > > id=52isly98kogao7j0cnautwknj
> > > > 1 of 1: name=_0 maxDoc=378
> > > > version=8.5.1
> > > > id=52isly98kogao7j0cnautwkni
> > > > codec=Lucene84
> > > > compound=false
> > > > numFiles=18
> > > > size (MB)=0.663
> > > > diagnostics = {java.vendor=Oracle Corporation, os=Mac OS X,
> > > > java.version=1.8.0_191, java.vm.version=25.191-b12,
> > > > lucene.version=8.5.1, os.arch=x86_64,
> > > > java.runtime.version=1.8.0_191-b12, source=addIndexes(CodecReader...),
> > > > os.version=10.15.5, timestamp=1591841756208}
> > > > no deletions
> > > > test: open reader.........OK [took 0.004 sec]
> > > > test: check integrity.....OK [took 0.002 sec]
> > > > test: check live docs.....OK [took 0.000 sec]
> > > > test: field infos.........OK [36 fields] [took 0.000 sec]
> > > > test: field norms.........OK [26 fields] [took 0.001 sec]
> > > > test: terms, freq, prox...ERROR: java.lang.RuntimeException:
> > > > Document 0 doesn't have terms according to postings but has a norm
> > > > value that is not zero: -1
> > > >
> > > > java.lang.RuntimeException: Document 0 doesn't have terms according to
> > > > postings but has a norm value that is not zero: -1
> > > > at org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:1678)
> > > > at
> > org.apache.lucene.index.CheckIndex.testPostings(CheckIndex.java:1871)
> > > > at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:724)
> > > > at org.apache.lucene.index.CheckIndex.doCheck(CheckIndex.java:2973)
> > > >
> > > > test: stored fields.......OK [.15935 total field count; avg 42.2
> > > > fields per doc] [took 0.003 sec]
> > > > test: term vectors........OK [.1173 total term vector count; avg
> > > > 3.1 term/freq vector fields per doc] [took 0.170 sec]
> > > > test: docvalues...........OK [.16 docvalues fields; 11 BINARY; 2
> > > > NUMERIC; 0 SORTED; 2 SORTED_NUMERIC; 1 SORTED_SET] [took 0.003 sec]
> > > > test: points..............OK [4 fields, 1509 points] [took 0.000
> > sec]
> > > > FAILED
> > > > WARNING: exorciseIndex() would remove reference to this segment;
> > > > full exception:
> > > > java.lang.RuntimeException: Term Index test failed
> > > > at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:750)
> > > > at org.apache.lucene.index.CheckIndex.doCheck(CheckIndex.java:2973)
> > > >
> > > > WARNING: 1 broken segments (containing 378 documents) detected
> > > > Took 0.355 sec total.
> > > > WARNING: would write new segments file, and 378 documents would be
> > > > lost, if -exorcise were specified
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> > >
> > > --
> > > Adrien
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
> --
> Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org