Mailing List Archive: Are the new index consistency checks too strict?

Are the new index consistency checks too strict?

msokolov at gmail

Sep 1, 2021, 7:43 AM

Post #1 of 6 (401 views)

While upgrading I ran afoul of some inconsistencies in our schema
usage, and to fix them I've ended up having to add data to our index
that I'd rather not. Let me give a little context: We have a
parent/child document structure. Some fields are shared across partn
and child docs, others are not. Our index has a sort key, and in order
for all the parent/child docs to sort together correctly, we add the
same (docvalues) fields that are part of the sortkey to both parent
and child docs. Some of these fields are *also* indexed as postings
(StringField) of the same name, but we only index the postings field
on the parent document, since child documents are never searched for
on their own - always in conjunction with a parent.

The schema-checking code we added in Lucene 9 does not allow this: it
enforces that all documents having a field should have the same "index
options", and failing to index the postings gets interpreted as having
index options = NONE (because of the presence of the doc values field
of the same name, I think?)

Our current solution is to also index the postings for the child
document (but just with an empty string value). This seems gross, and
creates postings in the index that we will never use.

Another possibility would be to rename the fields so that the postings
and docvalues fields have different names. But in this case our
application-level schema diverges from our Lucene schema, adding a
layer of complexity we'd rather not introduce.

Finally, could we relax this constraint, always allowing index
options=NONE regardless of how other docs are indexed? Would it cause
problems?

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Are the new index consistency checks too strict? [ In reply to ]

jpountz at gmail

Sep 1, 2021, 9:23 AM

Post #2 of 6 (401 views)

This additional validation that we introduced in Lucene 9 feels like a
natural extension of the validation that we already had before, such as the
fact that you cannot have some docs that use SORTED doc values and other
docs that use NUMERIC doc values on the same field. Actually I would have
liked to go further by enforcing that all data structures record the exact
same information but this is challenging due to the fact that IndexingChain
only has access to the encoded data, e.g. with IntPoint it only sees a
byte[] rather than the original integer, so we'd have to make assumptions
about how the data is encoded, which doesn't feel right.

I do like this additional validation very much because I suspect that most
cases when users would get this error is because they made a mistake in
their indexing code. And this also helps make Lucene work better
out-of-the-box. For instance, thanks to this additional validation we
enabled dynamic pruning when sorting on numeric fields by default - this is
opt-in on 8.x since this optimization needs to look at both points and doc
values, so it's broken if not all documents have the same schema. And there
are other things we could do in the near future like rewriting
DocValuesFieldExistsQuery to a MatchAllDocsQuery when points/terms report
that docCount == maxDoc.

In my opinion the correct solution for the problem you are facing would be
to have a way to make index sorting aware of the parent/child relationship
so that index sorting would read the sort key of the parent document
whenever it is on a child document, e.g. as done on LUCENE-5312
<https://issues.apache.org/jira/browse/LUCENE-5312>. This way you wouldn't
have to duplicate this sort key from your parent documents to your child
documents, so you wouldn't have any schema issues.

On Wed, Sep 1, 2021 at 4:44 PM Michael Sokolov <msokolov@gmail.com> wrote:

> While upgrading I ran afoul of some inconsistencies in our schema
> usage, and to fix them I've ended up having to add data to our index
> that I'd rather not. Let me give a little context: We have a
> parent/child document structure. Some fields are shared across partn
> and child docs, others are not. Our index has a sort key, and in order
> for all the parent/child docs to sort together correctly, we add the
> same (docvalues) fields that are part of the sortkey to both parent
> and child docs. Some of these fields are *also* indexed as postings
> (StringField) of the same name, but we only index the postings field
> on the parent document, since child documents are never searched for
> on their own - always in conjunction with a parent.
>
> The schema-checking code we added in Lucene 9 does not allow this: it
> enforces that all documents having a field should have the same "index
> options", and failing to index the postings gets interpreted as having
> index options = NONE (because of the presence of the doc values field
> of the same name, I think?)
>
> Our current solution is to also index the postings for the child
> document (but just with an empty string value). This seems gross, and
> creates postings in the index that we will never use.
>
> Another possibility would be to rename the fields so that the postings
> and docvalues fields have different names. But in this case our
> application-level schema diverges from our Lucene schema, adding a
> layer of complexity we'd rather not introduce.
>
> Finally, could we relax this constraint, always allowing index
> options=NONE regardless of how other docs are indexed? Would it cause
> problems?
>
> -Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

--
Adrien

Re: Are the new index consistency checks too strict? [ In reply to ]

msokolov at gmail

Sep 2, 2021, 4:45 AM

Post #3 of 6 (401 views)

Yes, I am also supportive of the idea of having a schema that is
enforced, and I like what it enables us to do. I just wonder if we
could relax the enforcement around IndexOptions.NONE (and
DocValuesType.NONE). Would it make sense to enable NONE to be "equal
to" any other IndexOptions, so that eg, you if you index a field with
IndexOptions.DOCS_AND_TERMS then every document must have either
DOCS_AND_TERMS or NONE? In the case where a field is *only* indexed
as terms, and has no docvalues, this is already allowed. But if you
index a field as both docvalue and terms, then it is not (currently),
which seems weird. I guess the same is true of a field that has no
docvalues on some docs, and has them on others, but is also indexed as
terms everywhere. I think that ought to be allowed (since you can have
a sparse docvalues field that is not indexed with terms).

On Wed, Sep 1, 2021 at 12:24 PM Adrien Grand <jpountz@gmail.com> wrote:
>
> This additional validation that we introduced in Lucene 9 feels like a natural extension of the validation that we already had before, such as the fact that you cannot have some docs that use SORTED doc values and other docs that use NUMERIC doc values on the same field. Actually I would have liked to go further by enforcing that all data structures record the exact same information but this is challenging due to the fact that IndexingChain only has access to the encoded data, e.g. with IntPoint it only sees a byte[] rather than the original integer, so we'd have to make assumptions about how the data is encoded, which doesn't feel right.
>
> I do like this additional validation very much because I suspect that most cases when users would get this error is because they made a mistake in their indexing code. And this also helps make Lucene work better out-of-the-box. For instance, thanks to this additional validation we enabled dynamic pruning when sorting on numeric fields by default - this is opt-in on 8.x since this optimization needs to look at both points and doc values, so it's broken if not all documents have the same schema. And there are other things we could do in the near future like rewriting DocValuesFieldExistsQuery to a MatchAllDocsQuery when points/terms report that docCount == maxDoc.
>
> In my opinion the correct solution for the problem you are facing would be to have a way to make index sorting aware of the parent/child relationship so that index sorting would read the sort key of the parent document whenever it is on a child document, e.g. as done on LUCENE-5312. This way you wouldn't have to duplicate this sort key from your parent documents to your child documents, so you wouldn't have any schema issues.
>
> On Wed, Sep 1, 2021 at 4:44 PM Michael Sokolov <msokolov@gmail.com> wrote:
>>
>> While upgrading I ran afoul of some inconsistencies in our schema
>> usage, and to fix them I've ended up having to add data to our index
>> that I'd rather not. Let me give a little context: We have a
>> parent/child document structure. Some fields are shared across partn
>> and child docs, others are not. Our index has a sort key, and in order
>> for all the parent/child docs to sort together correctly, we add the
>> same (docvalues) fields that are part of the sortkey to both parent
>> and child docs. Some of these fields are *also* indexed as postings
>> (StringField) of the same name, but we only index the postings field
>> on the parent document, since child documents are never searched for
>> on their own - always in conjunction with a parent.
>>
>> The schema-checking code we added in Lucene 9 does not allow this: it
>> enforces that all documents having a field should have the same "index
>> options", and failing to index the postings gets interpreted as having
>> index options = NONE (because of the presence of the doc values field
>> of the same name, I think?)
>>
>> Our current solution is to also index the postings for the child
>> document (but just with an empty string value). This seems gross, and
>> creates postings in the index that we will never use.
>>
>> Another possibility would be to rename the fields so that the postings
>> and docvalues fields have different names. But in this case our
>> application-level schema diverges from our Lucene schema, adding a
>> layer of complexity we'd rather not introduce.
>>
>> Finally, could we relax this constraint, always allowing index
>> options=NONE regardless of how other docs are indexed? Would it cause
>> problems?
>>
>> -Mike
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
>
> --
> Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Are the new index consistency checks too strict? [ In reply to ]

msokolov at gmail

Sep 2, 2021, 4:46 AM

Post #4 of 6 (401 views)

Oh, and also, I like the idea of making index sorting parent/child aware!

On Thu, Sep 2, 2021 at 7:45 AM Michael Sokolov <msokolov@gmail.com> wrote:
>
> Yes, I am also supportive of the idea of having a schema that is
> enforced, and I like what it enables us to do. I just wonder if we
> could relax the enforcement around IndexOptions.NONE (and
> DocValuesType.NONE). Would it make sense to enable NONE to be "equal
> to" any other IndexOptions, so that eg, you if you index a field with
> IndexOptions.DOCS_AND_TERMS then every document must have either
> DOCS_AND_TERMS or NONE? In the case where a field is *only* indexed
> as terms, and has no docvalues, this is already allowed. But if you
> index a field as both docvalue and terms, then it is not (currently),
> which seems weird. I guess the same is true of a field that has no
> docvalues on some docs, and has them on others, but is also indexed as
> terms everywhere. I think that ought to be allowed (since you can have
> a sparse docvalues field that is not indexed with terms).
>
> On Wed, Sep 1, 2021 at 12:24 PM Adrien Grand <jpountz@gmail.com> wrote:
> >
> > This additional validation that we introduced in Lucene 9 feels like a natural extension of the validation that we already had before, such as the fact that you cannot have some docs that use SORTED doc values and other docs that use NUMERIC doc values on the same field. Actually I would have liked to go further by enforcing that all data structures record the exact same information but this is challenging due to the fact that IndexingChain only has access to the encoded data, e.g. with IntPoint it only sees a byte[] rather than the original integer, so we'd have to make assumptions about how the data is encoded, which doesn't feel right.
> >
> > I do like this additional validation very much because I suspect that most cases when users would get this error is because they made a mistake in their indexing code. And this also helps make Lucene work better out-of-the-box. For instance, thanks to this additional validation we enabled dynamic pruning when sorting on numeric fields by default - this is opt-in on 8.x since this optimization needs to look at both points and doc values, so it's broken if not all documents have the same schema. And there are other things we could do in the near future like rewriting DocValuesFieldExistsQuery to a MatchAllDocsQuery when points/terms report that docCount == maxDoc.
> >
> > In my opinion the correct solution for the problem you are facing would be to have a way to make index sorting aware of the parent/child relationship so that index sorting would read the sort key of the parent document whenever it is on a child document, e.g. as done on LUCENE-5312. This way you wouldn't have to duplicate this sort key from your parent documents to your child documents, so you wouldn't have any schema issues.
> >
> > On Wed, Sep 1, 2021 at 4:44 PM Michael Sokolov <msokolov@gmail.com> wrote:
> >>
> >> While upgrading I ran afoul of some inconsistencies in our schema
> >> usage, and to fix them I've ended up having to add data to our index
> >> that I'd rather not. Let me give a little context: We have a
> >> parent/child document structure. Some fields are shared across partn
> >> and child docs, others are not. Our index has a sort key, and in order
> >> for all the parent/child docs to sort together correctly, we add the
> >> same (docvalues) fields that are part of the sortkey to both parent
> >> and child docs. Some of these fields are *also* indexed as postings
> >> (StringField) of the same name, but we only index the postings field
> >> on the parent document, since child documents are never searched for
> >> on their own - always in conjunction with a parent.
> >>
> >> The schema-checking code we added in Lucene 9 does not allow this: it
> >> enforces that all documents having a field should have the same "index
> >> options", and failing to index the postings gets interpreted as having
> >> index options = NONE (because of the presence of the doc values field
> >> of the same name, I think?)
> >>
> >> Our current solution is to also index the postings for the child
> >> document (but just with an empty string value). This seems gross, and
> >> creates postings in the index that we will never use.
> >>
> >> Another possibility would be to rename the fields so that the postings
> >> and docvalues fields have different names. But in this case our
> >> application-level schema diverges from our Lucene schema, adding a
> >> layer of complexity we'd rather not introduce.
> >>
> >> Finally, could we relax this constraint, always allowing index
> >> options=NONE regardless of how other docs are indexed? Would it cause
> >> problems?
> >>
> >> -Mike
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >
> >
> > --
> > Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Are the new index consistency checks too strict? [ In reply to ]

msokolov at gmail

Sep 2, 2021, 5:02 AM

Post #5 of 6 (401 views)

Hmm .. I guess I missed the implication of your comment about
requiring both points and docvalues for some cases, which I guess
could be violated if we relaxed this NONE != not NONE enforcement for
docvalues (or points)...

On Thu, Sep 2, 2021 at 7:46 AM Michael Sokolov <msokolov@gmail.com> wrote:
>
> Oh, and also, I like the idea of making index sorting parent/child aware!
>
> On Thu, Sep 2, 2021 at 7:45 AM Michael Sokolov <msokolov@gmail.com> wrote:
> >
> > Yes, I am also supportive of the idea of having a schema that is
> > enforced, and I like what it enables us to do. I just wonder if we
> > could relax the enforcement around IndexOptions.NONE (and
> > DocValuesType.NONE). Would it make sense to enable NONE to be "equal
> > to" any other IndexOptions, so that eg, you if you index a field with
> > IndexOptions.DOCS_AND_TERMS then every document must have either
> > DOCS_AND_TERMS or NONE? In the case where a field is *only* indexed
> > as terms, and has no docvalues, this is already allowed. But if you
> > index a field as both docvalue and terms, then it is not (currently),
> > which seems weird. I guess the same is true of a field that has no
> > docvalues on some docs, and has them on others, but is also indexed as
> > terms everywhere. I think that ought to be allowed (since you can have
> > a sparse docvalues field that is not indexed with terms).
> >
> > On Wed, Sep 1, 2021 at 12:24 PM Adrien Grand <jpountz@gmail.com> wrote:
> > >
> > > This additional validation that we introduced in Lucene 9 feels like a natural extension of the validation that we already had before, such as the fact that you cannot have some docs that use SORTED doc values and other docs that use NUMERIC doc values on the same field. Actually I would have liked to go further by enforcing that all data structures record the exact same information but this is challenging due to the fact that IndexingChain only has access to the encoded data, e.g. with IntPoint it only sees a byte[] rather than the original integer, so we'd have to make assumptions about how the data is encoded, which doesn't feel right.
> > >
> > > I do like this additional validation very much because I suspect that most cases when users would get this error is because they made a mistake in their indexing code. And this also helps make Lucene work better out-of-the-box. For instance, thanks to this additional validation we enabled dynamic pruning when sorting on numeric fields by default - this is opt-in on 8.x since this optimization needs to look at both points and doc values, so it's broken if not all documents have the same schema. And there are other things we could do in the near future like rewriting DocValuesFieldExistsQuery to a MatchAllDocsQuery when points/terms report that docCount == maxDoc.
> > >
> > > In my opinion the correct solution for the problem you are facing would be to have a way to make index sorting aware of the parent/child relationship so that index sorting would read the sort key of the parent document whenever it is on a child document, e.g. as done on LUCENE-5312. This way you wouldn't have to duplicate this sort key from your parent documents to your child documents, so you wouldn't have any schema issues.
> > >
> > > On Wed, Sep 1, 2021 at 4:44 PM Michael Sokolov <msokolov@gmail.com> wrote:
> > >>
> > >> While upgrading I ran afoul of some inconsistencies in our schema
> > >> usage, and to fix them I've ended up having to add data to our index
> > >> that I'd rather not. Let me give a little context: We have a
> > >> parent/child document structure. Some fields are shared across partn
> > >> and child docs, others are not. Our index has a sort key, and in order
> > >> for all the parent/child docs to sort together correctly, we add the
> > >> same (docvalues) fields that are part of the sortkey to both parent
> > >> and child docs. Some of these fields are *also* indexed as postings
> > >> (StringField) of the same name, but we only index the postings field
> > >> on the parent document, since child documents are never searched for
> > >> on their own - always in conjunction with a parent.
> > >>
> > >> The schema-checking code we added in Lucene 9 does not allow this: it
> > >> enforces that all documents having a field should have the same "index
> > >> options", and failing to index the postings gets interpreted as having
> > >> index options = NONE (because of the presence of the doc values field
> > >> of the same name, I think?)
> > >>
> > >> Our current solution is to also index the postings for the child
> > >> document (but just with an empty string value). This seems gross, and
> > >> creates postings in the index that we will never use.
> > >>
> > >> Another possibility would be to rename the fields so that the postings
> > >> and docvalues fields have different names. But in this case our
> > >> application-level schema diverges from our Lucene schema, adding a
> > >> layer of complexity we'd rather not introduce.
> > >>
> > >> Finally, could we relax this constraint, always allowing index
> > >> options=NONE regardless of how other docs are indexed? Would it cause
> > >> problems?
> > >>
> > >> -Mike
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > >> For additional commands, e-mail: dev-help@lucene.apache.org
> > >>
> > >
> > >
> > > --
> > > Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Are the new index consistency checks too strict? [ In reply to ]

jpountz at gmail

Sep 2, 2021, 5:07 AM

Post #6 of 6 (401 views)

Yes. The idea behind these new enforcements is that all documents must have
a consistent schema, but we still support the case when some documents are
missing values for a field. Whenever a field gets added for the first time
to an index, we generate a FieldInfo for it. And further documents that
have this field must use exactly the same features on this field as the
ones that are configured on this initial FieldInfo.

For instance if you index a document with both terms and doc values on a
given field, then further documents must have both terms and doc values on
this field too, or nothing. They cannot only have terms, or only have doc
values, this is illegal.

Likewise if you index a document with only terms, then further documents
must have either terms, or nothing. They cannot have terms and doc values,
or even doc values only, this is illegal.

On Thu, Sep 2, 2021 at 2:03 PM Michael Sokolov <msokolov@gmail.com> wrote:

> Hmm .. I guess I missed the implication of your comment about
> requiring both points and docvalues for some cases, which I guess
> could be violated if we relaxed this NONE != not NONE enforcement for
> docvalues (or points)...
>
> On Thu, Sep 2, 2021 at 7:46 AM Michael Sokolov <msokolov@gmail.com> wrote:
> >
> > Oh, and also, I like the idea of making index sorting parent/child aware!
> >
> > On Thu, Sep 2, 2021 at 7:45 AM Michael Sokolov <msokolov@gmail.com>
> wrote:
> > >
> > > Yes, I am also supportive of the idea of having a schema that is
> > > enforced, and I like what it enables us to do. I just wonder if we
> > > could relax the enforcement around IndexOptions.NONE (and
> > > DocValuesType.NONE). Would it make sense to enable NONE to be "equal
> > > to" any other IndexOptions, so that eg, you if you index a field with
> > > IndexOptions.DOCS_AND_TERMS then every document must have either
> > > DOCS_AND_TERMS or NONE? In the case where a field is *only* indexed
> > > as terms, and has no docvalues, this is already allowed. But if you
> > > index a field as both docvalue and terms, then it is not (currently),
> > > which seems weird. I guess the same is true of a field that has no
> > > docvalues on some docs, and has them on others, but is also indexed as
> > > terms everywhere. I think that ought to be allowed (since you can have
> > > a sparse docvalues field that is not indexed with terms).
> > >
> > > On Wed, Sep 1, 2021 at 12:24 PM Adrien Grand <jpountz@gmail.com>
> wrote:
> > > >
> > > > This additional validation that we introduced in Lucene 9 feels like
> a natural extension of the validation that we already had before, such as
> the fact that you cannot have some docs that use SORTED doc values and
> other docs that use NUMERIC doc values on the same field. Actually I would
> have liked to go further by enforcing that all data structures record the
> exact same information but this is challenging due to the fact that
> IndexingChain only has access to the encoded data, e.g. with IntPoint it
> only sees a byte[] rather than the original integer, so we'd have to make
> assumptions about how the data is encoded, which doesn't feel right.
> > > >
> > > > I do like this additional validation very much because I suspect
> that most cases when users would get this error is because they made a
> mistake in their indexing code. And this also helps make Lucene work better
> out-of-the-box. For instance, thanks to this additional validation we
> enabled dynamic pruning when sorting on numeric fields by default - this is
> opt-in on 8.x since this optimization needs to look at both points and doc
> values, so it's broken if not all documents have the same schema. And there
> are other things we could do in the near future like rewriting
> DocValuesFieldExistsQuery to a MatchAllDocsQuery when points/terms report
> that docCount == maxDoc.
> > > >
> > > > In my opinion the correct solution for the problem you are facing
> would be to have a way to make index sorting aware of the parent/child
> relationship so that index sorting would read the sort key of the parent
> document whenever it is on a child document, e.g. as done on LUCENE-5312.
> This way you wouldn't have to duplicate this sort key from your parent
> documents to your child documents, so you wouldn't have any schema issues.
> > > >
> > > > On Wed, Sep 1, 2021 at 4:44 PM Michael Sokolov <msokolov@gmail.com>
> wrote:
> > > >>
> > > >> While upgrading I ran afoul of some inconsistencies in our schema
> > > >> usage, and to fix them I've ended up having to add data to our index
> > > >> that I'd rather not. Let me give a little context: We have a
> > > >> parent/child document structure. Some fields are shared across partn
> > > >> and child docs, others are not. Our index has a sort key, and in
> order
> > > >> for all the parent/child docs to sort together correctly, we add the
> > > >> same (docvalues) fields that are part of the sortkey to both parent
> > > >> and child docs. Some of these fields are *also* indexed as postings
> > > >> (StringField) of the same name, but we only index the postings field
> > > >> on the parent document, since child documents are never searched for
> > > >> on their own - always in conjunction with a parent.
> > > >>
> > > >> The schema-checking code we added in Lucene 9 does not allow this:
> it
> > > >> enforces that all documents having a field should have the same
> "index
> > > >> options", and failing to index the postings gets interpreted as
> having
> > > >> index options = NONE (because of the presence of the doc values
> field
> > > >> of the same name, I think?)
> > > >>
> > > >> Our current solution is to also index the postings for the child
> > > >> document (but just with an empty string value). This seems gross,
> and
> > > >> creates postings in the index that we will never use.
> > > >>
> > > >> Another possibility would be to rename the fields so that the
> postings
> > > >> and docvalues fields have different names. But in this case our
> > > >> application-level schema diverges from our Lucene schema, adding a
> > > >> layer of complexity we'd rather not introduce.
> > > >>
> > > >> Finally, could we relax this constraint, always allowing index
> > > >> options=NONE regardless of how other docs are indexed? Would it
> cause
> > > >> problems?
> > > >>
> > > >> -Mike
> > > >>
> > > >>
> ---------------------------------------------------------------------
> > > >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > > >> For additional commands, e-mail: dev-help@lucene.apache.org
> > > >>
> > > >
> > > >
> > > > --
> > > > Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

--
Adrien