Mailing List Archive: RFC: N-2 compatibility for file formats

RFC: N-2 compatibility for file formats

simon.willnauer at gmail

Jan 6, 2021, 1:40 AM

Post #1 of 11 (584 views)

Hello all,

Currently Lucene supports reading and writing indices that have been
created with the current or previous (N-1) version of Lucene. Lucene
refuses to open an index created by N-2 or earlier versions.
I would like to propose that Lucene adds support for opening indices
created by version N-2 in read-only mode. Here's what I have in mind:

- Read-only support. You can open a reader on an index created by
version N-2, but you cannot open an IndexWriter on it, meaning that
you can neither delete, update, add documents or force-merge N-2
indices.

- File-format compatibility only. File-format compatibility enables
reading the content of old indices, but not more. Everything that is
done on top of file formats like analysis or the encoding of length
normalization factors is not guaranteed and only supported on a
best-effort basis.

The reason I came up with these limitations is because I wanted to
make the scope minimal in order to retain Lucene's ability to move
forward. If there is consensus to move forward with this, I would like
to target Lucene 9.0 with this change.

Simon

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: RFC: N-2 compatibility for file formats [ In reply to ]

ichattopadhyaya at gmail

Jan 6, 2021, 5:48 AM

Post #2 of 11 (584 views)

Sounds great, +1

On Wed, Jan 6, 2021 at 3:10 PM Simon Willnauer <simon.willnauer@gmail.com>
wrote:

> Hello all,
>
> Currently Lucene supports reading and writing indices that have been
> created with the current or previous (N-1) version of Lucene. Lucene
> refuses to open an index created by N-2 or earlier versions.
> I would like to propose that Lucene adds support for opening indices
> created by version N-2 in read-only mode. Here's what I have in mind:
>
> - Read-only support. You can open a reader on an index created by
> version N-2, but you cannot open an IndexWriter on it, meaning that
> you can neither delete, update, add documents or force-merge N-2
> indices.
>
> - File-format compatibility only. File-format compatibility enables
> reading the content of old indices, but not more. Everything that is
> done on top of file formats like analysis or the encoding of length
> normalization factors is not guaranteed and only supported on a
> best-effort basis.
>
> The reason I came up with these limitations is because I wanted to
> make the scope minimal in order to retain Lucene's ability to move
> forward. If there is consensus to move forward with this, I would like
> to target Lucene 9.0 with this change.
>
> Simon
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: RFC: N-2 compatibility for file formats [ In reply to ]

dsmiley at apache

Jan 6, 2021, 6:00 AM

Post #3 of 11 (584 views)

+1 -- Lucene should not _prevent_ this.

I forget where things stood in the past conversations about this subject...
I think most recently raised by Erick Ericson. I recall that we don't want
to maintain the code to read older indices... which I sympathize with...
but I recall there is code that actively *blocks* you (end user) from
reading N-2 which I think goes too far, _forcing_ you to fork Lucene to
work around that. At least a user should be able to maintain however far
back if they have their own codecs that they maintain (as I do at work).

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley

On Wed, Jan 6, 2021 at 8:48 AM Ishan Chattopadhyaya <
ichattopadhyaya@gmail.com> wrote:

> Sounds great, +1
>
> On Wed, Jan 6, 2021 at 3:10 PM Simon Willnauer <simon.willnauer@gmail.com>
> wrote:
>
>> Hello all,
>>
>> Currently Lucene supports reading and writing indices that have been
>> created with the current or previous (N-1) version of Lucene. Lucene
>> refuses to open an index created by N-2 or earlier versions.
>> I would like to propose that Lucene adds support for opening indices
>> created by version N-2 in read-only mode. Here's what I have in mind:
>>
>> - Read-only support. You can open a reader on an index created by
>> version N-2, but you cannot open an IndexWriter on it, meaning that
>> you can neither delete, update, add documents or force-merge N-2
>> indices.
>>
>> - File-format compatibility only. File-format compatibility enables
>> reading the content of old indices, but not more. Everything that is
>> done on top of file formats like analysis or the encoding of length
>> normalization factors is not guaranteed and only supported on a
>> best-effort basis.
>>
>> The reason I came up with these limitations is because I wanted to
>> make the scope minimal in order to retain Lucene's ability to move
>> forward. If there is consensus to move forward with this, I would like
>> to target Lucene 9.0 with this change.
>>
>> Simon
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>

Re: RFC: N-2 compatibility for file formats [ In reply to ]

msokolov at gmail

Jan 6, 2021, 7:20 AM

Post #4 of 11 (584 views)

In practice what would this mean? We relax the restriction that David
mentions, and we keep old codecs around in backwards-codecs for two
major releases instead of one? Are there other implications? Suppose
we had a Query that relied on a specific index format, which gets
retired. We keep the index format code around - do we also need to
remember to maintain the old Query?

-Mike

On Wed, Jan 6, 2021 at 4:41 AM Simon Willnauer
<simon.willnauer@gmail.com> wrote:
>
> Hello all,
>
> Currently Lucene supports reading and writing indices that have been
> created with the current or previous (N-1) version of Lucene. Lucene
> refuses to open an index created by N-2 or earlier versions.
> I would like to propose that Lucene adds support for opening indices
> created by version N-2 in read-only mode. Here's what I have in mind:
>
> - Read-only support. You can open a reader on an index created by
> version N-2, but you cannot open an IndexWriter on it, meaning that
> you can neither delete, update, add documents or force-merge N-2
> indices.
>
> - File-format compatibility only. File-format compatibility enables
> reading the content of old indices, but not more. Everything that is
> done on top of file formats like analysis or the encoding of length
> normalization factors is not guaranteed and only supported on a
> best-effort basis.
>
> The reason I came up with these limitations is because I wanted to
> make the scope minimal in order to retain Lucene's ability to move
> forward. If there is consensus to move forward with this, I would like
> to target Lucene 9.0 with this change.
>
> Simon
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: RFC: N-2 compatibility for file formats [ In reply to ]

dawid.weiss at gmail

Jan 6, 2021, 12:17 PM

Post #5 of 11 (584 views)

I see a more difficult problem in the opposite - say, a new Query that
requires something from
the index that older indexes (codecs) don't have. Then running such a query
would result in, I assume,
an exception? Things get awkward when you have existing systems that wish
to gradually upgrade
so that some segments are in older codecs and newer segments are in newer
codecs.

But in general I'm quite ok with keeping N-2 compatibility if it's not too
much trouble.

D.

On Wed, Jan 6, 2021 at 4:21 PM Michael Sokolov <msokolov@gmail.com> wrote:

> In practice what would this mean? We relax the restriction that David
> mentions, and we keep old codecs around in backwards-codecs for two
> major releases instead of one? Are there other implications? Suppose
> we had a Query that relied on a specific index format, which gets
> retired. We keep the index format code around - do we also need to
> remember to maintain the old Query?
>
> -Mike
>
> On Wed, Jan 6, 2021 at 4:41 AM Simon Willnauer
> <simon.willnauer@gmail.com> wrote:
> >
> > Hello all,
> >
> > Currently Lucene supports reading and writing indices that have been
> > created with the current or previous (N-1) version of Lucene. Lucene
> > refuses to open an index created by N-2 or earlier versions.
> > I would like to propose that Lucene adds support for opening indices
> > created by version N-2 in read-only mode. Here's what I have in mind:
> >
> > - Read-only support. You can open a reader on an index created by
> > version N-2, but you cannot open an IndexWriter on it, meaning that
> > you can neither delete, update, add documents or force-merge N-2
> > indices.
> >
> > - File-format compatibility only. File-format compatibility enables
> > reading the content of old indices, but not more. Everything that is
> > done on top of file formats like analysis or the encoding of length
> > normalization factors is not guaranteed and only supported on a
> > best-effort basis.
> >
> > The reason I came up with these limitations is because I wanted to
> > make the scope minimal in order to retain Lucene's ability to move
> > forward. If there is consensus to move forward with this, I would like
> > to target Lucene 9.0 with this change.
> >
> > Simon
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: RFC: N-2 compatibility for file formats [ In reply to ]

yseeley at gmail

Jan 6, 2021, 12:53 PM

Post #6 of 11 (584 views)

On Wed, Jan 6, 2021 at 4:40 AM Simon Willnauer <simon.willnauer@gmail.com>
wrote:

> You can open a reader on an index created by
> version N-2, but you cannot open an IndexWriter on it
>

+1
There should definitely be more consideration given to back compat in
general... it's caused a ton of pain to users over time.

-Yonik

Re: RFC: N-2 compatibility for file formats [ In reply to ]

jim.ferenczi at gmail

Jan 7, 2021, 11:03 AM

Post #7 of 11 (584 views)

The proposal is only about keeping the ability to read file-format up to
N-2. Everything that is done on top of the file format is not guaranteed
and should be supported on a best-effort basis.
That's an important aspect if we don't want to block innovation. So in
practice that means that queries that require some specific file format or
analyzers that change behaviors in major versions would not be part of the
extended guarantee.

Le mer. 6 janv. 2021 à 21:53, Yonik Seeley <yseeley@gmail.com> a écrit :

> On Wed, Jan 6, 2021 at 4:40 AM Simon Willnauer <simon.willnauer@gmail.com>
> wrote:
>
>> You can open a reader on an index created by
>> version N-2, but you cannot open an IndexWriter on it
>>
>
> +1
> There should definitely be more consideration given to back compat in
> general... it's caused a ton of pain to users over time.
>
> -Yonik
>
>
>

Re: RFC: N-2 compatibility for file formats [ In reply to ]

simon.willnauer at gmail

Jan 9, 2021, 3:12 AM

Post #8 of 11 (584 views)

I can provide some examples of BWC issues and what we would do if it
happened in the future:

- negative offsets: in this case it would be best effort to add a
wrapper around the older formats to check if the offsets go backwards
on the read side and throw an exception to prevent consumers making
the assumption that offsets go forward only from failing or going OOM
etc.
- norms encoding: in this case it would be best effort in the older
norms formats to convert to the newer encodings.
- the removal of numeric fields queries would not fall under the
promises we make with compatibility of N-2 and it would be the
responsibility of the user to keep the code around that understands
the value of a field.

I hope this clarifies some of the aspects?

we would only do all this for the reading end, for writing we would
reject indices that are older than N-1

simon

On Thu, Jan 7, 2021 at 8:04 PM jim ferenczi <jim.ferenczi@gmail.com> wrote:
>
> The proposal is only about keeping the ability to read file-format up to N-2. Everything that is done on top of the file format is not guaranteed and should be supported on a best-effort basis.
> That's an important aspect if we don't want to block innovation. So in practice that means that queries that require some specific file format or analyzers that change behaviors in major versions would not be part of the extended guarantee.
>
>
> Le mer. 6 janv. 2021 à 21:53, Yonik Seeley <yseeley@gmail.com> a écrit :
>>
>> On Wed, Jan 6, 2021 at 4:40 AM Simon Willnauer <simon.willnauer@gmail.com> wrote:
>>>
>>> You can open a reader on an index created by
>>> version N-2, but you cannot open an IndexWriter on it
>>
>>
>> +1
>> There should definitely be more consideration given to back compat in general... it's caused a ton of pain to users over time.
>>
>> -Yonik
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: RFC: N-2 compatibility for file formats [ In reply to ]

lucene at mikemccandless

Jan 11, 2021, 7:56 AM

Post #9 of 11 (584 views)

+1, I like the idea in general.

We will have to work out the details in practice as we come across "index
breaking" changes, and where/how to draw the line of "best effort". But I
think this is an improvement for our users over the hard check we now have
for "only N-1", and likely not so much development effort?

I think where it might get interesting is if we want to make a Codec API
change, maybe to optimize a interesting use-cases, and then we must do some
development to fix N-2 BWC codec (as well as N-1 BWC codec that we already
must fix for such an example, today).

Some users seem to keep their indices alive for a very long time!

Mike McCandless

http://blog.mikemccandless.com

On Sat, Jan 9, 2021 at 6:13 AM Simon Willnauer <simon.willnauer@gmail.com>
wrote:

> I can provide some examples of BWC issues and what we would do if it
> happened in the future:
>
> - negative offsets: in this case it would be best effort to add a
> wrapper around the older formats to check if the offsets go backwards
> on the read side and throw an exception to prevent consumers making
> the assumption that offsets go forward only from failing or going OOM
> etc.
> - norms encoding: in this case it would be best effort in the older
> norms formats to convert to the newer encodings.
> - the removal of numeric fields queries would not fall under the
> promises we make with compatibility of N-2 and it would be the
> responsibility of the user to keep the code around that understands
> the value of a field.
>
> I hope this clarifies some of the aspects?
>
> we would only do all this for the reading end, for writing we would
> reject indices that are older than N-1
>
> simon
>
>
> On Thu, Jan 7, 2021 at 8:04 PM jim ferenczi <jim.ferenczi@gmail.com>
> wrote:
> >
> > The proposal is only about keeping the ability to read file-format up to
> N-2. Everything that is done on top of the file format is not guaranteed
> and should be supported on a best-effort basis.
> > That's an important aspect if we don't want to block innovation. So in
> practice that means that queries that require some specific file format or
> analyzers that change behaviors in major versions would not be part of the
> extended guarantee.
> >
> >
> > Le mer. 6 janv. 2021 à 21:53, Yonik Seeley <yseeley@gmail.com> a écrit :
> >>
> >> On Wed, Jan 6, 2021 at 4:40 AM Simon Willnauer <
> simon.willnauer@gmail.com> wrote:
> >>>
> >>> You can open a reader on an index created by
> >>> version N-2, but you cannot open an IndexWriter on it
> >>
> >>
> >> +1
> >> There should definitely be more consideration given to back compat in
> general... it's caused a ton of pain to users over time.
> >>
> >> -Yonik
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: RFC: N-2 compatibility for file formats [ In reply to ]

jpountz at gmail

Jan 13, 2021, 5:57 AM

Post #10 of 11 (584 views)

+1 this strikes to me as a good balance between increasing backward
compatibility guarantees and still keeping room for innovation.

David, actually I would like to advocate in favor of still disallowing
opening N-2 indices by default, as they might not match Lucene's current
expectations (e.g. using a different encoding for norms due to
LUCENE-7730), and using Lucene's current analyzers/similarities/queries
might trigger surprising behavior. My preference would be to expose the
ability to open N-2 indices behind an expert API/flag that documents
limitations with N-2 indices.

Mike, I wondered about this question too. As you pointed out, I think that
we will generally be ok given that the N-2 compatibility layer will very
likely be the same as the N-1 compatibility layer that we need to develop
anyway. I tried to think of examples when that wouldn't work but couldn't
find any (which doesn't mean that there is none, but hopefully it would be
rare).

On Mon, Jan 11, 2021 at 4:57 PM Michael McCandless <
lucene@mikemccandless.com> wrote:

> +1, I like the idea in general.
>
> We will have to work out the details in practice as we come across "index
> breaking" changes, and where/how to draw the line of "best effort". But I
> think this is an improvement for our users over the hard check we now have
> for "only N-1", and likely not so much development effort?
>
> I think where it might get interesting is if we want to make a Codec API
> change, maybe to optimize a interesting use-cases, and then we must do some
> development to fix N-2 BWC codec (as well as N-1 BWC codec that we already
> must fix for such an example, today).
>
> Some users seem to keep their indices alive for a very long time!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Sat, Jan 9, 2021 at 6:13 AM Simon Willnauer <simon.willnauer@gmail.com>
> wrote:
>
>> I can provide some examples of BWC issues and what we would do if it
>> happened in the future:
>>
>> - negative offsets: in this case it would be best effort to add a
>> wrapper around the older formats to check if the offsets go backwards
>> on the read side and throw an exception to prevent consumers making
>> the assumption that offsets go forward only from failing or going OOM
>> etc.
>> - norms encoding: in this case it would be best effort in the older
>> norms formats to convert to the newer encodings.
>> - the removal of numeric fields queries would not fall under the
>> promises we make with compatibility of N-2 and it would be the
>> responsibility of the user to keep the code around that understands
>> the value of a field.
>>
>> I hope this clarifies some of the aspects?
>>
>> we would only do all this for the reading end, for writing we would
>> reject indices that are older than N-1
>>
>> simon
>>
>>
>> On Thu, Jan 7, 2021 at 8:04 PM jim ferenczi <jim.ferenczi@gmail.com>
>> wrote:
>> >
>> > The proposal is only about keeping the ability to read file-format up
>> to N-2. Everything that is done on top of the file format is not guaranteed
>> and should be supported on a best-effort basis.
>> > That's an important aspect if we don't want to block innovation. So in
>> practice that means that queries that require some specific file format or
>> analyzers that change behaviors in major versions would not be part of the
>> extended guarantee.
>> >
>> >
>> > Le mer. 6 janv. 2021 à 21:53, Yonik Seeley <yseeley@gmail.com> a écrit
>> :
>> >>
>> >> On Wed, Jan 6, 2021 at 4:40 AM Simon Willnauer <
>> simon.willnauer@gmail.com> wrote:
>> >>>
>> >>> You can open a reader on an index created by
>> >>> version N-2, but you cannot open an IndexWriter on it
>> >>
>> >>
>> >> +1
>> >> There should definitely be more consideration given to back compat in
>> general... it's caused a ton of pain to users over time.
>> >>
>> >> -Yonik
>> >>
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>

--
Adrien

Re: RFC: N-2 compatibility for file formats [ In reply to ]

simon.willnauer at gmail

Jan 14, 2021, 7:52 AM

Post #11 of 11 (584 views)

thanks for all the feedback, I opened
https://issues.apache.org/jira/browse/LUCENE-9669 to address this
further.

On Wed, Jan 13, 2021 at 2:58 PM Adrien Grand <jpountz@gmail.com> wrote:
>
> +1 this strikes to me as a good balance between increasing backward compatibility guarantees and still keeping room for innovation.
>
> David, actually I would like to advocate in favor of still disallowing opening N-2 indices by default, as they might not match Lucene's current expectations (e.g. using a different encoding for norms due to LUCENE-7730), and using Lucene's current analyzers/similarities/queries might trigger surprising behavior. My preference would be to expose the ability to open N-2 indices behind an expert API/flag that documents limitations with N-2 indices.
>
> Mike, I wondered about this question too. As you pointed out, I think that we will generally be ok given that the N-2 compatibility layer will very likely be the same as the N-1 compatibility layer that we need to develop anyway. I tried to think of examples when that wouldn't work but couldn't find any (which doesn't mean that there is none, but hopefully it would be rare).
>
>
>
> On Mon, Jan 11, 2021 at 4:57 PM Michael McCandless <lucene@mikemccandless.com> wrote:
>>
>> +1, I like the idea in general.
>>
>> We will have to work out the details in practice as we come across "index breaking" changes, and where/how to draw the line of "best effort". But I think this is an improvement for our users over the hard check we now have for "only N-1", and likely not so much development effort?
>>
>> I think where it might get interesting is if we want to make a Codec API change, maybe to optimize a interesting use-cases, and then we must do some development to fix N-2 BWC codec (as well as N-1 BWC codec that we already must fix for such an example, today).
>>
>> Some users seem to keep their indices alive for a very long time!
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Sat, Jan 9, 2021 at 6:13 AM Simon Willnauer <simon.willnauer@gmail.com> wrote:
>>>
>>> I can provide some examples of BWC issues and what we would do if it
>>> happened in the future:
>>>
>>> - negative offsets: in this case it would be best effort to add a
>>> wrapper around the older formats to check if the offsets go backwards
>>> on the read side and throw an exception to prevent consumers making
>>> the assumption that offsets go forward only from failing or going OOM
>>> etc.
>>> - norms encoding: in this case it would be best effort in the older
>>> norms formats to convert to the newer encodings.
>>> - the removal of numeric fields queries would not fall under the
>>> promises we make with compatibility of N-2 and it would be the
>>> responsibility of the user to keep the code around that understands
>>> the value of a field.
>>>
>>> I hope this clarifies some of the aspects?
>>>
>>> we would only do all this for the reading end, for writing we would
>>> reject indices that are older than N-1
>>>
>>> simon
>>>
>>>
>>> On Thu, Jan 7, 2021 at 8:04 PM jim ferenczi <jim.ferenczi@gmail.com> wrote:
>>> >
>>> > The proposal is only about keeping the ability to read file-format up to N-2. Everything that is done on top of the file format is not guaranteed and should be supported on a best-effort basis.
>>> > That's an important aspect if we don't want to block innovation. So in practice that means that queries that require some specific file format or analyzers that change behaviors in major versions would not be part of the extended guarantee.
>>> >
>>> >
>>> > Le mer. 6 janv. 2021 à 21:53, Yonik Seeley <yseeley@gmail.com> a écrit :
>>> >>
>>> >> On Wed, Jan 6, 2021 at 4:40 AM Simon Willnauer <simon.willnauer@gmail.com> wrote:
>>> >>>
>>> >>> You can open a reader on an index created by
>>> >>> version N-2, but you cannot open an IndexWriter on it
>>> >>
>>> >>
>>> >> +1
>>> >> There should definitely be more consideration given to back compat in general... it's caused a ton of pain to users over time.
>>> >>
>>> >> -Yonik
>>> >>
>>> >>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>
>
> --
> Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org