Mailing List Archive: Best strategy migrate indexes

Best strategy migrate indexes

pablovb at gmail

Oct 28, 2022, 5:00 AM

Post #1 of 14 (952 views)

Hi all,

I have some indices indexed with lucene 5.5.0. I have updated my
dependencies and code to Lucene 7 (but my final goal is to use Lucene 9)
and when trying to work with them I am having the exception:
org.apache.lucene.index.IndexFormatTooOldException: Format version is not
supported (resource
BufferedChecksumIndexInput(MMapIndexInput(path=".......\tests\segments_b"))):
this index is too old (version: 5.5.0). This version of Lucene only
supports indexes created with release 6.0 and later.

I want to migrate from Lucene 5.x to Lucene 9.x. Which is the best
strategy? Is there any tool to migrate the indices? Is it mandatory to
reindex? In this case, how can I deal with this when I do not have the
sources of documents that generated my current indices (I mean, I just have
the indices themselves)?

Thanks,

--
Pablo Vázquez
(pablovb@gmail.com)

Re: Best strategy migrate indexes [ In reply to ]

gus.heck at gmail

Oct 29, 2022, 11:16 AM

Post #2 of 14 (952 views)

Hi Pablo,

The deafening silence is probably nobody wanting to give you the bad news.
You are on a mission that may not be feasible, and even if you can get it
to "work", the end result won't likely be equivalent to indexing the
original data with Lucene 9.x. The indexing process is fundamentally lossy
and information originally used to produce non-stored fields will have been
thrown out. A simple example is things like stopwords or anything analyzed
with subclasses of FilteringTokenFilter. If the stop word list changed, or
the details of one of these filters changed (bugfix?), you will end up with
a different result than indexing with 9.x. This is just one
example, another would be stemming where the index likely only contains the
stem, not the whole word. Other folks who are more interested in the
details of our codecs than I am can probably provide further examples on a
more fundamental level. Lucene is not a database, and the source documents
should always be retained in a form that can be reindexed. If you have
inherited a system where source material has not been retained, you have a
difficult project and may have some potentially painful expectation setting
to perform.

Best,
Gus

On Fri, Oct 28, 2022 at 8:01 AM Pablo Vázquez Blázquez <pablovb@gmail.com>
wrote:

> Hi all,
>
> I have some indices indexed with lucene 5.5.0. I have updated my
> dependencies and code to Lucene 7 (but my final goal is to use Lucene 9)
> and when trying to work with them I am having the exception:
> org.apache.lucene.index.IndexFormatTooOldException: Format version is not
> supported (resource
>
> BufferedChecksumIndexInput(MMapIndexInput(path=".......\tests\segments_b"))):
> this index is too old (version: 5.5.0). This version of Lucene only
> supports indexes created with release 6.0 and later.
>
> I want to migrate from Lucene 5.x to Lucene 9.x. Which is the best
> strategy? Is there any tool to migrate the indices? Is it mandatory to
> reindex? In this case, how can I deal with this when I do not have the
> sources of documents that generated my current indices (I mean, I just have
> the indices themselves)?
>
> Thanks,
>
> --
> Pablo Vázquez
> (pablovb@gmail.com)
>

--
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)

Re: Best strategy migrate indexes [ In reply to ]

baris.kazar at oracle

Oct 29, 2022, 11:29 AM

Post #3 of 14 (952 views)

It is always great practice to retain non-indexed
data since when Lucene changes version,
even minor version, I always reindex.

Best regards
________________________________
From: Gus Heck <gus.heck@gmail.com>
Sent: Saturday, October 29, 2022 2:17 PM
To: java-user@lucene.apache.org <java-user@lucene.apache.org>
Subject: Re: Best strategy migrate indexes

Hi Pablo,

The deafening silence is probably nobody wanting to give you the bad news.
You are on a mission that may not be feasible, and even if you can get it
to "work", the end result won't likely be equivalent to indexing the
original data with Lucene 9.x. The indexing process is fundamentally lossy
and information originally used to produce non-stored fields will have been
thrown out. A simple example is things like stopwords or anything analyzed
with subclasses of FilteringTokenFilter. If the stop word list changed, or
the details of one of these filters changed (bugfix?), you will end up with
a different result than indexing with 9.x. This is just one
example, another would be stemming where the index likely only contains the
stem, not the whole word. Other folks who are more interested in the
details of our codecs than I am can probably provide further examples on a
more fundamental level. Lucene is not a database, and the source documents
should always be retained in a form that can be reindexed. If you have
inherited a system where source material has not been retained, you have a
difficult project and may have some potentially painful expectation setting
to perform.

Best,
Gus

On Fri, Oct 28, 2022 at 8:01 AM Pablo V?zquez Bl?zquez <pablovb@gmail.com>
wrote:

> Hi all,
>
> I have some indices indexed with lucene 5.5.0. I have updated my
> dependencies and code to Lucene 7 (but my final goal is to use Lucene 9)
> and when trying to work with them I am having the exception:
> org.apache.lucene.index.IndexFormatTooOldException: Format version is not
> supported (resource
>
> BufferedChecksumIndexInput(MMapIndexInput(path=".......\tests\segments_b"))):
> this index is too old (version: 5.5.0). This version of Lucene only
> supports indexes created with release 6.0 and later.
>
> I want to migrate from Lucene 5.x to Lucene 9.x. Which is the best
> strategy? Is there any tool to migrate the indices? Is it mandatory to
> reindex? In this case, how can I deal with this when I do not have the
> sources of documents that generated my current indices (I mean, I just have
> the indices themselves)?
>
> Thanks,
>
> --
> Pablo V?zquez
> (pablovb@gmail.com)
>

--
https://urldefense.com/v3/__http://www.needhamsoftware.com__;!!ACWV5N9M2RV99hQ!PVR-c0gAs5FpIrnotHWeo3sEWScxV8oFJrVpGdItGZictcDbRvnp5aZSqCRhglMCYqQsewQOuio4iIYARA$ (work)
https://urldefense.com/v3/__http://www.the111shift.com__;!!ACWV5N9M2RV99hQ!PVR-c0gAs5FpIrnotHWeo3sEWScxV8oFJrVpGdItGZictcDbRvnp5aZSqCRhglMCYqQsewQOuirxfFWpEQ$ (play)

Re: Best strategy migrate indexes [ In reply to ]

kryptonics411 at gmail

Oct 29, 2022, 2:30 PM

Post #4 of 14 (952 views)

Inside of Zulia search engine, the object being indexed is always a
JSON/BSON object and we store the BSON as a stored byte field in the
index. This allows easy internal reindexing when the searchable fields
change but also allows us to update to the latest lucene version.
Combined with using lucene-backward-codecs an older index than the current
major version can be opened and reindexed. If you have stored all the
fields (or a json/bson) in the index, it would be easy to reindex in the
new format. If you have not, maybe opening with lucene-backward-codecs
will be enough for your use case.

Thanks,
Matt

On Sat, Oct 29, 2022 at 2:30 PM Baris Kazar <baris.kazar@oracle.com> wrote:

> It is always great practice to retain non-indexed
> data since when Lucene changes version,
> even minor version, I always reindex.
>
> Best regards
> ________________________________
> From: Gus Heck <gus.heck@gmail.com>
> Sent: Saturday, October 29, 2022 2:17 PM
> To: java-user@lucene.apache.org <java-user@lucene.apache.org>
> Subject: Re: Best strategy migrate indexes
>
> Hi Pablo,
>
> The deafening silence is probably nobody wanting to give you the bad news.
> You are on a mission that may not be feasible, and even if you can get it
> to "work", the end result won't likely be equivalent to indexing the
> original data with Lucene 9.x. The indexing process is fundamentally lossy
> and information originally used to produce non-stored fields will have been
> thrown out. A simple example is things like stopwords or anything analyzed
> with subclasses of FilteringTokenFilter. If the stop word list changed, or
> the details of one of these filters changed (bugfix?), you will end up with
> a different result than indexing with 9.x. This is just one
> example, another would be stemming where the index likely only contains the
> stem, not the whole word. Other folks who are more interested in the
> details of our codecs than I am can probably provide further examples on a
> more fundamental level. Lucene is not a database, and the source documents
> should always be retained in a form that can be reindexed. If you have
> inherited a system where source material has not been retained, you have a
> difficult project and may have some potentially painful expectation setting
> to perform.
>
> Best,
> Gus
>
>
>
> On Fri, Oct 28, 2022 at 8:01 AM Pablo Vázquez Blázquez <pablovb@gmail.com>
> wrote:
>
> > Hi all,
> >
> > I have some indices indexed with lucene 5.5.0. I have updated my
> > dependencies and code to Lucene 7 (but my final goal is to use Lucene 9)
> > and when trying to work with them I am having the exception:
> > org.apache.lucene.index.IndexFormatTooOldException: Format version is not
> > supported (resource
> >
> >
> BufferedChecksumIndexInput(MMapIndexInput(path=".......\tests\segments_b"))):
> > this index is too old (version: 5.5.0). This version of Lucene only
> > supports indexes created with release 6.0 and later.
> >
> > I want to migrate from Lucene 5.x to Lucene 9.x. Which is the best
> > strategy? Is there any tool to migrate the indices? Is it mandatory to
> > reindex? In this case, how can I deal with this when I do not have the
> > sources of documents that generated my current indices (I mean, I just
> have
> > the indices themselves)?
> >
> > Thanks,
> >
> > --
> > Pablo Vázquez
> > (pablovb@gmail.com)
> >
>
>
> --
>
> https://urldefense.com/v3/__http://www.needhamsoftware.com__;!!ACWV5N9M2RV99hQ!PVR-c0gAs5FpIrnotHWeo3sEWScxV8oFJrVpGdItGZictcDbRvnp5aZSqCRhglMCYqQsewQOuio4iIYARA$
> (work)
>
> https://urldefense.com/v3/__http://www.the111shift.com__;!!ACWV5N9M2RV99hQ!PVR-c0gAs5FpIrnotHWeo3sEWScxV8oFJrVpGdItGZictcDbRvnp5aZSqCRhglMCYqQsewQOuirxfFWpEQ$
> (play)
>

Re: Best strategy migrate indexes [ In reply to ]

pablovb at gmail

Oct 31, 2022, 8:56 AM

Post #5 of 14 (951 views)

Hi all,

Thank you all for your responses.

So, when updating to a newer (major) Lucene version that modifies its
codecs, there is no way to ensure everything keeps working properly, unless
re-indexing, right?

Apart from not having some original sources that were indexed (which I will
try to solve by using the *IndexUpgrader *tool), I have another problem: I
was using the org.apache.lucene.uninverting.UninvertingReader to perform
queries against the index, mainly using the grouping api. But currently, it
was removed (since Lucene 7.0). So, again, do I have any other alternative,
apart from re-indexing to use docValues?

To give you more context, I am a developer of a tool that multiple
customers can use to index their data (currently, with Lucene 5.5.5). We
are planning to upgrade to Lucene 9 (because of some vulnerabilities
affecting Lucene 5.5.5) and I think asking them to reindex will not go down
well :(

Regards,

El sáb, 29 oct 2022 a las 23:31, Matt Davis (<kryptonics411@gmail.com>)
escribió:

> Inside of Zulia search engine, the object being indexed is always a
> JSON/BSON object and we store the BSON as a stored byte field in the
> index. This allows easy internal reindexing when the searchable fields
> change but also allows us to update to the latest lucene version.
> Combined with using lucene-backward-codecs an older index than the current
> major version can be opened and reindexed. If you have stored all the
> fields (or a json/bson) in the index, it would be easy to reindex in the
> new format. If you have not, maybe opening with lucene-backward-codecs
> will be enough for your use case.
>
> Thanks,
> Matt
>
> On Sat, Oct 29, 2022 at 2:30 PM Baris Kazar <baris.kazar@oracle.com>
> wrote:
>
> > It is always great practice to retain non-indexed
> > data since when Lucene changes version,
> > even minor version, I always reindex.
> >
> > Best regards
> > ________________________________
> > From: Gus Heck <gus.heck@gmail.com>
> > Sent: Saturday, October 29, 2022 2:17 PM
> > To: java-user@lucene.apache.org <java-user@lucene.apache.org>
> > Subject: Re: Best strategy migrate indexes
> >
> > Hi Pablo,
> >
> > The deafening silence is probably nobody wanting to give you the bad
> news.
> > You are on a mission that may not be feasible, and even if you can get it
> > to "work", the end result won't likely be equivalent to indexing the
> > original data with Lucene 9.x. The indexing process is fundamentally
> lossy
> > and information originally used to produce non-stored fields will have
> been
> > thrown out. A simple example is things like stopwords or anything
> analyzed
> > with subclasses of FilteringTokenFilter. If the stop word list changed,
> or
> > the details of one of these filters changed (bugfix?), you will end up
> with
> > a different result than indexing with 9.x. This is just one
> > example, another would be stemming where the index likely only contains
> the
> > stem, not the whole word. Other folks who are more interested in the
> > details of our codecs than I am can probably provide further examples on
> a
> > more fundamental level. Lucene is not a database, and the source
> documents
> > should always be retained in a form that can be reindexed. If you have
> > inherited a system where source material has not been retained, you have
> a
> > difficult project and may have some potentially painful expectation
> setting
> > to perform.
> >
> > Best,
> > Gus
> >
> >
> >
> > On Fri, Oct 28, 2022 at 8:01 AM Pablo Vázquez Blázquez <
> pablovb@gmail.com>
> > wrote:
> >
> > > Hi all,
> > >
> > > I have some indices indexed with lucene 5.5.0. I have updated my
> > > dependencies and code to Lucene 7 (but my final goal is to use Lucene
> 9)
> > > and when trying to work with them I am having the exception:
> > > org.apache.lucene.index.IndexFormatTooOldException: Format version is
> not
> > > supported (resource
> > >
> > >
> >
> BufferedChecksumIndexInput(MMapIndexInput(path=".......\tests\segments_b"))):
> > > this index is too old (version: 5.5.0). This version of Lucene only
> > > supports indexes created with release 6.0 and later.
> > >
> > > I want to migrate from Lucene 5.x to Lucene 9.x. Which is the best
> > > strategy? Is there any tool to migrate the indices? Is it mandatory to
> > > reindex? In this case, how can I deal with this when I do not have the
> > > sources of documents that generated my current indices (I mean, I just
> > have
> > > the indices themselves)?
> > >
> > > Thanks,
> > >
> > > --
> > > Pablo Vázquez
> > > (pablovb@gmail.com)
> > >
> >
> >
> > --
> >
> >
> https://urldefense.com/v3/__http://www.needhamsoftware.com__;!!ACWV5N9M2RV99hQ!PVR-c0gAs5FpIrnotHWeo3sEWScxV8oFJrVpGdItGZictcDbRvnp5aZSqCRhglMCYqQsewQOuio4iIYARA$
> > (work)
> >
> >
> https://urldefense.com/v3/__http://www.the111shift.com__;!!ACWV5N9M2RV99hQ!PVR-c0gAs5FpIrnotHWeo3sEWScxV8oFJrVpGdItGZictcDbRvnp5aZSqCRhglMCYqQsewQOuirxfFWpEQ$
> > (play)
> >
>

--
Pablo Vázquez
(pablovb@gmail.com)

Re: Best strategy migrate indexes [ In reply to ]

trejkaz at trypticon

Oct 31, 2022, 4:34 PM

Post #6 of 14 (944 views)

Well...

There's a way, but I wouldn't necessarily recommend it.

You can write custom migration code against some version of Lucene
which supports doc values, to create doc values fields. It's going to
involve writing a FilterCodecReader which wraps your real index and
then pretends to also have doc values, which you'll build in a custom
class which works similarly to UninvertingReader. Then you pass those
CodecReaders to IndexWriter.addIndexes to create a new index which
really has those doc values.

We did that ourselves when we had the same issue. The only painful
thing about it is having to keep around older versions of lucene to do
that migration. Forever. Luckily we were already using lucenemigrator,
which has the older versions baked into it with package prefixes. So
that library will get fatter and fatter over time but at least our own
code only gets fatter at the rate migrations are added.

The same approach works for any other kind of ad-hoc migration you
might want to perform. e.g., you might want to create points. Or
remove an index for a field. Or add an index for a field.

TX

On Tue, 1 Nov 2022 at 02:57, Pablo Vázquez Blázquez <pablovb@gmail.com> wrote:
>
> Hi all,
>
> Thank you all for your responses.
>
> So, when updating to a newer (major) Lucene version that modifies its
> codecs, there is no way to ensure everything keeps working properly, unless
> re-indexing, right?
>
> Apart from not having some original sources that were indexed (which I will
> try to solve by using the *IndexUpgrader *tool), I have another problem: I
> was using the org.apache.lucene.uninverting.UninvertingReader to perform
> queries against the index, mainly using the grouping api. But currently, it
> was removed (since Lucene 7.0). So, again, do I have any other alternative,
> apart from re-indexing to use docValues?
>
> To give you more context, I am a developer of a tool that multiple
> customers can use to index their data (currently, with Lucene 5.5.5). We
> are planning to upgrade to Lucene 9 (because of some vulnerabilities
> affecting Lucene 5.5.5) and I think asking them to reindex will not go down
> well :(
>
> Regards,
>
> El sáb, 29 oct 2022 a las 23:31, Matt Davis (<kryptonics411@gmail.com>)
> escribió:
>
> > Inside of Zulia search engine, the object being indexed is always a
> > JSON/BSON object and we store the BSON as a stored byte field in the
> > index. This allows easy internal reindexing when the searchable fields
> > change but also allows us to update to the latest lucene version.
> > Combined with using lucene-backward-codecs an older index than the current
> > major version can be opened and reindexed. If you have stored all the
> > fields (or a json/bson) in the index, it would be easy to reindex in the
> > new format. If you have not, maybe opening with lucene-backward-codecs
> > will be enough for your use case.
> >
> > Thanks,
> > Matt
> >
> > On Sat, Oct 29, 2022 at 2:30 PM Baris Kazar <baris.kazar@oracle.com>
> > wrote:
> >
> > > It is always great practice to retain non-indexed
> > > data since when Lucene changes version,
> > > even minor version, I always reindex.
> > >
> > > Best regards
> > > ________________________________
> > > From: Gus Heck <gus.heck@gmail.com>
> > > Sent: Saturday, October 29, 2022 2:17 PM
> > > To: java-user@lucene.apache.org <java-user@lucene.apache.org>
> > > Subject: Re: Best strategy migrate indexes
> > >
> > > Hi Pablo,
> > >
> > > The deafening silence is probably nobody wanting to give you the bad
> > news.
> > > You are on a mission that may not be feasible, and even if you can get it
> > > to "work", the end result won't likely be equivalent to indexing the
> > > original data with Lucene 9.x. The indexing process is fundamentally
> > lossy
> > > and information originally used to produce non-stored fields will have
> > been
> > > thrown out. A simple example is things like stopwords or anything
> > analyzed
> > > with subclasses of FilteringTokenFilter. If the stop word list changed,
> > or
> > > the details of one of these filters changed (bugfix?), you will end up
> > with
> > > a different result than indexing with 9.x. This is just one
> > > example, another would be stemming where the index likely only contains
> > the
> > > stem, not the whole word. Other folks who are more interested in the
> > > details of our codecs than I am can probably provide further examples on
> > a
> > > more fundamental level. Lucene is not a database, and the source
> > documents
> > > should always be retained in a form that can be reindexed. If you have
> > > inherited a system where source material has not been retained, you have
> > a
> > > difficult project and may have some potentially painful expectation
> > setting
> > > to perform.
> > >
> > > Best,
> > > Gus
> > >
> > >
> > >
> > > On Fri, Oct 28, 2022 at 8:01 AM Pablo Vázquez Blázquez <
> > pablovb@gmail.com>
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I have some indices indexed with lucene 5.5.0. I have updated my
> > > > dependencies and code to Lucene 7 (but my final goal is to use Lucene
> > 9)
> > > > and when trying to work with them I am having the exception:
> > > > org.apache.lucene.index.IndexFormatTooOldException: Format version is
> > not
> > > > supported (resource
> > > >
> > > >
> > >
> > BufferedChecksumIndexInput(MMapIndexInput(path=".......\tests\segments_b"))):
> > > > this index is too old (version: 5.5.0). This version of Lucene only
> > > > supports indexes created with release 6.0 and later.
> > > >
> > > > I want to migrate from Lucene 5.x to Lucene 9.x. Which is the best
> > > > strategy? Is there any tool to migrate the indices? Is it mandatory to
> > > > reindex? In this case, how can I deal with this when I do not have the
> > > > sources of documents that generated my current indices (I mean, I just
> > > have
> > > > the indices themselves)?
> > > >
> > > > Thanks,
> > > >
> > > > --
> > > > Pablo Vázquez
> > > > (pablovb@gmail.com)
> > > >
> > >
> > >
> > > --
> > >
> > >
> > https://urldefense.com/v3/__http://www.needhamsoftware.com__;!!ACWV5N9M2RV99hQ!PVR-c0gAs5FpIrnotHWeo3sEWScxV8oFJrVpGdItGZictcDbRvnp5aZSqCRhglMCYqQsewQOuio4iIYARA$
> > > (work)
> > >
> > >
> > https://urldefense.com/v3/__http://www.the111shift.com__;!!ACWV5N9M2RV99hQ!PVR-c0gAs5FpIrnotHWeo3sEWScxV8oFJrVpGdItGZictcDbRvnp5aZSqCRhglMCYqQsewQOuirxfFWpEQ$
> > > (play)
> > >
> >
>
>
> --
> Pablo Vázquez
> (pablovb@gmail.com)

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Best strategy migrate indexes [ In reply to ]

gus.heck at gmail

Oct 31, 2022, 8:04 PM

Post #7 of 14 (944 views)

You really should reindex for a 4 version jump. The index upgrader tool
explicitly prohibits what you are proposing to do with it. See
https://issues.apache.org/jira/browse/LUCENE-9127 and
https://solr.apache.org/guide/8_1/indexupgrader-tool.html (It seems that
the javadoc for IndexUpgrader maybe should be enhanced to clarify this).

A good read with lots of discussions on this is found here a few comments
deep in this issue: https://issues.apache.org/jira/browse/LUCENE-8264

Another useful thing to listen to is Erick Ericksons Activate 2019
presentation https://www.youtube.com/watch?v=eaQBH_H3d3g - near the end he
tells you how to break the rules, but the caveats are important.

Getting folks to re-index is a matter of presentation much of the time
(huge corpuses with ridiculous cost, and cases where they didn't keep their
own documents excluded of course). You need to sell the benefits to justify
the cost. If you add a feature that goes along with it, you make it look
like part of the price for the feature... people might give you something
if they get something in return, but something for nothing goes down
sideways every time.

-Gus

On Mon, Oct 31, 2022 at 11:57 AM Pablo Vázquez Blázquez <pablovb@gmail.com>
wrote:

> Hi all,
>
> Thank you all for your responses.
>
> So, when updating to a newer (major) Lucene version that modifies its
> codecs, there is no way to ensure everything keeps working properly, unless
> re-indexing, right?
>
> Apart from not having some original sources that were indexed (which I will
> try to solve by using the *IndexUpgrader *tool), I have another problem: I
> was using the org.apache.lucene.uninverting.UninvertingReader to perform
> queries against the index, mainly using the grouping api. But currently, it
> was removed (since Lucene 7.0). So, again, do I have any other alternative,
> apart from re-indexing to use docValues?
>
> To give you more context, I am a developer of a tool that multiple
> customers can use to index their data (currently, with Lucene 5.5.5). We
> are planning to upgrade to Lucene 9 (because of some vulnerabilities
> affecting Lucene 5.5.5) and I think asking them to reindex will not go down
> well :(
>
> Regards,
>
> El sáb, 29 oct 2022 a las 23:31, Matt Davis (<kryptonics411@gmail.com>)
> escribió:
>
> > Inside of Zulia search engine, the object being indexed is always a
> > JSON/BSON object and we store the BSON as a stored byte field in the
> > index. This allows easy internal reindexing when the searchable fields
> > change but also allows us to update to the latest lucene version.
> > Combined with using lucene-backward-codecs an older index than the
> current
> > major version can be opened and reindexed. If you have stored all the
> > fields (or a json/bson) in the index, it would be easy to reindex in the
> > new format. If you have not, maybe opening with lucene-backward-codecs
> > will be enough for your use case.
> >
> > Thanks,
> > Matt
> >
> > On Sat, Oct 29, 2022 at 2:30 PM Baris Kazar <baris.kazar@oracle.com>
> > wrote:
> >
> > > It is always great practice to retain non-indexed
> > > data since when Lucene changes version,
> > > even minor version, I always reindex.
> > >
> > > Best regards
> > > ________________________________
> > > From: Gus Heck <gus.heck@gmail.com>
> > > Sent: Saturday, October 29, 2022 2:17 PM
> > > To: java-user@lucene.apache.org <java-user@lucene.apache.org>
> > > Subject: Re: Best strategy migrate indexes
> > >
> > > Hi Pablo,
> > >
> > > The deafening silence is probably nobody wanting to give you the bad
> > news.
> > > You are on a mission that may not be feasible, and even if you can get
> it
> > > to "work", the end result won't likely be equivalent to indexing the
> > > original data with Lucene 9.x. The indexing process is fundamentally
> > lossy
> > > and information originally used to produce non-stored fields will have
> > been
> > > thrown out. A simple example is things like stopwords or anything
> > analyzed
> > > with subclasses of FilteringTokenFilter. If the stop word list changed,
> > or
> > > the details of one of these filters changed (bugfix?), you will end up
> > with
> > > a different result than indexing with 9.x. This is just one
> > > example, another would be stemming where the index likely only contains
> > the
> > > stem, not the whole word. Other folks who are more interested in the
> > > details of our codecs than I am can probably provide further examples
> on
> > a
> > > more fundamental level. Lucene is not a database, and the source
> > documents
> > > should always be retained in a form that can be reindexed. If you have
> > > inherited a system where source material has not been retained, you
> have
> > a
> > > difficult project and may have some potentially painful expectation
> > setting
> > > to perform.
> > >
> > > Best,
> > > Gus
> > >
> > >
> > >
> > > On Fri, Oct 28, 2022 at 8:01 AM Pablo Vázquez Blázquez <
> > pablovb@gmail.com>
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I have some indices indexed with lucene 5.5.0. I have updated my
> > > > dependencies and code to Lucene 7 (but my final goal is to use Lucene
> > 9)
> > > > and when trying to work with them I am having the exception:
> > > > org.apache.lucene.index.IndexFormatTooOldException: Format version is
> > not
> > > > supported (resource
> > > >
> > > >
> > >
> >
> BufferedChecksumIndexInput(MMapIndexInput(path=".......\tests\segments_b"))):
> > > > this index is too old (version: 5.5.0). This version of Lucene only
> > > > supports indexes created with release 6.0 and later.
> > > >
> > > > I want to migrate from Lucene 5.x to Lucene 9.x. Which is the best
> > > > strategy? Is there any tool to migrate the indices? Is it mandatory
> to
> > > > reindex? In this case, how can I deal with this when I do not have
> the
> > > > sources of documents that generated my current indices (I mean, I
> just
> > > have
> > > > the indices themselves)?
> > > >
> > > > Thanks,
> > > >
> > > > --
> > > > Pablo Vázquez
> > > > (pablovb@gmail.com)
> > > >
> > >
> > >
> > > --
> > >
> > >
> >
> https://urldefense.com/v3/__http://www.needhamsoftware.com__;!!ACWV5N9M2RV99hQ!PVR-c0gAs5FpIrnotHWeo3sEWScxV8oFJrVpGdItGZictcDbRvnp5aZSqCRhglMCYqQsewQOuio4iIYARA$
> > > (work)
> > >
> > >
> >
> https://urldefense.com/v3/__http://www.the111shift.com__;!!ACWV5N9M2RV99hQ!PVR-c0gAs5FpIrnotHWeo3sEWScxV8oFJrVpGdItGZictcDbRvnp5aZSqCRhglMCYqQsewQOuirxfFWpEQ$
> > > (play)
> > >
> >
>
>
> --
> Pablo Vázquez
> (pablovb@gmail.com)
>

--
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)

Re: Best strategy migrate indexes [ In reply to ]

pablovb at gmail

Nov 2, 2022, 1:13 PM

Post #8 of 14 (934 views)

Hi,

Luckily we were already using lucenemigrator

What do you mean with "lucenemigrator"? Is it a public tool?

I am trying to create a tool to read docs from a lucene5 index and generate
lucene9 documents from them (with docValues). That might work, right? I am
shading both lucene5 and lucene9 to avoid package conflicts.

Thanks!

El mar, 1 nov 2022 a las 0:35, Trejkaz (<trejkaz@trypticon.org>) escribió:

> Well...
>
> There's a way, but I wouldn't necessarily recommend it.
>
> You can write custom migration code against some version of Lucene
> which supports doc values, to create doc values fields. It's going to
> involve writing a FilterCodecReader which wraps your real index and
> then pretends to also have doc values, which you'll build in a custom
> class which works similarly to UninvertingReader. Then you pass those
> CodecReaders to IndexWriter.addIndexes to create a new index which
> really has those doc values.
>
> We did that ourselves when we had the same issue. The only painful
> thing about it is having to keep around older versions of lucene to do
> that migration. Forever. Luckily we were already using lucenemigrator,
> which has the older versions baked into it with package prefixes. So
> that library will get fatter and fatter over time but at least our own
> code only gets fatter at the rate migrations are added.
>
> The same approach works for any other kind of ad-hoc migration you
> might want to perform. e.g., you might want to create points. Or
> remove an index for a field. Or add an index for a field.
>
> TX
>
>
> On Tue, 1 Nov 2022 at 02:57, Pablo Vázquez Blázquez <pablovb@gmail.com>
> wrote:
> >
> > Hi all,
> >
> > Thank you all for your responses.
> >
> > So, when updating to a newer (major) Lucene version that modifies its
> > codecs, there is no way to ensure everything keeps working properly,
> unless
> > re-indexing, right?
> >
> > Apart from not having some original sources that were indexed (which I
> will
> > try to solve by using the *IndexUpgrader *tool), I have another problem:
> I
> > was using the org.apache.lucene.uninverting.UninvertingReader to perform
> > queries against the index, mainly using the grouping api. But currently,
> it
> > was removed (since Lucene 7.0). So, again, do I have any other
> alternative,
> > apart from re-indexing to use docValues?
> >
> > To give you more context, I am a developer of a tool that multiple
> > customers can use to index their data (currently, with Lucene 5.5.5). We
> > are planning to upgrade to Lucene 9 (because of some vulnerabilities
> > affecting Lucene 5.5.5) and I think asking them to reindex will not go
> down
> > well :(
> >
> > Regards,
> >
> > El sáb, 29 oct 2022 a las 23:31, Matt Davis (<kryptonics411@gmail.com>)
> > escribió:
> >
> > > Inside of Zulia search engine, the object being indexed is always a
> > > JSON/BSON object and we store the BSON as a stored byte field in the
> > > index. This allows easy internal reindexing when the searchable fields
> > > change but also allows us to update to the latest lucene version.
> > > Combined with using lucene-backward-codecs an older index than the
> current
> > > major version can be opened and reindexed. If you have stored all the
> > > fields (or a json/bson) in the index, it would be easy to reindex in
> the
> > > new format. If you have not, maybe opening with lucene-backward-codecs
> > > will be enough for your use case.
> > >
> > > Thanks,
> > > Matt
> > >
> > > On Sat, Oct 29, 2022 at 2:30 PM Baris Kazar <baris.kazar@oracle.com>
> > > wrote:
> > >
> > > > It is always great practice to retain non-indexed
> > > > data since when Lucene changes version,
> > > > even minor version, I always reindex.
> > > >
> > > > Best regards
> > > > ________________________________
> > > > From: Gus Heck <gus.heck@gmail.com>
> > > > Sent: Saturday, October 29, 2022 2:17 PM
> > > > To: java-user@lucene.apache.org <java-user@lucene.apache.org>
> > > > Subject: Re: Best strategy migrate indexes
> > > >
> > > > Hi Pablo,
> > > >
> > > > The deafening silence is probably nobody wanting to give you the bad
> > > news.
> > > > You are on a mission that may not be feasible, and even if you can
> get it
> > > > to "work", the end result won't likely be equivalent to indexing the
> > > > original data with Lucene 9.x. The indexing process is fundamentally
> > > lossy
> > > > and information originally used to produce non-stored fields will
> have
> > > been
> > > > thrown out. A simple example is things like stopwords or anything
> > > analyzed
> > > > with subclasses of FilteringTokenFilter. If the stop word list
> changed,
> > > or
> > > > the details of one of these filters changed (bugfix?), you will end
> up
> > > with
> > > > a different result than indexing with 9.x. This is just one
> > > > example, another would be stemming where the index likely only
> contains
> > > the
> > > > stem, not the whole word. Other folks who are more interested in the
> > > > details of our codecs than I am can probably provide further
> examples on
> > > a
> > > > more fundamental level. Lucene is not a database, and the source
> > > documents
> > > > should always be retained in a form that can be reindexed. If you
> have
> > > > inherited a system where source material has not been retained, you
> have
> > > a
> > > > difficult project and may have some potentially painful expectation
> > > setting
> > > > to perform.
> > > >
> > > > Best,
> > > > Gus
> > > >
> > > >
> > > >
> > > > On Fri, Oct 28, 2022 at 8:01 AM Pablo Vázquez Blázquez <
> > > pablovb@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I have some indices indexed with lucene 5.5.0. I have updated my
> > > > > dependencies and code to Lucene 7 (but my final goal is to use
> Lucene
> > > 9)
> > > > > and when trying to work with them I am having the exception:
> > > > > org.apache.lucene.index.IndexFormatTooOldException: Format version
> is
> > > not
> > > > > supported (resource
> > > > >
> > > > >
> > > >
> > >
> BufferedChecksumIndexInput(MMapIndexInput(path=".......\tests\segments_b"))):
> > > > > this index is too old (version: 5.5.0). This version of Lucene only
> > > > > supports indexes created with release 6.0 and later.
> > > > >
> > > > > I want to migrate from Lucene 5.x to Lucene 9.x. Which is the best
> > > > > strategy? Is there any tool to migrate the indices? Is it
> mandatory to
> > > > > reindex? In this case, how can I deal with this when I do not have
> the
> > > > > sources of documents that generated my current indices (I mean, I
> just
> > > > have
> > > > > the indices themselves)?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > --
> > > > > Pablo Vázquez
> > > > > (pablovb@gmail.com)
> > > > >
> > > >
> > > >
> > > > --
> > > >
> > > >
> > >
> https://urldefense.com/v3/__http://www.needhamsoftware.com__;!!ACWV5N9M2RV99hQ!PVR-c0gAs5FpIrnotHWeo3sEWScxV8oFJrVpGdItGZictcDbRvnp5aZSqCRhglMCYqQsewQOuio4iIYARA$
> > > > (work)
> > > >
> > > >
> > >
> https://urldefense.com/v3/__http://www.the111shift.com__;!!ACWV5N9M2RV99hQ!PVR-c0gAs5FpIrnotHWeo3sEWScxV8oFJrVpGdItGZictcDbRvnp5aZSqCRhglMCYqQsewQOuirxfFWpEQ$
> > > > (play)
> > > >
> > >
> >
> >
> > --
> > Pablo Vázquez
> > (pablovb@gmail.com)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

--
Pablo Vázquez
(pablovb@gmail.com)

Re: Best strategy migrate indexes [ In reply to ]

trejkaz at trypticon

Nov 2, 2022, 2:57 PM

Post #9 of 14 (932 views)

Was a typo, meant to say luceneupgrader.

And by itself, it won't do any kind of work to convert fields between
different types.
For that, you have to do what I described.

TX

On Thu, 3 Nov 2022 at 07:14, Pablo Vázquez Blázquez <pablovb@gmail.com> wrote:
>
> Hi,
>
> Luckily we were already using lucenemigrator
>
>
> What do you mean with "lucenemigrator"? Is it a public tool?
>
> I am trying to create a tool to read docs from a lucene5 index and generate
> lucene9 documents from them (with docValues). That might work, right? I am
> shading both lucene5 and lucene9 to avoid package conflicts.
>
> Thanks!
>
> El mar, 1 nov 2022 a las 0:35, Trejkaz (<trejkaz@trypticon.org>) escribió:
>
> > Well...
> >
> > There's a way, but I wouldn't necessarily recommend it.
> >
> > You can write custom migration code against some version of Lucene
> > which supports doc values, to create doc values fields. It's going to
> > involve writing a FilterCodecReader which wraps your real index and
> > then pretends to also have doc values, which you'll build in a custom
> > class which works similarly to UninvertingReader. Then you pass those
> > CodecReaders to IndexWriter.addIndexes to create a new index which
> > really has those doc values.
> >
> > We did that ourselves when we had the same issue. The only painful
> > thing about it is having to keep around older versions of lucene to do
> > that migration. Forever. Luckily we were already using lucenemigrator,
> > which has the older versions baked into it with package prefixes. So
> > that library will get fatter and fatter over time but at least our own
> > code only gets fatter at the rate migrations are added.
> >
> > The same approach works for any other kind of ad-hoc migration you
> > might want to perform. e.g., you might want to create points. Or
> > remove an index for a field. Or add an index for a field.
> >
> > TX
> >
> >
> > On Tue, 1 Nov 2022 at 02:57, Pablo Vázquez Blázquez <pablovb@gmail.com>
> > wrote:
> > >
> > > Hi all,
> > >
> > > Thank you all for your responses.
> > >
> > > So, when updating to a newer (major) Lucene version that modifies its
> > > codecs, there is no way to ensure everything keeps working properly,
> > unless
> > > re-indexing, right?
> > >
> > > Apart from not having some original sources that were indexed (which I
> > will
> > > try to solve by using the *IndexUpgrader *tool), I have another problem:
> > I
> > > was using the org.apache.lucene.uninverting.UninvertingReader to perform
> > > queries against the index, mainly using the grouping api. But currently,
> > it
> > > was removed (since Lucene 7.0). So, again, do I have any other
> > alternative,
> > > apart from re-indexing to use docValues?
> > >
> > > To give you more context, I am a developer of a tool that multiple
> > > customers can use to index their data (currently, with Lucene 5.5.5). We
> > > are planning to upgrade to Lucene 9 (because of some vulnerabilities
> > > affecting Lucene 5.5.5) and I think asking them to reindex will not go
> > down
> > > well :(
> > >
> > > Regards,
> > >
> > > El sáb, 29 oct 2022 a las 23:31, Matt Davis (<kryptonics411@gmail.com>)
> > > escribió:
> > >
> > > > Inside of Zulia search engine, the object being indexed is always a
> > > > JSON/BSON object and we store the BSON as a stored byte field in the
> > > > index. This allows easy internal reindexing when the searchable fields
> > > > change but also allows us to update to the latest lucene version.
> > > > Combined with using lucene-backward-codecs an older index than the
> > current
> > > > major version can be opened and reindexed. If you have stored all the
> > > > fields (or a json/bson) in the index, it would be easy to reindex in
> > the
> > > > new format. If you have not, maybe opening with lucene-backward-codecs
> > > > will be enough for your use case.
> > > >
> > > > Thanks,
> > > > Matt
> > > >
> > > > On Sat, Oct 29, 2022 at 2:30 PM Baris Kazar <baris.kazar@oracle.com>
> > > > wrote:
> > > >
> > > > > It is always great practice to retain non-indexed
> > > > > data since when Lucene changes version,
> > > > > even minor version, I always reindex.
> > > > >
> > > > > Best regards
> > > > > ________________________________
> > > > > From: Gus Heck <gus.heck@gmail.com>
> > > > > Sent: Saturday, October 29, 2022 2:17 PM
> > > > > To: java-user@lucene.apache.org <java-user@lucene.apache.org>
> > > > > Subject: Re: Best strategy migrate indexes
> > > > >
> > > > > Hi Pablo,
> > > > >
> > > > > The deafening silence is probably nobody wanting to give you the bad
> > > > news.
> > > > > You are on a mission that may not be feasible, and even if you can
> > get it
> > > > > to "work", the end result won't likely be equivalent to indexing the
> > > > > original data with Lucene 9.x. The indexing process is fundamentally
> > > > lossy
> > > > > and information originally used to produce non-stored fields will
> > have
> > > > been
> > > > > thrown out. A simple example is things like stopwords or anything
> > > > analyzed
> > > > > with subclasses of FilteringTokenFilter. If the stop word list
> > changed,
> > > > or
> > > > > the details of one of these filters changed (bugfix?), you will end
> > up
> > > > with
> > > > > a different result than indexing with 9.x. This is just one
> > > > > example, another would be stemming where the index likely only
> > contains
> > > > the
> > > > > stem, not the whole word. Other folks who are more interested in the
> > > > > details of our codecs than I am can probably provide further
> > examples on
> > > > a
> > > > > more fundamental level. Lucene is not a database, and the source
> > > > documents
> > > > > should always be retained in a form that can be reindexed. If you
> > have
> > > > > inherited a system where source material has not been retained, you
> > have
> > > > a
> > > > > difficult project and may have some potentially painful expectation
> > > > setting
> > > > > to perform.
> > > > >
> > > > > Best,
> > > > > Gus
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Oct 28, 2022 at 8:01 AM Pablo Vázquez Blázquez <
> > > > pablovb@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > I have some indices indexed with lucene 5.5.0. I have updated my
> > > > > > dependencies and code to Lucene 7 (but my final goal is to use
> > Lucene
> > > > 9)
> > > > > > and when trying to work with them I am having the exception:
> > > > > > org.apache.lucene.index.IndexFormatTooOldException: Format version
> > is
> > > > not
> > > > > > supported (resource
> > > > > >
> > > > > >
> > > > >
> > > >
> > BufferedChecksumIndexInput(MMapIndexInput(path=".......\tests\segments_b"))):
> > > > > > this index is too old (version: 5.5.0). This version of Lucene only
> > > > > > supports indexes created with release 6.0 and later.
> > > > > >
> > > > > > I want to migrate from Lucene 5.x to Lucene 9.x. Which is the best
> > > > > > strategy? Is there any tool to migrate the indices? Is it
> > mandatory to
> > > > > > reindex? In this case, how can I deal with this when I do not have
> > the
> > > > > > sources of documents that generated my current indices (I mean, I
> > just
> > > > > have
> > > > > > the indices themselves)?
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > --
> > > > > > Pablo Vázquez
> > > > > > (pablovb@gmail.com)
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > >
> > > >
> > https://urldefense.com/v3/__http://www.needhamsoftware.com__;!!ACWV5N9M2RV99hQ!PVR-c0gAs5FpIrnotHWeo3sEWScxV8oFJrVpGdItGZictcDbRvnp5aZSqCRhglMCYqQsewQOuio4iIYARA$
> > > > > (work)
> > > > >
> > > > >
> > > >
> > https://urldefense.com/v3/__http://www.the111shift.com__;!!ACWV5N9M2RV99hQ!PVR-c0gAs5FpIrnotHWeo3sEWScxV8oFJrVpGdItGZictcDbRvnp5aZSqCRhglMCYqQsewQOuirxfFWpEQ$
> > > > > (play)
> > > > >
> > > >
> > >
> > >
> > > --
> > > Pablo Vázquez
> > > (pablovb@gmail.com)
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
> --
> Pablo Vázquez
> (pablovb@gmail.com)

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Best strategy migrate indexes [ In reply to ]

pablovb at gmail

Nov 7, 2022, 3:08 AM

Post #10 of 14 (885 views)

Hi!

> I am trying to create a tool to read docs from a lucene5 index and
generate lucene9 documents from them (with docValues). That might work,
right? I am shading both lucene5 and lucene9 to avoid package conflicts.

I am doing the following steps:

- create IndexReader with lucene5 package over a lucene5 index
- create IndexWriter with lucene7 package
- iterate over reader.numDocs() to process each Document (lucene5)
- convert each Document (lucene5) to lucene7 Document
- for each IndexableField (lucene5) from Document (lucene5) convert
it to create an IndexableField (lucene7)
- create a SortedDocValuesField (lucene7) and add it to the
Document (lucene7)
- add the field to the Document (lucene7)
- add each converted Document to the writer
- close IndexReader and IndexWriter

When I open the resulting migrated lucene7 index with Luke I got an error:
org.apache.lucene.index.IndexFormatTooNewException: Format version is not
supported (resource
BufferedChecksumIndexInput(MMapIndexInput(path="tests_small_index-7.x-migrator\segments_1"))):
9 (needs to be between 6 and 7)

When I use the tool "luceneupgrader
<https://github.com/hakanai/luceneupgrader>", I got:
java -jar luceneupgrader-0.5.2-SNAPSHOT.jar info
tests_small_index-7.x-migrator
Lucene index version: 7

What am I doing wrong or misleading?

Thanks!

El mié, 2 nov 2022 a las 21:13, Pablo Vázquez Blázquez (<pablovb@gmail.com>)
escribió:

> Hi,
>
> Luckily we were already using lucenemigrator
>
>
> What do you mean with "lucenemigrator"? Is it a public tool?
>
> I am trying to create a tool to read docs from a lucene5 index and
> generate lucene9 documents from them (with docValues). That might work,
> right? I am shading both lucene5 and lucene9 to avoid package conflicts.
>
> Thanks!
>
> El mar, 1 nov 2022 a las 0:35, Trejkaz (<trejkaz@trypticon.org>) escribió:
>
>> Well...
>>
>> There's a way, but I wouldn't necessarily recommend it.
>>
>> You can write custom migration code against some version of Lucene
>> which supports doc values, to create doc values fields. It's going to
>> involve writing a FilterCodecReader which wraps your real index and
>> then pretends to also have doc values, which you'll build in a custom
>> class which works similarly to UninvertingReader. Then you pass those
>> CodecReaders to IndexWriter.addIndexes to create a new index which
>> really has those doc values.
>>
>> We did that ourselves when we had the same issue. The only painful
>> thing about it is having to keep around older versions of lucene to do
>> that migration. Forever. Luckily we were already using lucenemigrator,
>> which has the older versions baked into it with package prefixes. So
>> that library will get fatter and fatter over time but at least our own
>> code only gets fatter at the rate migrations are added.
>>
>> The same approach works for any other kind of ad-hoc migration you
>> might want to perform. e.g., you might want to create points. Or
>> remove an index for a field. Or add an index for a field.
>>
>> TX
>>
>>
>> On Tue, 1 Nov 2022 at 02:57, Pablo Vázquez Blázquez <pablovb@gmail.com>
>> wrote:
>> >
>> > Hi all,
>> >
>> > Thank you all for your responses.
>> >
>> > So, when updating to a newer (major) Lucene version that modifies its
>> > codecs, there is no way to ensure everything keeps working properly,
>> unless
>> > re-indexing, right?
>> >
>> > Apart from not having some original sources that were indexed (which I
>> will
>> > try to solve by using the *IndexUpgrader *tool), I have another
>> problem: I
>> > was using the org.apache.lucene.uninverting.UninvertingReader to perform
>> > queries against the index, mainly using the grouping api. But
>> currently, it
>> > was removed (since Lucene 7.0). So, again, do I have any other
>> alternative,
>> > apart from re-indexing to use docValues?
>> >
>> > To give you more context, I am a developer of a tool that multiple
>> > customers can use to index their data (currently, with Lucene 5.5.5). We
>> > are planning to upgrade to Lucene 9 (because of some vulnerabilities
>> > affecting Lucene 5.5.5) and I think asking them to reindex will not go
>> down
>> > well :(
>> >
>> > Regards,
>> >
>> > El sáb, 29 oct 2022 a las 23:31, Matt Davis (<kryptonics411@gmail.com>)
>> > escribió:
>> >
>> > > Inside of Zulia search engine, the object being indexed is always a
>> > > JSON/BSON object and we store the BSON as a stored byte field in the
>> > > index. This allows easy internal reindexing when the searchable
>> fields
>> > > change but also allows us to update to the latest lucene version.
>> > > Combined with using lucene-backward-codecs an older index than the
>> current
>> > > major version can be opened and reindexed. If you have stored all the
>> > > fields (or a json/bson) in the index, it would be easy to reindex in
>> the
>> > > new format. If you have not, maybe opening with
>> lucene-backward-codecs
>> > > will be enough for your use case.
>> > >
>> > > Thanks,
>> > > Matt
>> > >
>> > > On Sat, Oct 29, 2022 at 2:30 PM Baris Kazar <baris.kazar@oracle.com>
>> > > wrote:
>> > >
>> > > > It is always great practice to retain non-indexed
>> > > > data since when Lucene changes version,
>> > > > even minor version, I always reindex.
>> > > >
>> > > > Best regards
>> > > > ________________________________
>> > > > From: Gus Heck <gus.heck@gmail.com>
>> > > > Sent: Saturday, October 29, 2022 2:17 PM
>> > > > To: java-user@lucene.apache.org <java-user@lucene.apache.org>
>> > > > Subject: Re: Best strategy migrate indexes
>> > > >
>> > > > Hi Pablo,
>> > > >
>> > > > The deafening silence is probably nobody wanting to give you the bad
>> > > news.
>> > > > You are on a mission that may not be feasible, and even if you can
>> get it
>> > > > to "work", the end result won't likely be equivalent to indexing the
>> > > > original data with Lucene 9.x. The indexing process is fundamentally
>> > > lossy
>> > > > and information originally used to produce non-stored fields will
>> have
>> > > been
>> > > > thrown out. A simple example is things like stopwords or anything
>> > > analyzed
>> > > > with subclasses of FilteringTokenFilter. If the stop word list
>> changed,
>> > > or
>> > > > the details of one of these filters changed (bugfix?), you will end
>> up
>> > > with
>> > > > a different result than indexing with 9.x. This is just one
>> > > > example, another would be stemming where the index likely only
>> contains
>> > > the
>> > > > stem, not the whole word. Other folks who are more interested in the
>> > > > details of our codecs than I am can probably provide further
>> examples on
>> > > a
>> > > > more fundamental level. Lucene is not a database, and the source
>> > > documents
>> > > > should always be retained in a form that can be reindexed. If you
>> have
>> > > > inherited a system where source material has not been retained, you
>> have
>> > > a
>> > > > difficult project and may have some potentially painful expectation
>> > > setting
>> > > > to perform.
>> > > >
>> > > > Best,
>> > > > Gus
>> > > >
>> > > >
>> > > >
>> > > > On Fri, Oct 28, 2022 at 8:01 AM Pablo Vázquez Blázquez <
>> > > pablovb@gmail.com>
>> > > > wrote:
>> > > >
>> > > > > Hi all,
>> > > > >
>> > > > > I have some indices indexed with lucene 5.5.0. I have updated my
>> > > > > dependencies and code to Lucene 7 (but my final goal is to use
>> Lucene
>> > > 9)
>> > > > > and when trying to work with them I am having the exception:
>> > > > > org.apache.lucene.index.IndexFormatTooOldException: Format
>> version is
>> > > not
>> > > > > supported (resource
>> > > > >
>> > > > >
>> > > >
>> > >
>> BufferedChecksumIndexInput(MMapIndexInput(path=".......\tests\segments_b"))):
>> > > > > this index is too old (version: 5.5.0). This version of Lucene
>> only
>> > > > > supports indexes created with release 6.0 and later.
>> > > > >
>> > > > > I want to migrate from Lucene 5.x to Lucene 9.x. Which is the best
>> > > > > strategy? Is there any tool to migrate the indices? Is it
>> mandatory to
>> > > > > reindex? In this case, how can I deal with this when I do not
>> have the
>> > > > > sources of documents that generated my current indices (I mean, I
>> just
>> > > > have
>> > > > > the indices themselves)?
>> > > > >
>> > > > > Thanks,
>> > > > >
>> > > > > --
>> > > > > Pablo Vázquez
>> > > > > (pablovb@gmail.com)
>> > > > >
>> > > >
>> > > >
>> > > > --
>> > > >
>> > > >
>> > >
>> https://urldefense.com/v3/__http://www.needhamsoftware.com__;!!ACWV5N9M2RV99hQ!PVR-c0gAs5FpIrnotHWeo3sEWScxV8oFJrVpGdItGZictcDbRvnp5aZSqCRhglMCYqQsewQOuio4iIYARA$
>> > > > (work)
>> > > >
>> > > >
>> > >
>> https://urldefense.com/v3/__http://www.the111shift.com__;!!ACWV5N9M2RV99hQ!PVR-c0gAs5FpIrnotHWeo3sEWScxV8oFJrVpGdItGZictcDbRvnp5aZSqCRhglMCYqQsewQOuirxfFWpEQ$
>> > > > (play)
>> > > >
>> > >
>> >
>> >
>> > --
>> > Pablo Vázquez
>> > (pablovb@gmail.com)
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> --
> Pablo Vázquez
> (pablovb@gmail.com)
>

--
Pablo Vázquez
(pablovb@gmail.com)

Re: Best strategy migrate indexes [ In reply to ]

trejkaz at trypticon

Nov 7, 2022, 3:17 AM

Post #11 of 14 (885 views)

The process itself sounds like it should work (it's basically a
reindex so it should be safer than trying to migrate directly.)

I would check that the Luke version matches the Lucene version - if
the two match, it shouldn't be possible to get issues like this.
That is, the precise versions of Lucene each is using.

TX

On Mon, 7 Nov 2022 at 22:09, Pablo Vázquez Blázquez <pablovb@gmail.com> wrote:
>
> Hi!
>
> > I am trying to create a tool to read docs from a lucene5 index and
> generate lucene9 documents from them (with docValues). That might work,
> right? I am shading both lucene5 and lucene9 to avoid package conflicts.
>
> I am doing the following steps:
>
> - create IndexReader with lucene5 package over a lucene5 index
> - create IndexWriter with lucene7 package
> - iterate over reader.numDocs() to process each Document (lucene5)
> - convert each Document (lucene5) to lucene7 Document
> - for each IndexableField (lucene5) from Document (lucene5) convert
> it to create an IndexableField (lucene7)
> - create a SortedDocValuesField (lucene7) and add it to the
> Document (lucene7)
> - add the field to the Document (lucene7)
> - add each converted Document to the writer
> - close IndexReader and IndexWriter
>
> When I open the resulting migrated lucene7 index with Luke I got an error:
> org.apache.lucene.index.IndexFormatTooNewException: Format version is not
> supported (resource
> BufferedChecksumIndexInput(MMapIndexInput(path="tests_small_index-7.x-migrator\segments_1"))):
> 9 (needs to be between 6 and 7)
>
> When I use the tool "luceneupgrader
> <https://github.com/hakanai/luceneupgrader>", I got:
> java -jar luceneupgrader-0.5.2-SNAPSHOT.jar info
> tests_small_index-7.x-migrator
> Lucene index version: 7
>
> What am I doing wrong or misleading?
>
> Thanks!
>
> El mié, 2 nov 2022 a las 21:13, Pablo Vázquez Blázquez (<pablovb@gmail.com>)
> escribió:
>
> > Hi,
> >
> > Luckily we were already using lucenemigrator
> >
> >
> > What do you mean with "lucenemigrator"? Is it a public tool?
> >
> > I am trying to create a tool to read docs from a lucene5 index and
> > generate lucene9 documents from them (with docValues). That might work,
> > right? I am shading both lucene5 and lucene9 to avoid package conflicts.
> >
> > Thanks!
> >
> > El mar, 1 nov 2022 a las 0:35, Trejkaz (<trejkaz@trypticon.org>) escribió:
> >
> >> Well...
> >>
> >> There's a way, but I wouldn't necessarily recommend it.
> >>
> >> You can write custom migration code against some version of Lucene
> >> which supports doc values, to create doc values fields. It's going to
> >> involve writing a FilterCodecReader which wraps your real index and
> >> then pretends to also have doc values, which you'll build in a custom
> >> class which works similarly to UninvertingReader. Then you pass those
> >> CodecReaders to IndexWriter.addIndexes to create a new index which
> >> really has those doc values.
> >>
> >> We did that ourselves when we had the same issue. The only painful
> >> thing about it is having to keep around older versions of lucene to do
> >> that migration. Forever. Luckily we were already using lucenemigrator,
> >> which has the older versions baked into it with package prefixes. So
> >> that library will get fatter and fatter over time but at least our own
> >> code only gets fatter at the rate migrations are added.
> >>
> >> The same approach works for any other kind of ad-hoc migration you
> >> might want to perform. e.g., you might want to create points. Or
> >> remove an index for a field. Or add an index for a field.
> >>
> >> TX
> >>
> >>
> >> On Tue, 1 Nov 2022 at 02:57, Pablo Vázquez Blázquez <pablovb@gmail.com>
> >> wrote:
> >> >
> >> > Hi all,
> >> >
> >> > Thank you all for your responses.
> >> >
> >> > So, when updating to a newer (major) Lucene version that modifies its
> >> > codecs, there is no way to ensure everything keeps working properly,
> >> unless
> >> > re-indexing, right?
> >> >
> >> > Apart from not having some original sources that were indexed (which I
> >> will
> >> > try to solve by using the *IndexUpgrader *tool), I have another
> >> problem: I
> >> > was using the org.apache.lucene.uninverting.UninvertingReader to perform
> >> > queries against the index, mainly using the grouping api. But
> >> currently, it
> >> > was removed (since Lucene 7.0). So, again, do I have any other
> >> alternative,
> >> > apart from re-indexing to use docValues?
> >> >
> >> > To give you more context, I am a developer of a tool that multiple
> >> > customers can use to index their data (currently, with Lucene 5.5.5). We
> >> > are planning to upgrade to Lucene 9 (because of some vulnerabilities
> >> > affecting Lucene 5.5.5) and I think asking them to reindex will not go
> >> down
> >> > well :(
> >> >
> >> > Regards,
> >> >
> >> > El sáb, 29 oct 2022 a las 23:31, Matt Davis (<kryptonics411@gmail.com>)
> >> > escribió:
> >> >
> >> > > Inside of Zulia search engine, the object being indexed is always a
> >> > > JSON/BSON object and we store the BSON as a stored byte field in the
> >> > > index. This allows easy internal reindexing when the searchable
> >> fields
> >> > > change but also allows us to update to the latest lucene version.
> >> > > Combined with using lucene-backward-codecs an older index than the
> >> current
> >> > > major version can be opened and reindexed. If you have stored all the
> >> > > fields (or a json/bson) in the index, it would be easy to reindex in
> >> the
> >> > > new format. If you have not, maybe opening with
> >> lucene-backward-codecs
> >> > > will be enough for your use case.
> >> > >
> >> > > Thanks,
> >> > > Matt
> >> > >
> >> > > On Sat, Oct 29, 2022 at 2:30 PM Baris Kazar <baris.kazar@oracle.com>
> >> > > wrote:
> >> > >
> >> > > > It is always great practice to retain non-indexed
> >> > > > data since when Lucene changes version,
> >> > > > even minor version, I always reindex.
> >> > > >
> >> > > > Best regards
> >> > > > ________________________________
> >> > > > From: Gus Heck <gus.heck@gmail.com>
> >> > > > Sent: Saturday, October 29, 2022 2:17 PM
> >> > > > To: java-user@lucene.apache.org <java-user@lucene.apache.org>
> >> > > > Subject: Re: Best strategy migrate indexes
> >> > > >
> >> > > > Hi Pablo,
> >> > > >
> >> > > > The deafening silence is probably nobody wanting to give you the bad
> >> > > news.
> >> > > > You are on a mission that may not be feasible, and even if you can
> >> get it
> >> > > > to "work", the end result won't likely be equivalent to indexing the
> >> > > > original data with Lucene 9.x. The indexing process is fundamentally
> >> > > lossy
> >> > > > and information originally used to produce non-stored fields will
> >> have
> >> > > been
> >> > > > thrown out. A simple example is things like stopwords or anything
> >> > > analyzed
> >> > > > with subclasses of FilteringTokenFilter. If the stop word list
> >> changed,
> >> > > or
> >> > > > the details of one of these filters changed (bugfix?), you will end
> >> up
> >> > > with
> >> > > > a different result than indexing with 9.x. This is just one
> >> > > > example, another would be stemming where the index likely only
> >> contains
> >> > > the
> >> > > > stem, not the whole word. Other folks who are more interested in the
> >> > > > details of our codecs than I am can probably provide further
> >> examples on
> >> > > a
> >> > > > more fundamental level. Lucene is not a database, and the source
> >> > > documents
> >> > > > should always be retained in a form that can be reindexed. If you
> >> have
> >> > > > inherited a system where source material has not been retained, you
> >> have
> >> > > a
> >> > > > difficult project and may have some potentially painful expectation
> >> > > setting
> >> > > > to perform.
> >> > > >
> >> > > > Best,
> >> > > > Gus
> >> > > >
> >> > > >
> >> > > >
> >> > > > On Fri, Oct 28, 2022 at 8:01 AM Pablo Vázquez Blázquez <
> >> > > pablovb@gmail.com>
> >> > > > wrote:
> >> > > >
> >> > > > > Hi all,
> >> > > > >
> >> > > > > I have some indices indexed with lucene 5.5.0. I have updated my
> >> > > > > dependencies and code to Lucene 7 (but my final goal is to use
> >> Lucene
> >> > > 9)
> >> > > > > and when trying to work with them I am having the exception:
> >> > > > > org.apache.lucene.index.IndexFormatTooOldException: Format
> >> version is
> >> > > not
> >> > > > > supported (resource
> >> > > > >
> >> > > > >
> >> > > >
> >> > >
> >> BufferedChecksumIndexInput(MMapIndexInput(path=".......\tests\segments_b"))):
> >> > > > > this index is too old (version: 5.5.0). This version of Lucene
> >> only
> >> > > > > supports indexes created with release 6.0 and later.
> >> > > > >
> >> > > > > I want to migrate from Lucene 5.x to Lucene 9.x. Which is the best
> >> > > > > strategy? Is there any tool to migrate the indices? Is it
> >> mandatory to
> >> > > > > reindex? In this case, how can I deal with this when I do not
> >> have the
> >> > > > > sources of documents that generated my current indices (I mean, I
> >> just
> >> > > > have
> >> > > > > the indices themselves)?
> >> > > > >
> >> > > > > Thanks,
> >> > > > >
> >> > > > > --
> >> > > > > Pablo Vázquez
> >> > > > > (pablovb@gmail.com)
> >> > > > >
> >> > > >
> >> > > >
> >> > > > --
> >> > > >
> >> > > >
> >> > >
> >> https://urldefense.com/v3/__http://www.needhamsoftware.com__;!!ACWV5N9M2RV99hQ!PVR-c0gAs5FpIrnotHWeo3sEWScxV8oFJrVpGdItGZictcDbRvnp5aZSqCRhglMCYqQsewQOuio4iIYARA$
> >> > > > (work)
> >> > > >
> >> > > >
> >> > >
> >> https://urldefense.com/v3/__http://www.the111shift.com__;!!ACWV5N9M2RV99hQ!PVR-c0gAs5FpIrnotHWeo3sEWScxV8oFJrVpGdItGZictcDbRvnp5aZSqCRhglMCYqQsewQOuirxfFWpEQ$
> >> > > > (play)
> >> > > >
> >> > >
> >> >
> >> >
> >> > --
> >> > Pablo Vázquez
> >> > (pablovb@gmail.com)
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
> > --
> > Pablo Vázquez
> > (pablovb@gmail.com)
> >
>
>
> --
> Pablo Vázquez
> (pablovb@gmail.com)

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Best strategy migrate indexes [ In reply to ]

pablovb at gmail

Nov 7, 2022, 3:55 AM

Post #12 of 14 (885 views)

Thanks TX for your response.

I would check that the Luke version matches the Lucene version - if
> the two match, it shouldn't be possible to get issues like this.
> That is, the precise versions of Lucene each is using.

Yes, I am using https://github.com/DmitryKey/luke/releases/tag/luke-7.1.0

It works ok with my new generated indexes, but it does not with the
"migrated" ones.

El lun, 7 nov 2022 a las 12:18, Trejkaz (<trejkaz@trypticon.org>) escribió:

> The process itself sounds like it should work (it's basically a
> reindex so it should be safer than trying to migrate directly.)
>
> I would check that the Luke version matches the Lucene version - if
> the two match, it shouldn't be possible to get issues like this.
> That is, the precise versions of Lucene each is using.
>
> TX
>
>
> On Mon, 7 Nov 2022 at 22:09, Pablo Vázquez Blázquez <pablovb@gmail.com>
> wrote:
> >
> > Hi!
> >
> > > I am trying to create a tool to read docs from a lucene5 index and
> > generate lucene9 documents from them (with docValues). That might work,
> > right? I am shading both lucene5 and lucene9 to avoid package conflicts.
> >
> > I am doing the following steps:
> >
> > - create IndexReader with lucene5 package over a lucene5 index
> > - create IndexWriter with lucene7 package
> > - iterate over reader.numDocs() to process each Document (lucene5)
> > - convert each Document (lucene5) to lucene7 Document
> > - for each IndexableField (lucene5) from Document (lucene5)
> convert
> > it to create an IndexableField (lucene7)
> > - create a SortedDocValuesField (lucene7) and add it to the
> > Document (lucene7)
> > - add the field to the Document (lucene7)
> > - add each converted Document to the writer
> > - close IndexReader and IndexWriter
> >
> > When I open the resulting migrated lucene7 index with Luke I got an
> error:
> > org.apache.lucene.index.IndexFormatTooNewException: Format version is not
> > supported (resource
> >
> BufferedChecksumIndexInput(MMapIndexInput(path="tests_small_index-7.x-migrator\segments_1"))):
> > 9 (needs to be between 6 and 7)
> >
> > When I use the tool "luceneupgrader
> > <https://github.com/hakanai/luceneupgrader>", I got:
> > java -jar luceneupgrader-0.5.2-SNAPSHOT.jar info
> > tests_small_index-7.x-migrator
> > Lucene index version: 7
> >
> > What am I doing wrong or misleading?
> >
> > Thanks!
> >
> > El mié, 2 nov 2022 a las 21:13, Pablo Vázquez Blázquez (<
> pablovb@gmail.com>)
> > escribió:
> >
> > > Hi,
> > >
> > > Luckily we were already using lucenemigrator
> > >
> > >
> > > What do you mean with "lucenemigrator"? Is it a public tool?
> > >
> > > I am trying to create a tool to read docs from a lucene5 index and
> > > generate lucene9 documents from them (with docValues). That might work,
> > > right? I am shading both lucene5 and lucene9 to avoid package
> conflicts.
> > >
> > > Thanks!
> > >
> > > El mar, 1 nov 2022 a las 0:35, Trejkaz (<trejkaz@trypticon.org>)
> escribió:
> > >
> > >> Well...
> > >>
> > >> There's a way, but I wouldn't necessarily recommend it.
> > >>
> > >> You can write custom migration code against some version of Lucene
> > >> which supports doc values, to create doc values fields. It's going to
> > >> involve writing a FilterCodecReader which wraps your real index and
> > >> then pretends to also have doc values, which you'll build in a custom
> > >> class which works similarly to UninvertingReader. Then you pass those
> > >> CodecReaders to IndexWriter.addIndexes to create a new index which
> > >> really has those doc values.
> > >>
> > >> We did that ourselves when we had the same issue. The only painful
> > >> thing about it is having to keep around older versions of lucene to do
> > >> that migration. Forever. Luckily we were already using lucenemigrator,
> > >> which has the older versions baked into it with package prefixes. So
> > >> that library will get fatter and fatter over time but at least our own
> > >> code only gets fatter at the rate migrations are added.
> > >>
> > >> The same approach works for any other kind of ad-hoc migration you
> > >> might want to perform. e.g., you might want to create points. Or
> > >> remove an index for a field. Or add an index for a field.
> > >>
> > >> TX
> > >>
> > >>
> > >> On Tue, 1 Nov 2022 at 02:57, Pablo Vázquez Blázquez <
> pablovb@gmail.com>
> > >> wrote:
> > >> >
> > >> > Hi all,
> > >> >
> > >> > Thank you all for your responses.
> > >> >
> > >> > So, when updating to a newer (major) Lucene version that modifies
> its
> > >> > codecs, there is no way to ensure everything keeps working properly,
> > >> unless
> > >> > re-indexing, right?
> > >> >
> > >> > Apart from not having some original sources that were indexed
> (which I
> > >> will
> > >> > try to solve by using the *IndexUpgrader *tool), I have another
> > >> problem: I
> > >> > was using the org.apache.lucene.uninverting.UninvertingReader to
> perform
> > >> > queries against the index, mainly using the grouping api. But
> > >> currently, it
> > >> > was removed (since Lucene 7.0). So, again, do I have any other
> > >> alternative,
> > >> > apart from re-indexing to use docValues?
> > >> >
> > >> > To give you more context, I am a developer of a tool that multiple
> > >> > customers can use to index their data (currently, with Lucene
> 5.5.5). We
> > >> > are planning to upgrade to Lucene 9 (because of some vulnerabilities
> > >> > affecting Lucene 5.5.5) and I think asking them to reindex will not
> go
> > >> down
> > >> > well :(
> > >> >
> > >> > Regards,
> > >> >
> > >> > El sáb, 29 oct 2022 a las 23:31, Matt Davis (<
> kryptonics411@gmail.com>)
> > >> > escribió:
> > >> >
> > >> > > Inside of Zulia search engine, the object being indexed is always
> a
> > >> > > JSON/BSON object and we store the BSON as a stored byte field in
> the
> > >> > > index. This allows easy internal reindexing when the searchable
> > >> fields
> > >> > > change but also allows us to update to the latest lucene version.
> > >> > > Combined with using lucene-backward-codecs an older index than
> the
> > >> current
> > >> > > major version can be opened and reindexed. If you have stored
> all the
> > >> > > fields (or a json/bson) in the index, it would be easy to reindex
> in
> > >> the
> > >> > > new format. If you have not, maybe opening with
> > >> lucene-backward-codecs
> > >> > > will be enough for your use case.
> > >> > >
> > >> > > Thanks,
> > >> > > Matt
> > >> > >
> > >> > > On Sat, Oct 29, 2022 at 2:30 PM Baris Kazar <
> baris.kazar@oracle.com>
> > >> > > wrote:
> > >> > >
> > >> > > > It is always great practice to retain non-indexed
> > >> > > > data since when Lucene changes version,
> > >> > > > even minor version, I always reindex.
> > >> > > >
> > >> > > > Best regards
> > >> > > > ________________________________
> > >> > > > From: Gus Heck <gus.heck@gmail.com>
> > >> > > > Sent: Saturday, October 29, 2022 2:17 PM
> > >> > > > To: java-user@lucene.apache.org <java-user@lucene.apache.org>
> > >> > > > Subject: Re: Best strategy migrate indexes
> > >> > > >
> > >> > > > Hi Pablo,
> > >> > > >
> > >> > > > The deafening silence is probably nobody wanting to give you
> the bad
> > >> > > news.
> > >> > > > You are on a mission that may not be feasible, and even if you
> can
> > >> get it
> > >> > > > to "work", the end result won't likely be equivalent to
> indexing the
> > >> > > > original data with Lucene 9.x. The indexing process is
> fundamentally
> > >> > > lossy
> > >> > > > and information originally used to produce non-stored fields
> will
> > >> have
> > >> > > been
> > >> > > > thrown out. A simple example is things like stopwords or
> anything
> > >> > > analyzed
> > >> > > > with subclasses of FilteringTokenFilter. If the stop word list
> > >> changed,
> > >> > > or
> > >> > > > the details of one of these filters changed (bugfix?), you will
> end
> > >> up
> > >> > > with
> > >> > > > a different result than indexing with 9.x. This is just one
> > >> > > > example, another would be stemming where the index likely only
> > >> contains
> > >> > > the
> > >> > > > stem, not the whole word. Other folks who are more interested
> in the
> > >> > > > details of our codecs than I am can probably provide further
> > >> examples on
> > >> > > a
> > >> > > > more fundamental level. Lucene is not a database, and the source
> > >> > > documents
> > >> > > > should always be retained in a form that can be reindexed. If
> you
> > >> have
> > >> > > > inherited a system where source material has not been retained,
> you
> > >> have
> > >> > > a
> > >> > > > difficult project and may have some potentially painful
> expectation
> > >> > > setting
> > >> > > > to perform.
> > >> > > >
> > >> > > > Best,
> > >> > > > Gus
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > On Fri, Oct 28, 2022 at 8:01 AM Pablo Vázquez Blázquez <
> > >> > > pablovb@gmail.com>
> > >> > > > wrote:
> > >> > > >
> > >> > > > > Hi all,
> > >> > > > >
> > >> > > > > I have some indices indexed with lucene 5.5.0. I have updated
> my
> > >> > > > > dependencies and code to Lucene 7 (but my final goal is to use
> > >> Lucene
> > >> > > 9)
> > >> > > > > and when trying to work with them I am having the exception:
> > >> > > > > org.apache.lucene.index.IndexFormatTooOldException: Format
> > >> version is
> > >> > > not
> > >> > > > > supported (resource
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >>
> BufferedChecksumIndexInput(MMapIndexInput(path=".......\tests\segments_b"))):
> > >> > > > > this index is too old (version: 5.5.0). This version of Lucene
> > >> only
> > >> > > > > supports indexes created with release 6.0 and later.
> > >> > > > >
> > >> > > > > I want to migrate from Lucene 5.x to Lucene 9.x. Which is the
> best
> > >> > > > > strategy? Is there any tool to migrate the indices? Is it
> > >> mandatory to
> > >> > > > > reindex? In this case, how can I deal with this when I do not
> > >> have the
> > >> > > > > sources of documents that generated my current indices (I
> mean, I
> > >> just
> > >> > > > have
> > >> > > > > the indices themselves)?
> > >> > > > >
> > >> > > > > Thanks,
> > >> > > > >
> > >> > > > > --
> > >> > > > > Pablo Vázquez
> > >> > > > > (pablovb@gmail.com)
> > >> > > > >
> > >> > > >
> > >> > > >
> > >> > > > --
> > >> > > >
> > >> > > >
> > >> > >
> > >>
> https://urldefense.com/v3/__http://www.needhamsoftware.com__;!!ACWV5N9M2RV99hQ!PVR-c0gAs5FpIrnotHWeo3sEWScxV8oFJrVpGdItGZictcDbRvnp5aZSqCRhglMCYqQsewQOuio4iIYARA$
> > >> > > > (work)
> > >> > > >
> > >> > > >
> > >> > >
> > >>
> https://urldefense.com/v3/__http://www.the111shift.com__;!!ACWV5N9M2RV99hQ!PVR-c0gAs5FpIrnotHWeo3sEWScxV8oFJrVpGdItGZictcDbRvnp5aZSqCRhglMCYqQsewQOuirxfFWpEQ$
> > >> > > > (play)
> > >> > > >
> > >> > >
> > >> >
> > >> >
> > >> > --
> > >> > Pablo Vázquez
> > >> > (pablovb@gmail.com)
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > >>
> > >>
> > >
> > > --
> > > Pablo Vázquez
> > > (pablovb@gmail.com)
> > >
> >
> >
> > --
> > Pablo Vázquez
> > (pablovb@gmail.com)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

--
Pablo Vázquez
(pablovb@gmail.com)

Re: Best strategy migrate indexes [ In reply to ]

msokolov at gmail

Nov 7, 2022, 4:24 PM

Post #13 of 14 (874 views)

The error you got

BufferedChecksumIndexInput(MMapIndexInput(path="tests_small_index-7.x-migrator\segments_1"))):
9 (needs to be between 6 and 7)

indicates that the index you are reading was written by Lucene 9, so
things are not set up the way you described (writing using Lucene 7)

> Thanks TX for your response.
>
> I would check that the Luke version matches the Lucene version - if
> > the two match, it shouldn't be possible to get issues like this.
> > That is, the precise versions of Lucene each is using.
>
>
> Yes, I am using https://github.com/DmitryKey/luke/releases/tag/luke-7.1.0
>
> It works ok with my new generated indexes, but it does not with the
> "migrated" ones.
>
> El lun, 7 nov 2022 a las 12:18, Trejkaz (<trejkaz@trypticon.org>) escribió:
>
> > The process itself sounds like it should work (it's basically a
> > reindex so it should be safer than trying to migrate directly.)
> >
> > I would check that the Luke version matches the Lucene version - if
> > the two match, it shouldn't be possible to get issues like this.
> > That is, the precise versions of Lucene each is using.
> >
> > TX
> >
> >
> > On Mon, 7 Nov 2022 at 22:09, Pablo Vázquez Blázquez <pablovb@gmail.com>
> > wrote:
> > >
> > > Hi!
> > >
> > > > I am trying to create a tool to read docs from a lucene5 index and
> > > generate lucene9 documents from them (with docValues). That might work,
> > > right? I am shading both lucene5 and lucene9 to avoid package conflicts.
> > >
> > > I am doing the following steps:
> > >
> > > - create IndexReader with lucene5 package over a lucene5 index
> > > - create IndexWriter with lucene7 package
> > > - iterate over reader.numDocs() to process each Document (lucene5)
> > > - convert each Document (lucene5) to lucene7 Document
> > > - for each IndexableField (lucene5) from Document (lucene5)
> > convert
> > > it to create an IndexableField (lucene7)
> > > - create a SortedDocValuesField (lucene7) and add it to the
> > > Document (lucene7)
> > > - add the field to the Document (lucene7)
> > > - add each converted Document to the writer
> > > - close IndexReader and IndexWriter
> > >
> > > When I open the resulting migrated lucene7 index with Luke I got an
> > error:
> > > org.apache.lucene.index.IndexFormatTooNewException: Format version is not
> > > supported (resource
> > >
> > BufferedChecksumIndexInput(MMapIndexInput(path="tests_small_index-7.x-migrator\segments_1"))):
> > > 9 (needs to be between 6 and 7)
> > >
> > > When I use the tool "luceneupgrader
> > > <https://github.com/hakanai/luceneupgrader>", I got:
> > > java -jar luceneupgrader-0.5.2-SNAPSHOT.jar info
> > > tests_small_index-7.x-migrator
> > > Lucene index version: 7
> > >
> > > What am I doing wrong or misleading?
> > >
> > > Thanks!
> > >
> > > El mié, 2 nov 2022 a las 21:13, Pablo Vázquez Blázquez (<
> > pablovb@gmail.com>)
> > > escribió:
> > >
> > > > Hi,
> > > >
> > > > Luckily we were already using lucenemigrator
> > > >
> > > >
> > > > What do you mean with "lucenemigrator"? Is it a public tool?
> > > >
> > > > I am trying to create a tool to read docs from a lucene5 index and
> > > > generate lucene9 documents from them (with docValues). That might work,
> > > > right? I am shading both lucene5 and lucene9 to avoid package
> > conflicts.
> > > >
> > > > Thanks!
> > > >
> > > > El mar, 1 nov 2022 a las 0:35, Trejkaz (<trejkaz@trypticon.org>)
> > escribió:
> > > >
> > > >> Well...
> > > >>
> > > >> There's a way, but I wouldn't necessarily recommend it.
> > > >>
> > > >> You can write custom migration code against some version of Lucene
> > > >> which supports doc values, to create doc values fields. It's going to
> > > >> involve writing a FilterCodecReader which wraps your real index and
> > > >> then pretends to also have doc values, which you'll build in a custom
> > > >> class which works similarly to UninvertingReader. Then you pass those
> > > >> CodecReaders to IndexWriter.addIndexes to create a new index which
> > > >> really has those doc values.
> > > >>
> > > >> We did that ourselves when we had the same issue. The only painful
> > > >> thing about it is having to keep around older versions of lucene to do
> > > >> that migration. Forever. Luckily we were already using lucenemigrator,
> > > >> which has the older versions baked into it with package prefixes. So
> > > >> that library will get fatter and fatter over time but at least our own
> > > >> code only gets fatter at the rate migrations are added.
> > > >>
> > > >> The same approach works for any other kind of ad-hoc migration you
> > > >> might want to perform. e.g., you might want to create points. Or
> > > >> remove an index for a field. Or add an index for a field.
> > > >>
> > > >> TX
> > > >>
> > > >>
> > > >> On Tue, 1 Nov 2022 at 02:57, Pablo Vázquez Blázquez <
> > pablovb@gmail.com>
> > > >> wrote:
> > > >> >
> > > >> > Hi all,
> > > >> >
> > > >> > Thank you all for your responses.
> > > >> >
> > > >> > So, when updating to a newer (major) Lucene version that modifies
> > its
> > > >> > codecs, there is no way to ensure everything keeps working properly,
> > > >> unless
> > > >> > re-indexing, right?
> > > >> >
> > > >> > Apart from not having some original sources that were indexed
> > (which I
> > > >> will
> > > >> > try to solve by using the *IndexUpgrader *tool), I have another
> > > >> problem: I
> > > >> > was using the org.apache.lucene.uninverting.UninvertingReader to
> > perform
> > > >> > queries against the index, mainly using the grouping api. But
> > > >> currently, it
> > > >> > was removed (since Lucene 7.0). So, again, do I have any other
> > > >> alternative,
> > > >> > apart from re-indexing to use docValues?
> > > >> >
> > > >> > To give you more context, I am a developer of a tool that multiple
> > > >> > customers can use to index their data (currently, with Lucene
> > 5.5.5). We
> > > >> > are planning to upgrade to Lucene 9 (because of some vulnerabilities
> > > >> > affecting Lucene 5.5.5) and I think asking them to reindex will not
> > go
> > > >> down
> > > >> > well :(
> > > >> >
> > > >> > Regards,
> > > >> >
> > > >> > El sáb, 29 oct 2022 a las 23:31, Matt Davis (<
> > kryptonics411@gmail.com>)
> > > >> > escribió:
> > > >> >
> > > >> > > Inside of Zulia search engine, the object being indexed is always
> > a
> > > >> > > JSON/BSON object and we store the BSON as a stored byte field in
> > the
> > > >> > > index. This allows easy internal reindexing when the searchable
> > > >> fields
> > > >> > > change but also allows us to update to the latest lucene version.
> > > >> > > Combined with using lucene-backward-codecs an older index than
> > the
> > > >> current
> > > >> > > major version can be opened and reindexed. If you have stored
> > all the
> > > >> > > fields (or a json/bson) in the index, it would be easy to reindex
> > in
> > > >> the
> > > >> > > new format. If you have not, maybe opening with
> > > >> lucene-backward-codecs
> > > >> > > will be enough for your use case.
> > > >> > >
> > > >> > > Thanks,
> > > >> > > Matt
> > > >> > >
> > > >> > > On Sat, Oct 29, 2022 at 2:30 PM Baris Kazar <
> > baris.kazar@oracle.com>
> > > >> > > wrote:
> > > >> > >
> > > >> > > > It is always great practice to retain non-indexed
> > > >> > > > data since when Lucene changes version,
> > > >> > > > even minor version, I always reindex.
> > > >> > > >
> > > >> > > > Best regards
> > > >> > > > ________________________________
> > > >> > > > From: Gus Heck <gus.heck@gmail.com>
> > > >> > > > Sent: Saturday, October 29, 2022 2:17 PM
> > > >> > > > To: java-user@lucene.apache.org <java-user@lucene.apache.org>
> > > >> > > > Subject: Re: Best strategy migrate indexes
> > > >> > > >
> > > >> > > > Hi Pablo,
> > > >> > > >
> > > >> > > > The deafening silence is probably nobody wanting to give you
> > the bad
> > > >> > > news.
> > > >> > > > You are on a mission that may not be feasible, and even if you
> > can
> > > >> get it
> > > >> > > > to "work", the end result won't likely be equivalent to
> > indexing the
> > > >> > > > original data with Lucene 9.x. The indexing process is
> > fundamentally
> > > >> > > lossy
> > > >> > > > and information originally used to produce non-stored fields
> > will
> > > >> have
> > > >> > > been
> > > >> > > > thrown out. A simple example is things like stopwords or
> > anything
> > > >> > > analyzed
> > > >> > > > with subclasses of FilteringTokenFilter. If the stop word list
> > > >> changed,
> > > >> > > or
> > > >> > > > the details of one of these filters changed (bugfix?), you will
> > end
> > > >> up
> > > >> > > with
> > > >> > > > a different result than indexing with 9.x. This is just one
> > > >> > > > example, another would be stemming where the index likely only
> > > >> contains
> > > >> > > the
> > > >> > > > stem, not the whole word. Other folks who are more interested
> > in the
> > > >> > > > details of our codecs than I am can probably provide further
> > > >> examples on
> > > >> > > a
> > > >> > > > more fundamental level. Lucene is not a database, and the source
> > > >> > > documents
> > > >> > > > should always be retained in a form that can be reindexed. If
> > you
> > > >> have
> > > >> > > > inherited a system where source material has not been retained,
> > you
> > > >> have
> > > >> > > a
> > > >> > > > difficult project and may have some potentially painful
> > expectation
> > > >> > > setting
> > > >> > > > to perform.
> > > >> > > >
> > > >> > > > Best,
> > > >> > > > Gus
> > > >> > > >
> > > >> > > >
> > > >> > > >
> > > >> > > > On Fri, Oct 28, 2022 at 8:01 AM Pablo Vázquez Blázquez <
> > > >> > > pablovb@gmail.com>
> > > >> > > > wrote:
> > > >> > > >
> > > >> > > > > Hi all,
> > > >> > > > >
> > > >> > > > > I have some indices indexed with lucene 5.5.0. I have updated
> > my
> > > >> > > > > dependencies and code to Lucene 7 (but my final goal is to use
> > > >> Lucene
> > > >> > > 9)
> > > >> > > > > and when trying to work with them I am having the exception:
> > > >> > > > > org.apache.lucene.index.IndexFormatTooOldException: Format
> > > >> version is
> > > >> > > not
> > > >> > > > > supported (resource
> > > >> > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >>
> > BufferedChecksumIndexInput(MMapIndexInput(path=".......\tests\segments_b"))):
> > > >> > > > > this index is too old (version: 5.5.0). This version of Lucene
> > > >> only
> > > >> > > > > supports indexes created with release 6.0 and later.
> > > >> > > > >
> > > >> > > > > I want to migrate from Lucene 5.x to Lucene 9.x. Which is the
> > best
> > > >> > > > > strategy? Is there any tool to migrate the indices? Is it
> > > >> mandatory to
> > > >> > > > > reindex? In this case, how can I deal with this when I do not
> > > >> have the
> > > >> > > > > sources of documents that generated my current indices (I
> > mean, I
> > > >> just
> > > >> > > > have
> > > >> > > > > the indices themselves)?
> > > >> > > > >
> > > >> > > > > Thanks,
> > > >> > > > >
> > > >> > > > > --
> > > >> > > > > Pablo Vázquez
> > > >> > > > > (pablovb@gmail.com)
> > > >> > > > >
> > > >> > > >
> > > >> > > >
> > > >> > > > --
> > > >> > > >
> > > >> > > >
> > > >> > >
> > > >>
> > https://urldefense.com/v3/__http://www.needhamsoftware.com__;!!ACWV5N9M2RV99hQ!PVR-c0gAs5FpIrnotHWeo3sEWScxV8oFJrVpGdItGZictcDbRvnp5aZSqCRhglMCYqQsewQOuio4iIYARA$
> > > >> > > > (work)
> > > >> > > >
> > > >> > > >
> > > >> > >
> > > >>
> > https://urldefense.com/v3/__http://www.the111shift.com__;!!ACWV5N9M2RV99hQ!PVR-c0gAs5FpIrnotHWeo3sEWScxV8oFJrVpGdItGZictcDbRvnp5aZSqCRhglMCYqQsewQOuirxfFWpEQ$
> > > >> > > > (play)
> > > >> > > >
> > > >> > >
> > > >> >
> > > >> >
> > > >> > --
> > > >> > Pablo Vázquez
> > > >> > (pablovb@gmail.com)
> > > >>
> > > >> ---------------------------------------------------------------------
> > > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >>
> > > >>
> > > >
> > > > --
> > > > Pablo Vázquez
> > > > (pablovb@gmail.com)
> > > >
> > >
> > >
> > > --
> > > Pablo Vázquez
> > > (pablovb@gmail.com)
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
> --
> Pablo Vázquez
> (pablovb@gmail.com)

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Best strategy migrate indexes [ In reply to ]

pablovb at gmail

Nov 8, 2022, 1:06 AM

Post #14 of 14 (869 views)

Yes, it looks like this, but I am able to open that index with Luke 7.7.0

[image: image.png]

and it shows version 7.7.3.

I have triple-checked my program and all lucene classes are from Lucene5
and Lucene7.

Despite my final goal is to migrate to Lucene9, I want to do it
progressively, to test the api changes in my code and pass my tests. So, I
am currently migrating to Lucene7. As Luke 7.7.0 can open that migrated
index, I am moving from Lucene 7.0.0 to Lucene 7.7.0 and see if that works.

Regards.

El mar, 8 nov 2022 a las 1:25, Michael Sokolov (<msokolov@gmail.com>)
escribió:

> The error you got
>
>
> BufferedChecksumIndexInput(MMapIndexInput(path="tests_small_index-7.x-migrator\segments_1"))):
> 9 (needs to be between 6 and 7)
>
> indicates that the index you are reading was written by Lucene 9, so
> things are not set up the way you described (writing using Lucene 7)
>
>
> > Thanks TX for your response.
> >
> > I would check that the Luke version matches the Lucene version - if
> > > the two match, it shouldn't be possible to get issues like this.
> > > That is, the precise versions of Lucene each is using.
> >
> >
> > Yes, I am using
> https://github.com/DmitryKey/luke/releases/tag/luke-7.1.0
> >
> > It works ok with my new generated indexes, but it does not with the
> > "migrated" ones.
> >
> > El lun, 7 nov 2022 a las 12:18, Trejkaz (<trejkaz@trypticon.org>)
> escribió:
> >
> > > The process itself sounds like it should work (it's basically a
> > > reindex so it should be safer than trying to migrate directly.)
> > >
> > > I would check that the Luke version matches the Lucene version - if
> > > the two match, it shouldn't be possible to get issues like this.
> > > That is, the precise versions of Lucene each is using.
> > >
> > > TX
> > >
> > >
> > > On Mon, 7 Nov 2022 at 22:09, Pablo Vázquez Blázquez <pablovb@gmail.com
> >
> > > wrote:
> > > >
> > > > Hi!
> > > >
> > > > > I am trying to create a tool to read docs from a lucene5 index and
> > > > generate lucene9 documents from them (with docValues). That might
> work,
> > > > right? I am shading both lucene5 and lucene9 to avoid package
> conflicts.
> > > >
> > > > I am doing the following steps:
> > > >
> > > > - create IndexReader with lucene5 package over a lucene5 index
> > > > - create IndexWriter with lucene7 package
> > > > - iterate over reader.numDocs() to process each Document (lucene5)
> > > > - convert each Document (lucene5) to lucene7 Document
> > > > - for each IndexableField (lucene5) from Document (lucene5)
> > > convert
> > > > it to create an IndexableField (lucene7)
> > > > - create a SortedDocValuesField (lucene7) and add it to
> the
> > > > Document (lucene7)
> > > > - add the field to the Document (lucene7)
> > > > - add each converted Document to the writer
> > > > - close IndexReader and IndexWriter
> > > >
> > > > When I open the resulting migrated lucene7 index with Luke I got an
> > > error:
> > > > org.apache.lucene.index.IndexFormatTooNewException: Format version
> is not
> > > > supported (resource
> > > >
> > >
> BufferedChecksumIndexInput(MMapIndexInput(path="tests_small_index-7.x-migrator\segments_1"))):
> > > > 9 (needs to be between 6 and 7)
> > > >
> > > > When I use the tool "luceneupgrader
> > > > <https://github.com/hakanai/luceneupgrader>", I got:
> > > > java -jar luceneupgrader-0.5.2-SNAPSHOT.jar info
> > > > tests_small_index-7.x-migrator
> > > > Lucene index version: 7
> > > >
> > > > What am I doing wrong or misleading?
> > > >
> > > > Thanks!
> > > >
> > > > El mié, 2 nov 2022 a las 21:13, Pablo Vázquez Blázquez (<
> > > pablovb@gmail.com>)
> > > > escribió:
> > > >
> > > > > Hi,
> > > > >
> > > > > Luckily we were already using lucenemigrator
> > > > >
> > > > >
> > > > > What do you mean with "lucenemigrator"? Is it a public tool?
> > > > >
> > > > > I am trying to create a tool to read docs from a lucene5 index and
> > > > > generate lucene9 documents from them (with docValues). That might
> work,
> > > > > right? I am shading both lucene5 and lucene9 to avoid package
> > > conflicts.
> > > > >
> > > > > Thanks!
> > > > >
> > > > > El mar, 1 nov 2022 a las 0:35, Trejkaz (<trejkaz@trypticon.org>)
> > > escribió:
> > > > >
> > > > >> Well...
> > > > >>
> > > > >> There's a way, but I wouldn't necessarily recommend it.
> > > > >>
> > > > >> You can write custom migration code against some version of Lucene
> > > > >> which supports doc values, to create doc values fields. It's
> going to
> > > > >> involve writing a FilterCodecReader which wraps your real index
> and
> > > > >> then pretends to also have doc values, which you'll build in a
> custom
> > > > >> class which works similarly to UninvertingReader. Then you pass
> those
> > > > >> CodecReaders to IndexWriter.addIndexes to create a new index which
> > > > >> really has those doc values.
> > > > >>
> > > > >> We did that ourselves when we had the same issue. The only painful
> > > > >> thing about it is having to keep around older versions of lucene
> to do
> > > > >> that migration. Forever. Luckily we were already using
> lucenemigrator,
> > > > >> which has the older versions baked into it with package prefixes.
> So
> > > > >> that library will get fatter and fatter over time but at least
> our own
> > > > >> code only gets fatter at the rate migrations are added.
> > > > >>
> > > > >> The same approach works for any other kind of ad-hoc migration you
> > > > >> might want to perform. e.g., you might want to create points. Or
> > > > >> remove an index for a field. Or add an index for a field.
> > > > >>
> > > > >> TX
> > > > >>
> > > > >>
> > > > >> On Tue, 1 Nov 2022 at 02:57, Pablo Vázquez Blázquez <
> > > pablovb@gmail.com>
> > > > >> wrote:
> > > > >> >
> > > > >> > Hi all,
> > > > >> >
> > > > >> > Thank you all for your responses.
> > > > >> >
> > > > >> > So, when updating to a newer (major) Lucene version that
> modifies
> > > its
> > > > >> > codecs, there is no way to ensure everything keeps working
> properly,
> > > > >> unless
> > > > >> > re-indexing, right?
> > > > >> >
> > > > >> > Apart from not having some original sources that were indexed
> > > (which I
> > > > >> will
> > > > >> > try to solve by using the *IndexUpgrader *tool), I have another
> > > > >> problem: I
> > > > >> > was using the org.apache.lucene.uninverting.UninvertingReader to
> > > perform
> > > > >> > queries against the index, mainly using the grouping api. But
> > > > >> currently, it
> > > > >> > was removed (since Lucene 7.0). So, again, do I have any other
> > > > >> alternative,
> > > > >> > apart from re-indexing to use docValues?
> > > > >> >
> > > > >> > To give you more context, I am a developer of a tool that
> multiple
> > > > >> > customers can use to index their data (currently, with Lucene
> > > 5.5.5). We
> > > > >> > are planning to upgrade to Lucene 9 (because of some
> vulnerabilities
> > > > >> > affecting Lucene 5.5.5) and I think asking them to reindex will
> not
> > > go
> > > > >> down
> > > > >> > well :(
> > > > >> >
> > > > >> > Regards,
> > > > >> >
> > > > >> > El sáb, 29 oct 2022 a las 23:31, Matt Davis (<
> > > kryptonics411@gmail.com>)
> > > > >> > escribió:
> > > > >> >
> > > > >> > > Inside of Zulia search engine, the object being indexed is
> always
> > > a
> > > > >> > > JSON/BSON object and we store the BSON as a stored byte field
> in
> > > the
> > > > >> > > index. This allows easy internal reindexing when the
> searchable
> > > > >> fields
> > > > >> > > change but also allows us to update to the latest lucene
> version.
> > > > >> > > Combined with using lucene-backward-codecs an older index
> than
> > > the
> > > > >> current
> > > > >> > > major version can be opened and reindexed. If you have stored
> > > all the
> > > > >> > > fields (or a json/bson) in the index, it would be easy to
> reindex
> > > in
> > > > >> the
> > > > >> > > new format. If you have not, maybe opening with
> > > > >> lucene-backward-codecs
> > > > >> > > will be enough for your use case.
> > > > >> > >
> > > > >> > > Thanks,
> > > > >> > > Matt
> > > > >> > >
> > > > >> > > On Sat, Oct 29, 2022 at 2:30 PM Baris Kazar <
> > > baris.kazar@oracle.com>
> > > > >> > > wrote:
> > > > >> > >
> > > > >> > > > It is always great practice to retain non-indexed
> > > > >> > > > data since when Lucene changes version,
> > > > >> > > > even minor version, I always reindex.
> > > > >> > > >
> > > > >> > > > Best regards
> > > > >> > > > ________________________________
> > > > >> > > > From: Gus Heck <gus.heck@gmail.com>
> > > > >> > > > Sent: Saturday, October 29, 2022 2:17 PM
> > > > >> > > > To: java-user@lucene.apache.org <
> java-user@lucene.apache.org>
> > > > >> > > > Subject: Re: Best strategy migrate indexes
> > > > >> > > >
> > > > >> > > > Hi Pablo,
> > > > >> > > >
> > > > >> > > > The deafening silence is probably nobody wanting to give you
> > > the bad
> > > > >> > > news.
> > > > >> > > > You are on a mission that may not be feasible, and even if
> you
> > > can
> > > > >> get it
> > > > >> > > > to "work", the end result won't likely be equivalent to
> > > indexing the
> > > > >> > > > original data with Lucene 9.x. The indexing process is
> > > fundamentally
> > > > >> > > lossy
> > > > >> > > > and information originally used to produce non-stored fields
> > > will
> > > > >> have
> > > > >> > > been
> > > > >> > > > thrown out. A simple example is things like stopwords or
> > > anything
> > > > >> > > analyzed
> > > > >> > > > with subclasses of FilteringTokenFilter. If the stop word
> list
> > > > >> changed,
> > > > >> > > or
> > > > >> > > > the details of one of these filters changed (bugfix?), you
> will
> > > end
> > > > >> up
> > > > >> > > with
> > > > >> > > > a different result than indexing with 9.x. This is just one
> > > > >> > > > example, another would be stemming where the index likely
> only
> > > > >> contains
> > > > >> > > the
> > > > >> > > > stem, not the whole word. Other folks who are more
> interested
> > > in the
> > > > >> > > > details of our codecs than I am can probably provide further
> > > > >> examples on
> > > > >> > > a
> > > > >> > > > more fundamental level. Lucene is not a database, and the
> source
> > > > >> > > documents
> > > > >> > > > should always be retained in a form that can be reindexed.
> If
> > > you
> > > > >> have
> > > > >> > > > inherited a system where source material has not been
> retained,
> > > you
> > > > >> have
> > > > >> > > a
> > > > >> > > > difficult project and may have some potentially painful
> > > expectation
> > > > >> > > setting
> > > > >> > > > to perform.
> > > > >> > > >
> > > > >> > > > Best,
> > > > >> > > > Gus
> > > > >> > > >
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > On Fri, Oct 28, 2022 at 8:01 AM Pablo Vázquez Blázquez <
> > > > >> > > pablovb@gmail.com>
> > > > >> > > > wrote:
> > > > >> > > >
> > > > >> > > > > Hi all,
> > > > >> > > > >
> > > > >> > > > > I have some indices indexed with lucene 5.5.0. I have
> updated
> > > my
> > > > >> > > > > dependencies and code to Lucene 7 (but my final goal is
> to use
> > > > >> Lucene
> > > > >> > > 9)
> > > > >> > > > > and when trying to work with them I am having the
> exception:
> > > > >> > > > > org.apache.lucene.index.IndexFormatTooOldException: Format
> > > > >> version is
> > > > >> > > not
> > > > >> > > > > supported (resource
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >>
> > >
> BufferedChecksumIndexInput(MMapIndexInput(path=".......\tests\segments_b"))):
> > > > >> > > > > this index is too old (version: 5.5.0). This version of
> Lucene
> > > > >> only
> > > > >> > > > > supports indexes created with release 6.0 and later.
> > > > >> > > > >
> > > > >> > > > > I want to migrate from Lucene 5.x to Lucene 9.x. Which is
> the
> > > best
> > > > >> > > > > strategy? Is there any tool to migrate the indices? Is it
> > > > >> mandatory to
> > > > >> > > > > reindex? In this case, how can I deal with this when I do
> not
> > > > >> have the
> > > > >> > > > > sources of documents that generated my current indices (I
> > > mean, I
> > > > >> just
> > > > >> > > > have
> > > > >> > > > > the indices themselves)?
> > > > >> > > > >
> > > > >> > > > > Thanks,
> > > > >> > > > >
> > > > >> > > > > --
> > > > >> > > > > Pablo Vázquez
> > > > >> > > > > (pablovb@gmail.com)
> > > > >> > > > >
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > --
> > > > >> > > >
> > > > >> > > >
> > > > >> > >
> > > > >>
> > >
> https://urldefense.com/v3/__http://www.needhamsoftware.com__;!!ACWV5N9M2RV99hQ!PVR-c0gAs5FpIrnotHWeo3sEWScxV8oFJrVpGdItGZictcDbRvnp5aZSqCRhglMCYqQsewQOuio4iIYARA$
> > > > >> > > > (work)
> > > > >> > > >
> > > > >> > > >
> > > > >> > >
> > > > >>
> > >
> https://urldefense.com/v3/__http://www.the111shift.com__;!!ACWV5N9M2RV99hQ!PVR-c0gAs5FpIrnotHWeo3sEWScxV8oFJrVpGdItGZictcDbRvnp5aZSqCRhglMCYqQsewQOuirxfFWpEQ$
> > > > >> > > > (play)
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >> >
> > > > >> > --
> > > > >> > Pablo Vázquez
> > > > >> > (pablovb@gmail.com)
> > > > >>
> > > > >>
> ---------------------------------------------------------------------
> > > > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > >>
> > > > >>
> > > > >
> > > > > --
> > > > > Pablo Vázquez
> > > > > (pablovb@gmail.com)
> > > > >
> > > >
> > > >
> > > > --
> > > > Pablo Vázquez
> > > > (pablovb@gmail.com)
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> > --
> > Pablo Vázquez
> > (pablovb@gmail.com)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

--
Pablo Vázquez
(pablovb@gmail.com)