Mailing List Archive: Thinking about upgrading indexes to X+2

Thinking about upgrading indexes to X+2

Nov 20, 2020, 7:02 AM

Post #1 of 4 (341 views)

So yet another iteration on the users list of going from X to X+2 got me to thinking (dangerous I know). I wanted to run this by folks to see if it’s worth a JIRA.

It _seems_ reasonable from a user’s perspective to create an index with, say, 6x, then upgrade to 7x and reindex all documents (without deleting the index first), then be able to upgrade to 8x and reindex all documents. Rinse, repeat.

The problem of course is that the 6x segments get merged by TMP and the 6x stamp is preserved. (BTW, I’m going from hearsay here rather than code knowledge, correct me if I’m wrong, I’ve assumed all along that these are on each _segment_, not global to the entire index).

I can think of a couple of options for, say, TMP that might work out to support the above (I’m not proposing both, and these are bad names…):
1 - onlyMergeSegmentsCreatedWithTheSameVersion
2 - neverMergeSegmentsCreatedWithAPriorVersion

Either of these would, if and only if _all_ docs were indeed indexed again, result in all the X-1 segments consisting entirely of deleted documents and being dropped. Now no segment has the X-1 marker and we could upgrade to X+1.

There are some edge cases of course:

- if even one X-1 doc wasn't reindexed, it wouldn’t work. I can think of ways around this, e.g. a command deleteAllSegmentsCreatedWithPriorVersions, but since that’s indeterminate in terms of _which_ docs get deleted, I don’t like it at all. Handling this case sounds like a best practice recommendation for people concerned with this to index a field in each doc themselves (we could automate this) and do a delete-by-query.

- Disk space issues. If we used <1> above, this wouldn’t be much differently from what we have now in terms of wasted space. There’d be some extra wasted space, but not much. <2> would cause greater disk space waste. <2> would probably be easier, but I don’t think <1> is much work either.

- Is it worth the effort? People have to reindex every doc anyway.

- How to test?

- ???

I think the question of whether to pursue this or not comes down to two questions:

1> Does it really help end users enough to be worth the effort? How many users can _guarantee_ that they reindex every document?

2> Would something along these lines work at all? Like I said, I’m going from hearsay rather than deep knowledge of the X-2 mechanism.

All I’m looking for here is whether it’s interesting enough for me to create a JIRA and discuss details there...
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

RE: Thinking about upgrading indexes to X+2 [ In reply to ]

uwe at thetaphi

Nov 20, 2020, 10:18 AM

Post #2 of 4 (341 views)

Permalink

Thanks for bringing this again.

I tend to say: Let us just allow also IndexUpgrader beyodn 2 versions! If somebody complains about incorrect offsets, oh man - It's their problem.

Uwe

-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Erick Erickson <erickerickson@gmail.com>
> Sent: Friday, November 20, 2020 4:03 PM
> To: dev@lucene.apache.org
> Subject: Thinking about upgrading indexes to X+2
>
> So yet another iteration on the users list of going from X to X+2 got me to
> thinking (dangerous I know). I wanted to run this by folks to see if it’s worth a
> JIRA.
>
> It _seems_ reasonable from a user’s perspective to create an index with, say,
> 6x, then upgrade to 7x and reindex all documents (without deleting the index
> first), then be able to upgrade to 8x and reindex all documents. Rinse, repeat.
>
> The problem of course is that the 6x segments get merged by TMP and the 6x
> stamp is preserved. (BTW, I’m going from hearsay here rather than code
> knowledge, correct me if I’m wrong, I’ve assumed all along that these are on
> each _segment_, not global to the entire index).
>
> I can think of a couple of options for, say, TMP that might work out to support
> the above (I’m not proposing both, and these are bad names…):
> 1 - onlyMergeSegmentsCreatedWithTheSameVersion
> 2 - neverMergeSegmentsCreatedWithAPriorVersion
>
> Either of these would, if and only if _all_ docs were indeed indexed again,
> result in all the X-1 segments consisting entirely of deleted documents and
> being dropped. Now no segment has the X-1 marker and we could upgrade to
> X+1.
>
> There are some edge cases of course:
>
> - if even one X-1 doc wasn't reindexed, it wouldn’t work. I can think of ways
> around this, e.g. a command deleteAllSegmentsCreatedWithPriorVersions, but
> since that’s indeterminate in terms of _which_ docs get deleted, I don’t like it
> at all. Handling this case sounds like a best practice recommendation for people
> concerned with this to index a field in each doc themselves (we could automate
> this) and do a delete-by-query.
>
> - Disk space issues. If we used <1> above, this wouldn’t be much differently
> from what we have now in terms of wasted space. There’d be some extra
> wasted space, but not much. <2> would cause greater disk space waste. <2>
> would probably be easier, but I don’t think <1> is much work either.
>
> - Is it worth the effort? People have to reindex every doc anyway.
>
> - How to test?
>
> - ???
>
> I think the question of whether to pursue this or not comes down to two
> questions:
>
> 1> Does it really help end users enough to be worth the effort? How many
> users can _guarantee_ that they reindex every document?
>
> 2> Would something along these lines work at all? Like I said, I’m going from
> hearsay rather than deep knowledge of the X-2 mechanism.
>
> All I’m looking for here is whether it’s interesting enough for me to create a
> JIRA and discuss details there...
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Thinking about upgrading indexes to X+2 [ In reply to ]

bram.vandam at intix

Nov 24, 2020, 12:24 AM

Post #3 of 4 (341 views)

Permalink

On 20/11/2020 19:18, Uwe Schindler wrote:
> Thanks for bringing this again.
>
> I tend to say: Let us just allow also IndexUpgrader beyodn 2 versions! If somebody complains about incorrect offsets, oh man - It's their problem.

Is there any particular reason why the IndexUpgrader couldn't simply
warn about non-upgradable changes? If you have a data type that's no
longer supported and there's no migration path: error out.

I understand the whole "x != f(x)" upgrade problem, but many scenarios
should be pretty straightforward to upgrade. If a field is stored or has
docvalues, f(x) can simply be recomputed, no?

Am I missing something glaringly obvious here?

- Bram

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Thinking about upgrading indexes to X+2 [ In reply to ]

rcmuir at gmail

Nov 24, 2020, 2:11 AM

Post #4 of 4 (341 views)

Permalink

On Tue, Nov 24, 2020 at 3:24 AM Bram Van Dam <bram.vandam@intix.eu> wrote:

> Is there any particular reason why the IndexUpgrader couldn't simply
> warn about non-upgradable changes? If you have a data type that's no
> longer supported and there's no migration path: error out.
>
> I understand the whole "x != f(x)" upgrade problem, but many scenarios
> should be pretty straightforward to upgrade. If a field is stored or has
> docvalues, f(x) can simply be recomputed, no?
>
> Am I missing something glaringly obvious here?
>

No, you clearly don't understand the basics.

1. indexupgrader works via *merge* not reindexing. to recompute, you
need to reindex
2. there's no correlation between whats in any stored field or
anywhere else with whats in a posting list. you can set them to
whatever you want.
3. lucene doesn't even know what analyzer or anything else you used to
index it. its an index. not a database.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org