Mailing List Archive

Live index upgrading
Hello,

I use Lucene with PyLucene on a public-facing web application. We have a moderately large index (~24M documents, ~11GB index data), with a constant stream of new documents.

I recently upgraded to PyLucene 7.

When trying to test the new release of PyLucene 8, I encountered an IndexFormatTooOld error because my index conversion from Lucene6 to Lucene7 was not complete.

I found IndexUpgrader, and I had a look at its implementation. I would very much like to avoid putting down the service during the index upgrade, so I believe I cannot use IndexUpgrader because I need the write lock to be held by the web application to index new documents.

So I figure I could get the desired result with an IndexWriter.forceMerge(1). But the documentation says "This is a horribly costly operation, especially when you pass a small maxNumSegments; usually you should only call this if the index is static (will no longer be changed)." https://lucene.apache.org/core/7_7_2/core/org/apache/lucene/index/IndexWriter.html#forceMerge-int-

And indeed, forceMerge tends be killed the kernel OOM killer on my development VM. I want to avoid this failure mode in production. I could increase the VM until it works, but I would rather have a less brutal approach to upgrading a live index. Something that could run in the background with reasonable amounts of anonymous memory.

What is the recommended approach to upgrading a live index?

How can I know from the code that the index needs upgrading at all? I could add a manual knob to start an upgrade, but it would be better if it occurred transparently when I upgrade PyLucene.



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Live index upgrading [ In reply to ]
Let’s back up a bit. What version of Lucene are you using? Starting with Lucene 8, any index that’s ever been touched by Lucene 6 will not open. It does not matter if the index has been completely rewritten. It does not matter if it’s been run through IndexUpgraderTool, which just does a forceMerge to 1 segment. A marker is preserved when a segment is created, and the earliest one is preserved across merges. So say you have two segments, one created with 6 and one with 7. The Lucene 6 marker is preserved when they are merged.

Now, if any segment has the Lucene 6 marker, the index will not be opened by Lucene.

If you’re using Lucene 7, then this error implies that one or more of your segments was created with Lucene 5 or earlier.

So you probably need to re-index from scratch on whatever version of Lucene you want to use.

Best,
Erick



> On Jun 17, 2019, at 8:41 AM, David Allouche <david@allouche.net> wrote:
>
> Hello,
>
> I use Lucene with PyLucene on a public-facing web application. We have a moderately large index (~24M documents, ~11GB index data), with a constant stream of new documents.
>
> I recently upgraded to PyLucene 7.
>
> When trying to test the new release of PyLucene 8, I encountered an IndexFormatTooOld error because my index conversion from Lucene6 to Lucene7 was not complete.
>
> I found IndexUpgrader, and I had a look at its implementation. I would very much like to avoid putting down the service during the index upgrade, so I believe I cannot use IndexUpgrader because I need the write lock to be held by the web application to index new documents.
>
> So I figure I could get the desired result with an IndexWriter.forceMerge(1). But the documentation says "This is a horribly costly operation, especially when you pass a small maxNumSegments; usually you should only call this if the index is static (will no longer be changed)." https://lucene.apache.org/core/7_7_2/core/org/apache/lucene/index/IndexWriter.html#forceMerge-int-
>
> And indeed, forceMerge tends be killed the kernel OOM killer on my development VM. I want to avoid this failure mode in production. I could increase the VM until it works, but I would rather have a less brutal approach to upgrading a live index. Something that could run in the background with reasonable amounts of anonymous memory.
>
> What is the recommended approach to upgrading a live index?
>
> How can I know from the code that the index needs upgrading at all? I could add a manual knob to start an upgrade, but it would be better if it occurred transparently when I upgrade PyLucene.
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Live index upgrading [ In reply to ]
Wow. That is annoying. What is the reason for this?

I assumed there was a smooth upgrade path, but apparently, by design, one has to rebuild the index at least once every two major releases.

So, my question becomes, what is the recommended way of dealing with reindex-from-scratch without service interruption?

So I guess the upgrade path looks something like:
- Create Lucene6 index
- Update Lucene6 index
- Create Lucene7 index
- Separately keep track of which documents are indexed in Lucene7 and Lucene6 indexes
- Make updates to Lucene6 index, concurrently build Lucene7 index from scratch, user Lucene6 index for search.
- When Lucene7 index is fully built, remove Lucene6 index and use Lucene7 index for search.

Rinse and repeat every major version.

Really, isn't there something simpler already to handle Lucene major version upgrades?


> On 17 Jun 2019, at 18:04, Erick Erickson <erickerickson@gmail.com> wrote:
>
> Let’s back up a bit. What version of Lucene are you using? Starting with Lucene 8, any index that’s ever been touched by Lucene 6 will not open. It does not matter if the index has been completely rewritten. It does not matter if it’s been run through IndexUpgraderTool, which just does a forceMerge to 1 segment. A marker is preserved when a segment is created, and the earliest one is preserved across merges. So say you have two segments, one created with 6 and one with 7. The Lucene 6 marker is preserved when they are merged.
>
> Now, if any segment has the Lucene 6 marker, the index will not be opened by Lucene.
>
> If you’re using Lucene 7, then this error implies that one or more of your segments was created with Lucene 5 or earlier.
>
> So you probably need to re-index from scratch on whatever version of Lucene you want to use.
>
> Best,
> Erick
>
>
>
>> On Jun 17, 2019, at 8:41 AM, David Allouche <david@allouche.net> wrote:
>>
>> Hello,
>>
>> I use Lucene with PyLucene on a public-facing web application. We have a moderately large index (~24M documents, ~11GB index data), with a constant stream of new documents.
>>
>> I recently upgraded to PyLucene 7.
>>
>> When trying to test the new release of PyLucene 8, I encountered an IndexFormatTooOld error because my index conversion from Lucene6 to Lucene7 was not complete.
>>
>> I found IndexUpgrader, and I had a look at its implementation. I would very much like to avoid putting down the service during the index upgrade, so I believe I cannot use IndexUpgrader because I need the write lock to be held by the web application to index new documents.
>>
>> So I figure I could get the desired result with an IndexWriter.forceMerge(1). But the documentation says "This is a horribly costly operation, especially when you pass a small maxNumSegments; usually you should only call this if the index is static (will no longer be changed)." https://lucene.apache.org/core/7_7_2/core/org/apache/lucene/index/IndexWriter.html#forceMerge-int-
>>
>> And indeed, forceMerge tends be killed the kernel OOM killer on my development VM. I want to avoid this failure mode in production. I could increase the VM until it works, but I would rather have a less brutal approach to upgrading a live index. Something that could run in the background with reasonable amounts of anonymous memory.
>>
>> What is the recommended approach to upgrading a live index?
>>
>> How can I know from the code that the index needs upgrading at all? I could add a manual knob to start an upgrade, but it would be better if it occurred transparently when I upgrade PyLucene.
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Live index upgrading [ In reply to ]
Assuming SolrCloud, reindex from scratch into a new collection then use collection aliasing when you were ready to switch. You don’t need to stop your clients when you use CREATEALIAS.

Prior to writing the marker, Lucene would appear to work with older indexes, but there would be subtle errors because the information needed to score docs just wasn’t there.

Here are two quotes from people who know that crystalized the problem Lucene faces for me:

From Robert Muir:

“I think the key issue here is Lucene is an index not a database. Because it is a lossy index and does not retain all of the user's data, its not possible to safely migrate some things automagically. In the norms case IndexWriter needs to re-analyze the text ("re-index") and compute stats to get back the value, so it can be re-encoded. The function is y = f(x) and if x is not available its not possible, so lucene can't do it.”

From Mike McCandless:

“This really is the difference between an index and a database: we do not store, precisely, the original documents. We store an efficient derived/computed index from them. Yes, Solr/ES can add database-like behavior where they hold the true original source of the document and use that to rebuild Lucene indices over time. But Lucene really is just a "search index" and we need to be free to make important improvements with time.”

Best,
Erick

> On Jun 21, 2019, at 7:10 AM, David Allouche <david@allouche.net> wrote:
>
> Wow. That is annoying. What is the reason for this?
>
> I assumed there was a smooth upgrade path, but apparently, by design, one has to rebuild the index at least once every two major releases.
>
> So, my question becomes, what is the recommended way of dealing with reindex-from-scratch without service interruption?
>
> So I guess the upgrade path looks something like:
> - Create Lucene6 index
> - Update Lucene6 index
> - Create Lucene7 index
> - Separately keep track of which documents are indexed in Lucene7 and Lucene6 indexes
> - Make updates to Lucene6 index, concurrently build Lucene7 index from scratch, user Lucene6 index for search.
> - When Lucene7 index is fully built, remove Lucene6 index and use Lucene7 index for search.
>
> Rinse and repeat every major version.
>
> Really, isn't there something simpler already to handle Lucene major version upgrades?
>
>
>> On 17 Jun 2019, at 18:04, Erick Erickson <erickerickson@gmail.com> wrote:
>>
>> Let’s back up a bit. What version of Lucene are you using? Starting with Lucene 8, any index that’s ever been touched by Lucene 6 will not open. It does not matter if the index has been completely rewritten. It does not matter if it’s been run through IndexUpgraderTool, which just does a forceMerge to 1 segment. A marker is preserved when a segment is created, and the earliest one is preserved across merges. So say you have two segments, one created with 6 and one with 7. The Lucene 6 marker is preserved when they are merged.
>>
>> Now, if any segment has the Lucene 6 marker, the index will not be opened by Lucene.
>>
>> If you’re using Lucene 7, then this error implies that one or more of your segments was created with Lucene 5 or earlier.
>>
>> So you probably need to re-index from scratch on whatever version of Lucene you want to use.
>>
>> Best,
>> Erick
>>
>>
>>
>>> On Jun 17, 2019, at 8:41 AM, David Allouche <david@allouche.net> wrote:
>>>
>>> Hello,
>>>
>>> I use Lucene with PyLucene on a public-facing web application. We have a moderately large index (~24M documents, ~11GB index data), with a constant stream of new documents.
>>>
>>> I recently upgraded to PyLucene 7.
>>>
>>> When trying to test the new release of PyLucene 8, I encountered an IndexFormatTooOld error because my index conversion from Lucene6 to Lucene7 was not complete.
>>>
>>> I found IndexUpgrader, and I had a look at its implementation. I would very much like to avoid putting down the service during the index upgrade, so I believe I cannot use IndexUpgrader because I need the write lock to be held by the web application to index new documents.
>>>
>>> So I figure I could get the desired result with an IndexWriter.forceMerge(1). But the documentation says "This is a horribly costly operation, especially when you pass a small maxNumSegments; usually you should only call this if the index is static (will no longer be changed)." https://lucene.apache.org/core/7_7_2/core/org/apache/lucene/index/IndexWriter.html#forceMerge-int-
>>>
>>> And indeed, forceMerge tends be killed the kernel OOM killer on my development VM. I want to avoid this failure mode in production. I could increase the VM until it works, but I would rather have a less brutal approach to upgrading a live index. Something that could run in the background with reasonable amounts of anonymous memory.
>>>
>>> What is the recommended approach to upgrading a live index?
>>>
>>> How can I know from the code that the index needs upgrading at all? I could add a manual knob to start an upgrade, but it would be better if it occurred transparently when I upgrade PyLucene.
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Live index upgrading [ In reply to ]
Unfortunately, I cannot assume SolrCloud, because our software predates Solr.

So I would either need to switch to Solr or reimplement a work-around for the lack of index migration. I am reluctant to switch to Solr because it increases the operational complexity.

I understand the argument: if the algorithm f?() used to derive index data i? from the raw data r? changes [i?=f?(r?)], the index data i??? may not be derivable from i? [?n?g \ i?=g(i???)].

On the application level, one could store non-tokenized content (I guess that's why ElasticSearch has .raw fields). And traverse the index. I already have index traversal code that I use for garbage collection of old entries. Use the non-tokenized content to build a new index. So the progress of the conversion could be recorded as the index into LeafReader.getLiveDocs().

https://lucene.apache.org/core/8_1_1/core/org/apache/lucene/index/LeafReader.html#getLiveDocs--

Alternatively, since I do not have all the non-tokenized content in the index now, I could use the external document id to retrieve the original document text.

Is there a convenient place to store the getLiveDocs index across process interruptions? Or should I use something stupid like a file to store the counter?

That is still a lot of hassle, but I understand how it makes sense for Lucene to consider index migration should be handled up the stack.


> On 21 Jun 2019, at 18:06, Erick Erickson <erickerickson@gmail.com> wrote:
>
> Assuming SolrCloud, reindex from scratch into a new collection then use collection aliasing when you were ready to switch. You don’t need to stop your clients when you use CREATEALIAS.
>
> Prior to writing the marker, Lucene would appear to work with older indexes, but there would be subtle errors because the information needed to score docs just wasn’t there.
>
> Here are two quotes from people who know that crystalized the problem Lucene faces for me:
>
> From Robert Muir:
>
> “I think the key issue here is Lucene is an index not a database. Because it is a lossy index and does not retain all of the user's data, its not possible to safely migrate some things automagically. In the norms case IndexWriter needs to re-analyze the text ("re-index") and compute stats to get back the value, so it can be re-encoded. The function is y = f(x) and if x is not available its not possible, so lucene can't do it.”
>
> From Mike McCandless:
>
> “This really is the difference between an index and a database: we do not store, precisely, the original documents. We store an efficient derived/computed index from them. Yes, Solr/ES can add database-like behavior where they hold the true original source of the document and use that to rebuild Lucene indices over time. But Lucene really is just a "search index" and we need to be free to make important improvements with time.”
>
> Best,
> Erick
>
>> On Jun 21, 2019, at 7:10 AM, David Allouche <david@allouche.net> wrote:
>>
>> Wow. That is annoying. What is the reason for this?
>>
>> I assumed there was a smooth upgrade path, but apparently, by design, one has to rebuild the index at least once every two major releases.
>>
>> So, my question becomes, what is the recommended way of dealing with reindex-from-scratch without service interruption?
>>
>> So I guess the upgrade path looks something like:
>> - Create Lucene6 index
>> - Update Lucene6 index
>> - Create Lucene7 index
>> - Separately keep track of which documents are indexed in Lucene7 and Lucene6 indexes
>> - Make updates to Lucene6 index, concurrently build Lucene7 index from scratch, user Lucene6 index for search.
>> - When Lucene7 index is fully built, remove Lucene6 index and use Lucene7 index for search.
>>
>> Rinse and repeat every major version.
>>
>> Really, isn't there something simpler already to handle Lucene major version upgrades?
>>
>>
>>> On 17 Jun 2019, at 18:04, Erick Erickson <erickerickson@gmail.com> wrote:
>>>
>>> Let’s back up a bit. What version of Lucene are you using? Starting with Lucene 8, any index that’s ever been touched by Lucene 6 will not open. It does not matter if the index has been completely rewritten. It does not matter if it’s been run through IndexUpgraderTool, which just does a forceMerge to 1 segment. A marker is preserved when a segment is created, and the earliest one is preserved across merges. So say you have two segments, one created with 6 and one with 7. The Lucene 6 marker is preserved when they are merged.
>>>
>>> Now, if any segment has the Lucene 6 marker, the index will not be opened by Lucene.
>>>
>>> If you’re using Lucene 7, then this error implies that one or more of your segments was created with Lucene 5 or earlier.
>>>
>>> So you probably need to re-index from scratch on whatever version of Lucene you want to use.
>>>
>>> Best,
>>> Erick
>>>
>>>
>>>
>>>> On Jun 17, 2019, at 8:41 AM, David Allouche <david@allouche.net> wrote:
>>>>
>>>> Hello,
>>>>
>>>> I use Lucene with PyLucene on a public-facing web application. We have a moderately large index (~24M documents, ~11GB index data), with a constant stream of new documents.
>>>>
>>>> I recently upgraded to PyLucene 7.
>>>>
>>>> When trying to test the new release of PyLucene 8, I encountered an IndexFormatTooOld error because my index conversion from Lucene6 to Lucene7 was not complete.
>>>>
>>>> I found IndexUpgrader, and I had a look at its implementation. I would very much like to avoid putting down the service during the index upgrade, so I believe I cannot use IndexUpgrader because I need the write lock to be held by the web application to index new documents.
>>>>
>>>> So I figure I could get the desired result with an IndexWriter.forceMerge(1). But the documentation says "This is a horribly costly operation, especially when you pass a small maxNumSegments; usually you should only call this if the index is static (will no longer be changed)." https://lucene.apache.org/core/7_7_2/core/org/apache/lucene/index/IndexWriter.html#forceMerge-int-
>>>>
>>>> And indeed, forceMerge tends be killed the kernel OOM killer on my development VM. I want to avoid this failure mode in production. I could increase the VM until it works, but I would rather have a less brutal approach to upgrading a live index. Something that could run in the background with reasonable amounts of anonymous memory.
>>>>
>>>> What is the recommended approach to upgrading a live index?
>>>>
>>>> How can I know from the code that the index needs upgrading at all? I could add a manual knob to start an upgrade, but it would be better if it occurred transparently when I upgrade PyLucene.
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Live index upgrading [ In reply to ]
The bottom line for me, is that I am not going to upgrade to Lucene8 for a while.

The index migration would either cause a service interruption, or would require a little while to implement.

I have more urgent technical debt to deal with.

> On 21 Jun 2019, at 19:11, David Allouche <david@allouche.net> wrote:
>
> Unfortunately, I cannot assume SolrCloud, because our software predates Solr.
>
> So I would either need to switch to Solr or reimplement a work-around for the lack of index migration. I am reluctant to switch to Solr because it increases the operational complexity.
>
> I understand the argument: if the algorithm f?() used to derive index data i? from the raw data r? changes [i?=f?(r?)], the index data i??? may not be derivable from i? [?n?g \ i?=g(i???)].
>
> On the application level, one could store non-tokenized content (I guess that's why ElasticSearch has .raw fields). And traverse the index. I already have index traversal code that I use for garbage collection of old entries. Use the non-tokenized content to build a new index. So the progress of the conversion could be recorded as the index into LeafReader.getLiveDocs().
>
> https://lucene.apache.org/core/8_1_1/core/org/apache/lucene/index/LeafReader.html#getLiveDocs--
>
> Alternatively, since I do not have all the non-tokenized content in the index now, I could use the external document id to retrieve the original document text.
>
> Is there a convenient place to store the getLiveDocs index across process interruptions? Or should I use something stupid like a file to store the counter?
>
> That is still a lot of hassle, but I understand how it makes sense for Lucene to consider index migration should be handled up the stack.
>
>
>> On 21 Jun 2019, at 18:06, Erick Erickson <erickerickson@gmail.com> wrote:
>>
>> Assuming SolrCloud, reindex from scratch into a new collection then use collection aliasing when you were ready to switch. You don’t need to stop your clients when you use CREATEALIAS.
>>
>> Prior to writing the marker, Lucene would appear to work with older indexes, but there would be subtle errors because the information needed to score docs just wasn’t there.
>>
>> Here are two quotes from people who know that crystalized the problem Lucene faces for me:
>>
>> From Robert Muir:
>>
>> “I think the key issue here is Lucene is an index not a database. Because it is a lossy index and does not retain all of the user's data, its not possible to safely migrate some things automagically. In the norms case IndexWriter needs to re-analyze the text ("re-index") and compute stats to get back the value, so it can be re-encoded. The function is y = f(x) and if x is not available its not possible, so lucene can't do it.”
>>
>> From Mike McCandless:
>>
>> “This really is the difference between an index and a database: we do not store, precisely, the original documents. We store an efficient derived/computed index from them. Yes, Solr/ES can add database-like behavior where they hold the true original source of the document and use that to rebuild Lucene indices over time. But Lucene really is just a "search index" and we need to be free to make important improvements with time.”
>>
>> Best,
>> Erick
>>
>>> On Jun 21, 2019, at 7:10 AM, David Allouche <david@allouche.net> wrote:
>>>
>>> Wow. That is annoying. What is the reason for this?
>>>
>>> I assumed there was a smooth upgrade path, but apparently, by design, one has to rebuild the index at least once every two major releases.
>>>
>>> So, my question becomes, what is the recommended way of dealing with reindex-from-scratch without service interruption?
>>>
>>> So I guess the upgrade path looks something like:
>>> - Create Lucene6 index
>>> - Update Lucene6 index
>>> - Create Lucene7 index
>>> - Separately keep track of which documents are indexed in Lucene7 and Lucene6 indexes
>>> - Make updates to Lucene6 index, concurrently build Lucene7 index from scratch, user Lucene6 index for search.
>>> - When Lucene7 index is fully built, remove Lucene6 index and use Lucene7 index for search.
>>>
>>> Rinse and repeat every major version.
>>>
>>> Really, isn't there something simpler already to handle Lucene major version upgrades?
>>>
>>>
>>>> On 17 Jun 2019, at 18:04, Erick Erickson <erickerickson@gmail.com> wrote:
>>>>
>>>> Let’s back up a bit. What version of Lucene are you using? Starting with Lucene 8, any index that’s ever been touched by Lucene 6 will not open. It does not matter if the index has been completely rewritten. It does not matter if it’s been run through IndexUpgraderTool, which just does a forceMerge to 1 segment. A marker is preserved when a segment is created, and the earliest one is preserved across merges. So say you have two segments, one created with 6 and one with 7. The Lucene 6 marker is preserved when they are merged.
>>>>
>>>> Now, if any segment has the Lucene 6 marker, the index will not be opened by Lucene.
>>>>
>>>> If you’re using Lucene 7, then this error implies that one or more of your segments was created with Lucene 5 or earlier.
>>>>
>>>> So you probably need to re-index from scratch on whatever version of Lucene you want to use.
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>>
>>>>
>>>>> On Jun 17, 2019, at 8:41 AM, David Allouche <david@allouche.net> wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> I use Lucene with PyLucene on a public-facing web application. We have a moderately large index (~24M documents, ~11GB index data), with a constant stream of new documents.
>>>>>
>>>>> I recently upgraded to PyLucene 7.
>>>>>
>>>>> When trying to test the new release of PyLucene 8, I encountered an IndexFormatTooOld error because my index conversion from Lucene6 to Lucene7 was not complete.
>>>>>
>>>>> I found IndexUpgrader, and I had a look at its implementation. I would very much like to avoid putting down the service during the index upgrade, so I believe I cannot use IndexUpgrader because I need the write lock to be held by the web application to index new documents.
>>>>>
>>>>> So I figure I could get the desired result with an IndexWriter.forceMerge(1). But the documentation says "This is a horribly costly operation, especially when you pass a small maxNumSegments; usually you should only call this if the index is static (will no longer be changed)." https://lucene.apache.org/core/7_7_2/core/org/apache/lucene/index/IndexWriter.html#forceMerge-int-
>>>>>
>>>>> And indeed, forceMerge tends be killed the kernel OOM killer on my development VM. I want to avoid this failure mode in production. I could increase the VM until it works, but I would rather have a less brutal approach to upgrading a live index. Something that could run in the background with reasonable amounts of anonymous memory.
>>>>>
>>>>> What is the recommended approach to upgrading a live index?
>>>>>
>>>>> How can I know from the code that the index needs upgrading at all? I could add a manual knob to start an upgrade, but it would be better if it occurred transparently when I upgrade PyLucene.
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Live index upgrading [ In reply to ]
You’re exactly right that storing all the fields necessary to reconstruct the document is a way to not have to reindex from scratch. Of course that bloats your index, in large installations perhaps unacceptably.

bq. Is there a convenient place to store….

Lucene itself doesn’t preserve anything except the index across interruptions, so storing the counter somewhere is indicated.

So sounds like you’re set with how to go forward, good luck!

> On Jun 21, 2019, at 10:21 AM, David Allouche <david@allouche.net> wrote:
>
> The bottom line for me, is that I am not going to upgrade to Lucene8 for a while.
>
> The index migration would either cause a service interruption, or would require a little while to implement.
>
> I have more urgent technical debt to deal with.
>
>> On 21 Jun 2019, at 19:11, David Allouche <david@allouche.net> wrote:
>>
>> Unfortunately, I cannot assume SolrCloud, because our software predates Solr.
>>
>> So I would either need to switch to Solr or reimplement a work-around for the lack of index migration. I am reluctant to switch to Solr because it increases the operational complexity.
>>
>> I understand the argument: if the algorithm f?() used to derive index data i? from the raw data r? changes [i?=f?(r?)], the index data i??? may not be derivable from i? [?n?g \ i?=g(i???)].
>>
>> On the application level, one could store non-tokenized content (I guess that's why ElasticSearch has .raw fields). And traverse the index. I already have index traversal code that I use for garbage collection of old entries. Use the non-tokenized content to build a new index. So the progress of the conversion could be recorded as the index into LeafReader.getLiveDocs().
>>
>> https://lucene.apache.org/core/8_1_1/core/org/apache/lucene/index/LeafReader.html#getLiveDocs--
>>
>> Alternatively, since I do not have all the non-tokenized content in the index now, I could use the external document id to retrieve the original document text.
>>
>> Is there a convenient place to store the getLiveDocs index across process interruptions? Or should I use something stupid like a file to store the counter?
>>
>> That is still a lot of hassle, but I understand how it makes sense for Lucene to consider index migration should be handled up the stack.
>>
>>
>>> On 21 Jun 2019, at 18:06, Erick Erickson <erickerickson@gmail.com> wrote:
>>>
>>> Assuming SolrCloud, reindex from scratch into a new collection then use collection aliasing when you were ready to switch. You don’t need to stop your clients when you use CREATEALIAS.
>>>
>>> Prior to writing the marker, Lucene would appear to work with older indexes, but there would be subtle errors because the information needed to score docs just wasn’t there.
>>>
>>> Here are two quotes from people who know that crystalized the problem Lucene faces for me:
>>>
>>> From Robert Muir:
>>>
>>> “I think the key issue here is Lucene is an index not a database. Because it is a lossy index and does not retain all of the user's data, its not possible to safely migrate some things automagically. In the norms case IndexWriter needs to re-analyze the text ("re-index") and compute stats to get back the value, so it can be re-encoded. The function is y = f(x) and if x is not available its not possible, so lucene can't do it.”
>>>
>>> From Mike McCandless:
>>>
>>> “This really is the difference between an index and a database: we do not store, precisely, the original documents. We store an efficient derived/computed index from them. Yes, Solr/ES can add database-like behavior where they hold the true original source of the document and use that to rebuild Lucene indices over time. But Lucene really is just a "search index" and we need to be free to make important improvements with time.”
>>>
>>> Best,
>>> Erick
>>>
>>>> On Jun 21, 2019, at 7:10 AM, David Allouche <david@allouche.net> wrote:
>>>>
>>>> Wow. That is annoying. What is the reason for this?
>>>>
>>>> I assumed there was a smooth upgrade path, but apparently, by design, one has to rebuild the index at least once every two major releases.
>>>>
>>>> So, my question becomes, what is the recommended way of dealing with reindex-from-scratch without service interruption?
>>>>
>>>> So I guess the upgrade path looks something like:
>>>> - Create Lucene6 index
>>>> - Update Lucene6 index
>>>> - Create Lucene7 index
>>>> - Separately keep track of which documents are indexed in Lucene7 and Lucene6 indexes
>>>> - Make updates to Lucene6 index, concurrently build Lucene7 index from scratch, user Lucene6 index for search.
>>>> - When Lucene7 index is fully built, remove Lucene6 index and use Lucene7 index for search.
>>>>
>>>> Rinse and repeat every major version.
>>>>
>>>> Really, isn't there something simpler already to handle Lucene major version upgrades?
>>>>
>>>>
>>>>> On 17 Jun 2019, at 18:04, Erick Erickson <erickerickson@gmail.com> wrote:
>>>>>
>>>>> Let’s back up a bit. What version of Lucene are you using? Starting with Lucene 8, any index that’s ever been touched by Lucene 6 will not open. It does not matter if the index has been completely rewritten. It does not matter if it’s been run through IndexUpgraderTool, which just does a forceMerge to 1 segment. A marker is preserved when a segment is created, and the earliest one is preserved across merges. So say you have two segments, one created with 6 and one with 7. The Lucene 6 marker is preserved when they are merged.
>>>>>
>>>>> Now, if any segment has the Lucene 6 marker, the index will not be opened by Lucene.
>>>>>
>>>>> If you’re using Lucene 7, then this error implies that one or more of your segments was created with Lucene 5 or earlier.
>>>>>
>>>>> So you probably need to re-index from scratch on whatever version of Lucene you want to use.
>>>>>
>>>>> Best,
>>>>> Erick
>>>>>
>>>>>
>>>>>
>>>>>> On Jun 17, 2019, at 8:41 AM, David Allouche <david@allouche.net> wrote:
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I use Lucene with PyLucene on a public-facing web application. We have a moderately large index (~24M documents, ~11GB index data), with a constant stream of new documents.
>>>>>>
>>>>>> I recently upgraded to PyLucene 7.
>>>>>>
>>>>>> When trying to test the new release of PyLucene 8, I encountered an IndexFormatTooOld error because my index conversion from Lucene6 to Lucene7 was not complete.
>>>>>>
>>>>>> I found IndexUpgrader, and I had a look at its implementation. I would very much like to avoid putting down the service during the index upgrade, so I believe I cannot use IndexUpgrader because I need the write lock to be held by the web application to index new documents.
>>>>>>
>>>>>> So I figure I could get the desired result with an IndexWriter.forceMerge(1). But the documentation says "This is a horribly costly operation, especially when you pass a small maxNumSegments; usually you should only call this if the index is static (will no longer be changed)." https://lucene.apache.org/core/7_7_2/core/org/apache/lucene/index/IndexWriter.html#forceMerge-int-
>>>>>>
>>>>>> And indeed, forceMerge tends be killed the kernel OOM killer on my development VM. I want to avoid this failure mode in production. I could increase the VM until it works, but I would rather have a less brutal approach to upgrading a live index. Something that could run in the background with reasonable amounts of anonymous memory.
>>>>>>
>>>>>> What is the recommended approach to upgrading a live index?
>>>>>>
>>>>>> How can I know from the code that the index needs upgrading at all? I could add a manual knob to start an upgrade, but it would be better if it occurred transparently when I upgrade PyLucene.
>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org