Mailing List Archive

Multiple merge-runs from same set of segments
Hello,

We have a use-case for index-rewrite on a "frozen index" where no new
documents are added. It goes like this..

1. Get all segments for the index (base-segment-list)
2. Create a new segment from base-segment-list with unique set of docs
(LiveDocs)
3. Repeat step 2, for a fixed count. Like say 5 or 10 times

Is something like this achievable via Merge Policy? We can disable commits
too, till the full run is completed.

Any help is appreciated

Regards,
Ravi
Re: Multiple merge-runs from same set of segments [ In reply to ]
Are you trying to rewrite your already created index into a different
segment geometry?

Maybe have a look at the new IndexRearranger tool
<https://issues.apache.org/jira/browse/LUCENE-9694>? It is already doing
something like what you enumerated below, including mocking LiveDocs to get
the right documents into the right segments.

Mike McCandless

http://blog.mikemccandless.com


On Sat, May 22, 2021 at 3:50 PM Ravikumar Govindarajan <
ravikumar.govindarajan@gmail.com> wrote:

> Hello,
>
> We have a use-case for index-rewrite on a "frozen index" where no new
> documents are added. It goes like this..
>
> 1. Get all segments for the index (base-segment-list)
> 2. Create a new segment from base-segment-list with unique set of docs
> (LiveDocs)
> 3. Repeat step 2, for a fixed count. Like say 5 or 10 times
>
> Is something like this achievable via Merge Policy? We can disable commits
> too, till the full run is completed.
>
> Any help is appreciated
>
> Regards,
> Ravi
>
Re: Multiple merge-runs from same set of segments [ In reply to ]
Thanks Michael!

This was just what I was looking for!!. Just a couple of questions.


- When we call addIndexes(IndexReader...), does the merge happen via
MergePolicy? We use a SortingMergePolicy and would like to maintain the
sort-order in newly created segments too
- Concurrency is a cool-trick here. But if I understand the patch
correctly, don't we end-up doing multiple passes over the Term Dict, one
for each Selector? Loading it fully in memory could help here, possibly?

--
Ravi

On Mon, May 24, 2021 at 7:37 PM Michael McCandless <
lucene@mikemccandless.com> wrote:

> Are you trying to rewrite your already created index into a different
> segment geometry?
>
> Maybe have a look at the new IndexRearranger tool
> <https://issues.apache.org/jira/browse/LUCENE-9694>? It is already doing
> something like what you enumerated below, including mocking LiveDocs to get
> the right documents into the right segments.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Sat, May 22, 2021 at 3:50 PM Ravikumar Govindarajan <
> ravikumar.govindarajan@gmail.com> wrote:
>
>> Hello,
>>
>> We have a use-case for index-rewrite on a "frozen index" where no new
>> documents are added. It goes like this..
>>
>> 1. Get all segments for the index (base-segment-list)
>> 2. Create a new segment from base-segment-list with unique set of docs
>> (LiveDocs)
>> 3. Repeat step 2, for a fixed count. Like say 5 or 10 times
>>
>> Is something like this achievable via Merge Policy? We can disable commits
>> too, till the full run is completed.
>>
>> Any help is appreciated
>>
>> Regards,
>> Ravi
>>
>
Re: Multiple merge-runs from same set of segments [ In reply to ]
Hi Ravi,

1. May I know what lucene version you're using? As far as I know the
SortingMergePolicy has been deprecated and replaced by
IndexWriterConfig.setIndexSort in newer lucene version. So if the
"setIndexSort" is available I would suggest using that to achieve the
sorted index (as you might have already figured out, the IndexRearranger
let you pass in an IndexWriterConfig so that you could set it there). If it
is not available, I'm not sure whether the merge will happen via merge
policy, maybe you could check the source code and see?
2. Yeah it's a good observation, we're doing multiple passes over one
segment! But I think the current default directory implementation is
MMapDirectory, which delegate the caching to the system and should have
already optimized this situation. Here's a great blog explaining the
MMapDirectory in lucene:
https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

Best
Patrick

Ravikumar Govindarajan <ravikumar.govindarajan@gmail.com> ?2021?5?24???
??9:54???

> Thanks Michael!
>
> This was just what I was looking for!!. Just a couple of questions.
>
>
> - When we call addIndexes(IndexReader...), does the merge happen via
> MergePolicy? We use a SortingMergePolicy and would like to maintain the
> sort-order in newly created segments too
> - Concurrency is a cool-trick here. But if I understand the patch
> correctly, don't we end-up doing multiple passes over the Term Dict, one
> for each Selector? Loading it fully in memory could help here, possibly?
>
> --
> Ravi
>
> On Mon, May 24, 2021 at 7:37 PM Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
> > Are you trying to rewrite your already created index into a different
> > segment geometry?
> >
> > Maybe have a look at the new IndexRearranger tool
> > <https://issues.apache.org/jira/browse/LUCENE-9694>? It is already
> doing
> > something like what you enumerated below, including mocking LiveDocs to
> get
> > the right documents into the right segments.
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> >
> > On Sat, May 22, 2021 at 3:50 PM Ravikumar Govindarajan <
> > ravikumar.govindarajan@gmail.com> wrote:
> >
> >> Hello,
> >>
> >> We have a use-case for index-rewrite on a "frozen index" where no new
> >> documents are added. It goes like this..
> >>
> >> 1. Get all segments for the index (base-segment-list)
> >> 2. Create a new segment from base-segment-list with unique set of
> docs
> >> (LiveDocs)
> >> 3. Repeat step 2, for a fixed count. Like say 5 or 10 times
> >>
> >> Is something like this achievable via Merge Policy? We can disable
> commits
> >> too, till the full run is completed.
> >>
> >> Any help is appreciated
> >>
> >> Regards,
> >> Ravi
> >>
> >
>
Re: Multiple merge-runs from same set of segments [ In reply to ]
Thanks Patrick for the help!

May I know what lucene version you're using?
>

We are using an older version of lucene as of now (4.7.x) and I believe the
FilterCodecReader of current version is akin to FilterAtomicReader & should
do the job for us!

If it is not available, I'm not sure whether the merge will happen via merge
> policy, maybe you could check the source code and see?
>

Checked & AFAIK, our old version isn't supporting it. But I guess it should
be fine to wrap a SortingAtomicReader and pass it to the API. Guess, it can
be done!

But I think the current default directory implementation is MMapDirectory,
> which delegate the caching to the system and should have
> already optimized this situation
>

We do use the default MMap-dir but I was actually thinking about
unpacking/walking Term-Dict data (FST) repeatedly from various
threads, even if via MMap. Are there optimizations here (caching unpacked
blocks etc..) that we could tap into?

--
Ravi

On Mon, May 24, 2021 at 11:09 PM Patrick Zhai <zhai7631@gmail.com> wrote:

> Hi Ravi,
>
> 1. May I know what lucene version you're using? As far as I know the
> SortingMergePolicy has been deprecated and replaced by
> IndexWriterConfig.setIndexSort in newer lucene version. So if the
> "setIndexSort" is available I would suggest using that to achieve the
> sorted index (as you might have already figured out, the IndexRearranger
> let you pass in an IndexWriterConfig so that you could set it there). If it
> is not available, I'm not sure whether the merge will happen via merge
> policy, maybe you could check the source code and see?
> 2. Yeah it's a good observation, we're doing multiple passes over one
> segment! But I think the current default directory implementation is
> MMapDirectory, which delegate the caching to the system and should have
> already optimized this situation. Here's a great blog explaining the
> MMapDirectory in lucene:
> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>
> Best
> Patrick
>
> Ravikumar Govindarajan <ravikumar.govindarajan@gmail.com> ?2021?5?24???
> ??9:54???
>
> > Thanks Michael!
> >
> > This was just what I was looking for!!. Just a couple of questions.
> >
> >
> > - When we call addIndexes(IndexReader...), does the merge happen via
> > MergePolicy? We use a SortingMergePolicy and would like to maintain
> the
> > sort-order in newly created segments too
> > - Concurrency is a cool-trick here. But if I understand the patch
> > correctly, don't we end-up doing multiple passes over the Term Dict,
> one
> > for each Selector? Loading it fully in memory could help here,
> possibly?
> >
> > --
> > Ravi
> >
> > On Mon, May 24, 2021 at 7:37 PM Michael McCandless <
> > lucene@mikemccandless.com> wrote:
> >
> > > Are you trying to rewrite your already created index into a different
> > > segment geometry?
> > >
> > > Maybe have a look at the new IndexRearranger tool
> > > <https://issues.apache.org/jira/browse/LUCENE-9694>? It is already
> > doing
> > > something like what you enumerated below, including mocking LiveDocs to
> > get
> > > the right documents into the right segments.
> > >
> > > Mike McCandless
> > >
> > > http://blog.mikemccandless.com
> > >
> > >
> > > On Sat, May 22, 2021 at 3:50 PM Ravikumar Govindarajan <
> > > ravikumar.govindarajan@gmail.com> wrote:
> > >
> > >> Hello,
> > >>
> > >> We have a use-case for index-rewrite on a "frozen index" where no new
> > >> documents are added. It goes like this..
> > >>
> > >> 1. Get all segments for the index (base-segment-list)
> > >> 2. Create a new segment from base-segment-list with unique set of
> > docs
> > >> (LiveDocs)
> > >> 3. Repeat step 2, for a fixed count. Like say 5 or 10 times
> > >>
> > >> Is something like this achievable via Merge Policy? We can disable
> > commits
> > >> too, till the full run is completed.
> > >>
> > >> Any help is appreciated
> > >>
> > >> Regards,
> > >> Ravi
> > >>
> > >
> >
>
Re: Multiple merge-runs from same set of segments [ In reply to ]
Sorry for the delayed response, as for caching termDict data across
threads, I do not aware of any existing lucene mechanism could do that (and
it might be tricky since it is across threads), but maybe worth trying to
see whether we can get some extra speed based on that!

Patrick

Ravikumar Govindarajan <ravikumar.govindarajan@gmail.com> ?2021?5?24???
??11:49???

> Thanks Patrick for the help!
>
> May I know what lucene version you're using?
> >
>
> We are using an older version of lucene as of now (4.7.x) and I believe the
> FilterCodecReader of current version is akin to FilterAtomicReader & should
> do the job for us!
>
> If it is not available, I'm not sure whether the merge will happen via
> merge
> > policy, maybe you could check the source code and see?
> >
>
> Checked & AFAIK, our old version isn't supporting it. But I guess it should
> be fine to wrap a SortingAtomicReader and pass it to the API. Guess, it can
> be done!
>
> But I think the current default directory implementation is MMapDirectory,
> > which delegate the caching to the system and should have
> > already optimized this situation
> >
>
> We do use the default MMap-dir but I was actually thinking about
> unpacking/walking Term-Dict data (FST) repeatedly from various
> threads, even if via MMap. Are there optimizations here (caching unpacked
> blocks etc..) that we could tap into?
>
> --
> Ravi
>
> On Mon, May 24, 2021 at 11:09 PM Patrick Zhai <zhai7631@gmail.com> wrote:
>
> > Hi Ravi,
> >
> > 1. May I know what lucene version you're using? As far as I know the
> > SortingMergePolicy has been deprecated and replaced by
> > IndexWriterConfig.setIndexSort in newer lucene version. So if the
> > "setIndexSort" is available I would suggest using that to achieve the
> > sorted index (as you might have already figured out, the IndexRearranger
> > let you pass in an IndexWriterConfig so that you could set it there). If
> it
> > is not available, I'm not sure whether the merge will happen via merge
> > policy, maybe you could check the source code and see?
> > 2. Yeah it's a good observation, we're doing multiple passes over one
> > segment! But I think the current default directory implementation is
> > MMapDirectory, which delegate the caching to the system and should have
> > already optimized this situation. Here's a great blog explaining the
> > MMapDirectory in lucene:
> > https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
> >
> > Best
> > Patrick
> >
> > Ravikumar Govindarajan <ravikumar.govindarajan@gmail.com> ?2021?5?24???
> > ??9:54???
> >
> > > Thanks Michael!
> > >
> > > This was just what I was looking for!!. Just a couple of questions.
> > >
> > >
> > > - When we call addIndexes(IndexReader...), does the merge happen via
> > > MergePolicy? We use a SortingMergePolicy and would like to maintain
> > the
> > > sort-order in newly created segments too
> > > - Concurrency is a cool-trick here. But if I understand the patch
> > > correctly, don't we end-up doing multiple passes over the Term Dict,
> > one
> > > for each Selector? Loading it fully in memory could help here,
> > possibly?
> > >
> > > --
> > > Ravi
> > >
> > > On Mon, May 24, 2021 at 7:37 PM Michael McCandless <
> > > lucene@mikemccandless.com> wrote:
> > >
> > > > Are you trying to rewrite your already created index into a different
> > > > segment geometry?
> > > >
> > > > Maybe have a look at the new IndexRearranger tool
> > > > <https://issues.apache.org/jira/browse/LUCENE-9694>? It is already
> > > doing
> > > > something like what you enumerated below, including mocking LiveDocs
> to
> > > get
> > > > the right documents into the right segments.
> > > >
> > > > Mike McCandless
> > > >
> > > > http://blog.mikemccandless.com
> > > >
> > > >
> > > > On Sat, May 22, 2021 at 3:50 PM Ravikumar Govindarajan <
> > > > ravikumar.govindarajan@gmail.com> wrote:
> > > >
> > > >> Hello,
> > > >>
> > > >> We have a use-case for index-rewrite on a "frozen index" where no
> new
> > > >> documents are added. It goes like this..
> > > >>
> > > >> 1. Get all segments for the index (base-segment-list)
> > > >> 2. Create a new segment from base-segment-list with unique set of
> > > docs
> > > >> (LiveDocs)
> > > >> 3. Repeat step 2, for a fixed count. Like say 5 or 10 times
> > > >>
> > > >> Is something like this achievable via Merge Policy? We can disable
> > > commits
> > > >> too, till the full run is completed.
> > > >>
> > > >> Any help is appreciated
> > > >>
> > > >> Regards,
> > > >> Ravi
> > > >>
> > > >
> > >
> >
>
Re: Multiple merge-runs from same set of segments [ In reply to ]
Yeah, like you said, it looks quite tough to load entire FST in-mem. Plus
we have to address concurrent access

I guess I have to first see how the current patch goes before any
optimisations are done

Thanks for the help!


Ravi

On Thu, 27 May 2021 at 10:44 PM, Patrick Zhai <zhai7631@gmail.com> wrote:

> Sorry for the delayed response, as for caching termDict data across
>
> threads, I do not aware of any existing lucene mechanism could do that (and
>
> it might be tricky since it is across threads), but maybe worth trying to
>
> see whether we can get some extra speed based on that!
>
>
>
> Patrick
>
>
>
> Ravikumar Govindarajan <ravikumar.govindarajan@gmail.com> ?2021?5?24???
>
> ??11:49???
>
>
>
> > Thanks Patrick for the help!
>
> >
>
> > May I know what lucene version you're using?
>
> > >
>
> >
>
> > We are using an older version of lucene as of now (4.7.x) and I believe
> the
>
> > FilterCodecReader of current version is akin to FilterAtomicReader &
> should
>
> > do the job for us!
>
> >
>
> > If it is not available, I'm not sure whether the merge will happen via
>
> > merge
>
> > > policy, maybe you could check the source code and see?
>
> > >
>
> >
>
> > Checked & AFAIK, our old version isn't supporting it. But I guess it
> should
>
> > be fine to wrap a SortingAtomicReader and pass it to the API. Guess, it
> can
>
> > be done!
>
> >
>
> > But I think the current default directory implementation is
> MMapDirectory,
>
> > > which delegate the caching to the system and should have
>
> > > already optimized this situation
>
> > >
>
> >
>
> > We do use the default MMap-dir but I was actually thinking about
>
> > unpacking/walking Term-Dict data (FST) repeatedly from various
>
> > threads, even if via MMap. Are there optimizations here (caching unpacked
>
> > blocks etc..) that we could tap into?
>
> >
>
> > --
>
> > Ravi
>
> >
>
> > On Mon, May 24, 2021 at 11:09 PM Patrick Zhai <zhai7631@gmail.com>
> wrote:
>
> >
>
> > > Hi Ravi,
>
> > >
>
> > > 1. May I know what lucene version you're using? As far as I know the
>
> > > SortingMergePolicy has been deprecated and replaced by
>
> > > IndexWriterConfig.setIndexSort in newer lucene version. So if the
>
> > > "setIndexSort" is available I would suggest using that to achieve the
>
> > > sorted index (as you might have already figured out, the
> IndexRearranger
>
> > > let you pass in an IndexWriterConfig so that you could set it there).
> If
>
> > it
>
> > > is not available, I'm not sure whether the merge will happen via merge
>
> > > policy, maybe you could check the source code and see?
>
> > > 2. Yeah it's a good observation, we're doing multiple passes over one
>
> > > segment! But I think the current default directory implementation is
>
> > > MMapDirectory, which delegate the caching to the system and should have
>
> > > already optimized this situation. Here's a great blog explaining the
>
> > > MMapDirectory in lucene:
>
> > >
> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>
> > >
>
> > > Best
>
> > > Patrick
>
> > >
>
> > > Ravikumar Govindarajan <ravikumar.govindarajan@gmail.com>
> ?2021?5?24???
>
> > > ??9:54???
>
> > >
>
> > > > Thanks Michael!
>
> > > >
>
> > > > This was just what I was looking for!!. Just a couple of questions.
>
> > > >
>
> > > >
>
> > > > - When we call addIndexes(IndexReader...), does the merge happen
> via
>
> > > > MergePolicy? We use a SortingMergePolicy and would like to
> maintain
>
> > > the
>
> > > > sort-order in newly created segments too
>
> > > > - Concurrency is a cool-trick here. But if I understand the patch
>
> > > > correctly, don't we end-up doing multiple passes over the Term
> Dict,
>
> > > one
>
> > > > for each Selector? Loading it fully in memory could help here,
>
> > > possibly?
>
> > > >
>
> > > > --
>
> > > > Ravi
>
> > > >
>
> > > > On Mon, May 24, 2021 at 7:37 PM Michael McCandless <
>
> > > > lucene@mikemccandless.com> wrote:
>
> > > >
>
> > > > > Are you trying to rewrite your already created index into a
> different
>
> > > > > segment geometry?
>
> > > > >
>
> > > > > Maybe have a look at the new IndexRearranger tool
>
> > > > > <https://issues.apache.org/jira/browse/LUCENE-9694>? It is
> already
>
> > > > doing
>
> > > > > something like what you enumerated below, including mocking
> LiveDocs
>
> > to
>
> > > > get
>
> > > > > the right documents into the right segments.
>
> > > > >
>
> > > > > Mike McCandless
>
> > > > >
>
> > > > > http://blog.mikemccandless.com
>
> > > > >
>
> > > > >
>
> > > > > On Sat, May 22, 2021 at 3:50 PM Ravikumar Govindarajan <
>
> > > > > ravikumar.govindarajan@gmail.com> wrote:
>
> > > > >
>
> > > > >> Hello,
>
> > > > >>
>
> > > > >> We have a use-case for index-rewrite on a "frozen index" where no
>
> > new
>
> > > > >> documents are added. It goes like this..
>
> > > > >>
>
> > > > >> 1. Get all segments for the index (base-segment-list)
>
> > > > >> 2. Create a new segment from base-segment-list with unique set
> of
>
> > > > docs
>
> > > > >> (LiveDocs)
>
> > > > >> 3. Repeat step 2, for a fixed count. Like say 5 or 10 times
>
> > > > >>
>
> > > > >> Is something like this achievable via Merge Policy? We can disable
>
> > > > commits
>
> > > > >> too, till the full run is completed.
>
> > > > >>
>
> > > > >> Any help is appreciated
>
> > > > >>
>
> > > > >> Regards,
>
> > > > >> Ravi
>
> > > > >>
>
> > > > >
>
> > > >
>
> > >
>
> >
>
>