Mailing List Archive

Lucene index directory grows and shrinks
Hi all,

I'm using Jackrabbit 2.18.0 which uses lucene-core 3.6.0.

I'm working on an application that has reached 37 G of directory index, a few days ago, disk occupancy has quickly reached 100% and then returned to pre-growth employment.

I believe that was caused by a rapid growth of Lucene index directory, looking for such an event I've found only this article describing something really similar https://helpx.adobe.com/uk/experience-manager/kb/lucene-index-directory-growth.html

I would like to know more info about this behaviour, first of all can you confirm this growth and shrinkage?

Thanks in advance, best regards
[https://westpole.it/firma/logo.png]

Raffaele Gambelli
WebRainbow(r) Software Developer

P +39 051 8550 576
M #
E R.Gambelli@westpole.it
W https://westpole.webex.com/meet/R.Gambelli
A Via Ettore Cristoni, 84 - 40030 Casalecchio di Reno

[https://vitamined.it/westpole/website.png]<https://westpole.it> [https://westpole.it/firma/twitter.png] <https://twitter.com/WESTPOLE_SPA> [https://westpole.it/firma/linkedin.png] <https://www.linkedin.com/company/westpole/>

This email for the D.lgs.196/2003 (Privacy Code) and European Regulation 679/2016/UE (GDPR) may contain confidential and/or privileged information for the exclusive use of the intended recipient. Any review or distribution by others is strictly prohibited. If you are not the intended recipient, you must not use, copy, disclose or take any action based on this message or any information here. If you have received this email in error, please contact us (email:privacy@westpole.it) by reply email and delete all copies. Legal privilege is not waived because you have read this email. Thank you for your cooperation.

[https://westpole.it/firma/ambiente.png] Please consider the environment before printing this email
Re: Lucene index directory grows and shrinks [ In reply to ]
This are typical symptoms of an index merge.

However, it is hard to predict more without knowing more data. What is
your segment size limit? Have you changed the default merge frequency
or max segments configuration? Would you have an estimate of ratio of
number of segments reaching max limit / total segments?

Atri

On Mon, Nov 4, 2019 at 7:12 PM Raffaele Gambelli <R.Gambelli@westpole.it> wrote:
>
> Hi all,
>
> I'm using Jackrabbit 2.18.0 which uses lucene-core 3.6.0.
>
> I'm working on an application that has reached 37 G of directory index, a few days ago, disk occupancy has quickly reached 100% and then returned to pre-growth employment.
>
> I believe that was caused by a rapid growth of Lucene index directory, looking for such an event I've found only this article describing something really similar https://helpx.adobe.com/uk/experience-manager/kb/lucene-index-directory-growth.html
>
> I would like to know more info about this behaviour, first of all can you confirm this growth and shrinkage?
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Lucene index directory grows and shrinks [ In reply to ]
Here’s a neat visualization: http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html

The short form is this:

- A “segment” is all the files with a particular prefix in your index directory, e.g. _12ey1* is one segment
- Segments are created as documents are indexed and commits occur.
- Periodically, segments are “merged”, that is some number of segments are combined into a single new segment and then the old segments are deleted.
- During the merge, both the old and new segments occupy index space.
- Deleted documents continue to occupy disk space until the segment containing them are merged. NOTE: updating the same document deletes the old version and adds a new one, so that is a “deleted” document for this discussion.

So it’s quite common for deletes to accumulate until they are merged away. You have two sources of fluctuation:
1> deleted docs
2> the merging process.

And in your case, I see one segment around 25G. That indicates your index has been optimized at some point, and also I’d guess you’re on Lucene prior to release 7.5, so whenever you optimized again, _all_ segments will be merged into a single new segment, meaning your index will _at least- double in size temporarily.

Now, how this happens, you’d have to ask the jackrabbit folks since I don’t know that app either.

For the gory details on optimize, see: https://lucidworks.com/post/segment-merging-deleted-documents-optimize-may-bad/. Even though that’s labeled Solr, it’s really about Lucene and the doc applies to anything that uses Lucene with the Tiered Merge Policy (which has been the default for some time). Although whether jackrabbit does anything with this I don’t have a clue.

Best,
Erick


> On Nov 4, 2019, at 11:19 AM, Raffaele Gambelli <R.Gambelli@westpole.it> wrote:
>
> For what you know, is this behaviour which you defined "typical" described deeply somewhere?
>
> It is foundamental for me to better understand it even to know how big an index can grow, in a way that I can allocate the right disk space.
>
> Thank you very much
>
> -----Messaggio originale-----
> Da: Raffaele Gambelli <R.Gambelli@westpole.it>
> Inviato: lunedì 4 novembre 2019 15:16
> A: java-user@lucene.apache.org
> Oggetto: R: Lucene index directory grows and shrinks
>
> Thanks for your quick reply, I'm quite a beginner in Lucene concepts, Jackrabbit hides almost all about the way it uses Lucene internally.
>
> Anyway here it is the size of each sub-directory in my index, please note the bigger one, 25G, is it normal?
>
> ...repository/workspaces/default/index$ du -h .
> 2.5G ./_12ey1
> 14M ./_1dr9s
> 20M ./_1dr8d
> 2.8G ./_1b9pj
> 5.8M ./_1drqc
> 19M ./_1dr4q
> 2.5G ./_17lmu
> 4.0M ./_1drmx
> 11M ./_1drbf
> 4.3M ./_1drok
> 13M ./_1drq1
> 40K ./_1drqe
> 11M ./_1drhc
> 260M ./_1dr3g
> 664M ./_1by44
> 2.5G ./_14tet
> 281M ./_1c4wj
> 25G ./_zzgq
> 274M ./_1d2nc
> 638M ./_1ctf0
> 580K ./_1drqf
> 304K ./_1drqd
> 6.5M ./_1dr6m
> 325M ./_1djfp
> 37G
>
> I tried also to download index directory to my local machine, to inspect them with Luke which I know a bit, but for network problem the download always interrupts.
>
>> What is your segment size limit?
>
> I don't know, where could I see that limit?
>
>> Have you changed the default merge frequency or max segments configuration?
>
> Merge frequency is the mergeFactor ? If yes I'm using the default that is 10, read here https://jackrabbit.apache.org/archive/wiki/JCR/Search_115513504.html
>
> Max segment I don't know, where could I see it?
>
> Bye
>
> -----Messaggio originale-----
> Da: Sharma <atri@apache.org>
> Inviato: lunedì 4 novembre 2019 14:46
> A: java-user@lucene.apache.org
> Oggetto: Re: Lucene index directory grows and shrinks
>
> This are typical symptoms of an index merge.
>
> However, it is hard to predict more without knowing more data. What is your segment size limit? Have you changed the default merge frequency or max segments configuration? Would you have an estimate of ratio of number of segments reaching max limit / total segments?
>
> Atri
>
> On Mon, Nov 4, 2019 at 7:12 PM Raffaele Gambelli <R.Gambelli@westpole.it> wrote:
>>
>> Hi all,
>>
>> I'm using Jackrabbit 2.18.0 which uses lucene-core 3.6.0.
>>
>> I'm working on an application that has reached 37 G of directory index, a few days ago, disk occupancy has quickly reached 100% and then returned to pre-growth employment.
>>
>> I believe that was caused by a rapid growth of Lucene index directory,
>> looking for such an event I've found only this article describing
>> something really similar
>> https://helpx.adobe.com/uk/experience-manager/kb/lucene-index-director
>> y-growth.html
>>
>> I would like to know more info about this behaviour, first of all can you confirm this growth and shrinkage?
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> [https://westpole.it/firma/logo.png]
>
> Raffaele Gambelli
> WebRainbow® Software Developer
>
> P +39 051 8550 576
> M #
> E R.Gambelli@westpole.it
> W https://westpole.webex.com/meet/R.Gambelli
> A Via Ettore Cristoni, 84 - 40030 Casalecchio di Reno
>
> [https://vitamined.it/westpole/website.png]<https://westpole.it> [https://westpole.it/firma/twitter.png] <https://twitter.com/WESTPOLE_SPA> [https://westpole.it/firma/linkedin.png] <https://www.linkedin.com/company/westpole/>
>
> This email for the D.lgs.196/2003 (Privacy Code) and European Regulation 679/2016/UE (GDPR) may contain confidential and/or privileged information for the exclusive use of the intended recipient. Any review or distribution by others is strictly prohibited. If you are not the intended recipient, you must not use, copy, disclose or take any action based on this message or any information here. If you have received this email in error, please contact us (email:privacy@westpole.it) by reply email and delete all copies. Legal privilege is not waived because you have read this email. Thank you for your cooperation.
>
> [https://westpole.it/firma/ambiente.png] Please consider the environment before printing this email
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org