Mailing List Archive: https://issues.apache.org/jira/browse/LUCENE-8448

https://issues.apache.org/jira/browse/LUCENE-8448

baris.kazar at oracle

Nov 12, 2020, 2:35 PM

Post #1 of 8 (1039 views)

https://issues.apache.org/jira/browse/LUCENE-8448

Hi,-

is this issue fixed please? Could You please help me figure it out?

Best regards

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: https://issues.apache.org/jira/browse/LUCENE-8448 [ In reply to ]

baris.kazar at oracle

Nov 12, 2020, 2:41 PM

Post #2 of 8 (1039 views)

On a related issue:

i experience that with Version 7.7.2 i experienced this:

data is all lower case (same amount of docs as next case though)

vs

data is camel case except last word always in capital letters

but i used in indexer the lowercase filter in both cases so indexing is
done with all lower cases and i saw the first case's index size for case
is like 9.5GB

but same data size for second case was 11GB.

what causes such difference and increase in index size? amount of docs
are the same in both cases.

Best regards

On 11/12/20 5:35 PM, baris.kazar@oracle.com wrote:
> https://urldefense.com/v3/__https://issues.apache.org/jira/browse/LUCENE-8448__;!!GqivPVa7Brio!NnYqJL-FnBxofO27fztVvIe8fT0uLvT94d1qak6Dbtv5PMc20m6dUed4XDVUSglwDw$
>
>
> Hi,-
>
> is this issue fixed please? Could You please help me figure it out?
>
> Best regards
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: https://issues.apache.org/jira/browse/LUCENE-8448 [ In reply to ]

erickerickson at gmail

Nov 12, 2020, 8:12 PM

Post #3 of 8 (1039 views)

Yes, that issue is fixed. The “Resolution” tag is the key, it’s marked “fixed” and the version is 8.0

As for your other question, index size is a very imprecise number. How many deleted documents are there
in each case? Deleted documents take up disk space until the segments containing them are merged away.

Best,
Erick

> On Nov 12, 2020, at 5:35 PM, baris.kazar@oracle.com wrote:
>
> https://issues.apache.org/jira/browse/LUCENE-8448
>
>
> Hi,-
>
> is this issue fixed please? Could You please help me figure it out?
>
> Best regards
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: https://issues.apache.org/jira/browse/LUCENE-8448 [ In reply to ]

baris.kazar at oracle

Nov 12, 2020, 9:16 PM

Post #4 of 8 (1039 views)

Hi,-
Thanks.
These are final finished sizes in both cases.
Best regards

> On Nov 12, 2020, at 11:12 PM, Erick Erickson <erickerickson@gmail.com> wrote:
>
> ?Yes, that issue is fixed. The “Resolution” tag is the key, it’s marked “fixed” and the version is 8.0
>
> As for your other question, index size is a very imprecise number. How many deleted documents are there
> in each case? Deleted documents take up disk space until the segments containing them are merged away.
>
> Best,
> Erick
>
>> On Nov 12, 2020, at 5:35 PM, baris.kazar@oracle.com wrote:
>>
>> https://urldefense.com/v3/__https://issues.apache.org/jira/browse/LUCENE-8448__;!!GqivPVa7Brio!I3RsAXIoDcPmpP_sc8C29vn8DcAXSvIgH7pvcxyDaBnfhdJAk24zPpQhqP035V1IJA$
>>
>>
>> Hi,-
>>
>> is this issue fixed please? Could You please help me figure it out?
>>
>> Best regards
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: https://issues.apache.org/jira/browse/LUCENE-8448 [ In reply to ]

erickerickson at gmail

Nov 13, 2020, 4:39 AM

Post #5 of 8 (1039 views)

What does “final finished sizes” mean? After optimize of just after finishing all indexing?
The former is what counts here.

And you provided no information on the number of deleted docs in the two cases. Is
the number of deletedDocs the same (or close)? And does the q=*:* query
return the same numFound?

Finally, are you absolutely and totally sure that no other options changed. For instance,
you specified docValues=true for some field in one but not the other. Or stored=true
etc. If you’re using the same schema.

And you also haven’t provided information on what versions of Solr you’re talking about.
You mention 7.7.2, but not the _other_ version of solr. If you’re going from one major
version to another, sometimes defaults change for docValues on primitive fields
especially. I’d consider firing up Luke and examining the field definitions in
detail.

Best,
Erick

> On Nov 13, 2020, at 12:16 AM, baris.kazar@oracle.com wrote:
>
> Hi,-
> Thanks.
> These are final finished sizes in both cases.
> Best regards
>
>
>> On Nov 12, 2020, at 11:12 PM, Erick Erickson <erickerickson@gmail.com> wrote:
>>
>> ?Yes, that issue is fixed. The “Resolution” tag is the key, it’s marked “fixed” and the version is 8.0
>>
>> As for your other question, index size is a very imprecise number. How many deleted documents are there
>> in each case? Deleted documents take up disk space until the segments containing them are merged away.
>>
>> Best,
>> Erick
>>
>>> On Nov 12, 2020, at 5:35 PM, baris.kazar@oracle.com wrote:
>>>
>>> https://urldefense.com/v3/__https://issues.apache.org/jira/browse/LUCENE-8448__;!!GqivPVa7Brio!I3RsAXIoDcPmpP_sc8C29vn8DcAXSvIgH7pvcxyDaBnfhdJAk24zPpQhqP035V1IJA$
>>>
>>>
>>> Hi,-
>>>
>>> is this issue fixed please? Could You please help me figure it out?
>>>
>>> Best regards
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: https://issues.apache.org/jira/browse/LUCENE-8448 [ In reply to ]

baris.kazar at oracle

Nov 13, 2020, 9:57 AM

Post #6 of 8 (1039 views)

Nothing changed between two index generations except the data changed a
bit as i described.

When Lucene is done generating index, that is what i am reporting as the
size of the directory where all index files are stored.

I dont know about deleted docs? How do you trace that? yes the queries
run exactly the same way (same number of results) most of the time the
order is just changed which is fine; or some few different entries show
up and i dont know why since lowecase filter should normalize even if
original data casing changes.

Yes absolutely sure nothing else changed. i kept all those things the
same across two runs.

actually does lucene repository have these kinda experiments accross
versions (major or minor versions)?

if i were lucene i would do these experiments to see the impact on index
end results. this will help find out some potential un-indentified bugs.

Methodology:

have a large dataset like 15 million docs

run index at each time a new version comes out with very common settings.

i am not using solr, pure lucene 7.7.2. these info were in the other
email here. let me copy paste here:

===== previous email ====

On a related issue:

i experience that with Version 7.7.2 i experienced this:

data is all lower case (same amount of docs as next case though)

vs

data is camel case except last word always in capital letters

but i used in indexer the lowercase filter in both cases so indexing is
done with all lower cases and i saw the first case's index size for case
is like 9.5GB

but same data size for second case was 11GB.

what causes such difference and increase in index size? amount of docs
are the same in both cases.

Best regards

On 11/13/20 7:39 AM, Erick Erickson wrote:
> What does “final finished sizes” mean? After optimize of just after finishing all indexing?
> The former is what counts here.
>
> And you provided no information on the number of deleted docs in the two cases. Is
> the number of deletedDocs the same (or close)? And does the q=*:* query
> return the same numFound?
>
> Finally, are you absolutely and totally sure that no other options changed. For instance,
> you specified docValues=true for some field in one but not the other. Or stored=true
> etc. If you’re using the same schema.
>
> And you also haven’t provided information on what versions of Solr you’re talking about.
> You mention 7.7.2, but not the _other_ version of solr. If you’re going from one major
> version to another, sometimes defaults change for docValues on primitive fields
> especially. I’d consider firing up Luke and examining the field definitions in
> detail.
>
> Best,
> Erick
>
>> On Nov 13, 2020, at 12:16 AM, baris.kazar@oracle.com wrote:
>>
>> Hi,-
>> Thanks.
>> These are final finished sizes in both cases.
>> Best regards
>>
>>
>>> On Nov 12, 2020, at 11:12 PM, Erick Erickson <erickerickson@gmail.com> wrote:
>>>
>>> ?Yes, that issue is fixed. The “Resolution” tag is the key, it’s marked “fixed” and the version is 8.0
>>>
>>> As for your other question, index size is a very imprecise number. How many deleted documents are there
>>> in each case? Deleted documents take up disk space until the segments containing them are merged away.
>>>
>>> Best,
>>> Erick
>>>
>>>> On Nov 12, 2020, at 5:35 PM, baris.kazar@oracle.com wrote:
>>>>
>>>> https://urldefense.com/v3/__https://issues.apache.org/jira/browse/LUCENE-8448__;!!GqivPVa7Brio!I3RsAXIoDcPmpP_sc8C29vn8DcAXSvIgH7pvcxyDaBnfhdJAk24zPpQhqP035V1IJA$
>>>>
>>>>
>>>> Hi,-
>>>>
>>>> is this issue fixed please? Could You please help me figure it out?
>>>>
>>>> Best regards
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: https://issues.apache.org/jira/browse/LUCENE-8448 [ In reply to ]

msokolov at gmail

Nov 13, 2020, 10:48 AM

Post #7 of 8 (1039 views)

You can't directly compare disk usage across two indexes, even with
the same data. Try re-indexing one of your datasets, and you will see
that the disk size is not the same. Mostly this is due to the way
segments are merged varying with some randomness from one run to
another, although the size of the difference you report is pretty
large, it is not out of the question that could occur, especially if
you have a large number of deletions or updates to existing documents.
If you want to get a more accurate idea of the amount of space taken
up by your index, you could try calling IndexWriter.forceMerge(1);
this will merge your index to a single segment, eliminating waste. It
is not generally recommended to do this for indexes you use for
querying, but it can be a useful tool for analysis.

On Fri, Nov 13, 2020 at 1:01 PM <baris.kazar@oracle.com> wrote:
>
> Nothing changed between two index generations except the data changed a
> bit as i described.
>
> When Lucene is done generating index, that is what i am reporting as the
> size of the directory where all index files are stored.
>
> I dont know about deleted docs? How do you trace that? yes the queries
> run exactly the same way (same number of results) most of the time the
> order is just changed which is fine; or some few different entries show
> up and i dont know why since lowecase filter should normalize even if
> original data casing changes.
>
> Yes absolutely sure nothing else changed. i kept all those things the
> same across two runs.
>
> actually does lucene repository have these kinda experiments accross
> versions (major or minor versions)?
>
> if i were lucene i would do these experiments to see the impact on index
> end results. this will help find out some potential un-indentified bugs.
>
> Methodology:
>
> have a large dataset like 15 million docs
>
> run index at each time a new version comes out with very common settings.
>
>
> i am not using solr, pure lucene 7.7.2. these info were in the other
> email here. let me copy paste here:
>
>
>
> ===== previous email ====
>
> On a related issue:
>
> i experience that with Version 7.7.2 i experienced this:
>
> data is all lower case (same amount of docs as next case though)
>
> vs
>
> data is camel case except last word always in capital letters
>
>
> but i used in indexer the lowercase filter in both cases so indexing is
> done with all lower cases and i saw the first case's index size for case
> is like 9.5GB
>
> but same data size for second case was 11GB.
>
>
> what causes such difference and increase in index size? amount of docs
> are the same in both cases.
>
>
> Best regards
>
>
>
> On 11/13/20 7:39 AM, Erick Erickson wrote:
> > What does “final finished sizes” mean? After optimize of just after finishing all indexing?
> > The former is what counts here.
> >
> > And you provided no information on the number of deleted docs in the two cases. Is
> > the number of deletedDocs the same (or close)? And does the q=*:* query
> > return the same numFound?
> >
> > Finally, are you absolutely and totally sure that no other options changed. For instance,
> > you specified docValues=true for some field in one but not the other. Or stored=true
> > etc. If you’re using the same schema.
> >
> > And you also haven’t provided information on what versions of Solr you’re talking about.
> > You mention 7.7.2, but not the _other_ version of solr. If you’re going from one major
> > version to another, sometimes defaults change for docValues on primitive fields
> > especially. I’d consider firing up Luke and examining the field definitions in
> > detail.
> >
> > Best,
> > Erick
> >
> >> On Nov 13, 2020, at 12:16 AM, baris.kazar@oracle.com wrote:
> >>
> >> Hi,-
> >> Thanks.
> >> These are final finished sizes in both cases.
> >> Best regards
> >>
> >>
> >>> On Nov 12, 2020, at 11:12 PM, Erick Erickson <erickerickson@gmail.com> wrote:
> >>>
> >>> ?Yes, that issue is fixed. The “Resolution” tag is the key, it’s marked “fixed” and the version is 8.0
> >>>
> >>> As for your other question, index size is a very imprecise number. How many deleted documents are there
> >>> in each case? Deleted documents take up disk space until the segments containing them are merged away.
> >>>
> >>> Best,
> >>> Erick
> >>>
> >>>> On Nov 12, 2020, at 5:35 PM, baris.kazar@oracle.com wrote:
> >>>>
> >>>> https://urldefense.com/v3/__https://issues.apache.org/jira/browse/LUCENE-8448__;!!GqivPVa7Brio!I3RsAXIoDcPmpP_sc8C29vn8DcAXSvIgH7pvcxyDaBnfhdJAk24zPpQhqP035V1IJA$
> >>>>
> >>>>
> >>>> Hi,-
> >>>>
> >>>> is this issue fixed please? Could You please help me figure it out?
> >>>>
> >>>> Best regards
> >>>>
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: https://issues.apache.org/jira/browse/LUCENE-8448 [ In reply to ]

baris.kazar at oracle

Nov 13, 2020, 11:09 AM

Post #8 of 8 (1039 views)

Great answer
Thanks Michael.

Yes the difference was too much > 1G
Best regards

> On Nov 13, 2020, at 1:49 PM, Michael Sokolov <msokolov@gmail.com> wrote:
>
> ?You can't directly compare disk usage across two indexes, even with
> the same data. Try re-indexing one of your datasets, and you will see
> that the disk size is not the same. Mostly this is due to the way
> segments are merged varying with some randomness from one run to
> another, although the size of the difference you report is pretty
> large, it is not out of the question that could occur, especially if
> you have a large number of deletions or updates to existing documents.
> If you want to get a more accurate idea of the amount of space taken
> up by your index, you could try calling IndexWriter.forceMerge(1);
> this will merge your index to a single segment, eliminating waste. It
> is not generally recommended to do this for indexes you use for
> querying, but it can be a useful tool for analysis.
>
>> On Fri, Nov 13, 2020 at 1:01 PM <baris.kazar@oracle.com> wrote:
>>
>> Nothing changed between two index generations except the data changed a
>> bit as i described.
>>
>> When Lucene is done generating index, that is what i am reporting as the
>> size of the directory where all index files are stored.
>>
>> I dont know about deleted docs? How do you trace that? yes the queries
>> run exactly the same way (same number of results) most of the time the
>> order is just changed which is fine; or some few different entries show
>> up and i dont know why since lowecase filter should normalize even if
>> original data casing changes.
>>
>> Yes absolutely sure nothing else changed. i kept all those things the
>> same across two runs.
>>
>> actually does lucene repository have these kinda experiments accross
>> versions (major or minor versions)?
>>
>> if i were lucene i would do these experiments to see the impact on index
>> end results. this will help find out some potential un-indentified bugs.
>>
>> Methodology:
>>
>> have a large dataset like 15 million docs
>>
>> run index at each time a new version comes out with very common settings.
>>
>>
>> i am not using solr, pure lucene 7.7.2. these info were in the other
>> email here. let me copy paste here:
>>
>>
>>
>> ===== previous email ====
>>
>> On a related issue:
>>
>> i experience that with Version 7.7.2 i experienced this:
>>
>> data is all lower case (same amount of docs as next case though)
>>
>> vs
>>
>> data is camel case except last word always in capital letters
>>
>>
>> but i used in indexer the lowercase filter in both cases so indexing is
>> done with all lower cases and i saw the first case's index size for case
>> is like 9.5GB
>>
>> but same data size for second case was 11GB.
>>
>>
>> what causes such difference and increase in index size? amount of docs
>> are the same in both cases.
>>
>>
>> Best regards
>>
>>
>>
>>> On 11/13/20 7:39 AM, Erick Erickson wrote:
>>> What does “final finished sizes” mean? After optimize of just after finishing all indexing?
>>> The former is what counts here.
>>>
>>> And you provided no information on the number of deleted docs in the two cases. Is
>>> the number of deletedDocs the same (or close)? And does the q=*:* query
>>> return the same numFound?
>>>
>>> Finally, are you absolutely and totally sure that no other options changed. For instance,
>>> you specified docValues=true for some field in one but not the other. Or stored=true
>>> etc. If you’re using the same schema.
>>>
>>> And you also haven’t provided information on what versions of Solr you’re talking about.
>>> You mention 7.7.2, but not the _other_ version of solr. If you’re going from one major
>>> version to another, sometimes defaults change for docValues on primitive fields
>>> especially. I’d consider firing up Luke and examining the field definitions in
>>> detail.
>>>
>>> Best,
>>> Erick
>>>
>>>> On Nov 13, 2020, at 12:16 AM, baris.kazar@oracle.com wrote:
>>>>
>>>> Hi,-
>>>> Thanks.
>>>> These are final finished sizes in both cases.
>>>> Best regards
>>>>
>>>>
>>>>> On Nov 12, 2020, at 11:12 PM, Erick Erickson <erickerickson@gmail.com> wrote:
>>>>>
>>>>> ?Yes, that issue is fixed. The “Resolution” tag is the key, it’s marked “fixed” and the version is 8.0
>>>>>
>>>>> As for your other question, index size is a very imprecise number. How many deleted documents are there
>>>>> in each case? Deleted documents take up disk space until the segments containing them are merged away.
>>>>>
>>>>> Best,
>>>>> Erick
>>>>>
>>>>>> On Nov 12, 2020, at 5:35 PM, baris.kazar@oracle.com wrote:
>>>>>>
>>>>>> https://urldefense.com/v3/__https://issues.apache.org/jira/browse/LUCENE-8448__;!!GqivPVa7Brio!I3RsAXIoDcPmpP_sc8C29vn8DcAXSvIgH7pvcxyDaBnfhdJAk24zPpQhqP035V1IJA$
>>>>>>
>>>>>>
>>>>>> Hi,-
>>>>>>
>>>>>> is this issue fixed please? Could You please help me figure it out?
>>>>>>
>>>>>> Best regards
>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org