Mailing List Archive

Use DirectMonotonicWriter store sorted NumericDocValues
Hi,

In class Lucene80DocValuesConsumer#writeValues(FieldInfo field, DocValuesProducer valuesProducer), all numericDocValues will be visited to calculate gcd, in the meantime, we can check if all values were sorted. if so, maybe we could use DirectMonotonicWriter to store them. DirectMonotonicWriter can get impressive compression.

In addition, when i use Elasticsearch to store numeric field types, in Lucene level, the data always at least stored by NumericDocValues/SortedNumericDocValues. So when indexing some sorted values like ID, TIMESTAMP, maybe the upon optimization is applicable.

Could I have some suggestions?
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Use DirectMonotonicWriter store sorted NumericDocValues [ In reply to ]
We did this monotonic detection/compression before in older times, but
had to remove it because it caused too many slowdowns.

I think it easily causes too much type pollution, for example, for a
typical large index with unsorted docvalues field, big segments aren't
won't be sorted, tiny segments with a few values might happen to be
sorted (depending on chance/luck), tiny tiny ones with e.g. a single
document are sorted. Now we have a mix of monotonic and non-monotonic
over the same field.

On the other hand, optimization is very fragile and rare: even for
these log users actually sorting on that field at index-time, it will
just apply to one field out of the somehow typical dozens/hundreds
that they like to have. But may destroy performance of all the other
fields and overall causes more harm than good.

On Tue, Jun 15, 2021 at 5:49 AM LuXugang <xuganglu@icloud.com.invalid> wrote:
>
> Hi,
>
> In class Lucene80DocValuesConsumer#writeValues(FieldInfo field, DocValuesProducer valuesProducer), all numericDocValues will be visited to calculate gcd, in the meantime, we can check if all values were sorted. if so, maybe we could use DirectMonotonicWriter to store them. DirectMonotonicWriter can get impressive compression.
>
> In addition, when i use Elasticsearch to store numeric field types, in Lucene level, the data always at least stored by NumericDocValues/SortedNumericDocValues. So when indexing some sorted values like ID, TIMESTAMP, maybe the upon optimization is applicable.
>
> Could I have some suggestions?
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Use DirectMonotonicWriter store sorted NumericDocValues [ In reply to ]
I believe that this sort of optimization would be more effective and robust
if we made doc values look more like postings, with relatively small blocks
of values that would get compressed independently and decompressed in bulk.
This way, we wouldn't require data to be sorted across entire segments for
this optimization to kick in, and we would be less likely to slow down the
normal case.

On Tue, Jun 15, 2021 at 12:06 PM Robert Muir <rcmuir@gmail.com> wrote:

> We did this monotonic detection/compression before in older times, but
> had to remove it because it caused too many slowdowns.
>
> I think it easily causes too much type pollution, for example, for a
> typical large index with unsorted docvalues field, big segments aren't
> won't be sorted, tiny segments with a few values might happen to be
> sorted (depending on chance/luck), tiny tiny ones with e.g. a single
> document are sorted. Now we have a mix of monotonic and non-monotonic
> over the same field.
>
> On the other hand, optimization is very fragile and rare: even for
> these log users actually sorting on that field at index-time, it will
> just apply to one field out of the somehow typical dozens/hundreds
> that they like to have. But may destroy performance of all the other
> fields and overall causes more harm than good.
>
> On Tue, Jun 15, 2021 at 5:49 AM LuXugang <xuganglu@icloud.com.invalid>
> wrote:
> >
> > Hi,
> >
> > In class Lucene80DocValuesConsumer#writeValues(FieldInfo field,
> DocValuesProducer valuesProducer), all numericDocValues will be visited to
> calculate gcd, in the meantime, we can check if all values were sorted. if
> so, maybe we could use DirectMonotonicWriter to store them.
> DirectMonotonicWriter can get impressive compression.
> >
> > In addition, when i use Elasticsearch to store numeric field types, in
> Lucene level, the data always at least stored by
> NumericDocValues/SortedNumericDocValues. So when indexing some sorted
> values like ID, TIMESTAMP, maybe the upon optimization is applicable.
> >
> > Could I have some suggestions?
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

--
Adrien
Re: Use DirectMonotonicWriter store sorted NumericDocValues [ In reply to ]
+1 to that idea. Maybe a shorter-term possibility would be to only do
this compression on a field when the user has explicitly configured
index sorting on the field (can we hackishly peek at it and tell?)

On Tue, Jun 15, 2021 at 9:04 AM Adrien Grand <jpountz@gmail.com> wrote:
>
> I believe that this sort of optimization would be more effective and robust if we made doc values look more like postings, with relatively small blocks of values that would get compressed independently and decompressed in bulk. This way, we wouldn't require data to be sorted across entire segments for this optimization to kick in, and we would be less likely to slow down the normal case.
>
> On Tue, Jun 15, 2021 at 12:06 PM Robert Muir <rcmuir@gmail.com> wrote:
>>
>> We did this monotonic detection/compression before in older times, but
>> had to remove it because it caused too many slowdowns.
>>
>> I think it easily causes too much type pollution, for example, for a
>> typical large index with unsorted docvalues field, big segments aren't
>> won't be sorted, tiny segments with a few values might happen to be
>> sorted (depending on chance/luck), tiny tiny ones with e.g. a single
>> document are sorted. Now we have a mix of monotonic and non-monotonic
>> over the same field.
>>
>> On the other hand, optimization is very fragile and rare: even for
>> these log users actually sorting on that field at index-time, it will
>> just apply to one field out of the somehow typical dozens/hundreds
>> that they like to have. But may destroy performance of all the other
>> fields and overall causes more harm than good.
>>
>> On Tue, Jun 15, 2021 at 5:49 AM LuXugang <xuganglu@icloud.com.invalid> wrote:
>> >
>> > Hi,
>> >
>> > In class Lucene80DocValuesConsumer#writeValues(FieldInfo field, DocValuesProducer valuesProducer), all numericDocValues will be visited to calculate gcd, in the meantime, we can check if all values were sorted. if so, maybe we could use DirectMonotonicWriter to store them. DirectMonotonicWriter can get impressive compression.
>> >
>> > In addition, when i use Elasticsearch to store numeric field types, in Lucene level, the data always at least stored by NumericDocValues/SortedNumericDocValues. So when indexing some sorted values like ID, TIMESTAMP, maybe the upon optimization is applicable.
>> >
>> > Could I have some suggestions?
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: dev-help@lucene.apache.org
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
>
> --
> Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Use DirectMonotonicWriter store sorted NumericDocValues [ In reply to ]
SegmentWriteState has a reference to SegmentInfos which itself has the
index sort, so I believe that it would be possible.

I wonder how useful it would be in practice. E.g. in the Elasticsearch
case, even though we store lots of time-based data and have been looking
into index sorting for storage/query efficiency reasons, the index sorts
that we are interested in in practice look more like `host.name
ASC, @timestamp DESC` than just `@timestamp DESC`. The reason for sorting
by `host` first is that it helps a lot with storage/query efficiency of
metadata that is tied to the host (e.g. IP addresses, operating system,
etc.), and then because `host.name` is usually a low-cardinality field,
queries by descending timestamp remain super efficient thanks to LUCENE-9280
<https://issues.apache.org/jira/browse/LUCENE-9280>. So we'd be more
interested in an optimization that would support piecewise monotonic fields.

On Tue, Jun 15, 2021 at 3:33 PM Robert Muir <rcmuir@gmail.com> wrote:

> +1 to that idea. Maybe a shorter-term possibility would be to only do
> this compression on a field when the user has explicitly configured
> index sorting on the field (can we hackishly peek at it and tell?)
>
> On Tue, Jun 15, 2021 at 9:04 AM Adrien Grand <jpountz@gmail.com> wrote:
> >
> > I believe that this sort of optimization would be more effective and
> robust if we made doc values look more like postings, with relatively small
> blocks of values that would get compressed independently and decompressed
> in bulk. This way, we wouldn't require data to be sorted across entire
> segments for this optimization to kick in, and we would be less likely to
> slow down the normal case.
> >
> > On Tue, Jun 15, 2021 at 12:06 PM Robert Muir <rcmuir@gmail.com> wrote:
> >>
> >> We did this monotonic detection/compression before in older times, but
> >> had to remove it because it caused too many slowdowns.
> >>
> >> I think it easily causes too much type pollution, for example, for a
> >> typical large index with unsorted docvalues field, big segments aren't
> >> won't be sorted, tiny segments with a few values might happen to be
> >> sorted (depending on chance/luck), tiny tiny ones with e.g. a single
> >> document are sorted. Now we have a mix of monotonic and non-monotonic
> >> over the same field.
> >>
> >> On the other hand, optimization is very fragile and rare: even for
> >> these log users actually sorting on that field at index-time, it will
> >> just apply to one field out of the somehow typical dozens/hundreds
> >> that they like to have. But may destroy performance of all the other
> >> fields and overall causes more harm than good.
> >>
> >> On Tue, Jun 15, 2021 at 5:49 AM LuXugang <xuganglu@icloud.com.invalid>
> wrote:
> >> >
> >> > Hi,
> >> >
> >> > In class Lucene80DocValuesConsumer#writeValues(FieldInfo field,
> DocValuesProducer valuesProducer), all numericDocValues will be visited to
> calculate gcd, in the meantime, we can check if all values were sorted. if
> so, maybe we could use DirectMonotonicWriter to store them.
> DirectMonotonicWriter can get impressive compression.
> >> >
> >> > In addition, when i use Elasticsearch to store numeric field types,
> in Lucene level, the data always at least stored by
> NumericDocValues/SortedNumericDocValues. So when indexing some sorted
> values like ID, TIMESTAMP, maybe the upon optimization is applicable.
> >> >
> >> > Could I have some suggestions?
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> > For additional commands, e-mail: dev-help@lucene.apache.org
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >
> >
> > --
> > Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

--
Adrien
Re: Use DirectMonotonicWriter store sorted NumericDocValues [ In reply to ]
Well it definitely wouldn't be as useful as changing to a
postings-style approach. That would bring a lot more benefits to
general cases, e.g. use of PFOR and so on.

But it is also easier to implement right now, to accelerate cases
where fields are sorted, without hurting other things.

On Tue, Jun 15, 2021 at 9:53 AM Adrien Grand <jpountz@gmail.com> wrote:
>
> SegmentWriteState has a reference to SegmentInfos which itself has the index sort, so I believe that it would be possible.
>
> I wonder how useful it would be in practice. E.g. in the Elasticsearch case, even though we store lots of time-based data and have been looking into index sorting for storage/query efficiency reasons, the index sorts that we are interested in in practice look more like `host.name ASC, @timestamp DESC` than just `@timestamp DESC`. The reason for sorting by `host` first is that it helps a lot with storage/query efficiency of metadata that is tied to the host (e.g. IP addresses, operating system, etc.), and then because `host.name` is usually a low-cardinality field, queries by descending timestamp remain super efficient thanks to LUCENE-9280. So we'd be more interested in an optimization that would support piecewise monotonic fields.
>
> On Tue, Jun 15, 2021 at 3:33 PM Robert Muir <rcmuir@gmail.com> wrote:
>>
>> +1 to that idea. Maybe a shorter-term possibility would be to only do
>> this compression on a field when the user has explicitly configured
>> index sorting on the field (can we hackishly peek at it and tell?)
>>
>> On Tue, Jun 15, 2021 at 9:04 AM Adrien Grand <jpountz@gmail.com> wrote:
>> >
>> > I believe that this sort of optimization would be more effective and robust if we made doc values look more like postings, with relatively small blocks of values that would get compressed independently and decompressed in bulk. This way, we wouldn't require data to be sorted across entire segments for this optimization to kick in, and we would be less likely to slow down the normal case.
>> >
>> > On Tue, Jun 15, 2021 at 12:06 PM Robert Muir <rcmuir@gmail.com> wrote:
>> >>
>> >> We did this monotonic detection/compression before in older times, but
>> >> had to remove it because it caused too many slowdowns.
>> >>
>> >> I think it easily causes too much type pollution, for example, for a
>> >> typical large index with unsorted docvalues field, big segments aren't
>> >> won't be sorted, tiny segments with a few values might happen to be
>> >> sorted (depending on chance/luck), tiny tiny ones with e.g. a single
>> >> document are sorted. Now we have a mix of monotonic and non-monotonic
>> >> over the same field.
>> >>
>> >> On the other hand, optimization is very fragile and rare: even for
>> >> these log users actually sorting on that field at index-time, it will
>> >> just apply to one field out of the somehow typical dozens/hundreds
>> >> that they like to have. But may destroy performance of all the other
>> >> fields and overall causes more harm than good.
>> >>
>> >> On Tue, Jun 15, 2021 at 5:49 AM LuXugang <xuganglu@icloud.com.invalid> wrote:
>> >> >
>> >> > Hi,
>> >> >
>> >> > In class Lucene80DocValuesConsumer#writeValues(FieldInfo field, DocValuesProducer valuesProducer), all numericDocValues will be visited to calculate gcd, in the meantime, we can check if all values were sorted. if so, maybe we could use DirectMonotonicWriter to store them. DirectMonotonicWriter can get impressive compression.
>> >> >
>> >> > In addition, when i use Elasticsearch to store numeric field types, in Lucene level, the data always at least stored by NumericDocValues/SortedNumericDocValues. So when indexing some sorted values like ID, TIMESTAMP, maybe the upon optimization is applicable.
>> >> >
>> >> > Could I have some suggestions?
>> >> > ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> > For additional commands, e-mail: dev-help@lucene.apache.org
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>> >
>> >
>> > --
>> > Adrien
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
>
> --
> Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Use DirectMonotonicWriter store sorted NumericDocValues [ In reply to ]
Thanks, Robert, Adrien. your replies are helpful to me

> 2021?6?15? ??10:19?Robert Muir <rcmuir@gmail.com> ???
>
> Well it definitely wouldn't be as useful as changing to a
> postings-style approach. That would bring a lot more benefits to
> general cases, e.g. use of PFOR and so on.
>
> But it is also easier to implement right now, to accelerate cases
> where fields are sorted, without hurting other things.
>
> On Tue, Jun 15, 2021 at 9:53 AM Adrien Grand <jpountz@gmail.com> wrote:
>>
>> SegmentWriteState has a reference to SegmentInfos which itself has the index sort, so I believe that it would be possible.
>>
>> I wonder how useful it would be in practice. E.g. in the Elasticsearch case, even though we store lots of time-based data and have been looking into index sorting for storage/query efficiency reasons, the index sorts that we are interested in in practice look more like `host.name ASC, @timestamp DESC` than just `@timestamp DESC`. The reason for sorting by `host` first is that it helps a lot with storage/query efficiency of metadata that is tied to the host (e.g. IP addresses, operating system, etc.), and then because `host.name` is usually a low-cardinality field, queries by descending timestamp remain super efficient thanks to LUCENE-9280. So we'd be more interested in an optimization that would support piecewise monotonic fields.
>>
>> On Tue, Jun 15, 2021 at 3:33 PM Robert Muir <rcmuir@gmail.com> wrote:
>>>
>>> +1 to that idea. Maybe a shorter-term possibility would be to only do
>>> this compression on a field when the user has explicitly configured
>>> index sorting on the field (can we hackishly peek at it and tell?)
>>>
>>> On Tue, Jun 15, 2021 at 9:04 AM Adrien Grand <jpountz@gmail.com> wrote:
>>>>
>>>> I believe that this sort of optimization would be more effective and robust if we made doc values look more like postings, with relatively small blocks of values that would get compressed independently and decompressed in bulk. This way, we wouldn't require data to be sorted across entire segments for this optimization to kick in, and we would be less likely to slow down the normal case.
>>>>
>>>> On Tue, Jun 15, 2021 at 12:06 PM Robert Muir <rcmuir@gmail.com> wrote:
>>>>>
>>>>> We did this monotonic detection/compression before in older times, but
>>>>> had to remove it because it caused too many slowdowns.
>>>>>
>>>>> I think it easily causes too much type pollution, for example, for a
>>>>> typical large index with unsorted docvalues field, big segments aren't
>>>>> won't be sorted, tiny segments with a few values might happen to be
>>>>> sorted (depending on chance/luck), tiny tiny ones with e.g. a single
>>>>> document are sorted. Now we have a mix of monotonic and non-monotonic
>>>>> over the same field.
>>>>>
>>>>> On the other hand, optimization is very fragile and rare: even for
>>>>> these log users actually sorting on that field at index-time, it will
>>>>> just apply to one field out of the somehow typical dozens/hundreds
>>>>> that they like to have. But may destroy performance of all the other
>>>>> fields and overall causes more harm than good.
>>>>>
>>>>> On Tue, Jun 15, 2021 at 5:49 AM LuXugang <xuganglu@icloud.com.invalid> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> In class Lucene80DocValuesConsumer#writeValues(FieldInfo field, DocValuesProducer valuesProducer), all numericDocValues will be visited to calculate gcd, in the meantime, we can check if all values were sorted. if so, maybe we could use DirectMonotonicWriter to store them. DirectMonotonicWriter can get impressive compression.
>>>>>>
>>>>>> In addition, when i use Elasticsearch to store numeric field types, in Lucene level, the data always at least stored by NumericDocValues/SortedNumericDocValues. So when indexing some sorted values like ID, TIMESTAMP, maybe the upon optimization is applicable.
>>>>>>
>>>>>> Could I have some suggestions?
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>
>>>>
>>>>
>>>> --
>>>> Adrien
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>
>>
>> --
>> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org