SegmentWriteState has a reference to SegmentInfos which itself has the
index sort, so I believe that it would be possible.
I wonder how useful it would be in practice. E.g. in the Elasticsearch
case, even though we store lots of time-based data and have been looking
into index sorting for storage/query efficiency reasons, the index sorts
that we are interested in in practice look more like `host.name
ASC, @timestamp DESC` than just `@timestamp DESC`. The reason for sorting
by `host` first is that it helps a lot with storage/query efficiency of
metadata that is tied to the host (e.g. IP addresses, operating system,
etc.), and then because `host.name` is usually a low-cardinality field,
queries by descending timestamp remain super efficient thanks to LUCENE-9280
<
https://issues.apache.org/jira/browse/LUCENE-9280>. So we'd be more
interested in an optimization that would support piecewise monotonic fields.
On Tue, Jun 15, 2021 at 3:33 PM Robert Muir <rcmuir@gmail.com> wrote:
> +1 to that idea. Maybe a shorter-term possibility would be to only do
> this compression on a field when the user has explicitly configured
> index sorting on the field (can we hackishly peek at it and tell?)
>
> On Tue, Jun 15, 2021 at 9:04 AM Adrien Grand <jpountz@gmail.com> wrote:
> >
> > I believe that this sort of optimization would be more effective and
> robust if we made doc values look more like postings, with relatively small
> blocks of values that would get compressed independently and decompressed
> in bulk. This way, we wouldn't require data to be sorted across entire
> segments for this optimization to kick in, and we would be less likely to
> slow down the normal case.
> >
> > On Tue, Jun 15, 2021 at 12:06 PM Robert Muir <rcmuir@gmail.com> wrote:
> >>
> >> We did this monotonic detection/compression before in older times, but
> >> had to remove it because it caused too many slowdowns.
> >>
> >> I think it easily causes too much type pollution, for example, for a
> >> typical large index with unsorted docvalues field, big segments aren't
> >> won't be sorted, tiny segments with a few values might happen to be
> >> sorted (depending on chance/luck), tiny tiny ones with e.g. a single
> >> document are sorted. Now we have a mix of monotonic and non-monotonic
> >> over the same field.
> >>
> >> On the other hand, optimization is very fragile and rare: even for
> >> these log users actually sorting on that field at index-time, it will
> >> just apply to one field out of the somehow typical dozens/hundreds
> >> that they like to have. But may destroy performance of all the other
> >> fields and overall causes more harm than good.
> >>
> >> On Tue, Jun 15, 2021 at 5:49 AM LuXugang <xuganglu@icloud.com.invalid>
> wrote:
> >> >
> >> > Hi,
> >> >
> >> > In class Lucene80DocValuesConsumer#writeValues(FieldInfo field,
> DocValuesProducer valuesProducer), all numericDocValues will be visited to
> calculate gcd, in the meantime, we can check if all values were sorted. if
> so, maybe we could use DirectMonotonicWriter to store them.
> DirectMonotonicWriter can get impressive compression.
> >> >
> >> > In addition, when i use Elasticsearch to store numeric field types,
> in Lucene level, the data always at least stored by
> NumericDocValues/SortedNumericDocValues. So when indexing some sorted
> values like ID, TIMESTAMP, maybe the upon optimization is applicable.
> >> >
> >> > Could I have some suggestions?
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> > For additional commands, e-mail: dev-help@lucene.apache.org
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >
> >
> > --
> > Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
--
Adrien