Mailing List Archive

Sort by numeric field, order missing values before anything else
Hello,

When sorting documents by a NumericDocValuesField, how can documents be
ordered such that those with missing values can come before anything else
in ascending sorts? SortField allows to set a missing value:

var sortField = new SortField("price", SortField.Type.LONG);
sortField.setMissingValue(null);

This null is however converted into a long 0 and documents with missing
values are considered equally ordered with documents with an actual 0
value. It's possible to set the missing value to Long.MIN_VALUE, but that
will have the same problem, just for a different long value.

Besides writing a custom comparator, is there any simpler and still
performant way to achieve this sort?

--Petko
Re: Sort by numeric field, order missing values before anything else [ In reply to ]
Hi Petko,

Lucene's comparators for numerics have this limitation indeed. We haven't
got many questions around that in the past, which I would guess is due to
the fact that most numeric fields do not use the entire long range,
specifically Long.MIN_VALUE and Long.MAX_VALUE, so using either of these
works as a way to sort missing values first or last. If you have a field
that may use Long.MIN_VALUE and long.MAX_VALUE, we do not have a comparator
that can easily sort missing values first or last reliably out of the box.

The easier option I can think of would consist of using the comparator for
longs with MIN_VALUE / MAX_VALUE for missing values depending on whether
you want missing values sorted first or last, and chain it with another
comparator (via a FieldComparatorSource) which would sort missing values
before/after existing values. The benefit of this approach is that you
would automatically benefit from some not-so-trivial features of Lucene's
comparator such as dynamic pruning.

On Wed, Nov 16, 2022 at 9:16 PM Petko Minkov <pminkov@gmail.com> wrote:

> Hello,
>
> When sorting documents by a NumericDocValuesField, how can documents be
> ordered such that those with missing values can come before anything else
> in ascending sorts? SortField allows to set a missing value:
>
> var sortField = new SortField("price", SortField.Type.LONG);
> sortField.setMissingValue(null);
>
> This null is however converted into a long 0 and documents with missing
> values are considered equally ordered with documents with an actual 0
> value. It's possible to set the missing value to Long.MIN_VALUE, but that
> will have the same problem, just for a different long value.
>
> Besides writing a custom comparator, is there any simpler and still
> performant way to achieve this sort?
>
> --Petko
>


--
Adrien
Re: Sort by numeric field, order missing values before anything else [ In reply to ]
Thanks Adrien, this way of doing it makes sense! I suppose another option
might be storing numbers in their byte array representations (maybe using
https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/util/NumericUtils.html#longToSortableBytes(long,byte%5B%5D,int)
) in SortedDocValues and then using a missing value of STRING_FIRST (
https://lucene.apache.org/core/9_0_0/core/org/apache/lucene/search/SortField.html#STRING_FIRST)?
Looks like TermOrdValComparator supports dynamic pruning too now, which you
implemented.

On Wed, Nov 16, 2022 at 11:47 PM Adrien Grand <jpountz@gmail.com> wrote:

> Hi Petko,
>
> Lucene's comparators for numerics have this limitation indeed. We haven't
> got many questions around that in the past, which I would guess is due to
> the fact that most numeric fields do not use the entire long range,
> specifically Long.MIN_VALUE and Long.MAX_VALUE, so using either of these
> works as a way to sort missing values first or last. If you have a field
> that may use Long.MIN_VALUE and long.MAX_VALUE, we do not have a comparator
> that can easily sort missing values first or last reliably out of the box.
>
> The easier option I can think of would consist of using the comparator for
> longs with MIN_VALUE / MAX_VALUE for missing values depending on whether
> you want missing values sorted first or last, and chain it with another
> comparator (via a FieldComparatorSource) which would sort missing values
> before/after existing values. The benefit of this approach is that you
> would automatically benefit from some not-so-trivial features of Lucene's
> comparator such as dynamic pruning.
>
> On Wed, Nov 16, 2022 at 9:16 PM Petko Minkov <pminkov@gmail.com> wrote:
>
> > Hello,
> >
> > When sorting documents by a NumericDocValuesField, how can documents be
> > ordered such that those with missing values can come before anything else
> > in ascending sorts? SortField allows to set a missing value:
> >
> > var sortField = new SortField("price", SortField.Type.LONG);
> > sortField.setMissingValue(null);
> >
> > This null is however converted into a long 0 and documents with missing
> > values are considered equally ordered with documents with an actual 0
> > value. It's possible to set the missing value to Long.MIN_VALUE, but that
> > will have the same problem, just for a different long value.
> >
> > Besides writing a custom comparator, is there any simpler and still
> > performant way to achieve this sort?
> >
> > --Petko
> >
>
>
> --
> Adrien
>
Re: Sort by numeric field, order missing values before anything else [ In reply to ]
Hi,

Long.MIN_VALUE and Long.MAX_VALUE are the correct way for longs to sort.
In fact if you have Long.MIN_VALUE in your collection, empty values are
treated the same, but still empty value will appear at the wanted place.
In contrast to the default "0", it is not somewhere in the middle.
Because there is no long that is smaller than Long.MIN_VALUE, the sort
order will be OK.

BTW, Apache Solr is using exactly those values to support missing values
automatically (see sortMissingFirst, sortMissingLast schema options).

In fact, string/bytes sorting has theoretically the same problem,
because NULL is still different that empty. WARNING: If you really want
to compare by byte[] as suggested in your last mail, keep in mind: When
you sort against the raw bytes (using NumericUtils) with SORTED_SET
docvalues type, there is a large overhead on indexing and sorting
performance, especially for the case where you have many different
values in your index (which is likely for numerics).

Uwe

Am 17.11.2022 um 08:47 schrieb Adrien Grand:
> Hi Petko,
>
> Lucene's comparators for numerics have this limitation indeed. We haven't
> got many questions around that in the past, which I would guess is due to
> the fact that most numeric fields do not use the entire long range,
> specifically Long.MIN_VALUE and Long.MAX_VALUE, so using either of these
> works as a way to sort missing values first or last. If you have a field
> that may use Long.MIN_VALUE and long.MAX_VALUE, we do not have a comparator
> that can easily sort missing values first or last reliably out of the box.
>
> The easier option I can think of would consist of using the comparator for
> longs with MIN_VALUE / MAX_VALUE for missing values depending on whether
> you want missing values sorted first or last, and chain it with another
> comparator (via a FieldComparatorSource) which would sort missing values
> before/after existing values. The benefit of this approach is that you
> would automatically benefit from some not-so-trivial features of Lucene's
> comparator such as dynamic pruning.
>
> On Wed, Nov 16, 2022 at 9:16 PM Petko Minkov <pminkov@gmail.com> wrote:
>
>> Hello,
>>
>> When sorting documents by a NumericDocValuesField, how can documents be
>> ordered such that those with missing values can come before anything else
>> in ascending sorts? SortField allows to set a missing value:
>>
>> var sortField = new SortField("price", SortField.Type.LONG);
>> sortField.setMissingValue(null);
>>
>> This null is however converted into a long 0 and documents with missing
>> values are considered equally ordered with documents with an actual 0
>> value. It's possible to set the missing value to Long.MIN_VALUE, but that
>> will have the same problem, just for a different long value.
>>
>> Besides writing a custom comparator, is there any simpler and still
>> performant way to achieve this sort?
>>
>> --Petko
>>
>
--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: uwe@thetaphi.de


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Sort by numeric field, order missing values before anything else [ In reply to ]
Uwe, I think that Petko's question was about making sure that missing
values would be returned before non-missing values, even though some of
these non-missing values might be equal to Long.MIN_VALUE. Which isn't
possible today.

I agree with your recommendation against going with bytes given the
overhead in case of high cardinality.

On Mon, Nov 21, 2022 at 11:08 AM Uwe Schindler <uwe@thetaphi.de> wrote:

> Hi,
>
> Long.MIN_VALUE and Long.MAX_VALUE are the correct way for longs to sort.
> In fact if you have Long.MIN_VALUE in your collection, empty values are
> treated the same, but still empty value will appear at the wanted place.
> In contrast to the default "0", it is not somewhere in the middle.
> Because there is no long that is smaller than Long.MIN_VALUE, the sort
> order will be OK.
>
> BTW, Apache Solr is using exactly those values to support missing values
> automatically (see sortMissingFirst, sortMissingLast schema options).
>
> In fact, string/bytes sorting has theoretically the same problem,
> because NULL is still different that empty. WARNING: If you really want
> to compare by byte[] as suggested in your last mail, keep in mind: When
> you sort against the raw bytes (using NumericUtils) with SORTED_SET
> docvalues type, there is a large overhead on indexing and sorting
> performance, especially for the case where you have many different
> values in your index (which is likely for numerics).
>
> Uwe
>
> Am 17.11.2022 um 08:47 schrieb Adrien Grand:
> > Hi Petko,
> >
> > Lucene's comparators for numerics have this limitation indeed. We haven't
> > got many questions around that in the past, which I would guess is due to
> > the fact that most numeric fields do not use the entire long range,
> > specifically Long.MIN_VALUE and Long.MAX_VALUE, so using either of these
> > works as a way to sort missing values first or last. If you have a field
> > that may use Long.MIN_VALUE and long.MAX_VALUE, we do not have a
> comparator
> > that can easily sort missing values first or last reliably out of the
> box.
> >
> > The easier option I can think of would consist of using the comparator
> for
> > longs with MIN_VALUE / MAX_VALUE for missing values depending on whether
> > you want missing values sorted first or last, and chain it with another
> > comparator (via a FieldComparatorSource) which would sort missing values
> > before/after existing values. The benefit of this approach is that you
> > would automatically benefit from some not-so-trivial features of Lucene's
> > comparator such as dynamic pruning.
> >
> > On Wed, Nov 16, 2022 at 9:16 PM Petko Minkov <pminkov@gmail.com> wrote:
> >
> >> Hello,
> >>
> >> When sorting documents by a NumericDocValuesField, how can documents be
> >> ordered such that those with missing values can come before anything
> else
> >> in ascending sorts? SortField allows to set a missing value:
> >>
> >> var sortField = new SortField("price", SortField.Type.LONG);
> >> sortField.setMissingValue(null);
> >>
> >> This null is however converted into a long 0 and documents with missing
> >> values are considered equally ordered with documents with an actual 0
> >> value. It's possible to set the missing value to Long.MIN_VALUE, but
> that
> >> will have the same problem, just for a different long value.
> >>
> >> Besides writing a custom comparator, is there any simpler and still
> >> performant way to achieve this sort?
> >>
> >> --Petko
> >>
> >
> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

--
Adrien
Re: Sort by numeric field, order missing values before anything else [ In reply to ]
Thanks for the help! I didn't know about the performance issues with byte
arrays vs numbers in this particular case.

On Mon, Nov 21, 2022 at 3:25 AM Adrien Grand <jpountz@gmail.com> wrote:

> Uwe, I think that Petko's question was about making sure that missing
> values would be returned before non-missing values, even though some of
> these non-missing values might be equal to Long.MIN_VALUE. Which isn't
> possible today.
>
> I agree with your recommendation against going with bytes given the
> overhead in case of high cardinality.
>
> On Mon, Nov 21, 2022 at 11:08 AM Uwe Schindler <uwe@thetaphi.de> wrote:
>
> > Hi,
> >
> > Long.MIN_VALUE and Long.MAX_VALUE are the correct way for longs to sort.
> > In fact if you have Long.MIN_VALUE in your collection, empty values are
> > treated the same, but still empty value will appear at the wanted place.
> > In contrast to the default "0", it is not somewhere in the middle.
> > Because there is no long that is smaller than Long.MIN_VALUE, the sort
> > order will be OK.
> >
> > BTW, Apache Solr is using exactly those values to support missing values
> > automatically (see sortMissingFirst, sortMissingLast schema options).
> >
> > In fact, string/bytes sorting has theoretically the same problem,
> > because NULL is still different that empty. WARNING: If you really want
> > to compare by byte[] as suggested in your last mail, keep in mind: When
> > you sort against the raw bytes (using NumericUtils) with SORTED_SET
> > docvalues type, there is a large overhead on indexing and sorting
> > performance, especially for the case where you have many different
> > values in your index (which is likely for numerics).
> >
> > Uwe
> >
> > Am 17.11.2022 um 08:47 schrieb Adrien Grand:
> > > Hi Petko,
> > >
> > > Lucene's comparators for numerics have this limitation indeed. We
> haven't
> > > got many questions around that in the past, which I would guess is due
> to
> > > the fact that most numeric fields do not use the entire long range,
> > > specifically Long.MIN_VALUE and Long.MAX_VALUE, so using either of
> these
> > > works as a way to sort missing values first or last. If you have a
> field
> > > that may use Long.MIN_VALUE and long.MAX_VALUE, we do not have a
> > comparator
> > > that can easily sort missing values first or last reliably out of the
> > box.
> > >
> > > The easier option I can think of would consist of using the comparator
> > for
> > > longs with MIN_VALUE / MAX_VALUE for missing values depending on
> whether
> > > you want missing values sorted first or last, and chain it with another
> > > comparator (via a FieldComparatorSource) which would sort missing
> values
> > > before/after existing values. The benefit of this approach is that you
> > > would automatically benefit from some not-so-trivial features of
> Lucene's
> > > comparator such as dynamic pruning.
> > >
> > > On Wed, Nov 16, 2022 at 9:16 PM Petko Minkov <pminkov@gmail.com>
> wrote:
> > >
> > >> Hello,
> > >>
> > >> When sorting documents by a NumericDocValuesField, how can documents
> be
> > >> ordered such that those with missing values can come before anything
> > else
> > >> in ascending sorts? SortField allows to set a missing value:
> > >>
> > >> var sortField = new SortField("price", SortField.Type.LONG);
> > >> sortField.setMissingValue(null);
> > >>
> > >> This null is however converted into a long 0 and documents with
> missing
> > >> values are considered equally ordered with documents with an actual 0
> > >> value. It's possible to set the missing value to Long.MIN_VALUE, but
> > that
> > >> will have the same problem, just for a different long value.
> > >>
> > >> Besides writing a custom comparator, is there any simpler and still
> > >> performant way to achieve this sort?
> > >>
> > >> --Petko
> > >>
> > >
> > --
> > Uwe Schindler
> > Achterdiek 19, D-28357 Bremen
> > https://www.thetaphi.de
> > eMail: uwe@thetaphi.de
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
> --
> Adrien
>