Mailing List Archive: Efficient sort on SortedDocValues

Efficient sort on SortedDocValues

Nov 5, 2022, 11:06 AM

Post #1 of 5 (311 views)

Hello Lucene community, while looking into how to efficiently sort on a field value, I came across a couple of things that I don't quite understand. My assumption was that if I execute a search and sort on a SortedDocValues field, lucene would only iterate over the docs in the order of the field values or at least collect only competitive docs (docs that made it into the topN queue). Neither of those things seems to be happening. Instead, the iteration is happening in index order and all matched docs are collected. Looking at the code, I see that the optimizations are only possible if the index is sorted in the field order to begin with, which is not possible for our use case. We may have dozens of such fields in our index, thus there isn't any one field that can be used to sort the index. So I guess my question if what I am trying to achieve is possible? I tried to look though Solr codebase, but so far couldn't come up with anything. Code example is here https://pastebin.com/i05E2wZy . I am using 9.4.1. Thanks in advance.

Andrei

This e-mail is for the sole use of the intended recipient and contains information that may be privileged and/or confidential. If you are not an intended recipient, please notify the sender by return e-mail and delete this e-mail and any attachments. Certain required legal entity disclosures can be accessed on our website: https://www.thomsonreuters.com/en/resources/disclosures.html

RE: Efficient sort on SortedDocValues [ In reply to ]

andrei.solodin at thomsonreuters

Nov 5, 2022, 3:17 PM

Post #2 of 5 (310 views)

Permalink

I just realized that the problem is that the field needs to be indexed as well. Now it works. But I noticed that this only works in Lucene 9. Does not work in Lucene 8 (specifically 8.11.2). This must be new functionality in Lucene 9?

Thanks

From: Solodin, Andrei (TR Technology)
Sent: Saturday, November 5, 2022 1:07 PM
To: java-user@lucene.apache.org
Subject: Efficient sort on SortedDocValues

Hello Lucene community, while looking into how to efficiently sort on a field value, I came across a couple of things that I don't quite understand. My assumption was that if I execute a search and sort on a SortedDocValues field, lucene would only iterate over the docs in the order of the field values or at least collect only competitive docs (docs that made it into the topN queue). Neither of those things seems to be happening. Instead, the iteration is happening in index order and all matched docs are collected. Looking at the code, I see that the optimizations are only possible if the index is sorted in the field order to begin with, which is not possible for our use case. We may have dozens of such fields in our index, thus there isn't any one field that can be used to sort the index. So I guess my question if what I am trying to achieve is possible? I tried to look though Solr codebase, but so far couldn't come up with anything. Code example is here https://pastebin.com/i05E2wZy . I am using 9.4.1. Thanks in advance.

Andrei

RE: Efficient sort on SortedDocValues [ In reply to ]

andrei.solodin at thomsonreuters

Nov 5, 2022, 3:34 PM

Post #3 of 5 (310 views)

Permalink

One more thing. While the test case passes now, it still iterates in index order. Which means that it still collects ~6.4K docs out of 10k matches. This is an improvement, but I am still wondering why it's not possible to iterate in the field older. Seems like that would provide substantial improvement.

From: Solodin, Andrei (TR Technology)
Sent: Saturday, November 5, 2022 5:18 PM
To: java-user@lucene.apache.org
Subject: RE: Efficient sort on SortedDocValues

I just realized that the problem is that the field needs to be indexed as well. Now it works. But I noticed that this only works in Lucene 9. Does not work in Lucene 8 (specifically 8.11.2). This must be new functionality in Lucene 9?

Thanks

From: Solodin, Andrei (TR Technology)
Sent: Saturday, November 5, 2022 1:07 PM
To: java-user@lucene.apache.org<mailto:java-user@lucene.apache.org>
Subject: Efficient sort on SortedDocValues

Hello Lucene community, while looking into how to efficiently sort on a field value, I came across a couple of things that I don't quite understand. My assumption was that if I execute a search and sort on a SortedDocValues field, lucene would only iterate over the docs in the order of the field values or at least collect only competitive docs (docs that made it into the topN queue). Neither of those things seems to be happening. Instead, the iteration is happening in index order and all matched docs are collected. Looking at the code, I see that the optimizations are only possible if the index is sorted in the field order to begin with, which is not possible for our use case. We may have dozens of such fields in our index, thus there isn't any one field that can be used to sort the index. So I guess my question if what I am trying to achieve is possible? I tried to look though Solr codebase, but so far couldn't come up with anything. Code example is here https://pastebin.com/i05E2wZy . I am using 9.4.1. Thanks in advance.

Andrei

Re: Efficient sort on SortedDocValues [ In reply to ]

mkhl at apache

Nov 7, 2022, 12:44 AM

Post #4 of 5 (309 views)

Permalink

Hello, Andrei.
Docs are scored in-order (see Weight.scoreAll(), scoreRange()), just
because underneath postings API is in-order. There are a few
shortcuts/optimizations, but they only omit some iterations/segments like
checking competitive scores and so one.

On Sun, Nov 6, 2022 at 1:35 AM Solodin, Andrei (TR Technology)
<andrei.solodin@thomsonreuters.com.invalid> wrote:

> One more thing. While the test case passes now, it still iterates in index
> order. Which means that it still collects ~6.4K docs out of 10k matches.
> This is an improvement, but I am still wondering why it's not possible to
> iterate in the field older. Seems like that would provide substantial
> improvement.
>
> From: Solodin, Andrei (TR Technology)
> Sent: Saturday, November 5, 2022 5:18 PM
> To: java-user@lucene.apache.org
> Subject: RE: Efficient sort on SortedDocValues
>
> I just realized that the problem is that the field needs to be indexed as
> well. Now it works. But I noticed that this only works in Lucene 9. Does
> not work in Lucene 8 (specifically 8.11.2). This must be new functionality
> in Lucene 9?
>
> Thanks
>
>
> From: Solodin, Andrei (TR Technology)
> Sent: Saturday, November 5, 2022 1:07 PM
> To: java-user@lucene.apache.org<mailto:java-user@lucene.apache.org>
> Subject: Efficient sort on SortedDocValues
>
> Hello Lucene community, while looking into how to efficiently sort on a
> field value, I came across a couple of things that I don't quite
> understand. My assumption was that if I execute a search and sort on a
> SortedDocValues field, lucene would only iterate over the docs in the order
> of the field values or at least collect only competitive docs (docs that
> made it into the topN queue). Neither of those things seems to be
> happening. Instead, the iteration is happening in index order and all
> matched docs are collected. Looking at the code, I see that the
> optimizations are only possible if the index is sorted in the field order
> to begin with, which is not possible for our use case. We may have dozens
> of such fields in our index, thus there isn't any one field that can be
> used to sort the index. So I guess my question if what I am trying to
> achieve is possible? I tried to look though Solr codebase, but so far
> couldn't come up with anything. Code example is here
> https://pastebin.com/i05E2wZy . I am using 9.4.1. Thanks in advance.
>
> Andrei
>
>

--
Sincerely yours
Mikhail Khludnev

Re: Efficient sort on SortedDocValues [ In reply to ]

jpountz at gmail

Nov 7, 2022, 2:42 AM

Post #5 of 5 (309 views)

Permalink

Hi Andrei,

The case that you are describing got optimized in Lucene 9.4.0 in the case
when your field is also indexed with a StringField:
https://github.com/apache/lucene/pull/1023. See annotation ER at
http://people.apache.org/~mikemccand/lucenebench/TermMonthSort.html.

The way it works is that Lucene will automatically leverage the inverted
index in order to only look at documents that compare better than the
current k-th document in the priority queue.

To make it work with your test case, you will need to:
- index a StringField with the same name and same value
- change values to be less random if possible, since this optimization
works better on low-cardinality fields than on high-cardinality fields

On Mon, Nov 7, 2022 at 9:45 AM Mikhail Khludnev <mkhl@apache.org> wrote:

> Hello, Andrei.
> Docs are scored in-order (see Weight.scoreAll(), scoreRange()), just
> because underneath postings API is in-order. There are a few
> shortcuts/optimizations, but they only omit some iterations/segments like
> checking competitive scores and so one.
>
> On Sun, Nov 6, 2022 at 1:35 AM Solodin, Andrei (TR Technology)
> <andrei.solodin@thomsonreuters.com.invalid> wrote:
>
> > One more thing. While the test case passes now, it still iterates in
> index
> > order. Which means that it still collects ~6.4K docs out of 10k matches.
> > This is an improvement, but I am still wondering why it's not possible to
> > iterate in the field older. Seems like that would provide substantial
> > improvement.
> >
> > From: Solodin, Andrei (TR Technology)
> > Sent: Saturday, November 5, 2022 5:18 PM
> > To: java-user@lucene.apache.org
> > Subject: RE: Efficient sort on SortedDocValues
> >
> > I just realized that the problem is that the field needs to be indexed as
> > well. Now it works. But I noticed that this only works in Lucene 9. Does
> > not work in Lucene 8 (specifically 8.11.2). This must be new
> functionality
> > in Lucene 9?
> >
> > Thanks
> >
> >
> > From: Solodin, Andrei (TR Technology)
> > Sent: Saturday, November 5, 2022 1:07 PM
> > To: java-user@lucene.apache.org<mailto:java-user@lucene.apache.org>
> > Subject: Efficient sort on SortedDocValues
> >
> > Hello Lucene community, while looking into how to efficiently sort on a
> > field value, I came across a couple of things that I don't quite
> > understand. My assumption was that if I execute a search and sort on a
> > SortedDocValues field, lucene would only iterate over the docs in the
> order
> > of the field values or at least collect only competitive docs (docs that
> > made it into the topN queue). Neither of those things seems to be
> > happening. Instead, the iteration is happening in index order and all
> > matched docs are collected. Looking at the code, I see that the
> > optimizations are only possible if the index is sorted in the field order
> > to begin with, which is not possible for our use case. We may have dozens
> > of such fields in our index, thus there isn't any one field that can be
> > used to sort the index. So I guess my question if what I am trying to
> > achieve is possible? I tried to look though Solr codebase, but so far
> > couldn't come up with anything. Code example is here
> > https://pastebin.com/i05E2wZy . I am using 9.4.1. Thanks in advance.
> >
> > Andrei
> >
> >
>
> --
> Sincerely yours
> Mikhail Khludnev
>

--
Adrien