Mailing List Archive

TermInSetQuery: seekExact vs. seekCeil
Hi folks-

Back in GH#12156 (https://github.com/apache/lucene/pull/12156), we rewrote
TermInSetQuery to extend MultiTermQuery. With this change, TermInSetQuery
can now leverage the various "rewrite methods" available to MultiTermQuery,
allowing users to customize the query evaluation strategy (e.g., postings
vs. doc values, etc.), which was a nice win. In the benchmarks we ran, we
didn't see any performance issues.

In anticipation of 9.6 releasing, I've pulled this change into the Lucene
snapshot we use for Amazon product search, and started running some
additional benchmarks, which have surfaced an interesting issue. One
use-case we have for TermInSetQuery creates a term disjunction over a field
that's using bloom filtering (i.e., BloomFilterPostingsFormat). Because
bloom filtering can only help with seekExact and not seekCeil, we're seeing
a performance regression (primarily in red-line QPS).

One way I can think to address this is to move back to a seekExact approach
when creating the filtered TermsEnum used by MultiTermQuery (for the
TermInSetQuery implementation). Because TermInSetQuery can provide all of
its terms up-front, we can have a simpler term intersection implementation
that relies on seekExact over seekCeil. Here's a quick take on what I'm
thinking:
https://github.com/gsmiller/lucene/commit/e527c5d9b26ee53826b56b270d7c96db18bfaee5.
I've tested this internally and confirmed it solves our QPS regression
problem.

I'm curious if anyone has an objection to moving back to a seekExact term
intersection approach for TermInSetQuery, or has alternative ideas. I
wonder if I'm overlooking some important factors and focusing too much on
this specific case where the bloom filter interaction is hurting
performance? It seems like seekCeil could provide benefits in some cases
over seekExact by skipping over multiple query terms at a time, so that's a
possible consideration. If we solve for the most common cases by default, I
suppose advanced users could always override TermInSetQuery#getTermsEnum as
necessary (we could take this approach internally for example to work with
our bloom filtering if the best default is to leverage seekCeil). I can
easily turn my quick solution into a PR, but before I do, I wanted to poll
this group for thoughts on the approach or other alternatives I might be
overlooking. Thanks in advance!

Cheers,
-Greg
Re: TermInSetQuery: seekExact vs. seekCeil [ In reply to ]
Hi Greg
IMO I still think the seekCeil is a better solution for the default posting
format, as it could potentially save time on traversing the FST by doing
the ping-pong skipping.
I can see that in the case of using bloom filter the seekExact might be
better but I'm not sure whether there is a better way than overriding the
`getTermsEnum`...

Patrick

On Fri, May 5, 2023 at 4:45?PM Greg Miller <gsmiller@gmail.com> wrote:

> Hi folks-
>
> Back in GH#12156 (https://github.com/apache/lucene/pull/12156), we
> rewrote TermInSetQuery to extend MultiTermQuery. With this change,
> TermInSetQuery can now leverage the various "rewrite methods" available to
> MultiTermQuery, allowing users to customize the query evaluation strategy
> (e.g., postings vs. doc values, etc.), which was a nice win. In the
> benchmarks we ran, we didn't see any performance issues.
>
> In anticipation of 9.6 releasing, I've pulled this change into the Lucene
> snapshot we use for Amazon product search, and started running some
> additional benchmarks, which have surfaced an interesting issue. One
> use-case we have for TermInSetQuery creates a term disjunction over a field
> that's using bloom filtering (i.e., BloomFilterPostingsFormat). Because
> bloom filtering can only help with seekExact and not seekCeil, we're seeing
> a performance regression (primarily in red-line QPS).
>
> One way I can think to address this is to move back to a seekExact
> approach when creating the filtered TermsEnum used by MultiTermQuery (for
> the TermInSetQuery implementation). Because TermInSetQuery can provide all
> of its terms up-front, we can have a simpler term intersection
> implementation that relies on seekExact over seekCeil. Here's a quick take
> on what I'm thinking:
> https://github.com/gsmiller/lucene/commit/e527c5d9b26ee53826b56b270d7c96db18bfaee5.
> I've tested this internally and confirmed it solves our QPS regression
> problem.
>
> I'm curious if anyone has an objection to moving back to a seekExact term
> intersection approach for TermInSetQuery, or has alternative ideas. I
> wonder if I'm overlooking some important factors and focusing too much on
> this specific case where the bloom filter interaction is hurting
> performance? It seems like seekCeil could provide benefits in some cases
> over seekExact by skipping over multiple query terms at a time, so that's a
> possible consideration. If we solve for the most common cases by default, I
> suppose advanced users could always override TermInSetQuery#getTermsEnum as
> necessary (we could take this approach internally for example to work with
> our bloom filtering if the best default is to leverage seekCeil). I can
> easily turn my quick solution into a PR, but before I do, I wanted to poll
> this group for thoughts on the approach or other alternatives I might be
> overlooking. Thanks in advance!
>
> Cheers,
> -Greg
>
Re: TermInSetQuery: seekExact vs. seekCeil [ In reply to ]
Thanks Patrick. I tend to agree with you for the default behavior. Bloom
filter usage seems like a bit of a less-common case on the surface at least
(e.g., it's expected behavior for query terms to not be present in a given
segment with enough frequency to justify the additional codec layer). A
primary key-like field is sort of the exception here, where a
TermInSetQuery can be useful for allow/block-listing semantics—and where
bloom filtering can be helpful. As an aside, given that TermInSetQuery
already has some semi-special logic for recognizing primary key-like
fields, and with this additional consideration, it makes me wonder if a
special-purpose IDSet query or something might make sense at some point.

For now though, I like the idea of leaving TermInSetQuery as-is, since
users can extend it and change the behavior of getTerms if they really need
to.

Cheers,
-Greg

On Fri, May 5, 2023 at 6:33?PM Patrick Zhai <zhai7631@gmail.com> wrote:

> Hi Greg
> IMO I still think the seekCeil is a better solution for the default
> posting format, as it could potentially save time on traversing the FST by
> doing the ping-pong skipping.
> I can see that in the case of using bloom filter the seekExact might be
> better but I'm not sure whether there is a better way than overriding the
> `getTermsEnum`...
>
> Patrick
>
> On Fri, May 5, 2023 at 4:45?PM Greg Miller <gsmiller@gmail.com> wrote:
>
>> Hi folks-
>>
>> Back in GH#12156 (https://github.com/apache/lucene/pull/12156), we
>> rewrote TermInSetQuery to extend MultiTermQuery. With this change,
>> TermInSetQuery can now leverage the various "rewrite methods" available to
>> MultiTermQuery, allowing users to customize the query evaluation strategy
>> (e.g., postings vs. doc values, etc.), which was a nice win. In the
>> benchmarks we ran, we didn't see any performance issues.
>>
>> In anticipation of 9.6 releasing, I've pulled this change into the Lucene
>> snapshot we use for Amazon product search, and started running some
>> additional benchmarks, which have surfaced an interesting issue. One
>> use-case we have for TermInSetQuery creates a term disjunction over a field
>> that's using bloom filtering (i.e., BloomFilterPostingsFormat). Because
>> bloom filtering can only help with seekExact and not seekCeil, we're seeing
>> a performance regression (primarily in red-line QPS).
>>
>> One way I can think to address this is to move back to a seekExact
>> approach when creating the filtered TermsEnum used by MultiTermQuery (for
>> the TermInSetQuery implementation). Because TermInSetQuery can provide all
>> of its terms up-front, we can have a simpler term intersection
>> implementation that relies on seekExact over seekCeil. Here's a quick take
>> on what I'm thinking:
>> https://github.com/gsmiller/lucene/commit/e527c5d9b26ee53826b56b270d7c96db18bfaee5.
>> I've tested this internally and confirmed it solves our QPS regression
>> problem.
>>
>> I'm curious if anyone has an objection to moving back to a seekExact term
>> intersection approach for TermInSetQuery, or has alternative ideas. I
>> wonder if I'm overlooking some important factors and focusing too much on
>> this specific case where the bloom filter interaction is hurting
>> performance? It seems like seekCeil could provide benefits in some cases
>> over seekExact by skipping over multiple query terms at a time, so that's a
>> possible consideration. If we solve for the most common cases by default, I
>> suppose advanced users could always override TermInSetQuery#getTermsEnum as
>> necessary (we could take this approach internally for example to work with
>> our bloom filtering if the best default is to leverage seekCeil). I can
>> easily turn my quick solution into a PR, but before I do, I wanted to poll
>> this group for thoughts on the approach or other alternatives I might be
>> overlooking. Thanks in advance!
>>
>> Cheers,
>> -Greg
>>
>
Re: TermInSetQuery: seekExact vs. seekCeil [ In reply to ]
Besides not being able to use the bloom filter, seekCeil is also just more
costly than seekExact since it is essentially both .seekExact and .next in
a single operation.

Are either of the two approaches using the intersect method of TermsEnum?
It might be faster if the number of terms is over some threshold.

It would require building an Automaton out of the set of terms, which is
fast with DaciukMihovAutomatonBuilder. Hmm, I think we should rename this
class maybe. I'll open an issue. Naming is the hardest part!

The Codec can implement this quite efficiently since it can do the
ping-pong skipping Patrick is referring to on a byte-by-byte basis in each
of the sources of Term iteration.

Mike McCandless

http://blog.mikemccandless.com


On Fri, May 5, 2023 at 9:34?PM Patrick Zhai <zhai7631@gmail.com> wrote:

> Hi Greg
> IMO I still think the seekCeil is a better solution for the default
> posting format, as it could potentially save time on traversing the FST by
> doing the ping-pong skipping.
> I can see that in the case of using bloom filter the seekExact might be
> better but I'm not sure whether there is a better way than overriding the
> `getTermsEnum`...
>
> Patrick
>
> On Fri, May 5, 2023 at 4:45?PM Greg Miller <gsmiller@gmail.com> wrote:
>
>> Hi folks-
>>
>> Back in GH#12156 (https://github.com/apache/lucene/pull/12156), we
>> rewrote TermInSetQuery to extend MultiTermQuery. With this change,
>> TermInSetQuery can now leverage the various "rewrite methods" available to
>> MultiTermQuery, allowing users to customize the query evaluation strategy
>> (e.g., postings vs. doc values, etc.), which was a nice win. In the
>> benchmarks we ran, we didn't see any performance issues.
>>
>> In anticipation of 9.6 releasing, I've pulled this change into the Lucene
>> snapshot we use for Amazon product search, and started running some
>> additional benchmarks, which have surfaced an interesting issue. One
>> use-case we have for TermInSetQuery creates a term disjunction over a field
>> that's using bloom filtering (i.e., BloomFilterPostingsFormat). Because
>> bloom filtering can only help with seekExact and not seekCeil, we're seeing
>> a performance regression (primarily in red-line QPS).
>>
>> One way I can think to address this is to move back to a seekExact
>> approach when creating the filtered TermsEnum used by MultiTermQuery (for
>> the TermInSetQuery implementation). Because TermInSetQuery can provide all
>> of its terms up-front, we can have a simpler term intersection
>> implementation that relies on seekExact over seekCeil. Here's a quick take
>> on what I'm thinking:
>> https://github.com/gsmiller/lucene/commit/e527c5d9b26ee53826b56b270d7c96db18bfaee5.
>> I've tested this internally and confirmed it solves our QPS regression
>> problem.
>>
>> I'm curious if anyone has an objection to moving back to a seekExact term
>> intersection approach for TermInSetQuery, or has alternative ideas. I
>> wonder if I'm overlooking some important factors and focusing too much on
>> this specific case where the bloom filter interaction is hurting
>> performance? It seems like seekCeil could provide benefits in some cases
>> over seekExact by skipping over multiple query terms at a time, so that's a
>> possible consideration. If we solve for the most common cases by default, I
>> suppose advanced users could always override TermInSetQuery#getTermsEnum as
>> necessary (we could take this approach internally for example to work with
>> our bloom filtering if the best default is to leverage seekCeil). I can
>> easily turn my quick solution into a PR, but before I do, I wanted to poll
>> this group for thoughts on the approach or other alternatives I might be
>> overlooking. Thanks in advance!
>>
>> Cheers,
>> -Greg
>>
>
Re: TermInSetQuery: seekExact vs. seekCeil [ In reply to ]
The better solution is to use Terms.intersect. Then the postings
format can do the right thing. But this query doesn't use
Terms.intersect today, instead doing ping-ponging itself.

That's the problem.

We must *not* tune our algorithms for amazon's search but instead what
is the best for users (default postings format).

On Fri, May 5, 2023 at 9:34?PM Patrick Zhai <zhai7631@gmail.com> wrote:
>
> Hi Greg
> IMO I still think the seekCeil is a better solution for the default posting format, as it could potentially save time on traversing the FST by doing the ping-pong skipping.
> I can see that in the case of using bloom filter the seekExact might be better but I'm not sure whether there is a better way than overriding the `getTermsEnum`...
>
> Patrick
>
> On Fri, May 5, 2023 at 4:45?PM Greg Miller <gsmiller@gmail.com> wrote:
>>
>> Hi folks-
>>
>> Back in GH#12156 (https://github.com/apache/lucene/pull/12156), we rewrote TermInSetQuery to extend MultiTermQuery. With this change, TermInSetQuery can now leverage the various "rewrite methods" available to MultiTermQuery, allowing users to customize the query evaluation strategy (e.g., postings vs. doc values, etc.), which was a nice win. In the benchmarks we ran, we didn't see any performance issues.
>>
>> In anticipation of 9.6 releasing, I've pulled this change into the Lucene snapshot we use for Amazon product search, and started running some additional benchmarks, which have surfaced an interesting issue. One use-case we have for TermInSetQuery creates a term disjunction over a field that's using bloom filtering (i.e., BloomFilterPostingsFormat). Because bloom filtering can only help with seekExact and not seekCeil, we're seeing a performance regression (primarily in red-line QPS).
>>
>> One way I can think to address this is to move back to a seekExact approach when creating the filtered TermsEnum used by MultiTermQuery (for the TermInSetQuery implementation). Because TermInSetQuery can provide all of its terms up-front, we can have a simpler term intersection implementation that relies on seekExact over seekCeil. Here's a quick take on what I'm thinking: https://github.com/gsmiller/lucene/commit/e527c5d9b26ee53826b56b270d7c96db18bfaee5. I've tested this internally and confirmed it solves our QPS regression problem.
>>
>> I'm curious if anyone has an objection to moving back to a seekExact term intersection approach for TermInSetQuery, or has alternative ideas. I wonder if I'm overlooking some important factors and focusing too much on this specific case where the bloom filter interaction is hurting performance? It seems like seekCeil could provide benefits in some cases over seekExact by skipping over multiple query terms at a time, so that's a possible consideration. If we solve for the most common cases by default, I suppose advanced users could always override TermInSetQuery#getTermsEnum as necessary (we could take this approach internally for example to work with our bloom filtering if the best default is to leverage seekCeil). I can easily turn my quick solution into a PR, but before I do, I wanted to poll this group for thoughts on the approach or other alternatives I might be overlooking. Thanks in advance!
>>
>> Cheers,
>> -Greg

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: TermInSetQuery: seekExact vs. seekCeil [ In reply to ]
Thanks for the feedback Robert. This approach sounds like a better path to
follow. I'll explore it. I agree that we should provide default behavior
that is overall best for our users, and not for one specific use-case such
as Amazon search :).

Mike- TermInSetQuery used to use seekExact, and now uses seekCeil. We
haven't used intersect... yet.

Thanks again for the feedback.

Cheers,
-Greg

On Tue, May 9, 2023 at 11:09?AM Michael McCandless <
lucene@mikemccandless.com> wrote:

> Besides not being able to use the bloom filter, seekCeil is also just more
> costly than seekExact since it is essentially both .seekExact and .next in
> a single operation.
>
> Are either of the two approaches using the intersect method of TermsEnum?
> It might be faster if the number of terms is over some threshold.
>
> It would require building an Automaton out of the set of terms, which is
> fast with DaciukMihovAutomatonBuilder. Hmm, I think we should rename this
> class maybe. I'll open an issue. Naming is the hardest part!
>
> The Codec can implement this quite efficiently since it can do the
> ping-pong skipping Patrick is referring to on a byte-by-byte basis in each
> of the sources of Term iteration.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Fri, May 5, 2023 at 9:34?PM Patrick Zhai <zhai7631@gmail.com> wrote:
>
>> Hi Greg
>> IMO I still think the seekCeil is a better solution for the default
>> posting format, as it could potentially save time on traversing the FST by
>> doing the ping-pong skipping.
>> I can see that in the case of using bloom filter the seekExact might be
>> better but I'm not sure whether there is a better way than overriding the
>> `getTermsEnum`...
>>
>> Patrick
>>
>> On Fri, May 5, 2023 at 4:45?PM Greg Miller <gsmiller@gmail.com> wrote:
>>
>>> Hi folks-
>>>
>>> Back in GH#12156 (https://github.com/apache/lucene/pull/12156), we
>>> rewrote TermInSetQuery to extend MultiTermQuery. With this change,
>>> TermInSetQuery can now leverage the various "rewrite methods" available to
>>> MultiTermQuery, allowing users to customize the query evaluation strategy
>>> (e.g., postings vs. doc values, etc.), which was a nice win. In the
>>> benchmarks we ran, we didn't see any performance issues.
>>>
>>> In anticipation of 9.6 releasing, I've pulled this change into the
>>> Lucene snapshot we use for Amazon product search, and started running some
>>> additional benchmarks, which have surfaced an interesting issue. One
>>> use-case we have for TermInSetQuery creates a term disjunction over a field
>>> that's using bloom filtering (i.e., BloomFilterPostingsFormat). Because
>>> bloom filtering can only help with seekExact and not seekCeil, we're seeing
>>> a performance regression (primarily in red-line QPS).
>>>
>>> One way I can think to address this is to move back to a seekExact
>>> approach when creating the filtered TermsEnum used by MultiTermQuery (for
>>> the TermInSetQuery implementation). Because TermInSetQuery can provide all
>>> of its terms up-front, we can have a simpler term intersection
>>> implementation that relies on seekExact over seekCeil. Here's a quick take
>>> on what I'm thinking:
>>> https://github.com/gsmiller/lucene/commit/e527c5d9b26ee53826b56b270d7c96db18bfaee5.
>>> I've tested this internally and confirmed it solves our QPS regression
>>> problem.
>>>
>>> I'm curious if anyone has an objection to moving back to a seekExact
>>> term intersection approach for TermInSetQuery, or has alternative ideas. I
>>> wonder if I'm overlooking some important factors and focusing too much on
>>> this specific case where the bloom filter interaction is hurting
>>> performance? It seems like seekCeil could provide benefits in some cases
>>> over seekExact by skipping over multiple query terms at a time, so that's a
>>> possible consideration. If we solve for the most common cases by default, I
>>> suppose advanced users could always override TermInSetQuery#getTermsEnum as
>>> necessary (we could take this approach internally for example to work with
>>> our bloom filtering if the best default is to leverage seekCeil). I can
>>> easily turn my quick solution into a PR, but before I do, I wanted to poll
>>> this group for thoughts on the approach or other alternatives I might be
>>> overlooking. Thanks in advance!
>>>
>>> Cheers,
>>> -Greg
>>>
>>
Re: TermInSetQuery: seekExact vs. seekCeil [ In reply to ]
I remember the benefits from Terms.intersect being pretty huge. Rather
than simple ping-pong, the whole monster gets handed off directly to
the codec's term dictionary implementation. For the default terms
dictionary using blocktree, this saves time seeking to terms you don't
care about (because the postingsformat is aware of the blocktree
structure). It is probably worth just prototyping on its own enough,
to see if we can get the benefits. It may turn out, you dont need a
bloom anymore after that.

On Tue, May 9, 2023 at 3:24?PM Greg Miller <gsmiller@gmail.com> wrote:
>
> Thanks for the feedback Robert. This approach sounds like a better path to follow. I'll explore it. I agree that we should provide default behavior that is overall best for our users, and not for one specific use-case such as Amazon search :).
>
> Mike- TermInSetQuery used to use seekExact, and now uses seekCeil. We haven't used intersect... yet.
>
> Thanks again for the feedback.
>
> Cheers,
> -Greg
>
> On Tue, May 9, 2023 at 11:09?AM Michael McCandless <lucene@mikemccandless.com> wrote:
>>
>> Besides not being able to use the bloom filter, seekCeil is also just more costly than seekExact since it is essentially both .seekExact and .next in a single operation.
>>
>> Are either of the two approaches using the intersect method of TermsEnum? It might be faster if the number of terms is over some threshold.
>>
>> It would require building an Automaton out of the set of terms, which is fast with DaciukMihovAutomatonBuilder. Hmm, I think we should rename this class maybe. I'll open an issue. Naming is the hardest part!
>>
>> The Codec can implement this quite efficiently since it can do the ping-pong skipping Patrick is referring to on a byte-by-byte basis in each of the sources of Term iteration.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Fri, May 5, 2023 at 9:34?PM Patrick Zhai <zhai7631@gmail.com> wrote:
>>>
>>> Hi Greg
>>> IMO I still think the seekCeil is a better solution for the default posting format, as it could potentially save time on traversing the FST by doing the ping-pong skipping.
>>> I can see that in the case of using bloom filter the seekExact might be better but I'm not sure whether there is a better way than overriding the `getTermsEnum`...
>>>
>>> Patrick
>>>
>>> On Fri, May 5, 2023 at 4:45?PM Greg Miller <gsmiller@gmail.com> wrote:
>>>>
>>>> Hi folks-
>>>>
>>>> Back in GH#12156 (https://github.com/apache/lucene/pull/12156), we rewrote TermInSetQuery to extend MultiTermQuery. With this change, TermInSetQuery can now leverage the various "rewrite methods" available to MultiTermQuery, allowing users to customize the query evaluation strategy (e.g., postings vs. doc values, etc.), which was a nice win. In the benchmarks we ran, we didn't see any performance issues.
>>>>
>>>> In anticipation of 9.6 releasing, I've pulled this change into the Lucene snapshot we use for Amazon product search, and started running some additional benchmarks, which have surfaced an interesting issue. One use-case we have for TermInSetQuery creates a term disjunction over a field that's using bloom filtering (i.e., BloomFilterPostingsFormat). Because bloom filtering can only help with seekExact and not seekCeil, we're seeing a performance regression (primarily in red-line QPS).
>>>>
>>>> One way I can think to address this is to move back to a seekExact approach when creating the filtered TermsEnum used by MultiTermQuery (for the TermInSetQuery implementation). Because TermInSetQuery can provide all of its terms up-front, we can have a simpler term intersection implementation that relies on seekExact over seekCeil. Here's a quick take on what I'm thinking: https://github.com/gsmiller/lucene/commit/e527c5d9b26ee53826b56b270d7c96db18bfaee5. I've tested this internally and confirmed it solves our QPS regression problem.
>>>>
>>>> I'm curious if anyone has an objection to moving back to a seekExact term intersection approach for TermInSetQuery, or has alternative ideas. I wonder if I'm overlooking some important factors and focusing too much on this specific case where the bloom filter interaction is hurting performance? It seems like seekCeil could provide benefits in some cases over seekExact by skipping over multiple query terms at a time, so that's a possible consideration. If we solve for the most common cases by default, I suppose advanced users could always override TermInSetQuery#getTermsEnum as necessary (we could take this approach internally for example to work with our bloom filtering if the best default is to leverage seekCeil). I can easily turn my quick solution into a PR, but before I do, I wanted to poll this group for thoughts on the approach or other alternatives I might be overlooking. Thanks in advance!
>>>>
>>>> Cheers,
>>>> -Greg

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org