Ah. That makes sense. Thanks!
(I might re-run on a larger index just to learn how it works in more detail)
On Tue, Oct 13, 2020 at 1:24 PM Adrien Grand <jpountz@gmail.com> wrote:
> 100,000+ requests per core per second is a lot. :) My initial reaction is
> that the query is likely so fast on that index that the bottleneck might be
> rewriting or the initialization of weights/scorers (which don't get more
> costly as the index gets larger) rather than actual query execution, which
> means that we can't really conclude that the boolean query is faster than
> the TermInSetQuery.
>
> Also beware than IndexSearcher#count will look at index statistics if your
> queries have a single term, which would no longer work if you use this
> query as a filter for another query.
>
> On Tue, Oct 13, 2020 at 12:51 PM Rob Audenaerde <rob.audenaerde@gmail.com>
> wrote:
>
> > I reduced the benchmark as far as I could, and now got these results,
> > TermsInSet being a lot slower compared to the Terms/SHOULD.
> >
> >
> > BenchmarkOrQuery.benchmarkTerms thrpt 5 190820.510 ± 16667.411
> > ops/s
> > BenchmarkOrQuery.benchmarkTermsInSet thrpt 5 110548.345 ± 7490.169
> > ops/s
> >
> >
> > @Fork(1)
> > @Measurement(iterations = 5, time = 10)
> > @OutputTimeUnit(TimeUnit.SECONDS)
> > @Warmup(iterations = 3, time = 1)
> > @Benchmark
> > public void benchmarkTerms(final MyState myState) {
> > try {
> > final IndexSearcher searcher =
> > myState.matchedReaders.getIndexSearcher();
> > final BooleanQuery.Builder b = new BooleanQuery.Builder();
> >
> > for (final String role : myState.user.getAdditionalRoles()) {
> > b.add(new TermQuery(new Term(roles, new BytesRef(role))),
> > BooleanClause.Occur.SHOULD);
> > }
> > searcher.count(b.build());
> >
> > } catch (final IOException e) {
> > e.printStackTrace();
> > }
> > }
> >
> > @Fork(1)
> > @Measurement(iterations = 5, time = 10)
> > @OutputTimeUnit(TimeUnit.SECONDS)
> > @Warmup(iterations = 3, time = 1)
> > @Benchmark
> > public void benchmarkTermsInSet(final MyState myState) {
> > try {
> > final IndexSearcher searcher =
> > myState.matchedReaders.getIndexSearcher();
> > final Set<BytesRef> roles =
> >
> >
> myState.user.getAdditionalRoles().stream().map(BytesRef::new).collect(Collectors.toSet());
> > searcher.count(new TermInSetQuery(BenchmarkOrQuery.roles,
> roles));
> >
> > } catch (final IOException e) {
> > e.printStackTrace();
> > }
> > }
> >
> >
> > On Tue, Oct 13, 2020 at 11:56 AM Rob Audenaerde <
> rob.audenaerde@gmail.com>
> > wrote:
> >
> > > Hello Adrien,
> > >
> > > Thanks for the swift reply. I'll add the details:
> > >
> > > Lucene version: 8.6.2
> > >
> > > The restrictionQuery is indeed a conjunction, it allowes for a document
> > to
> > > be a hit if the 'roles' field is empty as well. It's used within a
> > > bigger query builder; so maybe I did something else wrong. I'll rewrite
> > the
> > > benchmark to just benchmark the TermsInSet and Terms.
> > >
> > > It never occurred (hah) to me to use Occur.FILTER, that is a good point
> > to
> > > check as well.
> > >
> > > As you put it, I would expect the results to be very similar, as I do
> not
> > > react the 16 terms in the TermInSet. I'll let you know what I'll find.
> > >
> > > On Tue, Oct 13, 2020 at 11:48 AM Adrien Grand <jpountz@gmail.com>
> wrote:
> > >
> > >> Can you give us a few more details:
> > >> - What version of Lucene are you testing?
> > >> - Are you benchmarking "restrictionQuery" on its own, or its
> > conjunction
> > >> with another query?
> > >>
> > >> You mentioned that you combine your "restrictionQuery" and the user
> > query
> > >> with Occur.MUST, Occur.FILTER feels more appropriate for
> > >> "restrictionQuery"
> > >> since it should not contribute to scoring.
> > >>
> > >> TermsInSetQuery automatically executes like a BooleanQuery when the
> > number
> > >> of clauses is less than 16, so I would not expect major performance
> > >> differences between a TermInSetQuery over less than 16 terms and a
> > >> BooleanQuery wrapped in a ConstantScoreQuery.
> > >>
> > >> On Tue, Oct 13, 2020 at 11:35 AM Rob Audenaerde <
> > rob.audenaerde@gmail.com
> > >> >
> > >> wrote:
> > >>
> > >> > Hello,
> > >> >
> > >> > I'm benchmarking an application which implements security on lucene
> by
> > >> > adding a multivalue field "roles". If the user has one of these
> roles,
> > >> he
> > >> > can find the document.
> > >> >
> > >> > I implemented this as a Boolean and query, added the original query
> > and
> > >> the
> > >> > restriction with Occur.MUST.
> > >> >
> > >> > I'm having some performance issues when counting the index (>60M
> > docs),
> > >> so
> > >> > I thought about tweaking this restriction-implementation.
> > >> >
> > >> > I set-up a benchmark like this:
> > >> >
> > >> > I generate 2M documents, Each document has a multi-value "roles"
> > field.
> > >> The
> > >> > "roles" field in each document has 4 values, taken from
> (2,2,1000,100)
> > >> > unique values.
> > >> > The user has (1,1,2,1) values for roles (so, 1 out of the 2, for the
> > >> first
> > >> > role, 1 out of 2 for the second, 2 out of the 1000 for the third
> > value,
> > >> and
> > >> > 1 / 100 for the fourth).
> > >> >
> > >> > I got a somewhat unexpected performance difference. At first, I
> > >> implemented
> > >> > the restriction query like this:
> > >> >
> > >> > for (final String role : roles) {
> > >> > restrictionQuery.add(new TermQuery(new Term("roles", new
> > >> > BytesRef(role))), Occur.SHOULD);
> > >> > }
> > >> >
> > >> > I then switched to a TermInSetQuery, which I thought would be faster
> > >> > as it is using constant-scores.
> > >> >
> > >> > final Set<BytesRef> rolesSet =
> > >> > roles.stream().map(BytesRef::new).collect(Collectors.toSet());
> > >> > restrictionQuery.add(new TermInSetQuery("roles", rolesSet),
> > >> Occur.SHOULD);
> > >> >
> > >> >
> > >> > However, the TermInSetQuery has about 25% slower ops/s. Is that to
> > >> > be expected? I did not, as I thought the constant-scoring would be
> > >> faster.
> > >> >
> > >>
> > >>
> > >> --
> > >> Adrien
> > >>
> > >
> >
>
>
> --
> Adrien
>