Mailing List Archive: Optimizing term-occurrence counting (code included)

Optimizing term-occurrence counting (code included)

Jul 23, 2020, 6:55 PM

Post #1 of 5 (775 views)

Hi all,

I am working on a query that takes a set of terms, finds all documents
containing at least one of those terms, computes a subset of candidate docs
with the most matching terms, and applies a user-provided scoring function
to each of the candidate docs

Simple example of the query:
- query terms ("aaa", "bbb")
- indexed docs with terms:
docId 0 has terms ("aaa", "bbb")
docId 1 has terms ("aaa", "ccc")
- number of top candidates = 1
- simple scoring function score(docId) = docId + 10
The query first builds a count array [2, 1], because docId 0 contains two
matching terms and docId 1 contains 1 matching term.
Then it picks docId 0 as the candidate subset.
Then it applies the scoring function, returning a score of 10 for docId 0.

The main bottleneck right now is doing the initial counting, i.e. the part
that returns the [2, 1] array.

I first started by using a BoolQuery containing a Should clause for every
Term, so the returned score was the count. This was simple but very slow.
Then I got a substantial speedup by copying and modifying the
TermInSetQuery so that it tracks the number of times each docId contains a
query term. The main construct here seems to be PrefixCodedTerms.

At this point I'm not sure if there's any faster construct, or perhaps a
more optimal way to use PrefixCodedTerms?

Here is the specific query, highlighting some specific parts of the code:
- Build the PrefixCodedTerms (in my case the terms are called 'hashes'):
https://github.com/alexklibisz/elastiknn/blob/c75b23f/plugin/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L27-L33
- Count the matching terms in a segment (this is the main bottleneck in my
query):
https://github.com/alexklibisz/elastiknn/blob/c75b23f/plugin/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L54-L73

I appreciate any suggestions you might have.

- Alex

Re: Optimizing term-occurrence counting (code included) [ In reply to ]

ali at ali

Jul 23, 2020, 9:07 PM

Post #2 of 5 (775 views)

Permalink

I'm new to lucene so I'm not sure what the best way of speeding this up in
Lucene is, but I've previously used https://github.com/npgall/cqengine for
similar stuff. It provided really good performance, especially if you're
just counting things.

On Fri, Jul 24, 2020 at 6:55 AM Alex K <aklibisz@gmail.com> wrote:

> Hi all,
>
> I am working on a query that takes a set of terms, finds all documents
> containing at least one of those terms, computes a subset of candidate docs
> with the most matching terms, and applies a user-provided scoring function
> to each of the candidate docs
>
> Simple example of the query:
> - query terms ("aaa", "bbb")
> - indexed docs with terms:
> docId 0 has terms ("aaa", "bbb")
> docId 1 has terms ("aaa", "ccc")
> - number of top candidates = 1
> - simple scoring function score(docId) = docId + 10
> The query first builds a count array [2, 1], because docId 0 contains two
> matching terms and docId 1 contains 1 matching term.
> Then it picks docId 0 as the candidate subset.
> Then it applies the scoring function, returning a score of 10 for docId 0.
>
> The main bottleneck right now is doing the initial counting, i.e. the part
> that returns the [2, 1] array.
>
> I first started by using a BoolQuery containing a Should clause for every
> Term, so the returned score was the count. This was simple but very slow.
> Then I got a substantial speedup by copying and modifying the
> TermInSetQuery so that it tracks the number of times each docId contains a
> query term. The main construct here seems to be PrefixCodedTerms.
>
> At this point I'm not sure if there's any faster construct, or perhaps a
> more optimal way to use PrefixCodedTerms?
>
> Here is the specific query, highlighting some specific parts of the code:
> - Build the PrefixCodedTerms (in my case the terms are called 'hashes'):
>
> https://github.com/alexklibisz/elastiknn/blob/c75b23f/plugin/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L27-L33
> - Count the matching terms in a segment (this is the main bottleneck in my
> query):
>
> https://github.com/alexklibisz/elastiknn/blob/c75b23f/plugin/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L54-L73
>
> I appreciate any suggestions you might have.
>
> - Alex
>

Re: Optimizing term-occurrence counting (code included) [ In reply to ]

aklibisz at gmail

Jul 24, 2020, 4:59 AM

Post #3 of 5 (775 views)

Permalink

Thanks Ali. I don't think that will work in this case, since the data I'm
counting is managed by lucene, but that looks like an interesting project.
-Alex

On Fri, Jul 24, 2020, 00:15 Ali Akhtar <ali@ali.actor> wrote:

> I'm new to lucene so I'm not sure what the best way of speeding this up in
> Lucene is, but I've previously used https://github.com/npgall/cqengine for
> similar stuff. It provided really good performance, especially if you're
> just counting things.
>
> On Fri, Jul 24, 2020 at 6:55 AM Alex K <aklibisz@gmail.com> wrote:
>
> > Hi all,
> >
> > I am working on a query that takes a set of terms, finds all documents
> > containing at least one of those terms, computes a subset of candidate
> docs
> > with the most matching terms, and applies a user-provided scoring
> function
> > to each of the candidate docs
> >
> > Simple example of the query:
> > - query terms ("aaa", "bbb")
> > - indexed docs with terms:
> > docId 0 has terms ("aaa", "bbb")
> > docId 1 has terms ("aaa", "ccc")
> > - number of top candidates = 1
> > - simple scoring function score(docId) = docId + 10
> > The query first builds a count array [2, 1], because docId 0 contains two
> > matching terms and docId 1 contains 1 matching term.
> > Then it picks docId 0 as the candidate subset.
> > Then it applies the scoring function, returning a score of 10 for docId
> 0.
> >
> > The main bottleneck right now is doing the initial counting, i.e. the
> part
> > that returns the [2, 1] array.
> >
> > I first started by using a BoolQuery containing a Should clause for every
> > Term, so the returned score was the count. This was simple but very slow.
> > Then I got a substantial speedup by copying and modifying the
> > TermInSetQuery so that it tracks the number of times each docId contains
> a
> > query term. The main construct here seems to be PrefixCodedTerms.
> >
> > At this point I'm not sure if there's any faster construct, or perhaps a
> > more optimal way to use PrefixCodedTerms?
> >
> > Here is the specific query, highlighting some specific parts of the code:
> > - Build the PrefixCodedTerms (in my case the terms are called 'hashes'):
> >
> >
> https://github.com/alexklibisz/elastiknn/blob/c75b23f/plugin/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L27-L33
> > - Count the matching terms in a segment (this is the main bottleneck in
> my
> > query):
> >
> >
> https://github.com/alexklibisz/elastiknn/blob/c75b23f/plugin/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L54-L73
> >
> > I appreciate any suggestions you might have.
> >
> > - Alex
> >
>

Re: Optimizing term-occurrence counting (code included) [ In reply to ]

aklibisz at gmail

Sep 20, 2020, 10:08 AM

Post #4 of 5 (734 views)

Permalink

Hi all, I'm still a bit stuck on this particular issue.I posted an issue on
the Elastiknn repo outlining some measurements and thoughts on potential
solutions: https://github.com/alexklibisz/elastiknn/issues/160

To restate the question: Is there a known optimal way to find and count
docs matching 10s to 100s of terms? It seems the bottleneck is in the
PostingsFormat implementation. Perhaps there is a PostingsFormat better
suited for this usecase?

Thanks,
Alex

On Fri, Jul 24, 2020 at 7:59 AM Alex K <aklibisz@gmail.com> wrote:

> Thanks Ali. I don't think that will work in this case, since the data I'm
> counting is managed by lucene, but that looks like an interesting project.
> -Alex
>
> On Fri, Jul 24, 2020, 00:15 Ali Akhtar <ali@ali.actor> wrote:
>
>> I'm new to lucene so I'm not sure what the best way of speeding this up in
>> Lucene is, but I've previously used https://github.com/npgall/cqengine
>> for
>> similar stuff. It provided really good performance, especially if you're
>> just counting things.
>>
>> On Fri, Jul 24, 2020 at 6:55 AM Alex K <aklibisz@gmail.com> wrote:
>>
>> > Hi all,
>> >
>> > I am working on a query that takes a set of terms, finds all documents
>> > containing at least one of those terms, computes a subset of candidate
>> docs
>> > with the most matching terms, and applies a user-provided scoring
>> function
>> > to each of the candidate docs
>> >
>> > Simple example of the query:
>> > - query terms ("aaa", "bbb")
>> > - indexed docs with terms:
>> > docId 0 has terms ("aaa", "bbb")
>> > docId 1 has terms ("aaa", "ccc")
>> > - number of top candidates = 1
>> > - simple scoring function score(docId) = docId + 10
>> > The query first builds a count array [2, 1], because docId 0 contains
>> two
>> > matching terms and docId 1 contains 1 matching term.
>> > Then it picks docId 0 as the candidate subset.
>> > Then it applies the scoring function, returning a score of 10 for docId
>> 0.
>> >
>> > The main bottleneck right now is doing the initial counting, i.e. the
>> part
>> > that returns the [2, 1] array.
>> >
>> > I first started by using a BoolQuery containing a Should clause for
>> every
>> > Term, so the returned score was the count. This was simple but very
>> slow.
>> > Then I got a substantial speedup by copying and modifying the
>> > TermInSetQuery so that it tracks the number of times each docId
>> contains a
>> > query term. The main construct here seems to be PrefixCodedTerms.
>> >
>> > At this point I'm not sure if there's any faster construct, or perhaps a
>> > more optimal way to use PrefixCodedTerms?
>> >
>> > Here is the specific query, highlighting some specific parts of the
>> code:
>> > - Build the PrefixCodedTerms (in my case the terms are called 'hashes'):
>> >
>> >
>> https://github.com/alexklibisz/elastiknn/blob/c75b23f/plugin/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L27-L33
>> > - Count the matching terms in a segment (this is the main bottleneck in
>> my
>> > query):
>> >
>> >
>> https://github.com/alexklibisz/elastiknn/blob/c75b23f/plugin/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L54-L73
>> >
>> > I appreciate any suggestions you might have.
>> >
>> > - Alex
>> >
>>
>

Re: Optimizing term-occurrence counting (code included) [ In reply to ]

lucene at mikemccandless

Sep 21, 2020, 5:56 AM

Post #5 of 5 (733 views)

Permalink

I left a comment on the issue.

Mike McCandless

http://blog.mikemccandless.com

On Sun, Sep 20, 2020 at 1:08 PM Alex K <aklibisz@gmail.com> wrote:

> Hi all, I'm still a bit stuck on this particular issue.I posted an issue on
> the Elastiknn repo outlining some measurements and thoughts on potential
> solutions: https://github.com/alexklibisz/elastiknn/issues/160
>
> To restate the question: Is there a known optimal way to find and count
> docs matching 10s to 100s of terms? It seems the bottleneck is in the
> PostingsFormat implementation. Perhaps there is a PostingsFormat better
> suited for this usecase?
>
> Thanks,
> Alex
>
> On Fri, Jul 24, 2020 at 7:59 AM Alex K <aklibisz@gmail.com> wrote:
>
> > Thanks Ali. I don't think that will work in this case, since the data I'm
> > counting is managed by lucene, but that looks like an interesting
> project.
> > -Alex
> >
> > On Fri, Jul 24, 2020, 00:15 Ali Akhtar <ali@ali.actor> wrote:
> >
> >> I'm new to lucene so I'm not sure what the best way of speeding this up
> in
> >> Lucene is, but I've previously used https://github.com/npgall/cqengine
> >> for
> >> similar stuff. It provided really good performance, especially if you're
> >> just counting things.
> >>
> >> On Fri, Jul 24, 2020 at 6:55 AM Alex K <aklibisz@gmail.com> wrote:
> >>
> >> > Hi all,
> >> >
> >> > I am working on a query that takes a set of terms, finds all documents
> >> > containing at least one of those terms, computes a subset of candidate
> >> docs
> >> > with the most matching terms, and applies a user-provided scoring
> >> function
> >> > to each of the candidate docs
> >> >
> >> > Simple example of the query:
> >> > - query terms ("aaa", "bbb")
> >> > - indexed docs with terms:
> >> > docId 0 has terms ("aaa", "bbb")
> >> > docId 1 has terms ("aaa", "ccc")
> >> > - number of top candidates = 1
> >> > - simple scoring function score(docId) = docId + 10
> >> > The query first builds a count array [2, 1], because docId 0 contains
> >> two
> >> > matching terms and docId 1 contains 1 matching term.
> >> > Then it picks docId 0 as the candidate subset.
> >> > Then it applies the scoring function, returning a score of 10 for
> docId
> >> 0.
> >> >
> >> > The main bottleneck right now is doing the initial counting, i.e. the
> >> part
> >> > that returns the [2, 1] array.
> >> >
> >> > I first started by using a BoolQuery containing a Should clause for
> >> every
> >> > Term, so the returned score was the count. This was simple but very
> >> slow.
> >> > Then I got a substantial speedup by copying and modifying the
> >> > TermInSetQuery so that it tracks the number of times each docId
> >> contains a
> >> > query term. The main construct here seems to be PrefixCodedTerms.
> >> >
> >> > At this point I'm not sure if there's any faster construct, or
> perhaps a
> >> > more optimal way to use PrefixCodedTerms?
> >> >
> >> > Here is the specific query, highlighting some specific parts of the
> >> code:
> >> > - Build the PrefixCodedTerms (in my case the terms are called
> 'hashes'):
> >> >
> >> >
> >>
> https://github.com/alexklibisz/elastiknn/blob/c75b23f/plugin/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L27-L33
> >> > - Count the matching terms in a segment (this is the main bottleneck
> in
> >> my
> >> > query):
> >> >
> >> >
> >>
> https://github.com/alexklibisz/elastiknn/blob/c75b23f/plugin/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L54-L73
> >> >
> >> > I appreciate any suggestions you might have.
> >> >
> >> > - Alex
> >> >
> >>
> >
>