Hi all,
I am working on a query that takes a set of terms, finds all documents
containing at least one of those terms, computes a subset of candidate docs
with the most matching terms, and applies a user-provided scoring function
to each of the candidate docs
Simple example of the query:
- query terms ("aaa", "bbb")
- indexed docs with terms:
docId 0 has terms ("aaa", "bbb")
docId 1 has terms ("aaa", "ccc")
- number of top candidates = 1
- simple scoring function score(docId) = docId + 10
The query first builds a count array [2, 1], because docId 0 contains two
matching terms and docId 1 contains 1 matching term.
Then it picks docId 0 as the candidate subset.
Then it applies the scoring function, returning a score of 10 for docId 0.
The main bottleneck right now is doing the initial counting, i.e. the part
that returns the [2, 1] array.
I first started by using a BoolQuery containing a Should clause for every
Term, so the returned score was the count. This was simple but very slow.
Then I got a substantial speedup by copying and modifying the
TermInSetQuery so that it tracks the number of times each docId contains a
query term. The main construct here seems to be PrefixCodedTerms.
At this point I'm not sure if there's any faster construct, or perhaps a
more optimal way to use PrefixCodedTerms?
Here is the specific query, highlighting some specific parts of the code:
- Build the PrefixCodedTerms (in my case the terms are called 'hashes'):
https://github.com/alexklibisz/elastiknn/blob/c75b23f/plugin/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L27-L33
- Count the matching terms in a segment (this is the main bottleneck in my
query):
https://github.com/alexklibisz/elastiknn/blob/c75b23f/plugin/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L54-L73
I appreciate any suggestions you might have.
- Alex
I am working on a query that takes a set of terms, finds all documents
containing at least one of those terms, computes a subset of candidate docs
with the most matching terms, and applies a user-provided scoring function
to each of the candidate docs
Simple example of the query:
- query terms ("aaa", "bbb")
- indexed docs with terms:
docId 0 has terms ("aaa", "bbb")
docId 1 has terms ("aaa", "ccc")
- number of top candidates = 1
- simple scoring function score(docId) = docId + 10
The query first builds a count array [2, 1], because docId 0 contains two
matching terms and docId 1 contains 1 matching term.
Then it picks docId 0 as the candidate subset.
Then it applies the scoring function, returning a score of 10 for docId 0.
The main bottleneck right now is doing the initial counting, i.e. the part
that returns the [2, 1] array.
I first started by using a BoolQuery containing a Should clause for every
Term, so the returned score was the count. This was simple but very slow.
Then I got a substantial speedup by copying and modifying the
TermInSetQuery so that it tracks the number of times each docId contains a
query term. The main construct here seems to be PrefixCodedTerms.
At this point I'm not sure if there's any faster construct, or perhaps a
more optimal way to use PrefixCodedTerms?
Here is the specific query, highlighting some specific parts of the code:
- Build the PrefixCodedTerms (in my case the terms are called 'hashes'):
https://github.com/alexklibisz/elastiknn/blob/c75b23f/plugin/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L27-L33
- Count the matching terms in a segment (this is the main bottleneck in my
query):
https://github.com/alexklibisz/elastiknn/blob/c75b23f/plugin/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L54-L73
I appreciate any suggestions you might have.
- Alex