Hello all,
I'm working on an Elasticsearch plugin (using Lucene internally) that
allows users to index numerical vectors and run exact and approximate
k-nearest-neighbors similarity queries.
I'd like to get some feedback about my usage of BooleanQueries and
TermQueries, and see if there are any optimizations or performance tricks
for my use case.
An example use case for the plugin is reverse image search. A user can
store vectors representing images and run a nearest-neighbors query to
retrieve the 10 vectors with the smallest L2 distance to a query vector.
More detailed documentation here: http://elastiknn.klibisz.com/
The main method for indexing the vectors is based on Locality Sensitive
Hashing <https://en.wikipedia.org/wiki/Locality-sensitive_hashing>.
The general pattern is:
1. When indexing a vector, apply a hash function to it, producing a set
of discrete hashes. Usually there are anywhere from 100 to 1000 hashes.
Similar vectors are more likely to share hashes (i.e., similar vectors
produce hash collisions).
2. Convert each hash to a byte array and store the byte array as a
Lucene Term at a specific field.
3. Store the complete vector (i.e. floating point numbers) in a binary
doc values field.
In other words, I'm converting each vector into a bag of words, though the
words have no semantic meaning.
A query works as follows:
1. Given a query vector, apply the same hash function to produce a set
of hashes.
2. Convert each hash to a byte array and create a Term.
3. Build and run a BooleanQuery with a clause for each Term. Each clause
looks like this: `new BooleanClause(new ConstantScoreQuery(new
TermQuery(new Term(field, new BytesRef(hashValue.toByteArray))),
BooleanClause.Occur.SHOULD))`.
4. As the BooleanQuery produces results, maintain a fixed-size heap of
its scores. For any score exceeding the min in the heap, load its vector
from the binary doc values, compute the exact similarity, and update the
heap. Otherwise the vector gets a score of 0.
When profiling my benchmarks with VisualVM, I've found the Elasticsearch
search threads spend > 50% of the runtime in these two methods:
- org.apache.lucene.search.DisiPriorityQueue.downHeap (~58% of runtime)
- org.apache.lucene.search.DisjunctionDISIApproximation.nextDoc (~8% of
runtime)
So the time seems to be dominated by collecting and ordering the results
produced by the BooleanQuery from step 3 above.
The exact similarity computation is only about 15% of the runtime. If I
disable it entirely, I still see the same bottlenecks in VisualVM.
Reducing the number of hashes yields roughly linear scaling (i.e., 400
hashes take ~2x longer than 200 hashes).
The use case seems different to text search in that there's no semantic
meaning to the terms, their length, their ordering, their stems, etc.
I basically just need the index to be a rudimentary HashMap, and I only
care about the scores for the top k results.
With that in mind, I've made the following optimizations:
- Disabled tokenization on the FieldType (setTokenized(false))
- Disabled norms on the FieldType (setOmitNorms(true))
- Set similarity to BooleanSimilarity on the elasticsearch
MappedFieldType
- Set index options to IndexOptions.Docs.
- Used the MoreLikeThis heuristic to pick a subset of terms. This
understandably only yields a speedup proportional to the number of
discarded terms.
I'm using Elasticsearch version 7.6.2 with Lucene 8.4.0.
The main query implementation is here
<https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala>
.
<https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala>
The actual query that gets executed by Elasticsearch is instantiated on line
98
<https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala#L98>
.
It's in Scala but all of the Java query classes should look familiar.
Maybe there are some settings that I'm not aware of?
Maybe I could optimize this by implementing a custom query or scorer?
Maybe there's just no way to speed this up?
I appreciate any input, examples, links, etc.. :)
Also, let me know if I can provide any additional details.
Thanks,
Alex Klibisz
I'm working on an Elasticsearch plugin (using Lucene internally) that
allows users to index numerical vectors and run exact and approximate
k-nearest-neighbors similarity queries.
I'd like to get some feedback about my usage of BooleanQueries and
TermQueries, and see if there are any optimizations or performance tricks
for my use case.
An example use case for the plugin is reverse image search. A user can
store vectors representing images and run a nearest-neighbors query to
retrieve the 10 vectors with the smallest L2 distance to a query vector.
More detailed documentation here: http://elastiknn.klibisz.com/
The main method for indexing the vectors is based on Locality Sensitive
Hashing <https://en.wikipedia.org/wiki/Locality-sensitive_hashing>.
The general pattern is:
1. When indexing a vector, apply a hash function to it, producing a set
of discrete hashes. Usually there are anywhere from 100 to 1000 hashes.
Similar vectors are more likely to share hashes (i.e., similar vectors
produce hash collisions).
2. Convert each hash to a byte array and store the byte array as a
Lucene Term at a specific field.
3. Store the complete vector (i.e. floating point numbers) in a binary
doc values field.
In other words, I'm converting each vector into a bag of words, though the
words have no semantic meaning.
A query works as follows:
1. Given a query vector, apply the same hash function to produce a set
of hashes.
2. Convert each hash to a byte array and create a Term.
3. Build and run a BooleanQuery with a clause for each Term. Each clause
looks like this: `new BooleanClause(new ConstantScoreQuery(new
TermQuery(new Term(field, new BytesRef(hashValue.toByteArray))),
BooleanClause.Occur.SHOULD))`.
4. As the BooleanQuery produces results, maintain a fixed-size heap of
its scores. For any score exceeding the min in the heap, load its vector
from the binary doc values, compute the exact similarity, and update the
heap. Otherwise the vector gets a score of 0.
When profiling my benchmarks with VisualVM, I've found the Elasticsearch
search threads spend > 50% of the runtime in these two methods:
- org.apache.lucene.search.DisiPriorityQueue.downHeap (~58% of runtime)
- org.apache.lucene.search.DisjunctionDISIApproximation.nextDoc (~8% of
runtime)
So the time seems to be dominated by collecting and ordering the results
produced by the BooleanQuery from step 3 above.
The exact similarity computation is only about 15% of the runtime. If I
disable it entirely, I still see the same bottlenecks in VisualVM.
Reducing the number of hashes yields roughly linear scaling (i.e., 400
hashes take ~2x longer than 200 hashes).
The use case seems different to text search in that there's no semantic
meaning to the terms, their length, their ordering, their stems, etc.
I basically just need the index to be a rudimentary HashMap, and I only
care about the scores for the top k results.
With that in mind, I've made the following optimizations:
- Disabled tokenization on the FieldType (setTokenized(false))
- Disabled norms on the FieldType (setOmitNorms(true))
- Set similarity to BooleanSimilarity on the elasticsearch
MappedFieldType
- Set index options to IndexOptions.Docs.
- Used the MoreLikeThis heuristic to pick a subset of terms. This
understandably only yields a speedup proportional to the number of
discarded terms.
I'm using Elasticsearch version 7.6.2 with Lucene 8.4.0.
The main query implementation is here
<https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala>
.
<https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala>
The actual query that gets executed by Elasticsearch is instantiated on line
98
<https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala#L98>
.
It's in Scala but all of the Java query classes should look familiar.
Maybe there are some settings that I'm not aware of?
Maybe I could optimize this by implementing a custom query or scorer?
Maybe there's just no way to speed this up?
I appreciate any input, examples, links, etc.. :)
Also, let me know if I can provide any additional details.
Thanks,
Alex Klibisz