Mailing List Archive

Optimizing a boolean query for 100s of term clauses
Hello all,

I'm working on an Elasticsearch plugin (using Lucene internally) that
allows users to index numerical vectors and run exact and approximate
k-nearest-neighbors similarity queries.
I'd like to get some feedback about my usage of BooleanQueries and
TermQueries, and see if there are any optimizations or performance tricks
for my use case.

An example use case for the plugin is reverse image search. A user can
store vectors representing images and run a nearest-neighbors query to
retrieve the 10 vectors with the smallest L2 distance to a query vector.
More detailed documentation here: http://elastiknn.klibisz.com/

The main method for indexing the vectors is based on Locality Sensitive
Hashing <https://en.wikipedia.org/wiki/Locality-sensitive_hashing>.
The general pattern is:

1. When indexing a vector, apply a hash function to it, producing a set
of discrete hashes. Usually there are anywhere from 100 to 1000 hashes.
Similar vectors are more likely to share hashes (i.e., similar vectors
produce hash collisions).
2. Convert each hash to a byte array and store the byte array as a
Lucene Term at a specific field.
3. Store the complete vector (i.e. floating point numbers) in a binary
doc values field.

In other words, I'm converting each vector into a bag of words, though the
words have no semantic meaning.

A query works as follows:

1. Given a query vector, apply the same hash function to produce a set
of hashes.
2. Convert each hash to a byte array and create a Term.
3. Build and run a BooleanQuery with a clause for each Term. Each clause
looks like this: `new BooleanClause(new ConstantScoreQuery(new
TermQuery(new Term(field, new BytesRef(hashValue.toByteArray))),
BooleanClause.Occur.SHOULD))`.
4. As the BooleanQuery produces results, maintain a fixed-size heap of
its scores. For any score exceeding the min in the heap, load its vector
from the binary doc values, compute the exact similarity, and update the
heap. Otherwise the vector gets a score of 0.

When profiling my benchmarks with VisualVM, I've found the Elasticsearch
search threads spend > 50% of the runtime in these two methods:

- org.apache.lucene.search.DisiPriorityQueue.downHeap (~58% of runtime)
- org.apache.lucene.search.DisjunctionDISIApproximation.nextDoc (~8% of
runtime)

So the time seems to be dominated by collecting and ordering the results
produced by the BooleanQuery from step 3 above.
The exact similarity computation is only about 15% of the runtime. If I
disable it entirely, I still see the same bottlenecks in VisualVM.
Reducing the number of hashes yields roughly linear scaling (i.e., 400
hashes take ~2x longer than 200 hashes).

The use case seems different to text search in that there's no semantic
meaning to the terms, their length, their ordering, their stems, etc.
I basically just need the index to be a rudimentary HashMap, and I only
care about the scores for the top k results.
With that in mind, I've made the following optimizations:

- Disabled tokenization on the FieldType (setTokenized(false))
- Disabled norms on the FieldType (setOmitNorms(true))
- Set similarity to BooleanSimilarity on the elasticsearch
MappedFieldType
- Set index options to IndexOptions.Docs.
- Used the MoreLikeThis heuristic to pick a subset of terms. This
understandably only yields a speedup proportional to the number of
discarded terms.

I'm using Elasticsearch version 7.6.2 with Lucene 8.4.0.
The main query implementation is here
<https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala>
.
<https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala>
The actual query that gets executed by Elasticsearch is instantiated on line
98
<https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala#L98>
.
It's in Scala but all of the Java query classes should look familiar.

Maybe there are some settings that I'm not aware of?
Maybe I could optimize this by implementing a custom query or scorer?
Maybe there's just no way to speed this up?

I appreciate any input, examples, links, etc.. :)
Also, let me know if I can provide any additional details.

Thanks,
Alex Klibisz
Re: Optimizing a boolean query for 100s of term clauses [ In reply to ]
You might consider using a TermInSetQuery in place of a BooleanQuery
for the hashes (since they are all in the same field).

I don't really understand why you are seeing so much cost in the heap
- it's sounds as if you have a single heap with mixed scores - those
generated by the BooleanQuery and those generated by the vector
scoring operation. Maybe you comment a little more on the interaction
there - are there really two heaps? Do you override the standard
collector?

On Tue, Jun 23, 2020 at 9:51 AM Alex K <aklibisz@gmail.com> wrote:
>
> Hello all,
>
> I'm working on an Elasticsearch plugin (using Lucene internally) that
> allows users to index numerical vectors and run exact and approximate
> k-nearest-neighbors similarity queries.
> I'd like to get some feedback about my usage of BooleanQueries and
> TermQueries, and see if there are any optimizations or performance tricks
> for my use case.
>
> An example use case for the plugin is reverse image search. A user can
> store vectors representing images and run a nearest-neighbors query to
> retrieve the 10 vectors with the smallest L2 distance to a query vector.
> More detailed documentation here: http://elastiknn.klibisz.com/
>
> The main method for indexing the vectors is based on Locality Sensitive
> Hashing <https://en.wikipedia.org/wiki/Locality-sensitive_hashing>.
> The general pattern is:
>
> 1. When indexing a vector, apply a hash function to it, producing a set
> of discrete hashes. Usually there are anywhere from 100 to 1000 hashes.
> Similar vectors are more likely to share hashes (i.e., similar vectors
> produce hash collisions).
> 2. Convert each hash to a byte array and store the byte array as a
> Lucene Term at a specific field.
> 3. Store the complete vector (i.e. floating point numbers) in a binary
> doc values field.
>
> In other words, I'm converting each vector into a bag of words, though the
> words have no semantic meaning.
>
> A query works as follows:
>
> 1. Given a query vector, apply the same hash function to produce a set
> of hashes.
> 2. Convert each hash to a byte array and create a Term.
> 3. Build and run a BooleanQuery with a clause for each Term. Each clause
> looks like this: `new BooleanClause(new ConstantScoreQuery(new
> TermQuery(new Term(field, new BytesRef(hashValue.toByteArray))),
> BooleanClause.Occur.SHOULD))`.
> 4. As the BooleanQuery produces results, maintain a fixed-size heap of
> its scores. For any score exceeding the min in the heap, load its vector
> from the binary doc values, compute the exact similarity, and update the
> heap. Otherwise the vector gets a score of 0.
>
> When profiling my benchmarks with VisualVM, I've found the Elasticsearch
> search threads spend > 50% of the runtime in these two methods:
>
> - org.apache.lucene.search.DisiPriorityQueue.downHeap (~58% of runtime)
> - org.apache.lucene.search.DisjunctionDISIApproximation.nextDoc (~8% of
> runtime)
>
> So the time seems to be dominated by collecting and ordering the results
> produced by the BooleanQuery from step 3 above.
> The exact similarity computation is only about 15% of the runtime. If I
> disable it entirely, I still see the same bottlenecks in VisualVM.
> Reducing the number of hashes yields roughly linear scaling (i.e., 400
> hashes take ~2x longer than 200 hashes).
>
> The use case seems different to text search in that there's no semantic
> meaning to the terms, their length, their ordering, their stems, etc.
> I basically just need the index to be a rudimentary HashMap, and I only
> care about the scores for the top k results.
> With that in mind, I've made the following optimizations:
>
> - Disabled tokenization on the FieldType (setTokenized(false))
> - Disabled norms on the FieldType (setOmitNorms(true))
> - Set similarity to BooleanSimilarity on the elasticsearch
> MappedFieldType
> - Set index options to IndexOptions.Docs.
> - Used the MoreLikeThis heuristic to pick a subset of terms. This
> understandably only yields a speedup proportional to the number of
> discarded terms.
>
> I'm using Elasticsearch version 7.6.2 with Lucene 8.4.0.
> The main query implementation is here
> <https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala>
> .
> <https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala>
> The actual query that gets executed by Elasticsearch is instantiated on line
> 98
> <https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala#L98>
> .
> It's in Scala but all of the Java query classes should look familiar.
>
> Maybe there are some settings that I'm not aware of?
> Maybe I could optimize this by implementing a custom query or scorer?
> Maybe there's just no way to speed this up?
>
> I appreciate any input, examples, links, etc.. :)
> Also, let me know if I can provide any additional details.
>
> Thanks,
> Alex Klibisz

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Optimizing a boolean query for 100s of term clauses [ In reply to ]
Hi Michael,
Thanks for the quick response!

I will look into the TermInSetQuery.

My usage of "heap" might've been confusing.
I'm using a FunctionScoreQuery from Elasticsearch.
This gets instantiated with a Lucene query, in this case the boolean query
as I described it, as well as a custom ScoreFunction object.
The ScoreFunction exposes a single method that takes a doc id and the
BooleanQuery score for that doc id, and returns another score.
In that method I use a MinMaxPriorityQueue from the Guava library to
maintain a fixed-capacity subset of the highest-scoring docs and evaluate
exact similarity on them.
Once the queue is at capacity, I just return 0 for any docs that had a
boolean query score smaller than the min in the queue.

But you can actually forget entirely that this ScoreFunction exists. It
only contributes ~6% of the runtime.
Even if I only use the BooleanQuery by itself, I still see the same
behavior and bottlenecks.

Thanks
- AK


On Tue, Jun 23, 2020 at 2:06 PM Michael Sokolov <msokolov@gmail.com> wrote:

> You might consider using a TermInSetQuery in place of a BooleanQuery
> for the hashes (since they are all in the same field).
>
> I don't really understand why you are seeing so much cost in the heap
> - it's sounds as if you have a single heap with mixed scores - those
> generated by the BooleanQuery and those generated by the vector
> scoring operation. Maybe you comment a little more on the interaction
> there - are there really two heaps? Do you override the standard
> collector?
>
> On Tue, Jun 23, 2020 at 9:51 AM Alex K <aklibisz@gmail.com> wrote:
> >
> > Hello all,
> >
> > I'm working on an Elasticsearch plugin (using Lucene internally) that
> > allows users to index numerical vectors and run exact and approximate
> > k-nearest-neighbors similarity queries.
> > I'd like to get some feedback about my usage of BooleanQueries and
> > TermQueries, and see if there are any optimizations or performance tricks
> > for my use case.
> >
> > An example use case for the plugin is reverse image search. A user can
> > store vectors representing images and run a nearest-neighbors query to
> > retrieve the 10 vectors with the smallest L2 distance to a query vector.
> > More detailed documentation here: http://elastiknn.klibisz.com/
> >
> > The main method for indexing the vectors is based on Locality Sensitive
> > Hashing <https://en.wikipedia.org/wiki/Locality-sensitive_hashing>.
> > The general pattern is:
> >
> > 1. When indexing a vector, apply a hash function to it, producing a
> set
> > of discrete hashes. Usually there are anywhere from 100 to 1000
> hashes.
> > Similar vectors are more likely to share hashes (i.e., similar vectors
> > produce hash collisions).
> > 2. Convert each hash to a byte array and store the byte array as a
> > Lucene Term at a specific field.
> > 3. Store the complete vector (i.e. floating point numbers) in a binary
> > doc values field.
> >
> > In other words, I'm converting each vector into a bag of words, though
> the
> > words have no semantic meaning.
> >
> > A query works as follows:
> >
> > 1. Given a query vector, apply the same hash function to produce a set
> > of hashes.
> > 2. Convert each hash to a byte array and create a Term.
> > 3. Build and run a BooleanQuery with a clause for each Term. Each
> clause
> > looks like this: `new BooleanClause(new ConstantScoreQuery(new
> > TermQuery(new Term(field, new BytesRef(hashValue.toByteArray))),
> > BooleanClause.Occur.SHOULD))`.
> > 4. As the BooleanQuery produces results, maintain a fixed-size heap of
> > its scores. For any score exceeding the min in the heap, load its
> vector
> > from the binary doc values, compute the exact similarity, and update
> the
> > heap. Otherwise the vector gets a score of 0.
> >
> > When profiling my benchmarks with VisualVM, I've found the Elasticsearch
> > search threads spend > 50% of the runtime in these two methods:
> >
> > - org.apache.lucene.search.DisiPriorityQueue.downHeap (~58% of
> runtime)
> > - org.apache.lucene.search.DisjunctionDISIApproximation.nextDoc (~8%
> of
> > runtime)
> >
> > So the time seems to be dominated by collecting and ordering the results
> > produced by the BooleanQuery from step 3 above.
> > The exact similarity computation is only about 15% of the runtime. If I
> > disable it entirely, I still see the same bottlenecks in VisualVM.
> > Reducing the number of hashes yields roughly linear scaling (i.e., 400
> > hashes take ~2x longer than 200 hashes).
> >
> > The use case seems different to text search in that there's no semantic
> > meaning to the terms, their length, their ordering, their stems, etc.
> > I basically just need the index to be a rudimentary HashMap, and I only
> > care about the scores for the top k results.
> > With that in mind, I've made the following optimizations:
> >
> > - Disabled tokenization on the FieldType (setTokenized(false))
> > - Disabled norms on the FieldType (setOmitNorms(true))
> > - Set similarity to BooleanSimilarity on the elasticsearch
> > MappedFieldType
> > - Set index options to IndexOptions.Docs.
> > - Used the MoreLikeThis heuristic to pick a subset of terms. This
> > understandably only yields a speedup proportional to the number of
> > discarded terms.
> >
> > I'm using Elasticsearch version 7.6.2 with Lucene 8.4.0.
> > The main query implementation is here
> > <
> https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala
> >
> > .
> > <
> https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala
> >
> > The actual query that gets executed by Elasticsearch is instantiated on
> line
> > 98
> > <
> https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala#L98
> >
> > .
> > It's in Scala but all of the Java query classes should look familiar.
> >
> > Maybe there are some settings that I'm not aware of?
> > Maybe I could optimize this by implementing a custom query or scorer?
> > Maybe there's just no way to speed this up?
> >
> > I appreciate any input, examples, links, etc.. :)
> > Also, let me know if I can provide any additional details.
> >
> > Thanks,
> > Alex Klibisz
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: Optimizing a boolean query for 100s of term clauses [ In reply to ]
The TermsInSetQuery is definitely faster. Unfortunately it doesn't seem to
return the number of terms that matched in a given document. Rather it just
returns the boost value. I'll look into copying/modifying the internals to
return the number of matched terms.

Thanks
- AK

On Tue, Jun 23, 2020 at 3:17 PM Alex K <aklibisz@gmail.com> wrote:

> Hi Michael,
> Thanks for the quick response!
>
> I will look into the TermInSetQuery.
>
> My usage of "heap" might've been confusing.
> I'm using a FunctionScoreQuery from Elasticsearch.
> This gets instantiated with a Lucene query, in this case the boolean query
> as I described it, as well as a custom ScoreFunction object.
> The ScoreFunction exposes a single method that takes a doc id and the
> BooleanQuery score for that doc id, and returns another score.
> In that method I use a MinMaxPriorityQueue from the Guava library to
> maintain a fixed-capacity subset of the highest-scoring docs and evaluate
> exact similarity on them.
> Once the queue is at capacity, I just return 0 for any docs that had a
> boolean query score smaller than the min in the queue.
>
> But you can actually forget entirely that this ScoreFunction exists. It
> only contributes ~6% of the runtime.
> Even if I only use the BooleanQuery by itself, I still see the same
> behavior and bottlenecks.
>
> Thanks
> - AK
>
>
> On Tue, Jun 23, 2020 at 2:06 PM Michael Sokolov <msokolov@gmail.com>
> wrote:
>
>> You might consider using a TermInSetQuery in place of a BooleanQuery
>> for the hashes (since they are all in the same field).
>>
>> I don't really understand why you are seeing so much cost in the heap
>> - it's sounds as if you have a single heap with mixed scores - those
>> generated by the BooleanQuery and those generated by the vector
>> scoring operation. Maybe you comment a little more on the interaction
>> there - are there really two heaps? Do you override the standard
>> collector?
>>
>> On Tue, Jun 23, 2020 at 9:51 AM Alex K <aklibisz@gmail.com> wrote:
>> >
>> > Hello all,
>> >
>> > I'm working on an Elasticsearch plugin (using Lucene internally) that
>> > allows users to index numerical vectors and run exact and approximate
>> > k-nearest-neighbors similarity queries.
>> > I'd like to get some feedback about my usage of BooleanQueries and
>> > TermQueries, and see if there are any optimizations or performance
>> tricks
>> > for my use case.
>> >
>> > An example use case for the plugin is reverse image search. A user can
>> > store vectors representing images and run a nearest-neighbors query to
>> > retrieve the 10 vectors with the smallest L2 distance to a query vector.
>> > More detailed documentation here: http://elastiknn.klibisz.com/
>> >
>> > The main method for indexing the vectors is based on Locality Sensitive
>> > Hashing <https://en.wikipedia.org/wiki/Locality-sensitive_hashing>.
>> > The general pattern is:
>> >
>> > 1. When indexing a vector, apply a hash function to it, producing a
>> set
>> > of discrete hashes. Usually there are anywhere from 100 to 1000
>> hashes.
>> > Similar vectors are more likely to share hashes (i.e., similar
>> vectors
>> > produce hash collisions).
>> > 2. Convert each hash to a byte array and store the byte array as a
>> > Lucene Term at a specific field.
>> > 3. Store the complete vector (i.e. floating point numbers) in a
>> binary
>> > doc values field.
>> >
>> > In other words, I'm converting each vector into a bag of words, though
>> the
>> > words have no semantic meaning.
>> >
>> > A query works as follows:
>> >
>> > 1. Given a query vector, apply the same hash function to produce a
>> set
>> > of hashes.
>> > 2. Convert each hash to a byte array and create a Term.
>> > 3. Build and run a BooleanQuery with a clause for each Term. Each
>> clause
>> > looks like this: `new BooleanClause(new ConstantScoreQuery(new
>> > TermQuery(new Term(field, new BytesRef(hashValue.toByteArray))),
>> > BooleanClause.Occur.SHOULD))`.
>> > 4. As the BooleanQuery produces results, maintain a fixed-size heap
>> of
>> > its scores. For any score exceeding the min in the heap, load its
>> vector
>> > from the binary doc values, compute the exact similarity, and update
>> the
>> > heap. Otherwise the vector gets a score of 0.
>> >
>> > When profiling my benchmarks with VisualVM, I've found the Elasticsearch
>> > search threads spend > 50% of the runtime in these two methods:
>> >
>> > - org.apache.lucene.search.DisiPriorityQueue.downHeap (~58% of
>> runtime)
>> > - org.apache.lucene.search.DisjunctionDISIApproximation.nextDoc (~8%
>> of
>> > runtime)
>> >
>> > So the time seems to be dominated by collecting and ordering the results
>> > produced by the BooleanQuery from step 3 above.
>> > The exact similarity computation is only about 15% of the runtime. If I
>> > disable it entirely, I still see the same bottlenecks in VisualVM.
>> > Reducing the number of hashes yields roughly linear scaling (i.e., 400
>> > hashes take ~2x longer than 200 hashes).
>> >
>> > The use case seems different to text search in that there's no semantic
>> > meaning to the terms, their length, their ordering, their stems, etc.
>> > I basically just need the index to be a rudimentary HashMap, and I only
>> > care about the scores for the top k results.
>> > With that in mind, I've made the following optimizations:
>> >
>> > - Disabled tokenization on the FieldType (setTokenized(false))
>> > - Disabled norms on the FieldType (setOmitNorms(true))
>> > - Set similarity to BooleanSimilarity on the elasticsearch
>> > MappedFieldType
>> > - Set index options to IndexOptions.Docs.
>> > - Used the MoreLikeThis heuristic to pick a subset of terms. This
>> > understandably only yields a speedup proportional to the number of
>> > discarded terms.
>> >
>> > I'm using Elasticsearch version 7.6.2 with Lucene 8.4.0.
>> > The main query implementation is here
>> > <
>> https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala
>> >
>> > .
>> > <
>> https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala
>> >
>> > The actual query that gets executed by Elasticsearch is instantiated on
>> line
>> > 98
>> > <
>> https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala#L98
>> >
>> > .
>> > It's in Scala but all of the Java query classes should look familiar.
>> >
>> > Maybe there are some settings that I'm not aware of?
>> > Maybe I could optimize this by implementing a custom query or scorer?
>> > Maybe there's just no way to speed this up?
>> >
>> > I appreciate any input, examples, links, etc.. :)
>> > Also, let me know if I can provide any additional details.
>> >
>> > Thanks,
>> > Alex Klibisz
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
Re: Optimizing a boolean query for 100s of term clauses [ In reply to ]
Yeah that will require some changes since what it does currently is to
maintain a bitset, and or into it repeatedly (once for each term's
docs). To maintain counts, you'd need a counter per doc (rather than a
bit), and you might lose some of the speed...

On Tue, Jun 23, 2020 at 8:52 PM Alex K <aklibisz@gmail.com> wrote:
>
> The TermsInSetQuery is definitely faster. Unfortunately it doesn't seem to
> return the number of terms that matched in a given document. Rather it just
> returns the boost value. I'll look into copying/modifying the internals to
> return the number of matched terms.
>
> Thanks
> - AK
>
> On Tue, Jun 23, 2020 at 3:17 PM Alex K <aklibisz@gmail.com> wrote:
>
> > Hi Michael,
> > Thanks for the quick response!
> >
> > I will look into the TermInSetQuery.
> >
> > My usage of "heap" might've been confusing.
> > I'm using a FunctionScoreQuery from Elasticsearch.
> > This gets instantiated with a Lucene query, in this case the boolean query
> > as I described it, as well as a custom ScoreFunction object.
> > The ScoreFunction exposes a single method that takes a doc id and the
> > BooleanQuery score for that doc id, and returns another score.
> > In that method I use a MinMaxPriorityQueue from the Guava library to
> > maintain a fixed-capacity subset of the highest-scoring docs and evaluate
> > exact similarity on them.
> > Once the queue is at capacity, I just return 0 for any docs that had a
> > boolean query score smaller than the min in the queue.
> >
> > But you can actually forget entirely that this ScoreFunction exists. It
> > only contributes ~6% of the runtime.
> > Even if I only use the BooleanQuery by itself, I still see the same
> > behavior and bottlenecks.
> >
> > Thanks
> > - AK
> >
> >
> > On Tue, Jun 23, 2020 at 2:06 PM Michael Sokolov <msokolov@gmail.com>
> > wrote:
> >
> >> You might consider using a TermInSetQuery in place of a BooleanQuery
> >> for the hashes (since they are all in the same field).
> >>
> >> I don't really understand why you are seeing so much cost in the heap
> >> - it's sounds as if you have a single heap with mixed scores - those
> >> generated by the BooleanQuery and those generated by the vector
> >> scoring operation. Maybe you comment a little more on the interaction
> >> there - are there really two heaps? Do you override the standard
> >> collector?
> >>
> >> On Tue, Jun 23, 2020 at 9:51 AM Alex K <aklibisz@gmail.com> wrote:
> >> >
> >> > Hello all,
> >> >
> >> > I'm working on an Elasticsearch plugin (using Lucene internally) that
> >> > allows users to index numerical vectors and run exact and approximate
> >> > k-nearest-neighbors similarity queries.
> >> > I'd like to get some feedback about my usage of BooleanQueries and
> >> > TermQueries, and see if there are any optimizations or performance
> >> tricks
> >> > for my use case.
> >> >
> >> > An example use case for the plugin is reverse image search. A user can
> >> > store vectors representing images and run a nearest-neighbors query to
> >> > retrieve the 10 vectors with the smallest L2 distance to a query vector.
> >> > More detailed documentation here: http://elastiknn.klibisz.com/
> >> >
> >> > The main method for indexing the vectors is based on Locality Sensitive
> >> > Hashing <https://en.wikipedia.org/wiki/Locality-sensitive_hashing>.
> >> > The general pattern is:
> >> >
> >> > 1. When indexing a vector, apply a hash function to it, producing a
> >> set
> >> > of discrete hashes. Usually there are anywhere from 100 to 1000
> >> hashes.
> >> > Similar vectors are more likely to share hashes (i.e., similar
> >> vectors
> >> > produce hash collisions).
> >> > 2. Convert each hash to a byte array and store the byte array as a
> >> > Lucene Term at a specific field.
> >> > 3. Store the complete vector (i.e. floating point numbers) in a
> >> binary
> >> > doc values field.
> >> >
> >> > In other words, I'm converting each vector into a bag of words, though
> >> the
> >> > words have no semantic meaning.
> >> >
> >> > A query works as follows:
> >> >
> >> > 1. Given a query vector, apply the same hash function to produce a
> >> set
> >> > of hashes.
> >> > 2. Convert each hash to a byte array and create a Term.
> >> > 3. Build and run a BooleanQuery with a clause for each Term. Each
> >> clause
> >> > looks like this: `new BooleanClause(new ConstantScoreQuery(new
> >> > TermQuery(new Term(field, new BytesRef(hashValue.toByteArray))),
> >> > BooleanClause.Occur.SHOULD))`.
> >> > 4. As the BooleanQuery produces results, maintain a fixed-size heap
> >> of
> >> > its scores. For any score exceeding the min in the heap, load its
> >> vector
> >> > from the binary doc values, compute the exact similarity, and update
> >> the
> >> > heap. Otherwise the vector gets a score of 0.
> >> >
> >> > When profiling my benchmarks with VisualVM, I've found the Elasticsearch
> >> > search threads spend > 50% of the runtime in these two methods:
> >> >
> >> > - org.apache.lucene.search.DisiPriorityQueue.downHeap (~58% of
> >> runtime)
> >> > - org.apache.lucene.search.DisjunctionDISIApproximation.nextDoc (~8%
> >> of
> >> > runtime)
> >> >
> >> > So the time seems to be dominated by collecting and ordering the results
> >> > produced by the BooleanQuery from step 3 above.
> >> > The exact similarity computation is only about 15% of the runtime. If I
> >> > disable it entirely, I still see the same bottlenecks in VisualVM.
> >> > Reducing the number of hashes yields roughly linear scaling (i.e., 400
> >> > hashes take ~2x longer than 200 hashes).
> >> >
> >> > The use case seems different to text search in that there's no semantic
> >> > meaning to the terms, their length, their ordering, their stems, etc.
> >> > I basically just need the index to be a rudimentary HashMap, and I only
> >> > care about the scores for the top k results.
> >> > With that in mind, I've made the following optimizations:
> >> >
> >> > - Disabled tokenization on the FieldType (setTokenized(false))
> >> > - Disabled norms on the FieldType (setOmitNorms(true))
> >> > - Set similarity to BooleanSimilarity on the elasticsearch
> >> > MappedFieldType
> >> > - Set index options to IndexOptions.Docs.
> >> > - Used the MoreLikeThis heuristic to pick a subset of terms. This
> >> > understandably only yields a speedup proportional to the number of
> >> > discarded terms.
> >> >
> >> > I'm using Elasticsearch version 7.6.2 with Lucene 8.4.0.
> >> > The main query implementation is here
> >> > <
> >> https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala
> >> >
> >> > .
> >> > <
> >> https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala
> >> >
> >> > The actual query that gets executed by Elasticsearch is instantiated on
> >> line
> >> > 98
> >> > <
> >> https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala#L98
> >> >
> >> > .
> >> > It's in Scala but all of the Java query classes should look familiar.
> >> >
> >> > Maybe there are some settings that I'm not aware of?
> >> > Maybe I could optimize this by implementing a custom query or scorer?
> >> > Maybe there's just no way to speed this up?
> >> >
> >> > I appreciate any input, examples, links, etc.. :)
> >> > Also, let me know if I can provide any additional details.
> >> >
> >> > Thanks,
> >> > Alex Klibisz
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Optimizing a boolean query for 100s of term clauses [ In reply to ]
On Tue, 2020-06-23 at 09:50 -0400, Alex K wrote:
> I'm working on an Elasticsearch plugin (using Lucene internally) that
> allows users to index numerical vectors and run exact and approximate
> k-nearest-neighbors similarity queries.

Quite a coincidence. I'm looking into the same thing :-)

> 1. When indexing a vector, apply a hash function to it, producing
> a set of discrete hashes. Usually there are anywhere from 100 to 1000
> hashes.

Is it important to have "infinite" scaling with inverted index or is it
acceptable to have a (fast) sequential scan through all documents? If
the use case is to combine the nearest neighbour search with other
filters, so that the effective search-space is relatively small, you
could go directly to computing the Euclidian distance (or whatever you
use to calculate the exact similarity score).

> 4. As the BooleanQuery produces results, maintain a fixed-size
> heap of its scores. For any score exceeding the min in the heap, load
> its vector from the binary doc values, compute the exact similarity,
> and update the heap.

I did something quite similar for a non-Lucene bases proof of concept,
except that I delayed the exact similarity calculation and over-
collected on the heap.

Fleshing that out: Instead of producing similarity hashes, I extracted
the top-X strongest signals (entries in the vector) and stored them as
indexes from the raw vector, so the top-3 signals from [10, 3, 6, 12,
1, 20] are [0, 3, 5]. The query was similar to your "match as many as
possible", just with indexes instead of hashes.

> - org.apache.lucene.search.DisiPriorityQueue.downHeap (~58% of
> runtime)

This sounds strange. How large is your queue? Object-based priority
queues tend to become slow when they get large (100K+ values).

> Maybe I could optimize this by implementing a custom query or scorer?

My plan for a better implementation is to use an autoencoder to produce
a condensed representation of the raw vector for a document. In order
to do so, a network must be trained on (ideally) the full corpus, so it
will require a bootstrap process and will probably work poorly if
incoming vectors differ substantially in nature from the existing ones
(at least until the autoencoder is retrained and the condensed
representations are reindexed). As our domain is an always growing
image collection with fairly defines types of images (portraits, line
drawings, maps...) and since new types are introduced rarely, this is
acceptable for us.

Back to Lucene, the condensed representation is expected to be a bitmap
where the (coarse) similarity between two representations is simply the
number of set bits at the same locations: An AND and a POPCNT of the
bitmaps.

This does imply a sequential pass of all potential documents, which
means that it won't scale well. On the other hand each comparison is a
fast check with very low memory overhead, so I hope it will work for
the few million images that we need it for.

- Toke Eskildsen, Royal Danish Library



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Optimizing a boolean query for 100s of term clauses [ In reply to ]
Thanks Michael. I managed to translate the TermInSetQuery into Scala
yesterday so now I can modify it in my codebase. This seems promising so
far. Fingers crossed there's a way to maintain scores without basically
converging to the BooleanQuery implementation.
- AK

On Wed, Jun 24, 2020 at 8:40 AM Michael Sokolov <msokolov@gmail.com> wrote:

> Yeah that will require some changes since what it does currently is to
> maintain a bitset, and or into it repeatedly (once for each term's
> docs). To maintain counts, you'd need a counter per doc (rather than a
> bit), and you might lose some of the speed...
>
> On Tue, Jun 23, 2020 at 8:52 PM Alex K <aklibisz@gmail.com> wrote:
> >
> > The TermsInSetQuery is definitely faster. Unfortunately it doesn't seem
> to
> > return the number of terms that matched in a given document. Rather it
> just
> > returns the boost value. I'll look into copying/modifying the internals
> to
> > return the number of matched terms.
> >
> > Thanks
> > - AK
> >
> > On Tue, Jun 23, 2020 at 3:17 PM Alex K <aklibisz@gmail.com> wrote:
> >
> > > Hi Michael,
> > > Thanks for the quick response!
> > >
> > > I will look into the TermInSetQuery.
> > >
> > > My usage of "heap" might've been confusing.
> > > I'm using a FunctionScoreQuery from Elasticsearch.
> > > This gets instantiated with a Lucene query, in this case the boolean
> query
> > > as I described it, as well as a custom ScoreFunction object.
> > > The ScoreFunction exposes a single method that takes a doc id and the
> > > BooleanQuery score for that doc id, and returns another score.
> > > In that method I use a MinMaxPriorityQueue from the Guava library to
> > > maintain a fixed-capacity subset of the highest-scoring docs and
> evaluate
> > > exact similarity on them.
> > > Once the queue is at capacity, I just return 0 for any docs that had a
> > > boolean query score smaller than the min in the queue.
> > >
> > > But you can actually forget entirely that this ScoreFunction exists. It
> > > only contributes ~6% of the runtime.
> > > Even if I only use the BooleanQuery by itself, I still see the same
> > > behavior and bottlenecks.
> > >
> > > Thanks
> > > - AK
> > >
> > >
> > > On Tue, Jun 23, 2020 at 2:06 PM Michael Sokolov <msokolov@gmail.com>
> > > wrote:
> > >
> > >> You might consider using a TermInSetQuery in place of a BooleanQuery
> > >> for the hashes (since they are all in the same field).
> > >>
> > >> I don't really understand why you are seeing so much cost in the heap
> > >> - it's sounds as if you have a single heap with mixed scores - those
> > >> generated by the BooleanQuery and those generated by the vector
> > >> scoring operation. Maybe you comment a little more on the interaction
> > >> there - are there really two heaps? Do you override the standard
> > >> collector?
> > >>
> > >> On Tue, Jun 23, 2020 at 9:51 AM Alex K <aklibisz@gmail.com> wrote:
> > >> >
> > >> > Hello all,
> > >> >
> > >> > I'm working on an Elasticsearch plugin (using Lucene internally)
> that
> > >> > allows users to index numerical vectors and run exact and
> approximate
> > >> > k-nearest-neighbors similarity queries.
> > >> > I'd like to get some feedback about my usage of BooleanQueries and
> > >> > TermQueries, and see if there are any optimizations or performance
> > >> tricks
> > >> > for my use case.
> > >> >
> > >> > An example use case for the plugin is reverse image search. A user
> can
> > >> > store vectors representing images and run a nearest-neighbors query
> to
> > >> > retrieve the 10 vectors with the smallest L2 distance to a query
> vector.
> > >> > More detailed documentation here: http://elastiknn.klibisz.com/
> > >> >
> > >> > The main method for indexing the vectors is based on Locality
> Sensitive
> > >> > Hashing <https://en.wikipedia.org/wiki/Locality-sensitive_hashing>.
> > >> > The general pattern is:
> > >> >
> > >> > 1. When indexing a vector, apply a hash function to it,
> producing a
> > >> set
> > >> > of discrete hashes. Usually there are anywhere from 100 to 1000
> > >> hashes.
> > >> > Similar vectors are more likely to share hashes (i.e., similar
> > >> vectors
> > >> > produce hash collisions).
> > >> > 2. Convert each hash to a byte array and store the byte array as
> a
> > >> > Lucene Term at a specific field.
> > >> > 3. Store the complete vector (i.e. floating point numbers) in a
> > >> binary
> > >> > doc values field.
> > >> >
> > >> > In other words, I'm converting each vector into a bag of words,
> though
> > >> the
> > >> > words have no semantic meaning.
> > >> >
> > >> > A query works as follows:
> > >> >
> > >> > 1. Given a query vector, apply the same hash function to produce
> a
> > >> set
> > >> > of hashes.
> > >> > 2. Convert each hash to a byte array and create a Term.
> > >> > 3. Build and run a BooleanQuery with a clause for each Term. Each
> > >> clause
> > >> > looks like this: `new BooleanClause(new ConstantScoreQuery(new
> > >> > TermQuery(new Term(field, new BytesRef(hashValue.toByteArray))),
> > >> > BooleanClause.Occur.SHOULD))`.
> > >> > 4. As the BooleanQuery produces results, maintain a fixed-size
> heap
> > >> of
> > >> > its scores. For any score exceeding the min in the heap, load its
> > >> vector
> > >> > from the binary doc values, compute the exact similarity, and
> update
> > >> the
> > >> > heap. Otherwise the vector gets a score of 0.
> > >> >
> > >> > When profiling my benchmarks with VisualVM, I've found the
> Elasticsearch
> > >> > search threads spend > 50% of the runtime in these two methods:
> > >> >
> > >> > - org.apache.lucene.search.DisiPriorityQueue.downHeap (~58% of
> > >> runtime)
> > >> > - org.apache.lucene.search.DisjunctionDISIApproximation.nextDoc
> (~8%
> > >> of
> > >> > runtime)
> > >> >
> > >> > So the time seems to be dominated by collecting and ordering the
> results
> > >> > produced by the BooleanQuery from step 3 above.
> > >> > The exact similarity computation is only about 15% of the runtime.
> If I
> > >> > disable it entirely, I still see the same bottlenecks in VisualVM.
> > >> > Reducing the number of hashes yields roughly linear scaling (i.e.,
> 400
> > >> > hashes take ~2x longer than 200 hashes).
> > >> >
> > >> > The use case seems different to text search in that there's no
> semantic
> > >> > meaning to the terms, their length, their ordering, their stems,
> etc.
> > >> > I basically just need the index to be a rudimentary HashMap, and I
> only
> > >> > care about the scores for the top k results.
> > >> > With that in mind, I've made the following optimizations:
> > >> >
> > >> > - Disabled tokenization on the FieldType (setTokenized(false))
> > >> > - Disabled norms on the FieldType (setOmitNorms(true))
> > >> > - Set similarity to BooleanSimilarity on the elasticsearch
> > >> > MappedFieldType
> > >> > - Set index options to IndexOptions.Docs.
> > >> > - Used the MoreLikeThis heuristic to pick a subset of terms. This
> > >> > understandably only yields a speedup proportional to the number
> of
> > >> > discarded terms.
> > >> >
> > >> > I'm using Elasticsearch version 7.6.2 with Lucene 8.4.0.
> > >> > The main query implementation is here
> > >> > <
> > >>
> https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala
> > >> >
> > >> > .
> > >> > <
> > >>
> https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala
> > >> >
> > >> > The actual query that gets executed by Elasticsearch is
> instantiated on
> > >> line
> > >> > 98
> > >> > <
> > >>
> https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala#L98
> > >> >
> > >> > .
> > >> > It's in Scala but all of the Java query classes should look
> familiar.
> > >> >
> > >> > Maybe there are some settings that I'm not aware of?
> > >> > Maybe I could optimize this by implementing a custom query or
> scorer?
> > >> > Maybe there's just no way to speed this up?
> > >> >
> > >> > I appreciate any input, examples, links, etc.. :)
> > >> > Also, let me know if I can provide any additional details.
> > >> >
> > >> > Thanks,
> > >> > Alex Klibisz
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > >>
> > >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: Optimizing a boolean query for 100s of term clauses [ In reply to ]
Hi Toke. Indeed a nice coincidence. It's an interesting and fun problem
space!

My implementation isn't specific to any particular dataset or access
pattern (i.e. infinite vs. subset).
So far the plugin supports exact L1, L2, Jaccard, Hamming, and Angular
similarities with LSH variants for all but L1.
My exact implementation is generally faster than the approximate LSH
implementation, hence the thread.
You make a good point that this is valuable by itself if you're able to
filter down to a small subset of docs.
I put a lot of work into optimizing the vector serialization speed and the
exact query execution.
I imagine with my current implementation there is some breaking point where
LSH becomes faster than exact, but so far I've tested with ~1.2M
~300-dimensional vectors and exact is still faster, especially when
parallelized across many shards.
So speeding up LSH is the current engineering challenge.

Are you using Elasticsearch or Lucene directly?
If you're using ES and have the time, I'd love some feedback on my plugin.
It sounds like you want to compute hamming similarity on your bitmaps?
If so that's currently supported.
There's an example here:
http://demo.elastiknn.klibisz.com/dataset/mnist-hamming?queryId=64121

Also I've compiled a small literature review on some related research here:
https://docs.google.com/document/d/14Z7ZKk9dq29bGeDDmBH6Bsy92h7NvlHoiGhbKTB0YJs/edit
*Fast and Exact NNS in Hamming Space on Full-Text Search Engines* describes
some clever tricks to speed up Hamming similarity.
*Large Scale Image Retrieval with Elasticsearch* describes the idea of
using the largest absolute magnitude values instead of the full vector.
Perhaps you've already read them but I figured I'd share.

Cheers
- AK



On Wed, Jun 24, 2020 at 8:44 AM Toke Eskildsen <toes@kb.dk> wrote:

> On Tue, 2020-06-23 at 09:50 -0400, Alex K wrote:
> > I'm working on an Elasticsearch plugin (using Lucene internally) that
> > allows users to index numerical vectors and run exact and approximate
> > k-nearest-neighbors similarity queries.
>
> Quite a coincidence. I'm looking into the same thing :-)
>
> > 1. When indexing a vector, apply a hash function to it, producing
> > a set of discrete hashes. Usually there are anywhere from 100 to 1000
> > hashes.
>
> Is it important to have "infinite" scaling with inverted index or is it
> acceptable to have a (fast) sequential scan through all documents? If
> the use case is to combine the nearest neighbour search with other
> filters, so that the effective search-space is relatively small, you
> could go directly to computing the Euclidian distance (or whatever you
> use to calculate the exact similarity score).
>
> > 4. As the BooleanQuery produces results, maintain a fixed-size
> > heap of its scores. For any score exceeding the min in the heap, load
> > its vector from the binary doc values, compute the exact similarity,
> > and update the heap.
>
> I did something quite similar for a non-Lucene bases proof of concept,
> except that I delayed the exact similarity calculation and over-
> collected on the heap.
>
> Fleshing that out: Instead of producing similarity hashes, I extracted
> the top-X strongest signals (entries in the vector) and stored them as
> indexes from the raw vector, so the top-3 signals from [10, 3, 6, 12,
> 1, 20] are [0, 3, 5]. The query was similar to your "match as many as
> possible", just with indexes instead of hashes.
>
> > - org.apache.lucene.search.DisiPriorityQueue.downHeap (~58% of
> > runtime)
>
> This sounds strange. How large is your queue? Object-based priority
> queues tend to become slow when they get large (100K+ values).
>
> > Maybe I could optimize this by implementing a custom query or scorer?
>
> My plan for a better implementation is to use an autoencoder to produce
> a condensed representation of the raw vector for a document. In order
> to do so, a network must be trained on (ideally) the full corpus, so it
> will require a bootstrap process and will probably work poorly if
> incoming vectors differ substantially in nature from the existing ones
> (at least until the autoencoder is retrained and the condensed
> representations are reindexed). As our domain is an always growing
> image collection with fairly defines types of images (portraits, line
> drawings, maps...) and since new types are introduced rarely, this is
> acceptable for us.
>
> Back to Lucene, the condensed representation is expected to be a bitmap
> where the (coarse) similarity between two representations is simply the
> number of set bits at the same locations: An AND and a POPCNT of the
> bitmaps.
>
> This does imply a sequential pass of all potential documents, which
> means that it won't scale well. On the other hand each comparison is a
> fast check with very low memory overhead, so I hope it will work for
> the few million images that we need it for.
>
> - Toke Eskildsen, Royal Danish Library
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: Optimizing a boolean query for 100s of term clauses [ In reply to ]
hi Alex,

I had worked on a similar problem directly on Lucene (within Anserini
toolkit) using LSH fingerprints of tokenized feature vector values.
You can find code at [1] and some information on the Anserini documentation
page [2] and in a short preprint [3].
As a side note my current thinking is that it would be very cool if we
could leverage Lucene N dimensional point support by properly reducing the
dimensionality of the original vectors, however that is hard to do without
losing important information.

My 2 cents,
Tommaso

[1] :
https://github.com/castorini/anserini/tree/master/src/main/java/io/anserini/ann
[2] :
https://github.com/castorini/anserini/blob/master/docs/approximate-nearestneighbor.md
[3] : https://arxiv.org/abs/1910.10208





On Wed, 24 Jun 2020 at 19:47, Alex K <aklibisz@gmail.com> wrote:

> Hi Toke. Indeed a nice coincidence. It's an interesting and fun problem
> space!
>
> My implementation isn't specific to any particular dataset or access
> pattern (i.e. infinite vs. subset).
> So far the plugin supports exact L1, L2, Jaccard, Hamming, and Angular
> similarities with LSH variants for all but L1.
> My exact implementation is generally faster than the approximate LSH
> implementation, hence the thread.
> You make a good point that this is valuable by itself if you're able to
> filter down to a small subset of docs.
> I put a lot of work into optimizing the vector serialization speed and the
> exact query execution.
> I imagine with my current implementation there is some breaking point where
> LSH becomes faster than exact, but so far I've tested with ~1.2M
> ~300-dimensional vectors and exact is still faster, especially when
> parallelized across many shards.
> So speeding up LSH is the current engineering challenge.
>
> Are you using Elasticsearch or Lucene directly?
> If you're using ES and have the time, I'd love some feedback on my plugin.
> It sounds like you want to compute hamming similarity on your bitmaps?
> If so that's currently supported.
> There's an example here:
> http://demo.elastiknn.klibisz.com/dataset/mnist-hamming?queryId=64121
>
> Also I've compiled a small literature review on some related research here:
>
> https://docs.google.com/document/d/14Z7ZKk9dq29bGeDDmBH6Bsy92h7NvlHoiGhbKTB0YJs/edit
> *Fast and Exact NNS in Hamming Space on Full-Text Search Engines* describes
> some clever tricks to speed up Hamming similarity.
> *Large Scale Image Retrieval with Elasticsearch* describes the idea of
> using the largest absolute magnitude values instead of the full vector.
> Perhaps you've already read them but I figured I'd share.
>
> Cheers
> - AK
>
>
>
> On Wed, Jun 24, 2020 at 8:44 AM Toke Eskildsen <toes@kb.dk> wrote:
>
> > On Tue, 2020-06-23 at 09:50 -0400, Alex K wrote:
> > > I'm working on an Elasticsearch plugin (using Lucene internally) that
> > > allows users to index numerical vectors and run exact and approximate
> > > k-nearest-neighbors similarity queries.
> >
> > Quite a coincidence. I'm looking into the same thing :-)
> >
> > > 1. When indexing a vector, apply a hash function to it, producing
> > > a set of discrete hashes. Usually there are anywhere from 100 to 1000
> > > hashes.
> >
> > Is it important to have "infinite" scaling with inverted index or is it
> > acceptable to have a (fast) sequential scan through all documents? If
> > the use case is to combine the nearest neighbour search with other
> > filters, so that the effective search-space is relatively small, you
> > could go directly to computing the Euclidian distance (or whatever you
> > use to calculate the exact similarity score).
> >
> > > 4. As the BooleanQuery produces results, maintain a fixed-size
> > > heap of its scores. For any score exceeding the min in the heap, load
> > > its vector from the binary doc values, compute the exact similarity,
> > > and update the heap.
> >
> > I did something quite similar for a non-Lucene bases proof of concept,
> > except that I delayed the exact similarity calculation and over-
> > collected on the heap.
> >
> > Fleshing that out: Instead of producing similarity hashes, I extracted
> > the top-X strongest signals (entries in the vector) and stored them as
> > indexes from the raw vector, so the top-3 signals from [10, 3, 6, 12,
> > 1, 20] are [0, 3, 5]. The query was similar to your "match as many as
> > possible", just with indexes instead of hashes.
> >
> > > - org.apache.lucene.search.DisiPriorityQueue.downHeap (~58% of
> > > runtime)
> >
> > This sounds strange. How large is your queue? Object-based priority
> > queues tend to become slow when they get large (100K+ values).
> >
> > > Maybe I could optimize this by implementing a custom query or scorer?
> >
> > My plan for a better implementation is to use an autoencoder to produce
> > a condensed representation of the raw vector for a document. In order
> > to do so, a network must be trained on (ideally) the full corpus, so it
> > will require a bootstrap process and will probably work poorly if
> > incoming vectors differ substantially in nature from the existing ones
> > (at least until the autoencoder is retrained and the condensed
> > representations are reindexed). As our domain is an always growing
> > image collection with fairly defines types of images (portraits, line
> > drawings, maps...) and since new types are introduced rarely, this is
> > acceptable for us.
> >
> > Back to Lucene, the condensed representation is expected to be a bitmap
> > where the (coarse) similarity between two representations is simply the
> > number of set bits at the same locations: An AND and a POPCNT of the
> > bitmaps.
> >
> > This does imply a sequential pass of all potential documents, which
> > means that it won't scale well. On the other hand each comparison is a
> > fast check with very low memory overhead, so I hope it will work for
> > the few million images that we need it for.
> >
> > - Toke Eskildsen, Royal Danish Library
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
Re: Optimizing a boolean query for 100s of term clauses [ In reply to ]
On Wed, 2020-06-24 at 13:46 -0400, Alex K wrote:
> My implementation isn't specific to any particular dataset or access
> pattern (i.e. infinite vs. subset).

Without a clearly defined use case, I would say that the sequential
scan approach is not the right one: As these things goes, someone will
come along and ask for scaling into the billions of images. "Someone"
might be my organization BTW: We do have a web archive and finding
similar images in that would be quite useful.

> Are you using Elasticsearch or Lucene directly?

None at the moment, as the driving project is currently at hold until
fall (at the earliest), and it was paused when I was about to switch
from prototyping (https://github.com/kb-dk/fairly-similar) to real
implementation. Hopefully I can twist another project in the direction
of using the same technology. If not, I'll just have to do it on my own
time :-)

I was hoping to use it with Solr, with an expectation of introducing
the necessary lower level mechanisms (and & bitcount of binary content)
at the Lucene level. Failing that, maybe Lucene directly. Using
Elasticsearch is a bit of a challenge as we don't do it currently and
it would require it to be added to Operation's support list.

> If you're using ES and have the time, I'd love some feedback on my
> plugin.

Sorry, not at the moment. Too many balls in the air before summer
vacation starts. I hope to find the time in August. Your post was just
too relevant to ignore.

> Also I've compiled a small literature review on some related research
> here:
> https://docs.google.com/document/d/14Z7ZKk9dq29bGeDDmBH6Bsy92h7NvlHoiGhbKTB0YJs/edit

You are clearly way ahead of us and I'll shamelessly piggyback on your
findings. I skimmed your notes and they look extremely useful.

> Fast and Exact NNS in Hamming Space on Full-Text Search Engines
> describes some clever tricks to speed up Hamming similarity.

The autoencoder-approach produces bitmaps where each bit is a distinct
signal, so I guess comparison would be equivalent to binary Hamming
distance?

> Large Scale Image Retrieval with Elasticsearch describes the idea of
> using the largest absolute magnitude values instead of the full
> vector.

That approach was very promising in our local proof of concept.

> Perhaps you've already read them but I figured I'd share.

A few of them, but not all. And your notes on the articles are great.

Thanks,
Toke Eskildsen, Royal Danish Library



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Optimizing a boolean query for 100s of term clauses [ In reply to ]
Hi Tommaso, thanks for the input and links! I'll add your paper to my
literature review.

So far I've seen very promising results from modifying the TermInSetQuery.
It was pretty simple to keep a map of `doc id -> matched term count` and
then only evaluate the exact similarity on the top k doc ids.
On a small benchmark, I was able to drop the time for 1000 queries from 45
seconds to 14 seconds.
Now the bottleneck is back in my own code, which I'm happy with because I
can optimize that more easily.
Hopefully I can merge these changes in the next couple days, and I'll post
the diff when I do.

- AK



On Thu, Jun 25, 2020 at 5:07 AM Tommaso Teofili <tommaso.teofili@gmail.com>
wrote:

> hi Alex,
>
> I had worked on a similar problem directly on Lucene (within Anserini
> toolkit) using LSH fingerprints of tokenized feature vector values.
> You can find code at [1] and some information on the Anserini documentation
> page [2] and in a short preprint [3].
> As a side note my current thinking is that it would be very cool if we
> could leverage Lucene N dimensional point support by properly reducing the
> dimensionality of the original vectors, however that is hard to do without
> losing important information.
>
> My 2 cents,
> Tommaso
>
> [1] :
>
> https://github.com/castorini/anserini/tree/master/src/main/java/io/anserini/ann
> [2] :
>
> https://github.com/castorini/anserini/blob/master/docs/approximate-nearestneighbor.md
> [3] : https://arxiv.org/abs/1910.10208
>
>
>
>
>
> On Wed, 24 Jun 2020 at 19:47, Alex K <aklibisz@gmail.com> wrote:
>
> > Hi Toke. Indeed a nice coincidence. It's an interesting and fun problem
> > space!
> >
> > My implementation isn't specific to any particular dataset or access
> > pattern (i.e. infinite vs. subset).
> > So far the plugin supports exact L1, L2, Jaccard, Hamming, and Angular
> > similarities with LSH variants for all but L1.
> > My exact implementation is generally faster than the approximate LSH
> > implementation, hence the thread.
> > You make a good point that this is valuable by itself if you're able to
> > filter down to a small subset of docs.
> > I put a lot of work into optimizing the vector serialization speed and
> the
> > exact query execution.
> > I imagine with my current implementation there is some breaking point
> where
> > LSH becomes faster than exact, but so far I've tested with ~1.2M
> > ~300-dimensional vectors and exact is still faster, especially when
> > parallelized across many shards.
> > So speeding up LSH is the current engineering challenge.
> >
> > Are you using Elasticsearch or Lucene directly?
> > If you're using ES and have the time, I'd love some feedback on my
> plugin.
> > It sounds like you want to compute hamming similarity on your bitmaps?
> > If so that's currently supported.
> > There's an example here:
> > http://demo.elastiknn.klibisz.com/dataset/mnist-hamming?queryId=64121
> >
> > Also I've compiled a small literature review on some related research
> here:
> >
> >
> https://docs.google.com/document/d/14Z7ZKk9dq29bGeDDmBH6Bsy92h7NvlHoiGhbKTB0YJs/edit
> > *Fast and Exact NNS in Hamming Space on Full-Text Search Engines*
> describes
> > some clever tricks to speed up Hamming similarity.
> > *Large Scale Image Retrieval with Elasticsearch* describes the idea of
> > using the largest absolute magnitude values instead of the full vector.
> > Perhaps you've already read them but I figured I'd share.
> >
> > Cheers
> > - AK
> >
> >
> >
> > On Wed, Jun 24, 2020 at 8:44 AM Toke Eskildsen <toes@kb.dk> wrote:
> >
> > > On Tue, 2020-06-23 at 09:50 -0400, Alex K wrote:
> > > > I'm working on an Elasticsearch plugin (using Lucene internally) that
> > > > allows users to index numerical vectors and run exact and approximate
> > > > k-nearest-neighbors similarity queries.
> > >
> > > Quite a coincidence. I'm looking into the same thing :-)
> > >
> > > > 1. When indexing a vector, apply a hash function to it, producing
> > > > a set of discrete hashes. Usually there are anywhere from 100 to 1000
> > > > hashes.
> > >
> > > Is it important to have "infinite" scaling with inverted index or is it
> > > acceptable to have a (fast) sequential scan through all documents? If
> > > the use case is to combine the nearest neighbour search with other
> > > filters, so that the effective search-space is relatively small, you
> > > could go directly to computing the Euclidian distance (or whatever you
> > > use to calculate the exact similarity score).
> > >
> > > > 4. As the BooleanQuery produces results, maintain a fixed-size
> > > > heap of its scores. For any score exceeding the min in the heap, load
> > > > its vector from the binary doc values, compute the exact similarity,
> > > > and update the heap.
> > >
> > > I did something quite similar for a non-Lucene bases proof of concept,
> > > except that I delayed the exact similarity calculation and over-
> > > collected on the heap.
> > >
> > > Fleshing that out: Instead of producing similarity hashes, I extracted
> > > the top-X strongest signals (entries in the vector) and stored them as
> > > indexes from the raw vector, so the top-3 signals from [10, 3, 6, 12,
> > > 1, 20] are [0, 3, 5]. The query was similar to your "match as many as
> > > possible", just with indexes instead of hashes.
> > >
> > > > - org.apache.lucene.search.DisiPriorityQueue.downHeap (~58% of
> > > > runtime)
> > >
> > > This sounds strange. How large is your queue? Object-based priority
> > > queues tend to become slow when they get large (100K+ values).
> > >
> > > > Maybe I could optimize this by implementing a custom query or scorer?
> > >
> > > My plan for a better implementation is to use an autoencoder to produce
> > > a condensed representation of the raw vector for a document. In order
> > > to do so, a network must be trained on (ideally) the full corpus, so it
> > > will require a bootstrap process and will probably work poorly if
> > > incoming vectors differ substantially in nature from the existing ones
> > > (at least until the autoencoder is retrained and the condensed
> > > representations are reindexed). As our domain is an always growing
> > > image collection with fairly defines types of images (portraits, line
> > > drawings, maps...) and since new types are introduced rarely, this is
> > > acceptable for us.
> > >
> > > Back to Lucene, the condensed representation is expected to be a bitmap
> > > where the (coarse) similarity between two representations is simply the
> > > number of set bits at the same locations: An AND and a POPCNT of the
> > > bitmaps.
> > >
> > > This does imply a sequential pass of all potential documents, which
> > > means that it won't scale well. On the other hand each comparison is a
> > > fast check with very low memory overhead, so I hope it will work for
> > > the few million images that we need it for.
> > >
> > > - Toke Eskildsen, Royal Danish Library
> > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
>