Mailing List Archive

Concurrent HNSW index
Hi all,

I've created an HNSW index implementation that allows for concurrent build
and querying. On my i9-12900 (8 performance cores and 8 efficiency) I get
a bit less than 10x speedup of wall clock time for building and querying
the "siftsmall" and "sift" datasets from http://corpus-texmex.irisa.fr/.
The small dataset is 10k vectors while the large is 1M. This speedup feels
pretty good for a data structure that isn't completely parallelizable, and
it's good to see that it's consistent as the dataset gets larger.

The concurrent classes achieve identical recall compared to the
non-concurrent versions within my ability to test it, and are drop-in
replacements for OnHeapHnswGraph and HnswGraphBuilder; I use threadlocals
to work around the places where the existing API assumes no concurrency.

The concurrent classes also pass the existing test suite with the exception
of the ram usage ones; the estimator doesn't know about AtomicReference
etc. (Big thanks to Michael Sokolov for testAknnDiverse which made it much
easier to track down subtle problems!)

My motivation is

1. It is faster to query a single on-heap hnsw index, than to query
multiple such indexes and combine the result.
2. Even with some contention necessarily occurring during building of the
index, we still come out way ahead in terms of total efficiency vs creating
per-thread indexes and combining them, since combining such indexes boils
down to "pick the largest and then add all the other nodes normally," you
don't really benefit from having computed the others previously.

I am currently adding this to Cassandra as code in our repo, but my
preference would be to upstream it. Is Lucene open to a pull request?

--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced
Re: Concurrent HNSW index [ In reply to ]
+1, please contribute to Lucene. Thanks!

On Thu, 27 Apr, 2023, 10:59 pm Jonathan Ellis, <jbellis@gmail.com> wrote:

> Hi all,
>
> I've created an HNSW index implementation that allows for concurrent build
> and querying. On my i9-12900 (8 performance cores and 8 efficiency) I get
> a bit less than 10x speedup of wall clock time for building and querying
> the "siftsmall" and "sift" datasets from http://corpus-texmex.irisa.fr/.
> The small dataset is 10k vectors while the large is 1M. This speedup feels
> pretty good for a data structure that isn't completely parallelizable, and
> it's good to see that it's consistent as the dataset gets larger.
>
> The concurrent classes achieve identical recall compared to the
> non-concurrent versions within my ability to test it, and are drop-in
> replacements for OnHeapHnswGraph and HnswGraphBuilder; I use threadlocals
> to work around the places where the existing API assumes no concurrency.
>
> The concurrent classes also pass the existing test suite with the
> exception of the ram usage ones; the estimator doesn't know about
> AtomicReference etc. (Big thanks to Michael Sokolov for testAknnDiverse
> which made it much easier to track down subtle problems!)
>
> My motivation is
>
> 1. It is faster to query a single on-heap hnsw index, than to query
> multiple such indexes and combine the result.
> 2. Even with some contention necessarily occurring during building of the
> index, we still come out way ahead in terms of total efficiency vs creating
> per-thread indexes and combining them, since combining such indexes boils
> down to "pick the largest and then add all the other nodes normally," you
> don't really benefit from having computed the others previously.
>
> I am currently adding this to Cassandra as code in our repo, but my
> preference would be to upstream it. Is Lucene open to a pull request?
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>
Re: Concurrent HNSW index [ In reply to ]
+1 for a pull request

Thanks

Michael

Am 27.04.23 um 20:53 schrieb Ishan Chattopadhyaya:
> +1, please contribute to Lucene. Thanks!
>
> On Thu, 27 Apr, 2023, 10:59 pm Jonathan Ellis, <jbellis@gmail.com> wrote:
>
> Hi all,
>
> I've created an HNSW index implementation that allows for
> concurrent build and querying.? On my i9-12900 (8 performance
> cores and 8 efficiency) I get a bit less than 10x speedup of wall
> clock time for building and querying the "siftsmall" and "sift"
> datasets from http://corpus-texmex.irisa.fr/. The small dataset is
> 10k vectors while the large is 1M. This speedup feels pretty good
> for a data structure that isn't completely parallelizable, and
> it's good to see that it's consistent as the dataset gets larger.
>
> The concurrent classes achieve identical recall compared to the
> non-concurrent versions within my ability to test it, and are
> drop-in replacements for OnHeapHnswGraph and HnswGraphBuilder; I
> use threadlocals to work around the places where the existing API
> assumes no concurrency.
>
> The concurrent classes also pass the existing test suite with the
> exception of the ram usage ones; the estimator doesn't know about
> AtomicReference etc.? (Big thanks to Michael Sokolov for
> testAknnDiverse which made it much easier to track down subtle
> problems!)
>
> My motivation is
>
> 1. It is faster to query a single on-heap hnsw index, than to
> query multiple such indexes and combine the result.
> 2. Even with some contention necessarily occurring during building
> of the index, we still come out way ahead in terms of total
> efficiency vs creating per-thread indexes and combining them,
> since combining such indexes boils down to "pick the largest and
> then add all the other nodes normally," you don't really benefit
> from having computed the others previously.
>
> I am currently adding this to Cassandra as code in our repo, but
> my preference would be to upstream it.? Is Lucene open to a pull
> request?
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>
Re: Concurrent HNSW index [ In reply to ]
That's great! And we were talking about this exactly here:
https://github.com/apache/lucene/pull/12169

It would also help with the new token filter :)
--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
*Apache Lucene/Solr Committer*
*Apache Solr PMC Member*

e-mail: a.benedetti@sease.io


*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
<https://github.com/seaseltd>


On Thu, 27 Apr 2023 at 19:29, Jonathan Ellis <jbellis@gmail.com> wrote:

> Hi all,
>
> I've created an HNSW index implementation that allows for concurrent build
> and querying. On my i9-12900 (8 performance cores and 8 efficiency) I get
> a bit less than 10x speedup of wall clock time for building and querying
> the "siftsmall" and "sift" datasets from http://corpus-texmex.irisa.fr/.
> The small dataset is 10k vectors while the large is 1M. This speedup feels
> pretty good for a data structure that isn't completely parallelizable, and
> it's good to see that it's consistent as the dataset gets larger.
>
> The concurrent classes achieve identical recall compared to the
> non-concurrent versions within my ability to test it, and are drop-in
> replacements for OnHeapHnswGraph and HnswGraphBuilder; I use threadlocals
> to work around the places where the existing API assumes no concurrency.
>
> The concurrent classes also pass the existing test suite with the
> exception of the ram usage ones; the estimator doesn't know about
> AtomicReference etc. (Big thanks to Michael Sokolov for testAknnDiverse
> which made it much easier to track down subtle problems!)
>
> My motivation is
>
> 1. It is faster to query a single on-heap hnsw index, than to query
> multiple such indexes and combine the result.
> 2. Even with some contention necessarily occurring during building of the
> index, we still come out way ahead in terms of total efficiency vs creating
> per-thread indexes and combining them, since combining such indexes boils
> down to "pick the largest and then add all the other nodes normally," you
> don't really benefit from having computed the others previously.
>
> I am currently adding this to Cassandra as code in our repo, but my
> preference would be to upstream it. Is Lucene open to a pull request?
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>
Re: Concurrent HNSW index [ In reply to ]
Great, I will work on squashing to get a clean PR.

One thing I am struggling with is the RamUsageTester. Here is the
stacktrace: https://gist.github.com/jbellis/20676b0e23f43751cbe8834a8def0d12

Apparently RamUsageTester tries to flip private fields to public so it can
introspect them, but the JVM modularization locks this down for internal
classes like ThreadLocal. Unclear to me why this is the first time this
problem has come up or how to fix it.

On Fri, Apr 28, 2023 at 2:18?AM Alessandro Benedetti <a.benedetti@sease.io>
wrote:

> That's great! And we were talking about this exactly here:
> https://github.com/apache/lucene/pull/12169
>
> It would also help with the new token filter :)
> --------------------------
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> *Apache Lucene/Solr Committer*
> *Apache Solr PMC Member*
>
> e-mail: a.benedetti@sease.io
>
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
> <https://twitter.com/seaseltd> | Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
> <https://github.com/seaseltd>
>
>
> On Thu, 27 Apr 2023 at 19:29, Jonathan Ellis <jbellis@gmail.com> wrote:
>
>> Hi all,
>>
>> I've created an HNSW index implementation that allows for concurrent
>> build and querying. On my i9-12900 (8 performance cores and 8 efficiency)
>> I get a bit less than 10x speedup of wall clock time for building and
>> querying the "siftsmall" and "sift" datasets from
>> http://corpus-texmex.irisa.fr/. The small dataset is 10k vectors while
>> the large is 1M. This speedup feels pretty good for a data structure that
>> isn't completely parallelizable, and it's good to see that it's consistent
>> as the dataset gets larger.
>>
>> The concurrent classes achieve identical recall compared to the
>> non-concurrent versions within my ability to test it, and are drop-in
>> replacements for OnHeapHnswGraph and HnswGraphBuilder; I use threadlocals
>> to work around the places where the existing API assumes no concurrency.
>>
>> The concurrent classes also pass the existing test suite with the
>> exception of the ram usage ones; the estimator doesn't know about
>> AtomicReference etc. (Big thanks to Michael Sokolov for testAknnDiverse
>> which made it much easier to track down subtle problems!)
>>
>> My motivation is
>>
>> 1. It is faster to query a single on-heap hnsw index, than to query
>> multiple such indexes and combine the result.
>> 2. Even with some contention necessarily occurring during building of the
>> index, we still come out way ahead in terms of total efficiency vs creating
>> per-thread indexes and combining them, since combining such indexes boils
>> down to "pick the largest and then add all the other nodes normally," you
>> don't really benefit from having computed the others previously.
>>
>> I am currently adding this to Cassandra as code in our repo, but my
>> preference would be to upstream it. Is Lucene open to a pull request?
>>
>> --
>> Jonathan Ellis
>> co-founder, http://www.datastax.com
>> @spyced
>>
>

--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced
Re: Concurrent HNSW index [ In reply to ]
Draft PR is posted here: https://github.com/apache/lucene/pull/12254

This depends on my PR to use HashMap in the non-concurrent OnHeapHnswGraph
(because that PR updates the tests to not assume sorted order of nodes in a
given level): https://github.com/apache/lucene/pull/12248

On Fri, Apr 28, 2023 at 8:14?AM Jonathan Ellis <jbellis@gmail.com> wrote:

> Great, I will work on squashing to get a clean PR.
>
> One thing I am struggling with is the RamUsageTester. Here is the
> stacktrace:
> https://gist.github.com/jbellis/20676b0e23f43751cbe8834a8def0d12
>
> Apparently RamUsageTester tries to flip private fields to public so it can
> introspect them, but the JVM modularization locks this down for internal
> classes like ThreadLocal. Unclear to me why this is the first time this
> problem has come up or how to fix it.
>
> On Fri, Apr 28, 2023 at 2:18?AM Alessandro Benedetti <a.benedetti@sease.io>
> wrote:
>
>> That's great! And we were talking about this exactly here:
>> https://github.com/apache/lucene/pull/12169
>>
>> It would also help with the new token filter :)
>> --------------------------
>> *Alessandro Benedetti*
>> Director @ Sease Ltd.
>> *Apache Lucene/Solr Committer*
>> *Apache Solr PMC Member*
>>
>> e-mail: a.benedetti@sease.io
>>
>>
>> *Sease* - Information Retrieval Applied
>> Consulting | Training | Open Source
>>
>> Website: Sease.io <http://sease.io/>
>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>> <https://twitter.com/seaseltd> | Youtube
>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>> <https://github.com/seaseltd>
>>
>>
>> On Thu, 27 Apr 2023 at 19:29, Jonathan Ellis <jbellis@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I've created an HNSW index implementation that allows for concurrent
>>> build and querying. On my i9-12900 (8 performance cores and 8 efficiency)
>>> I get a bit less than 10x speedup of wall clock time for building and
>>> querying the "siftsmall" and "sift" datasets from
>>> http://corpus-texmex.irisa.fr/. The small dataset is 10k vectors while
>>> the large is 1M. This speedup feels pretty good for a data structure that
>>> isn't completely parallelizable, and it's good to see that it's consistent
>>> as the dataset gets larger.
>>>
>>> The concurrent classes achieve identical recall compared to the
>>> non-concurrent versions within my ability to test it, and are drop-in
>>> replacements for OnHeapHnswGraph and HnswGraphBuilder; I use threadlocals
>>> to work around the places where the existing API assumes no concurrency.
>>>
>>> The concurrent classes also pass the existing test suite with the
>>> exception of the ram usage ones; the estimator doesn't know about
>>> AtomicReference etc. (Big thanks to Michael Sokolov for testAknnDiverse
>>> which made it much easier to track down subtle problems!)
>>>
>>> My motivation is
>>>
>>> 1. It is faster to query a single on-heap hnsw index, than to query
>>> multiple such indexes and combine the result.
>>> 2. Even with some contention necessarily occurring during building of
>>> the index, we still come out way ahead in terms of total efficiency vs
>>> creating per-thread indexes and combining them, since combining such
>>> indexes boils down to "pick the largest and then add all the other nodes
>>> normally," you don't really benefit from having computed the others
>>> previously.
>>>
>>> I am currently adding this to Cassandra as code in our repo, but my
>>> preference would be to upstream it. Is Lucene open to a pull request?
>>>
>>> --
>>> Jonathan Ellis
>>> co-founder, http://www.datastax.com
>>> @spyced
>>>
>>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>


--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced