Mailing List Archive: Vector Search with OpenAI Embeddings: Lucene Is All You Need

Vector Search with OpenAI Embeddings: Lucene Is All You Need

Aug 31, 2023, 12:27 AM

Post #1 of 4 (188 views)

Hi Together

You might be interesed in this paper / article

https://arxiv.org/abs/2308.14963

Thanks

Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Vector Search with OpenAI Embeddings: Lucene Is All You Need [ In reply to ]

lucene at mikemccandless

Aug 31, 2023, 2:49 AM

Post #2 of 4 (188 views)

Permalink

Thanks Michael, very interesting! I of course agree that Lucene is all you
need, heh ;)

Jimmy Lin also tweeted about the strength of Lucene's HNSW:
https://twitter.com/lintool/status/1681333664431460353?s=20

Mike McCandless

http://blog.mikemccandless.com

On Thu, Aug 31, 2023 at 3:31?AM Michael Wechner <michael.wechner@wyona.com>
wrote:

> Hi Together
>
> You might be interesed in this paper / article
>
> https://arxiv.org/abs/2308.14963
>
> Thanks
>
> Michael
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Vector Search with OpenAI Embeddings: Lucene Is All You Need [ In reply to ]

kent.fitch at gmail

Aug 31, 2023, 10:22 PM

Post #3 of 4 (188 views)

Permalink

My testing shows Lucene's HNSW in a very positive light. The ability to
perform blended searches (vector/semantic and text) is valuable, even with
high quality embeddings, and helps when the searcher's intent is to search
for specific words or phrases (such as a name, or exact concepts) which get
blurred-out by semantics. I discussed blended searching using Lucene in
this Code4Lib article: https://journal.code4lib.org/articles/17443

And regarding performance, I have benchmarked Lucene's HNSW (circa Jan2023
snapshot) on a test index of 192 million vectors of 1536 dimensions,
reduced by PQ coding to 512 bytes and stored in HNSW. Building this index
was slow (lots of time merging...) but once it was built, it did fit
entirely in memory (core i7-9800x (8 cores) with 128gb DDR4 memory running
at 2400 MT/s) so no IO was required at search time. (I modified the lucene
similarity code to support expansion of each of the 512 PQ byte codes back
to 3 floats for the distance calculation.) I havent updated this to take
advantage of the latest SIMD capability, but even so, once the HNSW
structure is in memory, a single-threaded topK=10 search thread achieves
2.4 queries/second. Two threads: 4.9 q/s, 4 threads: 7.2q/s, maxing out at
8 threads: 9.4 q/s. I guess the non-linear scaling with threads is due to
competition for memory bandwidth and cache. Curiously, I'm not getting
nearly as good performance out of the box using Milvus 2.3's diskANN, but I
need to find out why before condemning it.

Kent Fitch

On Thu, Aug 31, 2023 at 7:53?PM Michael McCandless <
lucene@mikemccandless.com> wrote:

> Thanks Michael, very interesting! I of course agree that Lucene is all
> you need, heh ;)
>
> Jimmy Lin also tweeted about the strength of Lucene's HNSW:
> https://twitter.com/lintool/status/1681333664431460353?s=20
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Aug 31, 2023 at 3:31?AM Michael Wechner <michael.wechner@wyona.com>
> wrote:
>
>> Hi Together
>>
>> You might be interesed in this paper / article
>>
>> https://arxiv.org/abs/2308.14963
>>
>> Thanks
>>
>> Michael
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>

Re: Vector Search with OpenAI Embeddings: Lucene Is All You Need [ In reply to ]

mkhl at apache

Sep 1, 2023, 12:10 AM

Post #4 of 4 (188 views)

Permalink

Thanks for sharing, Michael.
But can't we say that vector DBs may utilize GPUs that are hardly possible
with Lucene now?

On Fri, Sep 1, 2023 at 8:24?AM Kent Fitch <kent.fitch@gmail.com> wrote:

> My testing shows Lucene's HNSW in a very positive light. The ability to
> perform blended searches (vector/semantic and text) is valuable, even with
> high quality embeddings, and helps when the searcher's intent is to search
> for specific words or phrases (such as a name, or exact concepts) which get
> blurred-out by semantics. I discussed blended searching using Lucene in
> this Code4Lib article: https://journal.code4lib.org/articles/17443
>
> And regarding performance, I have benchmarked Lucene's HNSW (circa Jan2023
> snapshot) on a test index of 192 million vectors of 1536 dimensions,
> reduced by PQ coding to 512 bytes and stored in HNSW. Building this index
> was slow (lots of time merging...) but once it was built, it did fit
> entirely in memory (core i7-9800x (8 cores) with 128gb DDR4 memory running
> at 2400 MT/s) so no IO was required at search time. (I modified the lucene
> similarity code to support expansion of each of the 512 PQ byte codes back
> to 3 floats for the distance calculation.) I havent updated this to take
> advantage of the latest SIMD capability, but even so, once the HNSW
> structure is in memory, a single-threaded topK=10 search thread achieves
> 2.4 queries/second. Two threads: 4.9 q/s, 4 threads: 7.2q/s, maxing out at
> 8 threads: 9.4 q/s. I guess the non-linear scaling with threads is due to
> competition for memory bandwidth and cache. Curiously, I'm not getting
> nearly as good performance out of the box using Milvus 2.3's diskANN, but I
> need to find out why before condemning it.
>
> Kent Fitch
>
> On Thu, Aug 31, 2023 at 7:53?PM Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> Thanks Michael, very interesting! I of course agree that Lucene is all
>> you need, heh ;)
>>
>> Jimmy Lin also tweeted about the strength of Lucene's HNSW:
>> https://twitter.com/lintool/status/1681333664431460353?s=20
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Thu, Aug 31, 2023 at 3:31?AM Michael Wechner <
>> michael.wechner@wyona.com> wrote:
>>
>>> Hi Together
>>>
>>> You might be interesed in this paper / article
>>>
>>> https://arxiv.org/abs/2308.14963
>>>
>>> Thanks
>>>
>>> Michael
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>

--
Sincerely yours
Mikhail Khludnev