Mailing List Archive

Lucene/Solr and BERT
Hi

I recently found the following articles re Lucene/Solr and BERT

https://dmitry-kan.medium.com/neural-search-with-bert-and-solr-ea5ead060b28
https://medium.com/swlh/fun-with-apache-lucene-and-bert-embeddings-c2c496baa559

and would like to ask whether there might be more recent developments
within the Lucene/Solr community re BERT integration?

Also how these developments relate to

https://sbert.net/

?

Thanks very much for your insights!

Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Lucene/Solr and BERT [ In reply to ]
There were a couple additions recently merged into lucene but not yet
released:
- A first-class vector codec
- An implementation of HNSW for approximate nearest neighbor search

They are however available in the snapshot releases. I started on a small
project to get the HNSW implementation into the ann-benchmarks project, but
had to set it aside. Here's the code:
https://github.com/alexklibisz/ann-benchmarks-lucene. There are some test
suites that index and search Glove vectors. My first impression was that
indexing seems surprisingly slow, but it's entirely possible I'm doing
something wrong.

On Wed, Apr 21, 2021 at 9:31 AM Michael Wechner <michael.wechner@wyona.com>
wrote:

> Hi
>
> I recently found the following articles re Lucene/Solr and BERT
>
> https://dmitry-kan.medium.com/neural-search-with-bert-and-solr-ea5ead060b28
>
> https://medium.com/swlh/fun-with-apache-lucene-and-bert-embeddings-c2c496baa559
>
> and would like to ask whether there might be more recent developments
> within the Lucene/Solr community re BERT integration?
>
> Also how these developments relate to
>
> https://sbert.net/
>
> ?
>
> Thanks very much for your insights!
>
> Michael
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: Lucene/Solr and BERT [ In reply to ]
Great, thank you very much for your feedback!

Will give it a try and get back probably with more questions :-)

Thanks

Michael

Am 21.04.21 um 17:21 schrieb Alex K:
> There were a couple additions recently merged into lucene but not yet
> released:
> - A first-class vector codec
> - An implementation of HNSW for approximate nearest neighbor search
>
> They are however available in the snapshot releases. I started on a small
> project to get the HNSW implementation into the ann-benchmarks project, but
> had to set it aside. Here's the code:
> https://github.com/alexklibisz/ann-benchmarks-lucene. There are some test
> suites that index and search Glove vectors. My first impression was that
> indexing seems surprisingly slow, but it's entirely possible I'm doing
> something wrong.
>
> On Wed, Apr 21, 2021 at 9:31 AM Michael Wechner <michael.wechner@wyona.com>
> wrote:
>
>> Hi
>>
>> I recently found the following articles re Lucene/Solr and BERT
>>
>> https://dmitry-kan.medium.com/neural-search-with-bert-and-solr-ea5ead060b28
>>
>> https://medium.com/swlh/fun-with-apache-lucene-and-bert-embeddings-c2c496baa559
>>
>> and would like to ask whether there might be more recent developments
>> within the Lucene/Solr community re BERT integration?
>>
>> Also how these developments relate to
>>
>> https://sbert.net/
>>
>> ?
>>
>> Thanks very much for your insights!
>>
>> Michael
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Lucene/Solr and BERT [ In reply to ]
Hi Alex

Just to make sure I understand better what the additions are about

Am 21.04.21 um 17:21 schrieb Alex K:
> There were a couple additions recently merged into lucene but not yet
> released:
> - A first-class vector codec

do you mean the classes inside

https://github.com/apache/lucene/tree/main/lucene/core/src/java/org/apache/lucene/codecs/lucene90

and in particular

Lucene90HnswVectorFormat.java  Lucene90HnswVectorReader.java
Lucene90HnswVectorWriter.java

?

> - An implementation of HNSW for approximate nearest neighbor search

the HNSW implementation at

https://github.com/apache/lucene/tree/main/lucene/core/src/java/org/apache/lucene/util/hnsw

is similar to

https://opendistro.github.io/for-elasticsearch/blog/odfe-updates/2020/04/Building-k-Nearest-Neighbor-(k-NN)-Similarity-Search-Engine-with-Elasticsearch/

?
>
> They are however available in the snapshot releases. I started on a small
> project to get the HNSW implementation into the ann-benchmarks project, but
> had to set it aside.

Is there still something missing? Or what would be the next steps?

Thanks

Michael


> Here's the code:
> https://github.com/alexklibisz/ann-benchmarks-lucene. There are some test
> suites that index and search Glove vectors. My first impression was that
> indexing seems surprisingly slow, but it's entirely possible I'm doing
> something wrong.
>
> On Wed, Apr 21, 2021 at 9:31 AM Michael Wechner <michael.wechner@wyona.com>
> wrote:
>
>> Hi
>>
>> I recently found the following articles re Lucene/Solr and BERT
>>
>> https://dmitry-kan.medium.com/neural-search-with-bert-and-solr-ea5ead060b28
>>
>> https://medium.com/swlh/fun-with-apache-lucene-and-bert-embeddings-c2c496baa559
>>
>> and would like to ask whether there might be more recent developments
>> within the Lucene/Solr community re BERT integration?
>>
>> Also how these developments relate to
>>
>> https://sbert.net/
>>
>> ?
>>
>> Thanks very much for your insights!
>>
>> Michael
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Lucene/Solr and BERT [ In reply to ]
Hi Michael, that is fully-functional in the sense that Lucene will
build an HNSW graph for a vector-valued field and you can then use the
VectorReader.search method to do KNN-based search. Next steps may
include some integration with lexical, inverted-index type search so
that you can retrieve N-closest constrained by other constraints.
Today you can approximate that by oversampling and filtering. There is
also interest in pursuing other KNN search algorithms, and we have
been working to make sure the VectorFormat API (might still get
renamed due to confusion with other kinds of vectors existing in
Lucene) can support alternative KNN implementations.

On Wed, May 19, 2021 at 12:22 PM Michael Wechner
<michael.wechner@wyona.com> wrote:
>
> Hi Alex
>
> Just to make sure I understand better what the additions are about
>
> Am 21.04.21 um 17:21 schrieb Alex K:
> > There were a couple additions recently merged into lucene but not yet
> > released:
> > - A first-class vector codec
>
> do you mean the classes inside
>
> https://github.com/apache/lucene/tree/main/lucene/core/src/java/org/apache/lucene/codecs/lucene90
>
> and in particular
>
> Lucene90HnswVectorFormat.java Lucene90HnswVectorReader.java
> Lucene90HnswVectorWriter.java
>
> ?
>
> > - An implementation of HNSW for approximate nearest neighbor search
>
> the HNSW implementation at
>
> https://github.com/apache/lucene/tree/main/lucene/core/src/java/org/apache/lucene/util/hnsw
>
> is similar to
>
> https://opendistro.github.io/for-elasticsearch/blog/odfe-updates/2020/04/Building-k-Nearest-Neighbor-(k-NN)-Similarity-Search-Engine-with-Elasticsearch/
>
> ?
> >
> > They are however available in the snapshot releases. I started on a small
> > project to get the HNSW implementation into the ann-benchmarks project, but
> > had to set it aside.
>
> Is there still something missing? Or what would be the next steps?
>
> Thanks
>
> Michael
>
>
> > Here's the code:
> > https://github.com/alexklibisz/ann-benchmarks-lucene. There are some test
> > suites that index and search Glove vectors. My first impression was that
> > indexing seems surprisingly slow, but it's entirely possible I'm doing
> > something wrong.
> >
> > On Wed, Apr 21, 2021 at 9:31 AM Michael Wechner <michael.wechner@wyona.com>
> > wrote:
> >
> >> Hi
> >>
> >> I recently found the following articles re Lucene/Solr and BERT
> >>
> >> https://dmitry-kan.medium.com/neural-search-with-bert-and-solr-ea5ead060b28
> >>
> >> https://medium.com/swlh/fun-with-apache-lucene-and-bert-embeddings-c2c496baa559
> >>
> >> and would like to ask whether there might be more recent developments
> >> within the Lucene/Solr community re BERT integration?
> >>
> >> Also how these developments relate to
> >>
> >> https://sbert.net/
> >>
> >> ?
> >>
> >> Thanks very much for your insights!
> >>
> >> Michael
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Lucene/Solr and BERT [ In reply to ]
For practical search using BERT on any reasonable sized dataset, they're
going to need ANN, which Lucene recently added. This won't work in practice
if the query and document are of a different size, which is where sentence
transformers see a lot of use for documents up to 500 words.

https://issues.apache.org/jira/plugins/servlet/mobile#issue/LUCENE-9004

https://github.com/UKPLab/sentence-transformers

Russ

On Sun, May 23, 2021 at 8:23 PM Michael Sokolov <msokolov@gmail.com> wrote:

> Hi Michael, that is fully-functional in the sense that Lucene will
> build an HNSW graph for a vector-valued field and you can then use the
> VectorReader.search method to do KNN-based search. Next steps may
> include some integration with lexical, inverted-index type search so
> that you can retrieve N-closest constrained by other constraints.
> Today you can approximate that by oversampling and filtering. There is
> also interest in pursuing other KNN search algorithms, and we have
> been working to make sure the VectorFormat API (might still get
> renamed due to confusion with other kinds of vectors existing in
> Lucene) can support alternative KNN implementations.
>
> On Wed, May 19, 2021 at 12:22 PM Michael Wechner
> <michael.wechner@wyona.com> wrote:
> >
> > Hi Alex
> >
> > Just to make sure I understand better what the additions are about
> >
> > Am 21.04.21 um 17:21 schrieb Alex K:
> > > There were a couple additions recently merged into lucene but not yet
> > > released:
> > > - A first-class vector codec
> >
> > do you mean the classes inside
> >
> >
> https://github.com/apache/lucene/tree/main/lucene/core/src/java/org/apache/lucene/codecs/lucene90
> >
> > and in particular
> >
> > Lucene90HnswVectorFormat.java Lucene90HnswVectorReader.java
> > Lucene90HnswVectorWriter.java
> >
> > ?
> >
> > > - An implementation of HNSW for approximate nearest neighbor search
> >
> > the HNSW implementation at
> >
> >
> https://github.com/apache/lucene/tree/main/lucene/core/src/java/org/apache/lucene/util/hnsw
> >
> > is similar to
> >
> >
> https://opendistro.github.io/for-elasticsearch/blog/odfe-updates/2020/04/Building-k-Nearest-Neighbor-(k-NN)-Similarity-Search-Engine-with-Elasticsearch/
> >
> > ?
> > >
> > > They are however available in the snapshot releases. I started on a
> small
> > > project to get the HNSW implementation into the ann-benchmarks
> project, but
> > > had to set it aside.
> >
> > Is there still something missing? Or what would be the next steps?
> >
> > Thanks
> >
> > Michael
> >
> >
> > > Here's the code:
> > > https://github.com/alexklibisz/ann-benchmarks-lucene. There are some
> test
> > > suites that index and search Glove vectors. My first impression was
> that
> > > indexing seems surprisingly slow, but it's entirely possible I'm doing
> > > something wrong.
> > >
> > > On Wed, Apr 21, 2021 at 9:31 AM Michael Wechner <
> michael.wechner@wyona.com>
> > > wrote:
> > >
> > >> Hi
> > >>
> > >> I recently found the following articles re Lucene/Solr and BERT
> > >>
> > >>
> https://dmitry-kan.medium.com/neural-search-with-bert-and-solr-ea5ead060b28
> > >>
> > >>
> https://medium.com/swlh/fun-with-apache-lucene-and-bert-embeddings-c2c496baa559
> > >>
> > >> and would like to ask whether there might be more recent developments
> > >> within the Lucene/Solr community re BERT integration?
> > >>
> > >> Also how these developments relate to
> > >>
> > >> https://sbert.net/
> > >>
> > >> ?
> > >>
> > >> Thanks very much for your insights!
> > >>
> > >> Michael
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > >>
> > >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> --

Thanks,
Russell Jurney @rjurney <http://twitter.com/rjurney>
russell.jurney@gmail.com LI <http://linkedin.com/in/russelljurney> FB
<http://facebook.com/jurney> datasyndrome.com
Re: Lucene/Solr and BERT [ In reply to ]
Hi Michael

Thank you for your explanations!

I am currently trying to implement it, whereas I am learning from the
code of

https://github.com/jtibshirani/lucene/blob/hnsw-bench/lucene/core/src/java/org/apache/lucene/search/PythonEntryPoint.java

whereas Julie told me, that the code is a bit out-of-date, but should be
updated very soon.

It would be great to have some example code, similar to what is
available for Lucene otherwise

https://lucene.apache.org/core/8_8_2/core/overview-summary.html#overview.description

Does this already exist? If not, I could try to create some and
contribute it to the documentation.

Thanks

Michael


Am 24.05.21 um 05:22 schrieb Michael Sokolov:
> Hi Michael, that is fully-functional in the sense that Lucene will
> build an HNSW graph for a vector-valued field and you can then use the
> VectorReader.search method to do KNN-based search. Next steps may
> include some integration with lexical, inverted-index type search so
> that you can retrieve N-closest constrained by other constraints.
> Today you can approximate that by oversampling and filtering. There is
> also interest in pursuing other KNN search algorithms, and we have
> been working to make sure the VectorFormat API (might still get
> renamed due to confusion with other kinds of vectors existing in
> Lucene) can support alternative KNN implementations.
>
> On Wed, May 19, 2021 at 12:22 PM Michael Wechner
> <michael.wechner@wyona.com> wrote:
>> Hi Alex
>>
>> Just to make sure I understand better what the additions are about
>>
>> Am 21.04.21 um 17:21 schrieb Alex K:
>>> There were a couple additions recently merged into lucene but not yet
>>> released:
>>> - A first-class vector codec
>> do you mean the classes inside
>>
>> https://github.com/apache/lucene/tree/main/lucene/core/src/java/org/apache/lucene/codecs/lucene90
>>
>> and in particular
>>
>> Lucene90HnswVectorFormat.java Lucene90HnswVectorReader.java
>> Lucene90HnswVectorWriter.java
>>
>> ?
>>
>>> - An implementation of HNSW for approximate nearest neighbor search
>> the HNSW implementation at
>>
>> https://github.com/apache/lucene/tree/main/lucene/core/src/java/org/apache/lucene/util/hnsw
>>
>> is similar to
>>
>> https://opendistro.github.io/for-elasticsearch/blog/odfe-updates/2020/04/Building-k-Nearest-Neighbor-(k-NN)-Similarity-Search-Engine-with-Elasticsearch/
>>
>> ?
>>> They are however available in the snapshot releases. I started on a small
>>> project to get the HNSW implementation into the ann-benchmarks project, but
>>> had to set it aside.
>> Is there still something missing? Or what would be the next steps?
>>
>> Thanks
>>
>> Michael
>>
>>
>>> Here's the code:
>>> https://github.com/alexklibisz/ann-benchmarks-lucene. There are some test
>>> suites that index and search Glove vectors. My first impression was that
>>> indexing seems surprisingly slow, but it's entirely possible I'm doing
>>> something wrong.
>>>
>>> On Wed, Apr 21, 2021 at 9:31 AM Michael Wechner <michael.wechner@wyona.com>
>>> wrote:
>>>
>>>> Hi
>>>>
>>>> I recently found the following articles re Lucene/Solr and BERT
>>>>
>>>> https://dmitry-kan.medium.com/neural-search-with-bert-and-solr-ea5ead060b28
>>>>
>>>> https://medium.com/swlh/fun-with-apache-lucene-and-bert-embeddings-c2c496baa559
>>>>
>>>> and would like to ask whether there might be more recent developments
>>>> within the Lucene/Solr community re BERT integration?
>>>>
>>>> Also how these developments relate to
>>>>
>>>> https://sbert.net/
>>>>
>>>> ?
>>>>
>>>> Thanks very much for your insights!
>>>>
>>>> Michael
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Lucene/Solr and BERT [ In reply to ]
Hi Russ

I would like to use it for detecting duplicated questions, whereas I am
currently using the project sbert.net you mention below to do the
embedding with a size of 768 for indexing and querying.

sbert has an example listed using "util.pytorch_cos_sim(A,B) as a
||||brute-force approach

https://sbert.net/docs/usage/semantic_textual_similarity.html

and "paraphrase mining" approach for larger document collections

https://sbert.net/examples/applications/paraphrase-mining/README.html

Re the Lucene ANN implementation(s) I think it would be very interesting
to participate in the ANN benchmarking challenge which Julie mentioned
on the dev list

http://mail-archives.apache.org/mod_mbox/lucene-dev/202105.mbox/%3CCAKDq9%3D4rSuuczoK%2BcVg_N6Lwvh42E%2BXUoSGQ6m7BgqzuDvACew%40mail.gmail.com%3E

https://medium.com/big-ann-benchmarks/neurips-2021-announcement-the-billion-scale-approximate-nearest-neighbor-search-challenge-72858f768f69

Thanks

Michael



Am 24.05.21 um 05:31 schrieb Russell Jurney:
> For practical search using BERT on any reasonable sized dataset, they're
> going to need ANN, which Lucene recently added. This won't work in practice
> if the query and document are of a different size, which is where sentence
> transformers see a lot of use for documents up to 500 words.
>
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/LUCENE-9004
>
> https://github.com/UKPLab/sentence-transformers
>
> Russ
>
> On Sun, May 23, 2021 at 8:23 PM Michael Sokolov <msokolov@gmail.com> wrote:
>
>> Hi Michael, that is fully-functional in the sense that Lucene will
>> build an HNSW graph for a vector-valued field and you can then use the
>> VectorReader.search method to do KNN-based search. Next steps may
>> include some integration with lexical, inverted-index type search so
>> that you can retrieve N-closest constrained by other constraints.
>> Today you can approximate that by oversampling and filtering. There is
>> also interest in pursuing other KNN search algorithms, and we have
>> been working to make sure the VectorFormat API (might still get
>> renamed due to confusion with other kinds of vectors existing in
>> Lucene) can support alternative KNN implementations.
>>
>> On Wed, May 19, 2021 at 12:22 PM Michael Wechner
>> <michael.wechner@wyona.com> wrote:
>>> Hi Alex
>>>
>>> Just to make sure I understand better what the additions are about
>>>
>>> Am 21.04.21 um 17:21 schrieb Alex K:
>>>> There were a couple additions recently merged into lucene but not yet
>>>> released:
>>>> - A first-class vector codec
>>> do you mean the classes inside
>>>
>>>
>> https://github.com/apache/lucene/tree/main/lucene/core/src/java/org/apache/lucene/codecs/lucene90
>>> and in particular
>>>
>>> Lucene90HnswVectorFormat.java Lucene90HnswVectorReader.java
>>> Lucene90HnswVectorWriter.java
>>>
>>> ?
>>>
>>>> - An implementation of HNSW for approximate nearest neighbor search
>>> the HNSW implementation at
>>>
>>>
>> https://github.com/apache/lucene/tree/main/lucene/core/src/java/org/apache/lucene/util/hnsw
>>> is similar to
>>>
>>>
>> https://opendistro.github.io/for-elasticsearch/blog/odfe-updates/2020/04/Building-k-Nearest-Neighbor-(k-NN)-Similarity-Search-Engine-with-Elasticsearch/
>>> ?
>>>> They are however available in the snapshot releases. I started on a
>> small
>>>> project to get the HNSW implementation into the ann-benchmarks
>> project, but
>>>> had to set it aside.
>>> Is there still something missing? Or what would be the next steps?
>>>
>>> Thanks
>>>
>>> Michael
>>>
>>>
>>>> Here's the code:
>>>> https://github.com/alexklibisz/ann-benchmarks-lucene. There are some
>> test
>>>> suites that index and search Glove vectors. My first impression was
>> that
>>>> indexing seems surprisingly slow, but it's entirely possible I'm doing
>>>> something wrong.
>>>>
>>>> On Wed, Apr 21, 2021 at 9:31 AM Michael Wechner <
>> michael.wechner@wyona.com>
>>>> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> I recently found the following articles re Lucene/Solr and BERT
>>>>>
>>>>>
>> https://dmitry-kan.medium.com/neural-search-with-bert-and-solr-ea5ead060b28
>>>>>
>> https://medium.com/swlh/fun-with-apache-lucene-and-bert-embeddings-c2c496baa559
>>>>> and would like to ask whether there might be more recent developments
>>>>> within the Lucene/Solr community re BERT integration?
>>>>>
>>>>> Also how these developments relate to
>>>>>
>>>>> https://sbert.net/
>>>>>
>>>>> ?
>>>>>
>>>>> Thanks very much for your insights!
>>>>>
>>>>> Michael
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>> --
> Thanks,
> Russell Jurney @rjurney <http://twitter.com/rjurney>
> russell.jurney@gmail.com LI <http://linkedin.com/in/russelljurney> FB
> <http://facebook.com/jurney> datasyndrome.com
>
Re: Lucene/Solr and BERT [ In reply to ]
Hi Michael and others,

Sorry just now getting back to you. For your three original questions:

- Yes, I was referring to the Lucene90Hnsw* classes. Michael S. had a
thorough response.
- As far as I know Opendistro is calling out to a C/C++ binary to run the
actual HNSW algorithm and store the HNSW part of the index. When they
implemented it about a year ago, Lucene did not have this yet. I assume the
Lucene HNSW implementation is solid, but would not be surprised if it's
slower than the C/C++ based implementation, given the JVM has some
disadvantages for these kinds of CPU-bound/number crunching algos.
- I just haven't had much time to invest into my benchmark recently. In
particular, I got stuck on why indexing was taking extremely long. Just
indexing the vectors would have easily exceeded the current time
limitations in the ANN-benchmarks project. Maybe I had some naive mistake
in my implementation, but I profiled and dug pretty deep to make it fast.

I'm assuming you want to use Lucene, but not necessarily via Elasticsearch?
If so, another option you might try for ANN is the elastiknn-models
and elastiknn-lucene packages. elastiknn-models contains the Locality
Sensitive Hashing implementations of ANN used by Elastiknn, and
elastiknn-lucene contains the Lucene queries used by Elastiknn.The Lucene
query is the MatchHashesAndScoreQuery
<https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-lucene/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L18-L22>.
There are a couple of scala test suites that show how to use it:
MatchHashesAndScoreQuerySuite
<https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-testing/src/test/scala/com/klibisz/elastiknn/query/MatchHashesAndScoreQuerySuite.scala>.
MatchHashesAndScoreQueryPerformanceSuite
<https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-testing/src/test/scala/com/klibisz/elastiknn/query/MatchHashesAndScoreQueryPerformanceSuite.scala>.
This is all designed to work independently from Elasticsearch and is
published on Maven: com.klibisz.elastiknn / lucene
<https://search.maven.org/artifact/com.klibisz.elastiknn/lucene/7.12.1.0/jar>
and
com.klibisz.elastiknn / models
<https://search.maven.org/artifact/com.klibisz.elastiknn/models/7.12.1.0/jar>.
The tests are Scala but all of the implementation is in Java.

Thanks,
Alex

On Mon, May 24, 2021 at 3:06 AM Michael Wechner <michael.wechner@wyona.com>
wrote:

> Hi Russ
>
> I would like to use it for detecting duplicated questions, whereas I am
> currently using the project sbert.net you mention below to do the
> embedding with a size of 768 for indexing and querying.
>
> sbert has an example listed using "util.pytorch_cos_sim(A,B) as a
> ||||brute-force approach
>
> https://sbert.net/docs/usage/semantic_textual_similarity.html
>
> and "paraphrase mining" approach for larger document collections
>
> https://sbert.net/examples/applications/paraphrase-mining/README.html
>
> Re the Lucene ANN implementation(s) I think it would be very interesting
> to participate in the ANN benchmarking challenge which Julie mentioned
> on the dev list
>
>
> http://mail-archives.apache.org/mod_mbox/lucene-dev/202105.mbox/%3CCAKDq9%3D4rSuuczoK%2BcVg_N6Lwvh42E%2BXUoSGQ6m7BgqzuDvACew%40mail.gmail.com%3E
>
>
> https://medium.com/big-ann-benchmarks/neurips-2021-announcement-the-billion-scale-approximate-nearest-neighbor-search-challenge-72858f768f69
>
> Thanks
>
> Michael
>
>
>
> Am 24.05.21 um 05:31 schrieb Russell Jurney:
> > For practical search using BERT on any reasonable sized dataset, they're
> > going to need ANN, which Lucene recently added. This won't work in
> practice
> > if the query and document are of a different size, which is where
> sentence
> > transformers see a lot of use for documents up to 500 words.
> >
> > https://issues.apache.org/jira/plugins/servlet/mobile#issue/LUCENE-9004
> >
> > https://github.com/UKPLab/sentence-transformers
> >
> > Russ
> >
> > On Sun, May 23, 2021 at 8:23 PM Michael Sokolov <msokolov@gmail.com>
> wrote:
> >
> >> Hi Michael, that is fully-functional in the sense that Lucene will
> >> build an HNSW graph for a vector-valued field and you can then use the
> >> VectorReader.search method to do KNN-based search. Next steps may
> >> include some integration with lexical, inverted-index type search so
> >> that you can retrieve N-closest constrained by other constraints.
> >> Today you can approximate that by oversampling and filtering. There is
> >> also interest in pursuing other KNN search algorithms, and we have
> >> been working to make sure the VectorFormat API (might still get
> >> renamed due to confusion with other kinds of vectors existing in
> >> Lucene) can support alternative KNN implementations.
> >>
> >> On Wed, May 19, 2021 at 12:22 PM Michael Wechner
> >> <michael.wechner@wyona.com> wrote:
> >>> Hi Alex
> >>>
> >>> Just to make sure I understand better what the additions are about
> >>>
> >>> Am 21.04.21 um 17:21 schrieb Alex K:
> >>>> There were a couple additions recently merged into lucene but not yet
> >>>> released:
> >>>> - A first-class vector codec
> >>> do you mean the classes inside
> >>>
> >>>
> >>
> https://github.com/apache/lucene/tree/main/lucene/core/src/java/org/apache/lucene/codecs/lucene90
> >>> and in particular
> >>>
> >>> Lucene90HnswVectorFormat.java Lucene90HnswVectorReader.java
> >>> Lucene90HnswVectorWriter.java
> >>>
> >>> ?
> >>>
> >>>> - An implementation of HNSW for approximate nearest neighbor search
> >>> the HNSW implementation at
> >>>
> >>>
> >>
> https://github.com/apache/lucene/tree/main/lucene/core/src/java/org/apache/lucene/util/hnsw
> >>> is similar to
> >>>
> >>>
> >>
> https://opendistro.github.io/for-elasticsearch/blog/odfe-updates/2020/04/Building-k-Nearest-Neighbor-(k-NN)-Similarity-Search-Engine-with-Elasticsearch/
> >>> ?
> >>>> They are however available in the snapshot releases. I started on a
> >> small
> >>>> project to get the HNSW implementation into the ann-benchmarks
> >> project, but
> >>>> had to set it aside.
> >>> Is there still something missing? Or what would be the next steps?
> >>>
> >>> Thanks
> >>>
> >>> Michael
> >>>
> >>>
> >>>> Here's the code:
> >>>> https://github.com/alexklibisz/ann-benchmarks-lucene. There are some
> >> test
> >>>> suites that index and search Glove vectors. My first impression was
> >> that
> >>>> indexing seems surprisingly slow, but it's entirely possible I'm doing
> >>>> something wrong.
> >>>>
> >>>> On Wed, Apr 21, 2021 at 9:31 AM Michael Wechner <
> >> michael.wechner@wyona.com>
> >>>> wrote:
> >>>>
> >>>>> Hi
> >>>>>
> >>>>> I recently found the following articles re Lucene/Solr and BERT
> >>>>>
> >>>>>
> >>
> https://dmitry-kan.medium.com/neural-search-with-bert-and-solr-ea5ead060b28
> >>>>>
> >>
> https://medium.com/swlh/fun-with-apache-lucene-and-bert-embeddings-c2c496baa559
> >>>>> and would like to ask whether there might be more recent developments
> >>>>> within the Lucene/Solr community re BERT integration?
> >>>>>
> >>>>> Also how these developments relate to
> >>>>>
> >>>>> https://sbert.net/
> >>>>>
> >>>>> ?
> >>>>>
> >>>>> Thanks very much for your insights!
> >>>>>
> >>>>> Michael
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>>
> >>>>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >> --
> > Thanks,
> > Russell Jurney @rjurney <http://twitter.com/rjurney>
> > russell.jurney@gmail.com LI <http://linkedin.com/in/russelljurney> FB
> > <http://facebook.com/jurney> datasyndrome.com
> >
>
>
Re: Lucene/Solr and BERT [ In reply to ]
Hi Alex

Thank you very much for your feedback and the various insights!

Am 26.05.21 um 04:41 schrieb Alex K:
> Hi Michael and others,
>
> Sorry just now getting back to you. For your three original questions:
>
> - Yes, I was referring to the Lucene90Hnsw* classes. Michael S. had a
> thorough response.
> - As far as I know Opendistro is calling out to a C/C++ binary to run the
> actual HNSW algorithm and store the HNSW part of the index. When they
> implemented it about a year ago, Lucene did not have this yet. I assume the
> Lucene HNSW implementation is solid, but would not be surprised if it's
> slower than the C/C++ based implementation, given the JVM has some
> disadvantages for these kinds of CPU-bound/number crunching algos.
> - I just haven't had much time to invest into my benchmark recently. In
> particular, I got stuck on why indexing was taking extremely long. Just
> indexing the vectors would have easily exceeded the current time
> limitations in the ANN-benchmarks project. Maybe I had some naive mistake
> in my implementation, but I profiled and dug pretty deep to make it fast.

I am trying to get Julie's branch running

https://github.com/jtibshirani/lucene/tree/hnsw-bench

Maybe this will help and is comparable


>
> I'm assuming you want to use Lucene, but not necessarily via Elasticsearch?

Yes, for more simple setups I would like to use Lucene standalone, but
for setups which have to scale I would use either Elasticsearch or Solr.

Thanks

Michael



> If so, another option you might try for ANN is the elastiknn-models
> and elastiknn-lucene packages. elastiknn-models contains the Locality
> Sensitive Hashing implementations of ANN used by Elastiknn, and
> elastiknn-lucene contains the Lucene queries used by Elastiknn.The Lucene
> query is the MatchHashesAndScoreQuery
> <https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-lucene/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L18-L22>.
> There are a couple of scala test suites that show how to use it:
> MatchHashesAndScoreQuerySuite
> <https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-testing/src/test/scala/com/klibisz/elastiknn/query/MatchHashesAndScoreQuerySuite.scala>.
> MatchHashesAndScoreQueryPerformanceSuite
> <https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-testing/src/test/scala/com/klibisz/elastiknn/query/MatchHashesAndScoreQueryPerformanceSuite.scala>.
> This is all designed to work independently from Elasticsearch and is
> published on Maven: com.klibisz.elastiknn / lucene
> <https://search.maven.org/artifact/com.klibisz.elastiknn/lucene/7.12.1.0/jar>
> and
> com.klibisz.elastiknn / models
> <https://search.maven.org/artifact/com.klibisz.elastiknn/models/7.12.1.0/jar>.
> The tests are Scala but all of the implementation is in Java.
>
> Thanks,
> Alex
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Lucene/Solr and BERT [ In reply to ]
This java implementation will be slower than the C implementation. I
believe the algorithm is essentially the same, however this is new and
there may be bugs! I (and I think Julie had similar results IIRC)
measured something like 8x slower than hnswlib (using ann-benchmarks).
It is also surprising (to me) though how this varies with
differently-learned vectors so YMMV. I still think there is value
here, and look forward to improved performance, especially as JDK16
has some improved support for vectorized instructions.

Please also understand that the HNSW algorithm interacts with Lucene's
segmented architecture in a tricky way. Because we built a graph
*per-segment* when flushing/merging, these must be rebuilt whenever
segments are merged. So your indexing performance can be heavily
influenced by how often you flush, as well as by your merge policy
settings. Also, when searching, there is a bigger than usual benefit
for searching across fewer segments, since the cost of searching an
HNSW graph scales more or less with log N (so searching a single large
graph is cheaper than searching the same documents divided among
smaller graphs). So I do recommend using a multithreaded collector in
order to get best latency with HNSW-based search. To get the best
indexing, and searching, performance, you should generally index as
large a number of documents as possible before flushing.

-Mike

On Wed, May 26, 2021 at 9:43 AM Michael Wechner
<michael.wechner@wyona.com> wrote:
>
> Hi Alex
>
> Thank you very much for your feedback and the various insights!
>
> Am 26.05.21 um 04:41 schrieb Alex K:
> > Hi Michael and others,
> >
> > Sorry just now getting back to you. For your three original questions:
> >
> > - Yes, I was referring to the Lucene90Hnsw* classes. Michael S. had a
> > thorough response.
> > - As far as I know Opendistro is calling out to a C/C++ binary to run the
> > actual HNSW algorithm and store the HNSW part of the index. When they
> > implemented it about a year ago, Lucene did not have this yet. I assume the
> > Lucene HNSW implementation is solid, but would not be surprised if it's
> > slower than the C/C++ based implementation, given the JVM has some
> > disadvantages for these kinds of CPU-bound/number crunching algos.
> > - I just haven't had much time to invest into my benchmark recently. In
> > particular, I got stuck on why indexing was taking extremely long. Just
> > indexing the vectors would have easily exceeded the current time
> > limitations in the ANN-benchmarks project. Maybe I had some naive mistake
> > in my implementation, but I profiled and dug pretty deep to make it fast.
>
> I am trying to get Julie's branch running
>
> https://github.com/jtibshirani/lucene/tree/hnsw-bench
>
> Maybe this will help and is comparable
>
>
> >
> > I'm assuming you want to use Lucene, but not necessarily via Elasticsearch?
>
> Yes, for more simple setups I would like to use Lucene standalone, but
> for setups which have to scale I would use either Elasticsearch or Solr.
>
> Thanks
>
> Michael
>
>
>
> > If so, another option you might try for ANN is the elastiknn-models
> > and elastiknn-lucene packages. elastiknn-models contains the Locality
> > Sensitive Hashing implementations of ANN used by Elastiknn, and
> > elastiknn-lucene contains the Lucene queries used by Elastiknn.The Lucene
> > query is the MatchHashesAndScoreQuery
> > <https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-lucene/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L18-L22>.
> > There are a couple of scala test suites that show how to use it:
> > MatchHashesAndScoreQuerySuite
> > <https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-testing/src/test/scala/com/klibisz/elastiknn/query/MatchHashesAndScoreQuerySuite.scala>.
> > MatchHashesAndScoreQueryPerformanceSuite
> > <https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-testing/src/test/scala/com/klibisz/elastiknn/query/MatchHashesAndScoreQueryPerformanceSuite.scala>.
> > This is all designed to work independently from Elasticsearch and is
> > published on Maven: com.klibisz.elastiknn / lucene
> > <https://search.maven.org/artifact/com.klibisz.elastiknn/lucene/7.12.1.0/jar>
> > and
> > com.klibisz.elastiknn / models
> > <https://search.maven.org/artifact/com.klibisz.elastiknn/models/7.12.1.0/jar>.
> > The tests are Scala but all of the implementation is in Java.
> >
> > Thanks,
> > Alex
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Lucene/Solr and BERT [ In reply to ]
Thanks Michael. IIRC, the thing that was taking so long was merging into a
single segment. Is there already benchmarking code for HNSW
available somewhere? I feel like I remember someone posting benchmarking
results on one of the Jira tickets.

Thanks,
Alex

On Wed, May 26, 2021 at 3:41 PM Michael Sokolov <msokolov@gmail.com> wrote:

> This java implementation will be slower than the C implementation. I
> believe the algorithm is essentially the same, however this is new and
> there may be bugs! I (and I think Julie had similar results IIRC)
> measured something like 8x slower than hnswlib (using ann-benchmarks).
> It is also surprising (to me) though how this varies with
> differently-learned vectors so YMMV. I still think there is value
> here, and look forward to improved performance, especially as JDK16
> has some improved support for vectorized instructions.
>
> Please also understand that the HNSW algorithm interacts with Lucene's
> segmented architecture in a tricky way. Because we built a graph
> *per-segment* when flushing/merging, these must be rebuilt whenever
> segments are merged. So your indexing performance can be heavily
> influenced by how often you flush, as well as by your merge policy
> settings. Also, when searching, there is a bigger than usual benefit
> for searching across fewer segments, since the cost of searching an
> HNSW graph scales more or less with log N (so searching a single large
> graph is cheaper than searching the same documents divided among
> smaller graphs). So I do recommend using a multithreaded collector in
> order to get best latency with HNSW-based search. To get the best
> indexing, and searching, performance, you should generally index as
> large a number of documents as possible before flushing.
>
> -Mike
>
> On Wed, May 26, 2021 at 9:43 AM Michael Wechner
> <michael.wechner@wyona.com> wrote:
> >
> > Hi Alex
> >
> > Thank you very much for your feedback and the various insights!
> >
> > Am 26.05.21 um 04:41 schrieb Alex K:
> > > Hi Michael and others,
> > >
> > > Sorry just now getting back to you. For your three original questions:
> > >
> > > - Yes, I was referring to the Lucene90Hnsw* classes. Michael S. had a
> > > thorough response.
> > > - As far as I know Opendistro is calling out to a C/C++ binary to run
> the
> > > actual HNSW algorithm and store the HNSW part of the index. When they
> > > implemented it about a year ago, Lucene did not have this yet. I
> assume the
> > > Lucene HNSW implementation is solid, but would not be surprised if it's
> > > slower than the C/C++ based implementation, given the JVM has some
> > > disadvantages for these kinds of CPU-bound/number crunching algos.
> > > - I just haven't had much time to invest into my benchmark recently. In
> > > particular, I got stuck on why indexing was taking extremely long. Just
> > > indexing the vectors would have easily exceeded the current time
> > > limitations in the ANN-benchmarks project. Maybe I had some naive
> mistake
> > > in my implementation, but I profiled and dug pretty deep to make it
> fast.
> >
> > I am trying to get Julie's branch running
> >
> > https://github.com/jtibshirani/lucene/tree/hnsw-bench
> >
> > Maybe this will help and is comparable
> >
> >
> > >
> > > I'm assuming you want to use Lucene, but not necessarily via
> Elasticsearch?
> >
> > Yes, for more simple setups I would like to use Lucene standalone, but
> > for setups which have to scale I would use either Elasticsearch or Solr.
> >
> > Thanks
> >
> > Michael
> >
> >
> >
> > > If so, another option you might try for ANN is the elastiknn-models
> > > and elastiknn-lucene packages. elastiknn-models contains the Locality
> > > Sensitive Hashing implementations of ANN used by Elastiknn, and
> > > elastiknn-lucene contains the Lucene queries used by Elastiknn.The
> Lucene
> > > query is the MatchHashesAndScoreQuery
> > > <
> https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-lucene/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L18-L22
> >.
> > > There are a couple of scala test suites that show how to use it:
> > > MatchHashesAndScoreQuerySuite
> > > <
> https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-testing/src/test/scala/com/klibisz/elastiknn/query/MatchHashesAndScoreQuerySuite.scala
> >.
> > > MatchHashesAndScoreQueryPerformanceSuite
> > > <
> https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-testing/src/test/scala/com/klibisz/elastiknn/query/MatchHashesAndScoreQueryPerformanceSuite.scala
> >.
> > > This is all designed to work independently from Elasticsearch and is
> > > published on Maven: com.klibisz.elastiknn / lucene
> > > <
> https://search.maven.org/artifact/com.klibisz.elastiknn/lucene/7.12.1.0/jar
> >
> > > and
> > > com.klibisz.elastiknn / models
> > > <
> https://search.maven.org/artifact/com.klibisz.elastiknn/models/7.12.1.0/jar
> >.
> > > The tests are Scala but all of the implementation is in Java.
> > >
> > > Thanks,
> > > Alex
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: Lucene/Solr and BERT [ In reply to ]
These JIRA issues contain results against two ann-benchmarks datasets. It'd
be great to get your thoughts/ feedback if you have any:
* Searching: https://issues.apache.org/jira/browse/LUCENE-9937
* Indexing: https://issues.apache.org/jira/browse/LUCENE-9941

The benchmarks are based on the setup here:
https://github.com/jtibshirani/lucene/pull/1. I am happy to help if you run
into issues with it.

A note: my motivation for running ann-benchmarks was to understand how the
current performance compares to other approaches, and to research ideas for
improvements. The setup in the PR doesn't feel solid/ maintainable as a
long term approach to development benchmarks. My personal plan is to focus
on enhancing luceneutil and our nightly benchmarks (
https://github.com/mikemccand/luceneutil) instead of putting a lot of
effort into the ann-benchmarks setup.

Julie

On Wed, May 26, 2021 at 1:04 PM Alex K <aklibisz@gmail.com> wrote:

> Thanks Michael. IIRC, the thing that was taking so long was merging into a
> single segment. Is there already benchmarking code for HNSW
> available somewhere? I feel like I remember someone posting benchmarking
> results on one of the Jira tickets.
>
> Thanks,
> Alex
>
> On Wed, May 26, 2021 at 3:41 PM Michael Sokolov <msokolov@gmail.com>
> wrote:
>
> > This java implementation will be slower than the C implementation. I
> > believe the algorithm is essentially the same, however this is new and
> > there may be bugs! I (and I think Julie had similar results IIRC)
> > measured something like 8x slower than hnswlib (using ann-benchmarks).
> > It is also surprising (to me) though how this varies with
> > differently-learned vectors so YMMV. I still think there is value
> > here, and look forward to improved performance, especially as JDK16
> > has some improved support for vectorized instructions.
> >
> > Please also understand that the HNSW algorithm interacts with Lucene's
> > segmented architecture in a tricky way. Because we built a graph
> > *per-segment* when flushing/merging, these must be rebuilt whenever
> > segments are merged. So your indexing performance can be heavily
> > influenced by how often you flush, as well as by your merge policy
> > settings. Also, when searching, there is a bigger than usual benefit
> > for searching across fewer segments, since the cost of searching an
> > HNSW graph scales more or less with log N (so searching a single large
> > graph is cheaper than searching the same documents divided among
> > smaller graphs). So I do recommend using a multithreaded collector in
> > order to get best latency with HNSW-based search. To get the best
> > indexing, and searching, performance, you should generally index as
> > large a number of documents as possible before flushing.
> >
> > -Mike
> >
> > On Wed, May 26, 2021 at 9:43 AM Michael Wechner
> > <michael.wechner@wyona.com> wrote:
> > >
> > > Hi Alex
> > >
> > > Thank you very much for your feedback and the various insights!
> > >
> > > Am 26.05.21 um 04:41 schrieb Alex K:
> > > > Hi Michael and others,
> > > >
> > > > Sorry just now getting back to you. For your three original
> questions:
> > > >
> > > > - Yes, I was referring to the Lucene90Hnsw* classes. Michael S. had a
> > > > thorough response.
> > > > - As far as I know Opendistro is calling out to a C/C++ binary to run
> > the
> > > > actual HNSW algorithm and store the HNSW part of the index. When they
> > > > implemented it about a year ago, Lucene did not have this yet. I
> > assume the
> > > > Lucene HNSW implementation is solid, but would not be surprised if
> it's
> > > > slower than the C/C++ based implementation, given the JVM has some
> > > > disadvantages for these kinds of CPU-bound/number crunching algos.
> > > > - I just haven't had much time to invest into my benchmark recently.
> In
> > > > particular, I got stuck on why indexing was taking extremely long.
> Just
> > > > indexing the vectors would have easily exceeded the current time
> > > > limitations in the ANN-benchmarks project. Maybe I had some naive
> > mistake
> > > > in my implementation, but I profiled and dug pretty deep to make it
> > fast.
> > >
> > > I am trying to get Julie's branch running
> > >
> > > https://github.com/jtibshirani/lucene/tree/hnsw-bench
> > >
> > > Maybe this will help and is comparable
> > >
> > >
> > > >
> > > > I'm assuming you want to use Lucene, but not necessarily via
> > Elasticsearch?
> > >
> > > Yes, for more simple setups I would like to use Lucene standalone, but
> > > for setups which have to scale I would use either Elasticsearch or
> Solr.
> > >
> > > Thanks
> > >
> > > Michael
> > >
> > >
> > >
> > > > If so, another option you might try for ANN is the elastiknn-models
> > > > and elastiknn-lucene packages. elastiknn-models contains the Locality
> > > > Sensitive Hashing implementations of ANN used by Elastiknn, and
> > > > elastiknn-lucene contains the Lucene queries used by Elastiknn.The
> > Lucene
> > > > query is the MatchHashesAndScoreQuery
> > > > <
> >
> https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-lucene/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L18-L22
> > >.
> > > > There are a couple of scala test suites that show how to use it:
> > > > MatchHashesAndScoreQuerySuite
> > > > <
> >
> https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-testing/src/test/scala/com/klibisz/elastiknn/query/MatchHashesAndScoreQuerySuite.scala
> > >.
> > > > MatchHashesAndScoreQueryPerformanceSuite
> > > > <
> >
> https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-testing/src/test/scala/com/klibisz/elastiknn/query/MatchHashesAndScoreQueryPerformanceSuite.scala
> > >.
> > > > This is all designed to work independently from Elasticsearch and is
> > > > published on Maven: com.klibisz.elastiknn / lucene
> > > > <
> >
> https://search.maven.org/artifact/com.klibisz.elastiknn/lucene/7.12.1.0/jar
> > >
> > > > and
> > > > com.klibisz.elastiknn / models
> > > > <
> >
> https://search.maven.org/artifact/com.klibisz.elastiknn/models/7.12.1.0/jar
> > >.
> > > > The tests are Scala but all of the implementation is in Java.
> > > >
> > > > Thanks,
> > > > Alex
> > > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
Re: Lucene/Solr and BERT [ In reply to ]
Thank you very much for having done these benchmarks!

IIUC one could state

- Indexing:
      Lucene is slower than hnswlib/C++, very roughly 10x performance
difference
- Searching (Queries per second):
      Lucene is slower than hnswlib/C++, very roughly 8x performance
difference

right, but we should double-check these results?

Also it is not clear at the moment why there is this performance
difference, right?


Am 27.05.21 um 03:33 schrieb Julie Tibshirani:
> These JIRA issues contain results against two ann-benchmarks datasets. It'd
> be great to get your thoughts/ feedback if you have any:
> * Searching: https://issues.apache.org/jira/browse/LUCENE-9937
> * Indexing: https://issues.apache.org/jira/browse/LUCENE-9941
>
> The benchmarks are based on the setup here:
> https://github.com/jtibshirani/lucene/pull/1. I am happy to help if you run
> into issues with it.
>
> A note: my motivation for running ann-benchmarks was to understand how the
> current performance compares to other approaches, and to research ideas for
> improvements. The setup in the PR doesn't feel solid/ maintainable as a
> long term approach to development benchmarks. My personal plan is to focus
> on enhancing luceneutil and our nightly benchmarks (
> https://github.com/mikemccand/luceneutil) instead of putting a lot of
> effort into the ann-benchmarks setup.
>
> Julie
>
> On Wed, May 26, 2021 at 1:04 PM Alex K <aklibisz@gmail.com> wrote:
>
>> Thanks Michael. IIRC, the thing that was taking so long was merging into a
>> single segment. Is there already benchmarking code for HNSW
>> available somewhere? I feel like I remember someone posting benchmarking
>> results on one of the Jira tickets.
>>
>> Thanks,
>> Alex
>>
>> On Wed, May 26, 2021 at 3:41 PM Michael Sokolov <msokolov@gmail.com>
>> wrote:
>>
>>> This java implementation will be slower than the C implementation. I
>>> believe the algorithm is essentially the same, however this is new and
>>> there may be bugs! I (and I think Julie had similar results IIRC)
>>> measured something like 8x slower than hnswlib (using ann-benchmarks).
>>> It is also surprising (to me) though how this varies with
>>> differently-learned vectors so YMMV. I still think there is value
>>> here, and look forward to improved performance, especially as JDK16
>>> has some improved support for vectorized instructions.
>>>
>>> Please also understand that the HNSW algorithm interacts with Lucene's
>>> segmented architecture in a tricky way. Because we built a graph
>>> *per-segment* when flushing/merging, these must be rebuilt whenever
>>> segments are merged. So your indexing performance can be heavily
>>> influenced by how often you flush, as well as by your merge policy
>>> settings. Also, when searching, there is a bigger than usual benefit
>>> for searching across fewer segments, since the cost of searching an
>>> HNSW graph scales more or less with log N (so searching a single large
>>> graph is cheaper than searching the same documents divided among
>>> smaller graphs). So I do recommend using a multithreaded collector in
>>> order to get best latency with HNSW-based search. To get the best
>>> indexing, and searching, performance, you should generally index as
>>> large a number of documents as possible before flushing.
>>>
>>> -Mike
>>>
>>> On Wed, May 26, 2021 at 9:43 AM Michael Wechner
>>> <michael.wechner@wyona.com> wrote:
>>>> Hi Alex
>>>>
>>>> Thank you very much for your feedback and the various insights!
>>>>
>>>> Am 26.05.21 um 04:41 schrieb Alex K:
>>>>> Hi Michael and others,
>>>>>
>>>>> Sorry just now getting back to you. For your three original
>> questions:
>>>>> - Yes, I was referring to the Lucene90Hnsw* classes. Michael S. had a
>>>>> thorough response.
>>>>> - As far as I know Opendistro is calling out to a C/C++ binary to run
>>> the
>>>>> actual HNSW algorithm and store the HNSW part of the index. When they
>>>>> implemented it about a year ago, Lucene did not have this yet. I
>>> assume the
>>>>> Lucene HNSW implementation is solid, but would not be surprised if
>> it's
>>>>> slower than the C/C++ based implementation, given the JVM has some
>>>>> disadvantages for these kinds of CPU-bound/number crunching algos.
>>>>> - I just haven't had much time to invest into my benchmark recently.
>> In
>>>>> particular, I got stuck on why indexing was taking extremely long.
>> Just
>>>>> indexing the vectors would have easily exceeded the current time
>>>>> limitations in the ANN-benchmarks project. Maybe I had some naive
>>> mistake
>>>>> in my implementation, but I profiled and dug pretty deep to make it
>>> fast.
>>>> I am trying to get Julie's branch running
>>>>
>>>> https://github.com/jtibshirani/lucene/tree/hnsw-bench
>>>>
>>>> Maybe this will help and is comparable
>>>>
>>>>
>>>>> I'm assuming you want to use Lucene, but not necessarily via
>>> Elasticsearch?
>>>> Yes, for more simple setups I would like to use Lucene standalone, but
>>>> for setups which have to scale I would use either Elasticsearch or
>> Solr.
>>>> Thanks
>>>>
>>>> Michael
>>>>
>>>>
>>>>
>>>>> If so, another option you might try for ANN is the elastiknn-models
>>>>> and elastiknn-lucene packages. elastiknn-models contains the Locality
>>>>> Sensitive Hashing implementations of ANN used by Elastiknn, and
>>>>> elastiknn-lucene contains the Lucene queries used by Elastiknn.The
>>> Lucene
>>>>> query is the MatchHashesAndScoreQuery
>>>>> <
>> https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-lucene/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L18-L22
>>>> .
>>>>> There are a couple of scala test suites that show how to use it:
>>>>> MatchHashesAndScoreQuerySuite
>>>>> <
>> https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-testing/src/test/scala/com/klibisz/elastiknn/query/MatchHashesAndScoreQuerySuite.scala
>>>> .
>>>>> MatchHashesAndScoreQueryPerformanceSuite
>>>>> <
>> https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-testing/src/test/scala/com/klibisz/elastiknn/query/MatchHashesAndScoreQueryPerformanceSuite.scala
>>>> .
>>>>> This is all designed to work independently from Elasticsearch and is
>>>>> published on Maven: com.klibisz.elastiknn / lucene
>>>>> <
>> https://search.maven.org/artifact/com.klibisz.elastiknn/lucene/7.12.1.0/jar
>>>>> and
>>>>> com.klibisz.elastiknn / models
>>>>> <
>> https://search.maven.org/artifact/com.klibisz.elastiknn/models/7.12.1.0/jar
>>>> .
>>>>> The tests are Scala but all of the implementation is in Java.
>>>>>
>>>>> Thanks,
>>>>> Alex
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Lucene/Solr and BERT [ In reply to ]
Your summary sounds right to me. There are some ideas (being discussed on
the issue), but I don't think we have a detailed understanding yet of the
performance difference.

It would be great to get more eyes on the benchmark if you're interested in
double-checking the results. Mike mentioned that he saw a similar
performance difference in search (7-8x) when he ran his own benchmarks.

Julie




On Thu, May 27, 2021 at 12:55 AM Michael Wechner <michael.wechner@wyona.com>
wrote:

> Thank you very much for having done these benchmarks!
>
> IIUC one could state
>
> - Indexing:
> Lucene is slower than hnswlib/C++, very roughly 10x performance
> difference
> - Searching (Queries per second):
> Lucene is slower than hnswlib/C++, very roughly 8x performance
> difference
>
> right, but we should double-check these results?
>
> Also it is not clear at the moment why there is this performance
> difference, right?
>
>
> Am 27.05.21 um 03:33 schrieb Julie Tibshirani:
> > These JIRA issues contain results against two ann-benchmarks datasets.
> It'd
> > be great to get your thoughts/ feedback if you have any:
> > * Searching: https://issues.apache.org/jira/browse/LUCENE-9937
> > * Indexing: https://issues.apache.org/jira/browse/LUCENE-9941
> >
> > The benchmarks are based on the setup here:
> > https://github.com/jtibshirani/lucene/pull/1. I am happy to help if you
> run
> > into issues with it.
> >
> > A note: my motivation for running ann-benchmarks was to understand how
> the
> > current performance compares to other approaches, and to research ideas
> for
> > improvements. The setup in the PR doesn't feel solid/ maintainable as a
> > long term approach to development benchmarks. My personal plan is to
> focus
> > on enhancing luceneutil and our nightly benchmarks (
> > https://github.com/mikemccand/luceneutil) instead of putting a lot of
> > effort into the ann-benchmarks setup.
> >
> > Julie
> >
> > On Wed, May 26, 2021 at 1:04 PM Alex K <aklibisz@gmail.com> wrote:
> >
> >> Thanks Michael. IIRC, the thing that was taking so long was merging
> into a
> >> single segment. Is there already benchmarking code for HNSW
> >> available somewhere? I feel like I remember someone posting benchmarking
> >> results on one of the Jira tickets.
> >>
> >> Thanks,
> >> Alex
> >>
> >> On Wed, May 26, 2021 at 3:41 PM Michael Sokolov <msokolov@gmail.com>
> >> wrote:
> >>
> >>> This java implementation will be slower than the C implementation. I
> >>> believe the algorithm is essentially the same, however this is new and
> >>> there may be bugs! I (and I think Julie had similar results IIRC)
> >>> measured something like 8x slower than hnswlib (using ann-benchmarks).
> >>> It is also surprising (to me) though how this varies with
> >>> differently-learned vectors so YMMV. I still think there is value
> >>> here, and look forward to improved performance, especially as JDK16
> >>> has some improved support for vectorized instructions.
> >>>
> >>> Please also understand that the HNSW algorithm interacts with Lucene's
> >>> segmented architecture in a tricky way. Because we built a graph
> >>> *per-segment* when flushing/merging, these must be rebuilt whenever
> >>> segments are merged. So your indexing performance can be heavily
> >>> influenced by how often you flush, as well as by your merge policy
> >>> settings. Also, when searching, there is a bigger than usual benefit
> >>> for searching across fewer segments, since the cost of searching an
> >>> HNSW graph scales more or less with log N (so searching a single large
> >>> graph is cheaper than searching the same documents divided among
> >>> smaller graphs). So I do recommend using a multithreaded collector in
> >>> order to get best latency with HNSW-based search. To get the best
> >>> indexing, and searching, performance, you should generally index as
> >>> large a number of documents as possible before flushing.
> >>>
> >>> -Mike
> >>>
> >>> On Wed, May 26, 2021 at 9:43 AM Michael Wechner
> >>> <michael.wechner@wyona.com> wrote:
> >>>> Hi Alex
> >>>>
> >>>> Thank you very much for your feedback and the various insights!
> >>>>
> >>>> Am 26.05.21 um 04:41 schrieb Alex K:
> >>>>> Hi Michael and others,
> >>>>>
> >>>>> Sorry just now getting back to you. For your three original
> >> questions:
> >>>>> - Yes, I was referring to the Lucene90Hnsw* classes. Michael S. had a
> >>>>> thorough response.
> >>>>> - As far as I know Opendistro is calling out to a C/C++ binary to run
> >>> the
> >>>>> actual HNSW algorithm and store the HNSW part of the index. When they
> >>>>> implemented it about a year ago, Lucene did not have this yet. I
> >>> assume the
> >>>>> Lucene HNSW implementation is solid, but would not be surprised if
> >> it's
> >>>>> slower than the C/C++ based implementation, given the JVM has some
> >>>>> disadvantages for these kinds of CPU-bound/number crunching algos.
> >>>>> - I just haven't had much time to invest into my benchmark recently.
> >> In
> >>>>> particular, I got stuck on why indexing was taking extremely long.
> >> Just
> >>>>> indexing the vectors would have easily exceeded the current time
> >>>>> limitations in the ANN-benchmarks project. Maybe I had some naive
> >>> mistake
> >>>>> in my implementation, but I profiled and dug pretty deep to make it
> >>> fast.
> >>>> I am trying to get Julie's branch running
> >>>>
> >>>> https://github.com/jtibshirani/lucene/tree/hnsw-bench
> >>>>
> >>>> Maybe this will help and is comparable
> >>>>
> >>>>
> >>>>> I'm assuming you want to use Lucene, but not necessarily via
> >>> Elasticsearch?
> >>>> Yes, for more simple setups I would like to use Lucene standalone, but
> >>>> for setups which have to scale I would use either Elasticsearch or
> >> Solr.
> >>>> Thanks
> >>>>
> >>>> Michael
> >>>>
> >>>>
> >>>>
> >>>>> If so, another option you might try for ANN is the elastiknn-models
> >>>>> and elastiknn-lucene packages. elastiknn-models contains the Locality
> >>>>> Sensitive Hashing implementations of ANN used by Elastiknn, and
> >>>>> elastiknn-lucene contains the Lucene queries used by Elastiknn.The
> >>> Lucene
> >>>>> query is the MatchHashesAndScoreQuery
> >>>>> <
> >>
> https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-lucene/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L18-L22
> >>>> .
> >>>>> There are a couple of scala test suites that show how to use it:
> >>>>> MatchHashesAndScoreQuerySuite
> >>>>> <
> >>
> https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-testing/src/test/scala/com/klibisz/elastiknn/query/MatchHashesAndScoreQuerySuite.scala
> >>>> .
> >>>>> MatchHashesAndScoreQueryPerformanceSuite
> >>>>> <
> >>
> https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-testing/src/test/scala/com/klibisz/elastiknn/query/MatchHashesAndScoreQueryPerformanceSuite.scala
> >>>> .
> >>>>> This is all designed to work independently from Elasticsearch and is
> >>>>> published on Maven: com.klibisz.elastiknn / lucene
> >>>>> <
> >>
> https://search.maven.org/artifact/com.klibisz.elastiknn/lucene/7.12.1.0/jar
> >>>>> and
> >>>>> com.klibisz.elastiknn / models
> >>>>> <
> >>
> https://search.maven.org/artifact/com.klibisz.elastiknn/models/7.12.1.0/jar
> >>>> .
> >>>>> The tests are Scala but all of the implementation is in Java.
> >>>>>
> >>>>> Thanks,
> >>>>> Alex
> >>>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>
> >>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>