This java implementation will be slower than the C implementation. I
believe the algorithm is essentially the same, however this is new and
there may be bugs! I (and I think Julie had similar results IIRC)
measured something like 8x slower than hnswlib (using ann-benchmarks).
It is also surprising (to me) though how this varies with
differently-learned vectors so YMMV. I still think there is value
here, and look forward to improved performance, especially as JDK16
has some improved support for vectorized instructions.
Please also understand that the HNSW algorithm interacts with Lucene's
segmented architecture in a tricky way. Because we built a graph
*per-segment* when flushing/merging, these must be rebuilt whenever
segments are merged. So your indexing performance can be heavily
influenced by how often you flush, as well as by your merge policy
settings. Also, when searching, there is a bigger than usual benefit
for searching across fewer segments, since the cost of searching an
HNSW graph scales more or less with log N (so searching a single large
graph is cheaper than searching the same documents divided among
smaller graphs). So I do recommend using a multithreaded collector in
order to get best latency with HNSW-based search. To get the best
indexing, and searching, performance, you should generally index as
large a number of documents as possible before flushing.
-Mike
On Wed, May 26, 2021 at 9:43 AM Michael Wechner
<michael.wechner@wyona.com> wrote:
>
> Hi Alex
>
> Thank you very much for your feedback and the various insights!
>
> Am 26.05.21 um 04:41 schrieb Alex K:
> > Hi Michael and others,
> >
> > Sorry just now getting back to you. For your three original questions:
> >
> > - Yes, I was referring to the Lucene90Hnsw* classes. Michael S. had a
> > thorough response.
> > - As far as I know Opendistro is calling out to a C/C++ binary to run the
> > actual HNSW algorithm and store the HNSW part of the index. When they
> > implemented it about a year ago, Lucene did not have this yet. I assume the
> > Lucene HNSW implementation is solid, but would not be surprised if it's
> > slower than the C/C++ based implementation, given the JVM has some
> > disadvantages for these kinds of CPU-bound/number crunching algos.
> > - I just haven't had much time to invest into my benchmark recently. In
> > particular, I got stuck on why indexing was taking extremely long. Just
> > indexing the vectors would have easily exceeded the current time
> > limitations in the ANN-benchmarks project. Maybe I had some naive mistake
> > in my implementation, but I profiled and dug pretty deep to make it fast.
>
> I am trying to get Julie's branch running
>
> https://github.com/jtibshirani/lucene/tree/hnsw-bench
>
> Maybe this will help and is comparable
>
>
> >
> > I'm assuming you want to use Lucene, but not necessarily via Elasticsearch?
>
> Yes, for more simple setups I would like to use Lucene standalone, but
> for setups which have to scale I would use either Elasticsearch or Solr.
>
> Thanks
>
> Michael
>
>
>
> > If so, another option you might try for ANN is the elastiknn-models
> > and elastiknn-lucene packages. elastiknn-models contains the Locality
> > Sensitive Hashing implementations of ANN used by Elastiknn, and
> > elastiknn-lucene contains the Lucene queries used by Elastiknn.The Lucene
> > query is the MatchHashesAndScoreQuery
> > <https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-lucene/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L18-L22>.
> > There are a couple of scala test suites that show how to use it:
> > MatchHashesAndScoreQuerySuite
> > <https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-testing/src/test/scala/com/klibisz/elastiknn/query/MatchHashesAndScoreQuerySuite.scala>.
> > MatchHashesAndScoreQueryPerformanceSuite
> > <https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-testing/src/test/scala/com/klibisz/elastiknn/query/MatchHashesAndScoreQueryPerformanceSuite.scala>.
> > This is all designed to work independently from Elasticsearch and is
> > published on Maven: com.klibisz.elastiknn / lucene
> > <https://search.maven.org/artifact/com.klibisz.elastiknn/lucene/7.12.1.0/jar>
> > and
> > com.klibisz.elastiknn / models
> > <https://search.maven.org/artifact/com.klibisz.elastiknn/models/7.12.1.0/jar>.
> > The tests are Scala but all of the implementation is in Java.
> >
> > Thanks,
> > Alex
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org