Mailing List Archive

1 2  View All
Re: Questions about the new vector API [ In reply to ]
This is a good point, and I agree it’s not a perfect fit for Lucene
testing/ development. The repository indeed focuses on datasets that can be
held in memory -- by only looking at the results of methods on
ann-benchmarks, we might be missing important considerations like how well
the method scales to indexes that are larger than main memory, how well it
fits with the rest of the search framework, etc.

I’m thinking of it as a supplementary tool that can give some performance
insights. It reports search accuracy in addition to speed which luceneutil
doesn’t do (yet :)). The datasets aren’t huge but also aren’t 'toy
datasets' -- they range from 1-10 million vectors with dimension 100+, and
are based on pretty diverse + realistic data. In the future we could extend
luceneutil to cover some of this, like reporting recall on a few datasets.

So ann-benchmarks can be useful within a certain scope:
* Comparing our implementation against reference libraries to double-check/
debug the algorithm
* Testing on diverse datasets, with ability to measure search speed *and*
recall

Julie

On Wed, Apr 28, 2021 at 7:16 AM Robert Muir <rcmuir@gmail.com> wrote:

> Thanks for doing this benchmarking. But I am very concerned
> ann-benchmarks is a good one to be using.
>
> While it may be hip/trendy/popular, it clearly states that it is only
> for toy datasets that fit in RAM:
> https://github.com/erikbern/ann-benchmarks/blob/master/README.md#principles
>
>
>
> On Tue, Apr 27, 2021 at 4:46 PM Julie Tibshirani <julietibs@gmail.com>
> wrote:
> >
> > One last follow-up: Robert's comments got me interested in better
> quantifying the performance against other approaches. I hooked up Lucene
> HNSW to ann-benchmarks, a commonly used repo for benchmarking nearest
> neighbor search libraries against large datasets. These two issues describe
> the results:
> > * Search recall + QPS (https://issues.apache.org/jira/browse/LUCENE-9937
> )
> > * Index speed (https://issues.apache.org/jira/browse/LUCENE-9941)
> >
> > Thanks Mike for your insights so far on the search ticket.
> >
> > Julie
> >
> > On Tue, Apr 6, 2021 at 12:37 PM Julie Tibshirani <julietibs@gmail.com>
> wrote:
> >>
> >> I filed one more JIRA about the approach to specifying the NN
> algorithm: https://issues.apache.org/jira/browse/LUCENE-9905.
> >>
> >> As a summary, here's the current list of vector API issues we're
> tracking:
> >> * Reconsider the format name (
> https://issues.apache.org/jira/browse/LUCENE-9855)
> >> * Revise approach to specifying NN algorithm (
> https://issues.apache.org/jira/browse/LUCENE-9905)
> >> * Move VectorValues#search to VectorReader (
> https://issues.apache.org/jira/browse/LUCENE-9908)
> >> * Should VectorValues expose both iteration and random access? (
> https://issues.apache.org/jira/browse/LUCENE-9583)
> >>
> >> Julie
> >>
> >> On Tue, Apr 6, 2021 at 5:31 AM Adrien Grand <jpountz@gmail.com> wrote:
> >>>
> >>> I created a JIRA about moving VectorValues#search to VectorReader:
> https://issues.apache.org/jira/browse/LUCENE-9908.
> >>>
> >>> On Tue, Mar 16, 2021 at 7:14 PM Adrien Grand <jpountz@gmail.com>
> wrote:
> >>>>
> >>>> Hello Mike,
> >>>>
> >>>> On Tue, Mar 16, 2021 at 5:05 PM Michael Sokolov <msokolov@gmail.com>
> wrote:
> >>>>>
> >>>>> I think the reason we have search() on VectorValues is that we have
> >>>>> LeafReader.getVectorValues() (by analogy to the DocValues iterators),
> >>>>> but no way to access the VectorReader. Do you think we should also
> >>>>> have LeafReader.getVectorReader()? Today it's only on CodecReader.
> >>>>
> >>>>
> >>>> I was more thinking of moving VectorValues#search to
> LeafReader#searchNearestVectors or something along those lines. I agree
> that VectorReader should only be exposed on CodecReader.
> >>>>
> >>>>>
> >>>>> Re: SearchStrategy.NONE; the idea is we support efficient access to
> >>>>> floating point values. Using BinaryDocValues for this will always
> >>>>> require an additional decoding step. I can see that the naming is
> >>>>> confusing there. The intent is that you index the vector values, but
> >>>>> no additional indexing data structure.
> >>>>
> >>>>
> >>>> I wonder if things would be simpler if we were more opinionated and
> made vectors specifically about nearest-neighbor search. Then we have a
> clearer message, use vectors for NN search and doc values otherwise. As far
> as I know, reinterpreting bytes as floats shouldn't add much overhead. The
> main problem I know of is that the JVM won't auto-vectorize if you read
> floats dynamically from a byte[], but this is something that should be
> alleviated by the JDK vector API?
> >>>>
> >>>>> Also: the reason HNSW is
> >>>>> mentioned in these SearchStrategy enums is to make room for other
> >>>>> vector indexing approaches, like LSH. There was a lot of discussion
> >>>>> that we wanted an API that allowed for experimenting with other
> >>>>> techniques for indexing and searching vector values.
> >>>>
> >>>>
> >>>> Actually this is the thing that feels odd to me: if we end up with
> constants for both LSH and HNSW, then we are adding the requirement that
> all vector formats must implement both LSH and HNSW as they will need to
> support all SearchStrategy constants? Would it be possible to have a single
> API and then two implementations of VectorsFormat, LSHVectorsFormat on the
> one hand and HNSWVectorsFormat on the other hand?
> >>>>
> >>>>> Adrien, you made an analogy to PerFieldPostingsFormat (and
> DocValues),
> >>>>> but I think the situation is more akin to Points, where we have the
> >>>>> options on IndexableField. The metadata we store there (dimension and
> >>>>> score function) don't really result in different formats, ie code
> >>>>> paths for indexing and storage; they are more like parameters to the
> >>>>> format, in my mind. Perhaps the situation will look different when we
> >>>>> get our second vector indexing strategy (like LSH).
> >>>>
> >>>>
> >>>> Having the dimension count and the score function on the FieldType
> actually makes sense to me. I was more wondering whether maxConn and
> beamWidth actually belong to the FieldType, or if they should be made
> constructor arguments of Lucene90VectorFormat.
> >>>>
> >>>> --
> >>>> Adrien
> >>>
> >>>
> >>>
> >>> --
> >>> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: Questions about the new vector API [ In reply to ]
maybe one practical to make this benchmark more interesting would be
to starve the machine of RAM (by mlock()ing a huge amount of available
RAM and leaving only a small amount left), so that datasets don't fit
in ram during benchmarking.

there is a little script to do this in the luceneutil repo:
https://github.com/mikemccand/luceneutil/blob/master/src/python/ramhog.c

On Wed, Apr 28, 2021 at 1:16 PM Julie Tibshirani <julietibs@gmail.com> wrote:
>
> This is a good point, and I agree it’s not a perfect fit for Lucene testing/ development. The repository indeed focuses on datasets that can be held in memory -- by only looking at the results of methods on ann-benchmarks, we might be missing important considerations like how well the method scales to indexes that are larger than main memory, how well it fits with the rest of the search framework, etc.
>
> I’m thinking of it as a supplementary tool that can give some performance insights. It reports search accuracy in addition to speed which luceneutil doesn’t do (yet :)). The datasets aren’t huge but also aren’t 'toy datasets' -- they range from 1-10 million vectors with dimension 100+, and are based on pretty diverse + realistic data. In the future we could extend luceneutil to cover some of this, like reporting recall on a few datasets.
>
> So ann-benchmarks can be useful within a certain scope:
> * Comparing our implementation against reference libraries to double-check/ debug the algorithm
> * Testing on diverse datasets, with ability to measure search speed *and* recall
>
> Julie
>
> On Wed, Apr 28, 2021 at 7:16 AM Robert Muir <rcmuir@gmail.com> wrote:
>>
>> Thanks for doing this benchmarking. But I am very concerned
>> ann-benchmarks is a good one to be using.
>>
>> While it may be hip/trendy/popular, it clearly states that it is only
>> for toy datasets that fit in RAM:
>> https://github.com/erikbern/ann-benchmarks/blob/master/README.md#principles
>>
>>
>>
>> On Tue, Apr 27, 2021 at 4:46 PM Julie Tibshirani <julietibs@gmail.com> wrote:
>> >
>> > One last follow-up: Robert's comments got me interested in better quantifying the performance against other approaches. I hooked up Lucene HNSW to ann-benchmarks, a commonly used repo for benchmarking nearest neighbor search libraries against large datasets. These two issues describe the results:
>> > * Search recall + QPS (https://issues.apache.org/jira/browse/LUCENE-9937)
>> > * Index speed (https://issues.apache.org/jira/browse/LUCENE-9941)
>> >
>> > Thanks Mike for your insights so far on the search ticket.
>> >
>> > Julie
>> >
>> > On Tue, Apr 6, 2021 at 12:37 PM Julie Tibshirani <julietibs@gmail.com> wrote:
>> >>
>> >> I filed one more JIRA about the approach to specifying the NN algorithm: https://issues.apache.org/jira/browse/LUCENE-9905.
>> >>
>> >> As a summary, here's the current list of vector API issues we're tracking:
>> >> * Reconsider the format name (https://issues.apache.org/jira/browse/LUCENE-9855)
>> >> * Revise approach to specifying NN algorithm (https://issues.apache.org/jira/browse/LUCENE-9905)
>> >> * Move VectorValues#search to VectorReader (https://issues.apache.org/jira/browse/LUCENE-9908)
>> >> * Should VectorValues expose both iteration and random access? (https://issues.apache.org/jira/browse/LUCENE-9583)
>> >>
>> >> Julie
>> >>
>> >> On Tue, Apr 6, 2021 at 5:31 AM Adrien Grand <jpountz@gmail.com> wrote:
>> >>>
>> >>> I created a JIRA about moving VectorValues#search to VectorReader: https://issues.apache.org/jira/browse/LUCENE-9908.
>> >>>
>> >>> On Tue, Mar 16, 2021 at 7:14 PM Adrien Grand <jpountz@gmail.com> wrote:
>> >>>>
>> >>>> Hello Mike,
>> >>>>
>> >>>> On Tue, Mar 16, 2021 at 5:05 PM Michael Sokolov <msokolov@gmail.com> wrote:
>> >>>>>
>> >>>>> I think the reason we have search() on VectorValues is that we have
>> >>>>> LeafReader.getVectorValues() (by analogy to the DocValues iterators),
>> >>>>> but no way to access the VectorReader. Do you think we should also
>> >>>>> have LeafReader.getVectorReader()? Today it's only on CodecReader.
>> >>>>
>> >>>>
>> >>>> I was more thinking of moving VectorValues#search to LeafReader#searchNearestVectors or something along those lines. I agree that VectorReader should only be exposed on CodecReader.
>> >>>>
>> >>>>>
>> >>>>> Re: SearchStrategy.NONE; the idea is we support efficient access to
>> >>>>> floating point values. Using BinaryDocValues for this will always
>> >>>>> require an additional decoding step. I can see that the naming is
>> >>>>> confusing there. The intent is that you index the vector values, but
>> >>>>> no additional indexing data structure.
>> >>>>
>> >>>>
>> >>>> I wonder if things would be simpler if we were more opinionated and made vectors specifically about nearest-neighbor search. Then we have a clearer message, use vectors for NN search and doc values otherwise. As far as I know, reinterpreting bytes as floats shouldn't add much overhead. The main problem I know of is that the JVM won't auto-vectorize if you read floats dynamically from a byte[], but this is something that should be alleviated by the JDK vector API?
>> >>>>
>> >>>>> Also: the reason HNSW is
>> >>>>> mentioned in these SearchStrategy enums is to make room for other
>> >>>>> vector indexing approaches, like LSH. There was a lot of discussion
>> >>>>> that we wanted an API that allowed for experimenting with other
>> >>>>> techniques for indexing and searching vector values.
>> >>>>
>> >>>>
>> >>>> Actually this is the thing that feels odd to me: if we end up with constants for both LSH and HNSW, then we are adding the requirement that all vector formats must implement both LSH and HNSW as they will need to support all SearchStrategy constants? Would it be possible to have a single API and then two implementations of VectorsFormat, LSHVectorsFormat on the one hand and HNSWVectorsFormat on the other hand?
>> >>>>
>> >>>>> Adrien, you made an analogy to PerFieldPostingsFormat (and DocValues),
>> >>>>> but I think the situation is more akin to Points, where we have the
>> >>>>> options on IndexableField. The metadata we store there (dimension and
>> >>>>> score function) don't really result in different formats, ie code
>> >>>>> paths for indexing and storage; they are more like parameters to the
>> >>>>> format, in my mind. Perhaps the situation will look different when we
>> >>>>> get our second vector indexing strategy (like LSH).
>> >>>>
>> >>>>
>> >>>> Having the dimension count and the score function on the FieldType actually makes sense to me. I was more wondering whether maxConn and beamWidth actually belong to the FieldType, or if they should be made constructor arguments of Lucene90VectorFormat.
>> >>>>
>> >>>> --
>> >>>> Adrien
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Adrien
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Questions about the new vector API [ In reply to ]
Thanks Robert, I haven't had the chance to try this out yet but it's on my
radar.

Relevant to this discussion, the NeurIPS conference recently announced a
large-scale ANN benchmarking challenge:
https://medium.com/big-ann-benchmarks/neurips-2021-announcement-the-billion-scale-approximate-nearest-neighbor-search-challenge-72858f768f69.
It's great to see they're focused on datasets that don't fit in RAM:

"Despite the broad range of algorithms and approaches for ANNS, most
empirical evaluations of algorithms have focused on smaller datasets (1
million points). However, deploying recent algorithmic advances in ANNS
techniques for search, recommendation and ranking at scale requires support
at billion, trillion or larger scale. Barring a few recent papers, there is
limited consensus on which algorithms are effective at this scale... In the
T1 track, we will constrain the size of RAM to 64GB."

It will be several months until results are published, but will be
interesting to keep an eye on it.

Julie


On Wed, Apr 28, 2021 at 10:24 AM Robert Muir <rcmuir@gmail.com> wrote:

> maybe one practical to make this benchmark more interesting would be
> to starve the machine of RAM (by mlock()ing a huge amount of available
> RAM and leaving only a small amount left), so that datasets don't fit
> in ram during benchmarking.
>
> there is a little script to do this in the luceneutil repo:
> https://github.com/mikemccand/luceneutil/blob/master/src/python/ramhog.c
>
> On Wed, Apr 28, 2021 at 1:16 PM Julie Tibshirani <julietibs@gmail.com>
> wrote:
> >
> > This is a good point, and I agree it’s not a perfect fit for Lucene
> testing/ development. The repository indeed focuses on datasets that can be
> held in memory -- by only looking at the results of methods on
> ann-benchmarks, we might be missing important considerations like how well
> the method scales to indexes that are larger than main memory, how well it
> fits with the rest of the search framework, etc.
> >
> > I’m thinking of it as a supplementary tool that can give some
> performance insights. It reports search accuracy in addition to speed which
> luceneutil doesn’t do (yet :)). The datasets aren’t huge but also aren’t
> 'toy datasets' -- they range from 1-10 million vectors with dimension 100+,
> and are based on pretty diverse + realistic data. In the future we could
> extend luceneutil to cover some of this, like reporting recall on a few
> datasets.
> >
> > So ann-benchmarks can be useful within a certain scope:
> > * Comparing our implementation against reference libraries to
> double-check/ debug the algorithm
> > * Testing on diverse datasets, with ability to measure search speed
> *and* recall
> >
> > Julie
> >
> > On Wed, Apr 28, 2021 at 7:16 AM Robert Muir <rcmuir@gmail.com> wrote:
> >>
> >> Thanks for doing this benchmarking. But I am very concerned
> >> ann-benchmarks is a good one to be using.
> >>
> >> While it may be hip/trendy/popular, it clearly states that it is only
> >> for toy datasets that fit in RAM:
> >>
> https://github.com/erikbern/ann-benchmarks/blob/master/README.md#principles
> >>
> >>
> >>
> >> On Tue, Apr 27, 2021 at 4:46 PM Julie Tibshirani <julietibs@gmail.com>
> wrote:
> >> >
> >> > One last follow-up: Robert's comments got me interested in better
> quantifying the performance against other approaches. I hooked up Lucene
> HNSW to ann-benchmarks, a commonly used repo for benchmarking nearest
> neighbor search libraries against large datasets. These two issues describe
> the results:
> >> > * Search recall + QPS (
> https://issues.apache.org/jira/browse/LUCENE-9937)
> >> > * Index speed (https://issues.apache.org/jira/browse/LUCENE-9941)
> >> >
> >> > Thanks Mike for your insights so far on the search ticket.
> >> >
> >> > Julie
> >> >
> >> > On Tue, Apr 6, 2021 at 12:37 PM Julie Tibshirani <julietibs@gmail.com>
> wrote:
> >> >>
> >> >> I filed one more JIRA about the approach to specifying the NN
> algorithm: https://issues.apache.org/jira/browse/LUCENE-9905.
> >> >>
> >> >> As a summary, here's the current list of vector API issues we're
> tracking:
> >> >> * Reconsider the format name (
> https://issues.apache.org/jira/browse/LUCENE-9855)
> >> >> * Revise approach to specifying NN algorithm (
> https://issues.apache.org/jira/browse/LUCENE-9905)
> >> >> * Move VectorValues#search to VectorReader (
> https://issues.apache.org/jira/browse/LUCENE-9908)
> >> >> * Should VectorValues expose both iteration and random access? (
> https://issues.apache.org/jira/browse/LUCENE-9583)
> >> >>
> >> >> Julie
> >> >>
> >> >> On Tue, Apr 6, 2021 at 5:31 AM Adrien Grand <jpountz@gmail.com>
> wrote:
> >> >>>
> >> >>> I created a JIRA about moving VectorValues#search to VectorReader:
> https://issues.apache.org/jira/browse/LUCENE-9908.
> >> >>>
> >> >>> On Tue, Mar 16, 2021 at 7:14 PM Adrien Grand <jpountz@gmail.com>
> wrote:
> >> >>>>
> >> >>>> Hello Mike,
> >> >>>>
> >> >>>> On Tue, Mar 16, 2021 at 5:05 PM Michael Sokolov <
> msokolov@gmail.com> wrote:
> >> >>>>>
> >> >>>>> I think the reason we have search() on VectorValues is that we
> have
> >> >>>>> LeafReader.getVectorValues() (by analogy to the DocValues
> iterators),
> >> >>>>> but no way to access the VectorReader. Do you think we should also
> >> >>>>> have LeafReader.getVectorReader()? Today it's only on CodecReader.
> >> >>>>
> >> >>>>
> >> >>>> I was more thinking of moving VectorValues#search to
> LeafReader#searchNearestVectors or something along those lines. I agree
> that VectorReader should only be exposed on CodecReader.
> >> >>>>
> >> >>>>>
> >> >>>>> Re: SearchStrategy.NONE; the idea is we support efficient access
> to
> >> >>>>> floating point values. Using BinaryDocValues for this will always
> >> >>>>> require an additional decoding step. I can see that the naming is
> >> >>>>> confusing there. The intent is that you index the vector values,
> but
> >> >>>>> no additional indexing data structure.
> >> >>>>
> >> >>>>
> >> >>>> I wonder if things would be simpler if we were more opinionated
> and made vectors specifically about nearest-neighbor search. Then we have a
> clearer message, use vectors for NN search and doc values otherwise. As far
> as I know, reinterpreting bytes as floats shouldn't add much overhead. The
> main problem I know of is that the JVM won't auto-vectorize if you read
> floats dynamically from a byte[], but this is something that should be
> alleviated by the JDK vector API?
> >> >>>>
> >> >>>>> Also: the reason HNSW is
> >> >>>>> mentioned in these SearchStrategy enums is to make room for other
> >> >>>>> vector indexing approaches, like LSH. There was a lot of
> discussion
> >> >>>>> that we wanted an API that allowed for experimenting with other
> >> >>>>> techniques for indexing and searching vector values.
> >> >>>>
> >> >>>>
> >> >>>> Actually this is the thing that feels odd to me: if we end up with
> constants for both LSH and HNSW, then we are adding the requirement that
> all vector formats must implement both LSH and HNSW as they will need to
> support all SearchStrategy constants? Would it be possible to have a single
> API and then two implementations of VectorsFormat, LSHVectorsFormat on the
> one hand and HNSWVectorsFormat on the other hand?
> >> >>>>
> >> >>>>> Adrien, you made an analogy to PerFieldPostingsFormat (and
> DocValues),
> >> >>>>> but I think the situation is more akin to Points, where we have
> the
> >> >>>>> options on IndexableField. The metadata we store there (dimension
> and
> >> >>>>> score function) don't really result in different formats, ie code
> >> >>>>> paths for indexing and storage; they are more like parameters to
> the
> >> >>>>> format, in my mind. Perhaps the situation will look different
> when we
> >> >>>>> get our second vector indexing strategy (like LSH).
> >> >>>>
> >> >>>>
> >> >>>> Having the dimension count and the score function on the FieldType
> actually makes sense to me. I was more wondering whether maxConn and
> beamWidth actually belong to the FieldType, or if they should be made
> constructor arguments of Lucene90VectorFormat.
> >> >>>>
> >> >>>> --
> >> >>>> Adrien
> >> >>>
> >> >>>
> >> >>>
> >> >>> --
> >> >>> Adrien
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

1 2  View All