Mailing List Archive

Questions about the new vector API
Hello,

I've tried to catch up on the vector API and I have the following
questions. I've tried to read through discussions on JIRA first in case it
had been covered, but it's possible I missed some relevant ones.

Should VectorValues#search be on VectorReader instead? It felt a bit odd to
me to have the search logic on the iterator.

Do we need SearchStrategy.NONE? Documentation suggests that it allows
storing vectors but that NN search won't be supported. This looks like a
use-case for binary doc values to me? It also slightly caught me by
surprise due to the inconsistency with IndexOptions.NONE, which means "do
not index this field" (and likewise for DocValuesType.NONE), so I first
assumed that SearchStrategy.NONE also meant "do not index this field as a
vector".

While postings and doc-value formats allow per-field configuration via
PerFieldPostingsFormat/PerFieldDocValuesFormat, vectors use a different
mechanism where VectorField#createHnswType sets attributes on the field
type that the vectors writer then reads. Should we have a
PerFieldVectorsFormat instead and configure these options via the vectors
format?

Should SearchStrategy constants avoid explicit references to HNSW? The rest
of the API seems to try to be agnostic of the way that NN search is
implemented. Could we make SearchStrategy only about the similarity metric
that is used for vectors? This particular point seems discussed on
LUCENE-9322 <https://issues.apache.org/jira/browse/LUCENE-9322> but I
couldn't find the conclusion.

Should we rename VectorFormat to VectorsFormat? This would be more
consistent with other file formats that use the plural, like
PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.?

--
Adrien
Re: Questions about the new vector API [ In reply to ]
> Should we rename VectorFormat to VectorsFormat? This would be more
consistent with other file formats that use the plural, like
PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.?

+1 for using plural form for consistency - if we reconsider the names, how
about VectorValuesFormat so that it follows the naming convention for
XXXValues?

DocValuesFormat / DocValues
PointValuesFormat / PointValues
VectorValuesFormat / VectorValues (currently, VectorFormat / VectorValues)

> Should SearchStrategy constants avoid explicit references to HNSW?

Also +1 for decoupling HNSW specific implementations from general vectors,
though I am not fully sure if we can strictly separate the similarity
metrics and search algorithms for vectors.
LUCENE-9322 (unified vectors API) was resolved months ago, does it achieve
its goal? I haven't followed the issue in months because of my laziness...

Thanks,
Tomoko


2021?3?16?(?) 19:32 Adrien Grand <jpountz@gmail.com>:

> Hello,
>
> I've tried to catch up on the vector API and I have the following
> questions. I've tried to read through discussions on JIRA first in case it
> had been covered, but it's possible I missed some relevant ones.
>
> Should VectorValues#search be on VectorReader instead? It felt a bit odd
> to me to have the search logic on the iterator.
>
> Do we need SearchStrategy.NONE? Documentation suggests that it allows
> storing vectors but that NN search won't be supported. This looks like a
> use-case for binary doc values to me? It also slightly caught me by
> surprise due to the inconsistency with IndexOptions.NONE, which means "do
> not index this field" (and likewise for DocValuesType.NONE), so I first
> assumed that SearchStrategy.NONE also meant "do not index this field as a
> vector".
>
> While postings and doc-value formats allow per-field configuration via
> PerFieldPostingsFormat/PerFieldDocValuesFormat, vectors use a different
> mechanism where VectorField#createHnswType sets attributes on the field
> type that the vectors writer then reads. Should we have a
> PerFieldVectorsFormat instead and configure these options via the vectors
> format?
>
> Should SearchStrategy constants avoid explicit references to HNSW? The
> rest of the API seems to try to be agnostic of the way that NN search is
> implemented. Could we make SearchStrategy only about the similarity metric
> that is used for vectors? This particular point seems discussed on
> LUCENE-9322 <https://issues.apache.org/jira/browse/LUCENE-9322> but I
> couldn't find the conclusion.
>
> Should we rename VectorFormat to VectorsFormat? This would be more
> consistent with other file formats that use the plural, like
> PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.?
>
> --
> Adrien
>
Re: Questions about the new vector API [ In reply to ]
Consistent plural naming makes sense to me. I think it ended up
singular because I am biased to avoid plural names unless there is a
useful distinction to be made. But consistency should trump my
predilections.

I think the reason we have search() on VectorValues is that we have
LeafReader.getVectorValues() (by analogy to the DocValues iterators),
but no way to access the VectorReader. Do you think we should also
have LeafReader.getVectorReader()? Today it's only on CodecReader.

Re: SearchStrategy.NONE; the idea is we support efficient access to
floating point values. Using BinaryDocValues for this will always
require an additional decoding step. I can see that the naming is
confusing there. The intent is that you index the vector values, but
no additional indexing data structure. Also: the reason HNSW is
mentioned in these SearchStrategy enums is to make room for other
vector indexing approaches, like LSH. There was a lot of discussion
that we wanted an API that allowed for experimenting with other
techniques for indexing and searching vector values.

Adrien, you made an analogy to PerFieldPostingsFormat (and DocValues),
but I think the situation is more akin to Points, where we have the
options on IndexableField. The metadata we store there (dimension and
score function) don't really result in different formats, ie code
paths for indexing and storage; they are more like parameters to the
format, in my mind. Perhaps the situation will look different when we
get our second vector indexing strategy (like LSH).


On Tue, Mar 16, 2021 at 10:19 AM Tomoko Uchida
<tomoko.uchida.1111@gmail.com> wrote:
>
> > Should we rename VectorFormat to VectorsFormat? This would be more consistent with other file formats that use the plural, like PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.?
>
> +1 for using plural form for consistency - if we reconsider the names, how about VectorValuesFormat so that it follows the naming convention for XXXValues?
>
> DocValuesFormat / DocValues
> PointValuesFormat / PointValues
> VectorValuesFormat / VectorValues (currently, VectorFormat / VectorValues)
>
> > Should SearchStrategy constants avoid explicit references to HNSW?
>
> Also +1 for decoupling HNSW specific implementations from general vectors, though I am not fully sure if we can strictly separate the similarity metrics and search algorithms for vectors.
> LUCENE-9322 (unified vectors API) was resolved months ago, does it achieve its goal? I haven't followed the issue in months because of my laziness...
>
> Thanks,
> Tomoko
>
>
> 2021?3?16?(?) 19:32 Adrien Grand <jpountz@gmail.com>:
>>
>> Hello,
>>
>> I've tried to catch up on the vector API and I have the following questions. I've tried to read through discussions on JIRA first in case it had been covered, but it's possible I missed some relevant ones.
>>
>> Should VectorValues#search be on VectorReader instead? It felt a bit odd to me to have the search logic on the iterator.
>>
>> Do we need SearchStrategy.NONE? Documentation suggests that it allows storing vectors but that NN search won't be supported. This looks like a use-case for binary doc values to me? It also slightly caught me by surprise due to the inconsistency with IndexOptions.NONE, which means "do not index this field" (and likewise for DocValuesType.NONE), so I first assumed that SearchStrategy.NONE also meant "do not index this field as a vector".
>>
>> While postings and doc-value formats allow per-field configuration via PerFieldPostingsFormat/PerFieldDocValuesFormat, vectors use a different mechanism where VectorField#createHnswType sets attributes on the field type that the vectors writer then reads. Should we have a PerFieldVectorsFormat instead and configure these options via the vectors format?
>>
>> Should SearchStrategy constants avoid explicit references to HNSW? The rest of the API seems to try to be agnostic of the way that NN search is implemented. Could we make SearchStrategy only about the similarity metric that is used for vectors? This particular point seems discussed on LUCENE-9322 but I couldn't find the conclusion.
>>
>> Should we rename VectorFormat to VectorsFormat? This would be more consistent with other file formats that use the plural, like PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.?
>>
>> --
>> Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Questions about the new vector API [ In reply to ]
Also, Tomoko re:LUCENE-9322, did it succeed? I guess we won't know for
sure unless someone revives
https://issues.apache.org/jira/browse/LUCENE-9136 or something like
that

On Tue, Mar 16, 2021 at 12:04 PM Michael Sokolov <msokolov@gmail.com> wrote:
>
> Consistent plural naming makes sense to me. I think it ended up
> singular because I am biased to avoid plural names unless there is a
> useful distinction to be made. But consistency should trump my
> predilections.
>
> I think the reason we have search() on VectorValues is that we have
> LeafReader.getVectorValues() (by analogy to the DocValues iterators),
> but no way to access the VectorReader. Do you think we should also
> have LeafReader.getVectorReader()? Today it's only on CodecReader.
>
> Re: SearchStrategy.NONE; the idea is we support efficient access to
> floating point values. Using BinaryDocValues for this will always
> require an additional decoding step. I can see that the naming is
> confusing there. The intent is that you index the vector values, but
> no additional indexing data structure. Also: the reason HNSW is
> mentioned in these SearchStrategy enums is to make room for other
> vector indexing approaches, like LSH. There was a lot of discussion
> that we wanted an API that allowed for experimenting with other
> techniques for indexing and searching vector values.
>
> Adrien, you made an analogy to PerFieldPostingsFormat (and DocValues),
> but I think the situation is more akin to Points, where we have the
> options on IndexableField. The metadata we store there (dimension and
> score function) don't really result in different formats, ie code
> paths for indexing and storage; they are more like parameters to the
> format, in my mind. Perhaps the situation will look different when we
> get our second vector indexing strategy (like LSH).
>
>
> On Tue, Mar 16, 2021 at 10:19 AM Tomoko Uchida
> <tomoko.uchida.1111@gmail.com> wrote:
> >
> > > Should we rename VectorFormat to VectorsFormat? This would be more consistent with other file formats that use the plural, like PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.?
> >
> > +1 for using plural form for consistency - if we reconsider the names, how about VectorValuesFormat so that it follows the naming convention for XXXValues?
> >
> > DocValuesFormat / DocValues
> > PointValuesFormat / PointValues
> > VectorValuesFormat / VectorValues (currently, VectorFormat / VectorValues)
> >
> > > Should SearchStrategy constants avoid explicit references to HNSW?
> >
> > Also +1 for decoupling HNSW specific implementations from general vectors, though I am not fully sure if we can strictly separate the similarity metrics and search algorithms for vectors.
> > LUCENE-9322 (unified vectors API) was resolved months ago, does it achieve its goal? I haven't followed the issue in months because of my laziness...
> >
> > Thanks,
> > Tomoko
> >
> >
> > 2021?3?16?(?) 19:32 Adrien Grand <jpountz@gmail.com>:
> >>
> >> Hello,
> >>
> >> I've tried to catch up on the vector API and I have the following questions. I've tried to read through discussions on JIRA first in case it had been covered, but it's possible I missed some relevant ones.
> >>
> >> Should VectorValues#search be on VectorReader instead? It felt a bit odd to me to have the search logic on the iterator.
> >>
> >> Do we need SearchStrategy.NONE? Documentation suggests that it allows storing vectors but that NN search won't be supported. This looks like a use-case for binary doc values to me? It also slightly caught me by surprise due to the inconsistency with IndexOptions.NONE, which means "do not index this field" (and likewise for DocValuesType.NONE), so I first assumed that SearchStrategy.NONE also meant "do not index this field as a vector".
> >>
> >> While postings and doc-value formats allow per-field configuration via PerFieldPostingsFormat/PerFieldDocValuesFormat, vectors use a different mechanism where VectorField#createHnswType sets attributes on the field type that the vectors writer then reads. Should we have a PerFieldVectorsFormat instead and configure these options via the vectors format?
> >>
> >> Should SearchStrategy constants avoid explicit references to HNSW? The rest of the API seems to try to be agnostic of the way that NN search is implemented. Could we make SearchStrategy only about the similarity metric that is used for vectors? This particular point seems discussed on LUCENE-9322 but I couldn't find the conclusion.
> >>
> >> Should we rename VectorFormat to VectorsFormat? This would be more consistent with other file formats that use the plural, like PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.?
> >>
> >> --
> >> Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Questions about the new vector API [ In reply to ]
There's also some good discussion on
https://issues.apache.org/jira/browse/LUCENE-9583 about random access
vs iterator pattern that never got fully resolved. We said we would
revisit after KNN (LUCENE-9004) landed, and now it has. The usage of
random access is pretty well-established there, maybe we should
abandon the iterator API since it is redundant (you can always iterate
over a random access API if you know the size)?

On Tue, Mar 16, 2021 at 12:10 PM Michael Sokolov <msokolov@gmail.com> wrote:
>
> Also, Tomoko re:LUCENE-9322, did it succeed? I guess we won't know for
> sure unless someone revives
> https://issues.apache.org/jira/browse/LUCENE-9136 or something like
> that
>
> On Tue, Mar 16, 2021 at 12:04 PM Michael Sokolov <msokolov@gmail.com> wrote:
> >
> > Consistent plural naming makes sense to me. I think it ended up
> > singular because I am biased to avoid plural names unless there is a
> > useful distinction to be made. But consistency should trump my
> > predilections.
> >
> > I think the reason we have search() on VectorValues is that we have
> > LeafReader.getVectorValues() (by analogy to the DocValues iterators),
> > but no way to access the VectorReader. Do you think we should also
> > have LeafReader.getVectorReader()? Today it's only on CodecReader.
> >
> > Re: SearchStrategy.NONE; the idea is we support efficient access to
> > floating point values. Using BinaryDocValues for this will always
> > require an additional decoding step. I can see that the naming is
> > confusing there. The intent is that you index the vector values, but
> > no additional indexing data structure. Also: the reason HNSW is
> > mentioned in these SearchStrategy enums is to make room for other
> > vector indexing approaches, like LSH. There was a lot of discussion
> > that we wanted an API that allowed for experimenting with other
> > techniques for indexing and searching vector values.
> >
> > Adrien, you made an analogy to PerFieldPostingsFormat (and DocValues),
> > but I think the situation is more akin to Points, where we have the
> > options on IndexableField. The metadata we store there (dimension and
> > score function) don't really result in different formats, ie code
> > paths for indexing and storage; they are more like parameters to the
> > format, in my mind. Perhaps the situation will look different when we
> > get our second vector indexing strategy (like LSH).
> >
> >
> > On Tue, Mar 16, 2021 at 10:19 AM Tomoko Uchida
> > <tomoko.uchida.1111@gmail.com> wrote:
> > >
> > > > Should we rename VectorFormat to VectorsFormat? This would be more consistent with other file formats that use the plural, like PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.?
> > >
> > > +1 for using plural form for consistency - if we reconsider the names, how about VectorValuesFormat so that it follows the naming convention for XXXValues?
> > >
> > > DocValuesFormat / DocValues
> > > PointValuesFormat / PointValues
> > > VectorValuesFormat / VectorValues (currently, VectorFormat / VectorValues)
> > >
> > > > Should SearchStrategy constants avoid explicit references to HNSW?
> > >
> > > Also +1 for decoupling HNSW specific implementations from general vectors, though I am not fully sure if we can strictly separate the similarity metrics and search algorithms for vectors.
> > > LUCENE-9322 (unified vectors API) was resolved months ago, does it achieve its goal? I haven't followed the issue in months because of my laziness...
> > >
> > > Thanks,
> > > Tomoko
> > >
> > >
> > > 2021?3?16?(?) 19:32 Adrien Grand <jpountz@gmail.com>:
> > >>
> > >> Hello,
> > >>
> > >> I've tried to catch up on the vector API and I have the following questions. I've tried to read through discussions on JIRA first in case it had been covered, but it's possible I missed some relevant ones.
> > >>
> > >> Should VectorValues#search be on VectorReader instead? It felt a bit odd to me to have the search logic on the iterator.
> > >>
> > >> Do we need SearchStrategy.NONE? Documentation suggests that it allows storing vectors but that NN search won't be supported. This looks like a use-case for binary doc values to me? It also slightly caught me by surprise due to the inconsistency with IndexOptions.NONE, which means "do not index this field" (and likewise for DocValuesType.NONE), so I first assumed that SearchStrategy.NONE also meant "do not index this field as a vector".
> > >>
> > >> While postings and doc-value formats allow per-field configuration via PerFieldPostingsFormat/PerFieldDocValuesFormat, vectors use a different mechanism where VectorField#createHnswType sets attributes on the field type that the vectors writer then reads. Should we have a PerFieldVectorsFormat instead and configure these options via the vectors format?
> > >>
> > >> Should SearchStrategy constants avoid explicit references to HNSW? The rest of the API seems to try to be agnostic of the way that NN search is implemented. Could we make SearchStrategy only about the similarity metric that is used for vectors? This particular point seems discussed on LUCENE-9322 but I couldn't find the conclusion.
> > >>
> > >> Should we rename VectorFormat to VectorsFormat? This would be more consistent with other file formats that use the plural, like PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.?
> > >>
> > >> --
> > >> Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Questions about the new vector API [ In reply to ]
Where are the alternative algorithms that work on sequential iterators and
don't need random access?

Seems like these should be the ones we initially add to lucene, and HNSW
should be put aside for now? (is it a toy, or can we do it without
jazillions of random accesses?)

On Tue, Mar 16, 2021 at 12:15 PM Michael Sokolov <msokolov@gmail.com> wrote:

> There's also some good discussion on
> https://issues.apache.org/jira/browse/LUCENE-9583 about random access
> vs iterator pattern that never got fully resolved. We said we would
> revisit after KNN (LUCENE-9004) landed, and now it has. The usage of
> random access is pretty well-established there, maybe we should
> abandon the iterator API since it is redundant (you can always iterate
> over a random access API if you know the size)?
>
> On Tue, Mar 16, 2021 at 12:10 PM Michael Sokolov <msokolov@gmail.com>
> wrote:
> >
> > Also, Tomoko re:LUCENE-9322, did it succeed? I guess we won't know for
> > sure unless someone revives
> > https://issues.apache.org/jira/browse/LUCENE-9136 or something like
> > that
> >
> > On Tue, Mar 16, 2021 at 12:04 PM Michael Sokolov <msokolov@gmail.com>
> wrote:
> > >
> > > Consistent plural naming makes sense to me. I think it ended up
> > > singular because I am biased to avoid plural names unless there is a
> > > useful distinction to be made. But consistency should trump my
> > > predilections.
> > >
> > > I think the reason we have search() on VectorValues is that we have
> > > LeafReader.getVectorValues() (by analogy to the DocValues iterators),
> > > but no way to access the VectorReader. Do you think we should also
> > > have LeafReader.getVectorReader()? Today it's only on CodecReader.
> > >
> > > Re: SearchStrategy.NONE; the idea is we support efficient access to
> > > floating point values. Using BinaryDocValues for this will always
> > > require an additional decoding step. I can see that the naming is
> > > confusing there. The intent is that you index the vector values, but
> > > no additional indexing data structure. Also: the reason HNSW is
> > > mentioned in these SearchStrategy enums is to make room for other
> > > vector indexing approaches, like LSH. There was a lot of discussion
> > > that we wanted an API that allowed for experimenting with other
> > > techniques for indexing and searching vector values.
> > >
> > > Adrien, you made an analogy to PerFieldPostingsFormat (and DocValues),
> > > but I think the situation is more akin to Points, where we have the
> > > options on IndexableField. The metadata we store there (dimension and
> > > score function) don't really result in different formats, ie code
> > > paths for indexing and storage; they are more like parameters to the
> > > format, in my mind. Perhaps the situation will look different when we
> > > get our second vector indexing strategy (like LSH).
> > >
> > >
> > > On Tue, Mar 16, 2021 at 10:19 AM Tomoko Uchida
> > > <tomoko.uchida.1111@gmail.com> wrote:
> > > >
> > > > > Should we rename VectorFormat to VectorsFormat? This would be more
> consistent with other file formats that use the plural, like
> PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.?
> > > >
> > > > +1 for using plural form for consistency - if we reconsider the
> names, how about VectorValuesFormat so that it follows the naming
> convention for XXXValues?
> > > >
> > > > DocValuesFormat / DocValues
> > > > PointValuesFormat / PointValues
> > > > VectorValuesFormat / VectorValues (currently, VectorFormat /
> VectorValues)
> > > >
> > > > > Should SearchStrategy constants avoid explicit references to HNSW?
> > > >
> > > > Also +1 for decoupling HNSW specific implementations from general
> vectors, though I am not fully sure if we can strictly separate the
> similarity metrics and search algorithms for vectors.
> > > > LUCENE-9322 (unified vectors API) was resolved months ago, does it
> achieve its goal? I haven't followed the issue in months because of my
> laziness...
> > > >
> > > > Thanks,
> > > > Tomoko
> > > >
> > > >
> > > > 2021?3?16?(?) 19:32 Adrien Grand <jpountz@gmail.com>:
> > > >>
> > > >> Hello,
> > > >>
> > > >> I've tried to catch up on the vector API and I have the following
> questions. I've tried to read through discussions on JIRA first in case it
> had been covered, but it's possible I missed some relevant ones.
> > > >>
> > > >> Should VectorValues#search be on VectorReader instead? It felt a
> bit odd to me to have the search logic on the iterator.
> > > >>
> > > >> Do we need SearchStrategy.NONE? Documentation suggests that it
> allows storing vectors but that NN search won't be supported. This looks
> like a use-case for binary doc values to me? It also slightly caught me by
> surprise due to the inconsistency with IndexOptions.NONE, which means "do
> not index this field" (and likewise for DocValuesType.NONE), so I first
> assumed that SearchStrategy.NONE also meant "do not index this field as a
> vector".
> > > >>
> > > >> While postings and doc-value formats allow per-field configuration
> via PerFieldPostingsFormat/PerFieldDocValuesFormat, vectors use a different
> mechanism where VectorField#createHnswType sets attributes on the field
> type that the vectors writer then reads. Should we have a
> PerFieldVectorsFormat instead and configure these options via the vectors
> format?
> > > >>
> > > >> Should SearchStrategy constants avoid explicit references to HNSW?
> The rest of the API seems to try to be agnostic of the way that NN search
> is implemented. Could we make SearchStrategy only about the similarity
> metric that is used for vectors? This particular point seems discussed on
> LUCENE-9322 but I couldn't find the conclusion.
> > > >>
> > > >> Should we rename VectorFormat to VectorsFormat? This would be more
> consistent with other file formats that use the plural, like
> PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.?
> > > >>
> > > >> --
> > > >> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: Questions about the new vector API [ In reply to ]
ann-benchmarks.com maintains open benchmarks of a bunch of ANN
(approximate NN) algorithms. When we started this effort, HNSW was at
the top of the heap in most of the benchmarks.

On Tue, Mar 16, 2021 at 12:28 PM Robert Muir <rcmuir@gmail.com> wrote:
>
> Where are the alternative algorithms that work on sequential iterators and don't need random access?
>
> Seems like these should be the ones we initially add to lucene, and HNSW should be put aside for now? (is it a toy, or can we do it without jazillions of random accesses?)
>
> On Tue, Mar 16, 2021 at 12:15 PM Michael Sokolov <msokolov@gmail.com> wrote:
>>
>> There's also some good discussion on
>> https://issues.apache.org/jira/browse/LUCENE-9583 about random access
>> vs iterator pattern that never got fully resolved. We said we would
>> revisit after KNN (LUCENE-9004) landed, and now it has. The usage of
>> random access is pretty well-established there, maybe we should
>> abandon the iterator API since it is redundant (you can always iterate
>> over a random access API if you know the size)?
>>
>> On Tue, Mar 16, 2021 at 12:10 PM Michael Sokolov <msokolov@gmail.com> wrote:
>> >
>> > Also, Tomoko re:LUCENE-9322, did it succeed? I guess we won't know for
>> > sure unless someone revives
>> > https://issues.apache.org/jira/browse/LUCENE-9136 or something like
>> > that
>> >
>> > On Tue, Mar 16, 2021 at 12:04 PM Michael Sokolov <msokolov@gmail.com> wrote:
>> > >
>> > > Consistent plural naming makes sense to me. I think it ended up
>> > > singular because I am biased to avoid plural names unless there is a
>> > > useful distinction to be made. But consistency should trump my
>> > > predilections.
>> > >
>> > > I think the reason we have search() on VectorValues is that we have
>> > > LeafReader.getVectorValues() (by analogy to the DocValues iterators),
>> > > but no way to access the VectorReader. Do you think we should also
>> > > have LeafReader.getVectorReader()? Today it's only on CodecReader.
>> > >
>> > > Re: SearchStrategy.NONE; the idea is we support efficient access to
>> > > floating point values. Using BinaryDocValues for this will always
>> > > require an additional decoding step. I can see that the naming is
>> > > confusing there. The intent is that you index the vector values, but
>> > > no additional indexing data structure. Also: the reason HNSW is
>> > > mentioned in these SearchStrategy enums is to make room for other
>> > > vector indexing approaches, like LSH. There was a lot of discussion
>> > > that we wanted an API that allowed for experimenting with other
>> > > techniques for indexing and searching vector values.
>> > >
>> > > Adrien, you made an analogy to PerFieldPostingsFormat (and DocValues),
>> > > but I think the situation is more akin to Points, where we have the
>> > > options on IndexableField. The metadata we store there (dimension and
>> > > score function) don't really result in different formats, ie code
>> > > paths for indexing and storage; they are more like parameters to the
>> > > format, in my mind. Perhaps the situation will look different when we
>> > > get our second vector indexing strategy (like LSH).
>> > >
>> > >
>> > > On Tue, Mar 16, 2021 at 10:19 AM Tomoko Uchida
>> > > <tomoko.uchida.1111@gmail.com> wrote:
>> > > >
>> > > > > Should we rename VectorFormat to VectorsFormat? This would be more consistent with other file formats that use the plural, like PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.?
>> > > >
>> > > > +1 for using plural form for consistency - if we reconsider the names, how about VectorValuesFormat so that it follows the naming convention for XXXValues?
>> > > >
>> > > > DocValuesFormat / DocValues
>> > > > PointValuesFormat / PointValues
>> > > > VectorValuesFormat / VectorValues (currently, VectorFormat / VectorValues)
>> > > >
>> > > > > Should SearchStrategy constants avoid explicit references to HNSW?
>> > > >
>> > > > Also +1 for decoupling HNSW specific implementations from general vectors, though I am not fully sure if we can strictly separate the similarity metrics and search algorithms for vectors.
>> > > > LUCENE-9322 (unified vectors API) was resolved months ago, does it achieve its goal? I haven't followed the issue in months because of my laziness...
>> > > >
>> > > > Thanks,
>> > > > Tomoko
>> > > >
>> > > >
>> > > > 2021?3?16?(?) 19:32 Adrien Grand <jpountz@gmail.com>:
>> > > >>
>> > > >> Hello,
>> > > >>
>> > > >> I've tried to catch up on the vector API and I have the following questions. I've tried to read through discussions on JIRA first in case it had been covered, but it's possible I missed some relevant ones.
>> > > >>
>> > > >> Should VectorValues#search be on VectorReader instead? It felt a bit odd to me to have the search logic on the iterator.
>> > > >>
>> > > >> Do we need SearchStrategy.NONE? Documentation suggests that it allows storing vectors but that NN search won't be supported. This looks like a use-case for binary doc values to me? It also slightly caught me by surprise due to the inconsistency with IndexOptions.NONE, which means "do not index this field" (and likewise for DocValuesType.NONE), so I first assumed that SearchStrategy.NONE also meant "do not index this field as a vector".
>> > > >>
>> > > >> While postings and doc-value formats allow per-field configuration via PerFieldPostingsFormat/PerFieldDocValuesFormat, vectors use a different mechanism where VectorField#createHnswType sets attributes on the field type that the vectors writer then reads. Should we have a PerFieldVectorsFormat instead and configure these options via the vectors format?
>> > > >>
>> > > >> Should SearchStrategy constants avoid explicit references to HNSW? The rest of the API seems to try to be agnostic of the way that NN search is implemented. Could we make SearchStrategy only about the similarity metric that is used for vectors? This particular point seems discussed on LUCENE-9322 but I couldn't find the conclusion.
>> > > >>
>> > > >> Should we rename VectorFormat to VectorsFormat? This would be more consistent with other file formats that use the plural, like PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.?
>> > > >>
>> > > >> --
>> > > >> Adrien
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Questions about the new vector API [ In reply to ]
Maybe that is so, but we should factor in everything: such as large scale
indexing, not requiring whole data set to be in RAM, etc. Hey, it's Lucene!

Because HNSW has dominated the nightly benchmarks, I have been digging
through stacktraces and trying to figure out ways to make it work
efficiently, and I'm not sure what to do.
Especially merge is painful: it seems to cause a storm of page
faults/random accesses due to how it works, and I don't know yet how to
make it better.
It seems to rebuild the entire graph, spraying random accesses across a
"slow-wrapper" that binary searches each sub on every access.
I don't see any way to even amortize the pain with some kind of bulk merge
trick.

So if we find algorithms that scale better, I think we should lend a
preference towards them. For example, algorithms that allow
per-segment/sequential index and merge.

On Tue, Mar 16, 2021 at 1:06 PM Michael Sokolov <msokolov@gmail.com> wrote:

> ann-benchmarks.com maintains open benchmarks of a bunch of ANN
> (approximate NN) algorithms. When we started this effort, HNSW was at
> the top of the heap in most of the benchmarks.
>
> On Tue, Mar 16, 2021 at 12:28 PM Robert Muir <rcmuir@gmail.com> wrote:
> >
> > Where are the alternative algorithms that work on sequential iterators
> and don't need random access?
> >
> > Seems like these should be the ones we initially add to lucene, and HNSW
> should be put aside for now? (is it a toy, or can we do it without
> jazillions of random accesses?)
> >
> > On Tue, Mar 16, 2021 at 12:15 PM Michael Sokolov <msokolov@gmail.com>
> wrote:
> >>
> >> There's also some good discussion on
> >> https://issues.apache.org/jira/browse/LUCENE-9583 about random access
> >> vs iterator pattern that never got fully resolved. We said we would
> >> revisit after KNN (LUCENE-9004) landed, and now it has. The usage of
> >> random access is pretty well-established there, maybe we should
> >> abandon the iterator API since it is redundant (you can always iterate
> >> over a random access API if you know the size)?
> >>
> >> On Tue, Mar 16, 2021 at 12:10 PM Michael Sokolov <msokolov@gmail.com>
> wrote:
> >> >
> >> > Also, Tomoko re:LUCENE-9322, did it succeed? I guess we won't know for
> >> > sure unless someone revives
> >> > https://issues.apache.org/jira/browse/LUCENE-9136 or something like
> >> > that
> >> >
> >> > On Tue, Mar 16, 2021 at 12:04 PM Michael Sokolov <msokolov@gmail.com>
> wrote:
> >> > >
> >> > > Consistent plural naming makes sense to me. I think it ended up
> >> > > singular because I am biased to avoid plural names unless there is a
> >> > > useful distinction to be made. But consistency should trump my
> >> > > predilections.
> >> > >
> >> > > I think the reason we have search() on VectorValues is that we have
> >> > > LeafReader.getVectorValues() (by analogy to the DocValues
> iterators),
> >> > > but no way to access the VectorReader. Do you think we should also
> >> > > have LeafReader.getVectorReader()? Today it's only on CodecReader.
> >> > >
> >> > > Re: SearchStrategy.NONE; the idea is we support efficient access to
> >> > > floating point values. Using BinaryDocValues for this will always
> >> > > require an additional decoding step. I can see that the naming is
> >> > > confusing there. The intent is that you index the vector values, but
> >> > > no additional indexing data structure. Also: the reason HNSW is
> >> > > mentioned in these SearchStrategy enums is to make room for other
> >> > > vector indexing approaches, like LSH. There was a lot of discussion
> >> > > that we wanted an API that allowed for experimenting with other
> >> > > techniques for indexing and searching vector values.
> >> > >
> >> > > Adrien, you made an analogy to PerFieldPostingsFormat (and
> DocValues),
> >> > > but I think the situation is more akin to Points, where we have the
> >> > > options on IndexableField. The metadata we store there (dimension
> and
> >> > > score function) don't really result in different formats, ie code
> >> > > paths for indexing and storage; they are more like parameters to the
> >> > > format, in my mind. Perhaps the situation will look different when
> we
> >> > > get our second vector indexing strategy (like LSH).
> >> > >
> >> > >
> >> > > On Tue, Mar 16, 2021 at 10:19 AM Tomoko Uchida
> >> > > <tomoko.uchida.1111@gmail.com> wrote:
> >> > > >
> >> > > > > Should we rename VectorFormat to VectorsFormat? This would be
> more consistent with other file formats that use the plural, like
> PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.?
> >> > > >
> >> > > > +1 for using plural form for consistency - if we reconsider the
> names, how about VectorValuesFormat so that it follows the naming
> convention for XXXValues?
> >> > > >
> >> > > > DocValuesFormat / DocValues
> >> > > > PointValuesFormat / PointValues
> >> > > > VectorValuesFormat / VectorValues (currently, VectorFormat /
> VectorValues)
> >> > > >
> >> > > > > Should SearchStrategy constants avoid explicit references to
> HNSW?
> >> > > >
> >> > > > Also +1 for decoupling HNSW specific implementations from general
> vectors, though I am not fully sure if we can strictly separate the
> similarity metrics and search algorithms for vectors.
> >> > > > LUCENE-9322 (unified vectors API) was resolved months ago, does
> it achieve its goal? I haven't followed the issue in months because of my
> laziness...
> >> > > >
> >> > > > Thanks,
> >> > > > Tomoko
> >> > > >
> >> > > >
> >> > > > 2021?3?16?(?) 19:32 Adrien Grand <jpountz@gmail.com>:
> >> > > >>
> >> > > >> Hello,
> >> > > >>
> >> > > >> I've tried to catch up on the vector API and I have the
> following questions. I've tried to read through discussions on JIRA first
> in case it had been covered, but it's possible I missed some relevant ones.
> >> > > >>
> >> > > >> Should VectorValues#search be on VectorReader instead? It felt a
> bit odd to me to have the search logic on the iterator.
> >> > > >>
> >> > > >> Do we need SearchStrategy.NONE? Documentation suggests that it
> allows storing vectors but that NN search won't be supported. This looks
> like a use-case for binary doc values to me? It also slightly caught me by
> surprise due to the inconsistency with IndexOptions.NONE, which means "do
> not index this field" (and likewise for DocValuesType.NONE), so I first
> assumed that SearchStrategy.NONE also meant "do not index this field as a
> vector".
> >> > > >>
> >> > > >> While postings and doc-value formats allow per-field
> configuration via PerFieldPostingsFormat/PerFieldDocValuesFormat, vectors
> use a different mechanism where VectorField#createHnswType sets attributes
> on the field type that the vectors writer then reads. Should we have a
> PerFieldVectorsFormat instead and configure these options via the vectors
> format?
> >> > > >>
> >> > > >> Should SearchStrategy constants avoid explicit references to
> HNSW? The rest of the API seems to try to be agnostic of the way that NN
> search is implemented. Could we make SearchStrategy only about the
> similarity metric that is used for vectors? This particular point seems
> discussed on LUCENE-9322 but I couldn't find the conclusion.
> >> > > >>
> >> > > >> Should we rename VectorFormat to VectorsFormat? This would be
> more consistent with other file formats that use the plural, like
> PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.?
> >> > > >>
> >> > > >> --
> >> > > >> Adrien
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: Questions about the new vector API [ In reply to ]
Hello Mike,

On Tue, Mar 16, 2021 at 5:05 PM Michael Sokolov <msokolov@gmail.com> wrote:

> I think the reason we have search() on VectorValues is that we have
> LeafReader.getVectorValues() (by analogy to the DocValues iterators),
> but no way to access the VectorReader. Do you think we should also
> have LeafReader.getVectorReader()? Today it's only on CodecReader.
>

I was more thinking of moving VectorValues#search to
LeafReader#searchNearestVectors or something along those lines. I agree
that VectorReader should only be exposed on CodecReader.


> Re: SearchStrategy.NONE; the idea is we support efficient access to
> floating point values. Using BinaryDocValues for this will always
> require an additional decoding step. I can see that the naming is
> confusing there. The intent is that you index the vector values, but
> no additional indexing data structure.


I wonder if things would be simpler if we were more opinionated and made
vectors specifically about nearest-neighbor search. Then we have a
clearer message, use vectors for NN search and doc values otherwise. As far
as I know, reinterpreting bytes as floats shouldn't add much overhead. The
main problem I know of is that the JVM won't auto-vectorize if you read
floats dynamically from a byte[], but this is something that should be
alleviated by the JDK vector API?

Also: the reason HNSW is
> mentioned in these SearchStrategy enums is to make room for other
> vector indexing approaches, like LSH. There was a lot of discussion
> that we wanted an API that allowed for experimenting with other
> techniques for indexing and searching vector values.
>

Actually this is the thing that feels odd to me: if we end up with
constants for both LSH and HNSW, then we are adding the requirement that
all vector formats must implement both LSH and HNSW as they will need to
support all SearchStrategy constants? Would it be possible to have a single
API and then two implementations of VectorsFormat, LSHVectorsFormat on the
one hand and HNSWVectorsFormat on the other hand?

Adrien, you made an analogy to PerFieldPostingsFormat (and DocValues),
> but I think the situation is more akin to Points, where we have the
> options on IndexableField. The metadata we store there (dimension and
> score function) don't really result in different formats, ie code
> paths for indexing and storage; they are more like parameters to the
> format, in my mind. Perhaps the situation will look different when we
> get our second vector indexing strategy (like LSH).


Having the dimension count and the score function on the FieldType actually
makes sense to me. I was more wondering whether maxConn and beamWidth
actually belong to the FieldType, or if they should be made constructor
arguments of Lucene90VectorFormat.

--
Adrien
Re: Questions about the new vector API [ In reply to ]
If you click the github link from here, it says in the README.md: "Focus on
datasets that fit in RAM. Out of core ANN could be the topic of a later
comparison."

But a quick google search on some of these out of core ANN algorithms shows
some promise, here is the summary of the first one i stumbled on:

We propose a novel approach to compute KNN on large datasets by leveraging
both disk and main memory efficiently. The main rationale of our approach
is to minimize random accesses to disk, maximize sequential accesses to
data and efficient usage of only the available memory. We evaluate our
approach on large datasets, in terms of performance andmemory consumption.
The evaluation shows that our approach requiresonly 7% of the time needed
by an in-memory baseline to compute a KNN graph.

https://hal.inria.fr/hal-01336673/document

On Tue, Mar 16, 2021 at 1:06 PM Michael Sokolov <msokolov@gmail.com> wrote:

> ann-benchmarks.com maintains open benchmarks of a bunch of ANN
> (approximate NN) algorithms. When we started this effort, HNSW was at
> the top of the heap in most of the benchmarks.
>
> On Tue, Mar 16, 2021 at 12:28 PM Robert Muir <rcmuir@gmail.com> wrote:
> >
> > Where are the alternative algorithms that work on sequential iterators
> and don't need random access?
> >
> > Seems like these should be the ones we initially add to lucene, and HNSW
> should be put aside for now? (is it a toy, or can we do it without
> jazillions of random accesses?)
> >
> > On Tue, Mar 16, 2021 at 12:15 PM Michael Sokolov <msokolov@gmail.com>
> wrote:
> >>
> >> There's also some good discussion on
> >> https://issues.apache.org/jira/browse/LUCENE-9583 about random access
> >> vs iterator pattern that never got fully resolved. We said we would
> >> revisit after KNN (LUCENE-9004) landed, and now it has. The usage of
> >> random access is pretty well-established there, maybe we should
> >> abandon the iterator API since it is redundant (you can always iterate
> >> over a random access API if you know the size)?
> >>
> >> On Tue, Mar 16, 2021 at 12:10 PM Michael Sokolov <msokolov@gmail.com>
> wrote:
> >> >
> >> > Also, Tomoko re:LUCENE-9322, did it succeed? I guess we won't know for
> >> > sure unless someone revives
> >> > https://issues.apache.org/jira/browse/LUCENE-9136 or something like
> >> > that
> >> >
> >> > On Tue, Mar 16, 2021 at 12:04 PM Michael Sokolov <msokolov@gmail.com>
> wrote:
> >> > >
> >> > > Consistent plural naming makes sense to me. I think it ended up
> >> > > singular because I am biased to avoid plural names unless there is a
> >> > > useful distinction to be made. But consistency should trump my
> >> > > predilections.
> >> > >
> >> > > I think the reason we have search() on VectorValues is that we have
> >> > > LeafReader.getVectorValues() (by analogy to the DocValues
> iterators),
> >> > > but no way to access the VectorReader. Do you think we should also
> >> > > have LeafReader.getVectorReader()? Today it's only on CodecReader.
> >> > >
> >> > > Re: SearchStrategy.NONE; the idea is we support efficient access to
> >> > > floating point values. Using BinaryDocValues for this will always
> >> > > require an additional decoding step. I can see that the naming is
> >> > > confusing there. The intent is that you index the vector values, but
> >> > > no additional indexing data structure. Also: the reason HNSW is
> >> > > mentioned in these SearchStrategy enums is to make room for other
> >> > > vector indexing approaches, like LSH. There was a lot of discussion
> >> > > that we wanted an API that allowed for experimenting with other
> >> > > techniques for indexing and searching vector values.
> >> > >
> >> > > Adrien, you made an analogy to PerFieldPostingsFormat (and
> DocValues),
> >> > > but I think the situation is more akin to Points, where we have the
> >> > > options on IndexableField. The metadata we store there (dimension
> and
> >> > > score function) don't really result in different formats, ie code
> >> > > paths for indexing and storage; they are more like parameters to the
> >> > > format, in my mind. Perhaps the situation will look different when
> we
> >> > > get our second vector indexing strategy (like LSH).
> >> > >
> >> > >
> >> > > On Tue, Mar 16, 2021 at 10:19 AM Tomoko Uchida
> >> > > <tomoko.uchida.1111@gmail.com> wrote:
> >> > > >
> >> > > > > Should we rename VectorFormat to VectorsFormat? This would be
> more consistent with other file formats that use the plural, like
> PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.?
> >> > > >
> >> > > > +1 for using plural form for consistency - if we reconsider the
> names, how about VectorValuesFormat so that it follows the naming
> convention for XXXValues?
> >> > > >
> >> > > > DocValuesFormat / DocValues
> >> > > > PointValuesFormat / PointValues
> >> > > > VectorValuesFormat / VectorValues (currently, VectorFormat /
> VectorValues)
> >> > > >
> >> > > > > Should SearchStrategy constants avoid explicit references to
> HNSW?
> >> > > >
> >> > > > Also +1 for decoupling HNSW specific implementations from general
> vectors, though I am not fully sure if we can strictly separate the
> similarity metrics and search algorithms for vectors.
> >> > > > LUCENE-9322 (unified vectors API) was resolved months ago, does
> it achieve its goal? I haven't followed the issue in months because of my
> laziness...
> >> > > >
> >> > > > Thanks,
> >> > > > Tomoko
> >> > > >
> >> > > >
> >> > > > 2021?3?16?(?) 19:32 Adrien Grand <jpountz@gmail.com>:
> >> > > >>
> >> > > >> Hello,
> >> > > >>
> >> > > >> I've tried to catch up on the vector API and I have the
> following questions. I've tried to read through discussions on JIRA first
> in case it had been covered, but it's possible I missed some relevant ones.
> >> > > >>
> >> > > >> Should VectorValues#search be on VectorReader instead? It felt a
> bit odd to me to have the search logic on the iterator.
> >> > > >>
> >> > > >> Do we need SearchStrategy.NONE? Documentation suggests that it
> allows storing vectors but that NN search won't be supported. This looks
> like a use-case for binary doc values to me? It also slightly caught me by
> surprise due to the inconsistency with IndexOptions.NONE, which means "do
> not index this field" (and likewise for DocValuesType.NONE), so I first
> assumed that SearchStrategy.NONE also meant "do not index this field as a
> vector".
> >> > > >>
> >> > > >> While postings and doc-value formats allow per-field
> configuration via PerFieldPostingsFormat/PerFieldDocValuesFormat, vectors
> use a different mechanism where VectorField#createHnswType sets attributes
> on the field type that the vectors writer then reads. Should we have a
> PerFieldVectorsFormat instead and configure these options via the vectors
> format?
> >> > > >>
> >> > > >> Should SearchStrategy constants avoid explicit references to
> HNSW? The rest of the API seems to try to be agnostic of the way that NN
> search is implemented. Could we make SearchStrategy only about the
> similarity metric that is used for vectors? This particular point seems
> discussed on LUCENE-9322 but I couldn't find the conclusion.
> >> > > >>
> >> > > >> Should we rename VectorFormat to VectorsFormat? This would be
> more consistent with other file formats that use the plural, like
> PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.?
> >> > > >>
> >> > > >> --
> >> > > >> Adrien
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: Questions about the new vector API [ In reply to ]
Yeah, HNSW is problematic in a few ways: (1) merging is costly due to
the need to completely recreate the graph. (2) searching across a
segmented index sacrifices much of the performance benefit of HNSW
since the cost of searching HNSW graphs scales ~logarithmically with
the size of the graph, so splitting into multiple graphs and then
merge sorting results is pretty expensive. I guess the random access /
scan forward dynamic is another problematic area.

On Tue, Mar 16, 2021 at 1:28 PM Robert Muir <rcmuir@gmail.com> wrote:
>
> Maybe that is so, but we should factor in everything: such as large scale indexing, not requiring whole data set to be in RAM, etc. Hey, it's Lucene!
>
> Because HNSW has dominated the nightly benchmarks, I have been digging through stacktraces and trying to figure out ways to make it work efficiently, and I'm not sure what to do.
> Especially merge is painful: it seems to cause a storm of page faults/random accesses due to how it works, and I don't know yet how to make it better.
> It seems to rebuild the entire graph, spraying random accesses across a "slow-wrapper" that binary searches each sub on every access.
> I don't see any way to even amortize the pain with some kind of bulk merge trick.
>
> So if we find algorithms that scale better, I think we should lend a preference towards them. For example, algorithms that allow per-segment/sequential index and merge.
>
> On Tue, Mar 16, 2021 at 1:06 PM Michael Sokolov <msokolov@gmail.com> wrote:
>>
>> ann-benchmarks.com maintains open benchmarks of a bunch of ANN
>> (approximate NN) algorithms. When we started this effort, HNSW was at
>> the top of the heap in most of the benchmarks.
>>
>> On Tue, Mar 16, 2021 at 12:28 PM Robert Muir <rcmuir@gmail.com> wrote:
>> >
>> > Where are the alternative algorithms that work on sequential iterators and don't need random access?
>> >
>> > Seems like these should be the ones we initially add to lucene, and HNSW should be put aside for now? (is it a toy, or can we do it without jazillions of random accesses?)
>> >
>> > On Tue, Mar 16, 2021 at 12:15 PM Michael Sokolov <msokolov@gmail.com> wrote:
>> >>
>> >> There's also some good discussion on
>> >> https://issues.apache.org/jira/browse/LUCENE-9583 about random access
>> >> vs iterator pattern that never got fully resolved. We said we would
>> >> revisit after KNN (LUCENE-9004) landed, and now it has. The usage of
>> >> random access is pretty well-established there, maybe we should
>> >> abandon the iterator API since it is redundant (you can always iterate
>> >> over a random access API if you know the size)?
>> >>
>> >> On Tue, Mar 16, 2021 at 12:10 PM Michael Sokolov <msokolov@gmail.com> wrote:
>> >> >
>> >> > Also, Tomoko re:LUCENE-9322, did it succeed? I guess we won't know for
>> >> > sure unless someone revives
>> >> > https://issues.apache.org/jira/browse/LUCENE-9136 or something like
>> >> > that
>> >> >
>> >> > On Tue, Mar 16, 2021 at 12:04 PM Michael Sokolov <msokolov@gmail.com> wrote:
>> >> > >
>> >> > > Consistent plural naming makes sense to me. I think it ended up
>> >> > > singular because I am biased to avoid plural names unless there is a
>> >> > > useful distinction to be made. But consistency should trump my
>> >> > > predilections.
>> >> > >
>> >> > > I think the reason we have search() on VectorValues is that we have
>> >> > > LeafReader.getVectorValues() (by analogy to the DocValues iterators),
>> >> > > but no way to access the VectorReader. Do you think we should also
>> >> > > have LeafReader.getVectorReader()? Today it's only on CodecReader.
>> >> > >
>> >> > > Re: SearchStrategy.NONE; the idea is we support efficient access to
>> >> > > floating point values. Using BinaryDocValues for this will always
>> >> > > require an additional decoding step. I can see that the naming is
>> >> > > confusing there. The intent is that you index the vector values, but
>> >> > > no additional indexing data structure. Also: the reason HNSW is
>> >> > > mentioned in these SearchStrategy enums is to make room for other
>> >> > > vector indexing approaches, like LSH. There was a lot of discussion
>> >> > > that we wanted an API that allowed for experimenting with other
>> >> > > techniques for indexing and searching vector values.
>> >> > >
>> >> > > Adrien, you made an analogy to PerFieldPostingsFormat (and DocValues),
>> >> > > but I think the situation is more akin to Points, where we have the
>> >> > > options on IndexableField. The metadata we store there (dimension and
>> >> > > score function) don't really result in different formats, ie code
>> >> > > paths for indexing and storage; they are more like parameters to the
>> >> > > format, in my mind. Perhaps the situation will look different when we
>> >> > > get our second vector indexing strategy (like LSH).
>> >> > >
>> >> > >
>> >> > > On Tue, Mar 16, 2021 at 10:19 AM Tomoko Uchida
>> >> > > <tomoko.uchida.1111@gmail.com> wrote:
>> >> > > >
>> >> > > > > Should we rename VectorFormat to VectorsFormat? This would be more consistent with other file formats that use the plural, like PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.?
>> >> > > >
>> >> > > > +1 for using plural form for consistency - if we reconsider the names, how about VectorValuesFormat so that it follows the naming convention for XXXValues?
>> >> > > >
>> >> > > > DocValuesFormat / DocValues
>> >> > > > PointValuesFormat / PointValues
>> >> > > > VectorValuesFormat / VectorValues (currently, VectorFormat / VectorValues)
>> >> > > >
>> >> > > > > Should SearchStrategy constants avoid explicit references to HNSW?
>> >> > > >
>> >> > > > Also +1 for decoupling HNSW specific implementations from general vectors, though I am not fully sure if we can strictly separate the similarity metrics and search algorithms for vectors.
>> >> > > > LUCENE-9322 (unified vectors API) was resolved months ago, does it achieve its goal? I haven't followed the issue in months because of my laziness...
>> >> > > >
>> >> > > > Thanks,
>> >> > > > Tomoko
>> >> > > >
>> >> > > >
>> >> > > > 2021?3?16?(?) 19:32 Adrien Grand <jpountz@gmail.com>:
>> >> > > >>
>> >> > > >> Hello,
>> >> > > >>
>> >> > > >> I've tried to catch up on the vector API and I have the following questions. I've tried to read through discussions on JIRA first in case it had been covered, but it's possible I missed some relevant ones.
>> >> > > >>
>> >> > > >> Should VectorValues#search be on VectorReader instead? It felt a bit odd to me to have the search logic on the iterator.
>> >> > > >>
>> >> > > >> Do we need SearchStrategy.NONE? Documentation suggests that it allows storing vectors but that NN search won't be supported. This looks like a use-case for binary doc values to me? It also slightly caught me by surprise due to the inconsistency with IndexOptions.NONE, which means "do not index this field" (and likewise for DocValuesType.NONE), so I first assumed that SearchStrategy.NONE also meant "do not index this field as a vector".
>> >> > > >>
>> >> > > >> While postings and doc-value formats allow per-field configuration via PerFieldPostingsFormat/PerFieldDocValuesFormat, vectors use a different mechanism where VectorField#createHnswType sets attributes on the field type that the vectors writer then reads. Should we have a PerFieldVectorsFormat instead and configure these options via the vectors format?
>> >> > > >>
>> >> > > >> Should SearchStrategy constants avoid explicit references to HNSW? The rest of the API seems to try to be agnostic of the way that NN search is implemented. Could we make SearchStrategy only about the similarity metric that is used for vectors? This particular point seems discussed on LUCENE-9322 but I couldn't find the conclusion.
>> >> > > >>
>> >> > > >> Should we rename VectorFormat to VectorsFormat? This would be more consistent with other file formats that use the plural, like PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.?
>> >> > > >>
>> >> > > >> --
>> >> > > >> Adrien
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Questions about the new vector API [ In reply to ]
> I was more thinking of moving VectorValues#search to LeafReader#searchNearestVectors or something along those lines. I agree that VectorReader should only be exposed on CodecReader.

Ah, OK, yes that makes sense to me. I guess we were maybe reluctant to
add such visible API changes early on in the project.

> I wonder if things would be simpler if we were more opinionated and made vectors specifically about nearest-neighbor search. Then we have a clearer message, use vectors for NN search and doc values otherwise. As far as I know, reinterpreting bytes as floats shouldn't add much overhead. The main problem I know of is that the JVM won't auto-vectorize if you read floats dynamically from a byte[], but this is something that should be alleviated by the JDK vector API?

> Actually this is the thing that feels odd to me: if we end up with constants for both LSH and HNSW, then we are adding the requirement that all vector formats must implement both LSH and HNSW as they will need to support all SearchStrategy constants?

Hmm I see I didn't think this all the way through ... I guess I had it
in mind that there would probably only ever be a single format with
internal variants for different vector index types, but as I have
worked more with Lucene's index formats I see that is awkward, and I'm
certainly open to restructuring it in a more natural way. Similarly
for the NONE format - BinaryDocValues can be used for such
(non-searchable) vectors. Indeed we had such an implementation and
although we recently switched it to use the NONE format for
uniformity, it could easily be switched back.

Regarding the graph construction parameters (maxConn and beamWidth)
I'm not sure what the right approach is exactly. We struggled to find
the best API for this. I guess my concern about the PerField* approach
is (at least as I think I understand it) it needs to be configured in
code when creating a Codec. But we would like to be able to read such
parameters from a schema configuration. I think of them as in the same
spirit as an Analyzer. However I may not have fully appreciated the
intention of, or how to make the best use of PerField formats. It is
true we don't really need to write these parameters to the index;
we're free to use different values when merging for example.

-Mike

On Tue, Mar 16, 2021 at 2:15 PM Adrien Grand <jpountz@gmail.com> wrote:
>
> Hello Mike,
>
> On Tue, Mar 16, 2021 at 5:05 PM Michael Sokolov <msokolov@gmail.com> wrote:
>>
>> I think the reason we have search() on VectorValues is that we have
>> LeafReader.getVectorValues() (by analogy to the DocValues iterators),
>> but no way to access the VectorReader. Do you think we should also
>> have LeafReader.getVectorReader()? Today it's only on CodecReader.
>
>
> I was more thinking of moving VectorValues#search to LeafReader#searchNearestVectors or something along those lines. I agree that VectorReader should only be exposed on CodecReader.
>
>>
>> Re: SearchStrategy.NONE; the idea is we support efficient access to
>> floating point values. Using BinaryDocValues for this will always
>> require an additional decoding step. I can see that the naming is
>> confusing there. The intent is that you index the vector values, but
>> no additional indexing data structure.
>
>
> I wonder if things would be simpler if we were more opinionated and made vectors specifically about nearest-neighbor search. Then we have a clearer message, use vectors for NN search and doc values otherwise. As far as I know, reinterpreting bytes as floats shouldn't add much overhead. The main problem I know of is that the JVM won't auto-vectorize if you read floats dynamically from a byte[], but this is something that should be alleviated by the JDK vector API?
>
>> Also: the reason HNSW is
>> mentioned in these SearchStrategy enums is to make room for other
>> vector indexing approaches, like LSH. There was a lot of discussion
>> that we wanted an API that allowed for experimenting with other
>> techniques for indexing and searching vector values.
>
>
> Actually this is the thing that feels odd to me: if we end up with constants for both LSH and HNSW, then we are adding the requirement that all vector formats must implement both LSH and HNSW as they will need to support all SearchStrategy constants? Would it be possible to have a single API and then two implementations of VectorsFormat, LSHVectorsFormat on the one hand and HNSWVectorsFormat on the other hand?
>
>> Adrien, you made an analogy to PerFieldPostingsFormat (and DocValues),
>> but I think the situation is more akin to Points, where we have the
>> options on IndexableField. The metadata we store there (dimension and
>> score function) don't really result in different formats, ie code
>> paths for indexing and storage; they are more like parameters to the
>> format, in my mind. Perhaps the situation will look different when we
>> get our second vector indexing strategy (like LSH).
>
>
> Having the dimension count and the score function on the FieldType actually makes sense to me. I was more wondering whether maxConn and beamWidth actually belong to the FieldType, or if they should be made constructor arguments of Lucene90VectorFormat.
>
> --
> Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Questions about the new vector API [ In reply to ]
Configuring the codec based on the schema is something that Solr does via
SchemaCodecFactory.
https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/core/SchemaCodecFactory.java

Would a similar approach work in your case?

Le mar. 16 mars 2021 à 22:21, Michael Sokolov <msokolov@gmail.com> a écrit :

> > I was more thinking of moving VectorValues#search to
> LeafReader#searchNearestVectors or something along those lines. I agree
> that VectorReader should only be exposed on CodecReader.
>
> Ah, OK, yes that makes sense to me. I guess we were maybe reluctant to
> add such visible API changes early on in the project.
>
> > I wonder if things would be simpler if we were more opinionated and made
> vectors specifically about nearest-neighbor search. Then we have a clearer
> message, use vectors for NN search and doc values otherwise. As far as I
> know, reinterpreting bytes as floats shouldn't add much overhead. The main
> problem I know of is that the JVM won't auto-vectorize if you read floats
> dynamically from a byte[], but this is something that should be alleviated
> by the JDK vector API?
>
> > Actually this is the thing that feels odd to me: if we end up with
> constants for both LSH and HNSW, then we are adding the requirement that
> all vector formats must implement both LSH and HNSW as they will need to
> support all SearchStrategy constants?
>
> Hmm I see I didn't think this all the way through ... I guess I had it
> in mind that there would probably only ever be a single format with
> internal variants for different vector index types, but as I have
> worked more with Lucene's index formats I see that is awkward, and I'm
> certainly open to restructuring it in a more natural way. Similarly
> for the NONE format - BinaryDocValues can be used for such
> (non-searchable) vectors. Indeed we had such an implementation and
> although we recently switched it to use the NONE format for
> uniformity, it could easily be switched back.
>
> Regarding the graph construction parameters (maxConn and beamWidth)
> I'm not sure what the right approach is exactly. We struggled to find
> the best API for this. I guess my concern about the PerField* approach
> is (at least as I think I understand it) it needs to be configured in
> code when creating a Codec. But we would like to be able to read such
> parameters from a schema configuration. I think of them as in the same
> spirit as an Analyzer. However I may not have fully appreciated the
> intention of, or how to make the best use of PerField formats. It is
> true we don't really need to write these parameters to the index;
> we're free to use different values when merging for example.
>
> -Mike
>
> On Tue, Mar 16, 2021 at 2:15 PM Adrien Grand <jpountz@gmail.com> wrote:
> >
> > Hello Mike,
> >
> > On Tue, Mar 16, 2021 at 5:05 PM Michael Sokolov <msokolov@gmail.com>
> wrote:
> >>
> >> I think the reason we have search() on VectorValues is that we have
> >> LeafReader.getVectorValues() (by analogy to the DocValues iterators),
> >> but no way to access the VectorReader. Do you think we should also
> >> have LeafReader.getVectorReader()? Today it's only on CodecReader.
> >
> >
> > I was more thinking of moving VectorValues#search to
> LeafReader#searchNearestVectors or something along those lines. I agree
> that VectorReader should only be exposed on CodecReader.
> >
> >>
> >> Re: SearchStrategy.NONE; the idea is we support efficient access to
> >> floating point values. Using BinaryDocValues for this will always
> >> require an additional decoding step. I can see that the naming is
> >> confusing there. The intent is that you index the vector values, but
> >> no additional indexing data structure.
> >
> >
> > I wonder if things would be simpler if we were more opinionated and made
> vectors specifically about nearest-neighbor search. Then we have a clearer
> message, use vectors for NN search and doc values otherwise. As far as I
> know, reinterpreting bytes as floats shouldn't add much overhead. The main
> problem I know of is that the JVM won't auto-vectorize if you read floats
> dynamically from a byte[], but this is something that should be alleviated
> by the JDK vector API?
> >
> >> Also: the reason HNSW is
> >> mentioned in these SearchStrategy enums is to make room for other
> >> vector indexing approaches, like LSH. There was a lot of discussion
> >> that we wanted an API that allowed for experimenting with other
> >> techniques for indexing and searching vector values.
> >
> >
> > Actually this is the thing that feels odd to me: if we end up with
> constants for both LSH and HNSW, then we are adding the requirement that
> all vector formats must implement both LSH and HNSW as they will need to
> support all SearchStrategy constants? Would it be possible to have a single
> API and then two implementations of VectorsFormat, LSHVectorsFormat on the
> one hand and HNSWVectorsFormat on the other hand?
> >
> >> Adrien, you made an analogy to PerFieldPostingsFormat (and DocValues),
> >> but I think the situation is more akin to Points, where we have the
> >> options on IndexableField. The metadata we store there (dimension and
> >> score function) don't really result in different formats, ie code
> >> paths for indexing and storage; they are more like parameters to the
> >> format, in my mind. Perhaps the situation will look different when we
> >> get our second vector indexing strategy (like LSH).
> >
> >
> > Having the dimension count and the score function on the FieldType
> actually makes sense to me. I was more wondering whether maxConn and
> beamWidth actually belong to the FieldType, or if they should be made
> constructor arguments of Lucene90VectorFormat.
> >
> > --
> > Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: Questions about the new vector API [ In reply to ]
I see, right, we can create a Codec that applies the values takes from
the schema for a given field, sure, that works.

On Wed, Mar 17, 2021 at 3:10 AM Adrien Grand <jpountz@gmail.com> wrote:
>
> Configuring the codec based on the schema is something that Solr does via SchemaCodecFactory. https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/core/SchemaCodecFactory.java
>
> Would a similar approach work in your case?
>
> Le mar. 16 mars 2021 à 22:21, Michael Sokolov <msokolov@gmail.com> a écrit :
>>
>> > I was more thinking of moving VectorValues#search to LeafReader#searchNearestVectors or something along those lines. I agree that VectorReader should only be exposed on CodecReader.
>>
>> Ah, OK, yes that makes sense to me. I guess we were maybe reluctant to
>> add such visible API changes early on in the project.
>>
>> > I wonder if things would be simpler if we were more opinionated and made vectors specifically about nearest-neighbor search. Then we have a clearer message, use vectors for NN search and doc values otherwise. As far as I know, reinterpreting bytes as floats shouldn't add much overhead. The main problem I know of is that the JVM won't auto-vectorize if you read floats dynamically from a byte[], but this is something that should be alleviated by the JDK vector API?
>>
>> > Actually this is the thing that feels odd to me: if we end up with constants for both LSH and HNSW, then we are adding the requirement that all vector formats must implement both LSH and HNSW as they will need to support all SearchStrategy constants?
>>
>> Hmm I see I didn't think this all the way through ... I guess I had it
>> in mind that there would probably only ever be a single format with
>> internal variants for different vector index types, but as I have
>> worked more with Lucene's index formats I see that is awkward, and I'm
>> certainly open to restructuring it in a more natural way. Similarly
>> for the NONE format - BinaryDocValues can be used for such
>> (non-searchable) vectors. Indeed we had such an implementation and
>> although we recently switched it to use the NONE format for
>> uniformity, it could easily be switched back.
>>
>> Regarding the graph construction parameters (maxConn and beamWidth)
>> I'm not sure what the right approach is exactly. We struggled to find
>> the best API for this. I guess my concern about the PerField* approach
>> is (at least as I think I understand it) it needs to be configured in
>> code when creating a Codec. But we would like to be able to read such
>> parameters from a schema configuration. I think of them as in the same
>> spirit as an Analyzer. However I may not have fully appreciated the
>> intention of, or how to make the best use of PerField formats. It is
>> true we don't really need to write these parameters to the index;
>> we're free to use different values when merging for example.
>>
>> -Mike
>>
>> On Tue, Mar 16, 2021 at 2:15 PM Adrien Grand <jpountz@gmail.com> wrote:
>> >
>> > Hello Mike,
>> >
>> > On Tue, Mar 16, 2021 at 5:05 PM Michael Sokolov <msokolov@gmail.com> wrote:
>> >>
>> >> I think the reason we have search() on VectorValues is that we have
>> >> LeafReader.getVectorValues() (by analogy to the DocValues iterators),
>> >> but no way to access the VectorReader. Do you think we should also
>> >> have LeafReader.getVectorReader()? Today it's only on CodecReader.
>> >
>> >
>> > I was more thinking of moving VectorValues#search to LeafReader#searchNearestVectors or something along those lines. I agree that VectorReader should only be exposed on CodecReader.
>> >
>> >>
>> >> Re: SearchStrategy.NONE; the idea is we support efficient access to
>> >> floating point values. Using BinaryDocValues for this will always
>> >> require an additional decoding step. I can see that the naming is
>> >> confusing there. The intent is that you index the vector values, but
>> >> no additional indexing data structure.
>> >
>> >
>> > I wonder if things would be simpler if we were more opinionated and made vectors specifically about nearest-neighbor search. Then we have a clearer message, use vectors for NN search and doc values otherwise. As far as I know, reinterpreting bytes as floats shouldn't add much overhead. The main problem I know of is that the JVM won't auto-vectorize if you read floats dynamically from a byte[], but this is something that should be alleviated by the JDK vector API?
>> >
>> >> Also: the reason HNSW is
>> >> mentioned in these SearchStrategy enums is to make room for other
>> >> vector indexing approaches, like LSH. There was a lot of discussion
>> >> that we wanted an API that allowed for experimenting with other
>> >> techniques for indexing and searching vector values.
>> >
>> >
>> > Actually this is the thing that feels odd to me: if we end up with constants for both LSH and HNSW, then we are adding the requirement that all vector formats must implement both LSH and HNSW as they will need to support all SearchStrategy constants? Would it be possible to have a single API and then two implementations of VectorsFormat, LSHVectorsFormat on the one hand and HNSWVectorsFormat on the other hand?
>> >
>> >> Adrien, you made an analogy to PerFieldPostingsFormat (and DocValues),
>> >> but I think the situation is more akin to Points, where we have the
>> >> options on IndexableField. The metadata we store there (dimension and
>> >> score function) don't really result in different formats, ie code
>> >> paths for indexing and storage; they are more like parameters to the
>> >> format, in my mind. Perhaps the situation will look different when we
>> >> get our second vector indexing strategy (like LSH).
>> >
>> >
>> > Having the dimension count and the score function on the FieldType actually makes sense to me. I was more wondering whether maxConn and beamWidth actually belong to the FieldType, or if they should be made constructor arguments of Lucene90VectorFormat.
>> >
>> > --
>> > Adrien
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Questions about the new vector API [ In reply to ]
I'm gonna toss out one last question while we are here: Is Vector(s)Format
really a good name to use?

We already have "term vectors API", and "vector highlighter" that uses it.
There's also the traditional "vector-space" scoring model. With java 16, we
get a "vector api" from java itself, too.

I think the name is overloaded too many times already, and this one is the
straw that breaks the camel's back for me.

So I'm just throwing out there the idea: if this api is about ANN, maybe it
should claim its own name (NeighborsFormat?) that is less ambiguous.

On Wed, Mar 17, 2021 at 9:51 AM Michael Sokolov <msokolov@gmail.com> wrote:

> I see, right, we can create a Codec that applies the values takes from
> the schema for a given field, sure, that works.
>
> On Wed, Mar 17, 2021 at 3:10 AM Adrien Grand <jpountz@gmail.com> wrote:
> >
> > Configuring the codec based on the schema is something that Solr does
> via SchemaCodecFactory.
> https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/core/SchemaCodecFactory.java
> >
> > Would a similar approach work in your case?
> >
> > Le mar. 16 mars 2021 à 22:21, Michael Sokolov <msokolov@gmail.com> a
> écrit :
> >>
> >> > I was more thinking of moving VectorValues#search to
> LeafReader#searchNearestVectors or something along those lines. I agree
> that VectorReader should only be exposed on CodecReader.
> >>
> >> Ah, OK, yes that makes sense to me. I guess we were maybe reluctant to
> >> add such visible API changes early on in the project.
> >>
> >> > I wonder if things would be simpler if we were more opinionated and
> made vectors specifically about nearest-neighbor search. Then we have a
> clearer message, use vectors for NN search and doc values otherwise. As far
> as I know, reinterpreting bytes as floats shouldn't add much overhead. The
> main problem I know of is that the JVM won't auto-vectorize if you read
> floats dynamically from a byte[], but this is something that should be
> alleviated by the JDK vector API?
> >>
> >> > Actually this is the thing that feels odd to me: if we end up with
> constants for both LSH and HNSW, then we are adding the requirement that
> all vector formats must implement both LSH and HNSW as they will need to
> support all SearchStrategy constants?
> >>
> >> Hmm I see I didn't think this all the way through ... I guess I had it
> >> in mind that there would probably only ever be a single format with
> >> internal variants for different vector index types, but as I have
> >> worked more with Lucene's index formats I see that is awkward, and I'm
> >> certainly open to restructuring it in a more natural way. Similarly
> >> for the NONE format - BinaryDocValues can be used for such
> >> (non-searchable) vectors. Indeed we had such an implementation and
> >> although we recently switched it to use the NONE format for
> >> uniformity, it could easily be switched back.
> >>
> >> Regarding the graph construction parameters (maxConn and beamWidth)
> >> I'm not sure what the right approach is exactly. We struggled to find
> >> the best API for this. I guess my concern about the PerField* approach
> >> is (at least as I think I understand it) it needs to be configured in
> >> code when creating a Codec. But we would like to be able to read such
> >> parameters from a schema configuration. I think of them as in the same
> >> spirit as an Analyzer. However I may not have fully appreciated the
> >> intention of, or how to make the best use of PerField formats. It is
> >> true we don't really need to write these parameters to the index;
> >> we're free to use different values when merging for example.
> >>
> >> -Mike
> >>
> >> On Tue, Mar 16, 2021 at 2:15 PM Adrien Grand <jpountz@gmail.com> wrote:
> >> >
> >> > Hello Mike,
> >> >
> >> > On Tue, Mar 16, 2021 at 5:05 PM Michael Sokolov <msokolov@gmail.com>
> wrote:
> >> >>
> >> >> I think the reason we have search() on VectorValues is that we have
> >> >> LeafReader.getVectorValues() (by analogy to the DocValues iterators),
> >> >> but no way to access the VectorReader. Do you think we should also
> >> >> have LeafReader.getVectorReader()? Today it's only on CodecReader.
> >> >
> >> >
> >> > I was more thinking of moving VectorValues#search to
> LeafReader#searchNearestVectors or something along those lines. I agree
> that VectorReader should only be exposed on CodecReader.
> >> >
> >> >>
> >> >> Re: SearchStrategy.NONE; the idea is we support efficient access to
> >> >> floating point values. Using BinaryDocValues for this will always
> >> >> require an additional decoding step. I can see that the naming is
> >> >> confusing there. The intent is that you index the vector values, but
> >> >> no additional indexing data structure.
> >> >
> >> >
> >> > I wonder if things would be simpler if we were more opinionated and
> made vectors specifically about nearest-neighbor search. Then we have a
> clearer message, use vectors for NN search and doc values otherwise. As far
> as I know, reinterpreting bytes as floats shouldn't add much overhead. The
> main problem I know of is that the JVM won't auto-vectorize if you read
> floats dynamically from a byte[], but this is something that should be
> alleviated by the JDK vector API?
> >> >
> >> >> Also: the reason HNSW is
> >> >> mentioned in these SearchStrategy enums is to make room for other
> >> >> vector indexing approaches, like LSH. There was a lot of discussion
> >> >> that we wanted an API that allowed for experimenting with other
> >> >> techniques for indexing and searching vector values.
> >> >
> >> >
> >> > Actually this is the thing that feels odd to me: if we end up with
> constants for both LSH and HNSW, then we are adding the requirement that
> all vector formats must implement both LSH and HNSW as they will need to
> support all SearchStrategy constants? Would it be possible to have a single
> API and then two implementations of VectorsFormat, LSHVectorsFormat on the
> one hand and HNSWVectorsFormat on the other hand?
> >> >
> >> >> Adrien, you made an analogy to PerFieldPostingsFormat (and
> DocValues),
> >> >> but I think the situation is more akin to Points, where we have the
> >> >> options on IndexableField. The metadata we store there (dimension and
> >> >> score function) don't really result in different formats, ie code
> >> >> paths for indexing and storage; they are more like parameters to the
> >> >> format, in my mind. Perhaps the situation will look different when we
> >> >> get our second vector indexing strategy (like LSH).
> >> >
> >> >
> >> > Having the dimension count and the score function on the FieldType
> actually makes sense to me. I was more wondering whether maxConn and
> beamWidth actually belong to the FieldType, or if they should be made
> constructor arguments of Lucene90VectorFormat.
> >> >
> >> > --
> >> > Adrien
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: Questions about the new vector API [ In reply to ]
I think it makes sense that we use "ANN" or "NearestNeighbor" for ann
related APIs, this may give proper level of abstraction to them.
On the other hand, it slightly sounds odd to me to use it as a Codec
name... Maybe we should use names that represents its data structure,
instead of methods/algorithms?
I'd propose "DenseVector" here if "Vector" is too obscure, but it is also
just an idea.

Tomoko


2021?3?18?(?) 5:34 Robert Muir <rcmuir@gmail.com>:

> I'm gonna toss out one last question while we are here: Is Vector(s)Format
> really a good name to use?
>
> We already have "term vectors API", and "vector highlighter" that uses it.
> There's also the traditional "vector-space" scoring model. With java 16, we
> get a "vector api" from java itself, too.
>
> I think the name is overloaded too many times already, and this one is the
> straw that breaks the camel's back for me.
>
> So I'm just throwing out there the idea: if this api is about ANN, maybe
> it should claim its own name (NeighborsFormat?) that is less ambiguous.
>
> On Wed, Mar 17, 2021 at 9:51 AM Michael Sokolov <msokolov@gmail.com>
> wrote:
>
>> I see, right, we can create a Codec that applies the values takes from
>> the schema for a given field, sure, that works.
>>
>> On Wed, Mar 17, 2021 at 3:10 AM Adrien Grand <jpountz@gmail.com> wrote:
>> >
>> > Configuring the codec based on the schema is something that Solr does
>> via SchemaCodecFactory.
>> https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/core/SchemaCodecFactory.java
>> >
>> > Would a similar approach work in your case?
>> >
>> > Le mar. 16 mars 2021 à 22:21, Michael Sokolov <msokolov@gmail.com> a
>> écrit :
>> >>
>> >> > I was more thinking of moving VectorValues#search to
>> LeafReader#searchNearestVectors or something along those lines. I agree
>> that VectorReader should only be exposed on CodecReader.
>> >>
>> >> Ah, OK, yes that makes sense to me. I guess we were maybe reluctant to
>> >> add such visible API changes early on in the project.
>> >>
>> >> > I wonder if things would be simpler if we were more opinionated and
>> made vectors specifically about nearest-neighbor search. Then we have a
>> clearer message, use vectors for NN search and doc values otherwise. As far
>> as I know, reinterpreting bytes as floats shouldn't add much overhead. The
>> main problem I know of is that the JVM won't auto-vectorize if you read
>> floats dynamically from a byte[], but this is something that should be
>> alleviated by the JDK vector API?
>> >>
>> >> > Actually this is the thing that feels odd to me: if we end up with
>> constants for both LSH and HNSW, then we are adding the requirement that
>> all vector formats must implement both LSH and HNSW as they will need to
>> support all SearchStrategy constants?
>> >>
>> >> Hmm I see I didn't think this all the way through ... I guess I had it
>> >> in mind that there would probably only ever be a single format with
>> >> internal variants for different vector index types, but as I have
>> >> worked more with Lucene's index formats I see that is awkward, and I'm
>> >> certainly open to restructuring it in a more natural way. Similarly
>> >> for the NONE format - BinaryDocValues can be used for such
>> >> (non-searchable) vectors. Indeed we had such an implementation and
>> >> although we recently switched it to use the NONE format for
>> >> uniformity, it could easily be switched back.
>> >>
>> >> Regarding the graph construction parameters (maxConn and beamWidth)
>> >> I'm not sure what the right approach is exactly. We struggled to find
>> >> the best API for this. I guess my concern about the PerField* approach
>> >> is (at least as I think I understand it) it needs to be configured in
>> >> code when creating a Codec. But we would like to be able to read such
>> >> parameters from a schema configuration. I think of them as in the same
>> >> spirit as an Analyzer. However I may not have fully appreciated the
>> >> intention of, or how to make the best use of PerField formats. It is
>> >> true we don't really need to write these parameters to the index;
>> >> we're free to use different values when merging for example.
>> >>
>> >> -Mike
>> >>
>> >> On Tue, Mar 16, 2021 at 2:15 PM Adrien Grand <jpountz@gmail.com>
>> wrote:
>> >> >
>> >> > Hello Mike,
>> >> >
>> >> > On Tue, Mar 16, 2021 at 5:05 PM Michael Sokolov <msokolov@gmail.com>
>> wrote:
>> >> >>
>> >> >> I think the reason we have search() on VectorValues is that we have
>> >> >> LeafReader.getVectorValues() (by analogy to the DocValues
>> iterators),
>> >> >> but no way to access the VectorReader. Do you think we should also
>> >> >> have LeafReader.getVectorReader()? Today it's only on CodecReader.
>> >> >
>> >> >
>> >> > I was more thinking of moving VectorValues#search to
>> LeafReader#searchNearestVectors or something along those lines. I agree
>> that VectorReader should only be exposed on CodecReader.
>> >> >
>> >> >>
>> >> >> Re: SearchStrategy.NONE; the idea is we support efficient access to
>> >> >> floating point values. Using BinaryDocValues for this will always
>> >> >> require an additional decoding step. I can see that the naming is
>> >> >> confusing there. The intent is that you index the vector values, but
>> >> >> no additional indexing data structure.
>> >> >
>> >> >
>> >> > I wonder if things would be simpler if we were more opinionated and
>> made vectors specifically about nearest-neighbor search. Then we have a
>> clearer message, use vectors for NN search and doc values otherwise. As far
>> as I know, reinterpreting bytes as floats shouldn't add much overhead. The
>> main problem I know of is that the JVM won't auto-vectorize if you read
>> floats dynamically from a byte[], but this is something that should be
>> alleviated by the JDK vector API?
>> >> >
>> >> >> Also: the reason HNSW is
>> >> >> mentioned in these SearchStrategy enums is to make room for other
>> >> >> vector indexing approaches, like LSH. There was a lot of discussion
>> >> >> that we wanted an API that allowed for experimenting with other
>> >> >> techniques for indexing and searching vector values.
>> >> >
>> >> >
>> >> > Actually this is the thing that feels odd to me: if we end up with
>> constants for both LSH and HNSW, then we are adding the requirement that
>> all vector formats must implement both LSH and HNSW as they will need to
>> support all SearchStrategy constants? Would it be possible to have a single
>> API and then two implementations of VectorsFormat, LSHVectorsFormat on the
>> one hand and HNSWVectorsFormat on the other hand?
>> >> >
>> >> >> Adrien, you made an analogy to PerFieldPostingsFormat (and
>> DocValues),
>> >> >> but I think the situation is more akin to Points, where we have the
>> >> >> options on IndexableField. The metadata we store there (dimension
>> and
>> >> >> score function) don't really result in different formats, ie code
>> >> >> paths for indexing and storage; they are more like parameters to the
>> >> >> format, in my mind. Perhaps the situation will look different when
>> we
>> >> >> get our second vector indexing strategy (like LSH).
>> >> >
>> >> >
>> >> > Having the dimension count and the score function on the FieldType
>> actually makes sense to me. I was more wondering whether maxConn and
>> beamWidth actually belong to the FieldType, or if they should be made
>> constructor arguments of Lucene90VectorFormat.
>> >> >
>> >> > --
>> >> > Adrien
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
Re: Questions about the new vector API [ In reply to ]
I think the codec name is important and current naming seems not to be
appropriate anyway.
I would like to try to get consensus for that on LUCENE-9855
<https://issues.apache.org/jira/browse/LUCENE-9855>.



2021?3?20?(?) 16:04 Tomoko Uchida <tomoko.uchida.1111@gmail.com>:

> I think it makes sense that we use "ANN" or "NearestNeighbor" for ann
> related APIs, this may give proper level of abstraction to them.
> On the other hand, it slightly sounds odd to me to use it as a Codec
> name... Maybe we should use names that represents its data structure,
> instead of methods/algorithms?
> I'd propose "DenseVector" here if "Vector" is too obscure, but it is also
> just an idea.
>
> Tomoko
>
>
> 2021?3?18?(?) 5:34 Robert Muir <rcmuir@gmail.com>:
>
>> I'm gonna toss out one last question while we are here: Is
>> Vector(s)Format really a good name to use?
>>
>> We already have "term vectors API", and "vector highlighter" that uses
>> it. There's also the traditional "vector-space" scoring model. With java
>> 16, we get a "vector api" from java itself, too.
>>
>> I think the name is overloaded too many times already, and this one is
>> the straw that breaks the camel's back for me.
>>
>> So I'm just throwing out there the idea: if this api is about ANN, maybe
>> it should claim its own name (NeighborsFormat?) that is less ambiguous.
>>
>> On Wed, Mar 17, 2021 at 9:51 AM Michael Sokolov <msokolov@gmail.com>
>> wrote:
>>
>>> I see, right, we can create a Codec that applies the values takes from
>>> the schema for a given field, sure, that works.
>>>
>>> On Wed, Mar 17, 2021 at 3:10 AM Adrien Grand <jpountz@gmail.com> wrote:
>>> >
>>> > Configuring the codec based on the schema is something that Solr does
>>> via SchemaCodecFactory.
>>> https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/core/SchemaCodecFactory.java
>>> >
>>> > Would a similar approach work in your case?
>>> >
>>> > Le mar. 16 mars 2021 à 22:21, Michael Sokolov <msokolov@gmail.com> a
>>> écrit :
>>> >>
>>> >> > I was more thinking of moving VectorValues#search to
>>> LeafReader#searchNearestVectors or something along those lines. I agree
>>> that VectorReader should only be exposed on CodecReader.
>>> >>
>>> >> Ah, OK, yes that makes sense to me. I guess we were maybe reluctant to
>>> >> add such visible API changes early on in the project.
>>> >>
>>> >> > I wonder if things would be simpler if we were more opinionated and
>>> made vectors specifically about nearest-neighbor search. Then we have a
>>> clearer message, use vectors for NN search and doc values otherwise. As far
>>> as I know, reinterpreting bytes as floats shouldn't add much overhead. The
>>> main problem I know of is that the JVM won't auto-vectorize if you read
>>> floats dynamically from a byte[], but this is something that should be
>>> alleviated by the JDK vector API?
>>> >>
>>> >> > Actually this is the thing that feels odd to me: if we end up with
>>> constants for both LSH and HNSW, then we are adding the requirement that
>>> all vector formats must implement both LSH and HNSW as they will need to
>>> support all SearchStrategy constants?
>>> >>
>>> >> Hmm I see I didn't think this all the way through ... I guess I had it
>>> >> in mind that there would probably only ever be a single format with
>>> >> internal variants for different vector index types, but as I have
>>> >> worked more with Lucene's index formats I see that is awkward, and I'm
>>> >> certainly open to restructuring it in a more natural way. Similarly
>>> >> for the NONE format - BinaryDocValues can be used for such
>>> >> (non-searchable) vectors. Indeed we had such an implementation and
>>> >> although we recently switched it to use the NONE format for
>>> >> uniformity, it could easily be switched back.
>>> >>
>>> >> Regarding the graph construction parameters (maxConn and beamWidth)
>>> >> I'm not sure what the right approach is exactly. We struggled to find
>>> >> the best API for this. I guess my concern about the PerField* approach
>>> >> is (at least as I think I understand it) it needs to be configured in
>>> >> code when creating a Codec. But we would like to be able to read such
>>> >> parameters from a schema configuration. I think of them as in the same
>>> >> spirit as an Analyzer. However I may not have fully appreciated the
>>> >> intention of, or how to make the best use of PerField formats. It is
>>> >> true we don't really need to write these parameters to the index;
>>> >> we're free to use different values when merging for example.
>>> >>
>>> >> -Mike
>>> >>
>>> >> On Tue, Mar 16, 2021 at 2:15 PM Adrien Grand <jpountz@gmail.com>
>>> wrote:
>>> >> >
>>> >> > Hello Mike,
>>> >> >
>>> >> > On Tue, Mar 16, 2021 at 5:05 PM Michael Sokolov <msokolov@gmail.com>
>>> wrote:
>>> >> >>
>>> >> >> I think the reason we have search() on VectorValues is that we have
>>> >> >> LeafReader.getVectorValues() (by analogy to the DocValues
>>> iterators),
>>> >> >> but no way to access the VectorReader. Do you think we should also
>>> >> >> have LeafReader.getVectorReader()? Today it's only on CodecReader.
>>> >> >
>>> >> >
>>> >> > I was more thinking of moving VectorValues#search to
>>> LeafReader#searchNearestVectors or something along those lines. I agree
>>> that VectorReader should only be exposed on CodecReader.
>>> >> >
>>> >> >>
>>> >> >> Re: SearchStrategy.NONE; the idea is we support efficient access to
>>> >> >> floating point values. Using BinaryDocValues for this will always
>>> >> >> require an additional decoding step. I can see that the naming is
>>> >> >> confusing there. The intent is that you index the vector values,
>>> but
>>> >> >> no additional indexing data structure.
>>> >> >
>>> >> >
>>> >> > I wonder if things would be simpler if we were more opinionated and
>>> made vectors specifically about nearest-neighbor search. Then we have a
>>> clearer message, use vectors for NN search and doc values otherwise. As far
>>> as I know, reinterpreting bytes as floats shouldn't add much overhead. The
>>> main problem I know of is that the JVM won't auto-vectorize if you read
>>> floats dynamically from a byte[], but this is something that should be
>>> alleviated by the JDK vector API?
>>> >> >
>>> >> >> Also: the reason HNSW is
>>> >> >> mentioned in these SearchStrategy enums is to make room for other
>>> >> >> vector indexing approaches, like LSH. There was a lot of discussion
>>> >> >> that we wanted an API that allowed for experimenting with other
>>> >> >> techniques for indexing and searching vector values.
>>> >> >
>>> >> >
>>> >> > Actually this is the thing that feels odd to me: if we end up with
>>> constants for both LSH and HNSW, then we are adding the requirement that
>>> all vector formats must implement both LSH and HNSW as they will need to
>>> support all SearchStrategy constants? Would it be possible to have a single
>>> API and then two implementations of VectorsFormat, LSHVectorsFormat on the
>>> one hand and HNSWVectorsFormat on the other hand?
>>> >> >
>>> >> >> Adrien, you made an analogy to PerFieldPostingsFormat (and
>>> DocValues),
>>> >> >> but I think the situation is more akin to Points, where we have the
>>> >> >> options on IndexableField. The metadata we store there (dimension
>>> and
>>> >> >> score function) don't really result in different formats, ie code
>>> >> >> paths for indexing and storage; they are more like parameters to
>>> the
>>> >> >> format, in my mind. Perhaps the situation will look different when
>>> we
>>> >> >> get our second vector indexing strategy (like LSH).
>>> >> >
>>> >> >
>>> >> > Having the dimension count and the score function on the FieldType
>>> actually makes sense to me. I was more wondering whether maxConn and
>>> beamWidth actually belong to the FieldType, or if they should be made
>>> constructor arguments of Lucene90VectorFormat.
>>> >> >
>>> >> > --
>>> >> > Adrien
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>>> >>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>
Re: Questions about the new vector API [ In reply to ]
Michael,

I got some interest in this area and have been doing comparative study of
different KNN implementations and blogging about it.

Did you use nmslib for HNSW implementation or something else?

On Tue, 16 Mar 2021 at 22:47, Michael Sokolov <msokolov@gmail.com> wrote:

> Yeah, HNSW is problematic in a few ways: (1) merging is costly due to
> the need to completely recreate the graph. (2) searching across a
> segmented index sacrifices much of the performance benefit of HNSW
> since the cost of searching HNSW graphs scales ~logarithmically with
> the size of the graph, so splitting into multiple graphs and then
> merge sorting results is pretty expensive. I guess the random access /
> scan forward dynamic is another problematic area.
>
> On Tue, Mar 16, 2021 at 1:28 PM Robert Muir <rcmuir@gmail.com> wrote:
> >
> > Maybe that is so, but we should factor in everything: such as large
> scale indexing, not requiring whole data set to be in RAM, etc. Hey, it's
> Lucene!
> >
> > Because HNSW has dominated the nightly benchmarks, I have been digging
> through stacktraces and trying to figure out ways to make it work
> efficiently, and I'm not sure what to do.
> > Especially merge is painful: it seems to cause a storm of page
> faults/random accesses due to how it works, and I don't know yet how to
> make it better.
> > It seems to rebuild the entire graph, spraying random accesses across a
> "slow-wrapper" that binary searches each sub on every access.
> > I don't see any way to even amortize the pain with some kind of bulk
> merge trick.
> >
> > So if we find algorithms that scale better, I think we should lend a
> preference towards them. For example, algorithms that allow
> per-segment/sequential index and merge.
> >
> > On Tue, Mar 16, 2021 at 1:06 PM Michael Sokolov <msokolov@gmail.com>
> wrote:
> >>
> >> ann-benchmarks.com maintains open benchmarks of a bunch of ANN
> >> (approximate NN) algorithms. When we started this effort, HNSW was at
> >> the top of the heap in most of the benchmarks.
> >>
> >> On Tue, Mar 16, 2021 at 12:28 PM Robert Muir <rcmuir@gmail.com> wrote:
> >> >
> >> > Where are the alternative algorithms that work on sequential
> iterators and don't need random access?
> >> >
> >> > Seems like these should be the ones we initially add to lucene, and
> HNSW should be put aside for now? (is it a toy, or can we do it without
> jazillions of random accesses?)
> >> >
> >> > On Tue, Mar 16, 2021 at 12:15 PM Michael Sokolov <msokolov@gmail.com>
> wrote:
> >> >>
> >> >> There's also some good discussion on
> >> >> https://issues.apache.org/jira/browse/LUCENE-9583 about random
> access
> >> >> vs iterator pattern that never got fully resolved. We said we would
> >> >> revisit after KNN (LUCENE-9004) landed, and now it has. The usage of
> >> >> random access is pretty well-established there, maybe we should
> >> >> abandon the iterator API since it is redundant (you can always
> iterate
> >> >> over a random access API if you know the size)?
> >> >>
> >> >> On Tue, Mar 16, 2021 at 12:10 PM Michael Sokolov <msokolov@gmail.com>
> wrote:
> >> >> >
> >> >> > Also, Tomoko re:LUCENE-9322, did it succeed? I guess we won't know
> for
> >> >> > sure unless someone revives
> >> >> > https://issues.apache.org/jira/browse/LUCENE-9136 or something
> like
> >> >> > that
> >> >> >
> >> >> > On Tue, Mar 16, 2021 at 12:04 PM Michael Sokolov <
> msokolov@gmail.com> wrote:
> >> >> > >
> >> >> > > Consistent plural naming makes sense to me. I think it ended up
> >> >> > > singular because I am biased to avoid plural names unless there
> is a
> >> >> > > useful distinction to be made. But consistency should trump my
> >> >> > > predilections.
> >> >> > >
> >> >> > > I think the reason we have search() on VectorValues is that we
> have
> >> >> > > LeafReader.getVectorValues() (by analogy to the DocValues
> iterators),
> >> >> > > but no way to access the VectorReader. Do you think we should
> also
> >> >> > > have LeafReader.getVectorReader()? Today it's only on
> CodecReader.
> >> >> > >
> >> >> > > Re: SearchStrategy.NONE; the idea is we support efficient access
> to
> >> >> > > floating point values. Using BinaryDocValues for this will always
> >> >> > > require an additional decoding step. I can see that the naming is
> >> >> > > confusing there. The intent is that you index the vector values,
> but
> >> >> > > no additional indexing data structure. Also: the reason HNSW is
> >> >> > > mentioned in these SearchStrategy enums is to make room for other
> >> >> > > vector indexing approaches, like LSH. There was a lot of
> discussion
> >> >> > > that we wanted an API that allowed for experimenting with other
> >> >> > > techniques for indexing and searching vector values.
> >> >> > >
> >> >> > > Adrien, you made an analogy to PerFieldPostingsFormat (and
> DocValues),
> >> >> > > but I think the situation is more akin to Points, where we have
> the
> >> >> > > options on IndexableField. The metadata we store there
> (dimension and
> >> >> > > score function) don't really result in different formats, ie code
> >> >> > > paths for indexing and storage; they are more like parameters to
> the
> >> >> > > format, in my mind. Perhaps the situation will look different
> when we
> >> >> > > get our second vector indexing strategy (like LSH).
> >> >> > >
> >> >> > >
> >> >> > > On Tue, Mar 16, 2021 at 10:19 AM Tomoko Uchida
> >> >> > > <tomoko.uchida.1111@gmail.com> wrote:
> >> >> > > >
> >> >> > > > > Should we rename VectorFormat to VectorsFormat? This would
> be more consistent with other file formats that use the plural, like
> PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.?
> >> >> > > >
> >> >> > > > +1 for using plural form for consistency - if we reconsider
> the names, how about VectorValuesFormat so that it follows the naming
> convention for XXXValues?
> >> >> > > >
> >> >> > > > DocValuesFormat / DocValues
> >> >> > > > PointValuesFormat / PointValues
> >> >> > > > VectorValuesFormat / VectorValues (currently, VectorFormat /
> VectorValues)
> >> >> > > >
> >> >> > > > > Should SearchStrategy constants avoid explicit references to
> HNSW?
> >> >> > > >
> >> >> > > > Also +1 for decoupling HNSW specific implementations from
> general vectors, though I am not fully sure if we can strictly separate the
> similarity metrics and search algorithms for vectors.
> >> >> > > > LUCENE-9322 (unified vectors API) was resolved months ago,
> does it achieve its goal? I haven't followed the issue in months because of
> my laziness...
> >> >> > > >
> >> >> > > > Thanks,
> >> >> > > > Tomoko
> >> >> > > >
> >> >> > > >
> >> >> > > > 2021?3?16?(?) 19:32 Adrien Grand <jpountz@gmail.com>:
> >> >> > > >>
> >> >> > > >> Hello,
> >> >> > > >>
> >> >> > > >> I've tried to catch up on the vector API and I have the
> following questions. I've tried to read through discussions on JIRA first
> in case it had been covered, but it's possible I missed some relevant ones.
> >> >> > > >>
> >> >> > > >> Should VectorValues#search be on VectorReader instead? It
> felt a bit odd to me to have the search logic on the iterator.
> >> >> > > >>
> >> >> > > >> Do we need SearchStrategy.NONE? Documentation suggests that
> it allows storing vectors but that NN search won't be supported. This looks
> like a use-case for binary doc values to me? It also slightly caught me by
> surprise due to the inconsistency with IndexOptions.NONE, which means "do
> not index this field" (and likewise for DocValuesType.NONE), so I first
> assumed that SearchStrategy.NONE also meant "do not index this field as a
> vector".
> >> >> > > >>
> >> >> > > >> While postings and doc-value formats allow per-field
> configuration via PerFieldPostingsFormat/PerFieldDocValuesFormat, vectors
> use a different mechanism where VectorField#createHnswType sets attributes
> on the field type that the vectors writer then reads. Should we have a
> PerFieldVectorsFormat instead and configure these options via the vectors
> format?
> >> >> > > >>
> >> >> > > >> Should SearchStrategy constants avoid explicit references to
> HNSW? The rest of the API seems to try to be agnostic of the way that NN
> search is implemented. Could we make SearchStrategy only about the
> similarity metric that is used for vectors? This particular point seems
> discussed on LUCENE-9322 but I couldn't find the conclusion.
> >> >> > > >>
> >> >> > > >> Should we rename VectorFormat to VectorsFormat? This would be
> more consistent with other file formats that use the plural, like
> PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.?
> >> >> > > >>
> >> >> > > >> --
> >> >> > > >> Adrien
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

--
--
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info
Re: Questions about the new vector API [ In reply to ]
Hi Dimitry, I worked initially from the papers cited in LUCENE-9004, which
I think is also what Tomoko was doing. Later I did refer to nmslib too.

On Sat, Mar 27, 2021, 6:01 AM Dmitry Kan <dmitry.lucene@gmail.com> wrote:

> Michael,
>
> I got some interest in this area and have been doing comparative study of
> different KNN implementations and blogging about it.
>
> Did you use nmslib for HNSW implementation or something else?
>
> On Tue, 16 Mar 2021 at 22:47, Michael Sokolov <msokolov@gmail.com> wrote:
>
>> Yeah, HNSW is problematic in a few ways: (1) merging is costly due to
>> the need to completely recreate the graph. (2) searching across a
>> segmented index sacrifices much of the performance benefit of HNSW
>> since the cost of searching HNSW graphs scales ~logarithmically with
>> the size of the graph, so splitting into multiple graphs and then
>> merge sorting results is pretty expensive. I guess the random access /
>> scan forward dynamic is another problematic area.
>>
>> On Tue, Mar 16, 2021 at 1:28 PM Robert Muir <rcmuir@gmail.com> wrote:
>> >
>> > Maybe that is so, but we should factor in everything: such as large
>> scale indexing, not requiring whole data set to be in RAM, etc. Hey, it's
>> Lucene!
>> >
>> > Because HNSW has dominated the nightly benchmarks, I have been digging
>> through stacktraces and trying to figure out ways to make it work
>> efficiently, and I'm not sure what to do.
>> > Especially merge is painful: it seems to cause a storm of page
>> faults/random accesses due to how it works, and I don't know yet how to
>> make it better.
>> > It seems to rebuild the entire graph, spraying random accesses across a
>> "slow-wrapper" that binary searches each sub on every access.
>> > I don't see any way to even amortize the pain with some kind of bulk
>> merge trick.
>> >
>> > So if we find algorithms that scale better, I think we should lend a
>> preference towards them. For example, algorithms that allow
>> per-segment/sequential index and merge.
>> >
>> > On Tue, Mar 16, 2021 at 1:06 PM Michael Sokolov <msokolov@gmail.com>
>> wrote:
>> >>
>> >> ann-benchmarks.com maintains open benchmarks of a bunch of ANN
>> >> (approximate NN) algorithms. When we started this effort, HNSW was at
>> >> the top of the heap in most of the benchmarks.
>> >>
>> >> On Tue, Mar 16, 2021 at 12:28 PM Robert Muir <rcmuir@gmail.com> wrote:
>> >> >
>> >> > Where are the alternative algorithms that work on sequential
>> iterators and don't need random access?
>> >> >
>> >> > Seems like these should be the ones we initially add to lucene, and
>> HNSW should be put aside for now? (is it a toy, or can we do it without
>> jazillions of random accesses?)
>> >> >
>> >> > On Tue, Mar 16, 2021 at 12:15 PM Michael Sokolov <msokolov@gmail.com>
>> wrote:
>> >> >>
>> >> >> There's also some good discussion on
>> >> >> https://issues.apache.org/jira/browse/LUCENE-9583 about random
>> access
>> >> >> vs iterator pattern that never got fully resolved. We said we would
>> >> >> revisit after KNN (LUCENE-9004) landed, and now it has. The usage of
>> >> >> random access is pretty well-established there, maybe we should
>> >> >> abandon the iterator API since it is redundant (you can always
>> iterate
>> >> >> over a random access API if you know the size)?
>> >> >>
>> >> >> On Tue, Mar 16, 2021 at 12:10 PM Michael Sokolov <
>> msokolov@gmail.com> wrote:
>> >> >> >
>> >> >> > Also, Tomoko re:LUCENE-9322, did it succeed? I guess we won't
>> know for
>> >> >> > sure unless someone revives
>> >> >> > https://issues.apache.org/jira/browse/LUCENE-9136 or something
>> like
>> >> >> > that
>> >> >> >
>> >> >> > On Tue, Mar 16, 2021 at 12:04 PM Michael Sokolov <
>> msokolov@gmail.com> wrote:
>> >> >> > >
>> >> >> > > Consistent plural naming makes sense to me. I think it ended up
>> >> >> > > singular because I am biased to avoid plural names unless there
>> is a
>> >> >> > > useful distinction to be made. But consistency should trump my
>> >> >> > > predilections.
>> >> >> > >
>> >> >> > > I think the reason we have search() on VectorValues is that we
>> have
>> >> >> > > LeafReader.getVectorValues() (by analogy to the DocValues
>> iterators),
>> >> >> > > but no way to access the VectorReader. Do you think we should
>> also
>> >> >> > > have LeafReader.getVectorReader()? Today it's only on
>> CodecReader.
>> >> >> > >
>> >> >> > > Re: SearchStrategy.NONE; the idea is we support efficient
>> access to
>> >> >> > > floating point values. Using BinaryDocValues for this will
>> always
>> >> >> > > require an additional decoding step. I can see that the naming
>> is
>> >> >> > > confusing there. The intent is that you index the vector
>> values, but
>> >> >> > > no additional indexing data structure. Also: the reason HNSW is
>> >> >> > > mentioned in these SearchStrategy enums is to make room for
>> other
>> >> >> > > vector indexing approaches, like LSH. There was a lot of
>> discussion
>> >> >> > > that we wanted an API that allowed for experimenting with other
>> >> >> > > techniques for indexing and searching vector values.
>> >> >> > >
>> >> >> > > Adrien, you made an analogy to PerFieldPostingsFormat (and
>> DocValues),
>> >> >> > > but I think the situation is more akin to Points, where we have
>> the
>> >> >> > > options on IndexableField. The metadata we store there
>> (dimension and
>> >> >> > > score function) don't really result in different formats, ie
>> code
>> >> >> > > paths for indexing and storage; they are more like parameters
>> to the
>> >> >> > > format, in my mind. Perhaps the situation will look different
>> when we
>> >> >> > > get our second vector indexing strategy (like LSH).
>> >> >> > >
>> >> >> > >
>> >> >> > > On Tue, Mar 16, 2021 at 10:19 AM Tomoko Uchida
>> >> >> > > <tomoko.uchida.1111@gmail.com> wrote:
>> >> >> > > >
>> >> >> > > > > Should we rename VectorFormat to VectorsFormat? This would
>> be more consistent with other file formats that use the plural, like
>> PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.?
>> >> >> > > >
>> >> >> > > > +1 for using plural form for consistency - if we reconsider
>> the names, how about VectorValuesFormat so that it follows the naming
>> convention for XXXValues?
>> >> >> > > >
>> >> >> > > > DocValuesFormat / DocValues
>> >> >> > > > PointValuesFormat / PointValues
>> >> >> > > > VectorValuesFormat / VectorValues (currently, VectorFormat /
>> VectorValues)
>> >> >> > > >
>> >> >> > > > > Should SearchStrategy constants avoid explicit references
>> to HNSW?
>> >> >> > > >
>> >> >> > > > Also +1 for decoupling HNSW specific implementations from
>> general vectors, though I am not fully sure if we can strictly separate the
>> similarity metrics and search algorithms for vectors.
>> >> >> > > > LUCENE-9322 (unified vectors API) was resolved months ago,
>> does it achieve its goal? I haven't followed the issue in months because of
>> my laziness...
>> >> >> > > >
>> >> >> > > > Thanks,
>> >> >> > > > Tomoko
>> >> >> > > >
>> >> >> > > >
>> >> >> > > > 2021?3?16?(?) 19:32 Adrien Grand <jpountz@gmail.com>:
>> >> >> > > >>
>> >> >> > > >> Hello,
>> >> >> > > >>
>> >> >> > > >> I've tried to catch up on the vector API and I have the
>> following questions. I've tried to read through discussions on JIRA first
>> in case it had been covered, but it's possible I missed some relevant ones.
>> >> >> > > >>
>> >> >> > > >> Should VectorValues#search be on VectorReader instead? It
>> felt a bit odd to me to have the search logic on the iterator.
>> >> >> > > >>
>> >> >> > > >> Do we need SearchStrategy.NONE? Documentation suggests that
>> it allows storing vectors but that NN search won't be supported. This looks
>> like a use-case for binary doc values to me? It also slightly caught me by
>> surprise due to the inconsistency with IndexOptions.NONE, which means "do
>> not index this field" (and likewise for DocValuesType.NONE), so I first
>> assumed that SearchStrategy.NONE also meant "do not index this field as a
>> vector".
>> >> >> > > >>
>> >> >> > > >> While postings and doc-value formats allow per-field
>> configuration via PerFieldPostingsFormat/PerFieldDocValuesFormat, vectors
>> use a different mechanism where VectorField#createHnswType sets attributes
>> on the field type that the vectors writer then reads. Should we have a
>> PerFieldVectorsFormat instead and configure these options via the vectors
>> format?
>> >> >> > > >>
>> >> >> > > >> Should SearchStrategy constants avoid explicit references to
>> HNSW? The rest of the API seems to try to be agnostic of the way that NN
>> search is implemented. Could we make SearchStrategy only about the
>> similarity metric that is used for vectors? This particular point seems
>> discussed on LUCENE-9322 but I couldn't find the conclusion.
>> >> >> > > >>
>> >> >> > > >> Should we rename VectorFormat to VectorsFormat? This would
>> be more consistent with other file formats that use the plural, like
>> PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.?
>> >> >> > > >>
>> >> >> > > >> --
>> >> >> > > >> Adrien
>> >> >>
>> >> >>
>> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>
> --
> --
> Dmitry Kan
> Luke Toolbox: http://github.com/DmitryKey/luke
> Blog: http://dmitrykan.blogspot.com
> Twitter: http://twitter.com/dmitrykan
> SemanticAnalyzer: www.semanticanalyzer.info
>
Re: Questions about the new vector API [ In reply to ]
Ugh sorry for misspelling your name, I blame the phone!

On Sun, Mar 28, 2021, 6:50 AM Michael Sokolov <msokolov@gmail.com> wrote:

> Hi Dimitry, I worked initially from the papers cited in LUCENE-9004, which
> I think is also what Tomoko was doing. Later I did refer to nmslib too.
>
> On Sat, Mar 27, 2021, 6:01 AM Dmitry Kan <dmitry.lucene@gmail.com> wrote:
>
>> Michael,
>>
>> I got some interest in this area and have been doing comparative study of
>> different KNN implementations and blogging about it.
>>
>> Did you use nmslib for HNSW implementation or something else?
>>
>> On Tue, 16 Mar 2021 at 22:47, Michael Sokolov <msokolov@gmail.com> wrote:
>>
>>> Yeah, HNSW is problematic in a few ways: (1) merging is costly due to
>>> the need to completely recreate the graph. (2) searching across a
>>> segmented index sacrifices much of the performance benefit of HNSW
>>> since the cost of searching HNSW graphs scales ~logarithmically with
>>> the size of the graph, so splitting into multiple graphs and then
>>> merge sorting results is pretty expensive. I guess the random access /
>>> scan forward dynamic is another problematic area.
>>>
>>> On Tue, Mar 16, 2021 at 1:28 PM Robert Muir <rcmuir@gmail.com> wrote:
>>> >
>>> > Maybe that is so, but we should factor in everything: such as large
>>> scale indexing, not requiring whole data set to be in RAM, etc. Hey, it's
>>> Lucene!
>>> >
>>> > Because HNSW has dominated the nightly benchmarks, I have been digging
>>> through stacktraces and trying to figure out ways to make it work
>>> efficiently, and I'm not sure what to do.
>>> > Especially merge is painful: it seems to cause a storm of page
>>> faults/random accesses due to how it works, and I don't know yet how to
>>> make it better.
>>> > It seems to rebuild the entire graph, spraying random accesses across
>>> a "slow-wrapper" that binary searches each sub on every access.
>>> > I don't see any way to even amortize the pain with some kind of bulk
>>> merge trick.
>>> >
>>> > So if we find algorithms that scale better, I think we should lend a
>>> preference towards them. For example, algorithms that allow
>>> per-segment/sequential index and merge.
>>> >
>>> > On Tue, Mar 16, 2021 at 1:06 PM Michael Sokolov <msokolov@gmail.com>
>>> wrote:
>>> >>
>>> >> ann-benchmarks.com maintains open benchmarks of a bunch of ANN
>>> >> (approximate NN) algorithms. When we started this effort, HNSW was at
>>> >> the top of the heap in most of the benchmarks.
>>> >>
>>> >> On Tue, Mar 16, 2021 at 12:28 PM Robert Muir <rcmuir@gmail.com>
>>> wrote:
>>> >> >
>>> >> > Where are the alternative algorithms that work on sequential
>>> iterators and don't need random access?
>>> >> >
>>> >> > Seems like these should be the ones we initially add to lucene, and
>>> HNSW should be put aside for now? (is it a toy, or can we do it without
>>> jazillions of random accesses?)
>>> >> >
>>> >> > On Tue, Mar 16, 2021 at 12:15 PM Michael Sokolov <
>>> msokolov@gmail.com> wrote:
>>> >> >>
>>> >> >> There's also some good discussion on
>>> >> >> https://issues.apache.org/jira/browse/LUCENE-9583 about random
>>> access
>>> >> >> vs iterator pattern that never got fully resolved. We said we would
>>> >> >> revisit after KNN (LUCENE-9004) landed, and now it has. The usage
>>> of
>>> >> >> random access is pretty well-established there, maybe we should
>>> >> >> abandon the iterator API since it is redundant (you can always
>>> iterate
>>> >> >> over a random access API if you know the size)?
>>> >> >>
>>> >> >> On Tue, Mar 16, 2021 at 12:10 PM Michael Sokolov <
>>> msokolov@gmail.com> wrote:
>>> >> >> >
>>> >> >> > Also, Tomoko re:LUCENE-9322, did it succeed? I guess we won't
>>> know for
>>> >> >> > sure unless someone revives
>>> >> >> > https://issues.apache.org/jira/browse/LUCENE-9136 or something
>>> like
>>> >> >> > that
>>> >> >> >
>>> >> >> > On Tue, Mar 16, 2021 at 12:04 PM Michael Sokolov <
>>> msokolov@gmail.com> wrote:
>>> >> >> > >
>>> >> >> > > Consistent plural naming makes sense to me. I think it ended up
>>> >> >> > > singular because I am biased to avoid plural names unless
>>> there is a
>>> >> >> > > useful distinction to be made. But consistency should trump my
>>> >> >> > > predilections.
>>> >> >> > >
>>> >> >> > > I think the reason we have search() on VectorValues is that we
>>> have
>>> >> >> > > LeafReader.getVectorValues() (by analogy to the DocValues
>>> iterators),
>>> >> >> > > but no way to access the VectorReader. Do you think we should
>>> also
>>> >> >> > > have LeafReader.getVectorReader()? Today it's only on
>>> CodecReader.
>>> >> >> > >
>>> >> >> > > Re: SearchStrategy.NONE; the idea is we support efficient
>>> access to
>>> >> >> > > floating point values. Using BinaryDocValues for this will
>>> always
>>> >> >> > > require an additional decoding step. I can see that the naming
>>> is
>>> >> >> > > confusing there. The intent is that you index the vector
>>> values, but
>>> >> >> > > no additional indexing data structure. Also: the reason HNSW is
>>> >> >> > > mentioned in these SearchStrategy enums is to make room for
>>> other
>>> >> >> > > vector indexing approaches, like LSH. There was a lot of
>>> discussion
>>> >> >> > > that we wanted an API that allowed for experimenting with other
>>> >> >> > > techniques for indexing and searching vector values.
>>> >> >> > >
>>> >> >> > > Adrien, you made an analogy to PerFieldPostingsFormat (and
>>> DocValues),
>>> >> >> > > but I think the situation is more akin to Points, where we
>>> have the
>>> >> >> > > options on IndexableField. The metadata we store there
>>> (dimension and
>>> >> >> > > score function) don't really result in different formats, ie
>>> code
>>> >> >> > > paths for indexing and storage; they are more like parameters
>>> to the
>>> >> >> > > format, in my mind. Perhaps the situation will look different
>>> when we
>>> >> >> > > get our second vector indexing strategy (like LSH).
>>> >> >> > >
>>> >> >> > >
>>> >> >> > > On Tue, Mar 16, 2021 at 10:19 AM Tomoko Uchida
>>> >> >> > > <tomoko.uchida.1111@gmail.com> wrote:
>>> >> >> > > >
>>> >> >> > > > > Should we rename VectorFormat to VectorsFormat? This would
>>> be more consistent with other file formats that use the plural, like
>>> PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.?
>>> >> >> > > >
>>> >> >> > > > +1 for using plural form for consistency - if we reconsider
>>> the names, how about VectorValuesFormat so that it follows the naming
>>> convention for XXXValues?
>>> >> >> > > >
>>> >> >> > > > DocValuesFormat / DocValues
>>> >> >> > > > PointValuesFormat / PointValues
>>> >> >> > > > VectorValuesFormat / VectorValues (currently, VectorFormat /
>>> VectorValues)
>>> >> >> > > >
>>> >> >> > > > > Should SearchStrategy constants avoid explicit references
>>> to HNSW?
>>> >> >> > > >
>>> >> >> > > > Also +1 for decoupling HNSW specific implementations from
>>> general vectors, though I am not fully sure if we can strictly separate the
>>> similarity metrics and search algorithms for vectors.
>>> >> >> > > > LUCENE-9322 (unified vectors API) was resolved months ago,
>>> does it achieve its goal? I haven't followed the issue in months because of
>>> my laziness...
>>> >> >> > > >
>>> >> >> > > > Thanks,
>>> >> >> > > > Tomoko
>>> >> >> > > >
>>> >> >> > > >
>>> >> >> > > > 2021?3?16?(?) 19:32 Adrien Grand <jpountz@gmail.com>:
>>> >> >> > > >>
>>> >> >> > > >> Hello,
>>> >> >> > > >>
>>> >> >> > > >> I've tried to catch up on the vector API and I have the
>>> following questions. I've tried to read through discussions on JIRA first
>>> in case it had been covered, but it's possible I missed some relevant ones.
>>> >> >> > > >>
>>> >> >> > > >> Should VectorValues#search be on VectorReader instead? It
>>> felt a bit odd to me to have the search logic on the iterator.
>>> >> >> > > >>
>>> >> >> > > >> Do we need SearchStrategy.NONE? Documentation suggests that
>>> it allows storing vectors but that NN search won't be supported. This looks
>>> like a use-case for binary doc values to me? It also slightly caught me by
>>> surprise due to the inconsistency with IndexOptions.NONE, which means "do
>>> not index this field" (and likewise for DocValuesType.NONE), so I first
>>> assumed that SearchStrategy.NONE also meant "do not index this field as a
>>> vector".
>>> >> >> > > >>
>>> >> >> > > >> While postings and doc-value formats allow per-field
>>> configuration via PerFieldPostingsFormat/PerFieldDocValuesFormat, vectors
>>> use a different mechanism where VectorField#createHnswType sets attributes
>>> on the field type that the vectors writer then reads. Should we have a
>>> PerFieldVectorsFormat instead and configure these options via the vectors
>>> format?
>>> >> >> > > >>
>>> >> >> > > >> Should SearchStrategy constants avoid explicit references
>>> to HNSW? The rest of the API seems to try to be agnostic of the way that NN
>>> search is implemented. Could we make SearchStrategy only about the
>>> similarity metric that is used for vectors? This particular point seems
>>> discussed on LUCENE-9322 but I couldn't find the conclusion.
>>> >> >> > > >>
>>> >> >> > > >> Should we rename VectorFormat to VectorsFormat? This would
>>> be more consistent with other file formats that use the plural, like
>>> PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.?
>>> >> >> > > >>
>>> >> >> > > >> --
>>> >> >> > > >> Adrien
>>> >> >>
>>> >> >>
>>> ---------------------------------------------------------------------
>>> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
>>> >> >>
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>>> >>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>
>>
>> --
>> --
>> Dmitry Kan
>> Luke Toolbox: http://github.com/DmitryKey/luke
>> Blog: http://dmitrykan.blogspot.com
>> Twitter: http://twitter.com/dmitrykan
>> SemanticAnalyzer: www.semanticanalyzer.info
>>
>
Re: Questions about the new vector API [ In reply to ]
Hi Michael,

No worries on misspelling -- living abroad I see it happen frequently, and
frankly got used to it!

The main reason for asking which algos you've used is that some of them
have hyper-parameters that can be exposed to users in order for them to
decide the recall / qps tradeoff. Have you considered this? In particular,
nmslib offers such capability:
https://opendistro.github.io/for-elasticsearch-docs/docs/knn/settings/#index-settings
(nmslib was implemented in Open Distro Elasticsearch).

Also, if you are interested, please take a look at
https://towardsdatascience.com/speeding-up-bert-search-in-elasticsearch-750f1f34f455
which studies impact of an KNN algorithm on indexing and querying
speeds, as well as disk size usage. Will be grateful for your feedback.

Is there documentation for how to use LUCENE-9004?

Best,
Dmitry

On Sun, 28 Mar 2021 at 13:51, Michael Sokolov <msokolov@gmail.com> wrote:

> Ugh sorry for misspelling your name, I blame the phone!
>
> On Sun, Mar 28, 2021, 6:50 AM Michael Sokolov <msokolov@gmail.com> wrote:
>
>> Hi Dimitry, I worked initially from the papers cited in LUCENE-9004,
>> which I think is also what Tomoko was doing. Later I did refer to nmslib
>> too.
>>
>> On Sat, Mar 27, 2021, 6:01 AM Dmitry Kan <dmitry.lucene@gmail.com> wrote:
>>
>>> Michael,
>>>
>>> I got some interest in this area and have been doing comparative study
>>> of different KNN implementations and blogging about it.
>>>
>>> Did you use nmslib for HNSW implementation or something else?
>>>
>>> On Tue, 16 Mar 2021 at 22:47, Michael Sokolov <msokolov@gmail.com>
>>> wrote:
>>>
>>>> Yeah, HNSW is problematic in a few ways: (1) merging is costly due to
>>>> the need to completely recreate the graph. (2) searching across a
>>>> segmented index sacrifices much of the performance benefit of HNSW
>>>> since the cost of searching HNSW graphs scales ~logarithmically with
>>>> the size of the graph, so splitting into multiple graphs and then
>>>> merge sorting results is pretty expensive. I guess the random access /
>>>> scan forward dynamic is another problematic area.
>>>>
>>>> On Tue, Mar 16, 2021 at 1:28 PM Robert Muir <rcmuir@gmail.com> wrote:
>>>> >
>>>> > Maybe that is so, but we should factor in everything: such as large
>>>> scale indexing, not requiring whole data set to be in RAM, etc. Hey, it's
>>>> Lucene!
>>>> >
>>>> > Because HNSW has dominated the nightly benchmarks, I have been
>>>> digging through stacktraces and trying to figure out ways to make it work
>>>> efficiently, and I'm not sure what to do.
>>>> > Especially merge is painful: it seems to cause a storm of page
>>>> faults/random accesses due to how it works, and I don't know yet how to
>>>> make it better.
>>>> > It seems to rebuild the entire graph, spraying random accesses across
>>>> a "slow-wrapper" that binary searches each sub on every access.
>>>> > I don't see any way to even amortize the pain with some kind of bulk
>>>> merge trick.
>>>> >
>>>> > So if we find algorithms that scale better, I think we should lend a
>>>> preference towards them. For example, algorithms that allow
>>>> per-segment/sequential index and merge.
>>>> >
>>>> > On Tue, Mar 16, 2021 at 1:06 PM Michael Sokolov <msokolov@gmail.com>
>>>> wrote:
>>>> >>
>>>> >> ann-benchmarks.com maintains open benchmarks of a bunch of ANN
>>>> >> (approximate NN) algorithms. When we started this effort, HNSW was at
>>>> >> the top of the heap in most of the benchmarks.
>>>> >>
>>>> >> On Tue, Mar 16, 2021 at 12:28 PM Robert Muir <rcmuir@gmail.com>
>>>> wrote:
>>>> >> >
>>>> >> > Where are the alternative algorithms that work on sequential
>>>> iterators and don't need random access?
>>>> >> >
>>>> >> > Seems like these should be the ones we initially add to lucene,
>>>> and HNSW should be put aside for now? (is it a toy, or can we do it without
>>>> jazillions of random accesses?)
>>>> >> >
>>>> >> > On Tue, Mar 16, 2021 at 12:15 PM Michael Sokolov <
>>>> msokolov@gmail.com> wrote:
>>>> >> >>
>>>> >> >> There's also some good discussion on
>>>> >> >> https://issues.apache.org/jira/browse/LUCENE-9583 about random
>>>> access
>>>> >> >> vs iterator pattern that never got fully resolved. We said we
>>>> would
>>>> >> >> revisit after KNN (LUCENE-9004) landed, and now it has. The usage
>>>> of
>>>> >> >> random access is pretty well-established there, maybe we should
>>>> >> >> abandon the iterator API since it is redundant (you can always
>>>> iterate
>>>> >> >> over a random access API if you know the size)?
>>>> >> >>
>>>> >> >> On Tue, Mar 16, 2021 at 12:10 PM Michael Sokolov <
>>>> msokolov@gmail.com> wrote:
>>>> >> >> >
>>>> >> >> > Also, Tomoko re:LUCENE-9322, did it succeed? I guess we won't
>>>> know for
>>>> >> >> > sure unless someone revives
>>>> >> >> > https://issues.apache.org/jira/browse/LUCENE-9136 or something
>>>> like
>>>> >> >> > that
>>>> >> >> >
>>>> >> >> > On Tue, Mar 16, 2021 at 12:04 PM Michael Sokolov <
>>>> msokolov@gmail.com> wrote:
>>>> >> >> > >
>>>> >> >> > > Consistent plural naming makes sense to me. I think it ended
>>>> up
>>>> >> >> > > singular because I am biased to avoid plural names unless
>>>> there is a
>>>> >> >> > > useful distinction to be made. But consistency should trump my
>>>> >> >> > > predilections.
>>>> >> >> > >
>>>> >> >> > > I think the reason we have search() on VectorValues is that
>>>> we have
>>>> >> >> > > LeafReader.getVectorValues() (by analogy to the DocValues
>>>> iterators),
>>>> >> >> > > but no way to access the VectorReader. Do you think we should
>>>> also
>>>> >> >> > > have LeafReader.getVectorReader()? Today it's only on
>>>> CodecReader.
>>>> >> >> > >
>>>> >> >> > > Re: SearchStrategy.NONE; the idea is we support efficient
>>>> access to
>>>> >> >> > > floating point values. Using BinaryDocValues for this will
>>>> always
>>>> >> >> > > require an additional decoding step. I can see that the
>>>> naming is
>>>> >> >> > > confusing there. The intent is that you index the vector
>>>> values, but
>>>> >> >> > > no additional indexing data structure. Also: the reason HNSW
>>>> is
>>>> >> >> > > mentioned in these SearchStrategy enums is to make room for
>>>> other
>>>> >> >> > > vector indexing approaches, like LSH. There was a lot of
>>>> discussion
>>>> >> >> > > that we wanted an API that allowed for experimenting with
>>>> other
>>>> >> >> > > techniques for indexing and searching vector values.
>>>> >> >> > >
>>>> >> >> > > Adrien, you made an analogy to PerFieldPostingsFormat (and
>>>> DocValues),
>>>> >> >> > > but I think the situation is more akin to Points, where we
>>>> have the
>>>> >> >> > > options on IndexableField. The metadata we store there
>>>> (dimension and
>>>> >> >> > > score function) don't really result in different formats, ie
>>>> code
>>>> >> >> > > paths for indexing and storage; they are more like parameters
>>>> to the
>>>> >> >> > > format, in my mind. Perhaps the situation will look different
>>>> when we
>>>> >> >> > > get our second vector indexing strategy (like LSH).
>>>> >> >> > >
>>>> >> >> > >
>>>> >> >> > > On Tue, Mar 16, 2021 at 10:19 AM Tomoko Uchida
>>>> >> >> > > <tomoko.uchida.1111@gmail.com> wrote:
>>>> >> >> > > >
>>>> >> >> > > > > Should we rename VectorFormat to VectorsFormat? This
>>>> would be more consistent with other file formats that use the plural, like
>>>> PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.?
>>>> >> >> > > >
>>>> >> >> > > > +1 for using plural form for consistency - if we reconsider
>>>> the names, how about VectorValuesFormat so that it follows the naming
>>>> convention for XXXValues?
>>>> >> >> > > >
>>>> >> >> > > > DocValuesFormat / DocValues
>>>> >> >> > > > PointValuesFormat / PointValues
>>>> >> >> > > > VectorValuesFormat / VectorValues (currently, VectorFormat
>>>> / VectorValues)
>>>> >> >> > > >
>>>> >> >> > > > > Should SearchStrategy constants avoid explicit references
>>>> to HNSW?
>>>> >> >> > > >
>>>> >> >> > > > Also +1 for decoupling HNSW specific implementations from
>>>> general vectors, though I am not fully sure if we can strictly separate the
>>>> similarity metrics and search algorithms for vectors.
>>>> >> >> > > > LUCENE-9322 (unified vectors API) was resolved months ago,
>>>> does it achieve its goal? I haven't followed the issue in months because of
>>>> my laziness...
>>>> >> >> > > >
>>>> >> >> > > > Thanks,
>>>> >> >> > > > Tomoko
>>>> >> >> > > >
>>>> >> >> > > >
>>>> >> >> > > > 2021?3?16?(?) 19:32 Adrien Grand <jpountz@gmail.com>:
>>>> >> >> > > >>
>>>> >> >> > > >> Hello,
>>>> >> >> > > >>
>>>> >> >> > > >> I've tried to catch up on the vector API and I have the
>>>> following questions. I've tried to read through discussions on JIRA first
>>>> in case it had been covered, but it's possible I missed some relevant ones.
>>>> >> >> > > >>
>>>> >> >> > > >> Should VectorValues#search be on VectorReader instead? It
>>>> felt a bit odd to me to have the search logic on the iterator.
>>>> >> >> > > >>
>>>> >> >> > > >> Do we need SearchStrategy.NONE? Documentation suggests
>>>> that it allows storing vectors but that NN search won't be supported. This
>>>> looks like a use-case for binary doc values to me? It also slightly caught
>>>> me by surprise due to the inconsistency with IndexOptions.NONE, which means
>>>> "do not index this field" (and likewise for DocValuesType.NONE), so I first
>>>> assumed that SearchStrategy.NONE also meant "do not index this field as a
>>>> vector".
>>>> >> >> > > >>
>>>> >> >> > > >> While postings and doc-value formats allow per-field
>>>> configuration via PerFieldPostingsFormat/PerFieldDocValuesFormat, vectors
>>>> use a different mechanism where VectorField#createHnswType sets attributes
>>>> on the field type that the vectors writer then reads. Should we have a
>>>> PerFieldVectorsFormat instead and configure these options via the vectors
>>>> format?
>>>> >> >> > > >>
>>>> >> >> > > >> Should SearchStrategy constants avoid explicit references
>>>> to HNSW? The rest of the API seems to try to be agnostic of the way that NN
>>>> search is implemented. Could we make SearchStrategy only about the
>>>> similarity metric that is used for vectors? This particular point seems
>>>> discussed on LUCENE-9322 but I couldn't find the conclusion.
>>>> >> >> > > >>
>>>> >> >> > > >> Should we rename VectorFormat to VectorsFormat? This would
>>>> be more consistent with other file formats that use the plural, like
>>>> PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.?
>>>> >> >> > > >>
>>>> >> >> > > >> --
>>>> >> >> > > >> Adrien
>>>> >> >>
>>>> >> >>
>>>> ---------------------------------------------------------------------
>>>> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
>>>> >> >>
>>>> >>
>>>> >> ---------------------------------------------------------------------
>>>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>>>> >>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>
>>>>
>>>
>>> --
>>> --
>>> Dmitry Kan
>>> Luke Toolbox: http://github.com/DmitryKey/luke
>>> Blog: http://dmitrykan.blogspot.com
>>> Twitter: http://twitter.com/dmitrykan
>>> SemanticAnalyzer: www.semanticanalyzer.info
>>>
>>

--
--
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: https://dmitry-kan.medium.com/
Twitter: http://twitter.com/dmitrykan
Re: Questions about the new vector API [ In reply to ]
I created a JIRA about moving VectorValues#search to VectorReader:
https://issues.apache.org/jira/browse/LUCENE-9908.

On Tue, Mar 16, 2021 at 7:14 PM Adrien Grand <jpountz@gmail.com> wrote:

> Hello Mike,
>
> On Tue, Mar 16, 2021 at 5:05 PM Michael Sokolov <msokolov@gmail.com>
> wrote:
>
>> I think the reason we have search() on VectorValues is that we have
>> LeafReader.getVectorValues() (by analogy to the DocValues iterators),
>> but no way to access the VectorReader. Do you think we should also
>> have LeafReader.getVectorReader()? Today it's only on CodecReader.
>>
>
> I was more thinking of moving VectorValues#search to
> LeafReader#searchNearestVectors or something along those lines. I agree
> that VectorReader should only be exposed on CodecReader.
>
>
>> Re: SearchStrategy.NONE; the idea is we support efficient access to
>> floating point values. Using BinaryDocValues for this will always
>> require an additional decoding step. I can see that the naming is
>> confusing there. The intent is that you index the vector values, but
>> no additional indexing data structure.
>
>
> I wonder if things would be simpler if we were more opinionated and made
> vectors specifically about nearest-neighbor search. Then we have a
> clearer message, use vectors for NN search and doc values otherwise. As far
> as I know, reinterpreting bytes as floats shouldn't add much overhead. The
> main problem I know of is that the JVM won't auto-vectorize if you read
> floats dynamically from a byte[], but this is something that should be
> alleviated by the JDK vector API?
>
> Also: the reason HNSW is
>> mentioned in these SearchStrategy enums is to make room for other
>> vector indexing approaches, like LSH. There was a lot of discussion
>> that we wanted an API that allowed for experimenting with other
>> techniques for indexing and searching vector values.
>>
>
> Actually this is the thing that feels odd to me: if we end up with
> constants for both LSH and HNSW, then we are adding the requirement that
> all vector formats must implement both LSH and HNSW as they will need to
> support all SearchStrategy constants? Would it be possible to have a single
> API and then two implementations of VectorsFormat, LSHVectorsFormat on the
> one hand and HNSWVectorsFormat on the other hand?
>
> Adrien, you made an analogy to PerFieldPostingsFormat (and DocValues),
>> but I think the situation is more akin to Points, where we have the
>> options on IndexableField. The metadata we store there (dimension and
>> score function) don't really result in different formats, ie code
>> paths for indexing and storage; they are more like parameters to the
>> format, in my mind. Perhaps the situation will look different when we
>> get our second vector indexing strategy (like LSH).
>
>
> Having the dimension count and the score function on the FieldType
> actually makes sense to me. I was more wondering whether maxConn
> and beamWidth actually belong to the FieldType, or if they should be made
> constructor arguments of Lucene90VectorFormat.
>
> --
> Adrien
>


--
Adrien
Re: Questions about the new vector API [ In reply to ]
I filed one more JIRA about the approach to specifying the NN algorithm:
https://issues.apache.org/jira/browse/LUCENE-9905.

As a summary, here's the current list of vector API issues we're tracking:
* Reconsider the format name (
https://issues.apache.org/jira/browse/LUCENE-9855)
* Revise approach to specifying NN algorithm (
https://issues.apache.org/jira/browse/LUCENE-9905)
* Move VectorValues#search to VectorReader (
https://issues.apache.org/jira/browse/LUCENE-9908)
* Should VectorValues expose both iteration and random access? (
https://issues.apache.org/jira/browse/LUCENE-9583)

Julie

On Tue, Apr 6, 2021 at 5:31 AM Adrien Grand <jpountz@gmail.com> wrote:

> I created a JIRA about moving VectorValues#search to VectorReader:
> https://issues.apache.org/jira/browse/LUCENE-9908.
>
> On Tue, Mar 16, 2021 at 7:14 PM Adrien Grand <jpountz@gmail.com> wrote:
>
>> Hello Mike,
>>
>> On Tue, Mar 16, 2021 at 5:05 PM Michael Sokolov <msokolov@gmail.com>
>> wrote:
>>
>>> I think the reason we have search() on VectorValues is that we have
>>> LeafReader.getVectorValues() (by analogy to the DocValues iterators),
>>> but no way to access the VectorReader. Do you think we should also
>>> have LeafReader.getVectorReader()? Today it's only on CodecReader.
>>>
>>
>> I was more thinking of moving VectorValues#search to
>> LeafReader#searchNearestVectors or something along those lines. I agree
>> that VectorReader should only be exposed on CodecReader.
>>
>>
>>> Re: SearchStrategy.NONE; the idea is we support efficient access to
>>> floating point values. Using BinaryDocValues for this will always
>>> require an additional decoding step. I can see that the naming is
>>> confusing there. The intent is that you index the vector values, but
>>> no additional indexing data structure.
>>
>>
>> I wonder if things would be simpler if we were more opinionated and made
>> vectors specifically about nearest-neighbor search. Then we have a
>> clearer message, use vectors for NN search and doc values otherwise. As far
>> as I know, reinterpreting bytes as floats shouldn't add much overhead. The
>> main problem I know of is that the JVM won't auto-vectorize if you read
>> floats dynamically from a byte[], but this is something that should be
>> alleviated by the JDK vector API?
>>
>> Also: the reason HNSW is
>>> mentioned in these SearchStrategy enums is to make room for other
>>> vector indexing approaches, like LSH. There was a lot of discussion
>>> that we wanted an API that allowed for experimenting with other
>>> techniques for indexing and searching vector values.
>>>
>>
>> Actually this is the thing that feels odd to me: if we end up with
>> constants for both LSH and HNSW, then we are adding the requirement that
>> all vector formats must implement both LSH and HNSW as they will need to
>> support all SearchStrategy constants? Would it be possible to have a single
>> API and then two implementations of VectorsFormat, LSHVectorsFormat on the
>> one hand and HNSWVectorsFormat on the other hand?
>>
>> Adrien, you made an analogy to PerFieldPostingsFormat (and DocValues),
>>> but I think the situation is more akin to Points, where we have the
>>> options on IndexableField. The metadata we store there (dimension and
>>> score function) don't really result in different formats, ie code
>>> paths for indexing and storage; they are more like parameters to the
>>> format, in my mind. Perhaps the situation will look different when we
>>> get our second vector indexing strategy (like LSH).
>>
>>
>> Having the dimension count and the score function on the FieldType
>> actually makes sense to me. I was more wondering whether maxConn
>> and beamWidth actually belong to the FieldType, or if they should be made
>> constructor arguments of Lucene90VectorFormat.
>>
>> --
>> Adrien
>>
>
>
> --
> Adrien
>
Re: Questions about the new vector API [ In reply to ]
One last follow-up: Robert's comments got me interested in better
quantifying the performance against other approaches. I hooked up Lucene
HNSW to ann-benchmarks, a commonly used repo for benchmarking nearest
neighbor search libraries against large datasets. These two issues describe
the results:
* Search recall + QPS (https://issues.apache.org/jira/browse/LUCENE-9937)
* Index speed (https://issues.apache.org/jira/browse/LUCENE-9941)

Thanks Mike for your insights so far on the search ticket.

Julie

On Tue, Apr 6, 2021 at 12:37 PM Julie Tibshirani <julietibs@gmail.com>
wrote:

> I filed one more JIRA about the approach to specifying the NN algorithm:
> https://issues.apache.org/jira/browse/LUCENE-9905.
>
> As a summary, here's the current list of vector API issues we're tracking:
> * Reconsider the format name (
> https://issues.apache.org/jira/browse/LUCENE-9855)
> * Revise approach to specifying NN algorithm (
> https://issues.apache.org/jira/browse/LUCENE-9905)
> * Move VectorValues#search to VectorReader (
> https://issues.apache.org/jira/browse/LUCENE-9908)
> * Should VectorValues expose both iteration and random access? (
> https://issues.apache.org/jira/browse/LUCENE-9583)
>
> Julie
>
> On Tue, Apr 6, 2021 at 5:31 AM Adrien Grand <jpountz@gmail.com> wrote:
>
>> I created a JIRA about moving VectorValues#search to VectorReader:
>> https://issues.apache.org/jira/browse/LUCENE-9908.
>>
>> On Tue, Mar 16, 2021 at 7:14 PM Adrien Grand <jpountz@gmail.com> wrote:
>>
>>> Hello Mike,
>>>
>>> On Tue, Mar 16, 2021 at 5:05 PM Michael Sokolov <msokolov@gmail.com>
>>> wrote:
>>>
>>>> I think the reason we have search() on VectorValues is that we have
>>>> LeafReader.getVectorValues() (by analogy to the DocValues iterators),
>>>> but no way to access the VectorReader. Do you think we should also
>>>> have LeafReader.getVectorReader()? Today it's only on CodecReader.
>>>>
>>>
>>> I was more thinking of moving VectorValues#search to
>>> LeafReader#searchNearestVectors or something along those lines. I agree
>>> that VectorReader should only be exposed on CodecReader.
>>>
>>>
>>>> Re: SearchStrategy.NONE; the idea is we support efficient access to
>>>> floating point values. Using BinaryDocValues for this will always
>>>> require an additional decoding step. I can see that the naming is
>>>> confusing there. The intent is that you index the vector values, but
>>>> no additional indexing data structure.
>>>
>>>
>>> I wonder if things would be simpler if we were more opinionated and made
>>> vectors specifically about nearest-neighbor search. Then we have a
>>> clearer message, use vectors for NN search and doc values otherwise. As far
>>> as I know, reinterpreting bytes as floats shouldn't add much overhead. The
>>> main problem I know of is that the JVM won't auto-vectorize if you read
>>> floats dynamically from a byte[], but this is something that should be
>>> alleviated by the JDK vector API?
>>>
>>> Also: the reason HNSW is
>>>> mentioned in these SearchStrategy enums is to make room for other
>>>> vector indexing approaches, like LSH. There was a lot of discussion
>>>> that we wanted an API that allowed for experimenting with other
>>>> techniques for indexing and searching vector values.
>>>>
>>>
>>> Actually this is the thing that feels odd to me: if we end up with
>>> constants for both LSH and HNSW, then we are adding the requirement that
>>> all vector formats must implement both LSH and HNSW as they will need to
>>> support all SearchStrategy constants? Would it be possible to have a single
>>> API and then two implementations of VectorsFormat, LSHVectorsFormat on the
>>> one hand and HNSWVectorsFormat on the other hand?
>>>
>>> Adrien, you made an analogy to PerFieldPostingsFormat (and DocValues),
>>>> but I think the situation is more akin to Points, where we have the
>>>> options on IndexableField. The metadata we store there (dimension and
>>>> score function) don't really result in different formats, ie code
>>>> paths for indexing and storage; they are more like parameters to the
>>>> format, in my mind. Perhaps the situation will look different when we
>>>> get our second vector indexing strategy (like LSH).
>>>
>>>
>>> Having the dimension count and the score function on the FieldType
>>> actually makes sense to me. I was more wondering whether maxConn
>>> and beamWidth actually belong to the FieldType, or if they should be made
>>> constructor arguments of Lucene90VectorFormat.
>>>
>>> --
>>> Adrien
>>>
>>
>>
>> --
>> Adrien
>>
>
Re: Questions about the new vector API [ In reply to ]
Thanks for doing this benchmarking. But I am very concerned
ann-benchmarks is a good one to be using.

While it may be hip/trendy/popular, it clearly states that it is only
for toy datasets that fit in RAM:
https://github.com/erikbern/ann-benchmarks/blob/master/README.md#principles



On Tue, Apr 27, 2021 at 4:46 PM Julie Tibshirani <julietibs@gmail.com> wrote:
>
> One last follow-up: Robert's comments got me interested in better quantifying the performance against other approaches. I hooked up Lucene HNSW to ann-benchmarks, a commonly used repo for benchmarking nearest neighbor search libraries against large datasets. These two issues describe the results:
> * Search recall + QPS (https://issues.apache.org/jira/browse/LUCENE-9937)
> * Index speed (https://issues.apache.org/jira/browse/LUCENE-9941)
>
> Thanks Mike for your insights so far on the search ticket.
>
> Julie
>
> On Tue, Apr 6, 2021 at 12:37 PM Julie Tibshirani <julietibs@gmail.com> wrote:
>>
>> I filed one more JIRA about the approach to specifying the NN algorithm: https://issues.apache.org/jira/browse/LUCENE-9905.
>>
>> As a summary, here's the current list of vector API issues we're tracking:
>> * Reconsider the format name (https://issues.apache.org/jira/browse/LUCENE-9855)
>> * Revise approach to specifying NN algorithm (https://issues.apache.org/jira/browse/LUCENE-9905)
>> * Move VectorValues#search to VectorReader (https://issues.apache.org/jira/browse/LUCENE-9908)
>> * Should VectorValues expose both iteration and random access? (https://issues.apache.org/jira/browse/LUCENE-9583)
>>
>> Julie
>>
>> On Tue, Apr 6, 2021 at 5:31 AM Adrien Grand <jpountz@gmail.com> wrote:
>>>
>>> I created a JIRA about moving VectorValues#search to VectorReader: https://issues.apache.org/jira/browse/LUCENE-9908.
>>>
>>> On Tue, Mar 16, 2021 at 7:14 PM Adrien Grand <jpountz@gmail.com> wrote:
>>>>
>>>> Hello Mike,
>>>>
>>>> On Tue, Mar 16, 2021 at 5:05 PM Michael Sokolov <msokolov@gmail.com> wrote:
>>>>>
>>>>> I think the reason we have search() on VectorValues is that we have
>>>>> LeafReader.getVectorValues() (by analogy to the DocValues iterators),
>>>>> but no way to access the VectorReader. Do you think we should also
>>>>> have LeafReader.getVectorReader()? Today it's only on CodecReader.
>>>>
>>>>
>>>> I was more thinking of moving VectorValues#search to LeafReader#searchNearestVectors or something along those lines. I agree that VectorReader should only be exposed on CodecReader.
>>>>
>>>>>
>>>>> Re: SearchStrategy.NONE; the idea is we support efficient access to
>>>>> floating point values. Using BinaryDocValues for this will always
>>>>> require an additional decoding step. I can see that the naming is
>>>>> confusing there. The intent is that you index the vector values, but
>>>>> no additional indexing data structure.
>>>>
>>>>
>>>> I wonder if things would be simpler if we were more opinionated and made vectors specifically about nearest-neighbor search. Then we have a clearer message, use vectors for NN search and doc values otherwise. As far as I know, reinterpreting bytes as floats shouldn't add much overhead. The main problem I know of is that the JVM won't auto-vectorize if you read floats dynamically from a byte[], but this is something that should be alleviated by the JDK vector API?
>>>>
>>>>> Also: the reason HNSW is
>>>>> mentioned in these SearchStrategy enums is to make room for other
>>>>> vector indexing approaches, like LSH. There was a lot of discussion
>>>>> that we wanted an API that allowed for experimenting with other
>>>>> techniques for indexing and searching vector values.
>>>>
>>>>
>>>> Actually this is the thing that feels odd to me: if we end up with constants for both LSH and HNSW, then we are adding the requirement that all vector formats must implement both LSH and HNSW as they will need to support all SearchStrategy constants? Would it be possible to have a single API and then two implementations of VectorsFormat, LSHVectorsFormat on the one hand and HNSWVectorsFormat on the other hand?
>>>>
>>>>> Adrien, you made an analogy to PerFieldPostingsFormat (and DocValues),
>>>>> but I think the situation is more akin to Points, where we have the
>>>>> options on IndexableField. The metadata we store there (dimension and
>>>>> score function) don't really result in different formats, ie code
>>>>> paths for indexing and storage; they are more like parameters to the
>>>>> format, in my mind. Perhaps the situation will look different when we
>>>>> get our second vector indexing strategy (like LSH).
>>>>
>>>>
>>>> Having the dimension count and the score function on the FieldType actually makes sense to me. I was more wondering whether maxConn and beamWidth actually belong to the FieldType, or if they should be made constructor arguments of Lucene90VectorFormat.
>>>>
>>>> --
>>>> Adrien
>>>
>>>
>>>
>>> --
>>> Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

1 2  View All