Mailing List Archive

HNSW questions
HI all, a couple questions on how HNSW works:

1. What is driving the requirement for two copies of the input vectors? It
looks like the RAVV implementations do shallow copies, so the vector from A
is the same that would be returned by B. What am I missing?

2. What is the intended behavior when adding identical vectors to a HNSW?
It looks like when I supply 10 identical vectors, they all get added to the
graph, but when I search for the nearest neighbors, I only get one of them
in the result set.

--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced
Re: HNSW questions [ In reply to ]
These vector values have internal buffers they use to return the vectors.
In order to compare two vectors we need to use two independent sources so
that one doesn't overwrite this internal state when fetching the second
vector.

Sorry I forgot the second question and can't see it on my phone. Brb

On Tue, Apr 18, 2023, 10:55 PM Jonathan Ellis <jbellis@gmail.com> wrote:

> HI all, a couple questions on how HNSW works:
>
> 1. What is driving the requirement for two copies of the input vectors?
> It looks like the RAVV implementations do shallow copies, so the vector
> from A is the same that would be returned by B. What am I missing?
>
> 2. What is the intended behavior when adding identical vectors to a HNSW?
> It looks like when I supply 10 identical vectors, they all get added to the
> graph, but when I search for the nearest neighbors, I only get one of them
> in the result set.
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>
Re: HNSW questions [ In reply to ]
Oh identical vectors. Basically unsupported. If you create a large index
filled with identical vectors it leads to pathological behavior. Seems to
be a weakness in the algorithm. If you have any idea how to improve that,
it would be welcome. But in real world scenarios, it doesn't seem to arise?

On Tue, Apr 18, 2023, 10:55 PM Jonathan Ellis <jbellis@gmail.com> wrote:

> HI all, a couple questions on how HNSW works:
>
> 1. What is driving the requirement for two copies of the input vectors?
> It looks like the RAVV implementations do shallow copies, so the vector
> from A is the same that would be returned by B. What am I missing?
>
> 2. What is the intended behavior when adding identical vectors to a HNSW?
> It looks like when I supply 10 identical vectors, they all get added to the
> graph, but when I search for the nearest neighbors, I only get one of them
> in the result set.
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>
Re: HNSW questions [ In reply to ]
Thanks, Michael!

Looking at the paper by Malkov and Yashunin, it looks like the algorithm
allows for building the hnsw graph incrementally. Why does our
implementation require specifying all the vectors up front to
HnswGraphBuilder.create?

On Wed, Apr 19, 2023 at 3:04?AM Michael Sokolov <msokolov@gmail.com> wrote:

> These vector values have internal buffers they use to return the vectors.
> In order to compare two vectors we need to use two independent sources so
> that one doesn't overwrite this internal state when fetching the second
> vector.
>
> Sorry I forgot the second question and can't see it on my phone. Brb
>
> On Tue, Apr 18, 2023, 10:55 PM Jonathan Ellis <jbellis@gmail.com> wrote:
>
>> HI all, a couple questions on how HNSW works:
>>
>> 1. What is driving the requirement for two copies of the input vectors?
>> It looks like the RAVV implementations do shallow copies, so the vector
>> from A is the same that would be returned by B. What am I missing?
>>
>> 2. What is the intended behavior when adding identical vectors to a
>> HNSW? It looks like when I supply 10 identical vectors, they all get added
>> to the graph, but when I search for the nearest neighbors, I only get one
>> of them in the result set.
>>
>> --
>> Jonathan Ellis
>> co-founder, http://www.datastax.com
>> @spyced
>>
>

--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced
Re: HNSW questions [ In reply to ]
That class is intended for use by the Lucene index writer - it's not
designed as a general purpose class for re-use outside that context.
And IndexWriter writes documents to disk in bulk.

On Wed, Apr 19, 2023 at 3:54?PM Jonathan Ellis <jbellis@gmail.com> wrote:
>
> Thanks, Michael!
>
> Looking at the paper by Malkov and Yashunin, it looks like the algorithm allows for building the hnsw graph incrementally. Why does our implementation require specifying all the vectors up front to HnswGraphBuilder.create?
>
> On Wed, Apr 19, 2023 at 3:04?AM Michael Sokolov <msokolov@gmail.com> wrote:
>>
>> These vector values have internal buffers they use to return the vectors. In order to compare two vectors we need to use two independent sources so that one doesn't overwrite this internal state when fetching the second vector.
>>
>> Sorry I forgot the second question and can't see it on my phone. Brb
>>
>> On Tue, Apr 18, 2023, 10:55 PM Jonathan Ellis <jbellis@gmail.com> wrote:
>>>
>>> HI all, a couple questions on how HNSW works:
>>>
>>> 1. What is driving the requirement for two copies of the input vectors? It looks like the RAVV implementations do shallow copies, so the vector from A is the same that would be returned by B. What am I missing?
>>>
>>> 2. What is the intended behavior when adding identical vectors to a HNSW? It looks like when I supply 10 identical vectors, they all get added to the graph, but when I search for the nearest neighbors, I only get one of them in the result set.
>>>
>>> --
>>> Jonathan Ellis
>>> co-founder, http://www.datastax.com
>>> @spyced
>
>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: HNSW questions [ In reply to ]
I still don't understand, because RAVectorValues copy method is "return
this." When is it important to have a true separate copy?

On Wed, Apr 19, 2023 at 3:04?AM Michael Sokolov <msokolov@gmail.com> wrote:

> These vector values have internal buffers they use to return the vectors.
> In order to compare two vectors we need to use two independent sources so
> that one doesn't overwrite this internal state when fetching the second
> vector.
>
> Sorry I forgot the second question and can't see it on my phone. Brb
>
> On Tue, Apr 18, 2023, 10:55 PM Jonathan Ellis <jbellis@gmail.com> wrote:
>
>> HI all, a couple questions on how HNSW works:
>>
>> 1. What is driving the requirement for two copies of the input vectors?
>> It looks like the RAVV implementations do shallow copies, so the vector
>> from A is the same that would be returned by B. What am I missing?
>>
>> 2. What is the intended behavior when adding identical vectors to a
>> HNSW? It looks like when I supply 10 identical vectors, they all get added
>> to the graph, but when I search for the nearest neighbors, I only get one
>> of them in the result set.
>>
>> --
>> Jonathan Ellis
>> co-founder, http://www.datastax.com
>> @spyced
>>
>

--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced
Re: HNSW questions [ In reply to ]
It looks like I misunderstood how the Builder works, and the RAVV provided
to the constructor does not need to contain any values up front.
Specifically, Lucene95HnswVectorsWriter.FieldWriter adds vectors
incrementally to the RAVV that it gives to the builder as addValue is
called.

On Wed, Apr 19, 2023 at 1:37?PM Michael Sokolov <msokolov@gmail.com> wrote:

> That class is intended for use by the Lucene index writer - it's not
> designed as a general purpose class for re-use outside that context.
> And IndexWriter writes documents to disk in bulk.
>
> On Wed, Apr 19, 2023 at 3:54?PM Jonathan Ellis <jbellis@gmail.com> wrote:
> >
> > Thanks, Michael!
> >
> > Looking at the paper by Malkov and Yashunin, it looks like the algorithm
> allows for building the hnsw graph incrementally. Why does our
> implementation require specifying all the vectors up front to
> HnswGraphBuilder.create?
> >
> > On Wed, Apr 19, 2023 at 3:04?AM Michael Sokolov <msokolov@gmail.com>
> wrote:
> >>
> >> These vector values have internal buffers they use to return the
> vectors. In order to compare two vectors we need to use two independent
> sources so that one doesn't overwrite this internal state when fetching the
> second vector.
> >>
> >> Sorry I forgot the second question and can't see it on my phone. Brb
> >>
> >> On Tue, Apr 18, 2023, 10:55 PM Jonathan Ellis <jbellis@gmail.com>
> wrote:
> >>>
> >>> HI all, a couple questions on how HNSW works:
> >>>
> >>> 1. What is driving the requirement for two copies of the input
> vectors? It looks like the RAVV implementations do shallow copies, so the
> vector from A is the same that would be returned by B. What am I missing?
> >>>
> >>> 2. What is the intended behavior when adding identical vectors to a
> HNSW? It looks like when I supply 10 identical vectors, they all get added
> to the graph, but when I search for the nearest neighbors, I only get one
> of them in the result set.
> >>>
> >>> --
> >>> Jonathan Ellis
> >>> co-founder, http://www.datastax.com
> >>> @spyced
> >
> >
> >
> > --
> > Jonathan Ellis
> > co-founder, http://www.datastax.com
> > @spyced
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced
Re: HNSW questions [ In reply to ]
Right RAVectorValues is just fronting an array of vectors and it
doesn't have any intermediate storage or other state (like a file
pointer) so it can support many simultaneous callers. Other
implementations of the interface work differently; see
OffHeapByteVectorValues, which is representing vectors in the index
and implemented using I/O calls.

If you shared some context about your interest here, we might be able
to help you better.

On Thu, Apr 20, 2023 at 1:22?PM Jonathan Ellis <jbellis@gmail.com> wrote:
>
> It looks like I misunderstood how the Builder works, and the RAVV provided to the constructor does not need to contain any values up front. Specifically, Lucene95HnswVectorsWriter.FieldWriter adds vectors incrementally to the RAVV that it gives to the builder as addValue is called.
>
> On Wed, Apr 19, 2023 at 1:37?PM Michael Sokolov <msokolov@gmail.com> wrote:
>>
>> That class is intended for use by the Lucene index writer - it's not
>> designed as a general purpose class for re-use outside that context.
>> And IndexWriter writes documents to disk in bulk.
>>
>> On Wed, Apr 19, 2023 at 3:54?PM Jonathan Ellis <jbellis@gmail.com> wrote:
>> >
>> > Thanks, Michael!
>> >
>> > Looking at the paper by Malkov and Yashunin, it looks like the algorithm allows for building the hnsw graph incrementally. Why does our implementation require specifying all the vectors up front to HnswGraphBuilder.create?
>> >
>> > On Wed, Apr 19, 2023 at 3:04?AM Michael Sokolov <msokolov@gmail.com> wrote:
>> >>
>> >> These vector values have internal buffers they use to return the vectors. In order to compare two vectors we need to use two independent sources so that one doesn't overwrite this internal state when fetching the second vector.
>> >>
>> >> Sorry I forgot the second question and can't see it on my phone. Brb
>> >>
>> >> On Tue, Apr 18, 2023, 10:55 PM Jonathan Ellis <jbellis@gmail.com> wrote:
>> >>>
>> >>> HI all, a couple questions on how HNSW works:
>> >>>
>> >>> 1. What is driving the requirement for two copies of the input vectors? It looks like the RAVV implementations do shallow copies, so the vector from A is the same that would be returned by B. What am I missing?
>> >>>
>> >>> 2. What is the intended behavior when adding identical vectors to a HNSW? It looks like when I supply 10 identical vectors, they all get added to the graph, but when I search for the nearest neighbors, I only get one of them in the result set.
>> >>>
>> >>> --
>> >>> Jonathan Ellis
>> >>> co-founder, http://www.datastax.com
>> >>> @spyced
>> >
>> >
>> >
>> > --
>> > Jonathan Ellis
>> > co-founder, http://www.datastax.com
>> > @spyced
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: HNSW questions [ In reply to ]
Sure, I'm adding HNSW support to Cassandra. (Lots more detail on the
dev@cassandra list.)

HnswGraph says "The graph may be searched by multiple threads
concurrently," but OnHeapHnswGraph has a field cur that gets modified by
seek, which is called by Searcher. Bug, or outdated comment?

On Thu, Apr 20, 2023 at 1:45?PM Michael Sokolov <msokolov@gmail.com> wrote:

> Right RAVectorValues is just fronting an array of vectors and it
> doesn't have any intermediate storage or other state (like a file
> pointer) so it can support many simultaneous callers. Other
> implementations of the interface work differently; see
> OffHeapByteVectorValues, which is representing vectors in the index
> and implemented using I/O calls.
>
> If you shared some context about your interest here, we might be able
> to help you better.
>
> On Thu, Apr 20, 2023 at 1:22?PM Jonathan Ellis <jbellis@gmail.com> wrote:
> >
> > It looks like I misunderstood how the Builder works, and the RAVV
> provided to the constructor does not need to contain any values up front.
> Specifically, Lucene95HnswVectorsWriter.FieldWriter adds vectors
> incrementally to the RAVV that it gives to the builder as addValue is
> called.
> >
> > On Wed, Apr 19, 2023 at 1:37?PM Michael Sokolov <msokolov@gmail.com>
> wrote:
> >>
> >> That class is intended for use by the Lucene index writer - it's not
> >> designed as a general purpose class for re-use outside that context.
> >> And IndexWriter writes documents to disk in bulk.
> >>
> >> On Wed, Apr 19, 2023 at 3:54?PM Jonathan Ellis <jbellis@gmail.com>
> wrote:
> >> >
> >> > Thanks, Michael!
> >> >
> >> > Looking at the paper by Malkov and Yashunin, it looks like the
> algorithm allows for building the hnsw graph incrementally. Why does our
> implementation require specifying all the vectors up front to
> HnswGraphBuilder.create?
> >> >
> >> > On Wed, Apr 19, 2023 at 3:04?AM Michael Sokolov <msokolov@gmail.com>
> wrote:
> >> >>
> >> >> These vector values have internal buffers they use to return the
> vectors. In order to compare two vectors we need to use two independent
> sources so that one doesn't overwrite this internal state when fetching the
> second vector.
> >> >>
> >> >> Sorry I forgot the second question and can't see it on my phone. Brb
> >> >>
> >> >> On Tue, Apr 18, 2023, 10:55 PM Jonathan Ellis <jbellis@gmail.com>
> wrote:
> >> >>>
> >> >>> HI all, a couple questions on how HNSW works:
> >> >>>
> >> >>> 1. What is driving the requirement for two copies of the input
> vectors? It looks like the RAVV implementations do shallow copies, so the
> vector from A is the same that would be returned by B. What am I missing?
> >> >>>
> >> >>> 2. What is the intended behavior when adding identical vectors to a
> HNSW? It looks like when I supply 10 identical vectors, they all get added
> to the graph, but when I search for the nearest neighbors, I only get one
> of them in the result set.
> >> >>>
> >> >>> --
> >> >>> Jonathan Ellis
> >> >>> co-founder, http://www.datastax.com
> >> >>> @spyced
> >> >
> >> >
> >> >
> >> > --
> >> > Jonathan Ellis
> >> > co-founder, http://www.datastax.com
> >> > @spyced
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >
> >
> > --
> > Jonathan Ellis
> > co-founder, http://www.datastax.com
> > @spyced
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced
Re: HNSW questions [ In reply to ]
I don't see anything to make sure vectors are unique in IndexingChain down
to FieldWriter, is that handled somewhere else? Or is it just up to the
user to make sure no documents end up with duplicate vectors?

On Wed, Apr 19, 2023 at 5:07?AM Michael Sokolov <msokolov@gmail.com> wrote:

> Oh identical vectors. Basically unsupported. If you create a large index
> filled with identical vectors it leads to pathological behavior. Seems to
> be a weakness in the algorithm. If you have any idea how to improve that,
> it would be welcome. But in real world scenarios, it doesn't seem to arise?
>
> On Tue, Apr 18, 2023, 10:55 PM Jonathan Ellis <jbellis@gmail.com> wrote:
>
>> HI all, a couple questions on how HNSW works:
>>
>> 1. What is driving the requirement for two copies of the input vectors?
>> It looks like the RAVV implementations do shallow copies, so the vector
>> from A is the same that would be returned by B. What am I missing?
>>
>> 2. What is the intended behavior when adding identical vectors to a
>> HNSW? It looks like when I supply 10 identical vectors, they all get added
>> to the graph, but when I search for the nearest neighbors, I only get one
>> of them in the result set.
>>
>> --
>> Jonathan Ellis
>> co-founder, http://www.datastax.com
>> @spyced
>>
>

--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced
Re: HNSW questions [ In reply to ]
I think the concurrency is across segments or slices (= multiple small
segments)? I.e. "thread per slice" model, not multiple threads in one
slice.

But your cool PR would fix that limitation!
https://github.com/apache/lucene/pull/12254

Mike McCandless

http://blog.mikemccandless.com


On Sun, Apr 23, 2023 at 6:53?PM Jonathan Ellis <jbellis@gmail.com> wrote:

> Sure, I'm adding HNSW support to Cassandra. (Lots more detail on the
> dev@cassandra list.)
>
> HnswGraph says "The graph may be searched by multiple threads
> concurrently," but OnHeapHnswGraph has a field cur that gets modified by
> seek, which is called by Searcher. Bug, or outdated comment?
>
> On Thu, Apr 20, 2023 at 1:45?PM Michael Sokolov <msokolov@gmail.com>
> wrote:
>
>> Right RAVectorValues is just fronting an array of vectors and it
>> doesn't have any intermediate storage or other state (like a file
>> pointer) so it can support many simultaneous callers. Other
>> implementations of the interface work differently; see
>> OffHeapByteVectorValues, which is representing vectors in the index
>> and implemented using I/O calls.
>>
>> If you shared some context about your interest here, we might be able
>> to help you better.
>>
>> On Thu, Apr 20, 2023 at 1:22?PM Jonathan Ellis <jbellis@gmail.com> wrote:
>> >
>> > It looks like I misunderstood how the Builder works, and the RAVV
>> provided to the constructor does not need to contain any values up front.
>> Specifically, Lucene95HnswVectorsWriter.FieldWriter adds vectors
>> incrementally to the RAVV that it gives to the builder as addValue is
>> called.
>> >
>> > On Wed, Apr 19, 2023 at 1:37?PM Michael Sokolov <msokolov@gmail.com>
>> wrote:
>> >>
>> >> That class is intended for use by the Lucene index writer - it's not
>> >> designed as a general purpose class for re-use outside that context.
>> >> And IndexWriter writes documents to disk in bulk.
>> >>
>> >> On Wed, Apr 19, 2023 at 3:54?PM Jonathan Ellis <jbellis@gmail.com>
>> wrote:
>> >> >
>> >> > Thanks, Michael!
>> >> >
>> >> > Looking at the paper by Malkov and Yashunin, it looks like the
>> algorithm allows for building the hnsw graph incrementally. Why does our
>> implementation require specifying all the vectors up front to
>> HnswGraphBuilder.create?
>> >> >
>> >> > On Wed, Apr 19, 2023 at 3:04?AM Michael Sokolov <msokolov@gmail.com>
>> wrote:
>> >> >>
>> >> >> These vector values have internal buffers they use to return the
>> vectors. In order to compare two vectors we need to use two independent
>> sources so that one doesn't overwrite this internal state when fetching the
>> second vector.
>> >> >>
>> >> >> Sorry I forgot the second question and can't see it on my phone. Brb
>> >> >>
>> >> >> On Tue, Apr 18, 2023, 10:55 PM Jonathan Ellis <jbellis@gmail.com>
>> wrote:
>> >> >>>
>> >> >>> HI all, a couple questions on how HNSW works:
>> >> >>>
>> >> >>> 1. What is driving the requirement for two copies of the input
>> vectors? It looks like the RAVV implementations do shallow copies, so the
>> vector from A is the same that would be returned by B. What am I missing?
>> >> >>>
>> >> >>> 2. What is the intended behavior when adding identical vectors to
>> a HNSW? It looks like when I supply 10 identical vectors, they all get
>> added to the graph, but when I search for the nearest neighbors, I only get
>> one of them in the result set.
>> >> >>>
>> >> >>> --
>> >> >>> Jonathan Ellis
>> >> >>> co-founder, http://www.datastax.com
>> >> >>> @spyced
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Jonathan Ellis
>> >> > co-founder, http://www.datastax.com
>> >> > @spyced
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>> >
>> >
>> > --
>> > Jonathan Ellis
>> > co-founder, http://www.datastax.com
>> > @spyced
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>
Re: HNSW questions [ In reply to ]
Yes, it's up to the application. And it is definitely a pathological
case when it happens; https://github.com/apache/lucene/issues/11626

On Tue, May 9, 2023 at 1:30?PM Jonathan Ellis <jbellis@gmail.com> wrote:
>
> I don't see anything to make sure vectors are unique in IndexingChain down to FieldWriter, is that handled somewhere else? Or is it just up to the user to make sure no documents end up with duplicate vectors?
>
> On Wed, Apr 19, 2023 at 5:07?AM Michael Sokolov <msokolov@gmail.com> wrote:
>>
>> Oh identical vectors. Basically unsupported. If you create a large index filled with identical vectors it leads to pathological behavior. Seems to be a weakness in the algorithm. If you have any idea how to improve that, it would be welcome. But in real world scenarios, it doesn't seem to arise?
>>
>> On Tue, Apr 18, 2023, 10:55 PM Jonathan Ellis <jbellis@gmail.com> wrote:
>>>
>>> HI all, a couple questions on how HNSW works:
>>>
>>> 1. What is driving the requirement for two copies of the input vectors? It looks like the RAVV implementations do shallow copies, so the vector from A is the same that would be returned by B. What am I missing?
>>>
>>> 2. What is the intended behavior when adding identical vectors to a HNSW? It looks like when I supply 10 identical vectors, they all get added to the graph, but when I search for the nearest neighbors, I only get one of them in the result set.
>>>
>>> --
>>> Jonathan Ellis
>>> co-founder, http://www.datastax.com
>>> @spyced
>
>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org