Mailing List Archive: [Proposal] Remove max number of dimensions for KNN vectors

[Proposal] Remove max number of dimensions for KNN vectors

a.benedetti at sease

Mar 31, 2023, 2:43 AM

Post #1 of 99 (729 views)

I've been monitoring various discussions on Pull Requests about changing
the max number of dimensions allowed for Lucene HNSW vectors:

https://github.com/apache/lucene/pull/12191

https://github.com/apache/lucene/issues/11507

I would like to set up a discussion and potentially a vote about this.

I have seen some strong opposition from a few people but a majority of
favor in this direction.

*Motivation*

We were discussing in the Solr slack channel with Ishan Chattopadhyaya, Marcus
Eagan, and David Smiley about some neural search integrations in Solr:
https://github.com/openai/chatgpt-retrieval-plugin

*Proposal*

No hard limit at all.

As for many other Lucene areas, users will be allowed to push the system to
the limit of their resources and get terrible performances or crashes if
they want.

*What we are NOT discussing*

- Quality and scalability of the HNSW algorithm

- dimensionality reduction

- strategies to fit in an arbitrary self-imposed limit

*Benefits*

- users can use the models they want to generate vectors

- removal of an arbitrary limit that blocks some integrations

*Cons*

- if you go for vectors with high dimensions, there's no guarantee you get
acceptable performance for your use case

I want to keep it simple, right now in many Lucene areas, you can push the
system to not acceptable performance/ crashes.

For example, we don't limit the number of docs per index to an arbitrary
maximum of N, you push how many docs you like and if they are too much for
your system, you get terrible performance/crashes/whatever.

Limits caused by primitive java types will stay there behind the scene, and
that's acceptable, but I would prefer to not have arbitrary hard-coded ones
that may limit the software usability and integration which is extremely
important for a library.

I strongly encourage people to add benefits and cons, that I missed (I am
sure I missed some of them, but wanted to keep it simple)

Cheers
--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
*Apache Lucene/Solr Committer*
*Apache Solr PMC Member*

e-mail: a.benedetti@sease.io

*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
<https://github.com/seaseltd>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

michael.wechner at wyona

Mar 31, 2023, 5:38 AM

Post #2 of 99 (729 views)

Thanks Alessandro for summarizing the discussion below!

I understand that there is no clear reasoning re what is the best
embedding size, whereas I think heuristic approaches like described by
the following link can be helpful

https://datascience.stackexchange.com/questions/51404/word2vec-how-to-choose-the-embedding-size-parameter

Having said this, we see various embedding services providing higher
dimensions than 1024, like for example OpenAI, Cohere and Aleph Alpha.

And it would be great if we could run benchmarks without having to
recompile Lucene ourselves.

Therefore I would to suggest to either increase the limit or even better
to remove the limit and add a disclaimer, that people should be aware of
possible crashes etc.

Thanks

Michael

Am 31.03.23 um 11:43 schrieb Alessandro Benedetti:
>
> I've been monitoring various discussions on Pull Requests about
> changing the max number of dimensions allowed for Lucene HNSW vectors:
>
> https://github.com/apache/lucene/pull/12191
>
> https://github.com/apache/lucene/issues/11507
>
>
> I would like to set up a discussion and potentially a vote about this.
>
> I have seen some strong opposition from a few people but a majority of
> favor in this direction.
>
>
> *Motivation*
>
> We were discussing in the Solr slack channel with Ishan
> Chattopadhyaya, Marcus Eagan, and David Smiley about some neural
> search integrations in Solr:
> https://github.com/openai/chatgpt-retrieval-plugin
>
>
> *Proposal*
>
> No hard limit at all.
>
> As for many other Lucene areas, users will be allowed to push the
> system to the limit of their resources and get terrible performances
> or crashes if they want.
>
>
> *What we are NOT discussing*
>
> - Quality and scalability of the HNSW algorithm
>
> - dimensionality reduction
>
> - strategies to fit in an arbitrary self-imposed limit
>
>
> *Benefits*
>
> - users can use the models they want to generate vectors
>
> - removal of an arbitrary limit that blocks some integrations
>
>
> *Cons*
>
> ?- if you go for vectors with high dimensions, there's no guarantee
> you get acceptable performance for your use case
>
> *
> *
>
> *
> *
>
> I want to keep it simple, right now in many Lucene areas, you can push
> the system to not acceptable performance/ crashes.
>
> For example, we don't limit the number of docs per index to an
> arbitrary maximum of N, you push how many docs you like and if they
> are too much for your system, you get terrible
> performance/crashes/whatever.
>
>
> Limits caused by primitive java types will stay there behind the
> scene, and that's acceptable, but I would prefer to not have arbitrary
> hard-coded ones that may limit the software usability and integration
> which is extremely important for a library.
>
>
> I strongly encourage people to add benefits and cons, that I missed (I
> am sure I missed some of them, but wanted to keep it simple)
>
>
> Cheers
>
> --------------------------
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> /Apache Lucene/Solr Committer/
> /Apache Solr PMC Member/
>
> e-mail: a.benedetti@sease.io/
> /
>
> *Sease*?- Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd>?| Twitter
> <https://twitter.com/seaseltd>?| Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ>?| Github
> <https://github.com/seaseltd>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

jpountz at gmail

Mar 31, 2023, 5:45 AM

Post #3 of 99 (729 views)

I'm supportive of bumping the limit on the maximum dimension for
vectors to something that is above what the majority of users need,
but I'd like to keep a limit. We have limits for other things like the
max number of docs per index, the max term length, the max number of
dimensions of points, etc. and there are a few things that we don't
have limits on that I wish we had limits on. These limits allow us to
better tune our data structures, prevent overflows, help ensure we
have good test coverage, etc.

That said, these other limits we have in place are quite high. E.g.
the 32kB term limit, nobody would ever type a 32kB term in a text box.
Likewise for the max of 8 dimensions for points: a segment cannot
possibly have 2 splits per dimension on average if it doesn't have
512*2^(8*2)=34M docs, a sizable dataset already, so more dimensions
than 8 would likely defeat the point of indexing. In contrast, our
limit on the number of dimensions of vectors seems to be under what
some users would like, and while I understand the performance argument
against bumping the limit, it doesn't feel to me like something that
would be so bad that we need to prevent users from using numbers of
dimensions in the low thousands, e.g. top-k KNN searches would still
look at a very small subset of the full dataset.

So overall, my vote would be to bump the limit to 2048 as suggested by
Mayya on the issue that you linked.

On Fri, Mar 31, 2023 at 2:38?PM Michael Wechner
<michael.wechner@wyona.com> wrote:
>
> Thanks Alessandro for summarizing the discussion below!
>
> I understand that there is no clear reasoning re what is the best embedding size, whereas I think heuristic approaches like described by the following link can be helpful
>
> https://datascience.stackexchange.com/questions/51404/word2vec-how-to-choose-the-embedding-size-parameter
>
> Having said this, we see various embedding services providing higher dimensions than 1024, like for example OpenAI, Cohere and Aleph Alpha.
>
> And it would be great if we could run benchmarks without having to recompile Lucene ourselves.
>
> Therefore I would to suggest to either increase the limit or even better to remove the limit and add a disclaimer, that people should be aware of possible crashes etc.
>
> Thanks
>
> Michael
>
>
>
>
> Am 31.03.23 um 11:43 schrieb Alessandro Benedetti:
>
>
> I've been monitoring various discussions on Pull Requests about changing the max number of dimensions allowed for Lucene HNSW vectors:
>
> https://github.com/apache/lucene/pull/12191
>
> https://github.com/apache/lucene/issues/11507
>
>
> I would like to set up a discussion and potentially a vote about this.
>
> I have seen some strong opposition from a few people but a majority of favor in this direction.
>
>
> Motivation
>
> We were discussing in the Solr slack channel with Ishan Chattopadhyaya, Marcus Eagan, and David Smiley about some neural search integrations in Solr: https://github.com/openai/chatgpt-retrieval-plugin
>
>
> Proposal
>
> No hard limit at all.
>
> As for many other Lucene areas, users will be allowed to push the system to the limit of their resources and get terrible performances or crashes if they want.
>
>
> What we are NOT discussing
>
> - Quality and scalability of the HNSW algorithm
>
> - dimensionality reduction
>
> - strategies to fit in an arbitrary self-imposed limit
>
>
> Benefits
>
> - users can use the models they want to generate vectors
>
> - removal of an arbitrary limit that blocks some integrations
>
>
> Cons
>
> - if you go for vectors with high dimensions, there's no guarantee you get acceptable performance for your use case
>
>
>
> I want to keep it simple, right now in many Lucene areas, you can push the system to not acceptable performance/ crashes.
>
> For example, we don't limit the number of docs per index to an arbitrary maximum of N, you push how many docs you like and if they are too much for your system, you get terrible performance/crashes/whatever.
>
>
> Limits caused by primitive java types will stay there behind the scene, and that's acceptable, but I would prefer to not have arbitrary hard-coded ones that may limit the software usability and integration which is extremely important for a library.
>
>
> I strongly encourage people to add benefits and cons, that I missed (I am sure I missed some of them, but wanted to keep it simple)
>
>
> Cheers
>
> --------------------------
> Alessandro Benedetti
> Director @ Sease Ltd.
> Apache Lucene/Solr Committer
> Apache Solr PMC Member
>
> e-mail: a.benedetti@sease.io
>
>
> Sease - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io
> LinkedIn | Twitter | Youtube | Github
>
>

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

michael.wechner at wyona

Mar 31, 2023, 7:12 AM

Post #4 of 99 (729 views)

OpenAI reduced their size to 1536 dimensions

https://openai.com/blog/new-and-improved-embedding-model

so 2048 would work :-)

but other services do provide also higher dimensions with sometimes
slightly better accuracy

Thanks

Michael

Am 31.03.23 um 14:45 schrieb Adrien Grand:
> I'm supportive of bumping the limit on the maximum dimension for
> vectors to something that is above what the majority of users need,
> but I'd like to keep a limit. We have limits for other things like the
> max number of docs per index, the max term length, the max number of
> dimensions of points, etc. and there are a few things that we don't
> have limits on that I wish we had limits on. These limits allow us to
> better tune our data structures, prevent overflows, help ensure we
> have good test coverage, etc.
>
> That said, these other limits we have in place are quite high. E.g.
> the 32kB term limit, nobody would ever type a 32kB term in a text box.
> Likewise for the max of 8 dimensions for points: a segment cannot
> possibly have 2 splits per dimension on average if it doesn't have
> 512*2^(8*2)=34M docs, a sizable dataset already, so more dimensions
> than 8 would likely defeat the point of indexing. In contrast, our
> limit on the number of dimensions of vectors seems to be under what
> some users would like, and while I understand the performance argument
> against bumping the limit, it doesn't feel to me like something that
> would be so bad that we need to prevent users from using numbers of
> dimensions in the low thousands, e.g. top-k KNN searches would still
> look at a very small subset of the full dataset.
>
> So overall, my vote would be to bump the limit to 2048 as suggested by
> Mayya on the issue that you linked.
>
> On Fri, Mar 31, 2023 at 2:38?PM Michael Wechner
> <michael.wechner@wyona.com> wrote:
>> Thanks Alessandro for summarizing the discussion below!
>>
>> I understand that there is no clear reasoning re what is the best embedding size, whereas I think heuristic approaches like described by the following link can be helpful
>>
>> https://datascience.stackexchange.com/questions/51404/word2vec-how-to-choose-the-embedding-size-parameter
>>
>> Having said this, we see various embedding services providing higher dimensions than 1024, like for example OpenAI, Cohere and Aleph Alpha.
>>
>> And it would be great if we could run benchmarks without having to recompile Lucene ourselves.
>>
>> Therefore I would to suggest to either increase the limit or even better to remove the limit and add a disclaimer, that people should be aware of possible crashes etc.
>>
>> Thanks
>>
>> Michael
>>
>>
>>
>>
>> Am 31.03.23 um 11:43 schrieb Alessandro Benedetti:
>>
>>
>> I've been monitoring various discussions on Pull Requests about changing the max number of dimensions allowed for Lucene HNSW vectors:
>>
>> https://github.com/apache/lucene/pull/12191
>>
>> https://github.com/apache/lucene/issues/11507
>>
>>
>> I would like to set up a discussion and potentially a vote about this.
>>
>> I have seen some strong opposition from a few people but a majority of favor in this direction.
>>
>>
>> Motivation
>>
>> We were discussing in the Solr slack channel with Ishan Chattopadhyaya, Marcus Eagan, and David Smiley about some neural search integrations in Solr: https://github.com/openai/chatgpt-retrieval-plugin
>>
>>
>> Proposal
>>
>> No hard limit at all.
>>
>> As for many other Lucene areas, users will be allowed to push the system to the limit of their resources and get terrible performances or crashes if they want.
>>
>>
>> What we are NOT discussing
>>
>> - Quality and scalability of the HNSW algorithm
>>
>> - dimensionality reduction
>>
>> - strategies to fit in an arbitrary self-imposed limit
>>
>>
>> Benefits
>>
>> - users can use the models they want to generate vectors
>>
>> - removal of an arbitrary limit that blocks some integrations
>>
>>
>> Cons
>>
>> - if you go for vectors with high dimensions, there's no guarantee you get acceptable performance for your use case
>>
>>
>>
>> I want to keep it simple, right now in many Lucene areas, you can push the system to not acceptable performance/ crashes.
>>
>> For example, we don't limit the number of docs per index to an arbitrary maximum of N, you push how many docs you like and if they are too much for your system, you get terrible performance/crashes/whatever.
>>
>>
>> Limits caused by primitive java types will stay there behind the scene, and that's acceptable, but I would prefer to not have arbitrary hard-coded ones that may limit the software usability and integration which is extremely important for a library.
>>
>>
>> I strongly encourage people to add benefits and cons, that I missed (I am sure I missed some of them, but wanted to keep it simple)
>>
>>
>> Cheers
>>
>> --------------------------
>> Alessandro Benedetti
>> Director @ Sease Ltd.
>> Apache Lucene/Solr Committer
>> Apache Solr PMC Member
>>
>> e-mail: a.benedetti@sease.io
>>
>>
>> Sease - Information Retrieval Applied
>> Consulting | Training | Open Source
>>
>> Website: Sease.io
>> LinkedIn | Twitter | Youtube | Github
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

a.benedetti at sease

Mar 31, 2023, 8:56 AM

Post #5 of 99 (729 views)

I am also curious what would be the worst-case scenario if we remove the
constant at all (so automatically the limit becomes the Java
Integer.MAX_VALUE).
i.e.
right now if you exceed the limit you get:

> if (dimension > ByteVectorValues.MAX_DIMENSIONS) {
> throw new IllegalArgumentException(
> "cannot index vectors with dimension greater than " + ByteVectorValues.
> MAX_DIMENSIONS);
> }

in relation to:

> These limits allow us to
> better tune our data structures, prevent overflows, help ensure we
> have good test coverage, etc.

I agree 100% especially for typing stuff properly and avoiding resource
waste here and there, but I am not entirely sure this is the case for the
current implementation i.e. do we have optimizations in place that assume
the max dimension to be 1024?
If I missed that (and I likely have), I of course suggest the contribution
should not just blindly remove the limit, but do it appropriately.
I am not in favor of just doubling it as suggested by some people, I would
ideally prefer a solution that remains there to a decent extent, rather
than having to modifying it anytime someone requires a higher limit.

Cheers

--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
*Apache Lucene/Solr Committer*
*Apache Solr PMC Member*

e-mail: a.benedetti@sease.io

*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
<https://github.com/seaseltd>

On Fri, 31 Mar 2023 at 16:12, Michael Wechner <michael.wechner@wyona.com>
wrote:

> OpenAI reduced their size to 1536 dimensions
>
> https://openai.com/blog/new-and-improved-embedding-model
>
> so 2048 would work :-)
>
> but other services do provide also higher dimensions with sometimes
> slightly better accuracy
>
> Thanks
>
> Michael
>
>
> Am 31.03.23 um 14:45 schrieb Adrien Grand:
> > I'm supportive of bumping the limit on the maximum dimension for
> > vectors to something that is above what the majority of users need,
> > but I'd like to keep a limit. We have limits for other things like the
> > max number of docs per index, the max term length, the max number of
> > dimensions of points, etc. and there are a few things that we don't
> > have limits on that I wish we had limits on. These limits allow us to
> > better tune our data structures, prevent overflows, help ensure we
> > have good test coverage, etc.
> >
> > That said, these other limits we have in place are quite high. E.g.
> > the 32kB term limit, nobody would ever type a 32kB term in a text box.
> > Likewise for the max of 8 dimensions for points: a segment cannot
> > possibly have 2 splits per dimension on average if it doesn't have
> > 512*2^(8*2)=34M docs, a sizable dataset already, so more dimensions
> > than 8 would likely defeat the point of indexing. In contrast, our
> > limit on the number of dimensions of vectors seems to be under what
> > some users would like, and while I understand the performance argument
> > against bumping the limit, it doesn't feel to me like something that
> > would be so bad that we need to prevent users from using numbers of
> > dimensions in the low thousands, e.g. top-k KNN searches would still
> > look at a very small subset of the full dataset.
> >
> > So overall, my vote would be to bump the limit to 2048 as suggested by
> > Mayya on the issue that you linked.
> >
> > On Fri, Mar 31, 2023 at 2:38?PM Michael Wechner
> > <michael.wechner@wyona.com> wrote:
> >> Thanks Alessandro for summarizing the discussion below!
> >>
> >> I understand that there is no clear reasoning re what is the best
> embedding size, whereas I think heuristic approaches like described by the
> following link can be helpful
> >>
> >>
> https://datascience.stackexchange.com/questions/51404/word2vec-how-to-choose-the-embedding-size-parameter
> >>
> >> Having said this, we see various embedding services providing higher
> dimensions than 1024, like for example OpenAI, Cohere and Aleph Alpha.
> >>
> >> And it would be great if we could run benchmarks without having to
> recompile Lucene ourselves.
> >>
> >> Therefore I would to suggest to either increase the limit or even
> better to remove the limit and add a disclaimer, that people should be
> aware of possible crashes etc.
> >>
> >> Thanks
> >>
> >> Michael
> >>
> >>
> >>
> >>
> >> Am 31.03.23 um 11:43 schrieb Alessandro Benedetti:
> >>
> >>
> >> I've been monitoring various discussions on Pull Requests about
> changing the max number of dimensions allowed for Lucene HNSW vectors:
> >>
> >> https://github.com/apache/lucene/pull/12191
> >>
> >> https://github.com/apache/lucene/issues/11507
> >>
> >>
> >> I would like to set up a discussion and potentially a vote about this.
> >>
> >> I have seen some strong opposition from a few people but a majority of
> favor in this direction.
> >>
> >>
> >> Motivation
> >>
> >> We were discussing in the Solr slack channel with Ishan Chattopadhyaya,
> Marcus Eagan, and David Smiley about some neural search integrations in
> Solr: https://github.com/openai/chatgpt-retrieval-plugin
> >>
> >>
> >> Proposal
> >>
> >> No hard limit at all.
> >>
> >> As for many other Lucene areas, users will be allowed to push the
> system to the limit of their resources and get terrible performances or
> crashes if they want.
> >>
> >>
> >> What we are NOT discussing
> >>
> >> - Quality and scalability of the HNSW algorithm
> >>
> >> - dimensionality reduction
> >>
> >> - strategies to fit in an arbitrary self-imposed limit
> >>
> >>
> >> Benefits
> >>
> >> - users can use the models they want to generate vectors
> >>
> >> - removal of an arbitrary limit that blocks some integrations
> >>
> >>
> >> Cons
> >>
> >> - if you go for vectors with high dimensions, there's no guarantee
> you get acceptable performance for your use case
> >>
> >>
> >>
> >> I want to keep it simple, right now in many Lucene areas, you can push
> the system to not acceptable performance/ crashes.
> >>
> >> For example, we don't limit the number of docs per index to an
> arbitrary maximum of N, you push how many docs you like and if they are too
> much for your system, you get terrible performance/crashes/whatever.
> >>
> >>
> >> Limits caused by primitive java types will stay there behind the scene,
> and that's acceptable, but I would prefer to not have arbitrary hard-coded
> ones that may limit the software usability and integration which is
> extremely important for a library.
> >>
> >>
> >> I strongly encourage people to add benefits and cons, that I missed (I
> am sure I missed some of them, but wanted to keep it simple)
> >>
> >>
> >> Cheers
> >>
> >> --------------------------
> >> Alessandro Benedetti
> >> Director @ Sease Ltd.
> >> Apache Lucene/Solr Committer
> >> Apache Solr PMC Member
> >>
> >> e-mail: a.benedetti@sease.io
> >>
> >>
> >> Sease - Information Retrieval Applied
> >> Consulting | Training | Open Source
> >>
> >> Website: Sease.io
> >> LinkedIn | Twitter | Youtube | Github
> >>
> >>
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

msokolov at gmail

Apr 1, 2023, 5:47 AM

Post #6 of 99 (729 views)

I'm also in favor of raising this limit. We do see some datasets with
higher than 1024 dims. I also think we need to keep a limit. For example we
currently need to keep all the vectors in RAM while indexing and we want to
be able to support reasonable numbers of vectors in an index segment. Also
we don't know what innovations might come down the road. Maybe someday we
want to do product quantization and enforce that (k, m) both fit in a byte
-- we wouldn't be able to do that if a vector's dimension were to exceed
32K.

On Fri, Mar 31, 2023 at 11:57?AM Alessandro Benedetti <a.benedetti@sease.io>
wrote:

> I am also curious what would be the worst-case scenario if we remove the
> constant at all (so automatically the limit becomes the Java
> Integer.MAX_VALUE).
> i.e.
> right now if you exceed the limit you get:
>
>> if (dimension > ByteVectorValues.MAX_DIMENSIONS) {
>> throw new IllegalArgumentException(
>> "cannot index vectors with dimension greater than " + ByteVectorValues.
>> MAX_DIMENSIONS);
>> }
>
>
> in relation to:
>
>> These limits allow us to
>> better tune our data structures, prevent overflows, help ensure we
>> have good test coverage, etc.
>
>
> I agree 100% especially for typing stuff properly and avoiding resource
> waste here and there, but I am not entirely sure this is the case for the
> current implementation i.e. do we have optimizations in place that assume
> the max dimension to be 1024?
> If I missed that (and I likely have), I of course suggest the contribution
> should not just blindly remove the limit, but do it appropriately.
> I am not in favor of just doubling it as suggested by some people, I would
> ideally prefer a solution that remains there to a decent extent, rather
> than having to modifying it anytime someone requires a higher limit.
>
> Cheers
>
> --------------------------
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> *Apache Lucene/Solr Committer*
> *Apache Solr PMC Member*
>
> e-mail: a.benedetti@sease.io
>
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
> <https://twitter.com/seaseltd> | Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
> <https://github.com/seaseltd>
>
>
> On Fri, 31 Mar 2023 at 16:12, Michael Wechner <michael.wechner@wyona.com>
> wrote:
>
>> OpenAI reduced their size to 1536 dimensions
>>
>> https://openai.com/blog/new-and-improved-embedding-model
>>
>> so 2048 would work :-)
>>
>> but other services do provide also higher dimensions with sometimes
>> slightly better accuracy
>>
>> Thanks
>>
>> Michael
>>
>>
>> Am 31.03.23 um 14:45 schrieb Adrien Grand:
>> > I'm supportive of bumping the limit on the maximum dimension for
>> > vectors to something that is above what the majority of users need,
>> > but I'd like to keep a limit. We have limits for other things like the
>> > max number of docs per index, the max term length, the max number of
>> > dimensions of points, etc. and there are a few things that we don't
>> > have limits on that I wish we had limits on. These limits allow us to
>> > better tune our data structures, prevent overflows, help ensure we
>> > have good test coverage, etc.
>> >
>> > That said, these other limits we have in place are quite high. E.g.
>> > the 32kB term limit, nobody would ever type a 32kB term in a text box.
>> > Likewise for the max of 8 dimensions for points: a segment cannot
>> > possibly have 2 splits per dimension on average if it doesn't have
>> > 512*2^(8*2)=34M docs, a sizable dataset already, so more dimensions
>> > than 8 would likely defeat the point of indexing. In contrast, our
>> > limit on the number of dimensions of vectors seems to be under what
>> > some users would like, and while I understand the performance argument
>> > against bumping the limit, it doesn't feel to me like something that
>> > would be so bad that we need to prevent users from using numbers of
>> > dimensions in the low thousands, e.g. top-k KNN searches would still
>> > look at a very small subset of the full dataset.
>> >
>> > So overall, my vote would be to bump the limit to 2048 as suggested by
>> > Mayya on the issue that you linked.
>> >
>> > On Fri, Mar 31, 2023 at 2:38?PM Michael Wechner
>> > <michael.wechner@wyona.com> wrote:
>> >> Thanks Alessandro for summarizing the discussion below!
>> >>
>> >> I understand that there is no clear reasoning re what is the best
>> embedding size, whereas I think heuristic approaches like described by the
>> following link can be helpful
>> >>
>> >>
>> https://datascience.stackexchange.com/questions/51404/word2vec-how-to-choose-the-embedding-size-parameter
>> >>
>> >> Having said this, we see various embedding services providing higher
>> dimensions than 1024, like for example OpenAI, Cohere and Aleph Alpha.
>> >>
>> >> And it would be great if we could run benchmarks without having to
>> recompile Lucene ourselves.
>> >>
>> >> Therefore I would to suggest to either increase the limit or even
>> better to remove the limit and add a disclaimer, that people should be
>> aware of possible crashes etc.
>> >>
>> >> Thanks
>> >>
>> >> Michael
>> >>
>> >>
>> >>
>> >>
>> >> Am 31.03.23 um 11:43 schrieb Alessandro Benedetti:
>> >>
>> >>
>> >> I've been monitoring various discussions on Pull Requests about
>> changing the max number of dimensions allowed for Lucene HNSW vectors:
>> >>
>> >> https://github.com/apache/lucene/pull/12191
>> >>
>> >> https://github.com/apache/lucene/issues/11507
>> >>
>> >>
>> >> I would like to set up a discussion and potentially a vote about this.
>> >>
>> >> I have seen some strong opposition from a few people but a majority of
>> favor in this direction.
>> >>
>> >>
>> >> Motivation
>> >>
>> >> We were discussing in the Solr slack channel with Ishan
>> Chattopadhyaya, Marcus Eagan, and David Smiley about some neural search
>> integrations in Solr: https://github.com/openai/chatgpt-retrieval-plugin
>> >>
>> >>
>> >> Proposal
>> >>
>> >> No hard limit at all.
>> >>
>> >> As for many other Lucene areas, users will be allowed to push the
>> system to the limit of their resources and get terrible performances or
>> crashes if they want.
>> >>
>> >>
>> >> What we are NOT discussing
>> >>
>> >> - Quality and scalability of the HNSW algorithm
>> >>
>> >> - dimensionality reduction
>> >>
>> >> - strategies to fit in an arbitrary self-imposed limit
>> >>
>> >>
>> >> Benefits
>> >>
>> >> - users can use the models they want to generate vectors
>> >>
>> >> - removal of an arbitrary limit that blocks some integrations
>> >>
>> >>
>> >> Cons
>> >>
>> >> - if you go for vectors with high dimensions, there's no guarantee
>> you get acceptable performance for your use case
>> >>
>> >>
>> >>
>> >> I want to keep it simple, right now in many Lucene areas, you can push
>> the system to not acceptable performance/ crashes.
>> >>
>> >> For example, we don't limit the number of docs per index to an
>> arbitrary maximum of N, you push how many docs you like and if they are too
>> much for your system, you get terrible performance/crashes/whatever.
>> >>
>> >>
>> >> Limits caused by primitive java types will stay there behind the
>> scene, and that's acceptable, but I would prefer to not have arbitrary
>> hard-coded ones that may limit the software usability and integration which
>> is extremely important for a library.
>> >>
>> >>
>> >> I strongly encourage people to add benefits and cons, that I missed (I
>> am sure I missed some of them, but wanted to keep it simple)
>> >>
>> >>
>> >> Cheers
>> >>
>> >> --------------------------
>> >> Alessandro Benedetti
>> >> Director @ Sease Ltd.
>> >> Apache Lucene/Solr Committer
>> >> Apache Solr PMC Member
>> >>
>> >> e-mail: a.benedetti@sease.io
>> >>
>> >>
>> >> Sease - Information Retrieval Applied
>> >> Consulting | Training | Open Source
>> >>
>> >> Website: Sease.io
>> >> LinkedIn | Twitter | Youtube | Github
>> >>
>> >>
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

ichattopadhyaya at gmail

Apr 1, 2023, 7:01 AM

Post #7 of 99 (729 views)

+1 to raising the limit. Maybe in future performance problems can be
mitigated with optimisations or hardware acceleration (GPUs) etc.

On Sat, 1 Apr, 2023, 6:18 pm Michael Sokolov, <msokolov@gmail.com> wrote:

> I'm also in favor of raising this limit. We do see some datasets with
> higher than 1024 dims. I also think we need to keep a limit. For example we
> currently need to keep all the vectors in RAM while indexing and we want to
> be able to support reasonable numbers of vectors in an index segment. Also
> we don't know what innovations might come down the road. Maybe someday we
> want to do product quantization and enforce that (k, m) both fit in a byte
> -- we wouldn't be able to do that if a vector's dimension were to exceed
> 32K.
>
> On Fri, Mar 31, 2023 at 11:57?AM Alessandro Benedetti <
> a.benedetti@sease.io> wrote:
>
>> I am also curious what would be the worst-case scenario if we remove the
>> constant at all (so automatically the limit becomes the Java
>> Integer.MAX_VALUE).
>> i.e.
>> right now if you exceed the limit you get:
>>
>>> if (dimension > ByteVectorValues.MAX_DIMENSIONS) {
>>> throw new IllegalArgumentException(
>>> "cannot index vectors with dimension greater than " + ByteVectorValues.
>>> MAX_DIMENSIONS);
>>> }
>>
>>
>> in relation to:
>>
>>> These limits allow us to
>>> better tune our data structures, prevent overflows, help ensure we
>>> have good test coverage, etc.
>>
>>
>> I agree 100% especially for typing stuff properly and avoiding resource
>> waste here and there, but I am not entirely sure this is the case for the
>> current implementation i.e. do we have optimizations in place that assume
>> the max dimension to be 1024?
>> If I missed that (and I likely have), I of course suggest the
>> contribution should not just blindly remove the limit, but do it
>> appropriately.
>> I am not in favor of just doubling it as suggested by some people, I
>> would ideally prefer a solution that remains there to a decent extent,
>> rather than having to modifying it anytime someone requires a higher limit.
>>
>> Cheers
>>
>> --------------------------
>> *Alessandro Benedetti*
>> Director @ Sease Ltd.
>> *Apache Lucene/Solr Committer*
>> *Apache Solr PMC Member*
>>
>> e-mail: a.benedetti@sease.io
>>
>>
>> *Sease* - Information Retrieval Applied
>> Consulting | Training | Open Source
>>
>> Website: Sease.io <http://sease.io/>
>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>> <https://twitter.com/seaseltd> | Youtube
>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>> <https://github.com/seaseltd>
>>
>>
>> On Fri, 31 Mar 2023 at 16:12, Michael Wechner <michael.wechner@wyona.com>
>> wrote:
>>
>>> OpenAI reduced their size to 1536 dimensions
>>>
>>> https://openai.com/blog/new-and-improved-embedding-model
>>>
>>> so 2048 would work :-)
>>>
>>> but other services do provide also higher dimensions with sometimes
>>> slightly better accuracy
>>>
>>> Thanks
>>>
>>> Michael
>>>
>>>
>>> Am 31.03.23 um 14:45 schrieb Adrien Grand:
>>> > I'm supportive of bumping the limit on the maximum dimension for
>>> > vectors to something that is above what the majority of users need,
>>> > but I'd like to keep a limit. We have limits for other things like the
>>> > max number of docs per index, the max term length, the max number of
>>> > dimensions of points, etc. and there are a few things that we don't
>>> > have limits on that I wish we had limits on. These limits allow us to
>>> > better tune our data structures, prevent overflows, help ensure we
>>> > have good test coverage, etc.
>>> >
>>> > That said, these other limits we have in place are quite high. E.g.
>>> > the 32kB term limit, nobody would ever type a 32kB term in a text box.
>>> > Likewise for the max of 8 dimensions for points: a segment cannot
>>> > possibly have 2 splits per dimension on average if it doesn't have
>>> > 512*2^(8*2)=34M docs, a sizable dataset already, so more dimensions
>>> > than 8 would likely defeat the point of indexing. In contrast, our
>>> > limit on the number of dimensions of vectors seems to be under what
>>> > some users would like, and while I understand the performance argument
>>> > against bumping the limit, it doesn't feel to me like something that
>>> > would be so bad that we need to prevent users from using numbers of
>>> > dimensions in the low thousands, e.g. top-k KNN searches would still
>>> > look at a very small subset of the full dataset.
>>> >
>>> > So overall, my vote would be to bump the limit to 2048 as suggested by
>>> > Mayya on the issue that you linked.
>>> >
>>> > On Fri, Mar 31, 2023 at 2:38?PM Michael Wechner
>>> > <michael.wechner@wyona.com> wrote:
>>> >> Thanks Alessandro for summarizing the discussion below!
>>> >>
>>> >> I understand that there is no clear reasoning re what is the best
>>> embedding size, whereas I think heuristic approaches like described by the
>>> following link can be helpful
>>> >>
>>> >>
>>> https://datascience.stackexchange.com/questions/51404/word2vec-how-to-choose-the-embedding-size-parameter
>>> >>
>>> >> Having said this, we see various embedding services providing higher
>>> dimensions than 1024, like for example OpenAI, Cohere and Aleph Alpha.
>>> >>
>>> >> And it would be great if we could run benchmarks without having to
>>> recompile Lucene ourselves.
>>> >>
>>> >> Therefore I would to suggest to either increase the limit or even
>>> better to remove the limit and add a disclaimer, that people should be
>>> aware of possible crashes etc.
>>> >>
>>> >> Thanks
>>> >>
>>> >> Michael
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> Am 31.03.23 um 11:43 schrieb Alessandro Benedetti:
>>> >>
>>> >>
>>> >> I've been monitoring various discussions on Pull Requests about
>>> changing the max number of dimensions allowed for Lucene HNSW vectors:
>>> >>
>>> >> https://github.com/apache/lucene/pull/12191
>>> >>
>>> >> https://github.com/apache/lucene/issues/11507
>>> >>
>>> >>
>>> >> I would like to set up a discussion and potentially a vote about this.
>>> >>
>>> >> I have seen some strong opposition from a few people but a majority
>>> of favor in this direction.
>>> >>
>>> >>
>>> >> Motivation
>>> >>
>>> >> We were discussing in the Solr slack channel with Ishan
>>> Chattopadhyaya, Marcus Eagan, and David Smiley about some neural search
>>> integrations in Solr: https://github.com/openai/chatgpt-retrieval-plugin
>>> >>
>>> >>
>>> >> Proposal
>>> >>
>>> >> No hard limit at all.
>>> >>
>>> >> As for many other Lucene areas, users will be allowed to push the
>>> system to the limit of their resources and get terrible performances or
>>> crashes if they want.
>>> >>
>>> >>
>>> >> What we are NOT discussing
>>> >>
>>> >> - Quality and scalability of the HNSW algorithm
>>> >>
>>> >> - dimensionality reduction
>>> >>
>>> >> - strategies to fit in an arbitrary self-imposed limit
>>> >>
>>> >>
>>> >> Benefits
>>> >>
>>> >> - users can use the models they want to generate vectors
>>> >>
>>> >> - removal of an arbitrary limit that blocks some integrations
>>> >>
>>> >>
>>> >> Cons
>>> >>
>>> >> - if you go for vectors with high dimensions, there's no guarantee
>>> you get acceptable performance for your use case
>>> >>
>>> >>
>>> >>
>>> >> I want to keep it simple, right now in many Lucene areas, you can
>>> push the system to not acceptable performance/ crashes.
>>> >>
>>> >> For example, we don't limit the number of docs per index to an
>>> arbitrary maximum of N, you push how many docs you like and if they are too
>>> much for your system, you get terrible performance/crashes/whatever.
>>> >>
>>> >>
>>> >> Limits caused by primitive java types will stay there behind the
>>> scene, and that's acceptable, but I would prefer to not have arbitrary
>>> hard-coded ones that may limit the software usability and integration which
>>> is extremely important for a library.
>>> >>
>>> >>
>>> >> I strongly encourage people to add benefits and cons, that I missed
>>> (I am sure I missed some of them, but wanted to keep it simple)
>>> >>
>>> >>
>>> >> Cheers
>>> >>
>>> >> --------------------------
>>> >> Alessandro Benedetti
>>> >> Director @ Sease Ltd.
>>> >> Apache Lucene/Solr Committer
>>> >> Apache Solr PMC Member
>>> >>
>>> >> e-mail: a.benedetti@sease.io
>>> >>
>>> >>
>>> >> Sease - Information Retrieval Applied
>>> >> Consulting | Training | Open Source
>>> >>
>>> >> Website: Sease.io
>>> >> LinkedIn | Twitter | Youtube | Github
>>> >>
>>> >>
>>> >
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

michael.wechner at wyona

Apr 1, 2023, 11:28 PM

Post #8 of 99 (729 views)

btw, what was the reasoning to set the current limit to 1024?

Thanks

Michael

Am 01.04.23 um 14:47 schrieb Michael Sokolov:
> I'm also in favor of raising this limit. We do see some datasets with
> higher than 1024 dims. I also think we need to keep a limit. For
> example we currently need to keep all the vectors in RAM while
> indexing and we want to be able to support reasonable numbers of
> vectors in an index segment. Also we don't know what innovations might
> come down the road. Maybe someday we want to do product quantization
> and enforce that (k, m) both fit in a byte -- we wouldn't be able to
> do that if a vector's dimension were to exceed 32K.
>
> On Fri, Mar 31, 2023 at 11:57?AM Alessandro Benedetti
> <a.benedetti@sease.io> wrote:
>
> I am also curious what would be the worst-case scenario if we
> remove the constant at all (so automatically the limit becomes the
> Java Integer.MAX_VALUE).
> i.e.
> right now if you exceed the limit you get:
>
> if (dimension > ByteVectorValues.MAX_DIMENSIONS) {
> throw new IllegalArgumentException(
> "cannot index vectors with dimension greater than " +
> ByteVectorValues.MAX_DIMENSIONS);
> }
>
>
> in relation to:
>
> These limits allow us to
> better tune our data structures, prevent overflows, help ensure we
> have good test coverage, etc.
>
> I agree 100% especially for typing stuff properly and avoiding
> resource waste here and there, but I am not entirely sure this is
> the case for the current implementation i.e. do we have
> optimizations in place that assume the max dimension to be 1024?
> If I missed that (and I likely have), I of course suggest the
> contribution should not just blindly remove the limit, but do it
> appropriately.
> I am not in favor of just doubling it as suggested by some people,
> I would ideally prefer a solution that remains there to a decent
> extent, rather than having to modifying it anytime someone
> requires a higher limit.
>
> Cheers
> --------------------------
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> /Apache Lucene/Solr Committer/
> /Apache Solr PMC Member/
>
> e-mail: a.benedetti@sease.io/
> /
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
> <https://twitter.com/seaseltd> | Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
> Github <https://github.com/seaseltd>
>
>
> On Fri, 31 Mar 2023 at 16:12, Michael Wechner
> <michael.wechner@wyona.com> wrote:
>
> OpenAI reduced their size to 1536 dimensions
>
> https://openai.com/blog/new-and-improved-embedding-model
>
> so 2048 would work :-)
>
> but other services do provide also higher dimensions with
> sometimes
> slightly better accuracy
>
> Thanks
>
> Michael
>
>
> Am 31.03.23 um 14:45 schrieb Adrien Grand:
> > I'm supportive of bumping the limit on the maximum dimension for
> > vectors to something that is above what the majority of
> users need,
> > but I'd like to keep a limit. We have limits for other
> things like the
> > max number of docs per index, the max term length, the max
> number of
> > dimensions of points, etc. and there are a few things that
> we don't
> > have limits on that I wish we had limits on. These limits
> allow us to
> > better tune our data structures, prevent overflows, help
> ensure we
> > have good test coverage, etc.
> >
> > That said, these other limits we have in place are quite
> high. E.g.
> > the 32kB term limit, nobody would ever type a 32kB term in a
> text box.
> > Likewise for the max of 8 dimensions for points: a segment
> cannot
> > possibly have 2 splits per dimension on average if it
> doesn't have
> > 512*2^(8*2)=34M docs, a sizable dataset already, so more
> dimensions
> > than 8 would likely defeat the point of indexing. In
> contrast, our
> > limit on the number of dimensions of vectors seems to be
> under what
> > some users would like, and while I understand the
> performance argument
> > against bumping the limit, it doesn't feel to me like
> something that
> > would be so bad that we need to prevent users from using
> numbers of
> > dimensions in the low thousands, e.g. top-k KNN searches
> would still
> > look at a very small subset of the full dataset.
> >
> > So overall, my vote would be to bump the limit to 2048 as
> suggested by
> > Mayya on the issue that you linked.
> >
> > On Fri, Mar 31, 2023 at 2:38?PM Michael Wechner
> > <michael.wechner@wyona.com> wrote:
> >> Thanks Alessandro for summarizing the discussion below!
> >>
> >> I understand that there is no clear reasoning re what is
> the best embedding size, whereas I think heuristic approaches
> like described by the following link can be helpful
> >>
> >>
> https://datascience.stackexchange.com/questions/51404/word2vec-how-to-choose-the-embedding-size-parameter
> >>
> >> Having said this, we see various embedding services
> providing higher dimensions than 1024, like for example
> OpenAI, Cohere and Aleph Alpha.
> >>
> >> And it would be great if we could run benchmarks without
> having to recompile Lucene ourselves.
> >>
> >> Therefore I would to suggest to either increase the limit
> or even better to remove the limit and add a disclaimer, that
> people should be aware of possible crashes etc.
> >>
> >> Thanks
> >>
> >> Michael
> >>
> >>
> >>
> >>
> >> Am 31.03.23 um 11:43 schrieb Alessandro Benedetti:
> >>
> >>
> >> I've been monitoring various discussions on Pull Requests
> about changing the max number of dimensions allowed for Lucene
> HNSW vectors:
> >>
> >> https://github.com/apache/lucene/pull/12191
> >>
> >> https://github.com/apache/lucene/issues/11507
> >>
> >>
> >> I would like to set up a discussion and potentially a vote
> about this.
> >>
> >> I have seen some strong opposition from a few people but a
> majority of favor in this direction.
> >>
> >>
> >> Motivation
> >>
> >> We were discussing in the Solr slack channel with Ishan
> Chattopadhyaya, Marcus Eagan, and David Smiley about some
> neural search integrations in Solr:
> https://github.com/openai/chatgpt-retrieval-plugin
> >>
> >>
> >> Proposal
> >>
> >> No hard limit at all.
> >>
> >> As for many other Lucene areas, users will be allowed to
> push the system to the limit of their resources and get
> terrible performances or crashes if they want.
> >>
> >>
> >> What we are NOT discussing
> >>
> >> - Quality and scalability of the HNSW algorithm
> >>
> >> - dimensionality reduction
> >>
> >> - strategies to fit in an arbitrary self-imposed limit
> >>
> >>
> >> Benefits
> >>
> >> - users can use the models they want to generate vectors
> >>
> >> - removal of an arbitrary limit that blocks some integrations
> >>
> >>
> >> Cons
> >>
> >> - if you go for vectors with high dimensions, there's no
> guarantee you get acceptable performance for your use case
> >>
> >>
> >>
> >> I want to keep it simple, right now in many Lucene areas,
> you can push the system to not acceptable performance/ crashes.
> >>
> >> For example, we don't limit the number of docs per index to
> an arbitrary maximum of N, you push how many docs you like and
> if they are too much for your system, you get terrible
> performance/crashes/whatever.
> >>
> >>
> >> Limits caused by primitive java types will stay there
> behind the scene, and that's acceptable, but I would prefer to
> not have arbitrary hard-coded ones that may limit the software
> usability and integration which is extremely important for a
> library.
> >>
> >>
> >> I strongly encourage people to add benefits and cons, that
> I missed (I am sure I missed some of them, but wanted to keep
> it simple)
> >>
> >>
> >> Cheers
> >>
> >> --------------------------
> >> Alessandro Benedetti
> >> Director @ Sease Ltd.
> >> Apache Lucene/Solr Committer
> >> Apache Solr PMC Member
> >>
> >> e-mail: a.benedetti@sease.io
> >>
> >>
> >> Sease - Information Retrieval Applied
> >> Consulting | Training | Open Source
> >>
> >> Website: Sease.io
> >> LinkedIn | Twitter | Youtube | Github
> >>
> >>
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

a.benedetti at sease

Apr 3, 2023, 6:50 AM

Post #9 of 99 (729 views)

... and what would be the next limit?
I guess we'll need to motivate it better than the 1024 one.
I appreciate the fact that a limit is pretty much wanted by everyone but I
suspect we'll need some solid foundation for deciding the amount (and it
should be high enough to avoid continuous changes)

Cheers

On Sun, 2 Apr 2023, 07:29 Michael Wechner, <michael.wechner@wyona.com>
wrote:

> btw, what was the reasoning to set the current limit to 1024?
>
> Thanks
>
> Michael
>
> Am 01.04.23 um 14:47 schrieb Michael Sokolov:
>
> I'm also in favor of raising this limit. We do see some datasets with
> higher than 1024 dims. I also think we need to keep a limit. For example we
> currently need to keep all the vectors in RAM while indexing and we want to
> be able to support reasonable numbers of vectors in an index segment. Also
> we don't know what innovations might come down the road. Maybe someday we
> want to do product quantization and enforce that (k, m) both fit in a byte
> -- we wouldn't be able to do that if a vector's dimension were to exceed
> 32K.
>
> On Fri, Mar 31, 2023 at 11:57?AM Alessandro Benedetti <
> a.benedetti@sease.io> wrote:
>
>> I am also curious what would be the worst-case scenario if we remove the
>> constant at all (so automatically the limit becomes the Java
>> Integer.MAX_VALUE).
>> i.e.
>> right now if you exceed the limit you get:
>>
>>> if (dimension > ByteVectorValues.MAX_DIMENSIONS) {
>>> throw new IllegalArgumentException(
>>> "cannot index vectors with dimension greater than " + ByteVectorValues.
>>> MAX_DIMENSIONS);
>>> }
>>
>>
>> in relation to:
>>
>>> These limits allow us to
>>> better tune our data structures, prevent overflows, help ensure we
>>> have good test coverage, etc.
>>
>>
>> I agree 100% especially for typing stuff properly and avoiding resource
>> waste here and there, but I am not entirely sure this is the case for the
>> current implementation i.e. do we have optimizations in place that assume
>> the max dimension to be 1024?
>> If I missed that (and I likely have), I of course suggest the
>> contribution should not just blindly remove the limit, but do it
>> appropriately.
>> I am not in favor of just doubling it as suggested by some people, I
>> would ideally prefer a solution that remains there to a decent extent,
>> rather than having to modifying it anytime someone requires a higher limit.
>>
>> Cheers
>>
>> --------------------------
>> *Alessandro Benedetti*
>> Director @ Sease Ltd.
>> *Apache Lucene/Solr Committer*
>> *Apache Solr PMC Member*
>>
>> e-mail: a.benedetti@sease.io
>>
>>
>> *Sease* - Information Retrieval Applied
>> Consulting | Training | Open Source
>>
>> Website: Sease.io <http://sease.io/>
>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>> <https://twitter.com/seaseltd> | Youtube
>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>> <https://github.com/seaseltd>
>>
>>
>> On Fri, 31 Mar 2023 at 16:12, Michael Wechner <michael.wechner@wyona.com>
>> wrote:
>>
>>> OpenAI reduced their size to 1536 dimensions
>>>
>>> https://openai.com/blog/new-and-improved-embedding-model
>>>
>>> so 2048 would work :-)
>>>
>>> but other services do provide also higher dimensions with sometimes
>>> slightly better accuracy
>>>
>>> Thanks
>>>
>>> Michael
>>>
>>>
>>> Am 31.03.23 um 14:45 schrieb Adrien Grand:
>>> > I'm supportive of bumping the limit on the maximum dimension for
>>> > vectors to something that is above what the majority of users need,
>>> > but I'd like to keep a limit. We have limits for other things like the
>>> > max number of docs per index, the max term length, the max number of
>>> > dimensions of points, etc. and there are a few things that we don't
>>> > have limits on that I wish we had limits on. These limits allow us to
>>> > better tune our data structures, prevent overflows, help ensure we
>>> > have good test coverage, etc.
>>> >
>>> > That said, these other limits we have in place are quite high. E.g.
>>> > the 32kB term limit, nobody would ever type a 32kB term in a text box.
>>> > Likewise for the max of 8 dimensions for points: a segment cannot
>>> > possibly have 2 splits per dimension on average if it doesn't have
>>> > 512*2^(8*2)=34M docs, a sizable dataset already, so more dimensions
>>> > than 8 would likely defeat the point of indexing. In contrast, our
>>> > limit on the number of dimensions of vectors seems to be under what
>>> > some users would like, and while I understand the performance argument
>>> > against bumping the limit, it doesn't feel to me like something that
>>> > would be so bad that we need to prevent users from using numbers of
>>> > dimensions in the low thousands, e.g. top-k KNN searches would still
>>> > look at a very small subset of the full dataset.
>>> >
>>> > So overall, my vote would be to bump the limit to 2048 as suggested by
>>> > Mayya on the issue that you linked.
>>> >
>>> > On Fri, Mar 31, 2023 at 2:38?PM Michael Wechner
>>> > <michael.wechner@wyona.com> wrote:
>>> >> Thanks Alessandro for summarizing the discussion below!
>>> >>
>>> >> I understand that there is no clear reasoning re what is the best
>>> embedding size, whereas I think heuristic approaches like described by the
>>> following link can be helpful
>>> >>
>>> >>
>>> https://datascience.stackexchange.com/questions/51404/word2vec-how-to-choose-the-embedding-size-parameter
>>> >>
>>> >> Having said this, we see various embedding services providing higher
>>> dimensions than 1024, like for example OpenAI, Cohere and Aleph Alpha.
>>> >>
>>> >> And it would be great if we could run benchmarks without having to
>>> recompile Lucene ourselves.
>>> >>
>>> >> Therefore I would to suggest to either increase the limit or even
>>> better to remove the limit and add a disclaimer, that people should be
>>> aware of possible crashes etc.
>>> >>
>>> >> Thanks
>>> >>
>>> >> Michael
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> Am 31.03.23 um 11:43 schrieb Alessandro Benedetti:
>>> >>
>>> >>
>>> >> I've been monitoring various discussions on Pull Requests about
>>> changing the max number of dimensions allowed for Lucene HNSW vectors:
>>> >>
>>> >> https://github.com/apache/lucene/pull/12191
>>> >>
>>> >> https://github.com/apache/lucene/issues/11507
>>> >>
>>> >>
>>> >> I would like to set up a discussion and potentially a vote about this.
>>> >>
>>> >> I have seen some strong opposition from a few people but a majority
>>> of favor in this direction.
>>> >>
>>> >>
>>> >> Motivation
>>> >>
>>> >> We were discussing in the Solr slack channel with Ishan
>>> Chattopadhyaya, Marcus Eagan, and David Smiley about some neural search
>>> integrations in Solr: https://github.com/openai/chatgpt-retrieval-plugin
>>> >>
>>> >>
>>> >> Proposal
>>> >>
>>> >> No hard limit at all.
>>> >>
>>> >> As for many other Lucene areas, users will be allowed to push the
>>> system to the limit of their resources and get terrible performances or
>>> crashes if they want.
>>> >>
>>> >>
>>> >> What we are NOT discussing
>>> >>
>>> >> - Quality and scalability of the HNSW algorithm
>>> >>
>>> >> - dimensionality reduction
>>> >>
>>> >> - strategies to fit in an arbitrary self-imposed limit
>>> >>
>>> >>
>>> >> Benefits
>>> >>
>>> >> - users can use the models they want to generate vectors
>>> >>
>>> >> - removal of an arbitrary limit that blocks some integrations
>>> >>
>>> >>
>>> >> Cons
>>> >>
>>> >> - if you go for vectors with high dimensions, there's no guarantee
>>> you get acceptable performance for your use case
>>> >>
>>> >>
>>> >>
>>> >> I want to keep it simple, right now in many Lucene areas, you can
>>> push the system to not acceptable performance/ crashes.
>>> >>
>>> >> For example, we don't limit the number of docs per index to an
>>> arbitrary maximum of N, you push how many docs you like and if they are too
>>> much for your system, you get terrible performance/crashes/whatever.
>>> >>
>>> >>
>>> >> Limits caused by primitive java types will stay there behind the
>>> scene, and that's acceptable, but I would prefer to not have arbitrary
>>> hard-coded ones that may limit the software usability and integration which
>>> is extremely important for a library.
>>> >>
>>> >>
>>> >> I strongly encourage people to add benefits and cons, that I missed
>>> (I am sure I missed some of them, but wanted to keep it simple)
>>> >>
>>> >>
>>> >> Cheers
>>> >>
>>> >> --------------------------
>>> >> Alessandro Benedetti
>>> >> Director @ Sease Ltd.
>>> >> Apache Lucene/Solr Committer
>>> >> Apache Solr PMC Member
>>> >>
>>> >> e-mail: a.benedetti@sease.io
>>> >>
>>> >>
>>> >> Sease - Information Retrieval Applied
>>> >> Consulting | Training | Open Source
>>> >>
>>> >> Website: Sease.io
>>> >> LinkedIn | Twitter | Youtube | Github
>>> >>
>>> >>
>>> >
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

lucene at mikemccandless

Apr 4, 2023, 6:32 AM

Post #10 of 99 (728 views)

> I am not in favor of just doubling it as suggested by some people, I
would ideally prefer a solution that remains there to a decent extent,
rather than having to modifying it anytime someone requires a higher limit.

The problem with this approach is it is a one-way door, once released. We
would not be able to lower the limit again in the future without possibly
breaking some applications.

> For example, we don't limit the number of docs per index to an arbitrary
maximum of N, you push how many docs you like and if they are too much for
your system, you get terrible performance/crashes/whatever.

Correction: we do check this limit and throw a specific exception now:
https://github.com/apache/lucene/issues/6905

+1 to raise the limit, but not remove it.

Mike McCandless

http://blog.mikemccandless.com

On Mon, Apr 3, 2023 at 9:51?AM Alessandro Benedetti <a.benedetti@sease.io>
wrote:

> ... and what would be the next limit?
> I guess we'll need to motivate it better than the 1024 one.
> I appreciate the fact that a limit is pretty much wanted by everyone but I
> suspect we'll need some solid foundation for deciding the amount (and it
> should be high enough to avoid continuous changes)
>
> Cheers
>
> On Sun, 2 Apr 2023, 07:29 Michael Wechner, <michael.wechner@wyona.com>
> wrote:
>
>> btw, what was the reasoning to set the current limit to 1024?
>>
>> Thanks
>>
>> Michael
>>
>> Am 01.04.23 um 14:47 schrieb Michael Sokolov:
>>
>> I'm also in favor of raising this limit. We do see some datasets with
>> higher than 1024 dims. I also think we need to keep a limit. For example we
>> currently need to keep all the vectors in RAM while indexing and we want to
>> be able to support reasonable numbers of vectors in an index segment. Also
>> we don't know what innovations might come down the road. Maybe someday we
>> want to do product quantization and enforce that (k, m) both fit in a byte
>> -- we wouldn't be able to do that if a vector's dimension were to exceed
>> 32K.
>>
>> On Fri, Mar 31, 2023 at 11:57?AM Alessandro Benedetti <
>> a.benedetti@sease.io> wrote:
>>
>>> I am also curious what would be the worst-case scenario if we remove the
>>> constant at all (so automatically the limit becomes the Java
>>> Integer.MAX_VALUE).
>>> i.e.
>>> right now if you exceed the limit you get:
>>>
>>>> if (dimension > ByteVectorValues.MAX_DIMENSIONS) {
>>>> throw new IllegalArgumentException(
>>>> "cannot index vectors with dimension greater than " + ByteVectorValues.
>>>> MAX_DIMENSIONS);
>>>> }
>>>
>>>
>>> in relation to:
>>>
>>>> These limits allow us to
>>>> better tune our data structures, prevent overflows, help ensure we
>>>> have good test coverage, etc.
>>>
>>>
>>> I agree 100% especially for typing stuff properly and avoiding resource
>>> waste here and there, but I am not entirely sure this is the case for the
>>> current implementation i.e. do we have optimizations in place that assume
>>> the max dimension to be 1024?
>>> If I missed that (and I likely have), I of course suggest the
>>> contribution should not just blindly remove the limit, but do it
>>> appropriately.
>>> I am not in favor of just doubling it as suggested by some people, I
>>> would ideally prefer a solution that remains there to a decent extent,
>>> rather than having to modifying it anytime someone requires a higher limit.
>>>
>>> Cheers
>>>
>>> --------------------------
>>> *Alessandro Benedetti*
>>> Director @ Sease Ltd.
>>> *Apache Lucene/Solr Committer*
>>> *Apache Solr PMC Member*
>>>
>>> e-mail: a.benedetti@sease.io
>>>
>>>
>>> *Sease* - Information Retrieval Applied
>>> Consulting | Training | Open Source
>>>
>>> Website: Sease.io <http://sease.io/>
>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>> <https://twitter.com/seaseltd> | Youtube
>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>> <https://github.com/seaseltd>
>>>
>>>
>>> On Fri, 31 Mar 2023 at 16:12, Michael Wechner <michael.wechner@wyona.com>
>>> wrote:
>>>
>>>> OpenAI reduced their size to 1536 dimensions
>>>>
>>>> https://openai.com/blog/new-and-improved-embedding-model
>>>>
>>>> so 2048 would work :-)
>>>>
>>>> but other services do provide also higher dimensions with sometimes
>>>> slightly better accuracy
>>>>
>>>> Thanks
>>>>
>>>> Michael
>>>>
>>>>
>>>> Am 31.03.23 um 14:45 schrieb Adrien Grand:
>>>> > I'm supportive of bumping the limit on the maximum dimension for
>>>> > vectors to something that is above what the majority of users need,
>>>> > but I'd like to keep a limit. We have limits for other things like the
>>>> > max number of docs per index, the max term length, the max number of
>>>> > dimensions of points, etc. and there are a few things that we don't
>>>> > have limits on that I wish we had limits on. These limits allow us to
>>>> > better tune our data structures, prevent overflows, help ensure we
>>>> > have good test coverage, etc.
>>>> >
>>>> > That said, these other limits we have in place are quite high. E.g.
>>>> > the 32kB term limit, nobody would ever type a 32kB term in a text box.
>>>> > Likewise for the max of 8 dimensions for points: a segment cannot
>>>> > possibly have 2 splits per dimension on average if it doesn't have
>>>> > 512*2^(8*2)=34M docs, a sizable dataset already, so more dimensions
>>>> > than 8 would likely defeat the point of indexing. In contrast, our
>>>> > limit on the number of dimensions of vectors seems to be under what
>>>> > some users would like, and while I understand the performance argument
>>>> > against bumping the limit, it doesn't feel to me like something that
>>>> > would be so bad that we need to prevent users from using numbers of
>>>> > dimensions in the low thousands, e.g. top-k KNN searches would still
>>>> > look at a very small subset of the full dataset.
>>>> >
>>>> > So overall, my vote would be to bump the limit to 2048 as suggested by
>>>> > Mayya on the issue that you linked.
>>>> >
>>>> > On Fri, Mar 31, 2023 at 2:38?PM Michael Wechner
>>>> > <michael.wechner@wyona.com> wrote:
>>>> >> Thanks Alessandro for summarizing the discussion below!
>>>> >>
>>>> >> I understand that there is no clear reasoning re what is the best
>>>> embedding size, whereas I think heuristic approaches like described by the
>>>> following link can be helpful
>>>> >>
>>>> >>
>>>> https://datascience.stackexchange.com/questions/51404/word2vec-how-to-choose-the-embedding-size-parameter
>>>> >>
>>>> >> Having said this, we see various embedding services providing higher
>>>> dimensions than 1024, like for example OpenAI, Cohere and Aleph Alpha.
>>>> >>
>>>> >> And it would be great if we could run benchmarks without having to
>>>> recompile Lucene ourselves.
>>>> >>
>>>> >> Therefore I would to suggest to either increase the limit or even
>>>> better to remove the limit and add a disclaimer, that people should be
>>>> aware of possible crashes etc.
>>>> >>
>>>> >> Thanks
>>>> >>
>>>> >> Michael
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> Am 31.03.23 um 11:43 schrieb Alessandro Benedetti:
>>>> >>
>>>> >>
>>>> >> I've been monitoring various discussions on Pull Requests about
>>>> changing the max number of dimensions allowed for Lucene HNSW vectors:
>>>> >>
>>>> >> https://github.com/apache/lucene/pull/12191
>>>> >>
>>>> >> https://github.com/apache/lucene/issues/11507
>>>> >>
>>>> >>
>>>> >> I would like to set up a discussion and potentially a vote about
>>>> this.
>>>> >>
>>>> >> I have seen some strong opposition from a few people but a majority
>>>> of favor in this direction.
>>>> >>
>>>> >>
>>>> >> Motivation
>>>> >>
>>>> >> We were discussing in the Solr slack channel with Ishan
>>>> Chattopadhyaya, Marcus Eagan, and David Smiley about some neural search
>>>> integrations in Solr:
>>>> https://github.com/openai/chatgpt-retrieval-plugin
>>>> >>
>>>> >>
>>>> >> Proposal
>>>> >>
>>>> >> No hard limit at all.
>>>> >>
>>>> >> As for many other Lucene areas, users will be allowed to push the
>>>> system to the limit of their resources and get terrible performances or
>>>> crashes if they want.
>>>> >>
>>>> >>
>>>> >> What we are NOT discussing
>>>> >>
>>>> >> - Quality and scalability of the HNSW algorithm
>>>> >>
>>>> >> - dimensionality reduction
>>>> >>
>>>> >> - strategies to fit in an arbitrary self-imposed limit
>>>> >>
>>>> >>
>>>> >> Benefits
>>>> >>
>>>> >> - users can use the models they want to generate vectors
>>>> >>
>>>> >> - removal of an arbitrary limit that blocks some integrations
>>>> >>
>>>> >>
>>>> >> Cons
>>>> >>
>>>> >> - if you go for vectors with high dimensions, there's no guarantee
>>>> you get acceptable performance for your use case
>>>> >>
>>>> >>
>>>> >>
>>>> >> I want to keep it simple, right now in many Lucene areas, you can
>>>> push the system to not acceptable performance/ crashes.
>>>> >>
>>>> >> For example, we don't limit the number of docs per index to an
>>>> arbitrary maximum of N, you push how many docs you like and if they are too
>>>> much for your system, you get terrible performance/crashes/whatever.
>>>> >>
>>>> >>
>>>> >> Limits caused by primitive java types will stay there behind the
>>>> scene, and that's acceptable, but I would prefer to not have arbitrary
>>>> hard-coded ones that may limit the software usability and integration which
>>>> is extremely important for a library.
>>>> >>
>>>> >>
>>>> >> I strongly encourage people to add benefits and cons, that I missed
>>>> (I am sure I missed some of them, but wanted to keep it simple)
>>>> >>
>>>> >>
>>>> >> Cheers
>>>> >>
>>>> >> --------------------------
>>>> >> Alessandro Benedetti
>>>> >> Director @ Sease Ltd.
>>>> >> Apache Lucene/Solr Committer
>>>> >> Apache Solr PMC Member
>>>> >>
>>>> >> e-mail: a.benedetti@sease.io
>>>> >>
>>>> >>
>>>> >> Sease - Information Retrieval Applied
>>>> >> Consulting | Training | Open Source
>>>> >>
>>>> >> Website: Sease.io
>>>> >> LinkedIn | Twitter | Youtube | Github
>>>> >>
>>>> >>
>>>> >
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>
>>>>
>>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

serera at gmail

Apr 4, 2023, 6:40 AM

Post #11 of 99 (728 views)

I am not familiar with the internal implementation details, but is it
possible to refactor the code such that someone can provide an extension of
some VectorEncoder/Decoder and control the limits on their side? Rather
than Lucene committing to some arbitrary limit (which these days seems to
keep growing)?

If raising the limit only means changing some hard-coded constant, then I
assume such an abstraction can work. We can mark this extension as
@lucene.expert.

Shai

On Tue, Apr 4, 2023 at 4:33?PM Michael McCandless <lucene@mikemccandless.com>
wrote:

> > I am not in favor of just doubling it as suggested by some people, I
> would ideally prefer a solution that remains there to a decent extent,
> rather than having to modifying it anytime someone requires a higher limit.
>
> The problem with this approach is it is a one-way door, once released. We
> would not be able to lower the limit again in the future without possibly
> breaking some applications.
>
> > For example, we don't limit the number of docs per index to an
> arbitrary maximum of N, you push how many docs you like and if they are too
> much for your system, you get terrible performance/crashes/whatever.
>
> Correction: we do check this limit and throw a specific exception now:
> https://github.com/apache/lucene/issues/6905
>
> +1 to raise the limit, but not remove it.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Mon, Apr 3, 2023 at 9:51?AM Alessandro Benedetti <a.benedetti@sease.io>
> wrote:
>
>> ... and what would be the next limit?
>> I guess we'll need to motivate it better than the 1024 one.
>> I appreciate the fact that a limit is pretty much wanted by everyone but
>> I suspect we'll need some solid foundation for deciding the amount (and it
>> should be high enough to avoid continuous changes)
>>
>> Cheers
>>
>> On Sun, 2 Apr 2023, 07:29 Michael Wechner, <michael.wechner@wyona.com>
>> wrote:
>>
>>> btw, what was the reasoning to set the current limit to 1024?
>>>
>>> Thanks
>>>
>>> Michael
>>>
>>> Am 01.04.23 um 14:47 schrieb Michael Sokolov:
>>>
>>> I'm also in favor of raising this limit. We do see some datasets with
>>> higher than 1024 dims. I also think we need to keep a limit. For example we
>>> currently need to keep all the vectors in RAM while indexing and we want to
>>> be able to support reasonable numbers of vectors in an index segment. Also
>>> we don't know what innovations might come down the road. Maybe someday we
>>> want to do product quantization and enforce that (k, m) both fit in a byte
>>> -- we wouldn't be able to do that if a vector's dimension were to exceed
>>> 32K.
>>>
>>> On Fri, Mar 31, 2023 at 11:57?AM Alessandro Benedetti <
>>> a.benedetti@sease.io> wrote:
>>>
>>>> I am also curious what would be the worst-case scenario if we remove
>>>> the constant at all (so automatically the limit becomes the Java
>>>> Integer.MAX_VALUE).
>>>> i.e.
>>>> right now if you exceed the limit you get:
>>>>
>>>>> if (dimension > ByteVectorValues.MAX_DIMENSIONS) {
>>>>> throw new IllegalArgumentException(
>>>>> "cannot index vectors with dimension greater than " + ByteVectorValues
>>>>> .MAX_DIMENSIONS);
>>>>> }
>>>>
>>>>
>>>> in relation to:
>>>>
>>>>> These limits allow us to
>>>>> better tune our data structures, prevent overflows, help ensure we
>>>>> have good test coverage, etc.
>>>>
>>>>
>>>> I agree 100% especially for typing stuff properly and avoiding resource
>>>> waste here and there, but I am not entirely sure this is the case for the
>>>> current implementation i.e. do we have optimizations in place that assume
>>>> the max dimension to be 1024?
>>>> If I missed that (and I likely have), I of course suggest the
>>>> contribution should not just blindly remove the limit, but do it
>>>> appropriately.
>>>> I am not in favor of just doubling it as suggested by some people, I
>>>> would ideally prefer a solution that remains there to a decent extent,
>>>> rather than having to modifying it anytime someone requires a higher limit.
>>>>
>>>> Cheers
>>>>
>>>> --------------------------
>>>> *Alessandro Benedetti*
>>>> Director @ Sease Ltd.
>>>> *Apache Lucene/Solr Committer*
>>>> *Apache Solr PMC Member*
>>>>
>>>> e-mail: a.benedetti@sease.io
>>>>
>>>>
>>>> *Sease* - Information Retrieval Applied
>>>> Consulting | Training | Open Source
>>>>
>>>> Website: Sease.io <http://sease.io/>
>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>>> <https://twitter.com/seaseltd> | Youtube
>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>>> <https://github.com/seaseltd>
>>>>
>>>>
>>>> On Fri, 31 Mar 2023 at 16:12, Michael Wechner <
>>>> michael.wechner@wyona.com> wrote:
>>>>
>>>>> OpenAI reduced their size to 1536 dimensions
>>>>>
>>>>> https://openai.com/blog/new-and-improved-embedding-model
>>>>>
>>>>> so 2048 would work :-)
>>>>>
>>>>> but other services do provide also higher dimensions with sometimes
>>>>> slightly better accuracy
>>>>>
>>>>> Thanks
>>>>>
>>>>> Michael
>>>>>
>>>>>
>>>>> Am 31.03.23 um 14:45 schrieb Adrien Grand:
>>>>> > I'm supportive of bumping the limit on the maximum dimension for
>>>>> > vectors to something that is above what the majority of users need,
>>>>> > but I'd like to keep a limit. We have limits for other things like
>>>>> the
>>>>> > max number of docs per index, the max term length, the max number of
>>>>> > dimensions of points, etc. and there are a few things that we don't
>>>>> > have limits on that I wish we had limits on. These limits allow us to
>>>>> > better tune our data structures, prevent overflows, help ensure we
>>>>> > have good test coverage, etc.
>>>>> >
>>>>> > That said, these other limits we have in place are quite high. E.g.
>>>>> > the 32kB term limit, nobody would ever type a 32kB term in a text
>>>>> box.
>>>>> > Likewise for the max of 8 dimensions for points: a segment cannot
>>>>> > possibly have 2 splits per dimension on average if it doesn't have
>>>>> > 512*2^(8*2)=34M docs, a sizable dataset already, so more dimensions
>>>>> > than 8 would likely defeat the point of indexing. In contrast, our
>>>>> > limit on the number of dimensions of vectors seems to be under what
>>>>> > some users would like, and while I understand the performance
>>>>> argument
>>>>> > against bumping the limit, it doesn't feel to me like something that
>>>>> > would be so bad that we need to prevent users from using numbers of
>>>>> > dimensions in the low thousands, e.g. top-k KNN searches would still
>>>>> > look at a very small subset of the full dataset.
>>>>> >
>>>>> > So overall, my vote would be to bump the limit to 2048 as suggested
>>>>> by
>>>>> > Mayya on the issue that you linked.
>>>>> >
>>>>> > On Fri, Mar 31, 2023 at 2:38?PM Michael Wechner
>>>>> > <michael.wechner@wyona.com> wrote:
>>>>> >> Thanks Alessandro for summarizing the discussion below!
>>>>> >>
>>>>> >> I understand that there is no clear reasoning re what is the best
>>>>> embedding size, whereas I think heuristic approaches like described by the
>>>>> following link can be helpful
>>>>> >>
>>>>> >>
>>>>> https://datascience.stackexchange.com/questions/51404/word2vec-how-to-choose-the-embedding-size-parameter
>>>>> >>
>>>>> >> Having said this, we see various embedding services providing
>>>>> higher dimensions than 1024, like for example OpenAI, Cohere and Aleph
>>>>> Alpha.
>>>>> >>
>>>>> >> And it would be great if we could run benchmarks without having to
>>>>> recompile Lucene ourselves.
>>>>> >>
>>>>> >> Therefore I would to suggest to either increase the limit or even
>>>>> better to remove the limit and add a disclaimer, that people should be
>>>>> aware of possible crashes etc.
>>>>> >>
>>>>> >> Thanks
>>>>> >>
>>>>> >> Michael
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> Am 31.03.23 um 11:43 schrieb Alessandro Benedetti:
>>>>> >>
>>>>> >>
>>>>> >> I've been monitoring various discussions on Pull Requests about
>>>>> changing the max number of dimensions allowed for Lucene HNSW vectors:
>>>>> >>
>>>>> >> https://github.com/apache/lucene/pull/12191
>>>>> >>
>>>>> >> https://github.com/apache/lucene/issues/11507
>>>>> >>
>>>>> >>
>>>>> >> I would like to set up a discussion and potentially a vote about
>>>>> this.
>>>>> >>
>>>>> >> I have seen some strong opposition from a few people but a majority
>>>>> of favor in this direction.
>>>>> >>
>>>>> >>
>>>>> >> Motivation
>>>>> >>
>>>>> >> We were discussing in the Solr slack channel with Ishan
>>>>> Chattopadhyaya, Marcus Eagan, and David Smiley about some neural search
>>>>> integrations in Solr:
>>>>> https://github.com/openai/chatgpt-retrieval-plugin
>>>>> >>
>>>>> >>
>>>>> >> Proposal
>>>>> >>
>>>>> >> No hard limit at all.
>>>>> >>
>>>>> >> As for many other Lucene areas, users will be allowed to push the
>>>>> system to the limit of their resources and get terrible performances or
>>>>> crashes if they want.
>>>>> >>
>>>>> >>
>>>>> >> What we are NOT discussing
>>>>> >>
>>>>> >> - Quality and scalability of the HNSW algorithm
>>>>> >>
>>>>> >> - dimensionality reduction
>>>>> >>
>>>>> >> - strategies to fit in an arbitrary self-imposed limit
>>>>> >>
>>>>> >>
>>>>> >> Benefits
>>>>> >>
>>>>> >> - users can use the models they want to generate vectors
>>>>> >>
>>>>> >> - removal of an arbitrary limit that blocks some integrations
>>>>> >>
>>>>> >>
>>>>> >> Cons
>>>>> >>
>>>>> >> - if you go for vectors with high dimensions, there's no
>>>>> guarantee you get acceptable performance for your use case
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> I want to keep it simple, right now in many Lucene areas, you can
>>>>> push the system to not acceptable performance/ crashes.
>>>>> >>
>>>>> >> For example, we don't limit the number of docs per index to an
>>>>> arbitrary maximum of N, you push how many docs you like and if they are too
>>>>> much for your system, you get terrible performance/crashes/whatever.
>>>>> >>
>>>>> >>
>>>>> >> Limits caused by primitive java types will stay there behind the
>>>>> scene, and that's acceptable, but I would prefer to not have arbitrary
>>>>> hard-coded ones that may limit the software usability and integration which
>>>>> is extremely important for a library.
>>>>> >>
>>>>> >>
>>>>> >> I strongly encourage people to add benefits and cons, that I missed
>>>>> (I am sure I missed some of them, but wanted to keep it simple)
>>>>> >>
>>>>> >>
>>>>> >> Cheers
>>>>> >>
>>>>> >> --------------------------
>>>>> >> Alessandro Benedetti
>>>>> >> Director @ Sease Ltd.
>>>>> >> Apache Lucene/Solr Committer
>>>>> >> Apache Solr PMC Member
>>>>> >>
>>>>> >> e-mail: a.benedetti@sease.io
>>>>> >>
>>>>> >>
>>>>> >> Sease - Information Retrieval Applied
>>>>> >> Consulting | Training | Open Source
>>>>> >>
>>>>> >> Website: Sease.io
>>>>> >> LinkedIn | Twitter | Youtube | Github
>>>>> >>
>>>>> >>
>>>>> >
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>
>>>>>
>>>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

michael.wechner at wyona

Apr 4, 2023, 6:47 AM

Post #12 of 99 (728 views)

IIUC we all agree that the limit could be raised, but we need some solid
reasoning what limit makes sense, resp. why do we set this particular
limit (e.g. 2048), right?

Thanks

Michael

Am 04.04.23 um 15:32 schrieb Michael McCandless:
> > I am not in favor of just doubling it as suggested by some people, I
> would ideally prefer a solution that remains there to a decent extent,
> rather than having to modifying it anytime someone requires a higher
> limit.
>
> The problem with this approach is it is a one-way door, once
> released. We would not be able to lower the limit again in the future
> without possibly breaking some applications.
>
> > For example, we don't limit the number of docs per index to an
> arbitrary maximum of N, you push how many docs you like and if they
> are too much for your system, you get terrible
> performance/crashes/whatever.
>
> Correction: we do check this limit and throw a specific exception now:
> https://github.com/apache/lucene/issues/6905
>
> +1 to raise the limit, but not remove it.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Mon, Apr 3, 2023 at 9:51?AM Alessandro Benedetti
> <a.benedetti@sease.io> wrote:
>
> ... and what would be the next limit?
> I guess we'll need to motivate it better than the 1024 one.
> I appreciate the fact that a limit is pretty much wanted by
> everyone but I suspect we'll need some solid foundation for
> deciding the amount (and it should be high enough to avoid
> continuous changes)
>
> Cheers
>
> On Sun, 2 Apr 2023, 07:29 Michael Wechner,
> <michael.wechner@wyona.com> wrote:
>
> btw, what was the reasoning to set the current limit to 1024?
>
> Thanks
>
> Michael
>
> Am 01.04.23 um 14:47 schrieb Michael Sokolov:
>> I'm also in favor of raising this limit. We do see some
>> datasets with higher than 1024 dims. I also think we need to
>> keep a limit. For example we currently need to keep all the
>> vectors in RAM while indexing and we want to be able to
>> support reasonable numbers of vectors in an index segment.
>> Also we don't know what innovations might come down the road.
>> Maybe someday we want to do product quantization and enforce
>> that (k, m) both fit in a byte -- we wouldn't be able to do
>> that if a vector's dimension were to exceed 32K.
>>
>> On Fri, Mar 31, 2023 at 11:57?AM Alessandro Benedetti
>> <a.benedetti@sease.io> wrote:
>>
>> I am also curious what would be the worst-case scenario
>> if we remove the constant at all (so automatically the
>> limit becomes the Java Integer.MAX_VALUE).
>> i.e.
>> right now if you exceed the limit you get:
>>
>> if (dimension > ByteVectorValues.MAX_DIMENSIONS) {
>> throw new IllegalArgumentException(
>> "cannot index vectors with dimension greater than " +
>> ByteVectorValues.MAX_DIMENSIONS);
>> }
>>
>>
>> in relation to:
>>
>> These limits allow us to
>> better tune our data structures, prevent overflows,
>> help ensure we
>> have good test coverage, etc.
>>
>> I agree 100% especially for typing stuff properly and
>> avoiding resource waste here and there, but I am not
>> entirely sure this is the case for the current
>> implementation i.e. do we have optimizations in place
>> that assume the max dimension to be 1024?
>> If I missed that (and I likely have), I of course suggest
>> the contribution should not just blindly remove the
>> limit, but do it appropriately.
>> I am not in favor of just doubling it as suggested by
>> some people, I would ideally prefer a solution that
>> remains there to a decent extent, rather than having to
>> modifying it anytime someone requires a higher limit.
>>
>> Cheers
>> --------------------------
>> *Alessandro Benedetti*
>> Director @ Sease Ltd.
>> /Apache Lucene/Solr Committer/
>> /Apache Solr PMC Member/
>>
>> e-mail: a.benedetti@sease.io/
>> /
>>
>> *Sease* - Information Retrieval Applied
>> Consulting | Training | Open Source
>>
>> Website: Sease.io <http://sease.io/>
>> LinkedIn <https://linkedin.com/company/sease-ltd> |
>> Twitter <https://twitter.com/seaseltd> | Youtube
>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
>> Github <https://github.com/seaseltd>
>>
>>
>> On Fri, 31 Mar 2023 at 16:12, Michael Wechner
>> <michael.wechner@wyona.com> wrote:
>>
>> OpenAI reduced their size to 1536 dimensions
>>
>> https://openai.com/blog/new-and-improved-embedding-model
>>
>> so 2048 would work :-)
>>
>> but other services do provide also higher dimensions
>> with sometimes
>> slightly better accuracy
>>
>> Thanks
>>
>> Michael
>>
>>
>> Am 31.03.23 um 14:45 schrieb Adrien Grand:
>> > I'm supportive of bumping the limit on the maximum
>> dimension for
>> > vectors to something that is above what the
>> majority of users need,
>> > but I'd like to keep a limit. We have limits for
>> other things like the
>> > max number of docs per index, the max term length,
>> the max number of
>> > dimensions of points, etc. and there are a few
>> things that we don't
>> > have limits on that I wish we had limits on. These
>> limits allow us to
>> > better tune our data structures, prevent overflows,
>> help ensure we
>> > have good test coverage, etc.
>> >
>> > That said, these other limits we have in place are
>> quite high. E.g.
>> > the 32kB term limit, nobody would ever type a 32kB
>> term in a text box.
>> > Likewise for the max of 8 dimensions for points: a
>> segment cannot
>> > possibly have 2 splits per dimension on average if
>> it doesn't have
>> > 512*2^(8*2)=34M docs, a sizable dataset already, so
>> more dimensions
>> > than 8 would likely defeat the point of indexing.
>> In contrast, our
>> > limit on the number of dimensions of vectors seems
>> to be under what
>> > some users would like, and while I understand the
>> performance argument
>> > against bumping the limit, it doesn't feel to me
>> like something that
>> > would be so bad that we need to prevent users from
>> using numbers of
>> > dimensions in the low thousands, e.g. top-k KNN
>> searches would still
>> > look at a very small subset of the full dataset.
>> >
>> > So overall, my vote would be to bump the limit to
>> 2048 as suggested by
>> > Mayya on the issue that you linked.
>> >
>> > On Fri, Mar 31, 2023 at 2:38?PM Michael Wechner
>> > <michael.wechner@wyona.com> wrote:
>> >> Thanks Alessandro for summarizing the discussion
>> below!
>> >>
>> >> I understand that there is no clear reasoning re
>> what is the best embedding size, whereas I think
>> heuristic approaches like described by the following
>> link can be helpful
>> >>
>> >>
>> https://datascience.stackexchange.com/questions/51404/word2vec-how-to-choose-the-embedding-size-parameter
>> >>
>> >> Having said this, we see various embedding
>> services providing higher dimensions than 1024, like
>> for example OpenAI, Cohere and Aleph Alpha.
>> >>
>> >> And it would be great if we could run benchmarks
>> without having to recompile Lucene ourselves.
>> >>
>> >> Therefore I would to suggest to either increase
>> the limit or even better to remove the limit and add
>> a disclaimer, that people should be aware of possible
>> crashes etc.
>> >>
>> >> Thanks
>> >>
>> >> Michael
>> >>
>> >>
>> >>
>> >>
>> >> Am 31.03.23 um 11:43 schrieb Alessandro Benedetti:
>> >>
>> >>
>> >> I've been monitoring various discussions on Pull
>> Requests about changing the max number of dimensions
>> allowed for Lucene HNSW vectors:
>> >>
>> >> https://github.com/apache/lucene/pull/12191
>> >>
>> >> https://github.com/apache/lucene/issues/11507
>> >>
>> >>
>> >> I would like to set up a discussion and
>> potentially a vote about this.
>> >>
>> >> I have seen some strong opposition from a few
>> people but a majority of favor in this direction.
>> >>
>> >>
>> >> Motivation
>> >>
>> >> We were discussing in the Solr slack channel with
>> Ishan Chattopadhyaya, Marcus Eagan, and David Smiley
>> about some neural search integrations in Solr:
>> https://github.com/openai/chatgpt-retrieval-plugin
>> >>
>> >>
>> >> Proposal
>> >>
>> >> No hard limit at all.
>> >>
>> >> As for many other Lucene areas, users will be
>> allowed to push the system to the limit of their
>> resources and get terrible performances or crashes if
>> they want.
>> >>
>> >>
>> >> What we are NOT discussing
>> >>
>> >> - Quality and scalability of the HNSW algorithm
>> >>
>> >> - dimensionality reduction
>> >>
>> >> - strategies to fit in an arbitrary self-imposed limit
>> >>
>> >>
>> >> Benefits
>> >>
>> >> - users can use the models they want to generate
>> vectors
>> >>
>> >> - removal of an arbitrary limit that blocks some
>> integrations
>> >>
>> >>
>> >> Cons
>> >>
>> >> - if you go for vectors with high dimensions,
>> there's no guarantee you get acceptable performance
>> for your use case
>> >>
>> >>
>> >>
>> >> I want to keep it simple, right now in many Lucene
>> areas, you can push the system to not acceptable
>> performance/ crashes.
>> >>
>> >> For example, we don't limit the number of docs per
>> index to an arbitrary maximum of N, you push how many
>> docs you like and if they are too much for your
>> system, you get terrible performance/crashes/whatever.
>> >>
>> >>
>> >> Limits caused by primitive java types will stay
>> there behind the scene, and that's acceptable, but I
>> would prefer to not have arbitrary hard-coded ones
>> that may limit the software usability and integration
>> which is extremely important for a library.
>> >>
>> >>
>> >> I strongly encourage people to add benefits and
>> cons, that I missed (I am sure I missed some of them,
>> but wanted to keep it simple)
>> >>
>> >>
>> >> Cheers
>> >>
>> >> --------------------------
>> >> Alessandro Benedetti
>> >> Director @ Sease Ltd.
>> >> Apache Lucene/Solr Committer
>> >> Apache Solr PMC Member
>> >>
>> >> e-mail: a.benedetti@sease.io
>> >>
>> >>
>> >> Sease - Information Retrieval Applied
>> >> Consulting | Training | Open Source
>> >>
>> >> Website: Sease.io
>> >> LinkedIn | Twitter | Youtube | Github
>> >>
>> >>
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail:
>> dev-help@lucene.apache.org
>>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

a.benedetti at sease

Apr 5, 2023, 3:34 AM

Post #13 of 99 (727 views)

Thanks Mike for the insight!

What would be the next steps then?
I see agreement but also the necessity of identifying a candidate MAX.

Should create a VOTE thread, where we propose some values with a
justification and we vote?

In this way we can create a pull request and merge relatively soon.

Cheers

On Tue, 4 Apr 2023, 14:47 Michael Wechner, <michael.wechner@wyona.com>
wrote:

> IIUC we all agree that the limit could be raised, but we need some solid
> reasoning what limit makes sense, resp. why do we set this particular limit
> (e.g. 2048), right?
>
> Thanks
>
> Michael
>
>
> Am 04.04.23 um 15:32 schrieb Michael McCandless:
>
> > I am not in favor of just doubling it as suggested by some people, I
> would ideally prefer a solution that remains there to a decent extent,
> rather than having to modifying it anytime someone requires a higher limit.
>
> The problem with this approach is it is a one-way door, once released. We
> would not be able to lower the limit again in the future without possibly
> breaking some applications.
>
> > For example, we don't limit the number of docs per index to an
> arbitrary maximum of N, you push how many docs you like and if they are too
> much for your system, you get terrible performance/crashes/whatever.
>
> Correction: we do check this limit and throw a specific exception now:
> https://github.com/apache/lucene/issues/6905
>
> +1 to raise the limit, but not remove it.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Mon, Apr 3, 2023 at 9:51?AM Alessandro Benedetti <a.benedetti@sease.io>
> wrote:
>
>> ... and what would be the next limit?
>> I guess we'll need to motivate it better than the 1024 one.
>> I appreciate the fact that a limit is pretty much wanted by everyone but
>> I suspect we'll need some solid foundation for deciding the amount (and it
>> should be high enough to avoid continuous changes)
>>
>> Cheers
>>
>> On Sun, 2 Apr 2023, 07:29 Michael Wechner, <michael.wechner@wyona.com>
>> wrote:
>>
>>> btw, what was the reasoning to set the current limit to 1024?
>>>
>>> Thanks
>>>
>>> Michael
>>>
>>> Am 01.04.23 um 14:47 schrieb Michael Sokolov:
>>>
>>> I'm also in favor of raising this limit. We do see some datasets with
>>> higher than 1024 dims. I also think we need to keep a limit. For example we
>>> currently need to keep all the vectors in RAM while indexing and we want to
>>> be able to support reasonable numbers of vectors in an index segment. Also
>>> we don't know what innovations might come down the road. Maybe someday we
>>> want to do product quantization and enforce that (k, m) both fit in a byte
>>> -- we wouldn't be able to do that if a vector's dimension were to exceed
>>> 32K.
>>>
>>> On Fri, Mar 31, 2023 at 11:57?AM Alessandro Benedetti <
>>> a.benedetti@sease.io> wrote:
>>>
>>>> I am also curious what would be the worst-case scenario if we remove
>>>> the constant at all (so automatically the limit becomes the Java
>>>> Integer.MAX_VALUE).
>>>> i.e.
>>>> right now if you exceed the limit you get:
>>>>
>>>>> if (dimension > ByteVectorValues.MAX_DIMENSIONS) {
>>>>> throw new IllegalArgumentException(
>>>>> "cannot index vectors with dimension greater than " + ByteVectorValues
>>>>> .MAX_DIMENSIONS);
>>>>> }
>>>>
>>>>
>>>> in relation to:
>>>>
>>>>> These limits allow us to
>>>>> better tune our data structures, prevent overflows, help ensure we
>>>>> have good test coverage, etc.
>>>>
>>>>
>>>> I agree 100% especially for typing stuff properly and avoiding resource
>>>> waste here and there, but I am not entirely sure this is the case for the
>>>> current implementation i.e. do we have optimizations in place that assume
>>>> the max dimension to be 1024?
>>>> If I missed that (and I likely have), I of course suggest the
>>>> contribution should not just blindly remove the limit, but do it
>>>> appropriately.
>>>> I am not in favor of just doubling it as suggested by some people, I
>>>> would ideally prefer a solution that remains there to a decent extent,
>>>> rather than having to modifying it anytime someone requires a higher limit.
>>>>
>>>> Cheers
>>>>
>>>> --------------------------
>>>> *Alessandro Benedetti*
>>>> Director @ Sease Ltd.
>>>> *Apache Lucene/Solr Committer*
>>>> *Apache Solr PMC Member*
>>>>
>>>> e-mail: a.benedetti@sease.io
>>>>
>>>>
>>>> *Sease* - Information Retrieval Applied
>>>> Consulting | Training | Open Source
>>>>
>>>> Website: Sease.io <http://sease.io/>
>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>>> <https://twitter.com/seaseltd> | Youtube
>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>>> <https://github.com/seaseltd>
>>>>
>>>>
>>>> On Fri, 31 Mar 2023 at 16:12, Michael Wechner <
>>>> michael.wechner@wyona.com> wrote:
>>>>
>>>>> OpenAI reduced their size to 1536 dimensions
>>>>>
>>>>> https://openai.com/blog/new-and-improved-embedding-model
>>>>>
>>>>> so 2048 would work :-)
>>>>>
>>>>> but other services do provide also higher dimensions with sometimes
>>>>> slightly better accuracy
>>>>>
>>>>> Thanks
>>>>>
>>>>> Michael
>>>>>
>>>>>
>>>>> Am 31.03.23 um 14:45 schrieb Adrien Grand:
>>>>> > I'm supportive of bumping the limit on the maximum dimension for
>>>>> > vectors to something that is above what the majority of users need,
>>>>> > but I'd like to keep a limit. We have limits for other things like
>>>>> the
>>>>> > max number of docs per index, the max term length, the max number of
>>>>> > dimensions of points, etc. and there are a few things that we don't
>>>>> > have limits on that I wish we had limits on. These limits allow us to
>>>>> > better tune our data structures, prevent overflows, help ensure we
>>>>> > have good test coverage, etc.
>>>>> >
>>>>> > That said, these other limits we have in place are quite high. E.g.
>>>>> > the 32kB term limit, nobody would ever type a 32kB term in a text
>>>>> box.
>>>>> > Likewise for the max of 8 dimensions for points: a segment cannot
>>>>> > possibly have 2 splits per dimension on average if it doesn't have
>>>>> > 512*2^(8*2)=34M docs, a sizable dataset already, so more dimensions
>>>>> > than 8 would likely defeat the point of indexing. In contrast, our
>>>>> > limit on the number of dimensions of vectors seems to be under what
>>>>> > some users would like, and while I understand the performance
>>>>> argument
>>>>> > against bumping the limit, it doesn't feel to me like something that
>>>>> > would be so bad that we need to prevent users from using numbers of
>>>>> > dimensions in the low thousands, e.g. top-k KNN searches would still
>>>>> > look at a very small subset of the full dataset.
>>>>> >
>>>>> > So overall, my vote would be to bump the limit to 2048 as suggested
>>>>> by
>>>>> > Mayya on the issue that you linked.
>>>>> >
>>>>> > On Fri, Mar 31, 2023 at 2:38?PM Michael Wechner
>>>>> > <michael.wechner@wyona.com> wrote:
>>>>> >> Thanks Alessandro for summarizing the discussion below!
>>>>> >>
>>>>> >> I understand that there is no clear reasoning re what is the best
>>>>> embedding size, whereas I think heuristic approaches like described by the
>>>>> following link can be helpful
>>>>> >>
>>>>> >>
>>>>> https://datascience.stackexchange.com/questions/51404/word2vec-how-to-choose-the-embedding-size-parameter
>>>>> >>
>>>>> >> Having said this, we see various embedding services providing
>>>>> higher dimensions than 1024, like for example OpenAI, Cohere and Aleph
>>>>> Alpha.
>>>>> >>
>>>>> >> And it would be great if we could run benchmarks without having to
>>>>> recompile Lucene ourselves.
>>>>> >>
>>>>> >> Therefore I would to suggest to either increase the limit or even
>>>>> better to remove the limit and add a disclaimer, that people should be
>>>>> aware of possible crashes etc.
>>>>> >>
>>>>> >> Thanks
>>>>> >>
>>>>> >> Michael
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> Am 31.03.23 um 11:43 schrieb Alessandro Benedetti:
>>>>> >>
>>>>> >>
>>>>> >> I've been monitoring various discussions on Pull Requests about
>>>>> changing the max number of dimensions allowed for Lucene HNSW vectors:
>>>>> >>
>>>>> >> https://github.com/apache/lucene/pull/12191
>>>>> >>
>>>>> >> https://github.com/apache/lucene/issues/11507
>>>>> >>
>>>>> >>
>>>>> >> I would like to set up a discussion and potentially a vote about
>>>>> this.
>>>>> >>
>>>>> >> I have seen some strong opposition from a few people but a majority
>>>>> of favor in this direction.
>>>>> >>
>>>>> >>
>>>>> >> Motivation
>>>>> >>
>>>>> >> We were discussing in the Solr slack channel with Ishan
>>>>> Chattopadhyaya, Marcus Eagan, and David Smiley about some neural search
>>>>> integrations in Solr:
>>>>> https://github.com/openai/chatgpt-retrieval-plugin
>>>>> >>
>>>>> >>
>>>>> >> Proposal
>>>>> >>
>>>>> >> No hard limit at all.
>>>>> >>
>>>>> >> As for many other Lucene areas, users will be allowed to push the
>>>>> system to the limit of their resources and get terrible performances or
>>>>> crashes if they want.
>>>>> >>
>>>>> >>
>>>>> >> What we are NOT discussing
>>>>> >>
>>>>> >> - Quality and scalability of the HNSW algorithm
>>>>> >>
>>>>> >> - dimensionality reduction
>>>>> >>
>>>>> >> - strategies to fit in an arbitrary self-imposed limit
>>>>> >>
>>>>> >>
>>>>> >> Benefits
>>>>> >>
>>>>> >> - users can use the models they want to generate vectors
>>>>> >>
>>>>> >> - removal of an arbitrary limit that blocks some integrations
>>>>> >>
>>>>> >>
>>>>> >> Cons
>>>>> >>
>>>>> >> - if you go for vectors with high dimensions, there's no
>>>>> guarantee you get acceptable performance for your use case
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> I want to keep it simple, right now in many Lucene areas, you can
>>>>> push the system to not acceptable performance/ crashes.
>>>>> >>
>>>>> >> For example, we don't limit the number of docs per index to an
>>>>> arbitrary maximum of N, you push how many docs you like and if they are too
>>>>> much for your system, you get terrible performance/crashes/whatever.
>>>>> >>
>>>>> >>
>>>>> >> Limits caused by primitive java types will stay there behind the
>>>>> scene, and that's acceptable, but I would prefer to not have arbitrary
>>>>> hard-coded ones that may limit the software usability and integration which
>>>>> is extremely important for a library.
>>>>> >>
>>>>> >>
>>>>> >> I strongly encourage people to add benefits and cons, that I missed
>>>>> (I am sure I missed some of them, but wanted to keep it simple)
>>>>> >>
>>>>> >>
>>>>> >> Cheers
>>>>> >>
>>>>> >> --------------------------
>>>>> >> Alessandro Benedetti
>>>>> >> Director @ Sease Ltd.
>>>>> >> Apache Lucene/Solr Committer
>>>>> >> Apache Solr PMC Member
>>>>> >>
>>>>> >> e-mail: a.benedetti@sease.io
>>>>> >>
>>>>> >>
>>>>> >> Sease - Information Retrieval Applied
>>>>> >> Consulting | Training | Open Source
>>>>> >>
>>>>> >> Website: Sease.io
>>>>> >> LinkedIn | Twitter | Youtube | Github
>>>>> >>
>>>>> >>
>>>>> >
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>
>>>>>
>>>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

michael.wechner at wyona

Apr 5, 2023, 5:10 AM

Post #14 of 99 (727 views)

Am 05.04.23 um 12:34 schrieb Alessandro Benedetti:
> Thanks Mike for the insight!
>
> What would be the next steps then?
> I see agreement but also the necessity of identifying a candidate MAX.
>
> Should create a VOTE thread, where we propose some values with a
> justification and we vote?

+1

Thanks

Michael

>
> In this way we can create a pull request and merge relatively soon.
>
> Cheers
>
> On Tue, 4 Apr 2023, 14:47 Michael Wechner, <michael.wechner@wyona.com>
> wrote:
>
> IIUC we all agree that the limit could be raised, but we need some
> solid reasoning what limit makes sense, resp. why do we set this
> particular limit (e.g. 2048), right?
>
> Thanks
>
> Michael
>
>
> Am 04.04.23 um 15:32 schrieb Michael McCandless:
>> > I am not in favor of just doubling it as suggested by some
>> people, I would ideally prefer a solution that remains there to a
>> decent extent, rather than having to modifying it anytime someone
>> requires a higher limit.
>>
>> The problem with this approach is it is a one-way door, once
>> released. We would not be able to lower the limit again in the
>> future without possibly breaking some applications.
>>
>> > For example, we don't limit the number of docs per index to an
>> arbitrary maximum of N, you push how many docs you like and if
>> they are too much for your system, you get terrible
>> performance/crashes/whatever.
>>
>> Correction: we do check this limit and throw a specific exception
>> now: https://github.com/apache/lucene/issues/6905
>>
>> +1 to raise the limit, but not remove it.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Mon, Apr 3, 2023 at 9:51?AM Alessandro Benedetti
>> <a.benedetti@sease.io> wrote:
>>
>> ... and what would be the next limit?
>> I guess we'll need to motivate it better than the 1024 one.
>> I appreciate the fact that a limit is pretty much wanted by
>> everyone but I suspect we'll need some solid foundation for
>> deciding the amount (and it should be high enough to avoid
>> continuous changes)
>>
>> Cheers
>>
>> On Sun, 2 Apr 2023, 07:29 Michael Wechner,
>> <michael.wechner@wyona.com> wrote:
>>
>> btw, what was the reasoning to set the current limit to 1024?
>>
>> Thanks
>>
>> Michael
>>
>> Am 01.04.23 um 14:47 schrieb Michael Sokolov:
>>> I'm also in favor of raising this limit. We do see some
>>> datasets with higher than 1024 dims. I also think we
>>> need to keep a limit. For example we currently need to
>>> keep all the vectors in RAM while indexing and we want
>>> to be able to support reasonable numbers of vectors in
>>> an index segment. Also we don't know what innovations
>>> might come down the road. Maybe someday we want to do
>>> product quantization and enforce that (k, m) both fit in
>>> a byte -- we wouldn't be able to do that if a vector's
>>> dimension were to exceed 32K.
>>>
>>> On Fri, Mar 31, 2023 at 11:57?AM Alessandro Benedetti
>>> <a.benedetti@sease.io> wrote:
>>>
>>> I am also curious what would be the worst-case
>>> scenario if we remove the constant at all (so
>>> automatically the limit becomes the Java
>>> Integer.MAX_VALUE).
>>> i.e.
>>> right now if you exceed the limit you get:
>>>
>>> if (dimension > ByteVectorValues.MAX_DIMENSIONS) {
>>> throw new IllegalArgumentException(
>>> "cannot index vectors with dimension greater
>>> than " + ByteVectorValues.MAX_DIMENSIONS);
>>> }
>>>
>>>
>>> in relation to:
>>>
>>> These limits allow us to
>>> better tune our data structures, prevent
>>> overflows, help ensure we
>>> have good test coverage, etc.
>>>
>>> I agree 100% especially for typing stuff properly
>>> and avoiding resource waste here and there, but I am
>>> not entirely sure this is the case for the current
>>> implementation i.e. do we have optimizations in
>>> place that assume the max dimension to be 1024?
>>> If I missed that (and I likely have), I of course
>>> suggest the contribution should not just blindly
>>> remove the limit, but do it appropriately.
>>> I am not in favor of just doubling it as suggested
>>> by some people, I would ideally prefer a solution
>>> that remains there to a decent extent, rather than
>>> having to modifying it anytime someone requires a
>>> higher limit.
>>>
>>> Cheers
>>> --------------------------
>>> *Alessandro Benedetti*
>>> Director @ Sease Ltd.
>>> /Apache Lucene/Solr Committer/
>>> /Apache Solr PMC Member/
>>>
>>> e-mail: a.benedetti@sease.io/
>>> /
>>>
>>> *Sease* - Information Retrieval Applied
>>> Consulting | Training | Open Source
>>>
>>> Website: Sease.io <http://sease.io/>
>>> LinkedIn <https://linkedin.com/company/sease-ltd> |
>>> Twitter <https://twitter.com/seaseltd> | Youtube
>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
>>> Github <https://github.com/seaseltd>
>>>
>>>
>>> On Fri, 31 Mar 2023 at 16:12, Michael Wechner
>>> <michael.wechner@wyona.com> wrote:
>>>
>>> OpenAI reduced their size to 1536 dimensions
>>>
>>> https://openai.com/blog/new-and-improved-embedding-model
>>>
>>> so 2048 would work :-)
>>>
>>> but other services do provide also higher
>>> dimensions with sometimes
>>> slightly better accuracy
>>>
>>> Thanks
>>>
>>> Michael
>>>
>>>
>>> Am 31.03.23 um 14:45 schrieb Adrien Grand:
>>> > I'm supportive of bumping the limit on the
>>> maximum dimension for
>>> > vectors to something that is above what the
>>> majority of users need,
>>> > but I'd like to keep a limit. We have limits
>>> for other things like the
>>> > max number of docs per index, the max term
>>> length, the max number of
>>> > dimensions of points, etc. and there are a few
>>> things that we don't
>>> > have limits on that I wish we had limits on.
>>> These limits allow us to
>>> > better tune our data structures, prevent
>>> overflows, help ensure we
>>> > have good test coverage, etc.
>>> >
>>> > That said, these other limits we have in place
>>> are quite high. E.g.
>>> > the 32kB term limit, nobody would ever type a
>>> 32kB term in a text box.
>>> > Likewise for the max of 8 dimensions for
>>> points: a segment cannot
>>> > possibly have 2 splits per dimension on
>>> average if it doesn't have
>>> > 512*2^(8*2)=34M docs, a sizable dataset
>>> already, so more dimensions
>>> > than 8 would likely defeat the point of
>>> indexing. In contrast, our
>>> > limit on the number of dimensions of vectors
>>> seems to be under what
>>> > some users would like, and while I understand
>>> the performance argument
>>> > against bumping the limit, it doesn't feel to
>>> me like something that
>>> > would be so bad that we need to prevent users
>>> from using numbers of
>>> > dimensions in the low thousands, e.g. top-k
>>> KNN searches would still
>>> > look at a very small subset of the full dataset.
>>> >
>>> > So overall, my vote would be to bump the limit
>>> to 2048 as suggested by
>>> > Mayya on the issue that you linked.
>>> >
>>> > On Fri, Mar 31, 2023 at 2:38?PM Michael Wechner
>>> > <michael.wechner@wyona.com> wrote:
>>> >> Thanks Alessandro for summarizing the
>>> discussion below!
>>> >>
>>> >> I understand that there is no clear reasoning
>>> re what is the best embedding size, whereas I
>>> think heuristic approaches like described by the
>>> following link can be helpful
>>> >>
>>> >>
>>> https://datascience.stackexchange.com/questions/51404/word2vec-how-to-choose-the-embedding-size-parameter
>>> >>
>>> >> Having said this, we see various embedding
>>> services providing higher dimensions than 1024,
>>> like for example OpenAI, Cohere and Aleph Alpha.
>>> >>
>>> >> And it would be great if we could run
>>> benchmarks without having to recompile Lucene
>>> ourselves.
>>> >>
>>> >> Therefore I would to suggest to either
>>> increase the limit or even better to remove the
>>> limit and add a disclaimer, that people should
>>> be aware of possible crashes etc.
>>> >>
>>> >> Thanks
>>> >>
>>> >> Michael
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> Am 31.03.23 um 11:43 schrieb Alessandro
>>> Benedetti:
>>> >>
>>> >>
>>> >> I've been monitoring various discussions on
>>> Pull Requests about changing the max number of
>>> dimensions allowed for Lucene HNSW vectors:
>>> >>
>>> >> https://github.com/apache/lucene/pull/12191
>>> >>
>>> >> https://github.com/apache/lucene/issues/11507
>>> >>
>>> >>
>>> >> I would like to set up a discussion and
>>> potentially a vote about this.
>>> >>
>>> >> I have seen some strong opposition from a few
>>> people but a majority of favor in this direction.
>>> >>
>>> >>
>>> >> Motivation
>>> >>
>>> >> We were discussing in the Solr slack channel
>>> with Ishan Chattopadhyaya, Marcus Eagan, and
>>> David Smiley about some neural search
>>> integrations in Solr:
>>> https://github.com/openai/chatgpt-retrieval-plugin
>>> >>
>>> >>
>>> >> Proposal
>>> >>
>>> >> No hard limit at all.
>>> >>
>>> >> As for many other Lucene areas, users will be
>>> allowed to push the system to the limit of their
>>> resources and get terrible performances or
>>> crashes if they want.
>>> >>
>>> >>
>>> >> What we are NOT discussing
>>> >>
>>> >> - Quality and scalability of the HNSW algorithm
>>> >>
>>> >> - dimensionality reduction
>>> >>
>>> >> - strategies to fit in an arbitrary
>>> self-imposed limit
>>> >>
>>> >>
>>> >> Benefits
>>> >>
>>> >> - users can use the models they want to
>>> generate vectors
>>> >>
>>> >> - removal of an arbitrary limit that blocks
>>> some integrations
>>> >>
>>> >>
>>> >> Cons
>>> >>
>>> >> - if you go for vectors with high
>>> dimensions, there's no guarantee you get
>>> acceptable performance for your use case
>>> >>
>>> >>
>>> >>
>>> >> I want to keep it simple, right now in many
>>> Lucene areas, you can push the system to not
>>> acceptable performance/ crashes.
>>> >>
>>> >> For example, we don't limit the number of
>>> docs per index to an arbitrary maximum of N, you
>>> push how many docs you like and if they are too
>>> much for your system, you get terrible
>>> performance/crashes/whatever.
>>> >>
>>> >>
>>> >> Limits caused by primitive java types will
>>> stay there behind the scene, and that's
>>> acceptable, but I would prefer to not have
>>> arbitrary hard-coded ones that may limit the
>>> software usability and integration which is
>>> extremely important for a library.
>>> >>
>>> >>
>>> >> I strongly encourage people to add benefits
>>> and cons, that I missed (I am sure I missed some
>>> of them, but wanted to keep it simple)
>>> >>
>>> >>
>>> >> Cheers
>>> >>
>>> >> --------------------------
>>> >> Alessandro Benedetti
>>> >> Director @ Sease Ltd.
>>> >> Apache Lucene/Solr Committer
>>> >> Apache Solr PMC Member
>>> >>
>>> >> e-mail: a.benedetti@sease.io
>>> >>
>>> >>
>>> >> Sease - Information Retrieval Applied
>>> >> Consulting | Training | Open Source
>>> >>
>>> >> Website: Sease.io
>>> >> LinkedIn | Twitter | Youtube | Github
>>> >>
>>> >>
>>> >
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:
>>> dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail:
>>> dev-help@lucene.apache.org
>>>
>>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

dawid.weiss at gmail

Apr 5, 2023, 6:49 AM

Post #15 of 99 (727 views)

> Should create a VOTE thread, where we propose some values with a
> justification and we vote?
>

Technically, a vote thread won't help much if there's no full consensus - a
single veto will make the patch unacceptable for merging.
https://www.apache.org/foundation/voting.html#Veto

Dawid

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

a.benedetti at sease

Apr 5, 2023, 8:22 AM

Post #16 of 99 (727 views)

Ok, so what should we do then?
This space is moving fast, and in my opinion we should act fast to release
and guarantee we attract as many users as possible.

At the same time I am not saying we should proceed blind, if there's
concrete evidence for setting a limit rather than another, or that a
certain limit is detrimental to the project, I think that veto should be
valid.

We shouldn't accept weakly/not scientifically motivated vetos anyway right?

The problem I see is that more than voting we should first decide this
limit and I don't know how we can operate.
I am imagining like a poll where each entry is a limit + motivation and
PMCs maybe vote/add entries?

Did anything similar happen in the past? How was the current limit added?

On Wed, 5 Apr 2023, 14:50 Dawid Weiss, <dawid.weiss@gmail.com> wrote:

>
>
>> Should create a VOTE thread, where we propose some values with a
>> justification and we vote?
>>
>
> Technically, a vote thread won't help much if there's no full consensus -
> a single veto will make the patch unacceptable for merging.
> https://www.apache.org/foundation/voting.html#Veto
>
> Dawid
>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

dawid.weiss at gmail

Apr 5, 2023, 8:48 AM

Post #17 of 99 (727 views)

> Ok, so what should we do then?

I don't know, Alessandro. I just wanted to point out the fact that by
Apache rules a committer's veto to a code change counts as a no-go. It
does not specify any way to "override" such a veto, perhaps counting
on disagreeing parties to resolve conflicting points of views in a
civil manner so that veto can be retracted (or a different solution
suggested).

I think Robert's point is not about a particular limit value but about
the algorithm itself - the current implementation does not scale. I
don't want to be an advocate for either side - I'm all for freedom of
choice but at the same time last time I tried indexing a few million
vectors, I couldn't get far before segment merging blew up with
OOMs...

> Did anything similar happen in the past? How was the current limit added?

I honestly don't know, you'd have to git blame or look at the mailing
list archives of the original contribution.

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

michael.wechner at wyona

Apr 5, 2023, 8:55 AM

Post #18 of 99 (727 views)

Hi Dawid

Can you describe your crash in more detail?

How many millions vectors exactly?
What was the vector dimension?
How much RAM?
etc.

Thanks

Michael

Am 05.04.23 um 17:48 schrieb Dawid Weiss:
>> Ok, so what should we do then?
> I don't know, Alessandro. I just wanted to point out the fact that by
> Apache rules a committer's veto to a code change counts as a no-go. It
> does not specify any way to "override" such a veto, perhaps counting
> on disagreeing parties to resolve conflicting points of views in a
> civil manner so that veto can be retracted (or a different solution
> suggested).
>
> I think Robert's point is not about a particular limit value but about
> the algorithm itself - the current implementation does not scale. I
> don't want to be an advocate for either side - I'm all for freedom of
> choice but at the same time last time I tried indexing a few million
> vectors, I couldn't get far before segment merging blew up with
> OOMs...
>
>> Did anything similar happen in the past? How was the current limit added?
> I honestly don't know, you'd have to git blame or look at the mailing
> list archives of the original contribution.
>
> Dawid
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

dawid.weiss at gmail

Apr 5, 2023, 9:30 AM

Post #19 of 99 (727 views)

> Can you describe your crash in more detail?

I can't. That experiment was a while ago and a quick test to see if I
could index rather large-ish USPTO (patent office) data as vectors.
Couldn't do it then.

> How much RAM?

My indexing jobs run with rather smallish heaps to give space for I/O
buffers. Think 4-8GB at most. So yes, it could have been the problem.
I recall segment merging grew slower and slower and then simply
crashed. Lucene should work with low heap requirements, even if it
slows down. Throwing ram at the indexing/ segment merging problem
is... I don't know - not elegant?

Anyway. My main point was to remind folks about how Apache works -
code is merged in when there are no vetoes. If Rob (or anybody else)
remains unconvinced, he or she can block the change. (I didn't invent
those rules).

D.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

michael.wechner at wyona

Apr 5, 2023, 9:51 AM

Post #20 of 99 (727 views)

Thanks for your feedback!

I agree, that it should not crash.

So far we did not experience crashes ourselves, but we did not index
millions of vectors.

I will try to reproduce the crash, maybe this will help us to move forward.

Thanks

Michael

Am 05.04.23 um 18:30 schrieb Dawid Weiss:
>> Can you describe your crash in more detail?
> I can't. That experiment was a while ago and a quick test to see if I
> could index rather large-ish USPTO (patent office) data as vectors.
> Couldn't do it then.
>
>> How much RAM?
> My indexing jobs run with rather smallish heaps to give space for I/O
> buffers. Think 4-8GB at most. So yes, it could have been the problem.
> I recall segment merging grew slower and slower and then simply
> crashed. Lucene should work with low heap requirements, even if it
> slows down. Throwing ram at the indexing/ segment merging problem
> is... I don't know - not elegant?
>
> Anyway. My main point was to remind folks about how Apache works -
> code is merged in when there are no vetoes. If Rob (or anybody else)
> remains unconvinced, he or she can block the change. (I didn't invent
> those rules).
>
> D.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

rcmuir at gmail

Apr 5, 2023, 10:54 AM

Post #21 of 99 (727 views)

I'd ask anyone voting +1 to raise this limit to at least try to index
a few million vectors with 756 or 1024, which is allowed today.

IMO based on how painful it is, it seems the limit is already too
high, I realize that will sound controversial but please at least try
it out!

voting +1 without at least doing this is really the
"weak/unscientifically minded" approach.

On Wed, Apr 5, 2023 at 12:52?PM Michael Wechner
<michael.wechner@wyona.com> wrote:
>
> Thanks for your feedback!
>
> I agree, that it should not crash.
>
> So far we did not experience crashes ourselves, but we did not index
> millions of vectors.
>
> I will try to reproduce the crash, maybe this will help us to move forward.
>
> Thanks
>
> Michael
>
> Am 05.04.23 um 18:30 schrieb Dawid Weiss:
> >> Can you describe your crash in more detail?
> > I can't. That experiment was a while ago and a quick test to see if I
> > could index rather large-ish USPTO (patent office) data as vectors.
> > Couldn't do it then.
> >
> >> How much RAM?
> > My indexing jobs run with rather smallish heaps to give space for I/O
> > buffers. Think 4-8GB at most. So yes, it could have been the problem.
> > I recall segment merging grew slower and slower and then simply
> > crashed. Lucene should work with low heap requirements, even if it
> > slows down. Throwing ram at the indexing/ segment merging problem
> > is... I don't know - not elegant?
> >
> > Anyway. My main point was to remind folks about how Apache works -
> > code is merged in when there are no vetoes. If Rob (or anybody else)
> > remains unconvinced, he or she can block the change. (I didn't invent
> > those rules).
> >
> > D.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

osjdconrad at gmail

Apr 5, 2023, 1:50 PM

Post #22 of 99 (727 views)

I don't want to get too far off topic, but I think one of the problems here
is that HNSW doesn't really fit well as a Lucene data structure. The way it
behaves it would be better supported as a live, in-memory data structure
instead of segmented and written to disk for tiny graphs that then need to
be merged. I wonder if it may be a better approach to explore other
possible algorithms that are designed to be on-disk instead of in-memory
even if they require k-means clustering as a trade-off. Maybe with an
on-disk algorithm we could have good enough performance for a
higher-dimensional limit.

On Wed, Apr 5, 2023 at 10:54?AM Robert Muir <rcmuir@gmail.com> wrote:

> I'd ask anyone voting +1 to raise this limit to at least try to index
> a few million vectors with 756 or 1024, which is allowed today.
>
> IMO based on how painful it is, it seems the limit is already too
> high, I realize that will sound controversial but please at least try
> it out!
>
> voting +1 without at least doing this is really the
> "weak/unscientifically minded" approach.
>
> On Wed, Apr 5, 2023 at 12:52?PM Michael Wechner
> <michael.wechner@wyona.com> wrote:
> >
> > Thanks for your feedback!
> >
> > I agree, that it should not crash.
> >
> > So far we did not experience crashes ourselves, but we did not index
> > millions of vectors.
> >
> > I will try to reproduce the crash, maybe this will help us to move
> forward.
> >
> > Thanks
> >
> > Michael
> >
> > Am 05.04.23 um 18:30 schrieb Dawid Weiss:
> > >> Can you describe your crash in more detail?
> > > I can't. That experiment was a while ago and a quick test to see if I
> > > could index rather large-ish USPTO (patent office) data as vectors.
> > > Couldn't do it then.
> > >
> > >> How much RAM?
> > > My indexing jobs run with rather smallish heaps to give space for I/O
> > > buffers. Think 4-8GB at most. So yes, it could have been the problem.
> > > I recall segment merging grew slower and slower and then simply
> > > crashed. Lucene should work with low heap requirements, even if it
> > > slows down. Throwing ram at the indexing/ segment merging problem
> > > is... I don't know - not elegant?
> > >
> > > Anyway. My main point was to remind folks about how Apache works -
> > > code is merged in when there are no vetoes. If Rob (or anybody else)
> > > remains unconvinced, he or she can block the change. (I didn't invent
> > > those rules).
> > >
> > > D.
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: dev-help@lucene.apache.org
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

gus.heck at gmail

Apr 5, 2023, 7:30 PM

Post #23 of 99 (727 views)

10 MB hard drive, wow I'll never need another floppy disk ever...
Neural nets... nice idea, but there will never be enough CPU power to run
them...

etc.

Is it possible to make it a configurable limit?

On Wed, Apr 5, 2023 at 4:51?PM Jack Conradson <osjdconrad@gmail.com> wrote:

> I don't want to get too far off topic, but I think one of the problems
> here is that HNSW doesn't really fit well as a Lucene data structure. The
> way it behaves it would be better supported as a live, in-memory data
> structure instead of segmented and written to disk for tiny graphs that
> then need to be merged. I wonder if it may be a better approach to explore
> other possible algorithms that are designed to be on-disk instead of
> in-memory even if they require k-means clustering as a trade-off. Maybe
> with an on-disk algorithm we could have good enough performance for a
> higher-dimensional limit.
>
> On Wed, Apr 5, 2023 at 10:54?AM Robert Muir <rcmuir@gmail.com> wrote:
>
>> I'd ask anyone voting +1 to raise this limit to at least try to index
>> a few million vectors with 756 or 1024, which is allowed today.
>>
>> IMO based on how painful it is, it seems the limit is already too
>> high, I realize that will sound controversial but please at least try
>> it out!
>>
>> voting +1 without at least doing this is really the
>> "weak/unscientifically minded" approach.
>>
>> On Wed, Apr 5, 2023 at 12:52?PM Michael Wechner
>> <michael.wechner@wyona.com> wrote:
>> >
>> > Thanks for your feedback!
>> >
>> > I agree, that it should not crash.
>> >
>> > So far we did not experience crashes ourselves, but we did not index
>> > millions of vectors.
>> >
>> > I will try to reproduce the crash, maybe this will help us to move
>> forward.
>> >
>> > Thanks
>> >
>> > Michael
>> >
>> > Am 05.04.23 um 18:30 schrieb Dawid Weiss:
>> > >> Can you describe your crash in more detail?
>> > > I can't. That experiment was a while ago and a quick test to see if I
>> > > could index rather large-ish USPTO (patent office) data as vectors.
>> > > Couldn't do it then.
>> > >
>> > >> How much RAM?
>> > > My indexing jobs run with rather smallish heaps to give space for I/O
>> > > buffers. Think 4-8GB at most. So yes, it could have been the problem.
>> > > I recall segment merging grew slower and slower and then simply
>> > > crashed. Lucene should work with low heap requirements, even if it
>> > > slows down. Throwing ram at the indexing/ segment merging problem
>> > > is... I don't know - not elegant?
>> > >
>> > > Anyway. My main point was to remind folks about how Apache works -
>> > > code is merged in when there are no vetoes. If Rob (or anybody else)
>> > > remains unconvinced, he or she can block the change. (I didn't invent
>> > > those rules).
>> > >
>> > > D.
>> > >
>> > > ---------------------------------------------------------------------
>> > > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > > For additional commands, e-mail: dev-help@lucene.apache.org
>> > >
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: dev-help@lucene.apache.org
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>

--
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

lucene at mikemccandless

Apr 6, 2023, 2:47 AM

Post #24 of 99 (727 views)

> We shouldn't accept weakly/not scientifically motivated vetos anyway
right?

In fact we must accept all vetos by any committer as a veto, for a change
to Lucene's source code, regardless of that committer's reasoning. This is
the power of Apache's model.

Of course we all can and will work together to convince one another (this
is where the scientifically motivated part comes in) to change our votes,
one way or another.

> I'd ask anyone voting +1 to raise this limit to at least try to index a
few million vectors with 756 or 1024, which is allowed today.

+1, if the current implementation really does not scale / needs more and
more RAM for merging, let's understand what's going on here, first, before
increasing limits. I rescind my hasty +1 for now!

Mike McCandless

http://blog.mikemccandless.com

On Wed, Apr 5, 2023 at 11:22?AM Alessandro Benedetti <a.benedetti@sease.io>
wrote:

> Ok, so what should we do then?
> This space is moving fast, and in my opinion we should act fast to release
> and guarantee we attract as many users as possible.
>
> At the same time I am not saying we should proceed blind, if there's
> concrete evidence for setting a limit rather than another, or that a
> certain limit is detrimental to the project, I think that veto should be
> valid.
>
> We shouldn't accept weakly/not scientifically motivated vetos anyway right?
>
> The problem I see is that more than voting we should first decide this
> limit and I don't know how we can operate.
> I am imagining like a poll where each entry is a limit + motivation and
> PMCs maybe vote/add entries?
>
> Did anything similar happen in the past? How was the current limit added?
>
>
> On Wed, 5 Apr 2023, 14:50 Dawid Weiss, <dawid.weiss@gmail.com> wrote:
>
>>
>>
>>> Should create a VOTE thread, where we propose some values with a
>>> justification and we vote?
>>>
>>
>> Technically, a vote thread won't help much if there's no full consensus -
>> a single veto will make the patch unacceptable for merging.
>> https://www.apache.org/foundation/voting.html#Veto
>>
>> Dawid
>>
>>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

michael.wechner at wyona

Apr 6, 2023, 7:08 AM

Post #25 of 99 (727 views)

I think we should focus on testing where the limits are and what might
cause the limits.

Let's get out of this fog :-)

Thanks

Michael

Am 06.04.23 um 11:47 schrieb Michael McCandless:
> > We shouldn't accept weakly/not scientifically motivated vetos anyway
> right?
>
> In fact we must accept all vetos by any committer as a veto, for a
> change to Lucene's source code, regardless of that committer's
> reasoning. This is the power of Apache's model.
>
> Of course we all can and will work together to convince one another
> (this is where the scientifically motivated part comes in) to change
> our votes, one way or another.
>
> > I'd ask anyone voting +1 to raise this limit to at least try to
> index a few million vectors with 756 or 1024, which is allowed today.
>
> +1, if the current implementation really does not scale / needs more
> and more RAM for merging, let's understand what's going on here,
> first, before increasing limits. I rescind my hasty +1 for now!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Apr 5, 2023 at 11:22?AM Alessandro Benedetti
> <a.benedetti@sease.io> wrote:
>
> Ok, so what should we do then?
> This space is moving fast, and in my opinion we should act fast to
> release and guarantee we attract as many users as possible.
>
> At the same time I am not saying we should proceed blind, if
> there's concrete evidence for setting a limit rather than another,
> or that a certain limit is detrimental to the project, I think
> that veto should be valid.
>
> We shouldn't accept weakly/not scientifically motivated vetos
> anyway right?
>
> The problem I see is that more than voting we should first decide
> this limit and I don't know how we can operate.
> I am imagining like a poll where each entry is a limit +
> motivation and PMCs maybe vote/add entries?
>
> Did anything similar happen in the past? How was the current limit
> added?
>
>
> On Wed, 5 Apr 2023, 14:50 Dawid Weiss, <dawid.weiss@gmail.com> wrote:
>
> Should create a VOTE thread, where we propose some values
> with a justification and we vote?
>
>
> Technically, a vote thread won't help much if there's no full
> consensus - a single veto will make the patch unacceptable for
> merging.
> https://www.apache.org/foundation/voting.html#Veto
>
> Dawid
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

msokolov at gmail

Apr 6, 2023, 7:11 AM

Post #26 of 99 (120 views)

re: how does this HNSW stuff scale - I think people are calling out
indexing memory usage here, so let's discuss some facts. During
initial indexing we hold in RAM all the vector data and the graph
constructed from the new documents, but this is accounted for and
limited by the size of IndexWriter's buffer; the document vectors and
their graph will be flushed to disk when this fills up, and at search
time, they are not read in wholesale to RAM. There is potentially
unbounded RAM usage during merging though, because the entire merged
graph will be built in RAM. I lost track of how we handle the vector
data now, but at least in theory it should be fairly straightforward
to write the merged vector data in chunks using only limited RAM. So
how much RAM does the graph use? It uses numdocs*fanout VInts.
Actually it doesn't really scale with the vector dimension at all -
rather it scales with the graph fanout (M) parameter and with the
total number of documents. So I think this focus on limiting the
vector dimension is not helping to address the concern about RAM usage
while merging.

The vector dimension does have a strong role in the search, and
indexing time, but the impact is linear in the dimension and won't
exhaust any limited resource.

On Thu, Apr 6, 2023 at 5:48?AM Michael McCandless
<lucene@mikemccandless.com> wrote:
>
> > We shouldn't accept weakly/not scientifically motivated vetos anyway right?
>
> In fact we must accept all vetos by any committer as a veto, for a change to Lucene's source code, regardless of that committer's reasoning. This is the power of Apache's model.
>
> Of course we all can and will work together to convince one another (this is where the scientifically motivated part comes in) to change our votes, one way or another.
>
> > I'd ask anyone voting +1 to raise this limit to at least try to index a few million vectors with 756 or 1024, which is allowed today.
>
> +1, if the current implementation really does not scale / needs more and more RAM for merging, let's understand what's going on here, first, before increasing limits. I rescind my hasty +1 for now!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Apr 5, 2023 at 11:22?AM Alessandro Benedetti <a.benedetti@sease.io> wrote:
>>
>> Ok, so what should we do then?
>> This space is moving fast, and in my opinion we should act fast to release and guarantee we attract as many users as possible.
>>
>> At the same time I am not saying we should proceed blind, if there's concrete evidence for setting a limit rather than another, or that a certain limit is detrimental to the project, I think that veto should be valid.
>>
>> We shouldn't accept weakly/not scientifically motivated vetos anyway right?
>>
>> The problem I see is that more than voting we should first decide this limit and I don't know how we can operate.
>> I am imagining like a poll where each entry is a limit + motivation and PMCs maybe vote/add entries?
>>
>> Did anything similar happen in the past? How was the current limit added?
>>
>>
>> On Wed, 5 Apr 2023, 14:50 Dawid Weiss, <dawid.weiss@gmail.com> wrote:
>>>
>>>
>>>>
>>>> Should create a VOTE thread, where we propose some values with a justification and we vote?
>>>
>>>
>>> Technically, a vote thread won't help much if there's no full consensus - a single veto will make the patch unacceptable for merging.
>>> https://www.apache.org/foundation/voting.html#Veto
>>>
>>> Dawid
>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

michael.wechner at wyona

Apr 6, 2023, 7:20 AM

Post #27 of 99 (120 views)

thanks very much for these insights!

Does it make a difference re RAM when I do a batch import, for example
import 1000 documents and close the IndexWriter and do a forceMerge or
import 1Mio documents at once?

I would expect so, or do I misunderstand this?

Thanks

Michael

Am 06.04.23 um 16:11 schrieb Michael Sokolov:
> re: how does this HNSW stuff scale - I think people are calling out
> indexing memory usage here, so let's discuss some facts. During
> initial indexing we hold in RAM all the vector data and the graph
> constructed from the new documents, but this is accounted for and
> limited by the size of IndexWriter's buffer; the document vectors and
> their graph will be flushed to disk when this fills up, and at search
> time, they are not read in wholesale to RAM. There is potentially
> unbounded RAM usage during merging though, because the entire merged
> graph will be built in RAM. I lost track of how we handle the vector
> data now, but at least in theory it should be fairly straightforward
> to write the merged vector data in chunks using only limited RAM. So
> how much RAM does the graph use? It uses numdocs*fanout VInts.
> Actually it doesn't really scale with the vector dimension at all -
> rather it scales with the graph fanout (M) parameter and with the
> total number of documents. So I think this focus on limiting the
> vector dimension is not helping to address the concern about RAM usage
> while merging.
>
> The vector dimension does have a strong role in the search, and
> indexing time, but the impact is linear in the dimension and won't
> exhaust any limited resource.
>
> On Thu, Apr 6, 2023 at 5:48?AM Michael McCandless
> <lucene@mikemccandless.com> wrote:
>>> We shouldn't accept weakly/not scientifically motivated vetos anyway right?
>> In fact we must accept all vetos by any committer as a veto, for a change to Lucene's source code, regardless of that committer's reasoning. This is the power of Apache's model.
>>
>> Of course we all can and will work together to convince one another (this is where the scientifically motivated part comes in) to change our votes, one way or another.
>>
>>> I'd ask anyone voting +1 to raise this limit to at least try to index a few million vectors with 756 or 1024, which is allowed today.
>> +1, if the current implementation really does not scale / needs more and more RAM for merging, let's understand what's going on here, first, before increasing limits. I rescind my hasty +1 for now!
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Wed, Apr 5, 2023 at 11:22?AM Alessandro Benedetti <a.benedetti@sease.io> wrote:
>>> Ok, so what should we do then?
>>> This space is moving fast, and in my opinion we should act fast to release and guarantee we attract as many users as possible.
>>>
>>> At the same time I am not saying we should proceed blind, if there's concrete evidence for setting a limit rather than another, or that a certain limit is detrimental to the project, I think that veto should be valid.
>>>
>>> We shouldn't accept weakly/not scientifically motivated vetos anyway right?
>>>
>>> The problem I see is that more than voting we should first decide this limit and I don't know how we can operate.
>>> I am imagining like a poll where each entry is a limit + motivation and PMCs maybe vote/add entries?
>>>
>>> Did anything similar happen in the past? How was the current limit added?
>>>
>>>
>>> On Wed, 5 Apr 2023, 14:50 Dawid Weiss, <dawid.weiss@gmail.com> wrote:
>>>>
>>>>> Should create a VOTE thread, where we propose some values with a justification and we vote?
>>>>
>>>> Technically, a vote thread won't help much if there's no full consensus - a single veto will make the patch unacceptable for merging.
>>>> https://www.apache.org/foundation/voting.html#Veto
>>>>
>>>> Dawid
>>>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

msokolov at gmail

Apr 6, 2023, 7:28 AM

Post #28 of 99 (120 views)

yes, it makes a difference. It will take less time and CPU to do it
all in one go, producing a single segment (assuming the data does not
exceed the IndexWriter RAM buffer size). If you index a lot of little
segments and then force merge them it will take longer, because it had
to build the graphs for the little segments, and then for the big one
when merging, and it will eventually use the same amount of RAM to
build the big graph, although I don't believe it will have to load the
vectors en masse into RAM while merging.

On Thu, Apr 6, 2023 at 10:20?AM Michael Wechner
<michael.wechner@wyona.com> wrote:
>
> thanks very much for these insights!
>
> Does it make a difference re RAM when I do a batch import, for example
> import 1000 documents and close the IndexWriter and do a forceMerge or
> import 1Mio documents at once?
>
> I would expect so, or do I misunderstand this?
>
> Thanks
>
> Michael
>
>
>
> Am 06.04.23 um 16:11 schrieb Michael Sokolov:
> > re: how does this HNSW stuff scale - I think people are calling out
> > indexing memory usage here, so let's discuss some facts. During
> > initial indexing we hold in RAM all the vector data and the graph
> > constructed from the new documents, but this is accounted for and
> > limited by the size of IndexWriter's buffer; the document vectors and
> > their graph will be flushed to disk when this fills up, and at search
> > time, they are not read in wholesale to RAM. There is potentially
> > unbounded RAM usage during merging though, because the entire merged
> > graph will be built in RAM. I lost track of how we handle the vector
> > data now, but at least in theory it should be fairly straightforward
> > to write the merged vector data in chunks using only limited RAM. So
> > how much RAM does the graph use? It uses numdocs*fanout VInts.
> > Actually it doesn't really scale with the vector dimension at all -
> > rather it scales with the graph fanout (M) parameter and with the
> > total number of documents. So I think this focus on limiting the
> > vector dimension is not helping to address the concern about RAM usage
> > while merging.
> >
> > The vector dimension does have a strong role in the search, and
> > indexing time, but the impact is linear in the dimension and won't
> > exhaust any limited resource.
> >
> > On Thu, Apr 6, 2023 at 5:48?AM Michael McCandless
> > <lucene@mikemccandless.com> wrote:
> >>> We shouldn't accept weakly/not scientifically motivated vetos anyway right?
> >> In fact we must accept all vetos by any committer as a veto, for a change to Lucene's source code, regardless of that committer's reasoning. This is the power of Apache's model.
> >>
> >> Of course we all can and will work together to convince one another (this is where the scientifically motivated part comes in) to change our votes, one way or another.
> >>
> >>> I'd ask anyone voting +1 to raise this limit to at least try to index a few million vectors with 756 or 1024, which is allowed today.
> >> +1, if the current implementation really does not scale / needs more and more RAM for merging, let's understand what's going on here, first, before increasing limits. I rescind my hasty +1 for now!
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >>
> >> On Wed, Apr 5, 2023 at 11:22?AM Alessandro Benedetti <a.benedetti@sease.io> wrote:
> >>> Ok, so what should we do then?
> >>> This space is moving fast, and in my opinion we should act fast to release and guarantee we attract as many users as possible.
> >>>
> >>> At the same time I am not saying we should proceed blind, if there's concrete evidence for setting a limit rather than another, or that a certain limit is detrimental to the project, I think that veto should be valid.
> >>>
> >>> We shouldn't accept weakly/not scientifically motivated vetos anyway right?
> >>>
> >>> The problem I see is that more than voting we should first decide this limit and I don't know how we can operate.
> >>> I am imagining like a poll where each entry is a limit + motivation and PMCs maybe vote/add entries?
> >>>
> >>> Did anything similar happen in the past? How was the current limit added?
> >>>
> >>>
> >>> On Wed, 5 Apr 2023, 14:50 Dawid Weiss, <dawid.weiss@gmail.com> wrote:
> >>>>
> >>>>> Should create a VOTE thread, where we propose some values with a justification and we vote?
> >>>>
> >>>> Technically, a vote thread won't help much if there's no full consensus - a single veto will make the patch unacceptable for merging.
> >>>> https://www.apache.org/foundation/voting.html#Veto
> >>>>
> >>>> Dawid
> >>>>
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

michael.wechner at wyona

Apr 6, 2023, 7:33 AM

Post #29 of 99 (120 views)

Thanks!

I will try to run some tests to be on the safe side :-)

Am 06.04.23 um 16:28 schrieb Michael Sokolov:
> yes, it makes a difference. It will take less time and CPU to do it
> all in one go, producing a single segment (assuming the data does not
> exceed the IndexWriter RAM buffer size). If you index a lot of little
> segments and then force merge them it will take longer, because it had
> to build the graphs for the little segments, and then for the big one
> when merging, and it will eventually use the same amount of RAM to
> build the big graph, although I don't believe it will have to load the
> vectors en masse into RAM while merging.
>
> On Thu, Apr 6, 2023 at 10:20?AM Michael Wechner
> <michael.wechner@wyona.com> wrote:
>> thanks very much for these insights!
>>
>> Does it make a difference re RAM when I do a batch import, for example
>> import 1000 documents and close the IndexWriter and do a forceMerge or
>> import 1Mio documents at once?
>>
>> I would expect so, or do I misunderstand this?
>>
>> Thanks
>>
>> Michael
>>
>>
>>
>> Am 06.04.23 um 16:11 schrieb Michael Sokolov:
>>> re: how does this HNSW stuff scale - I think people are calling out
>>> indexing memory usage here, so let's discuss some facts. During
>>> initial indexing we hold in RAM all the vector data and the graph
>>> constructed from the new documents, but this is accounted for and
>>> limited by the size of IndexWriter's buffer; the document vectors and
>>> their graph will be flushed to disk when this fills up, and at search
>>> time, they are not read in wholesale to RAM. There is potentially
>>> unbounded RAM usage during merging though, because the entire merged
>>> graph will be built in RAM. I lost track of how we handle the vector
>>> data now, but at least in theory it should be fairly straightforward
>>> to write the merged vector data in chunks using only limited RAM. So
>>> how much RAM does the graph use? It uses numdocs*fanout VInts.
>>> Actually it doesn't really scale with the vector dimension at all -
>>> rather it scales with the graph fanout (M) parameter and with the
>>> total number of documents. So I think this focus on limiting the
>>> vector dimension is not helping to address the concern about RAM usage
>>> while merging.
>>>
>>> The vector dimension does have a strong role in the search, and
>>> indexing time, but the impact is linear in the dimension and won't
>>> exhaust any limited resource.
>>>
>>> On Thu, Apr 6, 2023 at 5:48?AM Michael McCandless
>>> <lucene@mikemccandless.com> wrote:
>>>>> We shouldn't accept weakly/not scientifically motivated vetos anyway right?
>>>> In fact we must accept all vetos by any committer as a veto, for a change to Lucene's source code, regardless of that committer's reasoning. This is the power of Apache's model.
>>>>
>>>> Of course we all can and will work together to convince one another (this is where the scientifically motivated part comes in) to change our votes, one way or another.
>>>>
>>>>> I'd ask anyone voting +1 to raise this limit to at least try to index a few million vectors with 756 or 1024, which is allowed today.
>>>> +1, if the current implementation really does not scale / needs more and more RAM for merging, let's understand what's going on here, first, before increasing limits. I rescind my hasty +1 for now!
>>>>
>>>> Mike McCandless
>>>>
>>>> http://blog.mikemccandless.com
>>>>
>>>>
>>>> On Wed, Apr 5, 2023 at 11:22?AM Alessandro Benedetti <a.benedetti@sease.io> wrote:
>>>>> Ok, so what should we do then?
>>>>> This space is moving fast, and in my opinion we should act fast to release and guarantee we attract as many users as possible.
>>>>>
>>>>> At the same time I am not saying we should proceed blind, if there's concrete evidence for setting a limit rather than another, or that a certain limit is detrimental to the project, I think that veto should be valid.
>>>>>
>>>>> We shouldn't accept weakly/not scientifically motivated vetos anyway right?
>>>>>
>>>>> The problem I see is that more than voting we should first decide this limit and I don't know how we can operate.
>>>>> I am imagining like a poll where each entry is a limit + motivation and PMCs maybe vote/add entries?
>>>>>
>>>>> Did anything similar happen in the past? How was the current limit added?
>>>>>
>>>>>
>>>>> On Wed, 5 Apr 2023, 14:50 Dawid Weiss, <dawid.weiss@gmail.com> wrote:
>>>>>>> Should create a VOTE thread, where we propose some values with a justification and we vote?
>>>>>> Technically, a vote thread won't help much if there's no full consensus - a single veto will make the patch unacceptable for merging.
>>>>>> https://www.apache.org/foundation/voting.html#Veto
>>>>>>
>>>>>> Dawid
>>>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

a.benedetti at sease

Apr 6, 2023, 8:38 AM

Post #30 of 99 (120 views)

As I said earlier, a max limit limits usability.
It's not forcing users with small vectors to pay the performance penalty of
big vectors, it's literally preventing some users to use
Lucene/Solr/Elasticsearch at all.
As far as I know, the max limit is used to raise an exception, it's not
used to initialise or optimise data structures (please correct me if I'm
wrong).

Improving the algorithm performance is a separate discussion.
I don't see a correlation with the fact that indexing billions of whatever
dimensioned vector is slow with a usability parameter.

What about potential users that need few high dimensional vectors?

As I said before, I am a big +1 for NOT just raise it blindly, but I
believe we need to remove the limit or size it in a way it's not a problem
for both users and internal data structure optimizations, if any.

On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcmuir@gmail.com> wrote:

> I'd ask anyone voting +1 to raise this limit to at least try to index
> a few million vectors with 756 or 1024, which is allowed today.
>
> IMO based on how painful it is, it seems the limit is already too
> high, I realize that will sound controversial but please at least try
> it out!
>
> voting +1 without at least doing this is really the
> "weak/unscientifically minded" approach.
>
> On Wed, Apr 5, 2023 at 12:52?PM Michael Wechner
> <michael.wechner@wyona.com> wrote:
> >
> > Thanks for your feedback!
> >
> > I agree, that it should not crash.
> >
> > So far we did not experience crashes ourselves, but we did not index
> > millions of vectors.
> >
> > I will try to reproduce the crash, maybe this will help us to move
> forward.
> >
> > Thanks
> >
> > Michael
> >
> > Am 05.04.23 um 18:30 schrieb Dawid Weiss:
> > >> Can you describe your crash in more detail?
> > > I can't. That experiment was a while ago and a quick test to see if I
> > > could index rather large-ish USPTO (patent office) data as vectors.
> > > Couldn't do it then.
> > >
> > >> How much RAM?
> > > My indexing jobs run with rather smallish heaps to give space for I/O
> > > buffers. Think 4-8GB at most. So yes, it could have been the problem.
> > > I recall segment merging grew slower and slower and then simply
> > > crashed. Lucene should work with low heap requirements, even if it
> > > slows down. Throwing ram at the indexing/ segment merging problem
> > > is... I don't know - not elegant?
> > >
> > > Anyway. My main point was to remind folks about how Apache works -
> > > code is merged in when there are no vetoes. If Rob (or anybody else)
> > > remains unconvinced, he or she can block the change. (I didn't invent
> > > those rules).
> > >
> > > D.
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: dev-help@lucene.apache.org
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

a.benedetti at sease

Apr 6, 2023, 8:44 AM

Post #31 of 99 (120 views)

> I don't know, Alessandro. I just wanted to point out the fact that by
Apache rules a committer's veto to a code change counts as a no-go.

Yeah Dawid, I was not provocative, I was genuinely asking what should a
pragmatic approach be to choose a limit/remove it, because I don't know how
to proceed and I would love to progress rather than just leave this
discussion in a mail thread with no tangible results.

On Wed, 5 Apr 2023, 16:49 Dawid Weiss, <dawid.weiss@gmail.com> wrote:

> > Ok, so what should we do then?
>
> I don't know, Alessandro. I just wanted to point out the fact that by
> Apache rules a committer's veto to a code change counts as a no-go. It
> does not specify any way to "override" such a veto, perhaps counting
> on disagreeing parties to resolve conflicting points of views in a
> civil manner so that veto can be retracted (or a different solution
> suggested).
>
> I think Robert's point is not about a particular limit value but about
> the algorithm itself - the current implementation does not scale. I
> don't want to be an advocate for either side - I'm all for freedom of
> choice but at the same time last time I tried indexing a few million
> vectors, I couldn't get far before segment merging blew up with
> OOMs...
>
> > Did anything similar happen in the past? How was the current limit added?
>
> I honestly don't know, you'd have to git blame or look at the mailing
> list archives of the original contribution.
>
> Dawid
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

rcmuir at gmail

Apr 6, 2023, 8:47 AM

Post #32 of 99 (120 views)

Well, I'm asking ppl actually try to test using such high dimensions.
Based on my own experience, I consider it unusable. It seems other
folks may have run into trouble too. If the project committers can't
even really use vectors with such high dimension counts, then its not
in an OK state for users, and we shouldn't bump the limit.

I'm happy to discuss/compromise etc, but simply bumping the limit
without addressing the underlying usability/scalability is a real
no-go, it is not really solving anything, nor is it giving users any
freedom or allowing them to do something they couldnt do before.
Because if it still doesnt work it still doesnt work.

We all need to be on the same page, grounded in reality, not fantasy,
where if we set a limit of 1024 or 2048, that you can actually index
vectors with that many dimensions and it actually works and scales.

On Thu, Apr 6, 2023 at 11:38?AM Alessandro Benedetti
<a.benedetti@sease.io> wrote:
>
> As I said earlier, a max limit limits usability.
> It's not forcing users with small vectors to pay the performance penalty of big vectors, it's literally preventing some users to use Lucene/Solr/Elasticsearch at all.
> As far as I know, the max limit is used to raise an exception, it's not used to initialise or optimise data structures (please correct me if I'm wrong).
>
> Improving the algorithm performance is a separate discussion.
> I don't see a correlation with the fact that indexing billions of whatever dimensioned vector is slow with a usability parameter.
>
> What about potential users that need few high dimensional vectors?
>
> As I said before, I am a big +1 for NOT just raise it blindly, but I believe we need to remove the limit or size it in a way it's not a problem for both users and internal data structure optimizations, if any.
>
>
> On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcmuir@gmail.com> wrote:
>>
>> I'd ask anyone voting +1 to raise this limit to at least try to index
>> a few million vectors with 756 or 1024, which is allowed today.
>>
>> IMO based on how painful it is, it seems the limit is already too
>> high, I realize that will sound controversial but please at least try
>> it out!
>>
>> voting +1 without at least doing this is really the
>> "weak/unscientifically minded" approach.
>>
>> On Wed, Apr 5, 2023 at 12:52?PM Michael Wechner
>> <michael.wechner@wyona.com> wrote:
>> >
>> > Thanks for your feedback!
>> >
>> > I agree, that it should not crash.
>> >
>> > So far we did not experience crashes ourselves, but we did not index
>> > millions of vectors.
>> >
>> > I will try to reproduce the crash, maybe this will help us to move forward.
>> >
>> > Thanks
>> >
>> > Michael
>> >
>> > Am 05.04.23 um 18:30 schrieb Dawid Weiss:
>> > >> Can you describe your crash in more detail?
>> > > I can't. That experiment was a while ago and a quick test to see if I
>> > > could index rather large-ish USPTO (patent office) data as vectors.
>> > > Couldn't do it then.
>> > >
>> > >> How much RAM?
>> > > My indexing jobs run with rather smallish heaps to give space for I/O
>> > > buffers. Think 4-8GB at most. So yes, it could have been the problem.
>> > > I recall segment merging grew slower and slower and then simply
>> > > crashed. Lucene should work with low heap requirements, even if it
>> > > slows down. Throwing ram at the indexing/ segment merging problem
>> > > is... I don't know - not elegant?
>> > >
>> > > Anyway. My main point was to remind folks about how Apache works -
>> > > code is merged in when there are no vetoes. If Rob (or anybody else)
>> > > remains unconvinced, he or she can block the change. (I didn't invent
>> > > those rules).
>> > >
>> > > D.
>> > >
>> > > ---------------------------------------------------------------------
>> > > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > > For additional commands, e-mail: dev-help@lucene.apache.org
>> > >
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: dev-help@lucene.apache.org
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

a.benedetti at sease

Apr 6, 2023, 8:48 AM

Post #33 of 99 (120 views)

>10 MB hard drive, wow I'll never need another floppy disk ever...
Neural nets... nice idea, but there will never be enough CPU power to run
them...

etc.

Is it possible to make it a configurable limit?

I think Gus is on spot, agree 100%.

Vector dimension is already configurable, it's the max dinension which is
hard coded.

Just bear in mind that this MAX limit is not used in initializing data
structures, but only to raise an exception.
As far as I know if we change the limit, if you have small vectors you
won't be impacted at all.

On Thu, 6 Apr 2023, 03:31 Gus Heck, <gus.heck@gmail.com> wrote:

> 10 MB hard drive, wow I'll never need another floppy disk ever...
> Neural nets... nice idea, but there will never be enough CPU power to run
> them...
>
> etc.
>
> Is it possible to make it a configurable limit?
>
> On Wed, Apr 5, 2023 at 4:51?PM Jack Conradson <osjdconrad@gmail.com>
> wrote:
>
>> I don't want to get too far off topic, but I think one of the problems
>> here is that HNSW doesn't really fit well as a Lucene data structure. The
>> way it behaves it would be better supported as a live, in-memory data
>> structure instead of segmented and written to disk for tiny graphs that
>> then need to be merged. I wonder if it may be a better approach to explore
>> other possible algorithms that are designed to be on-disk instead of
>> in-memory even if they require k-means clustering as a trade-off. Maybe
>> with an on-disk algorithm we could have good enough performance for a
>> higher-dimensional limit.
>>
>> On Wed, Apr 5, 2023 at 10:54?AM Robert Muir <rcmuir@gmail.com> wrote:
>>
>>> I'd ask anyone voting +1 to raise this limit to at least try to index
>>> a few million vectors with 756 or 1024, which is allowed today.
>>>
>>> IMO based on how painful it is, it seems the limit is already too
>>> high, I realize that will sound controversial but please at least try
>>> it out!
>>>
>>> voting +1 without at least doing this is really the
>>> "weak/unscientifically minded" approach.
>>>
>>> On Wed, Apr 5, 2023 at 12:52?PM Michael Wechner
>>> <michael.wechner@wyona.com> wrote:
>>> >
>>> > Thanks for your feedback!
>>> >
>>> > I agree, that it should not crash.
>>> >
>>> > So far we did not experience crashes ourselves, but we did not index
>>> > millions of vectors.
>>> >
>>> > I will try to reproduce the crash, maybe this will help us to move
>>> forward.
>>> >
>>> > Thanks
>>> >
>>> > Michael
>>> >
>>> > Am 05.04.23 um 18:30 schrieb Dawid Weiss:
>>> > >> Can you describe your crash in more detail?
>>> > > I can't. That experiment was a while ago and a quick test to see if I
>>> > > could index rather large-ish USPTO (patent office) data as vectors.
>>> > > Couldn't do it then.
>>> > >
>>> > >> How much RAM?
>>> > > My indexing jobs run with rather smallish heaps to give space for I/O
>>> > > buffers. Think 4-8GB at most. So yes, it could have been the problem.
>>> > > I recall segment merging grew slower and slower and then simply
>>> > > crashed. Lucene should work with low heap requirements, even if it
>>> > > slows down. Throwing ram at the indexing/ segment merging problem
>>> > > is... I don't know - not elegant?
>>> > >
>>> > > Anyway. My main point was to remind folks about how Apache works -
>>> > > code is merged in when there are no vetoes. If Rob (or anybody else)
>>> > > remains unconvinced, he or she can block the change. (I didn't invent
>>> > > those rules).
>>> > >
>>> > > D.
>>> > >
>>> > > ---------------------------------------------------------------------
>>> > > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> > > For additional commands, e-mail: dev-help@lucene.apache.org
>>> > >
>>> >
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> > For additional commands, e-mail: dev-help@lucene.apache.org
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>
>
> --
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

a.benedetti at sease

Apr 6, 2023, 8:57 AM

Post #34 of 99 (120 views)

To be clear Robert, I agree with you in not bumping it just to 2048 or
whatever not motivated enough constant.

But I disagree on the performance perspective:
I mean I am absolutely positive in working to improve the current
performances, but I think this is disconnected from that limit.

Not all users need billions of vectors, maybe tomorrow a new chip is
released that speed up the processing 100x or whatever...

The limit as far as I know is not used to initialise or optimise any data
structure, it's only used to raise an exception.

I don't see a big problem in allowing 10k vectors for example but then
majority of people won't be able to use such vectors because slow on the
average computer.
If we just get 1 new user, it's better than 0.
Or well, if it's a reputation thing, than It's a completely different
discussion I guess.

On Thu, 6 Apr 2023, 16:47 Robert Muir, <rcmuir@gmail.com> wrote:

> Well, I'm asking ppl actually try to test using such high dimensions.
> Based on my own experience, I consider it unusable. It seems other
> folks may have run into trouble too. If the project committers can't
> even really use vectors with such high dimension counts, then its not
> in an OK state for users, and we shouldn't bump the limit.
>
> I'm happy to discuss/compromise etc, but simply bumping the limit
> without addressing the underlying usability/scalability is a real
> no-go, it is not really solving anything, nor is it giving users any
> freedom or allowing them to do something they couldnt do before.
> Because if it still doesnt work it still doesnt work.
>
> We all need to be on the same page, grounded in reality, not fantasy,
> where if we set a limit of 1024 or 2048, that you can actually index
> vectors with that many dimensions and it actually works and scales.
>
> On Thu, Apr 6, 2023 at 11:38?AM Alessandro Benedetti
> <a.benedetti@sease.io> wrote:
> >
> > As I said earlier, a max limit limits usability.
> > It's not forcing users with small vectors to pay the performance penalty
> of big vectors, it's literally preventing some users to use
> Lucene/Solr/Elasticsearch at all.
> > As far as I know, the max limit is used to raise an exception, it's not
> used to initialise or optimise data structures (please correct me if I'm
> wrong).
> >
> > Improving the algorithm performance is a separate discussion.
> > I don't see a correlation with the fact that indexing billions of
> whatever dimensioned vector is slow with a usability parameter.
> >
> > What about potential users that need few high dimensional vectors?
> >
> > As I said before, I am a big +1 for NOT just raise it blindly, but I
> believe we need to remove the limit or size it in a way it's not a problem
> for both users and internal data structure optimizations, if any.
> >
> >
> > On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcmuir@gmail.com> wrote:
> >>
> >> I'd ask anyone voting +1 to raise this limit to at least try to index
> >> a few million vectors with 756 or 1024, which is allowed today.
> >>
> >> IMO based on how painful it is, it seems the limit is already too
> >> high, I realize that will sound controversial but please at least try
> >> it out!
> >>
> >> voting +1 without at least doing this is really the
> >> "weak/unscientifically minded" approach.
> >>
> >> On Wed, Apr 5, 2023 at 12:52?PM Michael Wechner
> >> <michael.wechner@wyona.com> wrote:
> >> >
> >> > Thanks for your feedback!
> >> >
> >> > I agree, that it should not crash.
> >> >
> >> > So far we did not experience crashes ourselves, but we did not index
> >> > millions of vectors.
> >> >
> >> > I will try to reproduce the crash, maybe this will help us to move
> forward.
> >> >
> >> > Thanks
> >> >
> >> > Michael
> >> >
> >> > Am 05.04.23 um 18:30 schrieb Dawid Weiss:
> >> > >> Can you describe your crash in more detail?
> >> > > I can't. That experiment was a while ago and a quick test to see if
> I
> >> > > could index rather large-ish USPTO (patent office) data as vectors.
> >> > > Couldn't do it then.
> >> > >
> >> > >> How much RAM?
> >> > > My indexing jobs run with rather smallish heaps to give space for
> I/O
> >> > > buffers. Think 4-8GB at most. So yes, it could have been the
> problem.
> >> > > I recall segment merging grew slower and slower and then simply
> >> > > crashed. Lucene should work with low heap requirements, even if it
> >> > > slows down. Throwing ram at the indexing/ segment merging problem
> >> > > is... I don't know - not elegant?
> >> > >
> >> > > Anyway. My main point was to remind folks about how Apache works -
> >> > > code is merged in when there are no vetoes. If Rob (or anybody else)
> >> > > remains unconvinced, he or she can block the change. (I didn't
> invent
> >> > > those rules).
> >> > >
> >> > > D.
> >> > >
> >> > >
> ---------------------------------------------------------------------
> >> > > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> > > For additional commands, e-mail: dev-help@lucene.apache.org
> >> > >
> >> >
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> > For additional commands, e-mail: dev-help@lucene.apache.org
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

wunder at wunderwood

Apr 6, 2023, 9:02 AM

Post #35 of 99 (120 views)

If we find issues with larger limits, maybe have a configurable limit like we do for maxBooleanClauses. Maybe somebody wants to run with a 100G heap and do one query per second.

Where I work (LexisNexis), we have high-value queries, but just not that many of them per second.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/ (my blog)

> On Apr 6, 2023, at 8:57 AM, Alessandro Benedetti <a.benedetti@sease.io> wrote:
>
> To be clear Robert, I agree with you in not bumping it just to 2048 or whatever not motivated enough constant.
>
> But I disagree on the performance perspective:
> I mean I am absolutely positive in working to improve the current performances, but I think this is disconnected from that limit.
>
> Not all users need billions of vectors, maybe tomorrow a new chip is released that speed up the processing 100x or whatever...
>
> The limit as far as I know is not used to initialise or optimise any data structure, it's only used to raise an exception.
>
> I don't see a big problem in allowing 10k vectors for example but then majority of people won't be able to use such vectors because slow on the average computer.
> If we just get 1 new user, it's better than 0.
> Or well, if it's a reputation thing, than It's a completely different discussion I guess.
>
>
> On Thu, 6 Apr 2023, 16:47 Robert Muir, <rcmuir@gmail.com <mailto:rcmuir@gmail.com>> wrote:
>> Well, I'm asking ppl actually try to test using such high dimensions.
>> Based on my own experience, I consider it unusable. It seems other
>> folks may have run into trouble too. If the project committers can't
>> even really use vectors with such high dimension counts, then its not
>> in an OK state for users, and we shouldn't bump the limit.
>>
>> I'm happy to discuss/compromise etc, but simply bumping the limit
>> without addressing the underlying usability/scalability is a real
>> no-go, it is not really solving anything, nor is it giving users any
>> freedom or allowing them to do something they couldnt do before.
>> Because if it still doesnt work it still doesnt work.
>>
>> We all need to be on the same page, grounded in reality, not fantasy,
>> where if we set a limit of 1024 or 2048, that you can actually index
>> vectors with that many dimensions and it actually works and scales.
>>
>> On Thu, Apr 6, 2023 at 11:38?AM Alessandro Benedetti
>> <a.benedetti@sease.io <mailto:a.benedetti@sease.io>> wrote:
>> >
>> > As I said earlier, a max limit limits usability.
>> > It's not forcing users with small vectors to pay the performance penalty of big vectors, it's literally preventing some users to use Lucene/Solr/Elasticsearch at all.
>> > As far as I know, the max limit is used to raise an exception, it's not used to initialise or optimise data structures (please correct me if I'm wrong).
>> >
>> > Improving the algorithm performance is a separate discussion.
>> > I don't see a correlation with the fact that indexing billions of whatever dimensioned vector is slow with a usability parameter.
>> >
>> > What about potential users that need few high dimensional vectors?
>> >
>> > As I said before, I am a big +1 for NOT just raise it blindly, but I believe we need to remove the limit or size it in a way it's not a problem for both users and internal data structure optimizations, if any.
>> >
>> >
>> > On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcmuir@gmail.com <mailto:rcmuir@gmail.com>> wrote:
>> >>
>> >> I'd ask anyone voting +1 to raise this limit to at least try to index
>> >> a few million vectors with 756 or 1024, which is allowed today.
>> >>
>> >> IMO based on how painful it is, it seems the limit is already too
>> >> high, I realize that will sound controversial but please at least try
>> >> it out!
>> >>
>> >> voting +1 without at least doing this is really the
>> >> "weak/unscientifically minded" approach.
>> >>
>> >> On Wed, Apr 5, 2023 at 12:52?PM Michael Wechner
>> >> <michael.wechner@wyona.com <mailto:michael.wechner@wyona.com>> wrote:
>> >> >
>> >> > Thanks for your feedback!
>> >> >
>> >> > I agree, that it should not crash.
>> >> >
>> >> > So far we did not experience crashes ourselves, but we did not index
>> >> > millions of vectors.
>> >> >
>> >> > I will try to reproduce the crash, maybe this will help us to move forward.
>> >> >
>> >> > Thanks
>> >> >
>> >> > Michael
>> >> >
>> >> > Am 05.04.23 um 18:30 schrieb Dawid Weiss:
>> >> > >> Can you describe your crash in more detail?
>> >> > > I can't. That experiment was a while ago and a quick test to see if I
>> >> > > could index rather large-ish USPTO (patent office) data as vectors.
>> >> > > Couldn't do it then.
>> >> > >
>> >> > >> How much RAM?
>> >> > > My indexing jobs run with rather smallish heaps to give space for I/O
>> >> > > buffers. Think 4-8GB at most. So yes, it could have been the problem.
>> >> > > I recall segment merging grew slower and slower and then simply
>> >> > > crashed. Lucene should work with low heap requirements, even if it
>> >> > > slows down. Throwing ram at the indexing/ segment merging problem
>> >> > > is... I don't know - not elegant?
>> >> > >
>> >> > > Anyway. My main point was to remind folks about how Apache works -
>> >> > > code is merged in when there are no vetoes. If Rob (or anybody else)
>> >> > > remains unconvinced, he or she can block the change. (I didn't invent
>> >> > > those rules).
>> >> > >
>> >> > > D.
>> >> > >
>> >> > > ---------------------------------------------------------------------
>> >> > > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org <mailto:dev-unsubscribe@lucene.apache.org>
>> >> > > For additional commands, e-mail: dev-help@lucene.apache.org <mailto:dev-help@lucene.apache.org>
>> >> > >
>> >> >
>> >> >
>> >> > ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org <mailto:dev-unsubscribe@lucene.apache.org>
>> >> > For additional commands, e-mail: dev-help@lucene.apache.org <mailto:dev-help@lucene.apache.org>
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org <mailto:dev-unsubscribe@lucene.apache.org>
>> >> For additional commands, e-mail: dev-help@lucene.apache.org <mailto:dev-help@lucene.apache.org>
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org <mailto:dev-unsubscribe@lucene.apache.org>
>> For additional commands, e-mail: dev-help@lucene.apache.org <mailto:dev-help@lucene.apache.org>
>>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

a.benedetti at sease

Apr 6, 2023, 9:30 AM

Post #36 of 99 (120 views)

I am not sure I get the point to make the limit configurable:

1) if it is configurable, but default max to 1024, it means that we don't
enforce any limit aside the max integer behind the scenes.
So if you want to set a vector dimension for a field to 5000 you need to
first set a MAX compatible and then set the dimension to 5000 for a field.

2) if we remove the limit (just an example). The user can directly set the
dimension to 5000 for a field.

It seems to me that setting the max limit as a configurable constant brings
all the same (negative?) considerations of removing the limit at all +
additional operations needed by the users to achieve the same results.

I beg your pardon if I an missing something.

On Thu, 6 Apr 2023, 17:02 Walter Underwood, <wunder@wunderwood.org> wrote:

> If we find issues with larger limits, maybe have a configurable limit like
> we do for maxBooleanClauses. Maybe somebody wants to run with a 100G heap
> and do one query per second.
>
> Where I work (LexisNexis), we have high-value queries, but just not that
> many of them per second.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/ (my blog)
>
> On Apr 6, 2023, at 8:57 AM, Alessandro Benedetti <a.benedetti@sease.io>
> wrote:
>
> To be clear Robert, I agree with you in not bumping it just to 2048 or
> whatever not motivated enough constant.
>
> But I disagree on the performance perspective:
> I mean I am absolutely positive in working to improve the current
> performances, but I think this is disconnected from that limit.
>
> Not all users need billions of vectors, maybe tomorrow a new chip is
> released that speed up the processing 100x or whatever...
>
> The limit as far as I know is not used to initialise or optimise any data
> structure, it's only used to raise an exception.
>
> I don't see a big problem in allowing 10k vectors for example but then
> majority of people won't be able to use such vectors because slow on the
> average computer.
> If we just get 1 new user, it's better than 0.
> Or well, if it's a reputation thing, than It's a completely different
> discussion I guess.
>
>
> On Thu, 6 Apr 2023, 16:47 Robert Muir, <rcmuir@gmail.com> wrote:
>
>> Well, I'm asking ppl actually try to test using such high dimensions.
>> Based on my own experience, I consider it unusable. It seems other
>> folks may have run into trouble too. If the project committers can't
>> even really use vectors with such high dimension counts, then its not
>> in an OK state for users, and we shouldn't bump the limit.
>>
>> I'm happy to discuss/compromise etc, but simply bumping the limit
>> without addressing the underlying usability/scalability is a real
>> no-go, it is not really solving anything, nor is it giving users any
>> freedom or allowing them to do something they couldnt do before.
>> Because if it still doesnt work it still doesnt work.
>>
>> We all need to be on the same page, grounded in reality, not fantasy,
>> where if we set a limit of 1024 or 2048, that you can actually index
>> vectors with that many dimensions and it actually works and scales.
>>
>> On Thu, Apr 6, 2023 at 11:38?AM Alessandro Benedetti
>> <a.benedetti@sease.io> wrote:
>> >
>> > As I said earlier, a max limit limits usability.
>> > It's not forcing users with small vectors to pay the performance
>> penalty of big vectors, it's literally preventing some users to use
>> Lucene/Solr/Elasticsearch at all.
>> > As far as I know, the max limit is used to raise an exception, it's not
>> used to initialise or optimise data structures (please correct me if I'm
>> wrong).
>> >
>> > Improving the algorithm performance is a separate discussion.
>> > I don't see a correlation with the fact that indexing billions of
>> whatever dimensioned vector is slow with a usability parameter.
>> >
>> > What about potential users that need few high dimensional vectors?
>> >
>> > As I said before, I am a big +1 for NOT just raise it blindly, but I
>> believe we need to remove the limit or size it in a way it's not a problem
>> for both users and internal data structure optimizations, if any.
>> >
>> >
>> > On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcmuir@gmail.com> wrote:
>> >>
>> >> I'd ask anyone voting +1 to raise this limit to at least try to index
>> >> a few million vectors with 756 or 1024, which is allowed today.
>> >>
>> >> IMO based on how painful it is, it seems the limit is already too
>> >> high, I realize that will sound controversial but please at least try
>> >> it out!
>> >>
>> >> voting +1 without at least doing this is really the
>> >> "weak/unscientifically minded" approach.
>> >>
>> >> On Wed, Apr 5, 2023 at 12:52?PM Michael Wechner
>> >> <michael.wechner@wyona.com> wrote:
>> >> >
>> >> > Thanks for your feedback!
>> >> >
>> >> > I agree, that it should not crash.
>> >> >
>> >> > So far we did not experience crashes ourselves, but we did not index
>> >> > millions of vectors.
>> >> >
>> >> > I will try to reproduce the crash, maybe this will help us to move
>> forward.
>> >> >
>> >> > Thanks
>> >> >
>> >> > Michael
>> >> >
>> >> > Am 05.04.23 um 18:30 schrieb Dawid Weiss:
>> >> > >> Can you describe your crash in more detail?
>> >> > > I can't. That experiment was a while ago and a quick test to see
>> if I
>> >> > > could index rather large-ish USPTO (patent office) data as vectors.
>> >> > > Couldn't do it then.
>> >> > >
>> >> > >> How much RAM?
>> >> > > My indexing jobs run with rather smallish heaps to give space for
>> I/O
>> >> > > buffers. Think 4-8GB at most. So yes, it could have been the
>> problem.
>> >> > > I recall segment merging grew slower and slower and then simply
>> >> > > crashed. Lucene should work with low heap requirements, even if it
>> >> > > slows down. Throwing ram at the indexing/ segment merging problem
>> >> > > is... I don't know - not elegant?
>> >> > >
>> >> > > Anyway. My main point was to remind folks about how Apache works -
>> >> > > code is merged in when there are no vetoes. If Rob (or anybody
>> else)
>> >> > > remains unconvinced, he or she can block the change. (I didn't
>> invent
>> >> > > those rules).
>> >> > >
>> >> > > D.
>> >> > >
>> >> > >
>> ---------------------------------------------------------------------
>> >> > > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> > > For additional commands, e-mail: dev-help@lucene.apache.org
>> >> > >
>> >> >
>> >> >
>> >> > ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> > For additional commands, e-mail: dev-help@lucene.apache.org
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

marcuseagan at gmail

Apr 6, 2023, 10:46 AM

Post #37 of 99 (120 views)

> We all need to be on the same page, grounded in reality, not fantasy,
where if we set a limit of 1024 or 2048, that you can actually index
vectors with that many dimensions and it actually works and scales

This is something that I agree with. When we test it, I think we should go
in with the right expectations. For example, there's been an ask for use
cases. The most immediate use case that I see is LLM memory in the higher
range of vectors (up to 2048), rather than strictly search. I'm ok with
this use case because of its unprecedented growth. Consider the fact that a
lot of users and most of the capital expenditures on Lucene infrastructure
goes to use cases that are not strictly application search. Analytics, as
we all know, is a very common off-label use of Lucene. Another common use
case is index intersection across a variety of text fields.

It would be great if we could document a way to up the limit and test using
Open AI's embeddings. If there are performance problems they should be
understood as performance problems for a particular use case, and we as a
community should invest in making it better based on user feedback. This
leap of faith could have a very positive impact on all involved.

Best,

Marcus

On Thu, Apr 6, 2023 at 9:30?AM Alessandro Benedetti <a.benedetti@sease.io>
wrote:

> I am not sure I get the point to make the limit configurable:
>
> 1) if it is configurable, but default max to 1024, it means that we don't
> enforce any limit aside the max integer behind the scenes.
> So if you want to set a vector dimension for a field to 5000 you need to
> first set a MAX compatible and then set the dimension to 5000 for a field.
>
> 2) if we remove the limit (just an example). The user can directly set the
> dimension to 5000 for a field.
>
> It seems to me that setting the max limit as a configurable constant
> brings all the same (negative?) considerations of removing the limit at all
> + additional operations needed by the users to achieve the same results.
>
> I beg your pardon if I an missing something.
>
> On Thu, 6 Apr 2023, 17:02 Walter Underwood, <wunder@wunderwood.org> wrote:
>
>> If we find issues with larger limits, maybe have a configurable limit
>> like we do for maxBooleanClauses. Maybe somebody wants to run with a 100G
>> heap and do one query per second.
>>
>> Where I work (LexisNexis), we have high-value queries, but just not that
>> many of them per second.
>>
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/ (my blog)
>>
>> On Apr 6, 2023, at 8:57 AM, Alessandro Benedetti <a.benedetti@sease.io>
>> wrote:
>>
>> To be clear Robert, I agree with you in not bumping it just to 2048 or
>> whatever not motivated enough constant.
>>
>> But I disagree on the performance perspective:
>> I mean I am absolutely positive in working to improve the current
>> performances, but I think this is disconnected from that limit.
>>
>> Not all users need billions of vectors, maybe tomorrow a new chip is
>> released that speed up the processing 100x or whatever...
>>
>> The limit as far as I know is not used to initialise or optimise any data
>> structure, it's only used to raise an exception.
>>
>> I don't see a big problem in allowing 10k vectors for example but then
>> majority of people won't be able to use such vectors because slow on the
>> average computer.
>> If we just get 1 new user, it's better than 0.
>> Or well, if it's a reputation thing, than It's a completely different
>> discussion I guess.
>>
>>
>> On Thu, 6 Apr 2023, 16:47 Robert Muir, <rcmuir@gmail.com> wrote:
>>
>>> Well, I'm asking ppl actually try to test using such high dimensions.
>>> Based on my own experience, I consider it unusable. It seems other
>>> folks may have run into trouble too. If the project committers can't
>>> even really use vectors with such high dimension counts, then its not
>>> in an OK state for users, and we shouldn't bump the limit.
>>>
>>> I'm happy to discuss/compromise etc, but simply bumping the limit
>>> without addressing the underlying usability/scalability is a real
>>> no-go, it is not really solving anything, nor is it giving users any
>>> freedom or allowing them to do something they couldnt do before.
>>> Because if it still doesnt work it still doesnt work.
>>>
>>> We all need to be on the same page, grounded in reality, not fantasy,
>>> where if we set a limit of 1024 or 2048, that you can actually index
>>> vectors with that many dimensions and it actually works and scales.
>>>
>>> On Thu, Apr 6, 2023 at 11:38?AM Alessandro Benedetti
>>> <a.benedetti@sease.io> wrote:
>>> >
>>> > As I said earlier, a max limit limits usability.
>>> > It's not forcing users with small vectors to pay the performance
>>> penalty of big vectors, it's literally preventing some users to use
>>> Lucene/Solr/Elasticsearch at all.
>>> > As far as I know, the max limit is used to raise an exception, it's
>>> not used to initialise or optimise data structures (please correct me if
>>> I'm wrong).
>>> >
>>> > Improving the algorithm performance is a separate discussion.
>>> > I don't see a correlation with the fact that indexing billions of
>>> whatever dimensioned vector is slow with a usability parameter.
>>> >
>>> > What about potential users that need few high dimensional vectors?
>>> >
>>> > As I said before, I am a big +1 for NOT just raise it blindly, but I
>>> believe we need to remove the limit or size it in a way it's not a problem
>>> for both users and internal data structure optimizations, if any.
>>> >
>>> >
>>> > On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcmuir@gmail.com> wrote:
>>> >>
>>> >> I'd ask anyone voting +1 to raise this limit to at least try to index
>>> >> a few million vectors with 756 or 1024, which is allowed today.
>>> >>
>>> >> IMO based on how painful it is, it seems the limit is already too
>>> >> high, I realize that will sound controversial but please at least try
>>> >> it out!
>>> >>
>>> >> voting +1 without at least doing this is really the
>>> >> "weak/unscientifically minded" approach.
>>> >>
>>> >> On Wed, Apr 5, 2023 at 12:52?PM Michael Wechner
>>> >> <michael.wechner@wyona.com> wrote:
>>> >> >
>>> >> > Thanks for your feedback!
>>> >> >
>>> >> > I agree, that it should not crash.
>>> >> >
>>> >> > So far we did not experience crashes ourselves, but we did not index
>>> >> > millions of vectors.
>>> >> >
>>> >> > I will try to reproduce the crash, maybe this will help us to move
>>> forward.
>>> >> >
>>> >> > Thanks
>>> >> >
>>> >> > Michael
>>> >> >
>>> >> > Am 05.04.23 um 18:30 schrieb Dawid Weiss:
>>> >> > >> Can you describe your crash in more detail?
>>> >> > > I can't. That experiment was a while ago and a quick test to see
>>> if I
>>> >> > > could index rather large-ish USPTO (patent office) data as
>>> vectors.
>>> >> > > Couldn't do it then.
>>> >> > >
>>> >> > >> How much RAM?
>>> >> > > My indexing jobs run with rather smallish heaps to give space for
>>> I/O
>>> >> > > buffers. Think 4-8GB at most. So yes, it could have been the
>>> problem.
>>> >> > > I recall segment merging grew slower and slower and then simply
>>> >> > > crashed. Lucene should work with low heap requirements, even if it
>>> >> > > slows down. Throwing ram at the indexing/ segment merging problem
>>> >> > > is... I don't know - not elegant?
>>> >> > >
>>> >> > > Anyway. My main point was to remind folks about how Apache works -
>>> >> > > code is merged in when there are no vetoes. If Rob (or anybody
>>> else)
>>> >> > > remains unconvinced, he or she can block the change. (I didn't
>>> invent
>>> >> > > those rules).
>>> >> > >
>>> >> > > D.
>>> >> > >
>>> >> > >
>>> ---------------------------------------------------------------------
>>> >> > > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >> > > For additional commands, e-mail: dev-help@lucene.apache.org
>>> >> > >
>>> >> >
>>> >> >
>>> >> >
>>> ---------------------------------------------------------------------
>>> >> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >> > For additional commands, e-mail: dev-help@lucene.apache.org
>>> >> >
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>>> >>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>
>>

--
Marcus Eagan

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

michael.wechner at wyona

Apr 6, 2023, 12:51 PM

Post #38 of 99 (120 views)

Am 06.04.23 um 17:47 schrieb Robert Muir:
> Well, I'm asking ppl actually try to test using such high dimensions.
> Based on my own experience, I consider it unusable. It seems other
> folks may have run into trouble too. If the project committers can't
> even really use vectors with such high dimension counts, then its not
> in an OK state for users, and we shouldn't bump the limit.
>
> I'm happy to discuss/compromise etc, but simply bumping the limit
> without addressing the underlying usability/scalability is a real
> no-go,

I agree that this needs to be adressed

> it is not really solving anything, nor is it giving users any
> freedom or allowing them to do something they couldnt do before.
> Because if it still doesnt work it still doesnt work.

I disagree, because it *does work* with "smaller" document sets.

Currently we have to compile Lucene ourselves to not get the exception
when using a model with vector dimension greater than 1024,
which is of course possible, but not really convenient.

As I wrote before, to resolve this discussion, I think we should test
and address possible issues.

I will try to stop discussing now :-) and instead try to understand
better the actual issues. Would be great if others could join on this!

Thanks

Michael

>
> We all need to be on the same page, grounded in reality, not fantasy,
> where if we set a limit of 1024 or 2048, that you can actually index
> vectors with that many dimensions and it actually works and scales.
>
> On Thu, Apr 6, 2023 at 11:38?AM Alessandro Benedetti
> <a.benedetti@sease.io> wrote:
>> As I said earlier, a max limit limits usability.
>> It's not forcing users with small vectors to pay the performance penalty of big vectors, it's literally preventing some users to use Lucene/Solr/Elasticsearch at all.
>> As far as I know, the max limit is used to raise an exception, it's not used to initialise or optimise data structures (please correct me if I'm wrong).
>>
>> Improving the algorithm performance is a separate discussion.
>> I don't see a correlation with the fact that indexing billions of whatever dimensioned vector is slow with a usability parameter.
>>
>> What about potential users that need few high dimensional vectors?
>>
>> As I said before, I am a big +1 for NOT just raise it blindly, but I believe we need to remove the limit or size it in a way it's not a problem for both users and internal data structure optimizations, if any.
>>
>>
>> On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcmuir@gmail.com> wrote:
>>> I'd ask anyone voting +1 to raise this limit to at least try to index
>>> a few million vectors with 756 or 1024, which is allowed today.
>>>
>>> IMO based on how painful it is, it seems the limit is already too
>>> high, I realize that will sound controversial but please at least try
>>> it out!
>>>
>>> voting +1 without at least doing this is really the
>>> "weak/unscientifically minded" approach.
>>>
>>> On Wed, Apr 5, 2023 at 12:52?PM Michael Wechner
>>> <michael.wechner@wyona.com> wrote:
>>>> Thanks for your feedback!
>>>>
>>>> I agree, that it should not crash.
>>>>
>>>> So far we did not experience crashes ourselves, but we did not index
>>>> millions of vectors.
>>>>
>>>> I will try to reproduce the crash, maybe this will help us to move forward.
>>>>
>>>> Thanks
>>>>
>>>> Michael
>>>>
>>>> Am 05.04.23 um 18:30 schrieb Dawid Weiss:
>>>>>> Can you describe your crash in more detail?
>>>>> I can't. That experiment was a while ago and a quick test to see if I
>>>>> could index rather large-ish USPTO (patent office) data as vectors.
>>>>> Couldn't do it then.
>>>>>
>>>>>> How much RAM?
>>>>> My indexing jobs run with rather smallish heaps to give space for I/O
>>>>> buffers. Think 4-8GB at most. So yes, it could have been the problem.
>>>>> I recall segment merging grew slower and slower and then simply
>>>>> crashed. Lucene should work with low heap requirements, even if it
>>>>> slows down. Throwing ram at the indexing/ segment merging problem
>>>>> is... I don't know - not elegant?
>>>>>
>>>>> Anyway. My main point was to remind folks about how Apache works -
>>>>> code is merged in when there are no vetoes. If Rob (or anybody else)
>>>>> remains unconvinced, he or she can block the change. (I didn't invent
>>>>> those rules).
>>>>>
>>>>> D.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

msokolov at gmail

Apr 6, 2023, 4:19 PM

Post #39 of 99 (120 views)

I'm trying to run a test. I indexed 8M 100d float32 vectors in ~20
minutes with a single thread. I have some 256K vectors, but only about
2M of them. Can anybody point me to a large set (say 8M+) of 1024+ dim
vectors I can use for testing? If all else fails I can test with
noise, but that tends to lead to meaningless results

On Thu, Apr 6, 2023 at 3:52?PM Michael Wechner
<michael.wechner@wyona.com> wrote:
>
>
>
> Am 06.04.23 um 17:47 schrieb Robert Muir:
> > Well, I'm asking ppl actually try to test using such high dimensions.
> > Based on my own experience, I consider it unusable. It seems other
> > folks may have run into trouble too. If the project committers can't
> > even really use vectors with such high dimension counts, then its not
> > in an OK state for users, and we shouldn't bump the limit.
> >
> > I'm happy to discuss/compromise etc, but simply bumping the limit
> > without addressing the underlying usability/scalability is a real
> > no-go,
>
> I agree that this needs to be adressed
>
>
>
> > it is not really solving anything, nor is it giving users any
> > freedom or allowing them to do something they couldnt do before.
> > Because if it still doesnt work it still doesnt work.
>
> I disagree, because it *does work* with "smaller" document sets.
>
> Currently we have to compile Lucene ourselves to not get the exception
> when using a model with vector dimension greater than 1024,
> which is of course possible, but not really convenient.
>
> As I wrote before, to resolve this discussion, I think we should test
> and address possible issues.
>
> I will try to stop discussing now :-) and instead try to understand
> better the actual issues. Would be great if others could join on this!
>
> Thanks
>
> Michael
>
>
>
> >
> > We all need to be on the same page, grounded in reality, not fantasy,
> > where if we set a limit of 1024 or 2048, that you can actually index
> > vectors with that many dimensions and it actually works and scales.
> >
> > On Thu, Apr 6, 2023 at 11:38?AM Alessandro Benedetti
> > <a.benedetti@sease.io> wrote:
> >> As I said earlier, a max limit limits usability.
> >> It's not forcing users with small vectors to pay the performance penalty of big vectors, it's literally preventing some users to use Lucene/Solr/Elasticsearch at all.
> >> As far as I know, the max limit is used to raise an exception, it's not used to initialise or optimise data structures (please correct me if I'm wrong).
> >>
> >> Improving the algorithm performance is a separate discussion.
> >> I don't see a correlation with the fact that indexing billions of whatever dimensioned vector is slow with a usability parameter.
> >>
> >> What about potential users that need few high dimensional vectors?
> >>
> >> As I said before, I am a big +1 for NOT just raise it blindly, but I believe we need to remove the limit or size it in a way it's not a problem for both users and internal data structure optimizations, if any.
> >>
> >>
> >> On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcmuir@gmail.com> wrote:
> >>> I'd ask anyone voting +1 to raise this limit to at least try to index
> >>> a few million vectors with 756 or 1024, which is allowed today.
> >>>
> >>> IMO based on how painful it is, it seems the limit is already too
> >>> high, I realize that will sound controversial but please at least try
> >>> it out!
> >>>
> >>> voting +1 without at least doing this is really the
> >>> "weak/unscientifically minded" approach.
> >>>
> >>> On Wed, Apr 5, 2023 at 12:52?PM Michael Wechner
> >>> <michael.wechner@wyona.com> wrote:
> >>>> Thanks for your feedback!
> >>>>
> >>>> I agree, that it should not crash.
> >>>>
> >>>> So far we did not experience crashes ourselves, but we did not index
> >>>> millions of vectors.
> >>>>
> >>>> I will try to reproduce the crash, maybe this will help us to move forward.
> >>>>
> >>>> Thanks
> >>>>
> >>>> Michael
> >>>>
> >>>> Am 05.04.23 um 18:30 schrieb Dawid Weiss:
> >>>>>> Can you describe your crash in more detail?
> >>>>> I can't. That experiment was a while ago and a quick test to see if I
> >>>>> could index rather large-ish USPTO (patent office) data as vectors.
> >>>>> Couldn't do it then.
> >>>>>
> >>>>>> How much RAM?
> >>>>> My indexing jobs run with rather smallish heaps to give space for I/O
> >>>>> buffers. Think 4-8GB at most. So yes, it could have been the problem.
> >>>>> I recall segment merging grew slower and slower and then simply
> >>>>> crashed. Lucene should work with low heap requirements, even if it
> >>>>> slows down. Throwing ram at the indexing/ segment merging problem
> >>>>> is... I don't know - not elegant?
> >>>>>
> >>>>> Anyway. My main point was to remind folks about how Apache works -
> >>>>> code is merged in when there are no vetoes. If Rob (or anybody else)
> >>>>> remains unconvinced, he or she can block the change. (I didn't invent
> >>>>> those rules).
> >>>>>
> >>>>> D.
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>>>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

michael.wechner at wyona

Apr 6, 2023, 9:42 PM

Post #40 of 99 (119 views)

Great, thank you!

How much RAM; etc. did you run this test on?

Do the vectors really have to be based on real data for testing the
indexing?
I understand, if you want to test the quality of the search results it
does matter, but for testing the scalability itself it should not matter
actually, right?

Thanks

Michael

Am 07.04.23 um 01:19 schrieb Michael Sokolov:
> I'm trying to run a test. I indexed 8M 100d float32 vectors in ~20
> minutes with a single thread. I have some 256K vectors, but only about
> 2M of them. Can anybody point me to a large set (say 8M+) of 1024+ dim
> vectors I can use for testing? If all else fails I can test with
> noise, but that tends to lead to meaningless results
>
> On Thu, Apr 6, 2023 at 3:52?PM Michael Wechner
> <michael.wechner@wyona.com> wrote:
>>
>>
>> Am 06.04.23 um 17:47 schrieb Robert Muir:
>>> Well, I'm asking ppl actually try to test using such high dimensions.
>>> Based on my own experience, I consider it unusable. It seems other
>>> folks may have run into trouble too. If the project committers can't
>>> even really use vectors with such high dimension counts, then its not
>>> in an OK state for users, and we shouldn't bump the limit.
>>>
>>> I'm happy to discuss/compromise etc, but simply bumping the limit
>>> without addressing the underlying usability/scalability is a real
>>> no-go,
>> I agree that this needs to be adressed
>>
>>
>>
>>> it is not really solving anything, nor is it giving users any
>>> freedom or allowing them to do something they couldnt do before.
>>> Because if it still doesnt work it still doesnt work.
>> I disagree, because it *does work* with "smaller" document sets.
>>
>> Currently we have to compile Lucene ourselves to not get the exception
>> when using a model with vector dimension greater than 1024,
>> which is of course possible, but not really convenient.
>>
>> As I wrote before, to resolve this discussion, I think we should test
>> and address possible issues.
>>
>> I will try to stop discussing now :-) and instead try to understand
>> better the actual issues. Would be great if others could join on this!
>>
>> Thanks
>>
>> Michael
>>
>>
>>
>>> We all need to be on the same page, grounded in reality, not fantasy,
>>> where if we set a limit of 1024 or 2048, that you can actually index
>>> vectors with that many dimensions and it actually works and scales.
>>>
>>> On Thu, Apr 6, 2023 at 11:38?AM Alessandro Benedetti
>>> <a.benedetti@sease.io> wrote:
>>>> As I said earlier, a max limit limits usability.
>>>> It's not forcing users with small vectors to pay the performance penalty of big vectors, it's literally preventing some users to use Lucene/Solr/Elasticsearch at all.
>>>> As far as I know, the max limit is used to raise an exception, it's not used to initialise or optimise data structures (please correct me if I'm wrong).
>>>>
>>>> Improving the algorithm performance is a separate discussion.
>>>> I don't see a correlation with the fact that indexing billions of whatever dimensioned vector is slow with a usability parameter.
>>>>
>>>> What about potential users that need few high dimensional vectors?
>>>>
>>>> As I said before, I am a big +1 for NOT just raise it blindly, but I believe we need to remove the limit or size it in a way it's not a problem for both users and internal data structure optimizations, if any.
>>>>
>>>>
>>>> On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcmuir@gmail.com> wrote:
>>>>> I'd ask anyone voting +1 to raise this limit to at least try to index
>>>>> a few million vectors with 756 or 1024, which is allowed today.
>>>>>
>>>>> IMO based on how painful it is, it seems the limit is already too
>>>>> high, I realize that will sound controversial but please at least try
>>>>> it out!
>>>>>
>>>>> voting +1 without at least doing this is really the
>>>>> "weak/unscientifically minded" approach.
>>>>>
>>>>> On Wed, Apr 5, 2023 at 12:52?PM Michael Wechner
>>>>> <michael.wechner@wyona.com> wrote:
>>>>>> Thanks for your feedback!
>>>>>>
>>>>>> I agree, that it should not crash.
>>>>>>
>>>>>> So far we did not experience crashes ourselves, but we did not index
>>>>>> millions of vectors.
>>>>>>
>>>>>> I will try to reproduce the crash, maybe this will help us to move forward.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Michael
>>>>>>
>>>>>> Am 05.04.23 um 18:30 schrieb Dawid Weiss:
>>>>>>>> Can you describe your crash in more detail?
>>>>>>> I can't. That experiment was a while ago and a quick test to see if I
>>>>>>> could index rather large-ish USPTO (patent office) data as vectors.
>>>>>>> Couldn't do it then.
>>>>>>>
>>>>>>>> How much RAM?
>>>>>>> My indexing jobs run with rather smallish heaps to give space for I/O
>>>>>>> buffers. Think 4-8GB at most. So yes, it could have been the problem.
>>>>>>> I recall segment merging grew slower and slower and then simply
>>>>>>> crashed. Lucene should work with low heap requirements, even if it
>>>>>>> slows down. Throwing ram at the indexing/ segment merging problem
>>>>>>> is... I don't know - not elegant?
>>>>>>>
>>>>>>> Anyway. My main point was to remind folks about how Apache works -
>>>>>>> code is merged in when there are no vetoes. If Rob (or anybody else)
>>>>>>> remains unconvinced, he or she can block the change. (I didn't invent
>>>>>>> those rules).
>>>>>>>
>>>>>>> D.
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

marcuseagan at gmail

Apr 7, 2023, 1:11 AM

Post #41 of 99 (119 views)

I've started to look on the internet, and surely someone will come, but the
challenge I suspect is that these vectors are expensive to generate so
people have not gone all in on generating such large vectors for large
datasets. They certainly have not made them easy to find. Here is the most
promising but it is too small, probably:
https://www.kaggle.com/datasets/stephanst/wikipedia-simple-openai-embeddings?resource=download

I'm still in and out of the office at the moment, but when I return, I can
ask my employer if they will sponsor a 10 million document collection so
that you can test with that. Or, maybe someone from work will see and ask
them on my behalf.

Alternatively, next week, I may get some time to set up a server with an
open source LLM to generate the vectors. It still won't be free, but it
would be 99% cheaper than paying the LLM companies if we can be slow.

On Thu, Apr 6, 2023 at 9:42?PM Michael Wechner <michael.wechner@wyona.com>
wrote:

> Great, thank you!
>
> How much RAM; etc. did you run this test on?
>
> Do the vectors really have to be based on real data for testing the
> indexing?
> I understand, if you want to test the quality of the search results it
> does matter, but for testing the scalability itself it should not matter
> actually, right?
>
> Thanks
>
> Michael
>
> Am 07.04.23 um 01:19 schrieb Michael Sokolov:
> > I'm trying to run a test. I indexed 8M 100d float32 vectors in ~20
> > minutes with a single thread. I have some 256K vectors, but only about
> > 2M of them. Can anybody point me to a large set (say 8M+) of 1024+ dim
> > vectors I can use for testing? If all else fails I can test with
> > noise, but that tends to lead to meaningless results
> >
> > On Thu, Apr 6, 2023 at 3:52?PM Michael Wechner
> > <michael.wechner@wyona.com> wrote:
> >>
> >>
> >> Am 06.04.23 um 17:47 schrieb Robert Muir:
> >>> Well, I'm asking ppl actually try to test using such high dimensions.
> >>> Based on my own experience, I consider it unusable. It seems other
> >>> folks may have run into trouble too. If the project committers can't
> >>> even really use vectors with such high dimension counts, then its not
> >>> in an OK state for users, and we shouldn't bump the limit.
> >>>
> >>> I'm happy to discuss/compromise etc, but simply bumping the limit
> >>> without addressing the underlying usability/scalability is a real
> >>> no-go,
> >> I agree that this needs to be adressed
> >>
> >>
> >>
> >>> it is not really solving anything, nor is it giving users any
> >>> freedom or allowing them to do something they couldnt do before.
> >>> Because if it still doesnt work it still doesnt work.
> >> I disagree, because it *does work* with "smaller" document sets.
> >>
> >> Currently we have to compile Lucene ourselves to not get the exception
> >> when using a model with vector dimension greater than 1024,
> >> which is of course possible, but not really convenient.
> >>
> >> As I wrote before, to resolve this discussion, I think we should test
> >> and address possible issues.
> >>
> >> I will try to stop discussing now :-) and instead try to understand
> >> better the actual issues. Would be great if others could join on this!
> >>
> >> Thanks
> >>
> >> Michael
> >>
> >>
> >>
> >>> We all need to be on the same page, grounded in reality, not fantasy,
> >>> where if we set a limit of 1024 or 2048, that you can actually index
> >>> vectors with that many dimensions and it actually works and scales.
> >>>
> >>> On Thu, Apr 6, 2023 at 11:38?AM Alessandro Benedetti
> >>> <a.benedetti@sease.io> wrote:
> >>>> As I said earlier, a max limit limits usability.
> >>>> It's not forcing users with small vectors to pay the performance
> penalty of big vectors, it's literally preventing some users to use
> Lucene/Solr/Elasticsearch at all.
> >>>> As far as I know, the max limit is used to raise an exception, it's
> not used to initialise or optimise data structures (please correct me if
> I'm wrong).
> >>>>
> >>>> Improving the algorithm performance is a separate discussion.
> >>>> I don't see a correlation with the fact that indexing billions of
> whatever dimensioned vector is slow with a usability parameter.
> >>>>
> >>>> What about potential users that need few high dimensional vectors?
> >>>>
> >>>> As I said before, I am a big +1 for NOT just raise it blindly, but I
> believe we need to remove the limit or size it in a way it's not a problem
> for both users and internal data structure optimizations, if any.
> >>>>
> >>>>
> >>>> On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcmuir@gmail.com> wrote:
> >>>>> I'd ask anyone voting +1 to raise this limit to at least try to index
> >>>>> a few million vectors with 756 or 1024, which is allowed today.
> >>>>>
> >>>>> IMO based on how painful it is, it seems the limit is already too
> >>>>> high, I realize that will sound controversial but please at least try
> >>>>> it out!
> >>>>>
> >>>>> voting +1 without at least doing this is really the
> >>>>> "weak/unscientifically minded" approach.
> >>>>>
> >>>>> On Wed, Apr 5, 2023 at 12:52?PM Michael Wechner
> >>>>> <michael.wechner@wyona.com> wrote:
> >>>>>> Thanks for your feedback!
> >>>>>>
> >>>>>> I agree, that it should not crash.
> >>>>>>
> >>>>>> So far we did not experience crashes ourselves, but we did not index
> >>>>>> millions of vectors.
> >>>>>>
> >>>>>> I will try to reproduce the crash, maybe this will help us to move
> forward.
> >>>>>>
> >>>>>> Thanks
> >>>>>>
> >>>>>> Michael
> >>>>>>
> >>>>>> Am 05.04.23 um 18:30 schrieb Dawid Weiss:
> >>>>>>>> Can you describe your crash in more detail?
> >>>>>>> I can't. That experiment was a while ago and a quick test to see
> if I
> >>>>>>> could index rather large-ish USPTO (patent office) data as vectors.
> >>>>>>> Couldn't do it then.
> >>>>>>>
> >>>>>>>> How much RAM?
> >>>>>>> My indexing jobs run with rather smallish heaps to give space for
> I/O
> >>>>>>> buffers. Think 4-8GB at most. So yes, it could have been the
> problem.
> >>>>>>> I recall segment merging grew slower and slower and then simply
> >>>>>>> crashed. Lucene should work with low heap requirements, even if it
> >>>>>>> slows down. Throwing ram at the indexing/ segment merging problem
> >>>>>>> is... I don't know - not elegant?
> >>>>>>>
> >>>>>>> Anyway. My main point was to remind folks about how Apache works -
> >>>>>>> code is merged in when there are no vetoes. If Rob (or anybody
> else)
> >>>>>>> remains unconvinced, he or she can block the change. (I didn't
> invent
> >>>>>>> those rules).
> >>>>>>>
> >>>>>>> D.
> >>>>>>>
> >>>>>>>
> ---------------------------------------------------------------------
> >>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>>>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

--
Marcus Eagan

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

michael.wechner at wyona

Apr 7, 2023, 1:53 AM

Post #42 of 99 (119 views)

you might want to use SentenceBERT to generate vectors

https://sbert.net

whereas for example the model "all-mpnet-base-v2" generates vectors with
dimension 768

We have SentenceBERT running as a web service, which we could open for
these tests, but because of network latency it should be faster running
locally.

HTH

Michael

Am 07.04.23 um 10:11 schrieb Marcus Eagan:
> I've started to look on the internet, and surely someone will come,
> but the challenge I suspect is that these vectors are expensive to
> generate so people have not gone all in on generating such large
> vectors for large datasets. They certainly have not made them easy to
> find. Here is the most promising but it is too small, probably:
> https://www.kaggle.com/datasets/stephanst/wikipedia-simple-openai-embeddings?resource=download
>
>
> I'm still in and out of the office at the moment, but when I return,
> I can ask my employer if they will sponsor a 10 million document
> collection so that you can test with that. Or, maybe someone from work
> will see and ask them on my behalf.
>
> Alternatively, next week, I may get some time to set up a server with
> an open source LLM to generate the vectors. It still won't be free,
> but it would be 99% cheaper than paying the LLM companies if we can be
> slow.
>
>
>
> On Thu, Apr 6, 2023 at 9:42?PM Michael Wechner
> <michael.wechner@wyona.com> wrote:
>
> Great, thank you!
>
> How much RAM; etc. did you run this test on?
>
> Do the vectors really have to be based on real data for testing the
> indexing?
> I understand, if you want to test the quality of the search
> results it
> does matter, but for testing the scalability itself it should not
> matter
> actually, right?
>
> Thanks
>
> Michael
>
> Am 07.04.23 um 01:19 schrieb Michael Sokolov:
> > I'm trying to run a test. I indexed 8M 100d float32 vectors in ~20
> > minutes with a single thread. I have some 256K vectors, but only
> about
> > 2M of them. Can anybody point me to a large set (say 8M+) of
> 1024+ dim
> > vectors I can use for testing? If all else fails I can test with
> > noise, but that tends to lead to meaningless results
> >
> > On Thu, Apr 6, 2023 at 3:52?PM Michael Wechner
> > <michael.wechner@wyona.com> wrote:
> >>
> >>
> >> Am 06.04.23 um 17:47 schrieb Robert Muir:
> >>> Well, I'm asking ppl actually try to test using such high
> dimensions.
> >>> Based on my own experience, I consider it unusable. It seems other
> >>> folks may have run into trouble too. If the project committers
> can't
> >>> even really use vectors with such high dimension counts, then
> its not
> >>> in an OK state for users, and we shouldn't bump the limit.
> >>>
> >>> I'm happy to discuss/compromise etc, but simply bumping the limit
> >>> without addressing the underlying usability/scalability is a real
> >>> no-go,
> >> I agree that this needs to be adressed
> >>
> >>
> >>
> >>> it is not really solving anything, nor is it giving users any
> >>> freedom or allowing them to do something they couldnt do before.
> >>> Because if it still doesnt work it still doesnt work.
> >> I disagree, because it *does work* with "smaller" document sets.
> >>
> >> Currently we have to compile Lucene ourselves to not get the
> exception
> >> when using a model with vector dimension greater than 1024,
> >> which is of course possible, but not really convenient.
> >>
> >> As I wrote before, to resolve this discussion, I think we
> should test
> >> and address possible issues.
> >>
> >> I will try to stop discussing now :-) and instead try to understand
> >> better the actual issues. Would be great if others could join
> on this!
> >>
> >> Thanks
> >>
> >> Michael
> >>
> >>
> >>
> >>> We all need to be on the same page, grounded in reality, not
> fantasy,
> >>> where if we set a limit of 1024 or 2048, that you can actually
> index
> >>> vectors with that many dimensions and it actually works and
> scales.
> >>>
> >>> On Thu, Apr 6, 2023 at 11:38?AM Alessandro Benedetti
> >>> <a.benedetti@sease.io> wrote:
> >>>> As I said earlier, a max limit limits usability.
> >>>> It's not forcing users with small vectors to pay the
> performance penalty of big vectors, it's literally preventing some
> users to use Lucene/Solr/Elasticsearch at all.
> >>>> As far as I know, the max limit is used to raise an
> exception, it's not used to initialise or optimise data structures
> (please correct me if I'm wrong).
> >>>>
> >>>> Improving the algorithm performance is a separate discussion.
> >>>> I don't see a correlation with the fact that indexing
> billions of whatever dimensioned vector is slow with a usability
> parameter.
> >>>>
> >>>> What about potential users that need few high dimensional
> vectors?
> >>>>
> >>>> As I said before, I am a big +1 for NOT just raise it
> blindly, but I believe we need to remove the limit or size it in a
> way it's not a problem for both users and internal data structure
> optimizations, if any.
> >>>>
> >>>>
> >>>> On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcmuir@gmail.com> wrote:
> >>>>> I'd ask anyone voting +1 to raise this limit to at least try
> to index
> >>>>> a few million vectors with 756 or 1024, which is allowed today.
> >>>>>
> >>>>> IMO based on how painful it is, it seems the limit is
> already too
> >>>>> high, I realize that will sound controversial but please at
> least try
> >>>>> it out!
> >>>>>
> >>>>> voting +1 without at least doing this is really the
> >>>>> "weak/unscientifically minded" approach.
> >>>>>
> >>>>> On Wed, Apr 5, 2023 at 12:52?PM Michael Wechner
> >>>>> <michael.wechner@wyona.com> wrote:
> >>>>>> Thanks for your feedback!
> >>>>>>
> >>>>>> I agree, that it should not crash.
> >>>>>>
> >>>>>> So far we did not experience crashes ourselves, but we did
> not index
> >>>>>> millions of vectors.
> >>>>>>
> >>>>>> I will try to reproduce the crash, maybe this will help us
> to move forward.
> >>>>>>
> >>>>>> Thanks
> >>>>>>
> >>>>>> Michael
> >>>>>>
> >>>>>> Am 05.04.23 um 18:30 schrieb Dawid Weiss:
> >>>>>>>> Can you describe your crash in more detail?
> >>>>>>> I can't. That experiment was a while ago and a quick test
> to see if I
> >>>>>>> could index rather large-ish USPTO (patent office) data as
> vectors.
> >>>>>>> Couldn't do it then.
> >>>>>>>
> >>>>>>>> How much RAM?
> >>>>>>> My indexing jobs run with rather smallish heaps to give
> space for I/O
> >>>>>>> buffers. Think 4-8GB at most. So yes, it could have been
> the problem.
> >>>>>>> I recall segment merging grew slower and slower and then
> simply
> >>>>>>> crashed. Lucene should work with low heap requirements,
> even if it
> >>>>>>> slows down. Throwing ram at the indexing/ segment merging
> problem
> >>>>>>> is... I don't know - not elegant?
> >>>>>>>
> >>>>>>> Anyway. My main point was to remind folks about how Apache
> works -
> >>>>>>> code is merged in when there are no vetoes. If Rob (or
> anybody else)
> >>>>>>> remains unconvinced, he or she can block the change. (I
> didn't invent
> >>>>>>> those rules).
> >>>>>>>
> >>>>>>> D.
> >>>>>>>
> >>>>>>>
> ---------------------------------------------------------------------
> >>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>>>>
> >>>>>
> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>>>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>>>
> >>>
> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>
> >>
> >>
> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
>
> --
> Marcus Eagan
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

kent.fitch at gmail

Apr 7, 2023, 2:18 AM

Post #43 of 99 (105 views)

Hi,
I have been testing Lucene with a custom vector similarity and loaded 192m
vectors of dim 512 bytes. (Yes, segment merges use a lot of java memory..).

As this was a performance test, the 192m vectors were derived by dithering
47k original vectors in such a way to allow realistic ANN evaluation of
HNSW. The original 47k vectors were generated by ada-002 on source
newspaper article text. After dithering, I used PQ to reduce their
dimensionality from 1536 floats to 512 bytes - 3 source dimensions to a
1byte code, 512 code tables, each learnt to reduce total encoding error
using Lloyds algorithm (hence the need for the custom similarity). BTW,
HNSW retrieval was accurate and fast enough for the use case I was
investigating as long as a machine with 128gb memory was available as the
graph needs to be cached in memory for reasonable query rates.

Anyway, if you want them, you are welcome to those 47k vectors of 1532
floats which can be readily dithered to generate very large and realistic
test vector sets.

Best regards,

Kent Fitch

On Fri, 7 Apr 2023, 6:53 pm Michael Wechner, <michael.wechner@wyona.com>
wrote:

> you might want to use SentenceBERT to generate vectors
>
> https://sbert.net
>
> whereas for example the model "all-mpnet-base-v2" generates vectors with
> dimension 768
>
> We have SentenceBERT running as a web service, which we could open for
> these tests, but because of network latency it should be faster running
> locally.
>
> HTH
>
> Michael
>
>
> Am 07.04.23 um 10:11 schrieb Marcus Eagan:
>
> I've started to look on the internet, and surely someone will come, but
> the challenge I suspect is that these vectors are expensive to generate so
> people have not gone all in on generating such large vectors for large
> datasets. They certainly have not made them easy to find. Here is the most
> promising but it is too small, probably:
> https://www.kaggle.com/datasets/stephanst/wikipedia-simple-openai-embeddings?resource=download
>
> I'm still in and out of the office at the moment, but when I return, I
> can ask my employer if they will sponsor a 10 million document collection
> so that you can test with that. Or, maybe someone from work will see and
> ask them on my behalf.
>
> Alternatively, next week, I may get some time to set up a server with an
> open source LLM to generate the vectors. It still won't be free, but it
> would be 99% cheaper than paying the LLM companies if we can be slow.
>
>
>
> On Thu, Apr 6, 2023 at 9:42?PM Michael Wechner <michael.wechner@wyona.com>
> wrote:
>
>> Great, thank you!
>>
>> How much RAM; etc. did you run this test on?
>>
>> Do the vectors really have to be based on real data for testing the
>> indexing?
>> I understand, if you want to test the quality of the search results it
>> does matter, but for testing the scalability itself it should not matter
>> actually, right?
>>
>> Thanks
>>
>> Michael
>>
>> Am 07.04.23 um 01:19 schrieb Michael Sokolov:
>> > I'm trying to run a test. I indexed 8M 100d float32 vectors in ~20
>> > minutes with a single thread. I have some 256K vectors, but only about
>> > 2M of them. Can anybody point me to a large set (say 8M+) of 1024+ dim
>> > vectors I can use for testing? If all else fails I can test with
>> > noise, but that tends to lead to meaningless results
>> >
>> > On Thu, Apr 6, 2023 at 3:52?PM Michael Wechner
>> > <michael.wechner@wyona.com> wrote:
>> >>
>> >>
>> >> Am 06.04.23 um 17:47 schrieb Robert Muir:
>> >>> Well, I'm asking ppl actually try to test using such high dimensions.
>> >>> Based on my own experience, I consider it unusable. It seems other
>> >>> folks may have run into trouble too. If the project committers can't
>> >>> even really use vectors with such high dimension counts, then its not
>> >>> in an OK state for users, and we shouldn't bump the limit.
>> >>>
>> >>> I'm happy to discuss/compromise etc, but simply bumping the limit
>> >>> without addressing the underlying usability/scalability is a real
>> >>> no-go,
>> >> I agree that this needs to be adressed
>> >>
>> >>
>> >>
>> >>> it is not really solving anything, nor is it giving users any
>> >>> freedom or allowing them to do something they couldnt do before.
>> >>> Because if it still doesnt work it still doesnt work.
>> >> I disagree, because it *does work* with "smaller" document sets.
>> >>
>> >> Currently we have to compile Lucene ourselves to not get the exception
>> >> when using a model with vector dimension greater than 1024,
>> >> which is of course possible, but not really convenient.
>> >>
>> >> As I wrote before, to resolve this discussion, I think we should test
>> >> and address possible issues.
>> >>
>> >> I will try to stop discussing now :-) and instead try to understand
>> >> better the actual issues. Would be great if others could join on this!
>> >>
>> >> Thanks
>> >>
>> >> Michael
>> >>
>> >>
>> >>
>> >>> We all need to be on the same page, grounded in reality, not fantasy,
>> >>> where if we set a limit of 1024 or 2048, that you can actually index
>> >>> vectors with that many dimensions and it actually works and scales.
>> >>>
>> >>> On Thu, Apr 6, 2023 at 11:38?AM Alessandro Benedetti
>> >>> <a.benedetti@sease.io> wrote:
>> >>>> As I said earlier, a max limit limits usability.
>> >>>> It's not forcing users with small vectors to pay the performance
>> penalty of big vectors, it's literally preventing some users to use
>> Lucene/Solr/Elasticsearch at all.
>> >>>> As far as I know, the max limit is used to raise an exception, it's
>> not used to initialise or optimise data structures (please correct me if
>> I'm wrong).
>> >>>>
>> >>>> Improving the algorithm performance is a separate discussion.
>> >>>> I don't see a correlation with the fact that indexing billions of
>> whatever dimensioned vector is slow with a usability parameter.
>> >>>>
>> >>>> What about potential users that need few high dimensional vectors?
>> >>>>
>> >>>> As I said before, I am a big +1 for NOT just raise it blindly, but I
>> believe we need to remove the limit or size it in a way it's not a problem
>> for both users and internal data structure optimizations, if any.
>> >>>>
>> >>>>
>> >>>> On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcmuir@gmail.com> wrote:
>> >>>>> I'd ask anyone voting +1 to raise this limit to at least try to
>> index
>> >>>>> a few million vectors with 756 or 1024, which is allowed today.
>> >>>>>
>> >>>>> IMO based on how painful it is, it seems the limit is already too
>> >>>>> high, I realize that will sound controversial but please at least
>> try
>> >>>>> it out!
>> >>>>>
>> >>>>> voting +1 without at least doing this is really the
>> >>>>> "weak/unscientifically minded" approach.
>> >>>>>
>> >>>>> On Wed, Apr 5, 2023 at 12:52?PM Michael Wechner
>> >>>>> <michael.wechner@wyona.com> wrote:
>> >>>>>> Thanks for your feedback!
>> >>>>>>
>> >>>>>> I agree, that it should not crash.
>> >>>>>>
>> >>>>>> So far we did not experience crashes ourselves, but we did not
>> index
>> >>>>>> millions of vectors.
>> >>>>>>
>> >>>>>> I will try to reproduce the crash, maybe this will help us to move
>> forward.
>> >>>>>>
>> >>>>>> Thanks
>> >>>>>>
>> >>>>>> Michael
>> >>>>>>
>> >>>>>> Am 05.04.23 um 18:30 schrieb Dawid Weiss:
>> >>>>>>>> Can you describe your crash in more detail?
>> >>>>>>> I can't. That experiment was a while ago and a quick test to see
>> if I
>> >>>>>>> could index rather large-ish USPTO (patent office) data as
>> vectors.
>> >>>>>>> Couldn't do it then.
>> >>>>>>>
>> >>>>>>>> How much RAM?
>> >>>>>>> My indexing jobs run with rather smallish heaps to give space for
>> I/O
>> >>>>>>> buffers. Think 4-8GB at most. So yes, it could have been the
>> problem.
>> >>>>>>> I recall segment merging grew slower and slower and then simply
>> >>>>>>> crashed. Lucene should work with low heap requirements, even if it
>> >>>>>>> slows down. Throwing ram at the indexing/ segment merging problem
>> >>>>>>> is... I don't know - not elegant?
>> >>>>>>>
>> >>>>>>> Anyway. My main point was to remind folks about how Apache works -
>> >>>>>>> code is merged in when there are no vetoes. If Rob (or anybody
>> else)
>> >>>>>>> remains unconvinced, he or she can block the change. (I didn't
>> invent
>> >>>>>>> those rules).
>> >>>>>>>
>> >>>>>>> D.
>> >>>>>>>
>> >>>>>>>
>> ---------------------------------------------------------------------
>> >>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>>>>>>
>> >>>>>>
>> ---------------------------------------------------------------------
>> >>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>>>>>
>> >>>>>
>> ---------------------------------------------------------------------
>> >>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>>>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >>> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: dev-help@lucene.apache.org
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>
> --
> Marcus Eagan
>
>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

msokolov at gmail

Apr 7, 2023, 4:45 AM

Post #44 of 99 (102 views)

Thanks Kent - I tried something similar to what you did I think. Took
a set of 256d vectors I had and concatenated them to make bigger ones,
then shifted the dimensions to make more of them. Here are a few
single-threaded indexing test runs. I ran all tests with M=16.

8M 100d float vectors indexed in 20 minutes (16G heap, IndexWriter
buffer size=1994)
8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer size=1994)

increasing the vector dimension makes things take longer (scaling
*linearly*) but doesn't lead to RAM issues. I think we could get to
OOM while merging with a small heap and a large number of vectors, or
by increasing M, but none of this has anything to do with vector
dimensions. Also, if merge RAM usage is a problem I think we could
address it by adding accounting to the merge process and simply not
merging graphs when they exceed the buffer size (as we do with
flushing).

Robert, since you're the only on-the-record veto here, does this
change your thinking at all, or if not could you share some test
results that didn't go the way you expected? Maybe we can find some
mitigation if we focus on a specific issue.

On Fri, Apr 7, 2023 at 5:18?AM Kent Fitch <kent.fitch@gmail.com> wrote:
>
> Hi,
> I have been testing Lucene with a custom vector similarity and loaded 192m vectors of dim 512 bytes. (Yes, segment merges use a lot of java memory..).
>
> As this was a performance test, the 192m vectors were derived by dithering 47k original vectors in such a way to allow realistic ANN evaluation of HNSW. The original 47k vectors were generated by ada-002 on source newspaper article text. After dithering, I used PQ to reduce their dimensionality from 1536 floats to 512 bytes - 3 source dimensions to a 1byte code, 512 code tables, each learnt to reduce total encoding error using Lloyds algorithm (hence the need for the custom similarity). BTW, HNSW retrieval was accurate and fast enough for the use case I was investigating as long as a machine with 128gb memory was available as the graph needs to be cached in memory for reasonable query rates.
>
> Anyway, if you want them, you are welcome to those 47k vectors of 1532 floats which can be readily dithered to generate very large and realistic test vector sets.
>
> Best regards,
>
> Kent Fitch
>
>
> On Fri, 7 Apr 2023, 6:53 pm Michael Wechner, <michael.wechner@wyona.com> wrote:
>>
>> you might want to use SentenceBERT to generate vectors
>>
>> https://sbert.net
>>
>> whereas for example the model "all-mpnet-base-v2" generates vectors with dimension 768
>>
>> We have SentenceBERT running as a web service, which we could open for these tests, but because of network latency it should be faster running locally.
>>
>> HTH
>>
>> Michael
>>
>>
>> Am 07.04.23 um 10:11 schrieb Marcus Eagan:
>>
>> I've started to look on the internet, and surely someone will come, but the challenge I suspect is that these vectors are expensive to generate so people have not gone all in on generating such large vectors for large datasets. They certainly have not made them easy to find. Here is the most promising but it is too small, probably: https://www.kaggle.com/datasets/stephanst/wikipedia-simple-openai-embeddings?resource=download
>>
>> I'm still in and out of the office at the moment, but when I return, I can ask my employer if they will sponsor a 10 million document collection so that you can test with that. Or, maybe someone from work will see and ask them on my behalf.
>>
>> Alternatively, next week, I may get some time to set up a server with an open source LLM to generate the vectors. It still won't be free, but it would be 99% cheaper than paying the LLM companies if we can be slow.
>>
>>
>>
>> On Thu, Apr 6, 2023 at 9:42?PM Michael Wechner <michael.wechner@wyona.com> wrote:
>>>
>>> Great, thank you!
>>>
>>> How much RAM; etc. did you run this test on?
>>>
>>> Do the vectors really have to be based on real data for testing the
>>> indexing?
>>> I understand, if you want to test the quality of the search results it
>>> does matter, but for testing the scalability itself it should not matter
>>> actually, right?
>>>
>>> Thanks
>>>
>>> Michael
>>>
>>> Am 07.04.23 um 01:19 schrieb Michael Sokolov:
>>> > I'm trying to run a test. I indexed 8M 100d float32 vectors in ~20
>>> > minutes with a single thread. I have some 256K vectors, but only about
>>> > 2M of them. Can anybody point me to a large set (say 8M+) of 1024+ dim
>>> > vectors I can use for testing? If all else fails I can test with
>>> > noise, but that tends to lead to meaningless results
>>> >
>>> > On Thu, Apr 6, 2023 at 3:52?PM Michael Wechner
>>> > <michael.wechner@wyona.com> wrote:
>>> >>
>>> >>
>>> >> Am 06.04.23 um 17:47 schrieb Robert Muir:
>>> >>> Well, I'm asking ppl actually try to test using such high dimensions.
>>> >>> Based on my own experience, I consider it unusable. It seems other
>>> >>> folks may have run into trouble too. If the project committers can't
>>> >>> even really use vectors with such high dimension counts, then its not
>>> >>> in an OK state for users, and we shouldn't bump the limit.
>>> >>>
>>> >>> I'm happy to discuss/compromise etc, but simply bumping the limit
>>> >>> without addressing the underlying usability/scalability is a real
>>> >>> no-go,
>>> >> I agree that this needs to be adressed
>>> >>
>>> >>
>>> >>
>>> >>> it is not really solving anything, nor is it giving users any
>>> >>> freedom or allowing them to do something they couldnt do before.
>>> >>> Because if it still doesnt work it still doesnt work.
>>> >> I disagree, because it *does work* with "smaller" document sets.
>>> >>
>>> >> Currently we have to compile Lucene ourselves to not get the exception
>>> >> when using a model with vector dimension greater than 1024,
>>> >> which is of course possible, but not really convenient.
>>> >>
>>> >> As I wrote before, to resolve this discussion, I think we should test
>>> >> and address possible issues.
>>> >>
>>> >> I will try to stop discussing now :-) and instead try to understand
>>> >> better the actual issues. Would be great if others could join on this!
>>> >>
>>> >> Thanks
>>> >>
>>> >> Michael
>>> >>
>>> >>
>>> >>
>>> >>> We all need to be on the same page, grounded in reality, not fantasy,
>>> >>> where if we set a limit of 1024 or 2048, that you can actually index
>>> >>> vectors with that many dimensions and it actually works and scales.
>>> >>>
>>> >>> On Thu, Apr 6, 2023 at 11:38?AM Alessandro Benedetti
>>> >>> <a.benedetti@sease.io> wrote:
>>> >>>> As I said earlier, a max limit limits usability.
>>> >>>> It's not forcing users with small vectors to pay the performance penalty of big vectors, it's literally preventing some users to use Lucene/Solr/Elasticsearch at all.
>>> >>>> As far as I know, the max limit is used to raise an exception, it's not used to initialise or optimise data structures (please correct me if I'm wrong).
>>> >>>>
>>> >>>> Improving the algorithm performance is a separate discussion.
>>> >>>> I don't see a correlation with the fact that indexing billions of whatever dimensioned vector is slow with a usability parameter.
>>> >>>>
>>> >>>> What about potential users that need few high dimensional vectors?
>>> >>>>
>>> >>>> As I said before, I am a big +1 for NOT just raise it blindly, but I believe we need to remove the limit or size it in a way it's not a problem for both users and internal data structure optimizations, if any.
>>> >>>>
>>> >>>>
>>> >>>> On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcmuir@gmail.com> wrote:
>>> >>>>> I'd ask anyone voting +1 to raise this limit to at least try to index
>>> >>>>> a few million vectors with 756 or 1024, which is allowed today.
>>> >>>>>
>>> >>>>> IMO based on how painful it is, it seems the limit is already too
>>> >>>>> high, I realize that will sound controversial but please at least try
>>> >>>>> it out!
>>> >>>>>
>>> >>>>> voting +1 without at least doing this is really the
>>> >>>>> "weak/unscientifically minded" approach.
>>> >>>>>
>>> >>>>> On Wed, Apr 5, 2023 at 12:52?PM Michael Wechner
>>> >>>>> <michael.wechner@wyona.com> wrote:
>>> >>>>>> Thanks for your feedback!
>>> >>>>>>
>>> >>>>>> I agree, that it should not crash.
>>> >>>>>>
>>> >>>>>> So far we did not experience crashes ourselves, but we did not index
>>> >>>>>> millions of vectors.
>>> >>>>>>
>>> >>>>>> I will try to reproduce the crash, maybe this will help us to move forward.
>>> >>>>>>
>>> >>>>>> Thanks
>>> >>>>>>
>>> >>>>>> Michael
>>> >>>>>>
>>> >>>>>> Am 05.04.23 um 18:30 schrieb Dawid Weiss:
>>> >>>>>>>> Can you describe your crash in more detail?
>>> >>>>>>> I can't. That experiment was a while ago and a quick test to see if I
>>> >>>>>>> could index rather large-ish USPTO (patent office) data as vectors.
>>> >>>>>>> Couldn't do it then.
>>> >>>>>>>
>>> >>>>>>>> How much RAM?
>>> >>>>>>> My indexing jobs run with rather smallish heaps to give space for I/O
>>> >>>>>>> buffers. Think 4-8GB at most. So yes, it could have been the problem.
>>> >>>>>>> I recall segment merging grew slower and slower and then simply
>>> >>>>>>> crashed. Lucene should work with low heap requirements, even if it
>>> >>>>>>> slows down. Throwing ram at the indexing/ segment merging problem
>>> >>>>>>> is... I don't know - not elegant?
>>> >>>>>>>
>>> >>>>>>> Anyway. My main point was to remind folks about how Apache works -
>>> >>>>>>> code is merged in when there are no vetoes. If Rob (or anybody else)
>>> >>>>>>> remains unconvinced, he or she can block the change. (I didn't invent
>>> >>>>>>> those rules).
>>> >>>>>>>
>>> >>>>>>> D.
>>> >>>>>>>
>>> >>>>>>> ---------------------------------------------------------------------
>>> >>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>> >>>>>>>
>>> >>>>>> ---------------------------------------------------------------------
>>> >>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>> >>>>>>
>>> >>>>> ---------------------------------------------------------------------
>>> >>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>> >>>>>
>>> >>> ---------------------------------------------------------------------
>>> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >>> For additional commands, e-mail: dev-help@lucene.apache.org
>>> >>>
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>>> >>
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> > For additional commands, e-mail: dev-help@lucene.apache.org
>>> >
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>
>>
>> --
>> Marcus Eagan
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

msokolov at gmail

Apr 7, 2023, 5:52 AM

Post #45 of 99 (102 views)

I also want to add that we do impose some other limits on graph
construction to help ensure that HNSW-based vector fields remain
manageable; M is limited to <= 512, and maximum segment size also
helps limit merge costs

On Fri, Apr 7, 2023 at 7:45?AM Michael Sokolov <msokolov@gmail.com> wrote:
>
> Thanks Kent - I tried something similar to what you did I think. Took
> a set of 256d vectors I had and concatenated them to make bigger ones,
> then shifted the dimensions to make more of them. Here are a few
> single-threaded indexing test runs. I ran all tests with M=16.
>
>
> 8M 100d float vectors indexed in 20 minutes (16G heap, IndexWriter
> buffer size=1994)
> 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
> 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer size=1994)
>
> increasing the vector dimension makes things take longer (scaling
> *linearly*) but doesn't lead to RAM issues. I think we could get to
> OOM while merging with a small heap and a large number of vectors, or
> by increasing M, but none of this has anything to do with vector
> dimensions. Also, if merge RAM usage is a problem I think we could
> address it by adding accounting to the merge process and simply not
> merging graphs when they exceed the buffer size (as we do with
> flushing).
>
> Robert, since you're the only on-the-record veto here, does this
> change your thinking at all, or if not could you share some test
> results that didn't go the way you expected? Maybe we can find some
> mitigation if we focus on a specific issue.
>
> On Fri, Apr 7, 2023 at 5:18?AM Kent Fitch <kent.fitch@gmail.com> wrote:
> >
> > Hi,
> > I have been testing Lucene with a custom vector similarity and loaded 192m vectors of dim 512 bytes. (Yes, segment merges use a lot of java memory..).
> >
> > As this was a performance test, the 192m vectors were derived by dithering 47k original vectors in such a way to allow realistic ANN evaluation of HNSW. The original 47k vectors were generated by ada-002 on source newspaper article text. After dithering, I used PQ to reduce their dimensionality from 1536 floats to 512 bytes - 3 source dimensions to a 1byte code, 512 code tables, each learnt to reduce total encoding error using Lloyds algorithm (hence the need for the custom similarity). BTW, HNSW retrieval was accurate and fast enough for the use case I was investigating as long as a machine with 128gb memory was available as the graph needs to be cached in memory for reasonable query rates.
> >
> > Anyway, if you want them, you are welcome to those 47k vectors of 1532 floats which can be readily dithered to generate very large and realistic test vector sets.
> >
> > Best regards,
> >
> > Kent Fitch
> >
> >
> > On Fri, 7 Apr 2023, 6:53 pm Michael Wechner, <michael.wechner@wyona.com> wrote:
> >>
> >> you might want to use SentenceBERT to generate vectors
> >>
> >> https://sbert.net
> >>
> >> whereas for example the model "all-mpnet-base-v2" generates vectors with dimension 768
> >>
> >> We have SentenceBERT running as a web service, which we could open for these tests, but because of network latency it should be faster running locally.
> >>
> >> HTH
> >>
> >> Michael
> >>
> >>
> >> Am 07.04.23 um 10:11 schrieb Marcus Eagan:
> >>
> >> I've started to look on the internet, and surely someone will come, but the challenge I suspect is that these vectors are expensive to generate so people have not gone all in on generating such large vectors for large datasets. They certainly have not made them easy to find. Here is the most promising but it is too small, probably: https://www.kaggle.com/datasets/stephanst/wikipedia-simple-openai-embeddings?resource=download
> >>
> >> I'm still in and out of the office at the moment, but when I return, I can ask my employer if they will sponsor a 10 million document collection so that you can test with that. Or, maybe someone from work will see and ask them on my behalf.
> >>
> >> Alternatively, next week, I may get some time to set up a server with an open source LLM to generate the vectors. It still won't be free, but it would be 99% cheaper than paying the LLM companies if we can be slow.
> >>
> >>
> >>
> >> On Thu, Apr 6, 2023 at 9:42?PM Michael Wechner <michael.wechner@wyona.com> wrote:
> >>>
> >>> Great, thank you!
> >>>
> >>> How much RAM; etc. did you run this test on?
> >>>
> >>> Do the vectors really have to be based on real data for testing the
> >>> indexing?
> >>> I understand, if you want to test the quality of the search results it
> >>> does matter, but for testing the scalability itself it should not matter
> >>> actually, right?
> >>>
> >>> Thanks
> >>>
> >>> Michael
> >>>
> >>> Am 07.04.23 um 01:19 schrieb Michael Sokolov:
> >>> > I'm trying to run a test. I indexed 8M 100d float32 vectors in ~20
> >>> > minutes with a single thread. I have some 256K vectors, but only about
> >>> > 2M of them. Can anybody point me to a large set (say 8M+) of 1024+ dim
> >>> > vectors I can use for testing? If all else fails I can test with
> >>> > noise, but that tends to lead to meaningless results
> >>> >
> >>> > On Thu, Apr 6, 2023 at 3:52?PM Michael Wechner
> >>> > <michael.wechner@wyona.com> wrote:
> >>> >>
> >>> >>
> >>> >> Am 06.04.23 um 17:47 schrieb Robert Muir:
> >>> >>> Well, I'm asking ppl actually try to test using such high dimensions.
> >>> >>> Based on my own experience, I consider it unusable. It seems other
> >>> >>> folks may have run into trouble too. If the project committers can't
> >>> >>> even really use vectors with such high dimension counts, then its not
> >>> >>> in an OK state for users, and we shouldn't bump the limit.
> >>> >>>
> >>> >>> I'm happy to discuss/compromise etc, but simply bumping the limit
> >>> >>> without addressing the underlying usability/scalability is a real
> >>> >>> no-go,
> >>> >> I agree that this needs to be adressed
> >>> >>
> >>> >>
> >>> >>
> >>> >>> it is not really solving anything, nor is it giving users any
> >>> >>> freedom or allowing them to do something they couldnt do before.
> >>> >>> Because if it still doesnt work it still doesnt work.
> >>> >> I disagree, because it *does work* with "smaller" document sets.
> >>> >>
> >>> >> Currently we have to compile Lucene ourselves to not get the exception
> >>> >> when using a model with vector dimension greater than 1024,
> >>> >> which is of course possible, but not really convenient.
> >>> >>
> >>> >> As I wrote before, to resolve this discussion, I think we should test
> >>> >> and address possible issues.
> >>> >>
> >>> >> I will try to stop discussing now :-) and instead try to understand
> >>> >> better the actual issues. Would be great if others could join on this!
> >>> >>
> >>> >> Thanks
> >>> >>
> >>> >> Michael
> >>> >>
> >>> >>
> >>> >>
> >>> >>> We all need to be on the same page, grounded in reality, not fantasy,
> >>> >>> where if we set a limit of 1024 or 2048, that you can actually index
> >>> >>> vectors with that many dimensions and it actually works and scales.
> >>> >>>
> >>> >>> On Thu, Apr 6, 2023 at 11:38?AM Alessandro Benedetti
> >>> >>> <a.benedetti@sease.io> wrote:
> >>> >>>> As I said earlier, a max limit limits usability.
> >>> >>>> It's not forcing users with small vectors to pay the performance penalty of big vectors, it's literally preventing some users to use Lucene/Solr/Elasticsearch at all.
> >>> >>>> As far as I know, the max limit is used to raise an exception, it's not used to initialise or optimise data structures (please correct me if I'm wrong).
> >>> >>>>
> >>> >>>> Improving the algorithm performance is a separate discussion.
> >>> >>>> I don't see a correlation with the fact that indexing billions of whatever dimensioned vector is slow with a usability parameter.
> >>> >>>>
> >>> >>>> What about potential users that need few high dimensional vectors?
> >>> >>>>
> >>> >>>> As I said before, I am a big +1 for NOT just raise it blindly, but I believe we need to remove the limit or size it in a way it's not a problem for both users and internal data structure optimizations, if any.
> >>> >>>>
> >>> >>>>
> >>> >>>> On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcmuir@gmail.com> wrote:
> >>> >>>>> I'd ask anyone voting +1 to raise this limit to at least try to index
> >>> >>>>> a few million vectors with 756 or 1024, which is allowed today.
> >>> >>>>>
> >>> >>>>> IMO based on how painful it is, it seems the limit is already too
> >>> >>>>> high, I realize that will sound controversial but please at least try
> >>> >>>>> it out!
> >>> >>>>>
> >>> >>>>> voting +1 without at least doing this is really the
> >>> >>>>> "weak/unscientifically minded" approach.
> >>> >>>>>
> >>> >>>>> On Wed, Apr 5, 2023 at 12:52?PM Michael Wechner
> >>> >>>>> <michael.wechner@wyona.com> wrote:
> >>> >>>>>> Thanks for your feedback!
> >>> >>>>>>
> >>> >>>>>> I agree, that it should not crash.
> >>> >>>>>>
> >>> >>>>>> So far we did not experience crashes ourselves, but we did not index
> >>> >>>>>> millions of vectors.
> >>> >>>>>>
> >>> >>>>>> I will try to reproduce the crash, maybe this will help us to move forward.
> >>> >>>>>>
> >>> >>>>>> Thanks
> >>> >>>>>>
> >>> >>>>>> Michael
> >>> >>>>>>
> >>> >>>>>> Am 05.04.23 um 18:30 schrieb Dawid Weiss:
> >>> >>>>>>>> Can you describe your crash in more detail?
> >>> >>>>>>> I can't. That experiment was a while ago and a quick test to see if I
> >>> >>>>>>> could index rather large-ish USPTO (patent office) data as vectors.
> >>> >>>>>>> Couldn't do it then.
> >>> >>>>>>>
> >>> >>>>>>>> How much RAM?
> >>> >>>>>>> My indexing jobs run with rather smallish heaps to give space for I/O
> >>> >>>>>>> buffers. Think 4-8GB at most. So yes, it could have been the problem.
> >>> >>>>>>> I recall segment merging grew slower and slower and then simply
> >>> >>>>>>> crashed. Lucene should work with low heap requirements, even if it
> >>> >>>>>>> slows down. Throwing ram at the indexing/ segment merging problem
> >>> >>>>>>> is... I don't know - not elegant?
> >>> >>>>>>>
> >>> >>>>>>> Anyway. My main point was to remind folks about how Apache works -
> >>> >>>>>>> code is merged in when there are no vetoes. If Rob (or anybody else)
> >>> >>>>>>> remains unconvinced, he or she can block the change. (I didn't invent
> >>> >>>>>>> those rules).
> >>> >>>>>>>
> >>> >>>>>>> D.
> >>> >>>>>>>
> >>> >>>>>>> ---------------------------------------------------------------------
> >>> >>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>> >>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>> >>>>>>>
> >>> >>>>>> ---------------------------------------------------------------------
> >>> >>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>> >>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>> >>>>>>
> >>> >>>>> ---------------------------------------------------------------------
> >>> >>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>> >>>>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>> >>>>>
> >>> >>> ---------------------------------------------------------------------
> >>> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>> >>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>> >>>
> >>> >>
> >>> >> ---------------------------------------------------------------------
> >>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>> >>
> >>> > ---------------------------------------------------------------------
> >>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>> > For additional commands, e-mail: dev-help@lucene.apache.org
> >>> >
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>
> >>
> >>
> >> --
> >> Marcus Eagan
> >>
> >>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

msokolov at gmail

Apr 7, 2023, 8:20 AM

Post #46 of 99 (101 views)

one more data point:

32M 100dim (fp32) vectors indexed in 1h20m (M=16, IW cache=1994, heap=4GB)

On Fri, Apr 7, 2023 at 8:52?AM Michael Sokolov <msokolov@gmail.com> wrote:
>
> I also want to add that we do impose some other limits on graph
> construction to help ensure that HNSW-based vector fields remain
> manageable; M is limited to <= 512, and maximum segment size also
> helps limit merge costs
>
> On Fri, Apr 7, 2023 at 7:45?AM Michael Sokolov <msokolov@gmail.com> wrote:
> >
> > Thanks Kent - I tried something similar to what you did I think. Took
> > a set of 256d vectors I had and concatenated them to make bigger ones,
> > then shifted the dimensions to make more of them. Here are a few
> > single-threaded indexing test runs. I ran all tests with M=16.
> >
> >
> > 8M 100d float vectors indexed in 20 minutes (16G heap, IndexWriter
> > buffer size=1994)
> > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
> > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer size=1994)
> >
> > increasing the vector dimension makes things take longer (scaling
> > *linearly*) but doesn't lead to RAM issues. I think we could get to
> > OOM while merging with a small heap and a large number of vectors, or
> > by increasing M, but none of this has anything to do with vector
> > dimensions. Also, if merge RAM usage is a problem I think we could
> > address it by adding accounting to the merge process and simply not
> > merging graphs when they exceed the buffer size (as we do with
> > flushing).
> >
> > Robert, since you're the only on-the-record veto here, does this
> > change your thinking at all, or if not could you share some test
> > results that didn't go the way you expected? Maybe we can find some
> > mitigation if we focus on a specific issue.
> >
> > On Fri, Apr 7, 2023 at 5:18?AM Kent Fitch <kent.fitch@gmail.com> wrote:
> > >
> > > Hi,
> > > I have been testing Lucene with a custom vector similarity and loaded 192m vectors of dim 512 bytes. (Yes, segment merges use a lot of java memory..).
> > >
> > > As this was a performance test, the 192m vectors were derived by dithering 47k original vectors in such a way to allow realistic ANN evaluation of HNSW. The original 47k vectors were generated by ada-002 on source newspaper article text. After dithering, I used PQ to reduce their dimensionality from 1536 floats to 512 bytes - 3 source dimensions to a 1byte code, 512 code tables, each learnt to reduce total encoding error using Lloyds algorithm (hence the need for the custom similarity). BTW, HNSW retrieval was accurate and fast enough for the use case I was investigating as long as a machine with 128gb memory was available as the graph needs to be cached in memory for reasonable query rates.
> > >
> > > Anyway, if you want them, you are welcome to those 47k vectors of 1532 floats which can be readily dithered to generate very large and realistic test vector sets.
> > >
> > > Best regards,
> > >
> > > Kent Fitch
> > >
> > >
> > > On Fri, 7 Apr 2023, 6:53 pm Michael Wechner, <michael.wechner@wyona.com> wrote:
> > >>
> > >> you might want to use SentenceBERT to generate vectors
> > >>
> > >> https://sbert.net
> > >>
> > >> whereas for example the model "all-mpnet-base-v2" generates vectors with dimension 768
> > >>
> > >> We have SentenceBERT running as a web service, which we could open for these tests, but because of network latency it should be faster running locally.
> > >>
> > >> HTH
> > >>
> > >> Michael
> > >>
> > >>
> > >> Am 07.04.23 um 10:11 schrieb Marcus Eagan:
> > >>
> > >> I've started to look on the internet, and surely someone will come, but the challenge I suspect is that these vectors are expensive to generate so people have not gone all in on generating such large vectors for large datasets. They certainly have not made them easy to find. Here is the most promising but it is too small, probably: https://www.kaggle.com/datasets/stephanst/wikipedia-simple-openai-embeddings?resource=download
> > >>
> > >> I'm still in and out of the office at the moment, but when I return, I can ask my employer if they will sponsor a 10 million document collection so that you can test with that. Or, maybe someone from work will see and ask them on my behalf.
> > >>
> > >> Alternatively, next week, I may get some time to set up a server with an open source LLM to generate the vectors. It still won't be free, but it would be 99% cheaper than paying the LLM companies if we can be slow.
> > >>
> > >>
> > >>
> > >> On Thu, Apr 6, 2023 at 9:42?PM Michael Wechner <michael.wechner@wyona.com> wrote:
> > >>>
> > >>> Great, thank you!
> > >>>
> > >>> How much RAM; etc. did you run this test on?
> > >>>
> > >>> Do the vectors really have to be based on real data for testing the
> > >>> indexing?
> > >>> I understand, if you want to test the quality of the search results it
> > >>> does matter, but for testing the scalability itself it should not matter
> > >>> actually, right?
> > >>>
> > >>> Thanks
> > >>>
> > >>> Michael
> > >>>
> > >>> Am 07.04.23 um 01:19 schrieb Michael Sokolov:
> > >>> > I'm trying to run a test. I indexed 8M 100d float32 vectors in ~20
> > >>> > minutes with a single thread. I have some 256K vectors, but only about
> > >>> > 2M of them. Can anybody point me to a large set (say 8M+) of 1024+ dim
> > >>> > vectors I can use for testing? If all else fails I can test with
> > >>> > noise, but that tends to lead to meaningless results
> > >>> >
> > >>> > On Thu, Apr 6, 2023 at 3:52?PM Michael Wechner
> > >>> > <michael.wechner@wyona.com> wrote:
> > >>> >>
> > >>> >>
> > >>> >> Am 06.04.23 um 17:47 schrieb Robert Muir:
> > >>> >>> Well, I'm asking ppl actually try to test using such high dimensions.
> > >>> >>> Based on my own experience, I consider it unusable. It seems other
> > >>> >>> folks may have run into trouble too. If the project committers can't
> > >>> >>> even really use vectors with such high dimension counts, then its not
> > >>> >>> in an OK state for users, and we shouldn't bump the limit.
> > >>> >>>
> > >>> >>> I'm happy to discuss/compromise etc, but simply bumping the limit
> > >>> >>> without addressing the underlying usability/scalability is a real
> > >>> >>> no-go,
> > >>> >> I agree that this needs to be adressed
> > >>> >>
> > >>> >>
> > >>> >>
> > >>> >>> it is not really solving anything, nor is it giving users any
> > >>> >>> freedom or allowing them to do something they couldnt do before.
> > >>> >>> Because if it still doesnt work it still doesnt work.
> > >>> >> I disagree, because it *does work* with "smaller" document sets.
> > >>> >>
> > >>> >> Currently we have to compile Lucene ourselves to not get the exception
> > >>> >> when using a model with vector dimension greater than 1024,
> > >>> >> which is of course possible, but not really convenient.
> > >>> >>
> > >>> >> As I wrote before, to resolve this discussion, I think we should test
> > >>> >> and address possible issues.
> > >>> >>
> > >>> >> I will try to stop discussing now :-) and instead try to understand
> > >>> >> better the actual issues. Would be great if others could join on this!
> > >>> >>
> > >>> >> Thanks
> > >>> >>
> > >>> >> Michael
> > >>> >>
> > >>> >>
> > >>> >>
> > >>> >>> We all need to be on the same page, grounded in reality, not fantasy,
> > >>> >>> where if we set a limit of 1024 or 2048, that you can actually index
> > >>> >>> vectors with that many dimensions and it actually works and scales.
> > >>> >>>
> > >>> >>> On Thu, Apr 6, 2023 at 11:38?AM Alessandro Benedetti
> > >>> >>> <a.benedetti@sease.io> wrote:
> > >>> >>>> As I said earlier, a max limit limits usability.
> > >>> >>>> It's not forcing users with small vectors to pay the performance penalty of big vectors, it's literally preventing some users to use Lucene/Solr/Elasticsearch at all.
> > >>> >>>> As far as I know, the max limit is used to raise an exception, it's not used to initialise or optimise data structures (please correct me if I'm wrong).
> > >>> >>>>
> > >>> >>>> Improving the algorithm performance is a separate discussion.
> > >>> >>>> I don't see a correlation with the fact that indexing billions of whatever dimensioned vector is slow with a usability parameter.
> > >>> >>>>
> > >>> >>>> What about potential users that need few high dimensional vectors?
> > >>> >>>>
> > >>> >>>> As I said before, I am a big +1 for NOT just raise it blindly, but I believe we need to remove the limit or size it in a way it's not a problem for both users and internal data structure optimizations, if any.
> > >>> >>>>
> > >>> >>>>
> > >>> >>>> On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcmuir@gmail.com> wrote:
> > >>> >>>>> I'd ask anyone voting +1 to raise this limit to at least try to index
> > >>> >>>>> a few million vectors with 756 or 1024, which is allowed today.
> > >>> >>>>>
> > >>> >>>>> IMO based on how painful it is, it seems the limit is already too
> > >>> >>>>> high, I realize that will sound controversial but please at least try
> > >>> >>>>> it out!
> > >>> >>>>>
> > >>> >>>>> voting +1 without at least doing this is really the
> > >>> >>>>> "weak/unscientifically minded" approach.
> > >>> >>>>>
> > >>> >>>>> On Wed, Apr 5, 2023 at 12:52?PM Michael Wechner
> > >>> >>>>> <michael.wechner@wyona.com> wrote:
> > >>> >>>>>> Thanks for your feedback!
> > >>> >>>>>>
> > >>> >>>>>> I agree, that it should not crash.
> > >>> >>>>>>
> > >>> >>>>>> So far we did not experience crashes ourselves, but we did not index
> > >>> >>>>>> millions of vectors.
> > >>> >>>>>>
> > >>> >>>>>> I will try to reproduce the crash, maybe this will help us to move forward.
> > >>> >>>>>>
> > >>> >>>>>> Thanks
> > >>> >>>>>>
> > >>> >>>>>> Michael
> > >>> >>>>>>
> > >>> >>>>>> Am 05.04.23 um 18:30 schrieb Dawid Weiss:
> > >>> >>>>>>>> Can you describe your crash in more detail?
> > >>> >>>>>>> I can't. That experiment was a while ago and a quick test to see if I
> > >>> >>>>>>> could index rather large-ish USPTO (patent office) data as vectors.
> > >>> >>>>>>> Couldn't do it then.
> > >>> >>>>>>>
> > >>> >>>>>>>> How much RAM?
> > >>> >>>>>>> My indexing jobs run with rather smallish heaps to give space for I/O
> > >>> >>>>>>> buffers. Think 4-8GB at most. So yes, it could have been the problem.
> > >>> >>>>>>> I recall segment merging grew slower and slower and then simply
> > >>> >>>>>>> crashed. Lucene should work with low heap requirements, even if it
> > >>> >>>>>>> slows down. Throwing ram at the indexing/ segment merging problem
> > >>> >>>>>>> is... I don't know - not elegant?
> > >>> >>>>>>>
> > >>> >>>>>>> Anyway. My main point was to remind folks about how Apache works -
> > >>> >>>>>>> code is merged in when there are no vetoes. If Rob (or anybody else)
> > >>> >>>>>>> remains unconvinced, he or she can block the change. (I didn't invent
> > >>> >>>>>>> those rules).
> > >>> >>>>>>>
> > >>> >>>>>>> D.
> > >>> >>>>>>>
> > >>> >>>>>>> ---------------------------------------------------------------------
> > >>> >>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > >>> >>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
> > >>> >>>>>>>
> > >>> >>>>>> ---------------------------------------------------------------------
> > >>> >>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > >>> >>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
> > >>> >>>>>>
> > >>> >>>>> ---------------------------------------------------------------------
> > >>> >>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > >>> >>>>> For additional commands, e-mail: dev-help@lucene.apache.org
> > >>> >>>>>
> > >>> >>> ---------------------------------------------------------------------
> > >>> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > >>> >>> For additional commands, e-mail: dev-help@lucene.apache.org
> > >>> >>>
> > >>> >>
> > >>> >> ---------------------------------------------------------------------
> > >>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > >>> >> For additional commands, e-mail: dev-help@lucene.apache.org
> > >>> >>
> > >>> > ---------------------------------------------------------------------
> > >>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > >>> > For additional commands, e-mail: dev-help@lucene.apache.org
> > >>> >
> > >>>
> > >>>
> > >>> ---------------------------------------------------------------------
> > >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > >>> For additional commands, e-mail: dev-help@lucene.apache.org
> > >>>
> > >>
> > >>
> > >> --
> > >> Marcus Eagan
> > >>
> > >>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

marcuseagan at gmail

Apr 7, 2023, 10:18 AM

Post #47 of 99 (101 views)

Important data point and it doesn't seem too bad or good. What is
acceptable performance should be decided by the user? What do you all think?

On Fri, Apr 7, 2023 at 8:20?AM Michael Sokolov <msokolov@gmail.com> wrote:

> one more data point:
>
> 32M 100dim (fp32) vectors indexed in 1h20m (M=16, IW cache=1994, heap=4GB)
>
> On Fri, Apr 7, 2023 at 8:52?AM Michael Sokolov <msokolov@gmail.com> wrote:
> >
> > I also want to add that we do impose some other limits on graph
> > construction to help ensure that HNSW-based vector fields remain
> > manageable; M is limited to <= 512, and maximum segment size also
> > helps limit merge costs
> >
> > On Fri, Apr 7, 2023 at 7:45?AM Michael Sokolov <msokolov@gmail.com>
> wrote:
> > >
> > > Thanks Kent - I tried something similar to what you did I think. Took
> > > a set of 256d vectors I had and concatenated them to make bigger ones,
> > > then shifted the dimensions to make more of them. Here are a few
> > > single-threaded indexing test runs. I ran all tests with M=16.
> > >
> > >
> > > 8M 100d float vectors indexed in 20 minutes (16G heap, IndexWriter
> > > buffer size=1994)
> > > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
> > > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer
> size=1994)
> > >
> > > increasing the vector dimension makes things take longer (scaling
> > > *linearly*) but doesn't lead to RAM issues. I think we could get to
> > > OOM while merging with a small heap and a large number of vectors, or
> > > by increasing M, but none of this has anything to do with vector
> > > dimensions. Also, if merge RAM usage is a problem I think we could
> > > address it by adding accounting to the merge process and simply not
> > > merging graphs when they exceed the buffer size (as we do with
> > > flushing).
> > >
> > > Robert, since you're the only on-the-record veto here, does this
> > > change your thinking at all, or if not could you share some test
> > > results that didn't go the way you expected? Maybe we can find some
> > > mitigation if we focus on a specific issue.
> > >
> > > On Fri, Apr 7, 2023 at 5:18?AM Kent Fitch <kent.fitch@gmail.com>
> wrote:
> > > >
> > > > Hi,
> > > > I have been testing Lucene with a custom vector similarity and
> loaded 192m vectors of dim 512 bytes. (Yes, segment merges use a lot of
> java memory..).
> > > >
> > > > As this was a performance test, the 192m vectors were derived by
> dithering 47k original vectors in such a way to allow realistic ANN
> evaluation of HNSW. The original 47k vectors were generated by ada-002 on
> source newspaper article text. After dithering, I used PQ to reduce their
> dimensionality from 1536 floats to 512 bytes - 3 source dimensions to a
> 1byte code, 512 code tables, each learnt to reduce total encoding error
> using Lloyds algorithm (hence the need for the custom similarity). BTW,
> HNSW retrieval was accurate and fast enough for the use case I was
> investigating as long as a machine with 128gb memory was available as the
> graph needs to be cached in memory for reasonable query rates.
> > > >
> > > > Anyway, if you want them, you are welcome to those 47k vectors of
> 1532 floats which can be readily dithered to generate very large and
> realistic test vector sets.
> > > >
> > > > Best regards,
> > > >
> > > > Kent Fitch
> > > >
> > > >
> > > > On Fri, 7 Apr 2023, 6:53 pm Michael Wechner, <
> michael.wechner@wyona.com> wrote:
> > > >>
> > > >> you might want to use SentenceBERT to generate vectors
> > > >>
> > > >> https://sbert.net
> > > >>
> > > >> whereas for example the model "all-mpnet-base-v2" generates vectors
> with dimension 768
> > > >>
> > > >> We have SentenceBERT running as a web service, which we could open
> for these tests, but because of network latency it should be faster running
> locally.
> > > >>
> > > >> HTH
> > > >>
> > > >> Michael
> > > >>
> > > >>
> > > >> Am 07.04.23 um 10:11 schrieb Marcus Eagan:
> > > >>
> > > >> I've started to look on the internet, and surely someone will come,
> but the challenge I suspect is that these vectors are expensive to generate
> so people have not gone all in on generating such large vectors for large
> datasets. They certainly have not made them easy to find. Here is the most
> promising but it is too small, probably:
> https://www.kaggle.com/datasets/stephanst/wikipedia-simple-openai-embeddings?resource=download
> > > >>
> > > >> I'm still in and out of the office at the moment, but when I
> return, I can ask my employer if they will sponsor a 10 million document
> collection so that you can test with that. Or, maybe someone from work will
> see and ask them on my behalf.
> > > >>
> > > >> Alternatively, next week, I may get some time to set up a server
> with an open source LLM to generate the vectors. It still won't be free,
> but it would be 99% cheaper than paying the LLM companies if we can be slow.
> > > >>
> > > >>
> > > >>
> > > >> On Thu, Apr 6, 2023 at 9:42?PM Michael Wechner <
> michael.wechner@wyona.com> wrote:
> > > >>>
> > > >>> Great, thank you!
> > > >>>
> > > >>> How much RAM; etc. did you run this test on?
> > > >>>
> > > >>> Do the vectors really have to be based on real data for testing the
> > > >>> indexing?
> > > >>> I understand, if you want to test the quality of the search
> results it
> > > >>> does matter, but for testing the scalability itself it should not
> matter
> > > >>> actually, right?
> > > >>>
> > > >>> Thanks
> > > >>>
> > > >>> Michael
> > > >>>
> > > >>> Am 07.04.23 um 01:19 schrieb Michael Sokolov:
> > > >>> > I'm trying to run a test. I indexed 8M 100d float32 vectors in
> ~20
> > > >>> > minutes with a single thread. I have some 256K vectors, but only
> about
> > > >>> > 2M of them. Can anybody point me to a large set (say 8M+) of
> 1024+ dim
> > > >>> > vectors I can use for testing? If all else fails I can test with
> > > >>> > noise, but that tends to lead to meaningless results
> > > >>> >
> > > >>> > On Thu, Apr 6, 2023 at 3:52?PM Michael Wechner
> > > >>> > <michael.wechner@wyona.com> wrote:
> > > >>> >>
> > > >>> >>
> > > >>> >> Am 06.04.23 um 17:47 schrieb Robert Muir:
> > > >>> >>> Well, I'm asking ppl actually try to test using such high
> dimensions.
> > > >>> >>> Based on my own experience, I consider it unusable. It seems
> other
> > > >>> >>> folks may have run into trouble too. If the project committers
> can't
> > > >>> >>> even really use vectors with such high dimension counts, then
> its not
> > > >>> >>> in an OK state for users, and we shouldn't bump the limit.
> > > >>> >>>
> > > >>> >>> I'm happy to discuss/compromise etc, but simply bumping the
> limit
> > > >>> >>> without addressing the underlying usability/scalability is a
> real
> > > >>> >>> no-go,
> > > >>> >> I agree that this needs to be adressed
> > > >>> >>
> > > >>> >>
> > > >>> >>
> > > >>> >>> it is not really solving anything, nor is it giving users
> any
> > > >>> >>> freedom or allowing them to do something they couldnt do
> before.
> > > >>> >>> Because if it still doesnt work it still doesnt work.
> > > >>> >> I disagree, because it *does work* with "smaller" document sets.
> > > >>> >>
> > > >>> >> Currently we have to compile Lucene ourselves to not get the
> exception
> > > >>> >> when using a model with vector dimension greater than 1024,
> > > >>> >> which is of course possible, but not really convenient.
> > > >>> >>
> > > >>> >> As I wrote before, to resolve this discussion, I think we
> should test
> > > >>> >> and address possible issues.
> > > >>> >>
> > > >>> >> I will try to stop discussing now :-) and instead try to
> understand
> > > >>> >> better the actual issues. Would be great if others could join
> on this!
> > > >>> >>
> > > >>> >> Thanks
> > > >>> >>
> > > >>> >> Michael
> > > >>> >>
> > > >>> >>
> > > >>> >>
> > > >>> >>> We all need to be on the same page, grounded in reality, not
> fantasy,
> > > >>> >>> where if we set a limit of 1024 or 2048, that you can actually
> index
> > > >>> >>> vectors with that many dimensions and it actually works and
> scales.
> > > >>> >>>
> > > >>> >>> On Thu, Apr 6, 2023 at 11:38?AM Alessandro Benedetti
> > > >>> >>> <a.benedetti@sease.io> wrote:
> > > >>> >>>> As I said earlier, a max limit limits usability.
> > > >>> >>>> It's not forcing users with small vectors to pay the
> performance penalty of big vectors, it's literally preventing some users to
> use Lucene/Solr/Elasticsearch at all.
> > > >>> >>>> As far as I know, the max limit is used to raise an
> exception, it's not used to initialise or optimise data structures (please
> correct me if I'm wrong).
> > > >>> >>>>
> > > >>> >>>> Improving the algorithm performance is a separate discussion.
> > > >>> >>>> I don't see a correlation with the fact that indexing
> billions of whatever dimensioned vector is slow with a usability parameter.
> > > >>> >>>>
> > > >>> >>>> What about potential users that need few high dimensional
> vectors?
> > > >>> >>>>
> > > >>> >>>> As I said before, I am a big +1 for NOT just raise it
> blindly, but I believe we need to remove the limit or size it in a way it's
> not a problem for both users and internal data structure optimizations, if
> any.
> > > >>> >>>>
> > > >>> >>>>
> > > >>> >>>> On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcmuir@gmail.com>
> wrote:
> > > >>> >>>>> I'd ask anyone voting +1 to raise this limit to at least try
> to index
> > > >>> >>>>> a few million vectors with 756 or 1024, which is allowed
> today.
> > > >>> >>>>>
> > > >>> >>>>> IMO based on how painful it is, it seems the limit is
> already too
> > > >>> >>>>> high, I realize that will sound controversial but please at
> least try
> > > >>> >>>>> it out!
> > > >>> >>>>>
> > > >>> >>>>> voting +1 without at least doing this is really the
> > > >>> >>>>> "weak/unscientifically minded" approach.
> > > >>> >>>>>
> > > >>> >>>>> On Wed, Apr 5, 2023 at 12:52?PM Michael Wechner
> > > >>> >>>>> <michael.wechner@wyona.com> wrote:
> > > >>> >>>>>> Thanks for your feedback!
> > > >>> >>>>>>
> > > >>> >>>>>> I agree, that it should not crash.
> > > >>> >>>>>>
> > > >>> >>>>>> So far we did not experience crashes ourselves, but we did
> not index
> > > >>> >>>>>> millions of vectors.
> > > >>> >>>>>>
> > > >>> >>>>>> I will try to reproduce the crash, maybe this will help us
> to move forward.
> > > >>> >>>>>>
> > > >>> >>>>>> Thanks
> > > >>> >>>>>>
> > > >>> >>>>>> Michael
> > > >>> >>>>>>
> > > >>> >>>>>> Am 05.04.23 um 18:30 schrieb Dawid Weiss:
> > > >>> >>>>>>>> Can you describe your crash in more detail?
> > > >>> >>>>>>> I can't. That experiment was a while ago and a quick test
> to see if I
> > > >>> >>>>>>> could index rather large-ish USPTO (patent office) data as
> vectors.
> > > >>> >>>>>>> Couldn't do it then.
> > > >>> >>>>>>>
> > > >>> >>>>>>>> How much RAM?
> > > >>> >>>>>>> My indexing jobs run with rather smallish heaps to give
> space for I/O
> > > >>> >>>>>>> buffers. Think 4-8GB at most. So yes, it could have been
> the problem.
> > > >>> >>>>>>> I recall segment merging grew slower and slower and then
> simply
> > > >>> >>>>>>> crashed. Lucene should work with low heap requirements,
> even if it
> > > >>> >>>>>>> slows down. Throwing ram at the indexing/ segment merging
> problem
> > > >>> >>>>>>> is... I don't know - not elegant?
> > > >>> >>>>>>>
> > > >>> >>>>>>> Anyway. My main point was to remind folks about how Apache
> works -
> > > >>> >>>>>>> code is merged in when there are no vetoes. If Rob (or
> anybody else)
> > > >>> >>>>>>> remains unconvinced, he or she can block the change. (I
> didn't invent
> > > >>> >>>>>>> those rules).
> > > >>> >>>>>>>
> > > >>> >>>>>>> D.
> > > >>> >>>>>>>
> > > >>> >>>>>>>
> ---------------------------------------------------------------------
> > > >>> >>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > > >>> >>>>>>> For additional commands, e-mail:
> dev-help@lucene.apache.org
> > > >>> >>>>>>>
> > > >>> >>>>>>
> ---------------------------------------------------------------------
> > > >>> >>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > > >>> >>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
> > > >>> >>>>>>
> > > >>> >>>>>
> ---------------------------------------------------------------------
> > > >>> >>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > > >>> >>>>> For additional commands, e-mail: dev-help@lucene.apache.org
> > > >>> >>>>>
> > > >>> >>>
> ---------------------------------------------------------------------
> > > >>> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > > >>> >>> For additional commands, e-mail: dev-help@lucene.apache.org
> > > >>> >>>
> > > >>> >>
> > > >>> >>
> ---------------------------------------------------------------------
> > > >>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > > >>> >> For additional commands, e-mail: dev-help@lucene.apache.org
> > > >>> >>
> > > >>> >
> ---------------------------------------------------------------------
> > > >>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > > >>> > For additional commands, e-mail: dev-help@lucene.apache.org
> > > >>> >
> > > >>>
> > > >>>
> > > >>>
> ---------------------------------------------------------------------
> > > >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > > >>> For additional commands, e-mail: dev-help@lucene.apache.org
> > > >>>
> > > >>
> > > >>
> > > >> --
> > > >> Marcus Eagan
> > > >>
> > > >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

--
Marcus Eagan

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

rcmuir at gmail

Apr 7, 2023, 1:59 PM

Post #48 of 99 (101 views)

On Fri, Apr 7, 2023 at 7:47?AM Michael Sokolov <msokolov@gmail.com> wrote:
>
> 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
> 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer size=1994)
>
> Robert, since you're the only on-the-record veto here, does this
> change your thinking at all, or if not could you share some test
> results that didn't go the way you expected? Maybe we can find some
> mitigation if we focus on a specific issue.
>

My scale concerns are both space and time. What does the execution
time look like if you don't set insanely large IW rambuffer? The
default is 16MB. Just concerned we're shoving some problems under the
rug :)

Even with the yuge RAMbuffer, we're still talking about almost 2 hours
to index 4M documents with these 2k vectors. Whereas you'd measure
this in seconds with typical lucene indexing, its nothing.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

ben.w.trent at gmail

Apr 7, 2023, 2:13 PM

Post #49 of 99 (101 views)

From all I have seen when hooking up JFR when indexing a medium number of
vectors(1M +), almost all the time is spent simply comparing the vectors
(e.g. dot_product).

This indicates to me that another algorithm won't really help index build
time tremendously. Unless others do dramatically fewer vector comparisons
(from what I can tell, this is at least not true for DiskAnn, unless some
fancy footwork is done when building the PQ codebook).

I would also say comparing vector index build time to indexing terms are
apples and oranges. Yeah, they both live in Lucene, but the number of
calculations required (no matter the data structure used), will be
magnitudes greater.

On Fri, Apr 7, 2023, 4:59 PM Robert Muir <rcmuir@gmail.com> wrote:

> On Fri, Apr 7, 2023 at 7:47?AM Michael Sokolov <msokolov@gmail.com> wrote:
> >
> > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
> > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer size=1994)
> >
> > Robert, since you're the only on-the-record veto here, does this
> > change your thinking at all, or if not could you share some test
> > results that didn't go the way you expected? Maybe we can find some
> > mitigation if we focus on a specific issue.
> >
>
> My scale concerns are both space and time. What does the execution
> time look like if you don't set insanely large IW rambuffer? The
> default is 16MB. Just concerned we're shoving some problems under the
> rug :)
>
> Even with the yuge RAMbuffer, we're still talking about almost 2 hours
> to index 4M documents with these 2k vectors. Whereas you'd measure
> this in seconds with typical lucene indexing, its nothing.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

rcmuir at gmail

Apr 7, 2023, 2:16 PM

Post #50 of 99 (101 views)

On Fri, Apr 7, 2023 at 5:13?PM Benjamin Trent <ben.w.trent@gmail.com> wrote:
>
> From all I have seen when hooking up JFR when indexing a medium number of vectors(1M +), almost all the time is spent simply comparing the vectors (e.g. dot_product).
>
> This indicates to me that another algorithm won't really help index build time tremendously. Unless others do dramatically fewer vector comparisons (from what I can tell, this is at least not true for DiskAnn, unless some fancy footwork is done when building the PQ codebook).
>
> I would also say comparing vector index build time to indexing terms are apples and oranges. Yeah, they both live in Lucene, but the number of calculations required (no matter the data structure used), will be magnitudes greater.
>

I'm not sure, i think this slowness due to massive number of
comparisons is just another side effect of the unscalable algorithm.
It is designed to build an in-memory datastructure and "merge" means
"rebuild". And since we fully rebuild a new one when merging, you get
something like O(n^2) total indexing when you take merges into
account.
Some of the other algorithms... in fact support merging. The DiskANN
paper has like a "chapter" on this.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

jim.ferenczi at gmail

Apr 7, 2023, 2:28 PM

Post #51 of 99 (175 views)

The inference time (and cost) to generate these big vectors must be quite
large too ;).
Regarding the ram buffer, we could drastically reduce the size by writing
the vectors on disk instead of keeping them in the heap. With 1k dimensions
the ram buffer is filled with these vectors quite rapidly.

On Fri, 7 Apr 2023 at 21:59, Robert Muir <rcmuir@gmail.com> wrote:

> On Fri, Apr 7, 2023 at 7:47?AM Michael Sokolov <msokolov@gmail.com> wrote:
> >
> > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
> > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer size=1994)
> >
> > Robert, since you're the only on-the-record veto here, does this
> > change your thinking at all, or if not could you share some test
> > results that didn't go the way you expected? Maybe we can find some
> > mitigation if we focus on a specific issue.
> >
>
> My scale concerns are both space and time. What does the execution
> time look like if you don't set insanely large IW rambuffer? The
> default is 16MB. Just concerned we're shoving some problems under the
> rug :)
>
> Even with the yuge RAMbuffer, we're still talking about almost 2 hours
> to index 4M documents with these 2k vectors. Whereas you'd measure
> this in seconds with typical lucene indexing, its nothing.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

jim.ferenczi at gmail

Apr 7, 2023, 2:36 PM

Post #52 of 99 (175 views)

I am also not sure that diskann would solve the merging issue. The idea
describe in the paper is to run kmeans first to create multiple graphs, one
per cluster. In our case the vectors in each segment could belong to
different cluster so I don’t see how we could merge them efficiently.

On Fri, 7 Apr 2023 at 22:28, jim ferenczi <jim.ferenczi@gmail.com> wrote:

> The inference time (and cost) to generate these big vectors must be quite
> large too ;).
> Regarding the ram buffer, we could drastically reduce the size by writing
> the vectors on disk instead of keeping them in the heap. With 1k dimensions
> the ram buffer is filled with these vectors quite rapidly.
>
> On Fri, 7 Apr 2023 at 21:59, Robert Muir <rcmuir@gmail.com> wrote:
>
>> On Fri, Apr 7, 2023 at 7:47?AM Michael Sokolov <msokolov@gmail.com>
>> wrote:
>> >
>> > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
>> > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer
>> size=1994)
>> >
>> > Robert, since you're the only on-the-record veto here, does this
>> > change your thinking at all, or if not could you share some test
>> > results that didn't go the way you expected? Maybe we can find some
>> > mitigation if we focus on a specific issue.
>> >
>>
>> My scale concerns are both space and time. What does the execution
>> time look like if you don't set insanely large IW rambuffer? The
>> default is 16MB. Just concerned we're shoving some problems under the
>> rug :)
>>
>> Even with the yuge RAMbuffer, we're still talking about almost 2 hours
>> to index 4M documents with these 2k vectors. Whereas you'd measure
>> this in seconds with typical lucene indexing, its nothing.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

rcmuir at gmail

Apr 7, 2023, 2:57 PM

Post #53 of 99 (175 views)

Personally i'd have to re-read the paper, but in general the merging
issue has to be addressed somehow to fix the overall indexing time
problem. It seems it gets "dodged" with huge rambuffers in the emails
here.
Keep in mind, there may be other ways to do it. In general if merging
something is going to be "heavyweight", we should think about it to
prevent things from going really bad overall.

As an example, I'm most familiar with adding DEFLATE compression to
stored fields. Previously, we'd basically decompress and recompress
the stored fields on merge, and LZ4 is so fast that it wasn't
obviously a problem. But with DEFLATE it got slower/heavier (more
intense compression algorithm), something had to be done or indexing
would be unacceptably slow. Hence if you look at storedfields writer,
there is "dirtiness" logic etc so that recompression is amortized over
time and doesn't happen on every merge.

On Fri, Apr 7, 2023 at 5:38?PM jim ferenczi <jim.ferenczi@gmail.com> wrote:
>
> I am also not sure that diskann would solve the merging issue. The idea describe in the paper is to run kmeans first to create multiple graphs, one per cluster. In our case the vectors in each segment could belong to different cluster so I don’t see how we could merge them efficiently.
>
> On Fri, 7 Apr 2023 at 22:28, jim ferenczi <jim.ferenczi@gmail.com> wrote:
>>
>> The inference time (and cost) to generate these big vectors must be quite large too ;).
>> Regarding the ram buffer, we could drastically reduce the size by writing the vectors on disk instead of keeping them in the heap. With 1k dimensions the ram buffer is filled with these vectors quite rapidly.
>>
>> On Fri, 7 Apr 2023 at 21:59, Robert Muir <rcmuir@gmail.com> wrote:
>>>
>>> On Fri, Apr 7, 2023 at 7:47?AM Michael Sokolov <msokolov@gmail.com> wrote:
>>> >
>>> > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
>>> > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer size=1994)
>>> >
>>> > Robert, since you're the only on-the-record veto here, does this
>>> > change your thinking at all, or if not could you share some test
>>> > results that didn't go the way you expected? Maybe we can find some
>>> > mitigation if we focus on a specific issue.
>>> >
>>>
>>> My scale concerns are both space and time. What does the execution
>>> time look like if you don't set insanely large IW rambuffer? The
>>> default is 16MB. Just concerned we're shoving some problems under the
>>> rug :)
>>>
>>> Even with the yuge RAMbuffer, we're still talking about almost 2 hours
>>> to index 4M documents with these 2k vectors. Whereas you'd measure
>>> this in seconds with typical lucene indexing, its nothing.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

jim.ferenczi at gmail

Apr 7, 2023, 3:02 PM

Post #54 of 99 (175 views)

> It is designed to build an in-memory datastructure and "merge" means
"rebuild".

The main idea imo in the diskann paper is to build the graph with the full
dimensions to preserve the quality of the neighbors. At query time it uses
the reduced dimensions (using product quantization) to compute the
similarity and thus reducing the ram required by a large factor. This is
something we could do with the current implementation. I think that Michael
tested something similar with the quantization, but when applied at build
time too it reduces the quality of the graph and the overall recall.

On Fri, 7 Apr 2023 at 22:36, jim ferenczi <jim.ferenczi@gmail.com> wrote:

> I am also not sure that diskann would solve the merging issue. The idea
> describe in the paper is to run kmeans first to create multiple graphs, one
> per cluster. In our case the vectors in each segment could belong to
> different cluster so I don’t see how we could merge them efficiently.
>
> On Fri, 7 Apr 2023 at 22:28, jim ferenczi <jim.ferenczi@gmail.com> wrote:
>
>> The inference time (and cost) to generate these big vectors must be quite
>> large too ;).
>> Regarding the ram buffer, we could drastically reduce the size by writing
>> the vectors on disk instead of keeping them in the heap. With 1k dimensions
>> the ram buffer is filled with these vectors quite rapidly.
>>
>> On Fri, 7 Apr 2023 at 21:59, Robert Muir <rcmuir@gmail.com> wrote:
>>
>>> On Fri, Apr 7, 2023 at 7:47?AM Michael Sokolov <msokolov@gmail.com>
>>> wrote:
>>> >
>>> > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
>>> > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer
>>> size=1994)
>>> >
>>> > Robert, since you're the only on-the-record veto here, does this
>>> > change your thinking at all, or if not could you share some test
>>> > results that didn't go the way you expected? Maybe we can find some
>>> > mitigation if we focus on a specific issue.
>>> >
>>>
>>> My scale concerns are both space and time. What does the execution
>>> time look like if you don't set insanely large IW rambuffer? The
>>> default is 16MB. Just concerned we're shoving some problems under the
>>> rug :)
>>>
>>> Even with the yuge RAMbuffer, we're still talking about almost 2 hours
>>> to index 4M documents with these 2k vectors. Whereas you'd measure
>>> this in seconds with typical lucene indexing, its nothing.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

jim.ferenczi at gmail

Apr 7, 2023, 3:15 PM

Post #55 of 99 (175 views)

> Keep in mind, there may be other ways to do it. In general if merging
something is going to be "heavyweight", we should think about it to
prevent things from going really bad overall.

Yep I agree. Personally I don t see how we can solve this without prior
knowledge of the vectors. Faiss has a nice implementation that fits
naturally with Lucene called IVF (
https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexIVF.html)
but if we want to avoid running kmeans on every merge we d require to
provide the clusters for the entire index before indexing the first vector.
It s a complex issue…

On Fri, 7 Apr 2023 at 22:58, Robert Muir <rcmuir@gmail.com> wrote:

> Personally i'd have to re-read the paper, but in general the merging
> issue has to be addressed somehow to fix the overall indexing time
> problem. It seems it gets "dodged" with huge rambuffers in the emails
> here.
> Keep in mind, there may be other ways to do it. In general if merging
> something is going to be "heavyweight", we should think about it to
> prevent things from going really bad overall.
>
> As an example, I'm most familiar with adding DEFLATE compression to
> stored fields. Previously, we'd basically decompress and recompress
> the stored fields on merge, and LZ4 is so fast that it wasn't
> obviously a problem. But with DEFLATE it got slower/heavier (more
> intense compression algorithm), something had to be done or indexing
> would be unacceptably slow. Hence if you look at storedfields writer,
> there is "dirtiness" logic etc so that recompression is amortized over
> time and doesn't happen on every merge.
>
> On Fri, Apr 7, 2023 at 5:38?PM jim ferenczi <jim.ferenczi@gmail.com>
> wrote:
> >
> > I am also not sure that diskann would solve the merging issue. The idea
> describe in the paper is to run kmeans first to create multiple graphs, one
> per cluster. In our case the vectors in each segment could belong to
> different cluster so I don’t see how we could merge them efficiently.
> >
> > On Fri, 7 Apr 2023 at 22:28, jim ferenczi <jim.ferenczi@gmail.com>
> wrote:
> >>
> >> The inference time (and cost) to generate these big vectors must be
> quite large too ;).
> >> Regarding the ram buffer, we could drastically reduce the size by
> writing the vectors on disk instead of keeping them in the heap. With 1k
> dimensions the ram buffer is filled with these vectors quite rapidly.
> >>
> >> On Fri, 7 Apr 2023 at 21:59, Robert Muir <rcmuir@gmail.com> wrote:
> >>>
> >>> On Fri, Apr 7, 2023 at 7:47?AM Michael Sokolov <msokolov@gmail.com>
> wrote:
> >>> >
> >>> > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer
> size=1994)
> >>> > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer
> size=1994)
> >>> >
> >>> > Robert, since you're the only on-the-record veto here, does this
> >>> > change your thinking at all, or if not could you share some test
> >>> > results that didn't go the way you expected? Maybe we can find some
> >>> > mitigation if we focus on a specific issue.
> >>> >
> >>>
> >>> My scale concerns are both space and time. What does the execution
> >>> time look like if you don't set insanely large IW rambuffer? The
> >>> default is 16MB. Just concerned we're shoving some problems under the
> >>> rug :)
> >>>
> >>> Even with the yuge RAMbuffer, we're still talking about almost 2 hours
> >>> to index 4M documents with these 2k vectors. Whereas you'd measure
> >>> this in seconds with typical lucene indexing, its nothing.
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

michael.wechner at wyona

Apr 7, 2023, 7:57 PM

Post #56 of 99 (175 views)

sorry to interrupt, but I think we get side-tracked from the original
discussion to increase the vector dimension limit.

I think improving the vector indexing performance is one thing and
making sure Lucene does not crash when increasing the vector dimension
limit is another.

I think it is great to find better ways to index vectors, but I think
this should not prevent people from being able to use models with higher
vector dimensions than 1024.

The following comparison might not be perfect, but imagine we have
invented a combustion engine, which is strong enough to move a car in
the flat area, but when applying it to a truck to move things over
mountains it will fail, because it is not strong enough. Would you
prevent people from using the combustion engine for a car in the flat area?

Thanks

Michael

Am 08.04.23 um 00:15 schrieb jim ferenczi:
> > Keep in mind, there may be other ways to do it. In general if merging
> something is going to be "heavyweight", we should think about it to
> prevent things from going really bad overall.
>
> Yep I agree. Personally I don t see how we can solve this without
> prior knowledge of the vectors. Faiss has a nice implementation that
> fits naturally with Lucene called IVF (
> https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexIVF.html)
> but if we want to avoid running kmeans on every merge we d require to
> provide the clusters for the entire index before indexing the first
> vector.
> It s a complex issue…
>
> On Fri, 7 Apr 2023 at 22:58, Robert Muir <rcmuir@gmail.com> wrote:
>
> Personally i'd have to re-read the paper, but in general the merging
> issue has to be addressed somehow to fix the overall indexing time
> problem. It seems it gets "dodged" with huge rambuffers in the emails
> here.
> Keep in mind, there may be other ways to do it. In general if merging
> something is going to be "heavyweight", we should think about it to
> prevent things from going really bad overall.
>
> As an example, I'm most familiar with adding DEFLATE compression to
> stored fields. Previously, we'd basically decompress and recompress
> the stored fields on merge, and LZ4 is so fast that it wasn't
> obviously a problem. But with DEFLATE it got slower/heavier (more
> intense compression algorithm), something had to be done or indexing
> would be unacceptably slow. Hence if you look at storedfields writer,
> there is "dirtiness" logic etc so that recompression is amortized over
> time and doesn't happen on every merge.
>
> On Fri, Apr 7, 2023 at 5:38?PM jim ferenczi
> <jim.ferenczi@gmail.com> wrote:
> >
> > I am also not sure that diskann would solve the merging issue.
> The idea describe in the paper is to run kmeans first to create
> multiple graphs, one per cluster. In our case the vectors in each
> segment could belong to different cluster so I don’t see how we
> could merge them efficiently.
> >
> > On Fri, 7 Apr 2023 at 22:28, jim ferenczi
> <jim.ferenczi@gmail.com> wrote:
> >>
> >> The inference time (and cost) to generate these big vectors
> must be quite large too ;).
> >> Regarding the ram buffer, we could drastically reduce the size
> by writing the vectors on disk instead of keeping them in the
> heap. With 1k dimensions the ram buffer is filled with these
> vectors quite rapidly.
> >>
> >> On Fri, 7 Apr 2023 at 21:59, Robert Muir <rcmuir@gmail.com> wrote:
> >>>
> >>> On Fri, Apr 7, 2023 at 7:47?AM Michael Sokolov
> <msokolov@gmail.com> wrote:
> >>> >
> >>> > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer
> size=1994)
> >>> > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW
> buffer size=1994)
> >>> >
> >>> > Robert, since you're the only on-the-record veto here, does this
> >>> > change your thinking at all, or if not could you share some test
> >>> > results that didn't go the way you expected? Maybe we can
> find some
> >>> > mitigation if we focus on a specific issue.
> >>> >
> >>>
> >>> My scale concerns are both space and time. What does the execution
> >>> time look like if you don't set insanely large IW rambuffer? The
> >>> default is 16MB. Just concerned we're shoving some problems
> under the
> >>> rug :)
> >>>
> >>> Even with the yuge RAMbuffer, we're still talking about almost
> 2 hours
> >>> to index 4M documents with these 2k vectors. Whereas you'd measure
> >>> this in seconds with typical lucene indexing, its nothing.
> >>>
> >>>
> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

a.benedetti at sease

Apr 8, 2023, 2:56 AM

Post #57 of 99 (175 views)

Yes, that was explicitly mentioned in the original mail, improving the
vector based search of Lucene is an interesting area, but off topic here.

Let's summarise:
- We want to at least increase the limit (or remove it)
- We proved that performance are ok to do it (and we can improve them more
in the future), no harm is given to users that intend to stick to low
dimensional vectors

What are the next steps?
What apache community tool can we use to agree on a new limit/no explicit
limit (max integer)?
I think we need some sort of place where each of us propose a limit with a
motivation and we vote the best option?
Any idea on how to do it?

Cheers

On Sat, 8 Apr 2023, 03:57 Michael Wechner, <michael.wechner@wyona.com>
wrote:

> sorry to interrupt, but I think we get side-tracked from the original
> discussion to increase the vector dimension limit.
>
> I think improving the vector indexing performance is one thing and making
> sure Lucene does not crash when increasing the vector dimension limit is
> another.
>
> I think it is great to find better ways to index vectors, but I think this
> should not prevent people from being able to use models with higher vector
> dimensions than 1024.
>
> The following comparison might not be perfect, but imagine we have
> invented a combustion engine, which is strong enough to move a car in the
> flat area, but when applying it to a truck to move things over mountains it
> will fail, because it is not strong enough. Would you prevent people from
> using the combustion engine for a car in the flat area?
>
> Thanks
>
> Michael
>
>
>
> Am 08.04.23 um 00:15 schrieb jim ferenczi:
>
> > Keep in mind, there may be other ways to do it. In general if merging
> something is going to be "heavyweight", we should think about it to
> prevent things from going really bad overall.
>
> Yep I agree. Personally I don t see how we can solve this without prior
> knowledge of the vectors. Faiss has a nice implementation that fits
> naturally with Lucene called IVF (
> https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexIVF.html)
> but if we want to avoid running kmeans on every merge we d require to
> provide the clusters for the entire index before indexing the first vector.
> It s a complex issue…
>
> On Fri, 7 Apr 2023 at 22:58, Robert Muir <rcmuir@gmail.com> wrote:
>
>> Personally i'd have to re-read the paper, but in general the merging
>> issue has to be addressed somehow to fix the overall indexing time
>> problem. It seems it gets "dodged" with huge rambuffers in the emails
>> here.
>> Keep in mind, there may be other ways to do it. In general if merging
>> something is going to be "heavyweight", we should think about it to
>> prevent things from going really bad overall.
>>
>> As an example, I'm most familiar with adding DEFLATE compression to
>> stored fields. Previously, we'd basically decompress and recompress
>> the stored fields on merge, and LZ4 is so fast that it wasn't
>> obviously a problem. But with DEFLATE it got slower/heavier (more
>> intense compression algorithm), something had to be done or indexing
>> would be unacceptably slow. Hence if you look at storedfields writer,
>> there is "dirtiness" logic etc so that recompression is amortized over
>> time and doesn't happen on every merge.
>>
>> On Fri, Apr 7, 2023 at 5:38?PM jim ferenczi <jim.ferenczi@gmail.com>
>> wrote:
>> >
>> > I am also not sure that diskann would solve the merging issue. The idea
>> describe in the paper is to run kmeans first to create multiple graphs, one
>> per cluster. In our case the vectors in each segment could belong to
>> different cluster so I don’t see how we could merge them efficiently.
>> >
>> > On Fri, 7 Apr 2023 at 22:28, jim ferenczi <jim.ferenczi@gmail.com>
>> wrote:
>> >>
>> >> The inference time (and cost) to generate these big vectors must be
>> quite large too ;).
>> >> Regarding the ram buffer, we could drastically reduce the size by
>> writing the vectors on disk instead of keeping them in the heap. With 1k
>> dimensions the ram buffer is filled with these vectors quite rapidly.
>> >>
>> >> On Fri, 7 Apr 2023 at 21:59, Robert Muir <rcmuir@gmail.com> wrote:
>> >>>
>> >>> On Fri, Apr 7, 2023 at 7:47?AM Michael Sokolov <msokolov@gmail.com>
>> wrote:
>> >>> >
>> >>> > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer
>> size=1994)
>> >>> > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer
>> size=1994)
>> >>> >
>> >>> > Robert, since you're the only on-the-record veto here, does this
>> >>> > change your thinking at all, or if not could you share some test
>> >>> > results that didn't go the way you expected? Maybe we can find some
>> >>> > mitigation if we focus on a specific issue.
>> >>> >
>> >>>
>> >>> My scale concerns are both space and time. What does the execution
>> >>> time look like if you don't set insanely large IW rambuffer? The
>> >>> default is 16MB. Just concerned we're shoving some problems under the
>> >>> rug :)
>> >>>
>> >>> Even with the yuge RAMbuffer, we're still talking about almost 2 hours
>> >>> to index 4M documents with these 2k vectors. Whereas you'd measure
>> >>> this in seconds with typical lucene indexing, its nothing.
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >>> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

rcmuir at gmail

Apr 8, 2023, 4:04 AM

Post #58 of 99 (175 views)

I don't think we have. The performance needs to be reasonable in order
to bump this limit. Otherwise bumping this limit makes the worst-case
2x worse than it already is!

Moreover, its clear something needs to happen to address the
scalability/lack of performance. I'd hate for this limit to be in the
way of that. Because of backwards compatibility, it's a one-way,
permanent, irreversible change.

I'm not sold by any means in any way yet. My vote remains the same.

On Fri, Apr 7, 2023 at 10:57?PM Michael Wechner
<michael.wechner@wyona.com> wrote:
>
> sorry to interrupt, but I think we get side-tracked from the original discussion to increase the vector dimension limit.
>
> I think improving the vector indexing performance is one thing and making sure Lucene does not crash when increasing the vector dimension limit is another.
>
> I think it is great to find better ways to index vectors, but I think this should not prevent people from being able to use models with higher vector dimensions than 1024.
>
> The following comparison might not be perfect, but imagine we have invented a combustion engine, which is strong enough to move a car in the flat area, but when applying it to a truck to move things over mountains it will fail, because it is not strong enough. Would you prevent people from using the combustion engine for a car in the flat area?
>
> Thanks
>
> Michael
>
>
>
> Am 08.04.23 um 00:15 schrieb jim ferenczi:
>
> > Keep in mind, there may be other ways to do it. In general if merging
> something is going to be "heavyweight", we should think about it to
> prevent things from going really bad overall.
>
> Yep I agree. Personally I don t see how we can solve this without prior knowledge of the vectors. Faiss has a nice implementation that fits naturally with Lucene called IVF (
> https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexIVF.html)
> but if we want to avoid running kmeans on every merge we d require to provide the clusters for the entire index before indexing the first vector.
> It s a complex issue…
>
> On Fri, 7 Apr 2023 at 22:58, Robert Muir <rcmuir@gmail.com> wrote:
>>
>> Personally i'd have to re-read the paper, but in general the merging
>> issue has to be addressed somehow to fix the overall indexing time
>> problem. It seems it gets "dodged" with huge rambuffers in the emails
>> here.
>> Keep in mind, there may be other ways to do it. In general if merging
>> something is going to be "heavyweight", we should think about it to
>> prevent things from going really bad overall.
>>
>> As an example, I'm most familiar with adding DEFLATE compression to
>> stored fields. Previously, we'd basically decompress and recompress
>> the stored fields on merge, and LZ4 is so fast that it wasn't
>> obviously a problem. But with DEFLATE it got slower/heavier (more
>> intense compression algorithm), something had to be done or indexing
>> would be unacceptably slow. Hence if you look at storedfields writer,
>> there is "dirtiness" logic etc so that recompression is amortized over
>> time and doesn't happen on every merge.
>>
>> On Fri, Apr 7, 2023 at 5:38?PM jim ferenczi <jim.ferenczi@gmail.com> wrote:
>> >
>> > I am also not sure that diskann would solve the merging issue. The idea describe in the paper is to run kmeans first to create multiple graphs, one per cluster. In our case the vectors in each segment could belong to different cluster so I don’t see how we could merge them efficiently.
>> >
>> > On Fri, 7 Apr 2023 at 22:28, jim ferenczi <jim.ferenczi@gmail.com> wrote:
>> >>
>> >> The inference time (and cost) to generate these big vectors must be quite large too ;).
>> >> Regarding the ram buffer, we could drastically reduce the size by writing the vectors on disk instead of keeping them in the heap. With 1k dimensions the ram buffer is filled with these vectors quite rapidly.
>> >>
>> >> On Fri, 7 Apr 2023 at 21:59, Robert Muir <rcmuir@gmail.com> wrote:
>> >>>
>> >>> On Fri, Apr 7, 2023 at 7:47?AM Michael Sokolov <msokolov@gmail.com> wrote:
>> >>> >
>> >>> > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
>> >>> > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer size=1994)
>> >>> >
>> >>> > Robert, since you're the only on-the-record veto here, does this
>> >>> > change your thinking at all, or if not could you share some test
>> >>> > results that didn't go the way you expected? Maybe we can find some
>> >>> > mitigation if we focus on a specific issue.
>> >>> >
>> >>>
>> >>> My scale concerns are both space and time. What does the execution
>> >>> time look like if you don't set insanely large IW rambuffer? The
>> >>> default is 16MB. Just concerned we're shoving some problems under the
>> >>> rug :)
>> >>>
>> >>> Even with the yuge RAMbuffer, we're still talking about almost 2 hours
>> >>> to index 4M documents with these 2k vectors. Whereas you'd measure
>> >>> this in seconds with typical lucene indexing, its nothing.
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >>> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

rcmuir at gmail

Apr 8, 2023, 4:06 AM

Post #59 of 99 (175 views)

Great way to try to meet me in the middle and win me over, basically
just dismiss my concerns. This is not going to achieve what you want.

On Sat, Apr 8, 2023 at 5:56?AM Alessandro Benedetti
<a.benedetti@sease.io> wrote:
>
> Yes, that was explicitly mentioned in the original mail, improving the vector based search of Lucene is an interesting area, but off topic here.
>
> Let's summarise:
> - We want to at least increase the limit (or remove it)
> - We proved that performance are ok to do it (and we can improve them more in the future), no harm is given to users that intend to stick to low dimensional vectors
>
> What are the next steps?
> What apache community tool can we use to agree on a new limit/no explicit limit (max integer)?
> I think we need some sort of place where each of us propose a limit with a motivation and we vote the best option?
> Any idea on how to do it?
>
> Cheers
>
> On Sat, 8 Apr 2023, 03:57 Michael Wechner, <michael.wechner@wyona.com> wrote:
>>
>> sorry to interrupt, but I think we get side-tracked from the original discussion to increase the vector dimension limit.
>>
>> I think improving the vector indexing performance is one thing and making sure Lucene does not crash when increasing the vector dimension limit is another.
>>
>> I think it is great to find better ways to index vectors, but I think this should not prevent people from being able to use models with higher vector dimensions than 1024.
>>
>> The following comparison might not be perfect, but imagine we have invented a combustion engine, which is strong enough to move a car in the flat area, but when applying it to a truck to move things over mountains it will fail, because it is not strong enough. Would you prevent people from using the combustion engine for a car in the flat area?
>>
>> Thanks
>>
>> Michael
>>
>>
>>
>> Am 08.04.23 um 00:15 schrieb jim ferenczi:
>>
>> > Keep in mind, there may be other ways to do it. In general if merging
>> something is going to be "heavyweight", we should think about it to
>> prevent things from going really bad overall.
>>
>> Yep I agree. Personally I don t see how we can solve this without prior knowledge of the vectors. Faiss has a nice implementation that fits naturally with Lucene called IVF (
>> https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexIVF.html)
>> but if we want to avoid running kmeans on every merge we d require to provide the clusters for the entire index before indexing the first vector.
>> It s a complex issue…
>>
>> On Fri, 7 Apr 2023 at 22:58, Robert Muir <rcmuir@gmail.com> wrote:
>>>
>>> Personally i'd have to re-read the paper, but in general the merging
>>> issue has to be addressed somehow to fix the overall indexing time
>>> problem. It seems it gets "dodged" with huge rambuffers in the emails
>>> here.
>>> Keep in mind, there may be other ways to do it. In general if merging
>>> something is going to be "heavyweight", we should think about it to
>>> prevent things from going really bad overall.
>>>
>>> As an example, I'm most familiar with adding DEFLATE compression to
>>> stored fields. Previously, we'd basically decompress and recompress
>>> the stored fields on merge, and LZ4 is so fast that it wasn't
>>> obviously a problem. But with DEFLATE it got slower/heavier (more
>>> intense compression algorithm), something had to be done or indexing
>>> would be unacceptably slow. Hence if you look at storedfields writer,
>>> there is "dirtiness" logic etc so that recompression is amortized over
>>> time and doesn't happen on every merge.
>>>
>>> On Fri, Apr 7, 2023 at 5:38?PM jim ferenczi <jim.ferenczi@gmail.com> wrote:
>>> >
>>> > I am also not sure that diskann would solve the merging issue. The idea describe in the paper is to run kmeans first to create multiple graphs, one per cluster. In our case the vectors in each segment could belong to different cluster so I don’t see how we could merge them efficiently.
>>> >
>>> > On Fri, 7 Apr 2023 at 22:28, jim ferenczi <jim.ferenczi@gmail.com> wrote:
>>> >>
>>> >> The inference time (and cost) to generate these big vectors must be quite large too ;).
>>> >> Regarding the ram buffer, we could drastically reduce the size by writing the vectors on disk instead of keeping them in the heap. With 1k dimensions the ram buffer is filled with these vectors quite rapidly.
>>> >>
>>> >> On Fri, 7 Apr 2023 at 21:59, Robert Muir <rcmuir@gmail.com> wrote:
>>> >>>
>>> >>> On Fri, Apr 7, 2023 at 7:47?AM Michael Sokolov <msokolov@gmail.com> wrote:
>>> >>> >
>>> >>> > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
>>> >>> > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer size=1994)
>>> >>> >
>>> >>> > Robert, since you're the only on-the-record veto here, does this
>>> >>> > change your thinking at all, or if not could you share some test
>>> >>> > results that didn't go the way you expected? Maybe we can find some
>>> >>> > mitigation if we focus on a specific issue.
>>> >>> >
>>> >>>
>>> >>> My scale concerns are both space and time. What does the execution
>>> >>> time look like if you don't set insanely large IW rambuffer? The
>>> >>> default is 16MB. Just concerned we're shoving some problems under the
>>> >>> rug :)
>>> >>>
>>> >>> Even with the yuge RAMbuffer, we're still talking about almost 2 hours
>>> >>> to index 4M documents with these 2k vectors. Whereas you'd measure
>>> >>> this in seconds with typical lucene indexing, its nothing.
>>> >>>
>>> >>> ---------------------------------------------------------------------
>>> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >>> For additional commands, e-mail: dev-help@lucene.apache.org
>>> >>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

michael.wechner at wyona

Apr 8, 2023, 5:33 AM

Post #60 of 99 (175 views)

What exactly do you consider reasonable?

I think it would help if we could specify concrete requirements re
performance and scalability, because then we have a concrete goal which
we can work with.
Do such requirements already exist or what would be a good starting point?

Re 2x worse, I think Michael Sokolov already pointed out that things
take longer linearly with vector dimension, which is quite obvious for
example for a brute force implementation. I would argue this will be the
case for any implementation.

And last I would like to ask again, slightly different, do we want
people to use Lucene, which will give us an opportunity to learn from
and progress?

Thanks

Michael

Am 08.04.23 um 13:04 schrieb Robert Muir:
> I don't think we have. The performance needs to be reasonable in order
> to bump this limit. Otherwise bumping this limit makes the worst-case
> 2x worse than it already is!
>
> Moreover, its clear something needs to happen to address the
> scalability/lack of performance. I'd hate for this limit to be in the
> way of that. Because of backwards compatibility, it's a one-way,
> permanent, irreversible change.
>
> I'm not sold by any means in any way yet. My vote remains the same.
>
> On Fri, Apr 7, 2023 at 10:57?PM Michael Wechner
> <michael.wechner@wyona.com> wrote:
>> sorry to interrupt, but I think we get side-tracked from the original discussion to increase the vector dimension limit.
>>
>> I think improving the vector indexing performance is one thing and making sure Lucene does not crash when increasing the vector dimension limit is another.
>>
>> I think it is great to find better ways to index vectors, but I think this should not prevent people from being able to use models with higher vector dimensions than 1024.
>>
>> The following comparison might not be perfect, but imagine we have invented a combustion engine, which is strong enough to move a car in the flat area, but when applying it to a truck to move things over mountains it will fail, because it is not strong enough. Would you prevent people from using the combustion engine for a car in the flat area?
>>
>> Thanks
>>
>> Michael
>>
>>
>>
>> Am 08.04.23 um 00:15 schrieb jim ferenczi:
>>
>>> Keep in mind, there may be other ways to do it. In general if merging
>> something is going to be "heavyweight", we should think about it to
>> prevent things from going really bad overall.
>>
>> Yep I agree. Personally I don t see how we can solve this without prior knowledge of the vectors. Faiss has a nice implementation that fits naturally with Lucene called IVF (
>> https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexIVF.html)
>> but if we want to avoid running kmeans on every merge we d require to provide the clusters for the entire index before indexing the first vector.
>> It s a complex issue…
>>
>> On Fri, 7 Apr 2023 at 22:58, Robert Muir <rcmuir@gmail.com> wrote:
>>> Personally i'd have to re-read the paper, but in general the merging
>>> issue has to be addressed somehow to fix the overall indexing time
>>> problem. It seems it gets "dodged" with huge rambuffers in the emails
>>> here.
>>> Keep in mind, there may be other ways to do it. In general if merging
>>> something is going to be "heavyweight", we should think about it to
>>> prevent things from going really bad overall.
>>>
>>> As an example, I'm most familiar with adding DEFLATE compression to
>>> stored fields. Previously, we'd basically decompress and recompress
>>> the stored fields on merge, and LZ4 is so fast that it wasn't
>>> obviously a problem. But with DEFLATE it got slower/heavier (more
>>> intense compression algorithm), something had to be done or indexing
>>> would be unacceptably slow. Hence if you look at storedfields writer,
>>> there is "dirtiness" logic etc so that recompression is amortized over
>>> time and doesn't happen on every merge.
>>>
>>> On Fri, Apr 7, 2023 at 5:38?PM jim ferenczi <jim.ferenczi@gmail.com> wrote:
>>>> I am also not sure that diskann would solve the merging issue. The idea describe in the paper is to run kmeans first to create multiple graphs, one per cluster. In our case the vectors in each segment could belong to different cluster so I don’t see how we could merge them efficiently.
>>>>
>>>> On Fri, 7 Apr 2023 at 22:28, jim ferenczi <jim.ferenczi@gmail.com> wrote:
>>>>> The inference time (and cost) to generate these big vectors must be quite large too ;).
>>>>> Regarding the ram buffer, we could drastically reduce the size by writing the vectors on disk instead of keeping them in the heap. With 1k dimensions the ram buffer is filled with these vectors quite rapidly.
>>>>>
>>>>> On Fri, 7 Apr 2023 at 21:59, Robert Muir <rcmuir@gmail.com> wrote:
>>>>>> On Fri, Apr 7, 2023 at 7:47?AM Michael Sokolov <msokolov@gmail.com> wrote:
>>>>>>> 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
>>>>>>> 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer size=1994)
>>>>>>>
>>>>>>> Robert, since you're the only on-the-record veto here, does this
>>>>>>> change your thinking at all, or if not could you share some test
>>>>>>> results that didn't go the way you expected? Maybe we can find some
>>>>>>> mitigation if we focus on a specific issue.
>>>>>>>
>>>>>> My scale concerns are both space and time. What does the execution
>>>>>> time look like if you don't set insanely large IW rambuffer? The
>>>>>> default is 16MB. Just concerned we're shoving some problems under the
>>>>>> rug :)
>>>>>>
>>>>>> Even with the yuge RAMbuffer, we're still talking about almost 2 hours
>>>>>> to index 4M documents with these 2k vectors. Whereas you'd measure
>>>>>> this in seconds with typical lucene indexing, its nothing.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

rcmuir at gmail

Apr 8, 2023, 9:39 AM

Post #61 of 99 (175 views)

On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
<michael.wechner@wyona.com> wrote:
>
> What exactly do you consider reasonable?

Let's begin a real discussion by being HONEST about the current
status. Please put politically correct or your own company's wishes
aside, we know it's not in a good state.

Current status is the one guy who wrote the code can set a
multi-gigabyte ram buffer and index a small dataset with 1024
dimensions in HOURS (i didn't ask what hardware).

My concerns are everyone else except the one guy, I want it to be
usable. Increasing dimensions just means even bigger multi-gigabyte
ram buffer and bigger heap to avoid OOM on merge.
It is also a permanent backwards compatibility decision, we have to
support it once we do this and we can't just say "oops" and flip it
back.

It is unclear to me, if the multi-gigabyte ram buffer is really to
avoid merges because they are so slow and it would be DAYS otherwise,
or if its to avoid merges so it doesn't hit OOM.
Also from personal experience, it takes trial and error (means
experiencing OOM on merge!!!) before you get those heap values correct
for your dataset. This usually means starting over which is
frustrating and wastes more time.

Jim mentioned some ideas about the memory usage in IndexWriter, seems
to me like its a good idea. maybe the multigigabyte ram buffer can be
avoided in this way and performance improved by writing bigger
segments with lucene's defaults. But this doesn't mean we can simply
ignore the horrors of what happens on merge. merging needs to scale so
that indexing really scales.

At least it shouldnt spike RAM on trivial data amounts and cause OOM,
and definitely it shouldnt burn hours and hours of CPU in O(n^2)
fashion when indexing.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

msokolov at gmail

Apr 8, 2023, 9:50 AM

Post #62 of 99 (175 views)

What you said about increasing dimensions requiring a bigger ram buffer on
merge is wrong. That's the point I was trying to make. Your concerns about
merge costs are not wrong, but your conclusion that we need to limit
dimensions is not justified.

You complain that hnsw sucks it doesn't scale, but when I show it scales
linearly with dimension you just ignore that and complain about something
entirely different.

You demand that people run all kinds of tests to prove you wrong but when
they do, you don't listen and you won't put in the work yourself or
complain that it's too hard.

Then you complain about people not meeting you half way. Wow

On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcmuir@gmail.com> wrote:

> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
> <michael.wechner@wyona.com> wrote:
> >
> > What exactly do you consider reasonable?
>
> Let's begin a real discussion by being HONEST about the current
> status. Please put politically correct or your own company's wishes
> aside, we know it's not in a good state.
>
> Current status is the one guy who wrote the code can set a
> multi-gigabyte ram buffer and index a small dataset with 1024
> dimensions in HOURS (i didn't ask what hardware).
>
> My concerns are everyone else except the one guy, I want it to be
> usable. Increasing dimensions just means even bigger multi-gigabyte
> ram buffer and bigger heap to avoid OOM on merge.
> It is also a permanent backwards compatibility decision, we have to
> support it once we do this and we can't just say "oops" and flip it
> back.
>
> It is unclear to me, if the multi-gigabyte ram buffer is really to
> avoid merges because they are so slow and it would be DAYS otherwise,
> or if its to avoid merges so it doesn't hit OOM.
> Also from personal experience, it takes trial and error (means
> experiencing OOM on merge!!!) before you get those heap values correct
> for your dataset. This usually means starting over which is
> frustrating and wastes more time.
>
> Jim mentioned some ideas about the memory usage in IndexWriter, seems
> to me like its a good idea. maybe the multigigabyte ram buffer can be
> avoided in this way and performance improved by writing bigger
> segments with lucene's defaults. But this doesn't mean we can simply
> ignore the horrors of what happens on merge. merging needs to scale so
> that indexing really scales.
>
> At least it shouldnt spike RAM on trivial data amounts and cause OOM,
> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
> fashion when indexing.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

rcmuir at gmail

Apr 8, 2023, 9:57 AM

Post #63 of 99 (175 views)

I disagree with your categorization. I put in plenty of work and
experienced plenty of pain myself, writing tests and fighting these
issues, after i saw that, two releases in a row, vector indexing fell
over and hit integer overflows etc on small datasets:

https://github.com/apache/lucene/pull/11905

Attacking me isn't helping the situation.

PS: when i said the "one guy who wrote the code" I didn't mean it in
any kind of demeaning fashion really. I meant to describe the current
state of usability with respect to indexing a few million docs with
high dimensions. You can scroll up the thread and see that at least
one other committer on the project experienced similar pain as me.
Then, think about users who aren't committers trying to use the
functionality!

On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov <msokolov@gmail.com> wrote:
>
> What you said about increasing dimensions requiring a bigger ram buffer on merge is wrong. That's the point I was trying to make. Your concerns about merge costs are not wrong, but your conclusion that we need to limit dimensions is not justified.
>
> You complain that hnsw sucks it doesn't scale, but when I show it scales linearly with dimension you just ignore that and complain about something entirely different.
>
> You demand that people run all kinds of tests to prove you wrong but when they do, you don't listen and you won't put in the work yourself or complain that it's too hard.
>
> Then you complain about people not meeting you half way. Wow
>
> On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcmuir@gmail.com> wrote:
>>
>> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
>> <michael.wechner@wyona.com> wrote:
>> >
>> > What exactly do you consider reasonable?
>>
>> Let's begin a real discussion by being HONEST about the current
>> status. Please put politically correct or your own company's wishes
>> aside, we know it's not in a good state.
>>
>> Current status is the one guy who wrote the code can set a
>> multi-gigabyte ram buffer and index a small dataset with 1024
>> dimensions in HOURS (i didn't ask what hardware).
>>
>> My concerns are everyone else except the one guy, I want it to be
>> usable. Increasing dimensions just means even bigger multi-gigabyte
>> ram buffer and bigger heap to avoid OOM on merge.
>> It is also a permanent backwards compatibility decision, we have to
>> support it once we do this and we can't just say "oops" and flip it
>> back.
>>
>> It is unclear to me, if the multi-gigabyte ram buffer is really to
>> avoid merges because they are so slow and it would be DAYS otherwise,
>> or if its to avoid merges so it doesn't hit OOM.
>> Also from personal experience, it takes trial and error (means
>> experiencing OOM on merge!!!) before you get those heap values correct
>> for your dataset. This usually means starting over which is
>> frustrating and wastes more time.
>>
>> Jim mentioned some ideas about the memory usage in IndexWriter, seems
>> to me like its a good idea. maybe the multigigabyte ram buffer can be
>> avoided in this way and performance improved by writing bigger
>> segments with lucene's defaults. But this doesn't mean we can simply
>> ignore the horrors of what happens on merge. merging needs to scale so
>> that indexing really scales.
>>
>> At least it shouldnt spike RAM on trivial data amounts and cause OOM,
>> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
>> fashion when indexing.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

a.benedetti at sease

Apr 8, 2023, 12:28 PM

Post #64 of 99 (175 views)

I am very attentive to listen opinions but I am un-convinced here and I an
not sure that a single person opinion should be allowed to be detrimental
for such an important project.

The limit as far as I know is literally just raising an exception.
Removing it won't alter in any way the current performance for users in low
dimensional space.
Removing it will just enable more users to use Lucene.

If new users in certain situations will be unhappy with the performance,
they may contribute improvements.
This is how you make progress.

If it's a reputation thing, trust me that not allowing users to play with
high dimensional space will equally damage it.

To me it's really a no brainer.
Removing the limit and enable people to use high dimensional vectors will
take minutes.
Improving the hnsw implementation can take months.
Pick one to begin with...

And there's no-one paying me here, no company interest whatsoever, actually
I pay people to contribute, I am just convinced it's a good idea.

On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcmuir@gmail.com> wrote:

> I disagree with your categorization. I put in plenty of work and
> experienced plenty of pain myself, writing tests and fighting these
> issues, after i saw that, two releases in a row, vector indexing fell
> over and hit integer overflows etc on small datasets:
>
> https://github.com/apache/lucene/pull/11905
>
> Attacking me isn't helping the situation.
>
> PS: when i said the "one guy who wrote the code" I didn't mean it in
> any kind of demeaning fashion really. I meant to describe the current
> state of usability with respect to indexing a few million docs with
> high dimensions. You can scroll up the thread and see that at least
> one other committer on the project experienced similar pain as me.
> Then, think about users who aren't committers trying to use the
> functionality!
>
> On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov <msokolov@gmail.com>
> wrote:
> >
> > What you said about increasing dimensions requiring a bigger ram buffer
> on merge is wrong. That's the point I was trying to make. Your concerns
> about merge costs are not wrong, but your conclusion that we need to limit
> dimensions is not justified.
> >
> > You complain that hnsw sucks it doesn't scale, but when I show it scales
> linearly with dimension you just ignore that and complain about something
> entirely different.
> >
> > You demand that people run all kinds of tests to prove you wrong but
> when they do, you don't listen and you won't put in the work yourself or
> complain that it's too hard.
> >
> > Then you complain about people not meeting you half way. Wow
> >
> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcmuir@gmail.com> wrote:
> >>
> >> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
> >> <michael.wechner@wyona.com> wrote:
> >> >
> >> > What exactly do you consider reasonable?
> >>
> >> Let's begin a real discussion by being HONEST about the current
> >> status. Please put politically correct or your own company's wishes
> >> aside, we know it's not in a good state.
> >>
> >> Current status is the one guy who wrote the code can set a
> >> multi-gigabyte ram buffer and index a small dataset with 1024
> >> dimensions in HOURS (i didn't ask what hardware).
> >>
> >> My concerns are everyone else except the one guy, I want it to be
> >> usable. Increasing dimensions just means even bigger multi-gigabyte
> >> ram buffer and bigger heap to avoid OOM on merge.
> >> It is also a permanent backwards compatibility decision, we have to
> >> support it once we do this and we can't just say "oops" and flip it
> >> back.
> >>
> >> It is unclear to me, if the multi-gigabyte ram buffer is really to
> >> avoid merges because they are so slow and it would be DAYS otherwise,
> >> or if its to avoid merges so it doesn't hit OOM.
> >> Also from personal experience, it takes trial and error (means
> >> experiencing OOM on merge!!!) before you get those heap values correct
> >> for your dataset. This usually means starting over which is
> >> frustrating and wastes more time.
> >>
> >> Jim mentioned some ideas about the memory usage in IndexWriter, seems
> >> to me like its a good idea. maybe the multigigabyte ram buffer can be
> >> avoided in this way and performance improved by writing bigger
> >> segments with lucene's defaults. But this doesn't mean we can simply
> >> ignore the horrors of what happens on merge. merging needs to scale so
> >> that indexing really scales.
> >>
> >> At least it shouldnt spike RAM on trivial data amounts and cause OOM,
> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
> >> fashion when indexing.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

ichattopadhyaya at gmail

Apr 8, 2023, 12:53 PM

Post #65 of 99 (175 views)

Can the limit be raised using Java reflection at run time? Or is there more
to it that needs to be changed?

On Sun, 9 Apr, 2023, 12:58 am Alessandro Benedetti, <a.benedetti@sease.io>
wrote:

> I am very attentive to listen opinions but I am un-convinced here and I an
> not sure that a single person opinion should be allowed to be detrimental
> for such an important project.
>
> The limit as far as I know is literally just raising an exception.
> Removing it won't alter in any way the current performance for users in
> low dimensional space.
> Removing it will just enable more users to use Lucene.
>
> If new users in certain situations will be unhappy with the performance,
> they may contribute improvements.
> This is how you make progress.
>
> If it's a reputation thing, trust me that not allowing users to play with
> high dimensional space will equally damage it.
>
> To me it's really a no brainer.
> Removing the limit and enable people to use high dimensional vectors will
> take minutes.
> Improving the hnsw implementation can take months.
> Pick one to begin with...
>
> And there's no-one paying me here, no company interest whatsoever,
> actually I pay people to contribute, I am just convinced it's a good idea.
>
>
> On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcmuir@gmail.com> wrote:
>
>> I disagree with your categorization. I put in plenty of work and
>> experienced plenty of pain myself, writing tests and fighting these
>> issues, after i saw that, two releases in a row, vector indexing fell
>> over and hit integer overflows etc on small datasets:
>>
>> https://github.com/apache/lucene/pull/11905
>>
>> Attacking me isn't helping the situation.
>>
>> PS: when i said the "one guy who wrote the code" I didn't mean it in
>> any kind of demeaning fashion really. I meant to describe the current
>> state of usability with respect to indexing a few million docs with
>> high dimensions. You can scroll up the thread and see that at least
>> one other committer on the project experienced similar pain as me.
>> Then, think about users who aren't committers trying to use the
>> functionality!
>>
>> On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov <msokolov@gmail.com>
>> wrote:
>> >
>> > What you said about increasing dimensions requiring a bigger ram buffer
>> on merge is wrong. That's the point I was trying to make. Your concerns
>> about merge costs are not wrong, but your conclusion that we need to limit
>> dimensions is not justified.
>> >
>> > You complain that hnsw sucks it doesn't scale, but when I show it
>> scales linearly with dimension you just ignore that and complain about
>> something entirely different.
>> >
>> > You demand that people run all kinds of tests to prove you wrong but
>> when they do, you don't listen and you won't put in the work yourself or
>> complain that it's too hard.
>> >
>> > Then you complain about people not meeting you half way. Wow
>> >
>> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcmuir@gmail.com> wrote:
>> >>
>> >> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
>> >> <michael.wechner@wyona.com> wrote:
>> >> >
>> >> > What exactly do you consider reasonable?
>> >>
>> >> Let's begin a real discussion by being HONEST about the current
>> >> status. Please put politically correct or your own company's wishes
>> >> aside, we know it's not in a good state.
>> >>
>> >> Current status is the one guy who wrote the code can set a
>> >> multi-gigabyte ram buffer and index a small dataset with 1024
>> >> dimensions in HOURS (i didn't ask what hardware).
>> >>
>> >> My concerns are everyone else except the one guy, I want it to be
>> >> usable. Increasing dimensions just means even bigger multi-gigabyte
>> >> ram buffer and bigger heap to avoid OOM on merge.
>> >> It is also a permanent backwards compatibility decision, we have to
>> >> support it once we do this and we can't just say "oops" and flip it
>> >> back.
>> >>
>> >> It is unclear to me, if the multi-gigabyte ram buffer is really to
>> >> avoid merges because they are so slow and it would be DAYS otherwise,
>> >> or if its to avoid merges so it doesn't hit OOM.
>> >> Also from personal experience, it takes trial and error (means
>> >> experiencing OOM on merge!!!) before you get those heap values correct
>> >> for your dataset. This usually means starting over which is
>> >> frustrating and wastes more time.
>> >>
>> >> Jim mentioned some ideas about the memory usage in IndexWriter, seems
>> >> to me like its a good idea. maybe the multigigabyte ram buffer can be
>> >> avoided in this way and performance improved by writing bigger
>> >> segments with lucene's defaults. But this doesn't mean we can simply
>> >> ignore the horrors of what happens on merge. merging needs to scale so
>> >> that indexing really scales.
>> >>
>> >> At least it shouldnt spike RAM on trivial data amounts and cause OOM,
>> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
>> >> fashion when indexing.
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

msokolov at gmail

Apr 8, 2023, 2:01 PM

Post #66 of 99 (175 views)

well, it's a final variable. But you could maybe extend KnnVectorField
to get around this limit? I think that's the only place it's currently
enforced

On Sat, Apr 8, 2023 at 3:54?PM Ishan Chattopadhyaya
<ichattopadhyaya@gmail.com> wrote:
>
> Can the limit be raised using Java reflection at run time? Or is there more to it that needs to be changed?
>
> On Sun, 9 Apr, 2023, 12:58 am Alessandro Benedetti, <a.benedetti@sease.io> wrote:
>>
>> I am very attentive to listen opinions but I am un-convinced here and I an not sure that a single person opinion should be allowed to be detrimental for such an important project.
>>
>> The limit as far as I know is literally just raising an exception.
>> Removing it won't alter in any way the current performance for users in low dimensional space.
>> Removing it will just enable more users to use Lucene.
>>
>> If new users in certain situations will be unhappy with the performance, they may contribute improvements.
>> This is how you make progress.
>>
>> If it's a reputation thing, trust me that not allowing users to play with high dimensional space will equally damage it.
>>
>> To me it's really a no brainer.
>> Removing the limit and enable people to use high dimensional vectors will take minutes.
>> Improving the hnsw implementation can take months.
>> Pick one to begin with...
>>
>> And there's no-one paying me here, no company interest whatsoever, actually I pay people to contribute, I am just convinced it's a good idea.
>>
>>
>> On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcmuir@gmail.com> wrote:
>>>
>>> I disagree with your categorization. I put in plenty of work and
>>> experienced plenty of pain myself, writing tests and fighting these
>>> issues, after i saw that, two releases in a row, vector indexing fell
>>> over and hit integer overflows etc on small datasets:
>>>
>>> https://github.com/apache/lucene/pull/11905
>>>
>>> Attacking me isn't helping the situation.
>>>
>>> PS: when i said the "one guy who wrote the code" I didn't mean it in
>>> any kind of demeaning fashion really. I meant to describe the current
>>> state of usability with respect to indexing a few million docs with
>>> high dimensions. You can scroll up the thread and see that at least
>>> one other committer on the project experienced similar pain as me.
>>> Then, think about users who aren't committers trying to use the
>>> functionality!
>>>
>>> On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov <msokolov@gmail.com> wrote:
>>> >
>>> > What you said about increasing dimensions requiring a bigger ram buffer on merge is wrong. That's the point I was trying to make. Your concerns about merge costs are not wrong, but your conclusion that we need to limit dimensions is not justified.
>>> >
>>> > You complain that hnsw sucks it doesn't scale, but when I show it scales linearly with dimension you just ignore that and complain about something entirely different.
>>> >
>>> > You demand that people run all kinds of tests to prove you wrong but when they do, you don't listen and you won't put in the work yourself or complain that it's too hard.
>>> >
>>> > Then you complain about people not meeting you half way. Wow
>>> >
>>> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcmuir@gmail.com> wrote:
>>> >>
>>> >> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
>>> >> <michael.wechner@wyona.com> wrote:
>>> >> >
>>> >> > What exactly do you consider reasonable?
>>> >>
>>> >> Let's begin a real discussion by being HONEST about the current
>>> >> status. Please put politically correct or your own company's wishes
>>> >> aside, we know it's not in a good state.
>>> >>
>>> >> Current status is the one guy who wrote the code can set a
>>> >> multi-gigabyte ram buffer and index a small dataset with 1024
>>> >> dimensions in HOURS (i didn't ask what hardware).
>>> >>
>>> >> My concerns are everyone else except the one guy, I want it to be
>>> >> usable. Increasing dimensions just means even bigger multi-gigabyte
>>> >> ram buffer and bigger heap to avoid OOM on merge.
>>> >> It is also a permanent backwards compatibility decision, we have to
>>> >> support it once we do this and we can't just say "oops" and flip it
>>> >> back.
>>> >>
>>> >> It is unclear to me, if the multi-gigabyte ram buffer is really to
>>> >> avoid merges because they are so slow and it would be DAYS otherwise,
>>> >> or if its to avoid merges so it doesn't hit OOM.
>>> >> Also from personal experience, it takes trial and error (means
>>> >> experiencing OOM on merge!!!) before you get those heap values correct
>>> >> for your dataset. This usually means starting over which is
>>> >> frustrating and wastes more time.
>>> >>
>>> >> Jim mentioned some ideas about the memory usage in IndexWriter, seems
>>> >> to me like its a good idea. maybe the multigigabyte ram buffer can be
>>> >> avoided in this way and performance improved by writing bigger
>>> >> segments with lucene's defaults. But this doesn't mean we can simply
>>> >> ignore the horrors of what happens on merge. merging needs to scale so
>>> >> that indexing really scales.
>>> >>
>>> >> At least it shouldnt spike RAM on trivial data amounts and cause OOM,
>>> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
>>> >> fashion when indexing.
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>>> >>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

jpountz at gmail

Apr 8, 2023, 2:29 PM

Post #67 of 99 (175 views)

As Dawid pointed out earlier on this thread, this is the rule for
Apache projects: a single -1 vote on a code change is a veto and
cannot be overridden. Furthermore, Robert is one of the people on this
project who worked the most on debugging subtle bugs, making Lucene
more robust and improving our test framework, so I'm listening when he
voices quality concerns.

The argument against removing/raising the limit that resonates with me
the most is that it is a one-way door. As MikeS highlighted earlier on
this thread, implementations may want to take advantage of the fact
that there is a limit at some point too. This is why I don't want to
remove the limit and would prefer a slight increase, such as 2048 as
suggested in the original issue, which would enable most of the things
that users who have been asking about raising the limit would like to
do.

I agree that the merge-time memory usage and slow indexing rate are
not great. But it's still possible to index multi-million vector
datasets with a 4GB heap without hitting OOMEs regardless of the
number of dimensions, and the feedback I'm seeing is that many users
are still interested in indexing multi-million vector datasets despite
the slow indexing rate. I wish we could do better, and vector indexing
is certainly more expert than text indexing, but it still is usable in
my opinion. I understand how giving Lucene more information about
vectors prior to indexing (e.g. clustering information as Jim pointed
out) could help make merging faster and more memory-efficient, but I
would really like to avoid making it a requirement for indexing
vectors as it also makes this feature much harder to use.

On Sat, Apr 8, 2023 at 9:28?PM Alessandro Benedetti
<a.benedetti@sease.io> wrote:
>
> I am very attentive to listen opinions but I am un-convinced here and I an not sure that a single person opinion should be allowed to be detrimental for such an important project.
>
> The limit as far as I know is literally just raising an exception.
> Removing it won't alter in any way the current performance for users in low dimensional space.
> Removing it will just enable more users to use Lucene.
>
> If new users in certain situations will be unhappy with the performance, they may contribute improvements.
> This is how you make progress.
>
> If it's a reputation thing, trust me that not allowing users to play with high dimensional space will equally damage it.
>
> To me it's really a no brainer.
> Removing the limit and enable people to use high dimensional vectors will take minutes.
> Improving the hnsw implementation can take months.
> Pick one to begin with...
>
> And there's no-one paying me here, no company interest whatsoever, actually I pay people to contribute, I am just convinced it's a good idea.
>
>
> On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcmuir@gmail.com> wrote:
>>
>> I disagree with your categorization. I put in plenty of work and
>> experienced plenty of pain myself, writing tests and fighting these
>> issues, after i saw that, two releases in a row, vector indexing fell
>> over and hit integer overflows etc on small datasets:
>>
>> https://github.com/apache/lucene/pull/11905
>>
>> Attacking me isn't helping the situation.
>>
>> PS: when i said the "one guy who wrote the code" I didn't mean it in
>> any kind of demeaning fashion really. I meant to describe the current
>> state of usability with respect to indexing a few million docs with
>> high dimensions. You can scroll up the thread and see that at least
>> one other committer on the project experienced similar pain as me.
>> Then, think about users who aren't committers trying to use the
>> functionality!
>>
>> On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov <msokolov@gmail.com> wrote:
>> >
>> > What you said about increasing dimensions requiring a bigger ram buffer on merge is wrong. That's the point I was trying to make. Your concerns about merge costs are not wrong, but your conclusion that we need to limit dimensions is not justified.
>> >
>> > You complain that hnsw sucks it doesn't scale, but when I show it scales linearly with dimension you just ignore that and complain about something entirely different.
>> >
>> > You demand that people run all kinds of tests to prove you wrong but when they do, you don't listen and you won't put in the work yourself or complain that it's too hard.
>> >
>> > Then you complain about people not meeting you half way. Wow
>> >
>> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcmuir@gmail.com> wrote:
>> >>
>> >> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
>> >> <michael.wechner@wyona.com> wrote:
>> >> >
>> >> > What exactly do you consider reasonable?
>> >>
>> >> Let's begin a real discussion by being HONEST about the current
>> >> status. Please put politically correct or your own company's wishes
>> >> aside, we know it's not in a good state.
>> >>
>> >> Current status is the one guy who wrote the code can set a
>> >> multi-gigabyte ram buffer and index a small dataset with 1024
>> >> dimensions in HOURS (i didn't ask what hardware).
>> >>
>> >> My concerns are everyone else except the one guy, I want it to be
>> >> usable. Increasing dimensions just means even bigger multi-gigabyte
>> >> ram buffer and bigger heap to avoid OOM on merge.
>> >> It is also a permanent backwards compatibility decision, we have to
>> >> support it once we do this and we can't just say "oops" and flip it
>> >> back.
>> >>
>> >> It is unclear to me, if the multi-gigabyte ram buffer is really to
>> >> avoid merges because they are so slow and it would be DAYS otherwise,
>> >> or if its to avoid merges so it doesn't hit OOM.
>> >> Also from personal experience, it takes trial and error (means
>> >> experiencing OOM on merge!!!) before you get those heap values correct
>> >> for your dataset. This usually means starting over which is
>> >> frustrating and wastes more time.
>> >>
>> >> Jim mentioned some ideas about the memory usage in IndexWriter, seems
>> >> to me like its a good idea. maybe the multigigabyte ram buffer can be
>> >> avoided in this way and performance improved by writing bigger
>> >> segments with lucene's defaults. But this doesn't mean we can simply
>> >> ignore the horrors of what happens on merge. merging needs to scale so
>> >> that indexing really scales.
>> >>
>> >> At least it shouldnt spike RAM on trivial data amounts and cause OOM,
>> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
>> >> fashion when indexing.
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

marcuseagan at gmail

Apr 8, 2023, 4:49 PM

Post #68 of 99 (175 views)

Given the massive amounts of funding going into the development and
investigation of the project, I think it would be good to at least have
Lucene be a part of the conversation. Simply because academics typically
focus on vectors <= 784 dimensions does not mean all users will. A large
swathe of very important users of the Lucene project never exceed 500k
documents, though they are shifting to other search engines to try out very
popular embeddings.

I think giving our users the opportunity to build chat bots or LLM memory
machines using Lucene is a positive development, even if some datasets
won't be able to work well. We don't limit the number of fields someone can
add in most cases, though we did just undeprecate that API to better
support multi-tenancy. But people still add so many fields and can crash
their clusters with mapping explosions when unlimited. The limit to vectors
feels similar. I expect more people to dig into Lucene due to its openness
and robustness as they run into problems. Today, they are forced to
consider other engines that are more permissive.

Not everyone important or valuable Lucene workload is in the millions of
documents. Many of them only have lots of queries or computationally
expensive access patterns for B-trees. We can document that it is very
ill-advised to make a deployment with vectors too large. What others will
do with it is on them.

On Sat, Apr 8, 2023 at 2:29?PM Adrien Grand <jpountz@gmail.com> wrote:

> As Dawid pointed out earlier on this thread, this is the rule for
> Apache projects: a single -1 vote on a code change is a veto and
> cannot be overridden. Furthermore, Robert is one of the people on this
> project who worked the most on debugging subtle bugs, making Lucene
> more robust and improving our test framework, so I'm listening when he
> voices quality concerns.
>
> The argument against removing/raising the limit that resonates with me
> the most is that it is a one-way door. As MikeS highlighted earlier on
> this thread, implementations may want to take advantage of the fact
> that there is a limit at some point too. This is why I don't want to
> remove the limit and would prefer a slight increase, such as 2048 as
> suggested in the original issue, which would enable most of the things
> that users who have been asking about raising the limit would like to
> do.
>
> I agree that the merge-time memory usage and slow indexing rate are
> not great. But it's still possible to index multi-million vector
> datasets with a 4GB heap without hitting OOMEs regardless of the
> number of dimensions, and the feedback I'm seeing is that many users
> are still interested in indexing multi-million vector datasets despite
> the slow indexing rate. I wish we could do better, and vector indexing
> is certainly more expert than text indexing, but it still is usable in
> my opinion. I understand how giving Lucene more information about
> vectors prior to indexing (e.g. clustering information as Jim pointed
> out) could help make merging faster and more memory-efficient, but I
> would really like to avoid making it a requirement for indexing
> vectors as it also makes this feature much harder to use.
>
> On Sat, Apr 8, 2023 at 9:28?PM Alessandro Benedetti
> <a.benedetti@sease.io> wrote:
> >
> > I am very attentive to listen opinions but I am un-convinced here and I
> an not sure that a single person opinion should be allowed to be
> detrimental for such an important project.
> >
> > The limit as far as I know is literally just raising an exception.
> > Removing it won't alter in any way the current performance for users in
> low dimensional space.
> > Removing it will just enable more users to use Lucene.
> >
> > If new users in certain situations will be unhappy with the performance,
> they may contribute improvements.
> > This is how you make progress.
> >
> > If it's a reputation thing, trust me that not allowing users to play
> with high dimensional space will equally damage it.
> >
> > To me it's really a no brainer.
> > Removing the limit and enable people to use high dimensional vectors
> will take minutes.
> > Improving the hnsw implementation can take months.
> > Pick one to begin with...
> >
> > And there's no-one paying me here, no company interest whatsoever,
> actually I pay people to contribute, I am just convinced it's a good idea.
> >
> >
> > On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcmuir@gmail.com> wrote:
> >>
> >> I disagree with your categorization. I put in plenty of work and
> >> experienced plenty of pain myself, writing tests and fighting these
> >> issues, after i saw that, two releases in a row, vector indexing fell
> >> over and hit integer overflows etc on small datasets:
> >>
> >> https://github.com/apache/lucene/pull/11905
> >>
> >> Attacking me isn't helping the situation.
> >>
> >> PS: when i said the "one guy who wrote the code" I didn't mean it in
> >> any kind of demeaning fashion really. I meant to describe the current
> >> state of usability with respect to indexing a few million docs with
> >> high dimensions. You can scroll up the thread and see that at least
> >> one other committer on the project experienced similar pain as me.
> >> Then, think about users who aren't committers trying to use the
> >> functionality!
> >>
> >> On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov <msokolov@gmail.com>
> wrote:
> >> >
> >> > What you said about increasing dimensions requiring a bigger ram
> buffer on merge is wrong. That's the point I was trying to make. Your
> concerns about merge costs are not wrong, but your conclusion that we need
> to limit dimensions is not justified.
> >> >
> >> > You complain that hnsw sucks it doesn't scale, but when I show it
> scales linearly with dimension you just ignore that and complain about
> something entirely different.
> >> >
> >> > You demand that people run all kinds of tests to prove you wrong but
> when they do, you don't listen and you won't put in the work yourself or
> complain that it's too hard.
> >> >
> >> > Then you complain about people not meeting you half way. Wow
> >> >
> >> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcmuir@gmail.com> wrote:
> >> >>
> >> >> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
> >> >> <michael.wechner@wyona.com> wrote:
> >> >> >
> >> >> > What exactly do you consider reasonable?
> >> >>
> >> >> Let's begin a real discussion by being HONEST about the current
> >> >> status. Please put politically correct or your own company's wishes
> >> >> aside, we know it's not in a good state.
> >> >>
> >> >> Current status is the one guy who wrote the code can set a
> >> >> multi-gigabyte ram buffer and index a small dataset with 1024
> >> >> dimensions in HOURS (i didn't ask what hardware).
> >> >>
> >> >> My concerns are everyone else except the one guy, I want it to be
> >> >> usable. Increasing dimensions just means even bigger multi-gigabyte
> >> >> ram buffer and bigger heap to avoid OOM on merge.
> >> >> It is also a permanent backwards compatibility decision, we have to
> >> >> support it once we do this and we can't just say "oops" and flip it
> >> >> back.
> >> >>
> >> >> It is unclear to me, if the multi-gigabyte ram buffer is really to
> >> >> avoid merges because they are so slow and it would be DAYS otherwise,
> >> >> or if its to avoid merges so it doesn't hit OOM.
> >> >> Also from personal experience, it takes trial and error (means
> >> >> experiencing OOM on merge!!!) before you get those heap values
> correct
> >> >> for your dataset. This usually means starting over which is
> >> >> frustrating and wastes more time.
> >> >>
> >> >> Jim mentioned some ideas about the memory usage in IndexWriter, seems
> >> >> to me like its a good idea. maybe the multigigabyte ram buffer can be
> >> >> avoided in this way and performance improved by writing bigger
> >> >> segments with lucene's defaults. But this doesn't mean we can simply
> >> >> ignore the horrors of what happens on merge. merging needs to scale
> so
> >> >> that indexing really scales.
> >> >>
> >> >> At least it shouldnt spike RAM on trivial data amounts and cause OOM,
> >> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
> >> >> fashion when indexing.
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
>
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

--
Marcus Eagan

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

dawid.weiss at gmail

Apr 8, 2023, 10:31 PM

Post #69 of 99 (175 views)

Can we set up a branch in which the limit is bumped to 2048, then have
a realistic, free data set (wikipedia sample or something) that has,
say, 5 million docs and vectors created using public data (glove
pre-trained embeddings or the like)? We then could run indexing on the
same hardware with 512, 1024 and 2048 and see what the numbers, limits
and behavior actually are.

I can help in writing this but not until after Easter.

Dawid

On Sat, Apr 8, 2023 at 11:29?PM Adrien Grand <jpountz@gmail.com> wrote:
>
> As Dawid pointed out earlier on this thread, this is the rule for
> Apache projects: a single -1 vote on a code change is a veto and
> cannot be overridden. Furthermore, Robert is one of the people on this
> project who worked the most on debugging subtle bugs, making Lucene
> more robust and improving our test framework, so I'm listening when he
> voices quality concerns.
>
> The argument against removing/raising the limit that resonates with me
> the most is that it is a one-way door. As MikeS highlighted earlier on
> this thread, implementations may want to take advantage of the fact
> that there is a limit at some point too. This is why I don't want to
> remove the limit and would prefer a slight increase, such as 2048 as
> suggested in the original issue, which would enable most of the things
> that users who have been asking about raising the limit would like to
> do.
>
> I agree that the merge-time memory usage and slow indexing rate are
> not great. But it's still possible to index multi-million vector
> datasets with a 4GB heap without hitting OOMEs regardless of the
> number of dimensions, and the feedback I'm seeing is that many users
> are still interested in indexing multi-million vector datasets despite
> the slow indexing rate. I wish we could do better, and vector indexing
> is certainly more expert than text indexing, but it still is usable in
> my opinion. I understand how giving Lucene more information about
> vectors prior to indexing (e.g. clustering information as Jim pointed
> out) could help make merging faster and more memory-efficient, but I
> would really like to avoid making it a requirement for indexing
> vectors as it also makes this feature much harder to use.
>
> On Sat, Apr 8, 2023 at 9:28?PM Alessandro Benedetti
> <a.benedetti@sease.io> wrote:
> >
> > I am very attentive to listen opinions but I am un-convinced here and I an not sure that a single person opinion should be allowed to be detrimental for such an important project.
> >
> > The limit as far as I know is literally just raising an exception.
> > Removing it won't alter in any way the current performance for users in low dimensional space.
> > Removing it will just enable more users to use Lucene.
> >
> > If new users in certain situations will be unhappy with the performance, they may contribute improvements.
> > This is how you make progress.
> >
> > If it's a reputation thing, trust me that not allowing users to play with high dimensional space will equally damage it.
> >
> > To me it's really a no brainer.
> > Removing the limit and enable people to use high dimensional vectors will take minutes.
> > Improving the hnsw implementation can take months.
> > Pick one to begin with...
> >
> > And there's no-one paying me here, no company interest whatsoever, actually I pay people to contribute, I am just convinced it's a good idea.
> >
> >
> > On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcmuir@gmail.com> wrote:
> >>
> >> I disagree with your categorization. I put in plenty of work and
> >> experienced plenty of pain myself, writing tests and fighting these
> >> issues, after i saw that, two releases in a row, vector indexing fell
> >> over and hit integer overflows etc on small datasets:
> >>
> >> https://github.com/apache/lucene/pull/11905
> >>
> >> Attacking me isn't helping the situation.
> >>
> >> PS: when i said the "one guy who wrote the code" I didn't mean it in
> >> any kind of demeaning fashion really. I meant to describe the current
> >> state of usability with respect to indexing a few million docs with
> >> high dimensions. You can scroll up the thread and see that at least
> >> one other committer on the project experienced similar pain as me.
> >> Then, think about users who aren't committers trying to use the
> >> functionality!
> >>
> >> On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov <msokolov@gmail.com> wrote:
> >> >
> >> > What you said about increasing dimensions requiring a bigger ram buffer on merge is wrong. That's the point I was trying to make. Your concerns about merge costs are not wrong, but your conclusion that we need to limit dimensions is not justified.
> >> >
> >> > You complain that hnsw sucks it doesn't scale, but when I show it scales linearly with dimension you just ignore that and complain about something entirely different.
> >> >
> >> > You demand that people run all kinds of tests to prove you wrong but when they do, you don't listen and you won't put in the work yourself or complain that it's too hard.
> >> >
> >> > Then you complain about people not meeting you half way. Wow
> >> >
> >> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcmuir@gmail.com> wrote:
> >> >>
> >> >> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
> >> >> <michael.wechner@wyona.com> wrote:
> >> >> >
> >> >> > What exactly do you consider reasonable?
> >> >>
> >> >> Let's begin a real discussion by being HONEST about the current
> >> >> status. Please put politically correct or your own company's wishes
> >> >> aside, we know it's not in a good state.
> >> >>
> >> >> Current status is the one guy who wrote the code can set a
> >> >> multi-gigabyte ram buffer and index a small dataset with 1024
> >> >> dimensions in HOURS (i didn't ask what hardware).
> >> >>
> >> >> My concerns are everyone else except the one guy, I want it to be
> >> >> usable. Increasing dimensions just means even bigger multi-gigabyte
> >> >> ram buffer and bigger heap to avoid OOM on merge.
> >> >> It is also a permanent backwards compatibility decision, we have to
> >> >> support it once we do this and we can't just say "oops" and flip it
> >> >> back.
> >> >>
> >> >> It is unclear to me, if the multi-gigabyte ram buffer is really to
> >> >> avoid merges because they are so slow and it would be DAYS otherwise,
> >> >> or if its to avoid merges so it doesn't hit OOM.
> >> >> Also from personal experience, it takes trial and error (means
> >> >> experiencing OOM on merge!!!) before you get those heap values correct
> >> >> for your dataset. This usually means starting over which is
> >> >> frustrating and wastes more time.
> >> >>
> >> >> Jim mentioned some ideas about the memory usage in IndexWriter, seems
> >> >> to me like its a good idea. maybe the multigigabyte ram buffer can be
> >> >> avoided in this way and performance improved by writing bigger
> >> >> segments with lucene's defaults. But this doesn't mean we can simply
> >> >> ignore the horrors of what happens on merge. merging needs to scale so
> >> >> that indexing really scales.
> >> >>
> >> >> At least it shouldnt spike RAM on trivial data amounts and cause OOM,
> >> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
> >> >> fashion when indexing.
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
>
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

rcmuir at gmail

Apr 9, 2023, 3:25 AM

Post #70 of 99 (175 views)

Yes, its very clear that folks on this thread are ignoring reason
entirely and completely swooned by chatgpt-hype.
And what happens when they make chatgpt-8 that uses even more dimensions?
backwards compatibility decisions can't be made by garbage hype such
as cryptocurrency or chatgpt.
Trying to convince me we should bump it because of chatgpt, well, i
think it has the opposite effect.

Please, lemme see real technical arguments why this limit needs to be
bumped. not including trash like chatgpt.

On Sat, Apr 8, 2023 at 7:50?PM Marcus Eagan <marcuseagan@gmail.com> wrote:
>
> Given the massive amounts of funding going into the development and investigation of the project, I think it would be good to at least have Lucene be a part of the conversation. Simply because academics typically focus on vectors <= 784 dimensions does not mean all users will. A large swathe of very important users of the Lucene project never exceed 500k documents, though they are shifting to other search engines to try out very popular embeddings.
>
> I think giving our users the opportunity to build chat bots or LLM memory machines using Lucene is a positive development, even if some datasets won't be able to work well. We don't limit the number of fields someone can add in most cases, though we did just undeprecate that API to better support multi-tenancy. But people still add so many fields and can crash their clusters with mapping explosions when unlimited. The limit to vectors feels similar. I expect more people to dig into Lucene due to its openness and robustness as they run into problems. Today, they are forced to consider other engines that are more permissive.
>
> Not everyone important or valuable Lucene workload is in the millions of documents. Many of them only have lots of queries or computationally expensive access patterns for B-trees. We can document that it is very ill-advised to make a deployment with vectors too large. What others will do with it is on them.
>
>
> On Sat, Apr 8, 2023 at 2:29?PM Adrien Grand <jpountz@gmail.com> wrote:
>>
>> As Dawid pointed out earlier on this thread, this is the rule for
>> Apache projects: a single -1 vote on a code change is a veto and
>> cannot be overridden. Furthermore, Robert is one of the people on this
>> project who worked the most on debugging subtle bugs, making Lucene
>> more robust and improving our test framework, so I'm listening when he
>> voices quality concerns.
>>
>> The argument against removing/raising the limit that resonates with me
>> the most is that it is a one-way door. As MikeS highlighted earlier on
>> this thread, implementations may want to take advantage of the fact
>> that there is a limit at some point too. This is why I don't want to
>> remove the limit and would prefer a slight increase, such as 2048 as
>> suggested in the original issue, which would enable most of the things
>> that users who have been asking about raising the limit would like to
>> do.
>>
>> I agree that the merge-time memory usage and slow indexing rate are
>> not great. But it's still possible to index multi-million vector
>> datasets with a 4GB heap without hitting OOMEs regardless of the
>> number of dimensions, and the feedback I'm seeing is that many users
>> are still interested in indexing multi-million vector datasets despite
>> the slow indexing rate. I wish we could do better, and vector indexing
>> is certainly more expert than text indexing, but it still is usable in
>> my opinion. I understand how giving Lucene more information about
>> vectors prior to indexing (e.g. clustering information as Jim pointed
>> out) could help make merging faster and more memory-efficient, but I
>> would really like to avoid making it a requirement for indexing
>> vectors as it also makes this feature much harder to use.
>>
>> On Sat, Apr 8, 2023 at 9:28?PM Alessandro Benedetti
>> <a.benedetti@sease.io> wrote:
>> >
>> > I am very attentive to listen opinions but I am un-convinced here and I an not sure that a single person opinion should be allowed to be detrimental for such an important project.
>> >
>> > The limit as far as I know is literally just raising an exception.
>> > Removing it won't alter in any way the current performance for users in low dimensional space.
>> > Removing it will just enable more users to use Lucene.
>> >
>> > If new users in certain situations will be unhappy with the performance, they may contribute improvements.
>> > This is how you make progress.
>> >
>> > If it's a reputation thing, trust me that not allowing users to play with high dimensional space will equally damage it.
>> >
>> > To me it's really a no brainer.
>> > Removing the limit and enable people to use high dimensional vectors will take minutes.
>> > Improving the hnsw implementation can take months.
>> > Pick one to begin with...
>> >
>> > And there's no-one paying me here, no company interest whatsoever, actually I pay people to contribute, I am just convinced it's a good idea.
>> >
>> >
>> > On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcmuir@gmail.com> wrote:
>> >>
>> >> I disagree with your categorization. I put in plenty of work and
>> >> experienced plenty of pain myself, writing tests and fighting these
>> >> issues, after i saw that, two releases in a row, vector indexing fell
>> >> over and hit integer overflows etc on small datasets:
>> >>
>> >> https://github.com/apache/lucene/pull/11905
>> >>
>> >> Attacking me isn't helping the situation.
>> >>
>> >> PS: when i said the "one guy who wrote the code" I didn't mean it in
>> >> any kind of demeaning fashion really. I meant to describe the current
>> >> state of usability with respect to indexing a few million docs with
>> >> high dimensions. You can scroll up the thread and see that at least
>> >> one other committer on the project experienced similar pain as me.
>> >> Then, think about users who aren't committers trying to use the
>> >> functionality!
>> >>
>> >> On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov <msokolov@gmail.com> wrote:
>> >> >
>> >> > What you said about increasing dimensions requiring a bigger ram buffer on merge is wrong. That's the point I was trying to make. Your concerns about merge costs are not wrong, but your conclusion that we need to limit dimensions is not justified.
>> >> >
>> >> > You complain that hnsw sucks it doesn't scale, but when I show it scales linearly with dimension you just ignore that and complain about something entirely different.
>> >> >
>> >> > You demand that people run all kinds of tests to prove you wrong but when they do, you don't listen and you won't put in the work yourself or complain that it's too hard.
>> >> >
>> >> > Then you complain about people not meeting you half way. Wow
>> >> >
>> >> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcmuir@gmail.com> wrote:
>> >> >>
>> >> >> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
>> >> >> <michael.wechner@wyona.com> wrote:
>> >> >> >
>> >> >> > What exactly do you consider reasonable?
>> >> >>
>> >> >> Let's begin a real discussion by being HONEST about the current
>> >> >> status. Please put politically correct or your own company's wishes
>> >> >> aside, we know it's not in a good state.
>> >> >>
>> >> >> Current status is the one guy who wrote the code can set a
>> >> >> multi-gigabyte ram buffer and index a small dataset with 1024
>> >> >> dimensions in HOURS (i didn't ask what hardware).
>> >> >>
>> >> >> My concerns are everyone else except the one guy, I want it to be
>> >> >> usable. Increasing dimensions just means even bigger multi-gigabyte
>> >> >> ram buffer and bigger heap to avoid OOM on merge.
>> >> >> It is also a permanent backwards compatibility decision, we have to
>> >> >> support it once we do this and we can't just say "oops" and flip it
>> >> >> back.
>> >> >>
>> >> >> It is unclear to me, if the multi-gigabyte ram buffer is really to
>> >> >> avoid merges because they are so slow and it would be DAYS otherwise,
>> >> >> or if its to avoid merges so it doesn't hit OOM.
>> >> >> Also from personal experience, it takes trial and error (means
>> >> >> experiencing OOM on merge!!!) before you get those heap values correct
>> >> >> for your dataset. This usually means starting over which is
>> >> >> frustrating and wastes more time.
>> >> >>
>> >> >> Jim mentioned some ideas about the memory usage in IndexWriter, seems
>> >> >> to me like its a good idea. maybe the multigigabyte ram buffer can be
>> >> >> avoided in this way and performance improved by writing bigger
>> >> >> segments with lucene's defaults. But this doesn't mean we can simply
>> >> >> ignore the horrors of what happens on merge. merging needs to scale so
>> >> >> that indexing really scales.
>> >> >>
>> >> >> At least it shouldnt spike RAM on trivial data amounts and cause OOM,
>> >> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
>> >> >> fashion when indexing.
>> >> >>
>> >> >> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>>
>>
>> --
>> Adrien
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
>
> --
> Marcus Eagan
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

rcmuir at gmail

Apr 9, 2023, 3:58 AM

Post #71 of 99 (175 views)

Also, please let's only disucss SEARCH. lucene is a SEARCH ENGINE
LIBRARY. not a vector database or whatever trash is being proposed
here.

i think we should table this and revisit it after chatgpt hype has dissipated.

this hype is causing ppl to behave irrationally, it is why i can't
converse with basically anyone on this thread because they are all
stating crazy things that don't make sense.

On Sun, Apr 9, 2023 at 6:25?AM Robert Muir <rcmuir@gmail.com> wrote:
>
> Yes, its very clear that folks on this thread are ignoring reason
> entirely and completely swooned by chatgpt-hype.
> And what happens when they make chatgpt-8 that uses even more dimensions?
> backwards compatibility decisions can't be made by garbage hype such
> as cryptocurrency or chatgpt.
> Trying to convince me we should bump it because of chatgpt, well, i
> think it has the opposite effect.
>
> Please, lemme see real technical arguments why this limit needs to be
> bumped. not including trash like chatgpt.
>
> On Sat, Apr 8, 2023 at 7:50?PM Marcus Eagan <marcuseagan@gmail.com> wrote:
> >
> > Given the massive amounts of funding going into the development and investigation of the project, I think it would be good to at least have Lucene be a part of the conversation. Simply because academics typically focus on vectors <= 784 dimensions does not mean all users will. A large swathe of very important users of the Lucene project never exceed 500k documents, though they are shifting to other search engines to try out very popular embeddings.
> >
> > I think giving our users the opportunity to build chat bots or LLM memory machines using Lucene is a positive development, even if some datasets won't be able to work well. We don't limit the number of fields someone can add in most cases, though we did just undeprecate that API to better support multi-tenancy. But people still add so many fields and can crash their clusters with mapping explosions when unlimited. The limit to vectors feels similar. I expect more people to dig into Lucene due to its openness and robustness as they run into problems. Today, they are forced to consider other engines that are more permissive.
> >
> > Not everyone important or valuable Lucene workload is in the millions of documents. Many of them only have lots of queries or computationally expensive access patterns for B-trees. We can document that it is very ill-advised to make a deployment with vectors too large. What others will do with it is on them.
> >
> >
> > On Sat, Apr 8, 2023 at 2:29?PM Adrien Grand <jpountz@gmail.com> wrote:
> >>
> >> As Dawid pointed out earlier on this thread, this is the rule for
> >> Apache projects: a single -1 vote on a code change is a veto and
> >> cannot be overridden. Furthermore, Robert is one of the people on this
> >> project who worked the most on debugging subtle bugs, making Lucene
> >> more robust and improving our test framework, so I'm listening when he
> >> voices quality concerns.
> >>
> >> The argument against removing/raising the limit that resonates with me
> >> the most is that it is a one-way door. As MikeS highlighted earlier on
> >> this thread, implementations may want to take advantage of the fact
> >> that there is a limit at some point too. This is why I don't want to
> >> remove the limit and would prefer a slight increase, such as 2048 as
> >> suggested in the original issue, which would enable most of the things
> >> that users who have been asking about raising the limit would like to
> >> do.
> >>
> >> I agree that the merge-time memory usage and slow indexing rate are
> >> not great. But it's still possible to index multi-million vector
> >> datasets with a 4GB heap without hitting OOMEs regardless of the
> >> number of dimensions, and the feedback I'm seeing is that many users
> >> are still interested in indexing multi-million vector datasets despite
> >> the slow indexing rate. I wish we could do better, and vector indexing
> >> is certainly more expert than text indexing, but it still is usable in
> >> my opinion. I understand how giving Lucene more information about
> >> vectors prior to indexing (e.g. clustering information as Jim pointed
> >> out) could help make merging faster and more memory-efficient, but I
> >> would really like to avoid making it a requirement for indexing
> >> vectors as it also makes this feature much harder to use.
> >>
> >> On Sat, Apr 8, 2023 at 9:28?PM Alessandro Benedetti
> >> <a.benedetti@sease.io> wrote:
> >> >
> >> > I am very attentive to listen opinions but I am un-convinced here and I an not sure that a single person opinion should be allowed to be detrimental for such an important project.
> >> >
> >> > The limit as far as I know is literally just raising an exception.
> >> > Removing it won't alter in any way the current performance for users in low dimensional space.
> >> > Removing it will just enable more users to use Lucene.
> >> >
> >> > If new users in certain situations will be unhappy with the performance, they may contribute improvements.
> >> > This is how you make progress.
> >> >
> >> > If it's a reputation thing, trust me that not allowing users to play with high dimensional space will equally damage it.
> >> >
> >> > To me it's really a no brainer.
> >> > Removing the limit and enable people to use high dimensional vectors will take minutes.
> >> > Improving the hnsw implementation can take months.
> >> > Pick one to begin with...
> >> >
> >> > And there's no-one paying me here, no company interest whatsoever, actually I pay people to contribute, I am just convinced it's a good idea.
> >> >
> >> >
> >> > On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcmuir@gmail.com> wrote:
> >> >>
> >> >> I disagree with your categorization. I put in plenty of work and
> >> >> experienced plenty of pain myself, writing tests and fighting these
> >> >> issues, after i saw that, two releases in a row, vector indexing fell
> >> >> over and hit integer overflows etc on small datasets:
> >> >>
> >> >> https://github.com/apache/lucene/pull/11905
> >> >>
> >> >> Attacking me isn't helping the situation.
> >> >>
> >> >> PS: when i said the "one guy who wrote the code" I didn't mean it in
> >> >> any kind of demeaning fashion really. I meant to describe the current
> >> >> state of usability with respect to indexing a few million docs with
> >> >> high dimensions. You can scroll up the thread and see that at least
> >> >> one other committer on the project experienced similar pain as me.
> >> >> Then, think about users who aren't committers trying to use the
> >> >> functionality!
> >> >>
> >> >> On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov <msokolov@gmail.com> wrote:
> >> >> >
> >> >> > What you said about increasing dimensions requiring a bigger ram buffer on merge is wrong. That's the point I was trying to make. Your concerns about merge costs are not wrong, but your conclusion that we need to limit dimensions is not justified.
> >> >> >
> >> >> > You complain that hnsw sucks it doesn't scale, but when I show it scales linearly with dimension you just ignore that and complain about something entirely different.
> >> >> >
> >> >> > You demand that people run all kinds of tests to prove you wrong but when they do, you don't listen and you won't put in the work yourself or complain that it's too hard.
> >> >> >
> >> >> > Then you complain about people not meeting you half way. Wow
> >> >> >
> >> >> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcmuir@gmail.com> wrote:
> >> >> >>
> >> >> >> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
> >> >> >> <michael.wechner@wyona.com> wrote:
> >> >> >> >
> >> >> >> > What exactly do you consider reasonable?
> >> >> >>
> >> >> >> Let's begin a real discussion by being HONEST about the current
> >> >> >> status. Please put politically correct or your own company's wishes
> >> >> >> aside, we know it's not in a good state.
> >> >> >>
> >> >> >> Current status is the one guy who wrote the code can set a
> >> >> >> multi-gigabyte ram buffer and index a small dataset with 1024
> >> >> >> dimensions in HOURS (i didn't ask what hardware).
> >> >> >>
> >> >> >> My concerns are everyone else except the one guy, I want it to be
> >> >> >> usable. Increasing dimensions just means even bigger multi-gigabyte
> >> >> >> ram buffer and bigger heap to avoid OOM on merge.
> >> >> >> It is also a permanent backwards compatibility decision, we have to
> >> >> >> support it once we do this and we can't just say "oops" and flip it
> >> >> >> back.
> >> >> >>
> >> >> >> It is unclear to me, if the multi-gigabyte ram buffer is really to
> >> >> >> avoid merges because they are so slow and it would be DAYS otherwise,
> >> >> >> or if its to avoid merges so it doesn't hit OOM.
> >> >> >> Also from personal experience, it takes trial and error (means
> >> >> >> experiencing OOM on merge!!!) before you get those heap values correct
> >> >> >> for your dataset. This usually means starting over which is
> >> >> >> frustrating and wastes more time.
> >> >> >>
> >> >> >> Jim mentioned some ideas about the memory usage in IndexWriter, seems
> >> >> >> to me like its a good idea. maybe the multigigabyte ram buffer can be
> >> >> >> avoided in this way and performance improved by writing bigger
> >> >> >> segments with lucene's defaults. But this doesn't mean we can simply
> >> >> >> ignore the horrors of what happens on merge. merging needs to scale so
> >> >> >> that indexing really scales.
> >> >> >>
> >> >> >> At least it shouldnt spike RAM on trivial data amounts and cause OOM,
> >> >> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
> >> >> >> fashion when indexing.
> >> >> >>
> >> >> >> ---------------------------------------------------------------------
> >> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >> >> >>
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >> >>
> >>
> >>
> >> --
> >> Adrien
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >
> >
> > --
> > Marcus Eagan
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

a.benedetti at sease

Apr 9, 2023, 4:37 AM

Post #72 of 99 (175 views)

I don't think this tone and language is appropriate for a community of
volunteers and men of science.

I personally find offensive to generalise Lucene people here to be "crazy
people hyped about chatGPT".

I personally don't give a damn about chatGPT except the fact it is a very
interesting technology.

As usual I see very little motivation and a lot of "convince me".
We're discussing here about a limit that raises an exception.

Improving performance is absolutely important and no-one here is saying we
won't address it, it's just a separate discussion.

On Sun, 9 Apr 2023, 12:59 Robert Muir, <rcmuir@gmail.com> wrote:

> Also, please let's only disucss SEARCH. lucene is a SEARCH ENGINE
> LIBRARY. not a vector database or whatever trash is being proposed
> here.
>
> i think we should table this and revisit it after chatgpt hype has
> dissipated.
>
> this hype is causing ppl to behave irrationally, it is why i can't
> converse with basically anyone on this thread because they are all
> stating crazy things that don't make sense.
>
> On Sun, Apr 9, 2023 at 6:25?AM Robert Muir <rcmuir@gmail.com> wrote:
> >
> > Yes, its very clear that folks on this thread are ignoring reason
> > entirely and completely swooned by chatgpt-hype.
> > And what happens when they make chatgpt-8 that uses even more dimensions?
> > backwards compatibility decisions can't be made by garbage hype such
> > as cryptocurrency or chatgpt.
> > Trying to convince me we should bump it because of chatgpt, well, i
> > think it has the opposite effect.
> >
> > Please, lemme see real technical arguments why this limit needs to be
> > bumped. not including trash like chatgpt.
> >
> > On Sat, Apr 8, 2023 at 7:50?PM Marcus Eagan <marcuseagan@gmail.com>
> wrote:
> > >
> > > Given the massive amounts of funding going into the development and
> investigation of the project, I think it would be good to at least have
> Lucene be a part of the conversation. Simply because academics typically
> focus on vectors <= 784 dimensions does not mean all users will. A large
> swathe of very important users of the Lucene project never exceed 500k
> documents, though they are shifting to other search engines to try out very
> popular embeddings.
> > >
> > > I think giving our users the opportunity to build chat bots or LLM
> memory machines using Lucene is a positive development, even if some
> datasets won't be able to work well. We don't limit the number of fields
> someone can add in most cases, though we did just undeprecate that API to
> better support multi-tenancy. But people still add so many fields and can
> crash their clusters with mapping explosions when unlimited. The limit to
> vectors feels similar. I expect more people to dig into Lucene due to its
> openness and robustness as they run into problems. Today, they are forced
> to consider other engines that are more permissive.
> > >
> > > Not everyone important or valuable Lucene workload is in the millions
> of documents. Many of them only have lots of queries or computationally
> expensive access patterns for B-trees. We can document that it is very
> ill-advised to make a deployment with vectors too large. What others will
> do with it is on them.
> > >
> > >
> > > On Sat, Apr 8, 2023 at 2:29?PM Adrien Grand <jpountz@gmail.com> wrote:
> > >>
> > >> As Dawid pointed out earlier on this thread, this is the rule for
> > >> Apache projects: a single -1 vote on a code change is a veto and
> > >> cannot be overridden. Furthermore, Robert is one of the people on this
> > >> project who worked the most on debugging subtle bugs, making Lucene
> > >> more robust and improving our test framework, so I'm listening when he
> > >> voices quality concerns.
> > >>
> > >> The argument against removing/raising the limit that resonates with me
> > >> the most is that it is a one-way door. As MikeS highlighted earlier on
> > >> this thread, implementations may want to take advantage of the fact
> > >> that there is a limit at some point too. This is why I don't want to
> > >> remove the limit and would prefer a slight increase, such as 2048 as
> > >> suggested in the original issue, which would enable most of the things
> > >> that users who have been asking about raising the limit would like to
> > >> do.
> > >>
> > >> I agree that the merge-time memory usage and slow indexing rate are
> > >> not great. But it's still possible to index multi-million vector
> > >> datasets with a 4GB heap without hitting OOMEs regardless of the
> > >> number of dimensions, and the feedback I'm seeing is that many users
> > >> are still interested in indexing multi-million vector datasets despite
> > >> the slow indexing rate. I wish we could do better, and vector indexing
> > >> is certainly more expert than text indexing, but it still is usable in
> > >> my opinion. I understand how giving Lucene more information about
> > >> vectors prior to indexing (e.g. clustering information as Jim pointed
> > >> out) could help make merging faster and more memory-efficient, but I
> > >> would really like to avoid making it a requirement for indexing
> > >> vectors as it also makes this feature much harder to use.
> > >>
> > >> On Sat, Apr 8, 2023 at 9:28?PM Alessandro Benedetti
> > >> <a.benedetti@sease.io> wrote:
> > >> >
> > >> > I am very attentive to listen opinions but I am un-convinced here
> and I an not sure that a single person opinion should be allowed to be
> detrimental for such an important project.
> > >> >
> > >> > The limit as far as I know is literally just raising an exception.
> > >> > Removing it won't alter in any way the current performance for
> users in low dimensional space.
> > >> > Removing it will just enable more users to use Lucene.
> > >> >
> > >> > If new users in certain situations will be unhappy with the
> performance, they may contribute improvements.
> > >> > This is how you make progress.
> > >> >
> > >> > If it's a reputation thing, trust me that not allowing users to
> play with high dimensional space will equally damage it.
> > >> >
> > >> > To me it's really a no brainer.
> > >> > Removing the limit and enable people to use high dimensional
> vectors will take minutes.
> > >> > Improving the hnsw implementation can take months.
> > >> > Pick one to begin with...
> > >> >
> > >> > And there's no-one paying me here, no company interest whatsoever,
> actually I pay people to contribute, I am just convinced it's a good idea.
> > >> >
> > >> >
> > >> > On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcmuir@gmail.com> wrote:
> > >> >>
> > >> >> I disagree with your categorization. I put in plenty of work and
> > >> >> experienced plenty of pain myself, writing tests and fighting these
> > >> >> issues, after i saw that, two releases in a row, vector indexing
> fell
> > >> >> over and hit integer overflows etc on small datasets:
> > >> >>
> > >> >> https://github.com/apache/lucene/pull/11905
> > >> >>
> > >> >> Attacking me isn't helping the situation.
> > >> >>
> > >> >> PS: when i said the "one guy who wrote the code" I didn't mean it
> in
> > >> >> any kind of demeaning fashion really. I meant to describe the
> current
> > >> >> state of usability with respect to indexing a few million docs with
> > >> >> high dimensions. You can scroll up the thread and see that at least
> > >> >> one other committer on the project experienced similar pain as me.
> > >> >> Then, think about users who aren't committers trying to use the
> > >> >> functionality!
> > >> >>
> > >> >> On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov <
> msokolov@gmail.com> wrote:
> > >> >> >
> > >> >> > What you said about increasing dimensions requiring a bigger ram
> buffer on merge is wrong. That's the point I was trying to make. Your
> concerns about merge costs are not wrong, but your conclusion that we need
> to limit dimensions is not justified.
> > >> >> >
> > >> >> > You complain that hnsw sucks it doesn't scale, but when I show
> it scales linearly with dimension you just ignore that and complain about
> something entirely different.
> > >> >> >
> > >> >> > You demand that people run all kinds of tests to prove you wrong
> but when they do, you don't listen and you won't put in the work yourself
> or complain that it's too hard.
> > >> >> >
> > >> >> > Then you complain about people not meeting you half way. Wow
> > >> >> >
> > >> >> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcmuir@gmail.com>
> wrote:
> > >> >> >>
> > >> >> >> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
> > >> >> >> <michael.wechner@wyona.com> wrote:
> > >> >> >> >
> > >> >> >> > What exactly do you consider reasonable?
> > >> >> >>
> > >> >> >> Let's begin a real discussion by being HONEST about the current
> > >> >> >> status. Please put politically correct or your own company's
> wishes
> > >> >> >> aside, we know it's not in a good state.
> > >> >> >>
> > >> >> >> Current status is the one guy who wrote the code can set a
> > >> >> >> multi-gigabyte ram buffer and index a small dataset with 1024
> > >> >> >> dimensions in HOURS (i didn't ask what hardware).
> > >> >> >>
> > >> >> >> My concerns are everyone else except the one guy, I want it to
> be
> > >> >> >> usable. Increasing dimensions just means even bigger
> multi-gigabyte
> > >> >> >> ram buffer and bigger heap to avoid OOM on merge.
> > >> >> >> It is also a permanent backwards compatibility decision, we
> have to
> > >> >> >> support it once we do this and we can't just say "oops" and
> flip it
> > >> >> >> back.
> > >> >> >>
> > >> >> >> It is unclear to me, if the multi-gigabyte ram buffer is really
> to
> > >> >> >> avoid merges because they are so slow and it would be DAYS
> otherwise,
> > >> >> >> or if its to avoid merges so it doesn't hit OOM.
> > >> >> >> Also from personal experience, it takes trial and error (means
> > >> >> >> experiencing OOM on merge!!!) before you get those heap values
> correct
> > >> >> >> for your dataset. This usually means starting over which is
> > >> >> >> frustrating and wastes more time.
> > >> >> >>
> > >> >> >> Jim mentioned some ideas about the memory usage in IndexWriter,
> seems
> > >> >> >> to me like its a good idea. maybe the multigigabyte ram buffer
> can be
> > >> >> >> avoided in this way and performance improved by writing bigger
> > >> >> >> segments with lucene's defaults. But this doesn't mean we can
> simply
> > >> >> >> ignore the horrors of what happens on merge. merging needs to
> scale so
> > >> >> >> that indexing really scales.
> > >> >> >>
> > >> >> >> At least it shouldnt spike RAM on trivial data amounts and
> cause OOM,
> > >> >> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
> > >> >> >> fashion when indexing.
> > >> >> >>
> > >> >> >>
> ---------------------------------------------------------------------
> > >> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > >> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> > >> >> >>
> > >> >>
> > >> >>
> ---------------------------------------------------------------------
> > >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> > >> >>
> > >>
> > >>
> > >> --
> > >> Adrien
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > >> For additional commands, e-mail: dev-help@lucene.apache.org
> > >>
> > >
> > >
> > > --
> > > Marcus Eagan
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

serera at gmail

Apr 9, 2023, 4:45 AM

Post #73 of 99 (175 views)

Putting ChatGPT aside, what are the implications of (1) removing the limit,
or (2) increasing the limit, or (3) make it configurable at the app's
discretion? The configuration can even be in the form of a VectorEncoder
impl which will decide on the size of the vectors, thereby making it
clearer that this is an expert setting and puts it at the hands of the app
to decide how to handle those large vectors.

Will bigger vectors require an algorithmic change (I understand that it
might benefit from one, I ask if it's required besides performance gains)?
If not, then why do you object to making it an "app problem"? If 2048 dims
vectors require 2GB IW RAM buffer, what's wrong with documenting it and
letting the app choose whether they want/can do it or not?

Do we limit the size of stored fields, or binary doc values? Do we prevent
anyone from using a Codec which loads these big byte[] into memory?

> Also, please let's only disucss SEARCH. lucene is a SEARCH ENGINE
> LIBRARY. not a vector database or whatever trash is being proposed
> here.

The problem being discussed *is* related to search, not lexical search, but
rather semantic search, or search in vector space, however you want to call
it. Are you saying that Lucene should only focus on lexical search
scenarios? Or are you saying that semantic search scenarios don't need to
index >1024 dimension vectors in order to produce high quality results?

I personally don't understand why not letting apps index bigger vectors, if
all that it takes is using bigger RAM buffers. Improvements will come
later, especially as more and more applications will try it. If we prevent
it, then we might never see these improvements cause no one will even
attempt to do it with Lucene, and IMO it's not a direction we want to head.
While ChatGPT itself might be a hype, I don't think that big vectors are,
and if the only technical reason we have for not supporting them is a
bigger RAM buffer, then I think we should allow it.

On Sun, Apr 9, 2023 at 1:59?PM Robert Muir <rcmuir@gmail.com> wrote:

> Also, please let's only disucss SEARCH. lucene is a SEARCH ENGINE
> LIBRARY. not a vector database or whatever trash is being proposed
> here.
>
> i think we should table this and revisit it after chatgpt hype has
> dissipated.
>
> this hype is causing ppl to behave irrationally, it is why i can't
> converse with basically anyone on this thread because they are all
> stating crazy things that don't make sense.
>
> On Sun, Apr 9, 2023 at 6:25?AM Robert Muir <rcmuir@gmail.com> wrote:
> >
> > Yes, its very clear that folks on this thread are ignoring reason
> > entirely and completely swooned by chatgpt-hype.
> > And what happens when they make chatgpt-8 that uses even more dimensions?
> > backwards compatibility decisions can't be made by garbage hype such
> > as cryptocurrency or chatgpt.
> > Trying to convince me we should bump it because of chatgpt, well, i
> > think it has the opposite effect.
> >
> > Please, lemme see real technical arguments why this limit needs to be
> > bumped. not including trash like chatgpt.
> >
> > On Sat, Apr 8, 2023 at 7:50?PM Marcus Eagan <marcuseagan@gmail.com>
> wrote:
> > >
> > > Given the massive amounts of funding going into the development and
> investigation of the project, I think it would be good to at least have
> Lucene be a part of the conversation. Simply because academics typically
> focus on vectors <= 784 dimensions does not mean all users will. A large
> swathe of very important users of the Lucene project never exceed 500k
> documents, though they are shifting to other search engines to try out very
> popular embeddings.
> > >
> > > I think giving our users the opportunity to build chat bots or LLM
> memory machines using Lucene is a positive development, even if some
> datasets won't be able to work well. We don't limit the number of fields
> someone can add in most cases, though we did just undeprecate that API to
> better support multi-tenancy. But people still add so many fields and can
> crash their clusters with mapping explosions when unlimited. The limit to
> vectors feels similar. I expect more people to dig into Lucene due to its
> openness and robustness as they run into problems. Today, they are forced
> to consider other engines that are more permissive.
> > >
> > > Not everyone important or valuable Lucene workload is in the millions
> of documents. Many of them only have lots of queries or computationally
> expensive access patterns for B-trees. We can document that it is very
> ill-advised to make a deployment with vectors too large. What others will
> do with it is on them.
> > >
> > >
> > > On Sat, Apr 8, 2023 at 2:29?PM Adrien Grand <jpountz@gmail.com> wrote:
> > >>
> > >> As Dawid pointed out earlier on this thread, this is the rule for
> > >> Apache projects: a single -1 vote on a code change is a veto and
> > >> cannot be overridden. Furthermore, Robert is one of the people on this
> > >> project who worked the most on debugging subtle bugs, making Lucene
> > >> more robust and improving our test framework, so I'm listening when he
> > >> voices quality concerns.
> > >>
> > >> The argument against removing/raising the limit that resonates with me
> > >> the most is that it is a one-way door. As MikeS highlighted earlier on
> > >> this thread, implementations may want to take advantage of the fact
> > >> that there is a limit at some point too. This is why I don't want to
> > >> remove the limit and would prefer a slight increase, such as 2048 as
> > >> suggested in the original issue, which would enable most of the things
> > >> that users who have been asking about raising the limit would like to
> > >> do.
> > >>
> > >> I agree that the merge-time memory usage and slow indexing rate are
> > >> not great. But it's still possible to index multi-million vector
> > >> datasets with a 4GB heap without hitting OOMEs regardless of the
> > >> number of dimensions, and the feedback I'm seeing is that many users
> > >> are still interested in indexing multi-million vector datasets despite
> > >> the slow indexing rate. I wish we could do better, and vector indexing
> > >> is certainly more expert than text indexing, but it still is usable in
> > >> my opinion. I understand how giving Lucene more information about
> > >> vectors prior to indexing (e.g. clustering information as Jim pointed
> > >> out) could help make merging faster and more memory-efficient, but I
> > >> would really like to avoid making it a requirement for indexing
> > >> vectors as it also makes this feature much harder to use.
> > >>
> > >> On Sat, Apr 8, 2023 at 9:28?PM Alessandro Benedetti
> > >> <a.benedetti@sease.io> wrote:
> > >> >
> > >> > I am very attentive to listen opinions but I am un-convinced here
> and I an not sure that a single person opinion should be allowed to be
> detrimental for such an important project.
> > >> >
> > >> > The limit as far as I know is literally just raising an exception.
> > >> > Removing it won't alter in any way the current performance for
> users in low dimensional space.
> > >> > Removing it will just enable more users to use Lucene.
> > >> >
> > >> > If new users in certain situations will be unhappy with the
> performance, they may contribute improvements.
> > >> > This is how you make progress.
> > >> >
> > >> > If it's a reputation thing, trust me that not allowing users to
> play with high dimensional space will equally damage it.
> > >> >
> > >> > To me it's really a no brainer.
> > >> > Removing the limit and enable people to use high dimensional
> vectors will take minutes.
> > >> > Improving the hnsw implementation can take months.
> > >> > Pick one to begin with...
> > >> >
> > >> > And there's no-one paying me here, no company interest whatsoever,
> actually I pay people to contribute, I am just convinced it's a good idea.
> > >> >
> > >> >
> > >> > On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcmuir@gmail.com> wrote:
> > >> >>
> > >> >> I disagree with your categorization. I put in plenty of work and
> > >> >> experienced plenty of pain myself, writing tests and fighting these
> > >> >> issues, after i saw that, two releases in a row, vector indexing
> fell
> > >> >> over and hit integer overflows etc on small datasets:
> > >> >>
> > >> >> https://github.com/apache/lucene/pull/11905
> > >> >>
> > >> >> Attacking me isn't helping the situation.
> > >> >>
> > >> >> PS: when i said the "one guy who wrote the code" I didn't mean it
> in
> > >> >> any kind of demeaning fashion really. I meant to describe the
> current
> > >> >> state of usability with respect to indexing a few million docs with
> > >> >> high dimensions. You can scroll up the thread and see that at least
> > >> >> one other committer on the project experienced similar pain as me.
> > >> >> Then, think about users who aren't committers trying to use the
> > >> >> functionality!
> > >> >>
> > >> >> On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov <
> msokolov@gmail.com> wrote:
> > >> >> >
> > >> >> > What you said about increasing dimensions requiring a bigger ram
> buffer on merge is wrong. That's the point I was trying to make. Your
> concerns about merge costs are not wrong, but your conclusion that we need
> to limit dimensions is not justified.
> > >> >> >
> > >> >> > You complain that hnsw sucks it doesn't scale, but when I show
> it scales linearly with dimension you just ignore that and complain about
> something entirely different.
> > >> >> >
> > >> >> > You demand that people run all kinds of tests to prove you wrong
> but when they do, you don't listen and you won't put in the work yourself
> or complain that it's too hard.
> > >> >> >
> > >> >> > Then you complain about people not meeting you half way. Wow
> > >> >> >
> > >> >> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcmuir@gmail.com>
> wrote:
> > >> >> >>
> > >> >> >> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
> > >> >> >> <michael.wechner@wyona.com> wrote:
> > >> >> >> >
> > >> >> >> > What exactly do you consider reasonable?
> > >> >> >>
> > >> >> >> Let's begin a real discussion by being HONEST about the current
> > >> >> >> status. Please put politically correct or your own company's
> wishes
> > >> >> >> aside, we know it's not in a good state.
> > >> >> >>
> > >> >> >> Current status is the one guy who wrote the code can set a
> > >> >> >> multi-gigabyte ram buffer and index a small dataset with 1024
> > >> >> >> dimensions in HOURS (i didn't ask what hardware).
> > >> >> >>
> > >> >> >> My concerns are everyone else except the one guy, I want it to
> be
> > >> >> >> usable. Increasing dimensions just means even bigger
> multi-gigabyte
> > >> >> >> ram buffer and bigger heap to avoid OOM on merge.
> > >> >> >> It is also a permanent backwards compatibility decision, we
> have to
> > >> >> >> support it once we do this and we can't just say "oops" and
> flip it
> > >> >> >> back.
> > >> >> >>
> > >> >> >> It is unclear to me, if the multi-gigabyte ram buffer is really
> to
> > >> >> >> avoid merges because they are so slow and it would be DAYS
> otherwise,
> > >> >> >> or if its to avoid merges so it doesn't hit OOM.
> > >> >> >> Also from personal experience, it takes trial and error (means
> > >> >> >> experiencing OOM on merge!!!) before you get those heap values
> correct
> > >> >> >> for your dataset. This usually means starting over which is
> > >> >> >> frustrating and wastes more time.
> > >> >> >>
> > >> >> >> Jim mentioned some ideas about the memory usage in IndexWriter,
> seems
> > >> >> >> to me like its a good idea. maybe the multigigabyte ram buffer
> can be
> > >> >> >> avoided in this way and performance improved by writing bigger
> > >> >> >> segments with lucene's defaults. But this doesn't mean we can
> simply
> > >> >> >> ignore the horrors of what happens on merge. merging needs to
> scale so
> > >> >> >> that indexing really scales.
> > >> >> >>
> > >> >> >> At least it shouldnt spike RAM on trivial data amounts and
> cause OOM,
> > >> >> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
> > >> >> >> fashion when indexing.
> > >> >> >>
> > >> >> >>
> ---------------------------------------------------------------------
> > >> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > >> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> > >> >> >>
> > >> >>
> > >> >>
> ---------------------------------------------------------------------
> > >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> > >> >>
> > >>
> > >>
> > >> --
> > >> Adrien
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > >> For additional commands, e-mail: dev-help@lucene.apache.org
> > >>
> > >
> > >
> > > --
> > > Marcus Eagan
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

rcmuir at gmail

Apr 9, 2023, 4:47 AM

Post #74 of 99 (175 views)

I don't care. you guys personally attacked me first. And then it turns
out, you were being dishonest the entire time and hiding your true
intent, which was not search at all but instead some chatgpt pyramid
scheme or similar.

i'm done with this thread.

On Sun, Apr 9, 2023 at 7:37?AM Alessandro Benedetti
<a.benedetti@sease.io> wrote:
>
> I don't think this tone and language is appropriate for a community of volunteers and men of science.
>
> I personally find offensive to generalise Lucene people here to be "crazy people hyped about chatGPT".
>
> I personally don't give a damn about chatGPT except the fact it is a very interesting technology.
>
> As usual I see very little motivation and a lot of "convince me".
> We're discussing here about a limit that raises an exception.
>
> Improving performance is absolutely important and no-one here is saying we won't address it, it's just a separate discussion.
>
>
> On Sun, 9 Apr 2023, 12:59 Robert Muir, <rcmuir@gmail.com> wrote:
>>
>> Also, please let's only disucss SEARCH. lucene is a SEARCH ENGINE
>> LIBRARY. not a vector database or whatever trash is being proposed
>> here.
>>
>> i think we should table this and revisit it after chatgpt hype has dissipated.
>>
>> this hype is causing ppl to behave irrationally, it is why i can't
>> converse with basically anyone on this thread because they are all
>> stating crazy things that don't make sense.
>>
>> On Sun, Apr 9, 2023 at 6:25?AM Robert Muir <rcmuir@gmail.com> wrote:
>> >
>> > Yes, its very clear that folks on this thread are ignoring reason
>> > entirely and completely swooned by chatgpt-hype.
>> > And what happens when they make chatgpt-8 that uses even more dimensions?
>> > backwards compatibility decisions can't be made by garbage hype such
>> > as cryptocurrency or chatgpt.
>> > Trying to convince me we should bump it because of chatgpt, well, i
>> > think it has the opposite effect.
>> >
>> > Please, lemme see real technical arguments why this limit needs to be
>> > bumped. not including trash like chatgpt.
>> >
>> > On Sat, Apr 8, 2023 at 7:50?PM Marcus Eagan <marcuseagan@gmail.com> wrote:
>> > >
>> > > Given the massive amounts of funding going into the development and investigation of the project, I think it would be good to at least have Lucene be a part of the conversation. Simply because academics typically focus on vectors <= 784 dimensions does not mean all users will. A large swathe of very important users of the Lucene project never exceed 500k documents, though they are shifting to other search engines to try out very popular embeddings.
>> > >
>> > > I think giving our users the opportunity to build chat bots or LLM memory machines using Lucene is a positive development, even if some datasets won't be able to work well. We don't limit the number of fields someone can add in most cases, though we did just undeprecate that API to better support multi-tenancy. But people still add so many fields and can crash their clusters with mapping explosions when unlimited. The limit to vectors feels similar. I expect more people to dig into Lucene due to its openness and robustness as they run into problems. Today, they are forced to consider other engines that are more permissive.
>> > >
>> > > Not everyone important or valuable Lucene workload is in the millions of documents. Many of them only have lots of queries or computationally expensive access patterns for B-trees. We can document that it is very ill-advised to make a deployment with vectors too large. What others will do with it is on them.
>> > >
>> > >
>> > > On Sat, Apr 8, 2023 at 2:29?PM Adrien Grand <jpountz@gmail.com> wrote:
>> > >>
>> > >> As Dawid pointed out earlier on this thread, this is the rule for
>> > >> Apache projects: a single -1 vote on a code change is a veto and
>> > >> cannot be overridden. Furthermore, Robert is one of the people on this
>> > >> project who worked the most on debugging subtle bugs, making Lucene
>> > >> more robust and improving our test framework, so I'm listening when he
>> > >> voices quality concerns.
>> > >>
>> > >> The argument against removing/raising the limit that resonates with me
>> > >> the most is that it is a one-way door. As MikeS highlighted earlier on
>> > >> this thread, implementations may want to take advantage of the fact
>> > >> that there is a limit at some point too. This is why I don't want to
>> > >> remove the limit and would prefer a slight increase, such as 2048 as
>> > >> suggested in the original issue, which would enable most of the things
>> > >> that users who have been asking about raising the limit would like to
>> > >> do.
>> > >>
>> > >> I agree that the merge-time memory usage and slow indexing rate are
>> > >> not great. But it's still possible to index multi-million vector
>> > >> datasets with a 4GB heap without hitting OOMEs regardless of the
>> > >> number of dimensions, and the feedback I'm seeing is that many users
>> > >> are still interested in indexing multi-million vector datasets despite
>> > >> the slow indexing rate. I wish we could do better, and vector indexing
>> > >> is certainly more expert than text indexing, but it still is usable in
>> > >> my opinion. I understand how giving Lucene more information about
>> > >> vectors prior to indexing (e.g. clustering information as Jim pointed
>> > >> out) could help make merging faster and more memory-efficient, but I
>> > >> would really like to avoid making it a requirement for indexing
>> > >> vectors as it also makes this feature much harder to use.
>> > >>
>> > >> On Sat, Apr 8, 2023 at 9:28?PM Alessandro Benedetti
>> > >> <a.benedetti@sease.io> wrote:
>> > >> >
>> > >> > I am very attentive to listen opinions but I am un-convinced here and I an not sure that a single person opinion should be allowed to be detrimental for such an important project.
>> > >> >
>> > >> > The limit as far as I know is literally just raising an exception.
>> > >> > Removing it won't alter in any way the current performance for users in low dimensional space.
>> > >> > Removing it will just enable more users to use Lucene.
>> > >> >
>> > >> > If new users in certain situations will be unhappy with the performance, they may contribute improvements.
>> > >> > This is how you make progress.
>> > >> >
>> > >> > If it's a reputation thing, trust me that not allowing users to play with high dimensional space will equally damage it.
>> > >> >
>> > >> > To me it's really a no brainer.
>> > >> > Removing the limit and enable people to use high dimensional vectors will take minutes.
>> > >> > Improving the hnsw implementation can take months.
>> > >> > Pick one to begin with...
>> > >> >
>> > >> > And there's no-one paying me here, no company interest whatsoever, actually I pay people to contribute, I am just convinced it's a good idea.
>> > >> >
>> > >> >
>> > >> > On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcmuir@gmail.com> wrote:
>> > >> >>
>> > >> >> I disagree with your categorization. I put in plenty of work and
>> > >> >> experienced plenty of pain myself, writing tests and fighting these
>> > >> >> issues, after i saw that, two releases in a row, vector indexing fell
>> > >> >> over and hit integer overflows etc on small datasets:
>> > >> >>
>> > >> >> https://github.com/apache/lucene/pull/11905
>> > >> >>
>> > >> >> Attacking me isn't helping the situation.
>> > >> >>
>> > >> >> PS: when i said the "one guy who wrote the code" I didn't mean it in
>> > >> >> any kind of demeaning fashion really. I meant to describe the current
>> > >> >> state of usability with respect to indexing a few million docs with
>> > >> >> high dimensions. You can scroll up the thread and see that at least
>> > >> >> one other committer on the project experienced similar pain as me.
>> > >> >> Then, think about users who aren't committers trying to use the
>> > >> >> functionality!
>> > >> >>
>> > >> >> On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov <msokolov@gmail.com> wrote:
>> > >> >> >
>> > >> >> > What you said about increasing dimensions requiring a bigger ram buffer on merge is wrong. That's the point I was trying to make. Your concerns about merge costs are not wrong, but your conclusion that we need to limit dimensions is not justified.
>> > >> >> >
>> > >> >> > You complain that hnsw sucks it doesn't scale, but when I show it scales linearly with dimension you just ignore that and complain about something entirely different.
>> > >> >> >
>> > >> >> > You demand that people run all kinds of tests to prove you wrong but when they do, you don't listen and you won't put in the work yourself or complain that it's too hard.
>> > >> >> >
>> > >> >> > Then you complain about people not meeting you half way. Wow
>> > >> >> >
>> > >> >> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcmuir@gmail.com> wrote:
>> > >> >> >>
>> > >> >> >> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
>> > >> >> >> <michael.wechner@wyona.com> wrote:
>> > >> >> >> >
>> > >> >> >> > What exactly do you consider reasonable?
>> > >> >> >>
>> > >> >> >> Let's begin a real discussion by being HONEST about the current
>> > >> >> >> status. Please put politically correct or your own company's wishes
>> > >> >> >> aside, we know it's not in a good state.
>> > >> >> >>
>> > >> >> >> Current status is the one guy who wrote the code can set a
>> > >> >> >> multi-gigabyte ram buffer and index a small dataset with 1024
>> > >> >> >> dimensions in HOURS (i didn't ask what hardware).
>> > >> >> >>
>> > >> >> >> My concerns are everyone else except the one guy, I want it to be
>> > >> >> >> usable. Increasing dimensions just means even bigger multi-gigabyte
>> > >> >> >> ram buffer and bigger heap to avoid OOM on merge.
>> > >> >> >> It is also a permanent backwards compatibility decision, we have to
>> > >> >> >> support it once we do this and we can't just say "oops" and flip it
>> > >> >> >> back.
>> > >> >> >>
>> > >> >> >> It is unclear to me, if the multi-gigabyte ram buffer is really to
>> > >> >> >> avoid merges because they are so slow and it would be DAYS otherwise,
>> > >> >> >> or if its to avoid merges so it doesn't hit OOM.
>> > >> >> >> Also from personal experience, it takes trial and error (means
>> > >> >> >> experiencing OOM on merge!!!) before you get those heap values correct
>> > >> >> >> for your dataset. This usually means starting over which is
>> > >> >> >> frustrating and wastes more time.
>> > >> >> >>
>> > >> >> >> Jim mentioned some ideas about the memory usage in IndexWriter, seems
>> > >> >> >> to me like its a good idea. maybe the multigigabyte ram buffer can be
>> > >> >> >> avoided in this way and performance improved by writing bigger
>> > >> >> >> segments with lucene's defaults. But this doesn't mean we can simply
>> > >> >> >> ignore the horrors of what happens on merge. merging needs to scale so
>> > >> >> >> that indexing really scales.
>> > >> >> >>
>> > >> >> >> At least it shouldnt spike RAM on trivial data amounts and cause OOM,
>> > >> >> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
>> > >> >> >> fashion when indexing.
>> > >> >> >>
>> > >> >> >> ---------------------------------------------------------------------
>> > >> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > >> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> > >> >> >>
>> > >> >>
>> > >> >> ---------------------------------------------------------------------
>> > >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > >> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> > >> >>
>> > >>
>> > >>
>> > >> --
>> > >> Adrien
>> > >>
>> > >> ---------------------------------------------------------------------
>> > >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > >> For additional commands, e-mail: dev-help@lucene.apache.org
>> > >>
>> > >
>> > >
>> > > --
>> > > Marcus Eagan
>> > >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

msokolov at gmail

Apr 9, 2023, 4:55 AM

Post #75 of 99 (175 views)

We do have a dataset built from Wikipedia in luceneutil. It comes in 100
and 300 dimensional varieties and can easily enough generate large numbers
of vector documents from the articles data. To go higher we could
concatenate vectors from that and I believe the performance numbers would
be plausible.

On Sun, Apr 9, 2023, 1:32 AM Dawid Weiss <dawid.weiss@gmail.com> wrote:

> Can we set up a branch in which the limit is bumped to 2048, then have
> a realistic, free data set (wikipedia sample or something) that has,
> say, 5 million docs and vectors created using public data (glove
> pre-trained embeddings or the like)? We then could run indexing on the
> same hardware with 512, 1024 and 2048 and see what the numbers, limits
> and behavior actually are.
>
> I can help in writing this but not until after Easter.
>
>
> Dawid
>
> On Sat, Apr 8, 2023 at 11:29?PM Adrien Grand <jpountz@gmail.com> wrote:
> >
> > As Dawid pointed out earlier on this thread, this is the rule for
> > Apache projects: a single -1 vote on a code change is a veto and
> > cannot be overridden. Furthermore, Robert is one of the people on this
> > project who worked the most on debugging subtle bugs, making Lucene
> > more robust and improving our test framework, so I'm listening when he
> > voices quality concerns.
> >
> > The argument against removing/raising the limit that resonates with me
> > the most is that it is a one-way door. As MikeS highlighted earlier on
> > this thread, implementations may want to take advantage of the fact
> > that there is a limit at some point too. This is why I don't want to
> > remove the limit and would prefer a slight increase, such as 2048 as
> > suggested in the original issue, which would enable most of the things
> > that users who have been asking about raising the limit would like to
> > do.
> >
> > I agree that the merge-time memory usage and slow indexing rate are
> > not great. But it's still possible to index multi-million vector
> > datasets with a 4GB heap without hitting OOMEs regardless of the
> > number of dimensions, and the feedback I'm seeing is that many users
> > are still interested in indexing multi-million vector datasets despite
> > the slow indexing rate. I wish we could do better, and vector indexing
> > is certainly more expert than text indexing, but it still is usable in
> > my opinion. I understand how giving Lucene more information about
> > vectors prior to indexing (e.g. clustering information as Jim pointed
> > out) could help make merging faster and more memory-efficient, but I
> > would really like to avoid making it a requirement for indexing
> > vectors as it also makes this feature much harder to use.
> >
> > On Sat, Apr 8, 2023 at 9:28?PM Alessandro Benedetti
> > <a.benedetti@sease.io> wrote:
> > >
> > > I am very attentive to listen opinions but I am un-convinced here and
> I an not sure that a single person opinion should be allowed to be
> detrimental for such an important project.
> > >
> > > The limit as far as I know is literally just raising an exception.
> > > Removing it won't alter in any way the current performance for users
> in low dimensional space.
> > > Removing it will just enable more users to use Lucene.
> > >
> > > If new users in certain situations will be unhappy with the
> performance, they may contribute improvements.
> > > This is how you make progress.
> > >
> > > If it's a reputation thing, trust me that not allowing users to play
> with high dimensional space will equally damage it.
> > >
> > > To me it's really a no brainer.
> > > Removing the limit and enable people to use high dimensional vectors
> will take minutes.
> > > Improving the hnsw implementation can take months.
> > > Pick one to begin with...
> > >
> > > And there's no-one paying me here, no company interest whatsoever,
> actually I pay people to contribute, I am just convinced it's a good idea.
> > >
> > >
> > > On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcmuir@gmail.com> wrote:
> > >>
> > >> I disagree with your categorization. I put in plenty of work and
> > >> experienced plenty of pain myself, writing tests and fighting these
> > >> issues, after i saw that, two releases in a row, vector indexing fell
> > >> over and hit integer overflows etc on small datasets:
> > >>
> > >> https://github.com/apache/lucene/pull/11905
> > >>
> > >> Attacking me isn't helping the situation.
> > >>
> > >> PS: when i said the "one guy who wrote the code" I didn't mean it in
> > >> any kind of demeaning fashion really. I meant to describe the current
> > >> state of usability with respect to indexing a few million docs with
> > >> high dimensions. You can scroll up the thread and see that at least
> > >> one other committer on the project experienced similar pain as me.
> > >> Then, think about users who aren't committers trying to use the
> > >> functionality!
> > >>
> > >> On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov <msokolov@gmail.com>
> wrote:
> > >> >
> > >> > What you said about increasing dimensions requiring a bigger ram
> buffer on merge is wrong. That's the point I was trying to make. Your
> concerns about merge costs are not wrong, but your conclusion that we need
> to limit dimensions is not justified.
> > >> >
> > >> > You complain that hnsw sucks it doesn't scale, but when I show it
> scales linearly with dimension you just ignore that and complain about
> something entirely different.
> > >> >
> > >> > You demand that people run all kinds of tests to prove you wrong
> but when they do, you don't listen and you won't put in the work yourself
> or complain that it's too hard.
> > >> >
> > >> > Then you complain about people not meeting you half way. Wow
> > >> >
> > >> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcmuir@gmail.com> wrote:
> > >> >>
> > >> >> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
> > >> >> <michael.wechner@wyona.com> wrote:
> > >> >> >
> > >> >> > What exactly do you consider reasonable?
> > >> >>
> > >> >> Let's begin a real discussion by being HONEST about the current
> > >> >> status. Please put politically correct or your own company's wishes
> > >> >> aside, we know it's not in a good state.
> > >> >>
> > >> >> Current status is the one guy who wrote the code can set a
> > >> >> multi-gigabyte ram buffer and index a small dataset with 1024
> > >> >> dimensions in HOURS (i didn't ask what hardware).
> > >> >>
> > >> >> My concerns are everyone else except the one guy, I want it to be
> > >> >> usable. Increasing dimensions just means even bigger multi-gigabyte
> > >> >> ram buffer and bigger heap to avoid OOM on merge.
> > >> >> It is also a permanent backwards compatibility decision, we have to
> > >> >> support it once we do this and we can't just say "oops" and flip it
> > >> >> back.
> > >> >>
> > >> >> It is unclear to me, if the multi-gigabyte ram buffer is really to
> > >> >> avoid merges because they are so slow and it would be DAYS
> otherwise,
> > >> >> or if its to avoid merges so it doesn't hit OOM.
> > >> >> Also from personal experience, it takes trial and error (means
> > >> >> experiencing OOM on merge!!!) before you get those heap values
> correct
> > >> >> for your dataset. This usually means starting over which is
> > >> >> frustrating and wastes more time.
> > >> >>
> > >> >> Jim mentioned some ideas about the memory usage in IndexWriter,
> seems
> > >> >> to me like its a good idea. maybe the multigigabyte ram buffer can
> be
> > >> >> avoided in this way and performance improved by writing bigger
> > >> >> segments with lucene's defaults. But this doesn't mean we can
> simply
> > >> >> ignore the horrors of what happens on merge. merging needs to
> scale so
> > >> >> that indexing really scales.
> > >> >>
> > >> >> At least it shouldnt spike RAM on trivial data amounts and cause
> OOM,
> > >> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
> > >> >> fashion when indexing.
> > >> >>
> > >> >>
> ---------------------------------------------------------------------
> > >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> > >> >>
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > >> For additional commands, e-mail: dev-help@lucene.apache.org
> > >>
> >
> >
> > --
> > Adrien
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

gus.heck at gmail

Apr 9, 2023, 9:46 AM

Post #76 of 99 (950 views)

What I see so far:

1. Much positive support for raising the limit
2. Slightly less support for removing it or making it configurable
3. A single veto which argues that a (as yet undefined) performance
standard must be met before raising the limit
4. Hot tempers (various) making this discussion difficult

As I understand it, vetoes must have technical merit. I'm not sure that
this veto rises to "technical merit" on 2 counts:

1. No standard for the performance is given so it cannot be technically
met. Without hard criteria it's a moving target.
2. It appears to encode a valuation of the user's time, and that
valuation is really up to the user. Some users may consider 2hours useless
and not worth it, and others might happily wait 2 hours. This is not a
technical decision, it's a business decision regarding the relative value
of the time invested vs the value of the result. If I can cure cancer by
indexing for a year, that might be worth it... (hyperbole of course).

Things I would consider to have technical merit that I don't hear:

1. Impact on the speed of **other** indexing operations. (devaluation of
other functionality)
2. Actual scenarios that work when the limit is low and fail when the
limit is high (new failure on the same data with the limit raised).

One thing that might or might not have technical merit

1. If someone feels there is a lack of documentation of the
costs/performance implications of using large vectors, possibly including
reproducible benchmarks establishing the scaling behavior (there seems to
be disagreement on O(n) vs O(n^2)).

The users *should* know what they are getting into, but if the cost is
worth it to them, they should be able to pay it without forking the
project. If this veto causes a fork that's not good.

On Sun, Apr 9, 2023 at 7:55?AM Michael Sokolov <msokolov@gmail.com> wrote:

> We do have a dataset built from Wikipedia in luceneutil. It comes in 100
> and 300 dimensional varieties and can easily enough generate large numbers
> of vector documents from the articles data. To go higher we could
> concatenate vectors from that and I believe the performance numbers would
> be plausible.
>
> On Sun, Apr 9, 2023, 1:32 AM Dawid Weiss <dawid.weiss@gmail.com> wrote:
>
>> Can we set up a branch in which the limit is bumped to 2048, then have
>> a realistic, free data set (wikipedia sample or something) that has,
>> say, 5 million docs and vectors created using public data (glove
>> pre-trained embeddings or the like)? We then could run indexing on the
>> same hardware with 512, 1024 and 2048 and see what the numbers, limits
>> and behavior actually are.
>>
>> I can help in writing this but not until after Easter.
>>
>>
>> Dawid
>>
>> On Sat, Apr 8, 2023 at 11:29?PM Adrien Grand <jpountz@gmail.com> wrote:
>> >
>> > As Dawid pointed out earlier on this thread, this is the rule for
>> > Apache projects: a single -1 vote on a code change is a veto and
>> > cannot be overridden. Furthermore, Robert is one of the people on this
>> > project who worked the most on debugging subtle bugs, making Lucene
>> > more robust and improving our test framework, so I'm listening when he
>> > voices quality concerns.
>> >
>> > The argument against removing/raising the limit that resonates with me
>> > the most is that it is a one-way door. As MikeS highlighted earlier on
>> > this thread, implementations may want to take advantage of the fact
>> > that there is a limit at some point too. This is why I don't want to
>> > remove the limit and would prefer a slight increase, such as 2048 as
>> > suggested in the original issue, which would enable most of the things
>> > that users who have been asking about raising the limit would like to
>> > do.
>> >
>> > I agree that the merge-time memory usage and slow indexing rate are
>> > not great. But it's still possible to index multi-million vector
>> > datasets with a 4GB heap without hitting OOMEs regardless of the
>> > number of dimensions, and the feedback I'm seeing is that many users
>> > are still interested in indexing multi-million vector datasets despite
>> > the slow indexing rate. I wish we could do better, and vector indexing
>> > is certainly more expert than text indexing, but it still is usable in
>> > my opinion. I understand how giving Lucene more information about
>> > vectors prior to indexing (e.g. clustering information as Jim pointed
>> > out) could help make merging faster and more memory-efficient, but I
>> > would really like to avoid making it a requirement for indexing
>> > vectors as it also makes this feature much harder to use.
>> >
>> > On Sat, Apr 8, 2023 at 9:28?PM Alessandro Benedetti
>> > <a.benedetti@sease.io> wrote:
>> > >
>> > > I am very attentive to listen opinions but I am un-convinced here and
>> I an not sure that a single person opinion should be allowed to be
>> detrimental for such an important project.
>> > >
>> > > The limit as far as I know is literally just raising an exception.
>> > > Removing it won't alter in any way the current performance for users
>> in low dimensional space.
>> > > Removing it will just enable more users to use Lucene.
>> > >
>> > > If new users in certain situations will be unhappy with the
>> performance, they may contribute improvements.
>> > > This is how you make progress.
>> > >
>> > > If it's a reputation thing, trust me that not allowing users to play
>> with high dimensional space will equally damage it.
>> > >
>> > > To me it's really a no brainer.
>> > > Removing the limit and enable people to use high dimensional vectors
>> will take minutes.
>> > > Improving the hnsw implementation can take months.
>> > > Pick one to begin with...
>> > >
>> > > And there's no-one paying me here, no company interest whatsoever,
>> actually I pay people to contribute, I am just convinced it's a good idea.
>> > >
>> > >
>> > > On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcmuir@gmail.com> wrote:
>> > >>
>> > >> I disagree with your categorization. I put in plenty of work and
>> > >> experienced plenty of pain myself, writing tests and fighting these
>> > >> issues, after i saw that, two releases in a row, vector indexing fell
>> > >> over and hit integer overflows etc on small datasets:
>> > >>
>> > >> https://github.com/apache/lucene/pull/11905
>> > >>
>> > >> Attacking me isn't helping the situation.
>> > >>
>> > >> PS: when i said the "one guy who wrote the code" I didn't mean it in
>> > >> any kind of demeaning fashion really. I meant to describe the current
>> > >> state of usability with respect to indexing a few million docs with
>> > >> high dimensions. You can scroll up the thread and see that at least
>> > >> one other committer on the project experienced similar pain as me.
>> > >> Then, think about users who aren't committers trying to use the
>> > >> functionality!
>> > >>
>> > >> On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov <msokolov@gmail.com>
>> wrote:
>> > >> >
>> > >> > What you said about increasing dimensions requiring a bigger ram
>> buffer on merge is wrong. That's the point I was trying to make. Your
>> concerns about merge costs are not wrong, but your conclusion that we need
>> to limit dimensions is not justified.
>> > >> >
>> > >> > You complain that hnsw sucks it doesn't scale, but when I show it
>> scales linearly with dimension you just ignore that and complain about
>> something entirely different.
>> > >> >
>> > >> > You demand that people run all kinds of tests to prove you wrong
>> but when they do, you don't listen and you won't put in the work yourself
>> or complain that it's too hard.
>> > >> >
>> > >> > Then you complain about people not meeting you half way. Wow
>> > >> >
>> > >> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcmuir@gmail.com>
>> wrote:
>> > >> >>
>> > >> >> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
>> > >> >> <michael.wechner@wyona.com> wrote:
>> > >> >> >
>> > >> >> > What exactly do you consider reasonable?
>> > >> >>
>> > >> >> Let's begin a real discussion by being HONEST about the current
>> > >> >> status. Please put politically correct or your own company's
>> wishes
>> > >> >> aside, we know it's not in a good state.
>> > >> >>
>> > >> >> Current status is the one guy who wrote the code can set a
>> > >> >> multi-gigabyte ram buffer and index a small dataset with 1024
>> > >> >> dimensions in HOURS (i didn't ask what hardware).
>> > >> >>
>> > >> >> My concerns are everyone else except the one guy, I want it to be
>> > >> >> usable. Increasing dimensions just means even bigger
>> multi-gigabyte
>> > >> >> ram buffer and bigger heap to avoid OOM on merge.
>> > >> >> It is also a permanent backwards compatibility decision, we have
>> to
>> > >> >> support it once we do this and we can't just say "oops" and flip
>> it
>> > >> >> back.
>> > >> >>
>> > >> >> It is unclear to me, if the multi-gigabyte ram buffer is really to
>> > >> >> avoid merges because they are so slow and it would be DAYS
>> otherwise,
>> > >> >> or if its to avoid merges so it doesn't hit OOM.
>> > >> >> Also from personal experience, it takes trial and error (means
>> > >> >> experiencing OOM on merge!!!) before you get those heap values
>> correct
>> > >> >> for your dataset. This usually means starting over which is
>> > >> >> frustrating and wastes more time.
>> > >> >>
>> > >> >> Jim mentioned some ideas about the memory usage in IndexWriter,
>> seems
>> > >> >> to me like its a good idea. maybe the multigigabyte ram buffer
>> can be
>> > >> >> avoided in this way and performance improved by writing bigger
>> > >> >> segments with lucene's defaults. But this doesn't mean we can
>> simply
>> > >> >> ignore the horrors of what happens on merge. merging needs to
>> scale so
>> > >> >> that indexing really scales.
>> > >> >>
>> > >> >> At least it shouldnt spike RAM on trivial data amounts and cause
>> OOM,
>> > >> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
>> > >> >> fashion when indexing.
>> > >> >>
>> > >> >>
>> ---------------------------------------------------------------------
>> > >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > >> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> > >> >>
>> > >>
>> > >> ---------------------------------------------------------------------
>> > >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > >> For additional commands, e-mail: dev-help@lucene.apache.org
>> > >>
>> >
>> >
>> > --
>> > Adrien
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: dev-help@lucene.apache.org
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>

--
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

gus.heck at gmail

Apr 9, 2023, 10:02 AM

Post #77 of 99 (950 views)

Also technically, it's just the threat of a veto since we are not actually
in a vote thread....

On Sun, Apr 9, 2023 at 12:46?PM Gus Heck <gus.heck@gmail.com> wrote:

> What I see so far:
>
> 1. Much positive support for raising the limit
> 2. Slightly less support for removing it or making it configurable
> 3. A single veto which argues that a (as yet undefined) performance
> standard must be met before raising the limit
> 4. Hot tempers (various) making this discussion difficult
>
> As I understand it, vetoes must have technical merit. I'm not sure that
> this veto rises to "technical merit" on 2 counts:
>
> 1. No standard for the performance is given so it cannot be
> technically met. Without hard criteria it's a moving target.
> 2. It appears to encode a valuation of the user's time, and that
> valuation is really up to the user. Some users may consider 2hours useless
> and not worth it, and others might happily wait 2 hours. This is not a
> technical decision, it's a business decision regarding the relative value
> of the time invested vs the value of the result. If I can cure cancer by
> indexing for a year, that might be worth it... (hyperbole of course).
>
> Things I would consider to have technical merit that I don't hear:
>
> 1. Impact on the speed of **other** indexing operations. (devaluation
> of other functionality)
> 2. Actual scenarios that work when the limit is low and fail when the
> limit is high (new failure on the same data with the limit raised).
>
> One thing that might or might not have technical merit
>
> 1. If someone feels there is a lack of documentation of the
> costs/performance implications of using large vectors, possibly including
> reproducible benchmarks establishing the scaling behavior (there seems to
> be disagreement on O(n) vs O(n^2)).
>
> The users *should* know what they are getting into, but if the cost is
> worth it to them, they should be able to pay it without forking the
> project. If this veto causes a fork that's not good.
>
> On Sun, Apr 9, 2023 at 7:55?AM Michael Sokolov <msokolov@gmail.com> wrote:
>
>> We do have a dataset built from Wikipedia in luceneutil. It comes in 100
>> and 300 dimensional varieties and can easily enough generate large numbers
>> of vector documents from the articles data. To go higher we could
>> concatenate vectors from that and I believe the performance numbers would
>> be plausible.
>>
>> On Sun, Apr 9, 2023, 1:32 AM Dawid Weiss <dawid.weiss@gmail.com> wrote:
>>
>>> Can we set up a branch in which the limit is bumped to 2048, then have
>>> a realistic, free data set (wikipedia sample or something) that has,
>>> say, 5 million docs and vectors created using public data (glove
>>> pre-trained embeddings or the like)? We then could run indexing on the
>>> same hardware with 512, 1024 and 2048 and see what the numbers, limits
>>> and behavior actually are.
>>>
>>> I can help in writing this but not until after Easter.
>>>
>>>
>>> Dawid
>>>
>>> On Sat, Apr 8, 2023 at 11:29?PM Adrien Grand <jpountz@gmail.com> wrote:
>>> >
>>> > As Dawid pointed out earlier on this thread, this is the rule for
>>> > Apache projects: a single -1 vote on a code change is a veto and
>>> > cannot be overridden. Furthermore, Robert is one of the people on this
>>> > project who worked the most on debugging subtle bugs, making Lucene
>>> > more robust and improving our test framework, so I'm listening when he
>>> > voices quality concerns.
>>> >
>>> > The argument against removing/raising the limit that resonates with me
>>> > the most is that it is a one-way door. As MikeS highlighted earlier on
>>> > this thread, implementations may want to take advantage of the fact
>>> > that there is a limit at some point too. This is why I don't want to
>>> > remove the limit and would prefer a slight increase, such as 2048 as
>>> > suggested in the original issue, which would enable most of the things
>>> > that users who have been asking about raising the limit would like to
>>> > do.
>>> >
>>> > I agree that the merge-time memory usage and slow indexing rate are
>>> > not great. But it's still possible to index multi-million vector
>>> > datasets with a 4GB heap without hitting OOMEs regardless of the
>>> > number of dimensions, and the feedback I'm seeing is that many users
>>> > are still interested in indexing multi-million vector datasets despite
>>> > the slow indexing rate. I wish we could do better, and vector indexing
>>> > is certainly more expert than text indexing, but it still is usable in
>>> > my opinion. I understand how giving Lucene more information about
>>> > vectors prior to indexing (e.g. clustering information as Jim pointed
>>> > out) could help make merging faster and more memory-efficient, but I
>>> > would really like to avoid making it a requirement for indexing
>>> > vectors as it also makes this feature much harder to use.
>>> >
>>> > On Sat, Apr 8, 2023 at 9:28?PM Alessandro Benedetti
>>> > <a.benedetti@sease.io> wrote:
>>> > >
>>> > > I am very attentive to listen opinions but I am un-convinced here
>>> and I an not sure that a single person opinion should be allowed to be
>>> detrimental for such an important project.
>>> > >
>>> > > The limit as far as I know is literally just raising an exception.
>>> > > Removing it won't alter in any way the current performance for users
>>> in low dimensional space.
>>> > > Removing it will just enable more users to use Lucene.
>>> > >
>>> > > If new users in certain situations will be unhappy with the
>>> performance, they may contribute improvements.
>>> > > This is how you make progress.
>>> > >
>>> > > If it's a reputation thing, trust me that not allowing users to play
>>> with high dimensional space will equally damage it.
>>> > >
>>> > > To me it's really a no brainer.
>>> > > Removing the limit and enable people to use high dimensional vectors
>>> will take minutes.
>>> > > Improving the hnsw implementation can take months.
>>> > > Pick one to begin with...
>>> > >
>>> > > And there's no-one paying me here, no company interest whatsoever,
>>> actually I pay people to contribute, I am just convinced it's a good idea.
>>> > >
>>> > >
>>> > > On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcmuir@gmail.com> wrote:
>>> > >>
>>> > >> I disagree with your categorization. I put in plenty of work and
>>> > >> experienced plenty of pain myself, writing tests and fighting these
>>> > >> issues, after i saw that, two releases in a row, vector indexing
>>> fell
>>> > >> over and hit integer overflows etc on small datasets:
>>> > >>
>>> > >> https://github.com/apache/lucene/pull/11905
>>> > >>
>>> > >> Attacking me isn't helping the situation.
>>> > >>
>>> > >> PS: when i said the "one guy who wrote the code" I didn't mean it in
>>> > >> any kind of demeaning fashion really. I meant to describe the
>>> current
>>> > >> state of usability with respect to indexing a few million docs with
>>> > >> high dimensions. You can scroll up the thread and see that at least
>>> > >> one other committer on the project experienced similar pain as me.
>>> > >> Then, think about users who aren't committers trying to use the
>>> > >> functionality!
>>> > >>
>>> > >> On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov <msokolov@gmail.com>
>>> wrote:
>>> > >> >
>>> > >> > What you said about increasing dimensions requiring a bigger ram
>>> buffer on merge is wrong. That's the point I was trying to make. Your
>>> concerns about merge costs are not wrong, but your conclusion that we need
>>> to limit dimensions is not justified.
>>> > >> >
>>> > >> > You complain that hnsw sucks it doesn't scale, but when I show it
>>> scales linearly with dimension you just ignore that and complain about
>>> something entirely different.
>>> > >> >
>>> > >> > You demand that people run all kinds of tests to prove you wrong
>>> but when they do, you don't listen and you won't put in the work yourself
>>> or complain that it's too hard.
>>> > >> >
>>> > >> > Then you complain about people not meeting you half way. Wow
>>> > >> >
>>> > >> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcmuir@gmail.com>
>>> wrote:
>>> > >> >>
>>> > >> >> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
>>> > >> >> <michael.wechner@wyona.com> wrote:
>>> > >> >> >
>>> > >> >> > What exactly do you consider reasonable?
>>> > >> >>
>>> > >> >> Let's begin a real discussion by being HONEST about the current
>>> > >> >> status. Please put politically correct or your own company's
>>> wishes
>>> > >> >> aside, we know it's not in a good state.
>>> > >> >>
>>> > >> >> Current status is the one guy who wrote the code can set a
>>> > >> >> multi-gigabyte ram buffer and index a small dataset with 1024
>>> > >> >> dimensions in HOURS (i didn't ask what hardware).
>>> > >> >>
>>> > >> >> My concerns are everyone else except the one guy, I want it to be
>>> > >> >> usable. Increasing dimensions just means even bigger
>>> multi-gigabyte
>>> > >> >> ram buffer and bigger heap to avoid OOM on merge.
>>> > >> >> It is also a permanent backwards compatibility decision, we have
>>> to
>>> > >> >> support it once we do this and we can't just say "oops" and flip
>>> it
>>> > >> >> back.
>>> > >> >>
>>> > >> >> It is unclear to me, if the multi-gigabyte ram buffer is really
>>> to
>>> > >> >> avoid merges because they are so slow and it would be DAYS
>>> otherwise,
>>> > >> >> or if its to avoid merges so it doesn't hit OOM.
>>> > >> >> Also from personal experience, it takes trial and error (means
>>> > >> >> experiencing OOM on merge!!!) before you get those heap values
>>> correct
>>> > >> >> for your dataset. This usually means starting over which is
>>> > >> >> frustrating and wastes more time.
>>> > >> >>
>>> > >> >> Jim mentioned some ideas about the memory usage in IndexWriter,
>>> seems
>>> > >> >> to me like its a good idea. maybe the multigigabyte ram buffer
>>> can be
>>> > >> >> avoided in this way and performance improved by writing bigger
>>> > >> >> segments with lucene's defaults. But this doesn't mean we can
>>> simply
>>> > >> >> ignore the horrors of what happens on merge. merging needs to
>>> scale so
>>> > >> >> that indexing really scales.
>>> > >> >>
>>> > >> >> At least it shouldnt spike RAM on trivial data amounts and cause
>>> OOM,
>>> > >> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
>>> > >> >> fashion when indexing.
>>> > >> >>
>>> > >> >>
>>> ---------------------------------------------------------------------
>>> > >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> > >> >> For additional commands, e-mail: dev-help@lucene.apache.org
>>> > >> >>
>>> > >>
>>> > >>
>>> ---------------------------------------------------------------------
>>> > >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> > >> For additional commands, e-mail: dev-help@lucene.apache.org
>>> > >>
>>> >
>>> >
>>> > --
>>> > Adrien
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> > For additional commands, e-mail: dev-help@lucene.apache.org
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>
>
> --
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)
>

--
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

vaiju1981 at gmail

Apr 9, 2023, 10:19 AM

Post #78 of 99 (950 views)

Joining this pretty late, But we do have a use case to use 2K dimensions
for embedding purposes.
We decided to break embeddings into high and low where first 1K goes into
high field and next 1K goes into low, while indexing this worked well. We
also add a few other fields like segment/cluster id to help us narrow
searches in subsequent.

While searching we then do a multiple pass to use both of these along with
other fields ( yes this might not scale for others ). It is working so far.

--Thanks and Regards
Vaijanath

On Sun, Apr 9, 2023 at 10:02?AM Gus Heck <gus.heck@gmail.com> wrote:

> Also technically, it's just the threat of a veto since we are not actually
> in a vote thread....
>
> On Sun, Apr 9, 2023 at 12:46?PM Gus Heck <gus.heck@gmail.com> wrote:
>
>> What I see so far:
>>
>> 1. Much positive support for raising the limit
>> 2. Slightly less support for removing it or making it configurable
>> 3. A single veto which argues that a (as yet undefined) performance
>> standard must be met before raising the limit
>> 4. Hot tempers (various) making this discussion difficult
>>
>> As I understand it, vetoes must have technical merit. I'm not sure that
>> this veto rises to "technical merit" on 2 counts:
>>
>> 1. No standard for the performance is given so it cannot be
>> technically met. Without hard criteria it's a moving target.
>> 2. It appears to encode a valuation of the user's time, and that
>> valuation is really up to the user. Some users may consider 2hours useless
>> and not worth it, and others might happily wait 2 hours. This is not a
>> technical decision, it's a business decision regarding the relative value
>> of the time invested vs the value of the result. If I can cure cancer by
>> indexing for a year, that might be worth it... (hyperbole of course).
>>
>> Things I would consider to have technical merit that I don't hear:
>>
>> 1. Impact on the speed of **other** indexing operations. (devaluation
>> of other functionality)
>> 2. Actual scenarios that work when the limit is low and fail when the
>> limit is high (new failure on the same data with the limit raised).
>>
>> One thing that might or might not have technical merit
>>
>> 1. If someone feels there is a lack of documentation of the
>> costs/performance implications of using large vectors, possibly including
>> reproducible benchmarks establishing the scaling behavior (there seems to
>> be disagreement on O(n) vs O(n^2)).
>>
>> The users *should* know what they are getting into, but if the cost is
>> worth it to them, they should be able to pay it without forking the
>> project. If this veto causes a fork that's not good.
>>
>> On Sun, Apr 9, 2023 at 7:55?AM Michael Sokolov <msokolov@gmail.com>
>> wrote:
>>
>>> We do have a dataset built from Wikipedia in luceneutil. It comes in 100
>>> and 300 dimensional varieties and can easily enough generate large numbers
>>> of vector documents from the articles data. To go higher we could
>>> concatenate vectors from that and I believe the performance numbers would
>>> be plausible.
>>>
>>> On Sun, Apr 9, 2023, 1:32 AM Dawid Weiss <dawid.weiss@gmail.com> wrote:
>>>
>>>> Can we set up a branch in which the limit is bumped to 2048, then have
>>>> a realistic, free data set (wikipedia sample or something) that has,
>>>> say, 5 million docs and vectors created using public data (glove
>>>> pre-trained embeddings or the like)? We then could run indexing on the
>>>> same hardware with 512, 1024 and 2048 and see what the numbers, limits
>>>> and behavior actually are.
>>>>
>>>> I can help in writing this but not until after Easter.
>>>>
>>>>
>>>> Dawid
>>>>
>>>> On Sat, Apr 8, 2023 at 11:29?PM Adrien Grand <jpountz@gmail.com> wrote:
>>>> >
>>>> > As Dawid pointed out earlier on this thread, this is the rule for
>>>> > Apache projects: a single -1 vote on a code change is a veto and
>>>> > cannot be overridden. Furthermore, Robert is one of the people on this
>>>> > project who worked the most on debugging subtle bugs, making Lucene
>>>> > more robust and improving our test framework, so I'm listening when he
>>>> > voices quality concerns.
>>>> >
>>>> > The argument against removing/raising the limit that resonates with me
>>>> > the most is that it is a one-way door. As MikeS highlighted earlier on
>>>> > this thread, implementations may want to take advantage of the fact
>>>> > that there is a limit at some point too. This is why I don't want to
>>>> > remove the limit and would prefer a slight increase, such as 2048 as
>>>> > suggested in the original issue, which would enable most of the things
>>>> > that users who have been asking about raising the limit would like to
>>>> > do.
>>>> >
>>>> > I agree that the merge-time memory usage and slow indexing rate are
>>>> > not great. But it's still possible to index multi-million vector
>>>> > datasets with a 4GB heap without hitting OOMEs regardless of the
>>>> > number of dimensions, and the feedback I'm seeing is that many users
>>>> > are still interested in indexing multi-million vector datasets despite
>>>> > the slow indexing rate. I wish we could do better, and vector indexing
>>>> > is certainly more expert than text indexing, but it still is usable in
>>>> > my opinion. I understand how giving Lucene more information about
>>>> > vectors prior to indexing (e.g. clustering information as Jim pointed
>>>> > out) could help make merging faster and more memory-efficient, but I
>>>> > would really like to avoid making it a requirement for indexing
>>>> > vectors as it also makes this feature much harder to use.
>>>> >
>>>> > On Sat, Apr 8, 2023 at 9:28?PM Alessandro Benedetti
>>>> > <a.benedetti@sease.io> wrote:
>>>> > >
>>>> > > I am very attentive to listen opinions but I am un-convinced here
>>>> and I an not sure that a single person opinion should be allowed to be
>>>> detrimental for such an important project.
>>>> > >
>>>> > > The limit as far as I know is literally just raising an exception.
>>>> > > Removing it won't alter in any way the current performance for
>>>> users in low dimensional space.
>>>> > > Removing it will just enable more users to use Lucene.
>>>> > >
>>>> > > If new users in certain situations will be unhappy with the
>>>> performance, they may contribute improvements.
>>>> > > This is how you make progress.
>>>> > >
>>>> > > If it's a reputation thing, trust me that not allowing users to
>>>> play with high dimensional space will equally damage it.
>>>> > >
>>>> > > To me it's really a no brainer.
>>>> > > Removing the limit and enable people to use high dimensional
>>>> vectors will take minutes.
>>>> > > Improving the hnsw implementation can take months.
>>>> > > Pick one to begin with...
>>>> > >
>>>> > > And there's no-one paying me here, no company interest whatsoever,
>>>> actually I pay people to contribute, I am just convinced it's a good idea.
>>>> > >
>>>> > >
>>>> > > On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcmuir@gmail.com> wrote:
>>>> > >>
>>>> > >> I disagree with your categorization. I put in plenty of work and
>>>> > >> experienced plenty of pain myself, writing tests and fighting these
>>>> > >> issues, after i saw that, two releases in a row, vector indexing
>>>> fell
>>>> > >> over and hit integer overflows etc on small datasets:
>>>> > >>
>>>> > >> https://github.com/apache/lucene/pull/11905
>>>> > >>
>>>> > >> Attacking me isn't helping the situation.
>>>> > >>
>>>> > >> PS: when i said the "one guy who wrote the code" I didn't mean it
>>>> in
>>>> > >> any kind of demeaning fashion really. I meant to describe the
>>>> current
>>>> > >> state of usability with respect to indexing a few million docs with
>>>> > >> high dimensions. You can scroll up the thread and see that at least
>>>> > >> one other committer on the project experienced similar pain as me.
>>>> > >> Then, think about users who aren't committers trying to use the
>>>> > >> functionality!
>>>> > >>
>>>> > >> On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov <
>>>> msokolov@gmail.com> wrote:
>>>> > >> >
>>>> > >> > What you said about increasing dimensions requiring a bigger ram
>>>> buffer on merge is wrong. That's the point I was trying to make. Your
>>>> concerns about merge costs are not wrong, but your conclusion that we need
>>>> to limit dimensions is not justified.
>>>> > >> >
>>>> > >> > You complain that hnsw sucks it doesn't scale, but when I show
>>>> it scales linearly with dimension you just ignore that and complain about
>>>> something entirely different.
>>>> > >> >
>>>> > >> > You demand that people run all kinds of tests to prove you wrong
>>>> but when they do, you don't listen and you won't put in the work yourself
>>>> or complain that it's too hard.
>>>> > >> >
>>>> > >> > Then you complain about people not meeting you half way. Wow
>>>> > >> >
>>>> > >> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcmuir@gmail.com>
>>>> wrote:
>>>> > >> >>
>>>> > >> >> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
>>>> > >> >> <michael.wechner@wyona.com> wrote:
>>>> > >> >> >
>>>> > >> >> > What exactly do you consider reasonable?
>>>> > >> >>
>>>> > >> >> Let's begin a real discussion by being HONEST about the current
>>>> > >> >> status. Please put politically correct or your own company's
>>>> wishes
>>>> > >> >> aside, we know it's not in a good state.
>>>> > >> >>
>>>> > >> >> Current status is the one guy who wrote the code can set a
>>>> > >> >> multi-gigabyte ram buffer and index a small dataset with 1024
>>>> > >> >> dimensions in HOURS (i didn't ask what hardware).
>>>> > >> >>
>>>> > >> >> My concerns are everyone else except the one guy, I want it to
>>>> be
>>>> > >> >> usable. Increasing dimensions just means even bigger
>>>> multi-gigabyte
>>>> > >> >> ram buffer and bigger heap to avoid OOM on merge.
>>>> > >> >> It is also a permanent backwards compatibility decision, we
>>>> have to
>>>> > >> >> support it once we do this and we can't just say "oops" and
>>>> flip it
>>>> > >> >> back.
>>>> > >> >>
>>>> > >> >> It is unclear to me, if the multi-gigabyte ram buffer is really
>>>> to
>>>> > >> >> avoid merges because they are so slow and it would be DAYS
>>>> otherwise,
>>>> > >> >> or if its to avoid merges so it doesn't hit OOM.
>>>> > >> >> Also from personal experience, it takes trial and error (means
>>>> > >> >> experiencing OOM on merge!!!) before you get those heap values
>>>> correct
>>>> > >> >> for your dataset. This usually means starting over which is
>>>> > >> >> frustrating and wastes more time.
>>>> > >> >>
>>>> > >> >> Jim mentioned some ideas about the memory usage in IndexWriter,
>>>> seems
>>>> > >> >> to me like its a good idea. maybe the multigigabyte ram buffer
>>>> can be
>>>> > >> >> avoided in this way and performance improved by writing bigger
>>>> > >> >> segments with lucene's defaults. But this doesn't mean we can
>>>> simply
>>>> > >> >> ignore the horrors of what happens on merge. merging needs to
>>>> scale so
>>>> > >> >> that indexing really scales.
>>>> > >> >>
>>>> > >> >> At least it shouldnt spike RAM on trivial data amounts and
>>>> cause OOM,
>>>> > >> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
>>>> > >> >> fashion when indexing.
>>>> > >> >>
>>>> > >> >>
>>>> ---------------------------------------------------------------------
>>>> > >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> > >> >> For additional commands, e-mail: dev-help@lucene.apache.org
>>>> > >> >>
>>>> > >>
>>>> > >>
>>>> ---------------------------------------------------------------------
>>>> > >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> > >> For additional commands, e-mail: dev-help@lucene.apache.org
>>>> > >>
>>>> >
>>>> >
>>>> > --
>>>> > Adrien
>>>> >
>>>> > ---------------------------------------------------------------------
>>>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> > For additional commands, e-mail: dev-help@lucene.apache.org
>>>> >
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>
>>>>
>>
>> --
>> http://www.needhamsoftware.com (work)
>> http://www.the111shift.com (play)
>>
>
>
> --
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)
>

--
I am feeling fine, healthier and Happier, what about you

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

dawid.weiss at gmail

Apr 9, 2023, 11:43 AM

Post #79 of 99 (950 views)

> We do have a dataset built from Wikipedia in luceneutil. It comes in 100 and 300 dimensional varieties and can easily enough generate large numbers of vector documents from the articles data. To go higher we could concatenate vectors from that and I believe the performance numbers would be plausible.

Apologies - I wasn't clear - I thought of building the 1k or 2k
vectors that would be realistic. Perhaps using glove or perhaps using
some other software but something that would reflect a true 2k
dimensional space accurately with "real" data underneath. I am not
familiar enough with the field to tell whether a simple concatenation
is a good enough simulation - perhaps it is.

I would really prefer to focus on doing this kind of assessment of
feasibility/ limitations rather than arguing back and forth. I did my
experiment a while ago and I can't really tell whether there have been
improvements in the indexing/ merging part - your email contradicts my
experience Mike, so I'm a bit intrigued and would like to revisit it.
But it'd be ideal to work with real vectors rather than a simulation.

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

michael.wechner at wyona

Apr 9, 2023, 12:14 PM

Post #80 of 99 (950 views)

Our use case is the following:

We have a dataset of several thousand questions and answers, for which
we generate vectors with various models resp. services, for example:

SentenceBERT, all-mpnet-base-v2
Aleph Alpha, luminous-base
Cohere, multilingual-22-12
OpenAI, text-similarity-ada-001 or text-similarity-davinci-001

These models produce vectors with different vector dimensions.
Depending on the model and on the benchmarks / datasets the accuracy is
higher when the vector dimension is higher.

We index these vectors with Lucene and we use Lucene to do a similarity
search.

I think this is exactly why Lucene KNN / ANN was developed, to do
similarity search, right?

Thanks

Michael W

Am 09.04.23 um 13:47 schrieb Robert Muir:
> I don't care. you guys personally attacked me first. And then it turns
> out, you were being dishonest the entire time and hiding your true
> intent, which was not search at all but instead some chatgpt pyramid
> scheme or similar.
>
> i'm done with this thread.
>
> On Sun, Apr 9, 2023 at 7:37?AM Alessandro Benedetti
> <a.benedetti@sease.io> wrote:
>> I don't think this tone and language is appropriate for a community of volunteers and men of science.
>>
>> I personally find offensive to generalise Lucene people here to be "crazy people hyped about chatGPT".
>>
>> I personally don't give a damn about chatGPT except the fact it is a very interesting technology.
>>
>> As usual I see very little motivation and a lot of "convince me".
>> We're discussing here about a limit that raises an exception.
>>
>> Improving performance is absolutely important and no-one here is saying we won't address it, it's just a separate discussion.
>>
>>
>> On Sun, 9 Apr 2023, 12:59 Robert Muir, <rcmuir@gmail.com> wrote:
>>> Also, please let's only disucss SEARCH. lucene is a SEARCH ENGINE
>>> LIBRARY. not a vector database or whatever trash is being proposed
>>> here.
>>>
>>> i think we should table this and revisit it after chatgpt hype has dissipated.
>>>
>>> this hype is causing ppl to behave irrationally, it is why i can't
>>> converse with basically anyone on this thread because they are all
>>> stating crazy things that don't make sense.
>>>
>>> On Sun, Apr 9, 2023 at 6:25?AM Robert Muir <rcmuir@gmail.com> wrote:
>>>> Yes, its very clear that folks on this thread are ignoring reason
>>>> entirely and completely swooned by chatgpt-hype.
>>>> And what happens when they make chatgpt-8 that uses even more dimensions?
>>>> backwards compatibility decisions can't be made by garbage hype such
>>>> as cryptocurrency or chatgpt.
>>>> Trying to convince me we should bump it because of chatgpt, well, i
>>>> think it has the opposite effect.
>>>>
>>>> Please, lemme see real technical arguments why this limit needs to be
>>>> bumped. not including trash like chatgpt.
>>>>
>>>> On Sat, Apr 8, 2023 at 7:50?PM Marcus Eagan <marcuseagan@gmail.com> wrote:
>>>>> Given the massive amounts of funding going into the development and investigation of the project, I think it would be good to at least have Lucene be a part of the conversation. Simply because academics typically focus on vectors <= 784 dimensions does not mean all users will. A large swathe of very important users of the Lucene project never exceed 500k documents, though they are shifting to other search engines to try out very popular embeddings.
>>>>>
>>>>> I think giving our users the opportunity to build chat bots or LLM memory machines using Lucene is a positive development, even if some datasets won't be able to work well. We don't limit the number of fields someone can add in most cases, though we did just undeprecate that API to better support multi-tenancy. But people still add so many fields and can crash their clusters with mapping explosions when unlimited. The limit to vectors feels similar. I expect more people to dig into Lucene due to its openness and robustness as they run into problems. Today, they are forced to consider other engines that are more permissive.
>>>>>
>>>>> Not everyone important or valuable Lucene workload is in the millions of documents. Many of them only have lots of queries or computationally expensive access patterns for B-trees. We can document that it is very ill-advised to make a deployment with vectors too large. What others will do with it is on them.
>>>>>
>>>>>
>>>>> On Sat, Apr 8, 2023 at 2:29?PM Adrien Grand <jpountz@gmail.com> wrote:
>>>>>> As Dawid pointed out earlier on this thread, this is the rule for
>>>>>> Apache projects: a single -1 vote on a code change is a veto and
>>>>>> cannot be overridden. Furthermore, Robert is one of the people on this
>>>>>> project who worked the most on debugging subtle bugs, making Lucene
>>>>>> more robust and improving our test framework, so I'm listening when he
>>>>>> voices quality concerns.
>>>>>>
>>>>>> The argument against removing/raising the limit that resonates with me
>>>>>> the most is that it is a one-way door. As MikeS highlighted earlier on
>>>>>> this thread, implementations may want to take advantage of the fact
>>>>>> that there is a limit at some point too. This is why I don't want to
>>>>>> remove the limit and would prefer a slight increase, such as 2048 as
>>>>>> suggested in the original issue, which would enable most of the things
>>>>>> that users who have been asking about raising the limit would like to
>>>>>> do.
>>>>>>
>>>>>> I agree that the merge-time memory usage and slow indexing rate are
>>>>>> not great. But it's still possible to index multi-million vector
>>>>>> datasets with a 4GB heap without hitting OOMEs regardless of the
>>>>>> number of dimensions, and the feedback I'm seeing is that many users
>>>>>> are still interested in indexing multi-million vector datasets despite
>>>>>> the slow indexing rate. I wish we could do better, and vector indexing
>>>>>> is certainly more expert than text indexing, but it still is usable in
>>>>>> my opinion. I understand how giving Lucene more information about
>>>>>> vectors prior to indexing (e.g. clustering information as Jim pointed
>>>>>> out) could help make merging faster and more memory-efficient, but I
>>>>>> would really like to avoid making it a requirement for indexing
>>>>>> vectors as it also makes this feature much harder to use.
>>>>>>
>>>>>> On Sat, Apr 8, 2023 at 9:28?PM Alessandro Benedetti
>>>>>> <a.benedetti@sease.io> wrote:
>>>>>>> I am very attentive to listen opinions but I am un-convinced here and I an not sure that a single person opinion should be allowed to be detrimental for such an important project.
>>>>>>>
>>>>>>> The limit as far as I know is literally just raising an exception.
>>>>>>> Removing it won't alter in any way the current performance for users in low dimensional space.
>>>>>>> Removing it will just enable more users to use Lucene.
>>>>>>>
>>>>>>> If new users in certain situations will be unhappy with the performance, they may contribute improvements.
>>>>>>> This is how you make progress.
>>>>>>>
>>>>>>> If it's a reputation thing, trust me that not allowing users to play with high dimensional space will equally damage it.
>>>>>>>
>>>>>>> To me it's really a no brainer.
>>>>>>> Removing the limit and enable people to use high dimensional vectors will take minutes.
>>>>>>> Improving the hnsw implementation can take months.
>>>>>>> Pick one to begin with...
>>>>>>>
>>>>>>> And there's no-one paying me here, no company interest whatsoever, actually I pay people to contribute, I am just convinced it's a good idea.
>>>>>>>
>>>>>>>
>>>>>>> On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcmuir@gmail.com> wrote:
>>>>>>>> I disagree with your categorization. I put in plenty of work and
>>>>>>>> experienced plenty of pain myself, writing tests and fighting these
>>>>>>>> issues, after i saw that, two releases in a row, vector indexing fell
>>>>>>>> over and hit integer overflows etc on small datasets:
>>>>>>>>
>>>>>>>> https://github.com/apache/lucene/pull/11905
>>>>>>>>
>>>>>>>> Attacking me isn't helping the situation.
>>>>>>>>
>>>>>>>> PS: when i said the "one guy who wrote the code" I didn't mean it in
>>>>>>>> any kind of demeaning fashion really. I meant to describe the current
>>>>>>>> state of usability with respect to indexing a few million docs with
>>>>>>>> high dimensions. You can scroll up the thread and see that at least
>>>>>>>> one other committer on the project experienced similar pain as me.
>>>>>>>> Then, think about users who aren't committers trying to use the
>>>>>>>> functionality!
>>>>>>>>
>>>>>>>> On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov <msokolov@gmail.com> wrote:
>>>>>>>>> What you said about increasing dimensions requiring a bigger ram buffer on merge is wrong. That's the point I was trying to make. Your concerns about merge costs are not wrong, but your conclusion that we need to limit dimensions is not justified.
>>>>>>>>>
>>>>>>>>> You complain that hnsw sucks it doesn't scale, but when I show it scales linearly with dimension you just ignore that and complain about something entirely different.
>>>>>>>>>
>>>>>>>>> You demand that people run all kinds of tests to prove you wrong but when they do, you don't listen and you won't put in the work yourself or complain that it's too hard.
>>>>>>>>>
>>>>>>>>> Then you complain about people not meeting you half way. Wow
>>>>>>>>>
>>>>>>>>> On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcmuir@gmail.com> wrote:
>>>>>>>>>> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
>>>>>>>>>> <michael.wechner@wyona.com> wrote:
>>>>>>>>>>> What exactly do you consider reasonable?
>>>>>>>>>> Let's begin a real discussion by being HONEST about the current
>>>>>>>>>> status. Please put politically correct or your own company's wishes
>>>>>>>>>> aside, we know it's not in a good state.
>>>>>>>>>>
>>>>>>>>>> Current status is the one guy who wrote the code can set a
>>>>>>>>>> multi-gigabyte ram buffer and index a small dataset with 1024
>>>>>>>>>> dimensions in HOURS (i didn't ask what hardware).
>>>>>>>>>>
>>>>>>>>>> My concerns are everyone else except the one guy, I want it to be
>>>>>>>>>> usable. Increasing dimensions just means even bigger multi-gigabyte
>>>>>>>>>> ram buffer and bigger heap to avoid OOM on merge.
>>>>>>>>>> It is also a permanent backwards compatibility decision, we have to
>>>>>>>>>> support it once we do this and we can't just say "oops" and flip it
>>>>>>>>>> back.
>>>>>>>>>>
>>>>>>>>>> It is unclear to me, if the multi-gigabyte ram buffer is really to
>>>>>>>>>> avoid merges because they are so slow and it would be DAYS otherwise,
>>>>>>>>>> or if its to avoid merges so it doesn't hit OOM.
>>>>>>>>>> Also from personal experience, it takes trial and error (means
>>>>>>>>>> experiencing OOM on merge!!!) before you get those heap values correct
>>>>>>>>>> for your dataset. This usually means starting over which is
>>>>>>>>>> frustrating and wastes more time.
>>>>>>>>>>
>>>>>>>>>> Jim mentioned some ideas about the memory usage in IndexWriter, seems
>>>>>>>>>> to me like its a good idea. maybe the multigigabyte ram buffer can be
>>>>>>>>>> avoided in this way and performance improved by writing bigger
>>>>>>>>>> segments with lucene's defaults. But this doesn't mean we can simply
>>>>>>>>>> ignore the horrors of what happens on merge. merging needs to scale so
>>>>>>>>>> that indexing really scales.
>>>>>>>>>>
>>>>>>>>>> At least it shouldnt spike RAM on trivial data amounts and cause OOM,
>>>>>>>>>> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
>>>>>>>>>> fashion when indexing.
>>>>>>>>>>
>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Adrien
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>>
>>>>>
>>>>> --
>>>>> Marcus Eagan
>>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

michael.wechner at wyona

Apr 9, 2023, 12:23 PM

Post #81 of 99 (950 views)

I think for testing the performance and scalability one can also use
synthetic data and it does not have to be real world data in the sense
of vectors generated from real world text.

But I think the more people revisit the testing of performance and
scalability the better and any help on this would be great!

Thanks

Michael W

Am 09.04.23 um 20:43 schrieb Dawid Weiss:
>> We do have a dataset built from Wikipedia in luceneutil. It comes in 100 and 300 dimensional varieties and can easily enough generate large numbers of vector documents from the articles data. To go higher we could concatenate vectors from that and I believe the performance numbers would be plausible.
> Apologies - I wasn't clear - I thought of building the 1k or 2k
> vectors that would be realistic. Perhaps using glove or perhaps using
> some other software but something that would reflect a true 2k
> dimensional space accurately with "real" data underneath. I am not
> familiar enough with the field to tell whether a simple concatenation
> is a good enough simulation - perhaps it is.
>
> I would really prefer to focus on doing this kind of assessment of
> feasibility/ limitations rather than arguing back and forth. I did my
> experiment a while ago and I can't really tell whether there have been
> improvements in the indexing/ merging part - your email contradicts my
> experience Mike, so I'm a bit intrigued and would like to revisit it.
> But it'd be ideal to work with real vectors rather than a simulation.
>
> Dawid
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

msokolov at gmail

Apr 10, 2023, 6:54 AM

Post #82 of 99 (950 views)

I think concatenating word-embedding vectors is a reasonable thing to
do. It captures information about the sequence of tokens which is
being lost by the current approach (summing them). Random article I
found in a search
https://medium.com/@dhartidhami/understanding-bert-word-embeddings-7dc4d2ea54ca
shows higher performance with a concatenative approach. So it seems to
me we could take the 300-dim Glove vectors and produce somewhat
meaningful (say) 1200- or 1500-dim vectors by running a sliding window
over the tokens in a document and concatenating the token-vectors

On Sun, Apr 9, 2023 at 2:44?PM Dawid Weiss <dawid.weiss@gmail.com> wrote:
>
> > We do have a dataset built from Wikipedia in luceneutil. It comes in 100 and 300 dimensional varieties and can easily enough generate large numbers of vector documents from the articles data. To go higher we could concatenate vectors from that and I believe the performance numbers would be plausible.
>
> Apologies - I wasn't clear - I thought of building the 1k or 2k
> vectors that would be realistic. Perhaps using glove or perhaps using
> some other software but something that would reflect a true 2k
> dimensional space accurately with "real" data underneath. I am not
> familiar enough with the field to tell whether a simple concatenation
> is a good enough simulation - perhaps it is.
>
> I would really prefer to focus on doing this kind of assessment of
> feasibility/ limitations rather than arguing back and forth. I did my
> experiment a while ago and I can't really tell whether there have been
> improvements in the indexing/ merging part - your email contradicts my
> experience Mike, so I'm a bit intrigued and would like to revisit it.
> But it'd be ideal to work with real vectors rather than a simulation.
>
> Dawid
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

msokolov at gmail

Apr 10, 2023, 8:06 AM

Post #83 of 99 (950 views)

I poked around on huggingface looking at various models that are being
promoted there; this is the highest-performing text model they list,
which is expected to take sentences as input; it uses so-called
"attention" to capture the context of words:
https://huggingface.co/sentence-transformers/all-mpnet-base-v2 and it
is 768-dimensional. This is a list of models designed for "asymmetric
semantic search" ie short queries and long documents:
https://www.sbert.net/docs/pretrained-models/msmarco-v3.html. The
highest ranking one there also seems to be 768d
https://huggingface.co/sebastian-hofstaetter/distilbert-dot-tas_b-b256-msmarco

I did see some other larger-dimensional model, but they all seem to
involve images+text.

On Mon, Apr 10, 2023 at 9:54?AM Michael Sokolov <msokolov@gmail.com> wrote:
>
> I think concatenating word-embedding vectors is a reasonable thing to
> do. It captures information about the sequence of tokens which is
> being lost by the current approach (summing them). Random article I
> found in a search
> https://medium.com/@dhartidhami/understanding-bert-word-embeddings-7dc4d2ea54ca
> shows higher performance with a concatenative approach. So it seems to
> me we could take the 300-dim Glove vectors and produce somewhat
> meaningful (say) 1200- or 1500-dim vectors by running a sliding window
> over the tokens in a document and concatenating the token-vectors
>
> On Sun, Apr 9, 2023 at 2:44?PM Dawid Weiss <dawid.weiss@gmail.com> wrote:
> >
> > > We do have a dataset built from Wikipedia in luceneutil. It comes in 100 and 300 dimensional varieties and can easily enough generate large numbers of vector documents from the articles data. To go higher we could concatenate vectors from that and I believe the performance numbers would be plausible.
> >
> > Apologies - I wasn't clear - I thought of building the 1k or 2k
> > vectors that would be realistic. Perhaps using glove or perhaps using
> > some other software but something that would reflect a true 2k
> > dimensional space accurately with "real" data underneath. I am not
> > familiar enough with the field to tell whether a simple concatenation
> > is a good enough simulation - perhaps it is.
> >
> > I would really prefer to focus on doing this kind of assessment of
> > feasibility/ limitations rather than arguing back and forth. I did my
> > experiment a while ago and I can't really tell whether there have been
> > improvements in the indexing/ merging part - your email contradicts my
> > experience Mike, so I'm a bit intrigued and would like to revisit it.
> > But it'd be ideal to work with real vectors rather than a simulation.
> >
> > Dawid
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

a.benedetti at sease

Apr 10, 2023, 3:41 PM

Post #84 of 99 (950 views)

I think Gus points are on target.

I recommend we move this forward in this way:
We stop any discussion and everyone interested proposes an option with a
motivation, then we aggregate the options and we create a Vote maybe?

I am also on the same page on the fact that a veto should come with a clear
and reasonable technical merit, which also in my opinion has not come yet.

I also apologise if any of my words sounded harsh or personal attacks,
never meant to do so.

My proposed option:

1) remove the limit and potentially make it configurable,
Motivation:
The system administrator can enforce a limit its users need to respect that
it's in line with whatever the admin decided to be acceptable for them.
Default can stay the current one.

That's my favourite at the moment, but I agree that potentially in the
future this may need to change, as we may optimise the data structures for
certain dimensions. I am a big fan of Yagni (you aren't going to need it)
so I am ok we'll face a different discussion if that happens in the future.

On Sun, 9 Apr 2023, 18:46 Gus Heck, <gus.heck@gmail.com> wrote:

> What I see so far:
>
> 1. Much positive support for raising the limit
> 2. Slightly less support for removing it or making it configurable
> 3. A single veto which argues that a (as yet undefined) performance
> standard must be met before raising the limit
> 4. Hot tempers (various) making this discussion difficult
>
> As I understand it, vetoes must have technical merit. I'm not sure that
> this veto rises to "technical merit" on 2 counts:
>
> 1. No standard for the performance is given so it cannot be
> technically met. Without hard criteria it's a moving target.
> 2. It appears to encode a valuation of the user's time, and that
> valuation is really up to the user. Some users may consider 2hours useless
> and not worth it, and others might happily wait 2 hours. This is not a
> technical decision, it's a business decision regarding the relative value
> of the time invested vs the value of the result. If I can cure cancer by
> indexing for a year, that might be worth it... (hyperbole of course).
>
> Things I would consider to have technical merit that I don't hear:
>
> 1. Impact on the speed of **other** indexing operations. (devaluation
> of other functionality)
> 2. Actual scenarios that work when the limit is low and fail when the
> limit is high (new failure on the same data with the limit raised).
>
> One thing that might or might not have technical merit
>
> 1. If someone feels there is a lack of documentation of the
> costs/performance implications of using large vectors, possibly including
> reproducible benchmarks establishing the scaling behavior (there seems to
> be disagreement on O(n) vs O(n^2)).
>
> The users *should* know what they are getting into, but if the cost is
> worth it to them, they should be able to pay it without forking the
> project. If this veto causes a fork that's not good.
>
> On Sun, Apr 9, 2023 at 7:55?AM Michael Sokolov <msokolov@gmail.com> wrote:
>
>> We do have a dataset built from Wikipedia in luceneutil. It comes in 100
>> and 300 dimensional varieties and can easily enough generate large numbers
>> of vector documents from the articles data. To go higher we could
>> concatenate vectors from that and I believe the performance numbers would
>> be plausible.
>>
>> On Sun, Apr 9, 2023, 1:32 AM Dawid Weiss <dawid.weiss@gmail.com> wrote:
>>
>>> Can we set up a branch in which the limit is bumped to 2048, then have
>>> a realistic, free data set (wikipedia sample or something) that has,
>>> say, 5 million docs and vectors created using public data (glove
>>> pre-trained embeddings or the like)? We then could run indexing on the
>>> same hardware with 512, 1024 and 2048 and see what the numbers, limits
>>> and behavior actually are.
>>>
>>> I can help in writing this but not until after Easter.
>>>
>>>
>>> Dawid
>>>
>>> On Sat, Apr 8, 2023 at 11:29?PM Adrien Grand <jpountz@gmail.com> wrote:
>>> >
>>> > As Dawid pointed out earlier on this thread, this is the rule for
>>> > Apache projects: a single -1 vote on a code change is a veto and
>>> > cannot be overridden. Furthermore, Robert is one of the people on this
>>> > project who worked the most on debugging subtle bugs, making Lucene
>>> > more robust and improving our test framework, so I'm listening when he
>>> > voices quality concerns.
>>> >
>>> > The argument against removing/raising the limit that resonates with me
>>> > the most is that it is a one-way door. As MikeS highlighted earlier on
>>> > this thread, implementations may want to take advantage of the fact
>>> > that there is a limit at some point too. This is why I don't want to
>>> > remove the limit and would prefer a slight increase, such as 2048 as
>>> > suggested in the original issue, which would enable most of the things
>>> > that users who have been asking about raising the limit would like to
>>> > do.
>>> >
>>> > I agree that the merge-time memory usage and slow indexing rate are
>>> > not great. But it's still possible to index multi-million vector
>>> > datasets with a 4GB heap without hitting OOMEs regardless of the
>>> > number of dimensions, and the feedback I'm seeing is that many users
>>> > are still interested in indexing multi-million vector datasets despite
>>> > the slow indexing rate. I wish we could do better, and vector indexing
>>> > is certainly more expert than text indexing, but it still is usable in
>>> > my opinion. I understand how giving Lucene more information about
>>> > vectors prior to indexing (e.g. clustering information as Jim pointed
>>> > out) could help make merging faster and more memory-efficient, but I
>>> > would really like to avoid making it a requirement for indexing
>>> > vectors as it also makes this feature much harder to use.
>>> >
>>> > On Sat, Apr 8, 2023 at 9:28?PM Alessandro Benedetti
>>> > <a.benedetti@sease.io> wrote:
>>> > >
>>> > > I am very attentive to listen opinions but I am un-convinced here
>>> and I an not sure that a single person opinion should be allowed to be
>>> detrimental for such an important project.
>>> > >
>>> > > The limit as far as I know is literally just raising an exception.
>>> > > Removing it won't alter in any way the current performance for users
>>> in low dimensional space.
>>> > > Removing it will just enable more users to use Lucene.
>>> > >
>>> > > If new users in certain situations will be unhappy with the
>>> performance, they may contribute improvements.
>>> > > This is how you make progress.
>>> > >
>>> > > If it's a reputation thing, trust me that not allowing users to play
>>> with high dimensional space will equally damage it.
>>> > >
>>> > > To me it's really a no brainer.
>>> > > Removing the limit and enable people to use high dimensional vectors
>>> will take minutes.
>>> > > Improving the hnsw implementation can take months.
>>> > > Pick one to begin with...
>>> > >
>>> > > And there's no-one paying me here, no company interest whatsoever,
>>> actually I pay people to contribute, I am just convinced it's a good idea.
>>> > >
>>> > >
>>> > > On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcmuir@gmail.com> wrote:
>>> > >>
>>> > >> I disagree with your categorization. I put in plenty of work and
>>> > >> experienced plenty of pain myself, writing tests and fighting these
>>> > >> issues, after i saw that, two releases in a row, vector indexing
>>> fell
>>> > >> over and hit integer overflows etc on small datasets:
>>> > >>
>>> > >> https://github.com/apache/lucene/pull/11905
>>> > >>
>>> > >> Attacking me isn't helping the situation.
>>> > >>
>>> > >> PS: when i said the "one guy who wrote the code" I didn't mean it in
>>> > >> any kind of demeaning fashion really. I meant to describe the
>>> current
>>> > >> state of usability with respect to indexing a few million docs with
>>> > >> high dimensions. You can scroll up the thread and see that at least
>>> > >> one other committer on the project experienced similar pain as me.
>>> > >> Then, think about users who aren't committers trying to use the
>>> > >> functionality!
>>> > >>
>>> > >> On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov <msokolov@gmail.com>
>>> wrote:
>>> > >> >
>>> > >> > What you said about increasing dimensions requiring a bigger ram
>>> buffer on merge is wrong. That's the point I was trying to make. Your
>>> concerns about merge costs are not wrong, but your conclusion that we need
>>> to limit dimensions is not justified.
>>> > >> >
>>> > >> > You complain that hnsw sucks it doesn't scale, but when I show it
>>> scales linearly with dimension you just ignore that and complain about
>>> something entirely different.
>>> > >> >
>>> > >> > You demand that people run all kinds of tests to prove you wrong
>>> but when they do, you don't listen and you won't put in the work yourself
>>> or complain that it's too hard.
>>> > >> >
>>> > >> > Then you complain about people not meeting you half way. Wow
>>> > >> >
>>> > >> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcmuir@gmail.com>
>>> wrote:
>>> > >> >>
>>> > >> >> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
>>> > >> >> <michael.wechner@wyona.com> wrote:
>>> > >> >> >
>>> > >> >> > What exactly do you consider reasonable?
>>> > >> >>
>>> > >> >> Let's begin a real discussion by being HONEST about the current
>>> > >> >> status. Please put politically correct or your own company's
>>> wishes
>>> > >> >> aside, we know it's not in a good state.
>>> > >> >>
>>> > >> >> Current status is the one guy who wrote the code can set a
>>> > >> >> multi-gigabyte ram buffer and index a small dataset with 1024
>>> > >> >> dimensions in HOURS (i didn't ask what hardware).
>>> > >> >>
>>> > >> >> My concerns are everyone else except the one guy, I want it to be
>>> > >> >> usable. Increasing dimensions just means even bigger
>>> multi-gigabyte
>>> > >> >> ram buffer and bigger heap to avoid OOM on merge.
>>> > >> >> It is also a permanent backwards compatibility decision, we have
>>> to
>>> > >> >> support it once we do this and we can't just say "oops" and flip
>>> it
>>> > >> >> back.
>>> > >> >>
>>> > >> >> It is unclear to me, if the multi-gigabyte ram buffer is really
>>> to
>>> > >> >> avoid merges because they are so slow and it would be DAYS
>>> otherwise,
>>> > >> >> or if its to avoid merges so it doesn't hit OOM.
>>> > >> >> Also from personal experience, it takes trial and error (means
>>> > >> >> experiencing OOM on merge!!!) before you get those heap values
>>> correct
>>> > >> >> for your dataset. This usually means starting over which is
>>> > >> >> frustrating and wastes more time.
>>> > >> >>
>>> > >> >> Jim mentioned some ideas about the memory usage in IndexWriter,
>>> seems
>>> > >> >> to me like its a good idea. maybe the multigigabyte ram buffer
>>> can be
>>> > >> >> avoided in this way and performance improved by writing bigger
>>> > >> >> segments with lucene's defaults. But this doesn't mean we can
>>> simply
>>> > >> >> ignore the horrors of what happens on merge. merging needs to
>>> scale so
>>> > >> >> that indexing really scales.
>>> > >> >>
>>> > >> >> At least it shouldnt spike RAM on trivial data amounts and cause
>>> OOM,
>>> > >> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
>>> > >> >> fashion when indexing.
>>> > >> >>
>>> > >> >>
>>> ---------------------------------------------------------------------
>>> > >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> > >> >> For additional commands, e-mail: dev-help@lucene.apache.org
>>> > >> >>
>>> > >>
>>> > >>
>>> ---------------------------------------------------------------------
>>> > >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> > >> For additional commands, e-mail: dev-help@lucene.apache.org
>>> > >>
>>> >
>>> >
>>> > --
>>> > Adrien
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> > For additional commands, e-mail: dev-help@lucene.apache.org
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>
>
> --
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

lucene at mikemccandless

Apr 11, 2023, 5:29 AM

Post #85 of 99 (950 views)

+1 to test on real vector data -- if you test on synthetic data you draw
synthetic conclusions.

Can someone post the theoretical performance (CPU and RAM required) of HNSW
construction? Do we know/believe our HNSW implementation has achieved that
theoretical big-O performance? Maybe we have some silly performance bug
that's causing it not to?

As I understand it, HNSW makes the tradeoff of costly construction for
faster searching, which is typically the right tradeoff for search use
cases. We do this in other parts of the Lucene index too.

Lucene will do a logarithmic number of merges over time, i.e. each doc will
be merged O(log(N)) times in its lifetime in the index. We need to
multiply that by the cost of re-building the whole HNSW graph on each
merge. BTW, other things in Lucene, like BKD/dimensional points, also
rebuild the whole data structure on each merge, I think? But, as Rob
pointed out, stored fields merging do indeed do some sneaky tricks to avoid
excessive block decompress/recompress on each merge.

> As I understand it, vetoes must have technical merit. I'm not sure that
this veto rises to "technical merit" on 2 counts:

Actually I think Robert's veto stands on its technical merit already.
Robert's take on technical matters very much resonate with me, even if he
is sometimes prickly in how he expresses them ;)

His point is that we, as a dev community, are not paying enough attention
to the indexing performance of our KNN algo (HNSW) and implementation, and
that it is reckless to increase / remove limits in that state. It is
indeed a one-way door decision and one must confront such decisions with
caution, especially for such a widely used base infrastructure as Lucene.
We don't even advertise today in our javadocs that you need XXX heap if you
index vectors with dimension Y, fanout X, levels Z, etc.

RAM used during merging is unaffected by dimensionality, but is affected by
fanout, because the HNSW graph (not the raw vectors) is memory resident, I
think? Maybe we could move it off-heap and let the OS manage the memory
(and still document the RAM requirements)? Maybe merge RAM costs should be
accounted for in IW's RAM buffer accounting? It is not today, and there
are some other things that use non-trivial RAM, e.g. the doc mapping (to
compress docid space when deletions are reclaimed).

When we added KNN vector testing to Lucene's nightly benchmarks, the
indexing time massively increased -- see annotations DH and DP here:
https://home.apache.org/~mikemccand/lucenebench/indexing.html. Nightly
benchmarks now start at 6 PM and don't finish until ~14.5 hours later. Of
course, that is using a single thread for indexing (on a box that has 128
cores!) so we produce a deterministic index every night ...

Stepping out (meta) a bit ... this discussion is precisely one of the
awesome benefits of the (informed) veto. It means risky changes to the
software, as determined by any single informed developer on the project,
can force a healthy discussion about the problem at hand. Robert is
legitimately concerned about a real issue and so we should use our creative
energies to characterize our HNSW implementation's performance, document it
clearly for users, and uncover ways to improve it.

Mike McCandless

http://blog.mikemccandless.com

On Mon, Apr 10, 2023 at 6:41?PM Alessandro Benedetti <a.benedetti@sease.io>
wrote:

> I think Gus points are on target.
>
> I recommend we move this forward in this way:
> We stop any discussion and everyone interested proposes an option with a
> motivation, then we aggregate the options and we create a Vote maybe?
>
> I am also on the same page on the fact that a veto should come with a
> clear and reasonable technical merit, which also in my opinion has not come
> yet.
>
> I also apologise if any of my words sounded harsh or personal attacks,
> never meant to do so.
>
> My proposed option:
>
> 1) remove the limit and potentially make it configurable,
> Motivation:
> The system administrator can enforce a limit its users need to respect
> that it's in line with whatever the admin decided to be acceptable for
> them.
> Default can stay the current one.
>
> That's my favourite at the moment, but I agree that potentially in the
> future this may need to change, as we may optimise the data structures for
> certain dimensions. I am a big fan of Yagni (you aren't going to need it)
> so I am ok we'll face a different discussion if that happens in the future.
>
>
>
> On Sun, 9 Apr 2023, 18:46 Gus Heck, <gus.heck@gmail.com> wrote:
>
>> What I see so far:
>>
>> 1. Much positive support for raising the limit
>> 2. Slightly less support for removing it or making it configurable
>> 3. A single veto which argues that a (as yet undefined) performance
>> standard must be met before raising the limit
>> 4. Hot tempers (various) making this discussion difficult
>>
>> As I understand it, vetoes must have technical merit. I'm not sure that
>> this veto rises to "technical merit" on 2 counts:
>>
>> 1. No standard for the performance is given so it cannot be
>> technically met. Without hard criteria it's a moving target.
>> 2. It appears to encode a valuation of the user's time, and that
>> valuation is really up to the user. Some users may consider 2hours useless
>> and not worth it, and others might happily wait 2 hours. This is not a
>> technical decision, it's a business decision regarding the relative value
>> of the time invested vs the value of the result. If I can cure cancer by
>> indexing for a year, that might be worth it... (hyperbole of course).
>>
>> Things I would consider to have technical merit that I don't hear:
>>
>> 1. Impact on the speed of **other** indexing operations. (devaluation
>> of other functionality)
>> 2. Actual scenarios that work when the limit is low and fail when the
>> limit is high (new failure on the same data with the limit raised).
>>
>> One thing that might or might not have technical merit
>>
>> 1. If someone feels there is a lack of documentation of the
>> costs/performance implications of using large vectors, possibly including
>> reproducible benchmarks establishing the scaling behavior (there seems to
>> be disagreement on O(n) vs O(n^2)).
>>
>> The users *should* know what they are getting into, but if the cost is
>> worth it to them, they should be able to pay it without forking the
>> project. If this veto causes a fork that's not good.
>>
>> On Sun, Apr 9, 2023 at 7:55?AM Michael Sokolov <msokolov@gmail.com>
>> wrote:
>>
>>> We do have a dataset built from Wikipedia in luceneutil. It comes in 100
>>> and 300 dimensional varieties and can easily enough generate large numbers
>>> of vector documents from the articles data. To go higher we could
>>> concatenate vectors from that and I believe the performance numbers would
>>> be plausible.
>>>
>>> On Sun, Apr 9, 2023, 1:32 AM Dawid Weiss <dawid.weiss@gmail.com> wrote:
>>>
>>>> Can we set up a branch in which the limit is bumped to 2048, then have
>>>> a realistic, free data set (wikipedia sample or something) that has,
>>>> say, 5 million docs and vectors created using public data (glove
>>>> pre-trained embeddings or the like)? We then could run indexing on the
>>>> same hardware with 512, 1024 and 2048 and see what the numbers, limits
>>>> and behavior actually are.
>>>>
>>>> I can help in writing this but not until after Easter.
>>>>
>>>>
>>>> Dawid
>>>>
>>>> On Sat, Apr 8, 2023 at 11:29?PM Adrien Grand <jpountz@gmail.com> wrote:
>>>> >
>>>> > As Dawid pointed out earlier on this thread, this is the rule for
>>>> > Apache projects: a single -1 vote on a code change is a veto and
>>>> > cannot be overridden. Furthermore, Robert is one of the people on this
>>>> > project who worked the most on debugging subtle bugs, making Lucene
>>>> > more robust and improving our test framework, so I'm listening when he
>>>> > voices quality concerns.
>>>> >
>>>> > The argument against removing/raising the limit that resonates with me
>>>> > the most is that it is a one-way door. As MikeS highlighted earlier on
>>>> > this thread, implementations may want to take advantage of the fact
>>>> > that there is a limit at some point too. This is why I don't want to
>>>> > remove the limit and would prefer a slight increase, such as 2048 as
>>>> > suggested in the original issue, which would enable most of the things
>>>> > that users who have been asking about raising the limit would like to
>>>> > do.
>>>> >
>>>> > I agree that the merge-time memory usage and slow indexing rate are
>>>> > not great. But it's still possible to index multi-million vector
>>>> > datasets with a 4GB heap without hitting OOMEs regardless of the
>>>> > number of dimensions, and the feedback I'm seeing is that many users
>>>> > are still interested in indexing multi-million vector datasets despite
>>>> > the slow indexing rate. I wish we could do better, and vector indexing
>>>> > is certainly more expert than text indexing, but it still is usable in
>>>> > my opinion. I understand how giving Lucene more information about
>>>> > vectors prior to indexing (e.g. clustering information as Jim pointed
>>>> > out) could help make merging faster and more memory-efficient, but I
>>>> > would really like to avoid making it a requirement for indexing
>>>> > vectors as it also makes this feature much harder to use.
>>>> >
>>>> > On Sat, Apr 8, 2023 at 9:28?PM Alessandro Benedetti
>>>> > <a.benedetti@sease.io> wrote:
>>>> > >
>>>> > > I am very attentive to listen opinions but I am un-convinced here
>>>> and I an not sure that a single person opinion should be allowed to be
>>>> detrimental for such an important project.
>>>> > >
>>>> > > The limit as far as I know is literally just raising an exception.
>>>> > > Removing it won't alter in any way the current performance for
>>>> users in low dimensional space.
>>>> > > Removing it will just enable more users to use Lucene.
>>>> > >
>>>> > > If new users in certain situations will be unhappy with the
>>>> performance, they may contribute improvements.
>>>> > > This is how you make progress.
>>>> > >
>>>> > > If it's a reputation thing, trust me that not allowing users to
>>>> play with high dimensional space will equally damage it.
>>>> > >
>>>> > > To me it's really a no brainer.
>>>> > > Removing the limit and enable people to use high dimensional
>>>> vectors will take minutes.
>>>> > > Improving the hnsw implementation can take months.
>>>> > > Pick one to begin with...
>>>> > >
>>>> > > And there's no-one paying me here, no company interest whatsoever,
>>>> actually I pay people to contribute, I am just convinced it's a good idea.
>>>> > >
>>>> > >
>>>> > > On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcmuir@gmail.com> wrote:
>>>> > >>
>>>> > >> I disagree with your categorization. I put in plenty of work and
>>>> > >> experienced plenty of pain myself, writing tests and fighting these
>>>> > >> issues, after i saw that, two releases in a row, vector indexing
>>>> fell
>>>> > >> over and hit integer overflows etc on small datasets:
>>>> > >>
>>>> > >> https://github.com/apache/lucene/pull/11905
>>>> > >>
>>>> > >> Attacking me isn't helping the situation.
>>>> > >>
>>>> > >> PS: when i said the "one guy who wrote the code" I didn't mean it
>>>> in
>>>> > >> any kind of demeaning fashion really. I meant to describe the
>>>> current
>>>> > >> state of usability with respect to indexing a few million docs with
>>>> > >> high dimensions. You can scroll up the thread and see that at least
>>>> > >> one other committer on the project experienced similar pain as me.
>>>> > >> Then, think about users who aren't committers trying to use the
>>>> > >> functionality!
>>>> > >>
>>>> > >> On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov <
>>>> msokolov@gmail.com> wrote:
>>>> > >> >
>>>> > >> > What you said about increasing dimensions requiring a bigger ram
>>>> buffer on merge is wrong. That's the point I was trying to make. Your
>>>> concerns about merge costs are not wrong, but your conclusion that we need
>>>> to limit dimensions is not justified.
>>>> > >> >
>>>> > >> > You complain that hnsw sucks it doesn't scale, but when I show
>>>> it scales linearly with dimension you just ignore that and complain about
>>>> something entirely different.
>>>> > >> >
>>>> > >> > You demand that people run all kinds of tests to prove you wrong
>>>> but when they do, you don't listen and you won't put in the work yourself
>>>> or complain that it's too hard.
>>>> > >> >
>>>> > >> > Then you complain about people not meeting you half way. Wow
>>>> > >> >
>>>> > >> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcmuir@gmail.com>
>>>> wrote:
>>>> > >> >>
>>>> > >> >> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
>>>> > >> >> <michael.wechner@wyona.com> wrote:
>>>> > >> >> >
>>>> > >> >> > What exactly do you consider reasonable?
>>>> > >> >>
>>>> > >> >> Let's begin a real discussion by being HONEST about the current
>>>> > >> >> status. Please put politically correct or your own company's
>>>> wishes
>>>> > >> >> aside, we know it's not in a good state.
>>>> > >> >>
>>>> > >> >> Current status is the one guy who wrote the code can set a
>>>> > >> >> multi-gigabyte ram buffer and index a small dataset with 1024
>>>> > >> >> dimensions in HOURS (i didn't ask what hardware).
>>>> > >> >>
>>>> > >> >> My concerns are everyone else except the one guy, I want it to
>>>> be
>>>> > >> >> usable. Increasing dimensions just means even bigger
>>>> multi-gigabyte
>>>> > >> >> ram buffer and bigger heap to avoid OOM on merge.
>>>> > >> >> It is also a permanent backwards compatibility decision, we
>>>> have to
>>>> > >> >> support it once we do this and we can't just say "oops" and
>>>> flip it
>>>> > >> >> back.
>>>> > >> >>
>>>> > >> >> It is unclear to me, if the multi-gigabyte ram buffer is really
>>>> to
>>>> > >> >> avoid merges because they are so slow and it would be DAYS
>>>> otherwise,
>>>> > >> >> or if its to avoid merges so it doesn't hit OOM.
>>>> > >> >> Also from personal experience, it takes trial and error (means
>>>> > >> >> experiencing OOM on merge!!!) before you get those heap values
>>>> correct
>>>> > >> >> for your dataset. This usually means starting over which is
>>>> > >> >> frustrating and wastes more time.
>>>> > >> >>
>>>> > >> >> Jim mentioned some ideas about the memory usage in IndexWriter,
>>>> seems
>>>> > >> >> to me like its a good idea. maybe the multigigabyte ram buffer
>>>> can be
>>>> > >> >> avoided in this way and performance improved by writing bigger
>>>> > >> >> segments with lucene's defaults. But this doesn't mean we can
>>>> simply
>>>> > >> >> ignore the horrors of what happens on merge. merging needs to
>>>> scale so
>>>> > >> >> that indexing really scales.
>>>> > >> >>
>>>> > >> >> At least it shouldnt spike RAM on trivial data amounts and
>>>> cause OOM,
>>>> > >> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
>>>> > >> >> fashion when indexing.
>>>> > >> >>
>>>> > >> >>
>>>> ---------------------------------------------------------------------
>>>> > >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> > >> >> For additional commands, e-mail: dev-help@lucene.apache.org
>>>> > >> >>
>>>> > >>
>>>> > >>
>>>> ---------------------------------------------------------------------
>>>> > >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> > >> For additional commands, e-mail: dev-help@lucene.apache.org
>>>> > >>
>>>> >
>>>> >
>>>> > --
>>>> > Adrien
>>>> >
>>>> > ---------------------------------------------------------------------
>>>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> > For additional commands, e-mail: dev-help@lucene.apache.org
>>>> >
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>
>>>>
>>
>> --
>> http://www.needhamsoftware.com (work)
>> http://www.the111shift.com (play)
>>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

michael.wechner at wyona

Apr 11, 2023, 6:30 AM

Post #86 of 99 (950 views)

What exactly do you consider real vector data? Vector data which is
based on texts written by humans?

I am asking, because I recently attended the following presentation by
Anastassia Shaitarova (UZH Institute for Computational Linguistics,
https://www.cl.uzh.ch/de/people/team/compling/shaitarova.html)

*----
*

*Can we Identify Machine-Generated Text? An Overview of Current Approaches*
by Anastassia Shaitarova (UZH Institute for Computational Linguistics)

/The detection of machine-generated text has become increasingly
important due to the prevalence of automated content generation and its
potential for misuse. In this talk, we will discuss the motivation for
automatic detection of generated text. We will present the currently
available methods, including feature-based classification as a “first
line-of-defense.” We will provide an overview of the detection tools
that have been made available so far and discuss their limitations.
Finally, we will reflect on some open problems associated with the
automatic discrimination of generated texts./

/----/

and her conclusion was that it has become basically impossible to
differentiate between text generated by humans and text generated by for
example ChatGPT.

Whereas others have a slightly different opinion, see for example

https://www.wired.com/story/how-to-spot-generative-ai-text-chatgpt/

But I would argue that real world and synthetic have become close enough
that testing performance and scalability of indexing should be possible
with synthetic data.

I completely agree that we have to base our discussions and decisions on
scientific methods and that we have to make sure that Lucene performs
and scales well and that we understand the limits and what is going on
under the hood.

Thanks

Michael W

Am 11.04.23 um 14:29 schrieb Michael McCandless:
> +1 to test on real vector data -- if you test on synthetic data you
> draw synthetic conclusions.
>
> Can someone post the theoretical performance (CPU and RAM required) of
> HNSW construction? Do we know/believe our HNSW implementation has
> achieved that theoretical big-O performance? Maybe we have some silly
> performance bug that's causing it not to?
>
> As I understand it, HNSW makes the tradeoff of costly construction for
> faster searching, which is typically the right tradeoff for search use
> cases. We do this in other parts of the Lucene index too.
>
> Lucene will do a logarithmic number of merges over time, i.e. each doc
> will be merged O(log(N)) times in its lifetime in the index. We need
> to multiply that by the cost of re-building the whole HNSW graph on
> each merge. BTW, other things in Lucene, like BKD/dimensional points,
> also rebuild the whole data structure on each merge, I think? But, as
> Rob pointed out, stored fields merging do indeed do some sneaky tricks
> to avoid excessive block decompress/recompress on each merge.
>
> > As I understand it, vetoes must have technical merit. I'm not sure
> that this veto rises to "technical merit" on 2 counts:
>
> Actually I think Robert's veto stands on its technical merit already.
> Robert's take on technical matters very much resonate with me, even if
> he is sometimes prickly in how he expresses them ;)
>
> His point is that we, as a dev community, are not paying enough
> attention to the indexing performance of our KNN algo (HNSW) and
> implementation, and that it is reckless to increase / remove limits in
> that state. It is indeed a one-way door decision and one must
> confront such decisions with caution, especially for such a widely
> used base infrastructure as Lucene. We don't even advertise today in
> our javadocs that you need XXX heap if you index vectors with
> dimension Y, fanout X, levels Z, etc.
>
> RAM used during merging is unaffected by dimensionality, but is
> affected by fanout, because the HNSW graph (not the raw vectors) is
> memory resident, I think? Maybe we could move it off-heap and let the
> OS manage the memory (and still document the RAM requirements)? Maybe
> merge RAM costs should be accounted for in IW's RAM buffer
> accounting? It is not today, and there are some other things that use
> non-trivial RAM, e.g. the doc mapping (to compress docid space when
> deletions are reclaimed).
>
> When we added KNN vector testing to Lucene's nightly benchmarks, the
> indexing time massively increased -- see annotations DH and DP here:
> https://home.apache.org/~mikemccand/lucenebench/indexing.html. Nightly
> benchmarks now start at 6 PM and don't finish until ~14.5 hours
> later. Of course, that is using a single thread for indexing (on a
> box that has 128 cores!) so we produce a deterministic index every
> night ...
>
> Stepping out (meta) a bit ... this discussion is precisely one of the
> awesome benefits of the (informed) veto. It means risky changes to
> the software, as determined by any single informed developer on the
> project, can force a healthy discussion about the problem at hand.
> Robert is legitimately concerned about a real issue and so we should
> use our creative energies to characterize our HNSW implementation's
> performance, document it clearly for users, and uncover ways to
> improve it.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Mon, Apr 10, 2023 at 6:41?PM Alessandro Benedetti
> <a.benedetti@sease.io> wrote:
>
> I think Gus points are on target.
>
> I recommend we move this forward in this way:
> We stop any discussion and everyone interested proposes an option
> with a motivation, then we aggregate the options and we create a
> Vote maybe?
>
> I am also on the same page on the fact that a veto should come
> with a clear and reasonable technical merit, which also in my
> opinion has not come yet.
>
> I also apologise if any of my words sounded harsh or personal
> attacks, never meant to do so.
>
> My proposed option:
>
> 1) remove the limit and potentially make it configurable,
> Motivation:
> The system administrator can enforce a limit its users need to
> respect that it's in line with whatever the admin decided to be
> acceptable for them.
> Default can stay the current one.
>
> That's my favourite at the moment, but I agree that potentially in
> the future this may need to change, as we may optimise the data
> structures for certain dimensions. I am a big fan of Yagni (you
> aren't going to need it) so I am ok we'll face a different
> discussion if that happens in the future.
>
>
>
> On Sun, 9 Apr 2023, 18:46 Gus Heck, <gus.heck@gmail.com> wrote:
>
> What I see so far:
>
> 1. Much positive support for raising the limit
> 2. Slightly less support for removing it or making it
> configurable
> 3. A single veto which argues that a (as yet undefined)
> performance standard must be met before raising the limit
> 4. Hot tempers (various) making this discussion difficult
>
> As I understand it, vetoes must have technical merit. I'm not
> sure that this veto rises to "technical merit" on 2 counts:
>
> 1. No standard for the performance is given so it cannot be
> technically met. Without hard criteria it's a moving target.
> 2. It appears to encode a valuation of the user's time, and
> that valuation is really up to the user. Some users may
> consider 2hours useless and not worth it, and others might
> happily wait 2 hours. This is not a technical decision,
> it's a business decision regarding the relative value of
> the time invested vs the value of the result. If I can
> cure cancer by indexing for a year, that might be worth
> it... (hyperbole of course).
>
> Things I would consider to have technical merit that I don't hear:
>
> 1. Impact on the speed of **other** indexing operations.
> (devaluation of other functionality)
> 2. Actual scenarios that work when the limit is low and fail
> when the limit is high (new failure on the same data with
> the limit raised).
>
> One thing that might or might not have technical merit
>
> 1. If someone feels there is a lack of documentation of the
> costs/performance implications of using large vectors,
> possibly including reproducible benchmarks establishing
> the scaling behavior (there seems to be disagreement on
> O(n) vs O(n^2)).
>
> The users *should* know what they are getting into, but if the
> cost is worth it to them, they should be able to pay it
> without forking the project. If this veto causes a fork that's
> not good.
>
> On Sun, Apr 9, 2023 at 7:55?AM Michael Sokolov
> <msokolov@gmail.com> wrote:
>
> We do have a dataset built from Wikipedia in luceneutil.
> It comes in 100 and 300 dimensional varieties and can
> easily enough generate large numbers of vector documents
> from the articles data. To go higher we could concatenate
> vectors from that and I believe the performance numbers
> would be plausible.
>
> On Sun, Apr 9, 2023, 1:32 AM Dawid Weiss
> <dawid.weiss@gmail.com> wrote:
>
> Can we set up a branch in which the limit is bumped to
> 2048, then have
> a realistic, free data set (wikipedia sample or
> something) that has,
> say, 5 million docs and vectors created using public
> data (glove
> pre-trained embeddings or the like)? We then could run
> indexing on the
> same hardware with 512, 1024 and 2048 and see what the
> numbers, limits
> and behavior actually are.
>
> I can help in writing this but not until after Easter.
>
>
> Dawid
>
> On Sat, Apr 8, 2023 at 11:29?PM Adrien Grand
> <jpountz@gmail.com> wrote:
> >
> > As Dawid pointed out earlier on this thread, this is
> the rule for
> > Apache projects: a single -1 vote on a code change
> is a veto and
> > cannot be overridden. Furthermore, Robert is one of
> the people on this
> > project who worked the most on debugging subtle
> bugs, making Lucene
> > more robust and improving our test framework, so I'm
> listening when he
> > voices quality concerns.
> >
> > The argument against removing/raising the limit that
> resonates with me
> > the most is that it is a one-way door. As MikeS
> highlighted earlier on
> > this thread, implementations may want to take
> advantage of the fact
> > that there is a limit at some point too. This is why
> I don't want to
> > remove the limit and would prefer a slight increase,
> such as 2048 as
> > suggested in the original issue, which would enable
> most of the things
> > that users who have been asking about raising the
> limit would like to
> > do.
> >
> > I agree that the merge-time memory usage and slow
> indexing rate are
> > not great. But it's still possible to index
> multi-million vector
> > datasets with a 4GB heap without hitting OOMEs
> regardless of the
> > number of dimensions, and the feedback I'm seeing is
> that many users
> > are still interested in indexing multi-million
> vector datasets despite
> > the slow indexing rate. I wish we could do better,
> and vector indexing
> > is certainly more expert than text indexing, but it
> still is usable in
> > my opinion. I understand how giving Lucene more
> information about
> > vectors prior to indexing (e.g. clustering
> information as Jim pointed
> > out) could help make merging faster and more
> memory-efficient, but I
> > would really like to avoid making it a requirement
> for indexing
> > vectors as it also makes this feature much harder to
> use.
> >
> > On Sat, Apr 8, 2023 at 9:28?PM Alessandro Benedetti
> > <a.benedetti@sease.io> wrote:
> > >
> > > I am very attentive to listen opinions but I am
> un-convinced here and I an not sure that a single
> person opinion should be allowed to be detrimental for
> such an important project.
> > >
> > > The limit as far as I know is literally just
> raising an exception.
> > > Removing it won't alter in any way the current
> performance for users in low dimensional space.
> > > Removing it will just enable more users to use Lucene.
> > >
> > > If new users in certain situations will be unhappy
> with the performance, they may contribute improvements.
> > > This is how you make progress.
> > >
> > > If it's a reputation thing, trust me that not
> allowing users to play with high dimensional space
> will equally damage it.
> > >
> > > To me it's really a no brainer.
> > > Removing the limit and enable people to use high
> dimensional vectors will take minutes.
> > > Improving the hnsw implementation can take months.
> > > Pick one to begin with...
> > >
> > > And there's no-one paying me here, no company
> interest whatsoever, actually I pay people to
> contribute, I am just convinced it's a good idea.
> > >
> > >
> > > On Sat, 8 Apr 2023, 18:57 Robert Muir,
> <rcmuir@gmail.com> wrote:
> > >>
> > >> I disagree with your categorization. I put in
> plenty of work and
> > >> experienced plenty of pain myself, writing tests
> and fighting these
> > >> issues, after i saw that, two releases in a row,
> vector indexing fell
> > >> over and hit integer overflows etc on small datasets:
> > >>
> > >> https://github.com/apache/lucene/pull/11905
> > >>
> > >> Attacking me isn't helping the situation.
> > >>
> > >> PS: when i said the "one guy who wrote the code"
> I didn't mean it in
> > >> any kind of demeaning fashion really. I meant to
> describe the current
> > >> state of usability with respect to indexing a few
> million docs with
> > >> high dimensions. You can scroll up the thread and
> see that at least
> > >> one other committer on the project experienced
> similar pain as me.
> > >> Then, think about users who aren't committers
> trying to use the
> > >> functionality!
> > >>
> > >> On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov
> <msokolov@gmail.com> wrote:
> > >> >
> > >> > What you said about increasing dimensions
> requiring a bigger ram buffer on merge is wrong.
> That's the point I was trying to make. Your concerns
> about merge costs are not wrong, but your conclusion
> that we need to limit dimensions is not justified.
> > >> >
> > >> > You complain that hnsw sucks it doesn't scale,
> but when I show it scales linearly with dimension you
> just ignore that and complain about something entirely
> different.
> > >> >
> > >> > You demand that people run all kinds of tests
> to prove you wrong but when they do, you don't listen
> and you won't put in the work yourself or complain
> that it's too hard.
> > >> >
> > >> > Then you complain about people not meeting you
> half way. Wow
> > >> >
> > >> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir
> <rcmuir@gmail.com> wrote:
> > >> >>
> > >> >> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
> > >> >> <michael.wechner@wyona.com> wrote:
> > >> >> >
> > >> >> > What exactly do you consider reasonable?
> > >> >>
> > >> >> Let's begin a real discussion by being HONEST
> about the current
> > >> >> status. Please put politically correct or your
> own company's wishes
> > >> >> aside, we know it's not in a good state.
> > >> >>
> > >> >> Current status is the one guy who wrote the
> code can set a
> > >> >> multi-gigabyte ram buffer and index a small
> dataset with 1024
> > >> >> dimensions in HOURS (i didn't ask what hardware).
> > >> >>
> > >> >> My concerns are everyone else except the one
> guy, I want it to be
> > >> >> usable. Increasing dimensions just means even
> bigger multi-gigabyte
> > >> >> ram buffer and bigger heap to avoid OOM on merge.
> > >> >> It is also a permanent backwards compatibility
> decision, we have to
> > >> >> support it once we do this and we can't just
> say "oops" and flip it
> > >> >> back.
> > >> >>
> > >> >> It is unclear to me, if the multi-gigabyte ram
> buffer is really to
> > >> >> avoid merges because they are so slow and it
> would be DAYS otherwise,
> > >> >> or if its to avoid merges so it doesn't hit OOM.
> > >> >> Also from personal experience, it takes trial
> and error (means
> > >> >> experiencing OOM on merge!!!) before you get
> those heap values correct
> > >> >> for your dataset. This usually means starting
> over which is
> > >> >> frustrating and wastes more time.
> > >> >>
> > >> >> Jim mentioned some ideas about the memory
> usage in IndexWriter, seems
> > >> >> to me like its a good idea. maybe the
> multigigabyte ram buffer can be
> > >> >> avoided in this way and performance improved
> by writing bigger
> > >> >> segments with lucene's defaults. But this
> doesn't mean we can simply
> > >> >> ignore the horrors of what happens on merge.
> merging needs to scale so
> > >> >> that indexing really scales.
> > >> >>
> > >> >> At least it shouldnt spike RAM on trivial data
> amounts and cause OOM,
> > >> >> and definitely it shouldnt burn hours and
> hours of CPU in O(n^2)
> > >> >> fashion when indexing.
> > >> >>
> > >> >>
> ---------------------------------------------------------------------
> > >> >> To unsubscribe, e-mail:
> dev-unsubscribe@lucene.apache.org
> > >> >> For additional commands, e-mail:
> dev-help@lucene.apache.org
> > >> >>
> > >>
> > >>
> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail:
> dev-unsubscribe@lucene.apache.org
> > >> For additional commands, e-mail:
> dev-help@lucene.apache.org
> > >>
> >
> >
> > --
> > Adrien
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail:
> dev-help@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail:
> dev-help@lucene.apache.org
>
>
>
> --
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

msokolov at gmail

Apr 11, 2023, 10:05 AM

Post #87 of 99 (950 views)

> What exactly do you consider real vector data? Vector data which is based on texts written by humans?

We have plenty of text; the problem is coming up with a realistic
vector model that requires as many dimensions as people seem to be
demanding. As I said above, after surveying huggingface I couldn't
find any text-based model using more than 768 dimensions. So far we
have some ideas of generating higher-dimensional data by dithering or
concatenating existing data, but it seems artificial.

On Tue, Apr 11, 2023 at 9:31?AM Michael Wechner
<michael.wechner@wyona.com> wrote:
>
> What exactly do you consider real vector data? Vector data which is based on texts written by humans?
>
> I am asking, because I recently attended the following presentation by Anastassia Shaitarova (UZH Institute for Computational Linguistics, https://www.cl.uzh.ch/de/people/team/compling/shaitarova.html)
>
> ----
>
> Can we Identify Machine-Generated Text? An Overview of Current Approaches
> by Anastassia Shaitarova (UZH Institute for Computational Linguistics)
>
> The detection of machine-generated text has become increasingly important due to the prevalence of automated content generation and its potential for misuse. In this talk, we will discuss the motivation for automatic detection of generated text. We will present the currently available methods, including feature-based classification as a “first line-of-defense.” We will provide an overview of the detection tools that have been made available so far and discuss their limitations. Finally, we will reflect on some open problems associated with the automatic discrimination of generated texts.
>
> ----
>
> and her conclusion was that it has become basically impossible to differentiate between text generated by humans and text generated by for example ChatGPT.
>
> Whereas others have a slightly different opinion, see for example
>
> https://www.wired.com/story/how-to-spot-generative-ai-text-chatgpt/
>
> But I would argue that real world and synthetic have become close enough that testing performance and scalability of indexing should be possible with synthetic data.
>
> I completely agree that we have to base our discussions and decisions on scientific methods and that we have to make sure that Lucene performs and scales well and that we understand the limits and what is going on under the hood.
>
> Thanks
>
> Michael W
>
>
>
>
>
> Am 11.04.23 um 14:29 schrieb Michael McCandless:
>
> +1 to test on real vector data -- if you test on synthetic data you draw synthetic conclusions.
>
> Can someone post the theoretical performance (CPU and RAM required) of HNSW construction? Do we know/believe our HNSW implementation has achieved that theoretical big-O performance? Maybe we have some silly performance bug that's causing it not to?
>
> As I understand it, HNSW makes the tradeoff of costly construction for faster searching, which is typically the right tradeoff for search use cases. We do this in other parts of the Lucene index too.
>
> Lucene will do a logarithmic number of merges over time, i.e. each doc will be merged O(log(N)) times in its lifetime in the index. We need to multiply that by the cost of re-building the whole HNSW graph on each merge. BTW, other things in Lucene, like BKD/dimensional points, also rebuild the whole data structure on each merge, I think? But, as Rob pointed out, stored fields merging do indeed do some sneaky tricks to avoid excessive block decompress/recompress on each merge.
>
> > As I understand it, vetoes must have technical merit. I'm not sure that this veto rises to "technical merit" on 2 counts:
>
> Actually I think Robert's veto stands on its technical merit already. Robert's take on technical matters very much resonate with me, even if he is sometimes prickly in how he expresses them ;)
>
> His point is that we, as a dev community, are not paying enough attention to the indexing performance of our KNN algo (HNSW) and implementation, and that it is reckless to increase / remove limits in that state. It is indeed a one-way door decision and one must confront such decisions with caution, especially for such a widely used base infrastructure as Lucene. We don't even advertise today in our javadocs that you need XXX heap if you index vectors with dimension Y, fanout X, levels Z, etc.
>
> RAM used during merging is unaffected by dimensionality, but is affected by fanout, because the HNSW graph (not the raw vectors) is memory resident, I think? Maybe we could move it off-heap and let the OS manage the memory (and still document the RAM requirements)? Maybe merge RAM costs should be accounted for in IW's RAM buffer accounting? It is not today, and there are some other things that use non-trivial RAM, e.g. the doc mapping (to compress docid space when deletions are reclaimed).
>
> When we added KNN vector testing to Lucene's nightly benchmarks, the indexing time massively increased -- see annotations DH and DP here: https://home.apache.org/~mikemccand/lucenebench/indexing.html. Nightly benchmarks now start at 6 PM and don't finish until ~14.5 hours later. Of course, that is using a single thread for indexing (on a box that has 128 cores!) so we produce a deterministic index every night ...
>
> Stepping out (meta) a bit ... this discussion is precisely one of the awesome benefits of the (informed) veto. It means risky changes to the software, as determined by any single informed developer on the project, can force a healthy discussion about the problem at hand. Robert is legitimately concerned about a real issue and so we should use our creative energies to characterize our HNSW implementation's performance, document it clearly for users, and uncover ways to improve it.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Mon, Apr 10, 2023 at 6:41?PM Alessandro Benedetti <a.benedetti@sease.io> wrote:
>>
>> I think Gus points are on target.
>>
>> I recommend we move this forward in this way:
>> We stop any discussion and everyone interested proposes an option with a motivation, then we aggregate the options and we create a Vote maybe?
>>
>> I am also on the same page on the fact that a veto should come with a clear and reasonable technical merit, which also in my opinion has not come yet.
>>
>> I also apologise if any of my words sounded harsh or personal attacks, never meant to do so.
>>
>> My proposed option:
>>
>> 1) remove the limit and potentially make it configurable,
>> Motivation:
>> The system administrator can enforce a limit its users need to respect that it's in line with whatever the admin decided to be acceptable for them.
>> Default can stay the current one.
>>
>> That's my favourite at the moment, but I agree that potentially in the future this may need to change, as we may optimise the data structures for certain dimensions. I am a big fan of Yagni (you aren't going to need it) so I am ok we'll face a different discussion if that happens in the future.
>>
>>
>>
>> On Sun, 9 Apr 2023, 18:46 Gus Heck, <gus.heck@gmail.com> wrote:
>>>
>>> What I see so far:
>>>
>>> Much positive support for raising the limit
>>> Slightly less support for removing it or making it configurable
>>> A single veto which argues that a (as yet undefined) performance standard must be met before raising the limit
>>> Hot tempers (various) making this discussion difficult
>>>
>>> As I understand it, vetoes must have technical merit. I'm not sure that this veto rises to "technical merit" on 2 counts:
>>>
>>> No standard for the performance is given so it cannot be technically met. Without hard criteria it's a moving target.
>>> It appears to encode a valuation of the user's time, and that valuation is really up to the user. Some users may consider 2hours useless and not worth it, and others might happily wait 2 hours. This is not a technical decision, it's a business decision regarding the relative value of the time invested vs the value of the result. If I can cure cancer by indexing for a year, that might be worth it... (hyperbole of course).
>>>
>>> Things I would consider to have technical merit that I don't hear:
>>>
>>> Impact on the speed of **other** indexing operations. (devaluation of other functionality)
>>> Actual scenarios that work when the limit is low and fail when the limit is high (new failure on the same data with the limit raised).
>>>
>>> One thing that might or might not have technical merit
>>>
>>> If someone feels there is a lack of documentation of the costs/performance implications of using large vectors, possibly including reproducible benchmarks establishing the scaling behavior (there seems to be disagreement on O(n) vs O(n^2)).
>>>
>>> The users *should* know what they are getting into, but if the cost is worth it to them, they should be able to pay it without forking the project. If this veto causes a fork that's not good.
>>>
>>> On Sun, Apr 9, 2023 at 7:55?AM Michael Sokolov <msokolov@gmail.com> wrote:
>>>>
>>>> We do have a dataset built from Wikipedia in luceneutil. It comes in 100 and 300 dimensional varieties and can easily enough generate large numbers of vector documents from the articles data. To go higher we could concatenate vectors from that and I believe the performance numbers would be plausible.
>>>>
>>>> On Sun, Apr 9, 2023, 1:32 AM Dawid Weiss <dawid.weiss@gmail.com> wrote:
>>>>>
>>>>> Can we set up a branch in which the limit is bumped to 2048, then have
>>>>> a realistic, free data set (wikipedia sample or something) that has,
>>>>> say, 5 million docs and vectors created using public data (glove
>>>>> pre-trained embeddings or the like)? We then could run indexing on the
>>>>> same hardware with 512, 1024 and 2048 and see what the numbers, limits
>>>>> and behavior actually are.
>>>>>
>>>>> I can help in writing this but not until after Easter.
>>>>>
>>>>>
>>>>> Dawid
>>>>>
>>>>> On Sat, Apr 8, 2023 at 11:29?PM Adrien Grand <jpountz@gmail.com> wrote:
>>>>> >
>>>>> > As Dawid pointed out earlier on this thread, this is the rule for
>>>>> > Apache projects: a single -1 vote on a code change is a veto and
>>>>> > cannot be overridden. Furthermore, Robert is one of the people on this
>>>>> > project who worked the most on debugging subtle bugs, making Lucene
>>>>> > more robust and improving our test framework, so I'm listening when he
>>>>> > voices quality concerns.
>>>>> >
>>>>> > The argument against removing/raising the limit that resonates with me
>>>>> > the most is that it is a one-way door. As MikeS highlighted earlier on
>>>>> > this thread, implementations may want to take advantage of the fact
>>>>> > that there is a limit at some point too. This is why I don't want to
>>>>> > remove the limit and would prefer a slight increase, such as 2048 as
>>>>> > suggested in the original issue, which would enable most of the things
>>>>> > that users who have been asking about raising the limit would like to
>>>>> > do.
>>>>> >
>>>>> > I agree that the merge-time memory usage and slow indexing rate are
>>>>> > not great. But it's still possible to index multi-million vector
>>>>> > datasets with a 4GB heap without hitting OOMEs regardless of the
>>>>> > number of dimensions, and the feedback I'm seeing is that many users
>>>>> > are still interested in indexing multi-million vector datasets despite
>>>>> > the slow indexing rate. I wish we could do better, and vector indexing
>>>>> > is certainly more expert than text indexing, but it still is usable in
>>>>> > my opinion. I understand how giving Lucene more information about
>>>>> > vectors prior to indexing (e.g. clustering information as Jim pointed
>>>>> > out) could help make merging faster and more memory-efficient, but I
>>>>> > would really like to avoid making it a requirement for indexing
>>>>> > vectors as it also makes this feature much harder to use.
>>>>> >
>>>>> > On Sat, Apr 8, 2023 at 9:28?PM Alessandro Benedetti
>>>>> > <a.benedetti@sease.io> wrote:
>>>>> > >
>>>>> > > I am very attentive to listen opinions but I am un-convinced here and I an not sure that a single person opinion should be allowed to be detrimental for such an important project.
>>>>> > >
>>>>> > > The limit as far as I know is literally just raising an exception.
>>>>> > > Removing it won't alter in any way the current performance for users in low dimensional space.
>>>>> > > Removing it will just enable more users to use Lucene.
>>>>> > >
>>>>> > > If new users in certain situations will be unhappy with the performance, they may contribute improvements.
>>>>> > > This is how you make progress.
>>>>> > >
>>>>> > > If it's a reputation thing, trust me that not allowing users to play with high dimensional space will equally damage it.
>>>>> > >
>>>>> > > To me it's really a no brainer.
>>>>> > > Removing the limit and enable people to use high dimensional vectors will take minutes.
>>>>> > > Improving the hnsw implementation can take months.
>>>>> > > Pick one to begin with...
>>>>> > >
>>>>> > > And there's no-one paying me here, no company interest whatsoever, actually I pay people to contribute, I am just convinced it's a good idea.
>>>>> > >
>>>>> > >
>>>>> > > On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcmuir@gmail.com> wrote:
>>>>> > >>
>>>>> > >> I disagree with your categorization. I put in plenty of work and
>>>>> > >> experienced plenty of pain myself, writing tests and fighting these
>>>>> > >> issues, after i saw that, two releases in a row, vector indexing fell
>>>>> > >> over and hit integer overflows etc on small datasets:
>>>>> > >>
>>>>> > >> https://github.com/apache/lucene/pull/11905
>>>>> > >>
>>>>> > >> Attacking me isn't helping the situation.
>>>>> > >>
>>>>> > >> PS: when i said the "one guy who wrote the code" I didn't mean it in
>>>>> > >> any kind of demeaning fashion really. I meant to describe the current
>>>>> > >> state of usability with respect to indexing a few million docs with
>>>>> > >> high dimensions. You can scroll up the thread and see that at least
>>>>> > >> one other committer on the project experienced similar pain as me.
>>>>> > >> Then, think about users who aren't committers trying to use the
>>>>> > >> functionality!
>>>>> > >>
>>>>> > >> On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov <msokolov@gmail.com> wrote:
>>>>> > >> >
>>>>> > >> > What you said about increasing dimensions requiring a bigger ram buffer on merge is wrong. That's the point I was trying to make. Your concerns about merge costs are not wrong, but your conclusion that we need to limit dimensions is not justified.
>>>>> > >> >
>>>>> > >> > You complain that hnsw sucks it doesn't scale, but when I show it scales linearly with dimension you just ignore that and complain about something entirely different.
>>>>> > >> >
>>>>> > >> > You demand that people run all kinds of tests to prove you wrong but when they do, you don't listen and you won't put in the work yourself or complain that it's too hard.
>>>>> > >> >
>>>>> > >> > Then you complain about people not meeting you half way. Wow
>>>>> > >> >
>>>>> > >> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcmuir@gmail.com> wrote:
>>>>> > >> >>
>>>>> > >> >> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
>>>>> > >> >> <michael.wechner@wyona.com> wrote:
>>>>> > >> >> >
>>>>> > >> >> > What exactly do you consider reasonable?
>>>>> > >> >>
>>>>> > >> >> Let's begin a real discussion by being HONEST about the current
>>>>> > >> >> status. Please put politically correct or your own company's wishes
>>>>> > >> >> aside, we know it's not in a good state.
>>>>> > >> >>
>>>>> > >> >> Current status is the one guy who wrote the code can set a
>>>>> > >> >> multi-gigabyte ram buffer and index a small dataset with 1024
>>>>> > >> >> dimensions in HOURS (i didn't ask what hardware).
>>>>> > >> >>
>>>>> > >> >> My concerns are everyone else except the one guy, I want it to be
>>>>> > >> >> usable. Increasing dimensions just means even bigger multi-gigabyte
>>>>> > >> >> ram buffer and bigger heap to avoid OOM on merge.
>>>>> > >> >> It is also a permanent backwards compatibility decision, we have to
>>>>> > >> >> support it once we do this and we can't just say "oops" and flip it
>>>>> > >> >> back.
>>>>> > >> >>
>>>>> > >> >> It is unclear to me, if the multi-gigabyte ram buffer is really to
>>>>> > >> >> avoid merges because they are so slow and it would be DAYS otherwise,
>>>>> > >> >> or if its to avoid merges so it doesn't hit OOM.
>>>>> > >> >> Also from personal experience, it takes trial and error (means
>>>>> > >> >> experiencing OOM on merge!!!) before you get those heap values correct
>>>>> > >> >> for your dataset. This usually means starting over which is
>>>>> > >> >> frustrating and wastes more time.
>>>>> > >> >>
>>>>> > >> >> Jim mentioned some ideas about the memory usage in IndexWriter, seems
>>>>> > >> >> to me like its a good idea. maybe the multigigabyte ram buffer can be
>>>>> > >> >> avoided in this way and performance improved by writing bigger
>>>>> > >> >> segments with lucene's defaults. But this doesn't mean we can simply
>>>>> > >> >> ignore the horrors of what happens on merge. merging needs to scale so
>>>>> > >> >> that indexing really scales.
>>>>> > >> >>
>>>>> > >> >> At least it shouldnt spike RAM on trivial data amounts and cause OOM,
>>>>> > >> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
>>>>> > >> >> fashion when indexing.
>>>>> > >> >>
>>>>> > >> >> ---------------------------------------------------------------------
>>>>> > >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>> > >> >> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>> > >> >>
>>>>> > >>
>>>>> > >> ---------------------------------------------------------------------
>>>>> > >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>> > >> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>> > >>
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Adrien
>>>>> >
>>>>> > ---------------------------------------------------------------------
>>>>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>> > For additional commands, e-mail: dev-help@lucene.apache.org
>>>>> >
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>
>>>
>>>
>>> --
>>> http://www.needhamsoftware.com (work)
>>> http://www.the111shift.com (play)
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

michael.wechner at wyona

Apr 11, 2023, 12:07 PM

Post #88 of 99 (950 views)

I understand what you mean that it seems to be artificial, but I don't
understand why this matters to test performance and scalability of the
indexing?

Let's assume the limit of Lucene would be 4 instead of 1024 and there
are only open source models generating vectors with 4 dimensions, for
example

0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814

0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844

-0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106

-0.007012288551777601,-0.026665858924388885,0.044495150446891785,-0.038030195981264114

and now I concatenate them to vectors with 8 dimensions

0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814,0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844

-0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106,-0.007012288551777601,-0.026665858924388885,0.044495150446891785,-0.038030195981264114

and normalize them to length 1.

Why should this be any different to a model which is acting like a black
box generating vectors with 8 dimensions?

Am 11.04.23 um 19:05 schrieb Michael Sokolov:
>> What exactly do you consider real vector data? Vector data which is based on texts written by humans?
> We have plenty of text; the problem is coming up with a realistic
> vector model that requires as many dimensions as people seem to be
> demanding. As I said above, after surveying huggingface I couldn't
> find any text-based model using more than 768 dimensions. So far we
> have some ideas of generating higher-dimensional data by dithering or
> concatenating existing data, but it seems artificial.
>
> On Tue, Apr 11, 2023 at 9:31?AM Michael Wechner
> <michael.wechner@wyona.com> wrote:
>> What exactly do you consider real vector data? Vector data which is based on texts written by humans?
>>
>> I am asking, because I recently attended the following presentation by Anastassia Shaitarova (UZH Institute for Computational Linguistics, https://www.cl.uzh.ch/de/people/team/compling/shaitarova.html)
>>
>> ----
>>
>> Can we Identify Machine-Generated Text? An Overview of Current Approaches
>> by Anastassia Shaitarova (UZH Institute for Computational Linguistics)
>>
>> The detection of machine-generated text has become increasingly important due to the prevalence of automated content generation and its potential for misuse. In this talk, we will discuss the motivation for automatic detection of generated text. We will present the currently available methods, including feature-based classification as a “first line-of-defense.” We will provide an overview of the detection tools that have been made available so far and discuss their limitations. Finally, we will reflect on some open problems associated with the automatic discrimination of generated texts.
>>
>> ----
>>
>> and her conclusion was that it has become basically impossible to differentiate between text generated by humans and text generated by for example ChatGPT.
>>
>> Whereas others have a slightly different opinion, see for example
>>
>> https://www.wired.com/story/how-to-spot-generative-ai-text-chatgpt/
>>
>> But I would argue that real world and synthetic have become close enough that testing performance and scalability of indexing should be possible with synthetic data.
>>
>> I completely agree that we have to base our discussions and decisions on scientific methods and that we have to make sure that Lucene performs and scales well and that we understand the limits and what is going on under the hood.
>>
>> Thanks
>>
>> Michael W
>>
>>
>>
>>
>>
>> Am 11.04.23 um 14:29 schrieb Michael McCandless:
>>
>> +1 to test on real vector data -- if you test on synthetic data you draw synthetic conclusions.
>>
>> Can someone post the theoretical performance (CPU and RAM required) of HNSW construction? Do we know/believe our HNSW implementation has achieved that theoretical big-O performance? Maybe we have some silly performance bug that's causing it not to?
>>
>> As I understand it, HNSW makes the tradeoff of costly construction for faster searching, which is typically the right tradeoff for search use cases. We do this in other parts of the Lucene index too.
>>
>> Lucene will do a logarithmic number of merges over time, i.e. each doc will be merged O(log(N)) times in its lifetime in the index. We need to multiply that by the cost of re-building the whole HNSW graph on each merge. BTW, other things in Lucene, like BKD/dimensional points, also rebuild the whole data structure on each merge, I think? But, as Rob pointed out, stored fields merging do indeed do some sneaky tricks to avoid excessive block decompress/recompress on each merge.
>>
>>> As I understand it, vetoes must have technical merit. I'm not sure that this veto rises to "technical merit" on 2 counts:
>> Actually I think Robert's veto stands on its technical merit already. Robert's take on technical matters very much resonate with me, even if he is sometimes prickly in how he expresses them ;)
>>
>> His point is that we, as a dev community, are not paying enough attention to the indexing performance of our KNN algo (HNSW) and implementation, and that it is reckless to increase / remove limits in that state. It is indeed a one-way door decision and one must confront such decisions with caution, especially for such a widely used base infrastructure as Lucene. We don't even advertise today in our javadocs that you need XXX heap if you index vectors with dimension Y, fanout X, levels Z, etc.
>>
>> RAM used during merging is unaffected by dimensionality, but is affected by fanout, because the HNSW graph (not the raw vectors) is memory resident, I think? Maybe we could move it off-heap and let the OS manage the memory (and still document the RAM requirements)? Maybe merge RAM costs should be accounted for in IW's RAM buffer accounting? It is not today, and there are some other things that use non-trivial RAM, e.g. the doc mapping (to compress docid space when deletions are reclaimed).
>>
>> When we added KNN vector testing to Lucene's nightly benchmarks, the indexing time massively increased -- see annotations DH and DP here: https://home.apache.org/~mikemccand/lucenebench/indexing.html. Nightly benchmarks now start at 6 PM and don't finish until ~14.5 hours later. Of course, that is using a single thread for indexing (on a box that has 128 cores!) so we produce a deterministic index every night ...
>>
>> Stepping out (meta) a bit ... this discussion is precisely one of the awesome benefits of the (informed) veto. It means risky changes to the software, as determined by any single informed developer on the project, can force a healthy discussion about the problem at hand. Robert is legitimately concerned about a real issue and so we should use our creative energies to characterize our HNSW implementation's performance, document it clearly for users, and uncover ways to improve it.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Mon, Apr 10, 2023 at 6:41?PM Alessandro Benedetti <a.benedetti@sease.io> wrote:
>>> I think Gus points are on target.
>>>
>>> I recommend we move this forward in this way:
>>> We stop any discussion and everyone interested proposes an option with a motivation, then we aggregate the options and we create a Vote maybe?
>>>
>>> I am also on the same page on the fact that a veto should come with a clear and reasonable technical merit, which also in my opinion has not come yet.
>>>
>>> I also apologise if any of my words sounded harsh or personal attacks, never meant to do so.
>>>
>>> My proposed option:
>>>
>>> 1) remove the limit and potentially make it configurable,
>>> Motivation:
>>> The system administrator can enforce a limit its users need to respect that it's in line with whatever the admin decided to be acceptable for them.
>>> Default can stay the current one.
>>>
>>> That's my favourite at the moment, but I agree that potentially in the future this may need to change, as we may optimise the data structures for certain dimensions. I am a big fan of Yagni (you aren't going to need it) so I am ok we'll face a different discussion if that happens in the future.
>>>
>>>
>>>
>>> On Sun, 9 Apr 2023, 18:46 Gus Heck, <gus.heck@gmail.com> wrote:
>>>> What I see so far:
>>>>
>>>> Much positive support for raising the limit
>>>> Slightly less support for removing it or making it configurable
>>>> A single veto which argues that a (as yet undefined) performance standard must be met before raising the limit
>>>> Hot tempers (various) making this discussion difficult
>>>>
>>>> As I understand it, vetoes must have technical merit. I'm not sure that this veto rises to "technical merit" on 2 counts:
>>>>
>>>> No standard for the performance is given so it cannot be technically met. Without hard criteria it's a moving target.
>>>> It appears to encode a valuation of the user's time, and that valuation is really up to the user. Some users may consider 2hours useless and not worth it, and others might happily wait 2 hours. This is not a technical decision, it's a business decision regarding the relative value of the time invested vs the value of the result. If I can cure cancer by indexing for a year, that might be worth it... (hyperbole of course).
>>>>
>>>> Things I would consider to have technical merit that I don't hear:
>>>>
>>>> Impact on the speed of **other** indexing operations. (devaluation of other functionality)
>>>> Actual scenarios that work when the limit is low and fail when the limit is high (new failure on the same data with the limit raised).
>>>>
>>>> One thing that might or might not have technical merit
>>>>
>>>> If someone feels there is a lack of documentation of the costs/performance implications of using large vectors, possibly including reproducible benchmarks establishing the scaling behavior (there seems to be disagreement on O(n) vs O(n^2)).
>>>>
>>>> The users *should* know what they are getting into, but if the cost is worth it to them, they should be able to pay it without forking the project. If this veto causes a fork that's not good.
>>>>
>>>> On Sun, Apr 9, 2023 at 7:55?AM Michael Sokolov <msokolov@gmail.com> wrote:
>>>>> We do have a dataset built from Wikipedia in luceneutil. It comes in 100 and 300 dimensional varieties and can easily enough generate large numbers of vector documents from the articles data. To go higher we could concatenate vectors from that and I believe the performance numbers would be plausible.
>>>>>
>>>>> On Sun, Apr 9, 2023, 1:32 AM Dawid Weiss <dawid.weiss@gmail.com> wrote:
>>>>>> Can we set up a branch in which the limit is bumped to 2048, then have
>>>>>> a realistic, free data set (wikipedia sample or something) that has,
>>>>>> say, 5 million docs and vectors created using public data (glove
>>>>>> pre-trained embeddings or the like)? We then could run indexing on the
>>>>>> same hardware with 512, 1024 and 2048 and see what the numbers, limits
>>>>>> and behavior actually are.
>>>>>>
>>>>>> I can help in writing this but not until after Easter.
>>>>>>
>>>>>>
>>>>>> Dawid
>>>>>>
>>>>>> On Sat, Apr 8, 2023 at 11:29?PM Adrien Grand <jpountz@gmail.com> wrote:
>>>>>>> As Dawid pointed out earlier on this thread, this is the rule for
>>>>>>> Apache projects: a single -1 vote on a code change is a veto and
>>>>>>> cannot be overridden. Furthermore, Robert is one of the people on this
>>>>>>> project who worked the most on debugging subtle bugs, making Lucene
>>>>>>> more robust and improving our test framework, so I'm listening when he
>>>>>>> voices quality concerns.
>>>>>>>
>>>>>>> The argument against removing/raising the limit that resonates with me
>>>>>>> the most is that it is a one-way door. As MikeS highlighted earlier on
>>>>>>> this thread, implementations may want to take advantage of the fact
>>>>>>> that there is a limit at some point too. This is why I don't want to
>>>>>>> remove the limit and would prefer a slight increase, such as 2048 as
>>>>>>> suggested in the original issue, which would enable most of the things
>>>>>>> that users who have been asking about raising the limit would like to
>>>>>>> do.
>>>>>>>
>>>>>>> I agree that the merge-time memory usage and slow indexing rate are
>>>>>>> not great. But it's still possible to index multi-million vector
>>>>>>> datasets with a 4GB heap without hitting OOMEs regardless of the
>>>>>>> number of dimensions, and the feedback I'm seeing is that many users
>>>>>>> are still interested in indexing multi-million vector datasets despite
>>>>>>> the slow indexing rate. I wish we could do better, and vector indexing
>>>>>>> is certainly more expert than text indexing, but it still is usable in
>>>>>>> my opinion. I understand how giving Lucene more information about
>>>>>>> vectors prior to indexing (e.g. clustering information as Jim pointed
>>>>>>> out) could help make merging faster and more memory-efficient, but I
>>>>>>> would really like to avoid making it a requirement for indexing
>>>>>>> vectors as it also makes this feature much harder to use.
>>>>>>>
>>>>>>> On Sat, Apr 8, 2023 at 9:28?PM Alessandro Benedetti
>>>>>>> <a.benedetti@sease.io> wrote:
>>>>>>>> I am very attentive to listen opinions but I am un-convinced here and I an not sure that a single person opinion should be allowed to be detrimental for such an important project.
>>>>>>>>
>>>>>>>> The limit as far as I know is literally just raising an exception.
>>>>>>>> Removing it won't alter in any way the current performance for users in low dimensional space.
>>>>>>>> Removing it will just enable more users to use Lucene.
>>>>>>>>
>>>>>>>> If new users in certain situations will be unhappy with the performance, they may contribute improvements.
>>>>>>>> This is how you make progress.
>>>>>>>>
>>>>>>>> If it's a reputation thing, trust me that not allowing users to play with high dimensional space will equally damage it.
>>>>>>>>
>>>>>>>> To me it's really a no brainer.
>>>>>>>> Removing the limit and enable people to use high dimensional vectors will take minutes.
>>>>>>>> Improving the hnsw implementation can take months.
>>>>>>>> Pick one to begin with...
>>>>>>>>
>>>>>>>> And there's no-one paying me here, no company interest whatsoever, actually I pay people to contribute, I am just convinced it's a good idea.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcmuir@gmail.com> wrote:
>>>>>>>>> I disagree with your categorization. I put in plenty of work and
>>>>>>>>> experienced plenty of pain myself, writing tests and fighting these
>>>>>>>>> issues, after i saw that, two releases in a row, vector indexing fell
>>>>>>>>> over and hit integer overflows etc on small datasets:
>>>>>>>>>
>>>>>>>>> https://github.com/apache/lucene/pull/11905
>>>>>>>>>
>>>>>>>>> Attacking me isn't helping the situation.
>>>>>>>>>
>>>>>>>>> PS: when i said the "one guy who wrote the code" I didn't mean it in
>>>>>>>>> any kind of demeaning fashion really. I meant to describe the current
>>>>>>>>> state of usability with respect to indexing a few million docs with
>>>>>>>>> high dimensions. You can scroll up the thread and see that at least
>>>>>>>>> one other committer on the project experienced similar pain as me.
>>>>>>>>> Then, think about users who aren't committers trying to use the
>>>>>>>>> functionality!
>>>>>>>>>
>>>>>>>>> On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov <msokolov@gmail.com> wrote:
>>>>>>>>>> What you said about increasing dimensions requiring a bigger ram buffer on merge is wrong. That's the point I was trying to make. Your concerns about merge costs are not wrong, but your conclusion that we need to limit dimensions is not justified.
>>>>>>>>>>
>>>>>>>>>> You complain that hnsw sucks it doesn't scale, but when I show it scales linearly with dimension you just ignore that and complain about something entirely different.
>>>>>>>>>>
>>>>>>>>>> You demand that people run all kinds of tests to prove you wrong but when they do, you don't listen and you won't put in the work yourself or complain that it's too hard.
>>>>>>>>>>
>>>>>>>>>> Then you complain about people not meeting you half way. Wow
>>>>>>>>>>
>>>>>>>>>> On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcmuir@gmail.com> wrote:
>>>>>>>>>>> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
>>>>>>>>>>> <michael.wechner@wyona.com> wrote:
>>>>>>>>>>>> What exactly do you consider reasonable?
>>>>>>>>>>> Let's begin a real discussion by being HONEST about the current
>>>>>>>>>>> status. Please put politically correct or your own company's wishes
>>>>>>>>>>> aside, we know it's not in a good state.
>>>>>>>>>>>
>>>>>>>>>>> Current status is the one guy who wrote the code can set a
>>>>>>>>>>> multi-gigabyte ram buffer and index a small dataset with 1024
>>>>>>>>>>> dimensions in HOURS (i didn't ask what hardware).
>>>>>>>>>>>
>>>>>>>>>>> My concerns are everyone else except the one guy, I want it to be
>>>>>>>>>>> usable. Increasing dimensions just means even bigger multi-gigabyte
>>>>>>>>>>> ram buffer and bigger heap to avoid OOM on merge.
>>>>>>>>>>> It is also a permanent backwards compatibility decision, we have to
>>>>>>>>>>> support it once we do this and we can't just say "oops" and flip it
>>>>>>>>>>> back.
>>>>>>>>>>>
>>>>>>>>>>> It is unclear to me, if the multi-gigabyte ram buffer is really to
>>>>>>>>>>> avoid merges because they are so slow and it would be DAYS otherwise,
>>>>>>>>>>> or if its to avoid merges so it doesn't hit OOM.
>>>>>>>>>>> Also from personal experience, it takes trial and error (means
>>>>>>>>>>> experiencing OOM on merge!!!) before you get those heap values correct
>>>>>>>>>>> for your dataset. This usually means starting over which is
>>>>>>>>>>> frustrating and wastes more time.
>>>>>>>>>>>
>>>>>>>>>>> Jim mentioned some ideas about the memory usage in IndexWriter, seems
>>>>>>>>>>> to me like its a good idea. maybe the multigigabyte ram buffer can be
>>>>>>>>>>> avoided in this way and performance improved by writing bigger
>>>>>>>>>>> segments with lucene's defaults. But this doesn't mean we can simply
>>>>>>>>>>> ignore the horrors of what happens on merge. merging needs to scale so
>>>>>>>>>>> that indexing really scales.
>>>>>>>>>>>
>>>>>>>>>>> At least it shouldnt spike RAM on trivial data amounts and cause OOM,
>>>>>>>>>>> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
>>>>>>>>>>> fashion when indexing.
>>>>>>>>>>>
>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>>>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Adrien
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>>
>>>>
>>>> --
>>>> http://www.needhamsoftware.com (work)
>>>> http://www.the111shift.com (play)
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

kent.fitch at gmail

Apr 11, 2023, 7:51 PM

Post #89 of 99 (950 views)

I only know some characteristics of the openAI ada-002 vectors, although
they are a very popular as embeddings/text-characterisations as they allow
more accurate/"human meaningful" semantic search results with fewer
dimensions than their predecessors - I've evaluated a few different
embedding models, including some BERT variants, CLIP ViT-L-14 (with 768
dims, which was quite good), openAI's ada-001 (1024 dims) and babbage-001
(2048 dims), and ada-002 are qualitatively the best, although that will
certainly change!

In any case, ada-002 vectors have interesting characteristics that I think
mean you could confidently create synthetic vectors which would be hard to
distinguish from "real" vectors. I found this from looking at 47K ada-002
vectors generated across a full year (1994) of newspaper articles from the
Canberra Times and 200K wikipedia articles:
- there is no discernible/significant correlation between values in any
pair of dimensions
- all but 5 of the 1536 dimensions have an almost identical distribution of
values shown in the central blob on these graphs (that just show a few of
these 1531 dimensions with clumped values and the 5 "outlier" dimensions,
but all 1531 non-outlier dims are in there, which makes for some easy
quantisation from float to byte if you dont want to go the full
kmeans/clustering/Lloyds-algorithm approach):
https://docs.google.com/spreadsheets/d/1DyyBCbirETZSUAEGcMK__mfbUNzsU_L48V9E0SyJYGg/edit?usp=sharing
https://docs.google.com/spreadsheets/d/1czEAlzYdyKa6xraRLesXjNZvEzlj27TcDGiEFS1-MPs/edit?usp=sharing
https://docs.google.com/spreadsheets/d/1RxTjV7Sj14etCNLk1GB-m44CXJVKdXaFlg2Y6yvj3z4/edit?usp=sharing
- the variance of the value of each dimension is characteristic:
https://docs.google.com/spreadsheets/d/1w5LnRUXt1cRzI9Qwm07LZ6UfszjMOgPaJot9cOGLHok/edit#gid=472178228

This probably represents something significant about how the ada-002
embeddings are created, but I think it also means creating "realistic"
values is possible. I did not use this information when testing recall &
performance on Lucene's HNSW implementation on 192m documents, as I
slightly dithered the values of a "real" set on 47K docs and stored other
fields in the doc that referenced the "base" document that the dithers were
made from, and used different dithering magnitudes so that I could test
recall with different neighbour sizes ("M"), construction-beamwidth and
search-beamwidths.

best regards

Kent Fitch

On Wed, Apr 12, 2023 at 5:08?AM Michael Wechner <michael.wechner@wyona.com>
wrote:

> I understand what you mean that it seems to be artificial, but I don't
> understand why this matters to test performance and scalability of the
> indexing?
>
> Let's assume the limit of Lucene would be 4 instead of 1024 and there
> are only open source models generating vectors with 4 dimensions, for
> example
>
>
> 0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814
>
>
> 0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844
>
>
> -0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106
>
>
> -0.007012288551777601,-0.026665858924388885,0.044495150446891785,-0.038030195981264114
>
> and now I concatenate them to vectors with 8 dimensions
>
>
>
> 0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814,0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844
>
>
> -0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106,-0.007012288551777601,-0.026665858924388885,0.044495150446891785,-0.038030195981264114
>
> and normalize them to length 1.
>
> Why should this be any different to a model which is acting like a black
> box generating vectors with 8 dimensions?
>
>
>
>
> Am 11.04.23 um 19:05 schrieb Michael Sokolov:
> >> What exactly do you consider real vector data? Vector data which is
> based on texts written by humans?
> > We have plenty of text; the problem is coming up with a realistic
> > vector model that requires as many dimensions as people seem to be
> > demanding. As I said above, after surveying huggingface I couldn't
> > find any text-based model using more than 768 dimensions. So far we
> > have some ideas of generating higher-dimensional data by dithering or
> > concatenating existing data, but it seems artificial.
> >
> > On Tue, Apr 11, 2023 at 9:31?AM Michael Wechner
> > <michael.wechner@wyona.com> wrote:
> >> What exactly do you consider real vector data? Vector data which is
> based on texts written by humans?
> >>
> >> I am asking, because I recently attended the following presentation by
> Anastassia Shaitarova (UZH Institute for Computational Linguistics,
> https://www.cl.uzh.ch/de/people/team/compling/shaitarova.html)
> >>
> >> ----
> >>
> >> Can we Identify Machine-Generated Text? An Overview of Current
> Approaches
> >> by Anastassia Shaitarova (UZH Institute for Computational Linguistics)
> >>
> >> The detection of machine-generated text has become increasingly
> important due to the prevalence of automated content generation and its
> potential for misuse. In this talk, we will discuss the motivation for
> automatic detection of generated text. We will present the currently
> available methods, including feature-based classification as a “first
> line-of-defense.” We will provide an overview of the detection tools that
> have been made available so far and discuss their limitations. Finally, we
> will reflect on some open problems associated with the automatic
> discrimination of generated texts.
> >>
> >> ----
> >>
> >> and her conclusion was that it has become basically impossible to
> differentiate between text generated by humans and text generated by for
> example ChatGPT.
> >>
> >> Whereas others have a slightly different opinion, see for example
> >>
> >> https://www.wired.com/story/how-to-spot-generative-ai-text-chatgpt/
> >>
> >> But I would argue that real world and synthetic have become close
> enough that testing performance and scalability of indexing should be
> possible with synthetic data.
> >>
> >> I completely agree that we have to base our discussions and decisions
> on scientific methods and that we have to make sure that Lucene performs
> and scales well and that we understand the limits and what is going on
> under the hood.
> >>
> >> Thanks
> >>
> >> Michael W
> >>
> >>
> >>
> >>
> >>
> >> Am 11.04.23 um 14:29 schrieb Michael McCandless:
> >>
> >> +1 to test on real vector data -- if you test on synthetic data you
> draw synthetic conclusions.
> >>
> >> Can someone post the theoretical performance (CPU and RAM required) of
> HNSW construction? Do we know/believe our HNSW implementation has achieved
> that theoretical big-O performance? Maybe we have some silly performance
> bug that's causing it not to?
> >>
> >> As I understand it, HNSW makes the tradeoff of costly construction for
> faster searching, which is typically the right tradeoff for search use
> cases. We do this in other parts of the Lucene index too.
> >>
> >> Lucene will do a logarithmic number of merges over time, i.e. each doc
> will be merged O(log(N)) times in its lifetime in the index. We need to
> multiply that by the cost of re-building the whole HNSW graph on each
> merge. BTW, other things in Lucene, like BKD/dimensional points, also
> rebuild the whole data structure on each merge, I think? But, as Rob
> pointed out, stored fields merging do indeed do some sneaky tricks to avoid
> excessive block decompress/recompress on each merge.
> >>
> >>> As I understand it, vetoes must have technical merit. I'm not sure
> that this veto rises to "technical merit" on 2 counts:
> >> Actually I think Robert's veto stands on its technical merit already.
> Robert's take on technical matters very much resonate with me, even if he
> is sometimes prickly in how he expresses them ;)
> >>
> >> His point is that we, as a dev community, are not paying enough
> attention to the indexing performance of our KNN algo (HNSW) and
> implementation, and that it is reckless to increase / remove limits in that
> state. It is indeed a one-way door decision and one must confront such
> decisions with caution, especially for such a widely used base
> infrastructure as Lucene. We don't even advertise today in our javadocs
> that you need XXX heap if you index vectors with dimension Y, fanout X,
> levels Z, etc.
> >>
> >> RAM used during merging is unaffected by dimensionality, but is
> affected by fanout, because the HNSW graph (not the raw vectors) is memory
> resident, I think? Maybe we could move it off-heap and let the OS manage
> the memory (and still document the RAM requirements)? Maybe merge RAM
> costs should be accounted for in IW's RAM buffer accounting? It is not
> today, and there are some other things that use non-trivial RAM, e.g. the
> doc mapping (to compress docid space when deletions are reclaimed).
> >>
> >> When we added KNN vector testing to Lucene's nightly benchmarks, the
> indexing time massively increased -- see annotations DH and DP here:
> https://home.apache.org/~mikemccand/lucenebench/indexing.html. Nightly
> benchmarks now start at 6 PM and don't finish until ~14.5 hours later. Of
> course, that is using a single thread for indexing (on a box that has 128
> cores!) so we produce a deterministic index every night ...
> >>
> >> Stepping out (meta) a bit ... this discussion is precisely one of the
> awesome benefits of the (informed) veto. It means risky changes to the
> software, as determined by any single informed developer on the project,
> can force a healthy discussion about the problem at hand. Robert is
> legitimately concerned about a real issue and so we should use our creative
> energies to characterize our HNSW implementation's performance, document it
> clearly for users, and uncover ways to improve it.
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >>
> >> On Mon, Apr 10, 2023 at 6:41?PM Alessandro Benedetti <
> a.benedetti@sease.io> wrote:
> >>> I think Gus points are on target.
> >>>
> >>> I recommend we move this forward in this way:
> >>> We stop any discussion and everyone interested proposes an option with
> a motivation, then we aggregate the options and we create a Vote maybe?
> >>>
> >>> I am also on the same page on the fact that a veto should come with a
> clear and reasonable technical merit, which also in my opinion has not come
> yet.
> >>>
> >>> I also apologise if any of my words sounded harsh or personal attacks,
> never meant to do so.
> >>>
> >>> My proposed option:
> >>>
> >>> 1) remove the limit and potentially make it configurable,
> >>> Motivation:
> >>> The system administrator can enforce a limit its users need to respect
> that it's in line with whatever the admin decided to be acceptable for them.
> >>> Default can stay the current one.
> >>>
> >>> That's my favourite at the moment, but I agree that potentially in the
> future this may need to change, as we may optimise the data structures for
> certain dimensions. I am a big fan of Yagni (you aren't going to need it)
> so I am ok we'll face a different discussion if that happens in the future.
> >>>
> >>>
> >>>
> >>> On Sun, 9 Apr 2023, 18:46 Gus Heck, <gus.heck@gmail.com> wrote:
> >>>> What I see so far:
> >>>>
> >>>> Much positive support for raising the limit
> >>>> Slightly less support for removing it or making it configurable
> >>>> A single veto which argues that a (as yet undefined) performance
> standard must be met before raising the limit
> >>>> Hot tempers (various) making this discussion difficult
> >>>>
> >>>> As I understand it, vetoes must have technical merit. I'm not sure
> that this veto rises to "technical merit" on 2 counts:
> >>>>
> >>>> No standard for the performance is given so it cannot be technically
> met. Without hard criteria it's a moving target.
> >>>> It appears to encode a valuation of the user's time, and that
> valuation is really up to the user. Some users may consider 2hours useless
> and not worth it, and others might happily wait 2 hours. This is not a
> technical decision, it's a business decision regarding the relative value
> of the time invested vs the value of the result. If I can cure cancer by
> indexing for a year, that might be worth it... (hyperbole of course).
> >>>>
> >>>> Things I would consider to have technical merit that I don't hear:
> >>>>
> >>>> Impact on the speed of **other** indexing operations. (devaluation of
> other functionality)
> >>>> Actual scenarios that work when the limit is low and fail when the
> limit is high (new failure on the same data with the limit raised).
> >>>>
> >>>> One thing that might or might not have technical merit
> >>>>
> >>>> If someone feels there is a lack of documentation of the
> costs/performance implications of using large vectors, possibly including
> reproducible benchmarks establishing the scaling behavior (there seems to
> be disagreement on O(n) vs O(n^2)).
> >>>>
> >>>> The users *should* know what they are getting into, but if the cost
> is worth it to them, they should be able to pay it without forking the
> project. If this veto causes a fork that's not good.
> >>>>
> >>>> On Sun, Apr 9, 2023 at 7:55?AM Michael Sokolov <msokolov@gmail.com>
> wrote:
> >>>>> We do have a dataset built from Wikipedia in luceneutil. It comes in
> 100 and 300 dimensional varieties and can easily enough generate large
> numbers of vector documents from the articles data. To go higher we could
> concatenate vectors from that and I believe the performance numbers would
> be plausible.
> >>>>>
> >>>>> On Sun, Apr 9, 2023, 1:32 AM Dawid Weiss <dawid.weiss@gmail.com>
> wrote:
> >>>>>> Can we set up a branch in which the limit is bumped to 2048, then
> have
> >>>>>> a realistic, free data set (wikipedia sample or something) that has,
> >>>>>> say, 5 million docs and vectors created using public data (glove
> >>>>>> pre-trained embeddings or the like)? We then could run indexing on
> the
> >>>>>> same hardware with 512, 1024 and 2048 and see what the numbers,
> limits
> >>>>>> and behavior actually are.
> >>>>>>
> >>>>>> I can help in writing this but not until after Easter.
> >>>>>>
> >>>>>>
> >>>>>> Dawid
> >>>>>>
> >>>>>> On Sat, Apr 8, 2023 at 11:29?PM Adrien Grand <jpountz@gmail.com>
> wrote:
> >>>>>>> As Dawid pointed out earlier on this thread, this is the rule for
> >>>>>>> Apache projects: a single -1 vote on a code change is a veto and
> >>>>>>> cannot be overridden. Furthermore, Robert is one of the people on
> this
> >>>>>>> project who worked the most on debugging subtle bugs, making Lucene
> >>>>>>> more robust and improving our test framework, so I'm listening
> when he
> >>>>>>> voices quality concerns.
> >>>>>>>
> >>>>>>> The argument against removing/raising the limit that resonates
> with me
> >>>>>>> the most is that it is a one-way door. As MikeS highlighted
> earlier on
> >>>>>>> this thread, implementations may want to take advantage of the fact
> >>>>>>> that there is a limit at some point too. This is why I don't want
> to
> >>>>>>> remove the limit and would prefer a slight increase, such as 2048
> as
> >>>>>>> suggested in the original issue, which would enable most of the
> things
> >>>>>>> that users who have been asking about raising the limit would like
> to
> >>>>>>> do.
> >>>>>>>
> >>>>>>> I agree that the merge-time memory usage and slow indexing rate are
> >>>>>>> not great. But it's still possible to index multi-million vector
> >>>>>>> datasets with a 4GB heap without hitting OOMEs regardless of the
> >>>>>>> number of dimensions, and the feedback I'm seeing is that many
> users
> >>>>>>> are still interested in indexing multi-million vector datasets
> despite
> >>>>>>> the slow indexing rate. I wish we could do better, and vector
> indexing
> >>>>>>> is certainly more expert than text indexing, but it still is
> usable in
> >>>>>>> my opinion. I understand how giving Lucene more information about
> >>>>>>> vectors prior to indexing (e.g. clustering information as Jim
> pointed
> >>>>>>> out) could help make merging faster and more memory-efficient, but
> I
> >>>>>>> would really like to avoid making it a requirement for indexing
> >>>>>>> vectors as it also makes this feature much harder to use.
> >>>>>>>
> >>>>>>> On Sat, Apr 8, 2023 at 9:28?PM Alessandro Benedetti
> >>>>>>> <a.benedetti@sease.io> wrote:
> >>>>>>>> I am very attentive to listen opinions but I am un-convinced here
> and I an not sure that a single person opinion should be allowed to be
> detrimental for such an important project.
> >>>>>>>>
> >>>>>>>> The limit as far as I know is literally just raising an exception.
> >>>>>>>> Removing it won't alter in any way the current performance for
> users in low dimensional space.
> >>>>>>>> Removing it will just enable more users to use Lucene.
> >>>>>>>>
> >>>>>>>> If new users in certain situations will be unhappy with the
> performance, they may contribute improvements.
> >>>>>>>> This is how you make progress.
> >>>>>>>>
> >>>>>>>> If it's a reputation thing, trust me that not allowing users to
> play with high dimensional space will equally damage it.
> >>>>>>>>
> >>>>>>>> To me it's really a no brainer.
> >>>>>>>> Removing the limit and enable people to use high dimensional
> vectors will take minutes.
> >>>>>>>> Improving the hnsw implementation can take months.
> >>>>>>>> Pick one to begin with...
> >>>>>>>>
> >>>>>>>> And there's no-one paying me here, no company interest
> whatsoever, actually I pay people to contribute, I am just convinced it's a
> good idea.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcmuir@gmail.com> wrote:
> >>>>>>>>> I disagree with your categorization. I put in plenty of work and
> >>>>>>>>> experienced plenty of pain myself, writing tests and fighting
> these
> >>>>>>>>> issues, after i saw that, two releases in a row, vector indexing
> fell
> >>>>>>>>> over and hit integer overflows etc on small datasets:
> >>>>>>>>>
> >>>>>>>>> https://github.com/apache/lucene/pull/11905
> >>>>>>>>>
> >>>>>>>>> Attacking me isn't helping the situation.
> >>>>>>>>>
> >>>>>>>>> PS: when i said the "one guy who wrote the code" I didn't mean
> it in
> >>>>>>>>> any kind of demeaning fashion really. I meant to describe the
> current
> >>>>>>>>> state of usability with respect to indexing a few million docs
> with
> >>>>>>>>> high dimensions. You can scroll up the thread and see that at
> least
> >>>>>>>>> one other committer on the project experienced similar pain as
> me.
> >>>>>>>>> Then, think about users who aren't committers trying to use the
> >>>>>>>>> functionality!
> >>>>>>>>>
> >>>>>>>>> On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov <
> msokolov@gmail.com> wrote:
> >>>>>>>>>> What you said about increasing dimensions requiring a bigger
> ram buffer on merge is wrong. That's the point I was trying to make. Your
> concerns about merge costs are not wrong, but your conclusion that we need
> to limit dimensions is not justified.
> >>>>>>>>>>
> >>>>>>>>>> You complain that hnsw sucks it doesn't scale, but when I show
> it scales linearly with dimension you just ignore that and complain about
> something entirely different.
> >>>>>>>>>>
> >>>>>>>>>> You demand that people run all kinds of tests to prove you
> wrong but when they do, you don't listen and you won't put in the work
> yourself or complain that it's too hard.
> >>>>>>>>>>
> >>>>>>>>>> Then you complain about people not meeting you half way. Wow
> >>>>>>>>>>
> >>>>>>>>>> On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcmuir@gmail.com>
> wrote:
> >>>>>>>>>>> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
> >>>>>>>>>>> <michael.wechner@wyona.com> wrote:
> >>>>>>>>>>>> What exactly do you consider reasonable?
> >>>>>>>>>>> Let's begin a real discussion by being HONEST about the current
> >>>>>>>>>>> status. Please put politically correct or your own company's
> wishes
> >>>>>>>>>>> aside, we know it's not in a good state.
> >>>>>>>>>>>
> >>>>>>>>>>> Current status is the one guy who wrote the code can set a
> >>>>>>>>>>> multi-gigabyte ram buffer and index a small dataset with 1024
> >>>>>>>>>>> dimensions in HOURS (i didn't ask what hardware).
> >>>>>>>>>>>
> >>>>>>>>>>> My concerns are everyone else except the one guy, I want it to
> be
> >>>>>>>>>>> usable. Increasing dimensions just means even bigger
> multi-gigabyte
> >>>>>>>>>>> ram buffer and bigger heap to avoid OOM on merge.
> >>>>>>>>>>> It is also a permanent backwards compatibility decision, we
> have to
> >>>>>>>>>>> support it once we do this and we can't just say "oops" and
> flip it
> >>>>>>>>>>> back.
> >>>>>>>>>>>
> >>>>>>>>>>> It is unclear to me, if the multi-gigabyte ram buffer is
> really to
> >>>>>>>>>>> avoid merges because they are so slow and it would be DAYS
> otherwise,
> >>>>>>>>>>> or if its to avoid merges so it doesn't hit OOM.
> >>>>>>>>>>> Also from personal experience, it takes trial and error (means
> >>>>>>>>>>> experiencing OOM on merge!!!) before you get those heap values
> correct
> >>>>>>>>>>> for your dataset. This usually means starting over which is
> >>>>>>>>>>> frustrating and wastes more time.
> >>>>>>>>>>>
> >>>>>>>>>>> Jim mentioned some ideas about the memory usage in
> IndexWriter, seems
> >>>>>>>>>>> to me like its a good idea. maybe the multigigabyte ram buffer
> can be
> >>>>>>>>>>> avoided in this way and performance improved by writing bigger
> >>>>>>>>>>> segments with lucene's defaults. But this doesn't mean we can
> simply
> >>>>>>>>>>> ignore the horrors of what happens on merge. merging needs to
> scale so
> >>>>>>>>>>> that indexing really scales.
> >>>>>>>>>>>
> >>>>>>>>>>> At least it shouldnt spike RAM on trivial data amounts and
> cause OOM,
> >>>>>>>>>>> and definitely it shouldnt burn hours and hours of CPU in
> O(n^2)
> >>>>>>>>>>> fashion when indexing.
> >>>>>>>>>>>
> >>>>>>>>>>>
> ---------------------------------------------------------------------
> >>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>>>>>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>>>>>>>>>
> >>>>>>>>>
> ---------------------------------------------------------------------
> >>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>>>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Adrien
> >>>>>>>
> >>>>>>>
> ---------------------------------------------------------------------
> >>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>>>>
> >>>>
> >>>> --
> >>>> http://www.needhamsoftware.com (work)
> >>>> http://www.the111shift.com (play)
> >>
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

gus.heck at gmail

Apr 11, 2023, 10:41 PM

Post #90 of 99 (950 views)

His point is that we, as a dev community, are not paying enough attention
> to the indexing performance of our KNN algo (HNSW) and implementation, and
> that it is reckless to increase / remove limits in that state.
>

If the argument were... "Please hold off while I'm actively improving this,
it will be ready soon and then we can adjust the limit" that might have
technical merit. As it was presented it came across more like "I'm going to
hold this feature lots of folk want hostage until *someone else* does
something I think should be done"... I doubt that was actually what he
consciously thought (I don't think anyone on this project would have that
specific intention), but the context and manner have made it seem that way,
and the net effect seems to be trending in that direction.

If there's a way that raising the limit *prevents* working on performance
that of course would be a key thing to understand. It seems to me that the
exact person who's going to go on a performance crusade is the person who
has a technique that they can prove works, but it's just too darn slow....
Maybe not the first person, maybe not the fifth, but it's going to be
*someone* who needs it...

100% the user should know that they are "off the edge of the map" and "here
there be monsters." Document it well, issue a warning, whatever. Once
they've been told, and they set sail for the unknown, let them develop an
itch so that they can scratch it.

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

michael.wechner at wyona

Apr 11, 2023, 11:36 PM

Post #91 of 99 (950 views)

thank you very much for your feedback!

In a previous post (April 7) you wrote you could make availlable the 47K
ada-002 vectors, which would be great!

Would it make sense to setup a public gitub repo, such that others could
use or also contribute vectors?

Thanks

Michael Wechner

Am 12.04.23 um 04:51 schrieb Kent Fitch:
> I only know some characteristics of the openAI ada-002 vectors,
> although they are a very popular as embeddings/text-characterisations
> as they allow more accurate/"human meaningful" semantic search results
> with fewer dimensions than their predecessors - I've evaluated a few
> different embedding models, including some BERT variants, CLIP
> ViT-L-14 (with 768 dims, which was quite good), openAI's ada-001 (1024
> dims) and babbage-001 (2048 dims), and ada-002 are qualitatively the
> best, although that will certainly change!
>
> In any case, ada-002 vectors have interesting characteristics that I
> think mean you could confidently create synthetic vectors which
> would be hard to distinguish from "real" vectors. I found this from
> looking at 47K ada-002 vectors generated across a full year (1994) of
> newspaper articles from the Canberra Times and 200K wikipedia articles:
> - there is no discernible/significant correlation between values in
> any pair of dimensions
> - all but 5 of the 1536 dimensions have an almost identical
> distribution of values shown in the central blob on these graphs (that
> just show a few of these 1531 dimensions with clumped values and the 5
> "outlier" dimensions, but all 1531 non-outlier dims are in there,
> which makes for some easy quantisation from float to byte if you dont
> want to go the full kmeans/clustering/Lloyds-algorithm approach):
> https://docs.google.com/spreadsheets/d/1DyyBCbirETZSUAEGcMK__mfbUNzsU_L48V9E0SyJYGg/edit?usp=sharing
> https://docs.google.com/spreadsheets/d/1czEAlzYdyKa6xraRLesXjNZvEzlj27TcDGiEFS1-MPs/edit?usp=sharing
> https://docs.google.com/spreadsheets/d/1RxTjV7Sj14etCNLk1GB-m44CXJVKdXaFlg2Y6yvj3z4/edit?usp=sharing
> - the variance of the value of each dimension is characteristic:
> https://docs.google.com/spreadsheets/d/1w5LnRUXt1cRzI9Qwm07LZ6UfszjMOgPaJot9cOGLHok/edit#gid=472178228
>
> This probably represents something significant about how the ada-002
> embeddings are created, but I think it also means creating "realistic"
> values is possible. I did not use this information when testing
> recall & performance on Lucene's HNSW implementation on 192m
> documents, as I slightly dithered the values of a "real" set on 47K
> docs and stored other fields in the doc that referenced the "base"
> document that the dithers were made from, and used different dithering
> magnitudes so that I could test recall with different neighbour sizes
> ("M"), construction-beamwidth and search-beamwidths.
>
> best regards
>
> Kent Fitch
>
>
>
>
> On Wed, Apr 12, 2023 at 5:08?AM Michael Wechner
> <michael.wechner@wyona.com> wrote:
>
> I understand what you mean that it seems to be artificial, but I
> don't
> understand why this matters to test performance and scalability of
> the
> indexing?
>
> Let's assume the limit of Lucene would be 4 instead of 1024 and there
> are only open source models generating vectors with 4 dimensions, for
> example
>
> 0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814
>
> 0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844
>
> -0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106
>
> -0.007012288551777601,-0.026665858924388885,0.044495150446891785,-0.038030195981264114
>
> and now I concatenate them to vectors with 8 dimensions
>
>
> 0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814,0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844
>
> -0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106,-0.007012288551777601,-0.026665858924388885,0.044495150446891785,-0.038030195981264114
>
> and normalize them to length 1.
>
> Why should this be any different to a model which is acting like a
> black
> box generating vectors with 8 dimensions?
>
>
>
>
> Am 11.04.23 um 19:05 schrieb Michael Sokolov:
> >> What exactly do you consider real vector data? Vector data
> which is based on texts written by humans?
> > We have plenty of text; the problem is coming up with a realistic
> > vector model that requires as many dimensions as people seem to be
> > demanding. As I said above, after surveying huggingface I couldn't
> > find any text-based model using more than 768 dimensions. So far we
> > have some ideas of generating higher-dimensional data by
> dithering or
> > concatenating existing data, but it seems artificial.
> >
> > On Tue, Apr 11, 2023 at 9:31?AM Michael Wechner
> > <michael.wechner@wyona.com> wrote:
> >> What exactly do you consider real vector data? Vector data
> which is based on texts written by humans?
> >>
> >> I am asking, because I recently attended the following
> presentation by Anastassia Shaitarova (UZH Institute for
> Computational Linguistics,
> https://www.cl.uzh.ch/de/people/team/compling/shaitarova.html)
> >>
> >> ----
> >>
> >> Can we Identify Machine-Generated Text? An Overview of Current
> Approaches
> >> by Anastassia Shaitarova (UZH Institute for Computational
> Linguistics)
> >>
> >> The detection of machine-generated text has become increasingly
> important due to the prevalence of automated content generation
> and its potential for misuse. In this talk, we will discuss the
> motivation for automatic detection of generated text. We will
> present the currently available methods, including feature-based
> classification as a “first line-of-defense.” We will provide an
> overview of the detection tools that have been made available so
> far and discuss their limitations. Finally, we will reflect on
> some open problems associated with the automatic discrimination of
> generated texts.
> >>
> >> ----
> >>
> >> and her conclusion was that it has become basically impossible
> to differentiate between text generated by humans and text
> generated by for example ChatGPT.
> >>
> >> Whereas others have a slightly different opinion, see for example
> >>
> >> https://www.wired.com/story/how-to-spot-generative-ai-text-chatgpt/
> >>
> >> But I would argue that real world and synthetic have become
> close enough that testing performance and scalability of indexing
> should be possible with synthetic data.
> >>
> >> I completely agree that we have to base our discussions and
> decisions on scientific methods and that we have to make sure that
> Lucene performs and scales well and that we understand the limits
> and what is going on under the hood.
> >>
> >> Thanks
> >>
> >> Michael W
> >>
> >>
> >>
> >>
> >>
> >> Am 11.04.23 um 14:29 schrieb Michael McCandless:
> >>
> >> +1 to test on real vector data -- if you test on synthetic data
> you draw synthetic conclusions.
> >>
> >> Can someone post the theoretical performance (CPU and RAM
> required) of HNSW construction? Do we know/believe our HNSW
> implementation has achieved that theoretical big-O performance?
> Maybe we have some silly performance bug that's causing it not to?
> >>
> >> As I understand it, HNSW makes the tradeoff of costly
> construction for faster searching, which is typically the right
> tradeoff for search use cases. We do this in other parts of the
> Lucene index too.
> >>
> >> Lucene will do a logarithmic number of merges over time, i.e.
> each doc will be merged O(log(N)) times in its lifetime in the
> index. We need to multiply that by the cost of re-building the
> whole HNSW graph on each merge. BTW, other things in Lucene, like
> BKD/dimensional points, also rebuild the whole data structure on
> each merge, I think? But, as Rob pointed out, stored fields
> merging do indeed do some sneaky tricks to avoid excessive block
> decompress/recompress on each merge.
> >>
> >>> As I understand it, vetoes must have technical merit. I'm not
> sure that this veto rises to "technical merit" on 2 counts:
> >> Actually I think Robert's veto stands on its technical merit
> already. Robert's take on technical matters very much resonate
> with me, even if he is sometimes prickly in how he expresses them ;)
> >>
> >> His point is that we, as a dev community, are not paying enough
> attention to the indexing performance of our KNN algo (HNSW) and
> implementation, and that it is reckless to increase / remove
> limits in that state. It is indeed a one-way door decision and
> one must confront such decisions with caution, especially for such
> a widely used base infrastructure as Lucene. We don't even
> advertise today in our javadocs that you need XXX heap if you
> index vectors with dimension Y, fanout X, levels Z, etc.
> >>
> >> RAM used during merging is unaffected by dimensionality, but is
> affected by fanout, because the HNSW graph (not the raw vectors)
> is memory resident, I think? Maybe we could move it off-heap and
> let the OS manage the memory (and still document the RAM
> requirements)? Maybe merge RAM costs should be accounted for in
> IW's RAM buffer accounting? It is not today, and there are some
> other things that use non-trivial RAM, e.g. the doc mapping (to
> compress docid space when deletions are reclaimed).
> >>
> >> When we added KNN vector testing to Lucene's nightly
> benchmarks, the indexing time massively increased -- see
> annotations DH and DP here:
> https://home.apache.org/~mikemccand/lucenebench/indexing.html.
> Nightly benchmarks now start at 6 PM and don't finish until ~14.5
> hours later. Of course, that is using a single thread for
> indexing (on a box that has 128 cores!) so we produce a
> deterministic index every night ...
> >>
> >> Stepping out (meta) a bit ... this discussion is precisely one
> of the awesome benefits of the (informed) veto. It means risky
> changes to the software, as determined by any single informed
> developer on the project, can force a healthy discussion about the
> problem at hand. Robert is legitimately concerned about a real
> issue and so we should use our creative energies to characterize
> our HNSW implementation's performance, document it clearly for
> users, and uncover ways to improve it.
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >>
> >> On Mon, Apr 10, 2023 at 6:41?PM Alessandro Benedetti
> <a.benedetti@sease.io> wrote:
> >>> I think Gus points are on target.
> >>>
> >>> I recommend we move this forward in this way:
> >>> We stop any discussion and everyone interested proposes an
> option with a motivation, then we aggregate the options and we
> create a Vote maybe?
> >>>
> >>> I am also on the same page on the fact that a veto should come
> with a clear and reasonable technical merit, which also in my
> opinion has not come yet.
> >>>
> >>> I also apologise if any of my words sounded harsh or personal
> attacks, never meant to do so.
> >>>
> >>> My proposed option:
> >>>
> >>> 1) remove the limit and potentially make it configurable,
> >>> Motivation:
> >>> The system administrator can enforce a limit its users need to
> respect that it's in line with whatever the admin decided to be
> acceptable for them.
> >>> Default can stay the current one.
> >>>
> >>> That's my favourite at the moment, but I agree that
> potentially in the future this may need to change, as we may
> optimise the data structures for certain dimensions. I am a big
> fan of Yagni (you aren't going to need it) so I am ok we'll face a
> different discussion if that happens in the future.
> >>>
> >>>
> >>>
> >>> On Sun, 9 Apr 2023, 18:46 Gus Heck, <gus.heck@gmail.com> wrote:
> >>>> What I see so far:
> >>>>
> >>>> Much positive support for raising the limit
> >>>> Slightly less support for removing it or making it configurable
> >>>> A single veto which argues that a (as yet undefined)
> performance standard must be met before raising the limit
> >>>> Hot tempers (various) making this discussion difficult
> >>>>
> >>>> As I understand it, vetoes must have technical merit. I'm not
> sure that this veto rises to "technical merit" on 2 counts:
> >>>>
> >>>> No standard for the performance is given so it cannot be
> technically met. Without hard criteria it's a moving target.
> >>>> It appears to encode a valuation of the user's time, and that
> valuation is really up to the user. Some users may consider 2hours
> useless and not worth it, and others might happily wait 2 hours.
> This is not a technical decision, it's a business decision
> regarding the relative value of the time invested vs the value of
> the result. If I can cure cancer by indexing for a year, that
> might be worth it... (hyperbole of course).
> >>>>
> >>>> Things I would consider to have technical merit that I don't
> hear:
> >>>>
> >>>> Impact on the speed of **other** indexing operations.
> (devaluation of other functionality)
> >>>> Actual scenarios that work when the limit is low and fail
> when the limit is high (new failure on the same data with the
> limit raised).
> >>>>
> >>>> One thing that might or might not have technical merit
> >>>>
> >>>> If someone feels there is a lack of documentation of the
> costs/performance implications of using large vectors, possibly
> including reproducible benchmarks establishing the scaling
> behavior (there seems to be disagreement on O(n) vs O(n^2)).
> >>>>
> >>>> The users *should* know what they are getting into, but if
> the cost is worth it to them, they should be able to pay it
> without forking the project. If this veto causes a fork that's not
> good.
> >>>>
> >>>> On Sun, Apr 9, 2023 at 7:55?AM Michael Sokolov
> <msokolov@gmail.com> wrote:
> >>>>> We do have a dataset built from Wikipedia in luceneutil. It
> comes in 100 and 300 dimensional varieties and can easily enough
> generate large numbers of vector documents from the articles data.
> To go higher we could concatenate vectors from that and I believe
> the performance numbers would be plausible.
> >>>>>
> >>>>> On Sun, Apr 9, 2023, 1:32 AM Dawid Weiss
> <dawid.weiss@gmail.com> wrote:
> >>>>>> Can we set up a branch in which the limit is bumped to
> 2048, then have
> >>>>>> a realistic, free data set (wikipedia sample or something)
> that has,
> >>>>>> say, 5 million docs and vectors created using public data
> (glove
> >>>>>> pre-trained embeddings or the like)? We then could run
> indexing on the
> >>>>>> same hardware with 512, 1024 and 2048 and see what the
> numbers, limits
> >>>>>> and behavior actually are.
> >>>>>>
> >>>>>> I can help in writing this but not until after Easter.
> >>>>>>
> >>>>>>
> >>>>>> Dawid
> >>>>>>
> >>>>>> On Sat, Apr 8, 2023 at 11:29?PM Adrien Grand
> <jpountz@gmail.com> wrote:
> >>>>>>> As Dawid pointed out earlier on this thread, this is the
> rule for
> >>>>>>> Apache projects: a single -1 vote on a code change is a
> veto and
> >>>>>>> cannot be overridden. Furthermore, Robert is one of the
> people on this
> >>>>>>> project who worked the most on debugging subtle bugs,
> making Lucene
> >>>>>>> more robust and improving our test framework, so I'm
> listening when he
> >>>>>>> voices quality concerns.
> >>>>>>>
> >>>>>>> The argument against removing/raising the limit that
> resonates with me
> >>>>>>> the most is that it is a one-way door. As MikeS
> highlighted earlier on
> >>>>>>> this thread, implementations may want to take advantage of
> the fact
> >>>>>>> that there is a limit at some point too. This is why I
> don't want to
> >>>>>>> remove the limit and would prefer a slight increase, such
> as 2048 as
> >>>>>>> suggested in the original issue, which would enable most
> of the things
> >>>>>>> that users who have been asking about raising the limit
> would like to
> >>>>>>> do.
> >>>>>>>
> >>>>>>> I agree that the merge-time memory usage and slow indexing
> rate are
> >>>>>>> not great. But it's still possible to index multi-million
> vector
> >>>>>>> datasets with a 4GB heap without hitting OOMEs regardless
> of the
> >>>>>>> number of dimensions, and the feedback I'm seeing is that
> many users
> >>>>>>> are still interested in indexing multi-million vector
> datasets despite
> >>>>>>> the slow indexing rate. I wish we could do better, and
> vector indexing
> >>>>>>> is certainly more expert than text indexing, but it still
> is usable in
> >>>>>>> my opinion. I understand how giving Lucene more
> information about
> >>>>>>> vectors prior to indexing (e.g. clustering information as
> Jim pointed
> >>>>>>> out) could help make merging faster and more
> memory-efficient, but I
> >>>>>>> would really like to avoid making it a requirement for
> indexing
> >>>>>>> vectors as it also makes this feature much harder to use.
> >>>>>>>
> >>>>>>> On Sat, Apr 8, 2023 at 9:28?PM Alessandro Benedetti
> >>>>>>> <a.benedetti@sease.io> wrote:
> >>>>>>>> I am very attentive to listen opinions but I am
> un-convinced here and I an not sure that a single person opinion
> should be allowed to be detrimental for such an important project.
> >>>>>>>>
> >>>>>>>> The limit as far as I know is literally just raising an
> exception.
> >>>>>>>> Removing it won't alter in any way the current
> performance for users in low dimensional space.
> >>>>>>>> Removing it will just enable more users to use Lucene.
> >>>>>>>>
> >>>>>>>> If new users in certain situations will be unhappy with
> the performance, they may contribute improvements.
> >>>>>>>> This is how you make progress.
> >>>>>>>>
> >>>>>>>> If it's a reputation thing, trust me that not allowing
> users to play with high dimensional space will equally damage it.
> >>>>>>>>
> >>>>>>>> To me it's really a no brainer.
> >>>>>>>> Removing the limit and enable people to use high
> dimensional vectors will take minutes.
> >>>>>>>> Improving the hnsw implementation can take months.
> >>>>>>>> Pick one to begin with...
> >>>>>>>>
> >>>>>>>> And there's no-one paying me here, no company interest
> whatsoever, actually I pay people to contribute, I am just
> convinced it's a good idea.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcmuir@gmail.com>
> wrote:
> >>>>>>>>> I disagree with your categorization. I put in plenty of
> work and
> >>>>>>>>> experienced plenty of pain myself, writing tests and
> fighting these
> >>>>>>>>> issues, after i saw that, two releases in a row, vector
> indexing fell
> >>>>>>>>> over and hit integer overflows etc on small datasets:
> >>>>>>>>>
> >>>>>>>>> https://github.com/apache/lucene/pull/11905
> >>>>>>>>>
> >>>>>>>>> Attacking me isn't helping the situation.
> >>>>>>>>>
> >>>>>>>>> PS: when i said the "one guy who wrote the code" I
> didn't mean it in
> >>>>>>>>> any kind of demeaning fashion really. I meant to
> describe the current
> >>>>>>>>> state of usability with respect to indexing a few
> million docs with
> >>>>>>>>> high dimensions. You can scroll up the thread and see
> that at least
> >>>>>>>>> one other committer on the project experienced similar
> pain as me.
> >>>>>>>>> Then, think about users who aren't committers trying to
> use the
> >>>>>>>>> functionality!
> >>>>>>>>>
> >>>>>>>>> On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov
> <msokolov@gmail.com> wrote:
> >>>>>>>>>> What you said about increasing dimensions requiring a
> bigger ram buffer on merge is wrong. That's the point I was trying
> to make. Your concerns about merge costs are not wrong, but your
> conclusion that we need to limit dimensions is not justified.
> >>>>>>>>>>
> >>>>>>>>>> You complain that hnsw sucks it doesn't scale, but when
> I show it scales linearly with dimension you just ignore that and
> complain about something entirely different.
> >>>>>>>>>>
> >>>>>>>>>> You demand that people run all kinds of tests to prove
> you wrong but when they do, you don't listen and you won't put in
> the work yourself or complain that it's too hard.
> >>>>>>>>>>
> >>>>>>>>>> Then you complain about people not meeting you half
> way. Wow
> >>>>>>>>>>
> >>>>>>>>>> On Sat, Apr 8, 2023, 12:40 PM Robert Muir
> <rcmuir@gmail.com> wrote:
> >>>>>>>>>>> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
> >>>>>>>>>>> <michael.wechner@wyona.com> wrote:
> >>>>>>>>>>>> What exactly do you consider reasonable?
> >>>>>>>>>>> Let's begin a real discussion by being HONEST about
> the current
> >>>>>>>>>>> status. Please put politically correct or your own
> company's wishes
> >>>>>>>>>>> aside, we know it's not in a good state.
> >>>>>>>>>>>
> >>>>>>>>>>> Current status is the one guy who wrote the code can set a
> >>>>>>>>>>> multi-gigabyte ram buffer and index a small dataset
> with 1024
> >>>>>>>>>>> dimensions in HOURS (i didn't ask what hardware).
> >>>>>>>>>>>
> >>>>>>>>>>> My concerns are everyone else except the one guy, I
> want it to be
> >>>>>>>>>>> usable. Increasing dimensions just means even bigger
> multi-gigabyte
> >>>>>>>>>>> ram buffer and bigger heap to avoid OOM on merge.
> >>>>>>>>>>> It is also a permanent backwards compatibility
> decision, we have to
> >>>>>>>>>>> support it once we do this and we can't just say
> "oops" and flip it
> >>>>>>>>>>> back.
> >>>>>>>>>>>
> >>>>>>>>>>> It is unclear to me, if the multi-gigabyte ram buffer
> is really to
> >>>>>>>>>>> avoid merges because they are so slow and it would be
> DAYS otherwise,
> >>>>>>>>>>> or if its to avoid merges so it doesn't hit OOM.
> >>>>>>>>>>> Also from personal experience, it takes trial and
> error (means
> >>>>>>>>>>> experiencing OOM on merge!!!) before you get those
> heap values correct
> >>>>>>>>>>> for your dataset. This usually means starting over
> which is
> >>>>>>>>>>> frustrating and wastes more time.
> >>>>>>>>>>>
> >>>>>>>>>>> Jim mentioned some ideas about the memory usage in
> IndexWriter, seems
> >>>>>>>>>>> to me like its a good idea. maybe the multigigabyte
> ram buffer can be
> >>>>>>>>>>> avoided in this way and performance improved by
> writing bigger
> >>>>>>>>>>> segments with lucene's defaults. But this doesn't mean
> we can simply
> >>>>>>>>>>> ignore the horrors of what happens on merge. merging
> needs to scale so
> >>>>>>>>>>> that indexing really scales.
> >>>>>>>>>>>
> >>>>>>>>>>> At least it shouldnt spike RAM on trivial data amounts
> and cause OOM,
> >>>>>>>>>>> and definitely it shouldnt burn hours and hours of CPU
> in O(n^2)
> >>>>>>>>>>> fashion when indexing.
> >>>>>>>>>>>
> >>>>>>>>>>>
> ---------------------------------------------------------------------
> >>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>>>>>>>>>> For additional commands, e-mail:
> dev-help@lucene.apache.org
> >>>>>>>>>>>
> >>>>>>>>>
> ---------------------------------------------------------------------
> >>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>>>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Adrien
> >>>>>>>
> >>>>>>>
> ---------------------------------------------------------------------
> >>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>>>>
> >>>>
> >>>> --
> >>>> http://www.needhamsoftware.com (work)
> >>>> http://www.the111shift.com (play)
> >>
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

a.benedetti at sease

Apr 12, 2023, 5:09 AM

Post #92 of 99 (950 views)

My tentative of listing here only a set of proposals to then vote, has
unfortunately failed.

I appreciate the discussion on better benchmarking hnsw but my feeling is
that this discussion is orthogonal to the limit discussion itself, should
we create a separate mail thread/github jira issue for that?

At the moment I see at least three lines of activities as an outcome from
this (maybe too long) discussion:

1) [small task] there's a need from a good amount of people of
increasing/removing the max limit, as an enabler, to get more users to
Lucene and ease adoption for systems Lucene based (Apache Solr,
Elasticsearch, OpenSearch)

2) [medium task] we all want more benchmarks for Lucene vector-based
search, with a good variety of vector dimensions and encodings

3) [big task? ] some people would like to improve vector based search
peformance because currently not acceptable, it's not clear when and how

A question I have for point 1, does it really need to be a one way door?
Can't we reduce the max limit in the future if the implementation becomes
coupled with certain dimension sizes?
It's not ideal I agree, but is back-compatibility more important than
pragmatic benefits?
I. E.
Right now there's no implementation coupled with the max limit - > we
remove/increase the limit and get more Users

With Lucene X.Y a clever committer introduces a super nice implementation
improvements that unfortunately limit the max size to K.
Can't we just document it as a breaking change for such release? So at that
point we won't support >K vectors but for a reason?

Do we have similar precedents in Lucene?

On Wed, 12 Apr 2023, 08:36 Michael Wechner, <michael.wechner@wyona.com>
wrote:

> thank you very much for your feedback!
>
> In a previous post (April 7) you wrote you could make availlable the 47K
> ada-002 vectors, which would be great!
>
> Would it make sense to setup a public gitub repo, such that others could
> use or also contribute vectors?
>
> Thanks
>
> Michael Wechner
>
>
> Am 12.04.23 um 04:51 schrieb Kent Fitch:
>
> I only know some characteristics of the openAI ada-002 vectors, although
> they are a very popular as embeddings/text-characterisations as they allow
> more accurate/"human meaningful" semantic search results with fewer
> dimensions than their predecessors - I've evaluated a few different
> embedding models, including some BERT variants, CLIP ViT-L-14 (with 768
> dims, which was quite good), openAI's ada-001 (1024 dims) and babbage-001
> (2048 dims), and ada-002 are qualitatively the best, although that will
> certainly change!
>
> In any case, ada-002 vectors have interesting characteristics that I think
> mean you could confidently create synthetic vectors which would be hard to
> distinguish from "real" vectors. I found this from looking at 47K ada-002
> vectors generated across a full year (1994) of newspaper articles from the
> Canberra Times and 200K wikipedia articles:
> - there is no discernible/significant correlation between values in any
> pair of dimensions
> - all but 5 of the 1536 dimensions have an almost identical distribution
> of values shown in the central blob on these graphs (that just show a few
> of these 1531 dimensions with clumped values and the 5 "outlier"
> dimensions, but all 1531 non-outlier dims are in there, which makes for
> some easy quantisation from float to byte if you dont want to go the full
> kmeans/clustering/Lloyds-algorithm approach):
>
> https://docs.google.com/spreadsheets/d/1DyyBCbirETZSUAEGcMK__mfbUNzsU_L48V9E0SyJYGg/edit?usp=sharing
>
> https://docs.google.com/spreadsheets/d/1czEAlzYdyKa6xraRLesXjNZvEzlj27TcDGiEFS1-MPs/edit?usp=sharing
>
> https://docs.google.com/spreadsheets/d/1RxTjV7Sj14etCNLk1GB-m44CXJVKdXaFlg2Y6yvj3z4/edit?usp=sharing
> - the variance of the value of each dimension is characteristic:
>
> https://docs.google.com/spreadsheets/d/1w5LnRUXt1cRzI9Qwm07LZ6UfszjMOgPaJot9cOGLHok/edit#gid=472178228
>
> This probably represents something significant about how the ada-002
> embeddings are created, but I think it also means creating "realistic"
> values is possible. I did not use this information when testing recall &
> performance on Lucene's HNSW implementation on 192m documents, as I
> slightly dithered the values of a "real" set on 47K docs and stored other
> fields in the doc that referenced the "base" document that the dithers were
> made from, and used different dithering magnitudes so that I could test
> recall with different neighbour sizes ("M"), construction-beamwidth and
> search-beamwidths.
>
> best regards
>
> Kent Fitch
>
>
>
>
> On Wed, Apr 12, 2023 at 5:08?AM Michael Wechner <michael.wechner@wyona.com>
> wrote:
>
>> I understand what you mean that it seems to be artificial, but I don't
>> understand why this matters to test performance and scalability of the
>> indexing?
>>
>> Let's assume the limit of Lucene would be 4 instead of 1024 and there
>> are only open source models generating vectors with 4 dimensions, for
>> example
>>
>>
>> 0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814
>>
>>
>> 0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844
>>
>>
>> -0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106
>>
>>
>> -0.007012288551777601,-0.026665858924388885,0.044495150446891785,-0.038030195981264114
>>
>> and now I concatenate them to vectors with 8 dimensions
>>
>>
>>
>> 0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814,0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844
>>
>>
>> -0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106,-0.007012288551777601,-0.026665858924388885,0.044495150446891785,-0.038030195981264114
>>
>> and normalize them to length 1.
>>
>> Why should this be any different to a model which is acting like a black
>> box generating vectors with 8 dimensions?
>>
>>
>>
>>
>> Am 11.04.23 um 19:05 schrieb Michael Sokolov:
>> >> What exactly do you consider real vector data? Vector data which is
>> based on texts written by humans?
>> > We have plenty of text; the problem is coming up with a realistic
>> > vector model that requires as many dimensions as people seem to be
>> > demanding. As I said above, after surveying huggingface I couldn't
>> > find any text-based model using more than 768 dimensions. So far we
>> > have some ideas of generating higher-dimensional data by dithering or
>> > concatenating existing data, but it seems artificial.
>> >
>> > On Tue, Apr 11, 2023 at 9:31?AM Michael Wechner
>> > <michael.wechner@wyona.com> wrote:
>> >> What exactly do you consider real vector data? Vector data which is
>> based on texts written by humans?
>> >>
>> >> I am asking, because I recently attended the following presentation by
>> Anastassia Shaitarova (UZH Institute for Computational Linguistics,
>> https://www.cl.uzh.ch/de/people/team/compling/shaitarova.html)
>> >>
>> >> ----
>> >>
>> >> Can we Identify Machine-Generated Text? An Overview of Current
>> Approaches
>> >> by Anastassia Shaitarova (UZH Institute for Computational Linguistics)
>> >>
>> >> The detection of machine-generated text has become increasingly
>> important due to the prevalence of automated content generation and its
>> potential for misuse. In this talk, we will discuss the motivation for
>> automatic detection of generated text. We will present the currently
>> available methods, including feature-based classification as a “first
>> line-of-defense.” We will provide an overview of the detection tools that
>> have been made available so far and discuss their limitations. Finally, we
>> will reflect on some open problems associated with the automatic
>> discrimination of generated texts.
>> >>
>> >> ----
>> >>
>> >> and her conclusion was that it has become basically impossible to
>> differentiate between text generated by humans and text generated by for
>> example ChatGPT.
>> >>
>> >> Whereas others have a slightly different opinion, see for example
>> >>
>> >> https://www.wired.com/story/how-to-spot-generative-ai-text-chatgpt/
>> >>
>> >> But I would argue that real world and synthetic have become close
>> enough that testing performance and scalability of indexing should be
>> possible with synthetic data.
>> >>
>> >> I completely agree that we have to base our discussions and decisions
>> on scientific methods and that we have to make sure that Lucene performs
>> and scales well and that we understand the limits and what is going on
>> under the hood.
>> >>
>> >> Thanks
>> >>
>> >> Michael W
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> Am 11.04.23 um 14:29 schrieb Michael McCandless:
>> >>
>> >> +1 to test on real vector data -- if you test on synthetic data you
>> draw synthetic conclusions.
>> >>
>> >> Can someone post the theoretical performance (CPU and RAM required) of
>> HNSW construction? Do we know/believe our HNSW implementation has achieved
>> that theoretical big-O performance? Maybe we have some silly performance
>> bug that's causing it not to?
>> >>
>> >> As I understand it, HNSW makes the tradeoff of costly construction for
>> faster searching, which is typically the right tradeoff for search use
>> cases. We do this in other parts of the Lucene index too.
>> >>
>> >> Lucene will do a logarithmic number of merges over time, i.e. each doc
>> will be merged O(log(N)) times in its lifetime in the index. We need to
>> multiply that by the cost of re-building the whole HNSW graph on each
>> merge. BTW, other things in Lucene, like BKD/dimensional points, also
>> rebuild the whole data structure on each merge, I think? But, as Rob
>> pointed out, stored fields merging do indeed do some sneaky tricks to avoid
>> excessive block decompress/recompress on each merge.
>> >>
>> >>> As I understand it, vetoes must have technical merit. I'm not sure
>> that this veto rises to "technical merit" on 2 counts:
>> >> Actually I think Robert's veto stands on its technical merit already.
>> Robert's take on technical matters very much resonate with me, even if he
>> is sometimes prickly in how he expresses them ;)
>> >>
>> >> His point is that we, as a dev community, are not paying enough
>> attention to the indexing performance of our KNN algo (HNSW) and
>> implementation, and that it is reckless to increase / remove limits in that
>> state. It is indeed a one-way door decision and one must confront such
>> decisions with caution, especially for such a widely used base
>> infrastructure as Lucene. We don't even advertise today in our javadocs
>> that you need XXX heap if you index vectors with dimension Y, fanout X,
>> levels Z, etc.
>> >>
>> >> RAM used during merging is unaffected by dimensionality, but is
>> affected by fanout, because the HNSW graph (not the raw vectors) is memory
>> resident, I think? Maybe we could move it off-heap and let the OS manage
>> the memory (and still document the RAM requirements)? Maybe merge RAM
>> costs should be accounted for in IW's RAM buffer accounting? It is not
>> today, and there are some other things that use non-trivial RAM, e.g. the
>> doc mapping (to compress docid space when deletions are reclaimed).
>> >>
>> >> When we added KNN vector testing to Lucene's nightly benchmarks, the
>> indexing time massively increased -- see annotations DH and DP here:
>> https://home.apache.org/~mikemccand/lucenebench/indexing.html. Nightly
>> benchmarks now start at 6 PM and don't finish until ~14.5 hours later. Of
>> course, that is using a single thread for indexing (on a box that has 128
>> cores!) so we produce a deterministic index every night ...
>> >>
>> >> Stepping out (meta) a bit ... this discussion is precisely one of the
>> awesome benefits of the (informed) veto. It means risky changes to the
>> software, as determined by any single informed developer on the project,
>> can force a healthy discussion about the problem at hand. Robert is
>> legitimately concerned about a real issue and so we should use our creative
>> energies to characterize our HNSW implementation's performance, document it
>> clearly for users, and uncover ways to improve it.
>> >>
>> >> Mike McCandless
>> >>
>> >> http://blog.mikemccandless.com
>> >>
>> >>
>> >> On Mon, Apr 10, 2023 at 6:41?PM Alessandro Benedetti <
>> a.benedetti@sease.io> wrote:
>> >>> I think Gus points are on target.
>> >>>
>> >>> I recommend we move this forward in this way:
>> >>> We stop any discussion and everyone interested proposes an option
>> with a motivation, then we aggregate the options and we create a Vote maybe?
>> >>>
>> >>> I am also on the same page on the fact that a veto should come with a
>> clear and reasonable technical merit, which also in my opinion has not come
>> yet.
>> >>>
>> >>> I also apologise if any of my words sounded harsh or personal
>> attacks, never meant to do so.
>> >>>
>> >>> My proposed option:
>> >>>
>> >>> 1) remove the limit and potentially make it configurable,
>> >>> Motivation:
>> >>> The system administrator can enforce a limit its users need to
>> respect that it's in line with whatever the admin decided to be acceptable
>> for them.
>> >>> Default can stay the current one.
>> >>>
>> >>> That's my favourite at the moment, but I agree that potentially in
>> the future this may need to change, as we may optimise the data structures
>> for certain dimensions. I am a big fan of Yagni (you aren't going to need
>> it) so I am ok we'll face a different discussion if that happens in the
>> future.
>> >>>
>> >>>
>> >>>
>> >>> On Sun, 9 Apr 2023, 18:46 Gus Heck, <gus.heck@gmail.com> wrote:
>> >>>> What I see so far:
>> >>>>
>> >>>> Much positive support for raising the limit
>> >>>> Slightly less support for removing it or making it configurable
>> >>>> A single veto which argues that a (as yet undefined) performance
>> standard must be met before raising the limit
>> >>>> Hot tempers (various) making this discussion difficult
>> >>>>
>> >>>> As I understand it, vetoes must have technical merit. I'm not sure
>> that this veto rises to "technical merit" on 2 counts:
>> >>>>
>> >>>> No standard for the performance is given so it cannot be technically
>> met. Without hard criteria it's a moving target.
>> >>>> It appears to encode a valuation of the user's time, and that
>> valuation is really up to the user. Some users may consider 2hours useless
>> and not worth it, and others might happily wait 2 hours. This is not a
>> technical decision, it's a business decision regarding the relative value
>> of the time invested vs the value of the result. If I can cure cancer by
>> indexing for a year, that might be worth it... (hyperbole of course).
>> >>>>
>> >>>> Things I would consider to have technical merit that I don't hear:
>> >>>>
>> >>>> Impact on the speed of **other** indexing operations. (devaluation
>> of other functionality)
>> >>>> Actual scenarios that work when the limit is low and fail when the
>> limit is high (new failure on the same data with the limit raised).
>> >>>>
>> >>>> One thing that might or might not have technical merit
>> >>>>
>> >>>> If someone feels there is a lack of documentation of the
>> costs/performance implications of using large vectors, possibly including
>> reproducible benchmarks establishing the scaling behavior (there seems to
>> be disagreement on O(n) vs O(n^2)).
>> >>>>
>> >>>> The users *should* know what they are getting into, but if the cost
>> is worth it to them, they should be able to pay it without forking the
>> project. If this veto causes a fork that's not good.
>> >>>>
>> >>>> On Sun, Apr 9, 2023 at 7:55?AM Michael Sokolov <msokolov@gmail.com>
>> wrote:
>> >>>>> We do have a dataset built from Wikipedia in luceneutil. It comes
>> in 100 and 300 dimensional varieties and can easily enough generate large
>> numbers of vector documents from the articles data. To go higher we could
>> concatenate vectors from that and I believe the performance numbers would
>> be plausible.
>> >>>>>
>> >>>>> On Sun, Apr 9, 2023, 1:32 AM Dawid Weiss <dawid.weiss@gmail.com>
>> wrote:
>> >>>>>> Can we set up a branch in which the limit is bumped to 2048, then
>> have
>> >>>>>> a realistic, free data set (wikipedia sample or something) that
>> has,
>> >>>>>> say, 5 million docs and vectors created using public data (glove
>> >>>>>> pre-trained embeddings or the like)? We then could run indexing on
>> the
>> >>>>>> same hardware with 512, 1024 and 2048 and see what the numbers,
>> limits
>> >>>>>> and behavior actually are.
>> >>>>>>
>> >>>>>> I can help in writing this but not until after Easter.
>> >>>>>>
>> >>>>>>
>> >>>>>> Dawid
>> >>>>>>
>> >>>>>> On Sat, Apr 8, 2023 at 11:29?PM Adrien Grand <jpountz@gmail.com>
>> wrote:
>> >>>>>>> As Dawid pointed out earlier on this thread, this is the rule for
>> >>>>>>> Apache projects: a single -1 vote on a code change is a veto and
>> >>>>>>> cannot be overridden. Furthermore, Robert is one of the people on
>> this
>> >>>>>>> project who worked the most on debugging subtle bugs, making
>> Lucene
>> >>>>>>> more robust and improving our test framework, so I'm listening
>> when he
>> >>>>>>> voices quality concerns.
>> >>>>>>>
>> >>>>>>> The argument against removing/raising the limit that resonates
>> with me
>> >>>>>>> the most is that it is a one-way door. As MikeS highlighted
>> earlier on
>> >>>>>>> this thread, implementations may want to take advantage of the
>> fact
>> >>>>>>> that there is a limit at some point too. This is why I don't want
>> to
>> >>>>>>> remove the limit and would prefer a slight increase, such as 2048
>> as
>> >>>>>>> suggested in the original issue, which would enable most of the
>> things
>> >>>>>>> that users who have been asking about raising the limit would
>> like to
>> >>>>>>> do.
>> >>>>>>>
>> >>>>>>> I agree that the merge-time memory usage and slow indexing rate
>> are
>> >>>>>>> not great. But it's still possible to index multi-million vector
>> >>>>>>> datasets with a 4GB heap without hitting OOMEs regardless of the
>> >>>>>>> number of dimensions, and the feedback I'm seeing is that many
>> users
>> >>>>>>> are still interested in indexing multi-million vector datasets
>> despite
>> >>>>>>> the slow indexing rate. I wish we could do better, and vector
>> indexing
>> >>>>>>> is certainly more expert than text indexing, but it still is
>> usable in
>> >>>>>>> my opinion. I understand how giving Lucene more information about
>> >>>>>>> vectors prior to indexing (e.g. clustering information as Jim
>> pointed
>> >>>>>>> out) could help make merging faster and more memory-efficient,
>> but I
>> >>>>>>> would really like to avoid making it a requirement for indexing
>> >>>>>>> vectors as it also makes this feature much harder to use.
>> >>>>>>>
>> >>>>>>> On Sat, Apr 8, 2023 at 9:28?PM Alessandro Benedetti
>> >>>>>>> <a.benedetti@sease.io> wrote:
>> >>>>>>>> I am very attentive to listen opinions but I am un-convinced
>> here and I an not sure that a single person opinion should be allowed to be
>> detrimental for such an important project.
>> >>>>>>>>
>> >>>>>>>> The limit as far as I know is literally just raising an
>> exception.
>> >>>>>>>> Removing it won't alter in any way the current performance for
>> users in low dimensional space.
>> >>>>>>>> Removing it will just enable more users to use Lucene.
>> >>>>>>>>
>> >>>>>>>> If new users in certain situations will be unhappy with the
>> performance, they may contribute improvements.
>> >>>>>>>> This is how you make progress.
>> >>>>>>>>
>> >>>>>>>> If it's a reputation thing, trust me that not allowing users to
>> play with high dimensional space will equally damage it.
>> >>>>>>>>
>> >>>>>>>> To me it's really a no brainer.
>> >>>>>>>> Removing the limit and enable people to use high dimensional
>> vectors will take minutes.
>> >>>>>>>> Improving the hnsw implementation can take months.
>> >>>>>>>> Pick one to begin with...
>> >>>>>>>>
>> >>>>>>>> And there's no-one paying me here, no company interest
>> whatsoever, actually I pay people to contribute, I am just convinced it's a
>> good idea.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcmuir@gmail.com> wrote:
>> >>>>>>>>> I disagree with your categorization. I put in plenty of work and
>> >>>>>>>>> experienced plenty of pain myself, writing tests and fighting
>> these
>> >>>>>>>>> issues, after i saw that, two releases in a row, vector
>> indexing fell
>> >>>>>>>>> over and hit integer overflows etc on small datasets:
>> >>>>>>>>>
>> >>>>>>>>> https://github.com/apache/lucene/pull/11905
>> >>>>>>>>>
>> >>>>>>>>> Attacking me isn't helping the situation.
>> >>>>>>>>>
>> >>>>>>>>> PS: when i said the "one guy who wrote the code" I didn't mean
>> it in
>> >>>>>>>>> any kind of demeaning fashion really. I meant to describe the
>> current
>> >>>>>>>>> state of usability with respect to indexing a few million docs
>> with
>> >>>>>>>>> high dimensions. You can scroll up the thread and see that at
>> least
>> >>>>>>>>> one other committer on the project experienced similar pain as
>> me.
>> >>>>>>>>> Then, think about users who aren't committers trying to use the
>> >>>>>>>>> functionality!
>> >>>>>>>>>
>> >>>>>>>>> On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov <
>> msokolov@gmail.com> wrote:
>> >>>>>>>>>> What you said about increasing dimensions requiring a bigger
>> ram buffer on merge is wrong. That's the point I was trying to make. Your
>> concerns about merge costs are not wrong, but your conclusion that we need
>> to limit dimensions is not justified.
>> >>>>>>>>>>
>> >>>>>>>>>> You complain that hnsw sucks it doesn't scale, but when I show
>> it scales linearly with dimension you just ignore that and complain about
>> something entirely different.
>> >>>>>>>>>>
>> >>>>>>>>>> You demand that people run all kinds of tests to prove you
>> wrong but when they do, you don't listen and you won't put in the work
>> yourself or complain that it's too hard.
>> >>>>>>>>>>
>> >>>>>>>>>> Then you complain about people not meeting you half way. Wow
>> >>>>>>>>>>
>> >>>>>>>>>> On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcmuir@gmail.com>
>> wrote:
>> >>>>>>>>>>> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
>> >>>>>>>>>>> <michael.wechner@wyona.com> wrote:
>> >>>>>>>>>>>> What exactly do you consider reasonable?
>> >>>>>>>>>>> Let's begin a real discussion by being HONEST about the
>> current
>> >>>>>>>>>>> status. Please put politically correct or your own company's
>> wishes
>> >>>>>>>>>>> aside, we know it's not in a good state.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Current status is the one guy who wrote the code can set a
>> >>>>>>>>>>> multi-gigabyte ram buffer and index a small dataset with 1024
>> >>>>>>>>>>> dimensions in HOURS (i didn't ask what hardware).
>> >>>>>>>>>>>
>> >>>>>>>>>>> My concerns are everyone else except the one guy, I want it
>> to be
>> >>>>>>>>>>> usable. Increasing dimensions just means even bigger
>> multi-gigabyte
>> >>>>>>>>>>> ram buffer and bigger heap to avoid OOM on merge.
>> >>>>>>>>>>> It is also a permanent backwards compatibility decision, we
>> have to
>> >>>>>>>>>>> support it once we do this and we can't just say "oops" and
>> flip it
>> >>>>>>>>>>> back.
>> >>>>>>>>>>>
>> >>>>>>>>>>> It is unclear to me, if the multi-gigabyte ram buffer is
>> really to
>> >>>>>>>>>>> avoid merges because they are so slow and it would be DAYS
>> otherwise,
>> >>>>>>>>>>> or if its to avoid merges so it doesn't hit OOM.
>> >>>>>>>>>>> Also from personal experience, it takes trial and error (means
>> >>>>>>>>>>> experiencing OOM on merge!!!) before you get those heap
>> values correct
>> >>>>>>>>>>> for your dataset. This usually means starting over which is
>> >>>>>>>>>>> frustrating and wastes more time.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Jim mentioned some ideas about the memory usage in
>> IndexWriter, seems
>> >>>>>>>>>>> to me like its a good idea. maybe the multigigabyte ram
>> buffer can be
>> >>>>>>>>>>> avoided in this way and performance improved by writing bigger
>> >>>>>>>>>>> segments with lucene's defaults. But this doesn't mean we can
>> simply
>> >>>>>>>>>>> ignore the horrors of what happens on merge. merging needs to
>> scale so
>> >>>>>>>>>>> that indexing really scales.
>> >>>>>>>>>>>
>> >>>>>>>>>>> At least it shouldnt spike RAM on trivial data amounts and
>> cause OOM,
>> >>>>>>>>>>> and definitely it shouldnt burn hours and hours of CPU in
>> O(n^2)
>> >>>>>>>>>>> fashion when indexing.
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> ---------------------------------------------------------------------
>> >>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >>>>>>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>>>>>>>>>>
>> >>>>>>>>>
>> ---------------------------------------------------------------------
>> >>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >>>>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>>>>>>>>
>> >>>>>>>
>> >>>>>>> --
>> >>>>>>> Adrien
>> >>>>>>>
>> >>>>>>>
>> ---------------------------------------------------------------------
>> >>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>>>>>>
>> >>>>>>
>> ---------------------------------------------------------------------
>> >>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>>>>>
>> >>>>
>> >>>> --
>> >>>> http://www.needhamsoftware.com (work)
>> >>>> http://www.the111shift.com (play)
>> >>
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: dev-help@lucene.apache.org
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

msokolov at gmail

Apr 12, 2023, 5:53 AM

Post #93 of 99 (950 views)

Just addressing [1] I believe there is a simple workaround. Here's a
unit test demonstrating:

public void testExcessivelyLargeVector() throws Exception {
IndexableFieldType vector2048 = new FieldType() {
@Override
public int vectorDimension() {
return 2048;
}

@Override
public VectorEncoding vectorEncoding() {
return VectorEncoding.FLOAT32;
}

@Override
public VectorSimilarityFunction vectorSimilarityFunction() {
return VectorSimilarityFunction.EUCLIDEAN;
}
};
try (Directory dir = newDirectory();
IndexWriter iw = new IndexWriter(dir,
newIndexWriterConfig(null).setCodec(codec))) {
Document doc = new Document();
FieldType type = new FieldType(vector2048);
doc.add(new KnnVectorField("vector2048", new float[2048], type));
iw.addDocument(doc);
}
}

On Wed, Apr 12, 2023 at 8:10?AM Alessandro Benedetti
<a.benedetti@sease.io> wrote:
>
> My tentative of listing here only a set of proposals to then vote, has unfortunately failed.
>
> I appreciate the discussion on better benchmarking hnsw but my feeling is that this discussion is orthogonal to the limit discussion itself, should we create a separate mail thread/github jira issue for that?
>
> At the moment I see at least three lines of activities as an outcome from this (maybe too long) discussion:
>
> 1) [small task] there's a need from a good amount of people of increasing/removing the max limit, as an enabler, to get more users to Lucene and ease adoption for systems Lucene based (Apache Solr, Elasticsearch, OpenSearch)
>
> 2) [medium task] we all want more benchmarks for Lucene vector-based search, with a good variety of vector dimensions and encodings
>
> 3) [big task? ] some people would like to improve vector based search peformance because currently not acceptable, it's not clear when and how
>
> A question I have for point 1, does it really need to be a one way door?
> Can't we reduce the max limit in the future if the implementation becomes coupled with certain dimension sizes?
> It's not ideal I agree, but is back-compatibility more important than pragmatic benefits?
> I. E.
> Right now there's no implementation coupled with the max limit - > we remove/increase the limit and get more Users
>
> With Lucene X.Y a clever committer introduces a super nice implementation improvements that unfortunately limit the max size to K.
> Can't we just document it as a breaking change for such release? So at that point we won't support >K vectors but for a reason?
>
> Do we have similar precedents in Lucene?
>
>
>
> On Wed, 12 Apr 2023, 08:36 Michael Wechner, <michael.wechner@wyona.com> wrote:
>>
>> thank you very much for your feedback!
>>
>> In a previous post (April 7) you wrote you could make availlable the 47K ada-002 vectors, which would be great!
>>
>> Would it make sense to setup a public gitub repo, such that others could use or also contribute vectors?
>>
>> Thanks
>>
>> Michael Wechner
>>
>>
>> Am 12.04.23 um 04:51 schrieb Kent Fitch:
>>
>> I only know some characteristics of the openAI ada-002 vectors, although they are a very popular as embeddings/text-characterisations as they allow more accurate/"human meaningful" semantic search results with fewer dimensions than their predecessors - I've evaluated a few different embedding models, including some BERT variants, CLIP ViT-L-14 (with 768 dims, which was quite good), openAI's ada-001 (1024 dims) and babbage-001 (2048 dims), and ada-002 are qualitatively the best, although that will certainly change!
>>
>> In any case, ada-002 vectors have interesting characteristics that I think mean you could confidently create synthetic vectors which would be hard to distinguish from "real" vectors. I found this from looking at 47K ada-002 vectors generated across a full year (1994) of newspaper articles from the Canberra Times and 200K wikipedia articles:
>> - there is no discernible/significant correlation between values in any pair of dimensions
>> - all but 5 of the 1536 dimensions have an almost identical distribution of values shown in the central blob on these graphs (that just show a few of these 1531 dimensions with clumped values and the 5 "outlier" dimensions, but all 1531 non-outlier dims are in there, which makes for some easy quantisation from float to byte if you dont want to go the full kmeans/clustering/Lloyds-algorithm approach):
>> https://docs.google.com/spreadsheets/d/1DyyBCbirETZSUAEGcMK__mfbUNzsU_L48V9E0SyJYGg/edit?usp=sharing
>> https://docs.google.com/spreadsheets/d/1czEAlzYdyKa6xraRLesXjNZvEzlj27TcDGiEFS1-MPs/edit?usp=sharing
>> https://docs.google.com/spreadsheets/d/1RxTjV7Sj14etCNLk1GB-m44CXJVKdXaFlg2Y6yvj3z4/edit?usp=sharing
>> - the variance of the value of each dimension is characteristic:
>> https://docs.google.com/spreadsheets/d/1w5LnRUXt1cRzI9Qwm07LZ6UfszjMOgPaJot9cOGLHok/edit#gid=472178228
>>
>> This probably represents something significant about how the ada-002 embeddings are created, but I think it also means creating "realistic" values is possible. I did not use this information when testing recall & performance on Lucene's HNSW implementation on 192m documents, as I slightly dithered the values of a "real" set on 47K docs and stored other fields in the doc that referenced the "base" document that the dithers were made from, and used different dithering magnitudes so that I could test recall with different neighbour sizes ("M"), construction-beamwidth and search-beamwidths.
>>
>> best regards
>>
>> Kent Fitch
>>
>>
>>
>>
>> On Wed, Apr 12, 2023 at 5:08?AM Michael Wechner <michael.wechner@wyona.com> wrote:
>>>
>>> I understand what you mean that it seems to be artificial, but I don't
>>> understand why this matters to test performance and scalability of the
>>> indexing?
>>>
>>> Let's assume the limit of Lucene would be 4 instead of 1024 and there
>>> are only open source models generating vectors with 4 dimensions, for
>>> example
>>>
>>> 0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814
>>>
>>> 0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844
>>>
>>> -0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106
>>>
>>> -0.007012288551777601,-0.026665858924388885,0.044495150446891785,-0.038030195981264114
>>>
>>> and now I concatenate them to vectors with 8 dimensions
>>>
>>>
>>> 0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814,0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844
>>>
>>> -0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106,-0.007012288551777601,-0.026665858924388885,0.044495150446891785,-0.038030195981264114
>>>
>>> and normalize them to length 1.
>>>
>>> Why should this be any different to a model which is acting like a black
>>> box generating vectors with 8 dimensions?
>>>
>>>
>>>
>>>
>>> Am 11.04.23 um 19:05 schrieb Michael Sokolov:
>>> >> What exactly do you consider real vector data? Vector data which is based on texts written by humans?
>>> > We have plenty of text; the problem is coming up with a realistic
>>> > vector model that requires as many dimensions as people seem to be
>>> > demanding. As I said above, after surveying huggingface I couldn't
>>> > find any text-based model using more than 768 dimensions. So far we
>>> > have some ideas of generating higher-dimensional data by dithering or
>>> > concatenating existing data, but it seems artificial.
>>> >
>>> > On Tue, Apr 11, 2023 at 9:31?AM Michael Wechner
>>> > <michael.wechner@wyona.com> wrote:
>>> >> What exactly do you consider real vector data? Vector data which is based on texts written by humans?
>>> >>
>>> >> I am asking, because I recently attended the following presentation by Anastassia Shaitarova (UZH Institute for Computational Linguistics, https://www.cl.uzh.ch/de/people/team/compling/shaitarova.html)
>>> >>
>>> >> ----
>>> >>
>>> >> Can we Identify Machine-Generated Text? An Overview of Current Approaches
>>> >> by Anastassia Shaitarova (UZH Institute for Computational Linguistics)
>>> >>
>>> >> The detection of machine-generated text has become increasingly important due to the prevalence of automated content generation and its potential for misuse. In this talk, we will discuss the motivation for automatic detection of generated text. We will present the currently available methods, including feature-based classification as a “first line-of-defense.” We will provide an overview of the detection tools that have been made available so far and discuss their limitations. Finally, we will reflect on some open problems associated with the automatic discrimination of generated texts.
>>> >>
>>> >> ----
>>> >>
>>> >> and her conclusion was that it has become basically impossible to differentiate between text generated by humans and text generated by for example ChatGPT.
>>> >>
>>> >> Whereas others have a slightly different opinion, see for example
>>> >>
>>> >> https://www.wired.com/story/how-to-spot-generative-ai-text-chatgpt/
>>> >>
>>> >> But I would argue that real world and synthetic have become close enough that testing performance and scalability of indexing should be possible with synthetic data.
>>> >>
>>> >> I completely agree that we have to base our discussions and decisions on scientific methods and that we have to make sure that Lucene performs and scales well and that we understand the limits and what is going on under the hood.
>>> >>
>>> >> Thanks
>>> >>
>>> >> Michael W
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> Am 11.04.23 um 14:29 schrieb Michael McCandless:
>>> >>
>>> >> +1 to test on real vector data -- if you test on synthetic data you draw synthetic conclusions.
>>> >>
>>> >> Can someone post the theoretical performance (CPU and RAM required) of HNSW construction? Do we know/believe our HNSW implementation has achieved that theoretical big-O performance? Maybe we have some silly performance bug that's causing it not to?
>>> >>
>>> >> As I understand it, HNSW makes the tradeoff of costly construction for faster searching, which is typically the right tradeoff for search use cases. We do this in other parts of the Lucene index too.
>>> >>
>>> >> Lucene will do a logarithmic number of merges over time, i.e. each doc will be merged O(log(N)) times in its lifetime in the index. We need to multiply that by the cost of re-building the whole HNSW graph on each merge. BTW, other things in Lucene, like BKD/dimensional points, also rebuild the whole data structure on each merge, I think? But, as Rob pointed out, stored fields merging do indeed do some sneaky tricks to avoid excessive block decompress/recompress on each merge.
>>> >>
>>> >>> As I understand it, vetoes must have technical merit. I'm not sure that this veto rises to "technical merit" on 2 counts:
>>> >> Actually I think Robert's veto stands on its technical merit already. Robert's take on technical matters very much resonate with me, even if he is sometimes prickly in how he expresses them ;)
>>> >>
>>> >> His point is that we, as a dev community, are not paying enough attention to the indexing performance of our KNN algo (HNSW) and implementation, and that it is reckless to increase / remove limits in that state. It is indeed a one-way door decision and one must confront such decisions with caution, especially for such a widely used base infrastructure as Lucene. We don't even advertise today in our javadocs that you need XXX heap if you index vectors with dimension Y, fanout X, levels Z, etc.
>>> >>
>>> >> RAM used during merging is unaffected by dimensionality, but is affected by fanout, because the HNSW graph (not the raw vectors) is memory resident, I think? Maybe we could move it off-heap and let the OS manage the memory (and still document the RAM requirements)? Maybe merge RAM costs should be accounted for in IW's RAM buffer accounting? It is not today, and there are some other things that use non-trivial RAM, e.g. the doc mapping (to compress docid space when deletions are reclaimed).
>>> >>
>>> >> When we added KNN vector testing to Lucene's nightly benchmarks, the indexing time massively increased -- see annotations DH and DP here: https://home.apache.org/~mikemccand/lucenebench/indexing.html. Nightly benchmarks now start at 6 PM and don't finish until ~14.5 hours later. Of course, that is using a single thread for indexing (on a box that has 128 cores!) so we produce a deterministic index every night ...
>>> >>
>>> >> Stepping out (meta) a bit ... this discussion is precisely one of the awesome benefits of the (informed) veto. It means risky changes to the software, as determined by any single informed developer on the project, can force a healthy discussion about the problem at hand. Robert is legitimately concerned about a real issue and so we should use our creative energies to characterize our HNSW implementation's performance, document it clearly for users, and uncover ways to improve it.
>>> >>
>>> >> Mike McCandless
>>> >>
>>> >> http://blog.mikemccandless.com
>>> >>
>>> >>
>>> >> On Mon, Apr 10, 2023 at 6:41?PM Alessandro Benedetti <a.benedetti@sease.io> wrote:
>>> >>> I think Gus points are on target.
>>> >>>
>>> >>> I recommend we move this forward in this way:
>>> >>> We stop any discussion and everyone interested proposes an option with a motivation, then we aggregate the options and we create a Vote maybe?
>>> >>>
>>> >>> I am also on the same page on the fact that a veto should come with a clear and reasonable technical merit, which also in my opinion has not come yet.
>>> >>>
>>> >>> I also apologise if any of my words sounded harsh or personal attacks, never meant to do so.
>>> >>>
>>> >>> My proposed option:
>>> >>>
>>> >>> 1) remove the limit and potentially make it configurable,
>>> >>> Motivation:
>>> >>> The system administrator can enforce a limit its users need to respect that it's in line with whatever the admin decided to be acceptable for them.
>>> >>> Default can stay the current one.
>>> >>>
>>> >>> That's my favourite at the moment, but I agree that potentially in the future this may need to change, as we may optimise the data structures for certain dimensions. I am a big fan of Yagni (you aren't going to need it) so I am ok we'll face a different discussion if that happens in the future.
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Sun, 9 Apr 2023, 18:46 Gus Heck, <gus.heck@gmail.com> wrote:
>>> >>>> What I see so far:
>>> >>>>
>>> >>>> Much positive support for raising the limit
>>> >>>> Slightly less support for removing it or making it configurable
>>> >>>> A single veto which argues that a (as yet undefined) performance standard must be met before raising the limit
>>> >>>> Hot tempers (various) making this discussion difficult
>>> >>>>
>>> >>>> As I understand it, vetoes must have technical merit. I'm not sure that this veto rises to "technical merit" on 2 counts:
>>> >>>>
>>> >>>> No standard for the performance is given so it cannot be technically met. Without hard criteria it's a moving target.
>>> >>>> It appears to encode a valuation of the user's time, and that valuation is really up to the user. Some users may consider 2hours useless and not worth it, and others might happily wait 2 hours. This is not a technical decision, it's a business decision regarding the relative value of the time invested vs the value of the result. If I can cure cancer by indexing for a year, that might be worth it... (hyperbole of course).
>>> >>>>
>>> >>>> Things I would consider to have technical merit that I don't hear:
>>> >>>>
>>> >>>> Impact on the speed of **other** indexing operations. (devaluation of other functionality)
>>> >>>> Actual scenarios that work when the limit is low and fail when the limit is high (new failure on the same data with the limit raised).
>>> >>>>
>>> >>>> One thing that might or might not have technical merit
>>> >>>>
>>> >>>> If someone feels there is a lack of documentation of the costs/performance implications of using large vectors, possibly including reproducible benchmarks establishing the scaling behavior (there seems to be disagreement on O(n) vs O(n^2)).
>>> >>>>
>>> >>>> The users *should* know what they are getting into, but if the cost is worth it to them, they should be able to pay it without forking the project. If this veto causes a fork that's not good.
>>> >>>>
>>> >>>> On Sun, Apr 9, 2023 at 7:55?AM Michael Sokolov <msokolov@gmail.com> wrote:
>>> >>>>> We do have a dataset built from Wikipedia in luceneutil. It comes in 100 and 300 dimensional varieties and can easily enough generate large numbers of vector documents from the articles data. To go higher we could concatenate vectors from that and I believe the performance numbers would be plausible.
>>> >>>>>
>>> >>>>> On Sun, Apr 9, 2023, 1:32 AM Dawid Weiss <dawid.weiss@gmail.com> wrote:
>>> >>>>>> Can we set up a branch in which the limit is bumped to 2048, then have
>>> >>>>>> a realistic, free data set (wikipedia sample or something) that has,
>>> >>>>>> say, 5 million docs and vectors created using public data (glove
>>> >>>>>> pre-trained embeddings or the like)? We then could run indexing on the
>>> >>>>>> same hardware with 512, 1024 and 2048 and see what the numbers, limits
>>> >>>>>> and behavior actually are.
>>> >>>>>>
>>> >>>>>> I can help in writing this but not until after Easter.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Dawid
>>> >>>>>>
>>> >>>>>> On Sat, Apr 8, 2023 at 11:29?PM Adrien Grand <jpountz@gmail.com> wrote:
>>> >>>>>>> As Dawid pointed out earlier on this thread, this is the rule for
>>> >>>>>>> Apache projects: a single -1 vote on a code change is a veto and
>>> >>>>>>> cannot be overridden. Furthermore, Robert is one of the people on this
>>> >>>>>>> project who worked the most on debugging subtle bugs, making Lucene
>>> >>>>>>> more robust and improving our test framework, so I'm listening when he
>>> >>>>>>> voices quality concerns.
>>> >>>>>>>
>>> >>>>>>> The argument against removing/raising the limit that resonates with me
>>> >>>>>>> the most is that it is a one-way door. As MikeS highlighted earlier on
>>> >>>>>>> this thread, implementations may want to take advantage of the fact
>>> >>>>>>> that there is a limit at some point too. This is why I don't want to
>>> >>>>>>> remove the limit and would prefer a slight increase, such as 2048 as
>>> >>>>>>> suggested in the original issue, which would enable most of the things
>>> >>>>>>> that users who have been asking about raising the limit would like to
>>> >>>>>>> do.
>>> >>>>>>>
>>> >>>>>>> I agree that the merge-time memory usage and slow indexing rate are
>>> >>>>>>> not great. But it's still possible to index multi-million vector
>>> >>>>>>> datasets with a 4GB heap without hitting OOMEs regardless of the
>>> >>>>>>> number of dimensions, and the feedback I'm seeing is that many users
>>> >>>>>>> are still interested in indexing multi-million vector datasets despite
>>> >>>>>>> the slow indexing rate. I wish we could do better, and vector indexing
>>> >>>>>>> is certainly more expert than text indexing, but it still is usable in
>>> >>>>>>> my opinion. I understand how giving Lucene more information about
>>> >>>>>>> vectors prior to indexing (e.g. clustering information as Jim pointed
>>> >>>>>>> out) could help make merging faster and more memory-efficient, but I
>>> >>>>>>> would really like to avoid making it a requirement for indexing
>>> >>>>>>> vectors as it also makes this feature much harder to use.
>>> >>>>>>>
>>> >>>>>>> On Sat, Apr 8, 2023 at 9:28?PM Alessandro Benedetti
>>> >>>>>>> <a.benedetti@sease.io> wrote:
>>> >>>>>>>> I am very attentive to listen opinions but I am un-convinced here and I an not sure that a single person opinion should be allowed to be detrimental for such an important project.
>>> >>>>>>>>
>>> >>>>>>>> The limit as far as I know is literally just raising an exception.
>>> >>>>>>>> Removing it won't alter in any way the current performance for users in low dimensional space.
>>> >>>>>>>> Removing it will just enable more users to use Lucene.
>>> >>>>>>>>
>>> >>>>>>>> If new users in certain situations will be unhappy with the performance, they may contribute improvements.
>>> >>>>>>>> This is how you make progress.
>>> >>>>>>>>
>>> >>>>>>>> If it's a reputation thing, trust me that not allowing users to play with high dimensional space will equally damage it.
>>> >>>>>>>>
>>> >>>>>>>> To me it's really a no brainer.
>>> >>>>>>>> Removing the limit and enable people to use high dimensional vectors will take minutes.
>>> >>>>>>>> Improving the hnsw implementation can take months.
>>> >>>>>>>> Pick one to begin with...
>>> >>>>>>>>
>>> >>>>>>>> And there's no-one paying me here, no company interest whatsoever, actually I pay people to contribute, I am just convinced it's a good idea.
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcmuir@gmail.com> wrote:
>>> >>>>>>>>> I disagree with your categorization. I put in plenty of work and
>>> >>>>>>>>> experienced plenty of pain myself, writing tests and fighting these
>>> >>>>>>>>> issues, after i saw that, two releases in a row, vector indexing fell
>>> >>>>>>>>> over and hit integer overflows etc on small datasets:
>>> >>>>>>>>>
>>> >>>>>>>>> https://github.com/apache/lucene/pull/11905
>>> >>>>>>>>>
>>> >>>>>>>>> Attacking me isn't helping the situation.
>>> >>>>>>>>>
>>> >>>>>>>>> PS: when i said the "one guy who wrote the code" I didn't mean it in
>>> >>>>>>>>> any kind of demeaning fashion really. I meant to describe the current
>>> >>>>>>>>> state of usability with respect to indexing a few million docs with
>>> >>>>>>>>> high dimensions. You can scroll up the thread and see that at least
>>> >>>>>>>>> one other committer on the project experienced similar pain as me.
>>> >>>>>>>>> Then, think about users who aren't committers trying to use the
>>> >>>>>>>>> functionality!
>>> >>>>>>>>>
>>> >>>>>>>>> On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov <msokolov@gmail.com> wrote:
>>> >>>>>>>>>> What you said about increasing dimensions requiring a bigger ram buffer on merge is wrong. That's the point I was trying to make. Your concerns about merge costs are not wrong, but your conclusion that we need to limit dimensions is not justified.
>>> >>>>>>>>>>
>>> >>>>>>>>>> You complain that hnsw sucks it doesn't scale, but when I show it scales linearly with dimension you just ignore that and complain about something entirely different.
>>> >>>>>>>>>>
>>> >>>>>>>>>> You demand that people run all kinds of tests to prove you wrong but when they do, you don't listen and you won't put in the work yourself or complain that it's too hard.
>>> >>>>>>>>>>
>>> >>>>>>>>>> Then you complain about people not meeting you half way. Wow
>>> >>>>>>>>>>
>>> >>>>>>>>>> On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcmuir@gmail.com> wrote:
>>> >>>>>>>>>>> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
>>> >>>>>>>>>>> <michael.wechner@wyona.com> wrote:
>>> >>>>>>>>>>>> What exactly do you consider reasonable?
>>> >>>>>>>>>>> Let's begin a real discussion by being HONEST about the current
>>> >>>>>>>>>>> status. Please put politically correct or your own company's wishes
>>> >>>>>>>>>>> aside, we know it's not in a good state.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Current status is the one guy who wrote the code can set a
>>> >>>>>>>>>>> multi-gigabyte ram buffer and index a small dataset with 1024
>>> >>>>>>>>>>> dimensions in HOURS (i didn't ask what hardware).
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> My concerns are everyone else except the one guy, I want it to be
>>> >>>>>>>>>>> usable. Increasing dimensions just means even bigger multi-gigabyte
>>> >>>>>>>>>>> ram buffer and bigger heap to avoid OOM on merge.
>>> >>>>>>>>>>> It is also a permanent backwards compatibility decision, we have to
>>> >>>>>>>>>>> support it once we do this and we can't just say "oops" and flip it
>>> >>>>>>>>>>> back.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> It is unclear to me, if the multi-gigabyte ram buffer is really to
>>> >>>>>>>>>>> avoid merges because they are so slow and it would be DAYS otherwise,
>>> >>>>>>>>>>> or if its to avoid merges so it doesn't hit OOM.
>>> >>>>>>>>>>> Also from personal experience, it takes trial and error (means
>>> >>>>>>>>>>> experiencing OOM on merge!!!) before you get those heap values correct
>>> >>>>>>>>>>> for your dataset. This usually means starting over which is
>>> >>>>>>>>>>> frustrating and wastes more time.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Jim mentioned some ideas about the memory usage in IndexWriter, seems
>>> >>>>>>>>>>> to me like its a good idea. maybe the multigigabyte ram buffer can be
>>> >>>>>>>>>>> avoided in this way and performance improved by writing bigger
>>> >>>>>>>>>>> segments with lucene's defaults. But this doesn't mean we can simply
>>> >>>>>>>>>>> ignore the horrors of what happens on merge. merging needs to scale so
>>> >>>>>>>>>>> that indexing really scales.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> At least it shouldnt spike RAM on trivial data amounts and cause OOM,
>>> >>>>>>>>>>> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
>>> >>>>>>>>>>> fashion when indexing.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> ---------------------------------------------------------------------
>>> >>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >>>>>>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>> >>>>>>>>>>>
>>> >>>>>>>>> ---------------------------------------------------------------------
>>> >>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >>>>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>> >>>>>>>>>
>>> >>>>>>>
>>> >>>>>>> --
>>> >>>>>>> Adrien
>>> >>>>>>>
>>> >>>>>>> ---------------------------------------------------------------------
>>> >>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>> >>>>>>>
>>> >>>>>> ---------------------------------------------------------------------
>>> >>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>> >>>>>>
>>> >>>>
>>> >>>> --
>>> >>>> http://www.needhamsoftware.com (work)
>>> >>>> http://www.the111shift.com (play)
>>> >>
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> > For additional commands, e-mail: dev-help@lucene.apache.org
>>> >
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

kent.fitch at gmail

Apr 12, 2023, 4:35 PM

Post #94 of 99 (950 views)

Hi Michael (and anyone else who wants just over 240K "real world" ada-002
vectors of dimension 1536),
you are welcome to retrieve a tar.gz file which contains:
- 47K embeddings of Canberra Times news article text from 1994
- 38K embeddings of the first paragraphs of wikipedia articles about
organisations
- 156.6K embeddings of the first paragraphs of wikipedia articles about
people

https://drive.google.com/file/d/13JP_5u7E8oZO6vRg0ekaTgBDQOaj-W00/view?usp=sharing

The file is about 1.7GB and will expand to about 4.4GB. This file will be
accessible for at least a week, and I hope you dont hit any google drive
download limits trying to retrieve it.

The embeddings were generated using my openAI account and you are welcome
to use them for any purpose you like.

best wishes,

Kent Fitch

On Wed, Apr 12, 2023 at 4:37?PM Michael Wechner <michael.wechner@wyona.com>
wrote:

> thank you very much for your feedback!
>
> In a previous post (April 7) you wrote you could make availlable the 47K
> ada-002 vectors, which would be great!
>
> Would it make sense to setup a public gitub repo, such that others could
> use or also contribute vectors?
>
> Thanks
>
> Michael Wechner
>
>
> Am 12.04.23 um 04:51 schrieb Kent Fitch:
>
> I only know some characteristics of the openAI ada-002 vectors, although
> they are a very popular as embeddings/text-characterisations as they allow
> more accurate/"human meaningful" semantic search results with fewer
> dimensions than their predecessors - I've evaluated a few different
> embedding models, including some BERT variants, CLIP ViT-L-14 (with 768
> dims, which was quite good), openAI's ada-001 (1024 dims) and babbage-001
> (2048 dims), and ada-002 are qualitatively the best, although that will
> certainly change!
>
> In any case, ada-002 vectors have interesting characteristics that I think
> mean you could confidently create synthetic vectors which would be hard to
> distinguish from "real" vectors. I found this from looking at 47K ada-002
> vectors generated across a full year (1994) of newspaper articles from the
> Canberra Times and 200K wikipedia articles:
> - there is no discernible/significant correlation between values in any
> pair of dimensions
> - all but 5 of the 1536 dimensions have an almost identical distribution
> of values shown in the central blob on these graphs (that just show a few
> of these 1531 dimensions with clumped values and the 5 "outlier"
> dimensions, but all 1531 non-outlier dims are in there, which makes for
> some easy quantisation from float to byte if you dont want to go the full
> kmeans/clustering/Lloyds-algorithm approach):
>
> https://docs.google.com/spreadsheets/d/1DyyBCbirETZSUAEGcMK__mfbUNzsU_L48V9E0SyJYGg/edit?usp=sharing
>
> https://docs.google.com/spreadsheets/d/1czEAlzYdyKa6xraRLesXjNZvEzlj27TcDGiEFS1-MPs/edit?usp=sharing
>
> https://docs.google.com/spreadsheets/d/1RxTjV7Sj14etCNLk1GB-m44CXJVKdXaFlg2Y6yvj3z4/edit?usp=sharing
> - the variance of the value of each dimension is characteristic:
>
> https://docs.google.com/spreadsheets/d/1w5LnRUXt1cRzI9Qwm07LZ6UfszjMOgPaJot9cOGLHok/edit#gid=472178228
>
> This probably represents something significant about how the ada-002
> embeddings are created, but I think it also means creating "realistic"
> values is possible. I did not use this information when testing recall &
> performance on Lucene's HNSW implementation on 192m documents, as I
> slightly dithered the values of a "real" set on 47K docs and stored other
> fields in the doc that referenced the "base" document that the dithers were
> made from, and used different dithering magnitudes so that I could test
> recall with different neighbour sizes ("M"), construction-beamwidth and
> search-beamwidths.
>
> best regards
>
> Kent Fitch
>
>
>
>
> On Wed, Apr 12, 2023 at 5:08?AM Michael Wechner <michael.wechner@wyona.com>
> wrote:
>
>> I understand what you mean that it seems to be artificial, but I don't
>> understand why this matters to test performance and scalability of the
>> indexing?
>>
>> Let's assume the limit of Lucene would be 4 instead of 1024 and there
>> are only open source models generating vectors with 4 dimensions, for
>> example
>>
>>
>> 0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814
>>
>>
>> 0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844
>>
>>
>> -0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106
>>
>>
>> -0.007012288551777601,-0.026665858924388885,0.044495150446891785,-0.038030195981264114
>>
>> and now I concatenate them to vectors with 8 dimensions
>>
>>
>>
>> 0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814,0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844
>>
>>
>> -0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106,-0.007012288551777601,-0.026665858924388885,0.044495150446891785,-0.038030195981264114
>>
>> and normalize them to length 1.
>>
>> Why should this be any different to a model which is acting like a black
>> box generating vectors with 8 dimensions?
>>
>>
>>
>>
>> Am 11.04.23 um 19:05 schrieb Michael Sokolov:
>> >> What exactly do you consider real vector data? Vector data which is
>> based on texts written by humans?
>> > We have plenty of text; the problem is coming up with a realistic
>> > vector model that requires as many dimensions as people seem to be
>> > demanding. As I said above, after surveying huggingface I couldn't
>> > find any text-based model using more than 768 dimensions. So far we
>> > have some ideas of generating higher-dimensional data by dithering or
>> > concatenating existing data, but it seems artificial.
>> >
>> > On Tue, Apr 11, 2023 at 9:31?AM Michael Wechner
>> > <michael.wechner@wyona.com> wrote:
>> >> What exactly do you consider real vector data? Vector data which is
>> based on texts written by humans?
>> >>
>> >> I am asking, because I recently attended the following presentation by
>> Anastassia Shaitarova (UZH Institute for Computational Linguistics,
>> https://www.cl.uzh.ch/de/people/team/compling/shaitarova.html)
>> >>
>> >> ----
>> >>
>> >> Can we Identify Machine-Generated Text? An Overview of Current
>> Approaches
>> >> by Anastassia Shaitarova (UZH Institute for Computational Linguistics)
>> >>
>> >> The detection of machine-generated text has become increasingly
>> important due to the prevalence of automated content generation and its
>> potential for misuse. In this talk, we will discuss the motivation for
>> automatic detection of generated text. We will present the currently
>> available methods, including feature-based classification as a “first
>> line-of-defense.” We will provide an overview of the detection tools that
>> have been made available so far and discuss their limitations. Finally, we
>> will reflect on some open problems associated with the automatic
>> discrimination of generated texts.
>> >>
>> >> ----
>> >>
>> >> and her conclusion was that it has become basically impossible to
>> differentiate between text generated by humans and text generated by for
>> example ChatGPT.
>> >>
>> >> Whereas others have a slightly different opinion, see for example
>> >>
>> >> https://www.wired.com/story/how-to-spot-generative-ai-text-chatgpt/
>> >>
>> >> But I would argue that real world and synthetic have become close
>> enough that testing performance and scalability of indexing should be
>> possible with synthetic data.
>> >>
>> >> I completely agree that we have to base our discussions and decisions
>> on scientific methods and that we have to make sure that Lucene performs
>> and scales well and that we understand the limits and what is going on
>> under the hood.
>> >>
>> >> Thanks
>> >>
>> >> Michael W
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> Am 11.04.23 um 14:29 schrieb Michael McCandless:
>> >>
>> >> +1 to test on real vector data -- if you test on synthetic data you
>> draw synthetic conclusions.
>> >>
>> >> Can someone post the theoretical performance (CPU and RAM required) of
>> HNSW construction? Do we know/believe our HNSW implementation has achieved
>> that theoretical big-O performance? Maybe we have some silly performance
>> bug that's causing it not to?
>> >>
>> >> As I understand it, HNSW makes the tradeoff of costly construction for
>> faster searching, which is typically the right tradeoff for search use
>> cases. We do this in other parts of the Lucene index too.
>> >>
>> >> Lucene will do a logarithmic number of merges over time, i.e. each doc
>> will be merged O(log(N)) times in its lifetime in the index. We need to
>> multiply that by the cost of re-building the whole HNSW graph on each
>> merge. BTW, other things in Lucene, like BKD/dimensional points, also
>> rebuild the whole data structure on each merge, I think? But, as Rob
>> pointed out, stored fields merging do indeed do some sneaky tricks to avoid
>> excessive block decompress/recompress on each merge.
>> >>
>> >>> As I understand it, vetoes must have technical merit. I'm not sure
>> that this veto rises to "technical merit" on 2 counts:
>> >> Actually I think Robert's veto stands on its technical merit already.
>> Robert's take on technical matters very much resonate with me, even if he
>> is sometimes prickly in how he expresses them ;)
>> >>
>> >> His point is that we, as a dev community, are not paying enough
>> attention to the indexing performance of our KNN algo (HNSW) and
>> implementation, and that it is reckless to increase / remove limits in that
>> state. It is indeed a one-way door decision and one must confront such
>> decisions with caution, especially for such a widely used base
>> infrastructure as Lucene. We don't even advertise today in our javadocs
>> that you need XXX heap if you index vectors with dimension Y, fanout X,
>> levels Z, etc.
>> >>
>> >> RAM used during merging is unaffected by dimensionality, but is
>> affected by fanout, because the HNSW graph (not the raw vectors) is memory
>> resident, I think? Maybe we could move it off-heap and let the OS manage
>> the memory (and still document the RAM requirements)? Maybe merge RAM
>> costs should be accounted for in IW's RAM buffer accounting? It is not
>> today, and there are some other things that use non-trivial RAM, e.g. the
>> doc mapping (to compress docid space when deletions are reclaimed).
>> >>
>> >> When we added KNN vector testing to Lucene's nightly benchmarks, the
>> indexing time massively increased -- see annotations DH and DP here:
>> https://home.apache.org/~mikemccand/lucenebench/indexing.html. Nightly
>> benchmarks now start at 6 PM and don't finish until ~14.5 hours later. Of
>> course, that is using a single thread for indexing (on a box that has 128
>> cores!) so we produce a deterministic index every night ...
>> >>
>> >> Stepping out (meta) a bit ... this discussion is precisely one of the
>> awesome benefits of the (informed) veto. It means risky changes to the
>> software, as determined by any single informed developer on the project,
>> can force a healthy discussion about the problem at hand. Robert is
>> legitimately concerned about a real issue and so we should use our creative
>> energies to characterize our HNSW implementation's performance, document it
>> clearly for users, and uncover ways to improve it.
>> >>
>> >> Mike McCandless
>> >>
>> >> http://blog.mikemccandless.com
>> >>
>> >>
>> >> On Mon, Apr 10, 2023 at 6:41?PM Alessandro Benedetti <
>> a.benedetti@sease.io> wrote:
>> >>> I think Gus points are on target.
>> >>>
>> >>> I recommend we move this forward in this way:
>> >>> We stop any discussion and everyone interested proposes an option
>> with a motivation, then we aggregate the options and we create a Vote maybe?
>> >>>
>> >>> I am also on the same page on the fact that a veto should come with a
>> clear and reasonable technical merit, which also in my opinion has not come
>> yet.
>> >>>
>> >>> I also apologise if any of my words sounded harsh or personal
>> attacks, never meant to do so.
>> >>>
>> >>> My proposed option:
>> >>>
>> >>> 1) remove the limit and potentially make it configurable,
>> >>> Motivation:
>> >>> The system administrator can enforce a limit its users need to
>> respect that it's in line with whatever the admin decided to be acceptable
>> for them.
>> >>> Default can stay the current one.
>> >>>
>> >>> That's my favourite at the moment, but I agree that potentially in
>> the future this may need to change, as we may optimise the data structures
>> for certain dimensions. I am a big fan of Yagni (you aren't going to need
>> it) so I am ok we'll face a different discussion if that happens in the
>> future.
>> >>>
>> >>>
>> >>>
>> >>> On Sun, 9 Apr 2023, 18:46 Gus Heck, <gus.heck@gmail.com> wrote:
>> >>>> What I see so far:
>> >>>>
>> >>>> Much positive support for raising the limit
>> >>>> Slightly less support for removing it or making it configurable
>> >>>> A single veto which argues that a (as yet undefined) performance
>> standard must be met before raising the limit
>> >>>> Hot tempers (various) making this discussion difficult
>> >>>>
>> >>>> As I understand it, vetoes must have technical merit. I'm not sure
>> that this veto rises to "technical merit" on 2 counts:
>> >>>>
>> >>>> No standard for the performance is given so it cannot be technically
>> met. Without hard criteria it's a moving target.
>> >>>> It appears to encode a valuation of the user's time, and that
>> valuation is really up to the user. Some users may consider 2hours useless
>> and not worth it, and others might happily wait 2 hours. This is not a
>> technical decision, it's a business decision regarding the relative value
>> of the time invested vs the value of the result. If I can cure cancer by
>> indexing for a year, that might be worth it... (hyperbole of course).
>> >>>>
>> >>>> Things I would consider to have technical merit that I don't hear:
>> >>>>
>> >>>> Impact on the speed of **other** indexing operations. (devaluation
>> of other functionality)
>> >>>> Actual scenarios that work when the limit is low and fail when the
>> limit is high (new failure on the same data with the limit raised).
>> >>>>
>> >>>> One thing that might or might not have technical merit
>> >>>>
>> >>>> If someone feels there is a lack of documentation of the
>> costs/performance implications of using large vectors, possibly including
>> reproducible benchmarks establishing the scaling behavior (there seems to
>> be disagreement on O(n) vs O(n^2)).
>> >>>>
>> >>>> The users *should* know what they are getting into, but if the cost
>> is worth it to them, they should be able to pay it without forking the
>> project. If this veto causes a fork that's not good.
>> >>>>
>> >>>> On Sun, Apr 9, 2023 at 7:55?AM Michael Sokolov <msokolov@gmail.com>
>> wrote:
>> >>>>> We do have a dataset built from Wikipedia in luceneutil. It comes
>> in 100 and 300 dimensional varieties and can easily enough generate large
>> numbers of vector documents from the articles data. To go higher we could
>> concatenate vectors from that and I believe the performance numbers would
>> be plausible.
>> >>>>>
>> >>>>> On Sun, Apr 9, 2023, 1:32 AM Dawid Weiss <dawid.weiss@gmail.com>
>> wrote:
>> >>>>>> Can we set up a branch in which the limit is bumped to 2048, then
>> have
>> >>>>>> a realistic, free data set (wikipedia sample or something) that
>> has,
>> >>>>>> say, 5 million docs and vectors created using public data (glove
>> >>>>>> pre-trained embeddings or the like)? We then could run indexing on
>> the
>> >>>>>> same hardware with 512, 1024 and 2048 and see what the numbers,
>> limits
>> >>>>>> and behavior actually are.
>> >>>>>>
>> >>>>>> I can help in writing this but not until after Easter.
>> >>>>>>
>> >>>>>>
>> >>>>>> Dawid
>> >>>>>>
>> >>>>>> On Sat, Apr 8, 2023 at 11:29?PM Adrien Grand <jpountz@gmail.com>
>> wrote:
>> >>>>>>> As Dawid pointed out earlier on this thread, this is the rule for
>> >>>>>>> Apache projects: a single -1 vote on a code change is a veto and
>> >>>>>>> cannot be overridden. Furthermore, Robert is one of the people on
>> this
>> >>>>>>> project who worked the most on debugging subtle bugs, making
>> Lucene
>> >>>>>>> more robust and improving our test framework, so I'm listening
>> when he
>> >>>>>>> voices quality concerns.
>> >>>>>>>
>> >>>>>>> The argument against removing/raising the limit that resonates
>> with me
>> >>>>>>> the most is that it is a one-way door. As MikeS highlighted
>> earlier on
>> >>>>>>> this thread, implementations may want to take advantage of the
>> fact
>> >>>>>>> that there is a limit at some point too. This is why I don't want
>> to
>> >>>>>>> remove the limit and would prefer a slight increase, such as 2048
>> as
>> >>>>>>> suggested in the original issue, which would enable most of the
>> things
>> >>>>>>> that users who have been asking about raising the limit would
>> like to
>> >>>>>>> do.
>> >>>>>>>
>> >>>>>>> I agree that the merge-time memory usage and slow indexing rate
>> are
>> >>>>>>> not great. But it's still possible to index multi-million vector
>> >>>>>>> datasets with a 4GB heap without hitting OOMEs regardless of the
>> >>>>>>> number of dimensions, and the feedback I'm seeing is that many
>> users
>> >>>>>>> are still interested in indexing multi-million vector datasets
>> despite
>> >>>>>>> the slow indexing rate. I wish we could do better, and vector
>> indexing
>> >>>>>>> is certainly more expert than text indexing, but it still is
>> usable in
>> >>>>>>> my opinion. I understand how giving Lucene more information about
>> >>>>>>> vectors prior to indexing (e.g. clustering information as Jim
>> pointed
>> >>>>>>> out) could help make merging faster and more memory-efficient,
>> but I
>> >>>>>>> would really like to avoid making it a requirement for indexing
>> >>>>>>> vectors as it also makes this feature much harder to use.
>> >>>>>>>
>> >>>>>>> On Sat, Apr 8, 2023 at 9:28?PM Alessandro Benedetti
>> >>>>>>> <a.benedetti@sease.io> wrote:
>> >>>>>>>> I am very attentive to listen opinions but I am un-convinced
>> here and I an not sure that a single person opinion should be allowed to be
>> detrimental for such an important project.
>> >>>>>>>>
>> >>>>>>>> The limit as far as I know is literally just raising an
>> exception.
>> >>>>>>>> Removing it won't alter in any way the current performance for
>> users in low dimensional space.
>> >>>>>>>> Removing it will just enable more users to use Lucene.
>> >>>>>>>>
>> >>>>>>>> If new users in certain situations will be unhappy with the
>> performance, they may contribute improvements.
>> >>>>>>>> This is how you make progress.
>> >>>>>>>>
>> >>>>>>>> If it's a reputation thing, trust me that not allowing users to
>> play with high dimensional space will equally damage it.
>> >>>>>>>>
>> >>>>>>>> To me it's really a no brainer.
>> >>>>>>>> Removing the limit and enable people to use high dimensional
>> vectors will take minutes.
>> >>>>>>>> Improving the hnsw implementation can take months.
>> >>>>>>>> Pick one to begin with...
>> >>>>>>>>
>> >>>>>>>> And there's no-one paying me here, no company interest
>> whatsoever, actually I pay people to contribute, I am just convinced it's a
>> good idea.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcmuir@gmail.com> wrote:
>> >>>>>>>>> I disagree with your categorization. I put in plenty of work and
>> >>>>>>>>> experienced plenty of pain myself, writing tests and fighting
>> these
>> >>>>>>>>> issues, after i saw that, two releases in a row, vector
>> indexing fell
>> >>>>>>>>> over and hit integer overflows etc on small datasets:
>> >>>>>>>>>
>> >>>>>>>>> https://github.com/apache/lucene/pull/11905
>> >>>>>>>>>
>> >>>>>>>>> Attacking me isn't helping the situation.
>> >>>>>>>>>
>> >>>>>>>>> PS: when i said the "one guy who wrote the code" I didn't mean
>> it in
>> >>>>>>>>> any kind of demeaning fashion really. I meant to describe the
>> current
>> >>>>>>>>> state of usability with respect to indexing a few million docs
>> with
>> >>>>>>>>> high dimensions. You can scroll up the thread and see that at
>> least
>> >>>>>>>>> one other committer on the project experienced similar pain as
>> me.
>> >>>>>>>>> Then, think about users who aren't committers trying to use the
>> >>>>>>>>> functionality!
>> >>>>>>>>>
>> >>>>>>>>> On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov <
>> msokolov@gmail.com> wrote:
>> >>>>>>>>>> What you said about increasing dimensions requiring a bigger
>> ram buffer on merge is wrong. That's the point I was trying to make. Your
>> concerns about merge costs are not wrong, but your conclusion that we need
>> to limit dimensions is not justified.
>> >>>>>>>>>>
>> >>>>>>>>>> You complain that hnsw sucks it doesn't scale, but when I show
>> it scales linearly with dimension you just ignore that and complain about
>> something entirely different.
>> >>>>>>>>>>
>> >>>>>>>>>> You demand that people run all kinds of tests to prove you
>> wrong but when they do, you don't listen and you won't put in the work
>> yourself or complain that it's too hard.
>> >>>>>>>>>>
>> >>>>>>>>>> Then you complain about people not meeting you half way. Wow
>> >>>>>>>>>>
>> >>>>>>>>>> On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcmuir@gmail.com>
>> wrote:
>> >>>>>>>>>>> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
>> >>>>>>>>>>> <michael.wechner@wyona.com> wrote:
>> >>>>>>>>>>>> What exactly do you consider reasonable?
>> >>>>>>>>>>> Let's begin a real discussion by being HONEST about the
>> current
>> >>>>>>>>>>> status. Please put politically correct or your own company's
>> wishes
>> >>>>>>>>>>> aside, we know it's not in a good state.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Current status is the one guy who wrote the code can set a
>> >>>>>>>>>>> multi-gigabyte ram buffer and index a small dataset with 1024
>> >>>>>>>>>>> dimensions in HOURS (i didn't ask what hardware).
>> >>>>>>>>>>>
>> >>>>>>>>>>> My concerns are everyone else except the one guy, I want it
>> to be
>> >>>>>>>>>>> usable. Increasing dimensions just means even bigger
>> multi-gigabyte
>> >>>>>>>>>>> ram buffer and bigger heap to avoid OOM on merge.
>> >>>>>>>>>>> It is also a permanent backwards compatibility decision, we
>> have to
>> >>>>>>>>>>> support it once we do this and we can't just say "oops" and
>> flip it
>> >>>>>>>>>>> back.
>> >>>>>>>>>>>
>> >>>>>>>>>>> It is unclear to me, if the multi-gigabyte ram buffer is
>> really to
>> >>>>>>>>>>> avoid merges because they are so slow and it would be DAYS
>> otherwise,
>> >>>>>>>>>>> or if its to avoid merges so it doesn't hit OOM.
>> >>>>>>>>>>> Also from personal experience, it takes trial and error (means
>> >>>>>>>>>>> experiencing OOM on merge!!!) before you get those heap
>> values correct
>> >>>>>>>>>>> for your dataset. This usually means starting over which is
>> >>>>>>>>>>> frustrating and wastes more time.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Jim mentioned some ideas about the memory usage in
>> IndexWriter, seems
>> >>>>>>>>>>> to me like its a good idea. maybe the multigigabyte ram
>> buffer can be
>> >>>>>>>>>>> avoided in this way and performance improved by writing bigger
>> >>>>>>>>>>> segments with lucene's defaults. But this doesn't mean we can
>> simply
>> >>>>>>>>>>> ignore the horrors of what happens on merge. merging needs to
>> scale so
>> >>>>>>>>>>> that indexing really scales.
>> >>>>>>>>>>>
>> >>>>>>>>>>> At least it shouldnt spike RAM on trivial data amounts and
>> cause OOM,
>> >>>>>>>>>>> and definitely it shouldnt burn hours and hours of CPU in
>> O(n^2)
>> >>>>>>>>>>> fashion when indexing.
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> ---------------------------------------------------------------------
>> >>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >>>>>>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>>>>>>>>>>
>> >>>>>>>>>
>> ---------------------------------------------------------------------
>> >>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >>>>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>>>>>>>>
>> >>>>>>>
>> >>>>>>> --
>> >>>>>>> Adrien
>> >>>>>>>
>> >>>>>>>
>> ---------------------------------------------------------------------
>> >>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>>>>>>
>> >>>>>>
>> ---------------------------------------------------------------------
>> >>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>>>>>
>> >>>>
>> >>>> --
>> >>>> http://www.needhamsoftware.com (work)
>> >>>> http://www.the111shift.com (play)
>> >>
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: dev-help@lucene.apache.org
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

michael.wechner at wyona

Apr 12, 2023, 10:58 PM

Post #95 of 99 (950 views)

Hi Kent

Great, thank you very much!

Will download it later today :-)

All the best

Michael

Am 13.04.23 um 01:35 schrieb Kent Fitch:
> Hi Michael (and anyone else who wants just over 240K "real world"
> ada-002 vectors of dimension 1536),
> you are welcome to retrieve a tar.gz file which contains:
> - 47K embeddings of Canberra Times news article text from 1994
> - 38K embeddings of the first paragraphs of wikipedia articles about
> organisations
> - 156.6K embeddings of the first paragraphs of wikipedia articles
> about people
>
> https://drive.google.com/file/d/13JP_5u7E8oZO6vRg0ekaTgBDQOaj-W00/view?usp=sharing
>
> The file is about 1.7GB and will expand to about 4.4GB. This file will
> be accessible for at least a week, and I hope you dont hit any google
> drive download limits trying to retrieve it.
>
> The embeddings were generated using my openAI account and you are
> welcome to use them for any purpose you like.
>
> best wishes,
>
> Kent Fitch
>
> On Wed, Apr 12, 2023 at 4:37?PM Michael Wechner
> <michael.wechner@wyona.com> wrote:
>
> thank you very much for your feedback!
>
> In a previous post (April 7) you wrote you could make availlable
> the 47K ada-002 vectors, which would be great!
>
> Would it make sense to setup a public gitub repo, such that others
> could use or also contribute vectors?
>
> Thanks
>
> Michael Wechner
>
>
> Am 12.04.23 um 04:51 schrieb Kent Fitch:
>> I only know some characteristics of the openAI ada-002 vectors,
>> although they are a very popular as
>> embeddings/text-characterisations as they allow more
>> accurate/"human meaningful" semantic search results with fewer
>> dimensions than their predecessors - I've evaluated a few
>> different embedding models, including some BERT variants, CLIP
>> ViT-L-14 (with 768 dims, which was quite good), openAI's ada-001
>> (1024 dims) and babbage-001 (2048 dims), and ada-002 are
>> qualitatively the best, although that will certainly change!
>>
>> In any case, ada-002 vectors have interesting characteristics
>> that I think mean you could confidently create synthetic vectors
>> which would be hard to distinguish from "real" vectors. I found
>> this from looking at 47K ada-002 vectors generated across a full
>> year (1994) of newspaper articles from the Canberra Times and
>> 200K wikipedia articles:
>> - there is no discernible/significant correlation between values
>> in any pair of dimensions
>> - all but 5 of the 1536 dimensions have an almost identical
>> distribution of values shown in the central blob on these graphs
>> (that just show a few of these 1531 dimensions with clumped
>> values and the 5 "outlier" dimensions, but all 1531 non-outlier
>> dims are in there, which makes for some easy quantisation from
>> float to byte if you dont want to go the full
>> kmeans/clustering/Lloyds-algorithm approach):
>> https://docs.google.com/spreadsheets/d/1DyyBCbirETZSUAEGcMK__mfbUNzsU_L48V9E0SyJYGg/edit?usp=sharing
>> https://docs.google.com/spreadsheets/d/1czEAlzYdyKa6xraRLesXjNZvEzlj27TcDGiEFS1-MPs/edit?usp=sharing
>> https://docs.google.com/spreadsheets/d/1RxTjV7Sj14etCNLk1GB-m44CXJVKdXaFlg2Y6yvj3z4/edit?usp=sharing
>> - the variance of the value of each dimension is characteristic:
>> https://docs.google.com/spreadsheets/d/1w5LnRUXt1cRzI9Qwm07LZ6UfszjMOgPaJot9cOGLHok/edit#gid=472178228
>>
>> This probably represents something significant about how the
>> ada-002 embeddings are created, but I think it also means
>> creating "realistic" values is possible. I did not use this
>> information when testing recall & performance on Lucene's HNSW
>> implementation on 192m documents, as I slightly dithered the
>> values of a "real" set on 47K docs and stored other fields in the
>> doc that referenced the "base" document that the dithers were
>> made from, and used different dithering magnitudes so that I
>> could test recall with different neighbour sizes ("M"),
>> construction-beamwidth and search-beamwidths.
>>
>> best regards
>>
>> Kent Fitch
>>
>>
>>
>>
>> On Wed, Apr 12, 2023 at 5:08?AM Michael Wechner
>> <michael.wechner@wyona.com> wrote:
>>
>> I understand what you mean that it seems to be artificial,
>> but I don't
>> understand why this matters to test performance and
>> scalability of the
>> indexing?
>>
>> Let's assume the limit of Lucene would be 4 instead of 1024
>> and there
>> are only open source models generating vectors with 4
>> dimensions, for
>> example
>>
>> 0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814
>>
>> 0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844
>>
>> -0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106
>>
>> -0.007012288551777601,-0.026665858924388885,0.044495150446891785,-0.038030195981264114
>>
>> and now I concatenate them to vectors with 8 dimensions
>>
>>
>> 0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814,0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844
>>
>> -0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106,-0.007012288551777601,-0.026665858924388885,0.044495150446891785,-0.038030195981264114
>>
>> and normalize them to length 1.
>>
>> Why should this be any different to a model which is acting
>> like a black
>> box generating vectors with 8 dimensions?
>>
>>
>>
>>
>> Am 11.04.23 um 19:05 schrieb Michael Sokolov:
>> >> What exactly do you consider real vector data? Vector data
>> which is based on texts written by humans?
>> > We have plenty of text; the problem is coming up with a
>> realistic
>> > vector model that requires as many dimensions as people
>> seem to be
>> > demanding. As I said above, after surveying huggingface I
>> couldn't
>> > find any text-based model using more than 768 dimensions.
>> So far we
>> > have some ideas of generating higher-dimensional data by
>> dithering or
>> > concatenating existing data, but it seems artificial.
>> >
>> > On Tue, Apr 11, 2023 at 9:31?AM Michael Wechner
>> > <michael.wechner@wyona.com> wrote:
>> >> What exactly do you consider real vector data? Vector data
>> which is based on texts written by humans?
>> >>
>> >> I am asking, because I recently attended the following
>> presentation by Anastassia Shaitarova (UZH Institute for
>> Computational Linguistics,
>> https://www.cl.uzh.ch/de/people/team/compling/shaitarova.html)
>> >>
>> >> ----
>> >>
>> >> Can we Identify Machine-Generated Text? An Overview of
>> Current Approaches
>> >> by Anastassia Shaitarova (UZH Institute for Computational
>> Linguistics)
>> >>
>> >> The detection of machine-generated text has become
>> increasingly important due to the prevalence of automated
>> content generation and its potential for misuse. In this
>> talk, we will discuss the motivation for automatic detection
>> of generated text. We will present the currently available
>> methods, including feature-based classification as a “first
>> line-of-defense.” We will provide an overview of the
>> detection tools that have been made available so far and
>> discuss their limitations. Finally, we will reflect on some
>> open problems associated with the automatic discrimination of
>> generated texts.
>> >>
>> >> ----
>> >>
>> >> and her conclusion was that it has become basically
>> impossible to differentiate between text generated by humans
>> and text generated by for example ChatGPT.
>> >>
>> >> Whereas others have a slightly different opinion, see for
>> example
>> >>
>> >>
>> https://www.wired.com/story/how-to-spot-generative-ai-text-chatgpt/
>> >>
>> >> But I would argue that real world and synthetic have
>> become close enough that testing performance and scalability
>> of indexing should be possible with synthetic data.
>> >>
>> >> I completely agree that we have to base our discussions
>> and decisions on scientific methods and that we have to make
>> sure that Lucene performs and scales well and that we
>> understand the limits and what is going on under the hood.
>> >>
>> >> Thanks
>> >>
>> >> Michael W
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> Am 11.04.23 um 14:29 schrieb Michael McCandless:
>> >>
>> >> +1 to test on real vector data -- if you test on synthetic
>> data you draw synthetic conclusions.
>> >>
>> >> Can someone post the theoretical performance (CPU and RAM
>> required) of HNSW construction? Do we know/believe our HNSW
>> implementation has achieved that theoretical big-O
>> performance? Maybe we have some silly performance bug that's
>> causing it not to?
>> >>
>> >> As I understand it, HNSW makes the tradeoff of costly
>> construction for faster searching, which is typically the
>> right tradeoff for search use cases. We do this in other
>> parts of the Lucene index too.
>> >>
>> >> Lucene will do a logarithmic number of merges over time,
>> i.e. each doc will be merged O(log(N)) times in its lifetime
>> in the index. We need to multiply that by the cost of
>> re-building the whole HNSW graph on each merge. BTW, other
>> things in Lucene, like BKD/dimensional points, also rebuild
>> the whole data structure on each merge, I think? But, as Rob
>> pointed out, stored fields merging do indeed do some sneaky
>> tricks to avoid excessive block decompress/recompress on each
>> merge.
>> >>
>> >>> As I understand it, vetoes must have technical merit. I'm
>> not sure that this veto rises to "technical merit" on 2 counts:
>> >> Actually I think Robert's veto stands on its technical
>> merit already. Robert's take on technical matters very much
>> resonate with me, even if he is sometimes prickly in how he
>> expresses them ;)
>> >>
>> >> His point is that we, as a dev community, are not paying
>> enough attention to the indexing performance of our KNN algo
>> (HNSW) and implementation, and that it is reckless to
>> increase / remove limits in that state. It is indeed a
>> one-way door decision and one must confront such decisions
>> with caution, especially for such a widely used base
>> infrastructure as Lucene. We don't even advertise today in
>> our javadocs that you need XXX heap if you index vectors with
>> dimension Y, fanout X, levels Z, etc.
>> >>
>> >> RAM used during merging is unaffected by dimensionality,
>> but is affected by fanout, because the HNSW graph (not the
>> raw vectors) is memory resident, I think? Maybe we could
>> move it off-heap and let the OS manage the memory (and still
>> document the RAM requirements)? Maybe merge RAM costs should
>> be accounted for in IW's RAM buffer accounting? It is not
>> today, and there are some other things that use non-trivial
>> RAM, e.g. the doc mapping (to compress docid space when
>> deletions are reclaimed).
>> >>
>> >> When we added KNN vector testing to Lucene's nightly
>> benchmarks, the indexing time massively increased -- see
>> annotations DH and DP here:
>> https://home.apache.org/~mikemccand/lucenebench/indexing.html.
>> Nightly benchmarks now start at 6 PM and don't finish until
>> ~14.5 hours later. Of course, that is using a single thread
>> for indexing (on a box that has 128 cores!) so we produce a
>> deterministic index every night ...
>> >>
>> >> Stepping out (meta) a bit ... this discussion is precisely
>> one of the awesome benefits of the (informed) veto. It means
>> risky changes to the software, as determined by any single
>> informed developer on the project, can force a healthy
>> discussion about the problem at hand. Robert is legitimately
>> concerned about a real issue and so we should use our
>> creative energies to characterize our HNSW implementation's
>> performance, document it clearly for users, and uncover ways
>> to improve it.
>> >>
>> >> Mike McCandless
>> >>
>> >> http://blog.mikemccandless.com
>> >>
>> >>
>> >> On Mon, Apr 10, 2023 at 6:41?PM Alessandro Benedetti
>> <a.benedetti@sease.io> wrote:
>> >>> I think Gus points are on target.
>> >>>
>> >>> I recommend we move this forward in this way:
>> >>> We stop any discussion and everyone interested proposes
>> an option with a motivation, then we aggregate the options
>> and we create a Vote maybe?
>> >>>
>> >>> I am also on the same page on the fact that a veto should
>> come with a clear and reasonable technical merit, which also
>> in my opinion has not come yet.
>> >>>
>> >>> I also apologise if any of my words sounded harsh or
>> personal attacks, never meant to do so.
>> >>>
>> >>> My proposed option:
>> >>>
>> >>> 1) remove the limit and potentially make it configurable,
>> >>> Motivation:
>> >>> The system administrator can enforce a limit its users
>> need to respect that it's in line with whatever the admin
>> decided to be acceptable for them.
>> >>> Default can stay the current one.
>> >>>
>> >>> That's my favourite at the moment, but I agree that
>> potentially in the future this may need to change, as we may
>> optimise the data structures for certain dimensions. I am a
>> big fan of Yagni (you aren't going to need it) so I am ok
>> we'll face a different discussion if that happens in the future.
>> >>>
>> >>>
>> >>>
>> >>> On Sun, 9 Apr 2023, 18:46 Gus Heck, <gus.heck@gmail.com>
>> wrote:
>> >>>> What I see so far:
>> >>>>
>> >>>> Much positive support for raising the limit
>> >>>> Slightly less support for removing it or making it
>> configurable
>> >>>> A single veto which argues that a (as yet undefined)
>> performance standard must be met before raising the limit
>> >>>> Hot tempers (various) making this discussion difficult
>> >>>>
>> >>>> As I understand it, vetoes must have technical merit.
>> I'm not sure that this veto rises to "technical merit" on 2
>> counts:
>> >>>>
>> >>>> No standard for the performance is given so it cannot be
>> technically met. Without hard criteria it's a moving target.
>> >>>> It appears to encode a valuation of the user's time, and
>> that valuation is really up to the user. Some users may
>> consider 2hours useless and not worth it, and others might
>> happily wait 2 hours. This is not a technical decision, it's
>> a business decision regarding the relative value of the time
>> invested vs the value of the result. If I can cure cancer by
>> indexing for a year, that might be worth it... (hyperbole of
>> course).
>> >>>>
>> >>>> Things I would consider to have technical merit that I
>> don't hear:
>> >>>>
>> >>>> Impact on the speed of **other** indexing operations.
>> (devaluation of other functionality)
>> >>>> Actual scenarios that work when the limit is low and
>> fail when the limit is high (new failure on the same data
>> with the limit raised).
>> >>>>
>> >>>> One thing that might or might not have technical merit
>> >>>>
>> >>>> If someone feels there is a lack of documentation of the
>> costs/performance implications of using large vectors,
>> possibly including reproducible benchmarks establishing the
>> scaling behavior (there seems to be disagreement on O(n) vs
>> O(n^2)).
>> >>>>
>> >>>> The users *should* know what they are getting into, but
>> if the cost is worth it to them, they should be able to pay
>> it without forking the project. If this veto causes a fork
>> that's not good.
>> >>>>
>> >>>> On Sun, Apr 9, 2023 at 7:55?AM Michael Sokolov
>> <msokolov@gmail.com> wrote:
>> >>>>> We do have a dataset built from Wikipedia in
>> luceneutil. It comes in 100 and 300 dimensional varieties and
>> can easily enough generate large numbers of vector documents
>> from the articles data. To go higher we could concatenate
>> vectors from that and I believe the performance numbers would
>> be plausible.
>> >>>>>
>> >>>>> On Sun, Apr 9, 2023, 1:32 AM Dawid Weiss
>> <dawid.weiss@gmail.com> wrote:
>> >>>>>> Can we set up a branch in which the limit is bumped to
>> 2048, then have
>> >>>>>> a realistic, free data set (wikipedia sample or
>> something) that has,
>> >>>>>> say, 5 million docs and vectors created using public
>> data (glove
>> >>>>>> pre-trained embeddings or the like)? We then could run
>> indexing on the
>> >>>>>> same hardware with 512, 1024 and 2048 and see what the
>> numbers, limits
>> >>>>>> and behavior actually are.
>> >>>>>>
>> >>>>>> I can help in writing this but not until after Easter.
>> >>>>>>
>> >>>>>>
>> >>>>>> Dawid
>> >>>>>>
>> >>>>>> On Sat, Apr 8, 2023 at 11:29?PM Adrien Grand
>> <jpountz@gmail.com> wrote:
>> >>>>>>> As Dawid pointed out earlier on this thread, this is
>> the rule for
>> >>>>>>> Apache projects: a single -1 vote on a code change is
>> a veto and
>> >>>>>>> cannot be overridden. Furthermore, Robert is one of
>> the people on this
>> >>>>>>> project who worked the most on debugging subtle bugs,
>> making Lucene
>> >>>>>>> more robust and improving our test framework, so I'm
>> listening when he
>> >>>>>>> voices quality concerns.
>> >>>>>>>
>> >>>>>>> The argument against removing/raising the limit that
>> resonates with me
>> >>>>>>> the most is that it is a one-way door. As MikeS
>> highlighted earlier on
>> >>>>>>> this thread, implementations may want to take
>> advantage of the fact
>> >>>>>>> that there is a limit at some point too. This is why
>> I don't want to
>> >>>>>>> remove the limit and would prefer a slight increase,
>> such as 2048 as
>> >>>>>>> suggested in the original issue, which would enable
>> most of the things
>> >>>>>>> that users who have been asking about raising the
>> limit would like to
>> >>>>>>> do.
>> >>>>>>>
>> >>>>>>> I agree that the merge-time memory usage and slow
>> indexing rate are
>> >>>>>>> not great. But it's still possible to index
>> multi-million vector
>> >>>>>>> datasets with a 4GB heap without hitting OOMEs
>> regardless of the
>> >>>>>>> number of dimensions, and the feedback I'm seeing is
>> that many users
>> >>>>>>> are still interested in indexing multi-million vector
>> datasets despite
>> >>>>>>> the slow indexing rate. I wish we could do better,
>> and vector indexing
>> >>>>>>> is certainly more expert than text indexing, but it
>> still is usable in
>> >>>>>>> my opinion. I understand how giving Lucene more
>> information about
>> >>>>>>> vectors prior to indexing (e.g. clustering
>> information as Jim pointed
>> >>>>>>> out) could help make merging faster and more
>> memory-efficient, but I
>> >>>>>>> would really like to avoid making it a requirement
>> for indexing
>> >>>>>>> vectors as it also makes this feature much harder to use.
>> >>>>>>>
>> >>>>>>> On Sat, Apr 8, 2023 at 9:28?PM Alessandro Benedetti
>> >>>>>>> <a.benedetti@sease.io> wrote:
>> >>>>>>>> I am very attentive to listen opinions but I am
>> un-convinced here and I an not sure that a single person
>> opinion should be allowed to be detrimental for such an
>> important project.
>> >>>>>>>>
>> >>>>>>>> The limit as far as I know is literally just raising
>> an exception.
>> >>>>>>>> Removing it won't alter in any way the current
>> performance for users in low dimensional space.
>> >>>>>>>> Removing it will just enable more users to use Lucene.
>> >>>>>>>>
>> >>>>>>>> If new users in certain situations will be unhappy
>> with the performance, they may contribute improvements.
>> >>>>>>>> This is how you make progress.
>> >>>>>>>>
>> >>>>>>>> If it's a reputation thing, trust me that not
>> allowing users to play with high dimensional space will
>> equally damage it.
>> >>>>>>>>
>> >>>>>>>> To me it's really a no brainer.
>> >>>>>>>> Removing the limit and enable people to use high
>> dimensional vectors will take minutes.
>> >>>>>>>> Improving the hnsw implementation can take months.
>> >>>>>>>> Pick one to begin with...
>> >>>>>>>>
>> >>>>>>>> And there's no-one paying me here, no company
>> interest whatsoever, actually I pay people to contribute, I
>> am just convinced it's a good idea.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Sat, 8 Apr 2023, 18:57 Robert Muir,
>> <rcmuir@gmail.com> wrote:
>> >>>>>>>>> I disagree with your categorization. I put in
>> plenty of work and
>> >>>>>>>>> experienced plenty of pain myself, writing tests
>> and fighting these
>> >>>>>>>>> issues, after i saw that, two releases in a row,
>> vector indexing fell
>> >>>>>>>>> over and hit integer overflows etc on small datasets:
>> >>>>>>>>>
>> >>>>>>>>> https://github.com/apache/lucene/pull/11905
>> >>>>>>>>>
>> >>>>>>>>> Attacking me isn't helping the situation.
>> >>>>>>>>>
>> >>>>>>>>> PS: when i said the "one guy who wrote the code" I
>> didn't mean it in
>> >>>>>>>>> any kind of demeaning fashion really. I meant to
>> describe the current
>> >>>>>>>>> state of usability with respect to indexing a few
>> million docs with
>> >>>>>>>>> high dimensions. You can scroll up the thread and
>> see that at least
>> >>>>>>>>> one other committer on the project experienced
>> similar pain as me.
>> >>>>>>>>> Then, think about users who aren't committers
>> trying to use the
>> >>>>>>>>> functionality!
>> >>>>>>>>>
>> >>>>>>>>> On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov
>> <msokolov@gmail.com> wrote:
>> >>>>>>>>>> What you said about increasing dimensions
>> requiring a bigger ram buffer on merge is wrong. That's the
>> point I was trying to make. Your concerns about merge costs
>> are not wrong, but your conclusion that we need to limit
>> dimensions is not justified.
>> >>>>>>>>>>
>> >>>>>>>>>> You complain that hnsw sucks it doesn't scale, but
>> when I show it scales linearly with dimension you just ignore
>> that and complain about something entirely different.
>> >>>>>>>>>>
>> >>>>>>>>>> You demand that people run all kinds of tests to
>> prove you wrong but when they do, you don't listen and you
>> won't put in the work yourself or complain that it's too hard.
>> >>>>>>>>>>
>> >>>>>>>>>> Then you complain about people not meeting you
>> half way. Wow
>> >>>>>>>>>>
>> >>>>>>>>>> On Sat, Apr 8, 2023, 12:40 PM Robert Muir
>> <rcmuir@gmail.com> wrote:
>> >>>>>>>>>>> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
>> >>>>>>>>>>> <michael.wechner@wyona.com> wrote:
>> >>>>>>>>>>>> What exactly do you consider reasonable?
>> >>>>>>>>>>> Let's begin a real discussion by being HONEST
>> about the current
>> >>>>>>>>>>> status. Please put politically correct or your
>> own company's wishes
>> >>>>>>>>>>> aside, we know it's not in a good state.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Current status is the one guy who wrote the code
>> can set a
>> >>>>>>>>>>> multi-gigabyte ram buffer and index a small
>> dataset with 1024
>> >>>>>>>>>>> dimensions in HOURS (i didn't ask what hardware).
>> >>>>>>>>>>>
>> >>>>>>>>>>> My concerns are everyone else except the one guy,
>> I want it to be
>> >>>>>>>>>>> usable. Increasing dimensions just means even
>> bigger multi-gigabyte
>> >>>>>>>>>>> ram buffer and bigger heap to avoid OOM on merge.
>> >>>>>>>>>>> It is also a permanent backwards compatibility
>> decision, we have to
>> >>>>>>>>>>> support it once we do this and we can't just say
>> "oops" and flip it
>> >>>>>>>>>>> back.
>> >>>>>>>>>>>
>> >>>>>>>>>>> It is unclear to me, if the multi-gigabyte ram
>> buffer is really to
>> >>>>>>>>>>> avoid merges because they are so slow and it
>> would be DAYS otherwise,
>> >>>>>>>>>>> or if its to avoid merges so it doesn't hit OOM.
>> >>>>>>>>>>> Also from personal experience, it takes trial and
>> error (means
>> >>>>>>>>>>> experiencing OOM on merge!!!) before you get
>> those heap values correct
>> >>>>>>>>>>> for your dataset. This usually means starting
>> over which is
>> >>>>>>>>>>> frustrating and wastes more time.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Jim mentioned some ideas about the memory usage
>> in IndexWriter, seems
>> >>>>>>>>>>> to me like its a good idea. maybe the
>> multigigabyte ram buffer can be
>> >>>>>>>>>>> avoided in this way and performance improved by
>> writing bigger
>> >>>>>>>>>>> segments with lucene's defaults. But this doesn't
>> mean we can simply
>> >>>>>>>>>>> ignore the horrors of what happens on merge.
>> merging needs to scale so
>> >>>>>>>>>>> that indexing really scales.
>> >>>>>>>>>>>
>> >>>>>>>>>>> At least it shouldnt spike RAM on trivial data
>> amounts and cause OOM,
>> >>>>>>>>>>> and definitely it shouldnt burn hours and hours
>> of CPU in O(n^2)
>> >>>>>>>>>>> fashion when indexing.
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> ---------------------------------------------------------------------
>> >>>>>>>>>>> To unsubscribe, e-mail:
>> dev-unsubscribe@lucene.apache.org
>> >>>>>>>>>>> For additional commands, e-mail:
>> dev-help@lucene.apache.org
>> >>>>>>>>>>>
>> >>>>>>>>>
>> ---------------------------------------------------------------------
>> >>>>>>>>> To unsubscribe, e-mail:
>> dev-unsubscribe@lucene.apache.org
>> >>>>>>>>> For additional commands, e-mail:
>> dev-help@lucene.apache.org
>> >>>>>>>>>
>> >>>>>>>
>> >>>>>>> --
>> >>>>>>> Adrien
>> >>>>>>>
>> >>>>>>>
>> ---------------------------------------------------------------------
>> >>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >>>>>>> For additional commands, e-mail:
>> dev-help@lucene.apache.org
>> >>>>>>>
>> >>>>>>
>> ---------------------------------------------------------------------
>> >>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >>>>>> For additional commands, e-mail:
>> dev-help@lucene.apache.org
>> >>>>>>
>> >>>>
>> >>>> --
>> >>>> http://www.needhamsoftware.com (work)
>> >>>> http://www.the111shift.com (play)
>> >>
>> >
>> ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: dev-help@lucene.apache.org
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

michael.wechner at wyona

Apr 20, 2023, 11:17 PM

Post #96 of 99 (949 views)

Hi Together

Cohere just published approx. 100Mio embeddings based on Wikipedia content

https://txt.cohere.com/embedding-archives-wikipedia/

resp.

https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-embeddings
https://huggingface.co/datasets/Cohere/wikipedia-22-12-de-embeddings
....

HTH

Michael

Am 13.04.23 um 07:58 schrieb Michael Wechner:
> Hi Kent
>
> Great, thank you very much!
>
> Will download it later today :-)
>
> All the best
>
> Michael
>
> Am 13.04.23 um 01:35 schrieb Kent Fitch:
>> Hi Michael (and anyone else who wants just over 240K "real world"
>> ada-002 vectors of dimension 1536),
>> you are welcome to retrieve a tar.gz file which contains:
>> - 47K embeddings of Canberra Times news article text from 1994
>> - 38K embeddings of the first paragraphs of wikipedia articles about
>> organisations
>> - 156.6K embeddings of the first paragraphs of wikipedia articles
>> about people
>>
>> https://drive.google.com/file/d/13JP_5u7E8oZO6vRg0ekaTgBDQOaj-W00/view?usp=sharing
>>
>> The file is about 1.7GB and will expand to about 4.4GB. This file
>> will be accessible for at least a week, and I hope you dont hit any
>> google drive download limits trying to retrieve it.
>>
>> The embeddings were generated using my openAI account and you are
>> welcome to use them for any purpose you like.
>>
>> best wishes,
>>
>> Kent Fitch
>>
>> On Wed, Apr 12, 2023 at 4:37?PM Michael Wechner
>> <michael.wechner@wyona.com> wrote:
>>
>> thank you very much for your feedback!
>>
>> In a previous post (April 7) you wrote you could make availlable
>> the 47K ada-002 vectors, which would be great!
>>
>> Would it make sense to setup a public gitub repo, such that
>> others could use or also contribute vectors?
>>
>> Thanks
>>
>> Michael Wechner
>>
>>
>> Am 12.04.23 um 04:51 schrieb Kent Fitch:
>>> I only know some characteristics of the openAI ada-002 vectors,
>>> although they are a very popular as
>>> embeddings/text-characterisations as they allow more
>>> accurate/"human meaningful" semantic search results with fewer
>>> dimensions than their predecessors - I've evaluated a few
>>> different embedding models, including some BERT variants, CLIP
>>> ViT-L-14 (with 768 dims, which was quite good), openAI's ada-001
>>> (1024 dims) and babbage-001 (2048 dims), and ada-002 are
>>> qualitatively the best, although that will certainly change!
>>>
>>> In any case, ada-002 vectors have interesting characteristics
>>> that I think mean you could confidently create synthetic vectors
>>> which would be hard to distinguish from "real" vectors. I found
>>> this from looking at 47K ada-002 vectors generated across a full
>>> year (1994) of newspaper articles from the Canberra Times and
>>> 200K wikipedia articles:
>>> - there is no discernible/significant correlation between values
>>> in any pair of dimensions
>>> - all but 5 of the 1536 dimensions have an almost identical
>>> distribution of values shown in the central blob on these graphs
>>> (that just show a few of these 1531 dimensions with clumped
>>> values and the 5 "outlier" dimensions, but all 1531 non-outlier
>>> dims are in there, which makes for some easy quantisation from
>>> float to byte if you dont want to go the full
>>> kmeans/clustering/Lloyds-algorithm approach):
>>> https://docs.google.com/spreadsheets/d/1DyyBCbirETZSUAEGcMK__mfbUNzsU_L48V9E0SyJYGg/edit?usp=sharing
>>> https://docs.google.com/spreadsheets/d/1czEAlzYdyKa6xraRLesXjNZvEzlj27TcDGiEFS1-MPs/edit?usp=sharing
>>> https://docs.google.com/spreadsheets/d/1RxTjV7Sj14etCNLk1GB-m44CXJVKdXaFlg2Y6yvj3z4/edit?usp=sharing
>>> - the variance of the value of each dimension is characteristic:
>>> https://docs.google.com/spreadsheets/d/1w5LnRUXt1cRzI9Qwm07LZ6UfszjMOgPaJot9cOGLHok/edit#gid=472178228
>>>
>>> This probably represents something significant about how the
>>> ada-002 embeddings are created, but I think it also means
>>> creating "realistic" values is possible. I did not use this
>>> information when testing recall & performance on Lucene's HNSW
>>> implementation on 192m documents, as I slightly dithered the
>>> values of a "real" set on 47K docs and stored other fields in
>>> the doc that referenced the "base" document that the dithers
>>> were made from, and used different dithering magnitudes so that
>>> I could test recall with different neighbour sizes ("M"),
>>> construction-beamwidth and search-beamwidths.
>>>
>>> best regards
>>>
>>> Kent Fitch
>>>
>>>
>>>
>>>
>>> On Wed, Apr 12, 2023 at 5:08?AM Michael Wechner
>>> <michael.wechner@wyona.com> wrote:
>>>
>>> I understand what you mean that it seems to be artificial,
>>> but I don't
>>> understand why this matters to test performance and
>>> scalability of the
>>> indexing?
>>>
>>> Let's assume the limit of Lucene would be 4 instead of 1024
>>> and there
>>> are only open source models generating vectors with 4
>>> dimensions, for
>>> example
>>>
>>> 0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814
>>>
>>> 0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844
>>>
>>> -0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106
>>>
>>> -0.007012288551777601,-0.026665858924388885,0.044495150446891785,-0.038030195981264114
>>>
>>> and now I concatenate them to vectors with 8 dimensions
>>>
>>>
>>> 0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814,0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844
>>>
>>> -0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106,-0.007012288551777601,-0.026665858924388885,0.044495150446891785,-0.038030195981264114
>>>
>>> and normalize them to length 1.
>>>
>>> Why should this be any different to a model which is acting
>>> like a black
>>> box generating vectors with 8 dimensions?
>>>
>>>
>>>
>>>
>>> Am 11.04.23 um 19:05 schrieb Michael Sokolov:
>>> >> What exactly do you consider real vector data? Vector
>>> data which is based on texts written by humans?
>>> > We have plenty of text; the problem is coming up with a
>>> realistic
>>> > vector model that requires as many dimensions as people
>>> seem to be
>>> > demanding. As I said above, after surveying huggingface I
>>> couldn't
>>> > find any text-based model using more than 768 dimensions.
>>> So far we
>>> > have some ideas of generating higher-dimensional data by
>>> dithering or
>>> > concatenating existing data, but it seems artificial.
>>> >
>>> > On Tue, Apr 11, 2023 at 9:31?AM Michael Wechner
>>> > <michael.wechner@wyona.com> wrote:
>>> >> What exactly do you consider real vector data? Vector
>>> data which is based on texts written by humans?
>>> >>
>>> >> I am asking, because I recently attended the following
>>> presentation by Anastassia Shaitarova (UZH Institute for
>>> Computational Linguistics,
>>> https://www.cl.uzh.ch/de/people/team/compling/shaitarova.html)
>>> >>
>>> >> ----
>>> >>
>>> >> Can we Identify Machine-Generated Text? An Overview of
>>> Current Approaches
>>> >> by Anastassia Shaitarova (UZH Institute for Computational
>>> Linguistics)
>>> >>
>>> >> The detection of machine-generated text has become
>>> increasingly important due to the prevalence of automated
>>> content generation and its potential for misuse. In this
>>> talk, we will discuss the motivation for automatic detection
>>> of generated text. We will present the currently available
>>> methods, including feature-based classification as a “first
>>> line-of-defense.” We will provide an overview of the
>>> detection tools that have been made available so far and
>>> discuss their limitations. Finally, we will reflect on some
>>> open problems associated with the automatic discrimination
>>> of generated texts.
>>> >>
>>> >> ----
>>> >>
>>> >> and her conclusion was that it has become basically
>>> impossible to differentiate between text generated by humans
>>> and text generated by for example ChatGPT.
>>> >>
>>> >> Whereas others have a slightly different opinion, see for
>>> example
>>> >>
>>> >>
>>> https://www.wired.com/story/how-to-spot-generative-ai-text-chatgpt/
>>> >>
>>> >> But I would argue that real world and synthetic have
>>> become close enough that testing performance and scalability
>>> of indexing should be possible with synthetic data.
>>> >>
>>> >> I completely agree that we have to base our discussions
>>> and decisions on scientific methods and that we have to make
>>> sure that Lucene performs and scales well and that we
>>> understand the limits and what is going on under the hood.
>>> >>
>>> >> Thanks
>>> >>
>>> >> Michael W
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> Am 11.04.23 um 14:29 schrieb Michael McCandless:
>>> >>
>>> >> +1 to test on real vector data -- if you test on
>>> synthetic data you draw synthetic conclusions.
>>> >>
>>> >> Can someone post the theoretical performance (CPU and RAM
>>> required) of HNSW construction? Do we know/believe our HNSW
>>> implementation has achieved that theoretical big-O
>>> performance? Maybe we have some silly performance bug
>>> that's causing it not to?
>>> >>
>>> >> As I understand it, HNSW makes the tradeoff of costly
>>> construction for faster searching, which is typically the
>>> right tradeoff for search use cases. We do this in other
>>> parts of the Lucene index too.
>>> >>
>>> >> Lucene will do a logarithmic number of merges over time,
>>> i.e. each doc will be merged O(log(N)) times in its lifetime
>>> in the index. We need to multiply that by the cost of
>>> re-building the whole HNSW graph on each merge. BTW, other
>>> things in Lucene, like BKD/dimensional points, also rebuild
>>> the whole data structure on each merge, I think? But, as Rob
>>> pointed out, stored fields merging do indeed do some sneaky
>>> tricks to avoid excessive block decompress/recompress on
>>> each merge.
>>> >>
>>> >>> As I understand it, vetoes must have technical merit.
>>> I'm not sure that this veto rises to "technical merit" on 2
>>> counts:
>>> >> Actually I think Robert's veto stands on its technical
>>> merit already. Robert's take on technical matters very much
>>> resonate with me, even if he is sometimes prickly in how he
>>> expresses them ;)
>>> >>
>>> >> His point is that we, as a dev community, are not paying
>>> enough attention to the indexing performance of our KNN algo
>>> (HNSW) and implementation, and that it is reckless to
>>> increase / remove limits in that state. It is indeed a
>>> one-way door decision and one must confront such decisions
>>> with caution, especially for such a widely used base
>>> infrastructure as Lucene. We don't even advertise today in
>>> our javadocs that you need XXX heap if you index vectors
>>> with dimension Y, fanout X, levels Z, etc.
>>> >>
>>> >> RAM used during merging is unaffected by dimensionality,
>>> but is affected by fanout, because the HNSW graph (not the
>>> raw vectors) is memory resident, I think? Maybe we could
>>> move it off-heap and let the OS manage the memory (and still
>>> document the RAM requirements)? Maybe merge RAM costs
>>> should be accounted for in IW's RAM buffer accounting? It
>>> is not today, and there are some other things that use
>>> non-trivial RAM, e.g. the doc mapping (to compress docid
>>> space when deletions are reclaimed).
>>> >>
>>> >> When we added KNN vector testing to Lucene's nightly
>>> benchmarks, the indexing time massively increased -- see
>>> annotations DH and DP here:
>>> https://home.apache.org/~mikemccand/lucenebench/indexing.html.
>>> Nightly benchmarks now start at 6 PM and don't finish until
>>> ~14.5 hours later. Of course, that is using a single thread
>>> for indexing (on a box that has 128 cores!) so we produce a
>>> deterministic index every night ...
>>> >>
>>> >> Stepping out (meta) a bit ... this discussion is
>>> precisely one of the awesome benefits of the (informed)
>>> veto. It means risky changes to the software, as determined
>>> by any single informed developer on the project, can force a
>>> healthy discussion about the problem at hand. Robert is
>>> legitimately concerned about a real issue and so we should
>>> use our creative energies to characterize our HNSW
>>> implementation's performance, document it clearly for users,
>>> and uncover ways to improve it.
>>> >>
>>> >> Mike McCandless
>>> >>
>>> >> http://blog.mikemccandless.com
>>> >>
>>> >>
>>> >> On Mon, Apr 10, 2023 at 6:41?PM Alessandro Benedetti
>>> <a.benedetti@sease.io> wrote:
>>> >>> I think Gus points are on target.
>>> >>>
>>> >>> I recommend we move this forward in this way:
>>> >>> We stop any discussion and everyone interested proposes
>>> an option with a motivation, then we aggregate the options
>>> and we create a Vote maybe?
>>> >>>
>>> >>> I am also on the same page on the fact that a veto
>>> should come with a clear and reasonable technical merit,
>>> which also in my opinion has not come yet.
>>> >>>
>>> >>> I also apologise if any of my words sounded harsh or
>>> personal attacks, never meant to do so.
>>> >>>
>>> >>> My proposed option:
>>> >>>
>>> >>> 1) remove the limit and potentially make it configurable,
>>> >>> Motivation:
>>> >>> The system administrator can enforce a limit its users
>>> need to respect that it's in line with whatever the admin
>>> decided to be acceptable for them.
>>> >>> Default can stay the current one.
>>> >>>
>>> >>> That's my favourite at the moment, but I agree that
>>> potentially in the future this may need to change, as we may
>>> optimise the data structures for certain dimensions. I am a
>>> big fan of Yagni (you aren't going to need it) so I am ok
>>> we'll face a different discussion if that happens in the future.
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Sun, 9 Apr 2023, 18:46 Gus Heck, <gus.heck@gmail.com>
>>> wrote:
>>> >>>> What I see so far:
>>> >>>>
>>> >>>> Much positive support for raising the limit
>>> >>>> Slightly less support for removing it or making it
>>> configurable
>>> >>>> A single veto which argues that a (as yet undefined)
>>> performance standard must be met before raising the limit
>>> >>>> Hot tempers (various) making this discussion difficult
>>> >>>>
>>> >>>> As I understand it, vetoes must have technical merit.
>>> I'm not sure that this veto rises to "technical merit" on 2
>>> counts:
>>> >>>>
>>> >>>> No standard for the performance is given so it cannot
>>> be technically met. Without hard criteria it's a moving target.
>>> >>>> It appears to encode a valuation of the user's time,
>>> and that valuation is really up to the user. Some users may
>>> consider 2hours useless and not worth it, and others might
>>> happily wait 2 hours. This is not a technical decision, it's
>>> a business decision regarding the relative value of the time
>>> invested vs the value of the result. If I can cure cancer by
>>> indexing for a year, that might be worth it... (hyperbole of
>>> course).
>>> >>>>
>>> >>>> Things I would consider to have technical merit that I
>>> don't hear:
>>> >>>>
>>> >>>> Impact on the speed of **other** indexing operations.
>>> (devaluation of other functionality)
>>> >>>> Actual scenarios that work when the limit is low and
>>> fail when the limit is high (new failure on the same data
>>> with the limit raised).
>>> >>>>
>>> >>>> One thing that might or might not have technical merit
>>> >>>>
>>> >>>> If someone feels there is a lack of documentation of
>>> the costs/performance implications of using large vectors,
>>> possibly including reproducible benchmarks establishing the
>>> scaling behavior (there seems to be disagreement on O(n) vs
>>> O(n^2)).
>>> >>>>
>>> >>>> The users *should* know what they are getting into, but
>>> if the cost is worth it to them, they should be able to pay
>>> it without forking the project. If this veto causes a fork
>>> that's not good.
>>> >>>>
>>> >>>> On Sun, Apr 9, 2023 at 7:55?AM Michael Sokolov
>>> <msokolov@gmail.com> wrote:
>>> >>>>> We do have a dataset built from Wikipedia in
>>> luceneutil. It comes in 100 and 300 dimensional varieties
>>> and can easily enough generate large numbers of vector
>>> documents from the articles data. To go higher we could
>>> concatenate vectors from that and I believe the performance
>>> numbers would be plausible.
>>> >>>>>
>>> >>>>> On Sun, Apr 9, 2023, 1:32 AM Dawid Weiss
>>> <dawid.weiss@gmail.com> wrote:
>>> >>>>>> Can we set up a branch in which the limit is bumped
>>> to 2048, then have
>>> >>>>>> a realistic, free data set (wikipedia sample or
>>> something) that has,
>>> >>>>>> say, 5 million docs and vectors created using public
>>> data (glove
>>> >>>>>> pre-trained embeddings or the like)? We then could
>>> run indexing on the
>>> >>>>>> same hardware with 512, 1024 and 2048 and see what
>>> the numbers, limits
>>> >>>>>> and behavior actually are.
>>> >>>>>>
>>> >>>>>> I can help in writing this but not until after Easter.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Dawid
>>> >>>>>>
>>> >>>>>> On Sat, Apr 8, 2023 at 11:29?PM Adrien Grand
>>> <jpountz@gmail.com> wrote:
>>> >>>>>>> As Dawid pointed out earlier on this thread, this is
>>> the rule for
>>> >>>>>>> Apache projects: a single -1 vote on a code change
>>> is a veto and
>>> >>>>>>> cannot be overridden. Furthermore, Robert is one of
>>> the people on this
>>> >>>>>>> project who worked the most on debugging subtle
>>> bugs, making Lucene
>>> >>>>>>> more robust and improving our test framework, so I'm
>>> listening when he
>>> >>>>>>> voices quality concerns.
>>> >>>>>>>
>>> >>>>>>> The argument against removing/raising the limit that
>>> resonates with me
>>> >>>>>>> the most is that it is a one-way door. As MikeS
>>> highlighted earlier on
>>> >>>>>>> this thread, implementations may want to take
>>> advantage of the fact
>>> >>>>>>> that there is a limit at some point too. This is why
>>> I don't want to
>>> >>>>>>> remove the limit and would prefer a slight increase,
>>> such as 2048 as
>>> >>>>>>> suggested in the original issue, which would enable
>>> most of the things
>>> >>>>>>> that users who have been asking about raising the
>>> limit would like to
>>> >>>>>>> do.
>>> >>>>>>>
>>> >>>>>>> I agree that the merge-time memory usage and slow
>>> indexing rate are
>>> >>>>>>> not great. But it's still possible to index
>>> multi-million vector
>>> >>>>>>> datasets with a 4GB heap without hitting OOMEs
>>> regardless of the
>>> >>>>>>> number of dimensions, and the feedback I'm seeing is
>>> that many users
>>> >>>>>>> are still interested in indexing multi-million
>>> vector datasets despite
>>> >>>>>>> the slow indexing rate. I wish we could do better,
>>> and vector indexing
>>> >>>>>>> is certainly more expert than text indexing, but it
>>> still is usable in
>>> >>>>>>> my opinion. I understand how giving Lucene more
>>> information about
>>> >>>>>>> vectors prior to indexing (e.g. clustering
>>> information as Jim pointed
>>> >>>>>>> out) could help make merging faster and more
>>> memory-efficient, but I
>>> >>>>>>> would really like to avoid making it a requirement
>>> for indexing
>>> >>>>>>> vectors as it also makes this feature much harder to
>>> use.
>>> >>>>>>>
>>> >>>>>>> On Sat, Apr 8, 2023 at 9:28?PM Alessandro Benedetti
>>> >>>>>>> <a.benedetti@sease.io> wrote:
>>> >>>>>>>> I am very attentive to listen opinions but I am
>>> un-convinced here and I an not sure that a single person
>>> opinion should be allowed to be detrimental for such an
>>> important project.
>>> >>>>>>>>
>>> >>>>>>>> The limit as far as I know is literally just
>>> raising an exception.
>>> >>>>>>>> Removing it won't alter in any way the current
>>> performance for users in low dimensional space.
>>> >>>>>>>> Removing it will just enable more users to use Lucene.
>>> >>>>>>>>
>>> >>>>>>>> If new users in certain situations will be unhappy
>>> with the performance, they may contribute improvements.
>>> >>>>>>>> This is how you make progress.
>>> >>>>>>>>
>>> >>>>>>>> If it's a reputation thing, trust me that not
>>> allowing users to play with high dimensional space will
>>> equally damage it.
>>> >>>>>>>>
>>> >>>>>>>> To me it's really a no brainer.
>>> >>>>>>>> Removing the limit and enable people to use high
>>> dimensional vectors will take minutes.
>>> >>>>>>>> Improving the hnsw implementation can take months.
>>> >>>>>>>> Pick one to begin with...
>>> >>>>>>>>
>>> >>>>>>>> And there's no-one paying me here, no company
>>> interest whatsoever, actually I pay people to contribute, I
>>> am just convinced it's a good idea.
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> On Sat, 8 Apr 2023, 18:57 Robert Muir,
>>> <rcmuir@gmail.com> wrote:
>>> >>>>>>>>> I disagree with your categorization. I put in
>>> plenty of work and
>>> >>>>>>>>> experienced plenty of pain myself, writing tests
>>> and fighting these
>>> >>>>>>>>> issues, after i saw that, two releases in a row,
>>> vector indexing fell
>>> >>>>>>>>> over and hit integer overflows etc on small datasets:
>>> >>>>>>>>>
>>> >>>>>>>>> https://github.com/apache/lucene/pull/11905
>>> >>>>>>>>>
>>> >>>>>>>>> Attacking me isn't helping the situation.
>>> >>>>>>>>>
>>> >>>>>>>>> PS: when i said the "one guy who wrote the code" I
>>> didn't mean it in
>>> >>>>>>>>> any kind of demeaning fashion really. I meant to
>>> describe the current
>>> >>>>>>>>> state of usability with respect to indexing a few
>>> million docs with
>>> >>>>>>>>> high dimensions. You can scroll up the thread and
>>> see that at least
>>> >>>>>>>>> one other committer on the project experienced
>>> similar pain as me.
>>> >>>>>>>>> Then, think about users who aren't committers
>>> trying to use the
>>> >>>>>>>>> functionality!
>>> >>>>>>>>>
>>> >>>>>>>>> On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov
>>> <msokolov@gmail.com> wrote:
>>> >>>>>>>>>> What you said about increasing dimensions
>>> requiring a bigger ram buffer on merge is wrong. That's the
>>> point I was trying to make. Your concerns about merge costs
>>> are not wrong, but your conclusion that we need to limit
>>> dimensions is not justified.
>>> >>>>>>>>>>
>>> >>>>>>>>>> You complain that hnsw sucks it doesn't scale,
>>> but when I show it scales linearly with dimension you just
>>> ignore that and complain about something entirely different.
>>> >>>>>>>>>>
>>> >>>>>>>>>> You demand that people run all kinds of tests to
>>> prove you wrong but when they do, you don't listen and you
>>> won't put in the work yourself or complain that it's too hard.
>>> >>>>>>>>>>
>>> >>>>>>>>>> Then you complain about people not meeting you
>>> half way. Wow
>>> >>>>>>>>>>
>>> >>>>>>>>>> On Sat, Apr 8, 2023, 12:40 PM Robert Muir
>>> <rcmuir@gmail.com> wrote:
>>> >>>>>>>>>>> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
>>> >>>>>>>>>>> <michael.wechner@wyona.com> wrote:
>>> >>>>>>>>>>>> What exactly do you consider reasonable?
>>> >>>>>>>>>>> Let's begin a real discussion by being HONEST
>>> about the current
>>> >>>>>>>>>>> status. Please put politically correct or your
>>> own company's wishes
>>> >>>>>>>>>>> aside, we know it's not in a good state.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Current status is the one guy who wrote the code
>>> can set a
>>> >>>>>>>>>>> multi-gigabyte ram buffer and index a small
>>> dataset with 1024
>>> >>>>>>>>>>> dimensions in HOURS (i didn't ask what hardware).
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> My concerns are everyone else except the one
>>> guy, I want it to be
>>> >>>>>>>>>>> usable. Increasing dimensions just means even
>>> bigger multi-gigabyte
>>> >>>>>>>>>>> ram buffer and bigger heap to avoid OOM on merge.
>>> >>>>>>>>>>> It is also a permanent backwards compatibility
>>> decision, we have to
>>> >>>>>>>>>>> support it once we do this and we can't just say
>>> "oops" and flip it
>>> >>>>>>>>>>> back.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> It is unclear to me, if the multi-gigabyte ram
>>> buffer is really to
>>> >>>>>>>>>>> avoid merges because they are so slow and it
>>> would be DAYS otherwise,
>>> >>>>>>>>>>> or if its to avoid merges so it doesn't hit OOM.
>>> >>>>>>>>>>> Also from personal experience, it takes trial
>>> and error (means
>>> >>>>>>>>>>> experiencing OOM on merge!!!) before you get
>>> those heap values correct
>>> >>>>>>>>>>> for your dataset. This usually means starting
>>> over which is
>>> >>>>>>>>>>> frustrating and wastes more time.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Jim mentioned some ideas about the memory usage
>>> in IndexWriter, seems
>>> >>>>>>>>>>> to me like its a good idea. maybe the
>>> multigigabyte ram buffer can be
>>> >>>>>>>>>>> avoided in this way and performance improved by
>>> writing bigger
>>> >>>>>>>>>>> segments with lucene's defaults. But this
>>> doesn't mean we can simply
>>> >>>>>>>>>>> ignore the horrors of what happens on merge.
>>> merging needs to scale so
>>> >>>>>>>>>>> that indexing really scales.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> At least it shouldnt spike RAM on trivial data
>>> amounts and cause OOM,
>>> >>>>>>>>>>> and definitely it shouldnt burn hours and hours
>>> of CPU in O(n^2)
>>> >>>>>>>>>>> fashion when indexing.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> ---------------------------------------------------------------------
>>> >>>>>>>>>>> To unsubscribe, e-mail:
>>> dev-unsubscribe@lucene.apache.org
>>> >>>>>>>>>>> For additional commands, e-mail:
>>> dev-help@lucene.apache.org
>>> >>>>>>>>>>>
>>> >>>>>>>>>
>>> ---------------------------------------------------------------------
>>> >>>>>>>>> To unsubscribe, e-mail:
>>> dev-unsubscribe@lucene.apache.org
>>> >>>>>>>>> For additional commands, e-mail:
>>> dev-help@lucene.apache.org
>>> >>>>>>>>>
>>> >>>>>>>
>>> >>>>>>> --
>>> >>>>>>> Adrien
>>> >>>>>>>
>>> >>>>>>>
>>> ---------------------------------------------------------------------
>>> >>>>>>> To unsubscribe, e-mail:
>>> dev-unsubscribe@lucene.apache.org
>>> >>>>>>> For additional commands, e-mail:
>>> dev-help@lucene.apache.org
>>> >>>>>>>
>>> >>>>>>
>>> ---------------------------------------------------------------------
>>> >>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >>>>>> For additional commands, e-mail:
>>> dev-help@lucene.apache.org
>>> >>>>>>
>>> >>>>
>>> >>>> --
>>> >>>> http://www.needhamsoftware.com (work)
>>> >>>> http://www.the111shift.com (play)
>>> >>
>>> >
>>> ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> > For additional commands, e-mail: dev-help@lucene.apache.org
>>> >
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

ichattopadhyaya at gmail

Apr 21, 2023, 12:24 AM

Post #97 of 99 (949 views)

Seems like they were all 768 dimensions.

On Fri, 21 Apr, 2023, 11:48 am Michael Wechner, <michael.wechner@wyona.com>
wrote:

> Hi Together
>
> Cohere just published approx. 100Mio embeddings based on Wikipedia content
>
> https://txt.cohere.com/embedding-archives-wikipedia/
>
> resp.
>
> https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-embeddings
> https://huggingface.co/datasets/Cohere/wikipedia-22-12-de-embeddings
> ....
>
> HTH
>
> Michael
>
>
>
> Am 13.04.23 um 07:58 schrieb Michael Wechner:
>
> Hi Kent
>
> Great, thank you very much!
>
> Will download it later today :-)
>
> All the best
>
> Michael
>
> Am 13.04.23 um 01:35 schrieb Kent Fitch:
>
> Hi Michael (and anyone else who wants just over 240K "real world" ada-002
> vectors of dimension 1536),
> you are welcome to retrieve a tar.gz file which contains:
> - 47K embeddings of Canberra Times news article text from 1994
> - 38K embeddings of the first paragraphs of wikipedia articles about
> organisations
> - 156.6K embeddings of the first paragraphs of wikipedia articles about
> people
>
>
> https://drive.google.com/file/d/13JP_5u7E8oZO6vRg0ekaTgBDQOaj-W00/view?usp=sharing
>
> The file is about 1.7GB and will expand to about 4.4GB. This file will be
> accessible for at least a week, and I hope you dont hit any google drive
> download limits trying to retrieve it.
>
> The embeddings were generated using my openAI account and you are welcome
> to use them for any purpose you like.
>
> best wishes,
>
> Kent Fitch
>
> On Wed, Apr 12, 2023 at 4:37?PM Michael Wechner <michael.wechner@wyona.com>
> wrote:
>
>> thank you very much for your feedback!
>>
>> In a previous post (April 7) you wrote you could make availlable the 47K
>> ada-002 vectors, which would be great!
>>
>> Would it make sense to setup a public gitub repo, such that others could
>> use or also contribute vectors?
>>
>> Thanks
>>
>> Michael Wechner
>>
>>
>> Am 12.04.23 um 04:51 schrieb Kent Fitch:
>>
>> I only know some characteristics of the openAI ada-002 vectors, although
>> they are a very popular as embeddings/text-characterisations as they allow
>> more accurate/"human meaningful" semantic search results with fewer
>> dimensions than their predecessors - I've evaluated a few different
>> embedding models, including some BERT variants, CLIP ViT-L-14 (with 768
>> dims, which was quite good), openAI's ada-001 (1024 dims) and babbage-001
>> (2048 dims), and ada-002 are qualitatively the best, although that will
>> certainly change!
>>
>> In any case, ada-002 vectors have interesting characteristics that I
>> think mean you could confidently create synthetic vectors which would be
>> hard to distinguish from "real" vectors. I found this from looking at 47K
>> ada-002 vectors generated across a full year (1994) of newspaper articles
>> from the Canberra Times and 200K wikipedia articles:
>> - there is no discernible/significant correlation between values in any
>> pair of dimensions
>> - all but 5 of the 1536 dimensions have an almost identical distribution
>> of values shown in the central blob on these graphs (that just show a few
>> of these 1531 dimensions with clumped values and the 5 "outlier"
>> dimensions, but all 1531 non-outlier dims are in there, which makes for
>> some easy quantisation from float to byte if you dont want to go the full
>> kmeans/clustering/Lloyds-algorithm approach):
>>
>> https://docs.google.com/spreadsheets/d/1DyyBCbirETZSUAEGcMK__mfbUNzsU_L48V9E0SyJYGg/edit?usp=sharing
>>
>> https://docs.google.com/spreadsheets/d/1czEAlzYdyKa6xraRLesXjNZvEzlj27TcDGiEFS1-MPs/edit?usp=sharing
>>
>> https://docs.google.com/spreadsheets/d/1RxTjV7Sj14etCNLk1GB-m44CXJVKdXaFlg2Y6yvj3z4/edit?usp=sharing
>> - the variance of the value of each dimension is characteristic:
>>
>> https://docs.google.com/spreadsheets/d/1w5LnRUXt1cRzI9Qwm07LZ6UfszjMOgPaJot9cOGLHok/edit#gid=472178228
>>
>> This probably represents something significant about how the ada-002
>> embeddings are created, but I think it also means creating "realistic"
>> values is possible. I did not use this information when testing recall &
>> performance on Lucene's HNSW implementation on 192m documents, as I
>> slightly dithered the values of a "real" set on 47K docs and stored other
>> fields in the doc that referenced the "base" document that the dithers were
>> made from, and used different dithering magnitudes so that I could test
>> recall with different neighbour sizes ("M"), construction-beamwidth and
>> search-beamwidths.
>>
>> best regards
>>
>> Kent Fitch
>>
>>
>>
>>
>> On Wed, Apr 12, 2023 at 5:08?AM Michael Wechner <
>> michael.wechner@wyona.com> wrote:
>>
>>> I understand what you mean that it seems to be artificial, but I don't
>>> understand why this matters to test performance and scalability of the
>>> indexing?
>>>
>>> Let's assume the limit of Lucene would be 4 instead of 1024 and there
>>> are only open source models generating vectors with 4 dimensions, for
>>> example
>>>
>>>
>>> 0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814
>>>
>>>
>>> 0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844
>>>
>>>
>>> -0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106
>>>
>>>
>>> -0.007012288551777601,-0.026665858924388885,0.044495150446891785,-0.038030195981264114
>>>
>>> and now I concatenate them to vectors with 8 dimensions
>>>
>>>
>>>
>>> 0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814,0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844
>>>
>>>
>>> -0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106,-0.007012288551777601,-0.026665858924388885,0.044495150446891785,-0.038030195981264114
>>>
>>> and normalize them to length 1.
>>>
>>> Why should this be any different to a model which is acting like a black
>>> box generating vectors with 8 dimensions?
>>>
>>>
>>>
>>>
>>> Am 11.04.23 um 19:05 schrieb Michael Sokolov:
>>> >> What exactly do you consider real vector data? Vector data which is
>>> based on texts written by humans?
>>> > We have plenty of text; the problem is coming up with a realistic
>>> > vector model that requires as many dimensions as people seem to be
>>> > demanding. As I said above, after surveying huggingface I couldn't
>>> > find any text-based model using more than 768 dimensions. So far we
>>> > have some ideas of generating higher-dimensional data by dithering or
>>> > concatenating existing data, but it seems artificial.
>>> >
>>> > On Tue, Apr 11, 2023 at 9:31?AM Michael Wechner
>>> > <michael.wechner@wyona.com> wrote:
>>> >> What exactly do you consider real vector data? Vector data which is
>>> based on texts written by humans?
>>> >>
>>> >> I am asking, because I recently attended the following presentation
>>> by Anastassia Shaitarova (UZH Institute for Computational Linguistics,
>>> https://www.cl.uzh.ch/de/people/team/compling/shaitarova.html)
>>> >>
>>> >> ----
>>> >>
>>> >> Can we Identify Machine-Generated Text? An Overview of Current
>>> Approaches
>>> >> by Anastassia Shaitarova (UZH Institute for Computational Linguistics)
>>> >>
>>> >> The detection of machine-generated text has become increasingly
>>> important due to the prevalence of automated content generation and its
>>> potential for misuse. In this talk, we will discuss the motivation for
>>> automatic detection of generated text. We will present the currently
>>> available methods, including feature-based classification as a “first
>>> line-of-defense.” We will provide an overview of the detection tools that
>>> have been made available so far and discuss their limitations. Finally, we
>>> will reflect on some open problems associated with the automatic
>>> discrimination of generated texts.
>>> >>
>>> >> ----
>>> >>
>>> >> and her conclusion was that it has become basically impossible to
>>> differentiate between text generated by humans and text generated by for
>>> example ChatGPT.
>>> >>
>>> >> Whereas others have a slightly different opinion, see for example
>>> >>
>>> >> https://www.wired.com/story/how-to-spot-generative-ai-text-chatgpt/
>>> >>
>>> >> But I would argue that real world and synthetic have become close
>>> enough that testing performance and scalability of indexing should be
>>> possible with synthetic data.
>>> >>
>>> >> I completely agree that we have to base our discussions and decisions
>>> on scientific methods and that we have to make sure that Lucene performs
>>> and scales well and that we understand the limits and what is going on
>>> under the hood.
>>> >>
>>> >> Thanks
>>> >>
>>> >> Michael W
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> Am 11.04.23 um 14:29 schrieb Michael McCandless:
>>> >>
>>> >> +1 to test on real vector data -- if you test on synthetic data you
>>> draw synthetic conclusions.
>>> >>
>>> >> Can someone post the theoretical performance (CPU and RAM required)
>>> of HNSW construction? Do we know/believe our HNSW implementation has
>>> achieved that theoretical big-O performance? Maybe we have some silly
>>> performance bug that's causing it not to?
>>> >>
>>> >> As I understand it, HNSW makes the tradeoff of costly construction
>>> for faster searching, which is typically the right tradeoff for search use
>>> cases. We do this in other parts of the Lucene index too.
>>> >>
>>> >> Lucene will do a logarithmic number of merges over time, i.e. each
>>> doc will be merged O(log(N)) times in its lifetime in the index. We need
>>> to multiply that by the cost of re-building the whole HNSW graph on each
>>> merge. BTW, other things in Lucene, like BKD/dimensional points, also
>>> rebuild the whole data structure on each merge, I think? But, as Rob
>>> pointed out, stored fields merging do indeed do some sneaky tricks to avoid
>>> excessive block decompress/recompress on each merge.
>>> >>
>>> >>> As I understand it, vetoes must have technical merit. I'm not sure
>>> that this veto rises to "technical merit" on 2 counts:
>>> >> Actually I think Robert's veto stands on its technical merit
>>> already. Robert's take on technical matters very much resonate with me,
>>> even if he is sometimes prickly in how he expresses them ;)
>>> >>
>>> >> His point is that we, as a dev community, are not paying enough
>>> attention to the indexing performance of our KNN algo (HNSW) and
>>> implementation, and that it is reckless to increase / remove limits in that
>>> state. It is indeed a one-way door decision and one must confront such
>>> decisions with caution, especially for such a widely used base
>>> infrastructure as Lucene. We don't even advertise today in our javadocs
>>> that you need XXX heap if you index vectors with dimension Y, fanout X,
>>> levels Z, etc.
>>> >>
>>> >> RAM used during merging is unaffected by dimensionality, but is
>>> affected by fanout, because the HNSW graph (not the raw vectors) is memory
>>> resident, I think? Maybe we could move it off-heap and let the OS manage
>>> the memory (and still document the RAM requirements)? Maybe merge RAM
>>> costs should be accounted for in IW's RAM buffer accounting? It is not
>>> today, and there are some other things that use non-trivial RAM, e.g. the
>>> doc mapping (to compress docid space when deletions are reclaimed).
>>> >>
>>> >> When we added KNN vector testing to Lucene's nightly benchmarks, the
>>> indexing time massively increased -- see annotations DH and DP here:
>>> https://home.apache.org/~mikemccand/lucenebench/indexing.html. Nightly
>>> benchmarks now start at 6 PM and don't finish until ~14.5 hours later. Of
>>> course, that is using a single thread for indexing (on a box that has 128
>>> cores!) so we produce a deterministic index every night ...
>>> >>
>>> >> Stepping out (meta) a bit ... this discussion is precisely one of the
>>> awesome benefits of the (informed) veto. It means risky changes to the
>>> software, as determined by any single informed developer on the project,
>>> can force a healthy discussion about the problem at hand. Robert is
>>> legitimately concerned about a real issue and so we should use our creative
>>> energies to characterize our HNSW implementation's performance, document it
>>> clearly for users, and uncover ways to improve it.
>>> >>
>>> >> Mike McCandless
>>> >>
>>> >> http://blog.mikemccandless.com
>>> >>
>>> >>
>>> >> On Mon, Apr 10, 2023 at 6:41?PM Alessandro Benedetti <
>>> a.benedetti@sease.io> wrote:
>>> >>> I think Gus points are on target.
>>> >>>
>>> >>> I recommend we move this forward in this way:
>>> >>> We stop any discussion and everyone interested proposes an option
>>> with a motivation, then we aggregate the options and we create a Vote maybe?
>>> >>>
>>> >>> I am also on the same page on the fact that a veto should come with
>>> a clear and reasonable technical merit, which also in my opinion has not
>>> come yet.
>>> >>>
>>> >>> I also apologise if any of my words sounded harsh or personal
>>> attacks, never meant to do so.
>>> >>>
>>> >>> My proposed option:
>>> >>>
>>> >>> 1) remove the limit and potentially make it configurable,
>>> >>> Motivation:
>>> >>> The system administrator can enforce a limit its users need to
>>> respect that it's in line with whatever the admin decided to be acceptable
>>> for them.
>>> >>> Default can stay the current one.
>>> >>>
>>> >>> That's my favourite at the moment, but I agree that potentially in
>>> the future this may need to change, as we may optimise the data structures
>>> for certain dimensions. I am a big fan of Yagni (you aren't going to need
>>> it) so I am ok we'll face a different discussion if that happens in the
>>> future.
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Sun, 9 Apr 2023, 18:46 Gus Heck, <gus.heck@gmail.com> wrote:
>>> >>>> What I see so far:
>>> >>>>
>>> >>>> Much positive support for raising the limit
>>> >>>> Slightly less support for removing it or making it configurable
>>> >>>> A single veto which argues that a (as yet undefined) performance
>>> standard must be met before raising the limit
>>> >>>> Hot tempers (various) making this discussion difficult
>>> >>>>
>>> >>>> As I understand it, vetoes must have technical merit. I'm not sure
>>> that this veto rises to "technical merit" on 2 counts:
>>> >>>>
>>> >>>> No standard for the performance is given so it cannot be
>>> technically met. Without hard criteria it's a moving target.
>>> >>>> It appears to encode a valuation of the user's time, and that
>>> valuation is really up to the user. Some users may consider 2hours useless
>>> and not worth it, and others might happily wait 2 hours. This is not a
>>> technical decision, it's a business decision regarding the relative value
>>> of the time invested vs the value of the result. If I can cure cancer by
>>> indexing for a year, that might be worth it... (hyperbole of course).
>>> >>>>
>>> >>>> Things I would consider to have technical merit that I don't hear:
>>> >>>>
>>> >>>> Impact on the speed of **other** indexing operations. (devaluation
>>> of other functionality)
>>> >>>> Actual scenarios that work when the limit is low and fail when the
>>> limit is high (new failure on the same data with the limit raised).
>>> >>>>
>>> >>>> One thing that might or might not have technical merit
>>> >>>>
>>> >>>> If someone feels there is a lack of documentation of the
>>> costs/performance implications of using large vectors, possibly including
>>> reproducible benchmarks establishing the scaling behavior (there seems to
>>> be disagreement on O(n) vs O(n^2)).
>>> >>>>
>>> >>>> The users *should* know what they are getting into, but if the cost
>>> is worth it to them, they should be able to pay it without forking the
>>> project. If this veto causes a fork that's not good.
>>> >>>>
>>> >>>> On Sun, Apr 9, 2023 at 7:55?AM Michael Sokolov <msokolov@gmail.com>
>>> wrote:
>>> >>>>> We do have a dataset built from Wikipedia in luceneutil. It comes
>>> in 100 and 300 dimensional varieties and can easily enough generate large
>>> numbers of vector documents from the articles data. To go higher we could
>>> concatenate vectors from that and I believe the performance numbers would
>>> be plausible.
>>> >>>>>
>>> >>>>> On Sun, Apr 9, 2023, 1:32 AM Dawid Weiss <dawid.weiss@gmail.com>
>>> wrote:
>>> >>>>>> Can we set up a branch in which the limit is bumped to 2048, then
>>> have
>>> >>>>>> a realistic, free data set (wikipedia sample or something) that
>>> has,
>>> >>>>>> say, 5 million docs and vectors created using public data (glove
>>> >>>>>> pre-trained embeddings or the like)? We then could run indexing
>>> on the
>>> >>>>>> same hardware with 512, 1024 and 2048 and see what the numbers,
>>> limits
>>> >>>>>> and behavior actually are.
>>> >>>>>>
>>> >>>>>> I can help in writing this but not until after Easter.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Dawid
>>> >>>>>>
>>> >>>>>> On Sat, Apr 8, 2023 at 11:29?PM Adrien Grand <jpountz@gmail.com>
>>> wrote:
>>> >>>>>>> As Dawid pointed out earlier on this thread, this is the rule for
>>> >>>>>>> Apache projects: a single -1 vote on a code change is a veto and
>>> >>>>>>> cannot be overridden. Furthermore, Robert is one of the people
>>> on this
>>> >>>>>>> project who worked the most on debugging subtle bugs, making
>>> Lucene
>>> >>>>>>> more robust and improving our test framework, so I'm listening
>>> when he
>>> >>>>>>> voices quality concerns.
>>> >>>>>>>
>>> >>>>>>> The argument against removing/raising the limit that resonates
>>> with me
>>> >>>>>>> the most is that it is a one-way door. As MikeS highlighted
>>> earlier on
>>> >>>>>>> this thread, implementations may want to take advantage of the
>>> fact
>>> >>>>>>> that there is a limit at some point too. This is why I don't
>>> want to
>>> >>>>>>> remove the limit and would prefer a slight increase, such as
>>> 2048 as
>>> >>>>>>> suggested in the original issue, which would enable most of the
>>> things
>>> >>>>>>> that users who have been asking about raising the limit would
>>> like to
>>> >>>>>>> do.
>>> >>>>>>>
>>> >>>>>>> I agree that the merge-time memory usage and slow indexing rate
>>> are
>>> >>>>>>> not great. But it's still possible to index multi-million vector
>>> >>>>>>> datasets with a 4GB heap without hitting OOMEs regardless of the
>>> >>>>>>> number of dimensions, and the feedback I'm seeing is that many
>>> users
>>> >>>>>>> are still interested in indexing multi-million vector datasets
>>> despite
>>> >>>>>>> the slow indexing rate. I wish we could do better, and vector
>>> indexing
>>> >>>>>>> is certainly more expert than text indexing, but it still is
>>> usable in
>>> >>>>>>> my opinion. I understand how giving Lucene more information about
>>> >>>>>>> vectors prior to indexing (e.g. clustering information as Jim
>>> pointed
>>> >>>>>>> out) could help make merging faster and more memory-efficient,
>>> but I
>>> >>>>>>> would really like to avoid making it a requirement for indexing
>>> >>>>>>> vectors as it also makes this feature much harder to use.
>>> >>>>>>>
>>> >>>>>>> On Sat, Apr 8, 2023 at 9:28?PM Alessandro Benedetti
>>> >>>>>>> <a.benedetti@sease.io> wrote:
>>> >>>>>>>> I am very attentive to listen opinions but I am un-convinced
>>> here and I an not sure that a single person opinion should be allowed to be
>>> detrimental for such an important project.
>>> >>>>>>>>
>>> >>>>>>>> The limit as far as I know is literally just raising an
>>> exception.
>>> >>>>>>>> Removing it won't alter in any way the current performance for
>>> users in low dimensional space.
>>> >>>>>>>> Removing it will just enable more users to use Lucene.
>>> >>>>>>>>
>>> >>>>>>>> If new users in certain situations will be unhappy with the
>>> performance, they may contribute improvements.
>>> >>>>>>>> This is how you make progress.
>>> >>>>>>>>
>>> >>>>>>>> If it's a reputation thing, trust me that not allowing users to
>>> play with high dimensional space will equally damage it.
>>> >>>>>>>>
>>> >>>>>>>> To me it's really a no brainer.
>>> >>>>>>>> Removing the limit and enable people to use high dimensional
>>> vectors will take minutes.
>>> >>>>>>>> Improving the hnsw implementation can take months.
>>> >>>>>>>> Pick one to begin with...
>>> >>>>>>>>
>>> >>>>>>>> And there's no-one paying me here, no company interest
>>> whatsoever, actually I pay people to contribute, I am just convinced it's a
>>> good idea.
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcmuir@gmail.com>
>>> wrote:
>>> >>>>>>>>> I disagree with your categorization. I put in plenty of work
>>> and
>>> >>>>>>>>> experienced plenty of pain myself, writing tests and fighting
>>> these
>>> >>>>>>>>> issues, after i saw that, two releases in a row, vector
>>> indexing fell
>>> >>>>>>>>> over and hit integer overflows etc on small datasets:
>>> >>>>>>>>>
>>> >>>>>>>>> https://github.com/apache/lucene/pull/11905
>>> >>>>>>>>>
>>> >>>>>>>>> Attacking me isn't helping the situation.
>>> >>>>>>>>>
>>> >>>>>>>>> PS: when i said the "one guy who wrote the code" I didn't mean
>>> it in
>>> >>>>>>>>> any kind of demeaning fashion really. I meant to describe the
>>> current
>>> >>>>>>>>> state of usability with respect to indexing a few million docs
>>> with
>>> >>>>>>>>> high dimensions. You can scroll up the thread and see that at
>>> least
>>> >>>>>>>>> one other committer on the project experienced similar pain as
>>> me.
>>> >>>>>>>>> Then, think about users who aren't committers trying to use the
>>> >>>>>>>>> functionality!
>>> >>>>>>>>>
>>> >>>>>>>>> On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov <
>>> msokolov@gmail.com> wrote:
>>> >>>>>>>>>> What you said about increasing dimensions requiring a bigger
>>> ram buffer on merge is wrong. That's the point I was trying to make. Your
>>> concerns about merge costs are not wrong, but your conclusion that we need
>>> to limit dimensions is not justified.
>>> >>>>>>>>>>
>>> >>>>>>>>>> You complain that hnsw sucks it doesn't scale, but when I
>>> show it scales linearly with dimension you just ignore that and complain
>>> about something entirely different.
>>> >>>>>>>>>>
>>> >>>>>>>>>> You demand that people run all kinds of tests to prove you
>>> wrong but when they do, you don't listen and you won't put in the work
>>> yourself or complain that it's too hard.
>>> >>>>>>>>>>
>>> >>>>>>>>>> Then you complain about people not meeting you half way. Wow
>>> >>>>>>>>>>
>>> >>>>>>>>>> On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcmuir@gmail.com>
>>> wrote:
>>> >>>>>>>>>>> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
>>> >>>>>>>>>>> <michael.wechner@wyona.com> wrote:
>>> >>>>>>>>>>>> What exactly do you consider reasonable?
>>> >>>>>>>>>>> Let's begin a real discussion by being HONEST about the
>>> current
>>> >>>>>>>>>>> status. Please put politically correct or your own company's
>>> wishes
>>> >>>>>>>>>>> aside, we know it's not in a good state.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Current status is the one guy who wrote the code can set a
>>> >>>>>>>>>>> multi-gigabyte ram buffer and index a small dataset with 1024
>>> >>>>>>>>>>> dimensions in HOURS (i didn't ask what hardware).
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> My concerns are everyone else except the one guy, I want it
>>> to be
>>> >>>>>>>>>>> usable. Increasing dimensions just means even bigger
>>> multi-gigabyte
>>> >>>>>>>>>>> ram buffer and bigger heap to avoid OOM on merge.
>>> >>>>>>>>>>> It is also a permanent backwards compatibility decision, we
>>> have to
>>> >>>>>>>>>>> support it once we do this and we can't just say "oops" and
>>> flip it
>>> >>>>>>>>>>> back.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> It is unclear to me, if the multi-gigabyte ram buffer is
>>> really to
>>> >>>>>>>>>>> avoid merges because they are so slow and it would be DAYS
>>> otherwise,
>>> >>>>>>>>>>> or if its to avoid merges so it doesn't hit OOM.
>>> >>>>>>>>>>> Also from personal experience, it takes trial and error
>>> (means
>>> >>>>>>>>>>> experiencing OOM on merge!!!) before you get those heap
>>> values correct
>>> >>>>>>>>>>> for your dataset. This usually means starting over which is
>>> >>>>>>>>>>> frustrating and wastes more time.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Jim mentioned some ideas about the memory usage in
>>> IndexWriter, seems
>>> >>>>>>>>>>> to me like its a good idea. maybe the multigigabyte ram
>>> buffer can be
>>> >>>>>>>>>>> avoided in this way and performance improved by writing
>>> bigger
>>> >>>>>>>>>>> segments with lucene's defaults. But this doesn't mean we
>>> can simply
>>> >>>>>>>>>>> ignore the horrors of what happens on merge. merging needs
>>> to scale so
>>> >>>>>>>>>>> that indexing really scales.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> At least it shouldnt spike RAM on trivial data amounts and
>>> cause OOM,
>>> >>>>>>>>>>> and definitely it shouldnt burn hours and hours of CPU in
>>> O(n^2)
>>> >>>>>>>>>>> fashion when indexing.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> ---------------------------------------------------------------------
>>> >>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >>>>>>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>> >>>>>>>>>>>
>>> >>>>>>>>>
>>> ---------------------------------------------------------------------
>>> >>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >>>>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>> >>>>>>>>>
>>> >>>>>>>
>>> >>>>>>> --
>>> >>>>>>> Adrien
>>> >>>>>>>
>>> >>>>>>>
>>> ---------------------------------------------------------------------
>>> >>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>> >>>>>>>
>>> >>>>>>
>>> ---------------------------------------------------------------------
>>> >>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>> >>>>>>
>>> >>>>
>>> >>>> --
>>> >>>> http://www.needhamsoftware.com (work)
>>> >>>> http://www.the111shift.com (play)
>>> >>
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> > For additional commands, e-mail: dev-help@lucene.apache.org
>>> >
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>
>>
>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

michael.wechner at wyona

Apr 21, 2023, 12:30 AM

Post #98 of 99 (949 views)

yes, they are, whereas it should help us to test performance and
scalability :-)

Am 21.04.23 um 09:24 schrieb Ishan Chattopadhyaya:
> Seems like they were all 768 dimensions.
>
> On Fri, 21 Apr, 2023, 11:48 am Michael Wechner,
> <michael.wechner@wyona.com> wrote:
>
> Hi Together
>
> Cohere just published approx. 100Mio embeddings based on Wikipedia
> content
>
> https://txt.cohere.com/embedding-archives-wikipedia/
>
> resp.
>
> https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-embeddings
> https://huggingface.co/datasets/Cohere/wikipedia-22-12-de-embeddings
> ....
>
> HTH
>
> Michael
>
>
>
> Am 13.04.23 um 07:58 schrieb Michael Wechner:
>> Hi Kent
>>
>> Great, thank you very much!
>>
>> Will download it later today :-)
>>
>> All the best
>>
>> Michael
>>
>> Am 13.04.23 um 01:35 schrieb Kent Fitch:
>>> Hi Michael (and anyone else who wants just over 240K "real
>>> world" ada-002 vectors of dimension 1536),
>>> you are welcome to retrieve a tar.gz file which contains:
>>> - 47K embeddings of Canberra Times news article text from 1994
>>> - 38K embeddings of the first paragraphs of wikipedia articles
>>> about organisations
>>> - 156.6K embeddings of the first paragraphs of wikipedia
>>> articles about people
>>>
>>> https://drive.google.com/file/d/13JP_5u7E8oZO6vRg0ekaTgBDQOaj-W00/view?usp=sharing
>>>
>>> The file is about 1.7GB and will expand to about 4.4GB. This
>>> file will be accessible for at least a week, and I hope you dont
>>> hit any google drive download limits trying to retrieve it.
>>>
>>> The embeddings were generated using my openAI account and you
>>> are welcome to use them for any purpose you like.
>>>
>>> best wishes,
>>>
>>> Kent Fitch
>>>
>>> On Wed, Apr 12, 2023 at 4:37?PM Michael Wechner
>>> <michael.wechner@wyona.com> wrote:
>>>
>>> thank you very much for your feedback!
>>>
>>> In a previous post (April 7) you wrote you could make
>>> availlable the 47K ada-002 vectors, which would be great!
>>>
>>> Would it make sense to setup a public gitub repo, such that
>>> others could use or also contribute vectors?
>>>
>>> Thanks
>>>
>>> Michael Wechner
>>>
>>>
>>> Am 12.04.23 um 04:51 schrieb Kent Fitch:
>>>> I only know some characteristics of the openAI ada-002
>>>> vectors, although they are a very popular as
>>>> embeddings/text-characterisations as they allow more
>>>> accurate/"human meaningful" semantic search results with
>>>> fewer dimensions than their predecessors - I've evaluated a
>>>> few different embedding models, including some BERT
>>>> variants, CLIP ViT-L-14 (with 768 dims, which was quite
>>>> good), openAI's ada-001 (1024 dims) and babbage-001 (2048
>>>> dims), and ada-002 are qualitatively the best, although
>>>> that will certainly change!
>>>>
>>>> In any case, ada-002 vectors have interesting
>>>> characteristics that I think mean you could confidently
>>>> create synthetic vectors which would be hard to distinguish
>>>> from "real" vectors. I found this from looking at 47K
>>>> ada-002 vectors generated across a full year (1994) of
>>>> newspaper articles from the Canberra Times and 200K
>>>> wikipedia articles:
>>>> - there is no discernible/significant correlation between
>>>> values in any pair of dimensions
>>>> - all but 5 of the 1536 dimensions have an almost identical
>>>> distribution of values shown in the central blob on these
>>>> graphs (that just show a few of these 1531 dimensions with
>>>> clumped values and the 5 "outlier" dimensions, but all 1531
>>>> non-outlier dims are in there, which makes for some easy
>>>> quantisation from float to byte if you dont want to go the
>>>> full kmeans/clustering/Lloyds-algorithm approach):
>>>> https://docs.google.com/spreadsheets/d/1DyyBCbirETZSUAEGcMK__mfbUNzsU_L48V9E0SyJYGg/edit?usp=sharing
>>>> https://docs.google.com/spreadsheets/d/1czEAlzYdyKa6xraRLesXjNZvEzlj27TcDGiEFS1-MPs/edit?usp=sharing
>>>> https://docs.google.com/spreadsheets/d/1RxTjV7Sj14etCNLk1GB-m44CXJVKdXaFlg2Y6yvj3z4/edit?usp=sharing
>>>> - the variance of the value of each dimension is
>>>> characteristic:
>>>> https://docs.google.com/spreadsheets/d/1w5LnRUXt1cRzI9Qwm07LZ6UfszjMOgPaJot9cOGLHok/edit#gid=472178228
>>>>
>>>> This probably represents something significant about how
>>>> the ada-002 embeddings are created, but I think it also
>>>> means creating "realistic" values is possible. I did not
>>>> use this information when testing recall & performance on
>>>> Lucene's HNSW implementation on 192m documents, as I
>>>> slightly dithered the values of a "real" set on 47K docs
>>>> and stored other fields in the doc that referenced the
>>>> "base" document that the dithers were made from, and used
>>>> different dithering magnitudes so that I could test recall
>>>> with different neighbour sizes ("M"),
>>>> construction-beamwidth and search-beamwidths.
>>>>
>>>> best regards
>>>>
>>>> Kent Fitch
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Apr 12, 2023 at 5:08?AM Michael Wechner
>>>> <michael.wechner@wyona.com> wrote:
>>>>
>>>> I understand what you mean that it seems to be
>>>> artificial, but I don't
>>>> understand why this matters to test performance and
>>>> scalability of the
>>>> indexing?
>>>>
>>>> Let's assume the limit of Lucene would be 4 instead of
>>>> 1024 and there
>>>> are only open source models generating vectors with 4
>>>> dimensions, for
>>>> example
>>>>
>>>> 0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814
>>>>
>>>> 0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844
>>>>
>>>> -0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106
>>>>
>>>> -0.007012288551777601,-0.026665858924388885,0.044495150446891785,-0.038030195981264114
>>>>
>>>> and now I concatenate them to vectors with 8 dimensions
>>>>
>>>>
>>>> 0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814,0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844
>>>>
>>>> -0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106,-0.007012288551777601,-0.026665858924388885,0.044495150446891785,-0.038030195981264114
>>>>
>>>> and normalize them to length 1.
>>>>
>>>> Why should this be any different to a model which is
>>>> acting like a black
>>>> box generating vectors with 8 dimensions?
>>>>
>>>>
>>>>
>>>>
>>>> Am 11.04.23 um 19:05 schrieb Michael Sokolov:
>>>> >> What exactly do you consider real vector data?
>>>> Vector data which is based on texts written by humans?
>>>> > We have plenty of text; the problem is coming up with
>>>> a realistic
>>>> > vector model that requires as many dimensions as
>>>> people seem to be
>>>> > demanding. As I said above, after surveying
>>>> huggingface I couldn't
>>>> > find any text-based model using more than 768
>>>> dimensions. So far we
>>>> > have some ideas of generating higher-dimensional data
>>>> by dithering or
>>>> > concatenating existing data, but it seems artificial.
>>>> >
>>>> > On Tue, Apr 11, 2023 at 9:31?AM Michael Wechner
>>>> > <michael.wechner@wyona.com> wrote:
>>>> >> What exactly do you consider real vector data?
>>>> Vector data which is based on texts written by humans?
>>>> >>
>>>> >> I am asking, because I recently attended the
>>>> following presentation by Anastassia Shaitarova (UZH
>>>> Institute for Computational Linguistics,
>>>> https://www.cl.uzh.ch/de/people/team/compling/shaitarova.html)
>>>> >>
>>>> >> ----
>>>> >>
>>>> >> Can we Identify Machine-Generated Text? An Overview
>>>> of Current Approaches
>>>> >> by Anastassia Shaitarova (UZH Institute for
>>>> Computational Linguistics)
>>>> >>
>>>> >> The detection of machine-generated text has become
>>>> increasingly important due to the prevalence of
>>>> automated content generation and its potential for
>>>> misuse. In this talk, we will discuss the motivation
>>>> for automatic detection of generated text. We will
>>>> present the currently available methods, including
>>>> feature-based classification as a “first
>>>> line-of-defense.” We will provide an overview of the
>>>> detection tools that have been made available so far
>>>> and discuss their limitations. Finally, we will reflect
>>>> on some open problems associated with the automatic
>>>> discrimination of generated texts.
>>>> >>
>>>> >> ----
>>>> >>
>>>> >> and her conclusion was that it has become basically
>>>> impossible to differentiate between text generated by
>>>> humans and text generated by for example ChatGPT.
>>>> >>
>>>> >> Whereas others have a slightly different opinion,
>>>> see for example
>>>> >>
>>>> >>
>>>> https://www.wired.com/story/how-to-spot-generative-ai-text-chatgpt/
>>>> >>
>>>> >> But I would argue that real world and synthetic have
>>>> become close enough that testing performance and
>>>> scalability of indexing should be possible with
>>>> synthetic data.
>>>> >>
>>>> >> I completely agree that we have to base our
>>>> discussions and decisions on scientific methods and
>>>> that we have to make sure that Lucene performs and
>>>> scales well and that we understand the limits and what
>>>> is going on under the hood.
>>>> >>
>>>> >> Thanks
>>>> >>
>>>> >> Michael W
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> Am 11.04.23 um 14:29 schrieb Michael McCandless:
>>>> >>
>>>> >> +1 to test on real vector data -- if you test on
>>>> synthetic data you draw synthetic conclusions.
>>>> >>
>>>> >> Can someone post the theoretical performance (CPU
>>>> and RAM required) of HNSW construction? Do we
>>>> know/believe our HNSW implementation has achieved that
>>>> theoretical big-O performance? Maybe we have some
>>>> silly performance bug that's causing it not to?
>>>> >>
>>>> >> As I understand it, HNSW makes the tradeoff of
>>>> costly construction for faster searching, which is
>>>> typically the right tradeoff for search use cases. We
>>>> do this in other parts of the Lucene index too.
>>>> >>
>>>> >> Lucene will do a logarithmic number of merges over
>>>> time, i.e. each doc will be merged O(log(N)) times in
>>>> its lifetime in the index. We need to multiply that by
>>>> the cost of re-building the whole HNSW graph on each
>>>> merge. BTW, other things in Lucene, like
>>>> BKD/dimensional points, also rebuild the whole data
>>>> structure on each merge, I think? But, as Rob pointed
>>>> out, stored fields merging do indeed do some sneaky
>>>> tricks to avoid excessive block decompress/recompress
>>>> on each merge.
>>>> >>
>>>> >>> As I understand it, vetoes must have technical
>>>> merit. I'm not sure that this veto rises to "technical
>>>> merit" on 2 counts:
>>>> >> Actually I think Robert's veto stands on its
>>>> technical merit already. Robert's take on technical
>>>> matters very much resonate with me, even if he is
>>>> sometimes prickly in how he expresses them ;)
>>>> >>
>>>> >> His point is that we, as a dev community, are not
>>>> paying enough attention to the indexing performance of
>>>> our KNN algo (HNSW) and implementation, and that it is
>>>> reckless to increase / remove limits in that state. It
>>>> is indeed a one-way door decision and one must confront
>>>> such decisions with caution, especially for such a
>>>> widely used base infrastructure as Lucene. We don't
>>>> even advertise today in our javadocs that you need XXX
>>>> heap if you index vectors with dimension Y, fanout X,
>>>> levels Z, etc.
>>>> >>
>>>> >> RAM used during merging is unaffected by
>>>> dimensionality, but is affected by fanout, because the
>>>> HNSW graph (not the raw vectors) is memory resident, I
>>>> think? Maybe we could move it off-heap and let the OS
>>>> manage the memory (and still document the RAM
>>>> requirements)? Maybe merge RAM costs should be
>>>> accounted for in IW's RAM buffer accounting? It is not
>>>> today, and there are some other things that use
>>>> non-trivial RAM, e.g. the doc mapping (to compress
>>>> docid space when deletions are reclaimed).
>>>> >>
>>>> >> When we added KNN vector testing to Lucene's nightly
>>>> benchmarks, the indexing time massively increased --
>>>> see annotations DH and DP here:
>>>> https://home.apache.org/~mikemccand/lucenebench/indexing.html.
>>>> Nightly benchmarks now start at 6 PM and don't finish
>>>> until ~14.5 hours later. Of course, that is using a
>>>> single thread for indexing (on a box that has 128
>>>> cores!) so we produce a deterministic index every night ...
>>>> >>
>>>> >> Stepping out (meta) a bit ... this discussion is
>>>> precisely one of the awesome benefits of the (informed)
>>>> veto. It means risky changes to the software, as
>>>> determined by any single informed developer on the
>>>> project, can force a healthy discussion about the
>>>> problem at hand. Robert is legitimately concerned
>>>> about a real issue and so we should use our creative
>>>> energies to characterize our HNSW implementation's
>>>> performance, document it clearly for users, and uncover
>>>> ways to improve it.
>>>> >>
>>>> >> Mike McCandless
>>>> >>
>>>> >> http://blog.mikemccandless.com
>>>> >>
>>>> >>
>>>> >> On Mon, Apr 10, 2023 at 6:41?PM Alessandro Benedetti
>>>> <a.benedetti@sease.io> wrote:
>>>> >>> I think Gus points are on target.
>>>> >>>
>>>> >>> I recommend we move this forward in this way:
>>>> >>> We stop any discussion and everyone interested
>>>> proposes an option with a motivation, then we aggregate
>>>> the options and we create a Vote maybe?
>>>> >>>
>>>> >>> I am also on the same page on the fact that a veto
>>>> should come with a clear and reasonable technical
>>>> merit, which also in my opinion has not come yet.
>>>> >>>
>>>> >>> I also apologise if any of my words sounded harsh
>>>> or personal attacks, never meant to do so.
>>>> >>>
>>>> >>> My proposed option:
>>>> >>>
>>>> >>> 1) remove the limit and potentially make it
>>>> configurable,
>>>> >>> Motivation:
>>>> >>> The system administrator can enforce a limit its
>>>> users need to respect that it's in line with whatever
>>>> the admin decided to be acceptable for them.
>>>> >>> Default can stay the current one.
>>>> >>>
>>>> >>> That's my favourite at the moment, but I agree that
>>>> potentially in the future this may need to change, as
>>>> we may optimise the data structures for certain
>>>> dimensions. I am a big fan of Yagni (you aren't going
>>>> to need it) so I am ok we'll face a different
>>>> discussion if that happens in the future.
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> On Sun, 9 Apr 2023, 18:46 Gus Heck,
>>>> <gus.heck@gmail.com> wrote:
>>>> >>>> What I see so far:
>>>> >>>>
>>>> >>>> Much positive support for raising the limit
>>>> >>>> Slightly less support for removing it or making it
>>>> configurable
>>>> >>>> A single veto which argues that a (as yet
>>>> undefined) performance standard must be met before
>>>> raising the limit
>>>> >>>> Hot tempers (various) making this discussion difficult
>>>> >>>>
>>>> >>>> As I understand it, vetoes must have technical
>>>> merit. I'm not sure that this veto rises to "technical
>>>> merit" on 2 counts:
>>>> >>>>
>>>> >>>> No standard for the performance is given so it
>>>> cannot be technically met. Without hard criteria it's a
>>>> moving target.
>>>> >>>> It appears to encode a valuation of the user's
>>>> time, and that valuation is really up to the user. Some
>>>> users may consider 2hours useless and not worth it, and
>>>> others might happily wait 2 hours. This is not a
>>>> technical decision, it's a business decision regarding
>>>> the relative value of the time invested vs the value of
>>>> the result. If I can cure cancer by indexing for a
>>>> year, that might be worth it... (hyperbole of course).
>>>> >>>>
>>>> >>>> Things I would consider to have technical merit
>>>> that I don't hear:
>>>> >>>>
>>>> >>>> Impact on the speed of **other** indexing
>>>> operations. (devaluation of other functionality)
>>>> >>>> Actual scenarios that work when the limit is low
>>>> and fail when the limit is high (new failure on the
>>>> same data with the limit raised).
>>>> >>>>
>>>> >>>> One thing that might or might not have technical merit
>>>> >>>>
>>>> >>>> If someone feels there is a lack of documentation
>>>> of the costs/performance implications of using large
>>>> vectors, possibly including reproducible benchmarks
>>>> establishing the scaling behavior (there seems to be
>>>> disagreement on O(n) vs O(n^2)).
>>>> >>>>
>>>> >>>> The users *should* know what they are getting
>>>> into, but if the cost is worth it to them, they should
>>>> be able to pay it without forking the project. If this
>>>> veto causes a fork that's not good.
>>>> >>>>
>>>> >>>> On Sun, Apr 9, 2023 at 7:55?AM Michael Sokolov
>>>> <msokolov@gmail.com> wrote:
>>>> >>>>> We do have a dataset built from Wikipedia in
>>>> luceneutil. It comes in 100 and 300 dimensional
>>>> varieties and can easily enough generate large numbers
>>>> of vector documents from the articles data. To go
>>>> higher we could concatenate vectors from that and I
>>>> believe the performance numbers would be plausible.
>>>> >>>>>
>>>> >>>>> On Sun, Apr 9, 2023, 1:32 AM Dawid Weiss
>>>> <dawid.weiss@gmail.com> wrote:
>>>> >>>>>> Can we set up a branch in which the limit is
>>>> bumped to 2048, then have
>>>> >>>>>> a realistic, free data set (wikipedia sample or
>>>> something) that has,
>>>> >>>>>> say, 5 million docs and vectors created using
>>>> public data (glove
>>>> >>>>>> pre-trained embeddings or the like)? We then
>>>> could run indexing on the
>>>> >>>>>> same hardware with 512, 1024 and 2048 and see
>>>> what the numbers, limits
>>>> >>>>>> and behavior actually are.
>>>> >>>>>>
>>>> >>>>>> I can help in writing this but not until after
>>>> Easter.
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> Dawid
>>>> >>>>>>
>>>> >>>>>> On Sat, Apr 8, 2023 at 11:29?PM Adrien Grand
>>>> <jpountz@gmail.com> wrote:
>>>> >>>>>>> As Dawid pointed out earlier on this thread,
>>>> this is the rule for
>>>> >>>>>>> Apache projects: a single -1 vote on a code
>>>> change is a veto and
>>>> >>>>>>> cannot be overridden. Furthermore, Robert is
>>>> one of the people on this
>>>> >>>>>>> project who worked the most on debugging subtle
>>>> bugs, making Lucene
>>>> >>>>>>> more robust and improving our test framework,
>>>> so I'm listening when he
>>>> >>>>>>> voices quality concerns.
>>>> >>>>>>>
>>>> >>>>>>> The argument against removing/raising the limit
>>>> that resonates with me
>>>> >>>>>>> the most is that it is a one-way door. As MikeS
>>>> highlighted earlier on
>>>> >>>>>>> this thread, implementations may want to take
>>>> advantage of the fact
>>>> >>>>>>> that there is a limit at some point too. This
>>>> is why I don't want to
>>>> >>>>>>> remove the limit and would prefer a slight
>>>> increase, such as 2048 as
>>>> >>>>>>> suggested in the original issue, which would
>>>> enable most of the things
>>>> >>>>>>> that users who have been asking about raising
>>>> the limit would like to
>>>> >>>>>>> do.
>>>> >>>>>>>
>>>> >>>>>>> I agree that the merge-time memory usage and
>>>> slow indexing rate are
>>>> >>>>>>> not great. But it's still possible to index
>>>> multi-million vector
>>>> >>>>>>> datasets with a 4GB heap without hitting OOMEs
>>>> regardless of the
>>>> >>>>>>> number of dimensions, and the feedback I'm
>>>> seeing is that many users
>>>> >>>>>>> are still interested in indexing multi-million
>>>> vector datasets despite
>>>> >>>>>>> the slow indexing rate. I wish we could do
>>>> better, and vector indexing
>>>> >>>>>>> is certainly more expert than text indexing,
>>>> but it still is usable in
>>>> >>>>>>> my opinion. I understand how giving Lucene more
>>>> information about
>>>> >>>>>>> vectors prior to indexing (e.g. clustering
>>>> information as Jim pointed
>>>> >>>>>>> out) could help make merging faster and more
>>>> memory-efficient, but I
>>>> >>>>>>> would really like to avoid making it a
>>>> requirement for indexing
>>>> >>>>>>> vectors as it also makes this feature much
>>>> harder to use.
>>>> >>>>>>>
>>>> >>>>>>> On Sat, Apr 8, 2023 at 9:28?PM Alessandro Benedetti
>>>> >>>>>>> <a.benedetti@sease.io> wrote:
>>>> >>>>>>>> I am very attentive to listen opinions but I
>>>> am un-convinced here and I an not sure that a single
>>>> person opinion should be allowed to be detrimental for
>>>> such an important project.
>>>> >>>>>>>>
>>>> >>>>>>>> The limit as far as I know is literally just
>>>> raising an exception.
>>>> >>>>>>>> Removing it won't alter in any way the current
>>>> performance for users in low dimensional space.
>>>> >>>>>>>> Removing it will just enable more users to use
>>>> Lucene.
>>>> >>>>>>>>
>>>> >>>>>>>> If new users in certain situations will be
>>>> unhappy with the performance, they may contribute
>>>> improvements.
>>>> >>>>>>>> This is how you make progress.
>>>> >>>>>>>>
>>>> >>>>>>>> If it's a reputation thing, trust me that not
>>>> allowing users to play with high dimensional space will
>>>> equally damage it.
>>>> >>>>>>>>
>>>> >>>>>>>> To me it's really a no brainer.
>>>> >>>>>>>> Removing the limit and enable people to use
>>>> high dimensional vectors will take minutes.
>>>> >>>>>>>> Improving the hnsw implementation can take months.
>>>> >>>>>>>> Pick one to begin with...
>>>> >>>>>>>>
>>>> >>>>>>>> And there's no-one paying me here, no company
>>>> interest whatsoever, actually I pay people to
>>>> contribute, I am just convinced it's a good idea.
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>> On Sat, 8 Apr 2023, 18:57 Robert Muir,
>>>> <rcmuir@gmail.com> wrote:
>>>> >>>>>>>>> I disagree with your categorization. I put in
>>>> plenty of work and
>>>> >>>>>>>>> experienced plenty of pain myself, writing
>>>> tests and fighting these
>>>> >>>>>>>>> issues, after i saw that, two releases in a
>>>> row, vector indexing fell
>>>> >>>>>>>>> over and hit integer overflows etc on small
>>>> datasets:
>>>> >>>>>>>>>
>>>> >>>>>>>>> https://github.com/apache/lucene/pull/11905
>>>> >>>>>>>>>
>>>> >>>>>>>>> Attacking me isn't helping the situation.
>>>> >>>>>>>>>
>>>> >>>>>>>>> PS: when i said the "one guy who wrote the
>>>> code" I didn't mean it in
>>>> >>>>>>>>> any kind of demeaning fashion really. I meant
>>>> to describe the current
>>>> >>>>>>>>> state of usability with respect to indexing a
>>>> few million docs with
>>>> >>>>>>>>> high dimensions. You can scroll up the thread
>>>> and see that at least
>>>> >>>>>>>>> one other committer on the project
>>>> experienced similar pain as me.
>>>> >>>>>>>>> Then, think about users who aren't committers
>>>> trying to use the
>>>> >>>>>>>>> functionality!
>>>> >>>>>>>>>
>>>> >>>>>>>>> On Sat, Apr 8, 2023 at 12:51?PM Michael
>>>> Sokolov <msokolov@gmail.com> wrote:
>>>> >>>>>>>>>> What you said about increasing dimensions
>>>> requiring a bigger ram buffer on merge is wrong. That's
>>>> the point I was trying to make. Your concerns about
>>>> merge costs are not wrong, but your conclusion that we
>>>> need to limit dimensions is not justified.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> You complain that hnsw sucks it doesn't
>>>> scale, but when I show it scales linearly with
>>>> dimension you just ignore that and complain about
>>>> something entirely different.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> You demand that people run all kinds of
>>>> tests to prove you wrong but when they do, you don't
>>>> listen and you won't put in the work yourself or
>>>> complain that it's too hard.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> Then you complain about people not meeting
>>>> you half way. Wow
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> On Sat, Apr 8, 2023, 12:40 PM Robert Muir
>>>> <rcmuir@gmail.com> wrote:
>>>> >>>>>>>>>>> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
>>>> >>>>>>>>>>> <michael.wechner@wyona.com> wrote:
>>>> >>>>>>>>>>>> What exactly do you consider reasonable?
>>>> >>>>>>>>>>> Let's begin a real discussion by being
>>>> HONEST about the current
>>>> >>>>>>>>>>> status. Please put politically correct or
>>>> your own company's wishes
>>>> >>>>>>>>>>> aside, we know it's not in a good state.
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> Current status is the one guy who wrote the
>>>> code can set a
>>>> >>>>>>>>>>> multi-gigabyte ram buffer and index a small
>>>> dataset with 1024
>>>> >>>>>>>>>>> dimensions in HOURS (i didn't ask what
>>>> hardware).
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> My concerns are everyone else except the
>>>> one guy, I want it to be
>>>> >>>>>>>>>>> usable. Increasing dimensions just means
>>>> even bigger multi-gigabyte
>>>> >>>>>>>>>>> ram buffer and bigger heap to avoid OOM on
>>>> merge.
>>>> >>>>>>>>>>> It is also a permanent backwards
>>>> compatibility decision, we have to
>>>> >>>>>>>>>>> support it once we do this and we can't
>>>> just say "oops" and flip it
>>>> >>>>>>>>>>> back.
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> It is unclear to me, if the multi-gigabyte
>>>> ram buffer is really to
>>>> >>>>>>>>>>> avoid merges because they are so slow and
>>>> it would be DAYS otherwise,
>>>> >>>>>>>>>>> or if its to avoid merges so it doesn't hit
>>>> OOM.
>>>> >>>>>>>>>>> Also from personal experience, it takes
>>>> trial and error (means
>>>> >>>>>>>>>>> experiencing OOM on merge!!!) before you
>>>> get those heap values correct
>>>> >>>>>>>>>>> for your dataset. This usually means
>>>> starting over which is
>>>> >>>>>>>>>>> frustrating and wastes more time.
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> Jim mentioned some ideas about the memory
>>>> usage in IndexWriter, seems
>>>> >>>>>>>>>>> to me like its a good idea. maybe the
>>>> multigigabyte ram buffer can be
>>>> >>>>>>>>>>> avoided in this way and performance
>>>> improved by writing bigger
>>>> >>>>>>>>>>> segments with lucene's defaults. But this
>>>> doesn't mean we can simply
>>>> >>>>>>>>>>> ignore the horrors of what happens on
>>>> merge. merging needs to scale so
>>>> >>>>>>>>>>> that indexing really scales.
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> At least it shouldnt spike RAM on trivial
>>>> data amounts and cause OOM,
>>>> >>>>>>>>>>> and definitely it shouldnt burn hours and
>>>> hours of CPU in O(n^2)
>>>> >>>>>>>>>>> fashion when indexing.
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> ---------------------------------------------------------------------
>>>> >>>>>>>>>>> To unsubscribe, e-mail:
>>>> dev-unsubscribe@lucene.apache.org
>>>> >>>>>>>>>>> For additional commands, e-mail:
>>>> dev-help@lucene.apache.org
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>
>>>> ---------------------------------------------------------------------
>>>> >>>>>>>>> To unsubscribe, e-mail:
>>>> dev-unsubscribe@lucene.apache.org
>>>> >>>>>>>>> For additional commands, e-mail:
>>>> dev-help@lucene.apache.org
>>>> >>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>>> --
>>>> >>>>>>> Adrien
>>>> >>>>>>>
>>>> >>>>>>>
>>>> ---------------------------------------------------------------------
>>>> >>>>>>> To unsubscribe, e-mail:
>>>> dev-unsubscribe@lucene.apache.org
>>>> >>>>>>> For additional commands, e-mail:
>>>> dev-help@lucene.apache.org
>>>> >>>>>>>
>>>> >>>>>>
>>>> ---------------------------------------------------------------------
>>>> >>>>>> To unsubscribe, e-mail:
>>>> dev-unsubscribe@lucene.apache.org
>>>> >>>>>> For additional commands, e-mail:
>>>> dev-help@lucene.apache.org
>>>> >>>>>>
>>>> >>>>
>>>> >>>> --
>>>> >>>> http://www.needhamsoftware.com (work)
>>>> >>>> http://www.the111shift.com (play)
>>>> >>
>>>> >
>>>> ---------------------------------------------------------------------
>>>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> > For additional commands, e-mail:
>>>> dev-help@lucene.apache.org
>>>> >
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>
>>>
>>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

a.benedetti at sease

May 9, 2023, 5:11 AM

Post #99 of 99 (920 views)

To proceed in a pragmatic way I opened this new thread: *Dimensions Limit
for KNN vectors - Next Steps* .
This is meant to address the main point in this discussion.

For the following points:

2) [medium task] We all want more benchmarks for Lucene vector-based
search, with a good variety of vector dimensions and encodings

3) [big task? ] Some people would like to improve vector-based search
performance because currently not acceptable, it's not clear when and how

Feel free to create the discussion threads if you believe they are an
immediate priority.

Cheers
--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
*Apache Lucene/Solr Committer*
*Apache Solr PMC Member*

e-mail: a.benedetti@sease.io

*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
<https://github.com/seaseltd>

On Fri, 21 Apr 2023 at 08:30, Michael Wechner <michael.wechner@wyona.com>
wrote:

> yes, they are, whereas it should help us to test performance and
> scalability :-)
>
> Am 21.04.23 um 09:24 schrieb Ishan Chattopadhyaya:
>
> Seems like they were all 768 dimensions.
>
> On Fri, 21 Apr, 2023, 11:48 am Michael Wechner, <michael.wechner@wyona.com>
> wrote:
>
>> Hi Together
>>
>> Cohere just published approx. 100Mio embeddings based on Wikipedia content
>>
>> https://txt.cohere.com/embedding-archives-wikipedia/
>>
>> resp.
>>
>> https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-embeddings
>> https://huggingface.co/datasets/Cohere/wikipedia-22-12-de-embeddings
>> ....
>>
>> HTH
>>
>> Michael
>>
>>
>>
>> Am 13.04.23 um 07:58 schrieb Michael Wechner:
>>
>> Hi Kent
>>
>> Great, thank you very much!
>>
>> Will download it later today :-)
>>
>> All the best
>>
>> Michael
>>
>> Am 13.04.23 um 01:35 schrieb Kent Fitch:
>>
>> Hi Michael (and anyone else who wants just over 240K "real world" ada-002
>> vectors of dimension 1536),
>> you are welcome to retrieve a tar.gz file which contains:
>> - 47K embeddings of Canberra Times news article text from 1994
>> - 38K embeddings of the first paragraphs of wikipedia articles about
>> organisations
>> - 156.6K embeddings of the first paragraphs of wikipedia articles about
>> people
>>
>>
>> https://drive.google.com/file/d/13JP_5u7E8oZO6vRg0ekaTgBDQOaj-W00/view?usp=sharing
>>
>> The file is about 1.7GB and will expand to about 4.4GB. This file will be
>> accessible for at least a week, and I hope you dont hit any google drive
>> download limits trying to retrieve it.
>>
>> The embeddings were generated using my openAI account and you are welcome
>> to use them for any purpose you like.
>>
>> best wishes,
>>
>> Kent Fitch
>>
>> On Wed, Apr 12, 2023 at 4:37?PM Michael Wechner <
>> michael.wechner@wyona.com> wrote:
>>
>>> thank you very much for your feedback!
>>>
>>> In a previous post (April 7) you wrote you could make availlable the 47K
>>> ada-002 vectors, which would be great!
>>>
>>> Would it make sense to setup a public gitub repo, such that others could
>>> use or also contribute vectors?
>>>
>>> Thanks
>>>
>>> Michael Wechner
>>>
>>>
>>> Am 12.04.23 um 04:51 schrieb Kent Fitch:
>>>
>>> I only know some characteristics of the openAI ada-002 vectors, although
>>> they are a very popular as embeddings/text-characterisations as they allow
>>> more accurate/"human meaningful" semantic search results with fewer
>>> dimensions than their predecessors - I've evaluated a few different
>>> embedding models, including some BERT variants, CLIP ViT-L-14 (with 768
>>> dims, which was quite good), openAI's ada-001 (1024 dims) and babbage-001
>>> (2048 dims), and ada-002 are qualitatively the best, although that will
>>> certainly change!
>>>
>>> In any case, ada-002 vectors have interesting characteristics that I
>>> think mean you could confidently create synthetic vectors which would be
>>> hard to distinguish from "real" vectors. I found this from looking at 47K
>>> ada-002 vectors generated across a full year (1994) of newspaper articles
>>> from the Canberra Times and 200K wikipedia articles:
>>> - there is no discernible/significant correlation between values in any
>>> pair of dimensions
>>> - all but 5 of the 1536 dimensions have an almost identical distribution
>>> of values shown in the central blob on these graphs (that just show a few
>>> of these 1531 dimensions with clumped values and the 5 "outlier"
>>> dimensions, but all 1531 non-outlier dims are in there, which makes for
>>> some easy quantisation from float to byte if you dont want to go the full
>>> kmeans/clustering/Lloyds-algorithm approach):
>>>
>>> https://docs.google.com/spreadsheets/d/1DyyBCbirETZSUAEGcMK__mfbUNzsU_L48V9E0SyJYGg/edit?usp=sharing
>>>
>>> https://docs.google.com/spreadsheets/d/1czEAlzYdyKa6xraRLesXjNZvEzlj27TcDGiEFS1-MPs/edit?usp=sharing
>>>
>>> https://docs.google.com/spreadsheets/d/1RxTjV7Sj14etCNLk1GB-m44CXJVKdXaFlg2Y6yvj3z4/edit?usp=sharing
>>> - the variance of the value of each dimension is characteristic:
>>>
>>> https://docs.google.com/spreadsheets/d/1w5LnRUXt1cRzI9Qwm07LZ6UfszjMOgPaJot9cOGLHok/edit#gid=472178228
>>>
>>> This probably represents something significant about how the ada-002
>>> embeddings are created, but I think it also means creating "realistic"
>>> values is possible. I did not use this information when testing recall &
>>> performance on Lucene's HNSW implementation on 192m documents, as I
>>> slightly dithered the values of a "real" set on 47K docs and stored other
>>> fields in the doc that referenced the "base" document that the dithers were
>>> made from, and used different dithering magnitudes so that I could test
>>> recall with different neighbour sizes ("M"), construction-beamwidth and
>>> search-beamwidths.
>>>
>>> best regards
>>>
>>> Kent Fitch
>>>
>>>
>>>
>>>
>>> On Wed, Apr 12, 2023 at 5:08?AM Michael Wechner <
>>> michael.wechner@wyona.com> wrote:
>>>
>>>> I understand what you mean that it seems to be artificial, but I don't
>>>> understand why this matters to test performance and scalability of the
>>>> indexing?
>>>>
>>>> Let's assume the limit of Lucene would be 4 instead of 1024 and there
>>>> are only open source models generating vectors with 4 dimensions, for
>>>> example
>>>>
>>>>
>>>> 0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814
>>>>
>>>>
>>>> 0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844
>>>>
>>>>
>>>> -0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106
>>>>
>>>>
>>>> -0.007012288551777601,-0.026665858924388885,0.044495150446891785,-0.038030195981264114
>>>>
>>>> and now I concatenate them to vectors with 8 dimensions
>>>>
>>>>
>>>>
>>>> 0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814,0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844
>>>>
>>>>
>>>> -0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106,-0.007012288551777601,-0.026665858924388885,0.044495150446891785,-0.038030195981264114
>>>>
>>>> and normalize them to length 1.
>>>>
>>>> Why should this be any different to a model which is acting like a
>>>> black
>>>> box generating vectors with 8 dimensions?
>>>>
>>>>
>>>>
>>>>
>>>> Am 11.04.23 um 19:05 schrieb Michael Sokolov:
>>>> >> What exactly do you consider real vector data? Vector data which is
>>>> based on texts written by humans?
>>>> > We have plenty of text; the problem is coming up with a realistic
>>>> > vector model that requires as many dimensions as people seem to be
>>>> > demanding. As I said above, after surveying huggingface I couldn't
>>>> > find any text-based model using more than 768 dimensions. So far we
>>>> > have some ideas of generating higher-dimensional data by dithering or
>>>> > concatenating existing data, but it seems artificial.
>>>> >
>>>> > On Tue, Apr 11, 2023 at 9:31?AM Michael Wechner
>>>> > <michael.wechner@wyona.com> wrote:
>>>> >> What exactly do you consider real vector data? Vector data which is
>>>> based on texts written by humans?
>>>> >>
>>>> >> I am asking, because I recently attended the following presentation
>>>> by Anastassia Shaitarova (UZH Institute for Computational Linguistics,
>>>> https://www.cl.uzh.ch/de/people/team/compling/shaitarova.html)
>>>> >>
>>>> >> ----
>>>> >>
>>>> >> Can we Identify Machine-Generated Text? An Overview of Current
>>>> Approaches
>>>> >> by Anastassia Shaitarova (UZH Institute for Computational
>>>> Linguistics)
>>>> >>
>>>> >> The detection of machine-generated text has become increasingly
>>>> important due to the prevalence of automated content generation and its
>>>> potential for misuse. In this talk, we will discuss the motivation for
>>>> automatic detection of generated text. We will present the currently
>>>> available methods, including feature-based classification as a “first
>>>> line-of-defense.” We will provide an overview of the detection tools that
>>>> have been made available so far and discuss their limitations. Finally, we
>>>> will reflect on some open problems associated with the automatic
>>>> discrimination of generated texts.
>>>> >>
>>>> >> ----
>>>> >>
>>>> >> and her conclusion was that it has become basically impossible to
>>>> differentiate between text generated by humans and text generated by for
>>>> example ChatGPT.
>>>> >>
>>>> >> Whereas others have a slightly different opinion, see for example
>>>> >>
>>>> >> https://www.wired.com/story/how-to-spot-generative-ai-text-chatgpt/
>>>> >>
>>>> >> But I would argue that real world and synthetic have become close
>>>> enough that testing performance and scalability of indexing should be
>>>> possible with synthetic data.
>>>> >>
>>>> >> I completely agree that we have to base our discussions and
>>>> decisions on scientific methods and that we have to make sure that Lucene
>>>> performs and scales well and that we understand the limits and what is
>>>> going on under the hood.
>>>> >>
>>>> >> Thanks
>>>> >>
>>>> >> Michael W
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> Am 11.04.23 um 14:29 schrieb Michael McCandless:
>>>> >>
>>>> >> +1 to test on real vector data -- if you test on synthetic data you
>>>> draw synthetic conclusions.
>>>> >>
>>>> >> Can someone post the theoretical performance (CPU and RAM required)
>>>> of HNSW construction? Do we know/believe our HNSW implementation has
>>>> achieved that theoretical big-O performance? Maybe we have some silly
>>>> performance bug that's causing it not to?
>>>> >>
>>>> >> As I understand it, HNSW makes the tradeoff of costly construction
>>>> for faster searching, which is typically the right tradeoff for search use
>>>> cases. We do this in other parts of the Lucene index too.
>>>> >>
>>>> >> Lucene will do a logarithmic number of merges over time, i.e. each
>>>> doc will be merged O(log(N)) times in its lifetime in the index. We need
>>>> to multiply that by the cost of re-building the whole HNSW graph on each
>>>> merge. BTW, other things in Lucene, like BKD/dimensional points, also
>>>> rebuild the whole data structure on each merge, I think? But, as Rob
>>>> pointed out, stored fields merging do indeed do some sneaky tricks to avoid
>>>> excessive block decompress/recompress on each merge.
>>>> >>
>>>> >>> As I understand it, vetoes must have technical merit. I'm not sure
>>>> that this veto rises to "technical merit" on 2 counts:
>>>> >> Actually I think Robert's veto stands on its technical merit
>>>> already. Robert's take on technical matters very much resonate with me,
>>>> even if he is sometimes prickly in how he expresses them ;)
>>>> >>
>>>> >> His point is that we, as a dev community, are not paying enough
>>>> attention to the indexing performance of our KNN algo (HNSW) and
>>>> implementation, and that it is reckless to increase / remove limits in that
>>>> state. It is indeed a one-way door decision and one must confront such
>>>> decisions with caution, especially for such a widely used base
>>>> infrastructure as Lucene. We don't even advertise today in our javadocs
>>>> that you need XXX heap if you index vectors with dimension Y, fanout X,
>>>> levels Z, etc.
>>>> >>
>>>> >> RAM used during merging is unaffected by dimensionality, but is
>>>> affected by fanout, because the HNSW graph (not the raw vectors) is memory
>>>> resident, I think? Maybe we could move it off-heap and let the OS manage
>>>> the memory (and still document the RAM requirements)? Maybe merge RAM
>>>> costs should be accounted for in IW's RAM buffer accounting? It is not
>>>> today, and there are some other things that use non-trivial RAM, e.g. the
>>>> doc mapping (to compress docid space when deletions are reclaimed).
>>>> >>
>>>> >> When we added KNN vector testing to Lucene's nightly benchmarks, the
>>>> indexing time massively increased -- see annotations DH and DP here:
>>>> https://home.apache.org/~mikemccand/lucenebench/indexing.html.
>>>> Nightly benchmarks now start at 6 PM and don't finish until ~14.5 hours
>>>> later. Of course, that is using a single thread for indexing (on a box
>>>> that has 128 cores!) so we produce a deterministic index every night ...
>>>> >>
>>>> >> Stepping out (meta) a bit ... this discussion is precisely one of
>>>> the awesome benefits of the (informed) veto. It means risky changes to the
>>>> software, as determined by any single informed developer on the project,
>>>> can force a healthy discussion about the problem at hand. Robert is
>>>> legitimately concerned about a real issue and so we should use our creative
>>>> energies to characterize our HNSW implementation's performance, document it
>>>> clearly for users, and uncover ways to improve it.
>>>> >>
>>>> >> Mike McCandless
>>>> >>
>>>> >> http://blog.mikemccandless.com
>>>> >>
>>>> >>
>>>> >> On Mon, Apr 10, 2023 at 6:41?PM Alessandro Benedetti <
>>>> a.benedetti@sease.io> wrote:
>>>> >>> I think Gus points are on target.
>>>> >>>
>>>> >>> I recommend we move this forward in this way:
>>>> >>> We stop any discussion and everyone interested proposes an option
>>>> with a motivation, then we aggregate the options and we create a Vote maybe?
>>>> >>>
>>>> >>> I am also on the same page on the fact that a veto should come with
>>>> a clear and reasonable technical merit, which also in my opinion has not
>>>> come yet.
>>>> >>>
>>>> >>> I also apologise if any of my words sounded harsh or personal
>>>> attacks, never meant to do so.
>>>> >>>
>>>> >>> My proposed option:
>>>> >>>
>>>> >>> 1) remove the limit and potentially make it configurable,
>>>> >>> Motivation:
>>>> >>> The system administrator can enforce a limit its users need to
>>>> respect that it's in line with whatever the admin decided to be acceptable
>>>> for them.
>>>> >>> Default can stay the current one.
>>>> >>>
>>>> >>> That's my favourite at the moment, but I agree that potentially in
>>>> the future this may need to change, as we may optimise the data structures
>>>> for certain dimensions. I am a big fan of Yagni (you aren't going to need
>>>> it) so I am ok we'll face a different discussion if that happens in the
>>>> future.
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> On Sun, 9 Apr 2023, 18:46 Gus Heck, <gus.heck@gmail.com> wrote:
>>>> >>>> What I see so far:
>>>> >>>>
>>>> >>>> Much positive support for raising the limit
>>>> >>>> Slightly less support for removing it or making it configurable
>>>> >>>> A single veto which argues that a (as yet undefined) performance
>>>> standard must be met before raising the limit
>>>> >>>> Hot tempers (various) making this discussion difficult
>>>> >>>>
>>>> >>>> As I understand it, vetoes must have technical merit. I'm not sure
>>>> that this veto rises to "technical merit" on 2 counts:
>>>> >>>>
>>>> >>>> No standard for the performance is given so it cannot be
>>>> technically met. Without hard criteria it's a moving target.
>>>> >>>> It appears to encode a valuation of the user's time, and that
>>>> valuation is really up to the user. Some users may consider 2hours useless
>>>> and not worth it, and others might happily wait 2 hours. This is not a
>>>> technical decision, it's a business decision regarding the relative value
>>>> of the time invested vs the value of the result. If I can cure cancer by
>>>> indexing for a year, that might be worth it... (hyperbole of course).
>>>> >>>>
>>>> >>>> Things I would consider to have technical merit that I don't hear:
>>>> >>>>
>>>> >>>> Impact on the speed of **other** indexing operations. (devaluation
>>>> of other functionality)
>>>> >>>> Actual scenarios that work when the limit is low and fail when the
>>>> limit is high (new failure on the same data with the limit raised).
>>>> >>>>
>>>> >>>> One thing that might or might not have technical merit
>>>> >>>>
>>>> >>>> If someone feels there is a lack of documentation of the
>>>> costs/performance implications of using large vectors, possibly including
>>>> reproducible benchmarks establishing the scaling behavior (there seems to
>>>> be disagreement on O(n) vs O(n^2)).
>>>> >>>>
>>>> >>>> The users *should* know what they are getting into, but if the
>>>> cost is worth it to them, they should be able to pay it without forking the
>>>> project. If this veto causes a fork that's not good.
>>>> >>>>
>>>> >>>> On Sun, Apr 9, 2023 at 7:55?AM Michael Sokolov <msokolov@gmail.com>
>>>> wrote:
>>>> >>>>> We do have a dataset built from Wikipedia in luceneutil. It comes
>>>> in 100 and 300 dimensional varieties and can easily enough generate large
>>>> numbers of vector documents from the articles data. To go higher we could
>>>> concatenate vectors from that and I believe the performance numbers would
>>>> be plausible.
>>>> >>>>>
>>>> >>>>> On Sun, Apr 9, 2023, 1:32 AM Dawid Weiss <dawid.weiss@gmail.com>
>>>> wrote:
>>>> >>>>>> Can we set up a branch in which the limit is bumped to 2048,
>>>> then have
>>>> >>>>>> a realistic, free data set (wikipedia sample or something) that
>>>> has,
>>>> >>>>>> say, 5 million docs and vectors created using public data (glove
>>>> >>>>>> pre-trained embeddings or the like)? We then could run indexing
>>>> on the
>>>> >>>>>> same hardware with 512, 1024 and 2048 and see what the numbers,
>>>> limits
>>>> >>>>>> and behavior actually are.
>>>> >>>>>>
>>>> >>>>>> I can help in writing this but not until after Easter.
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> Dawid
>>>> >>>>>>
>>>> >>>>>> On Sat, Apr 8, 2023 at 11:29?PM Adrien Grand <jpountz@gmail.com>
>>>> wrote:
>>>> >>>>>>> As Dawid pointed out earlier on this thread, this is the rule
>>>> for
>>>> >>>>>>> Apache projects: a single -1 vote on a code change is a veto and
>>>> >>>>>>> cannot be overridden. Furthermore, Robert is one of the people
>>>> on this
>>>> >>>>>>> project who worked the most on debugging subtle bugs, making
>>>> Lucene
>>>> >>>>>>> more robust and improving our test framework, so I'm listening
>>>> when he
>>>> >>>>>>> voices quality concerns.
>>>> >>>>>>>
>>>> >>>>>>> The argument against removing/raising the limit that resonates
>>>> with me
>>>> >>>>>>> the most is that it is a one-way door. As MikeS highlighted
>>>> earlier on
>>>> >>>>>>> this thread, implementations may want to take advantage of the
>>>> fact
>>>> >>>>>>> that there is a limit at some point too. This is why I don't
>>>> want to
>>>> >>>>>>> remove the limit and would prefer a slight increase, such as
>>>> 2048 as
>>>> >>>>>>> suggested in the original issue, which would enable most of the
>>>> things
>>>> >>>>>>> that users who have been asking about raising the limit would
>>>> like to
>>>> >>>>>>> do.
>>>> >>>>>>>
>>>> >>>>>>> I agree that the merge-time memory usage and slow indexing rate
>>>> are
>>>> >>>>>>> not great. But it's still possible to index multi-million vector
>>>> >>>>>>> datasets with a 4GB heap without hitting OOMEs regardless of the
>>>> >>>>>>> number of dimensions, and the feedback I'm seeing is that many
>>>> users
>>>> >>>>>>> are still interested in indexing multi-million vector datasets
>>>> despite
>>>> >>>>>>> the slow indexing rate. I wish we could do better, and vector
>>>> indexing
>>>> >>>>>>> is certainly more expert than text indexing, but it still is
>>>> usable in
>>>> >>>>>>> my opinion. I understand how giving Lucene more information
>>>> about
>>>> >>>>>>> vectors prior to indexing (e.g. clustering information as Jim
>>>> pointed
>>>> >>>>>>> out) could help make merging faster and more memory-efficient,
>>>> but I
>>>> >>>>>>> would really like to avoid making it a requirement for indexing
>>>> >>>>>>> vectors as it also makes this feature much harder to use.
>>>> >>>>>>>
>>>> >>>>>>> On Sat, Apr 8, 2023 at 9:28?PM Alessandro Benedetti
>>>> >>>>>>> <a.benedetti@sease.io> wrote:
>>>> >>>>>>>> I am very attentive to listen opinions but I am un-convinced
>>>> here and I an not sure that a single person opinion should be allowed to be
>>>> detrimental for such an important project.
>>>> >>>>>>>>
>>>> >>>>>>>> The limit as far as I know is literally just raising an
>>>> exception.
>>>> >>>>>>>> Removing it won't alter in any way the current performance for
>>>> users in low dimensional space.
>>>> >>>>>>>> Removing it will just enable more users to use Lucene.
>>>> >>>>>>>>
>>>> >>>>>>>> If new users in certain situations will be unhappy with the
>>>> performance, they may contribute improvements.
>>>> >>>>>>>> This is how you make progress.
>>>> >>>>>>>>
>>>> >>>>>>>> If it's a reputation thing, trust me that not allowing users
>>>> to play with high dimensional space will equally damage it.
>>>> >>>>>>>>
>>>> >>>>>>>> To me it's really a no brainer.
>>>> >>>>>>>> Removing the limit and enable people to use high dimensional
>>>> vectors will take minutes.
>>>> >>>>>>>> Improving the hnsw implementation can take months.
>>>> >>>>>>>> Pick one to begin with...
>>>> >>>>>>>>
>>>> >>>>>>>> And there's no-one paying me here, no company interest
>>>> whatsoever, actually I pay people to contribute, I am just convinced it's a
>>>> good idea.
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>> On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcmuir@gmail.com>
>>>> wrote:
>>>> >>>>>>>>> I disagree with your categorization. I put in plenty of work
>>>> and
>>>> >>>>>>>>> experienced plenty of pain myself, writing tests and fighting
>>>> these
>>>> >>>>>>>>> issues, after i saw that, two releases in a row, vector
>>>> indexing fell
>>>> >>>>>>>>> over and hit integer overflows etc on small datasets:
>>>> >>>>>>>>>
>>>> >>>>>>>>> https://github.com/apache/lucene/pull/11905
>>>> >>>>>>>>>
>>>> >>>>>>>>> Attacking me isn't helping the situation.
>>>> >>>>>>>>>
>>>> >>>>>>>>> PS: when i said the "one guy who wrote the code" I didn't
>>>> mean it in
>>>> >>>>>>>>> any kind of demeaning fashion really. I meant to describe the
>>>> current
>>>> >>>>>>>>> state of usability with respect to indexing a few million
>>>> docs with
>>>> >>>>>>>>> high dimensions. You can scroll up the thread and see that at
>>>> least
>>>> >>>>>>>>> one other committer on the project experienced similar pain
>>>> as me.
>>>> >>>>>>>>> Then, think about users who aren't committers trying to use
>>>> the
>>>> >>>>>>>>> functionality!
>>>> >>>>>>>>>
>>>> >>>>>>>>> On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov <
>>>> msokolov@gmail.com> wrote:
>>>> >>>>>>>>>> What you said about increasing dimensions requiring a bigger
>>>> ram buffer on merge is wrong. That's the point I was trying to make. Your
>>>> concerns about merge costs are not wrong, but your conclusion that we need
>>>> to limit dimensions is not justified.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> You complain that hnsw sucks it doesn't scale, but when I
>>>> show it scales linearly with dimension you just ignore that and complain
>>>> about something entirely different.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> You demand that people run all kinds of tests to prove you
>>>> wrong but when they do, you don't listen and you won't put in the work
>>>> yourself or complain that it's too hard.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> Then you complain about people not meeting you half way. Wow
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcmuir@gmail.com>
>>>> wrote:
>>>> >>>>>>>>>>> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
>>>> >>>>>>>>>>> <michael.wechner@wyona.com> wrote:
>>>> >>>>>>>>>>>> What exactly do you consider reasonable?
>>>> >>>>>>>>>>> Let's begin a real discussion by being HONEST about the
>>>> current
>>>> >>>>>>>>>>> status. Please put politically correct or your own
>>>> company's wishes
>>>> >>>>>>>>>>> aside, we know it's not in a good state.
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> Current status is the one guy who wrote the code can set a
>>>> >>>>>>>>>>> multi-gigabyte ram buffer and index a small dataset with
>>>> 1024
>>>> >>>>>>>>>>> dimensions in HOURS (i didn't ask what hardware).
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> My concerns are everyone else except the one guy, I want it
>>>> to be
>>>> >>>>>>>>>>> usable. Increasing dimensions just means even bigger
>>>> multi-gigabyte
>>>> >>>>>>>>>>> ram buffer and bigger heap to avoid OOM on merge.
>>>> >>>>>>>>>>> It is also a permanent backwards compatibility decision, we
>>>> have to
>>>> >>>>>>>>>>> support it once we do this and we can't just say "oops" and
>>>> flip it
>>>> >>>>>>>>>>> back.
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> It is unclear to me, if the multi-gigabyte ram buffer is
>>>> really to
>>>> >>>>>>>>>>> avoid merges because they are so slow and it would be DAYS
>>>> otherwise,
>>>> >>>>>>>>>>> or if its to avoid merges so it doesn't hit OOM.
>>>> >>>>>>>>>>> Also from personal experience, it takes trial and error
>>>> (means
>>>> >>>>>>>>>>> experiencing OOM on merge!!!) before you get those heap
>>>> values correct
>>>> >>>>>>>>>>> for your dataset. This usually means starting over which is
>>>> >>>>>>>>>>> frustrating and wastes more time.
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> Jim mentioned some ideas about the memory usage in
>>>> IndexWriter, seems
>>>> >>>>>>>>>>> to me like its a good idea. maybe the multigigabyte ram
>>>> buffer can be
>>>> >>>>>>>>>>> avoided in this way and performance improved by writing
>>>> bigger
>>>> >>>>>>>>>>> segments with lucene's defaults. But this doesn't mean we
>>>> can simply
>>>> >>>>>>>>>>> ignore the horrors of what happens on merge. merging needs
>>>> to scale so
>>>> >>>>>>>>>>> that indexing really scales.
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> At least it shouldnt spike RAM on trivial data amounts and
>>>> cause OOM,
>>>> >>>>>>>>>>> and definitely it shouldnt burn hours and hours of CPU in
>>>> O(n^2)
>>>> >>>>>>>>>>> fashion when indexing.
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> ---------------------------------------------------------------------
>>>> >>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> >>>>>>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>
>>>> ---------------------------------------------------------------------
>>>> >>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> >>>>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>> >>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>>> --
>>>> >>>>>>> Adrien
>>>> >>>>>>>
>>>> >>>>>>>
>>>> ---------------------------------------------------------------------
>>>> >>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> >>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>> >>>>>>>
>>>> >>>>>>
>>>> ---------------------------------------------------------------------
>>>> >>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> >>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>> >>>>>>
>>>> >>>>
>>>> >>>> --
>>>> >>>> http://www.needhamsoftware.com (work)
>>>> >>>> http://www.the111shift.com (play)
>>>> >>
>>>> > ---------------------------------------------------------------------
>>>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> > For additional commands, e-mail: dev-help@lucene.apache.org
>>>> >
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>
>>>>
>>>
>>
>>
>