Mailing List Archive: [Proposal] Remove max number of dimensions for KNN vectors

[Proposal] Remove max number of dimensions for KNN vectors

a.benedetti at sease

Mar 31, 2023, 2:43 AM

Post #1 of 99 (675 views)

I've been monitoring various discussions on Pull Requests about changing
the max number of dimensions allowed for Lucene HNSW vectors:

https://github.com/apache/lucene/pull/12191

https://github.com/apache/lucene/issues/11507

I would like to set up a discussion and potentially a vote about this.

I have seen some strong opposition from a few people but a majority of
favor in this direction.

*Motivation*

We were discussing in the Solr slack channel with Ishan Chattopadhyaya, Marcus
Eagan, and David Smiley about some neural search integrations in Solr:
https://github.com/openai/chatgpt-retrieval-plugin

*Proposal*

No hard limit at all.

As for many other Lucene areas, users will be allowed to push the system to
the limit of their resources and get terrible performances or crashes if
they want.

*What we are NOT discussing*

- Quality and scalability of the HNSW algorithm

- dimensionality reduction

- strategies to fit in an arbitrary self-imposed limit

*Benefits*

- users can use the models they want to generate vectors

- removal of an arbitrary limit that blocks some integrations

*Cons*

- if you go for vectors with high dimensions, there's no guarantee you get
acceptable performance for your use case

I want to keep it simple, right now in many Lucene areas, you can push the
system to not acceptable performance/ crashes.

For example, we don't limit the number of docs per index to an arbitrary
maximum of N, you push how many docs you like and if they are too much for
your system, you get terrible performance/crashes/whatever.

Limits caused by primitive java types will stay there behind the scene, and
that's acceptable, but I would prefer to not have arbitrary hard-coded ones
that may limit the software usability and integration which is extremely
important for a library.

I strongly encourage people to add benefits and cons, that I missed (I am
sure I missed some of them, but wanted to keep it simple)

Cheers
--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
*Apache Lucene/Solr Committer*
*Apache Solr PMC Member*

e-mail: a.benedetti@sease.io

*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
<https://github.com/seaseltd>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

michael.wechner at wyona

Mar 31, 2023, 5:38 AM

Post #2 of 99 (675 views)

Thanks Alessandro for summarizing the discussion below!

I understand that there is no clear reasoning re what is the best
embedding size, whereas I think heuristic approaches like described by
the following link can be helpful

https://datascience.stackexchange.com/questions/51404/word2vec-how-to-choose-the-embedding-size-parameter

Having said this, we see various embedding services providing higher
dimensions than 1024, like for example OpenAI, Cohere and Aleph Alpha.

And it would be great if we could run benchmarks without having to
recompile Lucene ourselves.

Therefore I would to suggest to either increase the limit or even better
to remove the limit and add a disclaimer, that people should be aware of
possible crashes etc.

Thanks

Michael

Am 31.03.23 um 11:43 schrieb Alessandro Benedetti:
>
> I've been monitoring various discussions on Pull Requests about
> changing the max number of dimensions allowed for Lucene HNSW vectors:
>
> https://github.com/apache/lucene/pull/12191
>
> https://github.com/apache/lucene/issues/11507
>
>
> I would like to set up a discussion and potentially a vote about this.
>
> I have seen some strong opposition from a few people but a majority of
> favor in this direction.
>
>
> *Motivation*
>
> We were discussing in the Solr slack channel with Ishan
> Chattopadhyaya, Marcus Eagan, and David Smiley about some neural
> search integrations in Solr:
> https://github.com/openai/chatgpt-retrieval-plugin
>
>
> *Proposal*
>
> No hard limit at all.
>
> As for many other Lucene areas, users will be allowed to push the
> system to the limit of their resources and get terrible performances
> or crashes if they want.
>
>
> *What we are NOT discussing*
>
> - Quality and scalability of the HNSW algorithm
>
> - dimensionality reduction
>
> - strategies to fit in an arbitrary self-imposed limit
>
>
> *Benefits*
>
> - users can use the models they want to generate vectors
>
> - removal of an arbitrary limit that blocks some integrations
>
>
> *Cons*
>
> ?- if you go for vectors with high dimensions, there's no guarantee
> you get acceptable performance for your use case
>
> *
> *
>
> *
> *
>
> I want to keep it simple, right now in many Lucene areas, you can push
> the system to not acceptable performance/ crashes.
>
> For example, we don't limit the number of docs per index to an
> arbitrary maximum of N, you push how many docs you like and if they
> are too much for your system, you get terrible
> performance/crashes/whatever.
>
>
> Limits caused by primitive java types will stay there behind the
> scene, and that's acceptable, but I would prefer to not have arbitrary
> hard-coded ones that may limit the software usability and integration
> which is extremely important for a library.
>
>
> I strongly encourage people to add benefits and cons, that I missed (I
> am sure I missed some of them, but wanted to keep it simple)
>
>
> Cheers
>
> --------------------------
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> /Apache Lucene/Solr Committer/
> /Apache Solr PMC Member/
>
> e-mail: a.benedetti@sease.io/
> /
>
> *Sease*?- Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd>?| Twitter
> <https://twitter.com/seaseltd>?| Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ>?| Github
> <https://github.com/seaseltd>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

jpountz at gmail

Mar 31, 2023, 5:45 AM

Post #3 of 99 (675 views)

I'm supportive of bumping the limit on the maximum dimension for
vectors to something that is above what the majority of users need,
but I'd like to keep a limit. We have limits for other things like the
max number of docs per index, the max term length, the max number of
dimensions of points, etc. and there are a few things that we don't
have limits on that I wish we had limits on. These limits allow us to
better tune our data structures, prevent overflows, help ensure we
have good test coverage, etc.

That said, these other limits we have in place are quite high. E.g.
the 32kB term limit, nobody would ever type a 32kB term in a text box.
Likewise for the max of 8 dimensions for points: a segment cannot
possibly have 2 splits per dimension on average if it doesn't have
512*2^(8*2)=34M docs, a sizable dataset already, so more dimensions
than 8 would likely defeat the point of indexing. In contrast, our
limit on the number of dimensions of vectors seems to be under what
some users would like, and while I understand the performance argument
against bumping the limit, it doesn't feel to me like something that
would be so bad that we need to prevent users from using numbers of
dimensions in the low thousands, e.g. top-k KNN searches would still
look at a very small subset of the full dataset.

So overall, my vote would be to bump the limit to 2048 as suggested by
Mayya on the issue that you linked.

On Fri, Mar 31, 2023 at 2:38?PM Michael Wechner
<michael.wechner@wyona.com> wrote:
>
> Thanks Alessandro for summarizing the discussion below!
>
> I understand that there is no clear reasoning re what is the best embedding size, whereas I think heuristic approaches like described by the following link can be helpful
>
> https://datascience.stackexchange.com/questions/51404/word2vec-how-to-choose-the-embedding-size-parameter
>
> Having said this, we see various embedding services providing higher dimensions than 1024, like for example OpenAI, Cohere and Aleph Alpha.
>
> And it would be great if we could run benchmarks without having to recompile Lucene ourselves.
>
> Therefore I would to suggest to either increase the limit or even better to remove the limit and add a disclaimer, that people should be aware of possible crashes etc.
>
> Thanks
>
> Michael
>
>
>
>
> Am 31.03.23 um 11:43 schrieb Alessandro Benedetti:
>
>
> I've been monitoring various discussions on Pull Requests about changing the max number of dimensions allowed for Lucene HNSW vectors:
>
> https://github.com/apache/lucene/pull/12191
>
> https://github.com/apache/lucene/issues/11507
>
>
> I would like to set up a discussion and potentially a vote about this.
>
> I have seen some strong opposition from a few people but a majority of favor in this direction.
>
>
> Motivation
>
> We were discussing in the Solr slack channel with Ishan Chattopadhyaya, Marcus Eagan, and David Smiley about some neural search integrations in Solr: https://github.com/openai/chatgpt-retrieval-plugin
>
>
> Proposal
>
> No hard limit at all.
>
> As for many other Lucene areas, users will be allowed to push the system to the limit of their resources and get terrible performances or crashes if they want.
>
>
> What we are NOT discussing
>
> - Quality and scalability of the HNSW algorithm
>
> - dimensionality reduction
>
> - strategies to fit in an arbitrary self-imposed limit
>
>
> Benefits
>
> - users can use the models they want to generate vectors
>
> - removal of an arbitrary limit that blocks some integrations
>
>
> Cons
>
> - if you go for vectors with high dimensions, there's no guarantee you get acceptable performance for your use case
>
>
>
> I want to keep it simple, right now in many Lucene areas, you can push the system to not acceptable performance/ crashes.
>
> For example, we don't limit the number of docs per index to an arbitrary maximum of N, you push how many docs you like and if they are too much for your system, you get terrible performance/crashes/whatever.
>
>
> Limits caused by primitive java types will stay there behind the scene, and that's acceptable, but I would prefer to not have arbitrary hard-coded ones that may limit the software usability and integration which is extremely important for a library.
>
>
> I strongly encourage people to add benefits and cons, that I missed (I am sure I missed some of them, but wanted to keep it simple)
>
>
> Cheers
>
> --------------------------
> Alessandro Benedetti
> Director @ Sease Ltd.
> Apache Lucene/Solr Committer
> Apache Solr PMC Member
>
> e-mail: a.benedetti@sease.io
>
>
> Sease - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io
> LinkedIn | Twitter | Youtube | Github
>
>

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

michael.wechner at wyona

Mar 31, 2023, 7:12 AM

Post #4 of 99 (675 views)

OpenAI reduced their size to 1536 dimensions

https://openai.com/blog/new-and-improved-embedding-model

so 2048 would work :-)

but other services do provide also higher dimensions with sometimes
slightly better accuracy

Thanks

Michael

Am 31.03.23 um 14:45 schrieb Adrien Grand:
> I'm supportive of bumping the limit on the maximum dimension for
> vectors to something that is above what the majority of users need,
> but I'd like to keep a limit. We have limits for other things like the
> max number of docs per index, the max term length, the max number of
> dimensions of points, etc. and there are a few things that we don't
> have limits on that I wish we had limits on. These limits allow us to
> better tune our data structures, prevent overflows, help ensure we
> have good test coverage, etc.
>
> That said, these other limits we have in place are quite high. E.g.
> the 32kB term limit, nobody would ever type a 32kB term in a text box.
> Likewise for the max of 8 dimensions for points: a segment cannot
> possibly have 2 splits per dimension on average if it doesn't have
> 512*2^(8*2)=34M docs, a sizable dataset already, so more dimensions
> than 8 would likely defeat the point of indexing. In contrast, our
> limit on the number of dimensions of vectors seems to be under what
> some users would like, and while I understand the performance argument
> against bumping the limit, it doesn't feel to me like something that
> would be so bad that we need to prevent users from using numbers of
> dimensions in the low thousands, e.g. top-k KNN searches would still
> look at a very small subset of the full dataset.
>
> So overall, my vote would be to bump the limit to 2048 as suggested by
> Mayya on the issue that you linked.
>
> On Fri, Mar 31, 2023 at 2:38?PM Michael Wechner
> <michael.wechner@wyona.com> wrote:
>> Thanks Alessandro for summarizing the discussion below!
>>
>> I understand that there is no clear reasoning re what is the best embedding size, whereas I think heuristic approaches like described by the following link can be helpful
>>
>> https://datascience.stackexchange.com/questions/51404/word2vec-how-to-choose-the-embedding-size-parameter
>>
>> Having said this, we see various embedding services providing higher dimensions than 1024, like for example OpenAI, Cohere and Aleph Alpha.
>>
>> And it would be great if we could run benchmarks without having to recompile Lucene ourselves.
>>
>> Therefore I would to suggest to either increase the limit or even better to remove the limit and add a disclaimer, that people should be aware of possible crashes etc.
>>
>> Thanks
>>
>> Michael
>>
>>
>>
>>
>> Am 31.03.23 um 11:43 schrieb Alessandro Benedetti:
>>
>>
>> I've been monitoring various discussions on Pull Requests about changing the max number of dimensions allowed for Lucene HNSW vectors:
>>
>> https://github.com/apache/lucene/pull/12191
>>
>> https://github.com/apache/lucene/issues/11507
>>
>>
>> I would like to set up a discussion and potentially a vote about this.
>>
>> I have seen some strong opposition from a few people but a majority of favor in this direction.
>>
>>
>> Motivation
>>
>> We were discussing in the Solr slack channel with Ishan Chattopadhyaya, Marcus Eagan, and David Smiley about some neural search integrations in Solr: https://github.com/openai/chatgpt-retrieval-plugin
>>
>>
>> Proposal
>>
>> No hard limit at all.
>>
>> As for many other Lucene areas, users will be allowed to push the system to the limit of their resources and get terrible performances or crashes if they want.
>>
>>
>> What we are NOT discussing
>>
>> - Quality and scalability of the HNSW algorithm
>>
>> - dimensionality reduction
>>
>> - strategies to fit in an arbitrary self-imposed limit
>>
>>
>> Benefits
>>
>> - users can use the models they want to generate vectors
>>
>> - removal of an arbitrary limit that blocks some integrations
>>
>>
>> Cons
>>
>> - if you go for vectors with high dimensions, there's no guarantee you get acceptable performance for your use case
>>
>>
>>
>> I want to keep it simple, right now in many Lucene areas, you can push the system to not acceptable performance/ crashes.
>>
>> For example, we don't limit the number of docs per index to an arbitrary maximum of N, you push how many docs you like and if they are too much for your system, you get terrible performance/crashes/whatever.
>>
>>
>> Limits caused by primitive java types will stay there behind the scene, and that's acceptable, but I would prefer to not have arbitrary hard-coded ones that may limit the software usability and integration which is extremely important for a library.
>>
>>
>> I strongly encourage people to add benefits and cons, that I missed (I am sure I missed some of them, but wanted to keep it simple)
>>
>>
>> Cheers
>>
>> --------------------------
>> Alessandro Benedetti
>> Director @ Sease Ltd.
>> Apache Lucene/Solr Committer
>> Apache Solr PMC Member
>>
>> e-mail: a.benedetti@sease.io
>>
>>
>> Sease - Information Retrieval Applied
>> Consulting | Training | Open Source
>>
>> Website: Sease.io
>> LinkedIn | Twitter | Youtube | Github
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

a.benedetti at sease

Mar 31, 2023, 8:56 AM

Post #5 of 99 (675 views)

I am also curious what would be the worst-case scenario if we remove the
constant at all (so automatically the limit becomes the Java
Integer.MAX_VALUE).
i.e.
right now if you exceed the limit you get:

> if (dimension > ByteVectorValues.MAX_DIMENSIONS) {
> throw new IllegalArgumentException(
> "cannot index vectors with dimension greater than " + ByteVectorValues.
> MAX_DIMENSIONS);
> }

in relation to:

> These limits allow us to
> better tune our data structures, prevent overflows, help ensure we
> have good test coverage, etc.

I agree 100% especially for typing stuff properly and avoiding resource
waste here and there, but I am not entirely sure this is the case for the
current implementation i.e. do we have optimizations in place that assume
the max dimension to be 1024?
If I missed that (and I likely have), I of course suggest the contribution
should not just blindly remove the limit, but do it appropriately.
I am not in favor of just doubling it as suggested by some people, I would
ideally prefer a solution that remains there to a decent extent, rather
than having to modifying it anytime someone requires a higher limit.

Cheers

--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
*Apache Lucene/Solr Committer*
*Apache Solr PMC Member*

e-mail: a.benedetti@sease.io

*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
<https://github.com/seaseltd>

On Fri, 31 Mar 2023 at 16:12, Michael Wechner <michael.wechner@wyona.com>
wrote:

> OpenAI reduced their size to 1536 dimensions
>
> https://openai.com/blog/new-and-improved-embedding-model
>
> so 2048 would work :-)
>
> but other services do provide also higher dimensions with sometimes
> slightly better accuracy
>
> Thanks
>
> Michael
>
>
> Am 31.03.23 um 14:45 schrieb Adrien Grand:
> > I'm supportive of bumping the limit on the maximum dimension for
> > vectors to something that is above what the majority of users need,
> > but I'd like to keep a limit. We have limits for other things like the
> > max number of docs per index, the max term length, the max number of
> > dimensions of points, etc. and there are a few things that we don't
> > have limits on that I wish we had limits on. These limits allow us to
> > better tune our data structures, prevent overflows, help ensure we
> > have good test coverage, etc.
> >
> > That said, these other limits we have in place are quite high. E.g.
> > the 32kB term limit, nobody would ever type a 32kB term in a text box.
> > Likewise for the max of 8 dimensions for points: a segment cannot
> > possibly have 2 splits per dimension on average if it doesn't have
> > 512*2^(8*2)=34M docs, a sizable dataset already, so more dimensions
> > than 8 would likely defeat the point of indexing. In contrast, our
> > limit on the number of dimensions of vectors seems to be under what
> > some users would like, and while I understand the performance argument
> > against bumping the limit, it doesn't feel to me like something that
> > would be so bad that we need to prevent users from using numbers of
> > dimensions in the low thousands, e.g. top-k KNN searches would still
> > look at a very small subset of the full dataset.
> >
> > So overall, my vote would be to bump the limit to 2048 as suggested by
> > Mayya on the issue that you linked.
> >
> > On Fri, Mar 31, 2023 at 2:38?PM Michael Wechner
> > <michael.wechner@wyona.com> wrote:
> >> Thanks Alessandro for summarizing the discussion below!
> >>
> >> I understand that there is no clear reasoning re what is the best
> embedding size, whereas I think heuristic approaches like described by the
> following link can be helpful
> >>
> >>
> https://datascience.stackexchange.com/questions/51404/word2vec-how-to-choose-the-embedding-size-parameter
> >>
> >> Having said this, we see various embedding services providing higher
> dimensions than 1024, like for example OpenAI, Cohere and Aleph Alpha.
> >>
> >> And it would be great if we could run benchmarks without having to
> recompile Lucene ourselves.
> >>
> >> Therefore I would to suggest to either increase the limit or even
> better to remove the limit and add a disclaimer, that people should be
> aware of possible crashes etc.
> >>
> >> Thanks
> >>
> >> Michael
> >>
> >>
> >>
> >>
> >> Am 31.03.23 um 11:43 schrieb Alessandro Benedetti:
> >>
> >>
> >> I've been monitoring various discussions on Pull Requests about
> changing the max number of dimensions allowed for Lucene HNSW vectors:
> >>
> >> https://github.com/apache/lucene/pull/12191
> >>
> >> https://github.com/apache/lucene/issues/11507
> >>
> >>
> >> I would like to set up a discussion and potentially a vote about this.
> >>
> >> I have seen some strong opposition from a few people but a majority of
> favor in this direction.
> >>
> >>
> >> Motivation
> >>
> >> We were discussing in the Solr slack channel with Ishan Chattopadhyaya,
> Marcus Eagan, and David Smiley about some neural search integrations in
> Solr: https://github.com/openai/chatgpt-retrieval-plugin
> >>
> >>
> >> Proposal
> >>
> >> No hard limit at all.
> >>
> >> As for many other Lucene areas, users will be allowed to push the
> system to the limit of their resources and get terrible performances or
> crashes if they want.
> >>
> >>
> >> What we are NOT discussing
> >>
> >> - Quality and scalability of the HNSW algorithm
> >>
> >> - dimensionality reduction
> >>
> >> - strategies to fit in an arbitrary self-imposed limit
> >>
> >>
> >> Benefits
> >>
> >> - users can use the models they want to generate vectors
> >>
> >> - removal of an arbitrary limit that blocks some integrations
> >>
> >>
> >> Cons
> >>
> >> - if you go for vectors with high dimensions, there's no guarantee
> you get acceptable performance for your use case
> >>
> >>
> >>
> >> I want to keep it simple, right now in many Lucene areas, you can push
> the system to not acceptable performance/ crashes.
> >>
> >> For example, we don't limit the number of docs per index to an
> arbitrary maximum of N, you push how many docs you like and if they are too
> much for your system, you get terrible performance/crashes/whatever.
> >>
> >>
> >> Limits caused by primitive java types will stay there behind the scene,
> and that's acceptable, but I would prefer to not have arbitrary hard-coded
> ones that may limit the software usability and integration which is
> extremely important for a library.
> >>
> >>
> >> I strongly encourage people to add benefits and cons, that I missed (I
> am sure I missed some of them, but wanted to keep it simple)
> >>
> >>
> >> Cheers
> >>
> >> --------------------------
> >> Alessandro Benedetti
> >> Director @ Sease Ltd.
> >> Apache Lucene/Solr Committer
> >> Apache Solr PMC Member
> >>
> >> e-mail: a.benedetti@sease.io
> >>
> >>
> >> Sease - Information Retrieval Applied
> >> Consulting | Training | Open Source
> >>
> >> Website: Sease.io
> >> LinkedIn | Twitter | Youtube | Github
> >>
> >>
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

msokolov at gmail

Apr 1, 2023, 5:47 AM

Post #6 of 99 (675 views)

I'm also in favor of raising this limit. We do see some datasets with
higher than 1024 dims. I also think we need to keep a limit. For example we
currently need to keep all the vectors in RAM while indexing and we want to
be able to support reasonable numbers of vectors in an index segment. Also
we don't know what innovations might come down the road. Maybe someday we
want to do product quantization and enforce that (k, m) both fit in a byte
-- we wouldn't be able to do that if a vector's dimension were to exceed
32K.

On Fri, Mar 31, 2023 at 11:57?AM Alessandro Benedetti <a.benedetti@sease.io>
wrote:

> I am also curious what would be the worst-case scenario if we remove the
> constant at all (so automatically the limit becomes the Java
> Integer.MAX_VALUE).
> i.e.
> right now if you exceed the limit you get:
>
>> if (dimension > ByteVectorValues.MAX_DIMENSIONS) {
>> throw new IllegalArgumentException(
>> "cannot index vectors with dimension greater than " + ByteVectorValues.
>> MAX_DIMENSIONS);
>> }
>
>
> in relation to:
>
>> These limits allow us to
>> better tune our data structures, prevent overflows, help ensure we
>> have good test coverage, etc.
>
>
> I agree 100% especially for typing stuff properly and avoiding resource
> waste here and there, but I am not entirely sure this is the case for the
> current implementation i.e. do we have optimizations in place that assume
> the max dimension to be 1024?
> If I missed that (and I likely have), I of course suggest the contribution
> should not just blindly remove the limit, but do it appropriately.
> I am not in favor of just doubling it as suggested by some people, I would
> ideally prefer a solution that remains there to a decent extent, rather
> than having to modifying it anytime someone requires a higher limit.
>
> Cheers
>
> --------------------------
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> *Apache Lucene/Solr Committer*
> *Apache Solr PMC Member*
>
> e-mail: a.benedetti@sease.io
>
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
> <https://twitter.com/seaseltd> | Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
> <https://github.com/seaseltd>
>
>
> On Fri, 31 Mar 2023 at 16:12, Michael Wechner <michael.wechner@wyona.com>
> wrote:
>
>> OpenAI reduced their size to 1536 dimensions
>>
>> https://openai.com/blog/new-and-improved-embedding-model
>>
>> so 2048 would work :-)
>>
>> but other services do provide also higher dimensions with sometimes
>> slightly better accuracy
>>
>> Thanks
>>
>> Michael
>>
>>
>> Am 31.03.23 um 14:45 schrieb Adrien Grand:
>> > I'm supportive of bumping the limit on the maximum dimension for
>> > vectors to something that is above what the majority of users need,
>> > but I'd like to keep a limit. We have limits for other things like the
>> > max number of docs per index, the max term length, the max number of
>> > dimensions of points, etc. and there are a few things that we don't
>> > have limits on that I wish we had limits on. These limits allow us to
>> > better tune our data structures, prevent overflows, help ensure we
>> > have good test coverage, etc.
>> >
>> > That said, these other limits we have in place are quite high. E.g.
>> > the 32kB term limit, nobody would ever type a 32kB term in a text box.
>> > Likewise for the max of 8 dimensions for points: a segment cannot
>> > possibly have 2 splits per dimension on average if it doesn't have
>> > 512*2^(8*2)=34M docs, a sizable dataset already, so more dimensions
>> > than 8 would likely defeat the point of indexing. In contrast, our
>> > limit on the number of dimensions of vectors seems to be under what
>> > some users would like, and while I understand the performance argument
>> > against bumping the limit, it doesn't feel to me like something that
>> > would be so bad that we need to prevent users from using numbers of
>> > dimensions in the low thousands, e.g. top-k KNN searches would still
>> > look at a very small subset of the full dataset.
>> >
>> > So overall, my vote would be to bump the limit to 2048 as suggested by
>> > Mayya on the issue that you linked.
>> >
>> > On Fri, Mar 31, 2023 at 2:38?PM Michael Wechner
>> > <michael.wechner@wyona.com> wrote:
>> >> Thanks Alessandro for summarizing the discussion below!
>> >>
>> >> I understand that there is no clear reasoning re what is the best
>> embedding size, whereas I think heuristic approaches like described by the
>> following link can be helpful
>> >>
>> >>
>> https://datascience.stackexchange.com/questions/51404/word2vec-how-to-choose-the-embedding-size-parameter
>> >>
>> >> Having said this, we see various embedding services providing higher
>> dimensions than 1024, like for example OpenAI, Cohere and Aleph Alpha.
>> >>
>> >> And it would be great if we could run benchmarks without having to
>> recompile Lucene ourselves.
>> >>
>> >> Therefore I would to suggest to either increase the limit or even
>> better to remove the limit and add a disclaimer, that people should be
>> aware of possible crashes etc.
>> >>
>> >> Thanks
>> >>
>> >> Michael
>> >>
>> >>
>> >>
>> >>
>> >> Am 31.03.23 um 11:43 schrieb Alessandro Benedetti:
>> >>
>> >>
>> >> I've been monitoring various discussions on Pull Requests about
>> changing the max number of dimensions allowed for Lucene HNSW vectors:
>> >>
>> >> https://github.com/apache/lucene/pull/12191
>> >>
>> >> https://github.com/apache/lucene/issues/11507
>> >>
>> >>
>> >> I would like to set up a discussion and potentially a vote about this.
>> >>
>> >> I have seen some strong opposition from a few people but a majority of
>> favor in this direction.
>> >>
>> >>
>> >> Motivation
>> >>
>> >> We were discussing in the Solr slack channel with Ishan
>> Chattopadhyaya, Marcus Eagan, and David Smiley about some neural search
>> integrations in Solr: https://github.com/openai/chatgpt-retrieval-plugin
>> >>
>> >>
>> >> Proposal
>> >>
>> >> No hard limit at all.
>> >>
>> >> As for many other Lucene areas, users will be allowed to push the
>> system to the limit of their resources and get terrible performances or
>> crashes if they want.
>> >>
>> >>
>> >> What we are NOT discussing
>> >>
>> >> - Quality and scalability of the HNSW algorithm
>> >>
>> >> - dimensionality reduction
>> >>
>> >> - strategies to fit in an arbitrary self-imposed limit
>> >>
>> >>
>> >> Benefits
>> >>
>> >> - users can use the models they want to generate vectors
>> >>
>> >> - removal of an arbitrary limit that blocks some integrations
>> >>
>> >>
>> >> Cons
>> >>
>> >> - if you go for vectors with high dimensions, there's no guarantee
>> you get acceptable performance for your use case
>> >>
>> >>
>> >>
>> >> I want to keep it simple, right now in many Lucene areas, you can push
>> the system to not acceptable performance/ crashes.
>> >>
>> >> For example, we don't limit the number of docs per index to an
>> arbitrary maximum of N, you push how many docs you like and if they are too
>> much for your system, you get terrible performance/crashes/whatever.
>> >>
>> >>
>> >> Limits caused by primitive java types will stay there behind the
>> scene, and that's acceptable, but I would prefer to not have arbitrary
>> hard-coded ones that may limit the software usability and integration which
>> is extremely important for a library.
>> >>
>> >>
>> >> I strongly encourage people to add benefits and cons, that I missed (I
>> am sure I missed some of them, but wanted to keep it simple)
>> >>
>> >>
>> >> Cheers
>> >>
>> >> --------------------------
>> >> Alessandro Benedetti
>> >> Director @ Sease Ltd.
>> >> Apache Lucene/Solr Committer
>> >> Apache Solr PMC Member
>> >>
>> >> e-mail: a.benedetti@sease.io
>> >>
>> >>
>> >> Sease - Information Retrieval Applied
>> >> Consulting | Training | Open Source
>> >>
>> >> Website: Sease.io
>> >> LinkedIn | Twitter | Youtube | Github
>> >>
>> >>
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

ichattopadhyaya at gmail

Apr 1, 2023, 7:01 AM

Post #7 of 99 (675 views)

+1 to raising the limit. Maybe in future performance problems can be
mitigated with optimisations or hardware acceleration (GPUs) etc.

On Sat, 1 Apr, 2023, 6:18 pm Michael Sokolov, <msokolov@gmail.com> wrote:

> I'm also in favor of raising this limit. We do see some datasets with
> higher than 1024 dims. I also think we need to keep a limit. For example we
> currently need to keep all the vectors in RAM while indexing and we want to
> be able to support reasonable numbers of vectors in an index segment. Also
> we don't know what innovations might come down the road. Maybe someday we
> want to do product quantization and enforce that (k, m) both fit in a byte
> -- we wouldn't be able to do that if a vector's dimension were to exceed
> 32K.
>
> On Fri, Mar 31, 2023 at 11:57?AM Alessandro Benedetti <
> a.benedetti@sease.io> wrote:
>
>> I am also curious what would be the worst-case scenario if we remove the
>> constant at all (so automatically the limit becomes the Java
>> Integer.MAX_VALUE).
>> i.e.
>> right now if you exceed the limit you get:
>>
>>> if (dimension > ByteVectorValues.MAX_DIMENSIONS) {
>>> throw new IllegalArgumentException(
>>> "cannot index vectors with dimension greater than " + ByteVectorValues.
>>> MAX_DIMENSIONS);
>>> }
>>
>>
>> in relation to:
>>
>>> These limits allow us to
>>> better tune our data structures, prevent overflows, help ensure we
>>> have good test coverage, etc.
>>
>>
>> I agree 100% especially for typing stuff properly and avoiding resource
>> waste here and there, but I am not entirely sure this is the case for the
>> current implementation i.e. do we have optimizations in place that assume
>> the max dimension to be 1024?
>> If I missed that (and I likely have), I of course suggest the
>> contribution should not just blindly remove the limit, but do it
>> appropriately.
>> I am not in favor of just doubling it as suggested by some people, I
>> would ideally prefer a solution that remains there to a decent extent,
>> rather than having to modifying it anytime someone requires a higher limit.
>>
>> Cheers
>>
>> --------------------------
>> *Alessandro Benedetti*
>> Director @ Sease Ltd.
>> *Apache Lucene/Solr Committer*
>> *Apache Solr PMC Member*
>>
>> e-mail: a.benedetti@sease.io
>>
>>
>> *Sease* - Information Retrieval Applied
>> Consulting | Training | Open Source
>>
>> Website: Sease.io <http://sease.io/>
>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>> <https://twitter.com/seaseltd> | Youtube
>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>> <https://github.com/seaseltd>
>>
>>
>> On Fri, 31 Mar 2023 at 16:12, Michael Wechner <michael.wechner@wyona.com>
>> wrote:
>>
>>> OpenAI reduced their size to 1536 dimensions
>>>
>>> https://openai.com/blog/new-and-improved-embedding-model
>>>
>>> so 2048 would work :-)
>>>
>>> but other services do provide also higher dimensions with sometimes
>>> slightly better accuracy
>>>
>>> Thanks
>>>
>>> Michael
>>>
>>>
>>> Am 31.03.23 um 14:45 schrieb Adrien Grand:
>>> > I'm supportive of bumping the limit on the maximum dimension for
>>> > vectors to something that is above what the majority of users need,
>>> > but I'd like to keep a limit. We have limits for other things like the
>>> > max number of docs per index, the max term length, the max number of
>>> > dimensions of points, etc. and there are a few things that we don't
>>> > have limits on that I wish we had limits on. These limits allow us to
>>> > better tune our data structures, prevent overflows, help ensure we
>>> > have good test coverage, etc.
>>> >
>>> > That said, these other limits we have in place are quite high. E.g.
>>> > the 32kB term limit, nobody would ever type a 32kB term in a text box.
>>> > Likewise for the max of 8 dimensions for points: a segment cannot
>>> > possibly have 2 splits per dimension on average if it doesn't have
>>> > 512*2^(8*2)=34M docs, a sizable dataset already, so more dimensions
>>> > than 8 would likely defeat the point of indexing. In contrast, our
>>> > limit on the number of dimensions of vectors seems to be under what
>>> > some users would like, and while I understand the performance argument
>>> > against bumping the limit, it doesn't feel to me like something that
>>> > would be so bad that we need to prevent users from using numbers of
>>> > dimensions in the low thousands, e.g. top-k KNN searches would still
>>> > look at a very small subset of the full dataset.
>>> >
>>> > So overall, my vote would be to bump the limit to 2048 as suggested by
>>> > Mayya on the issue that you linked.
>>> >
>>> > On Fri, Mar 31, 2023 at 2:38?PM Michael Wechner
>>> > <michael.wechner@wyona.com> wrote:
>>> >> Thanks Alessandro for summarizing the discussion below!
>>> >>
>>> >> I understand that there is no clear reasoning re what is the best
>>> embedding size, whereas I think heuristic approaches like described by the
>>> following link can be helpful
>>> >>
>>> >>
>>> https://datascience.stackexchange.com/questions/51404/word2vec-how-to-choose-the-embedding-size-parameter
>>> >>
>>> >> Having said this, we see various embedding services providing higher
>>> dimensions than 1024, like for example OpenAI, Cohere and Aleph Alpha.
>>> >>
>>> >> And it would be great if we could run benchmarks without having to
>>> recompile Lucene ourselves.
>>> >>
>>> >> Therefore I would to suggest to either increase the limit or even
>>> better to remove the limit and add a disclaimer, that people should be
>>> aware of possible crashes etc.
>>> >>
>>> >> Thanks
>>> >>
>>> >> Michael
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> Am 31.03.23 um 11:43 schrieb Alessandro Benedetti:
>>> >>
>>> >>
>>> >> I've been monitoring various discussions on Pull Requests about
>>> changing the max number of dimensions allowed for Lucene HNSW vectors:
>>> >>
>>> >> https://github.com/apache/lucene/pull/12191
>>> >>
>>> >> https://github.com/apache/lucene/issues/11507
>>> >>
>>> >>
>>> >> I would like to set up a discussion and potentially a vote about this.
>>> >>
>>> >> I have seen some strong opposition from a few people but a majority
>>> of favor in this direction.
>>> >>
>>> >>
>>> >> Motivation
>>> >>
>>> >> We were discussing in the Solr slack channel with Ishan
>>> Chattopadhyaya, Marcus Eagan, and David Smiley about some neural search
>>> integrations in Solr: https://github.com/openai/chatgpt-retrieval-plugin
>>> >>
>>> >>
>>> >> Proposal
>>> >>
>>> >> No hard limit at all.
>>> >>
>>> >> As for many other Lucene areas, users will be allowed to push the
>>> system to the limit of their resources and get terrible performances or
>>> crashes if they want.
>>> >>
>>> >>
>>> >> What we are NOT discussing
>>> >>
>>> >> - Quality and scalability of the HNSW algorithm
>>> >>
>>> >> - dimensionality reduction
>>> >>
>>> >> - strategies to fit in an arbitrary self-imposed limit
>>> >>
>>> >>
>>> >> Benefits
>>> >>
>>> >> - users can use the models they want to generate vectors
>>> >>
>>> >> - removal of an arbitrary limit that blocks some integrations
>>> >>
>>> >>
>>> >> Cons
>>> >>
>>> >> - if you go for vectors with high dimensions, there's no guarantee
>>> you get acceptable performance for your use case
>>> >>
>>> >>
>>> >>
>>> >> I want to keep it simple, right now in many Lucene areas, you can
>>> push the system to not acceptable performance/ crashes.
>>> >>
>>> >> For example, we don't limit the number of docs per index to an
>>> arbitrary maximum of N, you push how many docs you like and if they are too
>>> much for your system, you get terrible performance/crashes/whatever.
>>> >>
>>> >>
>>> >> Limits caused by primitive java types will stay there behind the
>>> scene, and that's acceptable, but I would prefer to not have arbitrary
>>> hard-coded ones that may limit the software usability and integration which
>>> is extremely important for a library.
>>> >>
>>> >>
>>> >> I strongly encourage people to add benefits and cons, that I missed
>>> (I am sure I missed some of them, but wanted to keep it simple)
>>> >>
>>> >>
>>> >> Cheers
>>> >>
>>> >> --------------------------
>>> >> Alessandro Benedetti
>>> >> Director @ Sease Ltd.
>>> >> Apache Lucene/Solr Committer
>>> >> Apache Solr PMC Member
>>> >>
>>> >> e-mail: a.benedetti@sease.io
>>> >>
>>> >>
>>> >> Sease - Information Retrieval Applied
>>> >> Consulting | Training | Open Source
>>> >>
>>> >> Website: Sease.io
>>> >> LinkedIn | Twitter | Youtube | Github
>>> >>
>>> >>
>>> >
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

michael.wechner at wyona

Apr 1, 2023, 11:28 PM

Post #8 of 99 (675 views)

btw, what was the reasoning to set the current limit to 1024?

Thanks

Michael

Am 01.04.23 um 14:47 schrieb Michael Sokolov:
> I'm also in favor of raising this limit. We do see some datasets with
> higher than 1024 dims. I also think we need to keep a limit. For
> example we currently need to keep all the vectors in RAM while
> indexing and we want to be able to support reasonable numbers of
> vectors in an index segment. Also we don't know what innovations might
> come down the road. Maybe someday we want to do product quantization
> and enforce that (k, m) both fit in a byte -- we wouldn't be able to
> do that if a vector's dimension were to exceed 32K.
>
> On Fri, Mar 31, 2023 at 11:57?AM Alessandro Benedetti
> <a.benedetti@sease.io> wrote:
>
> I am also curious what would be the worst-case scenario if we
> remove the constant at all (so automatically the limit becomes the
> Java Integer.MAX_VALUE).
> i.e.
> right now if you exceed the limit you get:
>
> if (dimension > ByteVectorValues.MAX_DIMENSIONS) {
> throw new IllegalArgumentException(
> "cannot index vectors with dimension greater than " +
> ByteVectorValues.MAX_DIMENSIONS);
> }
>
>
> in relation to:
>
> These limits allow us to
> better tune our data structures, prevent overflows, help ensure we
> have good test coverage, etc.
>
> I agree 100% especially for typing stuff properly and avoiding
> resource waste here and there, but I am not entirely sure this is
> the case for the current implementation i.e. do we have
> optimizations in place that assume the max dimension to be 1024?
> If I missed that (and I likely have), I of course suggest the
> contribution should not just blindly remove the limit, but do it
> appropriately.
> I am not in favor of just doubling it as suggested by some people,
> I would ideally prefer a solution that remains there to a decent
> extent, rather than having to modifying it anytime someone
> requires a higher limit.
>
> Cheers
> --------------------------
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> /Apache Lucene/Solr Committer/
> /Apache Solr PMC Member/
>
> e-mail: a.benedetti@sease.io/
> /
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
> <https://twitter.com/seaseltd> | Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
> Github <https://github.com/seaseltd>
>
>
> On Fri, 31 Mar 2023 at 16:12, Michael Wechner
> <michael.wechner@wyona.com> wrote:
>
> OpenAI reduced their size to 1536 dimensions
>
> https://openai.com/blog/new-and-improved-embedding-model
>
> so 2048 would work :-)
>
> but other services do provide also higher dimensions with
> sometimes
> slightly better accuracy
>
> Thanks
>
> Michael
>
>
> Am 31.03.23 um 14:45 schrieb Adrien Grand:
> > I'm supportive of bumping the limit on the maximum dimension for
> > vectors to something that is above what the majority of
> users need,
> > but I'd like to keep a limit. We have limits for other
> things like the
> > max number of docs per index, the max term length, the max
> number of
> > dimensions of points, etc. and there are a few things that
> we don't
> > have limits on that I wish we had limits on. These limits
> allow us to
> > better tune our data structures, prevent overflows, help
> ensure we
> > have good test coverage, etc.
> >
> > That said, these other limits we have in place are quite
> high. E.g.
> > the 32kB term limit, nobody would ever type a 32kB term in a
> text box.
> > Likewise for the max of 8 dimensions for points: a segment
> cannot
> > possibly have 2 splits per dimension on average if it
> doesn't have
> > 512*2^(8*2)=34M docs, a sizable dataset already, so more
> dimensions
> > than 8 would likely defeat the point of indexing. In
> contrast, our
> > limit on the number of dimensions of vectors seems to be
> under what
> > some users would like, and while I understand the
> performance argument
> > against bumping the limit, it doesn't feel to me like
> something that
> > would be so bad that we need to prevent users from using
> numbers of
> > dimensions in the low thousands, e.g. top-k KNN searches
> would still
> > look at a very small subset of the full dataset.
> >
> > So overall, my vote would be to bump the limit to 2048 as
> suggested by
> > Mayya on the issue that you linked.
> >
> > On Fri, Mar 31, 2023 at 2:38?PM Michael Wechner
> > <michael.wechner@wyona.com> wrote:
> >> Thanks Alessandro for summarizing the discussion below!
> >>
> >> I understand that there is no clear reasoning re what is
> the best embedding size, whereas I think heuristic approaches
> like described by the following link can be helpful
> >>
> >>
> https://datascience.stackexchange.com/questions/51404/word2vec-how-to-choose-the-embedding-size-parameter
> >>
> >> Having said this, we see various embedding services
> providing higher dimensions than 1024, like for example
> OpenAI, Cohere and Aleph Alpha.
> >>
> >> And it would be great if we could run benchmarks without
> having to recompile Lucene ourselves.
> >>
> >> Therefore I would to suggest to either increase the limit
> or even better to remove the limit and add a disclaimer, that
> people should be aware of possible crashes etc.
> >>
> >> Thanks
> >>
> >> Michael
> >>
> >>
> >>
> >>
> >> Am 31.03.23 um 11:43 schrieb Alessandro Benedetti:
> >>
> >>
> >> I've been monitoring various discussions on Pull Requests
> about changing the max number of dimensions allowed for Lucene
> HNSW vectors:
> >>
> >> https://github.com/apache/lucene/pull/12191
> >>
> >> https://github.com/apache/lucene/issues/11507
> >>
> >>
> >> I would like to set up a discussion and potentially a vote
> about this.
> >>
> >> I have seen some strong opposition from a few people but a
> majority of favor in this direction.
> >>
> >>
> >> Motivation
> >>
> >> We were discussing in the Solr slack channel with Ishan
> Chattopadhyaya, Marcus Eagan, and David Smiley about some
> neural search integrations in Solr:
> https://github.com/openai/chatgpt-retrieval-plugin
> >>
> >>
> >> Proposal
> >>
> >> No hard limit at all.
> >>
> >> As for many other Lucene areas, users will be allowed to
> push the system to the limit of their resources and get
> terrible performances or crashes if they want.
> >>
> >>
> >> What we are NOT discussing
> >>
> >> - Quality and scalability of the HNSW algorithm
> >>
> >> - dimensionality reduction
> >>
> >> - strategies to fit in an arbitrary self-imposed limit
> >>
> >>
> >> Benefits
> >>
> >> - users can use the models they want to generate vectors
> >>
> >> - removal of an arbitrary limit that blocks some integrations
> >>
> >>
> >> Cons
> >>
> >> - if you go for vectors with high dimensions, there's no
> guarantee you get acceptable performance for your use case
> >>
> >>
> >>
> >> I want to keep it simple, right now in many Lucene areas,
> you can push the system to not acceptable performance/ crashes.
> >>
> >> For example, we don't limit the number of docs per index to
> an arbitrary maximum of N, you push how many docs you like and
> if they are too much for your system, you get terrible
> performance/crashes/whatever.
> >>
> >>
> >> Limits caused by primitive java types will stay there
> behind the scene, and that's acceptable, but I would prefer to
> not have arbitrary hard-coded ones that may limit the software
> usability and integration which is extremely important for a
> library.
> >>
> >>
> >> I strongly encourage people to add benefits and cons, that
> I missed (I am sure I missed some of them, but wanted to keep
> it simple)
> >>
> >>
> >> Cheers
> >>
> >> --------------------------
> >> Alessandro Benedetti
> >> Director @ Sease Ltd.
> >> Apache Lucene/Solr Committer
> >> Apache Solr PMC Member
> >>
> >> e-mail: a.benedetti@sease.io
> >>
> >>
> >> Sease - Information Retrieval Applied
> >> Consulting | Training | Open Source
> >>
> >> Website: Sease.io
> >> LinkedIn | Twitter | Youtube | Github
> >>
> >>
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

a.benedetti at sease

Apr 3, 2023, 6:50 AM

Post #9 of 99 (675 views)

... and what would be the next limit?
I guess we'll need to motivate it better than the 1024 one.
I appreciate the fact that a limit is pretty much wanted by everyone but I
suspect we'll need some solid foundation for deciding the amount (and it
should be high enough to avoid continuous changes)

Cheers

On Sun, 2 Apr 2023, 07:29 Michael Wechner, <michael.wechner@wyona.com>
wrote:

> btw, what was the reasoning to set the current limit to 1024?
>
> Thanks
>
> Michael
>
> Am 01.04.23 um 14:47 schrieb Michael Sokolov:
>
> I'm also in favor of raising this limit. We do see some datasets with
> higher than 1024 dims. I also think we need to keep a limit. For example we
> currently need to keep all the vectors in RAM while indexing and we want to
> be able to support reasonable numbers of vectors in an index segment. Also
> we don't know what innovations might come down the road. Maybe someday we
> want to do product quantization and enforce that (k, m) both fit in a byte
> -- we wouldn't be able to do that if a vector's dimension were to exceed
> 32K.
>
> On Fri, Mar 31, 2023 at 11:57?AM Alessandro Benedetti <
> a.benedetti@sease.io> wrote:
>
>> I am also curious what would be the worst-case scenario if we remove the
>> constant at all (so automatically the limit becomes the Java
>> Integer.MAX_VALUE).
>> i.e.
>> right now if you exceed the limit you get:
>>
>>> if (dimension > ByteVectorValues.MAX_DIMENSIONS) {
>>> throw new IllegalArgumentException(
>>> "cannot index vectors with dimension greater than " + ByteVectorValues.
>>> MAX_DIMENSIONS);
>>> }
>>
>>
>> in relation to:
>>
>>> These limits allow us to
>>> better tune our data structures, prevent overflows, help ensure we
>>> have good test coverage, etc.
>>
>>
>> I agree 100% especially for typing stuff properly and avoiding resource
>> waste here and there, but I am not entirely sure this is the case for the
>> current implementation i.e. do we have optimizations in place that assume
>> the max dimension to be 1024?
>> If I missed that (and I likely have), I of course suggest the
>> contribution should not just blindly remove the limit, but do it
>> appropriately.
>> I am not in favor of just doubling it as suggested by some people, I
>> would ideally prefer a solution that remains there to a decent extent,
>> rather than having to modifying it anytime someone requires a higher limit.
>>
>> Cheers
>>
>> --------------------------
>> *Alessandro Benedetti*
>> Director @ Sease Ltd.
>> *Apache Lucene/Solr Committer*
>> *Apache Solr PMC Member*
>>
>> e-mail: a.benedetti@sease.io
>>
>>
>> *Sease* - Information Retrieval Applied
>> Consulting | Training | Open Source
>>
>> Website: Sease.io <http://sease.io/>
>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>> <https://twitter.com/seaseltd> | Youtube
>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>> <https://github.com/seaseltd>
>>
>>
>> On Fri, 31 Mar 2023 at 16:12, Michael Wechner <michael.wechner@wyona.com>
>> wrote:
>>
>>> OpenAI reduced their size to 1536 dimensions
>>>
>>> https://openai.com/blog/new-and-improved-embedding-model
>>>
>>> so 2048 would work :-)
>>>
>>> but other services do provide also higher dimensions with sometimes
>>> slightly better accuracy
>>>
>>> Thanks
>>>
>>> Michael
>>>
>>>
>>> Am 31.03.23 um 14:45 schrieb Adrien Grand:
>>> > I'm supportive of bumping the limit on the maximum dimension for
>>> > vectors to something that is above what the majority of users need,
>>> > but I'd like to keep a limit. We have limits for other things like the
>>> > max number of docs per index, the max term length, the max number of
>>> > dimensions of points, etc. and there are a few things that we don't
>>> > have limits on that I wish we had limits on. These limits allow us to
>>> > better tune our data structures, prevent overflows, help ensure we
>>> > have good test coverage, etc.
>>> >
>>> > That said, these other limits we have in place are quite high. E.g.
>>> > the 32kB term limit, nobody would ever type a 32kB term in a text box.
>>> > Likewise for the max of 8 dimensions for points: a segment cannot
>>> > possibly have 2 splits per dimension on average if it doesn't have
>>> > 512*2^(8*2)=34M docs, a sizable dataset already, so more dimensions
>>> > than 8 would likely defeat the point of indexing. In contrast, our
>>> > limit on the number of dimensions of vectors seems to be under what
>>> > some users would like, and while I understand the performance argument
>>> > against bumping the limit, it doesn't feel to me like something that
>>> > would be so bad that we need to prevent users from using numbers of
>>> > dimensions in the low thousands, e.g. top-k KNN searches would still
>>> > look at a very small subset of the full dataset.
>>> >
>>> > So overall, my vote would be to bump the limit to 2048 as suggested by
>>> > Mayya on the issue that you linked.
>>> >
>>> > On Fri, Mar 31, 2023 at 2:38?PM Michael Wechner
>>> > <michael.wechner@wyona.com> wrote:
>>> >> Thanks Alessandro for summarizing the discussion below!
>>> >>
>>> >> I understand that there is no clear reasoning re what is the best
>>> embedding size, whereas I think heuristic approaches like described by the
>>> following link can be helpful
>>> >>
>>> >>
>>> https://datascience.stackexchange.com/questions/51404/word2vec-how-to-choose-the-embedding-size-parameter
>>> >>
>>> >> Having said this, we see various embedding services providing higher
>>> dimensions than 1024, like for example OpenAI, Cohere and Aleph Alpha.
>>> >>
>>> >> And it would be great if we could run benchmarks without having to
>>> recompile Lucene ourselves.
>>> >>
>>> >> Therefore I would to suggest to either increase the limit or even
>>> better to remove the limit and add a disclaimer, that people should be
>>> aware of possible crashes etc.
>>> >>
>>> >> Thanks
>>> >>
>>> >> Michael
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> Am 31.03.23 um 11:43 schrieb Alessandro Benedetti:
>>> >>
>>> >>
>>> >> I've been monitoring various discussions on Pull Requests about
>>> changing the max number of dimensions allowed for Lucene HNSW vectors:
>>> >>
>>> >> https://github.com/apache/lucene/pull/12191
>>> >>
>>> >> https://github.com/apache/lucene/issues/11507
>>> >>
>>> >>
>>> >> I would like to set up a discussion and potentially a vote about this.
>>> >>
>>> >> I have seen some strong opposition from a few people but a majority
>>> of favor in this direction.
>>> >>
>>> >>
>>> >> Motivation
>>> >>
>>> >> We were discussing in the Solr slack channel with Ishan
>>> Chattopadhyaya, Marcus Eagan, and David Smiley about some neural search
>>> integrations in Solr: https://github.com/openai/chatgpt-retrieval-plugin
>>> >>
>>> >>
>>> >> Proposal
>>> >>
>>> >> No hard limit at all.
>>> >>
>>> >> As for many other Lucene areas, users will be allowed to push the
>>> system to the limit of their resources and get terrible performances or
>>> crashes if they want.
>>> >>
>>> >>
>>> >> What we are NOT discussing
>>> >>
>>> >> - Quality and scalability of the HNSW algorithm
>>> >>
>>> >> - dimensionality reduction
>>> >>
>>> >> - strategies to fit in an arbitrary self-imposed limit
>>> >>
>>> >>
>>> >> Benefits
>>> >>
>>> >> - users can use the models they want to generate vectors
>>> >>
>>> >> - removal of an arbitrary limit that blocks some integrations
>>> >>
>>> >>
>>> >> Cons
>>> >>
>>> >> - if you go for vectors with high dimensions, there's no guarantee
>>> you get acceptable performance for your use case
>>> >>
>>> >>
>>> >>
>>> >> I want to keep it simple, right now in many Lucene areas, you can
>>> push the system to not acceptable performance/ crashes.
>>> >>
>>> >> For example, we don't limit the number of docs per index to an
>>> arbitrary maximum of N, you push how many docs you like and if they are too
>>> much for your system, you get terrible performance/crashes/whatever.
>>> >>
>>> >>
>>> >> Limits caused by primitive java types will stay there behind the
>>> scene, and that's acceptable, but I would prefer to not have arbitrary
>>> hard-coded ones that may limit the software usability and integration which
>>> is extremely important for a library.
>>> >>
>>> >>
>>> >> I strongly encourage people to add benefits and cons, that I missed
>>> (I am sure I missed some of them, but wanted to keep it simple)
>>> >>
>>> >>
>>> >> Cheers
>>> >>
>>> >> --------------------------
>>> >> Alessandro Benedetti
>>> >> Director @ Sease Ltd.
>>> >> Apache Lucene/Solr Committer
>>> >> Apache Solr PMC Member
>>> >>
>>> >> e-mail: a.benedetti@sease.io
>>> >>
>>> >>
>>> >> Sease - Information Retrieval Applied
>>> >> Consulting | Training | Open Source
>>> >>
>>> >> Website: Sease.io
>>> >> LinkedIn | Twitter | Youtube | Github
>>> >>
>>> >>
>>> >
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

lucene at mikemccandless

Apr 4, 2023, 6:32 AM

Post #10 of 99 (674 views)

> I am not in favor of just doubling it as suggested by some people, I
would ideally prefer a solution that remains there to a decent extent,
rather than having to modifying it anytime someone requires a higher limit.

The problem with this approach is it is a one-way door, once released. We
would not be able to lower the limit again in the future without possibly
breaking some applications.

> For example, we don't limit the number of docs per index to an arbitrary
maximum of N, you push how many docs you like and if they are too much for
your system, you get terrible performance/crashes/whatever.

Correction: we do check this limit and throw a specific exception now:
https://github.com/apache/lucene/issues/6905

+1 to raise the limit, but not remove it.

Mike McCandless

http://blog.mikemccandless.com

On Mon, Apr 3, 2023 at 9:51?AM Alessandro Benedetti <a.benedetti@sease.io>
wrote:

> ... and what would be the next limit?
> I guess we'll need to motivate it better than the 1024 one.
> I appreciate the fact that a limit is pretty much wanted by everyone but I
> suspect we'll need some solid foundation for deciding the amount (and it
> should be high enough to avoid continuous changes)
>
> Cheers
>
> On Sun, 2 Apr 2023, 07:29 Michael Wechner, <michael.wechner@wyona.com>
> wrote:
>
>> btw, what was the reasoning to set the current limit to 1024?
>>
>> Thanks
>>
>> Michael
>>
>> Am 01.04.23 um 14:47 schrieb Michael Sokolov:
>>
>> I'm also in favor of raising this limit. We do see some datasets with
>> higher than 1024 dims. I also think we need to keep a limit. For example we
>> currently need to keep all the vectors in RAM while indexing and we want to
>> be able to support reasonable numbers of vectors in an index segment. Also
>> we don't know what innovations might come down the road. Maybe someday we
>> want to do product quantization and enforce that (k, m) both fit in a byte
>> -- we wouldn't be able to do that if a vector's dimension were to exceed
>> 32K.
>>
>> On Fri, Mar 31, 2023 at 11:57?AM Alessandro Benedetti <
>> a.benedetti@sease.io> wrote:
>>
>>> I am also curious what would be the worst-case scenario if we remove the
>>> constant at all (so automatically the limit becomes the Java
>>> Integer.MAX_VALUE).
>>> i.e.
>>> right now if you exceed the limit you get:
>>>
>>>> if (dimension > ByteVectorValues.MAX_DIMENSIONS) {
>>>> throw new IllegalArgumentException(
>>>> "cannot index vectors with dimension greater than " + ByteVectorValues.
>>>> MAX_DIMENSIONS);
>>>> }
>>>
>>>
>>> in relation to:
>>>
>>>> These limits allow us to
>>>> better tune our data structures, prevent overflows, help ensure we
>>>> have good test coverage, etc.
>>>
>>>
>>> I agree 100% especially for typing stuff properly and avoiding resource
>>> waste here and there, but I am not entirely sure this is the case for the
>>> current implementation i.e. do we have optimizations in place that assume
>>> the max dimension to be 1024?
>>> If I missed that (and I likely have), I of course suggest the
>>> contribution should not just blindly remove the limit, but do it
>>> appropriately.
>>> I am not in favor of just doubling it as suggested by some people, I
>>> would ideally prefer a solution that remains there to a decent extent,
>>> rather than having to modifying it anytime someone requires a higher limit.
>>>
>>> Cheers
>>>
>>> --------------------------
>>> *Alessandro Benedetti*
>>> Director @ Sease Ltd.
>>> *Apache Lucene/Solr Committer*
>>> *Apache Solr PMC Member*
>>>
>>> e-mail: a.benedetti@sease.io
>>>
>>>
>>> *Sease* - Information Retrieval Applied
>>> Consulting | Training | Open Source
>>>
>>> Website: Sease.io <http://sease.io/>
>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>> <https://twitter.com/seaseltd> | Youtube
>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>> <https://github.com/seaseltd>
>>>
>>>
>>> On Fri, 31 Mar 2023 at 16:12, Michael Wechner <michael.wechner@wyona.com>
>>> wrote:
>>>
>>>> OpenAI reduced their size to 1536 dimensions
>>>>
>>>> https://openai.com/blog/new-and-improved-embedding-model
>>>>
>>>> so 2048 would work :-)
>>>>
>>>> but other services do provide also higher dimensions with sometimes
>>>> slightly better accuracy
>>>>
>>>> Thanks
>>>>
>>>> Michael
>>>>
>>>>
>>>> Am 31.03.23 um 14:45 schrieb Adrien Grand:
>>>> > I'm supportive of bumping the limit on the maximum dimension for
>>>> > vectors to something that is above what the majority of users need,
>>>> > but I'd like to keep a limit. We have limits for other things like the
>>>> > max number of docs per index, the max term length, the max number of
>>>> > dimensions of points, etc. and there are a few things that we don't
>>>> > have limits on that I wish we had limits on. These limits allow us to
>>>> > better tune our data structures, prevent overflows, help ensure we
>>>> > have good test coverage, etc.
>>>> >
>>>> > That said, these other limits we have in place are quite high. E.g.
>>>> > the 32kB term limit, nobody would ever type a 32kB term in a text box.
>>>> > Likewise for the max of 8 dimensions for points: a segment cannot
>>>> > possibly have 2 splits per dimension on average if it doesn't have
>>>> > 512*2^(8*2)=34M docs, a sizable dataset already, so more dimensions
>>>> > than 8 would likely defeat the point of indexing. In contrast, our
>>>> > limit on the number of dimensions of vectors seems to be under what
>>>> > some users would like, and while I understand the performance argument
>>>> > against bumping the limit, it doesn't feel to me like something that
>>>> > would be so bad that we need to prevent users from using numbers of
>>>> > dimensions in the low thousands, e.g. top-k KNN searches would still
>>>> > look at a very small subset of the full dataset.
>>>> >
>>>> > So overall, my vote would be to bump the limit to 2048 as suggested by
>>>> > Mayya on the issue that you linked.
>>>> >
>>>> > On Fri, Mar 31, 2023 at 2:38?PM Michael Wechner
>>>> > <michael.wechner@wyona.com> wrote:
>>>> >> Thanks Alessandro for summarizing the discussion below!
>>>> >>
>>>> >> I understand that there is no clear reasoning re what is the best
>>>> embedding size, whereas I think heuristic approaches like described by the
>>>> following link can be helpful
>>>> >>
>>>> >>
>>>> https://datascience.stackexchange.com/questions/51404/word2vec-how-to-choose-the-embedding-size-parameter
>>>> >>
>>>> >> Having said this, we see various embedding services providing higher
>>>> dimensions than 1024, like for example OpenAI, Cohere and Aleph Alpha.
>>>> >>
>>>> >> And it would be great if we could run benchmarks without having to
>>>> recompile Lucene ourselves.
>>>> >>
>>>> >> Therefore I would to suggest to either increase the limit or even
>>>> better to remove the limit and add a disclaimer, that people should be
>>>> aware of possible crashes etc.
>>>> >>
>>>> >> Thanks
>>>> >>
>>>> >> Michael
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> Am 31.03.23 um 11:43 schrieb Alessandro Benedetti:
>>>> >>
>>>> >>
>>>> >> I've been monitoring various discussions on Pull Requests about
>>>> changing the max number of dimensions allowed for Lucene HNSW vectors:
>>>> >>
>>>> >> https://github.com/apache/lucene/pull/12191
>>>> >>
>>>> >> https://github.com/apache/lucene/issues/11507
>>>> >>
>>>> >>
>>>> >> I would like to set up a discussion and potentially a vote about
>>>> this.
>>>> >>
>>>> >> I have seen some strong opposition from a few people but a majority
>>>> of favor in this direction.
>>>> >>
>>>> >>
>>>> >> Motivation
>>>> >>
>>>> >> We were discussing in the Solr slack channel with Ishan
>>>> Chattopadhyaya, Marcus Eagan, and David Smiley about some neural search
>>>> integrations in Solr:
>>>> https://github.com/openai/chatgpt-retrieval-plugin
>>>> >>
>>>> >>
>>>> >> Proposal
>>>> >>
>>>> >> No hard limit at all.
>>>> >>
>>>> >> As for many other Lucene areas, users will be allowed to push the
>>>> system to the limit of their resources and get terrible performances or
>>>> crashes if they want.
>>>> >>
>>>> >>
>>>> >> What we are NOT discussing
>>>> >>
>>>> >> - Quality and scalability of the HNSW algorithm
>>>> >>
>>>> >> - dimensionality reduction
>>>> >>
>>>> >> - strategies to fit in an arbitrary self-imposed limit
>>>> >>
>>>> >>
>>>> >> Benefits
>>>> >>
>>>> >> - users can use the models they want to generate vectors
>>>> >>
>>>> >> - removal of an arbitrary limit that blocks some integrations
>>>> >>
>>>> >>
>>>> >> Cons
>>>> >>
>>>> >> - if you go for vectors with high dimensions, there's no guarantee
>>>> you get acceptable performance for your use case
>>>> >>
>>>> >>
>>>> >>
>>>> >> I want to keep it simple, right now in many Lucene areas, you can
>>>> push the system to not acceptable performance/ crashes.
>>>> >>
>>>> >> For example, we don't limit the number of docs per index to an
>>>> arbitrary maximum of N, you push how many docs you like and if they are too
>>>> much for your system, you get terrible performance/crashes/whatever.
>>>> >>
>>>> >>
>>>> >> Limits caused by primitive java types will stay there behind the
>>>> scene, and that's acceptable, but I would prefer to not have arbitrary
>>>> hard-coded ones that may limit the software usability and integration which
>>>> is extremely important for a library.
>>>> >>
>>>> >>
>>>> >> I strongly encourage people to add benefits and cons, that I missed
>>>> (I am sure I missed some of them, but wanted to keep it simple)
>>>> >>
>>>> >>
>>>> >> Cheers
>>>> >>
>>>> >> --------------------------
>>>> >> Alessandro Benedetti
>>>> >> Director @ Sease Ltd.
>>>> >> Apache Lucene/Solr Committer
>>>> >> Apache Solr PMC Member
>>>> >>
>>>> >> e-mail: a.benedetti@sease.io
>>>> >>
>>>> >>
>>>> >> Sease - Information Retrieval Applied
>>>> >> Consulting | Training | Open Source
>>>> >>
>>>> >> Website: Sease.io
>>>> >> LinkedIn | Twitter | Youtube | Github
>>>> >>
>>>> >>
>>>> >
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>
>>>>
>>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

serera at gmail

Apr 4, 2023, 6:40 AM

Post #11 of 99 (674 views)

I am not familiar with the internal implementation details, but is it
possible to refactor the code such that someone can provide an extension of
some VectorEncoder/Decoder and control the limits on their side? Rather
than Lucene committing to some arbitrary limit (which these days seems to
keep growing)?

If raising the limit only means changing some hard-coded constant, then I
assume such an abstraction can work. We can mark this extension as
@lucene.expert.

Shai

On Tue, Apr 4, 2023 at 4:33?PM Michael McCandless <lucene@mikemccandless.com>
wrote:

> > I am not in favor of just doubling it as suggested by some people, I
> would ideally prefer a solution that remains there to a decent extent,
> rather than having to modifying it anytime someone requires a higher limit.
>
> The problem with this approach is it is a one-way door, once released. We
> would not be able to lower the limit again in the future without possibly
> breaking some applications.
>
> > For example, we don't limit the number of docs per index to an
> arbitrary maximum of N, you push how many docs you like and if they are too
> much for your system, you get terrible performance/crashes/whatever.
>
> Correction: we do check this limit and throw a specific exception now:
> https://github.com/apache/lucene/issues/6905
>
> +1 to raise the limit, but not remove it.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Mon, Apr 3, 2023 at 9:51?AM Alessandro Benedetti <a.benedetti@sease.io>
> wrote:
>
>> ... and what would be the next limit?
>> I guess we'll need to motivate it better than the 1024 one.
>> I appreciate the fact that a limit is pretty much wanted by everyone but
>> I suspect we'll need some solid foundation for deciding the amount (and it
>> should be high enough to avoid continuous changes)
>>
>> Cheers
>>
>> On Sun, 2 Apr 2023, 07:29 Michael Wechner, <michael.wechner@wyona.com>
>> wrote:
>>
>>> btw, what was the reasoning to set the current limit to 1024?
>>>
>>> Thanks
>>>
>>> Michael
>>>
>>> Am 01.04.23 um 14:47 schrieb Michael Sokolov:
>>>
>>> I'm also in favor of raising this limit. We do see some datasets with
>>> higher than 1024 dims. I also think we need to keep a limit. For example we
>>> currently need to keep all the vectors in RAM while indexing and we want to
>>> be able to support reasonable numbers of vectors in an index segment. Also
>>> we don't know what innovations might come down the road. Maybe someday we
>>> want to do product quantization and enforce that (k, m) both fit in a byte
>>> -- we wouldn't be able to do that if a vector's dimension were to exceed
>>> 32K.
>>>
>>> On Fri, Mar 31, 2023 at 11:57?AM Alessandro Benedetti <
>>> a.benedetti@sease.io> wrote:
>>>
>>>> I am also curious what would be the worst-case scenario if we remove
>>>> the constant at all (so automatically the limit becomes the Java
>>>> Integer.MAX_VALUE).
>>>> i.e.
>>>> right now if you exceed the limit you get:
>>>>
>>>>> if (dimension > ByteVectorValues.MAX_DIMENSIONS) {
>>>>> throw new IllegalArgumentException(
>>>>> "cannot index vectors with dimension greater than " + ByteVectorValues
>>>>> .MAX_DIMENSIONS);
>>>>> }
>>>>
>>>>
>>>> in relation to:
>>>>
>>>>> These limits allow us to
>>>>> better tune our data structures, prevent overflows, help ensure we
>>>>> have good test coverage, etc.
>>>>
>>>>
>>>> I agree 100% especially for typing stuff properly and avoiding resource
>>>> waste here and there, but I am not entirely sure this is the case for the
>>>> current implementation i.e. do we have optimizations in place that assume
>>>> the max dimension to be 1024?
>>>> If I missed that (and I likely have), I of course suggest the
>>>> contribution should not just blindly remove the limit, but do it
>>>> appropriately.
>>>> I am not in favor of just doubling it as suggested by some people, I
>>>> would ideally prefer a solution that remains there to a decent extent,
>>>> rather than having to modifying it anytime someone requires a higher limit.
>>>>
>>>> Cheers
>>>>
>>>> --------------------------
>>>> *Alessandro Benedetti*
>>>> Director @ Sease Ltd.
>>>> *Apache Lucene/Solr Committer*
>>>> *Apache Solr PMC Member*
>>>>
>>>> e-mail: a.benedetti@sease.io
>>>>
>>>>
>>>> *Sease* - Information Retrieval Applied
>>>> Consulting | Training | Open Source
>>>>
>>>> Website: Sease.io <http://sease.io/>
>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>>> <https://twitter.com/seaseltd> | Youtube
>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>>> <https://github.com/seaseltd>
>>>>
>>>>
>>>> On Fri, 31 Mar 2023 at 16:12, Michael Wechner <
>>>> michael.wechner@wyona.com> wrote:
>>>>
>>>>> OpenAI reduced their size to 1536 dimensions
>>>>>
>>>>> https://openai.com/blog/new-and-improved-embedding-model
>>>>>
>>>>> so 2048 would work :-)
>>>>>
>>>>> but other services do provide also higher dimensions with sometimes
>>>>> slightly better accuracy
>>>>>
>>>>> Thanks
>>>>>
>>>>> Michael
>>>>>
>>>>>
>>>>> Am 31.03.23 um 14:45 schrieb Adrien Grand:
>>>>> > I'm supportive of bumping the limit on the maximum dimension for
>>>>> > vectors to something that is above what the majority of users need,
>>>>> > but I'd like to keep a limit. We have limits for other things like
>>>>> the
>>>>> > max number of docs per index, the max term length, the max number of
>>>>> > dimensions of points, etc. and there are a few things that we don't
>>>>> > have limits on that I wish we had limits on. These limits allow us to
>>>>> > better tune our data structures, prevent overflows, help ensure we
>>>>> > have good test coverage, etc.
>>>>> >
>>>>> > That said, these other limits we have in place are quite high. E.g.
>>>>> > the 32kB term limit, nobody would ever type a 32kB term in a text
>>>>> box.
>>>>> > Likewise for the max of 8 dimensions for points: a segment cannot
>>>>> > possibly have 2 splits per dimension on average if it doesn't have
>>>>> > 512*2^(8*2)=34M docs, a sizable dataset already, so more dimensions
>>>>> > than 8 would likely defeat the point of indexing. In contrast, our
>>>>> > limit on the number of dimensions of vectors seems to be under what
>>>>> > some users would like, and while I understand the performance
>>>>> argument
>>>>> > against bumping the limit, it doesn't feel to me like something that
>>>>> > would be so bad that we need to prevent users from using numbers of
>>>>> > dimensions in the low thousands, e.g. top-k KNN searches would still
>>>>> > look at a very small subset of the full dataset.
>>>>> >
>>>>> > So overall, my vote would be to bump the limit to 2048 as suggested
>>>>> by
>>>>> > Mayya on the issue that you linked.
>>>>> >
>>>>> > On Fri, Mar 31, 2023 at 2:38?PM Michael Wechner
>>>>> > <michael.wechner@wyona.com> wrote:
>>>>> >> Thanks Alessandro for summarizing the discussion below!
>>>>> >>
>>>>> >> I understand that there is no clear reasoning re what is the best
>>>>> embedding size, whereas I think heuristic approaches like described by the
>>>>> following link can be helpful
>>>>> >>
>>>>> >>
>>>>> https://datascience.stackexchange.com/questions/51404/word2vec-how-to-choose-the-embedding-size-parameter
>>>>> >>
>>>>> >> Having said this, we see various embedding services providing
>>>>> higher dimensions than 1024, like for example OpenAI, Cohere and Aleph
>>>>> Alpha.
>>>>> >>
>>>>> >> And it would be great if we could run benchmarks without having to
>>>>> recompile Lucene ourselves.
>>>>> >>
>>>>> >> Therefore I would to suggest to either increase the limit or even
>>>>> better to remove the limit and add a disclaimer, that people should be
>>>>> aware of possible crashes etc.
>>>>> >>
>>>>> >> Thanks
>>>>> >>
>>>>> >> Michael
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> Am 31.03.23 um 11:43 schrieb Alessandro Benedetti:
>>>>> >>
>>>>> >>
>>>>> >> I've been monitoring various discussions on Pull Requests about
>>>>> changing the max number of dimensions allowed for Lucene HNSW vectors:
>>>>> >>
>>>>> >> https://github.com/apache/lucene/pull/12191
>>>>> >>
>>>>> >> https://github.com/apache/lucene/issues/11507
>>>>> >>
>>>>> >>
>>>>> >> I would like to set up a discussion and potentially a vote about
>>>>> this.
>>>>> >>
>>>>> >> I have seen some strong opposition from a few people but a majority
>>>>> of favor in this direction.
>>>>> >>
>>>>> >>
>>>>> >> Motivation
>>>>> >>
>>>>> >> We were discussing in the Solr slack channel with Ishan
>>>>> Chattopadhyaya, Marcus Eagan, and David Smiley about some neural search
>>>>> integrations in Solr:
>>>>> https://github.com/openai/chatgpt-retrieval-plugin
>>>>> >>
>>>>> >>
>>>>> >> Proposal
>>>>> >>
>>>>> >> No hard limit at all.
>>>>> >>
>>>>> >> As for many other Lucene areas, users will be allowed to push the
>>>>> system to the limit of their resources and get terrible performances or
>>>>> crashes if they want.
>>>>> >>
>>>>> >>
>>>>> >> What we are NOT discussing
>>>>> >>
>>>>> >> - Quality and scalability of the HNSW algorithm
>>>>> >>
>>>>> >> - dimensionality reduction
>>>>> >>
>>>>> >> - strategies to fit in an arbitrary self-imposed limit
>>>>> >>
>>>>> >>
>>>>> >> Benefits
>>>>> >>
>>>>> >> - users can use the models they want to generate vectors
>>>>> >>
>>>>> >> - removal of an arbitrary limit that blocks some integrations
>>>>> >>
>>>>> >>
>>>>> >> Cons
>>>>> >>
>>>>> >> - if you go for vectors with high dimensions, there's no
>>>>> guarantee you get acceptable performance for your use case
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> I want to keep it simple, right now in many Lucene areas, you can
>>>>> push the system to not acceptable performance/ crashes.
>>>>> >>
>>>>> >> For example, we don't limit the number of docs per index to an
>>>>> arbitrary maximum of N, you push how many docs you like and if they are too
>>>>> much for your system, you get terrible performance/crashes/whatever.
>>>>> >>
>>>>> >>
>>>>> >> Limits caused by primitive java types will stay there behind the
>>>>> scene, and that's acceptable, but I would prefer to not have arbitrary
>>>>> hard-coded ones that may limit the software usability and integration which
>>>>> is extremely important for a library.
>>>>> >>
>>>>> >>
>>>>> >> I strongly encourage people to add benefits and cons, that I missed
>>>>> (I am sure I missed some of them, but wanted to keep it simple)
>>>>> >>
>>>>> >>
>>>>> >> Cheers
>>>>> >>
>>>>> >> --------------------------
>>>>> >> Alessandro Benedetti
>>>>> >> Director @ Sease Ltd.
>>>>> >> Apache Lucene/Solr Committer
>>>>> >> Apache Solr PMC Member
>>>>> >>
>>>>> >> e-mail: a.benedetti@sease.io
>>>>> >>
>>>>> >>
>>>>> >> Sease - Information Retrieval Applied
>>>>> >> Consulting | Training | Open Source
>>>>> >>
>>>>> >> Website: Sease.io
>>>>> >> LinkedIn | Twitter | Youtube | Github
>>>>> >>
>>>>> >>
>>>>> >
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>
>>>>>
>>>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

michael.wechner at wyona

Apr 4, 2023, 6:47 AM

Post #12 of 99 (674 views)

IIUC we all agree that the limit could be raised, but we need some solid
reasoning what limit makes sense, resp. why do we set this particular
limit (e.g. 2048), right?

Thanks

Michael

Am 04.04.23 um 15:32 schrieb Michael McCandless:
> > I am not in favor of just doubling it as suggested by some people, I
> would ideally prefer a solution that remains there to a decent extent,
> rather than having to modifying it anytime someone requires a higher
> limit.
>
> The problem with this approach is it is a one-way door, once
> released. We would not be able to lower the limit again in the future
> without possibly breaking some applications.
>
> > For example, we don't limit the number of docs per index to an
> arbitrary maximum of N, you push how many docs you like and if they
> are too much for your system, you get terrible
> performance/crashes/whatever.
>
> Correction: we do check this limit and throw a specific exception now:
> https://github.com/apache/lucene/issues/6905
>
> +1 to raise the limit, but not remove it.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Mon, Apr 3, 2023 at 9:51?AM Alessandro Benedetti
> <a.benedetti@sease.io> wrote:
>
> ... and what would be the next limit?
> I guess we'll need to motivate it better than the 1024 one.
> I appreciate the fact that a limit is pretty much wanted by
> everyone but I suspect we'll need some solid foundation for
> deciding the amount (and it should be high enough to avoid
> continuous changes)
>
> Cheers
>
> On Sun, 2 Apr 2023, 07:29 Michael Wechner,
> <michael.wechner@wyona.com> wrote:
>
> btw, what was the reasoning to set the current limit to 1024?
>
> Thanks
>
> Michael
>
> Am 01.04.23 um 14:47 schrieb Michael Sokolov:
>> I'm also in favor of raising this limit. We do see some
>> datasets with higher than 1024 dims. I also think we need to
>> keep a limit. For example we currently need to keep all the
>> vectors in RAM while indexing and we want to be able to
>> support reasonable numbers of vectors in an index segment.
>> Also we don't know what innovations might come down the road.
>> Maybe someday we want to do product quantization and enforce
>> that (k, m) both fit in a byte -- we wouldn't be able to do
>> that if a vector's dimension were to exceed 32K.
>>
>> On Fri, Mar 31, 2023 at 11:57?AM Alessandro Benedetti
>> <a.benedetti@sease.io> wrote:
>>
>> I am also curious what would be the worst-case scenario
>> if we remove the constant at all (so automatically the
>> limit becomes the Java Integer.MAX_VALUE).
>> i.e.
>> right now if you exceed the limit you get:
>>
>> if (dimension > ByteVectorValues.MAX_DIMENSIONS) {
>> throw new IllegalArgumentException(
>> "cannot index vectors with dimension greater than " +
>> ByteVectorValues.MAX_DIMENSIONS);
>> }
>>
>>
>> in relation to:
>>
>> These limits allow us to
>> better tune our data structures, prevent overflows,
>> help ensure we
>> have good test coverage, etc.
>>
>> I agree 100% especially for typing stuff properly and
>> avoiding resource waste here and there, but I am not
>> entirely sure this is the case for the current
>> implementation i.e. do we have optimizations in place
>> that assume the max dimension to be 1024?
>> If I missed that (and I likely have), I of course suggest
>> the contribution should not just blindly remove the
>> limit, but do it appropriately.
>> I am not in favor of just doubling it as suggested by
>> some people, I would ideally prefer a solution that
>> remains there to a decent extent, rather than having to
>> modifying it anytime someone requires a higher limit.
>>
>> Cheers
>> --------------------------
>> *Alessandro Benedetti*
>> Director @ Sease Ltd.
>> /Apache Lucene/Solr Committer/
>> /Apache Solr PMC Member/
>>
>> e-mail: a.benedetti@sease.io/
>> /
>>
>> *Sease* - Information Retrieval Applied
>> Consulting | Training | Open Source
>>
>> Website: Sease.io <http://sease.io/>
>> LinkedIn <https://linkedin.com/company/sease-ltd> |
>> Twitter <https://twitter.com/seaseltd> | Youtube
>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
>> Github <https://github.com/seaseltd>
>>
>>
>> On Fri, 31 Mar 2023 at 16:12, Michael Wechner
>> <michael.wechner@wyona.com> wrote:
>>
>> OpenAI reduced their size to 1536 dimensions
>>
>> https://openai.com/blog/new-and-improved-embedding-model
>>
>> so 2048 would work :-)
>>
>> but other services do provide also higher dimensions
>> with sometimes
>> slightly better accuracy
>>
>> Thanks
>>
>> Michael
>>
>>
>> Am 31.03.23 um 14:45 schrieb Adrien Grand:
>> > I'm supportive of bumping the limit on the maximum
>> dimension for
>> > vectors to something that is above what the
>> majority of users need,
>> > but I'd like to keep a limit. We have limits for
>> other things like the
>> > max number of docs per index, the max term length,
>> the max number of
>> > dimensions of points, etc. and there are a few
>> things that we don't
>> > have limits on that I wish we had limits on. These
>> limits allow us to
>> > better tune our data structures, prevent overflows,
>> help ensure we
>> > have good test coverage, etc.
>> >
>> > That said, these other limits we have in place are
>> quite high. E.g.
>> > the 32kB term limit, nobody would ever type a 32kB
>> term in a text box.
>> > Likewise for the max of 8 dimensions for points: a
>> segment cannot
>> > possibly have 2 splits per dimension on average if
>> it doesn't have
>> > 512*2^(8*2)=34M docs, a sizable dataset already, so
>> more dimensions
>> > than 8 would likely defeat the point of indexing.
>> In contrast, our
>> > limit on the number of dimensions of vectors seems
>> to be under what
>> > some users would like, and while I understand the
>> performance argument
>> > against bumping the limit, it doesn't feel to me
>> like something that
>> > would be so bad that we need to prevent users from
>> using numbers of
>> > dimensions in the low thousands, e.g. top-k KNN
>> searches would still
>> > look at a very small subset of the full dataset.
>> >
>> > So overall, my vote would be to bump the limit to
>> 2048 as suggested by
>> > Mayya on the issue that you linked.
>> >
>> > On Fri, Mar 31, 2023 at 2:38?PM Michael Wechner
>> > <michael.wechner@wyona.com> wrote:
>> >> Thanks Alessandro for summarizing the discussion
>> below!
>> >>
>> >> I understand that there is no clear reasoning re
>> what is the best embedding size, whereas I think
>> heuristic approaches like described by the following
>> link can be helpful
>> >>
>> >>
>> https://datascience.stackexchange.com/questions/51404/word2vec-how-to-choose-the-embedding-size-parameter
>> >>
>> >> Having said this, we see various embedding
>> services providing higher dimensions than 1024, like
>> for example OpenAI, Cohere and Aleph Alpha.
>> >>
>> >> And it would be great if we could run benchmarks
>> without having to recompile Lucene ourselves.
>> >>
>> >> Therefore I would to suggest to either increase
>> the limit or even better to remove the limit and add
>> a disclaimer, that people should be aware of possible
>> crashes etc.
>> >>
>> >> Thanks
>> >>
>> >> Michael
>> >>
>> >>
>> >>
>> >>
>> >> Am 31.03.23 um 11:43 schrieb Alessandro Benedetti:
>> >>
>> >>
>> >> I've been monitoring various discussions on Pull
>> Requests about changing the max number of dimensions
>> allowed for Lucene HNSW vectors:
>> >>
>> >> https://github.com/apache/lucene/pull/12191
>> >>
>> >> https://github.com/apache/lucene/issues/11507
>> >>
>> >>
>> >> I would like to set up a discussion and
>> potentially a vote about this.
>> >>
>> >> I have seen some strong opposition from a few
>> people but a majority of favor in this direction.
>> >>
>> >>
>> >> Motivation
>> >>
>> >> We were discussing in the Solr slack channel with
>> Ishan Chattopadhyaya, Marcus Eagan, and David Smiley
>> about some neural search integrations in Solr:
>> https://github.com/openai/chatgpt-retrieval-plugin
>> >>
>> >>
>> >> Proposal
>> >>
>> >> No hard limit at all.
>> >>
>> >> As for many other Lucene areas, users will be
>> allowed to push the system to the limit of their
>> resources and get terrible performances or crashes if
>> they want.
>> >>
>> >>
>> >> What we are NOT discussing
>> >>
>> >> - Quality and scalability of the HNSW algorithm
>> >>
>> >> - dimensionality reduction
>> >>
>> >> - strategies to fit in an arbitrary self-imposed limit
>> >>
>> >>
>> >> Benefits
>> >>
>> >> - users can use the models they want to generate
>> vectors
>> >>
>> >> - removal of an arbitrary limit that blocks some
>> integrations
>> >>
>> >>
>> >> Cons
>> >>
>> >> - if you go for vectors with high dimensions,
>> there's no guarantee you get acceptable performance
>> for your use case
>> >>
>> >>
>> >>
>> >> I want to keep it simple, right now in many Lucene
>> areas, you can push the system to not acceptable
>> performance/ crashes.
>> >>
>> >> For example, we don't limit the number of docs per
>> index to an arbitrary maximum of N, you push how many
>> docs you like and if they are too much for your
>> system, you get terrible performance/crashes/whatever.
>> >>
>> >>
>> >> Limits caused by primitive java types will stay
>> there behind the scene, and that's acceptable, but I
>> would prefer to not have arbitrary hard-coded ones
>> that may limit the software usability and integration
>> which is extremely important for a library.
>> >>
>> >>
>> >> I strongly encourage people to add benefits and
>> cons, that I missed (I am sure I missed some of them,
>> but wanted to keep it simple)
>> >>
>> >>
>> >> Cheers
>> >>
>> >> --------------------------
>> >> Alessandro Benedetti
>> >> Director @ Sease Ltd.
>> >> Apache Lucene/Solr Committer
>> >> Apache Solr PMC Member
>> >>
>> >> e-mail: a.benedetti@sease.io
>> >>
>> >>
>> >> Sease - Information Retrieval Applied
>> >> Consulting | Training | Open Source
>> >>
>> >> Website: Sease.io
>> >> LinkedIn | Twitter | Youtube | Github
>> >>
>> >>
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail:
>> dev-help@lucene.apache.org
>>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

a.benedetti at sease

Apr 5, 2023, 3:34 AM

Post #13 of 99 (673 views)

Thanks Mike for the insight!

What would be the next steps then?
I see agreement but also the necessity of identifying a candidate MAX.

Should create a VOTE thread, where we propose some values with a
justification and we vote?

In this way we can create a pull request and merge relatively soon.

Cheers

On Tue, 4 Apr 2023, 14:47 Michael Wechner, <michael.wechner@wyona.com>
wrote:

> IIUC we all agree that the limit could be raised, but we need some solid
> reasoning what limit makes sense, resp. why do we set this particular limit
> (e.g. 2048), right?
>
> Thanks
>
> Michael
>
>
> Am 04.04.23 um 15:32 schrieb Michael McCandless:
>
> > I am not in favor of just doubling it as suggested by some people, I
> would ideally prefer a solution that remains there to a decent extent,
> rather than having to modifying it anytime someone requires a higher limit.
>
> The problem with this approach is it is a one-way door, once released. We
> would not be able to lower the limit again in the future without possibly
> breaking some applications.
>
> > For example, we don't limit the number of docs per index to an
> arbitrary maximum of N, you push how many docs you like and if they are too
> much for your system, you get terrible performance/crashes/whatever.
>
> Correction: we do check this limit and throw a specific exception now:
> https://github.com/apache/lucene/issues/6905
>
> +1 to raise the limit, but not remove it.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Mon, Apr 3, 2023 at 9:51?AM Alessandro Benedetti <a.benedetti@sease.io>
> wrote:
>
>> ... and what would be the next limit?
>> I guess we'll need to motivate it better than the 1024 one.
>> I appreciate the fact that a limit is pretty much wanted by everyone but
>> I suspect we'll need some solid foundation for deciding the amount (and it
>> should be high enough to avoid continuous changes)
>>
>> Cheers
>>
>> On Sun, 2 Apr 2023, 07:29 Michael Wechner, <michael.wechner@wyona.com>
>> wrote:
>>
>>> btw, what was the reasoning to set the current limit to 1024?
>>>
>>> Thanks
>>>
>>> Michael
>>>
>>> Am 01.04.23 um 14:47 schrieb Michael Sokolov:
>>>
>>> I'm also in favor of raising this limit. We do see some datasets with
>>> higher than 1024 dims. I also think we need to keep a limit. For example we
>>> currently need to keep all the vectors in RAM while indexing and we want to
>>> be able to support reasonable numbers of vectors in an index segment. Also
>>> we don't know what innovations might come down the road. Maybe someday we
>>> want to do product quantization and enforce that (k, m) both fit in a byte
>>> -- we wouldn't be able to do that if a vector's dimension were to exceed
>>> 32K.
>>>
>>> On Fri, Mar 31, 2023 at 11:57?AM Alessandro Benedetti <
>>> a.benedetti@sease.io> wrote:
>>>
>>>> I am also curious what would be the worst-case scenario if we remove
>>>> the constant at all (so automatically the limit becomes the Java
>>>> Integer.MAX_VALUE).
>>>> i.e.
>>>> right now if you exceed the limit you get:
>>>>
>>>>> if (dimension > ByteVectorValues.MAX_DIMENSIONS) {
>>>>> throw new IllegalArgumentException(
>>>>> "cannot index vectors with dimension greater than " + ByteVectorValues
>>>>> .MAX_DIMENSIONS);
>>>>> }
>>>>
>>>>
>>>> in relation to:
>>>>
>>>>> These limits allow us to
>>>>> better tune our data structures, prevent overflows, help ensure we
>>>>> have good test coverage, etc.
>>>>
>>>>
>>>> I agree 100% especially for typing stuff properly and avoiding resource
>>>> waste here and there, but I am not entirely sure this is the case for the
>>>> current implementation i.e. do we have optimizations in place that assume
>>>> the max dimension to be 1024?
>>>> If I missed that (and I likely have), I of course suggest the
>>>> contribution should not just blindly remove the limit, but do it
>>>> appropriately.
>>>> I am not in favor of just doubling it as suggested by some people, I
>>>> would ideally prefer a solution that remains there to a decent extent,
>>>> rather than having to modifying it anytime someone requires a higher limit.
>>>>
>>>> Cheers
>>>>
>>>> --------------------------
>>>> *Alessandro Benedetti*
>>>> Director @ Sease Ltd.
>>>> *Apache Lucene/Solr Committer*
>>>> *Apache Solr PMC Member*
>>>>
>>>> e-mail: a.benedetti@sease.io
>>>>
>>>>
>>>> *Sease* - Information Retrieval Applied
>>>> Consulting | Training | Open Source
>>>>
>>>> Website: Sease.io <http://sease.io/>
>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>>> <https://twitter.com/seaseltd> | Youtube
>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>>> <https://github.com/seaseltd>
>>>>
>>>>
>>>> On Fri, 31 Mar 2023 at 16:12, Michael Wechner <
>>>> michael.wechner@wyona.com> wrote:
>>>>
>>>>> OpenAI reduced their size to 1536 dimensions
>>>>>
>>>>> https://openai.com/blog/new-and-improved-embedding-model
>>>>>
>>>>> so 2048 would work :-)
>>>>>
>>>>> but other services do provide also higher dimensions with sometimes
>>>>> slightly better accuracy
>>>>>
>>>>> Thanks
>>>>>
>>>>> Michael
>>>>>
>>>>>
>>>>> Am 31.03.23 um 14:45 schrieb Adrien Grand:
>>>>> > I'm supportive of bumping the limit on the maximum dimension for
>>>>> > vectors to something that is above what the majority of users need,
>>>>> > but I'd like to keep a limit. We have limits for other things like
>>>>> the
>>>>> > max number of docs per index, the max term length, the max number of
>>>>> > dimensions of points, etc. and there are a few things that we don't
>>>>> > have limits on that I wish we had limits on. These limits allow us to
>>>>> > better tune our data structures, prevent overflows, help ensure we
>>>>> > have good test coverage, etc.
>>>>> >
>>>>> > That said, these other limits we have in place are quite high. E.g.
>>>>> > the 32kB term limit, nobody would ever type a 32kB term in a text
>>>>> box.
>>>>> > Likewise for the max of 8 dimensions for points: a segment cannot
>>>>> > possibly have 2 splits per dimension on average if it doesn't have
>>>>> > 512*2^(8*2)=34M docs, a sizable dataset already, so more dimensions
>>>>> > than 8 would likely defeat the point of indexing. In contrast, our
>>>>> > limit on the number of dimensions of vectors seems to be under what
>>>>> > some users would like, and while I understand the performance
>>>>> argument
>>>>> > against bumping the limit, it doesn't feel to me like something that
>>>>> > would be so bad that we need to prevent users from using numbers of
>>>>> > dimensions in the low thousands, e.g. top-k KNN searches would still
>>>>> > look at a very small subset of the full dataset.
>>>>> >
>>>>> > So overall, my vote would be to bump the limit to 2048 as suggested
>>>>> by
>>>>> > Mayya on the issue that you linked.
>>>>> >
>>>>> > On Fri, Mar 31, 2023 at 2:38?PM Michael Wechner
>>>>> > <michael.wechner@wyona.com> wrote:
>>>>> >> Thanks Alessandro for summarizing the discussion below!
>>>>> >>
>>>>> >> I understand that there is no clear reasoning re what is the best
>>>>> embedding size, whereas I think heuristic approaches like described by the
>>>>> following link can be helpful
>>>>> >>
>>>>> >>
>>>>> https://datascience.stackexchange.com/questions/51404/word2vec-how-to-choose-the-embedding-size-parameter
>>>>> >>
>>>>> >> Having said this, we see various embedding services providing
>>>>> higher dimensions than 1024, like for example OpenAI, Cohere and Aleph
>>>>> Alpha.
>>>>> >>
>>>>> >> And it would be great if we could run benchmarks without having to
>>>>> recompile Lucene ourselves.
>>>>> >>
>>>>> >> Therefore I would to suggest to either increase the limit or even
>>>>> better to remove the limit and add a disclaimer, that people should be
>>>>> aware of possible crashes etc.
>>>>> >>
>>>>> >> Thanks
>>>>> >>
>>>>> >> Michael
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> Am 31.03.23 um 11:43 schrieb Alessandro Benedetti:
>>>>> >>
>>>>> >>
>>>>> >> I've been monitoring various discussions on Pull Requests about
>>>>> changing the max number of dimensions allowed for Lucene HNSW vectors:
>>>>> >>
>>>>> >> https://github.com/apache/lucene/pull/12191
>>>>> >>
>>>>> >> https://github.com/apache/lucene/issues/11507
>>>>> >>
>>>>> >>
>>>>> >> I would like to set up a discussion and potentially a vote about
>>>>> this.
>>>>> >>
>>>>> >> I have seen some strong opposition from a few people but a majority
>>>>> of favor in this direction.
>>>>> >>
>>>>> >>
>>>>> >> Motivation
>>>>> >>
>>>>> >> We were discussing in the Solr slack channel with Ishan
>>>>> Chattopadhyaya, Marcus Eagan, and David Smiley about some neural search
>>>>> integrations in Solr:
>>>>> https://github.com/openai/chatgpt-retrieval-plugin
>>>>> >>
>>>>> >>
>>>>> >> Proposal
>>>>> >>
>>>>> >> No hard limit at all.
>>>>> >>
>>>>> >> As for many other Lucene areas, users will be allowed to push the
>>>>> system to the limit of their resources and get terrible performances or
>>>>> crashes if they want.
>>>>> >>
>>>>> >>
>>>>> >> What we are NOT discussing
>>>>> >>
>>>>> >> - Quality and scalability of the HNSW algorithm
>>>>> >>
>>>>> >> - dimensionality reduction
>>>>> >>
>>>>> >> - strategies to fit in an arbitrary self-imposed limit
>>>>> >>
>>>>> >>
>>>>> >> Benefits
>>>>> >>
>>>>> >> - users can use the models they want to generate vectors
>>>>> >>
>>>>> >> - removal of an arbitrary limit that blocks some integrations
>>>>> >>
>>>>> >>
>>>>> >> Cons
>>>>> >>
>>>>> >> - if you go for vectors with high dimensions, there's no
>>>>> guarantee you get acceptable performance for your use case
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> I want to keep it simple, right now in many Lucene areas, you can
>>>>> push the system to not acceptable performance/ crashes.
>>>>> >>
>>>>> >> For example, we don't limit the number of docs per index to an
>>>>> arbitrary maximum of N, you push how many docs you like and if they are too
>>>>> much for your system, you get terrible performance/crashes/whatever.
>>>>> >>
>>>>> >>
>>>>> >> Limits caused by primitive java types will stay there behind the
>>>>> scene, and that's acceptable, but I would prefer to not have arbitrary
>>>>> hard-coded ones that may limit the software usability and integration which
>>>>> is extremely important for a library.
>>>>> >>
>>>>> >>
>>>>> >> I strongly encourage people to add benefits and cons, that I missed
>>>>> (I am sure I missed some of them, but wanted to keep it simple)
>>>>> >>
>>>>> >>
>>>>> >> Cheers
>>>>> >>
>>>>> >> --------------------------
>>>>> >> Alessandro Benedetti
>>>>> >> Director @ Sease Ltd.
>>>>> >> Apache Lucene/Solr Committer
>>>>> >> Apache Solr PMC Member
>>>>> >>
>>>>> >> e-mail: a.benedetti@sease.io
>>>>> >>
>>>>> >>
>>>>> >> Sease - Information Retrieval Applied
>>>>> >> Consulting | Training | Open Source
>>>>> >>
>>>>> >> Website: Sease.io
>>>>> >> LinkedIn | Twitter | Youtube | Github
>>>>> >>
>>>>> >>
>>>>> >
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>
>>>>>
>>>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

michael.wechner at wyona

Apr 5, 2023, 5:10 AM

Post #14 of 99 (673 views)

Am 05.04.23 um 12:34 schrieb Alessandro Benedetti:
> Thanks Mike for the insight!
>
> What would be the next steps then?
> I see agreement but also the necessity of identifying a candidate MAX.
>
> Should create a VOTE thread, where we propose some values with a
> justification and we vote?

+1

Thanks

Michael

>
> In this way we can create a pull request and merge relatively soon.
>
> Cheers
>
> On Tue, 4 Apr 2023, 14:47 Michael Wechner, <michael.wechner@wyona.com>
> wrote:
>
> IIUC we all agree that the limit could be raised, but we need some
> solid reasoning what limit makes sense, resp. why do we set this
> particular limit (e.g. 2048), right?
>
> Thanks
>
> Michael
>
>
> Am 04.04.23 um 15:32 schrieb Michael McCandless:
>> > I am not in favor of just doubling it as suggested by some
>> people, I would ideally prefer a solution that remains there to a
>> decent extent, rather than having to modifying it anytime someone
>> requires a higher limit.
>>
>> The problem with this approach is it is a one-way door, once
>> released. We would not be able to lower the limit again in the
>> future without possibly breaking some applications.
>>
>> > For example, we don't limit the number of docs per index to an
>> arbitrary maximum of N, you push how many docs you like and if
>> they are too much for your system, you get terrible
>> performance/crashes/whatever.
>>
>> Correction: we do check this limit and throw a specific exception
>> now: https://github.com/apache/lucene/issues/6905
>>
>> +1 to raise the limit, but not remove it.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Mon, Apr 3, 2023 at 9:51?AM Alessandro Benedetti
>> <a.benedetti@sease.io> wrote:
>>
>> ... and what would be the next limit?
>> I guess we'll need to motivate it better than the 1024 one.
>> I appreciate the fact that a limit is pretty much wanted by
>> everyone but I suspect we'll need some solid foundation for
>> deciding the amount (and it should be high enough to avoid
>> continuous changes)
>>
>> Cheers
>>
>> On Sun, 2 Apr 2023, 07:29 Michael Wechner,
>> <michael.wechner@wyona.com> wrote:
>>
>> btw, what was the reasoning to set the current limit to 1024?
>>
>> Thanks
>>
>> Michael
>>
>> Am 01.04.23 um 14:47 schrieb Michael Sokolov:
>>> I'm also in favor of raising this limit. We do see some
>>> datasets with higher than 1024 dims. I also think we
>>> need to keep a limit. For example we currently need to
>>> keep all the vectors in RAM while indexing and we want
>>> to be able to support reasonable numbers of vectors in
>>> an index segment. Also we don't know what innovations
>>> might come down the road. Maybe someday we want to do
>>> product quantization and enforce that (k, m) both fit in
>>> a byte -- we wouldn't be able to do that if a vector's
>>> dimension were to exceed 32K.
>>>
>>> On Fri, Mar 31, 2023 at 11:57?AM Alessandro Benedetti
>>> <a.benedetti@sease.io> wrote:
>>>
>>> I am also curious what would be the worst-case
>>> scenario if we remove the constant at all (so
>>> automatically the limit becomes the Java
>>> Integer.MAX_VALUE).
>>> i.e.
>>> right now if you exceed the limit you get:
>>>
>>> if (dimension > ByteVectorValues.MAX_DIMENSIONS) {
>>> throw new IllegalArgumentException(
>>> "cannot index vectors with dimension greater
>>> than " + ByteVectorValues.MAX_DIMENSIONS);
>>> }
>>>
>>>
>>> in relation to:
>>>
>>> These limits allow us to
>>> better tune our data structures, prevent
>>> overflows, help ensure we
>>> have good test coverage, etc.
>>>
>>> I agree 100% especially for typing stuff properly
>>> and avoiding resource waste here and there, but I am
>>> not entirely sure this is the case for the current
>>> implementation i.e. do we have optimizations in
>>> place that assume the max dimension to be 1024?
>>> If I missed that (and I likely have), I of course
>>> suggest the contribution should not just blindly
>>> remove the limit, but do it appropriately.
>>> I am not in favor of just doubling it as suggested
>>> by some people, I would ideally prefer a solution
>>> that remains there to a decent extent, rather than
>>> having to modifying it anytime someone requires a
>>> higher limit.
>>>
>>> Cheers
>>> --------------------------
>>> *Alessandro Benedetti*
>>> Director @ Sease Ltd.
>>> /Apache Lucene/Solr Committer/
>>> /Apache Solr PMC Member/
>>>
>>> e-mail: a.benedetti@sease.io/
>>> /
>>>
>>> *Sease* - Information Retrieval Applied
>>> Consulting | Training | Open Source
>>>
>>> Website: Sease.io <http://sease.io/>
>>> LinkedIn <https://linkedin.com/company/sease-ltd> |
>>> Twitter <https://twitter.com/seaseltd> | Youtube
>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
>>> Github <https://github.com/seaseltd>
>>>
>>>
>>> On Fri, 31 Mar 2023 at 16:12, Michael Wechner
>>> <michael.wechner@wyona.com> wrote:
>>>
>>> OpenAI reduced their size to 1536 dimensions
>>>
>>> https://openai.com/blog/new-and-improved-embedding-model
>>>
>>> so 2048 would work :-)
>>>
>>> but other services do provide also higher
>>> dimensions with sometimes
>>> slightly better accuracy
>>>
>>> Thanks
>>>
>>> Michael
>>>
>>>
>>> Am 31.03.23 um 14:45 schrieb Adrien Grand:
>>> > I'm supportive of bumping the limit on the
>>> maximum dimension for
>>> > vectors to something that is above what the
>>> majority of users need,
>>> > but I'd like to keep a limit. We have limits
>>> for other things like the
>>> > max number of docs per index, the max term
>>> length, the max number of
>>> > dimensions of points, etc. and there are a few
>>> things that we don't
>>> > have limits on that I wish we had limits on.
>>> These limits allow us to
>>> > better tune our data structures, prevent
>>> overflows, help ensure we
>>> > have good test coverage, etc.
>>> >
>>> > That said, these other limits we have in place
>>> are quite high. E.g.
>>> > the 32kB term limit, nobody would ever type a
>>> 32kB term in a text box.
>>> > Likewise for the max of 8 dimensions for
>>> points: a segment cannot
>>> > possibly have 2 splits per dimension on
>>> average if it doesn't have
>>> > 512*2^(8*2)=34M docs, a sizable dataset
>>> already, so more dimensions
>>> > than 8 would likely defeat the point of
>>> indexing. In contrast, our
>>> > limit on the number of dimensions of vectors
>>> seems to be under what
>>> > some users would like, and while I understand
>>> the performance argument
>>> > against bumping the limit, it doesn't feel to
>>> me like something that
>>> > would be so bad that we need to prevent users
>>> from using numbers of
>>> > dimensions in the low thousands, e.g. top-k
>>> KNN searches would still
>>> > look at a very small subset of the full dataset.
>>> >
>>> > So overall, my vote would be to bump the limit
>>> to 2048 as suggested by
>>> > Mayya on the issue that you linked.
>>> >
>>> > On Fri, Mar 31, 2023 at 2:38?PM Michael Wechner
>>> > <michael.wechner@wyona.com> wrote:
>>> >> Thanks Alessandro for summarizing the
>>> discussion below!
>>> >>
>>> >> I understand that there is no clear reasoning
>>> re what is the best embedding size, whereas I
>>> think heuristic approaches like described by the
>>> following link can be helpful
>>> >>
>>> >>
>>> https://datascience.stackexchange.com/questions/51404/word2vec-how-to-choose-the-embedding-size-parameter
>>> >>
>>> >> Having said this, we see various embedding
>>> services providing higher dimensions than 1024,
>>> like for example OpenAI, Cohere and Aleph Alpha.
>>> >>
>>> >> And it would be great if we could run
>>> benchmarks without having to recompile Lucene
>>> ourselves.
>>> >>
>>> >> Therefore I would to suggest to either
>>> increase the limit or even better to remove the
>>> limit and add a disclaimer, that people should
>>> be aware of possible crashes etc.
>>> >>
>>> >> Thanks
>>> >>
>>> >> Michael
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> Am 31.03.23 um 11:43 schrieb Alessandro
>>> Benedetti:
>>> >>
>>> >>
>>> >> I've been monitoring various discussions on
>>> Pull Requests about changing the max number of
>>> dimensions allowed for Lucene HNSW vectors:
>>> >>
>>> >> https://github.com/apache/lucene/pull/12191
>>> >>
>>> >> https://github.com/apache/lucene/issues/11507
>>> >>
>>> >>
>>> >> I would like to set up a discussion and
>>> potentially a vote about this.
>>> >>
>>> >> I have seen some strong opposition from a few
>>> people but a majority of favor in this direction.
>>> >>
>>> >>
>>> >> Motivation
>>> >>
>>> >> We were discussing in the Solr slack channel
>>> with Ishan Chattopadhyaya, Marcus Eagan, and
>>> David Smiley about some neural search
>>> integrations in Solr:
>>> https://github.com/openai/chatgpt-retrieval-plugin
>>> >>
>>> >>
>>> >> Proposal
>>> >>
>>> >> No hard limit at all.
>>> >>
>>> >> As for many other Lucene areas, users will be
>>> allowed to push the system to the limit of their
>>> resources and get terrible performances or
>>> crashes if they want.
>>> >>
>>> >>
>>> >> What we are NOT discussing
>>> >>
>>> >> - Quality and scalability of the HNSW algorithm
>>> >>
>>> >> - dimensionality reduction
>>> >>
>>> >> - strategies to fit in an arbitrary
>>> self-imposed limit
>>> >>
>>> >>
>>> >> Benefits
>>> >>
>>> >> - users can use the models they want to
>>> generate vectors
>>> >>
>>> >> - removal of an arbitrary limit that blocks
>>> some integrations
>>> >>
>>> >>
>>> >> Cons
>>> >>
>>> >> - if you go for vectors with high
>>> dimensions, there's no guarantee you get
>>> acceptable performance for your use case
>>> >>
>>> >>
>>> >>
>>> >> I want to keep it simple, right now in many
>>> Lucene areas, you can push the system to not
>>> acceptable performance/ crashes.
>>> >>
>>> >> For example, we don't limit the number of
>>> docs per index to an arbitrary maximum of N, you
>>> push how many docs you like and if they are too
>>> much for your system, you get terrible
>>> performance/crashes/whatever.
>>> >>
>>> >>
>>> >> Limits caused by primitive java types will
>>> stay there behind the scene, and that's
>>> acceptable, but I would prefer to not have
>>> arbitrary hard-coded ones that may limit the
>>> software usability and integration which is
>>> extremely important for a library.
>>> >>
>>> >>
>>> >> I strongly encourage people to add benefits
>>> and cons, that I missed (I am sure I missed some
>>> of them, but wanted to keep it simple)
>>> >>
>>> >>
>>> >> Cheers
>>> >>
>>> >> --------------------------
>>> >> Alessandro Benedetti
>>> >> Director @ Sease Ltd.
>>> >> Apache Lucene/Solr Committer
>>> >> Apache Solr PMC Member
>>> >>
>>> >> e-mail: a.benedetti@sease.io
>>> >>
>>> >>
>>> >> Sease - Information Retrieval Applied
>>> >> Consulting | Training | Open Source
>>> >>
>>> >> Website: Sease.io
>>> >> LinkedIn | Twitter | Youtube | Github
>>> >>
>>> >>
>>> >
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:
>>> dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail:
>>> dev-help@lucene.apache.org
>>>
>>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

dawid.weiss at gmail

Apr 5, 2023, 6:49 AM

Post #15 of 99 (673 views)

> Should create a VOTE thread, where we propose some values with a
> justification and we vote?
>

Technically, a vote thread won't help much if there's no full consensus - a
single veto will make the patch unacceptable for merging.
https://www.apache.org/foundation/voting.html#Veto

Dawid

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

a.benedetti at sease

Apr 5, 2023, 8:22 AM

Post #16 of 99 (673 views)

Ok, so what should we do then?
This space is moving fast, and in my opinion we should act fast to release
and guarantee we attract as many users as possible.

At the same time I am not saying we should proceed blind, if there's
concrete evidence for setting a limit rather than another, or that a
certain limit is detrimental to the project, I think that veto should be
valid.

We shouldn't accept weakly/not scientifically motivated vetos anyway right?

The problem I see is that more than voting we should first decide this
limit and I don't know how we can operate.
I am imagining like a poll where each entry is a limit + motivation and
PMCs maybe vote/add entries?

Did anything similar happen in the past? How was the current limit added?

On Wed, 5 Apr 2023, 14:50 Dawid Weiss, <dawid.weiss@gmail.com> wrote:

>
>
>> Should create a VOTE thread, where we propose some values with a
>> justification and we vote?
>>
>
> Technically, a vote thread won't help much if there's no full consensus -
> a single veto will make the patch unacceptable for merging.
> https://www.apache.org/foundation/voting.html#Veto
>
> Dawid
>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

dawid.weiss at gmail

Apr 5, 2023, 8:48 AM

Post #17 of 99 (673 views)

> Ok, so what should we do then?

I don't know, Alessandro. I just wanted to point out the fact that by
Apache rules a committer's veto to a code change counts as a no-go. It
does not specify any way to "override" such a veto, perhaps counting
on disagreeing parties to resolve conflicting points of views in a
civil manner so that veto can be retracted (or a different solution
suggested).

I think Robert's point is not about a particular limit value but about
the algorithm itself - the current implementation does not scale. I
don't want to be an advocate for either side - I'm all for freedom of
choice but at the same time last time I tried indexing a few million
vectors, I couldn't get far before segment merging blew up with
OOMs...

> Did anything similar happen in the past? How was the current limit added?

I honestly don't know, you'd have to git blame or look at the mailing
list archives of the original contribution.

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

michael.wechner at wyona

Apr 5, 2023, 8:55 AM

Post #18 of 99 (673 views)

Hi Dawid

Can you describe your crash in more detail?

How many millions vectors exactly?
What was the vector dimension?
How much RAM?
etc.

Thanks

Michael

Am 05.04.23 um 17:48 schrieb Dawid Weiss:
>> Ok, so what should we do then?
> I don't know, Alessandro. I just wanted to point out the fact that by
> Apache rules a committer's veto to a code change counts as a no-go. It
> does not specify any way to "override" such a veto, perhaps counting
> on disagreeing parties to resolve conflicting points of views in a
> civil manner so that veto can be retracted (or a different solution
> suggested).
>
> I think Robert's point is not about a particular limit value but about
> the algorithm itself - the current implementation does not scale. I
> don't want to be an advocate for either side - I'm all for freedom of
> choice but at the same time last time I tried indexing a few million
> vectors, I couldn't get far before segment merging blew up with
> OOMs...
>
>> Did anything similar happen in the past? How was the current limit added?
> I honestly don't know, you'd have to git blame or look at the mailing
> list archives of the original contribution.
>
> Dawid
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

dawid.weiss at gmail

Apr 5, 2023, 9:30 AM

Post #19 of 99 (673 views)

> Can you describe your crash in more detail?

I can't. That experiment was a while ago and a quick test to see if I
could index rather large-ish USPTO (patent office) data as vectors.
Couldn't do it then.

> How much RAM?

My indexing jobs run with rather smallish heaps to give space for I/O
buffers. Think 4-8GB at most. So yes, it could have been the problem.
I recall segment merging grew slower and slower and then simply
crashed. Lucene should work with low heap requirements, even if it
slows down. Throwing ram at the indexing/ segment merging problem
is... I don't know - not elegant?

Anyway. My main point was to remind folks about how Apache works -
code is merged in when there are no vetoes. If Rob (or anybody else)
remains unconvinced, he or she can block the change. (I didn't invent
those rules).

D.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

michael.wechner at wyona

Apr 5, 2023, 9:51 AM

Post #20 of 99 (673 views)

Thanks for your feedback!

I agree, that it should not crash.

So far we did not experience crashes ourselves, but we did not index
millions of vectors.

I will try to reproduce the crash, maybe this will help us to move forward.

Thanks

Michael

Am 05.04.23 um 18:30 schrieb Dawid Weiss:
>> Can you describe your crash in more detail?
> I can't. That experiment was a while ago and a quick test to see if I
> could index rather large-ish USPTO (patent office) data as vectors.
> Couldn't do it then.
>
>> How much RAM?
> My indexing jobs run with rather smallish heaps to give space for I/O
> buffers. Think 4-8GB at most. So yes, it could have been the problem.
> I recall segment merging grew slower and slower and then simply
> crashed. Lucene should work with low heap requirements, even if it
> slows down. Throwing ram at the indexing/ segment merging problem
> is... I don't know - not elegant?
>
> Anyway. My main point was to remind folks about how Apache works -
> code is merged in when there are no vetoes. If Rob (or anybody else)
> remains unconvinced, he or she can block the change. (I didn't invent
> those rules).
>
> D.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

rcmuir at gmail

Apr 5, 2023, 10:54 AM

Post #21 of 99 (673 views)

I'd ask anyone voting +1 to raise this limit to at least try to index
a few million vectors with 756 or 1024, which is allowed today.

IMO based on how painful it is, it seems the limit is already too
high, I realize that will sound controversial but please at least try
it out!

voting +1 without at least doing this is really the
"weak/unscientifically minded" approach.

On Wed, Apr 5, 2023 at 12:52?PM Michael Wechner
<michael.wechner@wyona.com> wrote:
>
> Thanks for your feedback!
>
> I agree, that it should not crash.
>
> So far we did not experience crashes ourselves, but we did not index
> millions of vectors.
>
> I will try to reproduce the crash, maybe this will help us to move forward.
>
> Thanks
>
> Michael
>
> Am 05.04.23 um 18:30 schrieb Dawid Weiss:
> >> Can you describe your crash in more detail?
> > I can't. That experiment was a while ago and a quick test to see if I
> > could index rather large-ish USPTO (patent office) data as vectors.
> > Couldn't do it then.
> >
> >> How much RAM?
> > My indexing jobs run with rather smallish heaps to give space for I/O
> > buffers. Think 4-8GB at most. So yes, it could have been the problem.
> > I recall segment merging grew slower and slower and then simply
> > crashed. Lucene should work with low heap requirements, even if it
> > slows down. Throwing ram at the indexing/ segment merging problem
> > is... I don't know - not elegant?
> >
> > Anyway. My main point was to remind folks about how Apache works -
> > code is merged in when there are no vetoes. If Rob (or anybody else)
> > remains unconvinced, he or she can block the change. (I didn't invent
> > those rules).
> >
> > D.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

osjdconrad at gmail

Apr 5, 2023, 1:50 PM

Post #22 of 99 (673 views)

I don't want to get too far off topic, but I think one of the problems here
is that HNSW doesn't really fit well as a Lucene data structure. The way it
behaves it would be better supported as a live, in-memory data structure
instead of segmented and written to disk for tiny graphs that then need to
be merged. I wonder if it may be a better approach to explore other
possible algorithms that are designed to be on-disk instead of in-memory
even if they require k-means clustering as a trade-off. Maybe with an
on-disk algorithm we could have good enough performance for a
higher-dimensional limit.

On Wed, Apr 5, 2023 at 10:54?AM Robert Muir <rcmuir@gmail.com> wrote:

> I'd ask anyone voting +1 to raise this limit to at least try to index
> a few million vectors with 756 or 1024, which is allowed today.
>
> IMO based on how painful it is, it seems the limit is already too
> high, I realize that will sound controversial but please at least try
> it out!
>
> voting +1 without at least doing this is really the
> "weak/unscientifically minded" approach.
>
> On Wed, Apr 5, 2023 at 12:52?PM Michael Wechner
> <michael.wechner@wyona.com> wrote:
> >
> > Thanks for your feedback!
> >
> > I agree, that it should not crash.
> >
> > So far we did not experience crashes ourselves, but we did not index
> > millions of vectors.
> >
> > I will try to reproduce the crash, maybe this will help us to move
> forward.
> >
> > Thanks
> >
> > Michael
> >
> > Am 05.04.23 um 18:30 schrieb Dawid Weiss:
> > >> Can you describe your crash in more detail?
> > > I can't. That experiment was a while ago and a quick test to see if I
> > > could index rather large-ish USPTO (patent office) data as vectors.
> > > Couldn't do it then.
> > >
> > >> How much RAM?
> > > My indexing jobs run with rather smallish heaps to give space for I/O
> > > buffers. Think 4-8GB at most. So yes, it could have been the problem.
> > > I recall segment merging grew slower and slower and then simply
> > > crashed. Lucene should work with low heap requirements, even if it
> > > slows down. Throwing ram at the indexing/ segment merging problem
> > > is... I don't know - not elegant?
> > >
> > > Anyway. My main point was to remind folks about how Apache works -
> > > code is merged in when there are no vetoes. If Rob (or anybody else)
> > > remains unconvinced, he or she can block the change. (I didn't invent
> > > those rules).
> > >
> > > D.
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: dev-help@lucene.apache.org
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

gus.heck at gmail

Apr 5, 2023, 7:30 PM

Post #23 of 99 (673 views)

10 MB hard drive, wow I'll never need another floppy disk ever...
Neural nets... nice idea, but there will never be enough CPU power to run
them...

etc.

Is it possible to make it a configurable limit?

On Wed, Apr 5, 2023 at 4:51?PM Jack Conradson <osjdconrad@gmail.com> wrote:

> I don't want to get too far off topic, but I think one of the problems
> here is that HNSW doesn't really fit well as a Lucene data structure. The
> way it behaves it would be better supported as a live, in-memory data
> structure instead of segmented and written to disk for tiny graphs that
> then need to be merged. I wonder if it may be a better approach to explore
> other possible algorithms that are designed to be on-disk instead of
> in-memory even if they require k-means clustering as a trade-off. Maybe
> with an on-disk algorithm we could have good enough performance for a
> higher-dimensional limit.
>
> On Wed, Apr 5, 2023 at 10:54?AM Robert Muir <rcmuir@gmail.com> wrote:
>
>> I'd ask anyone voting +1 to raise this limit to at least try to index
>> a few million vectors with 756 or 1024, which is allowed today.
>>
>> IMO based on how painful it is, it seems the limit is already too
>> high, I realize that will sound controversial but please at least try
>> it out!
>>
>> voting +1 without at least doing this is really the
>> "weak/unscientifically minded" approach.
>>
>> On Wed, Apr 5, 2023 at 12:52?PM Michael Wechner
>> <michael.wechner@wyona.com> wrote:
>> >
>> > Thanks for your feedback!
>> >
>> > I agree, that it should not crash.
>> >
>> > So far we did not experience crashes ourselves, but we did not index
>> > millions of vectors.
>> >
>> > I will try to reproduce the crash, maybe this will help us to move
>> forward.
>> >
>> > Thanks
>> >
>> > Michael
>> >
>> > Am 05.04.23 um 18:30 schrieb Dawid Weiss:
>> > >> Can you describe your crash in more detail?
>> > > I can't. That experiment was a while ago and a quick test to see if I
>> > > could index rather large-ish USPTO (patent office) data as vectors.
>> > > Couldn't do it then.
>> > >
>> > >> How much RAM?
>> > > My indexing jobs run with rather smallish heaps to give space for I/O
>> > > buffers. Think 4-8GB at most. So yes, it could have been the problem.
>> > > I recall segment merging grew slower and slower and then simply
>> > > crashed. Lucene should work with low heap requirements, even if it
>> > > slows down. Throwing ram at the indexing/ segment merging problem
>> > > is... I don't know - not elegant?
>> > >
>> > > Anyway. My main point was to remind folks about how Apache works -
>> > > code is merged in when there are no vetoes. If Rob (or anybody else)
>> > > remains unconvinced, he or she can block the change. (I didn't invent
>> > > those rules).
>> > >
>> > > D.
>> > >
>> > > ---------------------------------------------------------------------
>> > > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > > For additional commands, e-mail: dev-help@lucene.apache.org
>> > >
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: dev-help@lucene.apache.org
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>

--
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

lucene at mikemccandless

Apr 6, 2023, 2:47 AM

Post #24 of 99 (673 views)

> We shouldn't accept weakly/not scientifically motivated vetos anyway
right?

In fact we must accept all vetos by any committer as a veto, for a change
to Lucene's source code, regardless of that committer's reasoning. This is
the power of Apache's model.

Of course we all can and will work together to convince one another (this
is where the scientifically motivated part comes in) to change our votes,
one way or another.

> I'd ask anyone voting +1 to raise this limit to at least try to index a
few million vectors with 756 or 1024, which is allowed today.

+1, if the current implementation really does not scale / needs more and
more RAM for merging, let's understand what's going on here, first, before
increasing limits. I rescind my hasty +1 for now!

Mike McCandless

http://blog.mikemccandless.com

On Wed, Apr 5, 2023 at 11:22?AM Alessandro Benedetti <a.benedetti@sease.io>
wrote:

> Ok, so what should we do then?
> This space is moving fast, and in my opinion we should act fast to release
> and guarantee we attract as many users as possible.
>
> At the same time I am not saying we should proceed blind, if there's
> concrete evidence for setting a limit rather than another, or that a
> certain limit is detrimental to the project, I think that veto should be
> valid.
>
> We shouldn't accept weakly/not scientifically motivated vetos anyway right?
>
> The problem I see is that more than voting we should first decide this
> limit and I don't know how we can operate.
> I am imagining like a poll where each entry is a limit + motivation and
> PMCs maybe vote/add entries?
>
> Did anything similar happen in the past? How was the current limit added?
>
>
> On Wed, 5 Apr 2023, 14:50 Dawid Weiss, <dawid.weiss@gmail.com> wrote:
>
>>
>>
>>> Should create a VOTE thread, where we propose some values with a
>>> justification and we vote?
>>>
>>
>> Technically, a vote thread won't help much if there's no full consensus -
>> a single veto will make the patch unacceptable for merging.
>> https://www.apache.org/foundation/voting.html#Veto
>>
>> Dawid
>>
>>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

michael.wechner at wyona

Apr 6, 2023, 7:08 AM

Post #25 of 99 (673 views)

I think we should focus on testing where the limits are and what might
cause the limits.

Let's get out of this fog :-)

Thanks

Michael

Am 06.04.23 um 11:47 schrieb Michael McCandless:
> > We shouldn't accept weakly/not scientifically motivated vetos anyway
> right?
>
> In fact we must accept all vetos by any committer as a veto, for a
> change to Lucene's source code, regardless of that committer's
> reasoning. This is the power of Apache's model.
>
> Of course we all can and will work together to convince one another
> (this is where the scientifically motivated part comes in) to change
> our votes, one way or another.
>
> > I'd ask anyone voting +1 to raise this limit to at least try to
> index a few million vectors with 756 or 1024, which is allowed today.
>
> +1, if the current implementation really does not scale / needs more
> and more RAM for merging, let's understand what's going on here,
> first, before increasing limits. I rescind my hasty +1 for now!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Apr 5, 2023 at 11:22?AM Alessandro Benedetti
> <a.benedetti@sease.io> wrote:
>
> Ok, so what should we do then?
> This space is moving fast, and in my opinion we should act fast to
> release and guarantee we attract as many users as possible.
>
> At the same time I am not saying we should proceed blind, if
> there's concrete evidence for setting a limit rather than another,
> or that a certain limit is detrimental to the project, I think
> that veto should be valid.
>
> We shouldn't accept weakly/not scientifically motivated vetos
> anyway right?
>
> The problem I see is that more than voting we should first decide
> this limit and I don't know how we can operate.
> I am imagining like a poll where each entry is a limit +
> motivation and PMCs maybe vote/add entries?
>
> Did anything similar happen in the past? How was the current limit
> added?
>
>
> On Wed, 5 Apr 2023, 14:50 Dawid Weiss, <dawid.weiss@gmail.com> wrote:
>
> Should create a VOTE thread, where we propose some values
> with a justification and we vote?
>
>
> Technically, a vote thread won't help much if there's no full
> consensus - a single veto will make the patch unacceptable for
> merging.
> https://www.apache.org/foundation/voting.html#Veto
>
> Dawid
>