Mailing List Archive

1 2  View All
Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]
We agree backwards compatibility with the index should be maintained and
that checkIndex should work. And we agree on a number of other things, but
I want to focus on configurability.
As long as the index contains the number of dimensions actually used in a
specific segment & field, why couldn't checkIndex work if the dimension
*limit* is configurable? It's not checkindex's job to enforce the limit,
only to check that the data appears consistent / valid, irrespective of how
the number of dimensions came to be specified originally.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Tue, May 16, 2023 at 10:58?PM Robert Muir <rcmuir@gmail.com> wrote:

> My problem is that it impacts the default codec which is supported by our
> backwards compatibility policy for many years. We can't just let the user
> determine backwards compatibility with a sysprop. how will checkindex work?
> We have to have bounds and also allow for more performant implementations
> that might have different limitations. And I'm pretty sure we want a faster
> implementation than what we have in the future, and it will probably have
> different limits.
>
> For other codecs, it is fine to have a different limit as I already said,
> as it is implementation dependent. And honestly the stuff in lucene/codecs
> can be more "Fast and loose" because it doesn't require the extensive index
> back compat guarantee.
>
> Again, penultimate concern is that index back compat guarantee. When it
> comes to limits, the proper way is not to just keep bumping them without
> technical reasons, instead the correct approach is to fix the technical
> problems and make them irrelevant. Great example here (merged this
> morning):
> https://github.com/apache/lucene/commit/f53eb28af053d7612f7e4d1b2de05d33dc410645
>
>
> On Tue, May 16, 2023 at 10:49?PM David Smiley <dsmiley@apache.org> wrote:
>
>> Robert, I have not heard from you (or anyone) an argument against System
>> property based configurability (as I described in Option 4 via a System
>> property). Uwe notes wisely some care must be taken to ensure it actually
>> works. Sure, of course. What concerns do you have with this?
>>
>> ~ David Smiley
>> Apache Lucene/Solr Search Developer
>> http://www.linkedin.com/in/davidwsmiley
>>
>>
>> On Tue, May 16, 2023 at 9:50?PM Robert Muir <rcmuir@gmail.com> wrote:
>>
>>> by the way, i agree with the idea to MOVE THE LIMIT UNCHANGED to the
>>> hsnw-specific code.
>>>
>>> This way, someone can write alternative codec with vectors using some
>>> other completely different approach that incorporates a different more
>>> appropriate limit (maybe lower, maybe higher) depending upon their
>>> tradeoffs. We should encourage this as I think it is the "only true fix" to
>>> the scalability issues: use a scalable algorithm! Also, alternative codecs
>>> don't force the project into many years of index backwards compatibility,
>>> which is really my penultimate concern. We can lock ourselves into a truly
>>> bad place and become irrelevant (especially with scalar code implementing
>>> all this vector stuff, it is really senseless). In the meantime I suggest
>>> we try to reduce pain for the default codec with the current implementation
>>> if possible. If it is not possible, we need a new codec that performs.
>>>
>>> On Tue, May 16, 2023 at 8:53?PM Robert Muir <rcmuir@gmail.com> wrote:
>>>
>>>> Gus, I think i explained myself multiple times on issues and in this
>>>> thread. the performance is unacceptable, everyone knows it, but nobody is
>>>> talking about.
>>>> I don't need to explain myself time and time again here.
>>>> You don't seem to understand the technical issues (at least you sure as
>>>> fuck don't know how service loading works or you wouldnt have opened
>>>> https://github.com/apache/lucene/issues/12300 ????)
>>>>
>>>> I'm just the only one here completely unconstrained by any of silicon
>>>> valley's influences to speak my true mind, without any repercussions, so I
>>>> do it. Don't give any fucks about ChatGPT.
>>>>
>>>> I'm standing by my technical veto. If you bypass it, I'll revert the
>>>> offending commit.
>>>>
>>>> As far as fixing the technical performance, I just opened an issue with
>>>> some ideas to at least improve cpu usage by a factor of N. It does not help
>>>> with the crazy heap memory usage or other issues of KNN implementation
>>>> causing shit like OOM on merge. But it is one step:
>>>> https://github.com/apache/lucene/issues/12302
>>>>
>>>>
>>>>
>>>> On Tue, May 16, 2023 at 7:45?AM Gus Heck <gus.heck@gmail.com> wrote:
>>>>
>>>>> Robert,
>>>>>
>>>>> Can you explain in clear technical terms the standard that must be met
>>>>> for performance? A benchmark that must run in X time on Y hardware for
>>>>> example (and why that test is suitable)? Or some other reproducible
>>>>> criteria? So far I've heard you give an *opinion* that it's unusable, but
>>>>> that's not a technical criteria, others may have a different concept of
>>>>> what is usable to them.
>>>>>
>>>>> Forgive me if I misunderstand, but the essence of your argument has
>>>>> seemed to be
>>>>>
>>>>> "Performance isn't good enough, therefore we should force anyone who
>>>>> wants to experiment with something bigger to fork the code base to do it"
>>>>>
>>>>> Thus, it is necessary to have a clear unambiguous standard that anyone
>>>>> can verify for "good enough". A clear standard would also focus efforts at
>>>>> improvement.
>>>>>
>>>>> Where are the goal posts?
>>>>>
>>>>> FWIW I'm +1 on any of 2-4 since I believe the existence of a hard
>>>>> limit is fundamentally counterproductive in an open source setting, as it
>>>>> will lead to *fewer people* pushing the limits. Extremely few people
>>>>> are going to get into the nitty-gritty of optimizing things unless they are
>>>>> staring at code that they can prove does something interesting, but doesn't
>>>>> run fast enough for their purposes. If people hit a hard limit, more of
>>>>> them give up and never develop the code that will motivate them to look for
>>>>> optimizations.
>>>>>
>>>>> -Gus
>>>>>
>>>>> On Tue, May 16, 2023 at 6:04?AM Robert Muir <rcmuir@gmail.com> wrote:
>>>>>
>>>>>> i still feel -1 (veto) on increasing this limit. sending more emails
>>>>>> does not change the technical facts or make the veto go away.
>>>>>>
>>>>>> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti <
>>>>>> a.benedetti@sease.io> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>> we have finalized all the options proposed by the community and we
>>>>>>> are ready to vote for the preferred one and then proceed with the
>>>>>>> implementation.
>>>>>>>
>>>>>>> *Option 1*
>>>>>>> Keep it as it is (dimension limit hardcoded to 1024)
>>>>>>> *Motivation*:
>>>>>>> We are close to improving on many fronts. Given the criticality of
>>>>>>> Lucene in computing infrastructure and the concerns raised by one of the
>>>>>>> most active stewards of the project, I think we should keep working toward
>>>>>>> improving the feature as is and move to up the limit after we can
>>>>>>> demonstrate improvement unambiguously.
>>>>>>>
>>>>>>> *Option 2*
>>>>>>> make the limit configurable, for example through a system property
>>>>>>> *Motivation*:
>>>>>>> The system administrator can enforce a limit its users need to
>>>>>>> respect that it's in line with whatever the admin decided to be acceptable
>>>>>>> for them.
>>>>>>> The default can stay the current one.
>>>>>>> This should open the doors for Apache Solr, Elasticsearch,
>>>>>>> OpenSearch, and any sort of plugin development
>>>>>>>
>>>>>>> *Option 3*
>>>>>>> Move the max dimension limit lower level to a HNSW specific
>>>>>>> implementation. Once there, this limit would not bind any other potential
>>>>>>> vector engine alternative/evolution.
>>>>>>> *Motivation:* There seem to be contradictory performance
>>>>>>> interpretations about the current HNSW implementation. Some consider its
>>>>>>> performance ok, some not, and it depends on the target data set and use
>>>>>>> case. Increasing the max dimension limit where it is currently (in top
>>>>>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>>>>>> other use-cases) to be based on a lower limit.
>>>>>>>
>>>>>>> *Option 4*
>>>>>>> Make it configurable and move it to an appropriate place.
>>>>>>> In particular, a
>>>>>>> simple Integer.getInteger("lucene.hnsw.maxDimensions", 1024) should be
>>>>>>> enough.
>>>>>>> *Motivation*:
>>>>>>> Both are good and not mutually exclusive and could happen in any
>>>>>>> order.
>>>>>>> Someone suggested to perfect what the _default_ limit should be, but
>>>>>>> I've not seen an argument _against_ configurability. Especially in this
>>>>>>> way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>>>>>
>>>>>>> I'll keep this [VOTE] open for a week and then proceed to the
>>>>>>> implementation.
>>>>>>> --------------------------
>>>>>>> *Alessandro Benedetti*
>>>>>>> Director @ Sease Ltd.
>>>>>>> *Apache Lucene/Solr Committer*
>>>>>>> *Apache Solr PMC Member*
>>>>>>>
>>>>>>> e-mail: a.benedetti@sease.io
>>>>>>>
>>>>>>>
>>>>>>> *Sease* - Information Retrieval Applied
>>>>>>> Consulting | Training | Open Source
>>>>>>>
>>>>>>> Website: Sease.io <http://sease.io/>
>>>>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>>>>>> <https://twitter.com/seaseltd> | Youtube
>>>>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>>>>>> <https://github.com/seaseltd>
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> http://www.needhamsoftware.com (work)
>>>>> http://www.the111shift.com (play)
>>>>>
>>>>
Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]
Robert,
A gentle reminder of the
https://www.apache.org/foundation/policies/conduct.html.
I've read many e-mails about this topic that ended up in a tone that is not
up to the standard of a healthy community.
To be specific and pragmatic how you addressed Gus here, how you addressed
the rest of our community mocking us as sort of "ChatGPT minions" and the
usage of bad words in English (f* word), does not make sense and it's not
acceptable here.
Even if you feel heated, I recommend separating such emotions from what you
write and always being respectful of other people with different ideas.
You are an intelligent person, don't ruin your time (and others' time) on a
wonderful project such as Lucene, blinded by excessive emotion.
Please remember that the vast majority of us participate in this community
purely on a volunteering basis.
So when I spend time on this, I like to see respect,
thoughtful discussions, and intellectual challenges, the time we spend
together must be peaceful and positive.

The community comes first and here we are collecting what the community
would like for a feature.
Your vote and opinion are extremely valuable, but at this stage, we are
here to listen to the community rather than imposing a personal idea.
Once we observe the dominant need, we'll proceed with a contribution.
If you disagree with such a contribution and bring technical evidence that
supports a convincing veto, we (the Lucene community) will listen and
improve/change the contribution.
If you disagree with such a contribution and bring an unconvincing veto, we
(the Lucene community) will proceed with steps that are appropriate for the
situation.
Let's also remember that the project and the community come first, Lucene
is an Apache project, not mine or yours for that matters.

Cheers

--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
*Apache Lucene/Solr Committer*
*Apache Solr PMC Member*

e-mail: a.benedetti@sease.io


*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
<https://github.com/seaseltd>


On Wed, 17 May 2023 at 01:54, Robert Muir <rcmuir@gmail.com> wrote:

> Gus, I think i explained myself multiple times on issues and in this
> thread. the performance is unacceptable, everyone knows it, but nobody is
> talking about.
> I don't need to explain myself time and time again here.
> You don't seem to understand the technical issues (at least you sure as
> fuck don't know how service loading works or you wouldnt have opened
> https://github.com/apache/lucene/issues/12300 ????)
>
> I'm just the only one here completely unconstrained by any of silicon
> valley's influences to speak my true mind, without any repercussions, so I
> do it. Don't give any fucks about ChatGPT.
>
> I'm standing by my technical veto. If you bypass it, I'll revert the
> offending commit.
>
> As far as fixing the technical performance, I just opened an issue with
> some ideas to at least improve cpu usage by a factor of N. It does not help
> with the crazy heap memory usage or other issues of KNN implementation
> causing shit like OOM on merge. But it is one step:
> https://github.com/apache/lucene/issues/12302
>
>
>
> On Tue, May 16, 2023 at 7:45?AM Gus Heck <gus.heck@gmail.com> wrote:
>
>> Robert,
>>
>> Can you explain in clear technical terms the standard that must be met
>> for performance? A benchmark that must run in X time on Y hardware for
>> example (and why that test is suitable)? Or some other reproducible
>> criteria? So far I've heard you give an *opinion* that it's unusable, but
>> that's not a technical criteria, others may have a different concept of
>> what is usable to them.
>>
>> Forgive me if I misunderstand, but the essence of your argument has
>> seemed to be
>>
>> "Performance isn't good enough, therefore we should force anyone who
>> wants to experiment with something bigger to fork the code base to do it"
>>
>> Thus, it is necessary to have a clear unambiguous standard that anyone
>> can verify for "good enough". A clear standard would also focus efforts at
>> improvement.
>>
>> Where are the goal posts?
>>
>> FWIW I'm +1 on any of 2-4 since I believe the existence of a hard limit
>> is fundamentally counterproductive in an open source setting, as it will
>> lead to *fewer people* pushing the limits. Extremely few people are
>> going to get into the nitty-gritty of optimizing things unless they are
>> staring at code that they can prove does something interesting, but doesn't
>> run fast enough for their purposes. If people hit a hard limit, more of
>> them give up and never develop the code that will motivate them to look for
>> optimizations.
>>
>> -Gus
>>
>> On Tue, May 16, 2023 at 6:04?AM Robert Muir <rcmuir@gmail.com> wrote:
>>
>>> i still feel -1 (veto) on increasing this limit. sending more emails
>>> does not change the technical facts or make the veto go away.
>>>
>>> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti <
>>> a.benedetti@sease.io> wrote:
>>>
>>>> Hi all,
>>>> we have finalized all the options proposed by the community and we are
>>>> ready to vote for the preferred one and then proceed with the
>>>> implementation.
>>>>
>>>> *Option 1*
>>>> Keep it as it is (dimension limit hardcoded to 1024)
>>>> *Motivation*:
>>>> We are close to improving on many fronts. Given the criticality of
>>>> Lucene in computing infrastructure and the concerns raised by one of the
>>>> most active stewards of the project, I think we should keep working toward
>>>> improving the feature as is and move to up the limit after we can
>>>> demonstrate improvement unambiguously.
>>>>
>>>> *Option 2*
>>>> make the limit configurable, for example through a system property
>>>> *Motivation*:
>>>> The system administrator can enforce a limit its users need to respect
>>>> that it's in line with whatever the admin decided to be acceptable for
>>>> them.
>>>> The default can stay the current one.
>>>> This should open the doors for Apache Solr, Elasticsearch, OpenSearch,
>>>> and any sort of plugin development
>>>>
>>>> *Option 3*
>>>> Move the max dimension limit lower level to a HNSW specific
>>>> implementation. Once there, this limit would not bind any other potential
>>>> vector engine alternative/evolution.
>>>> *Motivation:* There seem to be contradictory performance
>>>> interpretations about the current HNSW implementation. Some consider its
>>>> performance ok, some not, and it depends on the target data set and use
>>>> case. Increasing the max dimension limit where it is currently (in top
>>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>>> other use-cases) to be based on a lower limit.
>>>>
>>>> *Option 4*
>>>> Make it configurable and move it to an appropriate place.
>>>> In particular, a simple Integer.getInteger("lucene.hnsw.maxDimensions",
>>>> 1024) should be enough.
>>>> *Motivation*:
>>>> Both are good and not mutually exclusive and could happen in any order.
>>>> Someone suggested to perfect what the _default_ limit should be, but
>>>> I've not seen an argument _against_ configurability. Especially in this
>>>> way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>>
>>>> I'll keep this [VOTE] open for a week and then proceed to the
>>>> implementation.
>>>> --------------------------
>>>> *Alessandro Benedetti*
>>>> Director @ Sease Ltd.
>>>> *Apache Lucene/Solr Committer*
>>>> *Apache Solr PMC Member*
>>>>
>>>> e-mail: a.benedetti@sease.io
>>>>
>>>>
>>>> *Sease* - Information Retrieval Applied
>>>> Consulting | Training | Open Source
>>>>
>>>> Website: Sease.io <http://sease.io/>
>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>>> <https://twitter.com/seaseltd> | Youtube
>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>>> <https://github.com/seaseltd>
>>>>
>>>
>>
>> --
>> http://www.needhamsoftware.com (work)
>> http://www.the111shift.com (play)
>>
>
Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]
As a reminder this isn't the Disney Plus channel and I'll use strong
language if I fucking want to.



On Wed, May 17, 2023, 4:45 AM Alessandro Benedetti <a.benedetti@sease.io>
wrote:

> Robert,
> A gentle reminder of the
> https://www.apache.org/foundation/policies/conduct.html.
> I've read many e-mails about this topic that ended up in a tone that is
> not up to the standard of a healthy community.
> To be specific and pragmatic how you addressed Gus here, how you addressed
> the rest of our community mocking us as sort of "ChatGPT minions" and the
> usage of bad words in English (f* word), does not make sense and it's not
> acceptable here.
> Even if you feel heated, I recommend separating such emotions from what
> you write and always being respectful of other people with different ideas.
> You are an intelligent person, don't ruin your time (and others' time) on
> a wonderful project such as Lucene, blinded by excessive emotion.
> Please remember that the vast majority of us participate in this community
> purely on a volunteering basis.
> So when I spend time on this, I like to see respect,
> thoughtful discussions, and intellectual challenges, the time we spend
> together must be peaceful and positive.
>
> The community comes first and here we are collecting what the community
> would like for a feature.
> Your vote and opinion are extremely valuable, but at this stage, we are
> here to listen to the community rather than imposing a personal idea.
> Once we observe the dominant need, we'll proceed with a contribution.
> If you disagree with such a contribution and bring technical evidence that
> supports a convincing veto, we (the Lucene community) will listen and
> improve/change the contribution.
> If you disagree with such a contribution and bring an unconvincing veto,
> we (the Lucene community) will proceed with steps that are appropriate for
> the situation.
> Let's also remember that the project and the community come first, Lucene
> is an Apache project, not mine or yours for that matters.
>
> Cheers
>
> --------------------------
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> *Apache Lucene/Solr Committer*
> *Apache Solr PMC Member*
>
> e-mail: a.benedetti@sease.io
>
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
> <https://twitter.com/seaseltd> | Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
> <https://github.com/seaseltd>
>
>
> On Wed, 17 May 2023 at 01:54, Robert Muir <rcmuir@gmail.com> wrote:
>
>> Gus, I think i explained myself multiple times on issues and in this
>> thread. the performance is unacceptable, everyone knows it, but nobody is
>> talking about.
>> I don't need to explain myself time and time again here.
>> You don't seem to understand the technical issues (at least you sure as
>> fuck don't know how service loading works or you wouldnt have opened
>> https://github.com/apache/lucene/issues/12300 ????)
>>
>> I'm just the only one here completely unconstrained by any of silicon
>> valley's influences to speak my true mind, without any repercussions, so I
>> do it. Don't give any fucks about ChatGPT.
>>
>> I'm standing by my technical veto. If you bypass it, I'll revert the
>> offending commit.
>>
>> As far as fixing the technical performance, I just opened an issue with
>> some ideas to at least improve cpu usage by a factor of N. It does not help
>> with the crazy heap memory usage or other issues of KNN implementation
>> causing shit like OOM on merge. But it is one step:
>> https://github.com/apache/lucene/issues/12302
>>
>>
>>
>> On Tue, May 16, 2023 at 7:45?AM Gus Heck <gus.heck@gmail.com> wrote:
>>
>>> Robert,
>>>
>>> Can you explain in clear technical terms the standard that must be met
>>> for performance? A benchmark that must run in X time on Y hardware for
>>> example (and why that test is suitable)? Or some other reproducible
>>> criteria? So far I've heard you give an *opinion* that it's unusable, but
>>> that's not a technical criteria, others may have a different concept of
>>> what is usable to them.
>>>
>>> Forgive me if I misunderstand, but the essence of your argument has
>>> seemed to be
>>>
>>> "Performance isn't good enough, therefore we should force anyone who
>>> wants to experiment with something bigger to fork the code base to do it"
>>>
>>> Thus, it is necessary to have a clear unambiguous standard that anyone
>>> can verify for "good enough". A clear standard would also focus efforts at
>>> improvement.
>>>
>>> Where are the goal posts?
>>>
>>> FWIW I'm +1 on any of 2-4 since I believe the existence of a hard limit
>>> is fundamentally counterproductive in an open source setting, as it will
>>> lead to *fewer people* pushing the limits. Extremely few people are
>>> going to get into the nitty-gritty of optimizing things unless they are
>>> staring at code that they can prove does something interesting, but doesn't
>>> run fast enough for their purposes. If people hit a hard limit, more of
>>> them give up and never develop the code that will motivate them to look for
>>> optimizations.
>>>
>>> -Gus
>>>
>>> On Tue, May 16, 2023 at 6:04?AM Robert Muir <rcmuir@gmail.com> wrote:
>>>
>>>> i still feel -1 (veto) on increasing this limit. sending more emails
>>>> does not change the technical facts or make the veto go away.
>>>>
>>>> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti <
>>>> a.benedetti@sease.io> wrote:
>>>>
>>>>> Hi all,
>>>>> we have finalized all the options proposed by the community and we are
>>>>> ready to vote for the preferred one and then proceed with the
>>>>> implementation.
>>>>>
>>>>> *Option 1*
>>>>> Keep it as it is (dimension limit hardcoded to 1024)
>>>>> *Motivation*:
>>>>> We are close to improving on many fronts. Given the criticality of
>>>>> Lucene in computing infrastructure and the concerns raised by one of the
>>>>> most active stewards of the project, I think we should keep working toward
>>>>> improving the feature as is and move to up the limit after we can
>>>>> demonstrate improvement unambiguously.
>>>>>
>>>>> *Option 2*
>>>>> make the limit configurable, for example through a system property
>>>>> *Motivation*:
>>>>> The system administrator can enforce a limit its users need to respect
>>>>> that it's in line with whatever the admin decided to be acceptable for
>>>>> them.
>>>>> The default can stay the current one.
>>>>> This should open the doors for Apache Solr, Elasticsearch, OpenSearch,
>>>>> and any sort of plugin development
>>>>>
>>>>> *Option 3*
>>>>> Move the max dimension limit lower level to a HNSW specific
>>>>> implementation. Once there, this limit would not bind any other potential
>>>>> vector engine alternative/evolution.
>>>>> *Motivation:* There seem to be contradictory performance
>>>>> interpretations about the current HNSW implementation. Some consider its
>>>>> performance ok, some not, and it depends on the target data set and use
>>>>> case. Increasing the max dimension limit where it is currently (in top
>>>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>>>> other use-cases) to be based on a lower limit.
>>>>>
>>>>> *Option 4*
>>>>> Make it configurable and move it to an appropriate place.
>>>>> In particular, a
>>>>> simple Integer.getInteger("lucene.hnsw.maxDimensions", 1024) should be
>>>>> enough.
>>>>> *Motivation*:
>>>>> Both are good and not mutually exclusive and could happen in any order.
>>>>> Someone suggested to perfect what the _default_ limit should be, but
>>>>> I've not seen an argument _against_ configurability. Especially in this
>>>>> way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>>>
>>>>> I'll keep this [VOTE] open for a week and then proceed to the
>>>>> implementation.
>>>>> --------------------------
>>>>> *Alessandro Benedetti*
>>>>> Director @ Sease Ltd.
>>>>> *Apache Lucene/Solr Committer*
>>>>> *Apache Solr PMC Member*
>>>>>
>>>>> e-mail: a.benedetti@sease.io
>>>>>
>>>>>
>>>>> *Sease* - Information Retrieval Applied
>>>>> Consulting | Training | Open Source
>>>>>
>>>>> Website: Sease.io <http://sease.io/>
>>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>>>> <https://twitter.com/seaseltd> | Youtube
>>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>>>> <https://github.com/seaseltd>
>>>>>
>>>>
>>>
>>> --
>>> http://www.needhamsoftware.com (work)
>>> http://www.the111shift.com (play)
>>>
>>
Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]
I think I've said before on this list we don't actually enforce the limit
in any way that can't easily be circumvented by a user. The codec already
supports any size vector - it doesn't impose any limit. The way the API is
written you can *already today* create an index with max-int sized vectors
and we are committed to supporting that going forward by our backwards
compatibility policy as Robert points out. This wasn't intentional, I
think, but it is the facts.

Given that, I think this whole discussion is not really necessary.

On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti <a.benedetti@sease.io>
wrote:

> Hi all,
> we have finalized all the options proposed by the community and we are
> ready to vote for the preferred one and then proceed with the
> implementation.
>
> *Option 1*
> Keep it as it is (dimension limit hardcoded to 1024)
> *Motivation*:
> We are close to improving on many fronts. Given the criticality of Lucene
> in computing infrastructure and the concerns raised by one of the most
> active stewards of the project, I think we should keep working toward
> improving the feature as is and move to up the limit after we can
> demonstrate improvement unambiguously.
>
> *Option 2*
> make the limit configurable, for example through a system property
> *Motivation*:
> The system administrator can enforce a limit its users need to respect
> that it's in line with whatever the admin decided to be acceptable for
> them.
> The default can stay the current one.
> This should open the doors for Apache Solr, Elasticsearch, OpenSearch, and
> any sort of plugin development
>
> *Option 3*
> Move the max dimension limit lower level to a HNSW specific
> implementation. Once there, this limit would not bind any other potential
> vector engine alternative/evolution.
> *Motivation:* There seem to be contradictory performance interpretations
> about the current HNSW implementation. Some consider its performance ok,
> some not, and it depends on the target data set and use case. Increasing
> the max dimension limit where it is currently (in top level
> FloatVectorValues) would not allow potential alternatives (e.g. for other
> use-cases) to be based on a lower limit.
>
> *Option 4*
> Make it configurable and move it to an appropriate place.
> In particular, a simple Integer.getInteger("lucene.hnsw.maxDimensions",
> 1024) should be enough.
> *Motivation*:
> Both are good and not mutually exclusive and could happen in any order.
> Someone suggested to perfect what the _default_ limit should be, but I've
> not seen an argument _against_ configurability. Especially in this way --
> a toggle that doesn't bind Lucene's APIs in any way.
>
> I'll keep this [VOTE] open for a week and then proceed to the
> implementation.
> --------------------------
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> *Apache Lucene/Solr Committer*
> *Apache Solr PMC Member*
>
> e-mail: a.benedetti@sease.io
>
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
> <https://twitter.com/seaseltd> | Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
> <https://github.com/seaseltd>
>
Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]
> easily be circumvented by a user

This is a revelation to me and others, if true. Michael, please then point
to a test or code snippet that shows the Lucene user community what they
want to see so they are unblocked from their explorations of vector search.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Wed, May 17, 2023 at 7:51?AM Michael Sokolov <msokolov@gmail.com> wrote:

> I think I've said before on this list we don't actually enforce the limit
> in any way that can't easily be circumvented by a user. The codec already
> supports any size vector - it doesn't impose any limit. The way the API is
> written you can *already today* create an index with max-int sized vectors
> and we are committed to supporting that going forward by our backwards
> compatibility policy as Robert points out. This wasn't intentional, I
> think, but it is the facts.
>
> Given that, I think this whole discussion is not really necessary.
>
> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti <a.benedetti@sease.io>
> wrote:
>
>> Hi all,
>> we have finalized all the options proposed by the community and we are
>> ready to vote for the preferred one and then proceed with the
>> implementation.
>>
>> *Option 1*
>> Keep it as it is (dimension limit hardcoded to 1024)
>> *Motivation*:
>> We are close to improving on many fronts. Given the criticality of Lucene
>> in computing infrastructure and the concerns raised by one of the most
>> active stewards of the project, I think we should keep working toward
>> improving the feature as is and move to up the limit after we can
>> demonstrate improvement unambiguously.
>>
>> *Option 2*
>> make the limit configurable, for example through a system property
>> *Motivation*:
>> The system administrator can enforce a limit its users need to respect
>> that it's in line with whatever the admin decided to be acceptable for
>> them.
>> The default can stay the current one.
>> This should open the doors for Apache Solr, Elasticsearch, OpenSearch,
>> and any sort of plugin development
>>
>> *Option 3*
>> Move the max dimension limit lower level to a HNSW specific
>> implementation. Once there, this limit would not bind any other potential
>> vector engine alternative/evolution.
>> *Motivation:* There seem to be contradictory performance interpretations
>> about the current HNSW implementation. Some consider its performance ok,
>> some not, and it depends on the target data set and use case. Increasing
>> the max dimension limit where it is currently (in top level
>> FloatVectorValues) would not allow potential alternatives (e.g. for other
>> use-cases) to be based on a lower limit.
>>
>> *Option 4*
>> Make it configurable and move it to an appropriate place.
>> In particular, a simple Integer.getInteger("lucene.hnsw.maxDimensions",
>> 1024) should be enough.
>> *Motivation*:
>> Both are good and not mutually exclusive and could happen in any order.
>> Someone suggested to perfect what the _default_ limit should be, but I've
>> not seen an argument _against_ configurability. Especially in this way --
>> a toggle that doesn't bind Lucene's APIs in any way.
>>
>> I'll keep this [VOTE] open for a week and then proceed to the
>> implementation.
>> --------------------------
>> *Alessandro Benedetti*
>> Director @ Sease Ltd.
>> *Apache Lucene/Solr Committer*
>> *Apache Solr PMC Member*
>>
>> e-mail: a.benedetti@sease.io
>>
>>
>> *Sease* - Information Retrieval Applied
>> Consulting | Training | Open Source
>>
>> Website: Sease.io <http://sease.io/>
>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>> <https://twitter.com/seaseltd> | Youtube
>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>> <https://github.com/seaseltd>
>>
>
Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]
see https://markmail.org/message/kf4nzoqyhwacb7ri

On Wed, May 17, 2023 at 10:09?AM David Smiley <dsmiley@apache.org> wrote:

> > easily be circumvented by a user
>
> This is a revelation to me and others, if true. Michael, please then
> point to a test or code snippet that shows the Lucene user community what
> they want to see so they are unblocked from their explorations of vector
> search.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Wed, May 17, 2023 at 7:51?AM Michael Sokolov <msokolov@gmail.com>
> wrote:
>
>> I think I've said before on this list we don't actually enforce the limit
>> in any way that can't easily be circumvented by a user. The codec already
>> supports any size vector - it doesn't impose any limit. The way the API is
>> written you can *already today* create an index with max-int sized vectors
>> and we are committed to supporting that going forward by our backwards
>> compatibility policy as Robert points out. This wasn't intentional, I
>> think, but it is the facts.
>>
>> Given that, I think this whole discussion is not really necessary.
>>
>> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti <
>> a.benedetti@sease.io> wrote:
>>
>>> Hi all,
>>> we have finalized all the options proposed by the community and we are
>>> ready to vote for the preferred one and then proceed with the
>>> implementation.
>>>
>>> *Option 1*
>>> Keep it as it is (dimension limit hardcoded to 1024)
>>> *Motivation*:
>>> We are close to improving on many fronts. Given the criticality of
>>> Lucene in computing infrastructure and the concerns raised by one of the
>>> most active stewards of the project, I think we should keep working toward
>>> improving the feature as is and move to up the limit after we can
>>> demonstrate improvement unambiguously.
>>>
>>> *Option 2*
>>> make the limit configurable, for example through a system property
>>> *Motivation*:
>>> The system administrator can enforce a limit its users need to respect
>>> that it's in line with whatever the admin decided to be acceptable for
>>> them.
>>> The default can stay the current one.
>>> This should open the doors for Apache Solr, Elasticsearch, OpenSearch,
>>> and any sort of plugin development
>>>
>>> *Option 3*
>>> Move the max dimension limit lower level to a HNSW specific
>>> implementation. Once there, this limit would not bind any other potential
>>> vector engine alternative/evolution.
>>> *Motivation:* There seem to be contradictory performance
>>> interpretations about the current HNSW implementation. Some consider its
>>> performance ok, some not, and it depends on the target data set and use
>>> case. Increasing the max dimension limit where it is currently (in top
>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>> other use-cases) to be based on a lower limit.
>>>
>>> *Option 4*
>>> Make it configurable and move it to an appropriate place.
>>> In particular, a simple Integer.getInteger("lucene.hnsw.maxDimensions",
>>> 1024) should be enough.
>>> *Motivation*:
>>> Both are good and not mutually exclusive and could happen in any order.
>>> Someone suggested to perfect what the _default_ limit should be, but
>>> I've not seen an argument _against_ configurability. Especially in this
>>> way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>
>>> I'll keep this [VOTE] open for a week and then proceed to the
>>> implementation.
>>> --------------------------
>>> *Alessandro Benedetti*
>>> Director @ Sease Ltd.
>>> *Apache Lucene/Solr Committer*
>>> *Apache Solr PMC Member*
>>>
>>> e-mail: a.benedetti@sease.io
>>>
>>>
>>> *Sease* - Information Retrieval Applied
>>> Consulting | Training | Open Source
>>>
>>> Website: Sease.io <http://sease.io/>
>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>> <https://twitter.com/seaseltd> | Youtube
>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>> <https://github.com/seaseltd>
>>>
>>
Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]
Thanks, Michael,
that example backs even more strongly the need of cleaning it up and making
the limit configurable without the need for custom field types I guess (I
was taking a look at the code again, and it seems the limit is also checked
twice: in org.apache.lucene.document.KnnByteVectorField#createType and then
in org.apache.lucene.document.FieldType#setVectorAttributes (for both byte
and float variants).
This should help people vote, great!

Cheers
--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
*Apache Lucene/Solr Committer*
*Apache Solr PMC Member*

e-mail: a.benedetti@sease.io


*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
<https://github.com/seaseltd>


On Wed, 17 May 2023 at 15:42, Michael Sokolov <msokolov@gmail.com> wrote:

> see https://markmail.org/message/kf4nzoqyhwacb7ri
>
> On Wed, May 17, 2023 at 10:09?AM David Smiley <dsmiley@apache.org> wrote:
>
>> > easily be circumvented by a user
>>
>> This is a revelation to me and others, if true. Michael, please then
>> point to a test or code snippet that shows the Lucene user community what
>> they want to see so they are unblocked from their explorations of vector
>> search.
>>
>> ~ David Smiley
>> Apache Lucene/Solr Search Developer
>> http://www.linkedin.com/in/davidwsmiley
>>
>>
>> On Wed, May 17, 2023 at 7:51?AM Michael Sokolov <msokolov@gmail.com>
>> wrote:
>>
>>> I think I've said before on this list we don't actually enforce the
>>> limit in any way that can't easily be circumvented by a user. The codec
>>> already supports any size vector - it doesn't impose any limit. The way the
>>> API is written you can *already today* create an index with max-int sized
>>> vectors and we are committed to supporting that going forward by our
>>> backwards compatibility policy as Robert points out. This wasn't
>>> intentional, I think, but it is the facts.
>>>
>>> Given that, I think this whole discussion is not really necessary.
>>>
>>> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti <
>>> a.benedetti@sease.io> wrote:
>>>
>>>> Hi all,
>>>> we have finalized all the options proposed by the community and we are
>>>> ready to vote for the preferred one and then proceed with the
>>>> implementation.
>>>>
>>>> *Option 1*
>>>> Keep it as it is (dimension limit hardcoded to 1024)
>>>> *Motivation*:
>>>> We are close to improving on many fronts. Given the criticality of
>>>> Lucene in computing infrastructure and the concerns raised by one of the
>>>> most active stewards of the project, I think we should keep working toward
>>>> improving the feature as is and move to up the limit after we can
>>>> demonstrate improvement unambiguously.
>>>>
>>>> *Option 2*
>>>> make the limit configurable, for example through a system property
>>>> *Motivation*:
>>>> The system administrator can enforce a limit its users need to respect
>>>> that it's in line with whatever the admin decided to be acceptable for
>>>> them.
>>>> The default can stay the current one.
>>>> This should open the doors for Apache Solr, Elasticsearch, OpenSearch,
>>>> and any sort of plugin development
>>>>
>>>> *Option 3*
>>>> Move the max dimension limit lower level to a HNSW specific
>>>> implementation. Once there, this limit would not bind any other potential
>>>> vector engine alternative/evolution.
>>>> *Motivation:* There seem to be contradictory performance
>>>> interpretations about the current HNSW implementation. Some consider its
>>>> performance ok, some not, and it depends on the target data set and use
>>>> case. Increasing the max dimension limit where it is currently (in top
>>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>>> other use-cases) to be based on a lower limit.
>>>>
>>>> *Option 4*
>>>> Make it configurable and move it to an appropriate place.
>>>> In particular, a simple Integer.getInteger("lucene.hnsw.maxDimensions",
>>>> 1024) should be enough.
>>>> *Motivation*:
>>>> Both are good and not mutually exclusive and could happen in any order.
>>>> Someone suggested to perfect what the _default_ limit should be, but
>>>> I've not seen an argument _against_ configurability. Especially in this
>>>> way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>>
>>>> I'll keep this [VOTE] open for a week and then proceed to the
>>>> implementation.
>>>> --------------------------
>>>> *Alessandro Benedetti*
>>>> Director @ Sease Ltd.
>>>> *Apache Lucene/Solr Committer*
>>>> *Apache Solr PMC Member*
>>>>
>>>> e-mail: a.benedetti@sease.io
>>>>
>>>>
>>>> *Sease* - Information Retrieval Applied
>>>> Consulting | Training | Open Source
>>>>
>>>> Website: Sease.io <http://sease.io/>
>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>>> <https://twitter.com/seaseltd> | Youtube
>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>>> <https://github.com/seaseltd>
>>>>
>>>
Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]
Alessandro,
Thanks for raising the code of conduct; it is very discouraging and
intimidating to participate in discussions where such language is used
especially by senior members.

Michael S.,
thanks for your suggestion and that's what we used in Elasticsearch to
raise dims limit, and Alessandro, perhaps, you can use it as well in Solr
for the time being.

On Wed, May 17, 2023 at 11:03?AM Alessandro Benedetti <a.benedetti@sease.io>
wrote:

> Thanks, Michael,
> that example backs even more strongly the need of cleaning it up and
> making the limit configurable without the need for custom field types I
> guess (I was taking a look at the code again, and it seems the limit is
> also checked twice:
> in org.apache.lucene.document.KnnByteVectorField#createType and then
> in org.apache.lucene.document.FieldType#setVectorAttributes (for both byte
> and float variants).
> This should help people vote, great!
>
> Cheers
> --------------------------
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> *Apache Lucene/Solr Committer*
> *Apache Solr PMC Member*
>
> e-mail: a.benedetti@sease.io
>
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
> <https://twitter.com/seaseltd> | Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
> <https://github.com/seaseltd>
>
>
> On Wed, 17 May 2023 at 15:42, Michael Sokolov <msokolov@gmail.com> wrote:
>
>> see https://markmail.org/message/kf4nzoqyhwacb7ri
>>
>> On Wed, May 17, 2023 at 10:09?AM David Smiley <dsmiley@apache.org> wrote:
>>
>>> > easily be circumvented by a user
>>>
>>> This is a revelation to me and others, if true. Michael, please then
>>> point to a test or code snippet that shows the Lucene user community what
>>> they want to see so they are unblocked from their explorations of vector
>>> search.
>>>
>>> ~ David Smiley
>>> Apache Lucene/Solr Search Developer
>>> http://www.linkedin.com/in/davidwsmiley
>>>
>>>
>>> On Wed, May 17, 2023 at 7:51?AM Michael Sokolov <msokolov@gmail.com>
>>> wrote:
>>>
>>>> I think I've said before on this list we don't actually enforce the
>>>> limit in any way that can't easily be circumvented by a user. The codec
>>>> already supports any size vector - it doesn't impose any limit. The way the
>>>> API is written you can *already today* create an index with max-int sized
>>>> vectors and we are committed to supporting that going forward by our
>>>> backwards compatibility policy as Robert points out. This wasn't
>>>> intentional, I think, but it is the facts.
>>>>
>>>> Given that, I think this whole discussion is not really necessary.
>>>>
>>>> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti <
>>>> a.benedetti@sease.io> wrote:
>>>>
>>>>> Hi all,
>>>>> we have finalized all the options proposed by the community and we are
>>>>> ready to vote for the preferred one and then proceed with the
>>>>> implementation.
>>>>>
>>>>> *Option 1*
>>>>> Keep it as it is (dimension limit hardcoded to 1024)
>>>>> *Motivation*:
>>>>> We are close to improving on many fronts. Given the criticality of
>>>>> Lucene in computing infrastructure and the concerns raised by one of the
>>>>> most active stewards of the project, I think we should keep working toward
>>>>> improving the feature as is and move to up the limit after we can
>>>>> demonstrate improvement unambiguously.
>>>>>
>>>>> *Option 2*
>>>>> make the limit configurable, for example through a system property
>>>>> *Motivation*:
>>>>> The system administrator can enforce a limit its users need to respect
>>>>> that it's in line with whatever the admin decided to be acceptable for
>>>>> them.
>>>>> The default can stay the current one.
>>>>> This should open the doors for Apache Solr, Elasticsearch, OpenSearch,
>>>>> and any sort of plugin development
>>>>>
>>>>> *Option 3*
>>>>> Move the max dimension limit lower level to a HNSW specific
>>>>> implementation. Once there, this limit would not bind any other potential
>>>>> vector engine alternative/evolution.
>>>>> *Motivation:* There seem to be contradictory performance
>>>>> interpretations about the current HNSW implementation. Some consider its
>>>>> performance ok, some not, and it depends on the target data set and use
>>>>> case. Increasing the max dimension limit where it is currently (in top
>>>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>>>> other use-cases) to be based on a lower limit.
>>>>>
>>>>> *Option 4*
>>>>> Make it configurable and move it to an appropriate place.
>>>>> In particular, a
>>>>> simple Integer.getInteger("lucene.hnsw.maxDimensions", 1024) should be
>>>>> enough.
>>>>> *Motivation*:
>>>>> Both are good and not mutually exclusive and could happen in any order.
>>>>> Someone suggested to perfect what the _default_ limit should be, but
>>>>> I've not seen an argument _against_ configurability. Especially in this
>>>>> way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>>>
>>>>> I'll keep this [VOTE] open for a week and then proceed to the
>>>>> implementation.
>>>>> --------------------------
>>>>> *Alessandro Benedetti*
>>>>> Director @ Sease Ltd.
>>>>> *Apache Lucene/Solr Committer*
>>>>> *Apache Solr PMC Member*
>>>>>
>>>>> e-mail: a.benedetti@sease.io
>>>>>
>>>>>
>>>>> *Sease* - Information Retrieval Applied
>>>>> Consulting | Training | Open Source
>>>>>
>>>>> Website: Sease.io <http://sease.io/>
>>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>>>> <https://twitter.com/seaseltd> | Youtube
>>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>>>> <https://github.com/seaseltd>
>>>>>
>>>>
Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]
I try to better understand the code, so IIUC vector MAX_DIMENSIONS is
currently used inside

lucene/core/src/java/org/apache/lucene/document/FieldType.java
lucene/core/src/java/org/apache/lucene/document/KnnFloatVectorField.java
lucene/core/src/java/org/apache/lucene/document/KnnByteVectorField.java
lucene/core/src/java/org/apache/lucene/index/FloatVectorValues.java
public static final int MAX_DIMENSIONS = 1024;
lucene/core/src/java/org/apache/lucene/index/ByteVectorValues.java
public static final int MAX_DIMENSIONS = 1024;

and when you are writing that it should be moved to the hnsw-specific
code, then you mean somewhere to

lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java
lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsReader.java
lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java
lucene/core/src/java/org/apache/lucene/codecs/lucene95/OffHeapByteVectorValues.java
lucene/core/src/java/org/apache/lucene/codecs/lucene95/OffHeapFloatVectorValues.java
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java
lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborQueue.java
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
lucene/core/src/java/org/apache/lucene/util/hnsw/RandomAccessVectorValues.java

?

Thanks

Michael




Am 17.05.23 um 03:50 schrieb Robert Muir:
> by the way, i agree with the idea to MOVE THE LIMIT UNCHANGED to the
> hsnw-specific code.
>
> This way, someone can write alternative codec with vectors using some
> other completely different approach that incorporates a different more
> appropriate limit (maybe lower, maybe higher) depending upon their
> tradeoffs. We should encourage this as I think it is the "only true
> fix" to the scalability issues: use a scalable algorithm! Also,
> alternative codecs don't force the project into many years of index
> backwards compatibility, which is really my penultimate concern. We
> can lock ourselves into a truly bad place and become irrelevant
> (especially with scalar code implementing all this vector stuff, it is
> really senseless). In the meantime I suggest we try to reduce pain for
> the default codec with the current implementation if possible. If it
> is not possible, we need a new codec that performs.
>
> On Tue, May 16, 2023 at 8:53?PM Robert Muir <rcmuir@gmail.com> wrote:
>
> Gus, I think i explained myself multiple times on issues and in
> this thread. the performance is unacceptable, everyone knows it,
> but nobody is talking about.
> I don't need to explain myself time and time again here.
> You don't seem to understand the technical issues (at least you
> sure as fuck don't know how service loading works or you wouldnt
> have opened https://github.com/apache/lucene/issues/12300 ????)
>
> I'm just the only one here completely unconstrained by any of
> silicon valley's influences to speak my true mind, without any
> repercussions, so I do it. Don't give any fucks about ChatGPT.
>
> I'm standing by my technical veto. If you bypass it, I'll revert
> the offending commit.
>
> As far as fixing the technical performance, I just opened an issue
> with some ideas to at least improve cpu usage by a factor of N. It
> does not help with the crazy heap memory usage or other issues of
> KNN implementation causing shit like OOM on merge. But it is one
> step: https://github.com/apache/lucene/issues/12302
>
>
>
> On Tue, May 16, 2023 at 7:45?AM Gus Heck <gus.heck@gmail.com> wrote:
>
> Robert,
>
> Can you explain in clear technical terms the standard that
> must be met for performance? A benchmark that must run in X
> time on Y hardware for example (and why that test is
> suitable)? Or some other reproducible criteria? So far I've
> heard you give an *opinion* that it's unusable, but that's not
> a technical criteria, others may have a different concept of
> what is usable to them.
>
> Forgive me if I misunderstand, but the essence of your
> argument has seemed to be
>
> "Performance isn't good enough, therefore we should force
> anyone who wants to experiment with something bigger to fork
> the code base to do it"
>
> Thus, it is necessary to have a clear unambiguous standard
> that anyone can verify for "good enough". A clear standard
> would also focus efforts at improvement.
>
> Where are the goal posts?
>
> FWIW I'm +1 on any of 2-4 since I believe the existence of a
> hard limit is fundamentally counterproductive in an open
> source setting, as it will lead to *fewer people* pushing
> the limits. Extremely few people are going to get into the
> nitty-gritty of optimizing things unless they are staring at
> code that they can prove does something interesting,
> but doesn't run fast enough for their purposes. If people hit
> a hard limit, more of them give up and never develop the code
> that will motivate them to look for optimizations.
>
> -Gus
>
> On Tue, May 16, 2023 at 6:04?AM Robert Muir <rcmuir@gmail.com>
> wrote:
>
> i still feel -1 (veto) on increasing this limit. sending
> more emails does not change the technical facts or make
> the veto go away.
>
> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti
> <a.benedetti@sease.io> wrote:
>
> Hi all,
> we have finalized all the options proposed by the
> community and we are ready to vote for the preferred
> one and then proceed with the implementation.
>
> *Option 1*
> Keep it as it is (dimension limit hardcoded to 1024)
> *Motivation*:
> We are close to improving on many fronts. Given the
> criticality of Lucene in computing infrastructure and
> the concerns raised by one of the most active stewards
> of the project, I think we should keep working toward
> improving the feature as is and move to up the limit
> after we can demonstrate improvement unambiguously.
>
> *Option 2*
> make the limit configurable, for example through a
> system property
> *Motivation*:
> The system administrator can enforce a limit its users
> need to respect that it's in line with whatever the
> admin decided to be acceptable for them.
> The default can stay the current one.
> This should open the doors for Apache Solr,
> Elasticsearch, OpenSearch, and any sort of plugin
> development
>
> *Option 3*
> Move the max dimension limit lower level to a HNSW
> specific implementation. Once there, this limit would
> not bind any other potential vector engine
> alternative/evolution.*
> *
> *Motivation:*There seem to be contradictory
> performance interpretations about the current HNSW
> implementation. Some consider its performance ok, some
> not, and it depends on the target data set and use
> case. Increasing the max dimension limit where it is
> currently (in top level FloatVectorValues) would not
> allow potential alternatives (e.g. for other
> use-cases) to be based on a lower limit.
>
> *Option 4*
> Make it configurable and move it to an appropriate place.
> In particular, a
> simple Integer.getInteger("lucene.hnsw.maxDimensions",
> 1024) should be enough.
> *Motivation*:
> Both are good and not mutually exclusive and could
> happen in any order.
> Someone suggested to perfect what the _default_ limit
> should be, but I've not seen an argument _against_
> configurability. Especially in this way -- a toggle
> that doesn't bind Lucene's APIs in any way.
>
> I'll keep this [VOTE] open for a week and then proceed
> to the implementation.
> --------------------------
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> /Apache Lucene/Solr Committer/
> /Apache Solr PMC Member/
>
> e-mail: a.benedetti@sease.io/
> /
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd> |
> Twitter <https://twitter.com/seaseltd> | Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
> Github <https://github.com/seaseltd>
>
>
>
> --
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)
>
Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]
Thanks Michael for sharing your code snippet on how to circumvent the
limit. My reaction to this is the same as Alessandro.

I just created a PR to make the limit configurable:
https://github.com/apache/lucene/pull/12306
If there is to be a veto presented to the PR, it should include technical
reasons specific to the PR and be raised on the PR itself.

Afterwards, I leave it to others to move the limit with its configurability
to be enforced in a codec specific way.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Wed, May 17, 2023 at 12:58?PM Mayya Sharipova
<mayya.sharipova@elastic.co.invalid> wrote:

> Alessandro,
> Thanks for raising the code of conduct; it is very discouraging and
> intimidating to participate in discussions where such language is used
> especially by senior members.
>
> Michael S.,
> thanks for your suggestion and that's what we used in Elasticsearch to
> raise dims limit, and Alessandro, perhaps, you can use it as well in Solr
> for the time being.
>
> On Wed, May 17, 2023 at 11:03?AM Alessandro Benedetti <
> a.benedetti@sease.io> wrote:
>
>> Thanks, Michael,
>> that example backs even more strongly the need of cleaning it up and
>> making the limit configurable without the need for custom field types I
>> guess (I was taking a look at the code again, and it seems the limit is
>> also checked twice:
>> in org.apache.lucene.document.KnnByteVectorField#createType and then
>> in org.apache.lucene.document.FieldType#setVectorAttributes (for both byte
>> and float variants).
>> This should help people vote, great!
>>
>> Cheers
>> --------------------------
>> *Alessandro Benedetti*
>> Director @ Sease Ltd.
>> *Apache Lucene/Solr Committer*
>> *Apache Solr PMC Member*
>>
>> e-mail: a.benedetti@sease.io
>>
>>
>> *Sease* - Information Retrieval Applied
>> Consulting | Training | Open Source
>>
>> Website: Sease.io <http://sease.io/>
>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>> <https://twitter.com/seaseltd> | Youtube
>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>> <https://github.com/seaseltd>
>>
>>
>> On Wed, 17 May 2023 at 15:42, Michael Sokolov <msokolov@gmail.com> wrote:
>>
>>> see https://markmail.org/message/kf4nzoqyhwacb7ri
>>>
>>> On Wed, May 17, 2023 at 10:09?AM David Smiley <dsmiley@apache.org>
>>> wrote:
>>>
>>>> > easily be circumvented by a user
>>>>
>>>> This is a revelation to me and others, if true. Michael, please then
>>>> point to a test or code snippet that shows the Lucene user community what
>>>> they want to see so they are unblocked from their explorations of vector
>>>> search.
>>>>
>>>> ~ David Smiley
>>>> Apache Lucene/Solr Search Developer
>>>> http://www.linkedin.com/in/davidwsmiley
>>>>
>>>>
>>>> On Wed, May 17, 2023 at 7:51?AM Michael Sokolov <msokolov@gmail.com>
>>>> wrote:
>>>>
>>>>> I think I've said before on this list we don't actually enforce the
>>>>> limit in any way that can't easily be circumvented by a user. The codec
>>>>> already supports any size vector - it doesn't impose any limit. The way the
>>>>> API is written you can *already today* create an index with max-int sized
>>>>> vectors and we are committed to supporting that going forward by our
>>>>> backwards compatibility policy as Robert points out. This wasn't
>>>>> intentional, I think, but it is the facts.
>>>>>
>>>>> Given that, I think this whole discussion is not really necessary.
>>>>>
>>>>> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti <
>>>>> a.benedetti@sease.io> wrote:
>>>>>
>>>>>> Hi all,
>>>>>> we have finalized all the options proposed by the community and we
>>>>>> are ready to vote for the preferred one and then proceed with the
>>>>>> implementation.
>>>>>>
>>>>>> *Option 1*
>>>>>> Keep it as it is (dimension limit hardcoded to 1024)
>>>>>> *Motivation*:
>>>>>> We are close to improving on many fronts. Given the criticality of
>>>>>> Lucene in computing infrastructure and the concerns raised by one of the
>>>>>> most active stewards of the project, I think we should keep working toward
>>>>>> improving the feature as is and move to up the limit after we can
>>>>>> demonstrate improvement unambiguously.
>>>>>>
>>>>>> *Option 2*
>>>>>> make the limit configurable, for example through a system property
>>>>>> *Motivation*:
>>>>>> The system administrator can enforce a limit its users need to
>>>>>> respect that it's in line with whatever the admin decided to be acceptable
>>>>>> for them.
>>>>>> The default can stay the current one.
>>>>>> This should open the doors for Apache Solr, Elasticsearch,
>>>>>> OpenSearch, and any sort of plugin development
>>>>>>
>>>>>> *Option 3*
>>>>>> Move the max dimension limit lower level to a HNSW specific
>>>>>> implementation. Once there, this limit would not bind any other potential
>>>>>> vector engine alternative/evolution.
>>>>>> *Motivation:* There seem to be contradictory performance
>>>>>> interpretations about the current HNSW implementation. Some consider its
>>>>>> performance ok, some not, and it depends on the target data set and use
>>>>>> case. Increasing the max dimension limit where it is currently (in top
>>>>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>>>>> other use-cases) to be based on a lower limit.
>>>>>>
>>>>>> *Option 4*
>>>>>> Make it configurable and move it to an appropriate place.
>>>>>> In particular, a
>>>>>> simple Integer.getInteger("lucene.hnsw.maxDimensions", 1024) should be
>>>>>> enough.
>>>>>> *Motivation*:
>>>>>> Both are good and not mutually exclusive and could happen in any
>>>>>> order.
>>>>>> Someone suggested to perfect what the _default_ limit should be, but
>>>>>> I've not seen an argument _against_ configurability. Especially in this
>>>>>> way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>>>>
>>>>>> I'll keep this [VOTE] open for a week and then proceed to the
>>>>>> implementation.
>>>>>> --------------------------
>>>>>> *Alessandro Benedetti*
>>>>>> Director @ Sease Ltd.
>>>>>> *Apache Lucene/Solr Committer*
>>>>>> *Apache Solr PMC Member*
>>>>>>
>>>>>> e-mail: a.benedetti@sease.io
>>>>>>
>>>>>>
>>>>>> *Sease* - Information Retrieval Applied
>>>>>> Consulting | Training | Open Source
>>>>>>
>>>>>> Website: Sease.io <http://sease.io/>
>>>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>>>>> <https://twitter.com/seaseltd> | Youtube
>>>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>>>>> <https://github.com/seaseltd>
>>>>>>
>>>>>
Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]
IIUC KnnVectorField is deprecated and one is supposed to use
KnnFloatVectorField when using float as vector values, right?

Am 17.05.23 um 16:41 schrieb Michael Sokolov:
> see https://markmail.org/message/kf4nzoqyhwacb7ri
>
> On Wed, May 17, 2023 at 10:09?AM David Smiley <dsmiley@apache.org> wrote:
>
> > easily be circumvented by a user
>
> This is a revelation to me and others, if true. Michael, please
> then point to a test or code snippet that shows the Lucene user
> community what they want to see so they are unblocked from their
> explorations of vector search.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Wed, May 17, 2023 at 7:51?AM Michael Sokolov
> <msokolov@gmail.com> wrote:
>
> I think I've said before on this list we don't actually
> enforce the limit in any way that can't easily be circumvented
> by a user. The codec already supports any size vector - it
> doesn't impose any limit. The way the API is written you can
> *already today* create an index with max-int sized vectors and
> we are committed to supporting that going forward by our
> backwards compatibility policy as Robert points out. This
> wasn't intentional, I think, but it is the facts.
>
> Given that, I think this whole discussion is not really necessary.
>
> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti
> <a.benedetti@sease.io> wrote:
>
> Hi all,
> we have finalized all the options proposed by the
> community and we are ready to vote for the preferred one
> and then proceed with the implementation.
>
> *Option 1*
> Keep it as it is (dimension limit hardcoded to 1024)
> *Motivation*:
> We are close to improving on many fronts. Given the
> criticality of Lucene in computing infrastructure and the
> concerns raised by one of the most active stewards of the
> project, I think we should keep working toward improving
> the feature as is and move to up the limit after we can
> demonstrate improvement unambiguously.
>
> *Option 2*
> make the limit configurable, for example through a system
> property
> *Motivation*:
> The system administrator can enforce a limit its users
> need to respect that it's in line with whatever the admin
> decided to be acceptable for them.
> The default can stay the current one.
> This should open the doors for Apache Solr, Elasticsearch,
> OpenSearch, and any sort of plugin development
>
> *Option 3*
> Move the max dimension limit lower level to a HNSW
> specific implementation. Once there, this limit would not
> bind any other potential vector engine alternative/evolution.*
> *
> *Motivation:*There seem to be contradictory performance
> interpretations about the current HNSW implementation.
> Some consider its performance ok, some not, and it depends
> on the target data set and use case. Increasing the max
> dimension limit where it is currently (in top level
> FloatVectorValues) would not allow potential alternatives
> (e.g. for other use-cases) to be based on a lower limit.
>
> *Option 4*
> Make it configurable and move it to an appropriate place.
> In particular, a
> simple Integer.getInteger("lucene.hnsw.maxDimensions",
> 1024) should be enough.
> *Motivation*:
> Both are good and not mutually exclusive and could happen
> in any order.
> Someone suggested to perfect what the _default_ limit
> should be, but I've not seen an argument _against_
> configurability.  Especially in this way -- a toggle that
> doesn't bind Lucene's APIs in any way.
>
> I'll keep this [VOTE] open for a week and then proceed to
> the implementation.
> --------------------------
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> /Apache Lucene/Solr Committer/
> /Apache Solr PMC Member/
>
> e-mail: a.benedetti@sease.io/
> /
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd> |
> Twitter <https://twitter.com/seaseltd> | Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
> Github <https://github.com/seaseltd>
>
Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]
I just implemented it and tested it with OpenAI's
text-embedding-ada-002, which is using 1536 dimensions and it works very
fine :-)

Thanks

Michael



Am 18.05.23 um 00:29 schrieb Michael Wechner:
> IIUC KnnVectorField is deprecated and one is supposed to use
> KnnFloatVectorField when using float as vector values, right?
>
> Am 17.05.23 um 16:41 schrieb Michael Sokolov:
>> see https://markmail.org/message/kf4nzoqyhwacb7ri
>>
>> On Wed, May 17, 2023 at 10:09?AM David Smiley <dsmiley@apache.org> wrote:
>>
>> > easily be circumvented by a user
>>
>> This is a revelation to me and others, if true. Michael, please
>> then point to a test or code snippet that shows the Lucene user
>> community what they want to see so they are unblocked from their
>> explorations of vector search.
>>
>> ~ David Smiley
>> Apache Lucene/Solr Search Developer
>> http://www.linkedin.com/in/davidwsmiley
>>
>>
>> On Wed, May 17, 2023 at 7:51?AM Michael Sokolov
>> <msokolov@gmail.com> wrote:
>>
>> I think I've said before on this list we don't actually
>> enforce the limit in any way that can't easily be
>> circumvented by a user. The codec already supports any size
>> vector - it doesn't impose any limit. The way the API is
>> written you can *already today* create an index with max-int
>> sized vectors and we are committed to supporting that going
>> forward by our backwards compatibility policy as Robert
>> points out. This wasn't intentional, I think, but it is the
>> facts.
>>
>> Given that, I think this whole discussion is not really
>> necessary.
>>
>> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti
>> <a.benedetti@sease.io> wrote:
>>
>> Hi all,
>> we have finalized all the options proposed by the
>> community and we are ready to vote for the preferred one
>> and then proceed with the implementation.
>>
>> *Option 1*
>> Keep it as it is (dimension limit hardcoded to 1024)
>> *Motivation*:
>> We are close to improving on many fronts. Given the
>> criticality of Lucene in computing infrastructure and the
>> concerns raised by one of the most active stewards of the
>> project, I think we should keep working toward improving
>> the feature as is and move to up the limit after we can
>> demonstrate improvement unambiguously.
>>
>> *Option 2*
>> make the limit configurable, for example through a system
>> property
>> *Motivation*:
>> The system administrator can enforce a limit its users
>> need to respect that it's in line with whatever the admin
>> decided to be acceptable for them.
>> The default can stay the current one.
>> This should open the doors for Apache Solr,
>> Elasticsearch, OpenSearch, and any sort of plugin development
>>
>> *Option 3*
>> Move the max dimension limit lower level to a HNSW
>> specific implementation. Once there, this limit would not
>> bind any other potential vector engine
>> alternative/evolution.*
>> *
>> *Motivation:*There seem to be contradictory performance
>> interpretations about the current HNSW implementation.
>> Some consider its performance ok, some not, and it
>> depends on the target data set and use case. Increasing
>> the max dimension limit where it is currently (in top
>> level FloatVectorValues) would not allow
>> potential alternatives (e.g. for other use-cases) to be
>> based on a lower limit.
>>
>> *Option 4*
>> Make it configurable and move it to an appropriate place.
>> In particular, a
>> simple Integer.getInteger("lucene.hnsw.maxDimensions",
>> 1024) should be enough.
>> *Motivation*:
>> Both are good and not mutually exclusive and could happen
>> in any order.
>> Someone suggested to perfect what the _default_ limit
>> should be, but I've not seen an argument _against_
>> configurability.  Especially in this way -- a toggle that
>> doesn't bind Lucene's APIs in any way.
>>
>> I'll keep this [VOTE] open for a week and then proceed to
>> the implementation.
>> --------------------------
>> *Alessandro Benedetti*
>> Director @ Sease Ltd.
>> /Apache Lucene/Solr Committer/
>> /Apache Solr PMC Member/
>>
>> e-mail: a.benedetti@sease.io/
>> /
>>
>> *Sease* - Information Retrieval Applied
>> Consulting | Training | Open Source
>>
>> Website: Sease.io <http://sease.io/>
>> LinkedIn <https://linkedin.com/company/sease-ltd> |
>> Twitter <https://twitter.com/seaseltd> | Youtube
>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
>> Github <https://github.com/seaseltd>
>>
>
Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]
That sounds promising, Michael. Can you share scripts/steps/code to
reproduce this?

On Thu, 18 May, 2023, 1:16 pm Michael Wechner, <michael.wechner@wyona.com>
wrote:

> I just implemented it and tested it with OpenAI's text-embedding-ada-002,
> which is using 1536 dimensions and it works very fine :-)
>
> Thanks
>
> Michael
>
>
>
> Am 18.05.23 um 00:29 schrieb Michael Wechner:
>
> IIUC KnnVectorField is deprecated and one is supposed to use
> KnnFloatVectorField when using float as vector values, right?
>
> Am 17.05.23 um 16:41 schrieb Michael Sokolov:
>
> see https://markmail.org/message/kf4nzoqyhwacb7ri
>
> On Wed, May 17, 2023 at 10:09?AM David Smiley <dsmiley@apache.org> wrote:
>
>> > easily be circumvented by a user
>>
>> This is a revelation to me and others, if true. Michael, please then
>> point to a test or code snippet that shows the Lucene user community what
>> they want to see so they are unblocked from their explorations of vector
>> search.
>>
>> ~ David Smiley
>> Apache Lucene/Solr Search Developer
>> http://www.linkedin.com/in/davidwsmiley
>>
>>
>> On Wed, May 17, 2023 at 7:51?AM Michael Sokolov <msokolov@gmail.com>
>> wrote:
>>
>>> I think I've said before on this list we don't actually enforce the
>>> limit in any way that can't easily be circumvented by a user. The codec
>>> already supports any size vector - it doesn't impose any limit. The way the
>>> API is written you can *already today* create an index with max-int sized
>>> vectors and we are committed to supporting that going forward by our
>>> backwards compatibility policy as Robert points out. This wasn't
>>> intentional, I think, but it is the facts.
>>>
>>> Given that, I think this whole discussion is not really necessary.
>>>
>>> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti <
>>> a.benedetti@sease.io> wrote:
>>>
>>>> Hi all,
>>>> we have finalized all the options proposed by the community and we are
>>>> ready to vote for the preferred one and then proceed with the
>>>> implementation.
>>>>
>>>> *Option 1*
>>>> Keep it as it is (dimension limit hardcoded to 1024)
>>>> *Motivation*:
>>>> We are close to improving on many fronts. Given the criticality of
>>>> Lucene in computing infrastructure and the concerns raised by one of the
>>>> most active stewards of the project, I think we should keep working toward
>>>> improving the feature as is and move to up the limit after we can
>>>> demonstrate improvement unambiguously.
>>>>
>>>> *Option 2*
>>>> make the limit configurable, for example through a system property
>>>> *Motivation*:
>>>> The system administrator can enforce a limit its users need to respect
>>>> that it's in line with whatever the admin decided to be acceptable for
>>>> them.
>>>> The default can stay the current one.
>>>> This should open the doors for Apache Solr, Elasticsearch, OpenSearch,
>>>> and any sort of plugin development
>>>>
>>>> *Option 3*
>>>> Move the max dimension limit lower level to a HNSW specific
>>>> implementation. Once there, this limit would not bind any other potential
>>>> vector engine alternative/evolution.
>>>> *Motivation:* There seem to be contradictory performance
>>>> interpretations about the current HNSW implementation. Some consider its
>>>> performance ok, some not, and it depends on the target data set and use
>>>> case. Increasing the max dimension limit where it is currently (in top
>>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>>> other use-cases) to be based on a lower limit.
>>>>
>>>> *Option 4*
>>>> Make it configurable and move it to an appropriate place.
>>>> In particular, a simple Integer.getInteger("lucene.hnsw.maxDimensions",
>>>> 1024) should be enough.
>>>> *Motivation*:
>>>> Both are good and not mutually exclusive and could happen in any order.
>>>> Someone suggested to perfect what the _default_ limit should be, but
>>>> I've not seen an argument _against_ configurability. Especially in this
>>>> way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>>
>>>> I'll keep this [VOTE] open for a week and then proceed to the
>>>> implementation.
>>>> --------------------------
>>>> *Alessandro Benedetti*
>>>> Director @ Sease Ltd.
>>>> *Apache Lucene/Solr Committer*
>>>> *Apache Solr PMC Member*
>>>>
>>>> e-mail: a.benedetti@sease.io
>>>>
>>>>
>>>> *Sease* - Information Retrieval Applied
>>>> Consulting | Training | Open Source
>>>>
>>>> Website: Sease.io <http://sease.io/>
>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>>> <https://twitter.com/seaseltd> | Youtube
>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>>> <https://github.com/seaseltd>
>>>>
>>>
>
>
Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]
That's great and a good plan B, but let's try to focus this thread of
collecting votes for a week (let's keep discussions on the nice PR opened
by David or the discussion thread we have in the mailing list already :)

On Thu, 18 May 2023, 10:10 Ishan Chattopadhyaya, <ichattopadhyaya@gmail.com>
wrote:

> That sounds promising, Michael. Can you share scripts/steps/code to
> reproduce this?
>
> On Thu, 18 May, 2023, 1:16 pm Michael Wechner, <michael.wechner@wyona.com>
> wrote:
>
>> I just implemented it and tested it with OpenAI's text-embedding-ada-002,
>> which is using 1536 dimensions and it works very fine :-)
>>
>> Thanks
>>
>> Michael
>>
>>
>>
>> Am 18.05.23 um 00:29 schrieb Michael Wechner:
>>
>> IIUC KnnVectorField is deprecated and one is supposed to use
>> KnnFloatVectorField when using float as vector values, right?
>>
>> Am 17.05.23 um 16:41 schrieb Michael Sokolov:
>>
>> see https://markmail.org/message/kf4nzoqyhwacb7ri
>>
>> On Wed, May 17, 2023 at 10:09?AM David Smiley <dsmiley@apache.org> wrote:
>>
>>> > easily be circumvented by a user
>>>
>>> This is a revelation to me and others, if true. Michael, please then
>>> point to a test or code snippet that shows the Lucene user community what
>>> they want to see so they are unblocked from their explorations of vector
>>> search.
>>>
>>> ~ David Smiley
>>> Apache Lucene/Solr Search Developer
>>> http://www.linkedin.com/in/davidwsmiley
>>>
>>>
>>> On Wed, May 17, 2023 at 7:51?AM Michael Sokolov <msokolov@gmail.com>
>>> wrote:
>>>
>>>> I think I've said before on this list we don't actually enforce the
>>>> limit in any way that can't easily be circumvented by a user. The codec
>>>> already supports any size vector - it doesn't impose any limit. The way the
>>>> API is written you can *already today* create an index with max-int sized
>>>> vectors and we are committed to supporting that going forward by our
>>>> backwards compatibility policy as Robert points out. This wasn't
>>>> intentional, I think, but it is the facts.
>>>>
>>>> Given that, I think this whole discussion is not really necessary.
>>>>
>>>> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti <
>>>> a.benedetti@sease.io> wrote:
>>>>
>>>>> Hi all,
>>>>> we have finalized all the options proposed by the community and we are
>>>>> ready to vote for the preferred one and then proceed with the
>>>>> implementation.
>>>>>
>>>>> *Option 1*
>>>>> Keep it as it is (dimension limit hardcoded to 1024)
>>>>> *Motivation*:
>>>>> We are close to improving on many fronts. Given the criticality of
>>>>> Lucene in computing infrastructure and the concerns raised by one of the
>>>>> most active stewards of the project, I think we should keep working toward
>>>>> improving the feature as is and move to up the limit after we can
>>>>> demonstrate improvement unambiguously.
>>>>>
>>>>> *Option 2*
>>>>> make the limit configurable, for example through a system property
>>>>> *Motivation*:
>>>>> The system administrator can enforce a limit its users need to respect
>>>>> that it's in line with whatever the admin decided to be acceptable for
>>>>> them.
>>>>> The default can stay the current one.
>>>>> This should open the doors for Apache Solr, Elasticsearch, OpenSearch,
>>>>> and any sort of plugin development
>>>>>
>>>>> *Option 3*
>>>>> Move the max dimension limit lower level to a HNSW specific
>>>>> implementation. Once there, this limit would not bind any other potential
>>>>> vector engine alternative/evolution.
>>>>> *Motivation:* There seem to be contradictory performance
>>>>> interpretations about the current HNSW implementation. Some consider its
>>>>> performance ok, some not, and it depends on the target data set and use
>>>>> case. Increasing the max dimension limit where it is currently (in top
>>>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>>>> other use-cases) to be based on a lower limit.
>>>>>
>>>>> *Option 4*
>>>>> Make it configurable and move it to an appropriate place.
>>>>> In particular, a
>>>>> simple Integer.getInteger("lucene.hnsw.maxDimensions", 1024) should be
>>>>> enough.
>>>>> *Motivation*:
>>>>> Both are good and not mutually exclusive and could happen in any order.
>>>>> Someone suggested to perfect what the _default_ limit should be, but
>>>>> I've not seen an argument _against_ configurability. Especially in this
>>>>> way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>>>
>>>>> I'll keep this [VOTE] open for a week and then proceed to the
>>>>> implementation.
>>>>> --------------------------
>>>>> *Alessandro Benedetti*
>>>>> Director @ Sease Ltd.
>>>>> *Apache Lucene/Solr Committer*
>>>>> *Apache Solr PMC Member*
>>>>>
>>>>> e-mail: a.benedetti@sease.io
>>>>>
>>>>>
>>>>> *Sease* - Information Retrieval Applied
>>>>> Consulting | Training | Open Source
>>>>>
>>>>> Website: Sease.io <http://sease.io/>
>>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>>>> <https://twitter.com/seaseltd> | Youtube
>>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>>>> <https://github.com/seaseltd>
>>>>>
>>>>
>>
>>
Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]
This isn't really a VOTE (no specific code change is being proposed), but
rather a poll?

Anyway, I would prefer Option 3: put the limit check into the HNSW
algorithm itself. This is the right place for the limit check, since HNSW
has its own scaling behaviour. It might have other limits, like max
fanout, etc. And we really should fix the loophole Mike S posted -- that's
just a dangerous long-term trap for users, thinking they have the back
compat promise of Lucene, when in fact they do not.

I love all the energy and passion going into debating all the ways to poke
at this limit, but please let's also spend some of this passion on actually
improving the scalability of our aKNN implementation! E.g. Robert opened
an exciting "Plan B" ( https://github.com/apache/lucene/issues/12302 ) to
workaround OpenJDK's crazy slowness on enabling access to vectorized SIMD
CPU instructions (the Java Vector API, JEP 426: https://openjdk.org/jeps/426
). This could help postings and doc values performance too!

Mike McCandless

http://blog.mikemccandless.com


On Thu, May 18, 2023 at 5:24?AM Alessandro Benedetti <a.benedetti@sease.io>
wrote:

> That's great and a good plan B, but let's try to focus this thread of
> collecting votes for a week (let's keep discussions on the nice PR opened
> by David or the discussion thread we have in the mailing list already :)
>
> On Thu, 18 May 2023, 10:10 Ishan Chattopadhyaya, <
> ichattopadhyaya@gmail.com> wrote:
>
>> That sounds promising, Michael. Can you share scripts/steps/code to
>> reproduce this?
>>
>> On Thu, 18 May, 2023, 1:16 pm Michael Wechner, <michael.wechner@wyona.com>
>> wrote:
>>
>>> I just implemented it and tested it with OpenAI's
>>> text-embedding-ada-002, which is using 1536 dimensions and it works very
>>> fine :-)
>>>
>>> Thanks
>>>
>>> Michael
>>>
>>>
>>>
>>> Am 18.05.23 um 00:29 schrieb Michael Wechner:
>>>
>>> IIUC KnnVectorField is deprecated and one is supposed to use
>>> KnnFloatVectorField when using float as vector values, right?
>>>
>>> Am 17.05.23 um 16:41 schrieb Michael Sokolov:
>>>
>>> see https://markmail.org/message/kf4nzoqyhwacb7ri
>>>
>>> On Wed, May 17, 2023 at 10:09?AM David Smiley <dsmiley@apache.org>
>>> wrote:
>>>
>>>> > easily be circumvented by a user
>>>>
>>>> This is a revelation to me and others, if true. Michael, please then
>>>> point to a test or code snippet that shows the Lucene user community what
>>>> they want to see so they are unblocked from their explorations of vector
>>>> search.
>>>>
>>>> ~ David Smiley
>>>> Apache Lucene/Solr Search Developer
>>>> http://www.linkedin.com/in/davidwsmiley
>>>>
>>>>
>>>> On Wed, May 17, 2023 at 7:51?AM Michael Sokolov <msokolov@gmail.com>
>>>> wrote:
>>>>
>>>>> I think I've said before on this list we don't actually enforce the
>>>>> limit in any way that can't easily be circumvented by a user. The codec
>>>>> already supports any size vector - it doesn't impose any limit. The way the
>>>>> API is written you can *already today* create an index with max-int sized
>>>>> vectors and we are committed to supporting that going forward by our
>>>>> backwards compatibility policy as Robert points out. This wasn't
>>>>> intentional, I think, but it is the facts.
>>>>>
>>>>> Given that, I think this whole discussion is not really necessary.
>>>>>
>>>>> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti <
>>>>> a.benedetti@sease.io> wrote:
>>>>>
>>>>>> Hi all,
>>>>>> we have finalized all the options proposed by the community and we
>>>>>> are ready to vote for the preferred one and then proceed with the
>>>>>> implementation.
>>>>>>
>>>>>> *Option 1*
>>>>>> Keep it as it is (dimension limit hardcoded to 1024)
>>>>>> *Motivation*:
>>>>>> We are close to improving on many fronts. Given the criticality of
>>>>>> Lucene in computing infrastructure and the concerns raised by one of the
>>>>>> most active stewards of the project, I think we should keep working toward
>>>>>> improving the feature as is and move to up the limit after we can
>>>>>> demonstrate improvement unambiguously.
>>>>>>
>>>>>> *Option 2*
>>>>>> make the limit configurable, for example through a system property
>>>>>> *Motivation*:
>>>>>> The system administrator can enforce a limit its users need to
>>>>>> respect that it's in line with whatever the admin decided to be acceptable
>>>>>> for them.
>>>>>> The default can stay the current one.
>>>>>> This should open the doors for Apache Solr, Elasticsearch,
>>>>>> OpenSearch, and any sort of plugin development
>>>>>>
>>>>>> *Option 3*
>>>>>> Move the max dimension limit lower level to a HNSW specific
>>>>>> implementation. Once there, this limit would not bind any other potential
>>>>>> vector engine alternative/evolution.
>>>>>> *Motivation:* There seem to be contradictory performance
>>>>>> interpretations about the current HNSW implementation. Some consider its
>>>>>> performance ok, some not, and it depends on the target data set and use
>>>>>> case. Increasing the max dimension limit where it is currently (in top
>>>>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>>>>> other use-cases) to be based on a lower limit.
>>>>>>
>>>>>> *Option 4*
>>>>>> Make it configurable and move it to an appropriate place.
>>>>>> In particular, a
>>>>>> simple Integer.getInteger("lucene.hnsw.maxDimensions", 1024) should be
>>>>>> enough.
>>>>>> *Motivation*:
>>>>>> Both are good and not mutually exclusive and could happen in any
>>>>>> order.
>>>>>> Someone suggested to perfect what the _default_ limit should be, but
>>>>>> I've not seen an argument _against_ configurability. Especially in this
>>>>>> way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>>>>
>>>>>> I'll keep this [VOTE] open for a week and then proceed to the
>>>>>> implementation.
>>>>>> --------------------------
>>>>>> *Alessandro Benedetti*
>>>>>> Director @ Sease Ltd.
>>>>>> *Apache Lucene/Solr Committer*
>>>>>> *Apache Solr PMC Member*
>>>>>>
>>>>>> e-mail: a.benedetti@sease.io
>>>>>>
>>>>>>
>>>>>> *Sease* - Information Retrieval Applied
>>>>>> Consulting | Training | Open Source
>>>>>>
>>>>>> Website: Sease.io <http://sease.io/>
>>>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>>>>> <https://twitter.com/seaseltd> | Youtube
>>>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>>>>> <https://github.com/seaseltd>
>>>>>>
>>>>>
>>>
>>>
Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]
It is basically the code which Michael Sokolov posted at

https://markmail.org/message/kf4nzoqyhwacb7ri

except
 - that I have replaced KnnVectorField by KnnFloatVectorField, because
KnnVectorField is deprecated.
 - that I don't hard code the  dimension as 2048 and the metric as
EUCLIDEAN, but take the dimension and metric (VectorSimilarityFunction)
used by the model. which are in the case of for example
text-embedding-ada-002: 1536 and COSINE
(https://platform.openai.com/docs/guides/embeddings/which-distance-function-should-i-use)

HTH

Michael



Am 18.05.23 um 11:10 schrieb Ishan Chattopadhyaya:
> That sounds promising, Michael. Can you share scripts/steps/code to
> reproduce this?
>
> On Thu, 18 May, 2023, 1:16 pm Michael Wechner,
> <michael.wechner@wyona.com> wrote:
>
> I just implemented it and tested it with OpenAI's
> text-embedding-ada-002, which is using 1536 dimensions and it
> works very fine :-)
>
> Thanks
>
> Michael
>
>
>
> Am 18.05.23 um 00:29 schrieb Michael Wechner:
>> IIUC KnnVectorField is deprecated and one is supposed to use
>> KnnFloatVectorField when using float as vector values, right?
>>
>> Am 17.05.23 um 16:41 schrieb Michael Sokolov:
>>> see https://markmail.org/message/kf4nzoqyhwacb7ri
>>>
>>> On Wed, May 17, 2023 at 10:09?AM David Smiley
>>> <dsmiley@apache.org> wrote:
>>>
>>> > easily be circumvented by a user
>>>
>>> This is a revelation to me and others, if true.  Michael,
>>> please then point to a test or code snippet that shows the
>>> Lucene user community what they want to see so they are
>>> unblocked from their explorations of vector search.
>>>
>>> ~ David Smiley
>>> Apache Lucene/Solr Search Developer
>>> http://www.linkedin.com/in/davidwsmiley
>>>
>>>
>>> On Wed, May 17, 2023 at 7:51?AM Michael Sokolov
>>> <msokolov@gmail.com> wrote:
>>>
>>> I think I've said before on this list we don't actually
>>> enforce the limit in any way that can't easily be
>>> circumvented by a user. The codec already supports any
>>> size vector - it doesn't impose any limit. The way the
>>> API is written you can *already today* create an index
>>> with max-int sized vectors and we are committed to
>>> supporting that going forward by our backwards
>>> compatibility policy as Robert points out. This wasn't
>>> intentional, I think, but it is the facts.
>>>
>>> Given that, I think this whole discussion is not really
>>> necessary.
>>>
>>> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti
>>> <a.benedetti@sease.io> wrote:
>>>
>>> Hi all,
>>> we have finalized all the options proposed by the
>>> community and we are ready to vote for the preferred
>>> one and then proceed with the implementation.
>>>
>>> *Option 1*
>>> Keep it as it is (dimension limit hardcoded to 1024)
>>> *Motivation*:
>>> We are close to improving on many fronts. Given the
>>> criticality of Lucene in computing infrastructure
>>> and the concerns raised by one of the most active
>>> stewards of the project, I think we should keep
>>> working toward improving the feature as is and move
>>> to up the limit after we can demonstrate improvement
>>> unambiguously.
>>>
>>> *Option 2*
>>> make the limit configurable, for example through a
>>> system property
>>> *Motivation*:
>>> The system administrator can enforce a limit its
>>> users need to respect that it's in line with
>>> whatever the admin decided to be acceptable for them.
>>> The default can stay the current one.
>>> This should open the doors for Apache Solr,
>>> Elasticsearch, OpenSearch, and any sort of plugin
>>> development
>>>
>>> *Option 3*
>>> Move the max dimension limit lower level to a HNSW
>>> specific implementation. Once there, this limit
>>> would not bind any other potential vector engine
>>> alternative/evolution.*
>>> *
>>> *Motivation:*There seem to be contradictory
>>> performance interpretations about the current HNSW
>>> implementation. Some consider its performance ok,
>>> some not, and it depends on the target data set and
>>> use case. Increasing the max dimension limit where
>>> it is currently (in top level FloatVectorValues)
>>> would not allow potential alternatives (e.g. for
>>> other use-cases) to be based on a lower limit.
>>>
>>> *Option 4*
>>> Make it configurable and move it to an appropriate
>>> place.
>>> In particular, a
>>> simple Integer.getInteger("lucene.hnsw.maxDimensions",
>>> 1024) should be enough.
>>> *Motivation*:
>>> Both are good and not mutually exclusive and could
>>> happen in any order.
>>> Someone suggested to perfect what the _default_
>>> limit should be, but I've not seen an argument
>>> _against_ configurability.  Especially in this way
>>> -- a toggle that doesn't bind Lucene's APIs in any way.
>>>
>>> I'll keep this [VOTE] open for a week and then
>>> proceed to the implementation.
>>> --------------------------
>>> *Alessandro Benedetti*
>>> Director @ Sease Ltd.
>>> /Apache Lucene/Solr Committer/
>>> /Apache Solr PMC Member/
>>>
>>> e-mail: a.benedetti@sease.io/
>>> /
>>>
>>> *Sease* - Information Retrieval Applied
>>> Consulting | Training | Open Source
>>>
>>> Website: Sease.io <http://sease.io/>
>>> LinkedIn <https://linkedin.com/company/sease-ltd> |
>>> Twitter <https://twitter.com/seaseltd> | Youtube
>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
>>> Github <https://github.com/seaseltd>
>>>
>>
>
Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]
Am 18.05.23 um 12:22 schrieb Michael McCandless:
>
> I love all the energy and passion going into debating all the ways to
> poke at this limit, but please let's also spend some of this passion
> on actually improving the scalability of our aKNN implementation! 
> E.g. Robert opened an exciting "Plan B" (
> https://github.com/apache/lucene/issues/12302 ) to workaround
> OpenJDK's crazy slowness on enabling access to vectorized SIMD CPU
> instructions (the Java Vector API, JEP 426:
> https://openjdk.org/jeps/426 ).  This could help postings and doc
> values performance too!


agreed, but I do not think the MAX_DIMENSIONS decision should depend on
this, because I think whatever improvements can be accomplished
eventually, very likely there will always be some limit.

Thanks

Michael

>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, May 18, 2023 at 5:24?AM Alessandro Benedetti
> <a.benedetti@sease.io> wrote:
>
> That's great and a good plan B, but let's try to focus this thread
> of collecting votes for a week (let's keep discussions on the nice
> PR opened by David or the discussion thread we have in the mailing
> list already :)
>
> On Thu, 18 May 2023, 10:10 Ishan Chattopadhyaya,
> <ichattopadhyaya@gmail.com> wrote:
>
> That sounds promising, Michael. Can you share
> scripts/steps/code to reproduce this?
>
> On Thu, 18 May, 2023, 1:16 pm Michael Wechner,
> <michael.wechner@wyona.com> wrote:
>
> I just implemented it and tested it with OpenAI's
> text-embedding-ada-002, which is using 1536 dimensions and
> it works very fine :-)
>
> Thanks
>
> Michael
>
>
>
> Am 18.05.23 um 00:29 schrieb Michael Wechner:
>> IIUC KnnVectorField is deprecated and one is supposed to
>> use KnnFloatVectorField when using float as vector
>> values, right?
>>
>> Am 17.05.23 um 16:41 schrieb Michael Sokolov:
>>> see https://markmail.org/message/kf4nzoqyhwacb7ri
>>>
>>> On Wed, May 17, 2023 at 10:09?AM David Smiley
>>> <dsmiley@apache.org> wrote:
>>>
>>> > easily be circumvented by a user
>>>
>>> This is a revelation to me and others, if true. 
>>> Michael, please then point to a test or code snippet
>>> that shows the Lucene user community what they want
>>> to see so they are unblocked from their explorations
>>> of vector search.
>>>
>>> ~ David Smiley
>>> Apache Lucene/Solr Search Developer
>>> http://www.linkedin.com/in/davidwsmiley
>>>
>>>
>>> On Wed, May 17, 2023 at 7:51?AM Michael Sokolov
>>> <msokolov@gmail.com> wrote:
>>>
>>> I think I've said before on this list we don't
>>> actually enforce the limit in any way that can't
>>> easily be circumvented by a user. The codec
>>> already supports any size vector - it doesn't
>>> impose any limit. The way the API is written you
>>> can *already today* create an index with max-int
>>> sized vectors and we are committed to supporting
>>> that going forward by our backwards
>>> compatibility policy as Robert points out. This
>>> wasn't intentional, I think, but it is the facts.
>>>
>>> Given that, I think this whole discussion is not
>>> really necessary.
>>>
>>> On Tue, May 16, 2023 at 4:50?AM Alessandro
>>> Benedetti <a.benedetti@sease.io> wrote:
>>>
>>> Hi all,
>>> we have finalized all the options proposed
>>> by the community and we are ready to vote
>>> for the preferred one and then proceed with
>>> the implementation.
>>>
>>> *Option 1*
>>> Keep it as it is (dimension limit hardcoded
>>> to 1024)
>>> *Motivation*:
>>> We are close to improving on many fronts.
>>> Given the criticality of Lucene in computing
>>> infrastructure and the concerns raised by
>>> one of the most active stewards of the
>>> project, I think we should keep working
>>> toward improving the feature as is and move
>>> to up the limit after we can demonstrate
>>> improvement unambiguously.
>>>
>>> *Option 2*
>>> make the limit configurable, for example
>>> through a system property
>>> *Motivation*:
>>> The system administrator can enforce a limit
>>> its users need to respect that it's in line
>>> with whatever the admin decided to be
>>> acceptable for them.
>>> The default can stay the current one.
>>> This should open the doors for Apache Solr,
>>> Elasticsearch, OpenSearch, and any sort of
>>> plugin development
>>>
>>> *Option 3*
>>> Move the max dimension limit lower level to
>>> a HNSW specific implementation. Once there,
>>> this limit would not bind any other
>>> potential vector engine alternative/evolution.*
>>> *
>>> *Motivation:*There seem to be contradictory
>>> performance interpretations about the
>>> current HNSW implementation. Some consider
>>> its performance ok, some not, and it depends
>>> on the target data set and use case.
>>> Increasing the max dimension limit where it
>>> is currently (in top level
>>> FloatVectorValues) would not allow
>>> potential alternatives (e.g. for other
>>> use-cases) to be based on a lower limit.
>>>
>>> *Option 4*
>>> Make it configurable and move it to an
>>> appropriate place.
>>> In particular, a
>>> simple Integer.getInteger("lucene.hnsw.maxDimensions",
>>> 1024) should be enough.
>>> *Motivation*:
>>> Both are good and not mutually exclusive and
>>> could happen in any order.
>>> Someone suggested to perfect what the
>>> _default_ limit should be, but I've not seen
>>> an argument _against_ configurability. 
>>> Especially in this way -- a toggle that
>>> doesn't bind Lucene's APIs in any way.
>>>
>>> I'll keep this [VOTE] open for a week and
>>> then proceed to the implementation.
>>> --------------------------
>>> *Alessandro Benedetti*
>>> Director @ Sease Ltd.
>>> /Apache Lucene/Solr Committer/
>>> /Apache Solr PMC Member/
>>>
>>> e-mail: a.benedetti@sease.io/
>>> /
>>>
>>> *Sease* - Information Retrieval Applied
>>> Consulting | Training | Open Source
>>>
>>> Website: Sease.io <http://sease.io/>
>>> LinkedIn
>>> <https://linkedin.com/company/sease-ltd> |
>>> Twitter <https://twitter.com/seaseltd> |
>>> Youtube
>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
>>> Github <https://github.com/seaseltd>
>>>
>>
>
Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]
Difficult to keep up with this topic when it's spread across issues, PRs,
and email lists. My poll response is option 3. -1 to option 2, I think the
configuration should be moved to the HNSW specific implementation. At this
point of technical maturity, it doesn't make sense (to me) to have the
config be a global system property.

Given the conversation fragmentation I'll ask here what I asked in my
comment on the github issue
<https://github.com/apache/lucene/issues/11507#issuecomment-1548612414>.

"Can anyone smart here post their benchmarks to substantiate their claims?"

For as enthusiastic a topic as vector dimensionality is, it sure is
discouraging there isn't empirical data to help make an informed decision
around what the recommended limit should be. I've only seen broad benchmark
claims like "We benchmarked a patched Lucene/Solr. We fully understand (we
measured it :-P)" It sure would be useful to see these benchmarks! Not
having them to help improve these arbitrary limits seems like a serious
disservice to the Lucene/Solr user community. I think until trustworthy
numbers are made available all we'll have is conjecture and opinions.

IMHO, given Java's lag in SIMD Vector support I'd rather see equal energy
put into Robert's Vector API Integration, Plan B
<https://github.com/apache/lucene/issues/12302> proposal. I'm not trying to
minimize the importance of adding a configuration to the HNSW
dimensionality, I just think we have the requisite expertise on this
project to fix the bigger performance issues that are a direct result of
Java's bigger vector performance deficiencies.

Nicholas Knize, Ph.D., GISP
Principal Engineer - Search | Amazon
Apache Lucene PMC Member and Committer
nknize@apache.org


On Thu, May 18, 2023 at 7:07?AM Michael Wechner <michael.wechner@wyona.com>
wrote:

>
>
> Am 18.05.23 um 12:22 schrieb Michael McCandless:
>
>
> I love all the energy and passion going into debating all the ways to poke
> at this limit, but please let's also spend some of this passion on actually
> improving the scalability of our aKNN implementation! E.g. Robert opened
> an exciting "Plan B" ( https://github.com/apache/lucene/issues/12302 ) to
> workaround OpenJDK's crazy slowness on enabling access to vectorized SIMD
> CPU instructions (the Java Vector API, JEP 426:
> https://openjdk.org/jeps/426 ). This could help postings and doc values
> performance too!
>
>
>
> agreed, but I do not think the MAX_DIMENSIONS decision should depend on
> this, because I think whatever improvements can be accomplished eventually,
> very likely there will always be some limit.
>
> Thanks
>
> Michael
>
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, May 18, 2023 at 5:24?AM Alessandro Benedetti <a.benedetti@sease.io>
> wrote:
>
>> That's great and a good plan B, but let's try to focus this thread of
>> collecting votes for a week (let's keep discussions on the nice PR opened
>> by David or the discussion thread we have in the mailing list already :)
>>
>> On Thu, 18 May 2023, 10:10 Ishan Chattopadhyaya, <
>> ichattopadhyaya@gmail.com> wrote:
>>
>>> That sounds promising, Michael. Can you share scripts/steps/code to
>>> reproduce this?
>>>
>>> On Thu, 18 May, 2023, 1:16 pm Michael Wechner, <
>>> michael.wechner@wyona.com> wrote:
>>>
>>>> I just implemented it and tested it with OpenAI's
>>>> text-embedding-ada-002, which is using 1536 dimensions and it works very
>>>> fine :-)
>>>>
>>>> Thanks
>>>>
>>>> Michael
>>>>
>>>>
>>>>
>>>> Am 18.05.23 um 00:29 schrieb Michael Wechner:
>>>>
>>>> IIUC KnnVectorField is deprecated and one is supposed to use
>>>> KnnFloatVectorField when using float as vector values, right?
>>>>
>>>> Am 17.05.23 um 16:41 schrieb Michael Sokolov:
>>>>
>>>> see https://markmail.org/message/kf4nzoqyhwacb7ri
>>>>
>>>> On Wed, May 17, 2023 at 10:09?AM David Smiley <dsmiley@apache.org>
>>>> wrote:
>>>>
>>>>> > easily be circumvented by a user
>>>>>
>>>>> This is a revelation to me and others, if true. Michael, please then
>>>>> point to a test or code snippet that shows the Lucene user community what
>>>>> they want to see so they are unblocked from their explorations of vector
>>>>> search.
>>>>>
>>>>> ~ David Smiley
>>>>> Apache Lucene/Solr Search Developer
>>>>> http://www.linkedin.com/in/davidwsmiley
>>>>>
>>>>>
>>>>> On Wed, May 17, 2023 at 7:51?AM Michael Sokolov <msokolov@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I think I've said before on this list we don't actually enforce the
>>>>>> limit in any way that can't easily be circumvented by a user. The codec
>>>>>> already supports any size vector - it doesn't impose any limit. The way the
>>>>>> API is written you can *already today* create an index with max-int sized
>>>>>> vectors and we are committed to supporting that going forward by our
>>>>>> backwards compatibility policy as Robert points out. This wasn't
>>>>>> intentional, I think, but it is the facts.
>>>>>>
>>>>>> Given that, I think this whole discussion is not really necessary.
>>>>>>
>>>>>> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti <
>>>>>> a.benedetti@sease.io> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>> we have finalized all the options proposed by the community and we
>>>>>>> are ready to vote for the preferred one and then proceed with the
>>>>>>> implementation.
>>>>>>>
>>>>>>> *Option 1*
>>>>>>> Keep it as it is (dimension limit hardcoded to 1024)
>>>>>>> *Motivation*:
>>>>>>> We are close to improving on many fronts. Given the criticality of
>>>>>>> Lucene in computing infrastructure and the concerns raised by one of the
>>>>>>> most active stewards of the project, I think we should keep working toward
>>>>>>> improving the feature as is and move to up the limit after we can
>>>>>>> demonstrate improvement unambiguously.
>>>>>>>
>>>>>>> *Option 2*
>>>>>>> make the limit configurable, for example through a system property
>>>>>>> *Motivation*:
>>>>>>> The system administrator can enforce a limit its users need to
>>>>>>> respect that it's in line with whatever the admin decided to be acceptable
>>>>>>> for them.
>>>>>>> The default can stay the current one.
>>>>>>> This should open the doors for Apache Solr, Elasticsearch,
>>>>>>> OpenSearch, and any sort of plugin development
>>>>>>>
>>>>>>> *Option 3*
>>>>>>> Move the max dimension limit lower level to a HNSW specific
>>>>>>> implementation. Once there, this limit would not bind any other potential
>>>>>>> vector engine alternative/evolution.
>>>>>>> *Motivation:* There seem to be contradictory performance
>>>>>>> interpretations about the current HNSW implementation. Some consider its
>>>>>>> performance ok, some not, and it depends on the target data set and use
>>>>>>> case. Increasing the max dimension limit where it is currently (in top
>>>>>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>>>>>> other use-cases) to be based on a lower limit.
>>>>>>>
>>>>>>> *Option 4*
>>>>>>> Make it configurable and move it to an appropriate place.
>>>>>>> In particular, a
>>>>>>> simple Integer.getInteger("lucene.hnsw.maxDimensions", 1024) should be
>>>>>>> enough.
>>>>>>> *Motivation*:
>>>>>>> Both are good and not mutually exclusive and could happen in any
>>>>>>> order.
>>>>>>> Someone suggested to perfect what the _default_ limit should be, but
>>>>>>> I've not seen an argument _against_ configurability. Especially in this
>>>>>>> way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>>>>>
>>>>>>> I'll keep this [VOTE] open for a week and then proceed to the
>>>>>>> implementation.
>>>>>>> --------------------------
>>>>>>> *Alessandro Benedetti*
>>>>>>> Director @ Sease Ltd.
>>>>>>> *Apache Lucene/Solr Committer*
>>>>>>> *Apache Solr PMC Member*
>>>>>>>
>>>>>>> e-mail: a.benedetti@sease.io
>>>>>>>
>>>>>>>
>>>>>>> *Sease* - Information Retrieval Applied
>>>>>>> Consulting | Training | Open Source
>>>>>>>
>>>>>>> Website: Sease.io <http://sease.io/>
>>>>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>>>>>> <https://twitter.com/seaseltd> | Youtube
>>>>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>>>>>> <https://github.com/seaseltd>
>>>>>>>
>>>>>>
>>>>
>>>>
>
Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]
Thanks to everyone involved so far!
I confirm that a proper subject should have been [POLL] rather than [VOTE],
apologies for the confusion.

We are in the middle of the poll and this is the summary so far (ordered by
preference):

Option 2-4: 9 votes
make the limit configurable, potentially moving the limit to the
appropriate place

Option 3: 4 votes
keep it as it is (1024) but move it lower level in HNSW-specific
implementation

Option 1: 0 votes
keep it as it is (1024)

I've also seen many people responding in the mail thread, but not
indicating their preference.
I believe it would be very useful if everyone interested, expresses their
preference.

Have a good day!
--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
*Apache Lucene/Solr Committer*
*Apache Solr PMC Member*

e-mail: a.benedetti@sease.io


*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
<https://github.com/seaseltd>


On Thu, 18 May 2023 at 14:34, Nicholas Knize <nknize@gmail.com> wrote:

> Difficult to keep up with this topic when it's spread across issues, PRs,
> and email lists. My poll response is option 3. -1 to option 2, I think the
> configuration should be moved to the HNSW specific implementation. At this
> point of technical maturity, it doesn't make sense (to me) to have the
> config be a global system property.
>
> Given the conversation fragmentation I'll ask here what I asked in my
> comment on the github issue
> <https://github.com/apache/lucene/issues/11507#issuecomment-1548612414>.
>
> "Can anyone smart here post their benchmarks to substantiate their
> claims?"
>
> For as enthusiastic a topic as vector dimensionality is, it sure is
> discouraging there isn't empirical data to help make an informed decision
> around what the recommended limit should be. I've only seen broad benchmark
> claims like "We benchmarked a patched Lucene/Solr. We fully understand (we
> measured it :-P)" It sure would be useful to see these benchmarks! Not
> having them to help improve these arbitrary limits seems like a serious
> disservice to the Lucene/Solr user community. I think until trustworthy
> numbers are made available all we'll have is conjecture and opinions.
>
> IMHO, given Java's lag in SIMD Vector support I'd rather see equal energy
> put into Robert's Vector API Integration, Plan B
> <https://github.com/apache/lucene/issues/12302> proposal. I'm not trying
> to minimize the importance of adding a configuration to the HNSW
> dimensionality, I just think we have the requisite expertise on this
> project to fix the bigger performance issues that are a direct result of
> Java's bigger vector performance deficiencies.
>
> Nicholas Knize, Ph.D., GISP
> Principal Engineer - Search | Amazon
> Apache Lucene PMC Member and Committer
> nknize@apache.org
>
>
> On Thu, May 18, 2023 at 7:07?AM Michael Wechner <michael.wechner@wyona.com>
> wrote:
>
>>
>>
>> Am 18.05.23 um 12:22 schrieb Michael McCandless:
>>
>>
>> I love all the energy and passion going into debating all the ways to
>> poke at this limit, but please let's also spend some of this passion on
>> actually improving the scalability of our aKNN implementation! E.g. Robert
>> opened an exciting "Plan B" (
>> https://github.com/apache/lucene/issues/12302 ) to workaround
>> OpenJDK's crazy slowness on enabling access to vectorized SIMD CPU
>> instructions (the Java Vector API, JEP 426: https://openjdk.org/jeps/426
>> ). This could help postings and doc values performance too!
>>
>>
>>
>> agreed, but I do not think the MAX_DIMENSIONS decision should depend on
>> this, because I think whatever improvements can be accomplished eventually,
>> very likely there will always be some limit.
>>
>> Thanks
>>
>> Michael
>>
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Thu, May 18, 2023 at 5:24?AM Alessandro Benedetti <
>> a.benedetti@sease.io> wrote:
>>
>>> That's great and a good plan B, but let's try to focus this thread of
>>> collecting votes for a week (let's keep discussions on the nice PR opened
>>> by David or the discussion thread we have in the mailing list already :)
>>>
>>> On Thu, 18 May 2023, 10:10 Ishan Chattopadhyaya, <
>>> ichattopadhyaya@gmail.com> wrote:
>>>
>>>> That sounds promising, Michael. Can you share scripts/steps/code to
>>>> reproduce this?
>>>>
>>>> On Thu, 18 May, 2023, 1:16 pm Michael Wechner, <
>>>> michael.wechner@wyona.com> wrote:
>>>>
>>>>> I just implemented it and tested it with OpenAI's
>>>>> text-embedding-ada-002, which is using 1536 dimensions and it works very
>>>>> fine :-)
>>>>>
>>>>> Thanks
>>>>>
>>>>> Michael
>>>>>
>>>>>
>>>>>
>>>>> Am 18.05.23 um 00:29 schrieb Michael Wechner:
>>>>>
>>>>> IIUC KnnVectorField is deprecated and one is supposed to use
>>>>> KnnFloatVectorField when using float as vector values, right?
>>>>>
>>>>> Am 17.05.23 um 16:41 schrieb Michael Sokolov:
>>>>>
>>>>> see https://markmail.org/message/kf4nzoqyhwacb7ri
>>>>>
>>>>> On Wed, May 17, 2023 at 10:09?AM David Smiley <dsmiley@apache.org>
>>>>> wrote:
>>>>>
>>>>>> > easily be circumvented by a user
>>>>>>
>>>>>> This is a revelation to me and others, if true. Michael, please then
>>>>>> point to a test or code snippet that shows the Lucene user community what
>>>>>> they want to see so they are unblocked from their explorations of vector
>>>>>> search.
>>>>>>
>>>>>> ~ David Smiley
>>>>>> Apache Lucene/Solr Search Developer
>>>>>> http://www.linkedin.com/in/davidwsmiley
>>>>>>
>>>>>>
>>>>>> On Wed, May 17, 2023 at 7:51?AM Michael Sokolov <msokolov@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I think I've said before on this list we don't actually enforce the
>>>>>>> limit in any way that can't easily be circumvented by a user. The codec
>>>>>>> already supports any size vector - it doesn't impose any limit. The way the
>>>>>>> API is written you can *already today* create an index with max-int sized
>>>>>>> vectors and we are committed to supporting that going forward by our
>>>>>>> backwards compatibility policy as Robert points out. This wasn't
>>>>>>> intentional, I think, but it is the facts.
>>>>>>>
>>>>>>> Given that, I think this whole discussion is not really necessary.
>>>>>>>
>>>>>>> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti <
>>>>>>> a.benedetti@sease.io> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>> we have finalized all the options proposed by the community and we
>>>>>>>> are ready to vote for the preferred one and then proceed with the
>>>>>>>> implementation.
>>>>>>>>
>>>>>>>> *Option 1*
>>>>>>>> Keep it as it is (dimension limit hardcoded to 1024)
>>>>>>>> *Motivation*:
>>>>>>>> We are close to improving on many fronts. Given the criticality of
>>>>>>>> Lucene in computing infrastructure and the concerns raised by one of the
>>>>>>>> most active stewards of the project, I think we should keep working toward
>>>>>>>> improving the feature as is and move to up the limit after we can
>>>>>>>> demonstrate improvement unambiguously.
>>>>>>>>
>>>>>>>> *Option 2*
>>>>>>>> make the limit configurable, for example through a system property
>>>>>>>> *Motivation*:
>>>>>>>> The system administrator can enforce a limit its users need to
>>>>>>>> respect that it's in line with whatever the admin decided to be acceptable
>>>>>>>> for them.
>>>>>>>> The default can stay the current one.
>>>>>>>> This should open the doors for Apache Solr, Elasticsearch,
>>>>>>>> OpenSearch, and any sort of plugin development
>>>>>>>>
>>>>>>>> *Option 3*
>>>>>>>> Move the max dimension limit lower level to a HNSW specific
>>>>>>>> implementation. Once there, this limit would not bind any other potential
>>>>>>>> vector engine alternative/evolution.
>>>>>>>> *Motivation:* There seem to be contradictory performance
>>>>>>>> interpretations about the current HNSW implementation. Some consider its
>>>>>>>> performance ok, some not, and it depends on the target data set and use
>>>>>>>> case. Increasing the max dimension limit where it is currently (in top
>>>>>>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>>>>>>> other use-cases) to be based on a lower limit.
>>>>>>>>
>>>>>>>> *Option 4*
>>>>>>>> Make it configurable and move it to an appropriate place.
>>>>>>>> In particular, a
>>>>>>>> simple Integer.getInteger("lucene.hnsw.maxDimensions", 1024) should be
>>>>>>>> enough.
>>>>>>>> *Motivation*:
>>>>>>>> Both are good and not mutually exclusive and could happen in any
>>>>>>>> order.
>>>>>>>> Someone suggested to perfect what the _default_ limit should be,
>>>>>>>> but I've not seen an argument _against_ configurability. Especially in
>>>>>>>> this way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>>>>>>
>>>>>>>> I'll keep this [VOTE] open for a week and then proceed to the
>>>>>>>> implementation.
>>>>>>>> --------------------------
>>>>>>>> *Alessandro Benedetti*
>>>>>>>> Director @ Sease Ltd.
>>>>>>>> *Apache Lucene/Solr Committer*
>>>>>>>> *Apache Solr PMC Member*
>>>>>>>>
>>>>>>>> e-mail: a.benedetti@sease.io
>>>>>>>>
>>>>>>>>
>>>>>>>> *Sease* - Information Retrieval Applied
>>>>>>>> Consulting | Training | Open Source
>>>>>>>>
>>>>>>>> Website: Sease.io <http://sease.io/>
>>>>>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>>>>>>> <https://twitter.com/seaseltd> | Youtube
>>>>>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>>>>>>> <https://github.com/seaseltd>
>>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>
Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]
I vote for option 3.
Then with a follow up work to have a simple extension codec in the "codecs"
package which is
1- not backward compatible, and 2- has a higher or configurable limit. That
way users can directly use this codec without any additional code.
Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]
Closing the poll after one week, these are the results:

Option 2-4: 9 votes
make the limit configurable, potentially moving the limit to the
appropriate place

Option 3: 5 votes
keep it as it is (1024) but move it lower level in HNSW-specific
implementation

Option 1: 0 votes
keep it as it is (1024)

-----
I was expecting more people to express their preferences, unfortunately,
many digressed to discussions without expressing any.
Given that, it seems clear that we want one of the most voted options, so
let's continue the discussions under the related Pull Requests and then
proceed to merges when agreement if found!

Thanks to everyone involved!


--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
*Apache Lucene/Solr Committer*
*Apache Solr PMC Member*

e-mail: a.benedetti@sease.io


*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
<https://github.com/seaseltd>


On Mon, 22 May 2023 at 09:17, Bruno Roustant <bruno.roustant@gmail.com>
wrote:

> I vote for option 3.
> Then with a follow up work to have a simple extension codec in the
> "codecs" package which is
> 1- not backward compatible, and 2- has a higher or configurable limit.
> That way users can directly use this codec without any additional code.
>

1 2  View All