Mailing List Archive: [VOTE] Dimension Limit for KNN Vectors

[VOTE] Dimension Limit for KNN Vectors

a.benedetti at sease

May 16, 2023, 1:50 AM

Post #1 of 46 (858 views)

Hi all,
we have finalized all the options proposed by the community and we are
ready to vote for the preferred one and then proceed with the
implementation.

*Option 1*
Keep it as it is (dimension limit hardcoded to 1024)
*Motivation*:
We are close to improving on many fronts. Given the criticality of Lucene
in computing infrastructure and the concerns raised by one of the most
active stewards of the project, I think we should keep working toward
improving the feature as is and move to up the limit after we can
demonstrate improvement unambiguously.

*Option 2*
make the limit configurable, for example through a system property
*Motivation*:
The system administrator can enforce a limit its users need to respect that
it's in line with whatever the admin decided to be acceptable for them.
The default can stay the current one.
This should open the doors for Apache Solr, Elasticsearch, OpenSearch, and
any sort of plugin development

*Option 3*
Move the max dimension limit lower level to a HNSW specific implementation.
Once there, this limit would not bind any other potential vector engine
alternative/evolution.
*Motivation:* There seem to be contradictory performance interpretations
about the current HNSW implementation. Some consider its performance ok,
some not, and it depends on the target data set and use case. Increasing
the max dimension limit where it is currently (in top level
FloatVectorValues) would not allow potential alternatives (e.g. for other
use-cases) to be based on a lower limit.

*Option 4*
Make it configurable and move it to an appropriate place.
In particular, a simple Integer.getInteger("lucene.hnsw.maxDimensions",
1024) should be enough.
*Motivation*:
Both are good and not mutually exclusive and could happen in any order.
Someone suggested to perfect what the _default_ limit should be, but I've
not seen an argument _against_ configurability. Especially in this way --
a toggle that doesn't bind Lucene's APIs in any way.

I'll keep this [VOTE] open for a week and then proceed to the
implementation.
--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
*Apache Lucene/Solr Committer*
*Apache Solr PMC Member*

e-mail: a.benedetti@sease.io

*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
<https://github.com/seaseltd>

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

a.benedetti at sease

May 16, 2023, 1:51 AM

Post #2 of 46 (858 views)

My vote goes to *Option 4*.
--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
*Apache Lucene/Solr Committer*
*Apache Solr PMC Member*

e-mail: a.benedetti@sease.io

*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
<https://github.com/seaseltd>

On Tue, 16 May 2023 at 09:50, Alessandro Benedetti <a.benedetti@sease.io>
wrote:

> Hi all,
> we have finalized all the options proposed by the community and we are
> ready to vote for the preferred one and then proceed with the
> implementation.
>
> *Option 1*
> Keep it as it is (dimension limit hardcoded to 1024)
> *Motivation*:
> We are close to improving on many fronts. Given the criticality of Lucene
> in computing infrastructure and the concerns raised by one of the most
> active stewards of the project, I think we should keep working toward
> improving the feature as is and move to up the limit after we can
> demonstrate improvement unambiguously.
>
> *Option 2*
> make the limit configurable, for example through a system property
> *Motivation*:
> The system administrator can enforce a limit its users need to respect
> that it's in line with whatever the admin decided to be acceptable for
> them.
> The default can stay the current one.
> This should open the doors for Apache Solr, Elasticsearch, OpenSearch, and
> any sort of plugin development
>
> *Option 3*
> Move the max dimension limit lower level to a HNSW specific
> implementation. Once there, this limit would not bind any other potential
> vector engine alternative/evolution.
> *Motivation:* There seem to be contradictory performance interpretations
> about the current HNSW implementation. Some consider its performance ok,
> some not, and it depends on the target data set and use case. Increasing
> the max dimension limit where it is currently (in top level
> FloatVectorValues) would not allow potential alternatives (e.g. for other
> use-cases) to be based on a lower limit.
>
> *Option 4*
> Make it configurable and move it to an appropriate place.
> In particular, a simple Integer.getInteger("lucene.hnsw.maxDimensions",
> 1024) should be enough.
> *Motivation*:
> Both are good and not mutually exclusive and could happen in any order.
> Someone suggested to perfect what the _default_ limit should be, but I've
> not seen an argument _against_ configurability. Especially in this way --
> a toggle that doesn't bind Lucene's APIs in any way.
>
> I'll keep this [VOTE] open for a week and then proceed to the
> implementation.
> --------------------------
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> *Apache Lucene/Solr Committer*
> *Apache Solr PMC Member*
>
> e-mail: a.benedetti@sease.io
>
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
> <https://twitter.com/seaseltd> | Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
> <https://github.com/seaseltd>
>

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

michael.wechner at wyona

May 16, 2023, 2:04 AM

Post #3 of 46 (858 views)

Hi Alessandro

Thank you very much for summarizing and starting the vote.

I am not sure whether I really understand the difference between Option
2 and Option 4, or is it just about implementation details?

Thanks

Michael

Am 16.05.23 um 10:50 schrieb Alessandro Benedetti:
> Hi all,
> we have finalized all the options proposed by the community and we are
> ready to vote for the preferred one and then proceed with the
> implementation.
>
> *Option 1*
> Keep it as it is (dimension limit hardcoded to 1024)
> *Motivation*:
> We are close to improving on many fronts. Given?the criticality of
> Lucene in computing infrastructure and the concerns raised by one of
> the most active stewards of the project, I think we should keep
> working toward improving the feature as is and move to up the limit
> after we can demonstrate improvement unambiguously.
>
> *Option 2*
> make the limit configurable, for example through a system property
> *Motivation*:
> The system administrator can enforce a limit its users need to respect
> that it's in line with whatever the admin decided to be acceptable for
> them.
> The default can stay the current one.
> This should open the doors for Apache Solr, Elasticsearch, OpenSearch,
> and any sort of plugin development
>
> *Option 3*
> Move the max dimension limit lower level to a HNSW specific
> implementation. Once there, this limit would not bind any other
> potential vector engine alternative/evolution.*
> *
> *Motivation:*There seem to be contradictory performance
> interpretations about the current HNSW implementation. Some consider
> its performance ok, some not, and it depends on the target data set
> and use case. Increasing the max dimension limit where it is currently
> (in top level FloatVectorValues) would not allow
> potential?alternatives (e.g. for other use-cases) to be based on a
> lower limit.
>
> *Option 4*
> Make it configurable and move it to an appropriate place.
> In particular, a
> simple?Integer.getInteger("lucene.hnsw.maxDimensions", 1024) should be
> enough.
> *Motivation*:
> Both are good and not mutually exclusive and could happen in any order.
> Someone suggested to perfect what the _default_ limit should be, but
> I've not?seen an argument _against_ configurability.? Especially in
> this way -- a toggle that doesn't bind Lucene's APIs in any way.
>
> I'll keep this [VOTE] open for a week and then proceed to the
> implementation.
> --------------------------
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> /Apache Lucene/Solr Committer/
> /Apache Solr PMC Member/
>
> e-mail: a.benedetti@sease.io/
> /
>
> *Sease*?- Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd>?| Twitter
> <https://twitter.com/seaseltd>?| Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ>?| Github
> <https://github.com/seaseltd>

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

a.benedetti at sease

May 16, 2023, 2:24 AM

Post #4 of 46 (858 views)

Option 4 also aim to refactor the limit in an appropriate place for the
code (short answer is Yes, implementation details)

Cheers

On Tue, 16 May 2023, 10:04 Michael Wechner, <michael.wechner@wyona.com>
wrote:

> Hi Alessandro
>
> Thank you very much for summarizing and starting the vote.
>
> I am not sure whether I really understand the difference between Option 2
> and Option 4, or is it just about implementation details?
>
> Thanks
>
> Michael
>
>
>
> Am 16.05.23 um 10:50 schrieb Alessandro Benedetti:
>
> Hi all,
> we have finalized all the options proposed by the community and we are
> ready to vote for the preferred one and then proceed with the
> implementation.
>
> *Option 1*
> Keep it as it is (dimension limit hardcoded to 1024)
> *Motivation*:
> We are close to improving on many fronts. Given the criticality of Lucene
> in computing infrastructure and the concerns raised by one of the most
> active stewards of the project, I think we should keep working toward
> improving the feature as is and move to up the limit after we can
> demonstrate improvement unambiguously.
>
> *Option 2*
> make the limit configurable, for example through a system property
> *Motivation*:
> The system administrator can enforce a limit its users need to respect
> that it's in line with whatever the admin decided to be acceptable for
> them.
> The default can stay the current one.
> This should open the doors for Apache Solr, Elasticsearch, OpenSearch, and
> any sort of plugin development
>
> *Option 3*
> Move the max dimension limit lower level to a HNSW specific
> implementation. Once there, this limit would not bind any other potential
> vector engine alternative/evolution.
> *Motivation:* There seem to be contradictory performance interpretations
> about the current HNSW implementation. Some consider its performance ok,
> some not, and it depends on the target data set and use case. Increasing
> the max dimension limit where it is currently (in top level
> FloatVectorValues) would not allow potential alternatives (e.g. for other
> use-cases) to be based on a lower limit.
>
> *Option 4*
> Make it configurable and move it to an appropriate place.
> In particular, a simple Integer.getInteger("lucene.hnsw.maxDimensions",
> 1024) should be enough.
> *Motivation*:
> Both are good and not mutually exclusive and could happen in any order.
> Someone suggested to perfect what the _default_ limit should be, but I've
> not seen an argument _against_ configurability. Especially in this way --
> a toggle that doesn't bind Lucene's APIs in any way.
>
> I'll keep this [VOTE] open for a week and then proceed to the
> implementation.
> --------------------------
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> *Apache Lucene/Solr Committer*
> *Apache Solr PMC Member*
>
> e-mail: a.benedetti@sease.io
>
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
> <https://twitter.com/seaseltd> | Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
> <https://github.com/seaseltd>
>
>
>

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

a.benedetti at sease

May 16, 2023, 2:28 AM

Post #5 of 46 (858 views)

For simplicity's sake, let's consider Option 2 and 4 as equivalent as they
are not mutually exclusive and just differ on a minor implementation
detail.

On Tue, 16 May 2023, 10:24 Alessandro Benedetti, <a.benedetti@sease.io>
wrote:

> Option 4 also aim to refactor the limit in an appropriate place for the
> code (short answer is Yes, implementation details)
>
> Cheers
>
> On Tue, 16 May 2023, 10:04 Michael Wechner, <michael.wechner@wyona.com>
> wrote:
>
>> Hi Alessandro
>>
>> Thank you very much for summarizing and starting the vote.
>>
>> I am not sure whether I really understand the difference between Option 2
>> and Option 4, or is it just about implementation details?
>>
>> Thanks
>>
>> Michael
>>
>>
>>
>> Am 16.05.23 um 10:50 schrieb Alessandro Benedetti:
>>
>> Hi all,
>> we have finalized all the options proposed by the community and we are
>> ready to vote for the preferred one and then proceed with the
>> implementation.
>>
>> *Option 1*
>> Keep it as it is (dimension limit hardcoded to 1024)
>> *Motivation*:
>> We are close to improving on many fronts. Given the criticality of Lucene
>> in computing infrastructure and the concerns raised by one of the most
>> active stewards of the project, I think we should keep working toward
>> improving the feature as is and move to up the limit after we can
>> demonstrate improvement unambiguously.
>>
>> *Option 2*
>> make the limit configurable, for example through a system property
>> *Motivation*:
>> The system administrator can enforce a limit its users need to respect
>> that it's in line with whatever the admin decided to be acceptable for
>> them.
>> The default can stay the current one.
>> This should open the doors for Apache Solr, Elasticsearch, OpenSearch,
>> and any sort of plugin development
>>
>> *Option 3*
>> Move the max dimension limit lower level to a HNSW specific
>> implementation. Once there, this limit would not bind any other potential
>> vector engine alternative/evolution.
>> *Motivation:* There seem to be contradictory performance interpretations
>> about the current HNSW implementation. Some consider its performance ok,
>> some not, and it depends on the target data set and use case. Increasing
>> the max dimension limit where it is currently (in top level
>> FloatVectorValues) would not allow potential alternatives (e.g. for other
>> use-cases) to be based on a lower limit.
>>
>> *Option 4*
>> Make it configurable and move it to an appropriate place.
>> In particular, a simple Integer.getInteger("lucene.hnsw.maxDimensions",
>> 1024) should be enough.
>> *Motivation*:
>> Both are good and not mutually exclusive and could happen in any order.
>> Someone suggested to perfect what the _default_ limit should be, but I've
>> not seen an argument _against_ configurability. Especially in this way --
>> a toggle that doesn't bind Lucene's APIs in any way.
>>
>> I'll keep this [VOTE] open for a week and then proceed to the
>> implementation.
>> --------------------------
>> *Alessandro Benedetti*
>> Director @ Sease Ltd.
>> *Apache Lucene/Solr Committer*
>> *Apache Solr PMC Member*
>>
>> e-mail: a.benedetti@sease.io
>>
>>
>> *Sease* - Information Retrieval Applied
>> Consulting | Training | Open Source
>>
>> Website: Sease.io <http://sease.io/>
>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>> <https://twitter.com/seaseltd> | Youtube
>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>> <https://github.com/seaseltd>
>>
>>
>>

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

rcmuir at gmail

May 16, 2023, 3:03 AM

Post #6 of 46 (858 views)

i still feel -1 (veto) on increasing this limit. sending more emails does
not change the technical facts or make the veto go away.

On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti <a.benedetti@sease.io>
wrote:

> Hi all,
> we have finalized all the options proposed by the community and we are
> ready to vote for the preferred one and then proceed with the
> implementation.
>
> *Option 1*
> Keep it as it is (dimension limit hardcoded to 1024)
> *Motivation*:
> We are close to improving on many fronts. Given the criticality of Lucene
> in computing infrastructure and the concerns raised by one of the most
> active stewards of the project, I think we should keep working toward
> improving the feature as is and move to up the limit after we can
> demonstrate improvement unambiguously.
>
> *Option 2*
> make the limit configurable, for example through a system property
> *Motivation*:
> The system administrator can enforce a limit its users need to respect
> that it's in line with whatever the admin decided to be acceptable for
> them.
> The default can stay the current one.
> This should open the doors for Apache Solr, Elasticsearch, OpenSearch, and
> any sort of plugin development
>
> *Option 3*
> Move the max dimension limit lower level to a HNSW specific
> implementation. Once there, this limit would not bind any other potential
> vector engine alternative/evolution.
> *Motivation:* There seem to be contradictory performance interpretations
> about the current HNSW implementation. Some consider its performance ok,
> some not, and it depends on the target data set and use case. Increasing
> the max dimension limit where it is currently (in top level
> FloatVectorValues) would not allow potential alternatives (e.g. for other
> use-cases) to be based on a lower limit.
>
> *Option 4*
> Make it configurable and move it to an appropriate place.
> In particular, a simple Integer.getInteger("lucene.hnsw.maxDimensions",
> 1024) should be enough.
> *Motivation*:
> Both are good and not mutually exclusive and could happen in any order.
> Someone suggested to perfect what the _default_ limit should be, but I've
> not seen an argument _against_ configurability. Especially in this way --
> a toggle that doesn't bind Lucene's APIs in any way.
>
> I'll keep this [VOTE] open for a week and then proceed to the
> implementation.
> --------------------------
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> *Apache Lucene/Solr Committer*
> *Apache Solr PMC Member*
>
> e-mail: a.benedetti@sease.io
>
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
> <https://twitter.com/seaseltd> | Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
> <https://github.com/seaseltd>
>

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

michael.wechner at wyona

May 16, 2023, 3:05 AM

Post #7 of 46 (858 views)

my non-binding vote goes to Option 2 resp. Option 4

Thanks

Michael Wechner

Am 16.05.23 um 10:51 schrieb Alessandro Benedetti:
> My vote goes to *Option 4*.
> --------------------------
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> /Apache Lucene/Solr Committer/
> /Apache Solr PMC Member/
>
> e-mail: a.benedetti@sease.io/
> /
>
> *Sease*?- Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd>?| Twitter
> <https://twitter.com/seaseltd>?| Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ>?| Github
> <https://github.com/seaseltd>
>
>
> On Tue, 16 May 2023 at 09:50, Alessandro Benedetti
> <a.benedetti@sease.io> wrote:
>
> Hi all,
> we have finalized all the options proposed by the community and we
> are ready to vote for the preferred one and then proceed with the
> implementation.
>
> *Option 1*
> Keep it as it is (dimension limit hardcoded to 1024)
> *Motivation*:
> We are close to improving on many fronts. Given?the criticality of
> Lucene in computing infrastructure and the concerns raised by one
> of the most active stewards of the project, I think we should keep
> working toward improving the feature as is and move to up the
> limit after we can demonstrate improvement unambiguously.
>
> *Option 2*
> make the limit configurable, for example through a system property
> *Motivation*:
> The system administrator can enforce a limit its users need to
> respect that it's in line with whatever the admin decided to be
> acceptable for them.
> The default can stay the current one.
> This should open the doors for Apache Solr, Elasticsearch,
> OpenSearch, and any sort of plugin development
>
> *Option 3*
> Move the max dimension limit lower level to a HNSW specific
> implementation. Once there, this limit would not bind any other
> potential vector engine alternative/evolution.*
> *
> *Motivation:*There seem to be contradictory performance
> interpretations about the current HNSW implementation. Some
> consider its performance ok, some not, and it depends on the
> target data set and use case. Increasing the max dimension limit
> where it is currently (in top level FloatVectorValues) would not
> allow potential?alternatives (e.g. for other use-cases) to be
> based on a lower limit.
>
> *Option 4*
> Make it configurable and move it to an appropriate place.
> In particular, a
> simple?Integer.getInteger("lucene.hnsw.maxDimensions", 1024)
> should be enough.
> *Motivation*:
> Both are good and not mutually exclusive and could happen in any
> order.
> Someone suggested to perfect what the _default_ limit should be,
> but I've not?seen an argument _against_ configurability.?
> Especially in this way -- a toggle that doesn't bind Lucene's APIs
> in any way.
>
> I'll keep this [VOTE] open for a week and then proceed to the
> implementation.
> --------------------------
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> /Apache Lucene/Solr Committer/
> /Apache Solr PMC Member/
>
> e-mail: a.benedetti@sease.io/
> /
>
> *Sease*?- Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd>?| Twitter
> <https://twitter.com/seaseltd>?| Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ>?|
> Github <https://github.com/seaseltd>
>

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

gus.heck at gmail

May 16, 2023, 4:45 AM

Post #8 of 46 (858 views)

Robert,

Can you explain in clear technical terms the standard that must be met for
performance? A benchmark that must run in X time on Y hardware for example
(and why that test is suitable)? Or some other reproducible criteria? So
far I've heard you give an *opinion* that it's unusable, but that's not a
technical criteria, others may have a different concept of what is usable
to them.

Forgive me if I misunderstand, but the essence of your argument has seemed
to be

"Performance isn't good enough, therefore we should force anyone who wants
to experiment with something bigger to fork the code base to do it"

Thus, it is necessary to have a clear unambiguous standard that anyone can
verify for "good enough". A clear standard would also focus efforts at
improvement.

Where are the goal posts?

FWIW I'm +1 on any of 2-4 since I believe the existence of a hard limit is
fundamentally counterproductive in an open source setting, as it will lead
to *fewer people* pushing the limits. Extremely few people are going to get
into the nitty-gritty of optimizing things unless they are staring at code
that they can prove does something interesting, but doesn't run fast enough
for their purposes. If people hit a hard limit, more of them give up and
never develop the code that will motivate them to look for optimizations.

-Gus

On Tue, May 16, 2023 at 6:04?AM Robert Muir <rcmuir@gmail.com> wrote:

> i still feel -1 (veto) on increasing this limit. sending more emails does
> not change the technical facts or make the veto go away.
>
> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti <a.benedetti@sease.io>
> wrote:
>
>> Hi all,
>> we have finalized all the options proposed by the community and we are
>> ready to vote for the preferred one and then proceed with the
>> implementation.
>>
>> *Option 1*
>> Keep it as it is (dimension limit hardcoded to 1024)
>> *Motivation*:
>> We are close to improving on many fronts. Given the criticality of Lucene
>> in computing infrastructure and the concerns raised by one of the most
>> active stewards of the project, I think we should keep working toward
>> improving the feature as is and move to up the limit after we can
>> demonstrate improvement unambiguously.
>>
>> *Option 2*
>> make the limit configurable, for example through a system property
>> *Motivation*:
>> The system administrator can enforce a limit its users need to respect
>> that it's in line with whatever the admin decided to be acceptable for
>> them.
>> The default can stay the current one.
>> This should open the doors for Apache Solr, Elasticsearch, OpenSearch,
>> and any sort of plugin development
>>
>> *Option 3*
>> Move the max dimension limit lower level to a HNSW specific
>> implementation. Once there, this limit would not bind any other potential
>> vector engine alternative/evolution.
>> *Motivation:* There seem to be contradictory performance interpretations
>> about the current HNSW implementation. Some consider its performance ok,
>> some not, and it depends on the target data set and use case. Increasing
>> the max dimension limit where it is currently (in top level
>> FloatVectorValues) would not allow potential alternatives (e.g. for other
>> use-cases) to be based on a lower limit.
>>
>> *Option 4*
>> Make it configurable and move it to an appropriate place.
>> In particular, a simple Integer.getInteger("lucene.hnsw.maxDimensions",
>> 1024) should be enough.
>> *Motivation*:
>> Both are good and not mutually exclusive and could happen in any order.
>> Someone suggested to perfect what the _default_ limit should be, but I've
>> not seen an argument _against_ configurability. Especially in this way --
>> a toggle that doesn't bind Lucene's APIs in any way.
>>
>> I'll keep this [VOTE] open for a week and then proceed to the
>> implementation.
>> --------------------------
>> *Alessandro Benedetti*
>> Director @ Sease Ltd.
>> *Apache Lucene/Solr Committer*
>> *Apache Solr PMC Member*
>>
>> e-mail: a.benedetti@sease.io
>>
>>
>> *Sease* - Information Retrieval Applied
>> Consulting | Training | Open Source
>>
>> Website: Sease.io <http://sease.io/>
>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>> <https://twitter.com/seaseltd> | Youtube
>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>> <https://github.com/seaseltd>
>>
>

--
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

michael.wechner at wyona

May 16, 2023, 5:01 AM

Post #9 of 46 (858 views)

+1 to Gus' reply.

I think that Robert's veto or anyone else's veto is fair enough, but I
also think that anyone who is vetoing should be very clear about the
objectives / goals to be achieved, in order to get a +1.

If no clear objectives / goals can be defined and agreed on, then the
whole thing becomes arbitrary.

Therefore I would also be interested to know the objectives / goals to
be met that there will be a +1 re this vote?

Thanks

Michael

Am 16.05.23 um 13:45 schrieb Gus Heck:
> Robert,
>
> Can you explain in clear technical terms the standard that must be met
> for performance? A benchmark that must run in X time on Y hardware for
> example (and why that test is suitable)? Or some other reproducible
> criteria? So far I've heard you give an *opinion* that it's unusable,
> but that's not a technical criteria, others may have a different
> concept of what is usable to them.
>
> Forgive me if I misunderstand, but the essence of your argument has
> seemed to be
>
> "Performance isn't good enough, therefore we should force anyone who
> wants to experiment with something bigger to fork the code base to do it"
>
> Thus, it is necessary to have a clear unambiguous standard that anyone
> can verify for "good enough". A clear standard would also focus
> efforts at improvement.
>
> Where are the goal posts?
>
> FWIW I'm +1 on any of 2-4 since I believe the existence of a hard
> limit is fundamentally counterproductive in an open source setting, as
> it will lead to *fewer people* pushing the limits. Extremely few
> people are going to get into the nitty-gritty of optimizing things
> unless they are staring at code that they can prove does something
> interesting, but doesn't run fast enough for their purposes. If people
> hit a hard limit, more of them give up and never develop the code that
> will motivate them to look for optimizations.
>
> -Gus
>
> On Tue, May 16, 2023 at 6:04?AM Robert Muir <rcmuir@gmail.com> wrote:
>
> i still feel -1 (veto) on increasing this limit. sending more
> emails does not change the technical facts or make the veto go away.
>
> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti
> <a.benedetti@sease.io> wrote:
>
> Hi all,
> we have finalized all the options proposed by the community
> and we are ready to vote for the preferred one and then
> proceed with the implementation.
>
> *Option 1*
> Keep it as it is (dimension limit hardcoded to 1024)
> *Motivation*:
> We are close to improving on many fronts. Given the
> criticality of Lucene in computing infrastructure and the
> concerns raised by one of the most active stewards of the
> project, I think we should keep working toward improving the
> feature as is and move to up the limit after we can
> demonstrate improvement unambiguously.
>
> *Option 2*
> make the limit configurable, for example through a system property
> *Motivation*:
> The system administrator can enforce a limit its users need to
> respect that it's in line with whatever the admin decided to
> be acceptable for them.
> The default can stay the current one.
> This should open the doors for Apache Solr, Elasticsearch,
> OpenSearch, and any sort of plugin development
>
> *Option 3*
> Move the max dimension limit lower level to a HNSW specific
> implementation. Once there, this limit would not bind any
> other potential vector engine alternative/evolution.*
> *
> *Motivation:*There seem to be contradictory performance
> interpretations about the current HNSW implementation. Some
> consider its performance ok, some not, and it depends on the
> target data set and use case. Increasing the max dimension
> limit where it is currently (in top level FloatVectorValues)
> would not allow potential alternatives (e.g. for other
> use-cases) to be based on a lower limit.
>
> *Option 4*
> Make it configurable and move it to an appropriate place.
> In particular, a
> simple Integer.getInteger("lucene.hnsw.maxDimensions", 1024)
> should be enough.
> *Motivation*:
> Both are good and not mutually exclusive and could happen in
> any order.
> Someone suggested to perfect what the _default_ limit should
> be, but I've not seen an argument _against_ configurability.
> Especially in this way -- a toggle that doesn't bind Lucene's
> APIs in any way.
>
> I'll keep this [VOTE] open for a week and then proceed to the
> implementation.
> --------------------------
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> /Apache Lucene/Solr Committer/
> /Apache Solr PMC Member/
>
> e-mail: a.benedetti@sease.io/
> /
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
> <https://twitter.com/seaseltd> | Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
> Github <https://github.com/seaseltd>
>
>
>
> --
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

ben.w.trent at gmail

May 16, 2023, 5:26 AM

Post #10 of 46 (858 views)

My vote is for option 3. Prevents Lucene from having the limit increased.
Allows others who implement a different codec to set a limit of their
choosing.

Though I don't know the historical reasons for putting specific
configuration items at the codec level. This limit is performance related
and various codec implementations would have different performance concerns.

On Tue, May 16, 2023, 8:02 AM Michael Wechner <michael.wechner@wyona.com>
wrote:

> +1 to Gus' reply.
>
> I think that Robert's veto or anyone else's veto is fair enough, but I
> also think that anyone who is vetoing should be very clear about the
> objectives / goals to be achieved, in order to get a +1.
>
> If no clear objectives / goals can be defined and agreed on, then the
> whole thing becomes arbitrary.
>
> Therefore I would also be interested to know the objectives / goals to be
> met that there will be a +1 re this vote?
>
> Thanks
>
> Michael
>
>
>
> Am 16.05.23 um 13:45 schrieb Gus Heck:
>
> Robert,
>
> Can you explain in clear technical terms the standard that must be met for
> performance? A benchmark that must run in X time on Y hardware for example
> (and why that test is suitable)? Or some other reproducible criteria? So
> far I've heard you give an *opinion* that it's unusable, but that's not a
> technical criteria, others may have a different concept of what is usable
> to them.
>
> Forgive me if I misunderstand, but the essence of your argument has seemed
> to be
>
> "Performance isn't good enough, therefore we should force anyone who wants
> to experiment with something bigger to fork the code base to do it"
>
> Thus, it is necessary to have a clear unambiguous standard that anyone can
> verify for "good enough". A clear standard would also focus efforts at
> improvement.
>
> Where are the goal posts?
>
> FWIW I'm +1 on any of 2-4 since I believe the existence of a hard limit is
> fundamentally counterproductive in an open source setting, as it will lead
> to *fewer people* pushing the limits. Extremely few people are going to
> get into the nitty-gritty of optimizing things unless they are staring at
> code that they can prove does something interesting, but doesn't run fast
> enough for their purposes. If people hit a hard limit, more of them give up
> and never develop the code that will motivate them to look for
> optimizations.
>
> -Gus
>
> On Tue, May 16, 2023 at 6:04?AM Robert Muir <rcmuir@gmail.com> wrote:
>
>> i still feel -1 (veto) on increasing this limit. sending more emails does
>> not change the technical facts or make the veto go away.
>>
>> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti <
>> a.benedetti@sease.io> wrote:
>>
>>> Hi all,
>>> we have finalized all the options proposed by the community and we are
>>> ready to vote for the preferred one and then proceed with the
>>> implementation.
>>>
>>> *Option 1*
>>> Keep it as it is (dimension limit hardcoded to 1024)
>>> *Motivation*:
>>> We are close to improving on many fronts. Given the criticality of
>>> Lucene in computing infrastructure and the concerns raised by one of the
>>> most active stewards of the project, I think we should keep working toward
>>> improving the feature as is and move to up the limit after we can
>>> demonstrate improvement unambiguously.
>>>
>>> *Option 2*
>>> make the limit configurable, for example through a system property
>>> *Motivation*:
>>> The system administrator can enforce a limit its users need to respect
>>> that it's in line with whatever the admin decided to be acceptable for
>>> them.
>>> The default can stay the current one.
>>> This should open the doors for Apache Solr, Elasticsearch, OpenSearch,
>>> and any sort of plugin development
>>>
>>> *Option 3*
>>> Move the max dimension limit lower level to a HNSW specific
>>> implementation. Once there, this limit would not bind any other potential
>>> vector engine alternative/evolution.
>>> *Motivation:* There seem to be contradictory performance
>>> interpretations about the current HNSW implementation. Some consider its
>>> performance ok, some not, and it depends on the target data set and use
>>> case. Increasing the max dimension limit where it is currently (in top
>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>> other use-cases) to be based on a lower limit.
>>>
>>> *Option 4*
>>> Make it configurable and move it to an appropriate place.
>>> In particular, a simple Integer.getInteger("lucene.hnsw.maxDimensions",
>>> 1024) should be enough.
>>> *Motivation*:
>>> Both are good and not mutually exclusive and could happen in any order.
>>> Someone suggested to perfect what the _default_ limit should be, but
>>> I've not seen an argument _against_ configurability. Especially in this
>>> way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>
>>> I'll keep this [VOTE] open for a week and then proceed to the
>>> implementation.
>>> --------------------------
>>> *Alessandro Benedetti*
>>> Director @ Sease Ltd.
>>> *Apache Lucene/Solr Committer*
>>> *Apache Solr PMC Member*
>>>
>>> e-mail: a.benedetti@sease.io
>>>
>>>
>>> *Sease* - Information Retrieval Applied
>>> Consulting | Training | Open Source
>>>
>>> Website: Sease.io <http://sease.io/>
>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>> <https://twitter.com/seaseltd> | Youtube
>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>> <https://github.com/seaseltd>
>>>
>>
>
> --
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)
>
>
>

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

dawid.weiss at gmail

May 16, 2023, 6:37 AM

Post #11 of 46 (858 views)

I'm for option 3 (limit at algorithm level), with the default there tunable
via property (option 4).

I understand Robert's concerns and I'd love to contribute a faster
implementation but the reality is - I can't do it at the moment. I feel
like experiments are good though and we shouldn't just ban people from
trying - if somebody changes the (sane) default and gets burned by
performance, perhaps it'll be an itch to work on speeding things up (much
like it's already happening with Jonathan's patch).

Dawid

On Tue, May 16, 2023 at 10:50?AM Alessandro Benedetti <a.benedetti@sease.io>
wrote:

> Hi all,
> we have finalized all the options proposed by the community and we are
> ready to vote for the preferred one and then proceed with the
> implementation.
>
> *Option 1*
> Keep it as it is (dimension limit hardcoded to 1024)
> *Motivation*:
> We are close to improving on many fronts. Given the criticality of Lucene
> in computing infrastructure and the concerns raised by one of the most
> active stewards of the project, I think we should keep working toward
> improving the feature as is and move to up the limit after we can
> demonstrate improvement unambiguously.
>
> *Option 2*
> make the limit configurable, for example through a system property
> *Motivation*:
> The system administrator can enforce a limit its users need to respect
> that it's in line with whatever the admin decided to be acceptable for
> them.
> The default can stay the current one.
> This should open the doors for Apache Solr, Elasticsearch, OpenSearch, and
> any sort of plugin development
>
> *Option 3*
> Move the max dimension limit lower level to a HNSW specific
> implementation. Once there, this limit would not bind any other potential
> vector engine alternative/evolution.
> *Motivation:* There seem to be contradictory performance interpretations
> about the current HNSW implementation. Some consider its performance ok,
> some not, and it depends on the target data set and use case. Increasing
> the max dimension limit where it is currently (in top level
> FloatVectorValues) would not allow potential alternatives (e.g. for other
> use-cases) to be based on a lower limit.
>
> *Option 4*
> Make it configurable and move it to an appropriate place.
> In particular, a simple Integer.getInteger("lucene.hnsw.maxDimensions",
> 1024) should be enough.
> *Motivation*:
> Both are good and not mutually exclusive and could happen in any order.
> Someone suggested to perfect what the _default_ limit should be, but I've
> not seen an argument _against_ configurability. Especially in this way --
> a toggle that doesn't bind Lucene's APIs in any way.
>
> I'll keep this [VOTE] open for a week and then proceed to the
> implementation.
> --------------------------
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> *Apache Lucene/Solr Committer*
> *Apache Solr PMC Member*
>
> e-mail: a.benedetti@sease.io
>
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
> <https://twitter.com/seaseltd> | Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
> <https://github.com/seaseltd>
>

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

uwe at thetaphi

May 16, 2023, 6:56 AM

Post #12 of 46 (858 views)

I agree with Dawid,

I am +1 for those two options in combination:

* option 3 (make limit an HNSW specific thing). New formats may use
other limits (lower or higher).
* option 4 (make a system property with HNSW prefix). Adding the
system property must be done in same way like new properties for
MMAP directory (including access controller) so it can be denied by
system admin to be set in code (see
https://github.com/apache/lucene/blob/f53eb28af053d7612f7e4d1b2de05d33dc410645/lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java#L327-L346
for example). Care has to be taken that the static initializers
won't fail is system properties cannot be read/set (system
adminitrator enforces default -> see mmap code). It also has to be
made sure that an index written with raised limit can still be read
without the limit, so the limit should not be glued into the file
format. Otherwise I disagree with option 4.

In short: I am fine with making it configurable only for HNSW if the
limit is not glued into index format. The default should only be there
to by default prevent people from doing wrong things, but changing
default should not break reading/modifiying those indexes.

Uwe

Am 16.05.2023 um 15:37 schrieb Dawid Weiss:
>
> I'm for option 3 (limit at algorithm level), with the default there
> tunable via property (option 4).
>
> I understand Robert's concerns and I'd love to contribute a faster
> implementation but the reality is - I can't do it at the moment. I
> feel like experiments are good though and we shouldn't just ban people
> from trying - if somebody changes the (sane) default and gets burned
> by performance, perhaps it'll be an itch to work on speeding things up
> (much like it's already happening with Jonathan's patch).
>
> Dawid
>
> On Tue, May 16, 2023 at 10:50?AM Alessandro Benedetti
> <a.benedetti@sease.io> wrote:
>
> Hi all,
> we have finalized all the options proposed by the community and we
> are ready to vote for the preferred one and then proceed with the
> implementation.
>
> *Option 1*
> Keep it as it is (dimension limit hardcoded to 1024)
> *Motivation*:
> We are close to improving on many fronts. Given the criticality of
> Lucene in computing infrastructure and the concerns raised by one
> of the most active stewards of the project, I think we should keep
> working toward improving the feature as is and move to up the
> limit after we can demonstrate improvement unambiguously.
>
> *Option 2*
> make the limit configurable, for example through a system property
> *Motivation*:
> The system administrator can enforce a limit its users need to
> respect that it's in line with whatever the admin decided to be
> acceptable for them.
> The default can stay the current one.
> This should open the doors for Apache Solr, Elasticsearch,
> OpenSearch, and any sort of plugin development
>
> *Option 3*
> Move the max dimension limit lower level to a HNSW specific
> implementation. Once there, this limit would not bind any other
> potential vector engine alternative/evolution.*
> *
> *Motivation:*There seem to be contradictory performance
> interpretations about the current HNSW implementation. Some
> consider its performance ok, some not, and it depends on the
> target data set and use case. Increasing the max dimension limit
> where it is currently (in top level FloatVectorValues) would not
> allow potential alternatives (e.g. for other use-cases) to be
> based on a lower limit.
>
> *Option 4*
> Make it configurable and move it to an appropriate place.
> In particular, a
> simple Integer.getInteger("lucene.hnsw.maxDimensions", 1024)
> should be enough.
> *Motivation*:
> Both are good and not mutually exclusive and could happen in any
> order.
> Someone suggested to perfect what the _default_ limit should be,
> but I've not seen an argument _against_ configurability.
> Especially in this way -- a toggle that doesn't bind Lucene's APIs
> in any way.
>
> I'll keep this [VOTE] open for a week and then proceed to the
> implementation.
> --------------------------
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> /Apache Lucene/Solr Committer/
> /Apache Solr PMC Member/
>
> e-mail: a.benedetti@sease.io/
> /
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
> <https://twitter.com/seaseltd> | Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
> Github <https://github.com/seaseltd>
>
--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:uwe@thetaphi.de

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

houston at apache

May 16, 2023, 7:09 AM

Post #13 of 46 (858 views)

+1 on the combination of #3 and #4.

Also good things to make sure of Uwe, thanks for calling those out.
(Especially about the limit only being used on write, not on read).

- Houston

On Tue, May 16, 2023 at 9:57?AM Uwe Schindler <uwe@thetaphi.de> wrote:

> I agree with Dawid,
>
> I am +1 for those two options in combination:
>
> - option 3 (make limit an HNSW specific thing). New formats may use
> other limits (lower or higher).
> - option 4 (make a system property with HNSW prefix). Adding the
> system property must be done in same way like new properties for MMAP
> directory (including access controller) so it can be denied by system admin
> to be set in code (see
> https://github.com/apache/lucene/blob/f53eb28af053d7612f7e4d1b2de05d33dc410645/lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java#L327-L346
> for example). Care has to be taken that the static initializers won't fail
> is system properties cannot be read/set (system adminitrator enforces
> default -> see mmap code). It also has to be made sure that an index
> written with raised limit can still be read without the limit, so the limit
> should not be glued into the file format. Otherwise I disagree with option
> 4.
>
> In short: I am fine with making it configurable only for HNSW if the limit
> is not glued into index format. The default should only be there to by
> default prevent people from doing wrong things, but changing default should
> not break reading/modifiying those indexes.
>
> Uwe
> Am 16.05.2023 um 15:37 schrieb Dawid Weiss:
>
>
> I'm for option 3 (limit at algorithm level), with the default there
> tunable via property (option 4).
>
> I understand Robert's concerns and I'd love to contribute a faster
> implementation but the reality is - I can't do it at the moment. I feel
> like experiments are good though and we shouldn't just ban people from
> trying - if somebody changes the (sane) default and gets burned by
> performance, perhaps it'll be an itch to work on speeding things up (much
> like it's already happening with Jonathan's patch).
>
> Dawid
>
> On Tue, May 16, 2023 at 10:50?AM Alessandro Benedetti <
> a.benedetti@sease.io> wrote:
>
>> Hi all,
>> we have finalized all the options proposed by the community and we are
>> ready to vote for the preferred one and then proceed with the
>> implementation.
>>
>> *Option 1*
>> Keep it as it is (dimension limit hardcoded to 1024)
>> *Motivation*:
>> We are close to improving on many fronts. Given the criticality of Lucene
>> in computing infrastructure and the concerns raised by one of the most
>> active stewards of the project, I think we should keep working toward
>> improving the feature as is and move to up the limit after we can
>> demonstrate improvement unambiguously.
>>
>> *Option 2*
>> make the limit configurable, for example through a system property
>> *Motivation*:
>> The system administrator can enforce a limit its users need to respect
>> that it's in line with whatever the admin decided to be acceptable for
>> them.
>> The default can stay the current one.
>> This should open the doors for Apache Solr, Elasticsearch, OpenSearch,
>> and any sort of plugin development
>>
>> *Option 3*
>> Move the max dimension limit lower level to a HNSW specific
>> implementation. Once there, this limit would not bind any other potential
>> vector engine alternative/evolution.
>> *Motivation:* There seem to be contradictory performance interpretations
>> about the current HNSW implementation. Some consider its performance ok,
>> some not, and it depends on the target data set and use case. Increasing
>> the max dimension limit where it is currently (in top level
>> FloatVectorValues) would not allow potential alternatives (e.g. for other
>> use-cases) to be based on a lower limit.
>>
>> *Option 4*
>> Make it configurable and move it to an appropriate place.
>> In particular, a simple Integer.getInteger("lucene.hnsw.maxDimensions",
>> 1024) should be enough.
>> *Motivation*:
>> Both are good and not mutually exclusive and could happen in any order.
>> Someone suggested to perfect what the _default_ limit should be, but I've
>> not seen an argument _against_ configurability. Especially in this way --
>> a toggle that doesn't bind Lucene's APIs in any way.
>>
>> I'll keep this [VOTE] open for a week and then proceed to the
>> implementation.
>> --------------------------
>> *Alessandro Benedetti*
>> Director @ Sease Ltd.
>> *Apache Lucene/Solr Committer*
>> *Apache Solr PMC Member*
>>
>> e-mail: a.benedetti@sease.io
>>
>>
>> *Sease* - Information Retrieval Applied
>> Consulting | Training | Open Source
>>
>> Website: Sease.io <http://sease.io/>
>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>> <https://twitter.com/seaseltd> | Youtube
>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>> <https://github.com/seaseltd>
>>
> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremenhttps://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

marcuseagan at gmail

May 16, 2023, 7:26 AM

Post #14 of 46 (858 views)

Given that Robert has put in his veto, aren’t we clear on what we need to
do for him to change his mind? He’s been pretty clear and the rules of veto
are cut and dry.

Most of the people that have contributed to kNN vectors recently are not
even on the thread. I think improving the feature should be the focus of
the Lucene community at this juncture.

On Tue, May 16, 2023 at 7:09 AM Houston Putman <houston@apache.org> wrote:

> +1 on the combination of #3 and #4.
>
> Also good things to make sure of Uwe, thanks for calling those out.
> (Especially about the limit only being used on write, not on read).
>
> - Houston
>
> On Tue, May 16, 2023 at 9:57?AM Uwe Schindler <uwe@thetaphi.de> wrote:
>
>> I agree with Dawid,
>>
>> I am +1 for those two options in combination:
>>
>> - option 3 (make limit an HNSW specific thing). New formats may use
>> other limits (lower or higher).
>> - option 4 (make a system property with HNSW prefix). Adding the
>> system property must be done in same way like new properties for MMAP
>> directory (including access controller) so it can be denied by system admin
>> to be set in code (see
>> https://github.com/apache/lucene/blob/f53eb28af053d7612f7e4d1b2de05d33dc410645/lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java#L327-L346
>> for example). Care has to be taken that the static initializers won't fail
>> is system properties cannot be read/set (system adminitrator enforces
>> default -> see mmap code). It also has to be made sure that an index
>> written with raised limit can still be read without the limit, so the limit
>> should not be glued into the file format. Otherwise I disagree with option
>> 4.
>>
>> In short: I am fine with making it configurable only for HNSW if the
>> limit is not glued into index format. The default should only be there to
>> by default prevent people from doing wrong things, but changing default
>> should not break reading/modifiying those indexes.
>>
>> Uwe
>> Am 16.05.2023 um 15:37 schrieb Dawid Weiss:
>>
>>
>> I'm for option 3 (limit at algorithm level), with the default there
>> tunable via property (option 4).
>>
>> I understand Robert's concerns and I'd love to contribute a faster
>> implementation but the reality is - I can't do it at the moment. I feel
>> like experiments are good though and we shouldn't just ban people from
>> trying - if somebody changes the (sane) default and gets burned by
>> performance, perhaps it'll be an itch to work on speeding things up (much
>> like it's already happening with Jonathan's patch).
>>
>> Dawid
>>
>> On Tue, May 16, 2023 at 10:50?AM Alessandro Benedetti <
>> a.benedetti@sease.io> wrote:
>>
>>> Hi all,
>>> we have finalized all the options proposed by the community and we are
>>> ready to vote for the preferred one and then proceed with the
>>> implementation.
>>>
>>> *Option 1*
>>> Keep it as it is (dimension limit hardcoded to 1024)
>>> *Motivation*:
>>> We are close to improving on many fronts. Given the criticality of
>>> Lucene in computing infrastructure and the concerns raised by one of the
>>> most active stewards of the project, I think we should keep working toward
>>> improving the feature as is and move to up the limit after we can
>>> demonstrate improvement unambiguously.
>>>
>>> *Option 2*
>>> make the limit configurable, for example through a system property
>>> *Motivation*:
>>> The system administrator can enforce a limit its users need to respect
>>> that it's in line with whatever the admin decided to be acceptable for
>>> them.
>>> The default can stay the current one.
>>> This should open the doors for Apache Solr, Elasticsearch, OpenSearch,
>>> and any sort of plugin development
>>>
>>> *Option 3*
>>> Move the max dimension limit lower level to a HNSW specific
>>> implementation. Once there, this limit would not bind any other potential
>>> vector engine alternative/evolution.
>>> *Motivation:* There seem to be contradictory performance
>>> interpretations about the current HNSW implementation. Some consider its
>>> performance ok, some not, and it depends on the target data set and use
>>> case. Increasing the max dimension limit where it is currently (in top
>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>> other use-cases) to be based on a lower limit.
>>>
>>> *Option 4*
>>> Make it configurable and move it to an appropriate place.
>>> In particular, a simple Integer.getInteger("lucene.hnsw.maxDimensions",
>>> 1024) should be enough.
>>> *Motivation*:
>>> Both are good and not mutually exclusive and could happen in any order.
>>> Someone suggested to perfect what the _default_ limit should be, but
>>> I've not seen an argument _against_ configurability. Especially in this
>>> way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>
>>> I'll keep this [VOTE] open for a week and then proceed to the
>>> implementation.
>>> --------------------------
>>> *Alessandro Benedetti*
>>> Director @ Sease Ltd.
>>> *Apache Lucene/Solr Committer*
>>> *Apache Solr PMC Member*
>>>
>>> e-mail: a.benedetti@sease.io
>>>
>>>
>>> *Sease* - Information Retrieval Applied
>>> Consulting | Training | Open Source
>>>
>>> Website: Sease.io <http://sease.io/>
>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>> <https://twitter.com/seaseltd> | Youtube
>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>> <https://github.com/seaseltd>
>>>
>> --
>> Uwe SchindlerAchterdiek 19, D-28357 Bremen <https://www.google.com/maps/search/Achterdiek+19,+D-28357+Bremen?entry=gmail&source=g>https://www.thetaphi.de
>> eMail: uwe@thetaphi.de
>>
>> --
Marcus Eagan

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

a.benedetti at sease

May 16, 2023, 7:46 AM

Post #15 of 46 (858 views)

Hi Marcus,
I am afraid at this stage Robert's opinion counts just as any other
opinion, a single vote for option 1.
We are collecting a community's feedback here, we are not changing any code
nor voting for a yes/no.
Once the voting is finished, we'll operate an action depending on the
community's choice.
If the action involves making a change and someone(Robert or whoever) feels
to veto it, he/she will need to motivate the veto with technical merit.

In response to Uwe point:

>
>> On Tue, May 16, 2023 at 9:57?AM Uwe Schindler <uwe@thetaphi.de> wrote:
>>
>>> I agree with Dawid,
>>>
>>> I am +1 for those two options in combination:
>>>
>>> - option 3 (make limit an HNSW specific thing). New formats may use
>>> other limits (lower or higher).
>>> - option 4 (make a system property with HNSW prefix). Adding the
>>> system property must be done in same way like new properties for MMAP
>>> directory (including access controller) so it can be denied by system admin
>>> to be set in code (see
>>> https://github.com/apache/lucene/blob/f53eb28af053d7612f7e4d1b2de05d33dc410645/lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java#L327-L346
>>> for example). Care has to be taken that the static initializers won't fail
>>> is system properties cannot be read/set (system adminitrator enforces
>>> default -> see mmap code). It also has to be made sure that an index
>>> written with raised limit can still be read without the limit, so the limit
>>> should not be glued into the file format. Otherwise I disagree with option
>>> 4.
>>>
>>> In short: I am fine with making it configurable only for HNSW if the
>>> limit is not glued into index format. The default should only be there to
>>> by default prevent people from doing wrong things, but changing default
>>> should not break reading/modifiying those indexes.
>>>
>>> Uwe
>>>
>>> Thanks Uwe, that's very useful!
Just to fully understand it, right now the limit is not written in any file
format, so you just want this behavior to be maintained right?

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

gus.heck at gmail

May 16, 2023, 7:51 AM

Post #16 of 46 (858 views)

Actually, I had wondered if this is a proper vote thread or not, normally
those are yes/no on a single option.

On Tue, May 16, 2023 at 10:47?AM Alessandro Benedetti <a.benedetti@sease.io>
wrote:

> Hi Marcus,
> I am afraid at this stage Robert's opinion counts just as any other
> opinion, a single vote for option 1.
> We are collecting a community's feedback here, we are not changing any
> code nor voting for a yes/no.
> Once the voting is finished, we'll operate an action depending on the
> community's choice.
> If the action involves making a change and someone(Robert or whoever)
> feels to veto it, he/she will need to motivate the veto with technical
> merit.
>
> In response to Uwe point:
>
>>
>>> On Tue, May 16, 2023 at 9:57?AM Uwe Schindler <uwe@thetaphi.de> wrote:
>>>
>>>> I agree with Dawid,
>>>>
>>>> I am +1 for those two options in combination:
>>>>
>>>> - option 3 (make limit an HNSW specific thing). New formats may use
>>>> other limits (lower or higher).
>>>> - option 4 (make a system property with HNSW prefix). Adding the
>>>> system property must be done in same way like new properties for MMAP
>>>> directory (including access controller) so it can be denied by system admin
>>>> to be set in code (see
>>>> https://github.com/apache/lucene/blob/f53eb28af053d7612f7e4d1b2de05d33dc410645/lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java#L327-L346
>>>> for example). Care has to be taken that the static initializers won't fail
>>>> is system properties cannot be read/set (system adminitrator enforces
>>>> default -> see mmap code). It also has to be made sure that an index
>>>> written with raised limit can still be read without the limit, so the limit
>>>> should not be glued into the file format. Otherwise I disagree with option
>>>> 4.
>>>>
>>>> In short: I am fine with making it configurable only for HNSW if the
>>>> limit is not glued into index format. The default should only be there to
>>>> by default prevent people from doing wrong things, but changing default
>>>> should not break reading/modifiying those indexes.
>>>>
>>>> Uwe
>>>>
>>>> Thanks Uwe, that's very useful!
> Just to fully understand it, right now the limit is not written in any
> file format, so you just want this behavior to be maintained right?
>
>

--
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

a.benedetti at sease

May 16, 2023, 8:06 AM

Post #17 of 46 (858 views)

Even if the options can be basically summarised in two groups: make it
configurable VS not making it configurable and leave it be, when I
collected the options from people I ended up with these four and I didn't
want to collapse any of them (potentially making the proposer feel
diminished).

--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
*Apache Lucene/Solr Committer*
*Apache Solr PMC Member*

e-mail: a.benedetti@sease.io

*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
<https://github.com/seaseltd>

On Tue, 16 May 2023 at 15:54, Gus Heck <gus.heck@gmail.com> wrote:

> Actually, I had wondered if this is a proper vote thread or not, normally
> those are yes/no on a single option.
>
> On Tue, May 16, 2023 at 10:47?AM Alessandro Benedetti <
> a.benedetti@sease.io> wrote:
>
>> Hi Marcus,
>> I am afraid at this stage Robert's opinion counts just as any other
>> opinion, a single vote for option 1.
>> We are collecting a community's feedback here, we are not changing any
>> code nor voting for a yes/no.
>> Once the voting is finished, we'll operate an action depending on the
>> community's choice.
>> If the action involves making a change and someone(Robert or whoever)
>> feels to veto it, he/she will need to motivate the veto with technical
>> merit.
>>
>> In response to Uwe point:
>>
>>>
>>>> On Tue, May 16, 2023 at 9:57?AM Uwe Schindler <uwe@thetaphi.de> wrote:
>>>>
>>>>> I agree with Dawid,
>>>>>
>>>>> I am +1 for those two options in combination:
>>>>>
>>>>> - option 3 (make limit an HNSW specific thing). New formats may
>>>>> use other limits (lower or higher).
>>>>> - option 4 (make a system property with HNSW prefix). Adding the
>>>>> system property must be done in same way like new properties for MMAP
>>>>> directory (including access controller) so it can be denied by system admin
>>>>> to be set in code (see
>>>>> https://github.com/apache/lucene/blob/f53eb28af053d7612f7e4d1b2de05d33dc410645/lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java#L327-L346
>>>>> for example). Care has to be taken that the static initializers won't fail
>>>>> is system properties cannot be read/set (system adminitrator enforces
>>>>> default -> see mmap code). It also has to be made sure that an index
>>>>> written with raised limit can still be read without the limit, so the limit
>>>>> should not be glued into the file format. Otherwise I disagree with option
>>>>> 4.
>>>>>
>>>>> In short: I am fine with making it configurable only for HNSW if the
>>>>> limit is not glued into index format. The default should only be there to
>>>>> by default prevent people from doing wrong things, but changing default
>>>>> should not break reading/modifiying those indexes.
>>>>>
>>>>> Uwe
>>>>>
>>>>> Thanks Uwe, that's very useful!
>> Just to fully understand it, right now the limit is not written in any
>> file format, so you just want this behavior to be maintained right?
>>
>>
>
> --
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)
>

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

jbellis at gmail

May 16, 2023, 4:22 PM

Post #18 of 46 (858 views)

My non-binding vote:

Option 2 = Option 4 > Option 1 > Option 3

Explanation: Lucene's somewhat arbitrary limit of 1024 does not currently
affect the raw, low-level HNSW, which is what I am plugging into
Cassandra. The only option that would break this code is option 3.

P.S. I mentioned this in another thread, but I'm happily throwing OpenAI's
ada embeddings of dimension 1536 at it and I've tested more.

On Tue, May 16, 2023 at 3:50?AM Alessandro Benedetti <a.benedetti@sease.io>
wrote:

> Hi all,
> we have finalized all the options proposed by the community and we are
> ready to vote for the preferred one and then proceed with the
> implementation.
>
> *Option 1*
> Keep it as it is (dimension limit hardcoded to 1024)
> *Motivation*:
> We are close to improving on many fronts. Given the criticality of Lucene
> in computing infrastructure and the concerns raised by one of the most
> active stewards of the project, I think we should keep working toward
> improving the feature as is and move to up the limit after we can
> demonstrate improvement unambiguously.
>
> *Option 2*
> make the limit configurable, for example through a system property
> *Motivation*:
> The system administrator can enforce a limit its users need to respect
> that it's in line with whatever the admin decided to be acceptable for
> them.
> The default can stay the current one.
> This should open the doors for Apache Solr, Elasticsearch, OpenSearch, and
> any sort of plugin development
>
> *Option 3*
> Move the max dimension limit lower level to a HNSW specific
> implementation. Once there, this limit would not bind any other potential
> vector engine alternative/evolution.
> *Motivation:* There seem to be contradictory performance interpretations
> about the current HNSW implementation. Some consider its performance ok,
> some not, and it depends on the target data set and use case. Increasing
> the max dimension limit where it is currently (in top level
> FloatVectorValues) would not allow potential alternatives (e.g. for other
> use-cases) to be based on a lower limit.
>
> *Option 4*
> Make it configurable and move it to an appropriate place.
> In particular, a simple Integer.getInteger("lucene.hnsw.maxDimensions",
> 1024) should be enough.
> *Motivation*:
> Both are good and not mutually exclusive and could happen in any order.
> Someone suggested to perfect what the _default_ limit should be, but I've
> not seen an argument _against_ configurability. Especially in this way --
> a toggle that doesn't bind Lucene's APIs in any way.
>
> I'll keep this [VOTE] open for a week and then proceed to the
> implementation.
> --------------------------
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> *Apache Lucene/Solr Committer*
> *Apache Solr PMC Member*
>
> e-mail: a.benedetti@sease.io
>
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
> <https://twitter.com/seaseltd> | Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
> <https://github.com/seaseltd>
>

--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

RE: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

pandu at s2search

May 16, 2023, 4:27 PM

Post #19 of 46 (858 views)

Hi all,

Great to have this discussion!

My votes are for 2 and 4!

Best,

Pandu

On 2023/05/16 08:50:24 Alessandro Benedetti wrote:
> Hi all,
> we have finalized all the options proposed by the community and we are
> ready to vote for the preferred one and then proceed with the
> implementation.
>
> *Option 1*
> Keep it as it is (dimension limit hardcoded to 1024)
> *Motivation*:
> We are close to improving on many fronts. Given the criticality of Lucene
> in computing infrastructure and the concerns raised by one of the most
> active stewards of the project, I think we should keep working toward
> improving the feature as is and move to up the limit after we can
> demonstrate improvement unambiguously.
>
> *Option 2*
> make the limit configurable, for example through a system property
> *Motivation*:
> The system administrator can enforce a limit its users need to respect that
> it's in line with whatever the admin decided to be acceptable for them.
> The default can stay the current one.
> This should open the doors for Apache Solr, Elasticsearch, OpenSearch, and
> any sort of plugin development
>
> *Option 3*
> Move the max dimension limit lower level to a HNSW specific implementation.
> Once there, this limit would not bind any other potential vector engine
> alternative/evolution.
> *Motivation:* There seem to be contradictory performance interpretations
> about the current HNSW implementation. Some consider its performance ok,
> some not, and it depends on the target data set and use case. Increasing
> the max dimension limit where it is currently (in top level
> FloatVectorValues) would not allow potential alternatives (e.g. for other
> use-cases) to be based on a lower limit.
>
> *Option 4*
> Make it configurable and move it to an appropriate place.
> In particular, a simple Integer.getInteger("lucene.hnsw.maxDimensions",
> 1024) should be enough.
> *Motivation*:
> Both are good and not mutually exclusive and could happen in any order.
> Someone suggested to perfect what the _default_ limit should be, but I've
> not seen an argument _against_ configurability. Especially in this way --
> a toggle that doesn't bind Lucene's APIs in any way.
>
> I'll keep this [VOTE] open for a week and then proceed to the
> implementation.
> --------------------------
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> *Apache Lucene/Solr Committer*
> *Apache Solr PMC Member*
>
> e-mail: a.benedetti@sease.io
>
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
> <https://twitter.com/seaseltd> | Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
> <https://github.com/seaseltd>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

RE: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

pandu at s2search

May 16, 2023, 4:38 PM

Post #20 of 46 (858 views)

Hi all,

Great to have this discussion!

My votes are for 2 and 4!

Best,

Pandu

On 2023/05/16 08:50:24 Alessandro Benedetti wrote:
> Hi all,
> we have finalized all the options proposed by the community and we are
> ready to vote for the preferred one and then proceed with the
> implementation.
>
> *Option 1*
> Keep it as it is (dimension limit hardcoded to 1024)
> *Motivation*:
> We are close to improving on many fronts. Given the criticality of Lucene
> in computing infrastructure and the concerns raised by one of the most
> active stewards of the project, I think we should keep working toward
> improving the feature as is and move to up the limit after we can
> demonstrate improvement unambiguously.
>
> *Option 2*
> make the limit configurable, for example through a system property
> *Motivation*:
> The system administrator can enforce a limit its users need to respect that
> it's in line with whatever the admin decided to be acceptable for them.
> The default can stay the current one.
> This should open the doors for Apache Solr, Elasticsearch, OpenSearch, and
> any sort of plugin development
>
> *Option 3*
> Move the max dimension limit lower level to a HNSW specific implementation.
> Once there, this limit would not bind any other potential vector engine
> alternative/evolution.
> *Motivation:* There seem to be contradictory performance interpretations
> about the current HNSW implementation. Some consider its performance ok,
> some not, and it depends on the target data set and use case. Increasing
> the max dimension limit where it is currently (in top level
> FloatVectorValues) would not allow potential alternatives (e.g. for other
> use-cases) to be based on a lower limit.
>
> *Option 4*
> Make it configurable and move it to an appropriate place.
> In particular, a simple Integer.getInteger("lucene.hnsw.maxDimensions",
> 1024) should be enough.
> *Motivation*:
> Both are good and not mutually exclusive and could happen in any order.
> Someone suggested to perfect what the _default_ limit should be, but I've
> not seen an argument _against_ configurability. Especially in this way --
> a toggle that doesn't bind Lucene's APIs in any way.
>
> I'll keep this [VOTE] open for a week and then proceed to the
> implementation.
> --------------------------
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> *Apache Lucene/Solr Committer*
> *Apache Solr PMC Member*
>
> e-mail: a.benedetti@sease.io
>
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
> <https://twitter.com/seaseltd> | Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
> <https://github.com/seaseltd>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

rcmuir at gmail

May 16, 2023, 5:53 PM

Post #21 of 46 (858 views)

Gus, I think i explained myself multiple times on issues and in this
thread. the performance is unacceptable, everyone knows it, but nobody is
talking about.
I don't need to explain myself time and time again here.
You don't seem to understand the technical issues (at least you sure as
fuck don't know how service loading works or you wouldnt have opened
https://github.com/apache/lucene/issues/12300 ????)

I'm just the only one here completely unconstrained by any of silicon
valley's influences to speak my true mind, without any repercussions, so I
do it. Don't give any fucks about ChatGPT.

I'm standing by my technical veto. If you bypass it, I'll revert the
offending commit.

As far as fixing the technical performance, I just opened an issue with
some ideas to at least improve cpu usage by a factor of N. It does not help
with the crazy heap memory usage or other issues of KNN implementation
causing shit like OOM on merge. But it is one step:
https://github.com/apache/lucene/issues/12302

On Tue, May 16, 2023 at 7:45?AM Gus Heck <gus.heck@gmail.com> wrote:

> Robert,
>
> Can you explain in clear technical terms the standard that must be met for
> performance? A benchmark that must run in X time on Y hardware for example
> (and why that test is suitable)? Or some other reproducible criteria? So
> far I've heard you give an *opinion* that it's unusable, but that's not a
> technical criteria, others may have a different concept of what is usable
> to them.
>
> Forgive me if I misunderstand, but the essence of your argument has seemed
> to be
>
> "Performance isn't good enough, therefore we should force anyone who wants
> to experiment with something bigger to fork the code base to do it"
>
> Thus, it is necessary to have a clear unambiguous standard that anyone can
> verify for "good enough". A clear standard would also focus efforts at
> improvement.
>
> Where are the goal posts?
>
> FWIW I'm +1 on any of 2-4 since I believe the existence of a hard limit is
> fundamentally counterproductive in an open source setting, as it will lead
> to *fewer people* pushing the limits. Extremely few people are going to
> get into the nitty-gritty of optimizing things unless they are staring at
> code that they can prove does something interesting, but doesn't run fast
> enough for their purposes. If people hit a hard limit, more of them give up
> and never develop the code that will motivate them to look for
> optimizations.
>
> -Gus
>
> On Tue, May 16, 2023 at 6:04?AM Robert Muir <rcmuir@gmail.com> wrote:
>
>> i still feel -1 (veto) on increasing this limit. sending more emails does
>> not change the technical facts or make the veto go away.
>>
>> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti <
>> a.benedetti@sease.io> wrote:
>>
>>> Hi all,
>>> we have finalized all the options proposed by the community and we are
>>> ready to vote for the preferred one and then proceed with the
>>> implementation.
>>>
>>> *Option 1*
>>> Keep it as it is (dimension limit hardcoded to 1024)
>>> *Motivation*:
>>> We are close to improving on many fronts. Given the criticality of
>>> Lucene in computing infrastructure and the concerns raised by one of the
>>> most active stewards of the project, I think we should keep working toward
>>> improving the feature as is and move to up the limit after we can
>>> demonstrate improvement unambiguously.
>>>
>>> *Option 2*
>>> make the limit configurable, for example through a system property
>>> *Motivation*:
>>> The system administrator can enforce a limit its users need to respect
>>> that it's in line with whatever the admin decided to be acceptable for
>>> them.
>>> The default can stay the current one.
>>> This should open the doors for Apache Solr, Elasticsearch, OpenSearch,
>>> and any sort of plugin development
>>>
>>> *Option 3*
>>> Move the max dimension limit lower level to a HNSW specific
>>> implementation. Once there, this limit would not bind any other potential
>>> vector engine alternative/evolution.
>>> *Motivation:* There seem to be contradictory performance
>>> interpretations about the current HNSW implementation. Some consider its
>>> performance ok, some not, and it depends on the target data set and use
>>> case. Increasing the max dimension limit where it is currently (in top
>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>> other use-cases) to be based on a lower limit.
>>>
>>> *Option 4*
>>> Make it configurable and move it to an appropriate place.
>>> In particular, a simple Integer.getInteger("lucene.hnsw.maxDimensions",
>>> 1024) should be enough.
>>> *Motivation*:
>>> Both are good and not mutually exclusive and could happen in any order.
>>> Someone suggested to perfect what the _default_ limit should be, but
>>> I've not seen an argument _against_ configurability. Especially in this
>>> way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>
>>> I'll keep this [VOTE] open for a week and then proceed to the
>>> implementation.
>>> --------------------------
>>> *Alessandro Benedetti*
>>> Director @ Sease Ltd.
>>> *Apache Lucene/Solr Committer*
>>> *Apache Solr PMC Member*
>>>
>>> e-mail: a.benedetti@sease.io
>>>
>>>
>>> *Sease* - Information Retrieval Applied
>>> Consulting | Training | Open Source
>>>
>>> Website: Sease.io <http://sease.io/>
>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>> <https://twitter.com/seaseltd> | Youtube
>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>> <https://github.com/seaseltd>
>>>
>>
>
> --
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)
>

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

rcmuir at gmail

May 16, 2023, 6:50 PM

Post #22 of 46 (858 views)

by the way, i agree with the idea to MOVE THE LIMIT UNCHANGED to the
hsnw-specific code.

This way, someone can write alternative codec with vectors using some other
completely different approach that incorporates a different more
appropriate limit (maybe lower, maybe higher) depending upon their
tradeoffs. We should encourage this as I think it is the "only true fix" to
the scalability issues: use a scalable algorithm! Also, alternative codecs
don't force the project into many years of index backwards compatibility,
which is really my penultimate concern. We can lock ourselves into a truly
bad place and become irrelevant (especially with scalar code implementing
all this vector stuff, it is really senseless). In the meantime I suggest
we try to reduce pain for the default codec with the current implementation
if possible. If it is not possible, we need a new codec that performs.

On Tue, May 16, 2023 at 8:53?PM Robert Muir <rcmuir@gmail.com> wrote:

> Gus, I think i explained myself multiple times on issues and in this
> thread. the performance is unacceptable, everyone knows it, but nobody is
> talking about.
> I don't need to explain myself time and time again here.
> You don't seem to understand the technical issues (at least you sure as
> fuck don't know how service loading works or you wouldnt have opened
> https://github.com/apache/lucene/issues/12300 ????)
>
> I'm just the only one here completely unconstrained by any of silicon
> valley's influences to speak my true mind, without any repercussions, so I
> do it. Don't give any fucks about ChatGPT.
>
> I'm standing by my technical veto. If you bypass it, I'll revert the
> offending commit.
>
> As far as fixing the technical performance, I just opened an issue with
> some ideas to at least improve cpu usage by a factor of N. It does not help
> with the crazy heap memory usage or other issues of KNN implementation
> causing shit like OOM on merge. But it is one step:
> https://github.com/apache/lucene/issues/12302
>
>
>
> On Tue, May 16, 2023 at 7:45?AM Gus Heck <gus.heck@gmail.com> wrote:
>
>> Robert,
>>
>> Can you explain in clear technical terms the standard that must be met
>> for performance? A benchmark that must run in X time on Y hardware for
>> example (and why that test is suitable)? Or some other reproducible
>> criteria? So far I've heard you give an *opinion* that it's unusable, but
>> that's not a technical criteria, others may have a different concept of
>> what is usable to them.
>>
>> Forgive me if I misunderstand, but the essence of your argument has
>> seemed to be
>>
>> "Performance isn't good enough, therefore we should force anyone who
>> wants to experiment with something bigger to fork the code base to do it"
>>
>> Thus, it is necessary to have a clear unambiguous standard that anyone
>> can verify for "good enough". A clear standard would also focus efforts at
>> improvement.
>>
>> Where are the goal posts?
>>
>> FWIW I'm +1 on any of 2-4 since I believe the existence of a hard limit
>> is fundamentally counterproductive in an open source setting, as it will
>> lead to *fewer people* pushing the limits. Extremely few people are
>> going to get into the nitty-gritty of optimizing things unless they are
>> staring at code that they can prove does something interesting, but doesn't
>> run fast enough for their purposes. If people hit a hard limit, more of
>> them give up and never develop the code that will motivate them to look for
>> optimizations.
>>
>> -Gus
>>
>> On Tue, May 16, 2023 at 6:04?AM Robert Muir <rcmuir@gmail.com> wrote:
>>
>>> i still feel -1 (veto) on increasing this limit. sending more emails
>>> does not change the technical facts or make the veto go away.
>>>
>>> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti <
>>> a.benedetti@sease.io> wrote:
>>>
>>>> Hi all,
>>>> we have finalized all the options proposed by the community and we are
>>>> ready to vote for the preferred one and then proceed with the
>>>> implementation.
>>>>
>>>> *Option 1*
>>>> Keep it as it is (dimension limit hardcoded to 1024)
>>>> *Motivation*:
>>>> We are close to improving on many fronts. Given the criticality of
>>>> Lucene in computing infrastructure and the concerns raised by one of the
>>>> most active stewards of the project, I think we should keep working toward
>>>> improving the feature as is and move to up the limit after we can
>>>> demonstrate improvement unambiguously.
>>>>
>>>> *Option 2*
>>>> make the limit configurable, for example through a system property
>>>> *Motivation*:
>>>> The system administrator can enforce a limit its users need to respect
>>>> that it's in line with whatever the admin decided to be acceptable for
>>>> them.
>>>> The default can stay the current one.
>>>> This should open the doors for Apache Solr, Elasticsearch, OpenSearch,
>>>> and any sort of plugin development
>>>>
>>>> *Option 3*
>>>> Move the max dimension limit lower level to a HNSW specific
>>>> implementation. Once there, this limit would not bind any other potential
>>>> vector engine alternative/evolution.
>>>> *Motivation:* There seem to be contradictory performance
>>>> interpretations about the current HNSW implementation. Some consider its
>>>> performance ok, some not, and it depends on the target data set and use
>>>> case. Increasing the max dimension limit where it is currently (in top
>>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>>> other use-cases) to be based on a lower limit.
>>>>
>>>> *Option 4*
>>>> Make it configurable and move it to an appropriate place.
>>>> In particular, a simple Integer.getInteger("lucene.hnsw.maxDimensions",
>>>> 1024) should be enough.
>>>> *Motivation*:
>>>> Both are good and not mutually exclusive and could happen in any order.
>>>> Someone suggested to perfect what the _default_ limit should be, but
>>>> I've not seen an argument _against_ configurability. Especially in this
>>>> way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>>
>>>> I'll keep this [VOTE] open for a week and then proceed to the
>>>> implementation.
>>>> --------------------------
>>>> *Alessandro Benedetti*
>>>> Director @ Sease Ltd.
>>>> *Apache Lucene/Solr Committer*
>>>> *Apache Solr PMC Member*
>>>>
>>>> e-mail: a.benedetti@sease.io
>>>>
>>>>
>>>> *Sease* - Information Retrieval Applied
>>>> Consulting | Training | Open Source
>>>>
>>>> Website: Sease.io <http://sease.io/>
>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>>> <https://twitter.com/seaseltd> | Youtube
>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>>> <https://github.com/seaseltd>
>>>>
>>>
>>
>> --
>> http://www.needhamsoftware.com (work)
>> http://www.the111shift.com (play)
>>
>

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

dsmiley at apache

May 16, 2023, 7:49 PM

Post #23 of 46 (858 views)

Robert, I have not heard from you (or anyone) an argument against System
property based configurability (as I described in Option 4 via a System
property). Uwe notes wisely some care must be taken to ensure it actually
works. Sure, of course. What concerns do you have with this?

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley

On Tue, May 16, 2023 at 9:50?PM Robert Muir <rcmuir@gmail.com> wrote:

> by the way, i agree with the idea to MOVE THE LIMIT UNCHANGED to the
> hsnw-specific code.
>
> This way, someone can write alternative codec with vectors using some
> other completely different approach that incorporates a different more
> appropriate limit (maybe lower, maybe higher) depending upon their
> tradeoffs. We should encourage this as I think it is the "only true fix" to
> the scalability issues: use a scalable algorithm! Also, alternative codecs
> don't force the project into many years of index backwards compatibility,
> which is really my penultimate concern. We can lock ourselves into a truly
> bad place and become irrelevant (especially with scalar code implementing
> all this vector stuff, it is really senseless). In the meantime I suggest
> we try to reduce pain for the default codec with the current implementation
> if possible. If it is not possible, we need a new codec that performs.
>
> On Tue, May 16, 2023 at 8:53?PM Robert Muir <rcmuir@gmail.com> wrote:
>
>> Gus, I think i explained myself multiple times on issues and in this
>> thread. the performance is unacceptable, everyone knows it, but nobody is
>> talking about.
>> I don't need to explain myself time and time again here.
>> You don't seem to understand the technical issues (at least you sure as
>> fuck don't know how service loading works or you wouldnt have opened
>> https://github.com/apache/lucene/issues/12300 ????)
>>
>> I'm just the only one here completely unconstrained by any of silicon
>> valley's influences to speak my true mind, without any repercussions, so I
>> do it. Don't give any fucks about ChatGPT.
>>
>> I'm standing by my technical veto. If you bypass it, I'll revert the
>> offending commit.
>>
>> As far as fixing the technical performance, I just opened an issue with
>> some ideas to at least improve cpu usage by a factor of N. It does not help
>> with the crazy heap memory usage or other issues of KNN implementation
>> causing shit like OOM on merge. But it is one step:
>> https://github.com/apache/lucene/issues/12302
>>
>>
>>
>> On Tue, May 16, 2023 at 7:45?AM Gus Heck <gus.heck@gmail.com> wrote:
>>
>>> Robert,
>>>
>>> Can you explain in clear technical terms the standard that must be met
>>> for performance? A benchmark that must run in X time on Y hardware for
>>> example (and why that test is suitable)? Or some other reproducible
>>> criteria? So far I've heard you give an *opinion* that it's unusable, but
>>> that's not a technical criteria, others may have a different concept of
>>> what is usable to them.
>>>
>>> Forgive me if I misunderstand, but the essence of your argument has
>>> seemed to be
>>>
>>> "Performance isn't good enough, therefore we should force anyone who
>>> wants to experiment with something bigger to fork the code base to do it"
>>>
>>> Thus, it is necessary to have a clear unambiguous standard that anyone
>>> can verify for "good enough". A clear standard would also focus efforts at
>>> improvement.
>>>
>>> Where are the goal posts?
>>>
>>> FWIW I'm +1 on any of 2-4 since I believe the existence of a hard limit
>>> is fundamentally counterproductive in an open source setting, as it will
>>> lead to *fewer people* pushing the limits. Extremely few people are
>>> going to get into the nitty-gritty of optimizing things unless they are
>>> staring at code that they can prove does something interesting, but doesn't
>>> run fast enough for their purposes. If people hit a hard limit, more of
>>> them give up and never develop the code that will motivate them to look for
>>> optimizations.
>>>
>>> -Gus
>>>
>>> On Tue, May 16, 2023 at 6:04?AM Robert Muir <rcmuir@gmail.com> wrote:
>>>
>>>> i still feel -1 (veto) on increasing this limit. sending more emails
>>>> does not change the technical facts or make the veto go away.
>>>>
>>>> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti <
>>>> a.benedetti@sease.io> wrote:
>>>>
>>>>> Hi all,
>>>>> we have finalized all the options proposed by the community and we are
>>>>> ready to vote for the preferred one and then proceed with the
>>>>> implementation.
>>>>>
>>>>> *Option 1*
>>>>> Keep it as it is (dimension limit hardcoded to 1024)
>>>>> *Motivation*:
>>>>> We are close to improving on many fronts. Given the criticality of
>>>>> Lucene in computing infrastructure and the concerns raised by one of the
>>>>> most active stewards of the project, I think we should keep working toward
>>>>> improving the feature as is and move to up the limit after we can
>>>>> demonstrate improvement unambiguously.
>>>>>
>>>>> *Option 2*
>>>>> make the limit configurable, for example through a system property
>>>>> *Motivation*:
>>>>> The system administrator can enforce a limit its users need to respect
>>>>> that it's in line with whatever the admin decided to be acceptable for
>>>>> them.
>>>>> The default can stay the current one.
>>>>> This should open the doors for Apache Solr, Elasticsearch, OpenSearch,
>>>>> and any sort of plugin development
>>>>>
>>>>> *Option 3*
>>>>> Move the max dimension limit lower level to a HNSW specific
>>>>> implementation. Once there, this limit would not bind any other potential
>>>>> vector engine alternative/evolution.
>>>>> *Motivation:* There seem to be contradictory performance
>>>>> interpretations about the current HNSW implementation. Some consider its
>>>>> performance ok, some not, and it depends on the target data set and use
>>>>> case. Increasing the max dimension limit where it is currently (in top
>>>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>>>> other use-cases) to be based on a lower limit.
>>>>>
>>>>> *Option 4*
>>>>> Make it configurable and move it to an appropriate place.
>>>>> In particular, a
>>>>> simple Integer.getInteger("lucene.hnsw.maxDimensions", 1024) should be
>>>>> enough.
>>>>> *Motivation*:
>>>>> Both are good and not mutually exclusive and could happen in any order.
>>>>> Someone suggested to perfect what the _default_ limit should be, but
>>>>> I've not seen an argument _against_ configurability. Especially in this
>>>>> way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>>>
>>>>> I'll keep this [VOTE] open for a week and then proceed to the
>>>>> implementation.
>>>>> --------------------------
>>>>> *Alessandro Benedetti*
>>>>> Director @ Sease Ltd.
>>>>> *Apache Lucene/Solr Committer*
>>>>> *Apache Solr PMC Member*
>>>>>
>>>>> e-mail: a.benedetti@sease.io
>>>>>
>>>>>
>>>>> *Sease* - Information Retrieval Applied
>>>>> Consulting | Training | Open Source
>>>>>
>>>>> Website: Sease.io <http://sease.io/>
>>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>>>> <https://twitter.com/seaseltd> | Youtube
>>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>>>> <https://github.com/seaseltd>
>>>>>
>>>>
>>>
>>> --
>>> http://www.needhamsoftware.com (work)
>>> http://www.the111shift.com (play)
>>>
>>

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

rcmuir at gmail

May 16, 2023, 7:58 PM

Post #24 of 46 (858 views)

My problem is that it impacts the default codec which is supported by our
backwards compatibility policy for many years. We can't just let the user
determine backwards compatibility with a sysprop. how will checkindex work?
We have to have bounds and also allow for more performant implementations
that might have different limitations. And I'm pretty sure we want a faster
implementation than what we have in the future, and it will probably have
different limits.

For other codecs, it is fine to have a different limit as I already said,
as it is implementation dependent. And honestly the stuff in lucene/codecs
can be more "Fast and loose" because it doesn't require the extensive index
back compat guarantee.

Again, penultimate concern is that index back compat guarantee. When it
comes to limits, the proper way is not to just keep bumping them without
technical reasons, instead the correct approach is to fix the technical
problems and make them irrelevant. Great example here (merged this
morning):
https://github.com/apache/lucene/commit/f53eb28af053d7612f7e4d1b2de05d33dc410645

On Tue, May 16, 2023 at 10:49?PM David Smiley <dsmiley@apache.org> wrote:

> Robert, I have not heard from you (or anyone) an argument against System
> property based configurability (as I described in Option 4 via a System
> property). Uwe notes wisely some care must be taken to ensure it actually
> works. Sure, of course. What concerns do you have with this?
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Tue, May 16, 2023 at 9:50?PM Robert Muir <rcmuir@gmail.com> wrote:
>
>> by the way, i agree with the idea to MOVE THE LIMIT UNCHANGED to the
>> hsnw-specific code.
>>
>> This way, someone can write alternative codec with vectors using some
>> other completely different approach that incorporates a different more
>> appropriate limit (maybe lower, maybe higher) depending upon their
>> tradeoffs. We should encourage this as I think it is the "only true fix" to
>> the scalability issues: use a scalable algorithm! Also, alternative codecs
>> don't force the project into many years of index backwards compatibility,
>> which is really my penultimate concern. We can lock ourselves into a truly
>> bad place and become irrelevant (especially with scalar code implementing
>> all this vector stuff, it is really senseless). In the meantime I suggest
>> we try to reduce pain for the default codec with the current implementation
>> if possible. If it is not possible, we need a new codec that performs.
>>
>> On Tue, May 16, 2023 at 8:53?PM Robert Muir <rcmuir@gmail.com> wrote:
>>
>>> Gus, I think i explained myself multiple times on issues and in this
>>> thread. the performance is unacceptable, everyone knows it, but nobody is
>>> talking about.
>>> I don't need to explain myself time and time again here.
>>> You don't seem to understand the technical issues (at least you sure as
>>> fuck don't know how service loading works or you wouldnt have opened
>>> https://github.com/apache/lucene/issues/12300 ????)
>>>
>>> I'm just the only one here completely unconstrained by any of silicon
>>> valley's influences to speak my true mind, without any repercussions, so I
>>> do it. Don't give any fucks about ChatGPT.
>>>
>>> I'm standing by my technical veto. If you bypass it, I'll revert the
>>> offending commit.
>>>
>>> As far as fixing the technical performance, I just opened an issue with
>>> some ideas to at least improve cpu usage by a factor of N. It does not help
>>> with the crazy heap memory usage or other issues of KNN implementation
>>> causing shit like OOM on merge. But it is one step:
>>> https://github.com/apache/lucene/issues/12302
>>>
>>>
>>>
>>> On Tue, May 16, 2023 at 7:45?AM Gus Heck <gus.heck@gmail.com> wrote:
>>>
>>>> Robert,
>>>>
>>>> Can you explain in clear technical terms the standard that must be met
>>>> for performance? A benchmark that must run in X time on Y hardware for
>>>> example (and why that test is suitable)? Or some other reproducible
>>>> criteria? So far I've heard you give an *opinion* that it's unusable, but
>>>> that's not a technical criteria, others may have a different concept of
>>>> what is usable to them.
>>>>
>>>> Forgive me if I misunderstand, but the essence of your argument has
>>>> seemed to be
>>>>
>>>> "Performance isn't good enough, therefore we should force anyone who
>>>> wants to experiment with something bigger to fork the code base to do it"
>>>>
>>>> Thus, it is necessary to have a clear unambiguous standard that anyone
>>>> can verify for "good enough". A clear standard would also focus efforts at
>>>> improvement.
>>>>
>>>> Where are the goal posts?
>>>>
>>>> FWIW I'm +1 on any of 2-4 since I believe the existence of a hard limit
>>>> is fundamentally counterproductive in an open source setting, as it will
>>>> lead to *fewer people* pushing the limits. Extremely few people are
>>>> going to get into the nitty-gritty of optimizing things unless they are
>>>> staring at code that they can prove does something interesting, but doesn't
>>>> run fast enough for their purposes. If people hit a hard limit, more of
>>>> them give up and never develop the code that will motivate them to look for
>>>> optimizations.
>>>>
>>>> -Gus
>>>>
>>>> On Tue, May 16, 2023 at 6:04?AM Robert Muir <rcmuir@gmail.com> wrote:
>>>>
>>>>> i still feel -1 (veto) on increasing this limit. sending more emails
>>>>> does not change the technical facts or make the veto go away.
>>>>>
>>>>> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti <
>>>>> a.benedetti@sease.io> wrote:
>>>>>
>>>>>> Hi all,
>>>>>> we have finalized all the options proposed by the community and we
>>>>>> are ready to vote for the preferred one and then proceed with the
>>>>>> implementation.
>>>>>>
>>>>>> *Option 1*
>>>>>> Keep it as it is (dimension limit hardcoded to 1024)
>>>>>> *Motivation*:
>>>>>> We are close to improving on many fronts. Given the criticality of
>>>>>> Lucene in computing infrastructure and the concerns raised by one of the
>>>>>> most active stewards of the project, I think we should keep working toward
>>>>>> improving the feature as is and move to up the limit after we can
>>>>>> demonstrate improvement unambiguously.
>>>>>>
>>>>>> *Option 2*
>>>>>> make the limit configurable, for example through a system property
>>>>>> *Motivation*:
>>>>>> The system administrator can enforce a limit its users need to
>>>>>> respect that it's in line with whatever the admin decided to be acceptable
>>>>>> for them.
>>>>>> The default can stay the current one.
>>>>>> This should open the doors for Apache Solr, Elasticsearch,
>>>>>> OpenSearch, and any sort of plugin development
>>>>>>
>>>>>> *Option 3*
>>>>>> Move the max dimension limit lower level to a HNSW specific
>>>>>> implementation. Once there, this limit would not bind any other potential
>>>>>> vector engine alternative/evolution.
>>>>>> *Motivation:* There seem to be contradictory performance
>>>>>> interpretations about the current HNSW implementation. Some consider its
>>>>>> performance ok, some not, and it depends on the target data set and use
>>>>>> case. Increasing the max dimension limit where it is currently (in top
>>>>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>>>>> other use-cases) to be based on a lower limit.
>>>>>>
>>>>>> *Option 4*
>>>>>> Make it configurable and move it to an appropriate place.
>>>>>> In particular, a
>>>>>> simple Integer.getInteger("lucene.hnsw.maxDimensions", 1024) should be
>>>>>> enough.
>>>>>> *Motivation*:
>>>>>> Both are good and not mutually exclusive and could happen in any
>>>>>> order.
>>>>>> Someone suggested to perfect what the _default_ limit should be, but
>>>>>> I've not seen an argument _against_ configurability. Especially in this
>>>>>> way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>>>>
>>>>>> I'll keep this [VOTE] open for a week and then proceed to the
>>>>>> implementation.
>>>>>> --------------------------
>>>>>> *Alessandro Benedetti*
>>>>>> Director @ Sease Ltd.
>>>>>> *Apache Lucene/Solr Committer*
>>>>>> *Apache Solr PMC Member*
>>>>>>
>>>>>> e-mail: a.benedetti@sease.io
>>>>>>
>>>>>>
>>>>>> *Sease* - Information Retrieval Applied
>>>>>> Consulting | Training | Open Source
>>>>>>
>>>>>> Website: Sease.io <http://sease.io/>
>>>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>>>>> <https://twitter.com/seaseltd> | Youtube
>>>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>>>>> <https://github.com/seaseltd>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> http://www.needhamsoftware.com (work)
>>>> http://www.the111shift.com (play)
>>>>
>>>

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

gus.heck at gmail

May 16, 2023, 7:59 PM

Post #25 of 46 (858 views)

Hi Robert,

If you read the issue I opened more carefully you'll see I had all the
service loading stuff sorted just fine. It's the silent eating of the
security exceptions by URLClassPath that I think is a useful thing to point
out. If anything, that ticket is more about being surprised by Security
manager behavior than service loading. I thought it would be good if anyone
else who doesn't know that bit of (IMHO obscure) trivia didn't have to
spend a long time hunting down the same thing I did if they encounter a
misconfigured security policy. If you think it could be worded better I'm
all ears.

Also, I didn't question a single thing you said or ask you to repeat
anything. I asked for a very specific detail you have not yet provided, and
that's what's your goal post. When is it good enough?

I do disagree with you at a "how open source works best" level, favoring
enablement. I don't think I've disagreed with a single one of your
technical claims or reported experiences.

Awesome if you've found an improvement. :) If it works as well as you
expect, is it enough to change your mind?

-Gus

On Tue, May 16, 2023 at 8:54?PM Robert Muir <rcmuir@gmail.com> wrote:

> Gus, I think i explained myself multiple times on issues and in this
> thread. the performance is unacceptable, everyone knows it, but nobody is
> talking about.
> I don't need to explain myself time and time again here.
> You don't seem to understand the technical issues (at least you sure as
> fuck don't know how service loading works or you wouldnt have opened
> https://github.com/apache/lucene/issues/12300 ????)
>
> I'm just the only one here completely unconstrained by any of silicon
> valley's influences to speak my true mind, without any repercussions, so I
> do it. Don't give any fucks about ChatGPT.
>
> I'm standing by my technical veto. If you bypass it, I'll revert the
> offending commit.
>
> As far as fixing the technical performance, I just opened an issue with
> some ideas to at least improve cpu usage by a factor of N. It does not help
> with the crazy heap memory usage or other issues of KNN implementation
> causing shit like OOM on merge. But it is one step:
> https://github.com/apache/lucene/issues/12302
>
>
>
> On Tue, May 16, 2023 at 7:45?AM Gus Heck <gus.heck@gmail.com> wrote:
>
>> Robert,
>>
>> Can you explain in clear technical terms the standard that must be met
>> for performance? A benchmark that must run in X time on Y hardware for
>> example (and why that test is suitable)? Or some other reproducible
>> criteria? So far I've heard you give an *opinion* that it's unusable, but
>> that's not a technical criteria, others may have a different concept of
>> what is usable to them.
>>
>> Forgive me if I misunderstand, but the essence of your argument has
>> seemed to be
>>
>> "Performance isn't good enough, therefore we should force anyone who
>> wants to experiment with something bigger to fork the code base to do it"
>>
>> Thus, it is necessary to have a clear unambiguous standard that anyone
>> can verify for "good enough". A clear standard would also focus efforts at
>> improvement.
>>
>> Where are the goal posts?
>>
>> FWIW I'm +1 on any of 2-4 since I believe the existence of a hard limit
>> is fundamentally counterproductive in an open source setting, as it will
>> lead to *fewer people* pushing the limits. Extremely few people are
>> going to get into the nitty-gritty of optimizing things unless they are
>> staring at code that they can prove does something interesting, but doesn't
>> run fast enough for their purposes. If people hit a hard limit, more of
>> them give up and never develop the code that will motivate them to look for
>> optimizations.
>>
>> -Gus
>>
>> On Tue, May 16, 2023 at 6:04?AM Robert Muir <rcmuir@gmail.com> wrote:
>>
>>> i still feel -1 (veto) on increasing this limit. sending more emails
>>> does not change the technical facts or make the veto go away.
>>>
>>> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti <
>>> a.benedetti@sease.io> wrote:
>>>
>>>> Hi all,
>>>> we have finalized all the options proposed by the community and we are
>>>> ready to vote for the preferred one and then proceed with the
>>>> implementation.
>>>>
>>>> *Option 1*
>>>> Keep it as it is (dimension limit hardcoded to 1024)
>>>> *Motivation*:
>>>> We are close to improving on many fronts. Given the criticality of
>>>> Lucene in computing infrastructure and the concerns raised by one of the
>>>> most active stewards of the project, I think we should keep working toward
>>>> improving the feature as is and move to up the limit after we can
>>>> demonstrate improvement unambiguously.
>>>>
>>>> *Option 2*
>>>> make the limit configurable, for example through a system property
>>>> *Motivation*:
>>>> The system administrator can enforce a limit its users need to respect
>>>> that it's in line with whatever the admin decided to be acceptable for
>>>> them.
>>>> The default can stay the current one.
>>>> This should open the doors for Apache Solr, Elasticsearch, OpenSearch,
>>>> and any sort of plugin development
>>>>
>>>> *Option 3*
>>>> Move the max dimension limit lower level to a HNSW specific
>>>> implementation. Once there, this limit would not bind any other potential
>>>> vector engine alternative/evolution.
>>>> *Motivation:* There seem to be contradictory performance
>>>> interpretations about the current HNSW implementation. Some consider its
>>>> performance ok, some not, and it depends on the target data set and use
>>>> case. Increasing the max dimension limit where it is currently (in top
>>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>>> other use-cases) to be based on a lower limit.
>>>>
>>>> *Option 4*
>>>> Make it configurable and move it to an appropriate place.
>>>> In particular, a simple Integer.getInteger("lucene.hnsw.maxDimensions",
>>>> 1024) should be enough.
>>>> *Motivation*:
>>>> Both are good and not mutually exclusive and could happen in any order.
>>>> Someone suggested to perfect what the _default_ limit should be, but
>>>> I've not seen an argument _against_ configurability. Especially in this
>>>> way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>>
>>>> I'll keep this [VOTE] open for a week and then proceed to the
>>>> implementation.
>>>> --------------------------
>>>> *Alessandro Benedetti*
>>>> Director @ Sease Ltd.
>>>> *Apache Lucene/Solr Committer*
>>>> *Apache Solr PMC Member*
>>>>
>>>> e-mail: a.benedetti@sease.io
>>>>
>>>>
>>>> *Sease* - Information Retrieval Applied
>>>> Consulting | Training | Open Source
>>>>
>>>> Website: Sease.io <http://sease.io/>
>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>>> <https://twitter.com/seaseltd> | Youtube
>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>>> <https://github.com/seaseltd>
>>>>
>>>
>>
>> --
>> http://www.needhamsoftware.com (work)
>> http://www.the111shift.com (play)
>>
>

--
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

dsmiley at apache

May 16, 2023, 10:04 PM

Post #26 of 46 (454 views)

We agree backwards compatibility with the index should be maintained and
that checkIndex should work. And we agree on a number of other things, but
I want to focus on configurability.
As long as the index contains the number of dimensions actually used in a
specific segment & field, why couldn't checkIndex work if the dimension
*limit* is configurable? It's not checkindex's job to enforce the limit,
only to check that the data appears consistent / valid, irrespective of how
the number of dimensions came to be specified originally.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley

On Tue, May 16, 2023 at 10:58?PM Robert Muir <rcmuir@gmail.com> wrote:

> My problem is that it impacts the default codec which is supported by our
> backwards compatibility policy for many years. We can't just let the user
> determine backwards compatibility with a sysprop. how will checkindex work?
> We have to have bounds and also allow for more performant implementations
> that might have different limitations. And I'm pretty sure we want a faster
> implementation than what we have in the future, and it will probably have
> different limits.
>
> For other codecs, it is fine to have a different limit as I already said,
> as it is implementation dependent. And honestly the stuff in lucene/codecs
> can be more "Fast and loose" because it doesn't require the extensive index
> back compat guarantee.
>
> Again, penultimate concern is that index back compat guarantee. When it
> comes to limits, the proper way is not to just keep bumping them without
> technical reasons, instead the correct approach is to fix the technical
> problems and make them irrelevant. Great example here (merged this
> morning):
> https://github.com/apache/lucene/commit/f53eb28af053d7612f7e4d1b2de05d33dc410645
>
>
> On Tue, May 16, 2023 at 10:49?PM David Smiley <dsmiley@apache.org> wrote:
>
>> Robert, I have not heard from you (or anyone) an argument against System
>> property based configurability (as I described in Option 4 via a System
>> property). Uwe notes wisely some care must be taken to ensure it actually
>> works. Sure, of course. What concerns do you have with this?
>>
>> ~ David Smiley
>> Apache Lucene/Solr Search Developer
>> http://www.linkedin.com/in/davidwsmiley
>>
>>
>> On Tue, May 16, 2023 at 9:50?PM Robert Muir <rcmuir@gmail.com> wrote:
>>
>>> by the way, i agree with the idea to MOVE THE LIMIT UNCHANGED to the
>>> hsnw-specific code.
>>>
>>> This way, someone can write alternative codec with vectors using some
>>> other completely different approach that incorporates a different more
>>> appropriate limit (maybe lower, maybe higher) depending upon their
>>> tradeoffs. We should encourage this as I think it is the "only true fix" to
>>> the scalability issues: use a scalable algorithm! Also, alternative codecs
>>> don't force the project into many years of index backwards compatibility,
>>> which is really my penultimate concern. We can lock ourselves into a truly
>>> bad place and become irrelevant (especially with scalar code implementing
>>> all this vector stuff, it is really senseless). In the meantime I suggest
>>> we try to reduce pain for the default codec with the current implementation
>>> if possible. If it is not possible, we need a new codec that performs.
>>>
>>> On Tue, May 16, 2023 at 8:53?PM Robert Muir <rcmuir@gmail.com> wrote:
>>>
>>>> Gus, I think i explained myself multiple times on issues and in this
>>>> thread. the performance is unacceptable, everyone knows it, but nobody is
>>>> talking about.
>>>> I don't need to explain myself time and time again here.
>>>> You don't seem to understand the technical issues (at least you sure as
>>>> fuck don't know how service loading works or you wouldnt have opened
>>>> https://github.com/apache/lucene/issues/12300 ????)
>>>>
>>>> I'm just the only one here completely unconstrained by any of silicon
>>>> valley's influences to speak my true mind, without any repercussions, so I
>>>> do it. Don't give any fucks about ChatGPT.
>>>>
>>>> I'm standing by my technical veto. If you bypass it, I'll revert the
>>>> offending commit.
>>>>
>>>> As far as fixing the technical performance, I just opened an issue with
>>>> some ideas to at least improve cpu usage by a factor of N. It does not help
>>>> with the crazy heap memory usage or other issues of KNN implementation
>>>> causing shit like OOM on merge. But it is one step:
>>>> https://github.com/apache/lucene/issues/12302
>>>>
>>>>
>>>>
>>>> On Tue, May 16, 2023 at 7:45?AM Gus Heck <gus.heck@gmail.com> wrote:
>>>>
>>>>> Robert,
>>>>>
>>>>> Can you explain in clear technical terms the standard that must be met
>>>>> for performance? A benchmark that must run in X time on Y hardware for
>>>>> example (and why that test is suitable)? Or some other reproducible
>>>>> criteria? So far I've heard you give an *opinion* that it's unusable, but
>>>>> that's not a technical criteria, others may have a different concept of
>>>>> what is usable to them.
>>>>>
>>>>> Forgive me if I misunderstand, but the essence of your argument has
>>>>> seemed to be
>>>>>
>>>>> "Performance isn't good enough, therefore we should force anyone who
>>>>> wants to experiment with something bigger to fork the code base to do it"
>>>>>
>>>>> Thus, it is necessary to have a clear unambiguous standard that anyone
>>>>> can verify for "good enough". A clear standard would also focus efforts at
>>>>> improvement.
>>>>>
>>>>> Where are the goal posts?
>>>>>
>>>>> FWIW I'm +1 on any of 2-4 since I believe the existence of a hard
>>>>> limit is fundamentally counterproductive in an open source setting, as it
>>>>> will lead to *fewer people* pushing the limits. Extremely few people
>>>>> are going to get into the nitty-gritty of optimizing things unless they are
>>>>> staring at code that they can prove does something interesting, but doesn't
>>>>> run fast enough for their purposes. If people hit a hard limit, more of
>>>>> them give up and never develop the code that will motivate them to look for
>>>>> optimizations.
>>>>>
>>>>> -Gus
>>>>>
>>>>> On Tue, May 16, 2023 at 6:04?AM Robert Muir <rcmuir@gmail.com> wrote:
>>>>>
>>>>>> i still feel -1 (veto) on increasing this limit. sending more emails
>>>>>> does not change the technical facts or make the veto go away.
>>>>>>
>>>>>> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti <
>>>>>> a.benedetti@sease.io> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>> we have finalized all the options proposed by the community and we
>>>>>>> are ready to vote for the preferred one and then proceed with the
>>>>>>> implementation.
>>>>>>>
>>>>>>> *Option 1*
>>>>>>> Keep it as it is (dimension limit hardcoded to 1024)
>>>>>>> *Motivation*:
>>>>>>> We are close to improving on many fronts. Given the criticality of
>>>>>>> Lucene in computing infrastructure and the concerns raised by one of the
>>>>>>> most active stewards of the project, I think we should keep working toward
>>>>>>> improving the feature as is and move to up the limit after we can
>>>>>>> demonstrate improvement unambiguously.
>>>>>>>
>>>>>>> *Option 2*
>>>>>>> make the limit configurable, for example through a system property
>>>>>>> *Motivation*:
>>>>>>> The system administrator can enforce a limit its users need to
>>>>>>> respect that it's in line with whatever the admin decided to be acceptable
>>>>>>> for them.
>>>>>>> The default can stay the current one.
>>>>>>> This should open the doors for Apache Solr, Elasticsearch,
>>>>>>> OpenSearch, and any sort of plugin development
>>>>>>>
>>>>>>> *Option 3*
>>>>>>> Move the max dimension limit lower level to a HNSW specific
>>>>>>> implementation. Once there, this limit would not bind any other potential
>>>>>>> vector engine alternative/evolution.
>>>>>>> *Motivation:* There seem to be contradictory performance
>>>>>>> interpretations about the current HNSW implementation. Some consider its
>>>>>>> performance ok, some not, and it depends on the target data set and use
>>>>>>> case. Increasing the max dimension limit where it is currently (in top
>>>>>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>>>>>> other use-cases) to be based on a lower limit.
>>>>>>>
>>>>>>> *Option 4*
>>>>>>> Make it configurable and move it to an appropriate place.
>>>>>>> In particular, a
>>>>>>> simple Integer.getInteger("lucene.hnsw.maxDimensions", 1024) should be
>>>>>>> enough.
>>>>>>> *Motivation*:
>>>>>>> Both are good and not mutually exclusive and could happen in any
>>>>>>> order.
>>>>>>> Someone suggested to perfect what the _default_ limit should be, but
>>>>>>> I've not seen an argument _against_ configurability. Especially in this
>>>>>>> way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>>>>>
>>>>>>> I'll keep this [VOTE] open for a week and then proceed to the
>>>>>>> implementation.
>>>>>>> --------------------------
>>>>>>> *Alessandro Benedetti*
>>>>>>> Director @ Sease Ltd.
>>>>>>> *Apache Lucene/Solr Committer*
>>>>>>> *Apache Solr PMC Member*
>>>>>>>
>>>>>>> e-mail: a.benedetti@sease.io
>>>>>>>
>>>>>>>
>>>>>>> *Sease* - Information Retrieval Applied
>>>>>>> Consulting | Training | Open Source
>>>>>>>
>>>>>>> Website: Sease.io <http://sease.io/>
>>>>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>>>>>> <https://twitter.com/seaseltd> | Youtube
>>>>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>>>>>> <https://github.com/seaseltd>
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> http://www.needhamsoftware.com (work)
>>>>> http://www.the111shift.com (play)
>>>>>
>>>>

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

a.benedetti at sease

May 17, 2023, 1:45 AM

Post #27 of 46 (454 views)

Robert,
A gentle reminder of the
https://www.apache.org/foundation/policies/conduct.html.
I've read many e-mails about this topic that ended up in a tone that is not
up to the standard of a healthy community.
To be specific and pragmatic how you addressed Gus here, how you addressed
the rest of our community mocking us as sort of "ChatGPT minions" and the
usage of bad words in English (f* word), does not make sense and it's not
acceptable here.
Even if you feel heated, I recommend separating such emotions from what you
write and always being respectful of other people with different ideas.
You are an intelligent person, don't ruin your time (and others' time) on a
wonderful project such as Lucene, blinded by excessive emotion.
Please remember that the vast majority of us participate in this community
purely on a volunteering basis.
So when I spend time on this, I like to see respect,
thoughtful discussions, and intellectual challenges, the time we spend
together must be peaceful and positive.

The community comes first and here we are collecting what the community
would like for a feature.
Your vote and opinion are extremely valuable, but at this stage, we are
here to listen to the community rather than imposing a personal idea.
Once we observe the dominant need, we'll proceed with a contribution.
If you disagree with such a contribution and bring technical evidence that
supports a convincing veto, we (the Lucene community) will listen and
improve/change the contribution.
If you disagree with such a contribution and bring an unconvincing veto, we
(the Lucene community) will proceed with steps that are appropriate for the
situation.
Let's also remember that the project and the community come first, Lucene
is an Apache project, not mine or yours for that matters.

Cheers

--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
*Apache Lucene/Solr Committer*
*Apache Solr PMC Member*

e-mail: a.benedetti@sease.io

*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
<https://github.com/seaseltd>

On Wed, 17 May 2023 at 01:54, Robert Muir <rcmuir@gmail.com> wrote:

> Gus, I think i explained myself multiple times on issues and in this
> thread. the performance is unacceptable, everyone knows it, but nobody is
> talking about.
> I don't need to explain myself time and time again here.
> You don't seem to understand the technical issues (at least you sure as
> fuck don't know how service loading works or you wouldnt have opened
> https://github.com/apache/lucene/issues/12300 ????)
>
> I'm just the only one here completely unconstrained by any of silicon
> valley's influences to speak my true mind, without any repercussions, so I
> do it. Don't give any fucks about ChatGPT.
>
> I'm standing by my technical veto. If you bypass it, I'll revert the
> offending commit.
>
> As far as fixing the technical performance, I just opened an issue with
> some ideas to at least improve cpu usage by a factor of N. It does not help
> with the crazy heap memory usage or other issues of KNN implementation
> causing shit like OOM on merge. But it is one step:
> https://github.com/apache/lucene/issues/12302
>
>
>
> On Tue, May 16, 2023 at 7:45?AM Gus Heck <gus.heck@gmail.com> wrote:
>
>> Robert,
>>
>> Can you explain in clear technical terms the standard that must be met
>> for performance? A benchmark that must run in X time on Y hardware for
>> example (and why that test is suitable)? Or some other reproducible
>> criteria? So far I've heard you give an *opinion* that it's unusable, but
>> that's not a technical criteria, others may have a different concept of
>> what is usable to them.
>>
>> Forgive me if I misunderstand, but the essence of your argument has
>> seemed to be
>>
>> "Performance isn't good enough, therefore we should force anyone who
>> wants to experiment with something bigger to fork the code base to do it"
>>
>> Thus, it is necessary to have a clear unambiguous standard that anyone
>> can verify for "good enough". A clear standard would also focus efforts at
>> improvement.
>>
>> Where are the goal posts?
>>
>> FWIW I'm +1 on any of 2-4 since I believe the existence of a hard limit
>> is fundamentally counterproductive in an open source setting, as it will
>> lead to *fewer people* pushing the limits. Extremely few people are
>> going to get into the nitty-gritty of optimizing things unless they are
>> staring at code that they can prove does something interesting, but doesn't
>> run fast enough for their purposes. If people hit a hard limit, more of
>> them give up and never develop the code that will motivate them to look for
>> optimizations.
>>
>> -Gus
>>
>> On Tue, May 16, 2023 at 6:04?AM Robert Muir <rcmuir@gmail.com> wrote:
>>
>>> i still feel -1 (veto) on increasing this limit. sending more emails
>>> does not change the technical facts or make the veto go away.
>>>
>>> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti <
>>> a.benedetti@sease.io> wrote:
>>>
>>>> Hi all,
>>>> we have finalized all the options proposed by the community and we are
>>>> ready to vote for the preferred one and then proceed with the
>>>> implementation.
>>>>
>>>> *Option 1*
>>>> Keep it as it is (dimension limit hardcoded to 1024)
>>>> *Motivation*:
>>>> We are close to improving on many fronts. Given the criticality of
>>>> Lucene in computing infrastructure and the concerns raised by one of the
>>>> most active stewards of the project, I think we should keep working toward
>>>> improving the feature as is and move to up the limit after we can
>>>> demonstrate improvement unambiguously.
>>>>
>>>> *Option 2*
>>>> make the limit configurable, for example through a system property
>>>> *Motivation*:
>>>> The system administrator can enforce a limit its users need to respect
>>>> that it's in line with whatever the admin decided to be acceptable for
>>>> them.
>>>> The default can stay the current one.
>>>> This should open the doors for Apache Solr, Elasticsearch, OpenSearch,
>>>> and any sort of plugin development
>>>>
>>>> *Option 3*
>>>> Move the max dimension limit lower level to a HNSW specific
>>>> implementation. Once there, this limit would not bind any other potential
>>>> vector engine alternative/evolution.
>>>> *Motivation:* There seem to be contradictory performance
>>>> interpretations about the current HNSW implementation. Some consider its
>>>> performance ok, some not, and it depends on the target data set and use
>>>> case. Increasing the max dimension limit where it is currently (in top
>>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>>> other use-cases) to be based on a lower limit.
>>>>
>>>> *Option 4*
>>>> Make it configurable and move it to an appropriate place.
>>>> In particular, a simple Integer.getInteger("lucene.hnsw.maxDimensions",
>>>> 1024) should be enough.
>>>> *Motivation*:
>>>> Both are good and not mutually exclusive and could happen in any order.
>>>> Someone suggested to perfect what the _default_ limit should be, but
>>>> I've not seen an argument _against_ configurability. Especially in this
>>>> way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>>
>>>> I'll keep this [VOTE] open for a week and then proceed to the
>>>> implementation.
>>>> --------------------------
>>>> *Alessandro Benedetti*
>>>> Director @ Sease Ltd.
>>>> *Apache Lucene/Solr Committer*
>>>> *Apache Solr PMC Member*
>>>>
>>>> e-mail: a.benedetti@sease.io
>>>>
>>>>
>>>> *Sease* - Information Retrieval Applied
>>>> Consulting | Training | Open Source
>>>>
>>>> Website: Sease.io <http://sease.io/>
>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>>> <https://twitter.com/seaseltd> | Youtube
>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>>> <https://github.com/seaseltd>
>>>>
>>>
>>
>> --
>> http://www.needhamsoftware.com (work)
>> http://www.the111shift.com (play)
>>
>

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

rcmuir at gmail

May 17, 2023, 2:15 AM

Post #28 of 46 (454 views)

As a reminder this isn't the Disney Plus channel and I'll use strong
language if I fucking want to.

On Wed, May 17, 2023, 4:45 AM Alessandro Benedetti <a.benedetti@sease.io>
wrote:

> Robert,
> A gentle reminder of the
> https://www.apache.org/foundation/policies/conduct.html.
> I've read many e-mails about this topic that ended up in a tone that is
> not up to the standard of a healthy community.
> To be specific and pragmatic how you addressed Gus here, how you addressed
> the rest of our community mocking us as sort of "ChatGPT minions" and the
> usage of bad words in English (f* word), does not make sense and it's not
> acceptable here.
> Even if you feel heated, I recommend separating such emotions from what
> you write and always being respectful of other people with different ideas.
> You are an intelligent person, don't ruin your time (and others' time) on
> a wonderful project such as Lucene, blinded by excessive emotion.
> Please remember that the vast majority of us participate in this community
> purely on a volunteering basis.
> So when I spend time on this, I like to see respect,
> thoughtful discussions, and intellectual challenges, the time we spend
> together must be peaceful and positive.
>
> The community comes first and here we are collecting what the community
> would like for a feature.
> Your vote and opinion are extremely valuable, but at this stage, we are
> here to listen to the community rather than imposing a personal idea.
> Once we observe the dominant need, we'll proceed with a contribution.
> If you disagree with such a contribution and bring technical evidence that
> supports a convincing veto, we (the Lucene community) will listen and
> improve/change the contribution.
> If you disagree with such a contribution and bring an unconvincing veto,
> we (the Lucene community) will proceed with steps that are appropriate for
> the situation.
> Let's also remember that the project and the community come first, Lucene
> is an Apache project, not mine or yours for that matters.
>
> Cheers
>
> --------------------------
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> *Apache Lucene/Solr Committer*
> *Apache Solr PMC Member*
>
> e-mail: a.benedetti@sease.io
>
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
> <https://twitter.com/seaseltd> | Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
> <https://github.com/seaseltd>
>
>
> On Wed, 17 May 2023 at 01:54, Robert Muir <rcmuir@gmail.com> wrote:
>
>> Gus, I think i explained myself multiple times on issues and in this
>> thread. the performance is unacceptable, everyone knows it, but nobody is
>> talking about.
>> I don't need to explain myself time and time again here.
>> You don't seem to understand the technical issues (at least you sure as
>> fuck don't know how service loading works or you wouldnt have opened
>> https://github.com/apache/lucene/issues/12300 ????)
>>
>> I'm just the only one here completely unconstrained by any of silicon
>> valley's influences to speak my true mind, without any repercussions, so I
>> do it. Don't give any fucks about ChatGPT.
>>
>> I'm standing by my technical veto. If you bypass it, I'll revert the
>> offending commit.
>>
>> As far as fixing the technical performance, I just opened an issue with
>> some ideas to at least improve cpu usage by a factor of N. It does not help
>> with the crazy heap memory usage or other issues of KNN implementation
>> causing shit like OOM on merge. But it is one step:
>> https://github.com/apache/lucene/issues/12302
>>
>>
>>
>> On Tue, May 16, 2023 at 7:45?AM Gus Heck <gus.heck@gmail.com> wrote:
>>
>>> Robert,
>>>
>>> Can you explain in clear technical terms the standard that must be met
>>> for performance? A benchmark that must run in X time on Y hardware for
>>> example (and why that test is suitable)? Or some other reproducible
>>> criteria? So far I've heard you give an *opinion* that it's unusable, but
>>> that's not a technical criteria, others may have a different concept of
>>> what is usable to them.
>>>
>>> Forgive me if I misunderstand, but the essence of your argument has
>>> seemed to be
>>>
>>> "Performance isn't good enough, therefore we should force anyone who
>>> wants to experiment with something bigger to fork the code base to do it"
>>>
>>> Thus, it is necessary to have a clear unambiguous standard that anyone
>>> can verify for "good enough". A clear standard would also focus efforts at
>>> improvement.
>>>
>>> Where are the goal posts?
>>>
>>> FWIW I'm +1 on any of 2-4 since I believe the existence of a hard limit
>>> is fundamentally counterproductive in an open source setting, as it will
>>> lead to *fewer people* pushing the limits. Extremely few people are
>>> going to get into the nitty-gritty of optimizing things unless they are
>>> staring at code that they can prove does something interesting, but doesn't
>>> run fast enough for their purposes. If people hit a hard limit, more of
>>> them give up and never develop the code that will motivate them to look for
>>> optimizations.
>>>
>>> -Gus
>>>
>>> On Tue, May 16, 2023 at 6:04?AM Robert Muir <rcmuir@gmail.com> wrote:
>>>
>>>> i still feel -1 (veto) on increasing this limit. sending more emails
>>>> does not change the technical facts or make the veto go away.
>>>>
>>>> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti <
>>>> a.benedetti@sease.io> wrote:
>>>>
>>>>> Hi all,
>>>>> we have finalized all the options proposed by the community and we are
>>>>> ready to vote for the preferred one and then proceed with the
>>>>> implementation.
>>>>>
>>>>> *Option 1*
>>>>> Keep it as it is (dimension limit hardcoded to 1024)
>>>>> *Motivation*:
>>>>> We are close to improving on many fronts. Given the criticality of
>>>>> Lucene in computing infrastructure and the concerns raised by one of the
>>>>> most active stewards of the project, I think we should keep working toward
>>>>> improving the feature as is and move to up the limit after we can
>>>>> demonstrate improvement unambiguously.
>>>>>
>>>>> *Option 2*
>>>>> make the limit configurable, for example through a system property
>>>>> *Motivation*:
>>>>> The system administrator can enforce a limit its users need to respect
>>>>> that it's in line with whatever the admin decided to be acceptable for
>>>>> them.
>>>>> The default can stay the current one.
>>>>> This should open the doors for Apache Solr, Elasticsearch, OpenSearch,
>>>>> and any sort of plugin development
>>>>>
>>>>> *Option 3*
>>>>> Move the max dimension limit lower level to a HNSW specific
>>>>> implementation. Once there, this limit would not bind any other potential
>>>>> vector engine alternative/evolution.
>>>>> *Motivation:* There seem to be contradictory performance
>>>>> interpretations about the current HNSW implementation. Some consider its
>>>>> performance ok, some not, and it depends on the target data set and use
>>>>> case. Increasing the max dimension limit where it is currently (in top
>>>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>>>> other use-cases) to be based on a lower limit.
>>>>>
>>>>> *Option 4*
>>>>> Make it configurable and move it to an appropriate place.
>>>>> In particular, a
>>>>> simple Integer.getInteger("lucene.hnsw.maxDimensions", 1024) should be
>>>>> enough.
>>>>> *Motivation*:
>>>>> Both are good and not mutually exclusive and could happen in any order.
>>>>> Someone suggested to perfect what the _default_ limit should be, but
>>>>> I've not seen an argument _against_ configurability. Especially in this
>>>>> way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>>>
>>>>> I'll keep this [VOTE] open for a week and then proceed to the
>>>>> implementation.
>>>>> --------------------------
>>>>> *Alessandro Benedetti*
>>>>> Director @ Sease Ltd.
>>>>> *Apache Lucene/Solr Committer*
>>>>> *Apache Solr PMC Member*
>>>>>
>>>>> e-mail: a.benedetti@sease.io
>>>>>
>>>>>
>>>>> *Sease* - Information Retrieval Applied
>>>>> Consulting | Training | Open Source
>>>>>
>>>>> Website: Sease.io <http://sease.io/>
>>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>>>> <https://twitter.com/seaseltd> | Youtube
>>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>>>> <https://github.com/seaseltd>
>>>>>
>>>>
>>>
>>> --
>>> http://www.needhamsoftware.com (work)
>>> http://www.the111shift.com (play)
>>>
>>

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

msokolov at gmail

May 17, 2023, 4:50 AM

Post #29 of 46 (454 views)

I think I've said before on this list we don't actually enforce the limit
in any way that can't easily be circumvented by a user. The codec already
supports any size vector - it doesn't impose any limit. The way the API is
written you can *already today* create an index with max-int sized vectors
and we are committed to supporting that going forward by our backwards
compatibility policy as Robert points out. This wasn't intentional, I
think, but it is the facts.

Given that, I think this whole discussion is not really necessary.

On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti <a.benedetti@sease.io>
wrote:

> Hi all,
> we have finalized all the options proposed by the community and we are
> ready to vote for the preferred one and then proceed with the
> implementation.
>
> *Option 1*
> Keep it as it is (dimension limit hardcoded to 1024)
> *Motivation*:
> We are close to improving on many fronts. Given the criticality of Lucene
> in computing infrastructure and the concerns raised by one of the most
> active stewards of the project, I think we should keep working toward
> improving the feature as is and move to up the limit after we can
> demonstrate improvement unambiguously.
>
> *Option 2*
> make the limit configurable, for example through a system property
> *Motivation*:
> The system administrator can enforce a limit its users need to respect
> that it's in line with whatever the admin decided to be acceptable for
> them.
> The default can stay the current one.
> This should open the doors for Apache Solr, Elasticsearch, OpenSearch, and
> any sort of plugin development
>
> *Option 3*
> Move the max dimension limit lower level to a HNSW specific
> implementation. Once there, this limit would not bind any other potential
> vector engine alternative/evolution.
> *Motivation:* There seem to be contradictory performance interpretations
> about the current HNSW implementation. Some consider its performance ok,
> some not, and it depends on the target data set and use case. Increasing
> the max dimension limit where it is currently (in top level
> FloatVectorValues) would not allow potential alternatives (e.g. for other
> use-cases) to be based on a lower limit.
>
> *Option 4*
> Make it configurable and move it to an appropriate place.
> In particular, a simple Integer.getInteger("lucene.hnsw.maxDimensions",
> 1024) should be enough.
> *Motivation*:
> Both are good and not mutually exclusive and could happen in any order.
> Someone suggested to perfect what the _default_ limit should be, but I've
> not seen an argument _against_ configurability. Especially in this way --
> a toggle that doesn't bind Lucene's APIs in any way.
>
> I'll keep this [VOTE] open for a week and then proceed to the
> implementation.
> --------------------------
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> *Apache Lucene/Solr Committer*
> *Apache Solr PMC Member*
>
> e-mail: a.benedetti@sease.io
>
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
> <https://twitter.com/seaseltd> | Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
> <https://github.com/seaseltd>
>

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

dsmiley at apache

May 17, 2023, 7:09 AM

Post #30 of 46 (454 views)

> easily be circumvented by a user

This is a revelation to me and others, if true. Michael, please then point
to a test or code snippet that shows the Lucene user community what they
want to see so they are unblocked from their explorations of vector search.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley

On Wed, May 17, 2023 at 7:51?AM Michael Sokolov <msokolov@gmail.com> wrote:

> I think I've said before on this list we don't actually enforce the limit
> in any way that can't easily be circumvented by a user. The codec already
> supports any size vector - it doesn't impose any limit. The way the API is
> written you can *already today* create an index with max-int sized vectors
> and we are committed to supporting that going forward by our backwards
> compatibility policy as Robert points out. This wasn't intentional, I
> think, but it is the facts.
>
> Given that, I think this whole discussion is not really necessary.
>
> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti <a.benedetti@sease.io>
> wrote:
>
>> Hi all,
>> we have finalized all the options proposed by the community and we are
>> ready to vote for the preferred one and then proceed with the
>> implementation.
>>
>> *Option 1*
>> Keep it as it is (dimension limit hardcoded to 1024)
>> *Motivation*:
>> We are close to improving on many fronts. Given the criticality of Lucene
>> in computing infrastructure and the concerns raised by one of the most
>> active stewards of the project, I think we should keep working toward
>> improving the feature as is and move to up the limit after we can
>> demonstrate improvement unambiguously.
>>
>> *Option 2*
>> make the limit configurable, for example through a system property
>> *Motivation*:
>> The system administrator can enforce a limit its users need to respect
>> that it's in line with whatever the admin decided to be acceptable for
>> them.
>> The default can stay the current one.
>> This should open the doors for Apache Solr, Elasticsearch, OpenSearch,
>> and any sort of plugin development
>>
>> *Option 3*
>> Move the max dimension limit lower level to a HNSW specific
>> implementation. Once there, this limit would not bind any other potential
>> vector engine alternative/evolution.
>> *Motivation:* There seem to be contradictory performance interpretations
>> about the current HNSW implementation. Some consider its performance ok,
>> some not, and it depends on the target data set and use case. Increasing
>> the max dimension limit where it is currently (in top level
>> FloatVectorValues) would not allow potential alternatives (e.g. for other
>> use-cases) to be based on a lower limit.
>>
>> *Option 4*
>> Make it configurable and move it to an appropriate place.
>> In particular, a simple Integer.getInteger("lucene.hnsw.maxDimensions",
>> 1024) should be enough.
>> *Motivation*:
>> Both are good and not mutually exclusive and could happen in any order.
>> Someone suggested to perfect what the _default_ limit should be, but I've
>> not seen an argument _against_ configurability. Especially in this way --
>> a toggle that doesn't bind Lucene's APIs in any way.
>>
>> I'll keep this [VOTE] open for a week and then proceed to the
>> implementation.
>> --------------------------
>> *Alessandro Benedetti*
>> Director @ Sease Ltd.
>> *Apache Lucene/Solr Committer*
>> *Apache Solr PMC Member*
>>
>> e-mail: a.benedetti@sease.io
>>
>>
>> *Sease* - Information Retrieval Applied
>> Consulting | Training | Open Source
>>
>> Website: Sease.io <http://sease.io/>
>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>> <https://twitter.com/seaseltd> | Youtube
>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>> <https://github.com/seaseltd>
>>
>

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

msokolov at gmail

May 17, 2023, 7:41 AM

Post #31 of 46 (454 views)

see https://markmail.org/message/kf4nzoqyhwacb7ri

On Wed, May 17, 2023 at 10:09?AM David Smiley <dsmiley@apache.org> wrote:

> > easily be circumvented by a user
>
> This is a revelation to me and others, if true. Michael, please then
> point to a test or code snippet that shows the Lucene user community what
> they want to see so they are unblocked from their explorations of vector
> search.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Wed, May 17, 2023 at 7:51?AM Michael Sokolov <msokolov@gmail.com>
> wrote:
>
>> I think I've said before on this list we don't actually enforce the limit
>> in any way that can't easily be circumvented by a user. The codec already
>> supports any size vector - it doesn't impose any limit. The way the API is
>> written you can *already today* create an index with max-int sized vectors
>> and we are committed to supporting that going forward by our backwards
>> compatibility policy as Robert points out. This wasn't intentional, I
>> think, but it is the facts.
>>
>> Given that, I think this whole discussion is not really necessary.
>>
>> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti <
>> a.benedetti@sease.io> wrote:
>>
>>> Hi all,
>>> we have finalized all the options proposed by the community and we are
>>> ready to vote for the preferred one and then proceed with the
>>> implementation.
>>>
>>> *Option 1*
>>> Keep it as it is (dimension limit hardcoded to 1024)
>>> *Motivation*:
>>> We are close to improving on many fronts. Given the criticality of
>>> Lucene in computing infrastructure and the concerns raised by one of the
>>> most active stewards of the project, I think we should keep working toward
>>> improving the feature as is and move to up the limit after we can
>>> demonstrate improvement unambiguously.
>>>
>>> *Option 2*
>>> make the limit configurable, for example through a system property
>>> *Motivation*:
>>> The system administrator can enforce a limit its users need to respect
>>> that it's in line with whatever the admin decided to be acceptable for
>>> them.
>>> The default can stay the current one.
>>> This should open the doors for Apache Solr, Elasticsearch, OpenSearch,
>>> and any sort of plugin development
>>>
>>> *Option 3*
>>> Move the max dimension limit lower level to a HNSW specific
>>> implementation. Once there, this limit would not bind any other potential
>>> vector engine alternative/evolution.
>>> *Motivation:* There seem to be contradictory performance
>>> interpretations about the current HNSW implementation. Some consider its
>>> performance ok, some not, and it depends on the target data set and use
>>> case. Increasing the max dimension limit where it is currently (in top
>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>> other use-cases) to be based on a lower limit.
>>>
>>> *Option 4*
>>> Make it configurable and move it to an appropriate place.
>>> In particular, a simple Integer.getInteger("lucene.hnsw.maxDimensions",
>>> 1024) should be enough.
>>> *Motivation*:
>>> Both are good and not mutually exclusive and could happen in any order.
>>> Someone suggested to perfect what the _default_ limit should be, but
>>> I've not seen an argument _against_ configurability. Especially in this
>>> way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>
>>> I'll keep this [VOTE] open for a week and then proceed to the
>>> implementation.
>>> --------------------------
>>> *Alessandro Benedetti*
>>> Director @ Sease Ltd.
>>> *Apache Lucene/Solr Committer*
>>> *Apache Solr PMC Member*
>>>
>>> e-mail: a.benedetti@sease.io
>>>
>>>
>>> *Sease* - Information Retrieval Applied
>>> Consulting | Training | Open Source
>>>
>>> Website: Sease.io <http://sease.io/>
>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>> <https://twitter.com/seaseltd> | Youtube
>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>> <https://github.com/seaseltd>
>>>
>>

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

a.benedetti at sease

May 17, 2023, 8:02 AM

Post #32 of 46 (454 views)

Thanks, Michael,
that example backs even more strongly the need of cleaning it up and making
the limit configurable without the need for custom field types I guess (I
was taking a look at the code again, and it seems the limit is also checked
twice: in org.apache.lucene.document.KnnByteVectorField#createType and then
in org.apache.lucene.document.FieldType#setVectorAttributes (for both byte
and float variants).
This should help people vote, great!

Cheers
--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
*Apache Lucene/Solr Committer*
*Apache Solr PMC Member*

e-mail: a.benedetti@sease.io

*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
<https://github.com/seaseltd>

On Wed, 17 May 2023 at 15:42, Michael Sokolov <msokolov@gmail.com> wrote:

> see https://markmail.org/message/kf4nzoqyhwacb7ri
>
> On Wed, May 17, 2023 at 10:09?AM David Smiley <dsmiley@apache.org> wrote:
>
>> > easily be circumvented by a user
>>
>> This is a revelation to me and others, if true. Michael, please then
>> point to a test or code snippet that shows the Lucene user community what
>> they want to see so they are unblocked from their explorations of vector
>> search.
>>
>> ~ David Smiley
>> Apache Lucene/Solr Search Developer
>> http://www.linkedin.com/in/davidwsmiley
>>
>>
>> On Wed, May 17, 2023 at 7:51?AM Michael Sokolov <msokolov@gmail.com>
>> wrote:
>>
>>> I think I've said before on this list we don't actually enforce the
>>> limit in any way that can't easily be circumvented by a user. The codec
>>> already supports any size vector - it doesn't impose any limit. The way the
>>> API is written you can *already today* create an index with max-int sized
>>> vectors and we are committed to supporting that going forward by our
>>> backwards compatibility policy as Robert points out. This wasn't
>>> intentional, I think, but it is the facts.
>>>
>>> Given that, I think this whole discussion is not really necessary.
>>>
>>> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti <
>>> a.benedetti@sease.io> wrote:
>>>
>>>> Hi all,
>>>> we have finalized all the options proposed by the community and we are
>>>> ready to vote for the preferred one and then proceed with the
>>>> implementation.
>>>>
>>>> *Option 1*
>>>> Keep it as it is (dimension limit hardcoded to 1024)
>>>> *Motivation*:
>>>> We are close to improving on many fronts. Given the criticality of
>>>> Lucene in computing infrastructure and the concerns raised by one of the
>>>> most active stewards of the project, I think we should keep working toward
>>>> improving the feature as is and move to up the limit after we can
>>>> demonstrate improvement unambiguously.
>>>>
>>>> *Option 2*
>>>> make the limit configurable, for example through a system property
>>>> *Motivation*:
>>>> The system administrator can enforce a limit its users need to respect
>>>> that it's in line with whatever the admin decided to be acceptable for
>>>> them.
>>>> The default can stay the current one.
>>>> This should open the doors for Apache Solr, Elasticsearch, OpenSearch,
>>>> and any sort of plugin development
>>>>
>>>> *Option 3*
>>>> Move the max dimension limit lower level to a HNSW specific
>>>> implementation. Once there, this limit would not bind any other potential
>>>> vector engine alternative/evolution.
>>>> *Motivation:* There seem to be contradictory performance
>>>> interpretations about the current HNSW implementation. Some consider its
>>>> performance ok, some not, and it depends on the target data set and use
>>>> case. Increasing the max dimension limit where it is currently (in top
>>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>>> other use-cases) to be based on a lower limit.
>>>>
>>>> *Option 4*
>>>> Make it configurable and move it to an appropriate place.
>>>> In particular, a simple Integer.getInteger("lucene.hnsw.maxDimensions",
>>>> 1024) should be enough.
>>>> *Motivation*:
>>>> Both are good and not mutually exclusive and could happen in any order.
>>>> Someone suggested to perfect what the _default_ limit should be, but
>>>> I've not seen an argument _against_ configurability. Especially in this
>>>> way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>>
>>>> I'll keep this [VOTE] open for a week and then proceed to the
>>>> implementation.
>>>> --------------------------
>>>> *Alessandro Benedetti*
>>>> Director @ Sease Ltd.
>>>> *Apache Lucene/Solr Committer*
>>>> *Apache Solr PMC Member*
>>>>
>>>> e-mail: a.benedetti@sease.io
>>>>
>>>>
>>>> *Sease* - Information Retrieval Applied
>>>> Consulting | Training | Open Source
>>>>
>>>> Website: Sease.io <http://sease.io/>
>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>>> <https://twitter.com/seaseltd> | Youtube
>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>>> <https://github.com/seaseltd>
>>>>
>>>

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

mayya.sharipova at elastic

May 17, 2023, 9:58 AM

Post #33 of 46 (454 views)

Alessandro,
Thanks for raising the code of conduct; it is very discouraging and
intimidating to participate in discussions where such language is used
especially by senior members.

Michael S.,
thanks for your suggestion and that's what we used in Elasticsearch to
raise dims limit, and Alessandro, perhaps, you can use it as well in Solr
for the time being.

On Wed, May 17, 2023 at 11:03?AM Alessandro Benedetti <a.benedetti@sease.io>
wrote:

> Thanks, Michael,
> that example backs even more strongly the need of cleaning it up and
> making the limit configurable without the need for custom field types I
> guess (I was taking a look at the code again, and it seems the limit is
> also checked twice:
> in org.apache.lucene.document.KnnByteVectorField#createType and then
> in org.apache.lucene.document.FieldType#setVectorAttributes (for both byte
> and float variants).
> This should help people vote, great!
>
> Cheers
> --------------------------
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> *Apache Lucene/Solr Committer*
> *Apache Solr PMC Member*
>
> e-mail: a.benedetti@sease.io
>
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
> <https://twitter.com/seaseltd> | Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
> <https://github.com/seaseltd>
>
>
> On Wed, 17 May 2023 at 15:42, Michael Sokolov <msokolov@gmail.com> wrote:
>
>> see https://markmail.org/message/kf4nzoqyhwacb7ri
>>
>> On Wed, May 17, 2023 at 10:09?AM David Smiley <dsmiley@apache.org> wrote:
>>
>>> > easily be circumvented by a user
>>>
>>> This is a revelation to me and others, if true. Michael, please then
>>> point to a test or code snippet that shows the Lucene user community what
>>> they want to see so they are unblocked from their explorations of vector
>>> search.
>>>
>>> ~ David Smiley
>>> Apache Lucene/Solr Search Developer
>>> http://www.linkedin.com/in/davidwsmiley
>>>
>>>
>>> On Wed, May 17, 2023 at 7:51?AM Michael Sokolov <msokolov@gmail.com>
>>> wrote:
>>>
>>>> I think I've said before on this list we don't actually enforce the
>>>> limit in any way that can't easily be circumvented by a user. The codec
>>>> already supports any size vector - it doesn't impose any limit. The way the
>>>> API is written you can *already today* create an index with max-int sized
>>>> vectors and we are committed to supporting that going forward by our
>>>> backwards compatibility policy as Robert points out. This wasn't
>>>> intentional, I think, but it is the facts.
>>>>
>>>> Given that, I think this whole discussion is not really necessary.
>>>>
>>>> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti <
>>>> a.benedetti@sease.io> wrote:
>>>>
>>>>> Hi all,
>>>>> we have finalized all the options proposed by the community and we are
>>>>> ready to vote for the preferred one and then proceed with the
>>>>> implementation.
>>>>>
>>>>> *Option 1*
>>>>> Keep it as it is (dimension limit hardcoded to 1024)
>>>>> *Motivation*:
>>>>> We are close to improving on many fronts. Given the criticality of
>>>>> Lucene in computing infrastructure and the concerns raised by one of the
>>>>> most active stewards of the project, I think we should keep working toward
>>>>> improving the feature as is and move to up the limit after we can
>>>>> demonstrate improvement unambiguously.
>>>>>
>>>>> *Option 2*
>>>>> make the limit configurable, for example through a system property
>>>>> *Motivation*:
>>>>> The system administrator can enforce a limit its users need to respect
>>>>> that it's in line with whatever the admin decided to be acceptable for
>>>>> them.
>>>>> The default can stay the current one.
>>>>> This should open the doors for Apache Solr, Elasticsearch, OpenSearch,
>>>>> and any sort of plugin development
>>>>>
>>>>> *Option 3*
>>>>> Move the max dimension limit lower level to a HNSW specific
>>>>> implementation. Once there, this limit would not bind any other potential
>>>>> vector engine alternative/evolution.
>>>>> *Motivation:* There seem to be contradictory performance
>>>>> interpretations about the current HNSW implementation. Some consider its
>>>>> performance ok, some not, and it depends on the target data set and use
>>>>> case. Increasing the max dimension limit where it is currently (in top
>>>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>>>> other use-cases) to be based on a lower limit.
>>>>>
>>>>> *Option 4*
>>>>> Make it configurable and move it to an appropriate place.
>>>>> In particular, a
>>>>> simple Integer.getInteger("lucene.hnsw.maxDimensions", 1024) should be
>>>>> enough.
>>>>> *Motivation*:
>>>>> Both are good and not mutually exclusive and could happen in any order.
>>>>> Someone suggested to perfect what the _default_ limit should be, but
>>>>> I've not seen an argument _against_ configurability. Especially in this
>>>>> way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>>>
>>>>> I'll keep this [VOTE] open for a week and then proceed to the
>>>>> implementation.
>>>>> --------------------------
>>>>> *Alessandro Benedetti*
>>>>> Director @ Sease Ltd.
>>>>> *Apache Lucene/Solr Committer*
>>>>> *Apache Solr PMC Member*
>>>>>
>>>>> e-mail: a.benedetti@sease.io
>>>>>
>>>>>
>>>>> *Sease* - Information Retrieval Applied
>>>>> Consulting | Training | Open Source
>>>>>
>>>>> Website: Sease.io <http://sease.io/>
>>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>>>> <https://twitter.com/seaseltd> | Youtube
>>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>>>> <https://github.com/seaseltd>
>>>>>
>>>>

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

michael.wechner at wyona

May 17, 2023, 2:58 PM

Post #34 of 46 (454 views)

I try to better understand the code, so IIUC vector MAX_DIMENSIONS is
currently used inside

lucene/core/src/java/org/apache/lucene/document/FieldType.java
lucene/core/src/java/org/apache/lucene/document/KnnFloatVectorField.java
lucene/core/src/java/org/apache/lucene/document/KnnByteVectorField.java
lucene/core/src/java/org/apache/lucene/index/FloatVectorValues.java
public static final int MAX_DIMENSIONS = 1024;
lucene/core/src/java/org/apache/lucene/index/ByteVectorValues.java
public static final int MAX_DIMENSIONS = 1024;

and when you are writing that it should be moved to the hnsw-specific
code, then you mean somewhere to

lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java
lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsReader.java
lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java
lucene/core/src/java/org/apache/lucene/codecs/lucene95/OffHeapByteVectorValues.java
lucene/core/src/java/org/apache/lucene/codecs/lucene95/OffHeapFloatVectorValues.java
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java
lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborQueue.java
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
lucene/core/src/java/org/apache/lucene/util/hnsw/RandomAccessVectorValues.java

?

Thanks

Michael

Am 17.05.23 um 03:50 schrieb Robert Muir:
> by the way, i agree with the idea to MOVE THE LIMIT UNCHANGED to the
> hsnw-specific code.
>
> This way, someone can write alternative codec with vectors using some
> other completely different approach that incorporates a different more
> appropriate limit (maybe lower, maybe higher) depending upon their
> tradeoffs. We should encourage this as I think it is the "only true
> fix" to the scalability issues: use a scalable algorithm! Also,
> alternative codecs don't force the project into many years of index
> backwards compatibility, which is really my penultimate concern. We
> can lock ourselves into a truly bad place and become irrelevant
> (especially with scalar code implementing all this vector stuff, it is
> really senseless). In the meantime I suggest we try to reduce pain for
> the default codec with the current implementation if possible. If it
> is not possible, we need a new codec that performs.
>
> On Tue, May 16, 2023 at 8:53?PM Robert Muir <rcmuir@gmail.com> wrote:
>
> Gus, I think i explained myself multiple times on issues and in
> this thread. the performance is unacceptable, everyone knows it,
> but nobody is talking about.
> I don't need to explain myself time and time again here.
> You don't seem to understand the technical issues (at least you
> sure as fuck don't know how service loading works or you wouldnt
> have opened https://github.com/apache/lucene/issues/12300 ????)
>
> I'm just the only one here completely unconstrained by any of
> silicon valley's influences to speak my true mind, without any
> repercussions, so I do it. Don't give any fucks about ChatGPT.
>
> I'm standing by my technical veto. If you bypass it, I'll revert
> the offending commit.
>
> As far as fixing the technical performance, I just opened an issue
> with some ideas to at least improve cpu usage by a factor of N. It
> does not help with the crazy heap memory usage or other issues of
> KNN implementation causing shit like OOM on merge. But it is one
> step: https://github.com/apache/lucene/issues/12302
>
>
>
> On Tue, May 16, 2023 at 7:45?AM Gus Heck <gus.heck@gmail.com> wrote:
>
> Robert,
>
> Can you explain in clear technical terms the standard that
> must be met for performance? A benchmark that must run in X
> time on Y hardware for example (and why that test is
> suitable)? Or some other reproducible criteria? So far I've
> heard you give an *opinion* that it's unusable, but that's not
> a technical criteria, others may have a different concept of
> what is usable to them.
>
> Forgive me if I misunderstand, but the essence of your
> argument has seemed to be
>
> "Performance isn't good enough, therefore we should force
> anyone who wants to experiment with something bigger to fork
> the code base to do it"
>
> Thus, it is necessary to have a clear unambiguous standard
> that anyone can verify for "good enough". A clear standard
> would also focus efforts at improvement.
>
> Where are the goal posts?
>
> FWIW I'm +1 on any of 2-4 since I believe the existence of a
> hard limit is fundamentally counterproductive in an open
> source setting, as it will lead to *fewer people* pushing
> the limits. Extremely few people are going to get into the
> nitty-gritty of optimizing things unless they are staring at
> code that they can prove does something interesting,
> but doesn't run fast enough for their purposes. If people hit
> a hard limit, more of them give up and never develop the code
> that will motivate them to look for optimizations.
>
> -Gus
>
> On Tue, May 16, 2023 at 6:04?AM Robert Muir <rcmuir@gmail.com>
> wrote:
>
> i still feel -1 (veto) on increasing this limit. sending
> more emails does not change the technical facts or make
> the veto go away.
>
> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti
> <a.benedetti@sease.io> wrote:
>
> Hi all,
> we have finalized all the options proposed by the
> community and we are ready to vote for the preferred
> one and then proceed with the implementation.
>
> *Option 1*
> Keep it as it is (dimension limit hardcoded to 1024)
> *Motivation*:
> We are close to improving on many fronts. Given the
> criticality of Lucene in computing infrastructure and
> the concerns raised by one of the most active stewards
> of the project, I think we should keep working toward
> improving the feature as is and move to up the limit
> after we can demonstrate improvement unambiguously.
>
> *Option 2*
> make the limit configurable, for example through a
> system property
> *Motivation*:
> The system administrator can enforce a limit its users
> need to respect that it's in line with whatever the
> admin decided to be acceptable for them.
> The default can stay the current one.
> This should open the doors for Apache Solr,
> Elasticsearch, OpenSearch, and any sort of plugin
> development
>
> *Option 3*
> Move the max dimension limit lower level to a HNSW
> specific implementation. Once there, this limit would
> not bind any other potential vector engine
> alternative/evolution.*
> *
> *Motivation:*There seem to be contradictory
> performance interpretations about the current HNSW
> implementation. Some consider its performance ok, some
> not, and it depends on the target data set and use
> case. Increasing the max dimension limit where it is
> currently (in top level FloatVectorValues) would not
> allow potential alternatives (e.g. for other
> use-cases) to be based on a lower limit.
>
> *Option 4*
> Make it configurable and move it to an appropriate place.
> In particular, a
> simple Integer.getInteger("lucene.hnsw.maxDimensions",
> 1024) should be enough.
> *Motivation*:
> Both are good and not mutually exclusive and could
> happen in any order.
> Someone suggested to perfect what the _default_ limit
> should be, but I've not seen an argument _against_
> configurability. Especially in this way -- a toggle
> that doesn't bind Lucene's APIs in any way.
>
> I'll keep this [VOTE] open for a week and then proceed
> to the implementation.
> --------------------------
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> /Apache Lucene/Solr Committer/
> /Apache Solr PMC Member/
>
> e-mail: a.benedetti@sease.io/
> /
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd> |
> Twitter <https://twitter.com/seaseltd> | Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
> Github <https://github.com/seaseltd>
>
>
>
> --
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)
>

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

dsmiley at apache

May 17, 2023, 3:05 PM

Post #35 of 46 (454 views)

Thanks Michael for sharing your code snippet on how to circumvent the
limit. My reaction to this is the same as Alessandro.

I just created a PR to make the limit configurable:
https://github.com/apache/lucene/pull/12306
If there is to be a veto presented to the PR, it should include technical
reasons specific to the PR and be raised on the PR itself.

Afterwards, I leave it to others to move the limit with its configurability
to be enforced in a codec specific way.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley

On Wed, May 17, 2023 at 12:58?PM Mayya Sharipova
<mayya.sharipova@elastic.co.invalid> wrote:

> Alessandro,
> Thanks for raising the code of conduct; it is very discouraging and
> intimidating to participate in discussions where such language is used
> especially by senior members.
>
> Michael S.,
> thanks for your suggestion and that's what we used in Elasticsearch to
> raise dims limit, and Alessandro, perhaps, you can use it as well in Solr
> for the time being.
>
> On Wed, May 17, 2023 at 11:03?AM Alessandro Benedetti <
> a.benedetti@sease.io> wrote:
>
>> Thanks, Michael,
>> that example backs even more strongly the need of cleaning it up and
>> making the limit configurable without the need for custom field types I
>> guess (I was taking a look at the code again, and it seems the limit is
>> also checked twice:
>> in org.apache.lucene.document.KnnByteVectorField#createType and then
>> in org.apache.lucene.document.FieldType#setVectorAttributes (for both byte
>> and float variants).
>> This should help people vote, great!
>>
>> Cheers
>> --------------------------
>> *Alessandro Benedetti*
>> Director @ Sease Ltd.
>> *Apache Lucene/Solr Committer*
>> *Apache Solr PMC Member*
>>
>> e-mail: a.benedetti@sease.io
>>
>>
>> *Sease* - Information Retrieval Applied
>> Consulting | Training | Open Source
>>
>> Website: Sease.io <http://sease.io/>
>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>> <https://twitter.com/seaseltd> | Youtube
>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>> <https://github.com/seaseltd>
>>
>>
>> On Wed, 17 May 2023 at 15:42, Michael Sokolov <msokolov@gmail.com> wrote:
>>
>>> see https://markmail.org/message/kf4nzoqyhwacb7ri
>>>
>>> On Wed, May 17, 2023 at 10:09?AM David Smiley <dsmiley@apache.org>
>>> wrote:
>>>
>>>> > easily be circumvented by a user
>>>>
>>>> This is a revelation to me and others, if true. Michael, please then
>>>> point to a test or code snippet that shows the Lucene user community what
>>>> they want to see so they are unblocked from their explorations of vector
>>>> search.
>>>>
>>>> ~ David Smiley
>>>> Apache Lucene/Solr Search Developer
>>>> http://www.linkedin.com/in/davidwsmiley
>>>>
>>>>
>>>> On Wed, May 17, 2023 at 7:51?AM Michael Sokolov <msokolov@gmail.com>
>>>> wrote:
>>>>
>>>>> I think I've said before on this list we don't actually enforce the
>>>>> limit in any way that can't easily be circumvented by a user. The codec
>>>>> already supports any size vector - it doesn't impose any limit. The way the
>>>>> API is written you can *already today* create an index with max-int sized
>>>>> vectors and we are committed to supporting that going forward by our
>>>>> backwards compatibility policy as Robert points out. This wasn't
>>>>> intentional, I think, but it is the facts.
>>>>>
>>>>> Given that, I think this whole discussion is not really necessary.
>>>>>
>>>>> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti <
>>>>> a.benedetti@sease.io> wrote:
>>>>>
>>>>>> Hi all,
>>>>>> we have finalized all the options proposed by the community and we
>>>>>> are ready to vote for the preferred one and then proceed with the
>>>>>> implementation.
>>>>>>
>>>>>> *Option 1*
>>>>>> Keep it as it is (dimension limit hardcoded to 1024)
>>>>>> *Motivation*:
>>>>>> We are close to improving on many fronts. Given the criticality of
>>>>>> Lucene in computing infrastructure and the concerns raised by one of the
>>>>>> most active stewards of the project, I think we should keep working toward
>>>>>> improving the feature as is and move to up the limit after we can
>>>>>> demonstrate improvement unambiguously.
>>>>>>
>>>>>> *Option 2*
>>>>>> make the limit configurable, for example through a system property
>>>>>> *Motivation*:
>>>>>> The system administrator can enforce a limit its users need to
>>>>>> respect that it's in line with whatever the admin decided to be acceptable
>>>>>> for them.
>>>>>> The default can stay the current one.
>>>>>> This should open the doors for Apache Solr, Elasticsearch,
>>>>>> OpenSearch, and any sort of plugin development
>>>>>>
>>>>>> *Option 3*
>>>>>> Move the max dimension limit lower level to a HNSW specific
>>>>>> implementation. Once there, this limit would not bind any other potential
>>>>>> vector engine alternative/evolution.
>>>>>> *Motivation:* There seem to be contradictory performance
>>>>>> interpretations about the current HNSW implementation. Some consider its
>>>>>> performance ok, some not, and it depends on the target data set and use
>>>>>> case. Increasing the max dimension limit where it is currently (in top
>>>>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>>>>> other use-cases) to be based on a lower limit.
>>>>>>
>>>>>> *Option 4*
>>>>>> Make it configurable and move it to an appropriate place.
>>>>>> In particular, a
>>>>>> simple Integer.getInteger("lucene.hnsw.maxDimensions", 1024) should be
>>>>>> enough.
>>>>>> *Motivation*:
>>>>>> Both are good and not mutually exclusive and could happen in any
>>>>>> order.
>>>>>> Someone suggested to perfect what the _default_ limit should be, but
>>>>>> I've not seen an argument _against_ configurability. Especially in this
>>>>>> way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>>>>
>>>>>> I'll keep this [VOTE] open for a week and then proceed to the
>>>>>> implementation.
>>>>>> --------------------------
>>>>>> *Alessandro Benedetti*
>>>>>> Director @ Sease Ltd.
>>>>>> *Apache Lucene/Solr Committer*
>>>>>> *Apache Solr PMC Member*
>>>>>>
>>>>>> e-mail: a.benedetti@sease.io
>>>>>>
>>>>>>
>>>>>> *Sease* - Information Retrieval Applied
>>>>>> Consulting | Training | Open Source
>>>>>>
>>>>>> Website: Sease.io <http://sease.io/>
>>>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>>>>> <https://twitter.com/seaseltd> | Youtube
>>>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>>>>> <https://github.com/seaseltd>
>>>>>>
>>>>>

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

michael.wechner at wyona

May 17, 2023, 3:29 PM

Post #36 of 46 (454 views)

IIUC KnnVectorField is deprecated and one is supposed to use
KnnFloatVectorField when using float as vector values, right?

Am 17.05.23 um 16:41 schrieb Michael Sokolov:
> see https://markmail.org/message/kf4nzoqyhwacb7ri
>
> On Wed, May 17, 2023 at 10:09?AM David Smiley <dsmiley@apache.org> wrote:
>
> > easily be circumvented by a user
>
> This is a revelation to me and others, if true. Michael, please
> then point to a test or code snippet that shows the Lucene user
> community what they want to see so they are unblocked from their
> explorations of vector search.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Wed, May 17, 2023 at 7:51?AM Michael Sokolov
> <msokolov@gmail.com> wrote:
>
> I think I've said before on this list we don't actually
> enforce the limit in any way that can't easily be circumvented
> by a user. The codec already supports any size vector - it
> doesn't impose any limit. The way the API is written you can
> *already today* create an index with max-int sized vectors and
> we are committed to supporting that going forward by our
> backwards compatibility policy as Robert points out. This
> wasn't intentional, I think, but it is the facts.
>
> Given that, I think this whole discussion is not really necessary.
>
> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti
> <a.benedetti@sease.io> wrote:
>
> Hi all,
> we have finalized all the options proposed by the
> community and we are ready to vote for the preferred one
> and then proceed with the implementation.
>
> *Option 1*
> Keep it as it is (dimension limit hardcoded to 1024)
> *Motivation*:
> We are close to improving on many fronts. Given the
> criticality of Lucene in computing infrastructure and the
> concerns raised by one of the most active stewards of the
> project, I think we should keep working toward improving
> the feature as is and move to up the limit after we can
> demonstrate improvement unambiguously.
>
> *Option 2*
> make the limit configurable, for example through a system
> property
> *Motivation*:
> The system administrator can enforce a limit its users
> need to respect that it's in line with whatever the admin
> decided to be acceptable for them.
> The default can stay the current one.
> This should open the doors for Apache Solr, Elasticsearch,
> OpenSearch, and any sort of plugin development
>
> *Option 3*
> Move the max dimension limit lower level to a HNSW
> specific implementation. Once there, this limit would not
> bind any other potential vector engine alternative/evolution.*
> *
> *Motivation:*There seem to be contradictory performance
> interpretations about the current HNSW implementation.
> Some consider its performance ok, some not, and it depends
> on the target data set and use case. Increasing the max
> dimension limit where it is currently (in top level
> FloatVectorValues) would not allow potential alternatives
> (e.g. for other use-cases) to be based on a lower limit.
>
> *Option 4*
> Make it configurable and move it to an appropriate place.
> In particular, a
> simple Integer.getInteger("lucene.hnsw.maxDimensions",
> 1024) should be enough.
> *Motivation*:
> Both are good and not mutually exclusive and could happen
> in any order.
> Someone suggested to perfect what the _default_ limit
> should be, but I've not seen an argument _against_
> configurability. Especially in this way -- a toggle that
> doesn't bind Lucene's APIs in any way.
>
> I'll keep this [VOTE] open for a week and then proceed to
> the implementation.
> --------------------------
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> /Apache Lucene/Solr Committer/
> /Apache Solr PMC Member/
>
> e-mail: a.benedetti@sease.io/
> /
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd> |
> Twitter <https://twitter.com/seaseltd> | Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
> Github <https://github.com/seaseltd>
>

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

michael.wechner at wyona

May 18, 2023, 12:45 AM

Post #37 of 46 (454 views)

I just implemented it and tested it with OpenAI's
text-embedding-ada-002, which is using 1536 dimensions and it works very
fine :-)

Thanks

Michael

Am 18.05.23 um 00:29 schrieb Michael Wechner:
> IIUC KnnVectorField is deprecated and one is supposed to use
> KnnFloatVectorField when using float as vector values, right?
>
> Am 17.05.23 um 16:41 schrieb Michael Sokolov:
>> see https://markmail.org/message/kf4nzoqyhwacb7ri
>>
>> On Wed, May 17, 2023 at 10:09?AM David Smiley <dsmiley@apache.org> wrote:
>>
>> > easily be circumvented by a user
>>
>> This is a revelation to me and others, if true. Michael, please
>> then point to a test or code snippet that shows the Lucene user
>> community what they want to see so they are unblocked from their
>> explorations of vector search.
>>
>> ~ David Smiley
>> Apache Lucene/Solr Search Developer
>> http://www.linkedin.com/in/davidwsmiley
>>
>>
>> On Wed, May 17, 2023 at 7:51?AM Michael Sokolov
>> <msokolov@gmail.com> wrote:
>>
>> I think I've said before on this list we don't actually
>> enforce the limit in any way that can't easily be
>> circumvented by a user. The codec already supports any size
>> vector - it doesn't impose any limit. The way the API is
>> written you can *already today* create an index with max-int
>> sized vectors and we are committed to supporting that going
>> forward by our backwards compatibility policy as Robert
>> points out. This wasn't intentional, I think, but it is the
>> facts.
>>
>> Given that, I think this whole discussion is not really
>> necessary.
>>
>> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti
>> <a.benedetti@sease.io> wrote:
>>
>> Hi all,
>> we have finalized all the options proposed by the
>> community and we are ready to vote for the preferred one
>> and then proceed with the implementation.
>>
>> *Option 1*
>> Keep it as it is (dimension limit hardcoded to 1024)
>> *Motivation*:
>> We are close to improving on many fronts. Given the
>> criticality of Lucene in computing infrastructure and the
>> concerns raised by one of the most active stewards of the
>> project, I think we should keep working toward improving
>> the feature as is and move to up the limit after we can
>> demonstrate improvement unambiguously.
>>
>> *Option 2*
>> make the limit configurable, for example through a system
>> property
>> *Motivation*:
>> The system administrator can enforce a limit its users
>> need to respect that it's in line with whatever the admin
>> decided to be acceptable for them.
>> The default can stay the current one.
>> This should open the doors for Apache Solr,
>> Elasticsearch, OpenSearch, and any sort of plugin development
>>
>> *Option 3*
>> Move the max dimension limit lower level to a HNSW
>> specific implementation. Once there, this limit would not
>> bind any other potential vector engine
>> alternative/evolution.*
>> *
>> *Motivation:*There seem to be contradictory performance
>> interpretations about the current HNSW implementation.
>> Some consider its performance ok, some not, and it
>> depends on the target data set and use case. Increasing
>> the max dimension limit where it is currently (in top
>> level FloatVectorValues) would not allow
>> potential alternatives (e.g. for other use-cases) to be
>> based on a lower limit.
>>
>> *Option 4*
>> Make it configurable and move it to an appropriate place.
>> In particular, a
>> simple Integer.getInteger("lucene.hnsw.maxDimensions",
>> 1024) should be enough.
>> *Motivation*:
>> Both are good and not mutually exclusive and could happen
>> in any order.
>> Someone suggested to perfect what the _default_ limit
>> should be, but I've not seen an argument _against_
>> configurability. Especially in this way -- a toggle that
>> doesn't bind Lucene's APIs in any way.
>>
>> I'll keep this [VOTE] open for a week and then proceed to
>> the implementation.
>> --------------------------
>> *Alessandro Benedetti*
>> Director @ Sease Ltd.
>> /Apache Lucene/Solr Committer/
>> /Apache Solr PMC Member/
>>
>> e-mail: a.benedetti@sease.io/
>> /
>>
>> *Sease* - Information Retrieval Applied
>> Consulting | Training | Open Source
>>
>> Website: Sease.io <http://sease.io/>
>> LinkedIn <https://linkedin.com/company/sease-ltd> |
>> Twitter <https://twitter.com/seaseltd> | Youtube
>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
>> Github <https://github.com/seaseltd>
>>
>

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

ichattopadhyaya at gmail

May 18, 2023, 2:10 AM

Post #38 of 46 (454 views)

That sounds promising, Michael. Can you share scripts/steps/code to
reproduce this?

On Thu, 18 May, 2023, 1:16 pm Michael Wechner, <michael.wechner@wyona.com>
wrote:

> I just implemented it and tested it with OpenAI's text-embedding-ada-002,
> which is using 1536 dimensions and it works very fine :-)
>
> Thanks
>
> Michael
>
>
>
> Am 18.05.23 um 00:29 schrieb Michael Wechner:
>
> IIUC KnnVectorField is deprecated and one is supposed to use
> KnnFloatVectorField when using float as vector values, right?
>
> Am 17.05.23 um 16:41 schrieb Michael Sokolov:
>
> see https://markmail.org/message/kf4nzoqyhwacb7ri
>
> On Wed, May 17, 2023 at 10:09?AM David Smiley <dsmiley@apache.org> wrote:
>
>> > easily be circumvented by a user
>>
>> This is a revelation to me and others, if true. Michael, please then
>> point to a test or code snippet that shows the Lucene user community what
>> they want to see so they are unblocked from their explorations of vector
>> search.
>>
>> ~ David Smiley
>> Apache Lucene/Solr Search Developer
>> http://www.linkedin.com/in/davidwsmiley
>>
>>
>> On Wed, May 17, 2023 at 7:51?AM Michael Sokolov <msokolov@gmail.com>
>> wrote:
>>
>>> I think I've said before on this list we don't actually enforce the
>>> limit in any way that can't easily be circumvented by a user. The codec
>>> already supports any size vector - it doesn't impose any limit. The way the
>>> API is written you can *already today* create an index with max-int sized
>>> vectors and we are committed to supporting that going forward by our
>>> backwards compatibility policy as Robert points out. This wasn't
>>> intentional, I think, but it is the facts.
>>>
>>> Given that, I think this whole discussion is not really necessary.
>>>
>>> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti <
>>> a.benedetti@sease.io> wrote:
>>>
>>>> Hi all,
>>>> we have finalized all the options proposed by the community and we are
>>>> ready to vote for the preferred one and then proceed with the
>>>> implementation.
>>>>
>>>> *Option 1*
>>>> Keep it as it is (dimension limit hardcoded to 1024)
>>>> *Motivation*:
>>>> We are close to improving on many fronts. Given the criticality of
>>>> Lucene in computing infrastructure and the concerns raised by one of the
>>>> most active stewards of the project, I think we should keep working toward
>>>> improving the feature as is and move to up the limit after we can
>>>> demonstrate improvement unambiguously.
>>>>
>>>> *Option 2*
>>>> make the limit configurable, for example through a system property
>>>> *Motivation*:
>>>> The system administrator can enforce a limit its users need to respect
>>>> that it's in line with whatever the admin decided to be acceptable for
>>>> them.
>>>> The default can stay the current one.
>>>> This should open the doors for Apache Solr, Elasticsearch, OpenSearch,
>>>> and any sort of plugin development
>>>>
>>>> *Option 3*
>>>> Move the max dimension limit lower level to a HNSW specific
>>>> implementation. Once there, this limit would not bind any other potential
>>>> vector engine alternative/evolution.
>>>> *Motivation:* There seem to be contradictory performance
>>>> interpretations about the current HNSW implementation. Some consider its
>>>> performance ok, some not, and it depends on the target data set and use
>>>> case. Increasing the max dimension limit where it is currently (in top
>>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>>> other use-cases) to be based on a lower limit.
>>>>
>>>> *Option 4*
>>>> Make it configurable and move it to an appropriate place.
>>>> In particular, a simple Integer.getInteger("lucene.hnsw.maxDimensions",
>>>> 1024) should be enough.
>>>> *Motivation*:
>>>> Both are good and not mutually exclusive and could happen in any order.
>>>> Someone suggested to perfect what the _default_ limit should be, but
>>>> I've not seen an argument _against_ configurability. Especially in this
>>>> way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>>
>>>> I'll keep this [VOTE] open for a week and then proceed to the
>>>> implementation.
>>>> --------------------------
>>>> *Alessandro Benedetti*
>>>> Director @ Sease Ltd.
>>>> *Apache Lucene/Solr Committer*
>>>> *Apache Solr PMC Member*
>>>>
>>>> e-mail: a.benedetti@sease.io
>>>>
>>>>
>>>> *Sease* - Information Retrieval Applied
>>>> Consulting | Training | Open Source
>>>>
>>>> Website: Sease.io <http://sease.io/>
>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>>> <https://twitter.com/seaseltd> | Youtube
>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>>> <https://github.com/seaseltd>
>>>>
>>>
>
>

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

a.benedetti at sease

May 18, 2023, 2:24 AM

Post #39 of 46 (454 views)

That's great and a good plan B, but let's try to focus this thread of
collecting votes for a week (let's keep discussions on the nice PR opened
by David or the discussion thread we have in the mailing list already :)

On Thu, 18 May 2023, 10:10 Ishan Chattopadhyaya, <ichattopadhyaya@gmail.com>
wrote:

> That sounds promising, Michael. Can you share scripts/steps/code to
> reproduce this?
>
> On Thu, 18 May, 2023, 1:16 pm Michael Wechner, <michael.wechner@wyona.com>
> wrote:
>
>> I just implemented it and tested it with OpenAI's text-embedding-ada-002,
>> which is using 1536 dimensions and it works very fine :-)
>>
>> Thanks
>>
>> Michael
>>
>>
>>
>> Am 18.05.23 um 00:29 schrieb Michael Wechner:
>>
>> IIUC KnnVectorField is deprecated and one is supposed to use
>> KnnFloatVectorField when using float as vector values, right?
>>
>> Am 17.05.23 um 16:41 schrieb Michael Sokolov:
>>
>> see https://markmail.org/message/kf4nzoqyhwacb7ri
>>
>> On Wed, May 17, 2023 at 10:09?AM David Smiley <dsmiley@apache.org> wrote:
>>
>>> > easily be circumvented by a user
>>>
>>> This is a revelation to me and others, if true. Michael, please then
>>> point to a test or code snippet that shows the Lucene user community what
>>> they want to see so they are unblocked from their explorations of vector
>>> search.
>>>
>>> ~ David Smiley
>>> Apache Lucene/Solr Search Developer
>>> http://www.linkedin.com/in/davidwsmiley
>>>
>>>
>>> On Wed, May 17, 2023 at 7:51?AM Michael Sokolov <msokolov@gmail.com>
>>> wrote:
>>>
>>>> I think I've said before on this list we don't actually enforce the
>>>> limit in any way that can't easily be circumvented by a user. The codec
>>>> already supports any size vector - it doesn't impose any limit. The way the
>>>> API is written you can *already today* create an index with max-int sized
>>>> vectors and we are committed to supporting that going forward by our
>>>> backwards compatibility policy as Robert points out. This wasn't
>>>> intentional, I think, but it is the facts.
>>>>
>>>> Given that, I think this whole discussion is not really necessary.
>>>>
>>>> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti <
>>>> a.benedetti@sease.io> wrote:
>>>>
>>>>> Hi all,
>>>>> we have finalized all the options proposed by the community and we are
>>>>> ready to vote for the preferred one and then proceed with the
>>>>> implementation.
>>>>>
>>>>> *Option 1*
>>>>> Keep it as it is (dimension limit hardcoded to 1024)
>>>>> *Motivation*:
>>>>> We are close to improving on many fronts. Given the criticality of
>>>>> Lucene in computing infrastructure and the concerns raised by one of the
>>>>> most active stewards of the project, I think we should keep working toward
>>>>> improving the feature as is and move to up the limit after we can
>>>>> demonstrate improvement unambiguously.
>>>>>
>>>>> *Option 2*
>>>>> make the limit configurable, for example through a system property
>>>>> *Motivation*:
>>>>> The system administrator can enforce a limit its users need to respect
>>>>> that it's in line with whatever the admin decided to be acceptable for
>>>>> them.
>>>>> The default can stay the current one.
>>>>> This should open the doors for Apache Solr, Elasticsearch, OpenSearch,
>>>>> and any sort of plugin development
>>>>>
>>>>> *Option 3*
>>>>> Move the max dimension limit lower level to a HNSW specific
>>>>> implementation. Once there, this limit would not bind any other potential
>>>>> vector engine alternative/evolution.
>>>>> *Motivation:* There seem to be contradictory performance
>>>>> interpretations about the current HNSW implementation. Some consider its
>>>>> performance ok, some not, and it depends on the target data set and use
>>>>> case. Increasing the max dimension limit where it is currently (in top
>>>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>>>> other use-cases) to be based on a lower limit.
>>>>>
>>>>> *Option 4*
>>>>> Make it configurable and move it to an appropriate place.
>>>>> In particular, a
>>>>> simple Integer.getInteger("lucene.hnsw.maxDimensions", 1024) should be
>>>>> enough.
>>>>> *Motivation*:
>>>>> Both are good and not mutually exclusive and could happen in any order.
>>>>> Someone suggested to perfect what the _default_ limit should be, but
>>>>> I've not seen an argument _against_ configurability. Especially in this
>>>>> way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>>>
>>>>> I'll keep this [VOTE] open for a week and then proceed to the
>>>>> implementation.
>>>>> --------------------------
>>>>> *Alessandro Benedetti*
>>>>> Director @ Sease Ltd.
>>>>> *Apache Lucene/Solr Committer*
>>>>> *Apache Solr PMC Member*
>>>>>
>>>>> e-mail: a.benedetti@sease.io
>>>>>
>>>>>
>>>>> *Sease* - Information Retrieval Applied
>>>>> Consulting | Training | Open Source
>>>>>
>>>>> Website: Sease.io <http://sease.io/>
>>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>>>> <https://twitter.com/seaseltd> | Youtube
>>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>>>> <https://github.com/seaseltd>
>>>>>
>>>>
>>
>>

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

lucene at mikemccandless

May 18, 2023, 3:22 AM

Post #40 of 46 (454 views)

This isn't really a VOTE (no specific code change is being proposed), but
rather a poll?

Anyway, I would prefer Option 3: put the limit check into the HNSW
algorithm itself. This is the right place for the limit check, since HNSW
has its own scaling behaviour. It might have other limits, like max
fanout, etc. And we really should fix the loophole Mike S posted -- that's
just a dangerous long-term trap for users, thinking they have the back
compat promise of Lucene, when in fact they do not.

I love all the energy and passion going into debating all the ways to poke
at this limit, but please let's also spend some of this passion on actually
improving the scalability of our aKNN implementation! E.g. Robert opened
an exciting "Plan B" ( https://github.com/apache/lucene/issues/12302 ) to
workaround OpenJDK's crazy slowness on enabling access to vectorized SIMD
CPU instructions (the Java Vector API, JEP 426: https://openjdk.org/jeps/426
). This could help postings and doc values performance too!

Mike McCandless

http://blog.mikemccandless.com

On Thu, May 18, 2023 at 5:24?AM Alessandro Benedetti <a.benedetti@sease.io>
wrote:

> That's great and a good plan B, but let's try to focus this thread of
> collecting votes for a week (let's keep discussions on the nice PR opened
> by David or the discussion thread we have in the mailing list already :)
>
> On Thu, 18 May 2023, 10:10 Ishan Chattopadhyaya, <
> ichattopadhyaya@gmail.com> wrote:
>
>> That sounds promising, Michael. Can you share scripts/steps/code to
>> reproduce this?
>>
>> On Thu, 18 May, 2023, 1:16 pm Michael Wechner, <michael.wechner@wyona.com>
>> wrote:
>>
>>> I just implemented it and tested it with OpenAI's
>>> text-embedding-ada-002, which is using 1536 dimensions and it works very
>>> fine :-)
>>>
>>> Thanks
>>>
>>> Michael
>>>
>>>
>>>
>>> Am 18.05.23 um 00:29 schrieb Michael Wechner:
>>>
>>> IIUC KnnVectorField is deprecated and one is supposed to use
>>> KnnFloatVectorField when using float as vector values, right?
>>>
>>> Am 17.05.23 um 16:41 schrieb Michael Sokolov:
>>>
>>> see https://markmail.org/message/kf4nzoqyhwacb7ri
>>>
>>> On Wed, May 17, 2023 at 10:09?AM David Smiley <dsmiley@apache.org>
>>> wrote:
>>>
>>>> > easily be circumvented by a user
>>>>
>>>> This is a revelation to me and others, if true. Michael, please then
>>>> point to a test or code snippet that shows the Lucene user community what
>>>> they want to see so they are unblocked from their explorations of vector
>>>> search.
>>>>
>>>> ~ David Smiley
>>>> Apache Lucene/Solr Search Developer
>>>> http://www.linkedin.com/in/davidwsmiley
>>>>
>>>>
>>>> On Wed, May 17, 2023 at 7:51?AM Michael Sokolov <msokolov@gmail.com>
>>>> wrote:
>>>>
>>>>> I think I've said before on this list we don't actually enforce the
>>>>> limit in any way that can't easily be circumvented by a user. The codec
>>>>> already supports any size vector - it doesn't impose any limit. The way the
>>>>> API is written you can *already today* create an index with max-int sized
>>>>> vectors and we are committed to supporting that going forward by our
>>>>> backwards compatibility policy as Robert points out. This wasn't
>>>>> intentional, I think, but it is the facts.
>>>>>
>>>>> Given that, I think this whole discussion is not really necessary.
>>>>>
>>>>> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti <
>>>>> a.benedetti@sease.io> wrote:
>>>>>
>>>>>> Hi all,
>>>>>> we have finalized all the options proposed by the community and we
>>>>>> are ready to vote for the preferred one and then proceed with the
>>>>>> implementation.
>>>>>>
>>>>>> *Option 1*
>>>>>> Keep it as it is (dimension limit hardcoded to 1024)
>>>>>> *Motivation*:
>>>>>> We are close to improving on many fronts. Given the criticality of
>>>>>> Lucene in computing infrastructure and the concerns raised by one of the
>>>>>> most active stewards of the project, I think we should keep working toward
>>>>>> improving the feature as is and move to up the limit after we can
>>>>>> demonstrate improvement unambiguously.
>>>>>>
>>>>>> *Option 2*
>>>>>> make the limit configurable, for example through a system property
>>>>>> *Motivation*:
>>>>>> The system administrator can enforce a limit its users need to
>>>>>> respect that it's in line with whatever the admin decided to be acceptable
>>>>>> for them.
>>>>>> The default can stay the current one.
>>>>>> This should open the doors for Apache Solr, Elasticsearch,
>>>>>> OpenSearch, and any sort of plugin development
>>>>>>
>>>>>> *Option 3*
>>>>>> Move the max dimension limit lower level to a HNSW specific
>>>>>> implementation. Once there, this limit would not bind any other potential
>>>>>> vector engine alternative/evolution.
>>>>>> *Motivation:* There seem to be contradictory performance
>>>>>> interpretations about the current HNSW implementation. Some consider its
>>>>>> performance ok, some not, and it depends on the target data set and use
>>>>>> case. Increasing the max dimension limit where it is currently (in top
>>>>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>>>>> other use-cases) to be based on a lower limit.
>>>>>>
>>>>>> *Option 4*
>>>>>> Make it configurable and move it to an appropriate place.
>>>>>> In particular, a
>>>>>> simple Integer.getInteger("lucene.hnsw.maxDimensions", 1024) should be
>>>>>> enough.
>>>>>> *Motivation*:
>>>>>> Both are good and not mutually exclusive and could happen in any
>>>>>> order.
>>>>>> Someone suggested to perfect what the _default_ limit should be, but
>>>>>> I've not seen an argument _against_ configurability. Especially in this
>>>>>> way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>>>>
>>>>>> I'll keep this [VOTE] open for a week and then proceed to the
>>>>>> implementation.
>>>>>> --------------------------
>>>>>> *Alessandro Benedetti*
>>>>>> Director @ Sease Ltd.
>>>>>> *Apache Lucene/Solr Committer*
>>>>>> *Apache Solr PMC Member*
>>>>>>
>>>>>> e-mail: a.benedetti@sease.io
>>>>>>
>>>>>>
>>>>>> *Sease* - Information Retrieval Applied
>>>>>> Consulting | Training | Open Source
>>>>>>
>>>>>> Website: Sease.io <http://sease.io/>
>>>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>>>>> <https://twitter.com/seaseltd> | Youtube
>>>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>>>>> <https://github.com/seaseltd>
>>>>>>
>>>>>
>>>
>>>

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

michael.wechner at wyona

May 18, 2023, 4:42 AM

Post #41 of 46 (454 views)

It is basically the code which Michael Sokolov posted at

https://markmail.org/message/kf4nzoqyhwacb7ri

except
- that I have replaced KnnVectorField by KnnFloatVectorField, because
KnnVectorField is deprecated.
- that I don't hard code the dimension as 2048 and the metric as
EUCLIDEAN, but take the dimension and metric (VectorSimilarityFunction)
used by the model. which are in the case of for example
text-embedding-ada-002: 1536 and COSINE
(https://platform.openai.com/docs/guides/embeddings/which-distance-function-should-i-use)

HTH

Michael

Am 18.05.23 um 11:10 schrieb Ishan Chattopadhyaya:
> That sounds promising, Michael. Can you share scripts/steps/code to
> reproduce this?
>
> On Thu, 18 May, 2023, 1:16 pm Michael Wechner,
> <michael.wechner@wyona.com> wrote:
>
> I just implemented it and tested it with OpenAI's
> text-embedding-ada-002, which is using 1536 dimensions and it
> works very fine :-)
>
> Thanks
>
> Michael
>
>
>
> Am 18.05.23 um 00:29 schrieb Michael Wechner:
>> IIUC KnnVectorField is deprecated and one is supposed to use
>> KnnFloatVectorField when using float as vector values, right?
>>
>> Am 17.05.23 um 16:41 schrieb Michael Sokolov:
>>> see https://markmail.org/message/kf4nzoqyhwacb7ri
>>>
>>> On Wed, May 17, 2023 at 10:09?AM David Smiley
>>> <dsmiley@apache.org> wrote:
>>>
>>> > easily be circumvented by a user
>>>
>>> This is a revelation to me and others, if true. Michael,
>>> please then point to a test or code snippet that shows the
>>> Lucene user community what they want to see so they are
>>> unblocked from their explorations of vector search.
>>>
>>> ~ David Smiley
>>> Apache Lucene/Solr Search Developer
>>> http://www.linkedin.com/in/davidwsmiley
>>>
>>>
>>> On Wed, May 17, 2023 at 7:51?AM Michael Sokolov
>>> <msokolov@gmail.com> wrote:
>>>
>>> I think I've said before on this list we don't actually
>>> enforce the limit in any way that can't easily be
>>> circumvented by a user. The codec already supports any
>>> size vector - it doesn't impose any limit. The way the
>>> API is written you can *already today* create an index
>>> with max-int sized vectors and we are committed to
>>> supporting that going forward by our backwards
>>> compatibility policy as Robert points out. This wasn't
>>> intentional, I think, but it is the facts.
>>>
>>> Given that, I think this whole discussion is not really
>>> necessary.
>>>
>>> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti
>>> <a.benedetti@sease.io> wrote:
>>>
>>> Hi all,
>>> we have finalized all the options proposed by the
>>> community and we are ready to vote for the preferred
>>> one and then proceed with the implementation.
>>>
>>> *Option 1*
>>> Keep it as it is (dimension limit hardcoded to 1024)
>>> *Motivation*:
>>> We are close to improving on many fronts. Given the
>>> criticality of Lucene in computing infrastructure
>>> and the concerns raised by one of the most active
>>> stewards of the project, I think we should keep
>>> working toward improving the feature as is and move
>>> to up the limit after we can demonstrate improvement
>>> unambiguously.
>>>
>>> *Option 2*
>>> make the limit configurable, for example through a
>>> system property
>>> *Motivation*:
>>> The system administrator can enforce a limit its
>>> users need to respect that it's in line with
>>> whatever the admin decided to be acceptable for them.
>>> The default can stay the current one.
>>> This should open the doors for Apache Solr,
>>> Elasticsearch, OpenSearch, and any sort of plugin
>>> development
>>>
>>> *Option 3*
>>> Move the max dimension limit lower level to a HNSW
>>> specific implementation. Once there, this limit
>>> would not bind any other potential vector engine
>>> alternative/evolution.*
>>> *
>>> *Motivation:*There seem to be contradictory
>>> performance interpretations about the current HNSW
>>> implementation. Some consider its performance ok,
>>> some not, and it depends on the target data set and
>>> use case. Increasing the max dimension limit where
>>> it is currently (in top level FloatVectorValues)
>>> would not allow potential alternatives (e.g. for
>>> other use-cases) to be based on a lower limit.
>>>
>>> *Option 4*
>>> Make it configurable and move it to an appropriate
>>> place.
>>> In particular, a
>>> simple Integer.getInteger("lucene.hnsw.maxDimensions",
>>> 1024) should be enough.
>>> *Motivation*:
>>> Both are good and not mutually exclusive and could
>>> happen in any order.
>>> Someone suggested to perfect what the _default_
>>> limit should be, but I've not seen an argument
>>> _against_ configurability. Especially in this way
>>> -- a toggle that doesn't bind Lucene's APIs in any way.
>>>
>>> I'll keep this [VOTE] open for a week and then
>>> proceed to the implementation.
>>> --------------------------
>>> *Alessandro Benedetti*
>>> Director @ Sease Ltd.
>>> /Apache Lucene/Solr Committer/
>>> /Apache Solr PMC Member/
>>>
>>> e-mail: a.benedetti@sease.io/
>>> /
>>>
>>> *Sease* - Information Retrieval Applied
>>> Consulting | Training | Open Source
>>>
>>> Website: Sease.io <http://sease.io/>
>>> LinkedIn <https://linkedin.com/company/sease-ltd> |
>>> Twitter <https://twitter.com/seaseltd> | Youtube
>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
>>> Github <https://github.com/seaseltd>
>>>
>>
>

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

michael.wechner at wyona

May 18, 2023, 5:07 AM

Post #42 of 46 (454 views)

Am 18.05.23 um 12:22 schrieb Michael McCandless:
>
> I love all the energy and passion going into debating all the ways to
> poke at this limit, but please let's also spend some of this passion
> on actually improving the scalability of our aKNN implementation!
> E.g. Robert opened an exciting "Plan B" (
> https://github.com/apache/lucene/issues/12302 ) to workaround
> OpenJDK's crazy slowness on enabling access to vectorized SIMD CPU
> instructions (the Java Vector API, JEP 426:
> https://openjdk.org/jeps/426 ). This could help postings and doc
> values performance too!

agreed, but I do not think the MAX_DIMENSIONS decision should depend on
this, because I think whatever improvements can be accomplished
eventually, very likely there will always be some limit.

Thanks

Michael

>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, May 18, 2023 at 5:24?AM Alessandro Benedetti
> <a.benedetti@sease.io> wrote:
>
> That's great and a good plan B, but let's try to focus this thread
> of collecting votes for a week (let's keep discussions on the nice
> PR opened by David or the discussion thread we have in the mailing
> list already :)
>
> On Thu, 18 May 2023, 10:10 Ishan Chattopadhyaya,
> <ichattopadhyaya@gmail.com> wrote:
>
> That sounds promising, Michael. Can you share
> scripts/steps/code to reproduce this?
>
> On Thu, 18 May, 2023, 1:16 pm Michael Wechner,
> <michael.wechner@wyona.com> wrote:
>
> I just implemented it and tested it with OpenAI's
> text-embedding-ada-002, which is using 1536 dimensions and
> it works very fine :-)
>
> Thanks
>
> Michael
>
>
>
> Am 18.05.23 um 00:29 schrieb Michael Wechner:
>> IIUC KnnVectorField is deprecated and one is supposed to
>> use KnnFloatVectorField when using float as vector
>> values, right?
>>
>> Am 17.05.23 um 16:41 schrieb Michael Sokolov:
>>> see https://markmail.org/message/kf4nzoqyhwacb7ri
>>>
>>> On Wed, May 17, 2023 at 10:09?AM David Smiley
>>> <dsmiley@apache.org> wrote:
>>>
>>> > easily be circumvented by a user
>>>
>>> This is a revelation to me and others, if true.
>>> Michael, please then point to a test or code snippet
>>> that shows the Lucene user community what they want
>>> to see so they are unblocked from their explorations
>>> of vector search.
>>>
>>> ~ David Smiley
>>> Apache Lucene/Solr Search Developer
>>> http://www.linkedin.com/in/davidwsmiley
>>>
>>>
>>> On Wed, May 17, 2023 at 7:51?AM Michael Sokolov
>>> <msokolov@gmail.com> wrote:
>>>
>>> I think I've said before on this list we don't
>>> actually enforce the limit in any way that can't
>>> easily be circumvented by a user. The codec
>>> already supports any size vector - it doesn't
>>> impose any limit. The way the API is written you
>>> can *already today* create an index with max-int
>>> sized vectors and we are committed to supporting
>>> that going forward by our backwards
>>> compatibility policy as Robert points out. This
>>> wasn't intentional, I think, but it is the facts.
>>>
>>> Given that, I think this whole discussion is not
>>> really necessary.
>>>
>>> On Tue, May 16, 2023 at 4:50?AM Alessandro
>>> Benedetti <a.benedetti@sease.io> wrote:
>>>
>>> Hi all,
>>> we have finalized all the options proposed
>>> by the community and we are ready to vote
>>> for the preferred one and then proceed with
>>> the implementation.
>>>
>>> *Option 1*
>>> Keep it as it is (dimension limit hardcoded
>>> to 1024)
>>> *Motivation*:
>>> We are close to improving on many fronts.
>>> Given the criticality of Lucene in computing
>>> infrastructure and the concerns raised by
>>> one of the most active stewards of the
>>> project, I think we should keep working
>>> toward improving the feature as is and move
>>> to up the limit after we can demonstrate
>>> improvement unambiguously.
>>>
>>> *Option 2*
>>> make the limit configurable, for example
>>> through a system property
>>> *Motivation*:
>>> The system administrator can enforce a limit
>>> its users need to respect that it's in line
>>> with whatever the admin decided to be
>>> acceptable for them.
>>> The default can stay the current one.
>>> This should open the doors for Apache Solr,
>>> Elasticsearch, OpenSearch, and any sort of
>>> plugin development
>>>
>>> *Option 3*
>>> Move the max dimension limit lower level to
>>> a HNSW specific implementation. Once there,
>>> this limit would not bind any other
>>> potential vector engine alternative/evolution.*
>>> *
>>> *Motivation:*There seem to be contradictory
>>> performance interpretations about the
>>> current HNSW implementation. Some consider
>>> its performance ok, some not, and it depends
>>> on the target data set and use case.
>>> Increasing the max dimension limit where it
>>> is currently (in top level
>>> FloatVectorValues) would not allow
>>> potential alternatives (e.g. for other
>>> use-cases) to be based on a lower limit.
>>>
>>> *Option 4*
>>> Make it configurable and move it to an
>>> appropriate place.
>>> In particular, a
>>> simple Integer.getInteger("lucene.hnsw.maxDimensions",
>>> 1024) should be enough.
>>> *Motivation*:
>>> Both are good and not mutually exclusive and
>>> could happen in any order.
>>> Someone suggested to perfect what the
>>> _default_ limit should be, but I've not seen
>>> an argument _against_ configurability.
>>> Especially in this way -- a toggle that
>>> doesn't bind Lucene's APIs in any way.
>>>
>>> I'll keep this [VOTE] open for a week and
>>> then proceed to the implementation.
>>> --------------------------
>>> *Alessandro Benedetti*
>>> Director @ Sease Ltd.
>>> /Apache Lucene/Solr Committer/
>>> /Apache Solr PMC Member/
>>>
>>> e-mail: a.benedetti@sease.io/
>>> /
>>>
>>> *Sease* - Information Retrieval Applied
>>> Consulting | Training | Open Source
>>>
>>> Website: Sease.io <http://sease.io/>
>>> LinkedIn
>>> <https://linkedin.com/company/sease-ltd> |
>>> Twitter <https://twitter.com/seaseltd> |
>>> Youtube
>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
>>> Github <https://github.com/seaseltd>
>>>
>>
>

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

nknize at gmail

May 18, 2023, 6:33 AM

Post #43 of 46 (454 views)

Difficult to keep up with this topic when it's spread across issues, PRs,
and email lists. My poll response is option 3. -1 to option 2, I think the
configuration should be moved to the HNSW specific implementation. At this
point of technical maturity, it doesn't make sense (to me) to have the
config be a global system property.

Given the conversation fragmentation I'll ask here what I asked in my
comment on the github issue
<https://github.com/apache/lucene/issues/11507#issuecomment-1548612414>.

"Can anyone smart here post their benchmarks to substantiate their claims?"

For as enthusiastic a topic as vector dimensionality is, it sure is
discouraging there isn't empirical data to help make an informed decision
around what the recommended limit should be. I've only seen broad benchmark
claims like "We benchmarked a patched Lucene/Solr. We fully understand (we
measured it :-P)" It sure would be useful to see these benchmarks! Not
having them to help improve these arbitrary limits seems like a serious
disservice to the Lucene/Solr user community. I think until trustworthy
numbers are made available all we'll have is conjecture and opinions.

IMHO, given Java's lag in SIMD Vector support I'd rather see equal energy
put into Robert's Vector API Integration, Plan B
<https://github.com/apache/lucene/issues/12302> proposal. I'm not trying to
minimize the importance of adding a configuration to the HNSW
dimensionality, I just think we have the requisite expertise on this
project to fix the bigger performance issues that are a direct result of
Java's bigger vector performance deficiencies.

Nicholas Knize, Ph.D., GISP
Principal Engineer - Search | Amazon
Apache Lucene PMC Member and Committer
nknize@apache.org

On Thu, May 18, 2023 at 7:07?AM Michael Wechner <michael.wechner@wyona.com>
wrote:

>
>
> Am 18.05.23 um 12:22 schrieb Michael McCandless:
>
>
> I love all the energy and passion going into debating all the ways to poke
> at this limit, but please let's also spend some of this passion on actually
> improving the scalability of our aKNN implementation! E.g. Robert opened
> an exciting "Plan B" ( https://github.com/apache/lucene/issues/12302 ) to
> workaround OpenJDK's crazy slowness on enabling access to vectorized SIMD
> CPU instructions (the Java Vector API, JEP 426:
> https://openjdk.org/jeps/426 ). This could help postings and doc values
> performance too!
>
>
>
> agreed, but I do not think the MAX_DIMENSIONS decision should depend on
> this, because I think whatever improvements can be accomplished eventually,
> very likely there will always be some limit.
>
> Thanks
>
> Michael
>
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, May 18, 2023 at 5:24?AM Alessandro Benedetti <a.benedetti@sease.io>
> wrote:
>
>> That's great and a good plan B, but let's try to focus this thread of
>> collecting votes for a week (let's keep discussions on the nice PR opened
>> by David or the discussion thread we have in the mailing list already :)
>>
>> On Thu, 18 May 2023, 10:10 Ishan Chattopadhyaya, <
>> ichattopadhyaya@gmail.com> wrote:
>>
>>> That sounds promising, Michael. Can you share scripts/steps/code to
>>> reproduce this?
>>>
>>> On Thu, 18 May, 2023, 1:16 pm Michael Wechner, <
>>> michael.wechner@wyona.com> wrote:
>>>
>>>> I just implemented it and tested it with OpenAI's
>>>> text-embedding-ada-002, which is using 1536 dimensions and it works very
>>>> fine :-)
>>>>
>>>> Thanks
>>>>
>>>> Michael
>>>>
>>>>
>>>>
>>>> Am 18.05.23 um 00:29 schrieb Michael Wechner:
>>>>
>>>> IIUC KnnVectorField is deprecated and one is supposed to use
>>>> KnnFloatVectorField when using float as vector values, right?
>>>>
>>>> Am 17.05.23 um 16:41 schrieb Michael Sokolov:
>>>>
>>>> see https://markmail.org/message/kf4nzoqyhwacb7ri
>>>>
>>>> On Wed, May 17, 2023 at 10:09?AM David Smiley <dsmiley@apache.org>
>>>> wrote:
>>>>
>>>>> > easily be circumvented by a user
>>>>>
>>>>> This is a revelation to me and others, if true. Michael, please then
>>>>> point to a test or code snippet that shows the Lucene user community what
>>>>> they want to see so they are unblocked from their explorations of vector
>>>>> search.
>>>>>
>>>>> ~ David Smiley
>>>>> Apache Lucene/Solr Search Developer
>>>>> http://www.linkedin.com/in/davidwsmiley
>>>>>
>>>>>
>>>>> On Wed, May 17, 2023 at 7:51?AM Michael Sokolov <msokolov@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I think I've said before on this list we don't actually enforce the
>>>>>> limit in any way that can't easily be circumvented by a user. The codec
>>>>>> already supports any size vector - it doesn't impose any limit. The way the
>>>>>> API is written you can *already today* create an index with max-int sized
>>>>>> vectors and we are committed to supporting that going forward by our
>>>>>> backwards compatibility policy as Robert points out. This wasn't
>>>>>> intentional, I think, but it is the facts.
>>>>>>
>>>>>> Given that, I think this whole discussion is not really necessary.
>>>>>>
>>>>>> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti <
>>>>>> a.benedetti@sease.io> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>> we have finalized all the options proposed by the community and we
>>>>>>> are ready to vote for the preferred one and then proceed with the
>>>>>>> implementation.
>>>>>>>
>>>>>>> *Option 1*
>>>>>>> Keep it as it is (dimension limit hardcoded to 1024)
>>>>>>> *Motivation*:
>>>>>>> We are close to improving on many fronts. Given the criticality of
>>>>>>> Lucene in computing infrastructure and the concerns raised by one of the
>>>>>>> most active stewards of the project, I think we should keep working toward
>>>>>>> improving the feature as is and move to up the limit after we can
>>>>>>> demonstrate improvement unambiguously.
>>>>>>>
>>>>>>> *Option 2*
>>>>>>> make the limit configurable, for example through a system property
>>>>>>> *Motivation*:
>>>>>>> The system administrator can enforce a limit its users need to
>>>>>>> respect that it's in line with whatever the admin decided to be acceptable
>>>>>>> for them.
>>>>>>> The default can stay the current one.
>>>>>>> This should open the doors for Apache Solr, Elasticsearch,
>>>>>>> OpenSearch, and any sort of plugin development
>>>>>>>
>>>>>>> *Option 3*
>>>>>>> Move the max dimension limit lower level to a HNSW specific
>>>>>>> implementation. Once there, this limit would not bind any other potential
>>>>>>> vector engine alternative/evolution.
>>>>>>> *Motivation:* There seem to be contradictory performance
>>>>>>> interpretations about the current HNSW implementation. Some consider its
>>>>>>> performance ok, some not, and it depends on the target data set and use
>>>>>>> case. Increasing the max dimension limit where it is currently (in top
>>>>>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>>>>>> other use-cases) to be based on a lower limit.
>>>>>>>
>>>>>>> *Option 4*
>>>>>>> Make it configurable and move it to an appropriate place.
>>>>>>> In particular, a
>>>>>>> simple Integer.getInteger("lucene.hnsw.maxDimensions", 1024) should be
>>>>>>> enough.
>>>>>>> *Motivation*:
>>>>>>> Both are good and not mutually exclusive and could happen in any
>>>>>>> order.
>>>>>>> Someone suggested to perfect what the _default_ limit should be, but
>>>>>>> I've not seen an argument _against_ configurability. Especially in this
>>>>>>> way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>>>>>
>>>>>>> I'll keep this [VOTE] open for a week and then proceed to the
>>>>>>> implementation.
>>>>>>> --------------------------
>>>>>>> *Alessandro Benedetti*
>>>>>>> Director @ Sease Ltd.
>>>>>>> *Apache Lucene/Solr Committer*
>>>>>>> *Apache Solr PMC Member*
>>>>>>>
>>>>>>> e-mail: a.benedetti@sease.io
>>>>>>>
>>>>>>>
>>>>>>> *Sease* - Information Retrieval Applied
>>>>>>> Consulting | Training | Open Source
>>>>>>>
>>>>>>> Website: Sease.io <http://sease.io/>
>>>>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>>>>>> <https://twitter.com/seaseltd> | Youtube
>>>>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>>>>>> <https://github.com/seaseltd>
>>>>>>>
>>>>>>
>>>>
>>>>
>

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

a.benedetti at sease

May 19, 2023, 2:57 AM

Post #44 of 46 (454 views)

Thanks to everyone involved so far!
I confirm that a proper subject should have been [POLL] rather than [VOTE],
apologies for the confusion.

We are in the middle of the poll and this is the summary so far (ordered by
preference):

Option 2-4: 9 votes
make the limit configurable, potentially moving the limit to the
appropriate place

Option 3: 4 votes
keep it as it is (1024) but move it lower level in HNSW-specific
implementation

Option 1: 0 votes
keep it as it is (1024)

I've also seen many people responding in the mail thread, but not
indicating their preference.
I believe it would be very useful if everyone interested, expresses their
preference.

Have a good day!
--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
*Apache Lucene/Solr Committer*
*Apache Solr PMC Member*

e-mail: a.benedetti@sease.io

*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
<https://github.com/seaseltd>

On Thu, 18 May 2023 at 14:34, Nicholas Knize <nknize@gmail.com> wrote:

> Difficult to keep up with this topic when it's spread across issues, PRs,
> and email lists. My poll response is option 3. -1 to option 2, I think the
> configuration should be moved to the HNSW specific implementation. At this
> point of technical maturity, it doesn't make sense (to me) to have the
> config be a global system property.
>
> Given the conversation fragmentation I'll ask here what I asked in my
> comment on the github issue
> <https://github.com/apache/lucene/issues/11507#issuecomment-1548612414>.
>
> "Can anyone smart here post their benchmarks to substantiate their
> claims?"
>
> For as enthusiastic a topic as vector dimensionality is, it sure is
> discouraging there isn't empirical data to help make an informed decision
> around what the recommended limit should be. I've only seen broad benchmark
> claims like "We benchmarked a patched Lucene/Solr. We fully understand (we
> measured it :-P)" It sure would be useful to see these benchmarks! Not
> having them to help improve these arbitrary limits seems like a serious
> disservice to the Lucene/Solr user community. I think until trustworthy
> numbers are made available all we'll have is conjecture and opinions.
>
> IMHO, given Java's lag in SIMD Vector support I'd rather see equal energy
> put into Robert's Vector API Integration, Plan B
> <https://github.com/apache/lucene/issues/12302> proposal. I'm not trying
> to minimize the importance of adding a configuration to the HNSW
> dimensionality, I just think we have the requisite expertise on this
> project to fix the bigger performance issues that are a direct result of
> Java's bigger vector performance deficiencies.
>
> Nicholas Knize, Ph.D., GISP
> Principal Engineer - Search | Amazon
> Apache Lucene PMC Member and Committer
> nknize@apache.org
>
>
> On Thu, May 18, 2023 at 7:07?AM Michael Wechner <michael.wechner@wyona.com>
> wrote:
>
>>
>>
>> Am 18.05.23 um 12:22 schrieb Michael McCandless:
>>
>>
>> I love all the energy and passion going into debating all the ways to
>> poke at this limit, but please let's also spend some of this passion on
>> actually improving the scalability of our aKNN implementation! E.g. Robert
>> opened an exciting "Plan B" (
>> https://github.com/apache/lucene/issues/12302 ) to workaround
>> OpenJDK's crazy slowness on enabling access to vectorized SIMD CPU
>> instructions (the Java Vector API, JEP 426: https://openjdk.org/jeps/426
>> ). This could help postings and doc values performance too!
>>
>>
>>
>> agreed, but I do not think the MAX_DIMENSIONS decision should depend on
>> this, because I think whatever improvements can be accomplished eventually,
>> very likely there will always be some limit.
>>
>> Thanks
>>
>> Michael
>>
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Thu, May 18, 2023 at 5:24?AM Alessandro Benedetti <
>> a.benedetti@sease.io> wrote:
>>
>>> That's great and a good plan B, but let's try to focus this thread of
>>> collecting votes for a week (let's keep discussions on the nice PR opened
>>> by David or the discussion thread we have in the mailing list already :)
>>>
>>> On Thu, 18 May 2023, 10:10 Ishan Chattopadhyaya, <
>>> ichattopadhyaya@gmail.com> wrote:
>>>
>>>> That sounds promising, Michael. Can you share scripts/steps/code to
>>>> reproduce this?
>>>>
>>>> On Thu, 18 May, 2023, 1:16 pm Michael Wechner, <
>>>> michael.wechner@wyona.com> wrote:
>>>>
>>>>> I just implemented it and tested it with OpenAI's
>>>>> text-embedding-ada-002, which is using 1536 dimensions and it works very
>>>>> fine :-)
>>>>>
>>>>> Thanks
>>>>>
>>>>> Michael
>>>>>
>>>>>
>>>>>
>>>>> Am 18.05.23 um 00:29 schrieb Michael Wechner:
>>>>>
>>>>> IIUC KnnVectorField is deprecated and one is supposed to use
>>>>> KnnFloatVectorField when using float as vector values, right?
>>>>>
>>>>> Am 17.05.23 um 16:41 schrieb Michael Sokolov:
>>>>>
>>>>> see https://markmail.org/message/kf4nzoqyhwacb7ri
>>>>>
>>>>> On Wed, May 17, 2023 at 10:09?AM David Smiley <dsmiley@apache.org>
>>>>> wrote:
>>>>>
>>>>>> > easily be circumvented by a user
>>>>>>
>>>>>> This is a revelation to me and others, if true. Michael, please then
>>>>>> point to a test or code snippet that shows the Lucene user community what
>>>>>> they want to see so they are unblocked from their explorations of vector
>>>>>> search.
>>>>>>
>>>>>> ~ David Smiley
>>>>>> Apache Lucene/Solr Search Developer
>>>>>> http://www.linkedin.com/in/davidwsmiley
>>>>>>
>>>>>>
>>>>>> On Wed, May 17, 2023 at 7:51?AM Michael Sokolov <msokolov@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I think I've said before on this list we don't actually enforce the
>>>>>>> limit in any way that can't easily be circumvented by a user. The codec
>>>>>>> already supports any size vector - it doesn't impose any limit. The way the
>>>>>>> API is written you can *already today* create an index with max-int sized
>>>>>>> vectors and we are committed to supporting that going forward by our
>>>>>>> backwards compatibility policy as Robert points out. This wasn't
>>>>>>> intentional, I think, but it is the facts.
>>>>>>>
>>>>>>> Given that, I think this whole discussion is not really necessary.
>>>>>>>
>>>>>>> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti <
>>>>>>> a.benedetti@sease.io> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>> we have finalized all the options proposed by the community and we
>>>>>>>> are ready to vote for the preferred one and then proceed with the
>>>>>>>> implementation.
>>>>>>>>
>>>>>>>> *Option 1*
>>>>>>>> Keep it as it is (dimension limit hardcoded to 1024)
>>>>>>>> *Motivation*:
>>>>>>>> We are close to improving on many fronts. Given the criticality of
>>>>>>>> Lucene in computing infrastructure and the concerns raised by one of the
>>>>>>>> most active stewards of the project, I think we should keep working toward
>>>>>>>> improving the feature as is and move to up the limit after we can
>>>>>>>> demonstrate improvement unambiguously.
>>>>>>>>
>>>>>>>> *Option 2*
>>>>>>>> make the limit configurable, for example through a system property
>>>>>>>> *Motivation*:
>>>>>>>> The system administrator can enforce a limit its users need to
>>>>>>>> respect that it's in line with whatever the admin decided to be acceptable
>>>>>>>> for them.
>>>>>>>> The default can stay the current one.
>>>>>>>> This should open the doors for Apache Solr, Elasticsearch,
>>>>>>>> OpenSearch, and any sort of plugin development
>>>>>>>>
>>>>>>>> *Option 3*
>>>>>>>> Move the max dimension limit lower level to a HNSW specific
>>>>>>>> implementation. Once there, this limit would not bind any other potential
>>>>>>>> vector engine alternative/evolution.
>>>>>>>> *Motivation:* There seem to be contradictory performance
>>>>>>>> interpretations about the current HNSW implementation. Some consider its
>>>>>>>> performance ok, some not, and it depends on the target data set and use
>>>>>>>> case. Increasing the max dimension limit where it is currently (in top
>>>>>>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>>>>>>> other use-cases) to be based on a lower limit.
>>>>>>>>
>>>>>>>> *Option 4*
>>>>>>>> Make it configurable and move it to an appropriate place.
>>>>>>>> In particular, a
>>>>>>>> simple Integer.getInteger("lucene.hnsw.maxDimensions", 1024) should be
>>>>>>>> enough.
>>>>>>>> *Motivation*:
>>>>>>>> Both are good and not mutually exclusive and could happen in any
>>>>>>>> order.
>>>>>>>> Someone suggested to perfect what the _default_ limit should be,
>>>>>>>> but I've not seen an argument _against_ configurability. Especially in
>>>>>>>> this way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>>>>>>
>>>>>>>> I'll keep this [VOTE] open for a week and then proceed to the
>>>>>>>> implementation.
>>>>>>>> --------------------------
>>>>>>>> *Alessandro Benedetti*
>>>>>>>> Director @ Sease Ltd.
>>>>>>>> *Apache Lucene/Solr Committer*
>>>>>>>> *Apache Solr PMC Member*
>>>>>>>>
>>>>>>>> e-mail: a.benedetti@sease.io
>>>>>>>>
>>>>>>>>
>>>>>>>> *Sease* - Information Retrieval Applied
>>>>>>>> Consulting | Training | Open Source
>>>>>>>>
>>>>>>>> Website: Sease.io <http://sease.io/>
>>>>>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>>>>>>> <https://twitter.com/seaseltd> | Youtube
>>>>>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>>>>>>> <https://github.com/seaseltd>
>>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

bruno.roustant at gmail

May 22, 2023, 1:17 AM

Post #45 of 46 (452 views)

I vote for option 3.
Then with a follow up work to have a simple extension codec in the "codecs"
package which is
1- not backward compatible, and 2- has a higher or configurable limit. That
way users can directly use this codec without any additional code.

Re: [VOTE] Dimension Limit for KNN Vectors [ In reply to ]

a.benedetti at sease

May 23, 2023, 2:27 AM

Post #46 of 46 (452 views)

Closing the poll after one week, these are the results:

Option 2-4: 9 votes
make the limit configurable, potentially moving the limit to the
appropriate place

Option 3: 5 votes
keep it as it is (1024) but move it lower level in HNSW-specific
implementation

Option 1: 0 votes
keep it as it is (1024)

-----
I was expecting more people to express their preferences, unfortunately,
many digressed to discussions without expressing any.
Given that, it seems clear that we want one of the most voted options, so
let's continue the discussions under the related Pull Requests and then
proceed to merges when agreement if found!

Thanks to everyone involved!

--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
*Apache Lucene/Solr Committer*
*Apache Solr PMC Member*

e-mail: a.benedetti@sease.io

*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
<https://github.com/seaseltd>

On Mon, 22 May 2023 at 09:17, Bruno Roustant <bruno.roustant@gmail.com>
wrote:

> I vote for option 3.
> Then with a follow up work to have a simple extension codec in the
> "codecs" package which is
> 1- not backward compatible, and 2- has a higher or configurable limit.
> That way users can directly use this codec without any additional code.
>