We agree backwards compatibility with the index should be maintained and
that checkIndex should work. And we agree on a number of other things, but
I want to focus on configurability.
As long as the index contains the number of dimensions actually used in a
specific segment & field, why couldn't checkIndex work if the dimension
*limit* is configurable? It's not checkindex's job to enforce the limit,
only to check that the data appears consistent / valid, irrespective of how
the number of dimensions came to be specified originally.
~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley
On Tue, May 16, 2023 at 10:58?PM Robert Muir <rcmuir@gmail.com> wrote:
> My problem is that it impacts the default codec which is supported by our
> backwards compatibility policy for many years. We can't just let the user
> determine backwards compatibility with a sysprop. how will checkindex work?
> We have to have bounds and also allow for more performant implementations
> that might have different limitations. And I'm pretty sure we want a faster
> implementation than what we have in the future, and it will probably have
> different limits.
>
> For other codecs, it is fine to have a different limit as I already said,
> as it is implementation dependent. And honestly the stuff in lucene/codecs
> can be more "Fast and loose" because it doesn't require the extensive index
> back compat guarantee.
>
> Again, penultimate concern is that index back compat guarantee. When it
> comes to limits, the proper way is not to just keep bumping them without
> technical reasons, instead the correct approach is to fix the technical
> problems and make them irrelevant. Great example here (merged this
> morning):
> https://github.com/apache/lucene/commit/f53eb28af053d7612f7e4d1b2de05d33dc410645
>
>
> On Tue, May 16, 2023 at 10:49?PM David Smiley <dsmiley@apache.org> wrote:
>
>> Robert, I have not heard from you (or anyone) an argument against System
>> property based configurability (as I described in Option 4 via a System
>> property). Uwe notes wisely some care must be taken to ensure it actually
>> works. Sure, of course. What concerns do you have with this?
>>
>> ~ David Smiley
>> Apache Lucene/Solr Search Developer
>> http://www.linkedin.com/in/davidwsmiley
>>
>>
>> On Tue, May 16, 2023 at 9:50?PM Robert Muir <rcmuir@gmail.com> wrote:
>>
>>> by the way, i agree with the idea to MOVE THE LIMIT UNCHANGED to the
>>> hsnw-specific code.
>>>
>>> This way, someone can write alternative codec with vectors using some
>>> other completely different approach that incorporates a different more
>>> appropriate limit (maybe lower, maybe higher) depending upon their
>>> tradeoffs. We should encourage this as I think it is the "only true fix" to
>>> the scalability issues: use a scalable algorithm! Also, alternative codecs
>>> don't force the project into many years of index backwards compatibility,
>>> which is really my penultimate concern. We can lock ourselves into a truly
>>> bad place and become irrelevant (especially with scalar code implementing
>>> all this vector stuff, it is really senseless). In the meantime I suggest
>>> we try to reduce pain for the default codec with the current implementation
>>> if possible. If it is not possible, we need a new codec that performs.
>>>
>>> On Tue, May 16, 2023 at 8:53?PM Robert Muir <rcmuir@gmail.com> wrote:
>>>
>>>> Gus, I think i explained myself multiple times on issues and in this
>>>> thread. the performance is unacceptable, everyone knows it, but nobody is
>>>> talking about.
>>>> I don't need to explain myself time and time again here.
>>>> You don't seem to understand the technical issues (at least you sure as
>>>> fuck don't know how service loading works or you wouldnt have opened
>>>> https://github.com/apache/lucene/issues/12300 ????)
>>>>
>>>> I'm just the only one here completely unconstrained by any of silicon
>>>> valley's influences to speak my true mind, without any repercussions, so I
>>>> do it. Don't give any fucks about ChatGPT.
>>>>
>>>> I'm standing by my technical veto. If you bypass it, I'll revert the
>>>> offending commit.
>>>>
>>>> As far as fixing the technical performance, I just opened an issue with
>>>> some ideas to at least improve cpu usage by a factor of N. It does not help
>>>> with the crazy heap memory usage or other issues of KNN implementation
>>>> causing shit like OOM on merge. But it is one step:
>>>> https://github.com/apache/lucene/issues/12302
>>>>
>>>>
>>>>
>>>> On Tue, May 16, 2023 at 7:45?AM Gus Heck <gus.heck@gmail.com> wrote:
>>>>
>>>>> Robert,
>>>>>
>>>>> Can you explain in clear technical terms the standard that must be met
>>>>> for performance? A benchmark that must run in X time on Y hardware for
>>>>> example (and why that test is suitable)? Or some other reproducible
>>>>> criteria? So far I've heard you give an *opinion* that it's unusable, but
>>>>> that's not a technical criteria, others may have a different concept of
>>>>> what is usable to them.
>>>>>
>>>>> Forgive me if I misunderstand, but the essence of your argument has
>>>>> seemed to be
>>>>>
>>>>> "Performance isn't good enough, therefore we should force anyone who
>>>>> wants to experiment with something bigger to fork the code base to do it"
>>>>>
>>>>> Thus, it is necessary to have a clear unambiguous standard that anyone
>>>>> can verify for "good enough". A clear standard would also focus efforts at
>>>>> improvement.
>>>>>
>>>>> Where are the goal posts?
>>>>>
>>>>> FWIW I'm +1 on any of 2-4 since I believe the existence of a hard
>>>>> limit is fundamentally counterproductive in an open source setting, as it
>>>>> will lead to *fewer people* pushing the limits. Extremely few people
>>>>> are going to get into the nitty-gritty of optimizing things unless they are
>>>>> staring at code that they can prove does something interesting, but doesn't
>>>>> run fast enough for their purposes. If people hit a hard limit, more of
>>>>> them give up and never develop the code that will motivate them to look for
>>>>> optimizations.
>>>>>
>>>>> -Gus
>>>>>
>>>>> On Tue, May 16, 2023 at 6:04?AM Robert Muir <rcmuir@gmail.com> wrote:
>>>>>
>>>>>> i still feel -1 (veto) on increasing this limit. sending more emails
>>>>>> does not change the technical facts or make the veto go away.
>>>>>>
>>>>>> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti <
>>>>>> a.benedetti@sease.io> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>> we have finalized all the options proposed by the community and we
>>>>>>> are ready to vote for the preferred one and then proceed with the
>>>>>>> implementation.
>>>>>>>
>>>>>>> *Option 1*
>>>>>>> Keep it as it is (dimension limit hardcoded to 1024)
>>>>>>> *Motivation*:
>>>>>>> We are close to improving on many fronts. Given the criticality of
>>>>>>> Lucene in computing infrastructure and the concerns raised by one of the
>>>>>>> most active stewards of the project, I think we should keep working toward
>>>>>>> improving the feature as is and move to up the limit after we can
>>>>>>> demonstrate improvement unambiguously.
>>>>>>>
>>>>>>> *Option 2*
>>>>>>> make the limit configurable, for example through a system property
>>>>>>> *Motivation*:
>>>>>>> The system administrator can enforce a limit its users need to
>>>>>>> respect that it's in line with whatever the admin decided to be acceptable
>>>>>>> for them.
>>>>>>> The default can stay the current one.
>>>>>>> This should open the doors for Apache Solr, Elasticsearch,
>>>>>>> OpenSearch, and any sort of plugin development
>>>>>>>
>>>>>>> *Option 3*
>>>>>>> Move the max dimension limit lower level to a HNSW specific
>>>>>>> implementation. Once there, this limit would not bind any other potential
>>>>>>> vector engine alternative/evolution.
>>>>>>> *Motivation:* There seem to be contradictory performance
>>>>>>> interpretations about the current HNSW implementation. Some consider its
>>>>>>> performance ok, some not, and it depends on the target data set and use
>>>>>>> case. Increasing the max dimension limit where it is currently (in top
>>>>>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>>>>>> other use-cases) to be based on a lower limit.
>>>>>>>
>>>>>>> *Option 4*
>>>>>>> Make it configurable and move it to an appropriate place.
>>>>>>> In particular, a
>>>>>>> simple Integer.getInteger("lucene.hnsw.maxDimensions", 1024) should be
>>>>>>> enough.
>>>>>>> *Motivation*:
>>>>>>> Both are good and not mutually exclusive and could happen in any
>>>>>>> order.
>>>>>>> Someone suggested to perfect what the _default_ limit should be, but
>>>>>>> I've not seen an argument _against_ configurability. Especially in this
>>>>>>> way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>>>>>
>>>>>>> I'll keep this [VOTE] open for a week and then proceed to the
>>>>>>> implementation.
>>>>>>> --------------------------
>>>>>>> *Alessandro Benedetti*
>>>>>>> Director @ Sease Ltd.
>>>>>>> *Apache Lucene/Solr Committer*
>>>>>>> *Apache Solr PMC Member*
>>>>>>>
>>>>>>> e-mail: a.benedetti@sease.io
>>>>>>>
>>>>>>>
>>>>>>> *Sease* - Information Retrieval Applied
>>>>>>> Consulting | Training | Open Source
>>>>>>>
>>>>>>> Website: Sease.io <http://sease.io/>
>>>>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>>>>>> <https://twitter.com/seaseltd> | Youtube
>>>>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>>>>>> <https://github.com/seaseltd>
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> http://www.needhamsoftware.com (work)
>>>>> http://www.the111shift.com (play)
>>>>>
>>>>
that checkIndex should work. And we agree on a number of other things, but
I want to focus on configurability.
As long as the index contains the number of dimensions actually used in a
specific segment & field, why couldn't checkIndex work if the dimension
*limit* is configurable? It's not checkindex's job to enforce the limit,
only to check that the data appears consistent / valid, irrespective of how
the number of dimensions came to be specified originally.
~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley
On Tue, May 16, 2023 at 10:58?PM Robert Muir <rcmuir@gmail.com> wrote:
> My problem is that it impacts the default codec which is supported by our
> backwards compatibility policy for many years. We can't just let the user
> determine backwards compatibility with a sysprop. how will checkindex work?
> We have to have bounds and also allow for more performant implementations
> that might have different limitations. And I'm pretty sure we want a faster
> implementation than what we have in the future, and it will probably have
> different limits.
>
> For other codecs, it is fine to have a different limit as I already said,
> as it is implementation dependent. And honestly the stuff in lucene/codecs
> can be more "Fast and loose" because it doesn't require the extensive index
> back compat guarantee.
>
> Again, penultimate concern is that index back compat guarantee. When it
> comes to limits, the proper way is not to just keep bumping them without
> technical reasons, instead the correct approach is to fix the technical
> problems and make them irrelevant. Great example here (merged this
> morning):
> https://github.com/apache/lucene/commit/f53eb28af053d7612f7e4d1b2de05d33dc410645
>
>
> On Tue, May 16, 2023 at 10:49?PM David Smiley <dsmiley@apache.org> wrote:
>
>> Robert, I have not heard from you (or anyone) an argument against System
>> property based configurability (as I described in Option 4 via a System
>> property). Uwe notes wisely some care must be taken to ensure it actually
>> works. Sure, of course. What concerns do you have with this?
>>
>> ~ David Smiley
>> Apache Lucene/Solr Search Developer
>> http://www.linkedin.com/in/davidwsmiley
>>
>>
>> On Tue, May 16, 2023 at 9:50?PM Robert Muir <rcmuir@gmail.com> wrote:
>>
>>> by the way, i agree with the idea to MOVE THE LIMIT UNCHANGED to the
>>> hsnw-specific code.
>>>
>>> This way, someone can write alternative codec with vectors using some
>>> other completely different approach that incorporates a different more
>>> appropriate limit (maybe lower, maybe higher) depending upon their
>>> tradeoffs. We should encourage this as I think it is the "only true fix" to
>>> the scalability issues: use a scalable algorithm! Also, alternative codecs
>>> don't force the project into many years of index backwards compatibility,
>>> which is really my penultimate concern. We can lock ourselves into a truly
>>> bad place and become irrelevant (especially with scalar code implementing
>>> all this vector stuff, it is really senseless). In the meantime I suggest
>>> we try to reduce pain for the default codec with the current implementation
>>> if possible. If it is not possible, we need a new codec that performs.
>>>
>>> On Tue, May 16, 2023 at 8:53?PM Robert Muir <rcmuir@gmail.com> wrote:
>>>
>>>> Gus, I think i explained myself multiple times on issues and in this
>>>> thread. the performance is unacceptable, everyone knows it, but nobody is
>>>> talking about.
>>>> I don't need to explain myself time and time again here.
>>>> You don't seem to understand the technical issues (at least you sure as
>>>> fuck don't know how service loading works or you wouldnt have opened
>>>> https://github.com/apache/lucene/issues/12300 ????)
>>>>
>>>> I'm just the only one here completely unconstrained by any of silicon
>>>> valley's influences to speak my true mind, without any repercussions, so I
>>>> do it. Don't give any fucks about ChatGPT.
>>>>
>>>> I'm standing by my technical veto. If you bypass it, I'll revert the
>>>> offending commit.
>>>>
>>>> As far as fixing the technical performance, I just opened an issue with
>>>> some ideas to at least improve cpu usage by a factor of N. It does not help
>>>> with the crazy heap memory usage or other issues of KNN implementation
>>>> causing shit like OOM on merge. But it is one step:
>>>> https://github.com/apache/lucene/issues/12302
>>>>
>>>>
>>>>
>>>> On Tue, May 16, 2023 at 7:45?AM Gus Heck <gus.heck@gmail.com> wrote:
>>>>
>>>>> Robert,
>>>>>
>>>>> Can you explain in clear technical terms the standard that must be met
>>>>> for performance? A benchmark that must run in X time on Y hardware for
>>>>> example (and why that test is suitable)? Or some other reproducible
>>>>> criteria? So far I've heard you give an *opinion* that it's unusable, but
>>>>> that's not a technical criteria, others may have a different concept of
>>>>> what is usable to them.
>>>>>
>>>>> Forgive me if I misunderstand, but the essence of your argument has
>>>>> seemed to be
>>>>>
>>>>> "Performance isn't good enough, therefore we should force anyone who
>>>>> wants to experiment with something bigger to fork the code base to do it"
>>>>>
>>>>> Thus, it is necessary to have a clear unambiguous standard that anyone
>>>>> can verify for "good enough". A clear standard would also focus efforts at
>>>>> improvement.
>>>>>
>>>>> Where are the goal posts?
>>>>>
>>>>> FWIW I'm +1 on any of 2-4 since I believe the existence of a hard
>>>>> limit is fundamentally counterproductive in an open source setting, as it
>>>>> will lead to *fewer people* pushing the limits. Extremely few people
>>>>> are going to get into the nitty-gritty of optimizing things unless they are
>>>>> staring at code that they can prove does something interesting, but doesn't
>>>>> run fast enough for their purposes. If people hit a hard limit, more of
>>>>> them give up and never develop the code that will motivate them to look for
>>>>> optimizations.
>>>>>
>>>>> -Gus
>>>>>
>>>>> On Tue, May 16, 2023 at 6:04?AM Robert Muir <rcmuir@gmail.com> wrote:
>>>>>
>>>>>> i still feel -1 (veto) on increasing this limit. sending more emails
>>>>>> does not change the technical facts or make the veto go away.
>>>>>>
>>>>>> On Tue, May 16, 2023 at 4:50?AM Alessandro Benedetti <
>>>>>> a.benedetti@sease.io> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>> we have finalized all the options proposed by the community and we
>>>>>>> are ready to vote for the preferred one and then proceed with the
>>>>>>> implementation.
>>>>>>>
>>>>>>> *Option 1*
>>>>>>> Keep it as it is (dimension limit hardcoded to 1024)
>>>>>>> *Motivation*:
>>>>>>> We are close to improving on many fronts. Given the criticality of
>>>>>>> Lucene in computing infrastructure and the concerns raised by one of the
>>>>>>> most active stewards of the project, I think we should keep working toward
>>>>>>> improving the feature as is and move to up the limit after we can
>>>>>>> demonstrate improvement unambiguously.
>>>>>>>
>>>>>>> *Option 2*
>>>>>>> make the limit configurable, for example through a system property
>>>>>>> *Motivation*:
>>>>>>> The system administrator can enforce a limit its users need to
>>>>>>> respect that it's in line with whatever the admin decided to be acceptable
>>>>>>> for them.
>>>>>>> The default can stay the current one.
>>>>>>> This should open the doors for Apache Solr, Elasticsearch,
>>>>>>> OpenSearch, and any sort of plugin development
>>>>>>>
>>>>>>> *Option 3*
>>>>>>> Move the max dimension limit lower level to a HNSW specific
>>>>>>> implementation. Once there, this limit would not bind any other potential
>>>>>>> vector engine alternative/evolution.
>>>>>>> *Motivation:* There seem to be contradictory performance
>>>>>>> interpretations about the current HNSW implementation. Some consider its
>>>>>>> performance ok, some not, and it depends on the target data set and use
>>>>>>> case. Increasing the max dimension limit where it is currently (in top
>>>>>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>>>>>> other use-cases) to be based on a lower limit.
>>>>>>>
>>>>>>> *Option 4*
>>>>>>> Make it configurable and move it to an appropriate place.
>>>>>>> In particular, a
>>>>>>> simple Integer.getInteger("lucene.hnsw.maxDimensions", 1024) should be
>>>>>>> enough.
>>>>>>> *Motivation*:
>>>>>>> Both are good and not mutually exclusive and could happen in any
>>>>>>> order.
>>>>>>> Someone suggested to perfect what the _default_ limit should be, but
>>>>>>> I've not seen an argument _against_ configurability. Especially in this
>>>>>>> way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>>>>>
>>>>>>> I'll keep this [VOTE] open for a week and then proceed to the
>>>>>>> implementation.
>>>>>>> --------------------------
>>>>>>> *Alessandro Benedetti*
>>>>>>> Director @ Sease Ltd.
>>>>>>> *Apache Lucene/Solr Committer*
>>>>>>> *Apache Solr PMC Member*
>>>>>>>
>>>>>>> e-mail: a.benedetti@sease.io
>>>>>>>
>>>>>>>
>>>>>>> *Sease* - Information Retrieval Applied
>>>>>>> Consulting | Training | Open Source
>>>>>>>
>>>>>>> Website: Sease.io <http://sease.io/>
>>>>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>>>>>> <https://twitter.com/seaseltd> | Youtube
>>>>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>>>>>> <https://github.com/seaseltd>
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> http://www.needhamsoftware.com (work)
>>>>> http://www.the111shift.com (play)
>>>>>
>>>>