Mailing List Archive

1 2 3 4  View All
Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]
re: how does this HNSW stuff scale - I think people are calling out
indexing memory usage here, so let's discuss some facts. During
initial indexing we hold in RAM all the vector data and the graph
constructed from the new documents, but this is accounted for and
limited by the size of IndexWriter's buffer; the document vectors and
their graph will be flushed to disk when this fills up, and at search
time, they are not read in wholesale to RAM. There is potentially
unbounded RAM usage during merging though, because the entire merged
graph will be built in RAM. I lost track of how we handle the vector
data now, but at least in theory it should be fairly straightforward
to write the merged vector data in chunks using only limited RAM. So
how much RAM does the graph use? It uses numdocs*fanout VInts.
Actually it doesn't really scale with the vector dimension at all -
rather it scales with the graph fanout (M) parameter and with the
total number of documents. So I think this focus on limiting the
vector dimension is not helping to address the concern about RAM usage
while merging.

The vector dimension does have a strong role in the search, and
indexing time, but the impact is linear in the dimension and won't
exhaust any limited resource.

On Thu, Apr 6, 2023 at 5:48?AM Michael McCandless
<lucene@mikemccandless.com> wrote:
>
> > We shouldn't accept weakly/not scientifically motivated vetos anyway right?
>
> In fact we must accept all vetos by any committer as a veto, for a change to Lucene's source code, regardless of that committer's reasoning. This is the power of Apache's model.
>
> Of course we all can and will work together to convince one another (this is where the scientifically motivated part comes in) to change our votes, one way or another.
>
> > I'd ask anyone voting +1 to raise this limit to at least try to index a few million vectors with 756 or 1024, which is allowed today.
>
> +1, if the current implementation really does not scale / needs more and more RAM for merging, let's understand what's going on here, first, before increasing limits. I rescind my hasty +1 for now!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Apr 5, 2023 at 11:22?AM Alessandro Benedetti <a.benedetti@sease.io> wrote:
>>
>> Ok, so what should we do then?
>> This space is moving fast, and in my opinion we should act fast to release and guarantee we attract as many users as possible.
>>
>> At the same time I am not saying we should proceed blind, if there's concrete evidence for setting a limit rather than another, or that a certain limit is detrimental to the project, I think that veto should be valid.
>>
>> We shouldn't accept weakly/not scientifically motivated vetos anyway right?
>>
>> The problem I see is that more than voting we should first decide this limit and I don't know how we can operate.
>> I am imagining like a poll where each entry is a limit + motivation and PMCs maybe vote/add entries?
>>
>> Did anything similar happen in the past? How was the current limit added?
>>
>>
>> On Wed, 5 Apr 2023, 14:50 Dawid Weiss, <dawid.weiss@gmail.com> wrote:
>>>
>>>
>>>>
>>>> Should create a VOTE thread, where we propose some values with a justification and we vote?
>>>
>>>
>>> Technically, a vote thread won't help much if there's no full consensus - a single veto will make the patch unacceptable for merging.
>>> https://www.apache.org/foundation/voting.html#Veto
>>>
>>> Dawid
>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]
thanks very much for these insights!

Does it make a difference re RAM when I do a batch import, for example
import 1000 documents and close the IndexWriter and do a forceMerge or
import 1Mio documents at once?

I would expect so, or do I misunderstand this?

Thanks

Michael



Am 06.04.23 um 16:11 schrieb Michael Sokolov:
> re: how does this HNSW stuff scale - I think people are calling out
> indexing memory usage here, so let's discuss some facts. During
> initial indexing we hold in RAM all the vector data and the graph
> constructed from the new documents, but this is accounted for and
> limited by the size of IndexWriter's buffer; the document vectors and
> their graph will be flushed to disk when this fills up, and at search
> time, they are not read in wholesale to RAM. There is potentially
> unbounded RAM usage during merging though, because the entire merged
> graph will be built in RAM. I lost track of how we handle the vector
> data now, but at least in theory it should be fairly straightforward
> to write the merged vector data in chunks using only limited RAM. So
> how much RAM does the graph use? It uses numdocs*fanout VInts.
> Actually it doesn't really scale with the vector dimension at all -
> rather it scales with the graph fanout (M) parameter and with the
> total number of documents. So I think this focus on limiting the
> vector dimension is not helping to address the concern about RAM usage
> while merging.
>
> The vector dimension does have a strong role in the search, and
> indexing time, but the impact is linear in the dimension and won't
> exhaust any limited resource.
>
> On Thu, Apr 6, 2023 at 5:48?AM Michael McCandless
> <lucene@mikemccandless.com> wrote:
>>> We shouldn't accept weakly/not scientifically motivated vetos anyway right?
>> In fact we must accept all vetos by any committer as a veto, for a change to Lucene's source code, regardless of that committer's reasoning. This is the power of Apache's model.
>>
>> Of course we all can and will work together to convince one another (this is where the scientifically motivated part comes in) to change our votes, one way or another.
>>
>>> I'd ask anyone voting +1 to raise this limit to at least try to index a few million vectors with 756 or 1024, which is allowed today.
>> +1, if the current implementation really does not scale / needs more and more RAM for merging, let's understand what's going on here, first, before increasing limits. I rescind my hasty +1 for now!
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Wed, Apr 5, 2023 at 11:22?AM Alessandro Benedetti <a.benedetti@sease.io> wrote:
>>> Ok, so what should we do then?
>>> This space is moving fast, and in my opinion we should act fast to release and guarantee we attract as many users as possible.
>>>
>>> At the same time I am not saying we should proceed blind, if there's concrete evidence for setting a limit rather than another, or that a certain limit is detrimental to the project, I think that veto should be valid.
>>>
>>> We shouldn't accept weakly/not scientifically motivated vetos anyway right?
>>>
>>> The problem I see is that more than voting we should first decide this limit and I don't know how we can operate.
>>> I am imagining like a poll where each entry is a limit + motivation and PMCs maybe vote/add entries?
>>>
>>> Did anything similar happen in the past? How was the current limit added?
>>>
>>>
>>> On Wed, 5 Apr 2023, 14:50 Dawid Weiss, <dawid.weiss@gmail.com> wrote:
>>>>
>>>>> Should create a VOTE thread, where we propose some values with a justification and we vote?
>>>>
>>>> Technically, a vote thread won't help much if there's no full consensus - a single veto will make the patch unacceptable for merging.
>>>> https://www.apache.org/foundation/voting.html#Veto
>>>>
>>>> Dawid
>>>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]
yes, it makes a difference. It will take less time and CPU to do it
all in one go, producing a single segment (assuming the data does not
exceed the IndexWriter RAM buffer size). If you index a lot of little
segments and then force merge them it will take longer, because it had
to build the graphs for the little segments, and then for the big one
when merging, and it will eventually use the same amount of RAM to
build the big graph, although I don't believe it will have to load the
vectors en masse into RAM while merging.

On Thu, Apr 6, 2023 at 10:20?AM Michael Wechner
<michael.wechner@wyona.com> wrote:
>
> thanks very much for these insights!
>
> Does it make a difference re RAM when I do a batch import, for example
> import 1000 documents and close the IndexWriter and do a forceMerge or
> import 1Mio documents at once?
>
> I would expect so, or do I misunderstand this?
>
> Thanks
>
> Michael
>
>
>
> Am 06.04.23 um 16:11 schrieb Michael Sokolov:
> > re: how does this HNSW stuff scale - I think people are calling out
> > indexing memory usage here, so let's discuss some facts. During
> > initial indexing we hold in RAM all the vector data and the graph
> > constructed from the new documents, but this is accounted for and
> > limited by the size of IndexWriter's buffer; the document vectors and
> > their graph will be flushed to disk when this fills up, and at search
> > time, they are not read in wholesale to RAM. There is potentially
> > unbounded RAM usage during merging though, because the entire merged
> > graph will be built in RAM. I lost track of how we handle the vector
> > data now, but at least in theory it should be fairly straightforward
> > to write the merged vector data in chunks using only limited RAM. So
> > how much RAM does the graph use? It uses numdocs*fanout VInts.
> > Actually it doesn't really scale with the vector dimension at all -
> > rather it scales with the graph fanout (M) parameter and with the
> > total number of documents. So I think this focus on limiting the
> > vector dimension is not helping to address the concern about RAM usage
> > while merging.
> >
> > The vector dimension does have a strong role in the search, and
> > indexing time, but the impact is linear in the dimension and won't
> > exhaust any limited resource.
> >
> > On Thu, Apr 6, 2023 at 5:48?AM Michael McCandless
> > <lucene@mikemccandless.com> wrote:
> >>> We shouldn't accept weakly/not scientifically motivated vetos anyway right?
> >> In fact we must accept all vetos by any committer as a veto, for a change to Lucene's source code, regardless of that committer's reasoning. This is the power of Apache's model.
> >>
> >> Of course we all can and will work together to convince one another (this is where the scientifically motivated part comes in) to change our votes, one way or another.
> >>
> >>> I'd ask anyone voting +1 to raise this limit to at least try to index a few million vectors with 756 or 1024, which is allowed today.
> >> +1, if the current implementation really does not scale / needs more and more RAM for merging, let's understand what's going on here, first, before increasing limits. I rescind my hasty +1 for now!
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >>
> >> On Wed, Apr 5, 2023 at 11:22?AM Alessandro Benedetti <a.benedetti@sease.io> wrote:
> >>> Ok, so what should we do then?
> >>> This space is moving fast, and in my opinion we should act fast to release and guarantee we attract as many users as possible.
> >>>
> >>> At the same time I am not saying we should proceed blind, if there's concrete evidence for setting a limit rather than another, or that a certain limit is detrimental to the project, I think that veto should be valid.
> >>>
> >>> We shouldn't accept weakly/not scientifically motivated vetos anyway right?
> >>>
> >>> The problem I see is that more than voting we should first decide this limit and I don't know how we can operate.
> >>> I am imagining like a poll where each entry is a limit + motivation and PMCs maybe vote/add entries?
> >>>
> >>> Did anything similar happen in the past? How was the current limit added?
> >>>
> >>>
> >>> On Wed, 5 Apr 2023, 14:50 Dawid Weiss, <dawid.weiss@gmail.com> wrote:
> >>>>
> >>>>> Should create a VOTE thread, where we propose some values with a justification and we vote?
> >>>>
> >>>> Technically, a vote thread won't help much if there's no full consensus - a single veto will make the patch unacceptable for merging.
> >>>> https://www.apache.org/foundation/voting.html#Veto
> >>>>
> >>>> Dawid
> >>>>
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]
Thanks!

I will try to run some tests to be on the safe side :-)

Am 06.04.23 um 16:28 schrieb Michael Sokolov:
> yes, it makes a difference. It will take less time and CPU to do it
> all in one go, producing a single segment (assuming the data does not
> exceed the IndexWriter RAM buffer size). If you index a lot of little
> segments and then force merge them it will take longer, because it had
> to build the graphs for the little segments, and then for the big one
> when merging, and it will eventually use the same amount of RAM to
> build the big graph, although I don't believe it will have to load the
> vectors en masse into RAM while merging.
>
> On Thu, Apr 6, 2023 at 10:20?AM Michael Wechner
> <michael.wechner@wyona.com> wrote:
>> thanks very much for these insights!
>>
>> Does it make a difference re RAM when I do a batch import, for example
>> import 1000 documents and close the IndexWriter and do a forceMerge or
>> import 1Mio documents at once?
>>
>> I would expect so, or do I misunderstand this?
>>
>> Thanks
>>
>> Michael
>>
>>
>>
>> Am 06.04.23 um 16:11 schrieb Michael Sokolov:
>>> re: how does this HNSW stuff scale - I think people are calling out
>>> indexing memory usage here, so let's discuss some facts. During
>>> initial indexing we hold in RAM all the vector data and the graph
>>> constructed from the new documents, but this is accounted for and
>>> limited by the size of IndexWriter's buffer; the document vectors and
>>> their graph will be flushed to disk when this fills up, and at search
>>> time, they are not read in wholesale to RAM. There is potentially
>>> unbounded RAM usage during merging though, because the entire merged
>>> graph will be built in RAM. I lost track of how we handle the vector
>>> data now, but at least in theory it should be fairly straightforward
>>> to write the merged vector data in chunks using only limited RAM. So
>>> how much RAM does the graph use? It uses numdocs*fanout VInts.
>>> Actually it doesn't really scale with the vector dimension at all -
>>> rather it scales with the graph fanout (M) parameter and with the
>>> total number of documents. So I think this focus on limiting the
>>> vector dimension is not helping to address the concern about RAM usage
>>> while merging.
>>>
>>> The vector dimension does have a strong role in the search, and
>>> indexing time, but the impact is linear in the dimension and won't
>>> exhaust any limited resource.
>>>
>>> On Thu, Apr 6, 2023 at 5:48?AM Michael McCandless
>>> <lucene@mikemccandless.com> wrote:
>>>>> We shouldn't accept weakly/not scientifically motivated vetos anyway right?
>>>> In fact we must accept all vetos by any committer as a veto, for a change to Lucene's source code, regardless of that committer's reasoning. This is the power of Apache's model.
>>>>
>>>> Of course we all can and will work together to convince one another (this is where the scientifically motivated part comes in) to change our votes, one way or another.
>>>>
>>>>> I'd ask anyone voting +1 to raise this limit to at least try to index a few million vectors with 756 or 1024, which is allowed today.
>>>> +1, if the current implementation really does not scale / needs more and more RAM for merging, let's understand what's going on here, first, before increasing limits. I rescind my hasty +1 for now!
>>>>
>>>> Mike McCandless
>>>>
>>>> http://blog.mikemccandless.com
>>>>
>>>>
>>>> On Wed, Apr 5, 2023 at 11:22?AM Alessandro Benedetti <a.benedetti@sease.io> wrote:
>>>>> Ok, so what should we do then?
>>>>> This space is moving fast, and in my opinion we should act fast to release and guarantee we attract as many users as possible.
>>>>>
>>>>> At the same time I am not saying we should proceed blind, if there's concrete evidence for setting a limit rather than another, or that a certain limit is detrimental to the project, I think that veto should be valid.
>>>>>
>>>>> We shouldn't accept weakly/not scientifically motivated vetos anyway right?
>>>>>
>>>>> The problem I see is that more than voting we should first decide this limit and I don't know how we can operate.
>>>>> I am imagining like a poll where each entry is a limit + motivation and PMCs maybe vote/add entries?
>>>>>
>>>>> Did anything similar happen in the past? How was the current limit added?
>>>>>
>>>>>
>>>>> On Wed, 5 Apr 2023, 14:50 Dawid Weiss, <dawid.weiss@gmail.com> wrote:
>>>>>>> Should create a VOTE thread, where we propose some values with a justification and we vote?
>>>>>> Technically, a vote thread won't help much if there's no full consensus - a single veto will make the patch unacceptable for merging.
>>>>>> https://www.apache.org/foundation/voting.html#Veto
>>>>>>
>>>>>> Dawid
>>>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]
As I said earlier, a max limit limits usability.
It's not forcing users with small vectors to pay the performance penalty of
big vectors, it's literally preventing some users to use
Lucene/Solr/Elasticsearch at all.
As far as I know, the max limit is used to raise an exception, it's not
used to initialise or optimise data structures (please correct me if I'm
wrong).

Improving the algorithm performance is a separate discussion.
I don't see a correlation with the fact that indexing billions of whatever
dimensioned vector is slow with a usability parameter.

What about potential users that need few high dimensional vectors?

As I said before, I am a big +1 for NOT just raise it blindly, but I
believe we need to remove the limit or size it in a way it's not a problem
for both users and internal data structure optimizations, if any.


On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcmuir@gmail.com> wrote:

> I'd ask anyone voting +1 to raise this limit to at least try to index
> a few million vectors with 756 or 1024, which is allowed today.
>
> IMO based on how painful it is, it seems the limit is already too
> high, I realize that will sound controversial but please at least try
> it out!
>
> voting +1 without at least doing this is really the
> "weak/unscientifically minded" approach.
>
> On Wed, Apr 5, 2023 at 12:52?PM Michael Wechner
> <michael.wechner@wyona.com> wrote:
> >
> > Thanks for your feedback!
> >
> > I agree, that it should not crash.
> >
> > So far we did not experience crashes ourselves, but we did not index
> > millions of vectors.
> >
> > I will try to reproduce the crash, maybe this will help us to move
> forward.
> >
> > Thanks
> >
> > Michael
> >
> > Am 05.04.23 um 18:30 schrieb Dawid Weiss:
> > >> Can you describe your crash in more detail?
> > > I can't. That experiment was a while ago and a quick test to see if I
> > > could index rather large-ish USPTO (patent office) data as vectors.
> > > Couldn't do it then.
> > >
> > >> How much RAM?
> > > My indexing jobs run with rather smallish heaps to give space for I/O
> > > buffers. Think 4-8GB at most. So yes, it could have been the problem.
> > > I recall segment merging grew slower and slower and then simply
> > > crashed. Lucene should work with low heap requirements, even if it
> > > slows down. Throwing ram at the indexing/ segment merging problem
> > > is... I don't know - not elegant?
> > >
> > > Anyway. My main point was to remind folks about how Apache works -
> > > code is merged in when there are no vetoes. If Rob (or anybody else)
> > > remains unconvinced, he or she can block the change. (I didn't invent
> > > those rules).
> > >
> > > D.
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: dev-help@lucene.apache.org
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]
> I don't know, Alessandro. I just wanted to point out the fact that by
Apache rules a committer's veto to a code change counts as a no-go.

Yeah Dawid, I was not provocative, I was genuinely asking what should a
pragmatic approach be to choose a limit/remove it, because I don't know how
to proceed and I would love to progress rather than just leave this
discussion in a mail thread with no tangible results.

On Wed, 5 Apr 2023, 16:49 Dawid Weiss, <dawid.weiss@gmail.com> wrote:

> > Ok, so what should we do then?
>
> I don't know, Alessandro. I just wanted to point out the fact that by
> Apache rules a committer's veto to a code change counts as a no-go. It
> does not specify any way to "override" such a veto, perhaps counting
> on disagreeing parties to resolve conflicting points of views in a
> civil manner so that veto can be retracted (or a different solution
> suggested).
>
> I think Robert's point is not about a particular limit value but about
> the algorithm itself - the current implementation does not scale. I
> don't want to be an advocate for either side - I'm all for freedom of
> choice but at the same time last time I tried indexing a few million
> vectors, I couldn't get far before segment merging blew up with
> OOMs...
>
> > Did anything similar happen in the past? How was the current limit added?
>
> I honestly don't know, you'd have to git blame or look at the mailing
> list archives of the original contribution.
>
> Dawid
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]
Well, I'm asking ppl actually try to test using such high dimensions.
Based on my own experience, I consider it unusable. It seems other
folks may have run into trouble too. If the project committers can't
even really use vectors with such high dimension counts, then its not
in an OK state for users, and we shouldn't bump the limit.

I'm happy to discuss/compromise etc, but simply bumping the limit
without addressing the underlying usability/scalability is a real
no-go, it is not really solving anything, nor is it giving users any
freedom or allowing them to do something they couldnt do before.
Because if it still doesnt work it still doesnt work.

We all need to be on the same page, grounded in reality, not fantasy,
where if we set a limit of 1024 or 2048, that you can actually index
vectors with that many dimensions and it actually works and scales.

On Thu, Apr 6, 2023 at 11:38?AM Alessandro Benedetti
<a.benedetti@sease.io> wrote:
>
> As I said earlier, a max limit limits usability.
> It's not forcing users with small vectors to pay the performance penalty of big vectors, it's literally preventing some users to use Lucene/Solr/Elasticsearch at all.
> As far as I know, the max limit is used to raise an exception, it's not used to initialise or optimise data structures (please correct me if I'm wrong).
>
> Improving the algorithm performance is a separate discussion.
> I don't see a correlation with the fact that indexing billions of whatever dimensioned vector is slow with a usability parameter.
>
> What about potential users that need few high dimensional vectors?
>
> As I said before, I am a big +1 for NOT just raise it blindly, but I believe we need to remove the limit or size it in a way it's not a problem for both users and internal data structure optimizations, if any.
>
>
> On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcmuir@gmail.com> wrote:
>>
>> I'd ask anyone voting +1 to raise this limit to at least try to index
>> a few million vectors with 756 or 1024, which is allowed today.
>>
>> IMO based on how painful it is, it seems the limit is already too
>> high, I realize that will sound controversial but please at least try
>> it out!
>>
>> voting +1 without at least doing this is really the
>> "weak/unscientifically minded" approach.
>>
>> On Wed, Apr 5, 2023 at 12:52?PM Michael Wechner
>> <michael.wechner@wyona.com> wrote:
>> >
>> > Thanks for your feedback!
>> >
>> > I agree, that it should not crash.
>> >
>> > So far we did not experience crashes ourselves, but we did not index
>> > millions of vectors.
>> >
>> > I will try to reproduce the crash, maybe this will help us to move forward.
>> >
>> > Thanks
>> >
>> > Michael
>> >
>> > Am 05.04.23 um 18:30 schrieb Dawid Weiss:
>> > >> Can you describe your crash in more detail?
>> > > I can't. That experiment was a while ago and a quick test to see if I
>> > > could index rather large-ish USPTO (patent office) data as vectors.
>> > > Couldn't do it then.
>> > >
>> > >> How much RAM?
>> > > My indexing jobs run with rather smallish heaps to give space for I/O
>> > > buffers. Think 4-8GB at most. So yes, it could have been the problem.
>> > > I recall segment merging grew slower and slower and then simply
>> > > crashed. Lucene should work with low heap requirements, even if it
>> > > slows down. Throwing ram at the indexing/ segment merging problem
>> > > is... I don't know - not elegant?
>> > >
>> > > Anyway. My main point was to remind folks about how Apache works -
>> > > code is merged in when there are no vetoes. If Rob (or anybody else)
>> > > remains unconvinced, he or she can block the change. (I didn't invent
>> > > those rules).
>> > >
>> > > D.
>> > >
>> > > ---------------------------------------------------------------------
>> > > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > > For additional commands, e-mail: dev-help@lucene.apache.org
>> > >
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: dev-help@lucene.apache.org
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]
>10 MB hard drive, wow I'll never need another floppy disk ever...
Neural nets... nice idea, but there will never be enough CPU power to run
them...

etc.

Is it possible to make it a configurable limit?

I think Gus is on spot, agree 100%.

Vector dimension is already configurable, it's the max dinension which is
hard coded.

Just bear in mind that this MAX limit is not used in initializing data
structures, but only to raise an exception.
As far as I know if we change the limit, if you have small vectors you
won't be impacted at all.

On Thu, 6 Apr 2023, 03:31 Gus Heck, <gus.heck@gmail.com> wrote:

> 10 MB hard drive, wow I'll never need another floppy disk ever...
> Neural nets... nice idea, but there will never be enough CPU power to run
> them...
>
> etc.
>
> Is it possible to make it a configurable limit?
>
> On Wed, Apr 5, 2023 at 4:51?PM Jack Conradson <osjdconrad@gmail.com>
> wrote:
>
>> I don't want to get too far off topic, but I think one of the problems
>> here is that HNSW doesn't really fit well as a Lucene data structure. The
>> way it behaves it would be better supported as a live, in-memory data
>> structure instead of segmented and written to disk for tiny graphs that
>> then need to be merged. I wonder if it may be a better approach to explore
>> other possible algorithms that are designed to be on-disk instead of
>> in-memory even if they require k-means clustering as a trade-off. Maybe
>> with an on-disk algorithm we could have good enough performance for a
>> higher-dimensional limit.
>>
>> On Wed, Apr 5, 2023 at 10:54?AM Robert Muir <rcmuir@gmail.com> wrote:
>>
>>> I'd ask anyone voting +1 to raise this limit to at least try to index
>>> a few million vectors with 756 or 1024, which is allowed today.
>>>
>>> IMO based on how painful it is, it seems the limit is already too
>>> high, I realize that will sound controversial but please at least try
>>> it out!
>>>
>>> voting +1 without at least doing this is really the
>>> "weak/unscientifically minded" approach.
>>>
>>> On Wed, Apr 5, 2023 at 12:52?PM Michael Wechner
>>> <michael.wechner@wyona.com> wrote:
>>> >
>>> > Thanks for your feedback!
>>> >
>>> > I agree, that it should not crash.
>>> >
>>> > So far we did not experience crashes ourselves, but we did not index
>>> > millions of vectors.
>>> >
>>> > I will try to reproduce the crash, maybe this will help us to move
>>> forward.
>>> >
>>> > Thanks
>>> >
>>> > Michael
>>> >
>>> > Am 05.04.23 um 18:30 schrieb Dawid Weiss:
>>> > >> Can you describe your crash in more detail?
>>> > > I can't. That experiment was a while ago and a quick test to see if I
>>> > > could index rather large-ish USPTO (patent office) data as vectors.
>>> > > Couldn't do it then.
>>> > >
>>> > >> How much RAM?
>>> > > My indexing jobs run with rather smallish heaps to give space for I/O
>>> > > buffers. Think 4-8GB at most. So yes, it could have been the problem.
>>> > > I recall segment merging grew slower and slower and then simply
>>> > > crashed. Lucene should work with low heap requirements, even if it
>>> > > slows down. Throwing ram at the indexing/ segment merging problem
>>> > > is... I don't know - not elegant?
>>> > >
>>> > > Anyway. My main point was to remind folks about how Apache works -
>>> > > code is merged in when there are no vetoes. If Rob (or anybody else)
>>> > > remains unconvinced, he or she can block the change. (I didn't invent
>>> > > those rules).
>>> > >
>>> > > D.
>>> > >
>>> > > ---------------------------------------------------------------------
>>> > > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> > > For additional commands, e-mail: dev-help@lucene.apache.org
>>> > >
>>> >
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> > For additional commands, e-mail: dev-help@lucene.apache.org
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>
>
> --
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)
>
Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]
To be clear Robert, I agree with you in not bumping it just to 2048 or
whatever not motivated enough constant.

But I disagree on the performance perspective:
I mean I am absolutely positive in working to improve the current
performances, but I think this is disconnected from that limit.

Not all users need billions of vectors, maybe tomorrow a new chip is
released that speed up the processing 100x or whatever...

The limit as far as I know is not used to initialise or optimise any data
structure, it's only used to raise an exception.

I don't see a big problem in allowing 10k vectors for example but then
majority of people won't be able to use such vectors because slow on the
average computer.
If we just get 1 new user, it's better than 0.
Or well, if it's a reputation thing, than It's a completely different
discussion I guess.


On Thu, 6 Apr 2023, 16:47 Robert Muir, <rcmuir@gmail.com> wrote:

> Well, I'm asking ppl actually try to test using such high dimensions.
> Based on my own experience, I consider it unusable. It seems other
> folks may have run into trouble too. If the project committers can't
> even really use vectors with such high dimension counts, then its not
> in an OK state for users, and we shouldn't bump the limit.
>
> I'm happy to discuss/compromise etc, but simply bumping the limit
> without addressing the underlying usability/scalability is a real
> no-go, it is not really solving anything, nor is it giving users any
> freedom or allowing them to do something they couldnt do before.
> Because if it still doesnt work it still doesnt work.
>
> We all need to be on the same page, grounded in reality, not fantasy,
> where if we set a limit of 1024 or 2048, that you can actually index
> vectors with that many dimensions and it actually works and scales.
>
> On Thu, Apr 6, 2023 at 11:38?AM Alessandro Benedetti
> <a.benedetti@sease.io> wrote:
> >
> > As I said earlier, a max limit limits usability.
> > It's not forcing users with small vectors to pay the performance penalty
> of big vectors, it's literally preventing some users to use
> Lucene/Solr/Elasticsearch at all.
> > As far as I know, the max limit is used to raise an exception, it's not
> used to initialise or optimise data structures (please correct me if I'm
> wrong).
> >
> > Improving the algorithm performance is a separate discussion.
> > I don't see a correlation with the fact that indexing billions of
> whatever dimensioned vector is slow with a usability parameter.
> >
> > What about potential users that need few high dimensional vectors?
> >
> > As I said before, I am a big +1 for NOT just raise it blindly, but I
> believe we need to remove the limit or size it in a way it's not a problem
> for both users and internal data structure optimizations, if any.
> >
> >
> > On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcmuir@gmail.com> wrote:
> >>
> >> I'd ask anyone voting +1 to raise this limit to at least try to index
> >> a few million vectors with 756 or 1024, which is allowed today.
> >>
> >> IMO based on how painful it is, it seems the limit is already too
> >> high, I realize that will sound controversial but please at least try
> >> it out!
> >>
> >> voting +1 without at least doing this is really the
> >> "weak/unscientifically minded" approach.
> >>
> >> On Wed, Apr 5, 2023 at 12:52?PM Michael Wechner
> >> <michael.wechner@wyona.com> wrote:
> >> >
> >> > Thanks for your feedback!
> >> >
> >> > I agree, that it should not crash.
> >> >
> >> > So far we did not experience crashes ourselves, but we did not index
> >> > millions of vectors.
> >> >
> >> > I will try to reproduce the crash, maybe this will help us to move
> forward.
> >> >
> >> > Thanks
> >> >
> >> > Michael
> >> >
> >> > Am 05.04.23 um 18:30 schrieb Dawid Weiss:
> >> > >> Can you describe your crash in more detail?
> >> > > I can't. That experiment was a while ago and a quick test to see if
> I
> >> > > could index rather large-ish USPTO (patent office) data as vectors.
> >> > > Couldn't do it then.
> >> > >
> >> > >> How much RAM?
> >> > > My indexing jobs run with rather smallish heaps to give space for
> I/O
> >> > > buffers. Think 4-8GB at most. So yes, it could have been the
> problem.
> >> > > I recall segment merging grew slower and slower and then simply
> >> > > crashed. Lucene should work with low heap requirements, even if it
> >> > > slows down. Throwing ram at the indexing/ segment merging problem
> >> > > is... I don't know - not elegant?
> >> > >
> >> > > Anyway. My main point was to remind folks about how Apache works -
> >> > > code is merged in when there are no vetoes. If Rob (or anybody else)
> >> > > remains unconvinced, he or she can block the change. (I didn't
> invent
> >> > > those rules).
> >> > >
> >> > > D.
> >> > >
> >> > >
> ---------------------------------------------------------------------
> >> > > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> > > For additional commands, e-mail: dev-help@lucene.apache.org
> >> > >
> >> >
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> > For additional commands, e-mail: dev-help@lucene.apache.org
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]
If we find issues with larger limits, maybe have a configurable limit like we do for maxBooleanClauses. Maybe somebody wants to run with a 100G heap and do one query per second.

Where I work (LexisNexis), we have high-value queries, but just not that many of them per second.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/ (my blog)

> On Apr 6, 2023, at 8:57 AM, Alessandro Benedetti <a.benedetti@sease.io> wrote:
>
> To be clear Robert, I agree with you in not bumping it just to 2048 or whatever not motivated enough constant.
>
> But I disagree on the performance perspective:
> I mean I am absolutely positive in working to improve the current performances, but I think this is disconnected from that limit.
>
> Not all users need billions of vectors, maybe tomorrow a new chip is released that speed up the processing 100x or whatever...
>
> The limit as far as I know is not used to initialise or optimise any data structure, it's only used to raise an exception.
>
> I don't see a big problem in allowing 10k vectors for example but then majority of people won't be able to use such vectors because slow on the average computer.
> If we just get 1 new user, it's better than 0.
> Or well, if it's a reputation thing, than It's a completely different discussion I guess.
>
>
> On Thu, 6 Apr 2023, 16:47 Robert Muir, <rcmuir@gmail.com <mailto:rcmuir@gmail.com>> wrote:
>> Well, I'm asking ppl actually try to test using such high dimensions.
>> Based on my own experience, I consider it unusable. It seems other
>> folks may have run into trouble too. If the project committers can't
>> even really use vectors with such high dimension counts, then its not
>> in an OK state for users, and we shouldn't bump the limit.
>>
>> I'm happy to discuss/compromise etc, but simply bumping the limit
>> without addressing the underlying usability/scalability is a real
>> no-go, it is not really solving anything, nor is it giving users any
>> freedom or allowing them to do something they couldnt do before.
>> Because if it still doesnt work it still doesnt work.
>>
>> We all need to be on the same page, grounded in reality, not fantasy,
>> where if we set a limit of 1024 or 2048, that you can actually index
>> vectors with that many dimensions and it actually works and scales.
>>
>> On Thu, Apr 6, 2023 at 11:38?AM Alessandro Benedetti
>> <a.benedetti@sease.io <mailto:a.benedetti@sease.io>> wrote:
>> >
>> > As I said earlier, a max limit limits usability.
>> > It's not forcing users with small vectors to pay the performance penalty of big vectors, it's literally preventing some users to use Lucene/Solr/Elasticsearch at all.
>> > As far as I know, the max limit is used to raise an exception, it's not used to initialise or optimise data structures (please correct me if I'm wrong).
>> >
>> > Improving the algorithm performance is a separate discussion.
>> > I don't see a correlation with the fact that indexing billions of whatever dimensioned vector is slow with a usability parameter.
>> >
>> > What about potential users that need few high dimensional vectors?
>> >
>> > As I said before, I am a big +1 for NOT just raise it blindly, but I believe we need to remove the limit or size it in a way it's not a problem for both users and internal data structure optimizations, if any.
>> >
>> >
>> > On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcmuir@gmail.com <mailto:rcmuir@gmail.com>> wrote:
>> >>
>> >> I'd ask anyone voting +1 to raise this limit to at least try to index
>> >> a few million vectors with 756 or 1024, which is allowed today.
>> >>
>> >> IMO based on how painful it is, it seems the limit is already too
>> >> high, I realize that will sound controversial but please at least try
>> >> it out!
>> >>
>> >> voting +1 without at least doing this is really the
>> >> "weak/unscientifically minded" approach.
>> >>
>> >> On Wed, Apr 5, 2023 at 12:52?PM Michael Wechner
>> >> <michael.wechner@wyona.com <mailto:michael.wechner@wyona.com>> wrote:
>> >> >
>> >> > Thanks for your feedback!
>> >> >
>> >> > I agree, that it should not crash.
>> >> >
>> >> > So far we did not experience crashes ourselves, but we did not index
>> >> > millions of vectors.
>> >> >
>> >> > I will try to reproduce the crash, maybe this will help us to move forward.
>> >> >
>> >> > Thanks
>> >> >
>> >> > Michael
>> >> >
>> >> > Am 05.04.23 um 18:30 schrieb Dawid Weiss:
>> >> > >> Can you describe your crash in more detail?
>> >> > > I can't. That experiment was a while ago and a quick test to see if I
>> >> > > could index rather large-ish USPTO (patent office) data as vectors.
>> >> > > Couldn't do it then.
>> >> > >
>> >> > >> How much RAM?
>> >> > > My indexing jobs run with rather smallish heaps to give space for I/O
>> >> > > buffers. Think 4-8GB at most. So yes, it could have been the problem.
>> >> > > I recall segment merging grew slower and slower and then simply
>> >> > > crashed. Lucene should work with low heap requirements, even if it
>> >> > > slows down. Throwing ram at the indexing/ segment merging problem
>> >> > > is... I don't know - not elegant?
>> >> > >
>> >> > > Anyway. My main point was to remind folks about how Apache works -
>> >> > > code is merged in when there are no vetoes. If Rob (or anybody else)
>> >> > > remains unconvinced, he or she can block the change. (I didn't invent
>> >> > > those rules).
>> >> > >
>> >> > > D.
>> >> > >
>> >> > > ---------------------------------------------------------------------
>> >> > > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org <mailto:dev-unsubscribe@lucene.apache.org>
>> >> > > For additional commands, e-mail: dev-help@lucene.apache.org <mailto:dev-help@lucene.apache.org>
>> >> > >
>> >> >
>> >> >
>> >> > ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org <mailto:dev-unsubscribe@lucene.apache.org>
>> >> > For additional commands, e-mail: dev-help@lucene.apache.org <mailto:dev-help@lucene.apache.org>
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org <mailto:dev-unsubscribe@lucene.apache.org>
>> >> For additional commands, e-mail: dev-help@lucene.apache.org <mailto:dev-help@lucene.apache.org>
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org <mailto:dev-unsubscribe@lucene.apache.org>
>> For additional commands, e-mail: dev-help@lucene.apache.org <mailto:dev-help@lucene.apache.org>
>>
Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]
I am not sure I get the point to make the limit configurable:

1) if it is configurable, but default max to 1024, it means that we don't
enforce any limit aside the max integer behind the scenes.
So if you want to set a vector dimension for a field to 5000 you need to
first set a MAX compatible and then set the dimension to 5000 for a field.

2) if we remove the limit (just an example). The user can directly set the
dimension to 5000 for a field.

It seems to me that setting the max limit as a configurable constant brings
all the same (negative?) considerations of removing the limit at all +
additional operations needed by the users to achieve the same results.

I beg your pardon if I an missing something.

On Thu, 6 Apr 2023, 17:02 Walter Underwood, <wunder@wunderwood.org> wrote:

> If we find issues with larger limits, maybe have a configurable limit like
> we do for maxBooleanClauses. Maybe somebody wants to run with a 100G heap
> and do one query per second.
>
> Where I work (LexisNexis), we have high-value queries, but just not that
> many of them per second.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/ (my blog)
>
> On Apr 6, 2023, at 8:57 AM, Alessandro Benedetti <a.benedetti@sease.io>
> wrote:
>
> To be clear Robert, I agree with you in not bumping it just to 2048 or
> whatever not motivated enough constant.
>
> But I disagree on the performance perspective:
> I mean I am absolutely positive in working to improve the current
> performances, but I think this is disconnected from that limit.
>
> Not all users need billions of vectors, maybe tomorrow a new chip is
> released that speed up the processing 100x or whatever...
>
> The limit as far as I know is not used to initialise or optimise any data
> structure, it's only used to raise an exception.
>
> I don't see a big problem in allowing 10k vectors for example but then
> majority of people won't be able to use such vectors because slow on the
> average computer.
> If we just get 1 new user, it's better than 0.
> Or well, if it's a reputation thing, than It's a completely different
> discussion I guess.
>
>
> On Thu, 6 Apr 2023, 16:47 Robert Muir, <rcmuir@gmail.com> wrote:
>
>> Well, I'm asking ppl actually try to test using such high dimensions.
>> Based on my own experience, I consider it unusable. It seems other
>> folks may have run into trouble too. If the project committers can't
>> even really use vectors with such high dimension counts, then its not
>> in an OK state for users, and we shouldn't bump the limit.
>>
>> I'm happy to discuss/compromise etc, but simply bumping the limit
>> without addressing the underlying usability/scalability is a real
>> no-go, it is not really solving anything, nor is it giving users any
>> freedom or allowing them to do something they couldnt do before.
>> Because if it still doesnt work it still doesnt work.
>>
>> We all need to be on the same page, grounded in reality, not fantasy,
>> where if we set a limit of 1024 or 2048, that you can actually index
>> vectors with that many dimensions and it actually works and scales.
>>
>> On Thu, Apr 6, 2023 at 11:38?AM Alessandro Benedetti
>> <a.benedetti@sease.io> wrote:
>> >
>> > As I said earlier, a max limit limits usability.
>> > It's not forcing users with small vectors to pay the performance
>> penalty of big vectors, it's literally preventing some users to use
>> Lucene/Solr/Elasticsearch at all.
>> > As far as I know, the max limit is used to raise an exception, it's not
>> used to initialise or optimise data structures (please correct me if I'm
>> wrong).
>> >
>> > Improving the algorithm performance is a separate discussion.
>> > I don't see a correlation with the fact that indexing billions of
>> whatever dimensioned vector is slow with a usability parameter.
>> >
>> > What about potential users that need few high dimensional vectors?
>> >
>> > As I said before, I am a big +1 for NOT just raise it blindly, but I
>> believe we need to remove the limit or size it in a way it's not a problem
>> for both users and internal data structure optimizations, if any.
>> >
>> >
>> > On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcmuir@gmail.com> wrote:
>> >>
>> >> I'd ask anyone voting +1 to raise this limit to at least try to index
>> >> a few million vectors with 756 or 1024, which is allowed today.
>> >>
>> >> IMO based on how painful it is, it seems the limit is already too
>> >> high, I realize that will sound controversial but please at least try
>> >> it out!
>> >>
>> >> voting +1 without at least doing this is really the
>> >> "weak/unscientifically minded" approach.
>> >>
>> >> On Wed, Apr 5, 2023 at 12:52?PM Michael Wechner
>> >> <michael.wechner@wyona.com> wrote:
>> >> >
>> >> > Thanks for your feedback!
>> >> >
>> >> > I agree, that it should not crash.
>> >> >
>> >> > So far we did not experience crashes ourselves, but we did not index
>> >> > millions of vectors.
>> >> >
>> >> > I will try to reproduce the crash, maybe this will help us to move
>> forward.
>> >> >
>> >> > Thanks
>> >> >
>> >> > Michael
>> >> >
>> >> > Am 05.04.23 um 18:30 schrieb Dawid Weiss:
>> >> > >> Can you describe your crash in more detail?
>> >> > > I can't. That experiment was a while ago and a quick test to see
>> if I
>> >> > > could index rather large-ish USPTO (patent office) data as vectors.
>> >> > > Couldn't do it then.
>> >> > >
>> >> > >> How much RAM?
>> >> > > My indexing jobs run with rather smallish heaps to give space for
>> I/O
>> >> > > buffers. Think 4-8GB at most. So yes, it could have been the
>> problem.
>> >> > > I recall segment merging grew slower and slower and then simply
>> >> > > crashed. Lucene should work with low heap requirements, even if it
>> >> > > slows down. Throwing ram at the indexing/ segment merging problem
>> >> > > is... I don't know - not elegant?
>> >> > >
>> >> > > Anyway. My main point was to remind folks about how Apache works -
>> >> > > code is merged in when there are no vetoes. If Rob (or anybody
>> else)
>> >> > > remains unconvinced, he or she can block the change. (I didn't
>> invent
>> >> > > those rules).
>> >> > >
>> >> > > D.
>> >> > >
>> >> > >
>> ---------------------------------------------------------------------
>> >> > > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> > > For additional commands, e-mail: dev-help@lucene.apache.org
>> >> > >
>> >> >
>> >> >
>> >> > ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> > For additional commands, e-mail: dev-help@lucene.apache.org
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>
Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]
> We all need to be on the same page, grounded in reality, not fantasy,
where if we set a limit of 1024 or 2048, that you can actually index
vectors with that many dimensions and it actually works and scales

This is something that I agree with. When we test it, I think we should go
in with the right expectations. For example, there's been an ask for use
cases. The most immediate use case that I see is LLM memory in the higher
range of vectors (up to 2048), rather than strictly search. I'm ok with
this use case because of its unprecedented growth. Consider the fact that a
lot of users and most of the capital expenditures on Lucene infrastructure
goes to use cases that are not strictly application search. Analytics, as
we all know, is a very common off-label use of Lucene. Another common use
case is index intersection across a variety of text fields.

It would be great if we could document a way to up the limit and test using
Open AI's embeddings. If there are performance problems they should be
understood as performance problems for a particular use case, and we as a
community should invest in making it better based on user feedback. This
leap of faith could have a very positive impact on all involved.

Best,

Marcus




On Thu, Apr 6, 2023 at 9:30?AM Alessandro Benedetti <a.benedetti@sease.io>
wrote:

> I am not sure I get the point to make the limit configurable:
>
> 1) if it is configurable, but default max to 1024, it means that we don't
> enforce any limit aside the max integer behind the scenes.
> So if you want to set a vector dimension for a field to 5000 you need to
> first set a MAX compatible and then set the dimension to 5000 for a field.
>
> 2) if we remove the limit (just an example). The user can directly set the
> dimension to 5000 for a field.
>
> It seems to me that setting the max limit as a configurable constant
> brings all the same (negative?) considerations of removing the limit at all
> + additional operations needed by the users to achieve the same results.
>
> I beg your pardon if I an missing something.
>
> On Thu, 6 Apr 2023, 17:02 Walter Underwood, <wunder@wunderwood.org> wrote:
>
>> If we find issues with larger limits, maybe have a configurable limit
>> like we do for maxBooleanClauses. Maybe somebody wants to run with a 100G
>> heap and do one query per second.
>>
>> Where I work (LexisNexis), we have high-value queries, but just not that
>> many of them per second.
>>
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/ (my blog)
>>
>> On Apr 6, 2023, at 8:57 AM, Alessandro Benedetti <a.benedetti@sease.io>
>> wrote:
>>
>> To be clear Robert, I agree with you in not bumping it just to 2048 or
>> whatever not motivated enough constant.
>>
>> But I disagree on the performance perspective:
>> I mean I am absolutely positive in working to improve the current
>> performances, but I think this is disconnected from that limit.
>>
>> Not all users need billions of vectors, maybe tomorrow a new chip is
>> released that speed up the processing 100x or whatever...
>>
>> The limit as far as I know is not used to initialise or optimise any data
>> structure, it's only used to raise an exception.
>>
>> I don't see a big problem in allowing 10k vectors for example but then
>> majority of people won't be able to use such vectors because slow on the
>> average computer.
>> If we just get 1 new user, it's better than 0.
>> Or well, if it's a reputation thing, than It's a completely different
>> discussion I guess.
>>
>>
>> On Thu, 6 Apr 2023, 16:47 Robert Muir, <rcmuir@gmail.com> wrote:
>>
>>> Well, I'm asking ppl actually try to test using such high dimensions.
>>> Based on my own experience, I consider it unusable. It seems other
>>> folks may have run into trouble too. If the project committers can't
>>> even really use vectors with such high dimension counts, then its not
>>> in an OK state for users, and we shouldn't bump the limit.
>>>
>>> I'm happy to discuss/compromise etc, but simply bumping the limit
>>> without addressing the underlying usability/scalability is a real
>>> no-go, it is not really solving anything, nor is it giving users any
>>> freedom or allowing them to do something they couldnt do before.
>>> Because if it still doesnt work it still doesnt work.
>>>
>>> We all need to be on the same page, grounded in reality, not fantasy,
>>> where if we set a limit of 1024 or 2048, that you can actually index
>>> vectors with that many dimensions and it actually works and scales.
>>>
>>> On Thu, Apr 6, 2023 at 11:38?AM Alessandro Benedetti
>>> <a.benedetti@sease.io> wrote:
>>> >
>>> > As I said earlier, a max limit limits usability.
>>> > It's not forcing users with small vectors to pay the performance
>>> penalty of big vectors, it's literally preventing some users to use
>>> Lucene/Solr/Elasticsearch at all.
>>> > As far as I know, the max limit is used to raise an exception, it's
>>> not used to initialise or optimise data structures (please correct me if
>>> I'm wrong).
>>> >
>>> > Improving the algorithm performance is a separate discussion.
>>> > I don't see a correlation with the fact that indexing billions of
>>> whatever dimensioned vector is slow with a usability parameter.
>>> >
>>> > What about potential users that need few high dimensional vectors?
>>> >
>>> > As I said before, I am a big +1 for NOT just raise it blindly, but I
>>> believe we need to remove the limit or size it in a way it's not a problem
>>> for both users and internal data structure optimizations, if any.
>>> >
>>> >
>>> > On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcmuir@gmail.com> wrote:
>>> >>
>>> >> I'd ask anyone voting +1 to raise this limit to at least try to index
>>> >> a few million vectors with 756 or 1024, which is allowed today.
>>> >>
>>> >> IMO based on how painful it is, it seems the limit is already too
>>> >> high, I realize that will sound controversial but please at least try
>>> >> it out!
>>> >>
>>> >> voting +1 without at least doing this is really the
>>> >> "weak/unscientifically minded" approach.
>>> >>
>>> >> On Wed, Apr 5, 2023 at 12:52?PM Michael Wechner
>>> >> <michael.wechner@wyona.com> wrote:
>>> >> >
>>> >> > Thanks for your feedback!
>>> >> >
>>> >> > I agree, that it should not crash.
>>> >> >
>>> >> > So far we did not experience crashes ourselves, but we did not index
>>> >> > millions of vectors.
>>> >> >
>>> >> > I will try to reproduce the crash, maybe this will help us to move
>>> forward.
>>> >> >
>>> >> > Thanks
>>> >> >
>>> >> > Michael
>>> >> >
>>> >> > Am 05.04.23 um 18:30 schrieb Dawid Weiss:
>>> >> > >> Can you describe your crash in more detail?
>>> >> > > I can't. That experiment was a while ago and a quick test to see
>>> if I
>>> >> > > could index rather large-ish USPTO (patent office) data as
>>> vectors.
>>> >> > > Couldn't do it then.
>>> >> > >
>>> >> > >> How much RAM?
>>> >> > > My indexing jobs run with rather smallish heaps to give space for
>>> I/O
>>> >> > > buffers. Think 4-8GB at most. So yes, it could have been the
>>> problem.
>>> >> > > I recall segment merging grew slower and slower and then simply
>>> >> > > crashed. Lucene should work with low heap requirements, even if it
>>> >> > > slows down. Throwing ram at the indexing/ segment merging problem
>>> >> > > is... I don't know - not elegant?
>>> >> > >
>>> >> > > Anyway. My main point was to remind folks about how Apache works -
>>> >> > > code is merged in when there are no vetoes. If Rob (or anybody
>>> else)
>>> >> > > remains unconvinced, he or she can block the change. (I didn't
>>> invent
>>> >> > > those rules).
>>> >> > >
>>> >> > > D.
>>> >> > >
>>> >> > >
>>> ---------------------------------------------------------------------
>>> >> > > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >> > > For additional commands, e-mail: dev-help@lucene.apache.org
>>> >> > >
>>> >> >
>>> >> >
>>> >> >
>>> ---------------------------------------------------------------------
>>> >> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >> > For additional commands, e-mail: dev-help@lucene.apache.org
>>> >> >
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>>> >>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>
>>

--
Marcus Eagan
Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]
Am 06.04.23 um 17:47 schrieb Robert Muir:
> Well, I'm asking ppl actually try to test using such high dimensions.
> Based on my own experience, I consider it unusable. It seems other
> folks may have run into trouble too. If the project committers can't
> even really use vectors with such high dimension counts, then its not
> in an OK state for users, and we shouldn't bump the limit.
>
> I'm happy to discuss/compromise etc, but simply bumping the limit
> without addressing the underlying usability/scalability is a real
> no-go,

I agree that this needs to be adressed



> it is not really solving anything, nor is it giving users any
> freedom or allowing them to do something they couldnt do before.
> Because if it still doesnt work it still doesnt work.

I disagree, because it *does work* with "smaller" document sets.

Currently we have to compile Lucene ourselves to not get the exception
when using a model with vector dimension greater than 1024,
which is of course possible, but not really convenient.

As I wrote before, to resolve this discussion, I think we should test
and address possible issues.

I will try to stop discussing now :-) and instead try to understand
better the actual issues. Would be great if others could join on this!

Thanks

Michael



>
> We all need to be on the same page, grounded in reality, not fantasy,
> where if we set a limit of 1024 or 2048, that you can actually index
> vectors with that many dimensions and it actually works and scales.
>
> On Thu, Apr 6, 2023 at 11:38?AM Alessandro Benedetti
> <a.benedetti@sease.io> wrote:
>> As I said earlier, a max limit limits usability.
>> It's not forcing users with small vectors to pay the performance penalty of big vectors, it's literally preventing some users to use Lucene/Solr/Elasticsearch at all.
>> As far as I know, the max limit is used to raise an exception, it's not used to initialise or optimise data structures (please correct me if I'm wrong).
>>
>> Improving the algorithm performance is a separate discussion.
>> I don't see a correlation with the fact that indexing billions of whatever dimensioned vector is slow with a usability parameter.
>>
>> What about potential users that need few high dimensional vectors?
>>
>> As I said before, I am a big +1 for NOT just raise it blindly, but I believe we need to remove the limit or size it in a way it's not a problem for both users and internal data structure optimizations, if any.
>>
>>
>> On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcmuir@gmail.com> wrote:
>>> I'd ask anyone voting +1 to raise this limit to at least try to index
>>> a few million vectors with 756 or 1024, which is allowed today.
>>>
>>> IMO based on how painful it is, it seems the limit is already too
>>> high, I realize that will sound controversial but please at least try
>>> it out!
>>>
>>> voting +1 without at least doing this is really the
>>> "weak/unscientifically minded" approach.
>>>
>>> On Wed, Apr 5, 2023 at 12:52?PM Michael Wechner
>>> <michael.wechner@wyona.com> wrote:
>>>> Thanks for your feedback!
>>>>
>>>> I agree, that it should not crash.
>>>>
>>>> So far we did not experience crashes ourselves, but we did not index
>>>> millions of vectors.
>>>>
>>>> I will try to reproduce the crash, maybe this will help us to move forward.
>>>>
>>>> Thanks
>>>>
>>>> Michael
>>>>
>>>> Am 05.04.23 um 18:30 schrieb Dawid Weiss:
>>>>>> Can you describe your crash in more detail?
>>>>> I can't. That experiment was a while ago and a quick test to see if I
>>>>> could index rather large-ish USPTO (patent office) data as vectors.
>>>>> Couldn't do it then.
>>>>>
>>>>>> How much RAM?
>>>>> My indexing jobs run with rather smallish heaps to give space for I/O
>>>>> buffers. Think 4-8GB at most. So yes, it could have been the problem.
>>>>> I recall segment merging grew slower and slower and then simply
>>>>> crashed. Lucene should work with low heap requirements, even if it
>>>>> slows down. Throwing ram at the indexing/ segment merging problem
>>>>> is... I don't know - not elegant?
>>>>>
>>>>> Anyway. My main point was to remind folks about how Apache works -
>>>>> code is merged in when there are no vetoes. If Rob (or anybody else)
>>>>> remains unconvinced, he or she can block the change. (I didn't invent
>>>>> those rules).
>>>>>
>>>>> D.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]
I'm trying to run a test. I indexed 8M 100d float32 vectors in ~20
minutes with a single thread. I have some 256K vectors, but only about
2M of them. Can anybody point me to a large set (say 8M+) of 1024+ dim
vectors I can use for testing? If all else fails I can test with
noise, but that tends to lead to meaningless results

On Thu, Apr 6, 2023 at 3:52?PM Michael Wechner
<michael.wechner@wyona.com> wrote:
>
>
>
> Am 06.04.23 um 17:47 schrieb Robert Muir:
> > Well, I'm asking ppl actually try to test using such high dimensions.
> > Based on my own experience, I consider it unusable. It seems other
> > folks may have run into trouble too. If the project committers can't
> > even really use vectors with such high dimension counts, then its not
> > in an OK state for users, and we shouldn't bump the limit.
> >
> > I'm happy to discuss/compromise etc, but simply bumping the limit
> > without addressing the underlying usability/scalability is a real
> > no-go,
>
> I agree that this needs to be adressed
>
>
>
> > it is not really solving anything, nor is it giving users any
> > freedom or allowing them to do something they couldnt do before.
> > Because if it still doesnt work it still doesnt work.
>
> I disagree, because it *does work* with "smaller" document sets.
>
> Currently we have to compile Lucene ourselves to not get the exception
> when using a model with vector dimension greater than 1024,
> which is of course possible, but not really convenient.
>
> As I wrote before, to resolve this discussion, I think we should test
> and address possible issues.
>
> I will try to stop discussing now :-) and instead try to understand
> better the actual issues. Would be great if others could join on this!
>
> Thanks
>
> Michael
>
>
>
> >
> > We all need to be on the same page, grounded in reality, not fantasy,
> > where if we set a limit of 1024 or 2048, that you can actually index
> > vectors with that many dimensions and it actually works and scales.
> >
> > On Thu, Apr 6, 2023 at 11:38?AM Alessandro Benedetti
> > <a.benedetti@sease.io> wrote:
> >> As I said earlier, a max limit limits usability.
> >> It's not forcing users with small vectors to pay the performance penalty of big vectors, it's literally preventing some users to use Lucene/Solr/Elasticsearch at all.
> >> As far as I know, the max limit is used to raise an exception, it's not used to initialise or optimise data structures (please correct me if I'm wrong).
> >>
> >> Improving the algorithm performance is a separate discussion.
> >> I don't see a correlation with the fact that indexing billions of whatever dimensioned vector is slow with a usability parameter.
> >>
> >> What about potential users that need few high dimensional vectors?
> >>
> >> As I said before, I am a big +1 for NOT just raise it blindly, but I believe we need to remove the limit or size it in a way it's not a problem for both users and internal data structure optimizations, if any.
> >>
> >>
> >> On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcmuir@gmail.com> wrote:
> >>> I'd ask anyone voting +1 to raise this limit to at least try to index
> >>> a few million vectors with 756 or 1024, which is allowed today.
> >>>
> >>> IMO based on how painful it is, it seems the limit is already too
> >>> high, I realize that will sound controversial but please at least try
> >>> it out!
> >>>
> >>> voting +1 without at least doing this is really the
> >>> "weak/unscientifically minded" approach.
> >>>
> >>> On Wed, Apr 5, 2023 at 12:52?PM Michael Wechner
> >>> <michael.wechner@wyona.com> wrote:
> >>>> Thanks for your feedback!
> >>>>
> >>>> I agree, that it should not crash.
> >>>>
> >>>> So far we did not experience crashes ourselves, but we did not index
> >>>> millions of vectors.
> >>>>
> >>>> I will try to reproduce the crash, maybe this will help us to move forward.
> >>>>
> >>>> Thanks
> >>>>
> >>>> Michael
> >>>>
> >>>> Am 05.04.23 um 18:30 schrieb Dawid Weiss:
> >>>>>> Can you describe your crash in more detail?
> >>>>> I can't. That experiment was a while ago and a quick test to see if I
> >>>>> could index rather large-ish USPTO (patent office) data as vectors.
> >>>>> Couldn't do it then.
> >>>>>
> >>>>>> How much RAM?
> >>>>> My indexing jobs run with rather smallish heaps to give space for I/O
> >>>>> buffers. Think 4-8GB at most. So yes, it could have been the problem.
> >>>>> I recall segment merging grew slower and slower and then simply
> >>>>> crashed. Lucene should work with low heap requirements, even if it
> >>>>> slows down. Throwing ram at the indexing/ segment merging problem
> >>>>> is... I don't know - not elegant?
> >>>>>
> >>>>> Anyway. My main point was to remind folks about how Apache works -
> >>>>> code is merged in when there are no vetoes. If Rob (or anybody else)
> >>>>> remains unconvinced, he or she can block the change. (I didn't invent
> >>>>> those rules).
> >>>>>
> >>>>> D.
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>>>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]
Great, thank you!

How much RAM; etc. did you run this test on?

Do the vectors really have to be based on real data for testing the
indexing?
I understand, if you want to test the quality of the search results it
does matter, but for testing the scalability itself it should not matter
actually, right?

Thanks

Michael

Am 07.04.23 um 01:19 schrieb Michael Sokolov:
> I'm trying to run a test. I indexed 8M 100d float32 vectors in ~20
> minutes with a single thread. I have some 256K vectors, but only about
> 2M of them. Can anybody point me to a large set (say 8M+) of 1024+ dim
> vectors I can use for testing? If all else fails I can test with
> noise, but that tends to lead to meaningless results
>
> On Thu, Apr 6, 2023 at 3:52?PM Michael Wechner
> <michael.wechner@wyona.com> wrote:
>>
>>
>> Am 06.04.23 um 17:47 schrieb Robert Muir:
>>> Well, I'm asking ppl actually try to test using such high dimensions.
>>> Based on my own experience, I consider it unusable. It seems other
>>> folks may have run into trouble too. If the project committers can't
>>> even really use vectors with such high dimension counts, then its not
>>> in an OK state for users, and we shouldn't bump the limit.
>>>
>>> I'm happy to discuss/compromise etc, but simply bumping the limit
>>> without addressing the underlying usability/scalability is a real
>>> no-go,
>> I agree that this needs to be adressed
>>
>>
>>
>>> it is not really solving anything, nor is it giving users any
>>> freedom or allowing them to do something they couldnt do before.
>>> Because if it still doesnt work it still doesnt work.
>> I disagree, because it *does work* with "smaller" document sets.
>>
>> Currently we have to compile Lucene ourselves to not get the exception
>> when using a model with vector dimension greater than 1024,
>> which is of course possible, but not really convenient.
>>
>> As I wrote before, to resolve this discussion, I think we should test
>> and address possible issues.
>>
>> I will try to stop discussing now :-) and instead try to understand
>> better the actual issues. Would be great if others could join on this!
>>
>> Thanks
>>
>> Michael
>>
>>
>>
>>> We all need to be on the same page, grounded in reality, not fantasy,
>>> where if we set a limit of 1024 or 2048, that you can actually index
>>> vectors with that many dimensions and it actually works and scales.
>>>
>>> On Thu, Apr 6, 2023 at 11:38?AM Alessandro Benedetti
>>> <a.benedetti@sease.io> wrote:
>>>> As I said earlier, a max limit limits usability.
>>>> It's not forcing users with small vectors to pay the performance penalty of big vectors, it's literally preventing some users to use Lucene/Solr/Elasticsearch at all.
>>>> As far as I know, the max limit is used to raise an exception, it's not used to initialise or optimise data structures (please correct me if I'm wrong).
>>>>
>>>> Improving the algorithm performance is a separate discussion.
>>>> I don't see a correlation with the fact that indexing billions of whatever dimensioned vector is slow with a usability parameter.
>>>>
>>>> What about potential users that need few high dimensional vectors?
>>>>
>>>> As I said before, I am a big +1 for NOT just raise it blindly, but I believe we need to remove the limit or size it in a way it's not a problem for both users and internal data structure optimizations, if any.
>>>>
>>>>
>>>> On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcmuir@gmail.com> wrote:
>>>>> I'd ask anyone voting +1 to raise this limit to at least try to index
>>>>> a few million vectors with 756 or 1024, which is allowed today.
>>>>>
>>>>> IMO based on how painful it is, it seems the limit is already too
>>>>> high, I realize that will sound controversial but please at least try
>>>>> it out!
>>>>>
>>>>> voting +1 without at least doing this is really the
>>>>> "weak/unscientifically minded" approach.
>>>>>
>>>>> On Wed, Apr 5, 2023 at 12:52?PM Michael Wechner
>>>>> <michael.wechner@wyona.com> wrote:
>>>>>> Thanks for your feedback!
>>>>>>
>>>>>> I agree, that it should not crash.
>>>>>>
>>>>>> So far we did not experience crashes ourselves, but we did not index
>>>>>> millions of vectors.
>>>>>>
>>>>>> I will try to reproduce the crash, maybe this will help us to move forward.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Michael
>>>>>>
>>>>>> Am 05.04.23 um 18:30 schrieb Dawid Weiss:
>>>>>>>> Can you describe your crash in more detail?
>>>>>>> I can't. That experiment was a while ago and a quick test to see if I
>>>>>>> could index rather large-ish USPTO (patent office) data as vectors.
>>>>>>> Couldn't do it then.
>>>>>>>
>>>>>>>> How much RAM?
>>>>>>> My indexing jobs run with rather smallish heaps to give space for I/O
>>>>>>> buffers. Think 4-8GB at most. So yes, it could have been the problem.
>>>>>>> I recall segment merging grew slower and slower and then simply
>>>>>>> crashed. Lucene should work with low heap requirements, even if it
>>>>>>> slows down. Throwing ram at the indexing/ segment merging problem
>>>>>>> is... I don't know - not elegant?
>>>>>>>
>>>>>>> Anyway. My main point was to remind folks about how Apache works -
>>>>>>> code is merged in when there are no vetoes. If Rob (or anybody else)
>>>>>>> remains unconvinced, he or she can block the change. (I didn't invent
>>>>>>> those rules).
>>>>>>>
>>>>>>> D.
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]
I've started to look on the internet, and surely someone will come, but the
challenge I suspect is that these vectors are expensive to generate so
people have not gone all in on generating such large vectors for large
datasets. They certainly have not made them easy to find. Here is the most
promising but it is too small, probably:
https://www.kaggle.com/datasets/stephanst/wikipedia-simple-openai-embeddings?resource=download

I'm still in and out of the office at the moment, but when I return, I can
ask my employer if they will sponsor a 10 million document collection so
that you can test with that. Or, maybe someone from work will see and ask
them on my behalf.

Alternatively, next week, I may get some time to set up a server with an
open source LLM to generate the vectors. It still won't be free, but it
would be 99% cheaper than paying the LLM companies if we can be slow.



On Thu, Apr 6, 2023 at 9:42?PM Michael Wechner <michael.wechner@wyona.com>
wrote:

> Great, thank you!
>
> How much RAM; etc. did you run this test on?
>
> Do the vectors really have to be based on real data for testing the
> indexing?
> I understand, if you want to test the quality of the search results it
> does matter, but for testing the scalability itself it should not matter
> actually, right?
>
> Thanks
>
> Michael
>
> Am 07.04.23 um 01:19 schrieb Michael Sokolov:
> > I'm trying to run a test. I indexed 8M 100d float32 vectors in ~20
> > minutes with a single thread. I have some 256K vectors, but only about
> > 2M of them. Can anybody point me to a large set (say 8M+) of 1024+ dim
> > vectors I can use for testing? If all else fails I can test with
> > noise, but that tends to lead to meaningless results
> >
> > On Thu, Apr 6, 2023 at 3:52?PM Michael Wechner
> > <michael.wechner@wyona.com> wrote:
> >>
> >>
> >> Am 06.04.23 um 17:47 schrieb Robert Muir:
> >>> Well, I'm asking ppl actually try to test using such high dimensions.
> >>> Based on my own experience, I consider it unusable. It seems other
> >>> folks may have run into trouble too. If the project committers can't
> >>> even really use vectors with such high dimension counts, then its not
> >>> in an OK state for users, and we shouldn't bump the limit.
> >>>
> >>> I'm happy to discuss/compromise etc, but simply bumping the limit
> >>> without addressing the underlying usability/scalability is a real
> >>> no-go,
> >> I agree that this needs to be adressed
> >>
> >>
> >>
> >>> it is not really solving anything, nor is it giving users any
> >>> freedom or allowing them to do something they couldnt do before.
> >>> Because if it still doesnt work it still doesnt work.
> >> I disagree, because it *does work* with "smaller" document sets.
> >>
> >> Currently we have to compile Lucene ourselves to not get the exception
> >> when using a model with vector dimension greater than 1024,
> >> which is of course possible, but not really convenient.
> >>
> >> As I wrote before, to resolve this discussion, I think we should test
> >> and address possible issues.
> >>
> >> I will try to stop discussing now :-) and instead try to understand
> >> better the actual issues. Would be great if others could join on this!
> >>
> >> Thanks
> >>
> >> Michael
> >>
> >>
> >>
> >>> We all need to be on the same page, grounded in reality, not fantasy,
> >>> where if we set a limit of 1024 or 2048, that you can actually index
> >>> vectors with that many dimensions and it actually works and scales.
> >>>
> >>> On Thu, Apr 6, 2023 at 11:38?AM Alessandro Benedetti
> >>> <a.benedetti@sease.io> wrote:
> >>>> As I said earlier, a max limit limits usability.
> >>>> It's not forcing users with small vectors to pay the performance
> penalty of big vectors, it's literally preventing some users to use
> Lucene/Solr/Elasticsearch at all.
> >>>> As far as I know, the max limit is used to raise an exception, it's
> not used to initialise or optimise data structures (please correct me if
> I'm wrong).
> >>>>
> >>>> Improving the algorithm performance is a separate discussion.
> >>>> I don't see a correlation with the fact that indexing billions of
> whatever dimensioned vector is slow with a usability parameter.
> >>>>
> >>>> What about potential users that need few high dimensional vectors?
> >>>>
> >>>> As I said before, I am a big +1 for NOT just raise it blindly, but I
> believe we need to remove the limit or size it in a way it's not a problem
> for both users and internal data structure optimizations, if any.
> >>>>
> >>>>
> >>>> On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcmuir@gmail.com> wrote:
> >>>>> I'd ask anyone voting +1 to raise this limit to at least try to index
> >>>>> a few million vectors with 756 or 1024, which is allowed today.
> >>>>>
> >>>>> IMO based on how painful it is, it seems the limit is already too
> >>>>> high, I realize that will sound controversial but please at least try
> >>>>> it out!
> >>>>>
> >>>>> voting +1 without at least doing this is really the
> >>>>> "weak/unscientifically minded" approach.
> >>>>>
> >>>>> On Wed, Apr 5, 2023 at 12:52?PM Michael Wechner
> >>>>> <michael.wechner@wyona.com> wrote:
> >>>>>> Thanks for your feedback!
> >>>>>>
> >>>>>> I agree, that it should not crash.
> >>>>>>
> >>>>>> So far we did not experience crashes ourselves, but we did not index
> >>>>>> millions of vectors.
> >>>>>>
> >>>>>> I will try to reproduce the crash, maybe this will help us to move
> forward.
> >>>>>>
> >>>>>> Thanks
> >>>>>>
> >>>>>> Michael
> >>>>>>
> >>>>>> Am 05.04.23 um 18:30 schrieb Dawid Weiss:
> >>>>>>>> Can you describe your crash in more detail?
> >>>>>>> I can't. That experiment was a while ago and a quick test to see
> if I
> >>>>>>> could index rather large-ish USPTO (patent office) data as vectors.
> >>>>>>> Couldn't do it then.
> >>>>>>>
> >>>>>>>> How much RAM?
> >>>>>>> My indexing jobs run with rather smallish heaps to give space for
> I/O
> >>>>>>> buffers. Think 4-8GB at most. So yes, it could have been the
> problem.
> >>>>>>> I recall segment merging grew slower and slower and then simply
> >>>>>>> crashed. Lucene should work with low heap requirements, even if it
> >>>>>>> slows down. Throwing ram at the indexing/ segment merging problem
> >>>>>>> is... I don't know - not elegant?
> >>>>>>>
> >>>>>>> Anyway. My main point was to remind folks about how Apache works -
> >>>>>>> code is merged in when there are no vetoes. If Rob (or anybody
> else)
> >>>>>>> remains unconvinced, he or she can block the change. (I didn't
> invent
> >>>>>>> those rules).
> >>>>>>>
> >>>>>>> D.
> >>>>>>>
> >>>>>>>
> ---------------------------------------------------------------------
> >>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>>>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

--
Marcus Eagan
Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]
you might want to use SentenceBERT to generate vectors

https://sbert.net

whereas for example the model "all-mpnet-base-v2" generates vectors with
dimension 768

We have SentenceBERT running as a web service, which we could open for
these tests, but because of network latency it should be faster running
locally.

HTH

Michael


Am 07.04.23 um 10:11 schrieb Marcus Eagan:
> I've started to look on the internet, and surely someone will come,
> but the challenge I suspect is that these vectors are expensive to
> generate so people have not gone all in on generating such large
> vectors for large datasets. They certainly have not made them easy to
> find. Here is the most promising but it is too small, probably:
> https://www.kaggle.com/datasets/stephanst/wikipedia-simple-openai-embeddings?resource=download
>
>
>  I'm still in and out of the office at the moment, but when I return,
> I can ask my employer if they will sponsor a 10 million document
> collection so that you can test with that. Or, maybe someone from work
> will see and ask them on my behalf.
>
> Alternatively, next week, I may get some time to set up a server with
> an open source LLM to generate the vectors. It still won't be free,
> but it would be 99% cheaper than paying the LLM companies if we can be
> slow.
>
>
>
> On Thu, Apr 6, 2023 at 9:42?PM Michael Wechner
> <michael.wechner@wyona.com> wrote:
>
> Great, thank you!
>
> How much RAM; etc. did you run this test on?
>
> Do the vectors really have to be based on real data for testing the
> indexing?
> I understand, if you want to test the quality of the search
> results it
> does matter, but for testing the scalability itself it should not
> matter
> actually, right?
>
> Thanks
>
> Michael
>
> Am 07.04.23 um 01:19 schrieb Michael Sokolov:
> > I'm trying to run a test. I indexed 8M 100d float32 vectors in ~20
> > minutes with a single thread. I have some 256K vectors, but only
> about
> > 2M of them. Can anybody point me to a large set (say 8M+) of
> 1024+ dim
> > vectors I can use for testing? If all else fails I can test with
> > noise, but that tends to lead to meaningless results
> >
> > On Thu, Apr 6, 2023 at 3:52?PM Michael Wechner
> > <michael.wechner@wyona.com> wrote:
> >>
> >>
> >> Am 06.04.23 um 17:47 schrieb Robert Muir:
> >>> Well, I'm asking ppl actually try to test using such high
> dimensions.
> >>> Based on my own experience, I consider it unusable. It seems other
> >>> folks may have run into trouble too. If the project committers
> can't
> >>> even really use vectors with such high dimension counts, then
> its not
> >>> in an OK state for users, and we shouldn't bump the limit.
> >>>
> >>> I'm happy to discuss/compromise etc, but simply bumping the limit
> >>> without addressing the underlying usability/scalability is a real
> >>> no-go,
> >> I agree that this needs to be adressed
> >>
> >>
> >>
> >>>    it is not really solving anything, nor is it giving users any
> >>> freedom or allowing them to do something they couldnt do before.
> >>> Because if it still doesnt work it still doesnt work.
> >> I disagree, because it *does work* with "smaller" document sets.
> >>
> >> Currently we have to compile Lucene ourselves to not get the
> exception
> >> when using a model with vector dimension greater than 1024,
> >> which is of course possible, but not really convenient.
> >>
> >> As I wrote before, to resolve this discussion, I think we
> should test
> >> and address possible issues.
> >>
> >> I will try to stop discussing now :-) and instead try to understand
> >> better the actual issues. Would be great if others could join
> on this!
> >>
> >> Thanks
> >>
> >> Michael
> >>
> >>
> >>
> >>> We all need to be on the same page, grounded in reality, not
> fantasy,
> >>> where if we set a limit of 1024 or 2048, that you can actually
> index
> >>> vectors with that many dimensions and it actually works and
> scales.
> >>>
> >>> On Thu, Apr 6, 2023 at 11:38?AM Alessandro Benedetti
> >>> <a.benedetti@sease.io> wrote:
> >>>> As I said earlier, a max limit limits usability.
> >>>> It's not forcing users with small vectors to pay the
> performance penalty of big vectors, it's literally preventing some
> users to use Lucene/Solr/Elasticsearch at all.
> >>>> As far as I know, the max limit is used to raise an
> exception, it's not used to initialise or optimise data structures
> (please correct me if I'm wrong).
> >>>>
> >>>> Improving the algorithm performance is a separate discussion.
> >>>> I don't see a correlation with the fact that indexing
> billions of whatever dimensioned vector is slow with a usability
> parameter.
> >>>>
> >>>> What about potential users that need few high dimensional
> vectors?
> >>>>
> >>>> As I said before, I am a big +1 for NOT just raise it
> blindly, but I believe we need to remove the limit or size it in a
> way it's not a problem for both users and internal data structure
> optimizations, if any.
> >>>>
> >>>>
> >>>> On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcmuir@gmail.com> wrote:
> >>>>> I'd ask anyone voting +1 to raise this limit to at least try
> to index
> >>>>> a few million vectors with 756 or 1024, which is allowed today.
> >>>>>
> >>>>> IMO based on how painful it is, it seems the limit is
> already too
> >>>>> high, I realize that will sound controversial but please at
> least try
> >>>>> it out!
> >>>>>
> >>>>> voting +1 without at least doing this is really the
> >>>>> "weak/unscientifically minded" approach.
> >>>>>
> >>>>> On Wed, Apr 5, 2023 at 12:52?PM Michael Wechner
> >>>>> <michael.wechner@wyona.com> wrote:
> >>>>>> Thanks for your feedback!
> >>>>>>
> >>>>>> I agree, that it should not crash.
> >>>>>>
> >>>>>> So far we did not experience crashes ourselves, but we did
> not index
> >>>>>> millions of vectors.
> >>>>>>
> >>>>>> I will try to reproduce the crash, maybe this will help us
> to move forward.
> >>>>>>
> >>>>>> Thanks
> >>>>>>
> >>>>>> Michael
> >>>>>>
> >>>>>> Am 05.04.23 um 18:30 schrieb Dawid Weiss:
> >>>>>>>> Can you describe your crash in more detail?
> >>>>>>> I can't. That experiment was a while ago and a quick test
> to see if I
> >>>>>>> could index rather large-ish USPTO (patent office) data as
> vectors.
> >>>>>>> Couldn't do it then.
> >>>>>>>
> >>>>>>>> How much RAM?
> >>>>>>> My indexing jobs run with rather smallish heaps to give
> space for I/O
> >>>>>>> buffers. Think 4-8GB at most. So yes, it could have been
> the problem.
> >>>>>>> I recall segment merging grew slower and slower and then
> simply
> >>>>>>> crashed. Lucene should work with low heap requirements,
> even if it
> >>>>>>> slows down. Throwing ram at the indexing/ segment merging
> problem
> >>>>>>> is... I don't know - not elegant?
> >>>>>>>
> >>>>>>> Anyway. My main point was to remind folks about how Apache
> works -
> >>>>>>> code is merged in when there are no vetoes. If Rob (or
> anybody else)
> >>>>>>> remains unconvinced, he or she can block the change. (I
> didn't invent
> >>>>>>> those rules).
> >>>>>>>
> >>>>>>> D.
> >>>>>>>
> >>>>>>>
> ---------------------------------------------------------------------
> >>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>>>>
> >>>>>
> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>>>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>>>
> >>>
> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>
> >>
> >>
> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
>
> --
> Marcus Eagan
>
Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]
Hi,
I have been testing Lucene with a custom vector similarity and loaded 192m
vectors of dim 512 bytes. (Yes, segment merges use a lot of java memory..).

As this was a performance test, the 192m vectors were derived by dithering
47k original vectors in such a way to allow realistic ANN evaluation of
HNSW. The original 47k vectors were generated by ada-002 on source
newspaper article text. After dithering, I used PQ to reduce their
dimensionality from 1536 floats to 512 bytes - 3 source dimensions to a
1byte code, 512 code tables, each learnt to reduce total encoding error
using Lloyds algorithm (hence the need for the custom similarity). BTW,
HNSW retrieval was accurate and fast enough for the use case I was
investigating as long as a machine with 128gb memory was available as the
graph needs to be cached in memory for reasonable query rates.

Anyway, if you want them, you are welcome to those 47k vectors of 1532
floats which can be readily dithered to generate very large and realistic
test vector sets.

Best regards,

Kent Fitch


On Fri, 7 Apr 2023, 6:53 pm Michael Wechner, <michael.wechner@wyona.com>
wrote:

> you might want to use SentenceBERT to generate vectors
>
> https://sbert.net
>
> whereas for example the model "all-mpnet-base-v2" generates vectors with
> dimension 768
>
> We have SentenceBERT running as a web service, which we could open for
> these tests, but because of network latency it should be faster running
> locally.
>
> HTH
>
> Michael
>
>
> Am 07.04.23 um 10:11 schrieb Marcus Eagan:
>
> I've started to look on the internet, and surely someone will come, but
> the challenge I suspect is that these vectors are expensive to generate so
> people have not gone all in on generating such large vectors for large
> datasets. They certainly have not made them easy to find. Here is the most
> promising but it is too small, probably:
> https://www.kaggle.com/datasets/stephanst/wikipedia-simple-openai-embeddings?resource=download
>
> I'm still in and out of the office at the moment, but when I return, I
> can ask my employer if they will sponsor a 10 million document collection
> so that you can test with that. Or, maybe someone from work will see and
> ask them on my behalf.
>
> Alternatively, next week, I may get some time to set up a server with an
> open source LLM to generate the vectors. It still won't be free, but it
> would be 99% cheaper than paying the LLM companies if we can be slow.
>
>
>
> On Thu, Apr 6, 2023 at 9:42?PM Michael Wechner <michael.wechner@wyona.com>
> wrote:
>
>> Great, thank you!
>>
>> How much RAM; etc. did you run this test on?
>>
>> Do the vectors really have to be based on real data for testing the
>> indexing?
>> I understand, if you want to test the quality of the search results it
>> does matter, but for testing the scalability itself it should not matter
>> actually, right?
>>
>> Thanks
>>
>> Michael
>>
>> Am 07.04.23 um 01:19 schrieb Michael Sokolov:
>> > I'm trying to run a test. I indexed 8M 100d float32 vectors in ~20
>> > minutes with a single thread. I have some 256K vectors, but only about
>> > 2M of them. Can anybody point me to a large set (say 8M+) of 1024+ dim
>> > vectors I can use for testing? If all else fails I can test with
>> > noise, but that tends to lead to meaningless results
>> >
>> > On Thu, Apr 6, 2023 at 3:52?PM Michael Wechner
>> > <michael.wechner@wyona.com> wrote:
>> >>
>> >>
>> >> Am 06.04.23 um 17:47 schrieb Robert Muir:
>> >>> Well, I'm asking ppl actually try to test using such high dimensions.
>> >>> Based on my own experience, I consider it unusable. It seems other
>> >>> folks may have run into trouble too. If the project committers can't
>> >>> even really use vectors with such high dimension counts, then its not
>> >>> in an OK state for users, and we shouldn't bump the limit.
>> >>>
>> >>> I'm happy to discuss/compromise etc, but simply bumping the limit
>> >>> without addressing the underlying usability/scalability is a real
>> >>> no-go,
>> >> I agree that this needs to be adressed
>> >>
>> >>
>> >>
>> >>> it is not really solving anything, nor is it giving users any
>> >>> freedom or allowing them to do something they couldnt do before.
>> >>> Because if it still doesnt work it still doesnt work.
>> >> I disagree, because it *does work* with "smaller" document sets.
>> >>
>> >> Currently we have to compile Lucene ourselves to not get the exception
>> >> when using a model with vector dimension greater than 1024,
>> >> which is of course possible, but not really convenient.
>> >>
>> >> As I wrote before, to resolve this discussion, I think we should test
>> >> and address possible issues.
>> >>
>> >> I will try to stop discussing now :-) and instead try to understand
>> >> better the actual issues. Would be great if others could join on this!
>> >>
>> >> Thanks
>> >>
>> >> Michael
>> >>
>> >>
>> >>
>> >>> We all need to be on the same page, grounded in reality, not fantasy,
>> >>> where if we set a limit of 1024 or 2048, that you can actually index
>> >>> vectors with that many dimensions and it actually works and scales.
>> >>>
>> >>> On Thu, Apr 6, 2023 at 11:38?AM Alessandro Benedetti
>> >>> <a.benedetti@sease.io> wrote:
>> >>>> As I said earlier, a max limit limits usability.
>> >>>> It's not forcing users with small vectors to pay the performance
>> penalty of big vectors, it's literally preventing some users to use
>> Lucene/Solr/Elasticsearch at all.
>> >>>> As far as I know, the max limit is used to raise an exception, it's
>> not used to initialise or optimise data structures (please correct me if
>> I'm wrong).
>> >>>>
>> >>>> Improving the algorithm performance is a separate discussion.
>> >>>> I don't see a correlation with the fact that indexing billions of
>> whatever dimensioned vector is slow with a usability parameter.
>> >>>>
>> >>>> What about potential users that need few high dimensional vectors?
>> >>>>
>> >>>> As I said before, I am a big +1 for NOT just raise it blindly, but I
>> believe we need to remove the limit or size it in a way it's not a problem
>> for both users and internal data structure optimizations, if any.
>> >>>>
>> >>>>
>> >>>> On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcmuir@gmail.com> wrote:
>> >>>>> I'd ask anyone voting +1 to raise this limit to at least try to
>> index
>> >>>>> a few million vectors with 756 or 1024, which is allowed today.
>> >>>>>
>> >>>>> IMO based on how painful it is, it seems the limit is already too
>> >>>>> high, I realize that will sound controversial but please at least
>> try
>> >>>>> it out!
>> >>>>>
>> >>>>> voting +1 without at least doing this is really the
>> >>>>> "weak/unscientifically minded" approach.
>> >>>>>
>> >>>>> On Wed, Apr 5, 2023 at 12:52?PM Michael Wechner
>> >>>>> <michael.wechner@wyona.com> wrote:
>> >>>>>> Thanks for your feedback!
>> >>>>>>
>> >>>>>> I agree, that it should not crash.
>> >>>>>>
>> >>>>>> So far we did not experience crashes ourselves, but we did not
>> index
>> >>>>>> millions of vectors.
>> >>>>>>
>> >>>>>> I will try to reproduce the crash, maybe this will help us to move
>> forward.
>> >>>>>>
>> >>>>>> Thanks
>> >>>>>>
>> >>>>>> Michael
>> >>>>>>
>> >>>>>> Am 05.04.23 um 18:30 schrieb Dawid Weiss:
>> >>>>>>>> Can you describe your crash in more detail?
>> >>>>>>> I can't. That experiment was a while ago and a quick test to see
>> if I
>> >>>>>>> could index rather large-ish USPTO (patent office) data as
>> vectors.
>> >>>>>>> Couldn't do it then.
>> >>>>>>>
>> >>>>>>>> How much RAM?
>> >>>>>>> My indexing jobs run with rather smallish heaps to give space for
>> I/O
>> >>>>>>> buffers. Think 4-8GB at most. So yes, it could have been the
>> problem.
>> >>>>>>> I recall segment merging grew slower and slower and then simply
>> >>>>>>> crashed. Lucene should work with low heap requirements, even if it
>> >>>>>>> slows down. Throwing ram at the indexing/ segment merging problem
>> >>>>>>> is... I don't know - not elegant?
>> >>>>>>>
>> >>>>>>> Anyway. My main point was to remind folks about how Apache works -
>> >>>>>>> code is merged in when there are no vetoes. If Rob (or anybody
>> else)
>> >>>>>>> remains unconvinced, he or she can block the change. (I didn't
>> invent
>> >>>>>>> those rules).
>> >>>>>>>
>> >>>>>>> D.
>> >>>>>>>
>> >>>>>>>
>> ---------------------------------------------------------------------
>> >>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>>>>>>
>> >>>>>>
>> ---------------------------------------------------------------------
>> >>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>>>>>
>> >>>>>
>> ---------------------------------------------------------------------
>> >>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>>>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >>> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: dev-help@lucene.apache.org
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>
> --
> Marcus Eagan
>
>
>
Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]
Thanks Kent - I tried something similar to what you did I think. Took
a set of 256d vectors I had and concatenated them to make bigger ones,
then shifted the dimensions to make more of them. Here are a few
single-threaded indexing test runs. I ran all tests with M=16.


8M 100d float vectors indexed in 20 minutes (16G heap, IndexWriter
buffer size=1994)
8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer size=1994)

increasing the vector dimension makes things take longer (scaling
*linearly*) but doesn't lead to RAM issues. I think we could get to
OOM while merging with a small heap and a large number of vectors, or
by increasing M, but none of this has anything to do with vector
dimensions. Also, if merge RAM usage is a problem I think we could
address it by adding accounting to the merge process and simply not
merging graphs when they exceed the buffer size (as we do with
flushing).

Robert, since you're the only on-the-record veto here, does this
change your thinking at all, or if not could you share some test
results that didn't go the way you expected? Maybe we can find some
mitigation if we focus on a specific issue.

On Fri, Apr 7, 2023 at 5:18?AM Kent Fitch <kent.fitch@gmail.com> wrote:
>
> Hi,
> I have been testing Lucene with a custom vector similarity and loaded 192m vectors of dim 512 bytes. (Yes, segment merges use a lot of java memory..).
>
> As this was a performance test, the 192m vectors were derived by dithering 47k original vectors in such a way to allow realistic ANN evaluation of HNSW. The original 47k vectors were generated by ada-002 on source newspaper article text. After dithering, I used PQ to reduce their dimensionality from 1536 floats to 512 bytes - 3 source dimensions to a 1byte code, 512 code tables, each learnt to reduce total encoding error using Lloyds algorithm (hence the need for the custom similarity). BTW, HNSW retrieval was accurate and fast enough for the use case I was investigating as long as a machine with 128gb memory was available as the graph needs to be cached in memory for reasonable query rates.
>
> Anyway, if you want them, you are welcome to those 47k vectors of 1532 floats which can be readily dithered to generate very large and realistic test vector sets.
>
> Best regards,
>
> Kent Fitch
>
>
> On Fri, 7 Apr 2023, 6:53 pm Michael Wechner, <michael.wechner@wyona.com> wrote:
>>
>> you might want to use SentenceBERT to generate vectors
>>
>> https://sbert.net
>>
>> whereas for example the model "all-mpnet-base-v2" generates vectors with dimension 768
>>
>> We have SentenceBERT running as a web service, which we could open for these tests, but because of network latency it should be faster running locally.
>>
>> HTH
>>
>> Michael
>>
>>
>> Am 07.04.23 um 10:11 schrieb Marcus Eagan:
>>
>> I've started to look on the internet, and surely someone will come, but the challenge I suspect is that these vectors are expensive to generate so people have not gone all in on generating such large vectors for large datasets. They certainly have not made them easy to find. Here is the most promising but it is too small, probably: https://www.kaggle.com/datasets/stephanst/wikipedia-simple-openai-embeddings?resource=download
>>
>> I'm still in and out of the office at the moment, but when I return, I can ask my employer if they will sponsor a 10 million document collection so that you can test with that. Or, maybe someone from work will see and ask them on my behalf.
>>
>> Alternatively, next week, I may get some time to set up a server with an open source LLM to generate the vectors. It still won't be free, but it would be 99% cheaper than paying the LLM companies if we can be slow.
>>
>>
>>
>> On Thu, Apr 6, 2023 at 9:42?PM Michael Wechner <michael.wechner@wyona.com> wrote:
>>>
>>> Great, thank you!
>>>
>>> How much RAM; etc. did you run this test on?
>>>
>>> Do the vectors really have to be based on real data for testing the
>>> indexing?
>>> I understand, if you want to test the quality of the search results it
>>> does matter, but for testing the scalability itself it should not matter
>>> actually, right?
>>>
>>> Thanks
>>>
>>> Michael
>>>
>>> Am 07.04.23 um 01:19 schrieb Michael Sokolov:
>>> > I'm trying to run a test. I indexed 8M 100d float32 vectors in ~20
>>> > minutes with a single thread. I have some 256K vectors, but only about
>>> > 2M of them. Can anybody point me to a large set (say 8M+) of 1024+ dim
>>> > vectors I can use for testing? If all else fails I can test with
>>> > noise, but that tends to lead to meaningless results
>>> >
>>> > On Thu, Apr 6, 2023 at 3:52?PM Michael Wechner
>>> > <michael.wechner@wyona.com> wrote:
>>> >>
>>> >>
>>> >> Am 06.04.23 um 17:47 schrieb Robert Muir:
>>> >>> Well, I'm asking ppl actually try to test using such high dimensions.
>>> >>> Based on my own experience, I consider it unusable. It seems other
>>> >>> folks may have run into trouble too. If the project committers can't
>>> >>> even really use vectors with such high dimension counts, then its not
>>> >>> in an OK state for users, and we shouldn't bump the limit.
>>> >>>
>>> >>> I'm happy to discuss/compromise etc, but simply bumping the limit
>>> >>> without addressing the underlying usability/scalability is a real
>>> >>> no-go,
>>> >> I agree that this needs to be adressed
>>> >>
>>> >>
>>> >>
>>> >>> it is not really solving anything, nor is it giving users any
>>> >>> freedom or allowing them to do something they couldnt do before.
>>> >>> Because if it still doesnt work it still doesnt work.
>>> >> I disagree, because it *does work* with "smaller" document sets.
>>> >>
>>> >> Currently we have to compile Lucene ourselves to not get the exception
>>> >> when using a model with vector dimension greater than 1024,
>>> >> which is of course possible, but not really convenient.
>>> >>
>>> >> As I wrote before, to resolve this discussion, I think we should test
>>> >> and address possible issues.
>>> >>
>>> >> I will try to stop discussing now :-) and instead try to understand
>>> >> better the actual issues. Would be great if others could join on this!
>>> >>
>>> >> Thanks
>>> >>
>>> >> Michael
>>> >>
>>> >>
>>> >>
>>> >>> We all need to be on the same page, grounded in reality, not fantasy,
>>> >>> where if we set a limit of 1024 or 2048, that you can actually index
>>> >>> vectors with that many dimensions and it actually works and scales.
>>> >>>
>>> >>> On Thu, Apr 6, 2023 at 11:38?AM Alessandro Benedetti
>>> >>> <a.benedetti@sease.io> wrote:
>>> >>>> As I said earlier, a max limit limits usability.
>>> >>>> It's not forcing users with small vectors to pay the performance penalty of big vectors, it's literally preventing some users to use Lucene/Solr/Elasticsearch at all.
>>> >>>> As far as I know, the max limit is used to raise an exception, it's not used to initialise or optimise data structures (please correct me if I'm wrong).
>>> >>>>
>>> >>>> Improving the algorithm performance is a separate discussion.
>>> >>>> I don't see a correlation with the fact that indexing billions of whatever dimensioned vector is slow with a usability parameter.
>>> >>>>
>>> >>>> What about potential users that need few high dimensional vectors?
>>> >>>>
>>> >>>> As I said before, I am a big +1 for NOT just raise it blindly, but I believe we need to remove the limit or size it in a way it's not a problem for both users and internal data structure optimizations, if any.
>>> >>>>
>>> >>>>
>>> >>>> On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcmuir@gmail.com> wrote:
>>> >>>>> I'd ask anyone voting +1 to raise this limit to at least try to index
>>> >>>>> a few million vectors with 756 or 1024, which is allowed today.
>>> >>>>>
>>> >>>>> IMO based on how painful it is, it seems the limit is already too
>>> >>>>> high, I realize that will sound controversial but please at least try
>>> >>>>> it out!
>>> >>>>>
>>> >>>>> voting +1 without at least doing this is really the
>>> >>>>> "weak/unscientifically minded" approach.
>>> >>>>>
>>> >>>>> On Wed, Apr 5, 2023 at 12:52?PM Michael Wechner
>>> >>>>> <michael.wechner@wyona.com> wrote:
>>> >>>>>> Thanks for your feedback!
>>> >>>>>>
>>> >>>>>> I agree, that it should not crash.
>>> >>>>>>
>>> >>>>>> So far we did not experience crashes ourselves, but we did not index
>>> >>>>>> millions of vectors.
>>> >>>>>>
>>> >>>>>> I will try to reproduce the crash, maybe this will help us to move forward.
>>> >>>>>>
>>> >>>>>> Thanks
>>> >>>>>>
>>> >>>>>> Michael
>>> >>>>>>
>>> >>>>>> Am 05.04.23 um 18:30 schrieb Dawid Weiss:
>>> >>>>>>>> Can you describe your crash in more detail?
>>> >>>>>>> I can't. That experiment was a while ago and a quick test to see if I
>>> >>>>>>> could index rather large-ish USPTO (patent office) data as vectors.
>>> >>>>>>> Couldn't do it then.
>>> >>>>>>>
>>> >>>>>>>> How much RAM?
>>> >>>>>>> My indexing jobs run with rather smallish heaps to give space for I/O
>>> >>>>>>> buffers. Think 4-8GB at most. So yes, it could have been the problem.
>>> >>>>>>> I recall segment merging grew slower and slower and then simply
>>> >>>>>>> crashed. Lucene should work with low heap requirements, even if it
>>> >>>>>>> slows down. Throwing ram at the indexing/ segment merging problem
>>> >>>>>>> is... I don't know - not elegant?
>>> >>>>>>>
>>> >>>>>>> Anyway. My main point was to remind folks about how Apache works -
>>> >>>>>>> code is merged in when there are no vetoes. If Rob (or anybody else)
>>> >>>>>>> remains unconvinced, he or she can block the change. (I didn't invent
>>> >>>>>>> those rules).
>>> >>>>>>>
>>> >>>>>>> D.
>>> >>>>>>>
>>> >>>>>>> ---------------------------------------------------------------------
>>> >>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>> >>>>>>>
>>> >>>>>> ---------------------------------------------------------------------
>>> >>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>> >>>>>>
>>> >>>>> ---------------------------------------------------------------------
>>> >>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>> >>>>>
>>> >>> ---------------------------------------------------------------------
>>> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >>> For additional commands, e-mail: dev-help@lucene.apache.org
>>> >>>
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>>> >>
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> > For additional commands, e-mail: dev-help@lucene.apache.org
>>> >
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>
>>
>> --
>> Marcus Eagan
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]
I also want to add that we do impose some other limits on graph
construction to help ensure that HNSW-based vector fields remain
manageable; M is limited to <= 512, and maximum segment size also
helps limit merge costs

On Fri, Apr 7, 2023 at 7:45?AM Michael Sokolov <msokolov@gmail.com> wrote:
>
> Thanks Kent - I tried something similar to what you did I think. Took
> a set of 256d vectors I had and concatenated them to make bigger ones,
> then shifted the dimensions to make more of them. Here are a few
> single-threaded indexing test runs. I ran all tests with M=16.
>
>
> 8M 100d float vectors indexed in 20 minutes (16G heap, IndexWriter
> buffer size=1994)
> 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
> 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer size=1994)
>
> increasing the vector dimension makes things take longer (scaling
> *linearly*) but doesn't lead to RAM issues. I think we could get to
> OOM while merging with a small heap and a large number of vectors, or
> by increasing M, but none of this has anything to do with vector
> dimensions. Also, if merge RAM usage is a problem I think we could
> address it by adding accounting to the merge process and simply not
> merging graphs when they exceed the buffer size (as we do with
> flushing).
>
> Robert, since you're the only on-the-record veto here, does this
> change your thinking at all, or if not could you share some test
> results that didn't go the way you expected? Maybe we can find some
> mitigation if we focus on a specific issue.
>
> On Fri, Apr 7, 2023 at 5:18?AM Kent Fitch <kent.fitch@gmail.com> wrote:
> >
> > Hi,
> > I have been testing Lucene with a custom vector similarity and loaded 192m vectors of dim 512 bytes. (Yes, segment merges use a lot of java memory..).
> >
> > As this was a performance test, the 192m vectors were derived by dithering 47k original vectors in such a way to allow realistic ANN evaluation of HNSW. The original 47k vectors were generated by ada-002 on source newspaper article text. After dithering, I used PQ to reduce their dimensionality from 1536 floats to 512 bytes - 3 source dimensions to a 1byte code, 512 code tables, each learnt to reduce total encoding error using Lloyds algorithm (hence the need for the custom similarity). BTW, HNSW retrieval was accurate and fast enough for the use case I was investigating as long as a machine with 128gb memory was available as the graph needs to be cached in memory for reasonable query rates.
> >
> > Anyway, if you want them, you are welcome to those 47k vectors of 1532 floats which can be readily dithered to generate very large and realistic test vector sets.
> >
> > Best regards,
> >
> > Kent Fitch
> >
> >
> > On Fri, 7 Apr 2023, 6:53 pm Michael Wechner, <michael.wechner@wyona.com> wrote:
> >>
> >> you might want to use SentenceBERT to generate vectors
> >>
> >> https://sbert.net
> >>
> >> whereas for example the model "all-mpnet-base-v2" generates vectors with dimension 768
> >>
> >> We have SentenceBERT running as a web service, which we could open for these tests, but because of network latency it should be faster running locally.
> >>
> >> HTH
> >>
> >> Michael
> >>
> >>
> >> Am 07.04.23 um 10:11 schrieb Marcus Eagan:
> >>
> >> I've started to look on the internet, and surely someone will come, but the challenge I suspect is that these vectors are expensive to generate so people have not gone all in on generating such large vectors for large datasets. They certainly have not made them easy to find. Here is the most promising but it is too small, probably: https://www.kaggle.com/datasets/stephanst/wikipedia-simple-openai-embeddings?resource=download
> >>
> >> I'm still in and out of the office at the moment, but when I return, I can ask my employer if they will sponsor a 10 million document collection so that you can test with that. Or, maybe someone from work will see and ask them on my behalf.
> >>
> >> Alternatively, next week, I may get some time to set up a server with an open source LLM to generate the vectors. It still won't be free, but it would be 99% cheaper than paying the LLM companies if we can be slow.
> >>
> >>
> >>
> >> On Thu, Apr 6, 2023 at 9:42?PM Michael Wechner <michael.wechner@wyona.com> wrote:
> >>>
> >>> Great, thank you!
> >>>
> >>> How much RAM; etc. did you run this test on?
> >>>
> >>> Do the vectors really have to be based on real data for testing the
> >>> indexing?
> >>> I understand, if you want to test the quality of the search results it
> >>> does matter, but for testing the scalability itself it should not matter
> >>> actually, right?
> >>>
> >>> Thanks
> >>>
> >>> Michael
> >>>
> >>> Am 07.04.23 um 01:19 schrieb Michael Sokolov:
> >>> > I'm trying to run a test. I indexed 8M 100d float32 vectors in ~20
> >>> > minutes with a single thread. I have some 256K vectors, but only about
> >>> > 2M of them. Can anybody point me to a large set (say 8M+) of 1024+ dim
> >>> > vectors I can use for testing? If all else fails I can test with
> >>> > noise, but that tends to lead to meaningless results
> >>> >
> >>> > On Thu, Apr 6, 2023 at 3:52?PM Michael Wechner
> >>> > <michael.wechner@wyona.com> wrote:
> >>> >>
> >>> >>
> >>> >> Am 06.04.23 um 17:47 schrieb Robert Muir:
> >>> >>> Well, I'm asking ppl actually try to test using such high dimensions.
> >>> >>> Based on my own experience, I consider it unusable. It seems other
> >>> >>> folks may have run into trouble too. If the project committers can't
> >>> >>> even really use vectors with such high dimension counts, then its not
> >>> >>> in an OK state for users, and we shouldn't bump the limit.
> >>> >>>
> >>> >>> I'm happy to discuss/compromise etc, but simply bumping the limit
> >>> >>> without addressing the underlying usability/scalability is a real
> >>> >>> no-go,
> >>> >> I agree that this needs to be adressed
> >>> >>
> >>> >>
> >>> >>
> >>> >>> it is not really solving anything, nor is it giving users any
> >>> >>> freedom or allowing them to do something they couldnt do before.
> >>> >>> Because if it still doesnt work it still doesnt work.
> >>> >> I disagree, because it *does work* with "smaller" document sets.
> >>> >>
> >>> >> Currently we have to compile Lucene ourselves to not get the exception
> >>> >> when using a model with vector dimension greater than 1024,
> >>> >> which is of course possible, but not really convenient.
> >>> >>
> >>> >> As I wrote before, to resolve this discussion, I think we should test
> >>> >> and address possible issues.
> >>> >>
> >>> >> I will try to stop discussing now :-) and instead try to understand
> >>> >> better the actual issues. Would be great if others could join on this!
> >>> >>
> >>> >> Thanks
> >>> >>
> >>> >> Michael
> >>> >>
> >>> >>
> >>> >>
> >>> >>> We all need to be on the same page, grounded in reality, not fantasy,
> >>> >>> where if we set a limit of 1024 or 2048, that you can actually index
> >>> >>> vectors with that many dimensions and it actually works and scales.
> >>> >>>
> >>> >>> On Thu, Apr 6, 2023 at 11:38?AM Alessandro Benedetti
> >>> >>> <a.benedetti@sease.io> wrote:
> >>> >>>> As I said earlier, a max limit limits usability.
> >>> >>>> It's not forcing users with small vectors to pay the performance penalty of big vectors, it's literally preventing some users to use Lucene/Solr/Elasticsearch at all.
> >>> >>>> As far as I know, the max limit is used to raise an exception, it's not used to initialise or optimise data structures (please correct me if I'm wrong).
> >>> >>>>
> >>> >>>> Improving the algorithm performance is a separate discussion.
> >>> >>>> I don't see a correlation with the fact that indexing billions of whatever dimensioned vector is slow with a usability parameter.
> >>> >>>>
> >>> >>>> What about potential users that need few high dimensional vectors?
> >>> >>>>
> >>> >>>> As I said before, I am a big +1 for NOT just raise it blindly, but I believe we need to remove the limit or size it in a way it's not a problem for both users and internal data structure optimizations, if any.
> >>> >>>>
> >>> >>>>
> >>> >>>> On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcmuir@gmail.com> wrote:
> >>> >>>>> I'd ask anyone voting +1 to raise this limit to at least try to index
> >>> >>>>> a few million vectors with 756 or 1024, which is allowed today.
> >>> >>>>>
> >>> >>>>> IMO based on how painful it is, it seems the limit is already too
> >>> >>>>> high, I realize that will sound controversial but please at least try
> >>> >>>>> it out!
> >>> >>>>>
> >>> >>>>> voting +1 without at least doing this is really the
> >>> >>>>> "weak/unscientifically minded" approach.
> >>> >>>>>
> >>> >>>>> On Wed, Apr 5, 2023 at 12:52?PM Michael Wechner
> >>> >>>>> <michael.wechner@wyona.com> wrote:
> >>> >>>>>> Thanks for your feedback!
> >>> >>>>>>
> >>> >>>>>> I agree, that it should not crash.
> >>> >>>>>>
> >>> >>>>>> So far we did not experience crashes ourselves, but we did not index
> >>> >>>>>> millions of vectors.
> >>> >>>>>>
> >>> >>>>>> I will try to reproduce the crash, maybe this will help us to move forward.
> >>> >>>>>>
> >>> >>>>>> Thanks
> >>> >>>>>>
> >>> >>>>>> Michael
> >>> >>>>>>
> >>> >>>>>> Am 05.04.23 um 18:30 schrieb Dawid Weiss:
> >>> >>>>>>>> Can you describe your crash in more detail?
> >>> >>>>>>> I can't. That experiment was a while ago and a quick test to see if I
> >>> >>>>>>> could index rather large-ish USPTO (patent office) data as vectors.
> >>> >>>>>>> Couldn't do it then.
> >>> >>>>>>>
> >>> >>>>>>>> How much RAM?
> >>> >>>>>>> My indexing jobs run with rather smallish heaps to give space for I/O
> >>> >>>>>>> buffers. Think 4-8GB at most. So yes, it could have been the problem.
> >>> >>>>>>> I recall segment merging grew slower and slower and then simply
> >>> >>>>>>> crashed. Lucene should work with low heap requirements, even if it
> >>> >>>>>>> slows down. Throwing ram at the indexing/ segment merging problem
> >>> >>>>>>> is... I don't know - not elegant?
> >>> >>>>>>>
> >>> >>>>>>> Anyway. My main point was to remind folks about how Apache works -
> >>> >>>>>>> code is merged in when there are no vetoes. If Rob (or anybody else)
> >>> >>>>>>> remains unconvinced, he or she can block the change. (I didn't invent
> >>> >>>>>>> those rules).
> >>> >>>>>>>
> >>> >>>>>>> D.
> >>> >>>>>>>
> >>> >>>>>>> ---------------------------------------------------------------------
> >>> >>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>> >>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>> >>>>>>>
> >>> >>>>>> ---------------------------------------------------------------------
> >>> >>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>> >>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>> >>>>>>
> >>> >>>>> ---------------------------------------------------------------------
> >>> >>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>> >>>>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>> >>>>>
> >>> >>> ---------------------------------------------------------------------
> >>> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>> >>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>> >>>
> >>> >>
> >>> >> ---------------------------------------------------------------------
> >>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>> >>
> >>> > ---------------------------------------------------------------------
> >>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>> > For additional commands, e-mail: dev-help@lucene.apache.org
> >>> >
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>
> >>
> >>
> >> --
> >> Marcus Eagan
> >>
> >>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]
one more data point:

32M 100dim (fp32) vectors indexed in 1h20m (M=16, IW cache=1994, heap=4GB)

On Fri, Apr 7, 2023 at 8:52?AM Michael Sokolov <msokolov@gmail.com> wrote:
>
> I also want to add that we do impose some other limits on graph
> construction to help ensure that HNSW-based vector fields remain
> manageable; M is limited to <= 512, and maximum segment size also
> helps limit merge costs
>
> On Fri, Apr 7, 2023 at 7:45?AM Michael Sokolov <msokolov@gmail.com> wrote:
> >
> > Thanks Kent - I tried something similar to what you did I think. Took
> > a set of 256d vectors I had and concatenated them to make bigger ones,
> > then shifted the dimensions to make more of them. Here are a few
> > single-threaded indexing test runs. I ran all tests with M=16.
> >
> >
> > 8M 100d float vectors indexed in 20 minutes (16G heap, IndexWriter
> > buffer size=1994)
> > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
> > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer size=1994)
> >
> > increasing the vector dimension makes things take longer (scaling
> > *linearly*) but doesn't lead to RAM issues. I think we could get to
> > OOM while merging with a small heap and a large number of vectors, or
> > by increasing M, but none of this has anything to do with vector
> > dimensions. Also, if merge RAM usage is a problem I think we could
> > address it by adding accounting to the merge process and simply not
> > merging graphs when they exceed the buffer size (as we do with
> > flushing).
> >
> > Robert, since you're the only on-the-record veto here, does this
> > change your thinking at all, or if not could you share some test
> > results that didn't go the way you expected? Maybe we can find some
> > mitigation if we focus on a specific issue.
> >
> > On Fri, Apr 7, 2023 at 5:18?AM Kent Fitch <kent.fitch@gmail.com> wrote:
> > >
> > > Hi,
> > > I have been testing Lucene with a custom vector similarity and loaded 192m vectors of dim 512 bytes. (Yes, segment merges use a lot of java memory..).
> > >
> > > As this was a performance test, the 192m vectors were derived by dithering 47k original vectors in such a way to allow realistic ANN evaluation of HNSW. The original 47k vectors were generated by ada-002 on source newspaper article text. After dithering, I used PQ to reduce their dimensionality from 1536 floats to 512 bytes - 3 source dimensions to a 1byte code, 512 code tables, each learnt to reduce total encoding error using Lloyds algorithm (hence the need for the custom similarity). BTW, HNSW retrieval was accurate and fast enough for the use case I was investigating as long as a machine with 128gb memory was available as the graph needs to be cached in memory for reasonable query rates.
> > >
> > > Anyway, if you want them, you are welcome to those 47k vectors of 1532 floats which can be readily dithered to generate very large and realistic test vector sets.
> > >
> > > Best regards,
> > >
> > > Kent Fitch
> > >
> > >
> > > On Fri, 7 Apr 2023, 6:53 pm Michael Wechner, <michael.wechner@wyona.com> wrote:
> > >>
> > >> you might want to use SentenceBERT to generate vectors
> > >>
> > >> https://sbert.net
> > >>
> > >> whereas for example the model "all-mpnet-base-v2" generates vectors with dimension 768
> > >>
> > >> We have SentenceBERT running as a web service, which we could open for these tests, but because of network latency it should be faster running locally.
> > >>
> > >> HTH
> > >>
> > >> Michael
> > >>
> > >>
> > >> Am 07.04.23 um 10:11 schrieb Marcus Eagan:
> > >>
> > >> I've started to look on the internet, and surely someone will come, but the challenge I suspect is that these vectors are expensive to generate so people have not gone all in on generating such large vectors for large datasets. They certainly have not made them easy to find. Here is the most promising but it is too small, probably: https://www.kaggle.com/datasets/stephanst/wikipedia-simple-openai-embeddings?resource=download
> > >>
> > >> I'm still in and out of the office at the moment, but when I return, I can ask my employer if they will sponsor a 10 million document collection so that you can test with that. Or, maybe someone from work will see and ask them on my behalf.
> > >>
> > >> Alternatively, next week, I may get some time to set up a server with an open source LLM to generate the vectors. It still won't be free, but it would be 99% cheaper than paying the LLM companies if we can be slow.
> > >>
> > >>
> > >>
> > >> On Thu, Apr 6, 2023 at 9:42?PM Michael Wechner <michael.wechner@wyona.com> wrote:
> > >>>
> > >>> Great, thank you!
> > >>>
> > >>> How much RAM; etc. did you run this test on?
> > >>>
> > >>> Do the vectors really have to be based on real data for testing the
> > >>> indexing?
> > >>> I understand, if you want to test the quality of the search results it
> > >>> does matter, but for testing the scalability itself it should not matter
> > >>> actually, right?
> > >>>
> > >>> Thanks
> > >>>
> > >>> Michael
> > >>>
> > >>> Am 07.04.23 um 01:19 schrieb Michael Sokolov:
> > >>> > I'm trying to run a test. I indexed 8M 100d float32 vectors in ~20
> > >>> > minutes with a single thread. I have some 256K vectors, but only about
> > >>> > 2M of them. Can anybody point me to a large set (say 8M+) of 1024+ dim
> > >>> > vectors I can use for testing? If all else fails I can test with
> > >>> > noise, but that tends to lead to meaningless results
> > >>> >
> > >>> > On Thu, Apr 6, 2023 at 3:52?PM Michael Wechner
> > >>> > <michael.wechner@wyona.com> wrote:
> > >>> >>
> > >>> >>
> > >>> >> Am 06.04.23 um 17:47 schrieb Robert Muir:
> > >>> >>> Well, I'm asking ppl actually try to test using such high dimensions.
> > >>> >>> Based on my own experience, I consider it unusable. It seems other
> > >>> >>> folks may have run into trouble too. If the project committers can't
> > >>> >>> even really use vectors with such high dimension counts, then its not
> > >>> >>> in an OK state for users, and we shouldn't bump the limit.
> > >>> >>>
> > >>> >>> I'm happy to discuss/compromise etc, but simply bumping the limit
> > >>> >>> without addressing the underlying usability/scalability is a real
> > >>> >>> no-go,
> > >>> >> I agree that this needs to be adressed
> > >>> >>
> > >>> >>
> > >>> >>
> > >>> >>> it is not really solving anything, nor is it giving users any
> > >>> >>> freedom or allowing them to do something they couldnt do before.
> > >>> >>> Because if it still doesnt work it still doesnt work.
> > >>> >> I disagree, because it *does work* with "smaller" document sets.
> > >>> >>
> > >>> >> Currently we have to compile Lucene ourselves to not get the exception
> > >>> >> when using a model with vector dimension greater than 1024,
> > >>> >> which is of course possible, but not really convenient.
> > >>> >>
> > >>> >> As I wrote before, to resolve this discussion, I think we should test
> > >>> >> and address possible issues.
> > >>> >>
> > >>> >> I will try to stop discussing now :-) and instead try to understand
> > >>> >> better the actual issues. Would be great if others could join on this!
> > >>> >>
> > >>> >> Thanks
> > >>> >>
> > >>> >> Michael
> > >>> >>
> > >>> >>
> > >>> >>
> > >>> >>> We all need to be on the same page, grounded in reality, not fantasy,
> > >>> >>> where if we set a limit of 1024 or 2048, that you can actually index
> > >>> >>> vectors with that many dimensions and it actually works and scales.
> > >>> >>>
> > >>> >>> On Thu, Apr 6, 2023 at 11:38?AM Alessandro Benedetti
> > >>> >>> <a.benedetti@sease.io> wrote:
> > >>> >>>> As I said earlier, a max limit limits usability.
> > >>> >>>> It's not forcing users with small vectors to pay the performance penalty of big vectors, it's literally preventing some users to use Lucene/Solr/Elasticsearch at all.
> > >>> >>>> As far as I know, the max limit is used to raise an exception, it's not used to initialise or optimise data structures (please correct me if I'm wrong).
> > >>> >>>>
> > >>> >>>> Improving the algorithm performance is a separate discussion.
> > >>> >>>> I don't see a correlation with the fact that indexing billions of whatever dimensioned vector is slow with a usability parameter.
> > >>> >>>>
> > >>> >>>> What about potential users that need few high dimensional vectors?
> > >>> >>>>
> > >>> >>>> As I said before, I am a big +1 for NOT just raise it blindly, but I believe we need to remove the limit or size it in a way it's not a problem for both users and internal data structure optimizations, if any.
> > >>> >>>>
> > >>> >>>>
> > >>> >>>> On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcmuir@gmail.com> wrote:
> > >>> >>>>> I'd ask anyone voting +1 to raise this limit to at least try to index
> > >>> >>>>> a few million vectors with 756 or 1024, which is allowed today.
> > >>> >>>>>
> > >>> >>>>> IMO based on how painful it is, it seems the limit is already too
> > >>> >>>>> high, I realize that will sound controversial but please at least try
> > >>> >>>>> it out!
> > >>> >>>>>
> > >>> >>>>> voting +1 without at least doing this is really the
> > >>> >>>>> "weak/unscientifically minded" approach.
> > >>> >>>>>
> > >>> >>>>> On Wed, Apr 5, 2023 at 12:52?PM Michael Wechner
> > >>> >>>>> <michael.wechner@wyona.com> wrote:
> > >>> >>>>>> Thanks for your feedback!
> > >>> >>>>>>
> > >>> >>>>>> I agree, that it should not crash.
> > >>> >>>>>>
> > >>> >>>>>> So far we did not experience crashes ourselves, but we did not index
> > >>> >>>>>> millions of vectors.
> > >>> >>>>>>
> > >>> >>>>>> I will try to reproduce the crash, maybe this will help us to move forward.
> > >>> >>>>>>
> > >>> >>>>>> Thanks
> > >>> >>>>>>
> > >>> >>>>>> Michael
> > >>> >>>>>>
> > >>> >>>>>> Am 05.04.23 um 18:30 schrieb Dawid Weiss:
> > >>> >>>>>>>> Can you describe your crash in more detail?
> > >>> >>>>>>> I can't. That experiment was a while ago and a quick test to see if I
> > >>> >>>>>>> could index rather large-ish USPTO (patent office) data as vectors.
> > >>> >>>>>>> Couldn't do it then.
> > >>> >>>>>>>
> > >>> >>>>>>>> How much RAM?
> > >>> >>>>>>> My indexing jobs run with rather smallish heaps to give space for I/O
> > >>> >>>>>>> buffers. Think 4-8GB at most. So yes, it could have been the problem.
> > >>> >>>>>>> I recall segment merging grew slower and slower and then simply
> > >>> >>>>>>> crashed. Lucene should work with low heap requirements, even if it
> > >>> >>>>>>> slows down. Throwing ram at the indexing/ segment merging problem
> > >>> >>>>>>> is... I don't know - not elegant?
> > >>> >>>>>>>
> > >>> >>>>>>> Anyway. My main point was to remind folks about how Apache works -
> > >>> >>>>>>> code is merged in when there are no vetoes. If Rob (or anybody else)
> > >>> >>>>>>> remains unconvinced, he or she can block the change. (I didn't invent
> > >>> >>>>>>> those rules).
> > >>> >>>>>>>
> > >>> >>>>>>> D.
> > >>> >>>>>>>
> > >>> >>>>>>> ---------------------------------------------------------------------
> > >>> >>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > >>> >>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
> > >>> >>>>>>>
> > >>> >>>>>> ---------------------------------------------------------------------
> > >>> >>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > >>> >>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
> > >>> >>>>>>
> > >>> >>>>> ---------------------------------------------------------------------
> > >>> >>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > >>> >>>>> For additional commands, e-mail: dev-help@lucene.apache.org
> > >>> >>>>>
> > >>> >>> ---------------------------------------------------------------------
> > >>> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > >>> >>> For additional commands, e-mail: dev-help@lucene.apache.org
> > >>> >>>
> > >>> >>
> > >>> >> ---------------------------------------------------------------------
> > >>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > >>> >> For additional commands, e-mail: dev-help@lucene.apache.org
> > >>> >>
> > >>> > ---------------------------------------------------------------------
> > >>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > >>> > For additional commands, e-mail: dev-help@lucene.apache.org
> > >>> >
> > >>>
> > >>>
> > >>> ---------------------------------------------------------------------
> > >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > >>> For additional commands, e-mail: dev-help@lucene.apache.org
> > >>>
> > >>
> > >>
> > >> --
> > >> Marcus Eagan
> > >>
> > >>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]
Important data point and it doesn't seem too bad or good. What is
acceptable performance should be decided by the user? What do you all think?

On Fri, Apr 7, 2023 at 8:20?AM Michael Sokolov <msokolov@gmail.com> wrote:

> one more data point:
>
> 32M 100dim (fp32) vectors indexed in 1h20m (M=16, IW cache=1994, heap=4GB)
>
> On Fri, Apr 7, 2023 at 8:52?AM Michael Sokolov <msokolov@gmail.com> wrote:
> >
> > I also want to add that we do impose some other limits on graph
> > construction to help ensure that HNSW-based vector fields remain
> > manageable; M is limited to <= 512, and maximum segment size also
> > helps limit merge costs
> >
> > On Fri, Apr 7, 2023 at 7:45?AM Michael Sokolov <msokolov@gmail.com>
> wrote:
> > >
> > > Thanks Kent - I tried something similar to what you did I think. Took
> > > a set of 256d vectors I had and concatenated them to make bigger ones,
> > > then shifted the dimensions to make more of them. Here are a few
> > > single-threaded indexing test runs. I ran all tests with M=16.
> > >
> > >
> > > 8M 100d float vectors indexed in 20 minutes (16G heap, IndexWriter
> > > buffer size=1994)
> > > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
> > > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer
> size=1994)
> > >
> > > increasing the vector dimension makes things take longer (scaling
> > > *linearly*) but doesn't lead to RAM issues. I think we could get to
> > > OOM while merging with a small heap and a large number of vectors, or
> > > by increasing M, but none of this has anything to do with vector
> > > dimensions. Also, if merge RAM usage is a problem I think we could
> > > address it by adding accounting to the merge process and simply not
> > > merging graphs when they exceed the buffer size (as we do with
> > > flushing).
> > >
> > > Robert, since you're the only on-the-record veto here, does this
> > > change your thinking at all, or if not could you share some test
> > > results that didn't go the way you expected? Maybe we can find some
> > > mitigation if we focus on a specific issue.
> > >
> > > On Fri, Apr 7, 2023 at 5:18?AM Kent Fitch <kent.fitch@gmail.com>
> wrote:
> > > >
> > > > Hi,
> > > > I have been testing Lucene with a custom vector similarity and
> loaded 192m vectors of dim 512 bytes. (Yes, segment merges use a lot of
> java memory..).
> > > >
> > > > As this was a performance test, the 192m vectors were derived by
> dithering 47k original vectors in such a way to allow realistic ANN
> evaluation of HNSW. The original 47k vectors were generated by ada-002 on
> source newspaper article text. After dithering, I used PQ to reduce their
> dimensionality from 1536 floats to 512 bytes - 3 source dimensions to a
> 1byte code, 512 code tables, each learnt to reduce total encoding error
> using Lloyds algorithm (hence the need for the custom similarity). BTW,
> HNSW retrieval was accurate and fast enough for the use case I was
> investigating as long as a machine with 128gb memory was available as the
> graph needs to be cached in memory for reasonable query rates.
> > > >
> > > > Anyway, if you want them, you are welcome to those 47k vectors of
> 1532 floats which can be readily dithered to generate very large and
> realistic test vector sets.
> > > >
> > > > Best regards,
> > > >
> > > > Kent Fitch
> > > >
> > > >
> > > > On Fri, 7 Apr 2023, 6:53 pm Michael Wechner, <
> michael.wechner@wyona.com> wrote:
> > > >>
> > > >> you might want to use SentenceBERT to generate vectors
> > > >>
> > > >> https://sbert.net
> > > >>
> > > >> whereas for example the model "all-mpnet-base-v2" generates vectors
> with dimension 768
> > > >>
> > > >> We have SentenceBERT running as a web service, which we could open
> for these tests, but because of network latency it should be faster running
> locally.
> > > >>
> > > >> HTH
> > > >>
> > > >> Michael
> > > >>
> > > >>
> > > >> Am 07.04.23 um 10:11 schrieb Marcus Eagan:
> > > >>
> > > >> I've started to look on the internet, and surely someone will come,
> but the challenge I suspect is that these vectors are expensive to generate
> so people have not gone all in on generating such large vectors for large
> datasets. They certainly have not made them easy to find. Here is the most
> promising but it is too small, probably:
> https://www.kaggle.com/datasets/stephanst/wikipedia-simple-openai-embeddings?resource=download
> > > >>
> > > >> I'm still in and out of the office at the moment, but when I
> return, I can ask my employer if they will sponsor a 10 million document
> collection so that you can test with that. Or, maybe someone from work will
> see and ask them on my behalf.
> > > >>
> > > >> Alternatively, next week, I may get some time to set up a server
> with an open source LLM to generate the vectors. It still won't be free,
> but it would be 99% cheaper than paying the LLM companies if we can be slow.
> > > >>
> > > >>
> > > >>
> > > >> On Thu, Apr 6, 2023 at 9:42?PM Michael Wechner <
> michael.wechner@wyona.com> wrote:
> > > >>>
> > > >>> Great, thank you!
> > > >>>
> > > >>> How much RAM; etc. did you run this test on?
> > > >>>
> > > >>> Do the vectors really have to be based on real data for testing the
> > > >>> indexing?
> > > >>> I understand, if you want to test the quality of the search
> results it
> > > >>> does matter, but for testing the scalability itself it should not
> matter
> > > >>> actually, right?
> > > >>>
> > > >>> Thanks
> > > >>>
> > > >>> Michael
> > > >>>
> > > >>> Am 07.04.23 um 01:19 schrieb Michael Sokolov:
> > > >>> > I'm trying to run a test. I indexed 8M 100d float32 vectors in
> ~20
> > > >>> > minutes with a single thread. I have some 256K vectors, but only
> about
> > > >>> > 2M of them. Can anybody point me to a large set (say 8M+) of
> 1024+ dim
> > > >>> > vectors I can use for testing? If all else fails I can test with
> > > >>> > noise, but that tends to lead to meaningless results
> > > >>> >
> > > >>> > On Thu, Apr 6, 2023 at 3:52?PM Michael Wechner
> > > >>> > <michael.wechner@wyona.com> wrote:
> > > >>> >>
> > > >>> >>
> > > >>> >> Am 06.04.23 um 17:47 schrieb Robert Muir:
> > > >>> >>> Well, I'm asking ppl actually try to test using such high
> dimensions.
> > > >>> >>> Based on my own experience, I consider it unusable. It seems
> other
> > > >>> >>> folks may have run into trouble too. If the project committers
> can't
> > > >>> >>> even really use vectors with such high dimension counts, then
> its not
> > > >>> >>> in an OK state for users, and we shouldn't bump the limit.
> > > >>> >>>
> > > >>> >>> I'm happy to discuss/compromise etc, but simply bumping the
> limit
> > > >>> >>> without addressing the underlying usability/scalability is a
> real
> > > >>> >>> no-go,
> > > >>> >> I agree that this needs to be adressed
> > > >>> >>
> > > >>> >>
> > > >>> >>
> > > >>> >>> it is not really solving anything, nor is it giving users
> any
> > > >>> >>> freedom or allowing them to do something they couldnt do
> before.
> > > >>> >>> Because if it still doesnt work it still doesnt work.
> > > >>> >> I disagree, because it *does work* with "smaller" document sets.
> > > >>> >>
> > > >>> >> Currently we have to compile Lucene ourselves to not get the
> exception
> > > >>> >> when using a model with vector dimension greater than 1024,
> > > >>> >> which is of course possible, but not really convenient.
> > > >>> >>
> > > >>> >> As I wrote before, to resolve this discussion, I think we
> should test
> > > >>> >> and address possible issues.
> > > >>> >>
> > > >>> >> I will try to stop discussing now :-) and instead try to
> understand
> > > >>> >> better the actual issues. Would be great if others could join
> on this!
> > > >>> >>
> > > >>> >> Thanks
> > > >>> >>
> > > >>> >> Michael
> > > >>> >>
> > > >>> >>
> > > >>> >>
> > > >>> >>> We all need to be on the same page, grounded in reality, not
> fantasy,
> > > >>> >>> where if we set a limit of 1024 or 2048, that you can actually
> index
> > > >>> >>> vectors with that many dimensions and it actually works and
> scales.
> > > >>> >>>
> > > >>> >>> On Thu, Apr 6, 2023 at 11:38?AM Alessandro Benedetti
> > > >>> >>> <a.benedetti@sease.io> wrote:
> > > >>> >>>> As I said earlier, a max limit limits usability.
> > > >>> >>>> It's not forcing users with small vectors to pay the
> performance penalty of big vectors, it's literally preventing some users to
> use Lucene/Solr/Elasticsearch at all.
> > > >>> >>>> As far as I know, the max limit is used to raise an
> exception, it's not used to initialise or optimise data structures (please
> correct me if I'm wrong).
> > > >>> >>>>
> > > >>> >>>> Improving the algorithm performance is a separate discussion.
> > > >>> >>>> I don't see a correlation with the fact that indexing
> billions of whatever dimensioned vector is slow with a usability parameter.
> > > >>> >>>>
> > > >>> >>>> What about potential users that need few high dimensional
> vectors?
> > > >>> >>>>
> > > >>> >>>> As I said before, I am a big +1 for NOT just raise it
> blindly, but I believe we need to remove the limit or size it in a way it's
> not a problem for both users and internal data structure optimizations, if
> any.
> > > >>> >>>>
> > > >>> >>>>
> > > >>> >>>> On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcmuir@gmail.com>
> wrote:
> > > >>> >>>>> I'd ask anyone voting +1 to raise this limit to at least try
> to index
> > > >>> >>>>> a few million vectors with 756 or 1024, which is allowed
> today.
> > > >>> >>>>>
> > > >>> >>>>> IMO based on how painful it is, it seems the limit is
> already too
> > > >>> >>>>> high, I realize that will sound controversial but please at
> least try
> > > >>> >>>>> it out!
> > > >>> >>>>>
> > > >>> >>>>> voting +1 without at least doing this is really the
> > > >>> >>>>> "weak/unscientifically minded" approach.
> > > >>> >>>>>
> > > >>> >>>>> On Wed, Apr 5, 2023 at 12:52?PM Michael Wechner
> > > >>> >>>>> <michael.wechner@wyona.com> wrote:
> > > >>> >>>>>> Thanks for your feedback!
> > > >>> >>>>>>
> > > >>> >>>>>> I agree, that it should not crash.
> > > >>> >>>>>>
> > > >>> >>>>>> So far we did not experience crashes ourselves, but we did
> not index
> > > >>> >>>>>> millions of vectors.
> > > >>> >>>>>>
> > > >>> >>>>>> I will try to reproduce the crash, maybe this will help us
> to move forward.
> > > >>> >>>>>>
> > > >>> >>>>>> Thanks
> > > >>> >>>>>>
> > > >>> >>>>>> Michael
> > > >>> >>>>>>
> > > >>> >>>>>> Am 05.04.23 um 18:30 schrieb Dawid Weiss:
> > > >>> >>>>>>>> Can you describe your crash in more detail?
> > > >>> >>>>>>> I can't. That experiment was a while ago and a quick test
> to see if I
> > > >>> >>>>>>> could index rather large-ish USPTO (patent office) data as
> vectors.
> > > >>> >>>>>>> Couldn't do it then.
> > > >>> >>>>>>>
> > > >>> >>>>>>>> How much RAM?
> > > >>> >>>>>>> My indexing jobs run with rather smallish heaps to give
> space for I/O
> > > >>> >>>>>>> buffers. Think 4-8GB at most. So yes, it could have been
> the problem.
> > > >>> >>>>>>> I recall segment merging grew slower and slower and then
> simply
> > > >>> >>>>>>> crashed. Lucene should work with low heap requirements,
> even if it
> > > >>> >>>>>>> slows down. Throwing ram at the indexing/ segment merging
> problem
> > > >>> >>>>>>> is... I don't know - not elegant?
> > > >>> >>>>>>>
> > > >>> >>>>>>> Anyway. My main point was to remind folks about how Apache
> works -
> > > >>> >>>>>>> code is merged in when there are no vetoes. If Rob (or
> anybody else)
> > > >>> >>>>>>> remains unconvinced, he or she can block the change. (I
> didn't invent
> > > >>> >>>>>>> those rules).
> > > >>> >>>>>>>
> > > >>> >>>>>>> D.
> > > >>> >>>>>>>
> > > >>> >>>>>>>
> ---------------------------------------------------------------------
> > > >>> >>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > > >>> >>>>>>> For additional commands, e-mail:
> dev-help@lucene.apache.org
> > > >>> >>>>>>>
> > > >>> >>>>>>
> ---------------------------------------------------------------------
> > > >>> >>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > > >>> >>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
> > > >>> >>>>>>
> > > >>> >>>>>
> ---------------------------------------------------------------------
> > > >>> >>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > > >>> >>>>> For additional commands, e-mail: dev-help@lucene.apache.org
> > > >>> >>>>>
> > > >>> >>>
> ---------------------------------------------------------------------
> > > >>> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > > >>> >>> For additional commands, e-mail: dev-help@lucene.apache.org
> > > >>> >>>
> > > >>> >>
> > > >>> >>
> ---------------------------------------------------------------------
> > > >>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > > >>> >> For additional commands, e-mail: dev-help@lucene.apache.org
> > > >>> >>
> > > >>> >
> ---------------------------------------------------------------------
> > > >>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > > >>> > For additional commands, e-mail: dev-help@lucene.apache.org
> > > >>> >
> > > >>>
> > > >>>
> > > >>>
> ---------------------------------------------------------------------
> > > >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > > >>> For additional commands, e-mail: dev-help@lucene.apache.org
> > > >>>
> > > >>
> > > >>
> > > >> --
> > > >> Marcus Eagan
> > > >>
> > > >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

--
Marcus Eagan
Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]
On Fri, Apr 7, 2023 at 7:47?AM Michael Sokolov <msokolov@gmail.com> wrote:
>
> 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
> 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer size=1994)
>
> Robert, since you're the only on-the-record veto here, does this
> change your thinking at all, or if not could you share some test
> results that didn't go the way you expected? Maybe we can find some
> mitigation if we focus on a specific issue.
>

My scale concerns are both space and time. What does the execution
time look like if you don't set insanely large IW rambuffer? The
default is 16MB. Just concerned we're shoving some problems under the
rug :)

Even with the yuge RAMbuffer, we're still talking about almost 2 hours
to index 4M documents with these 2k vectors. Whereas you'd measure
this in seconds with typical lucene indexing, its nothing.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]
From all I have seen when hooking up JFR when indexing a medium number of
vectors(1M +), almost all the time is spent simply comparing the vectors
(e.g. dot_product).

This indicates to me that another algorithm won't really help index build
time tremendously. Unless others do dramatically fewer vector comparisons
(from what I can tell, this is at least not true for DiskAnn, unless some
fancy footwork is done when building the PQ codebook).

I would also say comparing vector index build time to indexing terms are
apples and oranges. Yeah, they both live in Lucene, but the number of
calculations required (no matter the data structure used), will be
magnitudes greater.


On Fri, Apr 7, 2023, 4:59 PM Robert Muir <rcmuir@gmail.com> wrote:

> On Fri, Apr 7, 2023 at 7:47?AM Michael Sokolov <msokolov@gmail.com> wrote:
> >
> > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
> > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer size=1994)
> >
> > Robert, since you're the only on-the-record veto here, does this
> > change your thinking at all, or if not could you share some test
> > results that didn't go the way you expected? Maybe we can find some
> > mitigation if we focus on a specific issue.
> >
>
> My scale concerns are both space and time. What does the execution
> time look like if you don't set insanely large IW rambuffer? The
> default is 16MB. Just concerned we're shoving some problems under the
> rug :)
>
> Even with the yuge RAMbuffer, we're still talking about almost 2 hours
> to index 4M documents with these 2k vectors. Whereas you'd measure
> this in seconds with typical lucene indexing, its nothing.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]
On Fri, Apr 7, 2023 at 5:13?PM Benjamin Trent <ben.w.trent@gmail.com> wrote:
>
> From all I have seen when hooking up JFR when indexing a medium number of vectors(1M +), almost all the time is spent simply comparing the vectors (e.g. dot_product).
>
> This indicates to me that another algorithm won't really help index build time tremendously. Unless others do dramatically fewer vector comparisons (from what I can tell, this is at least not true for DiskAnn, unless some fancy footwork is done when building the PQ codebook).
>
> I would also say comparing vector index build time to indexing terms are apples and oranges. Yeah, they both live in Lucene, but the number of calculations required (no matter the data structure used), will be magnitudes greater.
>

I'm not sure, i think this slowness due to massive number of
comparisons is just another side effect of the unscalable algorithm.
It is designed to build an in-memory datastructure and "merge" means
"rebuild". And since we fully rebuild a new one when merging, you get
something like O(n^2) total indexing when you take merges into
account.
Some of the other algorithms... in fact support merging. The DiskANN
paper has like a "chapter" on this.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

1 2 3 4  View All