Mailing List Archive: [Proposal] Remove max number of dimensions for KNN vectors

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

Apr 7, 2023, 2:28 PM

Post #51 of 99 (176 views)

The inference time (and cost) to generate these big vectors must be quite
large too ;).
Regarding the ram buffer, we could drastically reduce the size by writing
the vectors on disk instead of keeping them in the heap. With 1k dimensions
the ram buffer is filled with these vectors quite rapidly.

On Fri, 7 Apr 2023 at 21:59, Robert Muir <rcmuir@gmail.com> wrote:

> On Fri, Apr 7, 2023 at 7:47?AM Michael Sokolov <msokolov@gmail.com> wrote:
> >
> > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
> > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer size=1994)
> >
> > Robert, since you're the only on-the-record veto here, does this
> > change your thinking at all, or if not could you share some test
> > results that didn't go the way you expected? Maybe we can find some
> > mitigation if we focus on a specific issue.
> >
>
> My scale concerns are both space and time. What does the execution
> time look like if you don't set insanely large IW rambuffer? The
> default is 16MB. Just concerned we're shoving some problems under the
> rug :)
>
> Even with the yuge RAMbuffer, we're still talking about almost 2 hours
> to index 4M documents with these 2k vectors. Whereas you'd measure
> this in seconds with typical lucene indexing, its nothing.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

jim.ferenczi at gmail

Apr 7, 2023, 2:36 PM

Post #52 of 99 (176 views)

Permalink

I am also not sure that diskann would solve the merging issue. The idea
describe in the paper is to run kmeans first to create multiple graphs, one
per cluster. In our case the vectors in each segment could belong to
different cluster so I don’t see how we could merge them efficiently.

On Fri, 7 Apr 2023 at 22:28, jim ferenczi <jim.ferenczi@gmail.com> wrote:

> The inference time (and cost) to generate these big vectors must be quite
> large too ;).
> Regarding the ram buffer, we could drastically reduce the size by writing
> the vectors on disk instead of keeping them in the heap. With 1k dimensions
> the ram buffer is filled with these vectors quite rapidly.
>
> On Fri, 7 Apr 2023 at 21:59, Robert Muir <rcmuir@gmail.com> wrote:
>
>> On Fri, Apr 7, 2023 at 7:47?AM Michael Sokolov <msokolov@gmail.com>
>> wrote:
>> >
>> > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
>> > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer
>> size=1994)
>> >
>> > Robert, since you're the only on-the-record veto here, does this
>> > change your thinking at all, or if not could you share some test
>> > results that didn't go the way you expected? Maybe we can find some
>> > mitigation if we focus on a specific issue.
>> >
>>
>> My scale concerns are both space and time. What does the execution
>> time look like if you don't set insanely large IW rambuffer? The
>> default is 16MB. Just concerned we're shoving some problems under the
>> rug :)
>>
>> Even with the yuge RAMbuffer, we're still talking about almost 2 hours
>> to index 4M documents with these 2k vectors. Whereas you'd measure
>> this in seconds with typical lucene indexing, its nothing.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

rcmuir at gmail

Apr 7, 2023, 2:57 PM

Post #53 of 99 (176 views)

Permalink

Personally i'd have to re-read the paper, but in general the merging
issue has to be addressed somehow to fix the overall indexing time
problem. It seems it gets "dodged" with huge rambuffers in the emails
here.
Keep in mind, there may be other ways to do it. In general if merging
something is going to be "heavyweight", we should think about it to
prevent things from going really bad overall.

As an example, I'm most familiar with adding DEFLATE compression to
stored fields. Previously, we'd basically decompress and recompress
the stored fields on merge, and LZ4 is so fast that it wasn't
obviously a problem. But with DEFLATE it got slower/heavier (more
intense compression algorithm), something had to be done or indexing
would be unacceptably slow. Hence if you look at storedfields writer,
there is "dirtiness" logic etc so that recompression is amortized over
time and doesn't happen on every merge.

On Fri, Apr 7, 2023 at 5:38?PM jim ferenczi <jim.ferenczi@gmail.com> wrote:
>
> I am also not sure that diskann would solve the merging issue. The idea describe in the paper is to run kmeans first to create multiple graphs, one per cluster. In our case the vectors in each segment could belong to different cluster so I don’t see how we could merge them efficiently.
>
> On Fri, 7 Apr 2023 at 22:28, jim ferenczi <jim.ferenczi@gmail.com> wrote:
>>
>> The inference time (and cost) to generate these big vectors must be quite large too ;).
>> Regarding the ram buffer, we could drastically reduce the size by writing the vectors on disk instead of keeping them in the heap. With 1k dimensions the ram buffer is filled with these vectors quite rapidly.
>>
>> On Fri, 7 Apr 2023 at 21:59, Robert Muir <rcmuir@gmail.com> wrote:
>>>
>>> On Fri, Apr 7, 2023 at 7:47?AM Michael Sokolov <msokolov@gmail.com> wrote:
>>> >
>>> > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
>>> > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer size=1994)
>>> >
>>> > Robert, since you're the only on-the-record veto here, does this
>>> > change your thinking at all, or if not could you share some test
>>> > results that didn't go the way you expected? Maybe we can find some
>>> > mitigation if we focus on a specific issue.
>>> >
>>>
>>> My scale concerns are both space and time. What does the execution
>>> time look like if you don't set insanely large IW rambuffer? The
>>> default is 16MB. Just concerned we're shoving some problems under the
>>> rug :)
>>>
>>> Even with the yuge RAMbuffer, we're still talking about almost 2 hours
>>> to index 4M documents with these 2k vectors. Whereas you'd measure
>>> this in seconds with typical lucene indexing, its nothing.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

jim.ferenczi at gmail

Apr 7, 2023, 3:02 PM

Post #54 of 99 (176 views)

Permalink

> It is designed to build an in-memory datastructure and "merge" means
"rebuild".

The main idea imo in the diskann paper is to build the graph with the full
dimensions to preserve the quality of the neighbors. At query time it uses
the reduced dimensions (using product quantization) to compute the
similarity and thus reducing the ram required by a large factor. This is
something we could do with the current implementation. I think that Michael
tested something similar with the quantization, but when applied at build
time too it reduces the quality of the graph and the overall recall.

On Fri, 7 Apr 2023 at 22:36, jim ferenczi <jim.ferenczi@gmail.com> wrote:

> I am also not sure that diskann would solve the merging issue. The idea
> describe in the paper is to run kmeans first to create multiple graphs, one
> per cluster. In our case the vectors in each segment could belong to
> different cluster so I don’t see how we could merge them efficiently.
>
> On Fri, 7 Apr 2023 at 22:28, jim ferenczi <jim.ferenczi@gmail.com> wrote:
>
>> The inference time (and cost) to generate these big vectors must be quite
>> large too ;).
>> Regarding the ram buffer, we could drastically reduce the size by writing
>> the vectors on disk instead of keeping them in the heap. With 1k dimensions
>> the ram buffer is filled with these vectors quite rapidly.
>>
>> On Fri, 7 Apr 2023 at 21:59, Robert Muir <rcmuir@gmail.com> wrote:
>>
>>> On Fri, Apr 7, 2023 at 7:47?AM Michael Sokolov <msokolov@gmail.com>
>>> wrote:
>>> >
>>> > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
>>> > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer
>>> size=1994)
>>> >
>>> > Robert, since you're the only on-the-record veto here, does this
>>> > change your thinking at all, or if not could you share some test
>>> > results that didn't go the way you expected? Maybe we can find some
>>> > mitigation if we focus on a specific issue.
>>> >
>>>
>>> My scale concerns are both space and time. What does the execution
>>> time look like if you don't set insanely large IW rambuffer? The
>>> default is 16MB. Just concerned we're shoving some problems under the
>>> rug :)
>>>
>>> Even with the yuge RAMbuffer, we're still talking about almost 2 hours
>>> to index 4M documents with these 2k vectors. Whereas you'd measure
>>> this in seconds with typical lucene indexing, its nothing.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

jim.ferenczi at gmail

Apr 7, 2023, 3:15 PM

Post #55 of 99 (176 views)

Permalink

> Keep in mind, there may be other ways to do it. In general if merging
something is going to be "heavyweight", we should think about it to
prevent things from going really bad overall.

Yep I agree. Personally I don t see how we can solve this without prior
knowledge of the vectors. Faiss has a nice implementation that fits
naturally with Lucene called IVF (
https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexIVF.html)
but if we want to avoid running kmeans on every merge we d require to
provide the clusters for the entire index before indexing the first vector.
It s a complex issue…

On Fri, 7 Apr 2023 at 22:58, Robert Muir <rcmuir@gmail.com> wrote:

> Personally i'd have to re-read the paper, but in general the merging
> issue has to be addressed somehow to fix the overall indexing time
> problem. It seems it gets "dodged" with huge rambuffers in the emails
> here.
> Keep in mind, there may be other ways to do it. In general if merging
> something is going to be "heavyweight", we should think about it to
> prevent things from going really bad overall.
>
> As an example, I'm most familiar with adding DEFLATE compression to
> stored fields. Previously, we'd basically decompress and recompress
> the stored fields on merge, and LZ4 is so fast that it wasn't
> obviously a problem. But with DEFLATE it got slower/heavier (more
> intense compression algorithm), something had to be done or indexing
> would be unacceptably slow. Hence if you look at storedfields writer,
> there is "dirtiness" logic etc so that recompression is amortized over
> time and doesn't happen on every merge.
>
> On Fri, Apr 7, 2023 at 5:38?PM jim ferenczi <jim.ferenczi@gmail.com>
> wrote:
> >
> > I am also not sure that diskann would solve the merging issue. The idea
> describe in the paper is to run kmeans first to create multiple graphs, one
> per cluster. In our case the vectors in each segment could belong to
> different cluster so I don’t see how we could merge them efficiently.
> >
> > On Fri, 7 Apr 2023 at 22:28, jim ferenczi <jim.ferenczi@gmail.com>
> wrote:
> >>
> >> The inference time (and cost) to generate these big vectors must be
> quite large too ;).
> >> Regarding the ram buffer, we could drastically reduce the size by
> writing the vectors on disk instead of keeping them in the heap. With 1k
> dimensions the ram buffer is filled with these vectors quite rapidly.
> >>
> >> On Fri, 7 Apr 2023 at 21:59, Robert Muir <rcmuir@gmail.com> wrote:
> >>>
> >>> On Fri, Apr 7, 2023 at 7:47?AM Michael Sokolov <msokolov@gmail.com>
> wrote:
> >>> >
> >>> > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer
> size=1994)
> >>> > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer
> size=1994)
> >>> >
> >>> > Robert, since you're the only on-the-record veto here, does this
> >>> > change your thinking at all, or if not could you share some test
> >>> > results that didn't go the way you expected? Maybe we can find some
> >>> > mitigation if we focus on a specific issue.
> >>> >
> >>>
> >>> My scale concerns are both space and time. What does the execution
> >>> time look like if you don't set insanely large IW rambuffer? The
> >>> default is 16MB. Just concerned we're shoving some problems under the
> >>> rug :)
> >>>
> >>> Even with the yuge RAMbuffer, we're still talking about almost 2 hours
> >>> to index 4M documents with these 2k vectors. Whereas you'd measure
> >>> this in seconds with typical lucene indexing, its nothing.
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

michael.wechner at wyona

Apr 7, 2023, 7:57 PM

Post #56 of 99 (176 views)

Permalink

sorry to interrupt, but I think we get side-tracked from the original
discussion to increase the vector dimension limit.

I think improving the vector indexing performance is one thing and
making sure Lucene does not crash when increasing the vector dimension
limit is another.

I think it is great to find better ways to index vectors, but I think
this should not prevent people from being able to use models with higher
vector dimensions than 1024.

The following comparison might not be perfect, but imagine we have
invented a combustion engine, which is strong enough to move a car in
the flat area, but when applying it to a truck to move things over
mountains it will fail, because it is not strong enough. Would you
prevent people from using the combustion engine for a car in the flat area?

Thanks

Michael

Am 08.04.23 um 00:15 schrieb jim ferenczi:
> > Keep in mind, there may be other ways to do it. In general if merging
> something is going to be "heavyweight", we should think about it to
> prevent things from going really bad overall.
>
> Yep I agree. Personally I don t see how we can solve this without
> prior knowledge of the vectors. Faiss has a nice implementation that
> fits naturally with Lucene called IVF (
> https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexIVF.html)
> but if we want to avoid running kmeans on every merge we d require to
> provide the clusters for the entire index before indexing the first
> vector.
> It s a complex issue…
>
> On Fri, 7 Apr 2023 at 22:58, Robert Muir <rcmuir@gmail.com> wrote:
>
> Personally i'd have to re-read the paper, but in general the merging
> issue has to be addressed somehow to fix the overall indexing time
> problem. It seems it gets "dodged" with huge rambuffers in the emails
> here.
> Keep in mind, there may be other ways to do it. In general if merging
> something is going to be "heavyweight", we should think about it to
> prevent things from going really bad overall.
>
> As an example, I'm most familiar with adding DEFLATE compression to
> stored fields. Previously, we'd basically decompress and recompress
> the stored fields on merge, and LZ4 is so fast that it wasn't
> obviously a problem. But with DEFLATE it got slower/heavier (more
> intense compression algorithm), something had to be done or indexing
> would be unacceptably slow. Hence if you look at storedfields writer,
> there is "dirtiness" logic etc so that recompression is amortized over
> time and doesn't happen on every merge.
>
> On Fri, Apr 7, 2023 at 5:38?PM jim ferenczi
> <jim.ferenczi@gmail.com> wrote:
> >
> > I am also not sure that diskann would solve the merging issue.
> The idea describe in the paper is to run kmeans first to create
> multiple graphs, one per cluster. In our case the vectors in each
> segment could belong to different cluster so I don’t see how we
> could merge them efficiently.
> >
> > On Fri, 7 Apr 2023 at 22:28, jim ferenczi
> <jim.ferenczi@gmail.com> wrote:
> >>
> >> The inference time (and cost) to generate these big vectors
> must be quite large too ;).
> >> Regarding the ram buffer, we could drastically reduce the size
> by writing the vectors on disk instead of keeping them in the
> heap. With 1k dimensions the ram buffer is filled with these
> vectors quite rapidly.
> >>
> >> On Fri, 7 Apr 2023 at 21:59, Robert Muir <rcmuir@gmail.com> wrote:
> >>>
> >>> On Fri, Apr 7, 2023 at 7:47?AM Michael Sokolov
> <msokolov@gmail.com> wrote:
> >>> >
> >>> > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer
> size=1994)
> >>> > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW
> buffer size=1994)
> >>> >
> >>> > Robert, since you're the only on-the-record veto here, does this
> >>> > change your thinking at all, or if not could you share some test
> >>> > results that didn't go the way you expected? Maybe we can
> find some
> >>> > mitigation if we focus on a specific issue.
> >>> >
> >>>
> >>> My scale concerns are both space and time. What does the execution
> >>> time look like if you don't set insanely large IW rambuffer? The
> >>> default is 16MB. Just concerned we're shoving some problems
> under the
> >>> rug :)
> >>>
> >>> Even with the yuge RAMbuffer, we're still talking about almost
> 2 hours
> >>> to index 4M documents with these 2k vectors. Whereas you'd measure
> >>> this in seconds with typical lucene indexing, its nothing.
> >>>
> >>>
> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

a.benedetti at sease

Apr 8, 2023, 2:56 AM

Post #57 of 99 (176 views)

Permalink

Yes, that was explicitly mentioned in the original mail, improving the
vector based search of Lucene is an interesting area, but off topic here.

Let's summarise:
- We want to at least increase the limit (or remove it)
- We proved that performance are ok to do it (and we can improve them more
in the future), no harm is given to users that intend to stick to low
dimensional vectors

What are the next steps?
What apache community tool can we use to agree on a new limit/no explicit
limit (max integer)?
I think we need some sort of place where each of us propose a limit with a
motivation and we vote the best option?
Any idea on how to do it?

Cheers

On Sat, 8 Apr 2023, 03:57 Michael Wechner, <michael.wechner@wyona.com>
wrote:

> sorry to interrupt, but I think we get side-tracked from the original
> discussion to increase the vector dimension limit.
>
> I think improving the vector indexing performance is one thing and making
> sure Lucene does not crash when increasing the vector dimension limit is
> another.
>
> I think it is great to find better ways to index vectors, but I think this
> should not prevent people from being able to use models with higher vector
> dimensions than 1024.
>
> The following comparison might not be perfect, but imagine we have
> invented a combustion engine, which is strong enough to move a car in the
> flat area, but when applying it to a truck to move things over mountains it
> will fail, because it is not strong enough. Would you prevent people from
> using the combustion engine for a car in the flat area?
>
> Thanks
>
> Michael
>
>
>
> Am 08.04.23 um 00:15 schrieb jim ferenczi:
>
> > Keep in mind, there may be other ways to do it. In general if merging
> something is going to be "heavyweight", we should think about it to
> prevent things from going really bad overall.
>
> Yep I agree. Personally I don t see how we can solve this without prior
> knowledge of the vectors. Faiss has a nice implementation that fits
> naturally with Lucene called IVF (
> https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexIVF.html)
> but if we want to avoid running kmeans on every merge we d require to
> provide the clusters for the entire index before indexing the first vector.
> It s a complex issue…
>
> On Fri, 7 Apr 2023 at 22:58, Robert Muir <rcmuir@gmail.com> wrote:
>
>> Personally i'd have to re-read the paper, but in general the merging
>> issue has to be addressed somehow to fix the overall indexing time
>> problem. It seems it gets "dodged" with huge rambuffers in the emails
>> here.
>> Keep in mind, there may be other ways to do it. In general if merging
>> something is going to be "heavyweight", we should think about it to
>> prevent things from going really bad overall.
>>
>> As an example, I'm most familiar with adding DEFLATE compression to
>> stored fields. Previously, we'd basically decompress and recompress
>> the stored fields on merge, and LZ4 is so fast that it wasn't
>> obviously a problem. But with DEFLATE it got slower/heavier (more
>> intense compression algorithm), something had to be done or indexing
>> would be unacceptably slow. Hence if you look at storedfields writer,
>> there is "dirtiness" logic etc so that recompression is amortized over
>> time and doesn't happen on every merge.
>>
>> On Fri, Apr 7, 2023 at 5:38?PM jim ferenczi <jim.ferenczi@gmail.com>
>> wrote:
>> >
>> > I am also not sure that diskann would solve the merging issue. The idea
>> describe in the paper is to run kmeans first to create multiple graphs, one
>> per cluster. In our case the vectors in each segment could belong to
>> different cluster so I don’t see how we could merge them efficiently.
>> >
>> > On Fri, 7 Apr 2023 at 22:28, jim ferenczi <jim.ferenczi@gmail.com>
>> wrote:
>> >>
>> >> The inference time (and cost) to generate these big vectors must be
>> quite large too ;).
>> >> Regarding the ram buffer, we could drastically reduce the size by
>> writing the vectors on disk instead of keeping them in the heap. With 1k
>> dimensions the ram buffer is filled with these vectors quite rapidly.
>> >>
>> >> On Fri, 7 Apr 2023 at 21:59, Robert Muir <rcmuir@gmail.com> wrote:
>> >>>
>> >>> On Fri, Apr 7, 2023 at 7:47?AM Michael Sokolov <msokolov@gmail.com>
>> wrote:
>> >>> >
>> >>> > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer
>> size=1994)
>> >>> > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer
>> size=1994)
>> >>> >
>> >>> > Robert, since you're the only on-the-record veto here, does this
>> >>> > change your thinking at all, or if not could you share some test
>> >>> > results that didn't go the way you expected? Maybe we can find some
>> >>> > mitigation if we focus on a specific issue.
>> >>> >
>> >>>
>> >>> My scale concerns are both space and time. What does the execution
>> >>> time look like if you don't set insanely large IW rambuffer? The
>> >>> default is 16MB. Just concerned we're shoving some problems under the
>> >>> rug :)
>> >>>
>> >>> Even with the yuge RAMbuffer, we're still talking about almost 2 hours
>> >>> to index 4M documents with these 2k vectors. Whereas you'd measure
>> >>> this in seconds with typical lucene indexing, its nothing.
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >>> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

rcmuir at gmail

Apr 8, 2023, 4:04 AM

Post #58 of 99 (176 views)

Permalink

I don't think we have. The performance needs to be reasonable in order
to bump this limit. Otherwise bumping this limit makes the worst-case
2x worse than it already is!

Moreover, its clear something needs to happen to address the
scalability/lack of performance. I'd hate for this limit to be in the
way of that. Because of backwards compatibility, it's a one-way,
permanent, irreversible change.

I'm not sold by any means in any way yet. My vote remains the same.

On Fri, Apr 7, 2023 at 10:57?PM Michael Wechner
<michael.wechner@wyona.com> wrote:
>
> sorry to interrupt, but I think we get side-tracked from the original discussion to increase the vector dimension limit.
>
> I think improving the vector indexing performance is one thing and making sure Lucene does not crash when increasing the vector dimension limit is another.
>
> I think it is great to find better ways to index vectors, but I think this should not prevent people from being able to use models with higher vector dimensions than 1024.
>
> The following comparison might not be perfect, but imagine we have invented a combustion engine, which is strong enough to move a car in the flat area, but when applying it to a truck to move things over mountains it will fail, because it is not strong enough. Would you prevent people from using the combustion engine for a car in the flat area?
>
> Thanks
>
> Michael
>
>
>
> Am 08.04.23 um 00:15 schrieb jim ferenczi:
>
> > Keep in mind, there may be other ways to do it. In general if merging
> something is going to be "heavyweight", we should think about it to
> prevent things from going really bad overall.
>
> Yep I agree. Personally I don t see how we can solve this without prior knowledge of the vectors. Faiss has a nice implementation that fits naturally with Lucene called IVF (
> https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexIVF.html)
> but if we want to avoid running kmeans on every merge we d require to provide the clusters for the entire index before indexing the first vector.
> It s a complex issue…
>
> On Fri, 7 Apr 2023 at 22:58, Robert Muir <rcmuir@gmail.com> wrote:
>>
>> Personally i'd have to re-read the paper, but in general the merging
>> issue has to be addressed somehow to fix the overall indexing time
>> problem. It seems it gets "dodged" with huge rambuffers in the emails
>> here.
>> Keep in mind, there may be other ways to do it. In general if merging
>> something is going to be "heavyweight", we should think about it to
>> prevent things from going really bad overall.
>>
>> As an example, I'm most familiar with adding DEFLATE compression to
>> stored fields. Previously, we'd basically decompress and recompress
>> the stored fields on merge, and LZ4 is so fast that it wasn't
>> obviously a problem. But with DEFLATE it got slower/heavier (more
>> intense compression algorithm), something had to be done or indexing
>> would be unacceptably slow. Hence if you look at storedfields writer,
>> there is "dirtiness" logic etc so that recompression is amortized over
>> time and doesn't happen on every merge.
>>
>> On Fri, Apr 7, 2023 at 5:38?PM jim ferenczi <jim.ferenczi@gmail.com> wrote:
>> >
>> > I am also not sure that diskann would solve the merging issue. The idea describe in the paper is to run kmeans first to create multiple graphs, one per cluster. In our case the vectors in each segment could belong to different cluster so I don’t see how we could merge them efficiently.
>> >
>> > On Fri, 7 Apr 2023 at 22:28, jim ferenczi <jim.ferenczi@gmail.com> wrote:
>> >>
>> >> The inference time (and cost) to generate these big vectors must be quite large too ;).
>> >> Regarding the ram buffer, we could drastically reduce the size by writing the vectors on disk instead of keeping them in the heap. With 1k dimensions the ram buffer is filled with these vectors quite rapidly.
>> >>
>> >> On Fri, 7 Apr 2023 at 21:59, Robert Muir <rcmuir@gmail.com> wrote:
>> >>>
>> >>> On Fri, Apr 7, 2023 at 7:47?AM Michael Sokolov <msokolov@gmail.com> wrote:
>> >>> >
>> >>> > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
>> >>> > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer size=1994)
>> >>> >
>> >>> > Robert, since you're the only on-the-record veto here, does this
>> >>> > change your thinking at all, or if not could you share some test
>> >>> > results that didn't go the way you expected? Maybe we can find some
>> >>> > mitigation if we focus on a specific issue.
>> >>> >
>> >>>
>> >>> My scale concerns are both space and time. What does the execution
>> >>> time look like if you don't set insanely large IW rambuffer? The
>> >>> default is 16MB. Just concerned we're shoving some problems under the
>> >>> rug :)
>> >>>
>> >>> Even with the yuge RAMbuffer, we're still talking about almost 2 hours
>> >>> to index 4M documents with these 2k vectors. Whereas you'd measure
>> >>> this in seconds with typical lucene indexing, its nothing.
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >>> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

rcmuir at gmail

Apr 8, 2023, 4:06 AM

Post #59 of 99 (176 views)

Permalink

Great way to try to meet me in the middle and win me over, basically
just dismiss my concerns. This is not going to achieve what you want.

On Sat, Apr 8, 2023 at 5:56?AM Alessandro Benedetti
<a.benedetti@sease.io> wrote:
>
> Yes, that was explicitly mentioned in the original mail, improving the vector based search of Lucene is an interesting area, but off topic here.
>
> Let's summarise:
> - We want to at least increase the limit (or remove it)
> - We proved that performance are ok to do it (and we can improve them more in the future), no harm is given to users that intend to stick to low dimensional vectors
>
> What are the next steps?
> What apache community tool can we use to agree on a new limit/no explicit limit (max integer)?
> I think we need some sort of place where each of us propose a limit with a motivation and we vote the best option?
> Any idea on how to do it?
>
> Cheers
>
> On Sat, 8 Apr 2023, 03:57 Michael Wechner, <michael.wechner@wyona.com> wrote:
>>
>> sorry to interrupt, but I think we get side-tracked from the original discussion to increase the vector dimension limit.
>>
>> I think improving the vector indexing performance is one thing and making sure Lucene does not crash when increasing the vector dimension limit is another.
>>
>> I think it is great to find better ways to index vectors, but I think this should not prevent people from being able to use models with higher vector dimensions than 1024.
>>
>> The following comparison might not be perfect, but imagine we have invented a combustion engine, which is strong enough to move a car in the flat area, but when applying it to a truck to move things over mountains it will fail, because it is not strong enough. Would you prevent people from using the combustion engine for a car in the flat area?
>>
>> Thanks
>>
>> Michael
>>
>>
>>
>> Am 08.04.23 um 00:15 schrieb jim ferenczi:
>>
>> > Keep in mind, there may be other ways to do it. In general if merging
>> something is going to be "heavyweight", we should think about it to
>> prevent things from going really bad overall.
>>
>> Yep I agree. Personally I don t see how we can solve this without prior knowledge of the vectors. Faiss has a nice implementation that fits naturally with Lucene called IVF (
>> https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexIVF.html)
>> but if we want to avoid running kmeans on every merge we d require to provide the clusters for the entire index before indexing the first vector.
>> It s a complex issue…
>>
>> On Fri, 7 Apr 2023 at 22:58, Robert Muir <rcmuir@gmail.com> wrote:
>>>
>>> Personally i'd have to re-read the paper, but in general the merging
>>> issue has to be addressed somehow to fix the overall indexing time
>>> problem. It seems it gets "dodged" with huge rambuffers in the emails
>>> here.
>>> Keep in mind, there may be other ways to do it. In general if merging
>>> something is going to be "heavyweight", we should think about it to
>>> prevent things from going really bad overall.
>>>
>>> As an example, I'm most familiar with adding DEFLATE compression to
>>> stored fields. Previously, we'd basically decompress and recompress
>>> the stored fields on merge, and LZ4 is so fast that it wasn't
>>> obviously a problem. But with DEFLATE it got slower/heavier (more
>>> intense compression algorithm), something had to be done or indexing
>>> would be unacceptably slow. Hence if you look at storedfields writer,
>>> there is "dirtiness" logic etc so that recompression is amortized over
>>> time and doesn't happen on every merge.
>>>
>>> On Fri, Apr 7, 2023 at 5:38?PM jim ferenczi <jim.ferenczi@gmail.com> wrote:
>>> >
>>> > I am also not sure that diskann would solve the merging issue. The idea describe in the paper is to run kmeans first to create multiple graphs, one per cluster. In our case the vectors in each segment could belong to different cluster so I don’t see how we could merge them efficiently.
>>> >
>>> > On Fri, 7 Apr 2023 at 22:28, jim ferenczi <jim.ferenczi@gmail.com> wrote:
>>> >>
>>> >> The inference time (and cost) to generate these big vectors must be quite large too ;).
>>> >> Regarding the ram buffer, we could drastically reduce the size by writing the vectors on disk instead of keeping them in the heap. With 1k dimensions the ram buffer is filled with these vectors quite rapidly.
>>> >>
>>> >> On Fri, 7 Apr 2023 at 21:59, Robert Muir <rcmuir@gmail.com> wrote:
>>> >>>
>>> >>> On Fri, Apr 7, 2023 at 7:47?AM Michael Sokolov <msokolov@gmail.com> wrote:
>>> >>> >
>>> >>> > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
>>> >>> > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer size=1994)
>>> >>> >
>>> >>> > Robert, since you're the only on-the-record veto here, does this
>>> >>> > change your thinking at all, or if not could you share some test
>>> >>> > results that didn't go the way you expected? Maybe we can find some
>>> >>> > mitigation if we focus on a specific issue.
>>> >>> >
>>> >>>
>>> >>> My scale concerns are both space and time. What does the execution
>>> >>> time look like if you don't set insanely large IW rambuffer? The
>>> >>> default is 16MB. Just concerned we're shoving some problems under the
>>> >>> rug :)
>>> >>>
>>> >>> Even with the yuge RAMbuffer, we're still talking about almost 2 hours
>>> >>> to index 4M documents with these 2k vectors. Whereas you'd measure
>>> >>> this in seconds with typical lucene indexing, its nothing.
>>> >>>
>>> >>> ---------------------------------------------------------------------
>>> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >>> For additional commands, e-mail: dev-help@lucene.apache.org
>>> >>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

michael.wechner at wyona

Apr 8, 2023, 5:33 AM

Post #60 of 99 (176 views)

Permalink

What exactly do you consider reasonable?

I think it would help if we could specify concrete requirements re
performance and scalability, because then we have a concrete goal which
we can work with.
Do such requirements already exist or what would be a good starting point?

Re 2x worse, I think Michael Sokolov already pointed out that things
take longer linearly with vector dimension, which is quite obvious for
example for a brute force implementation. I would argue this will be the
case for any implementation.

And last I would like to ask again, slightly different, do we want
people to use Lucene, which will give us an opportunity to learn from
and progress?

Thanks

Michael

Am 08.04.23 um 13:04 schrieb Robert Muir:
> I don't think we have. The performance needs to be reasonable in order
> to bump this limit. Otherwise bumping this limit makes the worst-case
> 2x worse than it already is!
>
> Moreover, its clear something needs to happen to address the
> scalability/lack of performance. I'd hate for this limit to be in the
> way of that. Because of backwards compatibility, it's a one-way,
> permanent, irreversible change.
>
> I'm not sold by any means in any way yet. My vote remains the same.
>
> On Fri, Apr 7, 2023 at 10:57?PM Michael Wechner
> <michael.wechner@wyona.com> wrote:
>> sorry to interrupt, but I think we get side-tracked from the original discussion to increase the vector dimension limit.
>>
>> I think improving the vector indexing performance is one thing and making sure Lucene does not crash when increasing the vector dimension limit is another.
>>
>> I think it is great to find better ways to index vectors, but I think this should not prevent people from being able to use models with higher vector dimensions than 1024.
>>
>> The following comparison might not be perfect, but imagine we have invented a combustion engine, which is strong enough to move a car in the flat area, but when applying it to a truck to move things over mountains it will fail, because it is not strong enough. Would you prevent people from using the combustion engine for a car in the flat area?
>>
>> Thanks
>>
>> Michael
>>
>>
>>
>> Am 08.04.23 um 00:15 schrieb jim ferenczi:
>>
>>> Keep in mind, there may be other ways to do it. In general if merging
>> something is going to be "heavyweight", we should think about it to
>> prevent things from going really bad overall.
>>
>> Yep I agree. Personally I don t see how we can solve this without prior knowledge of the vectors. Faiss has a nice implementation that fits naturally with Lucene called IVF (
>> https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexIVF.html)
>> but if we want to avoid running kmeans on every merge we d require to provide the clusters for the entire index before indexing the first vector.
>> It s a complex issue…
>>
>> On Fri, 7 Apr 2023 at 22:58, Robert Muir <rcmuir@gmail.com> wrote:
>>> Personally i'd have to re-read the paper, but in general the merging
>>> issue has to be addressed somehow to fix the overall indexing time
>>> problem. It seems it gets "dodged" with huge rambuffers in the emails
>>> here.
>>> Keep in mind, there may be other ways to do it. In general if merging
>>> something is going to be "heavyweight", we should think about it to
>>> prevent things from going really bad overall.
>>>
>>> As an example, I'm most familiar with adding DEFLATE compression to
>>> stored fields. Previously, we'd basically decompress and recompress
>>> the stored fields on merge, and LZ4 is so fast that it wasn't
>>> obviously a problem. But with DEFLATE it got slower/heavier (more
>>> intense compression algorithm), something had to be done or indexing
>>> would be unacceptably slow. Hence if you look at storedfields writer,
>>> there is "dirtiness" logic etc so that recompression is amortized over
>>> time and doesn't happen on every merge.
>>>
>>> On Fri, Apr 7, 2023 at 5:38?PM jim ferenczi <jim.ferenczi@gmail.com> wrote:
>>>> I am also not sure that diskann would solve the merging issue. The idea describe in the paper is to run kmeans first to create multiple graphs, one per cluster. In our case the vectors in each segment could belong to different cluster so I don’t see how we could merge them efficiently.
>>>>
>>>> On Fri, 7 Apr 2023 at 22:28, jim ferenczi <jim.ferenczi@gmail.com> wrote:
>>>>> The inference time (and cost) to generate these big vectors must be quite large too ;).
>>>>> Regarding the ram buffer, we could drastically reduce the size by writing the vectors on disk instead of keeping them in the heap. With 1k dimensions the ram buffer is filled with these vectors quite rapidly.
>>>>>
>>>>> On Fri, 7 Apr 2023 at 21:59, Robert Muir <rcmuir@gmail.com> wrote:
>>>>>> On Fri, Apr 7, 2023 at 7:47?AM Michael Sokolov <msokolov@gmail.com> wrote:
>>>>>>> 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
>>>>>>> 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer size=1994)
>>>>>>>
>>>>>>> Robert, since you're the only on-the-record veto here, does this
>>>>>>> change your thinking at all, or if not could you share some test
>>>>>>> results that didn't go the way you expected? Maybe we can find some
>>>>>>> mitigation if we focus on a specific issue.
>>>>>>>
>>>>>> My scale concerns are both space and time. What does the execution
>>>>>> time look like if you don't set insanely large IW rambuffer? The
>>>>>> default is 16MB. Just concerned we're shoving some problems under the
>>>>>> rug :)
>>>>>>
>>>>>> Even with the yuge RAMbuffer, we're still talking about almost 2 hours
>>>>>> to index 4M documents with these 2k vectors. Whereas you'd measure
>>>>>> this in seconds with typical lucene indexing, its nothing.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

rcmuir at gmail

Apr 8, 2023, 9:39 AM

Post #61 of 99 (176 views)

Permalink

On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
<michael.wechner@wyona.com> wrote:
>
> What exactly do you consider reasonable?

Let's begin a real discussion by being HONEST about the current
status. Please put politically correct or your own company's wishes
aside, we know it's not in a good state.

Current status is the one guy who wrote the code can set a
multi-gigabyte ram buffer and index a small dataset with 1024
dimensions in HOURS (i didn't ask what hardware).

My concerns are everyone else except the one guy, I want it to be
usable. Increasing dimensions just means even bigger multi-gigabyte
ram buffer and bigger heap to avoid OOM on merge.
It is also a permanent backwards compatibility decision, we have to
support it once we do this and we can't just say "oops" and flip it
back.

It is unclear to me, if the multi-gigabyte ram buffer is really to
avoid merges because they are so slow and it would be DAYS otherwise,
or if its to avoid merges so it doesn't hit OOM.
Also from personal experience, it takes trial and error (means
experiencing OOM on merge!!!) before you get those heap values correct
for your dataset. This usually means starting over which is
frustrating and wastes more time.

Jim mentioned some ideas about the memory usage in IndexWriter, seems
to me like its a good idea. maybe the multigigabyte ram buffer can be
avoided in this way and performance improved by writing bigger
segments with lucene's defaults. But this doesn't mean we can simply
ignore the horrors of what happens on merge. merging needs to scale so
that indexing really scales.

At least it shouldnt spike RAM on trivial data amounts and cause OOM,
and definitely it shouldnt burn hours and hours of CPU in O(n^2)
fashion when indexing.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

msokolov at gmail

Apr 8, 2023, 9:50 AM

Post #62 of 99 (176 views)

Permalink

What you said about increasing dimensions requiring a bigger ram buffer on
merge is wrong. That's the point I was trying to make. Your concerns about
merge costs are not wrong, but your conclusion that we need to limit
dimensions is not justified.

You complain that hnsw sucks it doesn't scale, but when I show it scales
linearly with dimension you just ignore that and complain about something
entirely different.

You demand that people run all kinds of tests to prove you wrong but when
they do, you don't listen and you won't put in the work yourself or
complain that it's too hard.

Then you complain about people not meeting you half way. Wow

On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcmuir@gmail.com> wrote:

> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
> <michael.wechner@wyona.com> wrote:
> >
> > What exactly do you consider reasonable?
>
> Let's begin a real discussion by being HONEST about the current
> status. Please put politically correct or your own company's wishes
> aside, we know it's not in a good state.
>
> Current status is the one guy who wrote the code can set a
> multi-gigabyte ram buffer and index a small dataset with 1024
> dimensions in HOURS (i didn't ask what hardware).
>
> My concerns are everyone else except the one guy, I want it to be
> usable. Increasing dimensions just means even bigger multi-gigabyte
> ram buffer and bigger heap to avoid OOM on merge.
> It is also a permanent backwards compatibility decision, we have to
> support it once we do this and we can't just say "oops" and flip it
> back.
>
> It is unclear to me, if the multi-gigabyte ram buffer is really to
> avoid merges because they are so slow and it would be DAYS otherwise,
> or if its to avoid merges so it doesn't hit OOM.
> Also from personal experience, it takes trial and error (means
> experiencing OOM on merge!!!) before you get those heap values correct
> for your dataset. This usually means starting over which is
> frustrating and wastes more time.
>
> Jim mentioned some ideas about the memory usage in IndexWriter, seems
> to me like its a good idea. maybe the multigigabyte ram buffer can be
> avoided in this way and performance improved by writing bigger
> segments with lucene's defaults. But this doesn't mean we can simply
> ignore the horrors of what happens on merge. merging needs to scale so
> that indexing really scales.
>
> At least it shouldnt spike RAM on trivial data amounts and cause OOM,
> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
> fashion when indexing.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

rcmuir at gmail

Apr 8, 2023, 9:57 AM

Post #63 of 99 (176 views)

Permalink

I disagree with your categorization. I put in plenty of work and
experienced plenty of pain myself, writing tests and fighting these
issues, after i saw that, two releases in a row, vector indexing fell
over and hit integer overflows etc on small datasets:

https://github.com/apache/lucene/pull/11905

Attacking me isn't helping the situation.

PS: when i said the "one guy who wrote the code" I didn't mean it in
any kind of demeaning fashion really. I meant to describe the current
state of usability with respect to indexing a few million docs with
high dimensions. You can scroll up the thread and see that at least
one other committer on the project experienced similar pain as me.
Then, think about users who aren't committers trying to use the
functionality!

On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov <msokolov@gmail.com> wrote:
>
> What you said about increasing dimensions requiring a bigger ram buffer on merge is wrong. That's the point I was trying to make. Your concerns about merge costs are not wrong, but your conclusion that we need to limit dimensions is not justified.
>
> You complain that hnsw sucks it doesn't scale, but when I show it scales linearly with dimension you just ignore that and complain about something entirely different.
>
> You demand that people run all kinds of tests to prove you wrong but when they do, you don't listen and you won't put in the work yourself or complain that it's too hard.
>
> Then you complain about people not meeting you half way. Wow
>
> On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcmuir@gmail.com> wrote:
>>
>> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
>> <michael.wechner@wyona.com> wrote:
>> >
>> > What exactly do you consider reasonable?
>>
>> Let's begin a real discussion by being HONEST about the current
>> status. Please put politically correct or your own company's wishes
>> aside, we know it's not in a good state.
>>
>> Current status is the one guy who wrote the code can set a
>> multi-gigabyte ram buffer and index a small dataset with 1024
>> dimensions in HOURS (i didn't ask what hardware).
>>
>> My concerns are everyone else except the one guy, I want it to be
>> usable. Increasing dimensions just means even bigger multi-gigabyte
>> ram buffer and bigger heap to avoid OOM on merge.
>> It is also a permanent backwards compatibility decision, we have to
>> support it once we do this and we can't just say "oops" and flip it
>> back.
>>
>> It is unclear to me, if the multi-gigabyte ram buffer is really to
>> avoid merges because they are so slow and it would be DAYS otherwise,
>> or if its to avoid merges so it doesn't hit OOM.
>> Also from personal experience, it takes trial and error (means
>> experiencing OOM on merge!!!) before you get those heap values correct
>> for your dataset. This usually means starting over which is
>> frustrating and wastes more time.
>>
>> Jim mentioned some ideas about the memory usage in IndexWriter, seems
>> to me like its a good idea. maybe the multigigabyte ram buffer can be
>> avoided in this way and performance improved by writing bigger
>> segments with lucene's defaults. But this doesn't mean we can simply
>> ignore the horrors of what happens on merge. merging needs to scale so
>> that indexing really scales.
>>
>> At least it shouldnt spike RAM on trivial data amounts and cause OOM,
>> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
>> fashion when indexing.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

a.benedetti at sease

Apr 8, 2023, 12:28 PM

Post #64 of 99 (176 views)

Permalink

I am very attentive to listen opinions but I am un-convinced here and I an
not sure that a single person opinion should be allowed to be detrimental
for such an important project.

The limit as far as I know is literally just raising an exception.
Removing it won't alter in any way the current performance for users in low
dimensional space.
Removing it will just enable more users to use Lucene.

If new users in certain situations will be unhappy with the performance,
they may contribute improvements.
This is how you make progress.

If it's a reputation thing, trust me that not allowing users to play with
high dimensional space will equally damage it.

To me it's really a no brainer.
Removing the limit and enable people to use high dimensional vectors will
take minutes.
Improving the hnsw implementation can take months.
Pick one to begin with...

And there's no-one paying me here, no company interest whatsoever, actually
I pay people to contribute, I am just convinced it's a good idea.

On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcmuir@gmail.com> wrote:

> I disagree with your categorization. I put in plenty of work and
> experienced plenty of pain myself, writing tests and fighting these
> issues, after i saw that, two releases in a row, vector indexing fell
> over and hit integer overflows etc on small datasets:
>
> https://github.com/apache/lucene/pull/11905
>
> Attacking me isn't helping the situation.
>
> PS: when i said the "one guy who wrote the code" I didn't mean it in
> any kind of demeaning fashion really. I meant to describe the current
> state of usability with respect to indexing a few million docs with
> high dimensions. You can scroll up the thread and see that at least
> one other committer on the project experienced similar pain as me.
> Then, think about users who aren't committers trying to use the
> functionality!
>
> On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov <msokolov@gmail.com>
> wrote:
> >
> > What you said about increasing dimensions requiring a bigger ram buffer
> on merge is wrong. That's the point I was trying to make. Your concerns
> about merge costs are not wrong, but your conclusion that we need to limit
> dimensions is not justified.
> >
> > You complain that hnsw sucks it doesn't scale, but when I show it scales
> linearly with dimension you just ignore that and complain about something
> entirely different.
> >
> > You demand that people run all kinds of tests to prove you wrong but
> when they do, you don't listen and you won't put in the work yourself or
> complain that it's too hard.
> >
> > Then you complain about people not meeting you half way. Wow
> >
> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcmuir@gmail.com> wrote:
> >>
> >> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
> >> <michael.wechner@wyona.com> wrote:
> >> >
> >> > What exactly do you consider reasonable?
> >>
> >> Let's begin a real discussion by being HONEST about the current
> >> status. Please put politically correct or your own company's wishes
> >> aside, we know it's not in a good state.
> >>
> >> Current status is the one guy who wrote the code can set a
> >> multi-gigabyte ram buffer and index a small dataset with 1024
> >> dimensions in HOURS (i didn't ask what hardware).
> >>
> >> My concerns are everyone else except the one guy, I want it to be
> >> usable. Increasing dimensions just means even bigger multi-gigabyte
> >> ram buffer and bigger heap to avoid OOM on merge.
> >> It is also a permanent backwards compatibility decision, we have to
> >> support it once we do this and we can't just say "oops" and flip it
> >> back.
> >>
> >> It is unclear to me, if the multi-gigabyte ram buffer is really to
> >> avoid merges because they are so slow and it would be DAYS otherwise,
> >> or if its to avoid merges so it doesn't hit OOM.
> >> Also from personal experience, it takes trial and error (means
> >> experiencing OOM on merge!!!) before you get those heap values correct
> >> for your dataset. This usually means starting over which is
> >> frustrating and wastes more time.
> >>
> >> Jim mentioned some ideas about the memory usage in IndexWriter, seems
> >> to me like its a good idea. maybe the multigigabyte ram buffer can be
> >> avoided in this way and performance improved by writing bigger
> >> segments with lucene's defaults. But this doesn't mean we can simply
> >> ignore the horrors of what happens on merge. merging needs to scale so
> >> that indexing really scales.
> >>
> >> At least it shouldnt spike RAM on trivial data amounts and cause OOM,
> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
> >> fashion when indexing.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

ichattopadhyaya at gmail

Apr 8, 2023, 12:53 PM

Post #65 of 99 (176 views)

Permalink

Can the limit be raised using Java reflection at run time? Or is there more
to it that needs to be changed?

On Sun, 9 Apr, 2023, 12:58 am Alessandro Benedetti, <a.benedetti@sease.io>
wrote:

> I am very attentive to listen opinions but I am un-convinced here and I an
> not sure that a single person opinion should be allowed to be detrimental
> for such an important project.
>
> The limit as far as I know is literally just raising an exception.
> Removing it won't alter in any way the current performance for users in
> low dimensional space.
> Removing it will just enable more users to use Lucene.
>
> If new users in certain situations will be unhappy with the performance,
> they may contribute improvements.
> This is how you make progress.
>
> If it's a reputation thing, trust me that not allowing users to play with
> high dimensional space will equally damage it.
>
> To me it's really a no brainer.
> Removing the limit and enable people to use high dimensional vectors will
> take minutes.
> Improving the hnsw implementation can take months.
> Pick one to begin with...
>
> And there's no-one paying me here, no company interest whatsoever,
> actually I pay people to contribute, I am just convinced it's a good idea.
>
>
> On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcmuir@gmail.com> wrote:
>
>> I disagree with your categorization. I put in plenty of work and
>> experienced plenty of pain myself, writing tests and fighting these
>> issues, after i saw that, two releases in a row, vector indexing fell
>> over and hit integer overflows etc on small datasets:
>>
>> https://github.com/apache/lucene/pull/11905
>>
>> Attacking me isn't helping the situation.
>>
>> PS: when i said the "one guy who wrote the code" I didn't mean it in
>> any kind of demeaning fashion really. I meant to describe the current
>> state of usability with respect to indexing a few million docs with
>> high dimensions. You can scroll up the thread and see that at least
>> one other committer on the project experienced similar pain as me.
>> Then, think about users who aren't committers trying to use the
>> functionality!
>>
>> On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov <msokolov@gmail.com>
>> wrote:
>> >
>> > What you said about increasing dimensions requiring a bigger ram buffer
>> on merge is wrong. That's the point I was trying to make. Your concerns
>> about merge costs are not wrong, but your conclusion that we need to limit
>> dimensions is not justified.
>> >
>> > You complain that hnsw sucks it doesn't scale, but when I show it
>> scales linearly with dimension you just ignore that and complain about
>> something entirely different.
>> >
>> > You demand that people run all kinds of tests to prove you wrong but
>> when they do, you don't listen and you won't put in the work yourself or
>> complain that it's too hard.
>> >
>> > Then you complain about people not meeting you half way. Wow
>> >
>> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcmuir@gmail.com> wrote:
>> >>
>> >> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
>> >> <michael.wechner@wyona.com> wrote:
>> >> >
>> >> > What exactly do you consider reasonable?
>> >>
>> >> Let's begin a real discussion by being HONEST about the current
>> >> status. Please put politically correct or your own company's wishes
>> >> aside, we know it's not in a good state.
>> >>
>> >> Current status is the one guy who wrote the code can set a
>> >> multi-gigabyte ram buffer and index a small dataset with 1024
>> >> dimensions in HOURS (i didn't ask what hardware).
>> >>
>> >> My concerns are everyone else except the one guy, I want it to be
>> >> usable. Increasing dimensions just means even bigger multi-gigabyte
>> >> ram buffer and bigger heap to avoid OOM on merge.
>> >> It is also a permanent backwards compatibility decision, we have to
>> >> support it once we do this and we can't just say "oops" and flip it
>> >> back.
>> >>
>> >> It is unclear to me, if the multi-gigabyte ram buffer is really to
>> >> avoid merges because they are so slow and it would be DAYS otherwise,
>> >> or if its to avoid merges so it doesn't hit OOM.
>> >> Also from personal experience, it takes trial and error (means
>> >> experiencing OOM on merge!!!) before you get those heap values correct
>> >> for your dataset. This usually means starting over which is
>> >> frustrating and wastes more time.
>> >>
>> >> Jim mentioned some ideas about the memory usage in IndexWriter, seems
>> >> to me like its a good idea. maybe the multigigabyte ram buffer can be
>> >> avoided in this way and performance improved by writing bigger
>> >> segments with lucene's defaults. But this doesn't mean we can simply
>> >> ignore the horrors of what happens on merge. merging needs to scale so
>> >> that indexing really scales.
>> >>
>> >> At least it shouldnt spike RAM on trivial data amounts and cause OOM,
>> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
>> >> fashion when indexing.
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

msokolov at gmail

Apr 8, 2023, 2:01 PM

Post #66 of 99 (176 views)

Permalink

well, it's a final variable. But you could maybe extend KnnVectorField
to get around this limit? I think that's the only place it's currently
enforced

On Sat, Apr 8, 2023 at 3:54?PM Ishan Chattopadhyaya
<ichattopadhyaya@gmail.com> wrote:
>
> Can the limit be raised using Java reflection at run time? Or is there more to it that needs to be changed?
>
> On Sun, 9 Apr, 2023, 12:58 am Alessandro Benedetti, <a.benedetti@sease.io> wrote:
>>
>> I am very attentive to listen opinions but I am un-convinced here and I an not sure that a single person opinion should be allowed to be detrimental for such an important project.
>>
>> The limit as far as I know is literally just raising an exception.
>> Removing it won't alter in any way the current performance for users in low dimensional space.
>> Removing it will just enable more users to use Lucene.
>>
>> If new users in certain situations will be unhappy with the performance, they may contribute improvements.
>> This is how you make progress.
>>
>> If it's a reputation thing, trust me that not allowing users to play with high dimensional space will equally damage it.
>>
>> To me it's really a no brainer.
>> Removing the limit and enable people to use high dimensional vectors will take minutes.
>> Improving the hnsw implementation can take months.
>> Pick one to begin with...
>>
>> And there's no-one paying me here, no company interest whatsoever, actually I pay people to contribute, I am just convinced it's a good idea.
>>
>>
>> On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcmuir@gmail.com> wrote:
>>>
>>> I disagree with your categorization. I put in plenty of work and
>>> experienced plenty of pain myself, writing tests and fighting these
>>> issues, after i saw that, two releases in a row, vector indexing fell
>>> over and hit integer overflows etc on small datasets:
>>>
>>> https://github.com/apache/lucene/pull/11905
>>>
>>> Attacking me isn't helping the situation.
>>>
>>> PS: when i said the "one guy who wrote the code" I didn't mean it in
>>> any kind of demeaning fashion really. I meant to describe the current
>>> state of usability with respect to indexing a few million docs with
>>> high dimensions. You can scroll up the thread and see that at least
>>> one other committer on the project experienced similar pain as me.
>>> Then, think about users who aren't committers trying to use the
>>> functionality!
>>>
>>> On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov <msokolov@gmail.com> wrote:
>>> >
>>> > What you said about increasing dimensions requiring a bigger ram buffer on merge is wrong. That's the point I was trying to make. Your concerns about merge costs are not wrong, but your conclusion that we need to limit dimensions is not justified.
>>> >
>>> > You complain that hnsw sucks it doesn't scale, but when I show it scales linearly with dimension you just ignore that and complain about something entirely different.
>>> >
>>> > You demand that people run all kinds of tests to prove you wrong but when they do, you don't listen and you won't put in the work yourself or complain that it's too hard.
>>> >
>>> > Then you complain about people not meeting you half way. Wow
>>> >
>>> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcmuir@gmail.com> wrote:
>>> >>
>>> >> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
>>> >> <michael.wechner@wyona.com> wrote:
>>> >> >
>>> >> > What exactly do you consider reasonable?
>>> >>
>>> >> Let's begin a real discussion by being HONEST about the current
>>> >> status. Please put politically correct or your own company's wishes
>>> >> aside, we know it's not in a good state.
>>> >>
>>> >> Current status is the one guy who wrote the code can set a
>>> >> multi-gigabyte ram buffer and index a small dataset with 1024
>>> >> dimensions in HOURS (i didn't ask what hardware).
>>> >>
>>> >> My concerns are everyone else except the one guy, I want it to be
>>> >> usable. Increasing dimensions just means even bigger multi-gigabyte
>>> >> ram buffer and bigger heap to avoid OOM on merge.
>>> >> It is also a permanent backwards compatibility decision, we have to
>>> >> support it once we do this and we can't just say "oops" and flip it
>>> >> back.
>>> >>
>>> >> It is unclear to me, if the multi-gigabyte ram buffer is really to
>>> >> avoid merges because they are so slow and it would be DAYS otherwise,
>>> >> or if its to avoid merges so it doesn't hit OOM.
>>> >> Also from personal experience, it takes trial and error (means
>>> >> experiencing OOM on merge!!!) before you get those heap values correct
>>> >> for your dataset. This usually means starting over which is
>>> >> frustrating and wastes more time.
>>> >>
>>> >> Jim mentioned some ideas about the memory usage in IndexWriter, seems
>>> >> to me like its a good idea. maybe the multigigabyte ram buffer can be
>>> >> avoided in this way and performance improved by writing bigger
>>> >> segments with lucene's defaults. But this doesn't mean we can simply
>>> >> ignore the horrors of what happens on merge. merging needs to scale so
>>> >> that indexing really scales.
>>> >>
>>> >> At least it shouldnt spike RAM on trivial data amounts and cause OOM,
>>> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
>>> >> fashion when indexing.
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>>> >>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

jpountz at gmail

Apr 8, 2023, 2:29 PM

Post #67 of 99 (176 views)

Permalink

As Dawid pointed out earlier on this thread, this is the rule for
Apache projects: a single -1 vote on a code change is a veto and
cannot be overridden. Furthermore, Robert is one of the people on this
project who worked the most on debugging subtle bugs, making Lucene
more robust and improving our test framework, so I'm listening when he
voices quality concerns.

The argument against removing/raising the limit that resonates with me
the most is that it is a one-way door. As MikeS highlighted earlier on
this thread, implementations may want to take advantage of the fact
that there is a limit at some point too. This is why I don't want to
remove the limit and would prefer a slight increase, such as 2048 as
suggested in the original issue, which would enable most of the things
that users who have been asking about raising the limit would like to
do.

I agree that the merge-time memory usage and slow indexing rate are
not great. But it's still possible to index multi-million vector
datasets with a 4GB heap without hitting OOMEs regardless of the
number of dimensions, and the feedback I'm seeing is that many users
are still interested in indexing multi-million vector datasets despite
the slow indexing rate. I wish we could do better, and vector indexing
is certainly more expert than text indexing, but it still is usable in
my opinion. I understand how giving Lucene more information about
vectors prior to indexing (e.g. clustering information as Jim pointed
out) could help make merging faster and more memory-efficient, but I
would really like to avoid making it a requirement for indexing
vectors as it also makes this feature much harder to use.

On Sat, Apr 8, 2023 at 9:28?PM Alessandro Benedetti
<a.benedetti@sease.io> wrote:
>
> I am very attentive to listen opinions but I am un-convinced here and I an not sure that a single person opinion should be allowed to be detrimental for such an important project.
>
> The limit as far as I know is literally just raising an exception.
> Removing it won't alter in any way the current performance for users in low dimensional space.
> Removing it will just enable more users to use Lucene.
>
> If new users in certain situations will be unhappy with the performance, they may contribute improvements.
> This is how you make progress.
>
> If it's a reputation thing, trust me that not allowing users to play with high dimensional space will equally damage it.
>
> To me it's really a no brainer.
> Removing the limit and enable people to use high dimensional vectors will take minutes.
> Improving the hnsw implementation can take months.
> Pick one to begin with...
>
> And there's no-one paying me here, no company interest whatsoever, actually I pay people to contribute, I am just convinced it's a good idea.
>
>
> On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcmuir@gmail.com> wrote:
>>
>> I disagree with your categorization. I put in plenty of work and
>> experienced plenty of pain myself, writing tests and fighting these
>> issues, after i saw that, two releases in a row, vector indexing fell
>> over and hit integer overflows etc on small datasets:
>>
>> https://github.com/apache/lucene/pull/11905
>>
>> Attacking me isn't helping the situation.
>>
>> PS: when i said the "one guy who wrote the code" I didn't mean it in
>> any kind of demeaning fashion really. I meant to describe the current
>> state of usability with respect to indexing a few million docs with
>> high dimensions. You can scroll up the thread and see that at least
>> one other committer on the project experienced similar pain as me.
>> Then, think about users who aren't committers trying to use the
>> functionality!
>>
>> On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov <msokolov@gmail.com> wrote:
>> >
>> > What you said about increasing dimensions requiring a bigger ram buffer on merge is wrong. That's the point I was trying to make. Your concerns about merge costs are not wrong, but your conclusion that we need to limit dimensions is not justified.
>> >
>> > You complain that hnsw sucks it doesn't scale, but when I show it scales linearly with dimension you just ignore that and complain about something entirely different.
>> >
>> > You demand that people run all kinds of tests to prove you wrong but when they do, you don't listen and you won't put in the work yourself or complain that it's too hard.
>> >
>> > Then you complain about people not meeting you half way. Wow
>> >
>> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcmuir@gmail.com> wrote:
>> >>
>> >> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
>> >> <michael.wechner@wyona.com> wrote:
>> >> >
>> >> > What exactly do you consider reasonable?
>> >>
>> >> Let's begin a real discussion by being HONEST about the current
>> >> status. Please put politically correct or your own company's wishes
>> >> aside, we know it's not in a good state.
>> >>
>> >> Current status is the one guy who wrote the code can set a
>> >> multi-gigabyte ram buffer and index a small dataset with 1024
>> >> dimensions in HOURS (i didn't ask what hardware).
>> >>
>> >> My concerns are everyone else except the one guy, I want it to be
>> >> usable. Increasing dimensions just means even bigger multi-gigabyte
>> >> ram buffer and bigger heap to avoid OOM on merge.
>> >> It is also a permanent backwards compatibility decision, we have to
>> >> support it once we do this and we can't just say "oops" and flip it
>> >> back.
>> >>
>> >> It is unclear to me, if the multi-gigabyte ram buffer is really to
>> >> avoid merges because they are so slow and it would be DAYS otherwise,
>> >> or if its to avoid merges so it doesn't hit OOM.
>> >> Also from personal experience, it takes trial and error (means
>> >> experiencing OOM on merge!!!) before you get those heap values correct
>> >> for your dataset. This usually means starting over which is
>> >> frustrating and wastes more time.
>> >>
>> >> Jim mentioned some ideas about the memory usage in IndexWriter, seems
>> >> to me like its a good idea. maybe the multigigabyte ram buffer can be
>> >> avoided in this way and performance improved by writing bigger
>> >> segments with lucene's defaults. But this doesn't mean we can simply
>> >> ignore the horrors of what happens on merge. merging needs to scale so
>> >> that indexing really scales.
>> >>
>> >> At least it shouldnt spike RAM on trivial data amounts and cause OOM,
>> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
>> >> fashion when indexing.
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

marcuseagan at gmail

Apr 8, 2023, 4:49 PM

Post #68 of 99 (176 views)

Permalink

Given the massive amounts of funding going into the development and
investigation of the project, I think it would be good to at least have
Lucene be a part of the conversation. Simply because academics typically
focus on vectors <= 784 dimensions does not mean all users will. A large
swathe of very important users of the Lucene project never exceed 500k
documents, though they are shifting to other search engines to try out very
popular embeddings.

I think giving our users the opportunity to build chat bots or LLM memory
machines using Lucene is a positive development, even if some datasets
won't be able to work well. We don't limit the number of fields someone can
add in most cases, though we did just undeprecate that API to better
support multi-tenancy. But people still add so many fields and can crash
their clusters with mapping explosions when unlimited. The limit to vectors
feels similar. I expect more people to dig into Lucene due to its openness
and robustness as they run into problems. Today, they are forced to
consider other engines that are more permissive.

Not everyone important or valuable Lucene workload is in the millions of
documents. Many of them only have lots of queries or computationally
expensive access patterns for B-trees. We can document that it is very
ill-advised to make a deployment with vectors too large. What others will
do with it is on them.

On Sat, Apr 8, 2023 at 2:29?PM Adrien Grand <jpountz@gmail.com> wrote:

> As Dawid pointed out earlier on this thread, this is the rule for
> Apache projects: a single -1 vote on a code change is a veto and
> cannot be overridden. Furthermore, Robert is one of the people on this
> project who worked the most on debugging subtle bugs, making Lucene
> more robust and improving our test framework, so I'm listening when he
> voices quality concerns.
>
> The argument against removing/raising the limit that resonates with me
> the most is that it is a one-way door. As MikeS highlighted earlier on
> this thread, implementations may want to take advantage of the fact
> that there is a limit at some point too. This is why I don't want to
> remove the limit and would prefer a slight increase, such as 2048 as
> suggested in the original issue, which would enable most of the things
> that users who have been asking about raising the limit would like to
> do.
>
> I agree that the merge-time memory usage and slow indexing rate are
> not great. But it's still possible to index multi-million vector
> datasets with a 4GB heap without hitting OOMEs regardless of the
> number of dimensions, and the feedback I'm seeing is that many users
> are still interested in indexing multi-million vector datasets despite
> the slow indexing rate. I wish we could do better, and vector indexing
> is certainly more expert than text indexing, but it still is usable in
> my opinion. I understand how giving Lucene more information about
> vectors prior to indexing (e.g. clustering information as Jim pointed
> out) could help make merging faster and more memory-efficient, but I
> would really like to avoid making it a requirement for indexing
> vectors as it also makes this feature much harder to use.
>
> On Sat, Apr 8, 2023 at 9:28?PM Alessandro Benedetti
> <a.benedetti@sease.io> wrote:
> >
> > I am very attentive to listen opinions but I am un-convinced here and I
> an not sure that a single person opinion should be allowed to be
> detrimental for such an important project.
> >
> > The limit as far as I know is literally just raising an exception.
> > Removing it won't alter in any way the current performance for users in
> low dimensional space.
> > Removing it will just enable more users to use Lucene.
> >
> > If new users in certain situations will be unhappy with the performance,
> they may contribute improvements.
> > This is how you make progress.
> >
> > If it's a reputation thing, trust me that not allowing users to play
> with high dimensional space will equally damage it.
> >
> > To me it's really a no brainer.
> > Removing the limit and enable people to use high dimensional vectors
> will take minutes.
> > Improving the hnsw implementation can take months.
> > Pick one to begin with...
> >
> > And there's no-one paying me here, no company interest whatsoever,
> actually I pay people to contribute, I am just convinced it's a good idea.
> >
> >
> > On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcmuir@gmail.com> wrote:
> >>
> >> I disagree with your categorization. I put in plenty of work and
> >> experienced plenty of pain myself, writing tests and fighting these
> >> issues, after i saw that, two releases in a row, vector indexing fell
> >> over and hit integer overflows etc on small datasets:
> >>
> >> https://github.com/apache/lucene/pull/11905
> >>
> >> Attacking me isn't helping the situation.
> >>
> >> PS: when i said the "one guy who wrote the code" I didn't mean it in
> >> any kind of demeaning fashion really. I meant to describe the current
> >> state of usability with respect to indexing a few million docs with
> >> high dimensions. You can scroll up the thread and see that at least
> >> one other committer on the project experienced similar pain as me.
> >> Then, think about users who aren't committers trying to use the
> >> functionality!
> >>
> >> On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov <msokolov@gmail.com>
> wrote:
> >> >
> >> > What you said about increasing dimensions requiring a bigger ram
> buffer on merge is wrong. That's the point I was trying to make. Your
> concerns about merge costs are not wrong, but your conclusion that we need
> to limit dimensions is not justified.
> >> >
> >> > You complain that hnsw sucks it doesn't scale, but when I show it
> scales linearly with dimension you just ignore that and complain about
> something entirely different.
> >> >
> >> > You demand that people run all kinds of tests to prove you wrong but
> when they do, you don't listen and you won't put in the work yourself or
> complain that it's too hard.
> >> >
> >> > Then you complain about people not meeting you half way. Wow
> >> >
> >> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcmuir@gmail.com> wrote:
> >> >>
> >> >> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
> >> >> <michael.wechner@wyona.com> wrote:
> >> >> >
> >> >> > What exactly do you consider reasonable?
> >> >>
> >> >> Let's begin a real discussion by being HONEST about the current
> >> >> status. Please put politically correct or your own company's wishes
> >> >> aside, we know it's not in a good state.
> >> >>
> >> >> Current status is the one guy who wrote the code can set a
> >> >> multi-gigabyte ram buffer and index a small dataset with 1024
> >> >> dimensions in HOURS (i didn't ask what hardware).
> >> >>
> >> >> My concerns are everyone else except the one guy, I want it to be
> >> >> usable. Increasing dimensions just means even bigger multi-gigabyte
> >> >> ram buffer and bigger heap to avoid OOM on merge.
> >> >> It is also a permanent backwards compatibility decision, we have to
> >> >> support it once we do this and we can't just say "oops" and flip it
> >> >> back.
> >> >>
> >> >> It is unclear to me, if the multi-gigabyte ram buffer is really to
> >> >> avoid merges because they are so slow and it would be DAYS otherwise,
> >> >> or if its to avoid merges so it doesn't hit OOM.
> >> >> Also from personal experience, it takes trial and error (means
> >> >> experiencing OOM on merge!!!) before you get those heap values
> correct
> >> >> for your dataset. This usually means starting over which is
> >> >> frustrating and wastes more time.
> >> >>
> >> >> Jim mentioned some ideas about the memory usage in IndexWriter, seems
> >> >> to me like its a good idea. maybe the multigigabyte ram buffer can be
> >> >> avoided in this way and performance improved by writing bigger
> >> >> segments with lucene's defaults. But this doesn't mean we can simply
> >> >> ignore the horrors of what happens on merge. merging needs to scale
> so
> >> >> that indexing really scales.
> >> >>
> >> >> At least it shouldnt spike RAM on trivial data amounts and cause OOM,
> >> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
> >> >> fashion when indexing.
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
>
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

--
Marcus Eagan

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

dawid.weiss at gmail

Apr 8, 2023, 10:31 PM

Post #69 of 99 (176 views)

Permalink

Can we set up a branch in which the limit is bumped to 2048, then have
a realistic, free data set (wikipedia sample or something) that has,
say, 5 million docs and vectors created using public data (glove
pre-trained embeddings or the like)? We then could run indexing on the
same hardware with 512, 1024 and 2048 and see what the numbers, limits
and behavior actually are.

I can help in writing this but not until after Easter.

Dawid

On Sat, Apr 8, 2023 at 11:29?PM Adrien Grand <jpountz@gmail.com> wrote:
>
> As Dawid pointed out earlier on this thread, this is the rule for
> Apache projects: a single -1 vote on a code change is a veto and
> cannot be overridden. Furthermore, Robert is one of the people on this
> project who worked the most on debugging subtle bugs, making Lucene
> more robust and improving our test framework, so I'm listening when he
> voices quality concerns.
>
> The argument against removing/raising the limit that resonates with me
> the most is that it is a one-way door. As MikeS highlighted earlier on
> this thread, implementations may want to take advantage of the fact
> that there is a limit at some point too. This is why I don't want to
> remove the limit and would prefer a slight increase, such as 2048 as
> suggested in the original issue, which would enable most of the things
> that users who have been asking about raising the limit would like to
> do.
>
> I agree that the merge-time memory usage and slow indexing rate are
> not great. But it's still possible to index multi-million vector
> datasets with a 4GB heap without hitting OOMEs regardless of the
> number of dimensions, and the feedback I'm seeing is that many users
> are still interested in indexing multi-million vector datasets despite
> the slow indexing rate. I wish we could do better, and vector indexing
> is certainly more expert than text indexing, but it still is usable in
> my opinion. I understand how giving Lucene more information about
> vectors prior to indexing (e.g. clustering information as Jim pointed
> out) could help make merging faster and more memory-efficient, but I
> would really like to avoid making it a requirement for indexing
> vectors as it also makes this feature much harder to use.
>
> On Sat, Apr 8, 2023 at 9:28?PM Alessandro Benedetti
> <a.benedetti@sease.io> wrote:
> >
> > I am very attentive to listen opinions but I am un-convinced here and I an not sure that a single person opinion should be allowed to be detrimental for such an important project.
> >
> > The limit as far as I know is literally just raising an exception.
> > Removing it won't alter in any way the current performance for users in low dimensional space.
> > Removing it will just enable more users to use Lucene.
> >
> > If new users in certain situations will be unhappy with the performance, they may contribute improvements.
> > This is how you make progress.
> >
> > If it's a reputation thing, trust me that not allowing users to play with high dimensional space will equally damage it.
> >
> > To me it's really a no brainer.
> > Removing the limit and enable people to use high dimensional vectors will take minutes.
> > Improving the hnsw implementation can take months.
> > Pick one to begin with...
> >
> > And there's no-one paying me here, no company interest whatsoever, actually I pay people to contribute, I am just convinced it's a good idea.
> >
> >
> > On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcmuir@gmail.com> wrote:
> >>
> >> I disagree with your categorization. I put in plenty of work and
> >> experienced plenty of pain myself, writing tests and fighting these
> >> issues, after i saw that, two releases in a row, vector indexing fell
> >> over and hit integer overflows etc on small datasets:
> >>
> >> https://github.com/apache/lucene/pull/11905
> >>
> >> Attacking me isn't helping the situation.
> >>
> >> PS: when i said the "one guy who wrote the code" I didn't mean it in
> >> any kind of demeaning fashion really. I meant to describe the current
> >> state of usability with respect to indexing a few million docs with
> >> high dimensions. You can scroll up the thread and see that at least
> >> one other committer on the project experienced similar pain as me.
> >> Then, think about users who aren't committers trying to use the
> >> functionality!
> >>
> >> On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov <msokolov@gmail.com> wrote:
> >> >
> >> > What you said about increasing dimensions requiring a bigger ram buffer on merge is wrong. That's the point I was trying to make. Your concerns about merge costs are not wrong, but your conclusion that we need to limit dimensions is not justified.
> >> >
> >> > You complain that hnsw sucks it doesn't scale, but when I show it scales linearly with dimension you just ignore that and complain about something entirely different.
> >> >
> >> > You demand that people run all kinds of tests to prove you wrong but when they do, you don't listen and you won't put in the work yourself or complain that it's too hard.
> >> >
> >> > Then you complain about people not meeting you half way. Wow
> >> >
> >> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcmuir@gmail.com> wrote:
> >> >>
> >> >> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
> >> >> <michael.wechner@wyona.com> wrote:
> >> >> >
> >> >> > What exactly do you consider reasonable?
> >> >>
> >> >> Let's begin a real discussion by being HONEST about the current
> >> >> status. Please put politically correct or your own company's wishes
> >> >> aside, we know it's not in a good state.
> >> >>
> >> >> Current status is the one guy who wrote the code can set a
> >> >> multi-gigabyte ram buffer and index a small dataset with 1024
> >> >> dimensions in HOURS (i didn't ask what hardware).
> >> >>
> >> >> My concerns are everyone else except the one guy, I want it to be
> >> >> usable. Increasing dimensions just means even bigger multi-gigabyte
> >> >> ram buffer and bigger heap to avoid OOM on merge.
> >> >> It is also a permanent backwards compatibility decision, we have to
> >> >> support it once we do this and we can't just say "oops" and flip it
> >> >> back.
> >> >>
> >> >> It is unclear to me, if the multi-gigabyte ram buffer is really to
> >> >> avoid merges because they are so slow and it would be DAYS otherwise,
> >> >> or if its to avoid merges so it doesn't hit OOM.
> >> >> Also from personal experience, it takes trial and error (means
> >> >> experiencing OOM on merge!!!) before you get those heap values correct
> >> >> for your dataset. This usually means starting over which is
> >> >> frustrating and wastes more time.
> >> >>
> >> >> Jim mentioned some ideas about the memory usage in IndexWriter, seems
> >> >> to me like its a good idea. maybe the multigigabyte ram buffer can be
> >> >> avoided in this way and performance improved by writing bigger
> >> >> segments with lucene's defaults. But this doesn't mean we can simply
> >> >> ignore the horrors of what happens on merge. merging needs to scale so
> >> >> that indexing really scales.
> >> >>
> >> >> At least it shouldnt spike RAM on trivial data amounts and cause OOM,
> >> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
> >> >> fashion when indexing.
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
>
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

rcmuir at gmail

Apr 9, 2023, 3:25 AM

Post #70 of 99 (176 views)

Permalink

Yes, its very clear that folks on this thread are ignoring reason
entirely and completely swooned by chatgpt-hype.
And what happens when they make chatgpt-8 that uses even more dimensions?
backwards compatibility decisions can't be made by garbage hype such
as cryptocurrency or chatgpt.
Trying to convince me we should bump it because of chatgpt, well, i
think it has the opposite effect.

Please, lemme see real technical arguments why this limit needs to be
bumped. not including trash like chatgpt.

On Sat, Apr 8, 2023 at 7:50?PM Marcus Eagan <marcuseagan@gmail.com> wrote:
>
> Given the massive amounts of funding going into the development and investigation of the project, I think it would be good to at least have Lucene be a part of the conversation. Simply because academics typically focus on vectors <= 784 dimensions does not mean all users will. A large swathe of very important users of the Lucene project never exceed 500k documents, though they are shifting to other search engines to try out very popular embeddings.
>
> I think giving our users the opportunity to build chat bots or LLM memory machines using Lucene is a positive development, even if some datasets won't be able to work well. We don't limit the number of fields someone can add in most cases, though we did just undeprecate that API to better support multi-tenancy. But people still add so many fields and can crash their clusters with mapping explosions when unlimited. The limit to vectors feels similar. I expect more people to dig into Lucene due to its openness and robustness as they run into problems. Today, they are forced to consider other engines that are more permissive.
>
> Not everyone important or valuable Lucene workload is in the millions of documents. Many of them only have lots of queries or computationally expensive access patterns for B-trees. We can document that it is very ill-advised to make a deployment with vectors too large. What others will do with it is on them.
>
>
> On Sat, Apr 8, 2023 at 2:29?PM Adrien Grand <jpountz@gmail.com> wrote:
>>
>> As Dawid pointed out earlier on this thread, this is the rule for
>> Apache projects: a single -1 vote on a code change is a veto and
>> cannot be overridden. Furthermore, Robert is one of the people on this
>> project who worked the most on debugging subtle bugs, making Lucene
>> more robust and improving our test framework, so I'm listening when he
>> voices quality concerns.
>>
>> The argument against removing/raising the limit that resonates with me
>> the most is that it is a one-way door. As MikeS highlighted earlier on
>> this thread, implementations may want to take advantage of the fact
>> that there is a limit at some point too. This is why I don't want to
>> remove the limit and would prefer a slight increase, such as 2048 as
>> suggested in the original issue, which would enable most of the things
>> that users who have been asking about raising the limit would like to
>> do.
>>
>> I agree that the merge-time memory usage and slow indexing rate are
>> not great. But it's still possible to index multi-million vector
>> datasets with a 4GB heap without hitting OOMEs regardless of the
>> number of dimensions, and the feedback I'm seeing is that many users
>> are still interested in indexing multi-million vector datasets despite
>> the slow indexing rate. I wish we could do better, and vector indexing
>> is certainly more expert than text indexing, but it still is usable in
>> my opinion. I understand how giving Lucene more information about
>> vectors prior to indexing (e.g. clustering information as Jim pointed
>> out) could help make merging faster and more memory-efficient, but I
>> would really like to avoid making it a requirement for indexing
>> vectors as it also makes this feature much harder to use.
>>
>> On Sat, Apr 8, 2023 at 9:28?PM Alessandro Benedetti
>> <a.benedetti@sease.io> wrote:
>> >
>> > I am very attentive to listen opinions but I am un-convinced here and I an not sure that a single person opinion should be allowed to be detrimental for such an important project.
>> >
>> > The limit as far as I know is literally just raising an exception.
>> > Removing it won't alter in any way the current performance for users in low dimensional space.
>> > Removing it will just enable more users to use Lucene.
>> >
>> > If new users in certain situations will be unhappy with the performance, they may contribute improvements.
>> > This is how you make progress.
>> >
>> > If it's a reputation thing, trust me that not allowing users to play with high dimensional space will equally damage it.
>> >
>> > To me it's really a no brainer.
>> > Removing the limit and enable people to use high dimensional vectors will take minutes.
>> > Improving the hnsw implementation can take months.
>> > Pick one to begin with...
>> >
>> > And there's no-one paying me here, no company interest whatsoever, actually I pay people to contribute, I am just convinced it's a good idea.
>> >
>> >
>> > On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcmuir@gmail.com> wrote:
>> >>
>> >> I disagree with your categorization. I put in plenty of work and
>> >> experienced plenty of pain myself, writing tests and fighting these
>> >> issues, after i saw that, two releases in a row, vector indexing fell
>> >> over and hit integer overflows etc on small datasets:
>> >>
>> >> https://github.com/apache/lucene/pull/11905
>> >>
>> >> Attacking me isn't helping the situation.
>> >>
>> >> PS: when i said the "one guy who wrote the code" I didn't mean it in
>> >> any kind of demeaning fashion really. I meant to describe the current
>> >> state of usability with respect to indexing a few million docs with
>> >> high dimensions. You can scroll up the thread and see that at least
>> >> one other committer on the project experienced similar pain as me.
>> >> Then, think about users who aren't committers trying to use the
>> >> functionality!
>> >>
>> >> On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov <msokolov@gmail.com> wrote:
>> >> >
>> >> > What you said about increasing dimensions requiring a bigger ram buffer on merge is wrong. That's the point I was trying to make. Your concerns about merge costs are not wrong, but your conclusion that we need to limit dimensions is not justified.
>> >> >
>> >> > You complain that hnsw sucks it doesn't scale, but when I show it scales linearly with dimension you just ignore that and complain about something entirely different.
>> >> >
>> >> > You demand that people run all kinds of tests to prove you wrong but when they do, you don't listen and you won't put in the work yourself or complain that it's too hard.
>> >> >
>> >> > Then you complain about people not meeting you half way. Wow
>> >> >
>> >> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcmuir@gmail.com> wrote:
>> >> >>
>> >> >> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
>> >> >> <michael.wechner@wyona.com> wrote:
>> >> >> >
>> >> >> > What exactly do you consider reasonable?
>> >> >>
>> >> >> Let's begin a real discussion by being HONEST about the current
>> >> >> status. Please put politically correct or your own company's wishes
>> >> >> aside, we know it's not in a good state.
>> >> >>
>> >> >> Current status is the one guy who wrote the code can set a
>> >> >> multi-gigabyte ram buffer and index a small dataset with 1024
>> >> >> dimensions in HOURS (i didn't ask what hardware).
>> >> >>
>> >> >> My concerns are everyone else except the one guy, I want it to be
>> >> >> usable. Increasing dimensions just means even bigger multi-gigabyte
>> >> >> ram buffer and bigger heap to avoid OOM on merge.
>> >> >> It is also a permanent backwards compatibility decision, we have to
>> >> >> support it once we do this and we can't just say "oops" and flip it
>> >> >> back.
>> >> >>
>> >> >> It is unclear to me, if the multi-gigabyte ram buffer is really to
>> >> >> avoid merges because they are so slow and it would be DAYS otherwise,
>> >> >> or if its to avoid merges so it doesn't hit OOM.
>> >> >> Also from personal experience, it takes trial and error (means
>> >> >> experiencing OOM on merge!!!) before you get those heap values correct
>> >> >> for your dataset. This usually means starting over which is
>> >> >> frustrating and wastes more time.
>> >> >>
>> >> >> Jim mentioned some ideas about the memory usage in IndexWriter, seems
>> >> >> to me like its a good idea. maybe the multigigabyte ram buffer can be
>> >> >> avoided in this way and performance improved by writing bigger
>> >> >> segments with lucene's defaults. But this doesn't mean we can simply
>> >> >> ignore the horrors of what happens on merge. merging needs to scale so
>> >> >> that indexing really scales.
>> >> >>
>> >> >> At least it shouldnt spike RAM on trivial data amounts and cause OOM,
>> >> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
>> >> >> fashion when indexing.
>> >> >>
>> >> >> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>>
>>
>> --
>> Adrien
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
>
> --
> Marcus Eagan
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

rcmuir at gmail

Apr 9, 2023, 3:58 AM

Post #71 of 99 (176 views)

Permalink

Also, please let's only disucss SEARCH. lucene is a SEARCH ENGINE
LIBRARY. not a vector database or whatever trash is being proposed
here.

i think we should table this and revisit it after chatgpt hype has dissipated.

this hype is causing ppl to behave irrationally, it is why i can't
converse with basically anyone on this thread because they are all
stating crazy things that don't make sense.

On Sun, Apr 9, 2023 at 6:25?AM Robert Muir <rcmuir@gmail.com> wrote:
>
> Yes, its very clear that folks on this thread are ignoring reason
> entirely and completely swooned by chatgpt-hype.
> And what happens when they make chatgpt-8 that uses even more dimensions?
> backwards compatibility decisions can't be made by garbage hype such
> as cryptocurrency or chatgpt.
> Trying to convince me we should bump it because of chatgpt, well, i
> think it has the opposite effect.
>
> Please, lemme see real technical arguments why this limit needs to be
> bumped. not including trash like chatgpt.
>
> On Sat, Apr 8, 2023 at 7:50?PM Marcus Eagan <marcuseagan@gmail.com> wrote:
> >
> > Given the massive amounts of funding going into the development and investigation of the project, I think it would be good to at least have Lucene be a part of the conversation. Simply because academics typically focus on vectors <= 784 dimensions does not mean all users will. A large swathe of very important users of the Lucene project never exceed 500k documents, though they are shifting to other search engines to try out very popular embeddings.
> >
> > I think giving our users the opportunity to build chat bots or LLM memory machines using Lucene is a positive development, even if some datasets won't be able to work well. We don't limit the number of fields someone can add in most cases, though we did just undeprecate that API to better support multi-tenancy. But people still add so many fields and can crash their clusters with mapping explosions when unlimited. The limit to vectors feels similar. I expect more people to dig into Lucene due to its openness and robustness as they run into problems. Today, they are forced to consider other engines that are more permissive.
> >
> > Not everyone important or valuable Lucene workload is in the millions of documents. Many of them only have lots of queries or computationally expensive access patterns for B-trees. We can document that it is very ill-advised to make a deployment with vectors too large. What others will do with it is on them.
> >
> >
> > On Sat, Apr 8, 2023 at 2:29?PM Adrien Grand <jpountz@gmail.com> wrote:
> >>
> >> As Dawid pointed out earlier on this thread, this is the rule for
> >> Apache projects: a single -1 vote on a code change is a veto and
> >> cannot be overridden. Furthermore, Robert is one of the people on this
> >> project who worked the most on debugging subtle bugs, making Lucene
> >> more robust and improving our test framework, so I'm listening when he
> >> voices quality concerns.
> >>
> >> The argument against removing/raising the limit that resonates with me
> >> the most is that it is a one-way door. As MikeS highlighted earlier on
> >> this thread, implementations may want to take advantage of the fact
> >> that there is a limit at some point too. This is why I don't want to
> >> remove the limit and would prefer a slight increase, such as 2048 as
> >> suggested in the original issue, which would enable most of the things
> >> that users who have been asking about raising the limit would like to
> >> do.
> >>
> >> I agree that the merge-time memory usage and slow indexing rate are
> >> not great. But it's still possible to index multi-million vector
> >> datasets with a 4GB heap without hitting OOMEs regardless of the
> >> number of dimensions, and the feedback I'm seeing is that many users
> >> are still interested in indexing multi-million vector datasets despite
> >> the slow indexing rate. I wish we could do better, and vector indexing
> >> is certainly more expert than text indexing, but it still is usable in
> >> my opinion. I understand how giving Lucene more information about
> >> vectors prior to indexing (e.g. clustering information as Jim pointed
> >> out) could help make merging faster and more memory-efficient, but I
> >> would really like to avoid making it a requirement for indexing
> >> vectors as it also makes this feature much harder to use.
> >>
> >> On Sat, Apr 8, 2023 at 9:28?PM Alessandro Benedetti
> >> <a.benedetti@sease.io> wrote:
> >> >
> >> > I am very attentive to listen opinions but I am un-convinced here and I an not sure that a single person opinion should be allowed to be detrimental for such an important project.
> >> >
> >> > The limit as far as I know is literally just raising an exception.
> >> > Removing it won't alter in any way the current performance for users in low dimensional space.
> >> > Removing it will just enable more users to use Lucene.
> >> >
> >> > If new users in certain situations will be unhappy with the performance, they may contribute improvements.
> >> > This is how you make progress.
> >> >
> >> > If it's a reputation thing, trust me that not allowing users to play with high dimensional space will equally damage it.
> >> >
> >> > To me it's really a no brainer.
> >> > Removing the limit and enable people to use high dimensional vectors will take minutes.
> >> > Improving the hnsw implementation can take months.
> >> > Pick one to begin with...
> >> >
> >> > And there's no-one paying me here, no company interest whatsoever, actually I pay people to contribute, I am just convinced it's a good idea.
> >> >
> >> >
> >> > On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcmuir@gmail.com> wrote:
> >> >>
> >> >> I disagree with your categorization. I put in plenty of work and
> >> >> experienced plenty of pain myself, writing tests and fighting these
> >> >> issues, after i saw that, two releases in a row, vector indexing fell
> >> >> over and hit integer overflows etc on small datasets:
> >> >>
> >> >> https://github.com/apache/lucene/pull/11905
> >> >>
> >> >> Attacking me isn't helping the situation.
> >> >>
> >> >> PS: when i said the "one guy who wrote the code" I didn't mean it in
> >> >> any kind of demeaning fashion really. I meant to describe the current
> >> >> state of usability with respect to indexing a few million docs with
> >> >> high dimensions. You can scroll up the thread and see that at least
> >> >> one other committer on the project experienced similar pain as me.
> >> >> Then, think about users who aren't committers trying to use the
> >> >> functionality!
> >> >>
> >> >> On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov <msokolov@gmail.com> wrote:
> >> >> >
> >> >> > What you said about increasing dimensions requiring a bigger ram buffer on merge is wrong. That's the point I was trying to make. Your concerns about merge costs are not wrong, but your conclusion that we need to limit dimensions is not justified.
> >> >> >
> >> >> > You complain that hnsw sucks it doesn't scale, but when I show it scales linearly with dimension you just ignore that and complain about something entirely different.
> >> >> >
> >> >> > You demand that people run all kinds of tests to prove you wrong but when they do, you don't listen and you won't put in the work yourself or complain that it's too hard.
> >> >> >
> >> >> > Then you complain about people not meeting you half way. Wow
> >> >> >
> >> >> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcmuir@gmail.com> wrote:
> >> >> >>
> >> >> >> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
> >> >> >> <michael.wechner@wyona.com> wrote:
> >> >> >> >
> >> >> >> > What exactly do you consider reasonable?
> >> >> >>
> >> >> >> Let's begin a real discussion by being HONEST about the current
> >> >> >> status. Please put politically correct or your own company's wishes
> >> >> >> aside, we know it's not in a good state.
> >> >> >>
> >> >> >> Current status is the one guy who wrote the code can set a
> >> >> >> multi-gigabyte ram buffer and index a small dataset with 1024
> >> >> >> dimensions in HOURS (i didn't ask what hardware).
> >> >> >>
> >> >> >> My concerns are everyone else except the one guy, I want it to be
> >> >> >> usable. Increasing dimensions just means even bigger multi-gigabyte
> >> >> >> ram buffer and bigger heap to avoid OOM on merge.
> >> >> >> It is also a permanent backwards compatibility decision, we have to
> >> >> >> support it once we do this and we can't just say "oops" and flip it
> >> >> >> back.
> >> >> >>
> >> >> >> It is unclear to me, if the multi-gigabyte ram buffer is really to
> >> >> >> avoid merges because they are so slow and it would be DAYS otherwise,
> >> >> >> or if its to avoid merges so it doesn't hit OOM.
> >> >> >> Also from personal experience, it takes trial and error (means
> >> >> >> experiencing OOM on merge!!!) before you get those heap values correct
> >> >> >> for your dataset. This usually means starting over which is
> >> >> >> frustrating and wastes more time.
> >> >> >>
> >> >> >> Jim mentioned some ideas about the memory usage in IndexWriter, seems
> >> >> >> to me like its a good idea. maybe the multigigabyte ram buffer can be
> >> >> >> avoided in this way and performance improved by writing bigger
> >> >> >> segments with lucene's defaults. But this doesn't mean we can simply
> >> >> >> ignore the horrors of what happens on merge. merging needs to scale so
> >> >> >> that indexing really scales.
> >> >> >>
> >> >> >> At least it shouldnt spike RAM on trivial data amounts and cause OOM,
> >> >> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
> >> >> >> fashion when indexing.
> >> >> >>
> >> >> >> ---------------------------------------------------------------------
> >> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >> >> >>
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >> >>
> >>
> >>
> >> --
> >> Adrien
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >
> >
> > --
> > Marcus Eagan
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

a.benedetti at sease

Apr 9, 2023, 4:37 AM

Post #72 of 99 (176 views)

Permalink

I don't think this tone and language is appropriate for a community of
volunteers and men of science.

I personally find offensive to generalise Lucene people here to be "crazy
people hyped about chatGPT".

I personally don't give a damn about chatGPT except the fact it is a very
interesting technology.

As usual I see very little motivation and a lot of "convince me".
We're discussing here about a limit that raises an exception.

Improving performance is absolutely important and no-one here is saying we
won't address it, it's just a separate discussion.

On Sun, 9 Apr 2023, 12:59 Robert Muir, <rcmuir@gmail.com> wrote:

> Also, please let's only disucss SEARCH. lucene is a SEARCH ENGINE
> LIBRARY. not a vector database or whatever trash is being proposed
> here.
>
> i think we should table this and revisit it after chatgpt hype has
> dissipated.
>
> this hype is causing ppl to behave irrationally, it is why i can't
> converse with basically anyone on this thread because they are all
> stating crazy things that don't make sense.
>
> On Sun, Apr 9, 2023 at 6:25?AM Robert Muir <rcmuir@gmail.com> wrote:
> >
> > Yes, its very clear that folks on this thread are ignoring reason
> > entirely and completely swooned by chatgpt-hype.
> > And what happens when they make chatgpt-8 that uses even more dimensions?
> > backwards compatibility decisions can't be made by garbage hype such
> > as cryptocurrency or chatgpt.
> > Trying to convince me we should bump it because of chatgpt, well, i
> > think it has the opposite effect.
> >
> > Please, lemme see real technical arguments why this limit needs to be
> > bumped. not including trash like chatgpt.
> >
> > On Sat, Apr 8, 2023 at 7:50?PM Marcus Eagan <marcuseagan@gmail.com>
> wrote:
> > >
> > > Given the massive amounts of funding going into the development and
> investigation of the project, I think it would be good to at least have
> Lucene be a part of the conversation. Simply because academics typically
> focus on vectors <= 784 dimensions does not mean all users will. A large
> swathe of very important users of the Lucene project never exceed 500k
> documents, though they are shifting to other search engines to try out very
> popular embeddings.
> > >
> > > I think giving our users the opportunity to build chat bots or LLM
> memory machines using Lucene is a positive development, even if some
> datasets won't be able to work well. We don't limit the number of fields
> someone can add in most cases, though we did just undeprecate that API to
> better support multi-tenancy. But people still add so many fields and can
> crash their clusters with mapping explosions when unlimited. The limit to
> vectors feels similar. I expect more people to dig into Lucene due to its
> openness and robustness as they run into problems. Today, they are forced
> to consider other engines that are more permissive.
> > >
> > > Not everyone important or valuable Lucene workload is in the millions
> of documents. Many of them only have lots of queries or computationally
> expensive access patterns for B-trees. We can document that it is very
> ill-advised to make a deployment with vectors too large. What others will
> do with it is on them.
> > >
> > >
> > > On Sat, Apr 8, 2023 at 2:29?PM Adrien Grand <jpountz@gmail.com> wrote:
> > >>
> > >> As Dawid pointed out earlier on this thread, this is the rule for
> > >> Apache projects: a single -1 vote on a code change is a veto and
> > >> cannot be overridden. Furthermore, Robert is one of the people on this
> > >> project who worked the most on debugging subtle bugs, making Lucene
> > >> more robust and improving our test framework, so I'm listening when he
> > >> voices quality concerns.
> > >>
> > >> The argument against removing/raising the limit that resonates with me
> > >> the most is that it is a one-way door. As MikeS highlighted earlier on
> > >> this thread, implementations may want to take advantage of the fact
> > >> that there is a limit at some point too. This is why I don't want to
> > >> remove the limit and would prefer a slight increase, such as 2048 as
> > >> suggested in the original issue, which would enable most of the things
> > >> that users who have been asking about raising the limit would like to
> > >> do.
> > >>
> > >> I agree that the merge-time memory usage and slow indexing rate are
> > >> not great. But it's still possible to index multi-million vector
> > >> datasets with a 4GB heap without hitting OOMEs regardless of the
> > >> number of dimensions, and the feedback I'm seeing is that many users
> > >> are still interested in indexing multi-million vector datasets despite
> > >> the slow indexing rate. I wish we could do better, and vector indexing
> > >> is certainly more expert than text indexing, but it still is usable in
> > >> my opinion. I understand how giving Lucene more information about
> > >> vectors prior to indexing (e.g. clustering information as Jim pointed
> > >> out) could help make merging faster and more memory-efficient, but I
> > >> would really like to avoid making it a requirement for indexing
> > >> vectors as it also makes this feature much harder to use.
> > >>
> > >> On Sat, Apr 8, 2023 at 9:28?PM Alessandro Benedetti
> > >> <a.benedetti@sease.io> wrote:
> > >> >
> > >> > I am very attentive to listen opinions but I am un-convinced here
> and I an not sure that a single person opinion should be allowed to be
> detrimental for such an important project.
> > >> >
> > >> > The limit as far as I know is literally just raising an exception.
> > >> > Removing it won't alter in any way the current performance for
> users in low dimensional space.
> > >> > Removing it will just enable more users to use Lucene.
> > >> >
> > >> > If new users in certain situations will be unhappy with the
> performance, they may contribute improvements.
> > >> > This is how you make progress.
> > >> >
> > >> > If it's a reputation thing, trust me that not allowing users to
> play with high dimensional space will equally damage it.
> > >> >
> > >> > To me it's really a no brainer.
> > >> > Removing the limit and enable people to use high dimensional
> vectors will take minutes.
> > >> > Improving the hnsw implementation can take months.
> > >> > Pick one to begin with...
> > >> >
> > >> > And there's no-one paying me here, no company interest whatsoever,
> actually I pay people to contribute, I am just convinced it's a good idea.
> > >> >
> > >> >
> > >> > On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcmuir@gmail.com> wrote:
> > >> >>
> > >> >> I disagree with your categorization. I put in plenty of work and
> > >> >> experienced plenty of pain myself, writing tests and fighting these
> > >> >> issues, after i saw that, two releases in a row, vector indexing
> fell
> > >> >> over and hit integer overflows etc on small datasets:
> > >> >>
> > >> >> https://github.com/apache/lucene/pull/11905
> > >> >>
> > >> >> Attacking me isn't helping the situation.
> > >> >>
> > >> >> PS: when i said the "one guy who wrote the code" I didn't mean it
> in
> > >> >> any kind of demeaning fashion really. I meant to describe the
> current
> > >> >> state of usability with respect to indexing a few million docs with
> > >> >> high dimensions. You can scroll up the thread and see that at least
> > >> >> one other committer on the project experienced similar pain as me.
> > >> >> Then, think about users who aren't committers trying to use the
> > >> >> functionality!
> > >> >>
> > >> >> On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov <
> msokolov@gmail.com> wrote:
> > >> >> >
> > >> >> > What you said about increasing dimensions requiring a bigger ram
> buffer on merge is wrong. That's the point I was trying to make. Your
> concerns about merge costs are not wrong, but your conclusion that we need
> to limit dimensions is not justified.
> > >> >> >
> > >> >> > You complain that hnsw sucks it doesn't scale, but when I show
> it scales linearly with dimension you just ignore that and complain about
> something entirely different.
> > >> >> >
> > >> >> > You demand that people run all kinds of tests to prove you wrong
> but when they do, you don't listen and you won't put in the work yourself
> or complain that it's too hard.
> > >> >> >
> > >> >> > Then you complain about people not meeting you half way. Wow
> > >> >> >
> > >> >> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcmuir@gmail.com>
> wrote:
> > >> >> >>
> > >> >> >> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
> > >> >> >> <michael.wechner@wyona.com> wrote:
> > >> >> >> >
> > >> >> >> > What exactly do you consider reasonable?
> > >> >> >>
> > >> >> >> Let's begin a real discussion by being HONEST about the current
> > >> >> >> status. Please put politically correct or your own company's
> wishes
> > >> >> >> aside, we know it's not in a good state.
> > >> >> >>
> > >> >> >> Current status is the one guy who wrote the code can set a
> > >> >> >> multi-gigabyte ram buffer and index a small dataset with 1024
> > >> >> >> dimensions in HOURS (i didn't ask what hardware).
> > >> >> >>
> > >> >> >> My concerns are everyone else except the one guy, I want it to
> be
> > >> >> >> usable. Increasing dimensions just means even bigger
> multi-gigabyte
> > >> >> >> ram buffer and bigger heap to avoid OOM on merge.
> > >> >> >> It is also a permanent backwards compatibility decision, we
> have to
> > >> >> >> support it once we do this and we can't just say "oops" and
> flip it
> > >> >> >> back.
> > >> >> >>
> > >> >> >> It is unclear to me, if the multi-gigabyte ram buffer is really
> to
> > >> >> >> avoid merges because they are so slow and it would be DAYS
> otherwise,
> > >> >> >> or if its to avoid merges so it doesn't hit OOM.
> > >> >> >> Also from personal experience, it takes trial and error (means
> > >> >> >> experiencing OOM on merge!!!) before you get those heap values
> correct
> > >> >> >> for your dataset. This usually means starting over which is
> > >> >> >> frustrating and wastes more time.
> > >> >> >>
> > >> >> >> Jim mentioned some ideas about the memory usage in IndexWriter,
> seems
> > >> >> >> to me like its a good idea. maybe the multigigabyte ram buffer
> can be
> > >> >> >> avoided in this way and performance improved by writing bigger
> > >> >> >> segments with lucene's defaults. But this doesn't mean we can
> simply
> > >> >> >> ignore the horrors of what happens on merge. merging needs to
> scale so
> > >> >> >> that indexing really scales.
> > >> >> >>
> > >> >> >> At least it shouldnt spike RAM on trivial data amounts and
> cause OOM,
> > >> >> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
> > >> >> >> fashion when indexing.
> > >> >> >>
> > >> >> >>
> ---------------------------------------------------------------------
> > >> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > >> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> > >> >> >>
> > >> >>
> > >> >>
> ---------------------------------------------------------------------
> > >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> > >> >>
> > >>
> > >>
> > >> --
> > >> Adrien
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > >> For additional commands, e-mail: dev-help@lucene.apache.org
> > >>
> > >
> > >
> > > --
> > > Marcus Eagan
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

serera at gmail

Apr 9, 2023, 4:45 AM

Post #73 of 99 (176 views)

Permalink

Putting ChatGPT aside, what are the implications of (1) removing the limit,
or (2) increasing the limit, or (3) make it configurable at the app's
discretion? The configuration can even be in the form of a VectorEncoder
impl which will decide on the size of the vectors, thereby making it
clearer that this is an expert setting and puts it at the hands of the app
to decide how to handle those large vectors.

Will bigger vectors require an algorithmic change (I understand that it
might benefit from one, I ask if it's required besides performance gains)?
If not, then why do you object to making it an "app problem"? If 2048 dims
vectors require 2GB IW RAM buffer, what's wrong with documenting it and
letting the app choose whether they want/can do it or not?

Do we limit the size of stored fields, or binary doc values? Do we prevent
anyone from using a Codec which loads these big byte[] into memory?

> Also, please let's only disucss SEARCH. lucene is a SEARCH ENGINE
> LIBRARY. not a vector database or whatever trash is being proposed
> here.

The problem being discussed *is* related to search, not lexical search, but
rather semantic search, or search in vector space, however you want to call
it. Are you saying that Lucene should only focus on lexical search
scenarios? Or are you saying that semantic search scenarios don't need to
index >1024 dimension vectors in order to produce high quality results?

I personally don't understand why not letting apps index bigger vectors, if
all that it takes is using bigger RAM buffers. Improvements will come
later, especially as more and more applications will try it. If we prevent
it, then we might never see these improvements cause no one will even
attempt to do it with Lucene, and IMO it's not a direction we want to head.
While ChatGPT itself might be a hype, I don't think that big vectors are,
and if the only technical reason we have for not supporting them is a
bigger RAM buffer, then I think we should allow it.

On Sun, Apr 9, 2023 at 1:59?PM Robert Muir <rcmuir@gmail.com> wrote:

> Also, please let's only disucss SEARCH. lucene is a SEARCH ENGINE
> LIBRARY. not a vector database or whatever trash is being proposed
> here.
>
> i think we should table this and revisit it after chatgpt hype has
> dissipated.
>
> this hype is causing ppl to behave irrationally, it is why i can't
> converse with basically anyone on this thread because they are all
> stating crazy things that don't make sense.
>
> On Sun, Apr 9, 2023 at 6:25?AM Robert Muir <rcmuir@gmail.com> wrote:
> >
> > Yes, its very clear that folks on this thread are ignoring reason
> > entirely and completely swooned by chatgpt-hype.
> > And what happens when they make chatgpt-8 that uses even more dimensions?
> > backwards compatibility decisions can't be made by garbage hype such
> > as cryptocurrency or chatgpt.
> > Trying to convince me we should bump it because of chatgpt, well, i
> > think it has the opposite effect.
> >
> > Please, lemme see real technical arguments why this limit needs to be
> > bumped. not including trash like chatgpt.
> >
> > On Sat, Apr 8, 2023 at 7:50?PM Marcus Eagan <marcuseagan@gmail.com>
> wrote:
> > >
> > > Given the massive amounts of funding going into the development and
> investigation of the project, I think it would be good to at least have
> Lucene be a part of the conversation. Simply because academics typically
> focus on vectors <= 784 dimensions does not mean all users will. A large
> swathe of very important users of the Lucene project never exceed 500k
> documents, though they are shifting to other search engines to try out very
> popular embeddings.
> > >
> > > I think giving our users the opportunity to build chat bots or LLM
> memory machines using Lucene is a positive development, even if some
> datasets won't be able to work well. We don't limit the number of fields
> someone can add in most cases, though we did just undeprecate that API to
> better support multi-tenancy. But people still add so many fields and can
> crash their clusters with mapping explosions when unlimited. The limit to
> vectors feels similar. I expect more people to dig into Lucene due to its
> openness and robustness as they run into problems. Today, they are forced
> to consider other engines that are more permissive.
> > >
> > > Not everyone important or valuable Lucene workload is in the millions
> of documents. Many of them only have lots of queries or computationally
> expensive access patterns for B-trees. We can document that it is very
> ill-advised to make a deployment with vectors too large. What others will
> do with it is on them.
> > >
> > >
> > > On Sat, Apr 8, 2023 at 2:29?PM Adrien Grand <jpountz@gmail.com> wrote:
> > >>
> > >> As Dawid pointed out earlier on this thread, this is the rule for
> > >> Apache projects: a single -1 vote on a code change is a veto and
> > >> cannot be overridden. Furthermore, Robert is one of the people on this
> > >> project who worked the most on debugging subtle bugs, making Lucene
> > >> more robust and improving our test framework, so I'm listening when he
> > >> voices quality concerns.
> > >>
> > >> The argument against removing/raising the limit that resonates with me
> > >> the most is that it is a one-way door. As MikeS highlighted earlier on
> > >> this thread, implementations may want to take advantage of the fact
> > >> that there is a limit at some point too. This is why I don't want to
> > >> remove the limit and would prefer a slight increase, such as 2048 as
> > >> suggested in the original issue, which would enable most of the things
> > >> that users who have been asking about raising the limit would like to
> > >> do.
> > >>
> > >> I agree that the merge-time memory usage and slow indexing rate are
> > >> not great. But it's still possible to index multi-million vector
> > >> datasets with a 4GB heap without hitting OOMEs regardless of the
> > >> number of dimensions, and the feedback I'm seeing is that many users
> > >> are still interested in indexing multi-million vector datasets despite
> > >> the slow indexing rate. I wish we could do better, and vector indexing
> > >> is certainly more expert than text indexing, but it still is usable in
> > >> my opinion. I understand how giving Lucene more information about
> > >> vectors prior to indexing (e.g. clustering information as Jim pointed
> > >> out) could help make merging faster and more memory-efficient, but I
> > >> would really like to avoid making it a requirement for indexing
> > >> vectors as it also makes this feature much harder to use.
> > >>
> > >> On Sat, Apr 8, 2023 at 9:28?PM Alessandro Benedetti
> > >> <a.benedetti@sease.io> wrote:
> > >> >
> > >> > I am very attentive to listen opinions but I am un-convinced here
> and I an not sure that a single person opinion should be allowed to be
> detrimental for such an important project.
> > >> >
> > >> > The limit as far as I know is literally just raising an exception.
> > >> > Removing it won't alter in any way the current performance for
> users in low dimensional space.
> > >> > Removing it will just enable more users to use Lucene.
> > >> >
> > >> > If new users in certain situations will be unhappy with the
> performance, they may contribute improvements.
> > >> > This is how you make progress.
> > >> >
> > >> > If it's a reputation thing, trust me that not allowing users to
> play with high dimensional space will equally damage it.
> > >> >
> > >> > To me it's really a no brainer.
> > >> > Removing the limit and enable people to use high dimensional
> vectors will take minutes.
> > >> > Improving the hnsw implementation can take months.
> > >> > Pick one to begin with...
> > >> >
> > >> > And there's no-one paying me here, no company interest whatsoever,
> actually I pay people to contribute, I am just convinced it's a good idea.
> > >> >
> > >> >
> > >> > On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcmuir@gmail.com> wrote:
> > >> >>
> > >> >> I disagree with your categorization. I put in plenty of work and
> > >> >> experienced plenty of pain myself, writing tests and fighting these
> > >> >> issues, after i saw that, two releases in a row, vector indexing
> fell
> > >> >> over and hit integer overflows etc on small datasets:
> > >> >>
> > >> >> https://github.com/apache/lucene/pull/11905
> > >> >>
> > >> >> Attacking me isn't helping the situation.
> > >> >>
> > >> >> PS: when i said the "one guy who wrote the code" I didn't mean it
> in
> > >> >> any kind of demeaning fashion really. I meant to describe the
> current
> > >> >> state of usability with respect to indexing a few million docs with
> > >> >> high dimensions. You can scroll up the thread and see that at least
> > >> >> one other committer on the project experienced similar pain as me.
> > >> >> Then, think about users who aren't committers trying to use the
> > >> >> functionality!
> > >> >>
> > >> >> On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov <
> msokolov@gmail.com> wrote:
> > >> >> >
> > >> >> > What you said about increasing dimensions requiring a bigger ram
> buffer on merge is wrong. That's the point I was trying to make. Your
> concerns about merge costs are not wrong, but your conclusion that we need
> to limit dimensions is not justified.
> > >> >> >
> > >> >> > You complain that hnsw sucks it doesn't scale, but when I show
> it scales linearly with dimension you just ignore that and complain about
> something entirely different.
> > >> >> >
> > >> >> > You demand that people run all kinds of tests to prove you wrong
> but when they do, you don't listen and you won't put in the work yourself
> or complain that it's too hard.
> > >> >> >
> > >> >> > Then you complain about people not meeting you half way. Wow
> > >> >> >
> > >> >> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcmuir@gmail.com>
> wrote:
> > >> >> >>
> > >> >> >> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
> > >> >> >> <michael.wechner@wyona.com> wrote:
> > >> >> >> >
> > >> >> >> > What exactly do you consider reasonable?
> > >> >> >>
> > >> >> >> Let's begin a real discussion by being HONEST about the current
> > >> >> >> status. Please put politically correct or your own company's
> wishes
> > >> >> >> aside, we know it's not in a good state.
> > >> >> >>
> > >> >> >> Current status is the one guy who wrote the code can set a
> > >> >> >> multi-gigabyte ram buffer and index a small dataset with 1024
> > >> >> >> dimensions in HOURS (i didn't ask what hardware).
> > >> >> >>
> > >> >> >> My concerns are everyone else except the one guy, I want it to
> be
> > >> >> >> usable. Increasing dimensions just means even bigger
> multi-gigabyte
> > >> >> >> ram buffer and bigger heap to avoid OOM on merge.
> > >> >> >> It is also a permanent backwards compatibility decision, we
> have to
> > >> >> >> support it once we do this and we can't just say "oops" and
> flip it
> > >> >> >> back.
> > >> >> >>
> > >> >> >> It is unclear to me, if the multi-gigabyte ram buffer is really
> to
> > >> >> >> avoid merges because they are so slow and it would be DAYS
> otherwise,
> > >> >> >> or if its to avoid merges so it doesn't hit OOM.
> > >> >> >> Also from personal experience, it takes trial and error (means
> > >> >> >> experiencing OOM on merge!!!) before you get those heap values
> correct
> > >> >> >> for your dataset. This usually means starting over which is
> > >> >> >> frustrating and wastes more time.
> > >> >> >>
> > >> >> >> Jim mentioned some ideas about the memory usage in IndexWriter,
> seems
> > >> >> >> to me like its a good idea. maybe the multigigabyte ram buffer
> can be
> > >> >> >> avoided in this way and performance improved by writing bigger
> > >> >> >> segments with lucene's defaults. But this doesn't mean we can
> simply
> > >> >> >> ignore the horrors of what happens on merge. merging needs to
> scale so
> > >> >> >> that indexing really scales.
> > >> >> >>
> > >> >> >> At least it shouldnt spike RAM on trivial data amounts and
> cause OOM,
> > >> >> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
> > >> >> >> fashion when indexing.
> > >> >> >>
> > >> >> >>
> ---------------------------------------------------------------------
> > >> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > >> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> > >> >> >>
> > >> >>
> > >> >>
> ---------------------------------------------------------------------
> > >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> > >> >>
> > >>
> > >>
> > >> --
> > >> Adrien
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > >> For additional commands, e-mail: dev-help@lucene.apache.org
> > >>
> > >
> > >
> > > --
> > > Marcus Eagan
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

rcmuir at gmail

Apr 9, 2023, 4:47 AM

Post #74 of 99 (176 views)

Permalink

I don't care. you guys personally attacked me first. And then it turns
out, you were being dishonest the entire time and hiding your true
intent, which was not search at all but instead some chatgpt pyramid
scheme or similar.

i'm done with this thread.

On Sun, Apr 9, 2023 at 7:37?AM Alessandro Benedetti
<a.benedetti@sease.io> wrote:
>
> I don't think this tone and language is appropriate for a community of volunteers and men of science.
>
> I personally find offensive to generalise Lucene people here to be "crazy people hyped about chatGPT".
>
> I personally don't give a damn about chatGPT except the fact it is a very interesting technology.
>
> As usual I see very little motivation and a lot of "convince me".
> We're discussing here about a limit that raises an exception.
>
> Improving performance is absolutely important and no-one here is saying we won't address it, it's just a separate discussion.
>
>
> On Sun, 9 Apr 2023, 12:59 Robert Muir, <rcmuir@gmail.com> wrote:
>>
>> Also, please let's only disucss SEARCH. lucene is a SEARCH ENGINE
>> LIBRARY. not a vector database or whatever trash is being proposed
>> here.
>>
>> i think we should table this and revisit it after chatgpt hype has dissipated.
>>
>> this hype is causing ppl to behave irrationally, it is why i can't
>> converse with basically anyone on this thread because they are all
>> stating crazy things that don't make sense.
>>
>> On Sun, Apr 9, 2023 at 6:25?AM Robert Muir <rcmuir@gmail.com> wrote:
>> >
>> > Yes, its very clear that folks on this thread are ignoring reason
>> > entirely and completely swooned by chatgpt-hype.
>> > And what happens when they make chatgpt-8 that uses even more dimensions?
>> > backwards compatibility decisions can't be made by garbage hype such
>> > as cryptocurrency or chatgpt.
>> > Trying to convince me we should bump it because of chatgpt, well, i
>> > think it has the opposite effect.
>> >
>> > Please, lemme see real technical arguments why this limit needs to be
>> > bumped. not including trash like chatgpt.
>> >
>> > On Sat, Apr 8, 2023 at 7:50?PM Marcus Eagan <marcuseagan@gmail.com> wrote:
>> > >
>> > > Given the massive amounts of funding going into the development and investigation of the project, I think it would be good to at least have Lucene be a part of the conversation. Simply because academics typically focus on vectors <= 784 dimensions does not mean all users will. A large swathe of very important users of the Lucene project never exceed 500k documents, though they are shifting to other search engines to try out very popular embeddings.
>> > >
>> > > I think giving our users the opportunity to build chat bots or LLM memory machines using Lucene is a positive development, even if some datasets won't be able to work well. We don't limit the number of fields someone can add in most cases, though we did just undeprecate that API to better support multi-tenancy. But people still add so many fields and can crash their clusters with mapping explosions when unlimited. The limit to vectors feels similar. I expect more people to dig into Lucene due to its openness and robustness as they run into problems. Today, they are forced to consider other engines that are more permissive.
>> > >
>> > > Not everyone important or valuable Lucene workload is in the millions of documents. Many of them only have lots of queries or computationally expensive access patterns for B-trees. We can document that it is very ill-advised to make a deployment with vectors too large. What others will do with it is on them.
>> > >
>> > >
>> > > On Sat, Apr 8, 2023 at 2:29?PM Adrien Grand <jpountz@gmail.com> wrote:
>> > >>
>> > >> As Dawid pointed out earlier on this thread, this is the rule for
>> > >> Apache projects: a single -1 vote on a code change is a veto and
>> > >> cannot be overridden. Furthermore, Robert is one of the people on this
>> > >> project who worked the most on debugging subtle bugs, making Lucene
>> > >> more robust and improving our test framework, so I'm listening when he
>> > >> voices quality concerns.
>> > >>
>> > >> The argument against removing/raising the limit that resonates with me
>> > >> the most is that it is a one-way door. As MikeS highlighted earlier on
>> > >> this thread, implementations may want to take advantage of the fact
>> > >> that there is a limit at some point too. This is why I don't want to
>> > >> remove the limit and would prefer a slight increase, such as 2048 as
>> > >> suggested in the original issue, which would enable most of the things
>> > >> that users who have been asking about raising the limit would like to
>> > >> do.
>> > >>
>> > >> I agree that the merge-time memory usage and slow indexing rate are
>> > >> not great. But it's still possible to index multi-million vector
>> > >> datasets with a 4GB heap without hitting OOMEs regardless of the
>> > >> number of dimensions, and the feedback I'm seeing is that many users
>> > >> are still interested in indexing multi-million vector datasets despite
>> > >> the slow indexing rate. I wish we could do better, and vector indexing
>> > >> is certainly more expert than text indexing, but it still is usable in
>> > >> my opinion. I understand how giving Lucene more information about
>> > >> vectors prior to indexing (e.g. clustering information as Jim pointed
>> > >> out) could help make merging faster and more memory-efficient, but I
>> > >> would really like to avoid making it a requirement for indexing
>> > >> vectors as it also makes this feature much harder to use.
>> > >>
>> > >> On Sat, Apr 8, 2023 at 9:28?PM Alessandro Benedetti
>> > >> <a.benedetti@sease.io> wrote:
>> > >> >
>> > >> > I am very attentive to listen opinions but I am un-convinced here and I an not sure that a single person opinion should be allowed to be detrimental for such an important project.
>> > >> >
>> > >> > The limit as far as I know is literally just raising an exception.
>> > >> > Removing it won't alter in any way the current performance for users in low dimensional space.
>> > >> > Removing it will just enable more users to use Lucene.
>> > >> >
>> > >> > If new users in certain situations will be unhappy with the performance, they may contribute improvements.
>> > >> > This is how you make progress.
>> > >> >
>> > >> > If it's a reputation thing, trust me that not allowing users to play with high dimensional space will equally damage it.
>> > >> >
>> > >> > To me it's really a no brainer.
>> > >> > Removing the limit and enable people to use high dimensional vectors will take minutes.
>> > >> > Improving the hnsw implementation can take months.
>> > >> > Pick one to begin with...
>> > >> >
>> > >> > And there's no-one paying me here, no company interest whatsoever, actually I pay people to contribute, I am just convinced it's a good idea.
>> > >> >
>> > >> >
>> > >> > On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcmuir@gmail.com> wrote:
>> > >> >>
>> > >> >> I disagree with your categorization. I put in plenty of work and
>> > >> >> experienced plenty of pain myself, writing tests and fighting these
>> > >> >> issues, after i saw that, two releases in a row, vector indexing fell
>> > >> >> over and hit integer overflows etc on small datasets:
>> > >> >>
>> > >> >> https://github.com/apache/lucene/pull/11905
>> > >> >>
>> > >> >> Attacking me isn't helping the situation.
>> > >> >>
>> > >> >> PS: when i said the "one guy who wrote the code" I didn't mean it in
>> > >> >> any kind of demeaning fashion really. I meant to describe the current
>> > >> >> state of usability with respect to indexing a few million docs with
>> > >> >> high dimensions. You can scroll up the thread and see that at least
>> > >> >> one other committer on the project experienced similar pain as me.
>> > >> >> Then, think about users who aren't committers trying to use the
>> > >> >> functionality!
>> > >> >>
>> > >> >> On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov <msokolov@gmail.com> wrote:
>> > >> >> >
>> > >> >> > What you said about increasing dimensions requiring a bigger ram buffer on merge is wrong. That's the point I was trying to make. Your concerns about merge costs are not wrong, but your conclusion that we need to limit dimensions is not justified.
>> > >> >> >
>> > >> >> > You complain that hnsw sucks it doesn't scale, but when I show it scales linearly with dimension you just ignore that and complain about something entirely different.
>> > >> >> >
>> > >> >> > You demand that people run all kinds of tests to prove you wrong but when they do, you don't listen and you won't put in the work yourself or complain that it's too hard.
>> > >> >> >
>> > >> >> > Then you complain about people not meeting you half way. Wow
>> > >> >> >
>> > >> >> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcmuir@gmail.com> wrote:
>> > >> >> >>
>> > >> >> >> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
>> > >> >> >> <michael.wechner@wyona.com> wrote:
>> > >> >> >> >
>> > >> >> >> > What exactly do you consider reasonable?
>> > >> >> >>
>> > >> >> >> Let's begin a real discussion by being HONEST about the current
>> > >> >> >> status. Please put politically correct or your own company's wishes
>> > >> >> >> aside, we know it's not in a good state.
>> > >> >> >>
>> > >> >> >> Current status is the one guy who wrote the code can set a
>> > >> >> >> multi-gigabyte ram buffer and index a small dataset with 1024
>> > >> >> >> dimensions in HOURS (i didn't ask what hardware).
>> > >> >> >>
>> > >> >> >> My concerns are everyone else except the one guy, I want it to be
>> > >> >> >> usable. Increasing dimensions just means even bigger multi-gigabyte
>> > >> >> >> ram buffer and bigger heap to avoid OOM on merge.
>> > >> >> >> It is also a permanent backwards compatibility decision, we have to
>> > >> >> >> support it once we do this and we can't just say "oops" and flip it
>> > >> >> >> back.
>> > >> >> >>
>> > >> >> >> It is unclear to me, if the multi-gigabyte ram buffer is really to
>> > >> >> >> avoid merges because they are so slow and it would be DAYS otherwise,
>> > >> >> >> or if its to avoid merges so it doesn't hit OOM.
>> > >> >> >> Also from personal experience, it takes trial and error (means
>> > >> >> >> experiencing OOM on merge!!!) before you get those heap values correct
>> > >> >> >> for your dataset. This usually means starting over which is
>> > >> >> >> frustrating and wastes more time.
>> > >> >> >>
>> > >> >> >> Jim mentioned some ideas about the memory usage in IndexWriter, seems
>> > >> >> >> to me like its a good idea. maybe the multigigabyte ram buffer can be
>> > >> >> >> avoided in this way and performance improved by writing bigger
>> > >> >> >> segments with lucene's defaults. But this doesn't mean we can simply
>> > >> >> >> ignore the horrors of what happens on merge. merging needs to scale so
>> > >> >> >> that indexing really scales.
>> > >> >> >>
>> > >> >> >> At least it shouldnt spike RAM on trivial data amounts and cause OOM,
>> > >> >> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
>> > >> >> >> fashion when indexing.
>> > >> >> >>
>> > >> >> >> ---------------------------------------------------------------------
>> > >> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > >> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> > >> >> >>
>> > >> >>
>> > >> >> ---------------------------------------------------------------------
>> > >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > >> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> > >> >>
>> > >>
>> > >>
>> > >> --
>> > >> Adrien
>> > >>
>> > >> ---------------------------------------------------------------------
>> > >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > >> For additional commands, e-mail: dev-help@lucene.apache.org
>> > >>
>> > >
>> > >
>> > > --
>> > > Marcus Eagan
>> > >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors [ In reply to ]

msokolov at gmail

Apr 9, 2023, 4:55 AM

Post #75 of 99 (176 views)

Permalink

We do have a dataset built from Wikipedia in luceneutil. It comes in 100
and 300 dimensional varieties and can easily enough generate large numbers
of vector documents from the articles data. To go higher we could
concatenate vectors from that and I believe the performance numbers would
be plausible.

On Sun, Apr 9, 2023, 1:32 AM Dawid Weiss <dawid.weiss@gmail.com> wrote:

> Can we set up a branch in which the limit is bumped to 2048, then have
> a realistic, free data set (wikipedia sample or something) that has,
> say, 5 million docs and vectors created using public data (glove
> pre-trained embeddings or the like)? We then could run indexing on the
> same hardware with 512, 1024 and 2048 and see what the numbers, limits
> and behavior actually are.
>
> I can help in writing this but not until after Easter.
>
>
> Dawid
>
> On Sat, Apr 8, 2023 at 11:29?PM Adrien Grand <jpountz@gmail.com> wrote:
> >
> > As Dawid pointed out earlier on this thread, this is the rule for
> > Apache projects: a single -1 vote on a code change is a veto and
> > cannot be overridden. Furthermore, Robert is one of the people on this
> > project who worked the most on debugging subtle bugs, making Lucene
> > more robust and improving our test framework, so I'm listening when he
> > voices quality concerns.
> >
> > The argument against removing/raising the limit that resonates with me
> > the most is that it is a one-way door. As MikeS highlighted earlier on
> > this thread, implementations may want to take advantage of the fact
> > that there is a limit at some point too. This is why I don't want to
> > remove the limit and would prefer a slight increase, such as 2048 as
> > suggested in the original issue, which would enable most of the things
> > that users who have been asking about raising the limit would like to
> > do.
> >
> > I agree that the merge-time memory usage and slow indexing rate are
> > not great. But it's still possible to index multi-million vector
> > datasets with a 4GB heap without hitting OOMEs regardless of the
> > number of dimensions, and the feedback I'm seeing is that many users
> > are still interested in indexing multi-million vector datasets despite
> > the slow indexing rate. I wish we could do better, and vector indexing
> > is certainly more expert than text indexing, but it still is usable in
> > my opinion. I understand how giving Lucene more information about
> > vectors prior to indexing (e.g. clustering information as Jim pointed
> > out) could help make merging faster and more memory-efficient, but I
> > would really like to avoid making it a requirement for indexing
> > vectors as it also makes this feature much harder to use.
> >
> > On Sat, Apr 8, 2023 at 9:28?PM Alessandro Benedetti
> > <a.benedetti@sease.io> wrote:
> > >
> > > I am very attentive to listen opinions but I am un-convinced here and
> I an not sure that a single person opinion should be allowed to be
> detrimental for such an important project.
> > >
> > > The limit as far as I know is literally just raising an exception.
> > > Removing it won't alter in any way the current performance for users
> in low dimensional space.
> > > Removing it will just enable more users to use Lucene.
> > >
> > > If new users in certain situations will be unhappy with the
> performance, they may contribute improvements.
> > > This is how you make progress.
> > >
> > > If it's a reputation thing, trust me that not allowing users to play
> with high dimensional space will equally damage it.
> > >
> > > To me it's really a no brainer.
> > > Removing the limit and enable people to use high dimensional vectors
> will take minutes.
> > > Improving the hnsw implementation can take months.
> > > Pick one to begin with...
> > >
> > > And there's no-one paying me here, no company interest whatsoever,
> actually I pay people to contribute, I am just convinced it's a good idea.
> > >
> > >
> > > On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcmuir@gmail.com> wrote:
> > >>
> > >> I disagree with your categorization. I put in plenty of work and
> > >> experienced plenty of pain myself, writing tests and fighting these
> > >> issues, after i saw that, two releases in a row, vector indexing fell
> > >> over and hit integer overflows etc on small datasets:
> > >>
> > >> https://github.com/apache/lucene/pull/11905
> > >>
> > >> Attacking me isn't helping the situation.
> > >>
> > >> PS: when i said the "one guy who wrote the code" I didn't mean it in
> > >> any kind of demeaning fashion really. I meant to describe the current
> > >> state of usability with respect to indexing a few million docs with
> > >> high dimensions. You can scroll up the thread and see that at least
> > >> one other committer on the project experienced similar pain as me.
> > >> Then, think about users who aren't committers trying to use the
> > >> functionality!
> > >>
> > >> On Sat, Apr 8, 2023 at 12:51?PM Michael Sokolov <msokolov@gmail.com>
> wrote:
> > >> >
> > >> > What you said about increasing dimensions requiring a bigger ram
> buffer on merge is wrong. That's the point I was trying to make. Your
> concerns about merge costs are not wrong, but your conclusion that we need
> to limit dimensions is not justified.
> > >> >
> > >> > You complain that hnsw sucks it doesn't scale, but when I show it
> scales linearly with dimension you just ignore that and complain about
> something entirely different.
> > >> >
> > >> > You demand that people run all kinds of tests to prove you wrong
> but when they do, you don't listen and you won't put in the work yourself
> or complain that it's too hard.
> > >> >
> > >> > Then you complain about people not meeting you half way. Wow
> > >> >
> > >> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcmuir@gmail.com> wrote:
> > >> >>
> > >> >> On Sat, Apr 8, 2023 at 8:33?AM Michael Wechner
> > >> >> <michael.wechner@wyona.com> wrote:
> > >> >> >
> > >> >> > What exactly do you consider reasonable?
> > >> >>
> > >> >> Let's begin a real discussion by being HONEST about the current
> > >> >> status. Please put politically correct or your own company's wishes
> > >> >> aside, we know it's not in a good state.
> > >> >>
> > >> >> Current status is the one guy who wrote the code can set a
> > >> >> multi-gigabyte ram buffer and index a small dataset with 1024
> > >> >> dimensions in HOURS (i didn't ask what hardware).
> > >> >>
> > >> >> My concerns are everyone else except the one guy, I want it to be
> > >> >> usable. Increasing dimensions just means even bigger multi-gigabyte
> > >> >> ram buffer and bigger heap to avoid OOM on merge.
> > >> >> It is also a permanent backwards compatibility decision, we have to
> > >> >> support it once we do this and we can't just say "oops" and flip it
> > >> >> back.
> > >> >>
> > >> >> It is unclear to me, if the multi-gigabyte ram buffer is really to
> > >> >> avoid merges because they are so slow and it would be DAYS
> otherwise,
> > >> >> or if its to avoid merges so it doesn't hit OOM.
> > >> >> Also from personal experience, it takes trial and error (means
> > >> >> experiencing OOM on merge!!!) before you get those heap values
> correct
> > >> >> for your dataset. This usually means starting over which is
> > >> >> frustrating and wastes more time.
> > >> >>
> > >> >> Jim mentioned some ideas about the memory usage in IndexWriter,
> seems
> > >> >> to me like its a good idea. maybe the multigigabyte ram buffer can
> be
> > >> >> avoided in this way and performance improved by writing bigger
> > >> >> segments with lucene's defaults. But this doesn't mean we can
> simply
> > >> >> ignore the horrors of what happens on merge. merging needs to
> scale so
> > >> >> that indexing really scales.
> > >> >>
> > >> >> At least it shouldnt spike RAM on trivial data amounts and cause
> OOM,
> > >> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
> > >> >> fashion when indexing.
> > >> >>
> > >> >>
> ---------------------------------------------------------------------
> > >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> > >> >>
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > >> For additional commands, e-mail: dev-help@lucene.apache.org
> > >>
> >
> >
> > --
> > Adrien
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>