re: how does this HNSW stuff scale - I think people are calling out
indexing memory usage here, so let's discuss some facts. During
initial indexing we hold in RAM all the vector data and the graph
constructed from the new documents, but this is accounted for and
limited by the size of IndexWriter's buffer; the document vectors and
their graph will be flushed to disk when this fills up, and at search
time, they are not read in wholesale to RAM. There is potentially
unbounded RAM usage during merging though, because the entire merged
graph will be built in RAM. I lost track of how we handle the vector
data now, but at least in theory it should be fairly straightforward
to write the merged vector data in chunks using only limited RAM. So
how much RAM does the graph use? It uses numdocs*fanout VInts.
Actually it doesn't really scale with the vector dimension at all -
rather it scales with the graph fanout (M) parameter and with the
total number of documents. So I think this focus on limiting the
vector dimension is not helping to address the concern about RAM usage
while merging.
The vector dimension does have a strong role in the search, and
indexing time, but the impact is linear in the dimension and won't
exhaust any limited resource.
On Thu, Apr 6, 2023 at 5:48?AM Michael McCandless
<lucene@mikemccandless.com> wrote:
>
> > We shouldn't accept weakly/not scientifically motivated vetos anyway right?
>
> In fact we must accept all vetos by any committer as a veto, for a change to Lucene's source code, regardless of that committer's reasoning. This is the power of Apache's model.
>
> Of course we all can and will work together to convince one another (this is where the scientifically motivated part comes in) to change our votes, one way or another.
>
> > I'd ask anyone voting +1 to raise this limit to at least try to index a few million vectors with 756 or 1024, which is allowed today.
>
> +1, if the current implementation really does not scale / needs more and more RAM for merging, let's understand what's going on here, first, before increasing limits. I rescind my hasty +1 for now!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Apr 5, 2023 at 11:22?AM Alessandro Benedetti <a.benedetti@sease.io> wrote:
>>
>> Ok, so what should we do then?
>> This space is moving fast, and in my opinion we should act fast to release and guarantee we attract as many users as possible.
>>
>> At the same time I am not saying we should proceed blind, if there's concrete evidence for setting a limit rather than another, or that a certain limit is detrimental to the project, I think that veto should be valid.
>>
>> We shouldn't accept weakly/not scientifically motivated vetos anyway right?
>>
>> The problem I see is that more than voting we should first decide this limit and I don't know how we can operate.
>> I am imagining like a poll where each entry is a limit + motivation and PMCs maybe vote/add entries?
>>
>> Did anything similar happen in the past? How was the current limit added?
>>
>>
>> On Wed, 5 Apr 2023, 14:50 Dawid Weiss, <dawid.weiss@gmail.com> wrote:
>>>
>>>
>>>>
>>>> Should create a VOTE thread, where we propose some values with a justification and we vote?
>>>
>>>
>>> Technically, a vote thread won't help much if there's no full consensus - a single veto will make the patch unacceptable for merging.
>>> https://www.apache.org/foundation/voting.html#Veto
>>>
>>> Dawid
>>>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
indexing memory usage here, so let's discuss some facts. During
initial indexing we hold in RAM all the vector data and the graph
constructed from the new documents, but this is accounted for and
limited by the size of IndexWriter's buffer; the document vectors and
their graph will be flushed to disk when this fills up, and at search
time, they are not read in wholesale to RAM. There is potentially
unbounded RAM usage during merging though, because the entire merged
graph will be built in RAM. I lost track of how we handle the vector
data now, but at least in theory it should be fairly straightforward
to write the merged vector data in chunks using only limited RAM. So
how much RAM does the graph use? It uses numdocs*fanout VInts.
Actually it doesn't really scale with the vector dimension at all -
rather it scales with the graph fanout (M) parameter and with the
total number of documents. So I think this focus on limiting the
vector dimension is not helping to address the concern about RAM usage
while merging.
The vector dimension does have a strong role in the search, and
indexing time, but the impact is linear in the dimension and won't
exhaust any limited resource.
On Thu, Apr 6, 2023 at 5:48?AM Michael McCandless
<lucene@mikemccandless.com> wrote:
>
> > We shouldn't accept weakly/not scientifically motivated vetos anyway right?
>
> In fact we must accept all vetos by any committer as a veto, for a change to Lucene's source code, regardless of that committer's reasoning. This is the power of Apache's model.
>
> Of course we all can and will work together to convince one another (this is where the scientifically motivated part comes in) to change our votes, one way or another.
>
> > I'd ask anyone voting +1 to raise this limit to at least try to index a few million vectors with 756 or 1024, which is allowed today.
>
> +1, if the current implementation really does not scale / needs more and more RAM for merging, let's understand what's going on here, first, before increasing limits. I rescind my hasty +1 for now!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Apr 5, 2023 at 11:22?AM Alessandro Benedetti <a.benedetti@sease.io> wrote:
>>
>> Ok, so what should we do then?
>> This space is moving fast, and in my opinion we should act fast to release and guarantee we attract as many users as possible.
>>
>> At the same time I am not saying we should proceed blind, if there's concrete evidence for setting a limit rather than another, or that a certain limit is detrimental to the project, I think that veto should be valid.
>>
>> We shouldn't accept weakly/not scientifically motivated vetos anyway right?
>>
>> The problem I see is that more than voting we should first decide this limit and I don't know how we can operate.
>> I am imagining like a poll where each entry is a limit + motivation and PMCs maybe vote/add entries?
>>
>> Did anything similar happen in the past? How was the current limit added?
>>
>>
>> On Wed, 5 Apr 2023, 14:50 Dawid Weiss, <dawid.weiss@gmail.com> wrote:
>>>
>>>
>>>>
>>>> Should create a VOTE thread, where we propose some values with a justification and we vote?
>>>
>>>
>>> Technically, a vote thread won't help much if there's no full consensus - a single veto will make the patch unacceptable for merging.
>>> https://www.apache.org/foundation/voting.html#Veto
>>>
>>> Dawid
>>>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org