Mailing List Archive

Idea about faster vector format merge
Hi Folks

I've talked with Mike Sokolov and learnt some KNN knowledge from him (thank
you!) during ApacheCon and one thing I learnt was that our KNN
implementation was kind of suffering from long merging time because we
currently rebuild the graph from scratch every time we merge. I noticed
there's one effort that is trying to reuse a graph from one segment to save
part of the time: https://github.com/apache/lucene/issues/11354.

But I wonder whether it makes sense for us to take a step even further: to
be able to delay the HNSW graph merge or only do partial merge and allow
multiple HNSW graphs stay in one segment? For example, if we're merging 8
equal sized segments and we can tolerate up to 4 hnsw graphs, then we only
need to re-insert half of the documents (after we're able to reuse old
graphs). This could slow down the search within the segment by a factor of
logK, but could potentially save a lot of merging time, especially when the
merge policy is aggressive?

Just want to throw this idea out and please feel free to comment!

Best
Patrick
Re: Idea about faster vector format merge [ In reply to ]
Hi Patrick, this seems like an important topic to explore!

Speeding up merging by allowing multiple graphs per segment could be an
acceptable compromise. We'd probably have to run a large-scale experiment
to get a good handle on the performance trade-off. My intuition though is
that it could significantly hurt search latency: the HNSW algorithm has
roughly logarithmic complexity, so it's always better to search a single
large graph rather than several smaller ones sequentially. For example, if
a single large graph gave ~log n, then searching 4 small ones would be ~4 *
log(n/4), which is roughly 4 times the cost.

Another option is to still merge all graphs into one, but in a way that
mostly just reuses the existing segment graphs. I'm mentioning this option
for completeness, but don't have a clear idea of how it would work. There
is limited academic work on merging nearest neighbor graphs (I only found
https://arxiv.org/pdf/1908.00814.pdf, which is interesting but not clear if
it could be adapted to HNSW?)

Julie

On Tue, Oct 18, 2022 at 9:43 PM Patrick Zhai <zhai7631@gmail.com> wrote:

> Hi Folks
>
> I've talked with Mike Sokolov and learnt some KNN knowledge from him
> (thank you!) during ApacheCon and one thing I learnt was that our KNN
> implementation was kind of suffering from long merging time because we
> currently rebuild the graph from scratch every time we merge. I noticed
> there's one effort that is trying to reuse a graph from one segment to save
> part of the time: https://github.com/apache/lucene/issues/11354.
>
> But I wonder whether it makes sense for us to take a step even further: to
> be able to delay the HNSW graph merge or only do partial merge and allow
> multiple HNSW graphs stay in one segment? For example, if we're merging 8
> equal sized segments and we can tolerate up to 4 hnsw graphs, then we only
> need to re-insert half of the documents (after we're able to reuse old
> graphs). This could slow down the search within the segment by a factor of
> logK, but could potentially save a lot of merging time, especially when the
> merge policy is aggressive?
>
> Just want to throw this idea out and please feel free to comment!
>
> Best
> Patrick
>