Mailing List Archive: Idea about faster vector format merge

Hi Folks

I've talked with Mike Sokolov and learnt some KNN knowledge from him (thank
you!) during ApacheCon and one thing I learnt was that our KNN
implementation was kind of suffering from long merging time because we
currently rebuild the graph from scratch every time we merge. I noticed
there's one effort that is trying to reuse a graph from one segment to save
part of the time: https://github.com/apache/lucene/issues/11354.

But I wonder whether it makes sense for us to take a step even further: to
be able to delay the HNSW graph merge or only do partial merge and allow
multiple HNSW graphs stay in one segment? For example, if we're merging 8
equal sized segments and we can tolerate up to 4 hnsw graphs, then we only
need to re-insert half of the documents (after we're able to reuse old
graphs). This could slow down the search within the segment by a factor of
logK, but could potentially save a lot of merging time, especially when the
merge policy is aggressive?

Just want to throw this idea out and please feel free to comment!

Best
Patrick

Hi Patrick, this seems like an important topic to explore!

Speeding up merging by allowing multiple graphs per segment could be an
acceptable compromise. We'd probably have to run a large-scale experiment
to get a good handle on the performance trade-off. My intuition though is
that it could significantly hurt search latency: the HNSW algorithm has
roughly logarithmic complexity, so it's always better to search a single
large graph rather than several smaller ones sequentially. For example, if
a single large graph gave ~log n, then searching 4 small ones would be ~4 *
log(n/4), which is roughly 4 times the cost.

Another option is to still merge all graphs into one, but in a way that
mostly just reuses the existing segment graphs. I'm mentioning this option
for completeness, but don't have a clear idea of how it would work. There
is limited academic work on merging nearest neighbor graphs (I only found
https://arxiv.org/pdf/1908.00814.pdf, which is interesting but not clear if
it could be adapted to HNSW?)

Julie

On Tue, Oct 18, 2022 at 9:43 PM Patrick Zhai <zhai7631@gmail.com> wrote:

> Hi Folks
>
> I've talked with Mike Sokolov and learnt some KNN knowledge from him
> (thank you!) during ApacheCon and one thing I learnt was that our KNN
> implementation was kind of suffering from long merging time because we
> currently rebuild the graph from scratch every time we merge. I noticed
> there's one effort that is trying to reuse a graph from one segment to save
> part of the time: https://github.com/apache/lucene/issues/11354.
>
> But I wonder whether it makes sense for us to take a step even further: to
> be able to delay the HNSW graph merge or only do partial merge and allow
> multiple HNSW graphs stay in one segment? For example, if we're merging 8
> equal sized segments and we can tolerate up to 4 hnsw graphs, then we only
> need to re-insert half of the documents (after we're able to reuse old
> graphs). This could slow down the search within the segment by a factor of
> logK, but could potentially save a lot of merging time, especially when the
> merge policy is aggressive?
>
> Just want to throw this idea out and please feel free to comment!
>
> Best
> Patrick
>