Mailing List Archive

Re: Multi-segments and HNSW
Hi, MyCoy.
I suppose these questions should go into dev@ list. Please join.

On Wed, Nov 2, 2022 at 12:57 AM MyCoy Z <mycoy.zhang@gmail.com> wrote:

> Hi:
>
> I'm studying the HNSW source code and have some questions regarding
> Lucene's multi-segments and HNSW.
>
> First, some of my understanding:
> 1. While creating the index, when two segments are being merged, it could
> rebuild the HNSW graph based on the docs and vectors in the two segments.
> 2. But while reading the index, each segment's graph is loaded separately.
> There is no way to merge multiple-graphs.
> The search will iterate each segment separately.
> Please let me know if there is any misunderstanding.
>
>
> Since HNSW is a graph, the connections between the nodes could matter a
> lot.
> I can imagine some pros and cons here.
> 1. By splitting the docs into multiple separate graphs, it could help the
> diversity by retrieving more docs.
> For example, if just a single graph, some docs could be too far in the
> Neighbor list to be retrieved. And one way to mitigate this is, dividing
> the docs into multiple graphs.
> It could also help to boost the performance.
>
> 2. However, too many segments could cause other issues.
> For example, retrieving too many irrelevant docs, especially if there
> are not so many docs in a segment.
>
>
> So, I think the number of segments and the size of the graphs could have a
> real impact on the retrieving quality and performance.
>
> I'm wondering if there is any best practice, e.g. how many docs should be
> in a single graph?
> Or does anyone have some production experience to share?
>
> Thanks & Regards
> MyCoy
>


--
Sincerely yours
Mikhail Khludnev