Mailing List Archive

Multi-segments and HNSW
Hi:

I'm studying the HNSW source code and have some questions regarding
Lucene's multi-segments and HNSW.

First, some of my understanding:
1. While creating the index, when two segments are being merged, it could
rebuild the HNSW graph based on the docs and vectors in the two segments.
2. But while reading the index, each segment's graph is loaded separately.
There is no way to merge multiple-graphs.
The search will iterate each segment separately.
Please let me know if there is any misunderstanding.


Since HNSW is a graph, the connections between the nodes could matter a lot.
I can imagine some pros and cons here.
1. By splitting the docs into multiple separate graphs, it could help the
diversity by retrieving more docs.
For example, if just a single graph, some docs could be too far in the
Neighbor list to be retrieved. And one way to mitigate this is, dividing
the docs into multiple graphs.
It could also help to boost the performance.

2. However, too many segments could cause other issues.
For example, retrieving too many irrelevant docs, especially if there
are not so many docs in a segment.


So, I think the number of segments and the size of the graphs could have a
real impact on the retrieving quality and performance.

I'm wondering if there is any best practice, e.g. how many docs should be
in a single graph?
Or does anyone have some production experience to share?

Thanks & Regards
MyCoy
Re: Multi-segments and HNSW [ In reply to ]
Hi, MyCoy.
I suppose these questions should go into dev@ list. Please join.

On Wed, Nov 2, 2022 at 12:57 AM MyCoy Z <mycoy.zhang@gmail.com> wrote:

> Hi:
>
> I'm studying the HNSW source code and have some questions regarding
> Lucene's multi-segments and HNSW.
>
> First, some of my understanding:
> 1. While creating the index, when two segments are being merged, it could
> rebuild the HNSW graph based on the docs and vectors in the two segments.
> 2. But while reading the index, each segment's graph is loaded separately.
> There is no way to merge multiple-graphs.
> The search will iterate each segment separately.
> Please let me know if there is any misunderstanding.
>
>
> Since HNSW is a graph, the connections between the nodes could matter a
> lot.
> I can imagine some pros and cons here.
> 1. By splitting the docs into multiple separate graphs, it could help the
> diversity by retrieving more docs.
> For example, if just a single graph, some docs could be too far in the
> Neighbor list to be retrieved. And one way to mitigate this is, dividing
> the docs into multiple graphs.
> It could also help to boost the performance.
>
> 2. However, too many segments could cause other issues.
> For example, retrieving too many irrelevant docs, especially if there
> are not so many docs in a segment.
>
>
> So, I think the number of segments and the size of the graphs could have a
> real impact on the retrieving quality and performance.
>
> I'm wondering if there is any best practice, e.g. how many docs should be
> in a single graph?
> Or does anyone have some production experience to share?
>
> Thanks & Regards
> MyCoy
>


--
Sincerely yours
Mikhail Khludnev