Mailing List Archive: HNSW and Multi-segments

Hi, Lucene Developers:

I'm studying the HNSW source code and have some questions regarding
Lucene's multi-segments and HNSW.

First, some of my understanding:
1. While creating the index, when two segments are being merged, it could
rebuild the HNSW graph based on the docs and vectors in the two segments.
2. But while reading the index, each segment's graph is loaded separately.
There is no way to merge graphs from multiple segments while reading
the index.
Please let me know if there is any misunderstanding.

Since HNSW is a graph, the connections between the nodes could matter a lot.
I can imagine some pros and cons here.
1. By splitting the docs into multiple separate graphs, it could help the
diversity by retrieving more docs.
For example, if just a single graph, some docs could be too far in the
Neighbor list to be retrieved. And one way to mitigate this is, dividing
the docs into multiple graphs.
It could also help to boost the performance.

2. However, too many segments could cause other issues.
For example, retrieving too many irrelevant docs, especially if there
are not so many docs in a segment.

So, I think the number of segments and the size of the graphs could have a
real impact on the retrieving quality and performance.
I'm wondering if there is any best practice, e.g. how many docs should be
in a single graph?
Or does anyone have some production experience to share?

Thanks & Regards
MyCoy

The way I think of this is that segmenting the graph will generally
lead to higher recall and higher costs (at query time) for a given set
of HNSW parameters. Indexing costs will tend to be lower for multiple
segmented graphs. I don't think that increased irrelevant docs should
be a concern since after collecting from multiple segments (which by
the way can be done concurrently), the results are merged sorted by
score.

On Thu, Nov 3, 2022 at 2:38 PM MyCoy Z <mycoy.zhang@gmail.com> wrote:
>
> Hi, Lucene Developers:
>
> I'm studying the HNSW source code and have some questions regarding Lucene's multi-segments and HNSW.
>
> First, some of my understanding:
> 1. While creating the index, when two segments are being merged, it could rebuild the HNSW graph based on the docs and vectors in the two segments.
> 2. But while reading the index, each segment's graph is loaded separately.
> There is no way to merge graphs from multiple segments while reading the index.
> Please let me know if there is any misunderstanding.
>
>
> Since HNSW is a graph, the connections between the nodes could matter a lot.
> I can imagine some pros and cons here.
> 1. By splitting the docs into multiple separate graphs, it could help the diversity by retrieving more docs.
> For example, if just a single graph, some docs could be too far in the Neighbor list to be retrieved. And one way to mitigate this is, dividing the docs into multiple graphs.
> It could also help to boost the performance.
>
> 2. However, too many segments could cause other issues.
> For example, retrieving too many irrelevant docs, especially if there are not so many docs in a segment.
>
>
> So, I think the number of segments and the size of the graphs could have a real impact on the retrieving quality and performance.
> I'm wondering if there is any best practice, e.g. how many docs should be in a single graph?
> Or does anyone have some production experience to share?
>
> Thanks & Regards
> MyCoy

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org