Hi, Lucene Developers:
I'm studying the HNSW source code and have some questions regarding
Lucene's multi-segments and HNSW.
First, some of my understanding:
1. While creating the index, when two segments are being merged, it could
rebuild the HNSW graph based on the docs and vectors in the two segments.
2. But while reading the index, each segment's graph is loaded separately.
There is no way to merge graphs from multiple segments while reading
the index.
Please let me know if there is any misunderstanding.
Since HNSW is a graph, the connections between the nodes could matter a lot.
I can imagine some pros and cons here.
1. By splitting the docs into multiple separate graphs, it could help the
diversity by retrieving more docs.
For example, if just a single graph, some docs could be too far in the
Neighbor list to be retrieved. And one way to mitigate this is, dividing
the docs into multiple graphs.
It could also help to boost the performance.
2. However, too many segments could cause other issues.
For example, retrieving too many irrelevant docs, especially if there
are not so many docs in a segment.
So, I think the number of segments and the size of the graphs could have a
real impact on the retrieving quality and performance.
I'm wondering if there is any best practice, e.g. how many docs should be
in a single graph?
Or does anyone have some production experience to share?
Thanks & Regards
MyCoy
I'm studying the HNSW source code and have some questions regarding
Lucene's multi-segments and HNSW.
First, some of my understanding:
1. While creating the index, when two segments are being merged, it could
rebuild the HNSW graph based on the docs and vectors in the two segments.
2. But while reading the index, each segment's graph is loaded separately.
There is no way to merge graphs from multiple segments while reading
the index.
Please let me know if there is any misunderstanding.
Since HNSW is a graph, the connections between the nodes could matter a lot.
I can imagine some pros and cons here.
1. By splitting the docs into multiple separate graphs, it could help the
diversity by retrieving more docs.
For example, if just a single graph, some docs could be too far in the
Neighbor list to be retrieved. And one way to mitigate this is, dividing
the docs into multiple graphs.
It could also help to boost the performance.
2. However, too many segments could cause other issues.
For example, retrieving too many irrelevant docs, especially if there
are not so many docs in a segment.
So, I think the number of segments and the size of the graphs could have a
real impact on the retrieving quality and performance.
I'm wondering if there is any best practice, e.g. how many docs should be
in a single graph?
Or does anyone have some production experience to share?
Thanks & Regards
MyCoy