Mailing List Archive

Disconnectedness in HNSW graphs in Lucene
Hi Lucene developers,

I work for Amazon Retail Product search and we are using Lucene KNN for
semantic search of products. We index product embeddings (vectors) into
lucene (hnsw graph) and search them by generating query embedding at
runtime. The product embeddings also receive regular updates and the index
geometry keeps changing because of merges.
We recently noticed that the hnsw graphs generated are not always strongly
connected and in worst case scenario some products may be undiscoverable.
Connectedness of Hierarchical graph can be complicated, so below I am
mentioning my experiment details.

- Experiment:
I took the Lucene indexes from our production servers and for each segment
(hnsw graph) I did following test.
At each level graph I took the same entry point, the entry point of HNSW
graph, checked how many nodes are reachable from this entrypoint. Note that
connectedness at each level was checked independently of other levels.
Sample code attached. My observations are as below.

- Observation :
1. Of all the graphs across all the segments, across 100s of indexes that I
considered, one graph for each "level" of HNSW, almost 18% of the graphs
had some disconnectedness.
2. Disconnectedness was observed at all the levels of HNSW graphs. We have
at most 3 levels in HNSW graphs.
3. percentage disconnectedness ranged from small fractions 0.000386% (1
disconnected out of 259342) to 3.7% (eg. 87 disconnected out of 2308).
In some extreme case the entry-point in zeroth level graph was disconnected
from rest of the graph making the %age disconnected as high as 99.9% (65
reachable nodes from EP out of 252275). But this does not necessarily mean
that the 99.9% of nodes were not discoverable, it just means that if
unluckily we end up on EP in the 0th level graph for a query, there can at
max be 65 nodes that can be reached. But had we diverted our path from EP
to some other node in the upper level graphs then may be more nodes be
discoverable via that node.

- What I Not Checked :
What I have not checked till now is the connectedness for the whole HNSW
graph including edges of all the levels.
Also, I have not checked the number of disconnected components in a graph.
I have just checked the number of connected nodes to the entry-point.

But irrespective of that, I think graphs should be strongly connected at
each level and disconnectedness if at all should be very very rare.

Thanks Kaival Parikh for discovering the issue in the first place and the
script for checking connectedness.

What do others think about this?

public class CheckHNSWConnectedness {
private static int getReachableNodes(HnswGraph graph, int level)
throws IOException {
Set<Integer> visited = new HashSet<>();
Stack<Integer> candidates = new Stack<>();
candidates.push(graph.entryNode());

while (!candidates.isEmpty()) {
int node = candidates.pop();

if (visited.contains(node)) {
continue;
}

visited.add(node);
graph.seek(level, node);

int friendOrd;
while ((friendOrd = graph.nextNeighbor()) != NO_MORE_DOCS) {
candidates.push(friendOrd);
}
}
return visited.size();
}

public static void checkConnected(String index, String hnswField)
throws IOException, NoSuchFieldException, IllegalAccessException {
try (FSDirectory dir = FSDirectory.open(Paths.get(index));
IndexReader indexReader = DirectoryReader.open(dir)) {
for (LeafReaderContext ctx : indexReader.leaves() ) {
KnnVectorsReader reader =
((PerFieldKnnVectorsFormat.FieldsReader) ((SegmentReader)
ctx.reader()).getVectorReader()).getFieldReader(hnswField);

if (reader != null) {
HnswGraph graph = ((Lucene95HnswVectorsReader)
reader).getGraph(hnswField);
for (int l = 0; l < graph.numLevels(); l++){
int reachableNodes = getReachableNodes(graph, l);
// int graphSize = graph.size(); // this gives
nodes at 0th level
int graphSize = graph.getNodesOnLevel(l).size();
System.out.printf("Level: %d, Actual Size:
%d, Reachable: %d, Not Reachable : %d\n", l, graphSize,
reachableNodes, (graphSize - reachableNodes));
}
}
}
}
}

public static void main(String[] args) throws IOException,
NoSuchFieldException, IllegalAccessException {
String index = args[0];
String field = args[1];
System.out.println("For index " + index + " field : " + field);
checkConnected(index, field);
}




--
Regards,
Nitiraj
Re: Disconnectedness in HNSW graphs in Lucene [ In reply to ]
Nitiraj,

Good experimentation! Connectedness within layers is indeed important. The
algorithm itself should ensure connectedness of disjoint NSWs as it
mutually connects nodes (selected over diversity).

However, if the data is extremely clustered, this can cause connectedness
to drop (few densely packed clusters may not connect to other densely
packed clusters).

For your extreme examples, is the data densely clustered?

What would you suggest as an improvement in Lucene regarding the algorithm
implementation?

An interesting experiment would be to see if `hnswlib` has the same
connected issues if it indexes the same vectors in the same order.

Thanks!

Ben Trent



On Wed, Aug 23, 2023 at 5:07?AM Nitiraj Singh Rathore <
nitiraj.rathore@gmail.com> wrote:

> Hi Lucene developers,
>
> I work for Amazon Retail Product search and we are using Lucene KNN for
> semantic search of products. We index product embeddings (vectors) into
> lucene (hnsw graph) and search them by generating query embedding at
> runtime. The product embeddings also receive regular updates and the index
> geometry keeps changing because of merges.
> We recently noticed that the hnsw graphs generated are not always strongly
> connected and in worst case scenario some products may be undiscoverable.
> Connectedness of Hierarchical graph can be complicated, so below I am
> mentioning my experiment details.
>
> - Experiment:
> I took the Lucene indexes from our production servers and for each segment
> (hnsw graph) I did following test.
> At each level graph I took the same entry point, the entry point of HNSW
> graph, checked how many nodes are reachable from this entrypoint. Note that
> connectedness at each level was checked independently of other levels.
> Sample code attached. My observations are as below.
>
> - Observation :
> 1. Of all the graphs across all the segments, across 100s of indexes
> that I considered, one graph for each "level" of HNSW, almost 18% of the
> graphs had some disconnectedness.
> 2. Disconnectedness was observed at all the levels of HNSW graphs. We have
> at most 3 levels in HNSW graphs.
> 3. percentage disconnectedness ranged from small fractions 0.000386% (1
> disconnected out of 259342) to 3.7% (eg. 87 disconnected out of 2308).
> In some extreme case the entry-point in zeroth level graph was
> disconnected from rest of the graph making the %age disconnected as high as 99.9%
> (65 reachable nodes from EP out of 252275). But this does not necessarily
> mean that the 99.9% of nodes were not discoverable, it just means that if
> unluckily we end up on EP in the 0th level graph for a query, there can at
> max be 65 nodes that can be reached. But had we diverted our path from EP
> to some other node in the upper level graphs then may be more nodes be
> discoverable via that node.
>
> - What I Not Checked :
> What I have not checked till now is the connectedness for the whole HNSW
> graph including edges of all the levels.
> Also, I have not checked the number of disconnected components in a graph.
> I have just checked the number of connected nodes to the entry-point.
>
> But irrespective of that, I think graphs should be strongly connected at
> each level and disconnectedness if at all should be very very rare.
>
> Thanks Kaival Parikh for discovering the issue in the first place and the
> script for checking connectedness.
>
> What do others think about this?
>
> public class CheckHNSWConnectedness {
> private static int getReachableNodes(HnswGraph graph, int level) throws IOException {
> Set<Integer> visited = new HashSet<>();
> Stack<Integer> candidates = new Stack<>();
> candidates.push(graph.entryNode());
>
> while (!candidates.isEmpty()) {
> int node = candidates.pop();
>
> if (visited.contains(node)) {
> continue;
> }
>
> visited.add(node);
> graph.seek(level, node);
>
> int friendOrd;
> while ((friendOrd = graph.nextNeighbor()) != NO_MORE_DOCS) {
> candidates.push(friendOrd);
> }
> }
> return visited.size();
> }
>
> public static void checkConnected(String index, String hnswField) throws IOException, NoSuchFieldException, IllegalAccessException {
> try (FSDirectory dir = FSDirectory.open(Paths.get(index));
> IndexReader indexReader = DirectoryReader.open(dir)) {
> for (LeafReaderContext ctx : indexReader.leaves() ) {
> KnnVectorsReader reader = ((PerFieldKnnVectorsFormat.FieldsReader) ((SegmentReader) ctx.reader()).getVectorReader()).getFieldReader(hnswField);
>
> if (reader != null) {
> HnswGraph graph = ((Lucene95HnswVectorsReader) reader).getGraph(hnswField);
> for (int l = 0; l < graph.numLevels(); l++){
> int reachableNodes = getReachableNodes(graph, l);
> // int graphSize = graph.size(); // this gives nodes at 0th level
> int graphSize = graph.getNodesOnLevel(l).size();
> System.out.printf("Level: %d, Actual Size: %d, Reachable: %d, Not Reachable : %d\n", l, graphSize, reachableNodes, (graphSize - reachableNodes));
> }
> }
> }
> }
> }
>
> public static void main(String[] args) throws IOException, NoSuchFieldException, IllegalAccessException {
> String index = args[0];
> String field = args[1];
> System.out.println("For index " + index + " field : " + field);
> checkConnected(index, field);
> }
>
>
>
>
> --
> Regards,
> Nitiraj
>
Re: Disconnectedness in HNSW graphs in Lucene [ In reply to ]
Nitiraj,

Good experimentation! Connectedness within layers is indeed important. The
algorithm itself should ensure connectedness of disjoint NSWs as it
mutually connects nodes (selected over diversity).

However, if the data is extremely clustered, this can cause connectedness
to drop (few densely packed clusters may not connect to other densely
packed clusters).

For your extreme examples, is the data densely clustered?

What would you suggest as an improvement in Lucene regarding the algorithm
implementation?

An interesting experiment would be to see if `hnswlib` has the same
connected issues if it indexes the same vectors in the same order.

Thanks!

Ben Trent



On Wed, Aug 23, 2023 at 5:07?AM Nitiraj Singh Rathore <
nitiraj.rathore@gmail.com> wrote:

> Hi Lucene developers,
>
> I work for Amazon Retail Product search and we are using Lucene KNN for
> semantic search of products. We index product embeddings (vectors) into
> lucene (hnsw graph) and search them by generating query embedding at
> runtime. The product embeddings also receive regular updates and the index
> geometry keeps changing because of merges.
> We recently noticed that the hnsw graphs generated are not always strongly
> connected and in worst case scenario some products may be undiscoverable.
> Connectedness of Hierarchical graph can be complicated, so below I am
> mentioning my experiment details.
>
> - Experiment:
> I took the Lucene indexes from our production servers and for each segment
> (hnsw graph) I did following test.
> At each level graph I took the same entry point, the entry point of HNSW
> graph, checked how many nodes are reachable from this entrypoint. Note that
> connectedness at each level was checked independently of other levels.
> Sample code attached. My observations are as below.
>
> - Observation :
> 1. Of all the graphs across all the segments, across 100s of indexes
> that I considered, one graph for each "level" of HNSW, almost 18% of the
> graphs had some disconnectedness.
> 2. Disconnectedness was observed at all the levels of HNSW graphs. We have
> at most 3 levels in HNSW graphs.
> 3. percentage disconnectedness ranged from small fractions 0.000386% (1
> disconnected out of 259342) to 3.7% (eg. 87 disconnected out of 2308).
> In some extreme case the entry-point in zeroth level graph was
> disconnected from rest of the graph making the %age disconnected as high as 99.9%
> (65 reachable nodes from EP out of 252275). But this does not necessarily
> mean that the 99.9% of nodes were not discoverable, it just means that if
> unluckily we end up on EP in the 0th level graph for a query, there can at
> max be 65 nodes that can be reached. But had we diverted our path from EP
> to some other node in the upper level graphs then may be more nodes be
> discoverable via that node.
>
> - What I Not Checked :
> What I have not checked till now is the connectedness for the whole HNSW
> graph including edges of all the levels.
> Also, I have not checked the number of disconnected components in a graph.
> I have just checked the number of connected nodes to the entry-point.
>
> But irrespective of that, I think graphs should be strongly connected at
> each level and disconnectedness if at all should be very very rare.
>
> Thanks Kaival Parikh for discovering the issue in the first place and the
> script for checking connectedness.
>
> What do others think about this?
>
> public class CheckHNSWConnectedness {
> private static int getReachableNodes(HnswGraph graph, int level) throws IOException {
> Set<Integer> visited = new HashSet<>();
> Stack<Integer> candidates = new Stack<>();
> candidates.push(graph.entryNode());
>
> while (!candidates.isEmpty()) {
> int node = candidates.pop();
>
> if (visited.contains(node)) {
> continue;
> }
>
> visited.add(node);
> graph.seek(level, node);
>
> int friendOrd;
> while ((friendOrd = graph.nextNeighbor()) != NO_MORE_DOCS) {
> candidates.push(friendOrd);
> }
> }
> return visited.size();
> }
>
> public static void checkConnected(String index, String hnswField) throws IOException, NoSuchFieldException, IllegalAccessException {
> try (FSDirectory dir = FSDirectory.open(Paths.get(index));
> IndexReader indexReader = DirectoryReader.open(dir)) {
> for (LeafReaderContext ctx : indexReader.leaves() ) {
> KnnVectorsReader reader = ((PerFieldKnnVectorsFormat.FieldsReader) ((SegmentReader) ctx.reader()).getVectorReader()).getFieldReader(hnswField);
>
> if (reader != null) {
> HnswGraph graph = ((Lucene95HnswVectorsReader) reader).getGraph(hnswField);
> for (int l = 0; l < graph.numLevels(); l++){
> int reachableNodes = getReachableNodes(graph, l);
> // int graphSize = graph.size(); // this gives nodes at 0th level
> int graphSize = graph.getNodesOnLevel(l).size();
> System.out.printf("Level: %d, Actual Size: %d, Reachable: %d, Not Reachable : %d\n", l, graphSize, reachableNodes, (graphSize - reachableNodes));
> }
> }
> }
> }
> }
>
> public static void main(String[] args) throws IOException, NoSuchFieldException, IllegalAccessException {
> String index = args[0];
> String field = args[1];
> System.out.println("For index " + index + " field : " + field);
> checkConnected(index, field);
> }
>
>
>
>
> --
> Regards,
> Nitiraj
>
Re: Disconnectedness in HNSW graphs in Lucene [ In reply to ]
Thanks Benjamin for the reply and confirming that connected can be issue.( btw I am same Nitiraj, just using my apache.org email id from now on to communicate).

I will do some more experiment to reproduce the issue and see the connectedness across the graph and not just with the Entry point. But since these are our production indexes which receive fast updates and re-merges happens frequently I am not sure if it will be easy to reproduce. In that case I will just work with some particular indexes that show this behaviour.

I will also try to see if `hnswlib` shows such behaviour and if not what measures have been taken to ensure connectedness. Although, the paper does talk that heuristic helps in keeping the clustered graph connected (excerpt below), but it seems some improvement might be required. I will check the implementations in the `addDiverseNeighbors()`, `findWorstNonDiverse()` and `selectAndLinkDiverse()` method of lucene code and come up with something. At the same time may be just not removing some connections from the graph that can result in disconnectedness can help, but such a check can be very expensive.

Can I create a github issue for this and continue updating there?

-- Excerpt from Paper : https://arxiv.org/pdf/1603.09320.pdf
```
The relative neighborhood graph allows easily keeping the global connected component, even in case of highly clustered data (see Fig. 2 for illustration). Note that the heuristic creates extra edges compared to the exact relative neighborhood graphs, allowing controlling the number of the connections which is important for search performance.
```

On 2023/08/23 16:07:55 Benjamin Trent wrote:
> Nitiraj,
>
> Good experimentation! Connectedness within layers is indeed important. The
> algorithm itself should ensure connectedness of disjoint NSWs as it
> mutually connects nodes (selected over diversity).
>
> However, if the data is extremely clustered, this can cause connectedness
> to drop (few densely packed clusters may not connect to other densely
> packed clusters).
>
> For your extreme examples, is the data densely clustered?
>
> What would you suggest as an improvement in Lucene regarding the algorithm
> implementation?
>
> An interesting experiment would be to see if `hnswlib` has the same
> connected issues if it indexes the same vectors in the same order.
>
> Thanks!
>
> Ben Trent
>
>
>
> On Wed, Aug 23, 2023 at 5:07?AM Nitiraj Singh Rathore <
> nitiraj.rathore@gmail.com> wrote:
>
> > Hi Lucene developers,
> >
> > I work for Amazon Retail Product search and we are using Lucene KNN for
> > semantic search of products. We index product embeddings (vectors) into
> > lucene (hnsw graph) and search them by generating query embedding at
> > runtime. The product embeddings also receive regular updates and the index
> > geometry keeps changing because of merges.
> > We recently noticed that the hnsw graphs generated are not always strongly
> > connected and in worst case scenario some products may be undiscoverable.
> > Connectedness of Hierarchical graph can be complicated, so below I am
> > mentioning my experiment details.
> >
> > - Experiment:
> > I took the Lucene indexes from our production servers and for each segment
> > (hnsw graph) I did following test.
> > At each level graph I took the same entry point, the entry point of HNSW
> > graph, checked how many nodes are reachable from this entrypoint. Note that
> > connectedness at each level was checked independently of other levels.
> > Sample code attached. My observations are as below.
> >
> > - Observation :
> > 1. Of all the graphs across all the segments, across 100s of indexes
> > that I considered, one graph for each "level" of HNSW, almost 18% of the
> > graphs had some disconnectedness.
> > 2. Disconnectedness was observed at all the levels of HNSW graphs. We have
> > at most 3 levels in HNSW graphs.
> > 3. percentage disconnectedness ranged from small fractions 0.000386% (1
> > disconnected out of 259342) to 3.7% (eg. 87 disconnected out of 2308).
> > In some extreme case the entry-point in zeroth level graph was
> > disconnected from rest of the graph making the %age disconnected as high as 99.9%
> > (65 reachable nodes from EP out of 252275). But this does not necessarily
> > mean that the 99.9% of nodes were not discoverable, it just means that if
> > unluckily we end up on EP in the 0th level graph for a query, there can at
> > max be 65 nodes that can be reached. But had we diverted our path from EP
> > to some other node in the upper level graphs then may be more nodes be
> > discoverable via that node.
> >
> > - What I Not Checked :
> > What I have not checked till now is the connectedness for the whole HNSW
> > graph including edges of all the levels.
> > Also, I have not checked the number of disconnected components in a graph.
> > I have just checked the number of connected nodes to the entry-point.
> >
> > But irrespective of that, I think graphs should be strongly connected at
> > each level and disconnectedness if at all should be very very rare.
> >
> > Thanks Kaival Parikh for discovering the issue in the first place and the
> > script for checking connectedness.
> >
> > What do others think about this?
> >
> > public class CheckHNSWConnectedness {
> > private static int getReachableNodes(HnswGraph graph, int level) throws IOException {
> > Set<Integer> visited = new HashSet<>();
> > Stack<Integer> candidates = new Stack<>();
> > candidates.push(graph.entryNode());
> >
> > while (!candidates.isEmpty()) {
> > int node = candidates.pop();
> >
> > if (visited.contains(node)) {
> > continue;
> > }
> >
> > visited.add(node);
> > graph.seek(level, node);
> >
> > int friendOrd;
> > while ((friendOrd = graph.nextNeighbor()) != NO_MORE_DOCS) {
> > candidates.push(friendOrd);
> > }
> > }
> > return visited.size();
> > }
> >
> > public static void checkConnected(String index, String hnswField) throws IOException, NoSuchFieldException, IllegalAccessException {
> > try (FSDirectory dir = FSDirectory.open(Paths.get(index));
> > IndexReader indexReader = DirectoryReader.open(dir)) {
> > for (LeafReaderContext ctx : indexReader.leaves() ) {
> > KnnVectorsReader reader = ((PerFieldKnnVectorsFormat.FieldsReader) ((SegmentReader) ctx.reader()).getVectorReader()).getFieldReader(hnswField);
> >
> > if (reader != null) {
> > HnswGraph graph = ((Lucene95HnswVectorsReader) reader).getGraph(hnswField);
> > for (int l = 0; l < graph.numLevels(); l++){
> > int reachableNodes = getReachableNodes(graph, l);
> > // int graphSize = graph.size(); // this gives nodes at 0th level
> > int graphSize = graph.getNodesOnLevel(l).size();
> > System.out.printf("Level: %d, Actual Size: %d, Reachable: %d, Not Reachable : %d\n", l, graphSize, reachableNodes, (graphSize - reachableNodes));
> > }
> > }
> > }
> > }
> > }
> >
> > public static void main(String[] args) throws IOException, NoSuchFieldException, IllegalAccessException {
> > String index = args[0];
> > String field = args[1];
> > System.out.println("For index " + index + " field : " + field);
> > checkConnected(index, field);
> > }
> >
> >
> >
> >
> > --
> > Regards,
> > Nitiraj
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Disconnectedness in HNSW graphs in Lucene [ In reply to ]
> Can I create a github issue for this and continue updating there?

I think that would be great. If y'all are suffering from this and
discovered an issue, others are unknowingly having the same issue. Having
tools to discover it and fix it will make Lucene better. Hopefully we find
a bug and can address it!

One thing I know we do is prune connections to ensure we have at most
`maxConn` (`m` in the paper) connections. Even after diverse addition of
backlinks and neighbors. The paper hints at a "keep pruned connections"
option, but I don't think any HNSW implementation actually has that option.


Another experiment would be increasing y'alls `maxConn` (default of `16`) &
`beamWidth` (default of `100`). Since this is a production system, you may
want a fix sooner rather than later. Increasing these might alleviate some
of the disconnectedness, but it would increase indexing cost.

I realize such experiments might be expensive in light of y'alls search &
indexing load.

Thanks!

Ben Trent

On Thu, Aug 24, 2023 at 4:03?AM Nitiraj Rathore <nitiraj@apache.org> wrote:

> Thanks Benjamin for the reply and confirming that connected can be issue.(
> btw I am same Nitiraj, just using my apache.org email id from now on to
> communicate).
>
> I will do some more experiment to reproduce the issue and see the
> connectedness across the graph and not just with the Entry point. But since
> these are our production indexes which receive fast updates and re-merges
> happens frequently I am not sure if it will be easy to reproduce. In that
> case I will just work with some particular indexes that show this behaviour.
>
> I will also try to see if `hnswlib` shows such behaviour and if not what
> measures have been taken to ensure connectedness. Although, the paper does
> talk that heuristic helps in keeping the clustered graph connected (excerpt
> below), but it seems some improvement might be required. I will check the
> implementations in the `addDiverseNeighbors()`, `findWorstNonDiverse()` and
> `selectAndLinkDiverse()` method of lucene code and come up with something.
> At the same time may be just not removing some connections from the graph
> that can result in disconnectedness can help, but such a check can be very
> expensive.
>
> Can I create a github issue for this and continue updating there?
>
> -- Excerpt from Paper : https://arxiv.org/pdf/1603.09320.pdf
> ```
> The relative neighborhood graph allows easily keeping the global connected
> component, even in case of highly clustered data (see Fig. 2 for
> illustration). Note that the heuristic creates extra edges compared to the
> exact relative neighborhood graphs, allowing controlling the number of the
> connections which is important for search performance.
> ```
>
> On 2023/08/23 16:07:55 Benjamin Trent wrote:
> > Nitiraj,
> >
> > Good experimentation! Connectedness within layers is indeed important.
> The
> > algorithm itself should ensure connectedness of disjoint NSWs as it
> > mutually connects nodes (selected over diversity).
> >
> > However, if the data is extremely clustered, this can cause connectedness
> > to drop (few densely packed clusters may not connect to other densely
> > packed clusters).
> >
> > For your extreme examples, is the data densely clustered?
> >
> > What would you suggest as an improvement in Lucene regarding the
> algorithm
> > implementation?
> >
> > An interesting experiment would be to see if `hnswlib` has the same
> > connected issues if it indexes the same vectors in the same order.
> >
> > Thanks!
> >
> > Ben Trent
> >
> >
> >
> > On Wed, Aug 23, 2023 at 5:07?AM Nitiraj Singh Rathore <
> > nitiraj.rathore@gmail.com> wrote:
> >
> > > Hi Lucene developers,
> > >
> > > I work for Amazon Retail Product search and we are using Lucene KNN for
> > > semantic search of products. We index product embeddings (vectors) into
> > > lucene (hnsw graph) and search them by generating query embedding at
> > > runtime. The product embeddings also receive regular updates and the
> index
> > > geometry keeps changing because of merges.
> > > We recently noticed that the hnsw graphs generated are not always
> strongly
> > > connected and in worst case scenario some products may be
> undiscoverable.
> > > Connectedness of Hierarchical graph can be complicated, so below I am
> > > mentioning my experiment details.
> > >
> > > - Experiment:
> > > I took the Lucene indexes from our production servers and for each
> segment
> > > (hnsw graph) I did following test.
> > > At each level graph I took the same entry point, the entry point of
> HNSW
> > > graph, checked how many nodes are reachable from this entrypoint. Note
> that
> > > connectedness at each level was checked independently of other levels.
> > > Sample code attached. My observations are as below.
> > >
> > > - Observation :
> > > 1. Of all the graphs across all the segments, across 100s of indexes
> > > that I considered, one graph for each "level" of HNSW, almost 18% of
> the
> > > graphs had some disconnectedness.
> > > 2. Disconnectedness was observed at all the levels of HNSW graphs. We
> have
> > > at most 3 levels in HNSW graphs.
> > > 3. percentage disconnectedness ranged from small fractions 0.000386% (1
> > > disconnected out of 259342) to 3.7% (eg. 87 disconnected out of 2308).
> > > In some extreme case the entry-point in zeroth level graph was
> > > disconnected from rest of the graph making the %age disconnected as
> high as 99.9%
> > > (65 reachable nodes from EP out of 252275). But this does not
> necessarily
> > > mean that the 99.9% of nodes were not discoverable, it just means that
> if
> > > unluckily we end up on EP in the 0th level graph for a query, there
> can at
> > > max be 65 nodes that can be reached. But had we diverted our path from
> EP
> > > to some other node in the upper level graphs then may be more nodes be
> > > discoverable via that node.
> > >
> > > - What I Not Checked :
> > > What I have not checked till now is the connectedness for the whole
> HNSW
> > > graph including edges of all the levels.
> > > Also, I have not checked the number of disconnected components in a
> graph.
> > > I have just checked the number of connected nodes to the entry-point.
> > >
> > > But irrespective of that, I think graphs should be strongly connected
> at
> > > each level and disconnectedness if at all should be very very rare.
> > >
> > > Thanks Kaival Parikh for discovering the issue in the first place and
> the
> > > script for checking connectedness.
> > >
> > > What do others think about this?
> > >
> > > public class CheckHNSWConnectedness {
> > > private static int getReachableNodes(HnswGraph graph, int level)
> throws IOException {
> > > Set<Integer> visited = new HashSet<>();
> > > Stack<Integer> candidates = new Stack<>();
> > > candidates.push(graph.entryNode());
> > >
> > > while (!candidates.isEmpty()) {
> > > int node = candidates.pop();
> > >
> > > if (visited.contains(node)) {
> > > continue;
> > > }
> > >
> > > visited.add(node);
> > > graph.seek(level, node);
> > >
> > > int friendOrd;
> > > while ((friendOrd = graph.nextNeighbor()) != NO_MORE_DOCS)
> {
> > > candidates.push(friendOrd);
> > > }
> > > }
> > > return visited.size();
> > > }
> > >
> > > public static void checkConnected(String index, String hnswField)
> throws IOException, NoSuchFieldException, IllegalAccessException {
> > > try (FSDirectory dir = FSDirectory.open(Paths.get(index));
> > > IndexReader indexReader = DirectoryReader.open(dir)) {
> > > for (LeafReaderContext ctx : indexReader.leaves() ) {
> > > KnnVectorsReader reader =
> ((PerFieldKnnVectorsFormat.FieldsReader) ((SegmentReader)
> ctx.reader()).getVectorReader()).getFieldReader(hnswField);
> > >
> > > if (reader != null) {
> > > HnswGraph graph = ((Lucene95HnswVectorsReader)
> reader).getGraph(hnswField);
> > > for (int l = 0; l < graph.numLevels(); l++){
> > > int reachableNodes = getReachableNodes(graph,
> l);
> > > // int graphSize = graph.size(); // this gives
> nodes at 0th level
> > > int graphSize =
> graph.getNodesOnLevel(l).size();
> > > System.out.printf("Level: %d, Actual Size:
> %d, Reachable: %d, Not Reachable : %d\n", l, graphSize, reachableNodes,
> (graphSize - reachableNodes));
> > > }
> > > }
> > > }
> > > }
> > > }
> > >
> > > public static void main(String[] args) throws IOException,
> NoSuchFieldException, IllegalAccessException {
> > > String index = args[0];
> > > String field = args[1];
> > > System.out.println("For index " + index + " field : " + field);
> > > checkConnected(index, field);
> > > }
> > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > > Nitiraj
> > >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>