Hi folks,
I've been looking carefully at the heartbeat protocol and authentication
methods in the process of writing them up.
Although the current heartbeat code has some protection against replay
attacks, it appears that some fairly heavy-handed replay attacks will
succeed against it when combined with some method to cause nodes in the
cluster to crash on demand.
Even though it's somewhat obscure, I am interested in fixing this.
I went through a few iterations of potential fixes, and came up with
this challenge-response based approach which I would like for this
august body to critique...
-----------------------------------------------------------------------
In addition to the current non-persistent sequence number, each machine
will manage a persistent generation number which it increments every
time it restarts.
When a new node is observed on the network, it is sent a unique
challenge based on both systems generation numbers and the local time of
day. Until a good response is received, non-challenge/response packets
from that machine are dropped. Once an appropriate response is
received, normal message reception from the given machine is unblocked,
and the response will percolate up through the mythical API we've been
designing.
A "new" node is defined as one which satisfies any of the following
conditions
Has never been heard from before now
Has an incremented generation number
Is currently marked dead
-----------------------------------------------------------------------
I believe this approach should be immune to replay attacks. As a note,
any machine which was simply marked dead (didn't have an incremented
gen#) is the result of a cluster merge (the aftermath of a cluster
partition).
Clearly, this implementation will need extensive testing before being
put into the production code. We'll make sure the nice_failback code
and the API code go in before this does...
Thanks to Jerome for providing the impetus to look at the code again...
-- Alan Robertson
alanr@suse.com
I've been looking carefully at the heartbeat protocol and authentication
methods in the process of writing them up.
Although the current heartbeat code has some protection against replay
attacks, it appears that some fairly heavy-handed replay attacks will
succeed against it when combined with some method to cause nodes in the
cluster to crash on demand.
Even though it's somewhat obscure, I am interested in fixing this.
I went through a few iterations of potential fixes, and came up with
this challenge-response based approach which I would like for this
august body to critique...
-----------------------------------------------------------------------
In addition to the current non-persistent sequence number, each machine
will manage a persistent generation number which it increments every
time it restarts.
When a new node is observed on the network, it is sent a unique
challenge based on both systems generation numbers and the local time of
day. Until a good response is received, non-challenge/response packets
from that machine are dropped. Once an appropriate response is
received, normal message reception from the given machine is unblocked,
and the response will percolate up through the mythical API we've been
designing.
A "new" node is defined as one which satisfies any of the following
conditions
Has never been heard from before now
Has an incremented generation number
Is currently marked dead
-----------------------------------------------------------------------
I believe this approach should be immune to replay attacks. As a note,
any machine which was simply marked dead (didn't have an incremented
gen#) is the result of a cluster merge (the aftermath of a cluster
partition).
Clearly, this implementation will need extensive testing before being
put into the production code. We'll make sure the nice_failback code
and the API code go in before this does...
Thanks to Jerome for providing the impetus to look at the code again...
-- Alan Robertson
alanr@suse.com