The conversation below is taken from email off the list. It seemed
generally interesting though...
Horms wrote:
> ... I have however noticed that as heartbeat keeps state of
> nodes, and not resource allocations it is possible to get
> into a state where no nodes/more than one node have a
> resource. In particular if there is a communication medium
> failure, or if heartbeat is started up on more than one node
> simultaneously. I have been thinking of some fairly simple
> mechanisms to resolve this, vis a vis nodes requesting
> ownership of a resource. I am wondering what your thoughts
> are. I am most concerned about the (simple) two-node case,
> though something that extends beyond that would be nice.
The folks from Conectiva are doing something in a related area. In the
current code, the assumption is that if the master for a resource is up,
it has control of the resources it is listed as master for. They break
that assumption with a new feature (nice_failover?). It would be good
to add your thoughts and observations to that, and think about the right
way of thinking about this stuff. Once one has the right mental model,
the code is easy :-)
There is a mechanism right now for a node to make a cluster-request to
get ownership of a resource group. There is a way to tell if a node
owns a particular resource, but there is no cluster-request to ask the
cluster which node owns a particular resource. Obviously there is an
auditing problem that goes with it as well. In this case, every node
should answer "yes" or "no", not just have the owning node answer "yes",
and everyone else give silence.
This is also related to the cluster partitioning problem, in that you
need resource auditing to recover from a partitioned cluster. So, these
three things are related to each other and the concept of resource
ownership.
More thoughts?
-- Alan Robertson
alanr@suse.com
generally interesting though...
Horms wrote:
> ... I have however noticed that as heartbeat keeps state of
> nodes, and not resource allocations it is possible to get
> into a state where no nodes/more than one node have a
> resource. In particular if there is a communication medium
> failure, or if heartbeat is started up on more than one node
> simultaneously. I have been thinking of some fairly simple
> mechanisms to resolve this, vis a vis nodes requesting
> ownership of a resource. I am wondering what your thoughts
> are. I am most concerned about the (simple) two-node case,
> though something that extends beyond that would be nice.
The folks from Conectiva are doing something in a related area. In the
current code, the assumption is that if the master for a resource is up,
it has control of the resources it is listed as master for. They break
that assumption with a new feature (nice_failover?). It would be good
to add your thoughts and observations to that, and think about the right
way of thinking about this stuff. Once one has the right mental model,
the code is easy :-)
There is a mechanism right now for a node to make a cluster-request to
get ownership of a resource group. There is a way to tell if a node
owns a particular resource, but there is no cluster-request to ask the
cluster which node owns a particular resource. Obviously there is an
auditing problem that goes with it as well. In this case, every node
should answer "yes" or "no", not just have the owning node answer "yes",
and everyone else give silence.
This is also related to the cluster partitioning problem, in that you
need resource auditing to recover from a partitioned cluster. So, these
three things are related to each other and the concept of resource
ownership.
More thoughts?
-- Alan Robertson
alanr@suse.com