We're proposing this as an outline for how to organize
cluster i/o fencing. My immediate personal goal is to
prototype a natalie nfs agent. We've left quite
a number of open issues, for which I hope to see some
discussion. All comments are appreciated.
thanks!
-dB
I/O Fencing for Clusters
Generic Resource Intervention Tool Service (GRITS)
and
NFS Admin Tool And Location Intervention Extension (NATALIE)
----------------
David Brower (mailto:dbrower@us.oracle.com)
John Leys (mailto:jleys@us.oracle.com)
Gary Young (mailto:gdyoung@us.oracle.com)
History
Public version 0.1 1-Mar-00
Abstract
Cluster systems with shared resources, such as disk, need
"fencing" of those resources during and after membership
reconfigurations. There are no general solutions to providing a
mechanism for fencing in the Open Standards world, with existing
solutions tied tightly to particular membership services and i/o
systems. This note outlines the architecture of a generic service
(GRITS) for organizing fencing interactions between membership
services and resources, and a mechanism (NATALIE) by which NFS
services may be extended to become a safely fenceable resource
under GRITS. Other resources, such as shared-scsi disk drivers,
SAN switches and the like should be equally capable of becoming
GRITS-able partners. Because the solution is openly releases, it
is hoped that system providers, SAN vendors, and the purveyors of
storage systems will incorporate appropriate agents, allowing for
reliable clusters with shared, fence-able resources.
GRITS Architecture
A GRITS cluster consists of:
Some number of nodes;
Some number of membership group/services, maintaining
quorum for the group. Groups have generation numbers of
at least 16 bits; wraps are allowed, and handled.
Some number of resources used by groups.
Nodes are identified by either IP address or resolvable name.
Resources are identified the same way, and are assumed to have at
least an IP capable proxy -- something that will respond to IP,
even if it needs to take some other path to an actual resource.
Each GRITS membership group has a configuration identifying the
resources that may be used by the group, including the destination
for GRITS control messages to the resource. The group service also
provides multiple access points for resources to query the group
when that is necessary.
Each GRITS resource has a a configured list of groups and hosts
that may possibly access it. The configuration identifies the
destinations for querys by the resource of the group. The
resource itself has at least one identified access point for
control messages to it from the resources.
(The configurations of groups and resources are expected to be
slowly changing, and their control is not defined by GRITS.)
GRITS controls access to resources by nodes depending on the group
membership. Access is either permitted or denied to whole nodes,
with no finer granularity. (While more control might be
desirable, it is hard to achieve, and is not addressed so as to
provide the simplest possible solution.)
Resources that must remain writable to all during cluster
transition, perhaps because they are used as part of the
membership quorum resolution, should not be under GRITS control.
At resource boot time, the resource examines the configuration,
and adopts an access posture towards the potential members of all
the groups. It does this first by seeing the configured boot
policy associated with each group member. Then it may use GRITS
defined messaging to communicate with the membership service to
set the correct current access rights. Plausibly different
initial policies are "read only" and "no access"; some resources
may only be able to enforce "no access". Obviously, a "writable"
boot policy would be insane.
Once an initial posture is established by a resource, membership
change events in the group drive GRITS control messages to all the
resources configured for the group. This will either deny access
to departing members, or allow access to continuing or joining
members.
At group boot time, the membership service must establish a quorum
without the use of any GRITS controlled resources. Then it allows
access to the members to all the resources using GRITS. The group
cannot proceed out of its reconfiguration until the unfencing of
all resources has been accomplished.
It is intended that "gritty" agents can be written and put in
place for:
- directly attached disks on shared SCSI. The agent would
communicate with some kernel-level code to manipulate the
SCSI reset and SCSI reserve to arbitrate access to the resource;
GRITS would talk to both sides to force to a known state.
- SAN attached storage, where the agent could program tokens or
domains of the fabric to control access;
- NFS attached storage, where an agent could use NATALIE
capabilities to narrow access below that of the basic exports;
- SMB attached storage, where an agent could communicate to the
software doing the "sharing" to control access.
- General network attached storage, where control may be achieved
by filtering in a router or proxy between the nodes and the
resources.
- Worst Case group members, who will have a third party wired
to their reset buttons to force them to be "fenced." This
is an always correct final solution.
Mixtures of these agencies may be needed, depending on the needs
and topology of the cluster in question. The resource providers
may or may not be on hosts that are part of any group in question.
Protocols
At the architectural level, GRITS is agnostic about the control
protocols. The service could be provided using a variety of
communication mechanisms. The messages are defined in terms of
verbs that may be bound to different techniques. In practice,
there will need to be some commonality of protocol. It will not
do to have a resource attempt to query a group using a protocol
the group does not support, nor can a group meaningfully send
a membership change to a resource without common ground.
Potential protocols include:
ONC RPC
CORBA
HTTP
HTTPS
COM/DCOM
SMB extensions
Exact bindings and support is area for open discussion.
Security
Only "authorized" parties may be allowed to invoke the verbs.
This is handled, pitifully, by a "cookie", a shared secret between
resources and group services. A secure protocol will protect the
contents of the cookie, but is not an essential part of the
architecture. As is traditional in cluster discussions, we
presume for the moment that traffic between nodes and resources is
on a secure network.
It is also the case that only current quorum holding membership
services should be invoking commands. The approach taken is to
have GRITS have some knowledge about cluster epochs. (A cluster
epoch is an ever increasing number bumped at the time the
membership is changed, sometimes calle the cluster generation, or
cluster id.) Only messages from the latest epoch should be
obeyed. To do this, GRITS needs to be able to establish an
ordering on epochs. The protocol must also handle wraps of the
epoch, from a high value back to zero. This is not worked out.
There are also issues regarding the need for stable storage for
the epoch in resource agents. What epoch should they obey at
resource boot?
Resource Settings
A resourceSetting is a combination of
{ resource, node, allow|deny }
ResourceSettings are a list or array of resourceSettings to
cover a set of resource/node bindings.
Verbs
Resource to GroupService
SetMeSeymour( in resourceName, out resourceSettings );
GroupService to Resource
Set( in cookie, in groupGeneration, in resourceName,
in nodeName, in boolean allow );
SetAll( in cookie, in groupGeneration, in resourceName,
in boolean allow );
GetSettings( in resourceName, out resourceSettings );
The GRITS agent must remember the highest generation used, and
refuse operations from older generations. If the cookie provided
do not match the configured cookie, the set request will be
rejected. FIXME -- deal with generation wrap.
When a Set operation completes, denying i/o from a host, it must
be guaranteed that all i/o from that host is over, and absolutely
no more will be done -- to the degree possible at the level of the
agent invoked. For example, an agent manipulating the disk driver
of a operating system kernel can rarely stop i/o that is queued in
the drive electronics.
Expected use
When the group service detects the death of a member, it uses
GRITS to fence it off, by calling Set for all resources in use by
the group, denying access by the deceased member.
When the member comes back to life, and is granted access back
into the group, the group uses another GRITS Set to re-enable
access to the resources.
It is up to the agent associated with the group to determine the
exact access needed to particular resources. It may be necessary
to leave write access available to some resource that is used as
part of group membership establishment, and/or quorum
determination.
NATALIE
The NATALIE extensions to NFS are additional RPC interfaces to
perform manipulation of the live state of the NFS servers. In
particular, NATALIE supports forceable eviction of mounting
clients. This is principally useful for cluster "fence off", but
is administratively useful on its own merits in non-cluster
environments.
The main verbs are almost those used with the GRITS GroupService
to Resource. With the followig exceptions. (1) The generation is
not needed, as NATALIE is not specific to group membership, and
(2) the mode is not allow or deny, but an NFS export access right,
such as "rw". The GRITS agent translating to NATALIE must do the
appropriate mapping.
Set( in cookie, in resourceName, in nodeName, in mode );
SetAll( in cookie, in resourceName, in mode );
GetSettings( in resourceName, out resourceSettings );
Only nodes covered by entries in the NFS exports will be allowed
to mount at all. It is not specified whether NATALIE can grant
more permission than are present in the exports configuration,
as this would constrain the Natalie implementaton.
When there are multiple and dynamic NFS servers on a machine, the
NATALIE settings need to be coherently replicated across them all.
That is, when a Set operation completes disallowing i/o from a
host, it must be guaranteed that all i/o from that host is over,
and no more will be done.
Three plausible implementations of NATALIE follow:
1. At the time of a Set that changes current values, all nfsds
and mountds are killed. The exports file is updated to
reflect the permissions being set, and the daemons are
restarted.
The error propagation of this is questionable; survivors
may receive errors that they ought not see.
2. Have control plugs into the NFSDs to reject operations
from fenced hosts. This requires memory of the nodes
being fenced. May also want hooks into mountd.
3. Create an NFS proxy, which implements the NATALIE filtering
but otherwise forwards the request on to an unmodified
NFS server. This is inefficient, but totally generic.
The NATALIE service need not even reside on the same
node as the NFS service. It should not, however,
reside on any of the nodes being fenced!
If a NATALIE service of type 2 or 3 nfds comes up as configured by
exports, and does not make the GRITS query, then a retried write
of evicted/frozen node might be allowed. This would be
unnacceptable. One solution is to put the natalie state in
persistant storage on the server. Another is to have the NATALIE
be started by the GRITS agent after it queries the group about the
boot posture.
Examples
These examples pretend to use ONC as the communication
mechanism. Others may do as well.
NFS/NATALIE Cluster
-------------------
A three node cluster A, B, C uses shared storage provided by
service N, mounted from export point /path. Together, the quorum
of A, B, C form a virtual IP host V.
Whenever there is a membership change in V, an agent receives
the change and issues GRITS set commands to all the resources,
On each of the nodes A, B or C, there is a configuration of GRITS
resources used, perhaps in the following form:
% cat /etc/grits.groups
# GRITS resources used by group V
V N onc://N/grits/resource/path
This tells the group-side code that to issue commands to the
resource for N, one uses onc to that location. It also
says that at boot, only read access is required.
On the resource providing N, the GRITS agent is configured:
% cat /etc/grits.resources
# GRITS groups using resource N
N /path onc://V/grits/group/V(none) \
onc://V/grits/group/A(r) \
onc://V/grits/group/B(r) \
onc://V/grits/group/C(r) \
This tells the resource side of GRITS that on booting, it may
contact any of V, A, B, or C for current state, and that until it
makes sucessful contact, it should allow only r access to the
actual nodes. There should never be requests from the virtual
node V, so it is given no access.
Shared SCSI Cluster
-------------------
Two nodes A and B have a shared SCSI bus arbitrating disk D,
forming group G. Each node runs both the membership and the scsi
grits agent; there is no shared IP.
% cat /etc/grits.groups
# GRITS resources used by group V
G D onc://A/grits/resource/path
G D onc://B/grits/resource/path
% cat /etc/grits.resources
# GRITS groups using resources
D /path onc://V/grits/group/A(r) \
onc://V/grits/group/B(r) \
cluster i/o fencing. My immediate personal goal is to
prototype a natalie nfs agent. We've left quite
a number of open issues, for which I hope to see some
discussion. All comments are appreciated.
thanks!
-dB
I/O Fencing for Clusters
Generic Resource Intervention Tool Service (GRITS)
and
NFS Admin Tool And Location Intervention Extension (NATALIE)
----------------
David Brower (mailto:dbrower@us.oracle.com)
John Leys (mailto:jleys@us.oracle.com)
Gary Young (mailto:gdyoung@us.oracle.com)
History
Public version 0.1 1-Mar-00
Abstract
Cluster systems with shared resources, such as disk, need
"fencing" of those resources during and after membership
reconfigurations. There are no general solutions to providing a
mechanism for fencing in the Open Standards world, with existing
solutions tied tightly to particular membership services and i/o
systems. This note outlines the architecture of a generic service
(GRITS) for organizing fencing interactions between membership
services and resources, and a mechanism (NATALIE) by which NFS
services may be extended to become a safely fenceable resource
under GRITS. Other resources, such as shared-scsi disk drivers,
SAN switches and the like should be equally capable of becoming
GRITS-able partners. Because the solution is openly releases, it
is hoped that system providers, SAN vendors, and the purveyors of
storage systems will incorporate appropriate agents, allowing for
reliable clusters with shared, fence-able resources.
GRITS Architecture
A GRITS cluster consists of:
Some number of nodes;
Some number of membership group/services, maintaining
quorum for the group. Groups have generation numbers of
at least 16 bits; wraps are allowed, and handled.
Some number of resources used by groups.
Nodes are identified by either IP address or resolvable name.
Resources are identified the same way, and are assumed to have at
least an IP capable proxy -- something that will respond to IP,
even if it needs to take some other path to an actual resource.
Each GRITS membership group has a configuration identifying the
resources that may be used by the group, including the destination
for GRITS control messages to the resource. The group service also
provides multiple access points for resources to query the group
when that is necessary.
Each GRITS resource has a a configured list of groups and hosts
that may possibly access it. The configuration identifies the
destinations for querys by the resource of the group. The
resource itself has at least one identified access point for
control messages to it from the resources.
(The configurations of groups and resources are expected to be
slowly changing, and their control is not defined by GRITS.)
GRITS controls access to resources by nodes depending on the group
membership. Access is either permitted or denied to whole nodes,
with no finer granularity. (While more control might be
desirable, it is hard to achieve, and is not addressed so as to
provide the simplest possible solution.)
Resources that must remain writable to all during cluster
transition, perhaps because they are used as part of the
membership quorum resolution, should not be under GRITS control.
At resource boot time, the resource examines the configuration,
and adopts an access posture towards the potential members of all
the groups. It does this first by seeing the configured boot
policy associated with each group member. Then it may use GRITS
defined messaging to communicate with the membership service to
set the correct current access rights. Plausibly different
initial policies are "read only" and "no access"; some resources
may only be able to enforce "no access". Obviously, a "writable"
boot policy would be insane.
Once an initial posture is established by a resource, membership
change events in the group drive GRITS control messages to all the
resources configured for the group. This will either deny access
to departing members, or allow access to continuing or joining
members.
At group boot time, the membership service must establish a quorum
without the use of any GRITS controlled resources. Then it allows
access to the members to all the resources using GRITS. The group
cannot proceed out of its reconfiguration until the unfencing of
all resources has been accomplished.
It is intended that "gritty" agents can be written and put in
place for:
- directly attached disks on shared SCSI. The agent would
communicate with some kernel-level code to manipulate the
SCSI reset and SCSI reserve to arbitrate access to the resource;
GRITS would talk to both sides to force to a known state.
- SAN attached storage, where the agent could program tokens or
domains of the fabric to control access;
- NFS attached storage, where an agent could use NATALIE
capabilities to narrow access below that of the basic exports;
- SMB attached storage, where an agent could communicate to the
software doing the "sharing" to control access.
- General network attached storage, where control may be achieved
by filtering in a router or proxy between the nodes and the
resources.
- Worst Case group members, who will have a third party wired
to their reset buttons to force them to be "fenced." This
is an always correct final solution.
Mixtures of these agencies may be needed, depending on the needs
and topology of the cluster in question. The resource providers
may or may not be on hosts that are part of any group in question.
Protocols
At the architectural level, GRITS is agnostic about the control
protocols. The service could be provided using a variety of
communication mechanisms. The messages are defined in terms of
verbs that may be bound to different techniques. In practice,
there will need to be some commonality of protocol. It will not
do to have a resource attempt to query a group using a protocol
the group does not support, nor can a group meaningfully send
a membership change to a resource without common ground.
Potential protocols include:
ONC RPC
CORBA
HTTP
HTTPS
COM/DCOM
SMB extensions
Exact bindings and support is area for open discussion.
Security
Only "authorized" parties may be allowed to invoke the verbs.
This is handled, pitifully, by a "cookie", a shared secret between
resources and group services. A secure protocol will protect the
contents of the cookie, but is not an essential part of the
architecture. As is traditional in cluster discussions, we
presume for the moment that traffic between nodes and resources is
on a secure network.
It is also the case that only current quorum holding membership
services should be invoking commands. The approach taken is to
have GRITS have some knowledge about cluster epochs. (A cluster
epoch is an ever increasing number bumped at the time the
membership is changed, sometimes calle the cluster generation, or
cluster id.) Only messages from the latest epoch should be
obeyed. To do this, GRITS needs to be able to establish an
ordering on epochs. The protocol must also handle wraps of the
epoch, from a high value back to zero. This is not worked out.
There are also issues regarding the need for stable storage for
the epoch in resource agents. What epoch should they obey at
resource boot?
Resource Settings
A resourceSetting is a combination of
{ resource, node, allow|deny }
ResourceSettings are a list or array of resourceSettings to
cover a set of resource/node bindings.
Verbs
Resource to GroupService
SetMeSeymour( in resourceName, out resourceSettings );
GroupService to Resource
Set( in cookie, in groupGeneration, in resourceName,
in nodeName, in boolean allow );
SetAll( in cookie, in groupGeneration, in resourceName,
in boolean allow );
GetSettings( in resourceName, out resourceSettings );
The GRITS agent must remember the highest generation used, and
refuse operations from older generations. If the cookie provided
do not match the configured cookie, the set request will be
rejected. FIXME -- deal with generation wrap.
When a Set operation completes, denying i/o from a host, it must
be guaranteed that all i/o from that host is over, and absolutely
no more will be done -- to the degree possible at the level of the
agent invoked. For example, an agent manipulating the disk driver
of a operating system kernel can rarely stop i/o that is queued in
the drive electronics.
Expected use
When the group service detects the death of a member, it uses
GRITS to fence it off, by calling Set for all resources in use by
the group, denying access by the deceased member.
When the member comes back to life, and is granted access back
into the group, the group uses another GRITS Set to re-enable
access to the resources.
It is up to the agent associated with the group to determine the
exact access needed to particular resources. It may be necessary
to leave write access available to some resource that is used as
part of group membership establishment, and/or quorum
determination.
NATALIE
The NATALIE extensions to NFS are additional RPC interfaces to
perform manipulation of the live state of the NFS servers. In
particular, NATALIE supports forceable eviction of mounting
clients. This is principally useful for cluster "fence off", but
is administratively useful on its own merits in non-cluster
environments.
The main verbs are almost those used with the GRITS GroupService
to Resource. With the followig exceptions. (1) The generation is
not needed, as NATALIE is not specific to group membership, and
(2) the mode is not allow or deny, but an NFS export access right,
such as "rw". The GRITS agent translating to NATALIE must do the
appropriate mapping.
Set( in cookie, in resourceName, in nodeName, in mode );
SetAll( in cookie, in resourceName, in mode );
GetSettings( in resourceName, out resourceSettings );
Only nodes covered by entries in the NFS exports will be allowed
to mount at all. It is not specified whether NATALIE can grant
more permission than are present in the exports configuration,
as this would constrain the Natalie implementaton.
When there are multiple and dynamic NFS servers on a machine, the
NATALIE settings need to be coherently replicated across them all.
That is, when a Set operation completes disallowing i/o from a
host, it must be guaranteed that all i/o from that host is over,
and no more will be done.
Three plausible implementations of NATALIE follow:
1. At the time of a Set that changes current values, all nfsds
and mountds are killed. The exports file is updated to
reflect the permissions being set, and the daemons are
restarted.
The error propagation of this is questionable; survivors
may receive errors that they ought not see.
2. Have control plugs into the NFSDs to reject operations
from fenced hosts. This requires memory of the nodes
being fenced. May also want hooks into mountd.
3. Create an NFS proxy, which implements the NATALIE filtering
but otherwise forwards the request on to an unmodified
NFS server. This is inefficient, but totally generic.
The NATALIE service need not even reside on the same
node as the NFS service. It should not, however,
reside on any of the nodes being fenced!
If a NATALIE service of type 2 or 3 nfds comes up as configured by
exports, and does not make the GRITS query, then a retried write
of evicted/frozen node might be allowed. This would be
unnacceptable. One solution is to put the natalie state in
persistant storage on the server. Another is to have the NATALIE
be started by the GRITS agent after it queries the group about the
boot posture.
Examples
These examples pretend to use ONC as the communication
mechanism. Others may do as well.
NFS/NATALIE Cluster
-------------------
A three node cluster A, B, C uses shared storage provided by
service N, mounted from export point /path. Together, the quorum
of A, B, C form a virtual IP host V.
Whenever there is a membership change in V, an agent receives
the change and issues GRITS set commands to all the resources,
On each of the nodes A, B or C, there is a configuration of GRITS
resources used, perhaps in the following form:
% cat /etc/grits.groups
# GRITS resources used by group V
V N onc://N/grits/resource/path
This tells the group-side code that to issue commands to the
resource for N, one uses onc to that location. It also
says that at boot, only read access is required.
On the resource providing N, the GRITS agent is configured:
% cat /etc/grits.resources
# GRITS groups using resource N
N /path onc://V/grits/group/V(none) \
onc://V/grits/group/A(r) \
onc://V/grits/group/B(r) \
onc://V/grits/group/C(r) \
This tells the resource side of GRITS that on booting, it may
contact any of V, A, B, or C for current state, and that until it
makes sucessful contact, it should allow only r access to the
actual nodes. There should never be requests from the virtual
node V, so it is given no access.
Shared SCSI Cluster
-------------------
Two nodes A and B have a shared SCSI bus arbitrating disk D,
forming group G. Each node runs both the membership and the scsi
grits agent; there is no shared IP.
% cat /etc/grits.groups
# GRITS resources used by group V
G D onc://A/grits/resource/path
G D onc://B/grits/resource/path
% cat /etc/grits.resources
# GRITS groups using resources
D /path onc://V/grits/group/A(r) \
onc://V/grits/group/B(r) \