Mailing List Archive: I/O Fencing Proposal, Draft 0.4 (Grits/Natalie)

Here is a new draft of the proposal for providing a framework for
I/O fencing, and for implementation with NFS storage.

(Nobody saw 0.3x.)

I am particularly interested in feedback on the use of HTTP for NATALIE,
and the encoding suggested.

thanks!

-dB

============================================================================

I/O Fencing for Clusters

Generic Resource Intervention Tool Service (GRITS)
and
NFS Admin Tool And Location Intervention Extension (NATALIE)

----------------

David Brower (mailto:dbrower@us.oracle.com)
John Leys (mailto:jleys@us.oracle.com)
Gary Young (mailto:gdyoung@us.oracle.com)

History

public version 0.4 27-Mar-00

- Verbs handle deal with errors, quorum check, and gen wrap.
- GRITS for quorum resolution discussed.
- http/s for Natalie

Discussion Archives:

http://lists.tummy.com/pipermail/linux-ha-dev/

Abstract

Cluster systems with shared resources, such as disk, need
"fencing" of those resources during and after membership
reconfigurations. There are no general solutions to providing a
mechanism for fencing in the Open Standards world, with existing
solutions tied tightly to particular membership services and i/o
systems. This note outlines the architecture of a generic service
(GRITS) for organizing fencing interactions between membership
services and resources. It also describes a mechanism (NATALIE)
by which NFS services may be extended to become a safely fenceable
resource under GRITS. Other resources, such as shared-scsi disk
drivers, SAN switches and the like, should be equally capable of
becoming GRITS-able partners. Because the solution is openly
released, it is hoped that system providers, SAN vendors, and the
purveyors of storage systems will incorporate appropriate agents,
allowing for reliable clusters with shared, fence-able resources.

GRITS Architecture

A GRITS cluster consists of:

Some number of nodes;

Some number of membership quorum group services.
Groups have quorum generation numbers of
at least 16 bits; wraps are allowed, and handled.

Some number of resources used by quorum groups.

Nodes are identified by either IP address or resolvable name.
Resources are identified the same way, and are assumed to have at
least an IP capable proxy -- something that will respond to IP,
even if it needs to take some other path to an actual resource.

Each GRITS membership quorum group has a configuration identifying
the resources that may be used by the group, including the
destination for GRITS control messages to the resource. The
quorum group service also provides multiple access points for
resources to query the group when that is necessary. Each GRITS
group issuing commands and responding to queries is required to
have established quorum. Each has a generation number, which is
only seen outside the membership service once quorum has been
established.

Each GRITS resource has a configured list of quorum groups and
hosts that may possibly access it. The configuration identifies
the destinations for querys by the resource of the group. The
resource itself has at least one identified access point for
control messages to it from the resources.

(The configurations of groups and resources are expected to be
slowly changing, and their control is not defined by GRITS.)

GRITS controls access to resources by nodes depending on the
quorum group membership. Access is either permitted or denied to
whole nodes, with no finer granularity.

[.Finer granularity is desirable, but hard to achieve. It would
seem to be necessary to associate groups with processes, and make
the groups, or a group cookie or key get carried along with
requests on behalf of processes. For instance, the key associated
with a fibre-channel persistent reservation might be an excellent
way to allow/disallow members. It may be very difficult to
arrange for the key sent by the driver for an i/o on behalf of one
process to be different than the key used for i/o by another
process.]

Resources that must remain writable to all during cluster
transition, perhaps because they are used as part of the
membership quorum resolution, should not be under GRITS control.

--> Fencing can be used as part of quorum resolution. This can be
done with a separate "quorum" group, arbitrating access to the
quorum resource. This is discussed in the Appendix.

At resource boot time, the resource examines the configuration,
and adopts an access posture towards the potential members of all
the groups. First, it sees the configured boot policy associated
with each group member. Then it may also use GRITS defined
messaging to communicate with the configured membership groups to
set the correct current access rights. At a true cold boot, there
may be no groups to respond, so the configured boot posture
remains in effect until a quorum group is formed and issues
commands to the resources. The plausible initial policies are
"read only" and "no access"; some resources may only be able to
enforce "no access". A "writable" boot policy would be defeat
the purpose of the fence.

Once an initial posture is established by a resource, membership
change events in the quorum group drive GRITS control messages to
all the resources configured for the group. These will deny
access to departing members and allow access to continuing or
joining members. The quorum group cannot proceed out of its
reconfiguration stage until the correct fencing of all resources
has been accomplished.

It is intended that "gritty" agents can be written and put in
place for:

- directly attached disks on shared SCSI. The agent would
communicate with some kernel-level code to manipulate the
SCSI reset and SCSI reserve to arbitrate access to the resource;
GRITS would talk to both sides to force to a known state.

- SAN attached storage, where the agent could program tokens or
domains of the fabric to control access;

- NFS attached storage, where an agent could use NATALIE
capabilities to narrow access below that of the basic exports;

- SMB attached storage, where an agent could communicate to the
software doing the "sharing" to control access.

- General network attached storage, where control may be achieved
by filtering in a router or proxy between the nodes and the
resources.

- Worst Case group members, who will have a third party wired
to their reset buttons to force them to be "fenced." This
is an always correct final solution. The "X10" system can
be used to turn off the power to particularly non-cooperative
entities.

Mixtures of these agencies may be needed, depending on the needs
and topology of the cluster in question. The resource providers
may or may not be on hosts that are part of any group in question.

OPEN: The current proposal does not address layers of fencing or
escalation and isolation. It might be useful to identify levels
at which fencing may be stopped without doing higher levels. For
instance, if all disk i/o may be stopped by frobbing the
fibrechannel switch, then turning off the power may not be
necessary.

Protocols

At the architectural level, GRITS is agnostic about the control
protocols. The service could be provided using a variety of
communication mechanisms. The messages are defined in terms of
verbs that may be bound to different techniques. In practice,
there will need to be some commonality of protocol. It will not
do to have a resource attempt to query a group using a protocol
the group does not support, nor can a group meaningfully send
a membership change to a resource without common ground.

The following protocols are likely candidates:

ONC (Sun) RPC
HTTP
HTTPS

Exact bindings and support is area for discussion, decision,
and documentation for interoperability.

Security

Only "authorized" parties may be allowed to invoke the verbs.
This is handled, pitifully, by a "cookie", a shared secret between
resources and group services. A secure protocol would protect the
contents of the cookie, but is not an essential part of the
architecture. As is traditional in cluster discussions, we may
presume for the moment that traffic between nodes and resources is
on a secure network.

Only current quorum holding membership services should be invoking
commands, except that a member may always fence itself from
resources. (It may not unfence without obtaining quorum.)

To enforce this, GRITS has some knowledge about quorum
generations. A quorum generation is an ever increasing number
bumped at the time the membership is changed and confirmed. This
is distinct from a raw cluster, which may exist without the
presence of quorum. For purposes of GRITS, only quorum
generations exist, and cluster generations are never seen. For
example, a cluster with a quorum generation of 10 experiences a
partition, which drives reconfiguration. Several partitions may
each decide to have generation 11 as they seek quorum. All but
one of these will lose the quorum determination, and their
existence at generation 11 will never be seen by GRITS. Only the
surviving quorum holder at generation 11 may issue GRITS commands.
Therefore, GRITS communication need only obey commands from the
latest cluster generation. It may challenge, discard or return
error to late-arriving commands from earlier generations. The
challenge protocol nests a "SetMe" call back from a resource
to a group that is issuing a "Set" command.

Resource Settings

A resourceSetting is a combination of

{ resource, node, allow|deny }

ResourceSettings are a list or array of resourceSettings to
cover a set of resource/node bindings.

Verbs

Resource to GroupService

errstat SetMe( in resourceName, out resourceSettings,
out groupGeneration );

The resource may optionally invoke a SetMe operation
against the group after it establishes boot posture.
If it succeeds, it may prevent surviving members from
seeing i/o errors that would result in more recovery
and cluster transition activities.

When a group member receives this call, it -MUST-
revalidate that it does currently hold the quorum. If it
no longer has quorum, it returns that as an error status.

The resource may issue SetMe in response to a Set
operation if it suspect the invoker of the Set may not be
the quorum holder. To prevent races, the SetMe is issued
-before- the return from the Set is

GroupService to Resource

errstat Set( in cookie, in groupGeneration, in resourceSettings );

The set operation is used by a quorum holding group to
fence out hosts to be excluded, and to allow operations by
hosts that are members of the group. It will only be
obeyed if issued by the group holding quorum. The set of
members in the resourceSettings list should be complete,
covering all hosts to be excluded, and all those to be
included.

If the cookie provided does not match the configured
cookie, the Set request will be rejected and return error
status.

If the resource suspects the caller does not have quorum,
the resource may call the invoker back with a SetMe
operation. The caller of Set must be able to respond to
the SetMe while the Set call is in progress. Depending on
the response to the SetMe, the Set call will return with
different status. This nested invocation prevents error
handling races in the group side, as the group may only
progress once the Set is completed successfully.

There are three situations where the resource may suspect
the quorum state of the caller:

(1) The resource has booted, and has no previous
cluster generation to check against.
(2) The resource generation supplied is less than
the one remembered, signalling either wrap or
complete cluster reset.
(3) A Set for a generation that has already been done,
which may be a duplicate of a split-brained
reconfiguration.

When a Set operation completes, denying i/o from some
hosts, it must be guaranteed that all i/o from those hosts
is over, and absolutely no more will be done. This may
mean waiting for the completion of all outstanding i/o
requests if true cancellation is not possible.

errstat Get( in resourceName, out resourceSettings );

The get operation is a management convenience, to query
the state of a resource. It may be issued at any time
by anyone.

FIXME -- should Get be restricted by cookie?

OPEN: the current proposal does not address hung Set operations.

Expected use

When the quorum group detects the death of a member, it uses GRITS
to fence it off, by calling Set for all resources in use by the
group, denying access by the deceased member.

When the member comes back to life, and is granted access back
into the group, the group uses another GRITS Set to re-enable
access to the resources.

It is up to the agent associated with the group to determine the
exact access needed to particular resources. It may be necessary
to leave write access available to some resource that is used as
part of group membership establishment, and/or quorum
determination.

NATALIE - NFS Administration Extensions

The NATALIE extensions to NFS provide additional network
interfaces to perform manipulation of the live state of the NFS
servers. In particular, NATALIE supports forced eviction of
clients that have directories mounted, and the cancellation of
operations they may have in progress. This is principally useful
for cluster "fence off", but is administratively useful on its own
merits in non-cluster environments.

The main verbs are almost those used with the GRITS GroupService
to Resource, with the followin exceptions: (1) The generation is
not needed, as NATALIE is not specific to group membership, and
(2) the mode is not allow or deny, but an NFS export access right,
such as "rw". The GRITS agent translating to NATALIE must do the
appropriate mapping.

We propose to use HTTP/S for NATALIE communication rather than
ONC/RPC. The two operations of interest would be encoded as
GET or a POST operation. The following variables may be sent
with the query:

secret # the NFS control cookie
sa # sent action, "Allow Changes" "Get Current" "Change"

If the returned page contains "<H2>ERROR", then an error was
encountered during processing of the request. If the page
contains <H2>Success", then the operation succeeded.

If "Change" is the sent action, the following variables describe
the directories and access rights to be set.

dir1 # directory name 1
acc1 # access rights for dir1
dir2 # directory name 2
acc2 # access rights for dir2
...
dirN
accN

Where the accN variable is of the form:

accspec ::= nodespec | nodespec : nodespec
nodespec ::= node = rights
rights ::= rw | ro

Example:

node1=rw:node2=ro:node3=rw

If "Get Current" is the sent action, then the result page will
contain a table having two columns, whose first column is a
directory, and whose second is an access spec as shown above.

Only these two sent actions, "Change" and "Get Current", need be
recognized by a Natalie implementation. The actual pages returned
by a Natalie may have other actions at the discretion of the
implementor.

It is not specified whether NATALIE can grant more permission than
are present in some static exports configuration, as this would
constrain the NATALIE implementaton. Enforcing this is desirable
from a security perspective.

When there are multiple and dynamic NFS servers on a machine, the
NATALIE settings need to be coherently replicated across them all.
That is, when a Set operation completes disallowing i/o from a
host, it must be guaranteed that all i/o from that host is over,
and no more will be done.

Four plausible implementations of NATALIE follow:

1. At the time of a Set that changes current values, all nfsds
and mountds are killed. The exports file (or equivalent) is
updated to reflect the permissions being set, and the daemons
are restarted.

The error propagation of this is questionable; survivors
may receive errors that they ought not see.

It is important that Natalie not be able to allow permissions
that were not tolerated by the NFS configuration. It is also
important that the proper boot posture is established. For
use with GRITS, there would need to be some intertwining of
the NFS shutdown with the SetMe query.

2. Have control plugs into the NFSDs to reject operations
from fenced hosts. This requires memory of the nodes
being fenced. You may also need hooks into mountd.

3. Create an NFS proxy, which implements the NATALIE filtering
but otherwise forwards the request on to an unmodified
NFS server. This is inefficient, but totally generic.
The NATALIE service need not even reside on the same
node as the NFS service. It should not, however,
reside on any of the nodes being fenced!

4. Change the exports, then reboot the node.

If a NATALIE service of type 2 or 3 comes up as configured by
exports, and does not make the GRITS query, then a retried write
of evicted/frozen node might be allowed. This would be bad. One
solution is to put the NATALIE state in persistant storage on the
server. Another is to have the NATALIE be started by the GRITS
agent after it queries the group about the boot posture.

Examples

These examples pretend to use ONC as the communication
mechanism. Others may do as well.

NFS/NATALIE Cluster
-------------------

A three node cluster A, B, C uses shared storage provided by
service N, mounted from export point /path. Together, the quorum
of A, B, C form a virtual IP host V.

Whenever there is a membership change in V, an agent receives
the change and issues GRITS set commands to all the resources,

On each of the nodes A, B or C, there is a configuration of GRITS
resources used, perhaps in the following form:

% cat /etc/grits.groups

# GRITS resources used by group V
V N onc://N/grits/resource/path

This tells the group-side code that to issue commands to the
resource for N, one uses onc to that location. It also
says that at boot, only read access is required.

On the resource providing N, the GRITS agent is configured:

% cat /etc/grits.resources

# GRITS groups using resource N

N /path onc://V/grits/group/V(none) \
onc://V/grits/group/A(r) \
onc://V/grits/group/B(r) \
onc://V/grits/group/C(r) \

This tells the resource side of GRITS that on booting, it may
contact any of V, A, B, or C for current state, and that until it
makes sucessful contact, it should allow only r access to the
actual nodes. There should never be requests from the virtual
node V, so it is given no access.

Shared SCSI Cluster
-------------------

Two nodes A and B have a shared SCSI bus arbitrating disk D,
forming group G. Each node runs both the membership and the scsi
GRITS agent; there is no shared IP.

% cat /etc/grits.groups

# GRITS resources used by group V
G D onc://A/grits/resource/path
G D onc://B/grits/resource/path

% cat /etc/grits.resources

# GRITS groups using resources

D /path onc://V/grits/group/A(r) \
onc://V/grits/group/B(r) \

Summary of open areas and problems
----------------------------------

OPEN 1. Can we use fencing to resolve quorum? How would that
actually work? This is supported by having resources
challenge quorum state when they think a Set is
"suspicious", and discussed in the Appendix below.

CLOSED 2. Do we need persistance in the resource agent to determine
the correct cluster generation to listen too? This gets
particularly complicated during nested reconfigs, with
some delayed member believing it has quorum, when the
actual cluster has moved on beyond.

We think persistence is not needed, given the SetMe challenge
protocol. Persistence will enhance performance by reducing
the number of unnecessary challenges.

OPEN 3. It may be desirable to support configuration of explicit
hierarchies of fencing points, stopping at the lowest one that
will work rather than going all the way up to "shoot the node"

OPEN 4. Error reporting and propagation have not been addressed. If
an attempt to fence fails, what to we do? This leads to
hierarchies above.

OPEN 5. Finer granularity than node may be extremely desirable.
Doing so is difficult, seemingly requiring kernel level
hooks to attach "group" attributes to processes to be
attached to requests for resources; this involves getting
into device drivers, and gets very messy.

CLOSED 6. Performance reasons forced a change to resourceSettings
in the Set command, so we can batch a bunch of requests
in one shot. But we may still need to do Sets in parallel
if there are a lot of them to talk to, and we don't want
to serialize on their potentially lengthy responses.

Set now takes the full set of hosts in question.

CLOSED 7. To resolve wrap, we were tempted to use some arithmetic
subtleties. This isn't necessary with the challenge
protocol, and lets us be agnostic about the number of
bits in the generation number.

OPEN 8. What ought to happen, globally, if some resource or
group hangs or doesn't respond quickly to a command
or challenge?

Appendix - Use of GRITS to resolve quorum
-----------------------------------------

In some cases, the group membership service may have a hard
time determining quorum. One possible resolution is to use
access to a quorum resource to resolve the dispute. Access
can be controlled with fencing, but this presents recursion
problems, as GRITS insists the group have quorum before doing
any set operations.

It should be clear that the quorum device is very special --
it is used to resolve quorum only, and that this is not the
same thing as fencing off a large number of resources that
need protection to prevent corruption.

To model this situation in GRITS, we will say that there
are two groups, "group", accessing the actual protected
resources, and "group-manager", which decides meta-access
to the quorum device.

# GRITS groups using data1, data2, data3, ...

group /data1 onc://V/grits/group/V(none) \
onc://V/grits/group/A(r) \
onc://V/grits/group/B(r)

group /data2 onc://V/grits/group/V(none) \
onc://V/grits/group/A(r) \
onc://V/grits/group/B(r)

group /data3 onc://V/grits/group/V(none) \
onc://V/grits/group/A(r) \
onc://V/grits/group/B(r)

group-manager /quorum onc://V/grits/group-manager/V(none) \
onc://V/grits/group-manager/A(r) \
onc://V/grits/group-manager/B(r)

When there is a quorum dispute, the group service will attempt to
write the /quorum device. If it has access, it is in the quorum
group.

During a split-brain reconfiguration, nodes A and B want to claim
quorum. Not knowing of the other's existence, each attempts to
fence out the other. The first one in, say node A, will succeed
in fencing B out. Node B one may think it has quorum and issue
its own Set, but this will be receive a SetMe challenge in
response. On receiving this, the B will try to access the quorum
device, and fail because of the fencing put in place by node A.

Now say the fence of B is in place, and A dies. If the partition
still exists, B won't know, and the system is dead. If the
partition is resolved, and B is notified, it must grab the quorum
device in some way. It will need to establish on its own that A
is dead, issue a Set, and when challenged by SetMe, it will claim
the device even though it is currently fenced out, and fence A out
in the process.

It is very difficult for B to distinguish between the cases where
A is truly dead (and that B needs to step in), and those where the
network partition remains in place, A is working, and B needs to
stay dead. These problems are left for group membership service
writers to resolve; GRITS does not provide an easy solution.

Acknowledgements
----------------

We'd like to thank the readers of the gfs-devel@borg.umn.edu
and those on the linux-ha-dev@lists.tummy.com for their indulgence
in hosting discussions of this proposal. We'd also like to thank
the following individuals who have provided significant feedback:

Stephen C. Tweedie (sct@redhat.com)
Phil Larson <Phil.Larson@netapp.com>

--
Butterflies tell me to say:
"The statements and opinions expressed here are my own and do not necessarily
represent those of Oracle Corporation."