Mailing List Archive: [Fwd: Re: draft of updated GRITS note]

A co-author nails me

-dB

-------- Original Message --------
From: "Gary D. Young" <gdyoung@us.oracle.com>
Subject: Re: draft of updated GRITS note
To: David Brower <dbrower@us.oracle.com>, jleys@us.oracle.com

Gary Young (mailto:gdyoung@us.oracle.com)

Add another space before my name.
>
<snip>
> remains in effect until a quorum group is formed and issues
> commands to the resources. The plausible initial policies are
> "read only" and "no access"; some resources may only be able to
> enforce "no access". A "writable" boot policy would be defeat
> the purpose of the fence.

Fix that last sentence. Perhaps strike the word "be".

> - Worst Case group members, who will have a third party wired
> to their reset buttons to force them to be "fenced." This
> is an always correct final solution. The "X10" system can
> be used to turn off the power to particularly non-cooperative
> entities.

"This is an always correct final solution." Sounds kind of awkward to
me. Perhaps "This final recourse solution has guaranteed correctness."

> OPEN: The current proposal does not address layers of fencing or
> escalation and isolation. It might be useful to identify levels
> at which fencing may be stopped without doing higher levels. For
> instance, if all disk i/o may be stopped by frobbing the
> fibrechannel switch, then turning off the power may not be
> necessary.

Remind me please: what's frobbing?

> Potential protocols include:
>
> ONC RPC
> CORBA
> HTTP
> HTTPS
> COM/DCOM
> SMB extensions

Was there some reason why TCP/IP was not included? It's most likely
going to be the communications module used for the first version,
so it seems kinda silly to leave it out. All of the above probably
make use of TCP/IP in some fashion.... perhaps "sockets" or
something would be the appropriate terminology here?

> To establish ordering of quorum generations, GRITS must consider
> the possibility of wraparound. It is suggested that something like
>
> static inline x_before_y (unsigned x, unsigned y)
> {
> return ((signed) (x - y)) < 0;
> }
>
> will suffice.

Certainly there will be some default generation number that is used
when you initially request the forming of a cluster. So if you happen
to wrap around to it, any new nodes will either assume they are
already part of the cluster, or the storage unit may assume they are
since they have the correct generation number.

Potential solutions could be:
1) to skip the "magic number" assigned as default when incrementing.
2) try to arrange the protocol so that you must have the correct
generation number AND be considered part of the group by GRITS.

Hm. Is this point so obvious that you just glossed over it, or
am I hitting something meritous here?

> FIXME - For this to work, we need to fix the width of the
> generation, to 16, 32, or 64 bits. My inclination is to make it
> 64 bits.

Putting it at 64 bits would make it rather unlikely that the ceiling
is ever hit. 1ms network latency * 2^64 is "sizable". Some machines
may have to do their own 64 bit arithmetic if they don't have 64
bit libraries (or 64 bit processors), but that's not a serious issue.

> FIXME - There are also issues regarding the need for stable
> storage for the epoch in resource agents. What epoch should they
> obey at resource boot.
>
> Resource Settings
>
> A resourceSetting is a combination of
>
> { resource, node, allow|deny }
>
> ResourceSettings are a list or array of resourceSettings to
> cover a set of resource/node bindings.

I'm still keen on putting "administrate" as one of the attributes.
Then fencing "administrate" access from all the rest of the nodes
would result in quorum being established. If everyone seeking
quorum had the protocol of "fence everyone else off, and if those
all succeed then you're the reconfig leader". (If two people were
vying, obviously one of them would fence the other first, and
then at least one of the other's fencing commands would fail.
Whoever else they fence in the process would be redundant.)

This has issues when the reconfig leader dies, though.... but
I suppose any quorum service has to be able to recover from
the token being lost.

> OPEN: the current proposal does not address errors or timeouts
> that could be returned from Set operations.

Ack/Nack-ing each set request would have its merits. The option of
querying after each request to verify that it went though has a flaw
when of someone interfering between your set and your verification
request.

> NATALIE
>
> The NATALIE extensions to NFS are additional RPC interfaces to
> perform manipulation of the live state of the NFS servers. In
> particular, NATALIE supports forceable eviction of mounting
> clients. This is principally useful for cluster "fence off", but
> is administratively useful on its own merits in non-cluster
> environments.

Gee... that second sentence is producing too many mental images. :)

> Summary of open areas and problems
> ----------------------------------
>
> 1. Can we use fencing to resolve quorum? How would that
> actually work?

See above. I think it will work.

> 4. Error reporting and propagation have not been addressed. If
> an attempt to fence fails, what to we do? This leads to
> hierarchies above.

My gut feeling says to either acknowledge every fence request or to
acknowledge none.

> -------- Original Message --------
> From: "Gary D. Young" <gdyoung@us.oracle.com>
> Subject: Re: draft of updated GRITS note
> To: David Brower <dbrower@us.oracle.com>, jleys@us.oracle.com

> OPEN: The current proposal does not address layers of fencing or
> > escalation and isolation. It might be useful to identify levels
> > at which fencing may be stopped without doing higher levels. For
> > instance, if all disk i/o may be stopped by frobbing the
> > fibrechannel switch, then turning off the power may not be
> > necessary.
>
> Remind me please: what's frobbing?

The jargon file is of mixed help. It's entry for "frob" only gives
as verb form "frobnicate". Helpfully, the entry for "molly-guard"
uses it in context:

molly-guard /mol'ee-gard/ /n./ [University of Illinois] A shield to
prevent tripping of some Big Red Switch by clumsy or ignorant hands.
Originally used of the plexiglass covers improvised for the BRS on
an IBM 4341 after a programmer's toddler daughter (named Molly)
frobbed it twice in one day. Later generalized to covers over
stop/reset switches on disk drives and networking equipment.

The usual use of "frobbing a knob" is making some adjustment
on some piece of equipment. It's sort of like tweaking, only
it is generally an "official" adjustment, where tweaking can
be done out-of-band with metal cutting tools and impact devices :-)

> > Potential protocols include:
> >
> > ONC RPC
> > CORBA
> > HTTP
> > HTTPS
> > COM/DCOM
> > SMB extensions
>
> Was there some reason why TCP/IP was not included? It's most likely
> going to be the communications module used for the first version,
> so it seems kinda silly to leave it out. All of the above probably
> make use of TCP/IP in some fashion.... perhaps "sockets" or
> something would be the appropriate terminology here?

I suppose if pressed, I'd say HTTP is tcp with ascii, er, ISO 8859 syntax,
and that I hate rolling my own message formats. Would anyone like to
argue for raw tcp?

> > To establish ordering of quorum generations, GRITS must consider
> > the possibility of wraparound. It is suggested that something like
> >
> > static inline x_before_y (unsigned x, unsigned y)
> > {
> > return ((signed) (x - y)) < 0;
> > }
> >
> > will suffice.
>
> Certainly there will be some default generation number that is used
> when you initially request the forming of a cluster. So if you happen
> to wrap around to it, any new nodes will either assume they are
> already part of the cluster, or the storage unit may assume they are
> since they have the correct generation number.
>
> Potential solutions could be:
> 1) to skip the "magic number" assigned as default when incrementing.
> 2) try to arrange the protocol so that you must have the correct
> generation number AND be considered part of the group by GRITS.
>
> Hm. Is this point so obvious that you just glossed over it, or
> am I hitting something meritous here?

Maybe, or maybe something different.

I was talking with John Leys, and we came up with some other problem
cases, and maybe a solution to the "stable store in the resource
for quorum generation problem"

Here is a scenario. The cluster is at genration 1 and it partitions;
subset Alpha forms and gains quorum at gen 2. It starts sending fencing
commands out, but has a problem and stalls. Another reconfig is done,
forming generation 3, and it sends fences successfully to all resources.
Then, the sleepy quorum master from gen 2 comes back to life, at the same time
resource R dies. R reboots, and receives a message from sleepy 2, and fences
out the members of gen 3. Or R sends it's "set me" request to 2, who cheerfully
responds with the same membership; same result.

This would be solved if we remembered generations in the resource, because
R would know not to talk to generation 2. But we'd like not to require that.

We can solve one part if we insist that every time a quorum groups gets a
"set me" query from a booting resource, it explicitly checks to make sure
it still holds quorum. We can solve the other part if a booting resource
without a memory challenges the first command it receives, forcing the
same explicit quorum check from the commanding node. This way there
is no need for persistant store in the resource, but having persistent
store eliminates the need for a challenge.

How does that sound?

Still unsettled is the need for generation memory in the group itself;
if the whole group dies, must it start past the previous generation,
or can it start at 0 or 1 again? With a challenge, it might work
to let the quorum generation float, and be negotiated to the max of
the remembered generations from all the members and the available
resources. An alternative is Gary's "magic epoch" value, skipped
on wrap around.

Any opinions?

> > FIXME - For this to work, we need to fix the width of the
> > generation, to 16, 32, or 64 bits. My inclination is to make it
> > 64 bits.
>
> Putting it at 64 bits would make it rather unlikely that the ceiling
> is ever hit. 1ms network latency * 2^64 is "sizable". Some machines
> may have to do their own 64 bit arithmetic if they don't have 64
> bit libraries (or 64 bit processors), but that's not a serious issue.

Yes, but arguing against myself, I'm not sure I want to -require- 64
bits, esp. if I'm trying to work with an existing group manager that
has a smaller native width. GRITS/NATALIE is not supposed to be
Linux specific, so I kind of don't like wiring things in like
generation size.

Any opinions?

> > Resource Settings
> >
> > A resourceSetting is a combination of
> >
> > { resource, node, allow|deny }
> >
> > ResourceSettings are a list or array of resourceSettings to
> > cover a set of resource/node bindings.
>
> I'm still keen on putting "administrate" as one of the attributes.
> Then fencing "administrate" access from all the rest of the nodes
> would result in quorum being established. If everyone seeking
> quorum had the protocol of "fence everyone else off, and if those
> all succeed then you're the reconfig leader". (If two people were
> vying, obviously one of them would fence the other first, and
> then at least one of the other's fencing commands would fail.
> Whoever else they fence in the process would be redundant.)
>
> This has issues when the reconfig leader dies, though.... but
> I suppose any quorum service has to be able to recover from
> the token being lost.

I'm leery of semantics beyond allow/deny, because I think that
there will be some very crude switches we'd like to use as our
enforcers. I don't see how to implement "administrate" using
an access decision in a switch, for instance. It would be neat
if we knew how "administrate" could be made to work everywhere.

Does it sound possible to anyone else?

> > OPEN: the current proposal does not address errors or timeouts
> > that could be returned from Set operations.
>
> Ack/Nack-ing each set request would have its merits. The option of
> querying after each request to verify that it went though has a flaw
> when of someone interfering between your set and your verification
> request.

I think synchronous error is the best thing to do; explcitly supporting
asynchrony with completion status checking is ugly, and I'm willing
to assume threads for parallelism.

thanks!

-dB