Mailing List Archive

STONITH implementations
Folks,

Does anyone know of any existing code for operating one or more kinds of
remote power-off/reset devices suitable for a STONITH/STOMITH approach.

STONITH/STOMITH:

Shoot
The
Other
Node/Machine
In
The
Head

Thanks!

-- Alan Robertson
alanr@suse.com
Re: STONITH implementations [ In reply to ]
On Thu, Apr 27, 2000 at 10:42:07AM -0600, Alan Robertson wrote:
> Folks,
>
> Does anyone know of any existing code for operating one or more kinds of
> remote power-off/reset devices suitable for a STONITH/STOMITH approach.
>
> STONITH/STOMITH:
>
> Shoot
> The
> Other
> Node/Machine
> In
> The
> Head

VACM http://vacm.sourceforge.net/ has support for controlling
Baytek power strips.

--
Horms
Re: STONITH implementations [ In reply to ]
Horms wrote:
>
> On Thu, Apr 27, 2000 at 10:42:07AM -0600, Alan Robertson wrote:
> > Folks,
> >
> > Does anyone know of any existing code for operating one or more kinds of
> > remote power-off/reset devices suitable for a STONITH/STOMITH approach.
> >
> > STONITH/STOMITH:
> >
> > Shoot
> > The
> > Other
> > Node/Machine
> > In
> > The
> > Head
>
> VACM http://vacm.sourceforge.net/ has support for controlling
> Baytek power strips.

Thanks Horms!

It's good information, but it looks like it'll take a little tweaking
for use with heartbeat or FailSafe. It has lots of tie ins to the VA
clustering infrastructure, and it is only set up to work with serial
communication. Serial isn't suitable for STONITH, because all nodes
need to be able to power each other off independently.

However, the code clearly shows how to operate the switches, and the
Baytek hardware seems reasonably nice. They have models that support
having each machine be on it's own UPS, and some models provide telnet
support. Unfortunately, they seem to cost around $150/port.

One nice thing about telnet. If they only support one caller at a time,
then this eliminates each machine shooting the other in the head
simultaneously. Unfortunately, it also means that when the hub goes
out, STONITH won't work. This could be a real problem unless you have
redundant heartbeat mechanisms to minimize the possibility of a split
cluster occurring for this reason... Hmmm...

-- Alan Robertson
alanr@suse.com
Re: STONITH implementations [ In reply to ]
Hi there,

> Does anyone know of any existing code for operating one or more kinds of
> remote power-off/reset devices suitable for a STONITH/STOMITH approach.

A good possibility would be having a kernel module waiting for a specially
formated (read crypto, auth, whatever to make it unspoofable) packet to
arrive on the net and then panic the kernel.

Or maybe having it require packets from >50% of the cluster (remember USS
Enterprise auto-destruction activation?:) in order to activate this
code. The code would then acknowledge the fact that it is going down and
panic.

IMHO this should be done in the kernel, for if it has enough functionality
left to mess the network/cluster, it should also be able to receive/send
packets and panic... :)

This is just a thought, should not be difficult to implement if one has
the time. I'm interested in doing it, but at the moment lack the time.

On #kernelnewbies it was proposed this should better be done with serial
ports, which are more reliable and easier to spoof-protect. I just can't
foresee how will the serial connections be done, as there should be a way
for everyone to talk to everyone (ring?).

See ya!
Fábio
( Fábio Olivé Leite -* ConectivaLinux *- olive@conectiva.com[.br] )
( PPGC/UFRGS MSc candidate -*- Advisor: Taisy Silva Weber )
( Linux - Distributed Systems - Fault Tolerance - Security - /etc )
Re: STONITH implementations [ In reply to ]
This doesn't totally work. One of the things you are trying
to protect against is the temporarily hung kernel. The other
nodes may send their 'shoot him' messages, but it is hard to
know if it is listening. =20

At the same time, if your are willing to go down the path of
an active kernel agent, then your should also be trying to
do something more intellegent than panicing. It becomes reasonable
and appropriate to consider it the agent of a generic resource
fencing protocol. For sake of argument, the GRITS protocol we=20
have started to discuss on linux-ha-dev.

-dB

F=E1bio Oliv=E9 Leite wrote:
>=20
> Hi there,
>=20
> > Does anyone know of any existing code for operating one or more kinds=
of
> > remote power-off/reset devices suitable for a STONITH/STOMITH approac=
h.
>=20
> A good possibility would be having a kernel module waiting for a specia=
lly
> formated (read crypto, auth, whatever to make it unspoofable) packet to
> arrive on the net and then panic the kernel.
>=20
> Or maybe having it require packets from >50% of the cluster (remember U=
SS
> Enterprise auto-destruction activation?:) in order to activate this
> code. The code would then acknowledge the fact that it is going down an=
d
> panic.
>=20
> IMHO this should be done in the kernel, for if it has enough functional=
ity
> left to mess the network/cluster, it should also be able to receive/sen=
d
> packets and panic... :)
>=20
> This is just a thought, should not be difficult to implement if one has
> the time. I'm interested in doing it, but at the moment lack the time.
>=20
> On #kernelnewbies it was proposed this should better be done with seria=
l
> ports, which are more reliable and easier to spoof-protect. I just can'=
t
> foresee how will the serial connections be done, as there should be a w=
ay
> for everyone to talk to everyone (ring?).
>=20
> See ya!
> F=E1bio
> ( F=E1bio Oliv=E9 Leite -* ConectivaLinux *- olive@conectiva.com[.br] )
> ( PPGC/UFRGS MSc candidate -*- Advisor: Taisy Silva Weber )
> ( Linux - Distributed Systems - Fault Tolerance - Security - /etc )
>=20
> -----------------------------------------------------------------------=
-------
> Linux HA Web Site:
> http://linux-ha.org/
> Linux HA HOWTO:
> http://metalab.unc.edu/pub/Linux/ALPHA/linux-ha/High-Availability-HOW=
TO.html
> -----------------------------------------------------------------------=
-------

--=20
Butterflies tell me to say:
"The statements and opinions expressed here are my own and do not necessa=
rily=20
represent those of Oracle Corporation."
Re: STONITH implementations [ In reply to ]
David Brower wrote:
>
> This doesn't totally work. One of the things you are trying
> to protect against is the temporarily hung kernel. The other
> nodes may send their 'shoot him' messages, but it is hard to
> know if it is listening.

I assume that an ACK could be used to solve this. The problem with this
would be that you really need to keep the protocol up and running long
enough to be sure that the ack itself didn't get lost. A hardware
safeguard is probably more reliable than a software safeguard (sigh).

If you changed heartbeat to lock itself in memory and use one of the
more realtime scheduling methods, then the kernel is probably only a
little more reliable than heartbeat. If you limit yourself to only one
media type in the kernel, then it's probably more reliable. I've
thought about a shutdown signal or message. This would still have to be
backed up by a hardware method for the more paranoid IMHO.

> At the same time, if your are willing to go down the path of
> an active kernel agent, then your should also be trying to
> do something more intellegent than panicing. It becomes reasonable
> and appropriate to consider it the agent of a generic resource
> fencing protocol. For sake of argument, the GRITS protocol we
> have started to discuss on linux-ha-dev.
>
> -dB

Neither of these protect very well against loss of communications
media. Heartbeat is configured to talk over as many media as possible.

Have you thought about whether it would make sense to use of the
heartbeat comm layer rather than implement a kernel agent?

It would probably be slightly less reliable than doing it all in the
kernel, but you could ride it's coattails (so to speak) in terms of
redundancy, and multiple media types. If you're really paranoid, you
still need to have a hardware safeguard anyway.

Heartbeat is already doing strong authentication as well, and implements
serial ring protocols, so Fábio's major concerns are all addressed.

It is this kind of application, where you need extremely high
reliability, bounded latency, and low bandwidth, that made me architect
the heartbeat comm layer the way I did. This kind of application is
exactly what it's designed for.

Comments?

-- Alan Robertson
alanr@suse.com