Mailing List Archive

May 5, 2000, 9:03 PM

Post #2 of 53 (6893 views)

Jerome Etienne wrote:
>
> [.by remote reset i mean what others call stonith, but i prefere remote
> reset because it isn't an acronym. Acronym often makes things harder
> to understand so i try to avoid them when possible.]
>
> I though a bit to problem of performing a remote reset without
> dedicated hardware. I see a big problem in a software emulation
> of this feature: to stop the computer is trivial but how to
> restart it ?
>
> to start a computer implies it is off and when the computer is off
> it isent suppose to run any software, so how emulate this feature ?
>
> a possibility it to run software doing nothing but waiting a reboot
> order. a kludge in the linux kernel ? a modified boot process like
> bootp (see etherboot problem) ?
>
> ps: i sketch a protocol which, i believe, is secure. it is able to stop or
> reboot a computer but obviously doesnt solve the question.

The issue is to guarantee that the following thing occurs:

That the system stops doing disk I/O within a short, known period
of time, like 10 seconds, and doesn't do any more I/O
until it goes through a reset/reboot message first.

It is not necessary that the system actually reboot automatically, just
that it not do any more disk I/O if it doesn't reboot.

So, if it's hung, it just needs to *stay* hung.

The easiest way to do this is to guarantee that it reboots - quickly.

If one wanted to crank the priority of heartbeat up enough, then only
broken hardware, hung drivers (or other kernel or real-time events)
could keep it from seeing a reboot message and acting on it. If you
have sufficiently well-behaved hardware and software, it is improbable
that the reset message would be ignored, and yet the system would do
more disk I/O afterwards.

However, if you're running funky drivers, or ill-behaved real time
software (especially if it uses SCHED_FIFO), or have flaky hardware,
this becomes a more dicey proposition.

As always, it depends a lot on the level of paranoia you have. If
you're using actual shared media, then it might not be paranoid enough.
If you're using LAN mirroring (for example, using DRBD), and have
redundant independent heartbeat media, and are using some kind of quorum
device, then you may be in fine shape in practice.

I don't know of any reliable, reputable way of computing the
probabilities of these events.

-- Alan Robertson
alanr@suse.com

remote reset [ In reply to ]

May 6, 2000, 4:33 AM

Post #3 of 53 (6887 views)

On Fri, May 05, 2000 at 10:03:50PM -0600, Alan Robertson wrote:
> The issue is to guarantee that the following thing occurs:
>
> That the system stops doing disk I/O within a short, known period
> of time, like 10 seconds, and doesn't do any more I/O
> until it goes through a reset/reboot message first.

1. on the delay: here i assume that 10s is the delay between the
host receive the order and the disk stop. not between somebody send
the order. there is no way to garante a delay on the network.

2. why do you talk only about the disk ?

3. to use a garantee with a timer would be a bad a idea, suppose that within
the 10sec, the fs isn't synced, we freeze it in a unstable state with
possible loss of data.

> It is not necessary that the system actually reboot automatically, just
> that it not do any more disk I/O if it doesn't reboot.

i think it is. not to do it mean a human intervention. if should
be avoid when possible

remote reset [ In reply to ]

May 6, 2000, 7:27 AM

Post #4 of 53 (6885 views)

Jerome Etienne wrote:
>
> On Fri, May 05, 2000 at 10:03:50PM -0600, Alan Robertson wrote:
> > The issue is to guarantee that the following thing occurs:
> >
> > That the system stops doing disk I/O within a short, known period
> > of time, like 10 seconds, and doesn't do any more I/O
> > until it goes through a reset/reboot message first.
>
> 1. on the delay: here i assume that 10s is the delay between the
> host receive the order and the disk stop. not between somebody send
> the order. there is no way to garante a delay on the network

What I was referring to here is that if you send a command to reboot to
a machine and it doesn't respond, after a certain amount of time, you
have to timeout and go on - either assuming that it is still writing the
disk or that it has stopped. If you can't guarantee that it has stopped
writing to the disk, then you cannot proceed.

>
> 2. why do you talk only about the disk ?

That's a good question. I'm not sure that my answer is as good as your
question. First of all, the consequences of two machines writing to a
shared disk simultaneously are catastrophic, so it's a good example. As
far as I know there are basically two reasons for the two nodes to both
be up and each think the other is down: One is a system hang, and the
other is loss of communications. In the case of loss of communications,
the use of a quorum device will make the other machine give up and stop
providing service, in the same time interval, and not persist. System
hangs are only a problem if they recover and begin functioning again
after some period of time. Even then, they'll notice the loss of quorum
after a bit, and then also shut themselves down. Until they notice a
loss of quorum, they are a potential danger to cluster integrity. The
only really damaging thing I can think of that they can do during a
short (let's say 5 second) interval is write to disk. For example,
responding to a few packets at an IP address that no longer belongs to
it is going to likely cause them to be ignored.

Can anyone think of any other examples of things which it can do during
a short interval which would be damaging to cluster or data integrity?

> 3. to use a garantee with a timer would be a bad a idea, suppose that within
> the 10sec, the fs isn't synced, we freeze it in a unstable state with
> possible loss of data.

Well, you'd better be using a journalling filesystem anyway. The
reasons for the 10 second timer is the assumption that the other machine
doesn't respond at all during that time.

>
> > It is not necessary that the system actually reboot automatically, just
> > that it not do any more disk I/O if it doesn't reboot.
>
> i think it is. not to do it mean a human intervention. if should
> be avoid when possible

I didn't mean that it wasn't desirable, just that it didn't have to
actually be guaranteed from the point of view of data integrity.

-- Alan Robertson
alanr@suse.com

remote reset [ In reply to ]

May 6, 2000, 8:49 AM

Post #5 of 53 (6896 views)

On Sat, May 06, 2000 at 08:27:27AM -0600, Alan Robertson wrote:
> > 2. why do you talk only about the disk ?
[snip]
> Can anyone think of any other examples of things which it can do during
> a short interval which would be damaging to cluster or data integrity?

I though more about the problem if we choose to solve it by a kludge
in the kernel. We want to halt the computer and wait for a new command
(probably to reboot).

I think that if we stop the disk i/o and stop to schedule processes,
we could consider the computer 'halted'. Some problems with stuff
like khttpd/nfsd will occurs. they have to be stoped too.

We can't stop the network i/o because we want to be able to receive
the new order but as processes(and kernel network deamons) are stopped,

pb will occurs with automatic retransmition like tcp but im not sure
it is a problem.

[minor topics follows]

> > 1. on the delay: here i assume that 10s is the delay between the
> > host receive the order and the disk stop. not between somebody send
> > the order. there is no way to garante a delay on the network
>
> What I was referring to here is that if you send a command to reboot to
> a machine and it doesn't respond, after a certain amount of time, you
> have to timeout and go on - either assuming that it is still writing the
> disk or that it has stopped. If you can't guarantee that it has stopped
> writing to the disk, then you cannot proceed.

im lost. 'have to timeout and go on' and 'no garantee so cannot proceed'
are opposite.

> > 3. to use a garantee with a timer would be a bad a idea, suppose that within
> > the 10sec, the fs isn't synced, we freeze it in a unstable state with
> > possible loss of data.
>
> Well, you'd better be using a journalling filesystem anyway.

a journalling fs 'only' garantee the integrity of the fs under crash.
it doesnt prevent from losing data. but let forget this topic because
the timer was a network timeout.

remote reset [ In reply to ]

olive at conectiva

May 10, 2000, 7:35 AM

Post #6 of 53 (6884 views)

Hi there,

) I though more about the problem if we choose to solve it by a kludge
) in the kernel. We want to halt the computer and wait for a new command
) (probably to reboot).
)
) I think that if we stop the disk i/o and stop to schedule processes,
) we could consider the computer 'halted'. Some problems with stuff
) like khttpd/nfsd will occurs. they have to be stoped too.

I think a small kernel module that could trigger SysRq key sequences upon
receiving special packets is the best option. With that, we can terminate
all processes, sync disks, umount partitions and even halt the machine,
all with code that is already there - we will only need to add a new
trigger mechanism, that could even be an optional module.

The good thing is that the only dependency is on a running kernel and
network card. If the network card is down, we don't have to worry, since
it won't polute the network and any running cluster manager will
eventually timeout, loose quorum and shut itself down. With a broken
kernel, the machine is halted anyway.

If people use a shared SCSI disk and fear a controller going nuts, then a
hardware shutdown solution is required, IMHO. If DRBD or even ODR is being
used, then the software solution above is enough.

What do you think?

Fábio
( Fábio Olivé Leite -* ConectivaLinux *- olive@conectiva.com[.br] )
( PPGC/UFRGS MSc candidate -*- Advisor: Taisy Silva Weber )
( Linux - Distributed Systems - Fault Tolerance - Security - /etc )

remote reset [ In reply to ]

May 10, 2000, 8:00 AM

Post #7 of 53 (6894 views)

On Wed, May 10, 2000 at 11:35:33AM -0300, Fábio Olivé Leite wrote:
> I think a small kernel module that could trigger SysRq key sequences upon
> receiving special packets is the best option. With that, we can terminate
> all processes, sync disks, umount partitions and even halt the machine,

all is there, if we halt the machine, a human intervention is required
to have it back. i would like to avoid it so to have a machine which
doesnt do anything but waiting for a restart order.
so can we reach this goal by simply stopping to schedule processes and
disk i/o ?

i think to reuse shutdown codes is a good idea but some modifications may
be needed: e.g. if the shutdown procedure remove the network interface
structure, the kernel no more know its ip addresses and can't receive
the packet. this one can be easily fixed but the modifications needed
to be listed to determine if there are some more annoying cases.

> The good thing is that the only dependency is on a running kernel and
> network card. If the network card is down, we don't have to worry, since
> it won't polute the network and any running cluster manager will
> eventually timeout, loose quorum and shut itself down. With a broken
> kernel, the machine is halted anyway.

not necessarily. a broken kernel may send bogus network packets or write
wrong data on local disks. With a software solution, i dont think we can
avoid it but we need to be aware of it.

Some software checks the integrity of the code(arpanet router) or
data(ospf) by periodically checksuming it. Nevetheless doing this
in linux kernel may not be practical: too many data structures and
(according to a short conversation with andy kleen and alan cox) linux
kernel relies on the ability to automodify its code.

> If people use a shared SCSI disk and fear a controller going nuts, then a
> hardware shutdown solution is required, IMHO. If DRBD or even ODR is being
> used, then the software solution above is enough.

agreed.

remote reset [ In reply to ]

May 10, 2000, 8:34 AM

Post #8 of 53 (6889 views)

Fábio Olivé Leite wrote:
>
> Hi there,
>
> ) I though more about the problem if we choose to solve it by a kludge
> ) in the kernel. We want to halt the computer and wait for a new command
> ) (probably to reboot).
> )
> ) I think that if we stop the disk i/o and stop to schedule processes,
> ) we could consider the computer 'halted'. Some problems with stuff
> ) like khttpd/nfsd will occurs. they have to be stoped too.
>
> I think a small kernel module that could trigger SysRq key sequences upon
> receiving special packets is the best option. With that, we can terminate
> all processes, sync disks, umount partitions and even halt the machine,
> all with code that is already there - we will only need to add a new
> trigger mechanism, that could even be an optional module.
>
> The good thing is that the only dependency is on a running kernel and
> network card. If the network card is down, we don't have to worry, since
> it won't polute the network and any running cluster manager will
> eventually timeout, loose quorum and shut itself down. With a broken
> kernel, the machine is halted anyway.

I would offer one additional thing: I believe that it might be best to
implement trigger this from heartbeat, rather than strictly in the
kernel for the following reasons: 1) Heartbeat already has a highly
available communications mechanism, including serial ports so that if
someone accidentally disconnects your ethernet, you can get the order to
shut down by the serial port. 2) Heartbeat already does strong
authentication, so the considerable security risks associated with this
solution are minimized. I understand that heartbeat in theory will fail
slightly more often than a kernel implementation, but I suspect that it
differs in the 3rd significant digit, because heartbeat locks itself in
memory, and raises it's own priority. My guess is that this is more
than made up in practice by the robust and redundant communications
mechanism that heartbeat already implements.

> If people use a shared SCSI disk and fear a controller going nuts, then a
> hardware shutdown solution is required, IMHO. If DRBD or even ODR is being
> used, then the software solution above is enough.

What's ODR?

I would go even farther. I believe that with DRBD that quorum is
enough, and even software remote reset isn't needed.

Remote reset is used to deal with two conditions:
1) broken communications
2) hung kernels that either continue to sputter along intermittently
or recover completely after an excessive pause

1) Is taken care of by quorum.

When one uses DRBD, loss of quorum will result in the local partition
being marked invalid, and not being used as a source of TrueBits.
Therefore whatever writes were done to it will never be used. This
renders 2) irrelevant.

To have the highest degree of data integrity, one must configure drbd to
not report I/O complete until the slave node reports that the data is on
the slave disk. Otherwise, you'll report write complete to applications
for data which may not really be on the "real" disk. Here's how that
can happen:

Machine "A" is the drbd master, and "B" is the slave. The switch (or
hub) "S" is connected to both "A" and "B" and is used as a quorum
device, and the network connecting A, B, and S is used for drbd traffic.

Someone unplugs the ethernet cable on A. "A" writes to the local disk,
schedules the data to be written to B, and acknowledges write complete
to the application. The application then reports success to the user.
However the write never gets mirrored to B. Machine A notices that it
has lost quorum and dutifully stops providing services, marking it's own
local partition as not containing TrueBits. Machine B notices that A
has gone, sees that it can talk to the quorum device, decides machine A
has died, and takes over as drbd master.

It then goes on and happily updates the disk as it should including the
same blocks that A has never written to it. When someone reconnects the
ethernet on "A", it accepts updates from B, overwriting the data it
never copied to B before the network went down.

This is bad news - because I/O operations formerly acknowledged as
complete are effectively never done. The only way to solve this is to
report I/O success upwards only when either the slave machine has
written the bits to disk, or you complete a transition and decide to go
on without the slave.

Nevertheless, stronger I/O fencing is of no help in this case at all -
even if Machine B had successfully reset "A", the result would have been
the same.

-- Alan Robertson
alanr@suse.com

remote reset [ In reply to ]

May 10, 2000, 9:20 AM

Post #9 of 53 (6894 views)

On Wed, May 10, 2000 at 09:34:34AM -0600, Alan Robertson wrote:
> I would offer one additional thing: I believe that it might be best to
> implement trigger this from heartbeat, rather than strictly in the
> kernel

heartbeat is a userspace code, so less stable than the kernel (can be
killed by the kernel if there is no more memory, swapped and the
machine too loaded to get it back in ram etc...)

Basically as this remote reset is used in emergency, it should rely
on as less elements as possible. and userspace processes relies on
more elements than the kernel.

> for the following reasons: 1) Heartbeat already has a highly
> available communications mechanism

in which way it is more available than usual IP communication ?

> including serial ports so that if someone accidentally disconnects
> your ethernet, you can get the order to shut down by the serial port.

here you assume machines are linked with a serial link.
i think this assumption can be avoided because any IP links (on serial
or other media) would deliver the packet.

> 2) Heartbeat already does strong
> authentication, so the considerable security risks associated with this
> solution are minimized.

As far as i know, an security which is undocumented, so unreviewable.
It is commonly advised(e.g. sci.crypt faq) to look for review to avoid
holes. In cryptography an algorithm isn't trusted if it hasnt be
seriously reviewed. i think it can apply in this case.

if there is a text about the heartbeat security, please give a pointer.

> I understand that heartbeat in theory will fail
> slightly more often than a kernel implementation, but I suspect that it
> differs in the 3rd significant digit, because heartbeat locks itself in
> memory, and raises it's own priority. My guess is that this is more
> than made up in practice by the robust and redundant communications
> mechanism that heartbeat already implements.

how heartbeat implement robust and redundant communications better
than usual IP protocols ?

i reask this question because i don't see the need to reimplement a
complete network stack in userspace when IP has been designed/implemented
by experienced people along the years. i think it would be a mistake
not to use it.

remote reset [ In reply to ]

May 10, 2000, 9:39 AM

Post #10 of 53 (6892 views)

On Wed, May 10, 2000 at 09:34:34AM -0600, Alan Robertson wrote:
> I would offer one additional thing: I believe that it might be best to
> implement trigger this from heartbeat, rather than strictly in the
> kernel for the following reasons

another one i forgot :) how do you plan to restart the computer ?

if it is handle by heartbeat, it mean that the processes should be
scheduled, and so the other processes may do undesirable things
while heartbeat wait for the order to restart.

remote reset [ In reply to ]

May 10, 2000, 10:45 AM

Post #11 of 53 (6885 views)

Jerome Etienne wrote:
>
> On Wed, May 10, 2000 at 09:34:34AM -0600, Alan Robertson wrote:
> > I would offer one additional thing: I believe that it might be best to
> > implement trigger this from heartbeat, rather than strictly in the
> > kernel for the following reasons
>
> another one i forgot :) how do you plan to restart the computer ?

There are well-known system calls for rebooting.

> if it is handle by heartbeat, it mean that the processes should be
> scheduled, and so the other processes may do undesirable things
> while heartbeat wait for the order to restart.

Heartbeat runs as a soft real-time process at a priority higher than all
normal user processes. It is locked into memory.

-- Alan Robertson
alanr@suse.com

remote reset [ In reply to ]

May 10, 2000, 10:54 AM

Post #12 of 53 (6888 views)

On Wed, May 10, 2000 at 11:45:33AM -0600, Alan Robertson wrote:
> There are well-known system calls for rebooting.

here we need 2 actions. to stop the computer and to restart it.
the syscall i know either halt or reboot the computer but
dont 'halt waiting for a order to reboot without doing
anything else'.

> > if it is handle by heartbeat, it mean that the processes should be
> > scheduled, and so the other processes may do undesirable things
> > while heartbeat wait for the order to restart.
>
> Heartbeat runs as a soft real-time process at a priority higher than all
> normal user processes. It is locked into memory.

assuming heartbeat still run isnt enough, you must stop all the other
processes. how do you do it ?

remote reset [ In reply to ]

May 10, 2000, 11:36 AM

Post #13 of 53 (6895 views)

The party issuing the kill order needs to know if it was successful
or not. It will need a strong guarantee that successful completion
means that no more writes to disk are going to happen from that
node. The timer is an awkward heuristic -- all it can do after
the timer expires is return an "i don't know" status. Which is
fine -- but the survivor will need to get human intervention
at that point in order to proceed safely. Needing human intervention
is sad, but necessary when status in in question.

Conceptually, a halt/reboot sequences causes complete fencing
of i/o from the node; and, the reboot does an unfence, with the
requirement that the code on the node is smart enough to obtain
quorum membership before it does any potentially damaging writes.

I see no compelling reasons for the execution-agent on the node
to be in either kernel or user space. Implementation convenience
seems as good a reason as any other to me.

The problem, of course, is that you'd really like to have a way
of forcing the operation when there is no cooperative software
running on the machine at all, because it has hung or otherwise
gone insane. This is where you need mechanisms like the r/c
power switch, or access to reservations in the storage fabric
to turn i/o capability off w/o relying on the node in question.

-dB

--
Butterflies tell me to say:
"The statements and opinions expressed here are my own and do not necessarily
represent those of Oracle Corporation."

remote reset [ In reply to ]

May 10, 2000, 11:42 AM

Post #14 of 53 (6889 views)

Also, in the interests of recovery in the survivors,
is is best if the shot takes the node down instantly.
You do -not- want to be waiting for any part of a
graceful shutdown, -especially- syncing buffers.

The system must always do fast recovery from a truly
crashed node. There is probably nothing you can do
in this case to make the recovery any faster than that,
and much to delay the start of that recovery (eg: syncing).

Systems that rely on orderly shutdown for fast restart
are broken. (That's a little dogmatically overstated, but
not by much.)

Anything that cared about it's data integrity wasn't relying
on buffer cache flush anyway. It needed real transactional
integrity.

-dB

remote reset [ In reply to ]

May 10, 2000, 11:50 AM

Post #15 of 53 (6887 views)

On Wed, May 10, 2000 at 11:42:27AM -0700, David Brower wrote:
> Also, in the interests of recovery in the survivors,
> is is best if the shot takes the node down instantly.
> You do -not- want to be waiting for any part of a
> graceful shutdown

agreed.
At the begining, i was confused. Here, we are trying to emulate a
physical device which powerdown the computer. The emulation must
be as close as possible of the original, so no gracefull shutdown.

remote reset [ In reply to ]

May 10, 2000, 12:00 PM

Post #16 of 53 (6892 views)

David Brower wrote:
>
> The party issuing the kill order needs to know if it was successful
> or not. It will need a strong guarantee that successful completion
> means that no more writes to disk are going to happen from that
> node. The timer is an awkward heuristic -- all it can do after
> the timer expires is return an "i don't know" status. Which is
> fine -- but the survivor will need to get human intervention
> at that point in order to proceed safely. Needing human intervention
> is sad, but necessary when status in in question.

And where a single copy of the data is involved, as when using
physically shared media. "logically shared" media (drbd or raid+nbd)
has different constraints.

There are two reasons why one wants to fence off the other node:
Lost communication
Sick computer

Properly designed redundant communication can largely eliminate the
first
cause. Sick computers are problems. They are the least likely to
reliably report their own demise. Exactly when you need them to die,
they won't die in a way that you can count on.

For this state, and physically shared media, an external fencing
mechanism (like a reset) is the best choice.

> Conceptually, a halt/reboot sequences causes complete fencing
> of i/o from the node; and, the reboot does an unfence, with the
> requirement that the code on the node is smart enough to obtain
> quorum membership before it does any potentially damaging writes.

Exactly.

> I see no compelling reasons for the execution-agent on the node
> to be in either kernel or user space. Implementation convenience
> seems as good a reason as any other to me.

That's obviously my take as well ;-)

> The problem, of course, is that you'd really like to have a way
> of forcing the operation when there is no cooperative software
> running on the machine at all, because it has hung or otherwise
> gone insane. This is where you need mechanisms like the r/c
> power switch, or access to reservations in the storage fabric
> to turn i/o capability off w/o relying on the node in question.

The only time you don't need this is if you have a replication agent
like DRBD involved, so that both parties can write the same logical disk
without each interfering with each other. If the node eventually
regains sanity on its own, it will realize that it's been fenced off
by the other node and can invalidate its copy of the data. One can
postulate that the node will either regain sanity through some
mechanism, or eventually be removed from the cluster ;-)

I *really* like the DRBD mechanism. This is one of the reasons why.

-- Alan Robertson
alanr@suse.com

remote reset [ In reply to ]

May 10, 2000, 12:04 PM

Post #17 of 53 (6892 views)

David Brower wrote:
>
> Also, in the interests of recovery in the survivors,
> is is best if the shot takes the node down instantly.
> You do -not- want to be waiting for any part of a
> graceful shutdown, -especially- syncing buffers.
>
> The system must always do fast recovery from a truly
> crashed node. There is probably nothing you can do
> in this case to make the recovery any faster than that,
> and much to delay the start of that recovery (eg: syncing).

Depending on why the system is sick, it could cause it to become more
hung. A *reliable* mechanism is vastly more important than a graceful
one.

> Systems that rely on orderly shutdown for fast restart
> are broken. (That's a little dogmatically overstated, but
> not by much.)

This is why journalling filesystems are so important.

> Anything that cared about it's data integrity wasn't relying
> on buffer cache flush anyway. It needed real transactional
> integrity.

Good summary.

-- Alan Robertson
alanr@suse.com

remote reset [ In reply to ]

May 10, 2000, 12:08 PM

Post #18 of 53 (6890 views)

Jerome Etienne wrote:
>
> On Wed, May 10, 2000 at 11:45:33AM -0600, Alan Robertson wrote:
> > There are well-known system calls for rebooting.
>
> here we need 2 actions. to stop the computer and to restart it.
> the syscall i know either halt or reboot the computer but
> dont 'halt waiting for a order to reboot without doing
> anything else'.
>
> > > if it is handle by heartbeat, it mean that the processes should be
> > > scheduled, and so the other processes may do undesirable things
> > > while heartbeat wait for the order to restart.
> >
> > Heartbeat runs as a soft real-time process at a priority higher than all
> > normal user processes. It is locked into memory.
>
> assuming heartbeat still run isnt enough, you must stop all the other
> processes. how do you do it ?

For any software system to work, something on the machine has to work.
Even kernel systems need working interrupts and scheduling of timeouts
to work. What heartbeat adds to that is that it has to be able to get
scheduled for long enough to issue the system call to reboot the
computer.

As David points out, all we need is a reboot call (i.e. a processor
reset). That is completely sufficient.

-- Alan Robertson
alanr@suse.com

remote reset [ In reply to ]

May 10, 2000, 1:28 PM

Post #19 of 53 (6888 views)

Jerome Etienne wrote:
>
> On Wed, May 10, 2000 at 11:42:27AM -0700, David Brower wrote:
> > Also, in the interests of recovery in the survivors,
> > is is best if the shot takes the node down instantly.
> > You do -not- want to be waiting for any part of a
> > graceful shutdown
>
> agreed.
> At the begining, i was confused. Here, we are trying to emulate a
> physical device which powerdown the computer. The emulation must
> be as close as possible of the original, so no gracefull shutdown.

I feel the need to pound on a dogmatic point here, so even though
you are agreeing, I'm going to take issue with how you're agreeing.

It's not that we are choosing to go be plug compatible with something
that happens to be in place. It is trying to get to the point where
the sick node is out of our misery as fast as possible. Because of
all the other things that can go wrong, our highly available system
has to meet its data integrity and minimum recovery times in the face
of power off to a node. Therefore, anything that slows down the
exit compared to a power off is a Bad Thing.

Let me go deeper: it's my opinion that running any scripts on
shutdown, other than those that issue nice warnings, ought not be
needed by a truly robust system. Shutdown ought not take more than
a few seconds longer than hitting the power switch. Seeking orderly
shutdown is a mistake. Look at all the scripts that are run down
during your Linux shutdown -- how many of them ought to be critical
to system integrity, or speed boot time much? Hardly any of them.
The only step that is of practical use is the sync step, which is
only needed because of the lack of journalled file system. Once
those are available, you ought to be able to pull the plug anytime
you please, and nuts to the orderly shutdown. The software-controlled
off switch is the work of the devil.

pedantically,

-dB

PS,

It's also my opinion that any program ought to be able to safely
call exit() anywhere and have the rest of the world deal. Hardly
anybody agrees with me about this, and I relent myself for the
purposes of getting good Purify data. But if you adopt this
point of view, you don't have shutdown bugs, only startup bugs.
Code you don't write for cleanup you don't do at shutdown
executes fast, fast, fast and never segvs. It certainly
provides a limit value for program shutdown performance.

remote reset [ In reply to ]

May 10, 2000, 2:19 PM

Post #20 of 53 (6894 views)

On Wed, May 10, 2000 at 01:28:14PM -0700, David Brower wrote:
> > agreed.
> > At the begining, i was confused. Here, we are trying to emulate a
> > physical device which powerdown the computer. The emulation must
> > be as close as possible of the original, so no gracefull shutdown.
>
> I feel the need to pound on a dogmatic point here, so even though
> you are agreeing, I'm going to take issue with how you're agreeing.
>
> It's not that we are choosing to go be plug compatible with something
> that happens to be in place.

i think we are.

In http://lists.tummy.com/pipermail/linux-ha-dev/2000-May/000548.html,
you can find a message from alan robertson.
"Since FailSafe depends on STONITH, I'm writing a little "white paper" so
to speak on STONITH, and how it can be implemented."

I talk about emulation because of this message.

remote reset [ In reply to ]

May 10, 2000, 2:30 PM

Post #21 of 53 (6883 views)

Whatever; the point is that it is a correctness
issue, and the semantics of machine kill are best
done with instantaneous dispatch. The mechanism
is to some degree irrelevant, as long as the
proper results are achieved.

-dB

erome Etienne wrote:
>
> On Wed, May 10, 2000 at 01:28:14PM -0700, David Brower wrote:
> > > agreed.
> > > At the begining, i was confused. Here, we are trying to emulate a
> > > physical device which powerdown the computer. The emulation must
> > > be as close as possible of the original, so no gracefull shutdown.
> >
> > I feel the need to pound on a dogmatic point here, so even though
> > you are agreeing, I'm going to take issue with how you're agreeing.
> >
> > It's not that we are choosing to go be plug compatible with something
> > that happens to be in place.
>
> i think we are.
>
> In http://lists.tummy.com/pipermail/linux-ha-dev/2000-May/000548.html,
> you can find a message from alan robertson.
> "Since FailSafe depends on STONITH, I'm writing a little "white paper" so
> to speak on STONITH, and how it can be implemented."
>
> I talk about emulation because of this message.
>
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.tummy.com
> http://lists.tummy.com/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/

--
Butterflies tell me to say:
"The statements and opinions expressed here are my own and do not necessarily
represent those of Oracle Corporation."

remote reset [ In reply to ]

olive at conectiva

May 10, 2000, 2:38 PM

Post #22 of 53 (6900 views)

HI there,

) As David points out, all we need is a reboot call (i.e. a processor
) reset). That is completely sufficient.

OK, agreed. Now the only thing to do seems to be creating another message
type for heartbeat, that when received will have it call

reboot(now, don't even blink);

and of course the situation on which it will be sent. Using heartbeat's
communication code is obvious, since it does all that needs to be done.

Jerome, the IP protocols are nothing compared to a reliable multicast
implementation. When I proposed the "NetSysRq" module, I was certainly
thinking of a lot more than UDP or TCP... :)

See ya!
Fábio
( Fábio Olivé Leite -* ConectivaLinux *- olive@conectiva.com[.br] )
( PPGC/UFRGS MSc candidate -*- Advisor: Taisy Silva Weber )
( Linux - Distributed Systems - Fault Tolerance - Security - /etc )

remote reset [ In reply to ]

May 10, 2000, 3:00 PM

Post #23 of 53 (6895 views)

On Wed, May 10, 2000 at 02:30:41PM -0700, David Brower wrote:
> Whatever; the point is that it is a correctness
> issue, and the semantics of machine kill are best
> done with instantaneous dispatch. The mechanism
> is to some degree irrelevant, as long as the
> proper results are achieved.

we agreed on the main topic (i.e. should we sync the disk before stopping
the machine ? no), "the proper results are achieved".
the fact i disagree with you on some minor issues of why we should do
it, "is to some degree irrelevant".

Lets not argue for the sake of it. There is to much work to do to waste
time.

remote reset [ In reply to ]

May 10, 2000, 3:06 PM

Post #24 of 53 (6886 views)

On Wed, May 10, 2000 at 06:38:18PM -0300, Fábio Olivé Leite wrote:
> Jerome, the IP protocols are nothing compared to a reliable multicast
> implementation.

here 2 things:
1. why do you need a reliable multicast implementation ?
2. IP protocols already include several reliable multicast protocols,
implemented and heavily tested.

> When I proposed the "NetSysRq" module, I was certainly
> thinking of a lot more than UDP or TCP... :)

me too

remote reset [ In reply to ]

May 10, 2000, 3:31 PM

Post #25 of 53 (6896 views)