Mailing List Archive

Reserve/Release for NBD/RAID: PLEASE COMMENT!
Hi,

Sorry to send this to both lists. I couldn't decide which.

This is an attempt to solve the problem of network partitioning possibly causing
an application to report than an I/O operation was a success, when in fact, it
shouldn't have been successful, because we are in a "partitioned cluster" mode.

I have been mulling over an idea which no doubt still has some holes in it.
Let's see if we can make them very small, or better yet, make them go completely
away.

I propose a modified version of the NBD and the mirroring code. Perhaps the
changes will be small. Perhaps they won't, and the RAID driver has to fork.
Perhaps we'll decide it isn't practical. Let's find out.

As in all mirroring schemes, each write has to go both to the local disk and the
remote disk. In this scheme, the RAID code would then for the remote disk write
to complete before attempting the local disk write.

If the remote machine has falsely declared the local machine "down", then it
will make the remote disk "busy" (effectively reserved), and the local machine
will then get an error when it tries to write to the remote disk. It would then
treat this error as a special case, and refuse to write to the local disk as
well - propagating this error back to the caller.

This obviously has some potential performance (latency) issues. I'm not sure
they're worse than the "normal" network RAID case, since (I think?) the writer
has to wait for both writes anyway. You could always save the old block from
the disk, and then put it back if the remote write gets the partitioned cluster
error. It is essential that you not return success to the user before the
remote responds or times out.

If the local machine has declared the remote machine down, then the process is
in some sense simpler, since this means that the "owning" machine simply has to
resync the mirror on the remote machine. No potential of scrogging data state
here...

The worst case is probably where each machine thinks the other one is down.

There are several cases I haven't considered here, and there are also questions
about how to involve alternative (i.e., serial) communication media in this, so
that you can handle some cases like where various kinds of network failures
occur.

It's a little late for me to make this bulletproof before going to bed, but I
thought I'd throw it out for you to tear up and improve.

-- Alan Robertson
alanr@bell-labs.com
Re: Reserve/Release for NBD/RAID: PLEASE COMMENT! [ In reply to ]
Am Die, 07 Dez 1999 schrieb Alan Robertson:
>Hi,
>
>Sorry to send this to both lists. I couldn't decide which.
>
>This is an attempt to solve the problem of network partitioning possibly causing
>an application to report than an I/O operation was a success, when in fact, it
>shouldn't have been successful, because we are in a "partitioned cluster" mode.
>
>I have been mulling over an idea which no doubt still has some holes in it.
>Let's see if we can make them very small, or better yet, make them go
>completely away.
>
>I propose a modified version of the NBD and the mirroring code. Perhaps the
>changes will be small. Perhaps they won't, and the RAID driver has to fork.
>Perhaps we'll decide it isn't practical. Let's find out.
>

That is what I already did with drbd. It is independent of the RAID code.

>As in all mirroring schemes, each write has to go both to the local disk and the
>remote disk. In this scheme, the RAID code would then for the remote disk write
>to complete before attempting the local disk write.
>
>If the remote machine has falsely declared the local machine "down", then it
>will make the remote disk "busy" (effectively reserved), and the local machine
>will then get an error when it tries to write to the remote disk. It would then
>treat this error as a special case, and refuse to write to the local disk as
>well - propagating this error back to the caller.
>

I do not understand. If I loose contact to the other machine, I return
an error to the caller, right? But where is high availability ?
This doubles my chance for an outage of service.

First: The "active" node is failing, the other takes over but can
not run the service (say a samba server) because it can not write...
Second: The "inactive" node fails, and then the active node
can not write anymore ??

I do not know how I would explain this to the users of my HA samba server.

***

My approach to the problem is:

There are two majort parts:
*) The mirror code (which is integrated into a block device in my case (drbd))
*) The cluster membership code (maybe heatbeat .... or something else)

If drbd losses contact to the other node, ...
...on the "active" node, it continues operation by writing/reading to/from
local disk. (and recording the block numbers of the written blocks into a
log)
...on the "stand-by" node, it just sits there and hopes that the "active"
node will reapear somewhen.

To switch one node from the "stand-by" state to the "active" state, there
is the cluster membership code. It is placed in user space and has better
abilities to judge about the cluster state.
Maybe it controls an additional serial link, or redundant NICs, or whatever.
Since it is placed in user space, it is easier to modify it to fit a
particullar application.

***

As more as I am writing, the more I think that out fundamental problem is,
that we should at *first* say what failures we whant to mask, and *then*
we should discuss the solutions.

-Philipp

>This obviously has some potential performance (latency) issues. I'm not sure
>they're worse than the "normal" network RAID case, since (I think?) the writer
>has to wait for both writes anyway. You could always save the old block from
>the disk, and then put it back if the remote write gets the partitioned cluster
>error. It is essential that you not return success to the user before the
>remote responds or times out.
>
>If the local machine has declared the remote machine down, then the process is
>in some sense simpler, since this means that the "owning" machine simply has to
>resync the mirror on the remote machine. No potential of scrogging data state
>here...
>
>The worst case is probably where each machine thinks the other one is down.
>
>There are several cases I haven't considered here, and there are also questions
>about how to involve alternative (i.e., serial) communication media in this, so
>that you can handle some cases like where various kinds of network failures
>occur.
>
>It's a little late for me to make this bulletproof before going to bed, but I
>thought I'd throw it out for you to tear up and improve.

--
Want to try something new? Are you a Linux hacker?
Volunteer in testing mergemem!
(Get it from http://das.ist.org/mergemem)
-----
Philipp Reisner PGP: http://der.ist.org/~kde/pgp.asc
Re: Reserve/Release for NBD/RAID: PLEASE COMMENT! [ In reply to ]
Hi,

On Mon, 06 Dec 1999 23:38:53 -0700, Alan Robertson <alanr@bell-labs.com>
said:

> If the remote machine has falsely declared the local machine "down",
> then it will make the remote disk "busy" (effectively reserved), and
> the local machine will then get an error when it tries to write to the
> remote disk. It would then treat this error as a special case, and
> refuse to write to the local disk as well - propagating this error
> back to the caller.

Umm, how on earth do you do failover when the remote machine dies? In
cases where the remote machine has declared the local machine down, the
local machine almost certainly no longer has network connectivity to the
remote one. How, then, do you know whether to fail over to the local
node or just propagate the error upwards?

> The worst case is probably where each machine thinks the other one is
> down.

Exactly, this is the whole point: the cluster partition is the hard
bit. Pretty much everything else is simple in comparison: the rest is
just implementation detail in relation.

--Stephen
Reserve/Release for NBD/RAID: PLEASE COMMENT! [ In reply to ]
Alan Robertson wrote:

> As in all mirroring schemes, each write has to go both to the local disk and the
> remote disk. In this scheme, the RAID code would then for the remote disk write
> to complete before attempting the local disk write.
>
> f the remote machine has falsely declared the local machine "down", then it
> will make the remote disk "busy" (effectively reserved), and the local machine
> will then get an error when it tries to write to the remote disk. It would then
> treat this error as a special case, and refuse to write to the local disk as
> well - propagating this error back to the caller.

I think the local machine shouldn't be affected by a fail in the remote node
(because is a ha cluster)
, the local machine should write the data independently some other node fails, this
should
be done in parallel mode if it's possible (I belive the philosophy should be to
partition
every fail without propagate it, only advertise to the other nodes).

>
> This obviously has some potential performance (latency) issues. I'm not sure
> they're worse than the "normal" network RAID case, since (I think?) the writer
> has to wait for both writes anyway. You could always save the old block from
> the disk, and then put it back if the remote write gets the partitioned cluster
> error. It is essential that you not return success to the user before the
> remote responds or times out.
>
Is it really essential? I think if the master node is still running and ok then it
shouldn't
report any error to the user. Else, you are propagatting failures and this may
collapse
the cluster :-(

> If the local machine has declared the remote machine down, then the process is
> in some sense simpler, since this means that the "owning" machine simply has to
> resync the mirror on the remote machine. No potential of scrogging data state
> here...
> The worst case is probably where each machine thinks the other one is down.

Yes this may be caotic, but this only occur if both machines are really isolated,
and
the probability is very low because we use redundant ties (I think), is this
correct?

I think the Philipp's considerations are a proper way to do this.

David Martinez
--------------------------------------------------------
CEINTEC i+d i+d@ceintec.drago.net
Investigation & Development division
48011 Bilbao SPAIN (34) 94 441 44 97
Re: Reserve/Release for NBD/RAID: PLEASE COMMENT! [ In reply to ]
"Stephen C. Tweedie" wrote:

> Exactly, this is the whole point: the cluster partition is the hard
> bit. Pretty much everything else is simple in comparison: the rest is
> just implementation detail in relation.

There are two types of installations we should ultimately address:
Where you can't tolerate a cluster paritition

Where you have to be prepared to heal after a partition

I'm discussing the first case here -- and primarily the 2-node version...

As you have no doubt noticed, I'm thrashing around here trying to find a way to
avoid a special lock device (like a disk, or some other piece of hardware). As
you will notice below, I haven't entirely succeeded :-)

With network RAID, there is no obvious lock device available. You would have to
invent one. Then you lose some of the simplicity and reliability advantages
that this "shared nothing" scheme has.

As has been discussed, every scheme has holes associated with it, and NOTHING is
absolutely fail-proof.

With a shared SCSI disk scheme, you can use the SCSI disk as the lock device.
Of course, then the SCSI bus itself becomes an SPOF - if you use it for disk
data, but not for the lock device (as was pointed out earlier). You also get
expense and constraints (distance, etc) as part of the bargain. If you mirror
the disk data over then LAN, then the disk isn't an SPOF, but you may have
electrical issues replacing the disk without taking down both systems. You can
use some other kind of locking device if you wish. Simple, single-purpose
electrical circuitry could also accomplish this even more reliably - without the
SCSI electrical issues. Clearly this device need not cost more than about $20
to manufacture even in small quantities. This means you could buy it for ~ $200
:-)

I suppose what you really want is a lock device where you get an interrupt if
someone steals your lock. You can then put in a block pseudo-device which will
refuse to write unless you own the lock. I guess you don't really even need an
interrupt (just read the register each time).

Does someone know where to buy this device?

If not, maybe I can talk to some of the hardware designers here about it...

On the other hand:
With a true shared-nothing approach, you MUST have ultra-reliable
communications.
- You must be willing to reboot any time you haven't sent out heartbeats a
little more ruthlessly than you would on not receiving them from another
machine.
- You must lock the pages of the heartbeat code in memory
- You must set the priority "sufficiently high"
- send a "DIE!" packet to machines marked dead (to be executed by the
their heartbeat code)
- Have redundant, reliable heartbeat media (2 serial, 1 ethernet)
- Note and diagnose heartbeat media failures

I believe that together, these will give you an extremely high degree of
certainty that you cannot get your cluster split.

However, some applications and some people demand more paranoia than this.
If you find yourself in that situation, you can always add X-10 (or similar)
remote reset control. Using the current code, one could create a resource script
which would reset the machine which used to own the resource group (if it is now
"dead"). Including this resource in a resource group would declare that group
to be effectively a high-integrity resource group. Is a system reset
sufficient, or do you really need to power off the other machine (assuming
you're a "shared nothing" configuration)?

Does someone sell kits that do this for ISA or PCI cards, or should we use X10?

I believe the 2-node case is the single most important case. It's important to
get this right because there will be more 2-node systems than any other type, if
we can get it right. The highly paranoid case doesn't have to be ultra cheap,
but the "more reasonably paranoid" case should be cheap to buy and easy to
install.


Comments?

-- Alan Robertson
alanr@bell-labs.com
Re: Reserve/Release for NBD/RAID: PLEASE COMMENT! [ In reply to ]
Alan Robertson wrote:
> However, some applications and some people demand more paranoia than this.
> If you find yourself in that situation, you can always add X-10 (or similar)
> remote reset control. Using the current code, one could create a resource script
> which would reset the machine which used to own the resource group (if it is now
> "dead"). Including this resource in a resource group would declare that group
> to be effectively a high-integrity resource group. Is a system reset
> sufficient, or do you really need to power off the other machine (assuming
> you're a "shared nothing" configuration)?
>
> Does someone sell kits that do this for ISA or PCI cards, or should we use X10?

Funny story--I was doing an IRIX 6.5 upgrade on an SGI Challenge DM and
I noticed those things actually have jacks, like a mono head-phone jack,
which output the signal for the various interrupts. I don't know if you
can also use those jacks trigger an interrupt externally, but it was
interesting.

I'm not sure, but I think I have a few leads for you. What exactly is
it that you want to lock again? What is a ballpark time delay for
action to take place, or what is the frequency or number of switches
required within a specified unit of time? There are lots of devices for
triggering some sort of "output" but I think the harder part will be to
find a corresponding receiver for "input." Also, I imagine it would be
preferrable to find something that could work more or less "out of the
box" or most of the bits and pieces necessary are already in existence
rather than having to write some monster device driver(s).

Black Box http://www.blackbox.com has loads of automated
switching/signalling/gang-switching type equipment with varying response
times, for example a fairly slow response time would be having a modem
dial out to a switch which accepts a touch-tone code to implement some
action vs. a moderately fast response would be a switching device which
connects to a serial port. I've used a few of their wire SCSI repeaters
to locate tape drives where they are more accessible for general use vs.
having to intrude into someone's cubicle. They also have fiber SCSI
extenders, which might be of interest for connecting servers located in
different buildings or on different floors of a building.

Berkshire Products http://www.berkprod.com/ makes several devices
including ISA and serial which may be useful and used to be fairly
inexpensive. Most importantly, they have a couple devices which also
accept some sort of input. I have 3 of their PC Watchdog with
temperature probe in use at my site--the one in my workstation has come
in handy on a few occasions when I was mucking around with experimental
kernels and locked up the system at night when I was out of the
office--the PCWDs have 2 relays connected to external screw-terminal
contacts. The PCWD can be set to monitor whatever IRQ or I/O
address(es) you choose, which might be good for something.

ICS Advent http://www.icsadvent.com/ (formerly Industrial Computer
Source) makes devices similar to those offered by Berkshire but I've had
0 luck finding a supplier where I could actually order just 1 or 2--I
think they would not accept a University of Maryland PO, or they didn't
want to deal with us unless we wanted to buy in quantity, or something
like that, but it was awhile ago.

As for X10, I use X10 extensively in my house, and I can forsee a few
potential problems. If I really wanted to do High Availability "right"
I would want both nodes on their own UPS. If you're using X10 through
the wire, as I understand it you're using the 60 HZ alternating current
as a carrier with the X10 signal in the nodes. If the UPS are
"switched" the X10 signals might make it through both UPS as long as
there's line power to provide a carrier, and as long as there isn't too
much "power conditioning" going on, but I can't see how the X10 signal
could survive the AC-DC-AC conversion in an UPS with a full-time
rectifier. I suppose each server could be plugged in to its own UPS,
but have the X10 computer interfaces on one UPS, or another circuit
altogether, where they would be able to "talk" to each other. There's a
new "firecracker" X10 computer interface which can send out air RF
signals, but it can't receive, besides which I hear that it sucks in
terms of range and reliability, and there might not be any linux drivers
yet.

If the servers are too far apart electrically, in terms of the path back
to the common circuit panel or transformer, I'm not sure what method
could be used to transmit the X10 signal, for example if the 2 servers
were in different buildings. You could have a modem dial up an X10
telephone responder, but obviously that's not going to give you very
fast switching.

> I believe the 2-node case is the single most important case. It's important to
> get this right because there will be more 2-node systems than any other type, if
> we can get it right. The highly paranoid case doesn't have to be ultra cheap,
> but the "more reasonably paranoid" case should be cheap to buy and easy to
> install.

--
"Jonathan F. Dill" (jfdill@jfdill.suite.net)
Re: Reserve/Release for NBD/RAID: PLEASE COMMENT! [ In reply to ]
On Tue, 7 Dec 1999, Alan Robertson wrote:
> "Stephen C. Tweedie" wrote:
>
> > Exactly, this is the whole point: the cluster partition is the hard
> > bit. Pretty much everything else is simple in comparison: the rest is
> > just implementation detail in relation.
>
> There are two types of installations we should ultimately address:
> Where you can't tolerate a cluster paritition
>
> Where you have to be prepared to heal after a partition
>
> I'm discussing the first case here -- and primarily the 2-node version...
>
> As you have no doubt noticed, I'm thrashing around here trying to find a way to
> avoid a special lock device (like a disk, or some other piece of hardware). As
> you will notice below, I haven't entirely succeeded :-)

I think there's another way to approach the problem. You could also make one
of the systems an automatic winner in the case of a cluster partition by
giving it more "votes".

E.g. as I recall, DEC VaxClusters used to give a disk a quorom vote on
_one_ of the systems in a 2-node cluster... at least I think that's how it
worked. If that were the case, then one system would still have two members
to achieve quorom after cluster partitioning.

If we're willing to stipulate that partitioning shall only be dealt with in
and of itself (i.e. not at the same time as some other availability fault),
it seems to make sense to me. Feel free to tell me that I'm clueless and
why.

Even if we're not willing to stipulate that, it gives you a one-in-two
chance of availability in the worst case scenario... :-) :-)


> Does someone sell kits that do this for ISA or PCI cards, or should we use X10?

I don't think X-10 is really reliable enough...

-Andy
Re: Reserve/Release for NBD/RAID: PLEASE COMMENT! [ In reply to ]
Hi,

On Wed, 08 Dec 1999 02:35:27 -0500, "Jonathan F. Dill"
<jfdill@jfdill.suite.net> said:

> Alan Robertson wrote:
>> However, some applications and some people demand more paranoia than
>> this. If you find yourself in that situation, you can always add
>> X-10 (or similar) remote reset control. Using the current code, one
>> could create a resource script which would reset the machine which
>> used to own the resource group (if it is now "dead").

>> Does someone sell kits that do this for ISA or PCI cards, or should
>> we use X10?

> I'm not sure, but I think I have a few leads for you. What exactly is
> it that you want to lock again? What is a ballpark time delay for
> action to take place, or what is the frequency or number of switches
> required within a specified unit of time?

The clustering software can adapt to the hardware timeouts: as long as
you know that a disconnected node will reset itself within X seconds,
you can delay completion of the cluster transition for that time.

PC hardware watchdog cards could achieve a lot of this.

--Stephen
Re: Reserve/Release for NBD/RAID: PLEASE COMMENT! [ In reply to ]
Hi,

On Wed, 8 Dec 1999 03:12:21 -0500 (EST), Andy Poling
<andy@globalauctions.com> said:

> I think there's another way to approach the problem. You could also
> make one of the systems an automatic winner in the case of a cluster
> partition by giving it more "votes".

Won't work: the automatic winner will carry on in case of a cluster
partition, but if that machine dies, the other won't be able to take
over because it doesn't have enough votes. A workable solution *has* to
be symmetric, unfortunately.

> E.g. as I recall, DEC VaxClusters used to give a disk a quorom vote
> on _one_ of the systems in a 2-node cluster...

No. The quorum vote was associated with a *shared* disk, and that vote
could be stolen by either node on completion of a scsi reservation.
It's using shared hardware to act as a tie-breaker.

--Stephen
Re: Reserve/Release for NBD/RAID: PLEASE COMMENT! [ In reply to ]
Andy Poling wrote:
>
> On Tue, 7 Dec 1999, Alan Robertson wrote:
> > "Stephen C. Tweedie" wrote:
> >
> > > Exactly, this is the whole point: the cluster partition is the hard
> > > bit. Pretty much everything else is simple in comparison: the rest is
> > > just implementation detail in relation.
> >
> > There are two types of installations we should ultimately address:
> > Where you can't tolerate a cluster paritition
> >
> > Where you have to be prepared to heal after a partition
> >
> > I'm discussing the first case here -- and primarily the 2-node version...
> >
> > As you have no doubt noticed, I'm thrashing around here trying to find a way to
> > avoid a special lock device (like a disk, or some other piece of hardware). As
> > you will notice below, I haven't entirely succeeded :-)
>
> I think there's another way to approach the problem. You could also make one
> of the systems an automatic winner in the case of a cluster partition by
> giving it more "votes".
>
> E.g. as I recall, DEC VaxClusters used to give a disk a quorom vote on
> _one_ of the systems in a 2-node cluster... at least I think that's how it
> worked. If that were the case, then one system would still have two members
> to achieve quorom after cluster partitioning.
> If we're willing to stipulate that partitioning shall only be dealt with in
> and of itself (i.e. not at the same time as some other availability fault),
> it seems to make sense to me. Feel free to tell me that I'm clueless and
> why.

It is almost certain that partitioning was caused by a failure of some kind.
With multiple reliable heartbeat media, partitioning is an very unlikely event.
Nevertheless, you can expect it to happen somewhere -- and it will be caused by
a very low-probability set of events -- perhaps a multiple failure scenario.

If I understood your scenario correctly, if that one system ever went down, the
other system would never take over. You basically have no high-availability.
If you used the disk for reserve/release, then that's the case I originally
described, but I think you have electrical problems with replacing the disk when
it fails which requires both machines to be taken down. If not, then I don't
understand...

> Even if we're not willing to stipulate that, it gives you a one-in-two
> chance of availability in the worst case scenario... :-) :-)
>
> > Does someone sell kits that do this for ISA or PCI cards, or should we use X10?
>
> I don't think X-10 is really reliable enough...

Probably not, but you *are* talking about an extraordinarily unusual event in
the first place. From my experiences with X-10, I'd say it's about 99%
reliable. It has the huge advantage of being cheap and readily available.

Again, it depends on your level of paranoia (which has been postulated as at
least above average for this case).


-- Alan Robertson
alanr@bell-labs.com
Re: Reserve/Release for NBD/RAID: PLEASE COMMENT! [ In reply to ]
"Stephen C. Tweedie" wrote:
>
> Hi,
>
> On Wed, 08 Dec 1999 02:35:27 -0500, "Jonathan F. Dill"
> <jfdill@jfdill.suite.net> said:
>
> > Alan Robertson wrote:
> >> However, some applications and some people demand more paranoia than
> >> this. If you find yourself in that situation, you can always add
> >> X-10 (or similar) remote reset control. Using the current code, one
> >> could create a resource script which would reset the machine which
> >> used to own the resource group (if it is now "dead").
>
> >> Does someone sell kits that do this for ISA or PCI cards, or should
> >> we use X10?
>
> > I'm not sure, but I think I have a few leads for you. What exactly is
> > it that you want to lock again? What is a ballpark time delay for
> > action to take place, or what is the frequency or number of switches
> > required within a specified unit of time?
>
> The clustering software can adapt to the hardware timeouts: as long as
> you know that a disconnected node will reset itself within X seconds,
> you can delay completion of the cluster transition for that time.
>
> PC hardware watchdog cards could achieve a lot of this.

Which I agree with (since I also said something similar in my original mail ;-)
"You must be willing to reboot any time you haven't sent out heartbeats a
little more ruthlessly than you would on not receiving them from another
machine"

However, having said that, heard Stephen say it, and having agreed with both of
us :-), it is worth remembering that for this to work "correctly", you need to
have the conditions I originally described. And even then, the most extremely
paranoid among you will wonder if that's really enough...

If you are among the Most Extremely Paranoid, you can still use a disk with
reserve release, and say that some scheduled downtime will be needed to replace
or repair the other computer and/or the disk. Surely if you're extremely
paranoid, you're not going to replace things "hot" even if you hardly ever use
the bus... Maybe you could unload the device driver. Is that enough? I
suppose you could use PCMCIA SCSI and pop the device out. *That* should satisfy
even the Most Extremely Paranoid and Fastidious among us (if it's well
tested...).


-- Alan Robertson
alanr@bell-labs.com
Re: Reserve/Release for NBD/RAID: PLEASE COMMENT! [ In reply to ]
"Stephen C. Tweedie" wrote:
>
> Hi,
>
> On Wed, 08 Dec 1999 02:35:27 -0500, "Jonathan F. Dill"
> <jfdill@jfdill.suite.net> said:
>
> > Alan Robertson wrote:
> >> However, some applications and some people demand more paranoia than
> >> this. If you find yourself in that situation, you can always add
> >> X-10 (or similar) remote reset control. Using the current code, one
> >> could create a resource script which would reset the machine which
> >> used to own the resource group (if it is now "dead").
>
> >> Does someone sell kits that do this for ISA or PCI cards, or should
> >> we use X10?
>
> > I'm not sure, but I think I have a few leads for you. What exactly is
> > it that you want to lock again? What is a ballpark time delay for
> > action to take place, or what is the frequency or number of switches
> > required within a specified unit of time?
>
> The clustering software can adapt to the hardware timeouts: as long as
> you know that a disconnected node will reset itself within X seconds,
> you can delay completion of the cluster transition for that time.
>
> PC hardware watchdog cards could achieve a lot of this.

In that case, I think the Berkshire Products PCWD may do what you want.

Just to describe briefly, in case not everybody knows how one of these
things works...The watchdog (wd) has a "countdown timer" which gets
reset each time the wd detects activity on some I/O address. The timer
interval and I/O port can be configured via dip switches or jumpers on
the wd. When the timer runs out, the relays on the wd are toggled--In
the case of PCWD, there are 2 relays, one which is switched momentarily,
and the other which is "latched on" until the system is powered down.
There are internal and external contacts for connecting to both the NO
and NC contacts of both relays.

The normal operation under linux is to connect the internal NO contacts
of the relay which gets toggled momentarily to the motherboard reset
connector so that if the wd timer runs out, the equivalent of pressing
the reset button will occur. An unique I/O address is used for the
PCWD, and the kernel pcwd driver is set to trigger that I/O address
every so many seconds, a sort of "heartbeat," if you will--Hopefully,
you were smart enough to make sure that pcwd driver sends out the signal
frequently enough that the timer does not expire during normal
operation.

I usually use a delay of 5 minutes, 10 minutes, or even 20 minutes
because I have had a few problems with pcwd. First, if you have SCSI
timeouts, there may be many seconds between pcwd signals getting out, so
if you have the timer set to less than a couple minutes, a trigger is
likely to occur. Second, the reset does not trigger a clean shutdown,
so if you're using non-journaled filesystems like ext2, the boot up may
get as far as "An unexpected inconsistency has occurred" during the fsck
and you may have to enter the root passwd at the console to run fsck
"manually" and reboot before the system will come up. Also, if you have
several very large filesystems, or slow disks, you may have to disable
the pcwd during boot up or else the timer may run out while you're doing
the fsck, and enable the pcwd after fsck has finished. Ideally, the
pcwd driver should start sending signals at the very beginning of the
bootup process, but I'm not sure how you would do that.

In one application, I had several large disks with large filesystems, so
I set those disks to "noauto" and not to fsck /etc/fstab to let the
system come up completely before checking those disks. I had a script
that ran after the system booted up to run fsck on the large
filesystems, and then mount the disks. However, this workaround would
not handle the case where the root filesystem has to be fsck'd
"manually." If you use ext3 or another journaled fs, these precautions
should not be needed.

For HA, I supsect a shorter time interval of less than 1 minute would be
desireable. For the I/O address to monitor, you might think that an I/O
address for the SCSI controller or an NIC would be a good idea, but what
happens when the system is "idle?" You want some I/O address that is
definitely going to get triggered, but there are also certain problems
when you definitely want the card to be triggered even though I/O on
some other channel might continue to work.

I suppose a good approach would be to trigger on the unique I/O address
as per the "normal" use of the pcwd under linux, and have additional
mechanisms to externally trigger the I/O address when certain things are
working correctly, and mechanisms to externally stop triggering the I/O
address when a condition occurs that you definitely want to reboot.

It would also be nice if you could first try to trigger a "soft" reboot
with a clean shutdown before you try a cold reboot i.e. the equivalent
of hitting Ctrl-Alt-Del or issuing the reboot command.

--
"Jonathan F. Dill" (jfdill@jfdill.suite.net)
Re: Reserve/Release for NBD/RAID: PLEASE COMMENT! [ In reply to ]
"Jonathan F. Dill" wrote:
> I suppose a good approach would be to trigger on the unique I/O address
> as per the "normal" use of the pcwd under linux, and have additional
> mechanisms to externally trigger the I/O address when certain things are
> working correctly, and mechanisms to externally stop triggering the I/O
> address when a condition occurs that you definitely want to reboot.

OK I think I have a partial answer to this. Use the unique I/O for the
wd card as I mentioned before.
The watchdog can be triggered by writing anything to /dev/watchdog (char
major 10 minor 130). Rather than having a watchdog daemon send to
/dev/watchdog, for an HA application I think it makes sense to have the
HA control software send to the device. Then when certain conditions
occur, the HA software can stop sending the signal to the wd to cause a
reset.

Thinking more about this, I think it would be useful to have some sort
of "disk heartbeat" as well. If you're super paranoid, I think the
thing to do would be to tune the fs buffers to flush more frequently
though this approach will likely degrade fs I/O performance--This
approach may also help if you're using any non-journaled filesystems to
decrease the extent and likelihood of fs corruption. Then the HA
software could keep an eye on disk I/O via any of several entries in
/proc eg. /proc/interrupts or /proc/scsi/scsi.

A less paranoid approach would be to force a teeny tiny bit of disk I/O
when the disk is otherwise "idle" to maintain a "disk heartbeat" and let
the fs buffers flush normally. Simple file I/O won't do it because that
is likely to be buffered. One method would be to read a garbage char or
two from a raw partition, or possibly you could trigger a disk seek, I
suppose preferably to the middle of the platter so as not to diminish
subsequent real disk I/O. Alternatively, I suppose you could
artificially trigger an interrupt, although I think it makes sense to
use some type of real disk I/O to test if the disk is really working.
For multiple disks, it might be a good idea to have independent
heartbeats for each disk.

> It would also be nice if you could first try to trigger a "soft" reboot
> with a clean shutdown before you try a cold reboot i.e. the equivalent
> of hitting Ctrl-Alt-Del or issuing the reboot command.

Is there any way to do something with ATX and BIOS settings eg. in the
APM/ACPI configuration for "Power Button pressed for less than 4
seconds" to trigger a soft reboot rather than hard? Then you could hook
up the wd to the motherboard soft power switch connector rather than the
reset.

--
"Jonathan F. Dill" (jfdill@jfdill.suite.net)
Re: Reserve/Release for NBD/RAID: PLEASE COMMENT! [ In reply to ]
"Jonathan F. Dill" wrote:
>
> "Jonathan F. Dill" wrote:
> > I suppose a good approach would be to trigger on the unique I/O address
> > as per the "normal" use of the pcwd under linux, and have additional
> > mechanisms to externally trigger the I/O address when certain things are
> > working correctly, and mechanisms to externally stop triggering the I/O
> > address when a condition occurs that you definitely want to reboot.
>
> OK I think I have a partial answer to this. Use the unique I/O for the
> wd card as I mentioned before.
> The watchdog can be triggered by writing anything to /dev/watchdog (char
> major 10 minor 130). Rather than having a watchdog daemon send to
> /dev/watchdog, for an HA application I think it makes sense to have the
> HA control software send to the device. Then when certain conditions
> occur, the HA software can stop sending the signal to the wd to cause a
> reset.

One of the ways you can achieve a partitioned cluster is if the heartbeat
software stops getting properly scheduled for a while. In "heartbeat", you
can specify a watchdog device for it to tickle whenever it hears its own
heartbeat.

What you normally want to do is have some kind of sanity criteria like getting
scheduled regularly or having the heartbeat software work that you have to
satisfy before tickling such a device. Hearing your own heart beat is a
natural critera to satisfy for such a scheme.

Right now, I tickle /dev/watchdog when I send out a heartbeat, but I should
probably change it to tickle it when I *hear* my own heartbeat. This is a
more complex criteria, but one that would be naturally satisfied by the
software when it is working correctly.

One could also imagine satisfying other types of criteria before deciding to
tickle the watchdog device. One example might be to be able to fork and exec
a process which opened a few files and returned a success or failure result.
When you could no longer do this successfully, the machine would reboot on its
own. This process could do whatever you want it to in order to ensure "basic
sanity" according to some criteria. All the things you talk about doing below
could be part of it.

This is basically the reason why I put in basic /dev/watchdog support in about
a year ago... It doesn't have "watchdog exec" support described above at this
point. This would require a little thought to do right, but wouldn't be too
bad.

> Thinking more about this, I think it would be useful to have some sort
> of "disk heartbeat" as well. If you're super paranoid, I think the
> thing to do would be to tune the fs buffers to flush more frequently
> though this approach will likely degrade fs I/O performance--This
> approach may also help if you're using any non-journaled filesystems to
> decrease the extent and likelihood of fs corruption. Then the HA
> software could keep an eye on disk I/O via any of several entries in
> /proc eg. /proc/interrupts or /proc/scsi/scsi.
>
> A less paranoid approach would be to force a teeny tiny bit of disk I/O
> when the disk is otherwise "idle" to maintain a "disk heartbeat" and let
> the fs buffers flush normally. Simple file I/O won't do it because that
> is likely to be buffered. One method would be to read a garbage char or
> two from a raw partition, or possibly you could trigger a disk seek, I
> suppose preferably to the middle of the platter so as not to diminish
> subsequent real disk I/O. Alternatively, I suppose you could
> artificially trigger an interrupt, although I think it makes sense to
> use some type of real disk I/O to test if the disk is really working.
> For multiple disks, it might be a good idea to have independent
> heartbeats for each disk.
>
> > It would also be nice if you could first try to trigger a "soft" reboot
> > with a clean shutdown before you try a cold reboot i.e. the equivalent
> > of hitting Ctrl-Alt-Del or issuing the reboot command.
>
> Is there any way to do something with ATX and BIOS settings eg. in the
> APM/ACPI configuration for "Power Button pressed for less than 4
> seconds" to trigger a soft reboot rather than hard? Then you could hook
> up the wd to the motherboard soft power switch connector rather than the
> reset.

With a more sophisticated timer system, you could have layers of timers, where
failing to tickle a short timer resulted in a soft reboot, and failing to
tickle a longer timer resulted in a hard reboot.

Hierarchical watchdog support has been standard fare in telecommunications
systems for longer than I've been in the business (more than 20 years). As
you might guess, they're called "sanity timers" at Lucent.

-- Alan Robertson
alanr@bell-labs.com
Re: Reserve/Release for NBD/RAID: PLEASE COMMENT! [ In reply to ]
An example of where it can matter.

Again using the WAN cluster, you can have two sites, each with a complete
database, but if you end up partitioned the databases will start to
diverge, at some point you need to have one site shut down to prevent you
from returning the wrong answer.

David Lang

On Wed, 8 Dec 1999, David Brower wrote:

>
> Regarding partition, it is interesting to consider the consequences
> of split brain. If there are truly no shared resources, does it
> matter if the partitions live? It is conflict over shared resources that
> causes problems. The ones first in mind are shared disk, and shared IP
> address. If there is shared disk, then it ought to be possible to use
> the persistence in that disk to resolve the partition. If there is only IP
> address ownership, then things get murky, and operator intervention may be
> required. What are some other shared resources that can cause problems?
>
> -dB
>
Re: Reserve/Release for NBD/RAID: PLEASE COMMENT! [ In reply to ]
Alan Robertson wrote:
> One of the ways you can achieve a partitioned cluster is if the heartbeat
> software stops getting properly scheduled for a while. In "heartbeat", you
> can specify a watchdog device for it to tickle whenever it hears its own
> heartbeat.
>
> What you normally want to do is have some kind of sanity criteria like getting
> scheduled regularly or having the heartbeat software work that you have to
> satisfy before tickling such a device. Hearing your own heart beat is a
> natural critera to satisfy for such a scheme.
>
> Right now, I tickle /dev/watchdog when I send out a heartbeat, but I should
> probably change it to tickle it when I *hear* my own heartbeat. This is a
> more complex criteria, but one that would be naturally satisfied by the
> software when it is working correctly.

This is one of the reasons why some systems have ended up putting the
heartbeat into the kernel - to prevent false mortality. There are also some
reasons to use a shared disk in the resolution. DEC, er, Compaq does both of
these things in TruClusters. Warning: DEC has patents in this area.

It is certainly easier to work in user space, but doing so may not supply the
necessary scheduling guarantees for the paranoid. This suggests a really good
design framweork will admit implementations in both places. The requires
enough abstraction in the notification mechanism that related components don't
care where it really lives.

Regarding partition, it is interesting to consider the consequences
of split brain. If there are truly no shared resources, does it
matter if the partitions live? It is conflict over shared resources that
causes problems. The ones first in mind are shared disk, and shared IP
address. If there is shared disk, then it ought to be possible to use
the persistence in that disk to resolve the partition. If there is only IP
address ownership, then things get murky, and operator intervention may be
required. What are some other shared resources that can cause problems?

-dB
Re: Reserve/Release for NBD/RAID: PLEASE COMMENT! [ In reply to ]
David Lang wrote:
>
> An example of where it can matter.
>
> Again using the WAN cluster, you can have two sites, each with a complete
> database, but if you end up partitioned the databases will start to
> diverge, at some point you need to have one site shut down to prevent you
> from returning the wrong answer.

Who has -the- data? If you can't answer that question, you are lost.

If the database(s) weren't shared before partition, how were they being kept
up to date while the cluster was connected? And why doesn't that mechanism
still (sort of) work after the partitions are reconnected?

Here are a few strategies, and how they play out:

1. DLM in the middle synchronizing updates and ownership. In this case, the
DLM space is the critical shared resource, like an IP address. When there is
more than one "authority" on the space, like two nodes fighting over the IP
address, chaos ensues. This might degenerate to having the root of a lock
tree reside at a virtual IP address, turning it into exactly the problem of
resolving IP address ownership.

2. Master/Slave change relationships. If in the normal case one
node is determined to be the master, it can ship coherent changes to the
slave. Decision about who is the master and who is the slave is the crux of
the partition problem, and essentially the same as the DLM problem -- you
can't have two claiming to be the master. At the time of a join, they had
probably better agree on who was the master the last time they were connected,
or replay of changes will have conflicts.

3. Truly symmetric change/replication, where both nodes were always allowed
to make changes and ship deltas to the other node. This can either be done in
real-time, with short conflict windows, or with some delay, having longer
conflict exposures. A persistently queued log of deltas on each node would
survive a partition/rejoin, and need to be resolved. But this is just the
longer case of the same vulnerability faced in the independant but on-line
shipping of changes.

If you are going to allow the possibility of un-coordinated changes on
multiple nodes, then you must be prepared to resolve conflicts -- either short
term or long term. If you don't want conflicts, then there can be only one
authoritative copy of the data, with serialized access.

What I'm saying is this WAN cluster example is hopelessly flawed in the first
place, because it doesn't have an authority or a mechanism for resolving
ownership of the data. You are really, really open for confusion if you don't
design with a single, authoritative source of the data. Only updates to that
data are real; updates elsewhere are wishful thinking.

You always need a point of decision, either a connected shared resource, or
the brains of an operator. You can't resolve partition without one or the
other.

-dB
Re: Reserve/Release for NBD/RAID: PLEASE COMMENT! [ In reply to ]
On Wed, 8 Dec 1999, David Brower wrote:

> David Lang wrote:
> >
> > An example of where it can matter.
> >
> > Again using the WAN cluster, you can have two sites, each with a complete
> > database, but if you end up partitioned the databases will start to
> > diverge, at some point you need to have one site shut down to prevent you
> > from returning the wrong answer.
>
> Who has -the- data? If you can't answer that question, you are lost.
>
> If the database(s) weren't shared before partition, how were they being kept
> up to date while the cluster was connected? And why doesn't that mechanism
> still (sort of) work after the partitions are reconnected?
>
> 3. Truly symmetric change/replication, where both nodes were always allowed
> to make changes and ship deltas to the other node. This can either be done in
> real-time, with short conflict windows, or with some delay, having longer
> conflict exposures. A persistently queued log of deltas on each node would
> survive a partition/rejoin, and need to be resolved. But this is just the
> longer case of the same vulnerability faced in the independant but on-line
> shipping of changes.

This is the case I am dealing with, there are a couple of application
specific details (by it's nature only one location will be updating a
record at a time, 99% read, and others) that make a short update period
effectivly instantanious, but a longer period (several min) has other
problems


> If you are going to allow the possibility of un-coordinated changes on
> multiple nodes, then you must be prepared to resolve conflicts -- either short
> term or long term. If you don't want conflicts, then there can be only one
> authoritative copy of the data, with serialized access.
>
> What I'm saying is this WAN cluster example is hopelessly flawed in the first
> place, because it doesn't have an authority or a mechanism for resolving
> ownership of the data. You are really, really open for confusion if you don't
> design with a single, authoritative source of the data. Only updates to that
> data are real; updates elsewhere are wishful thinking.
>
> You always need a point of decision, either a connected shared resource, or
> the brains of an operator. You can't resolve partition without one or the
> other.
>
> -dB
>

Im my case I am taking the approach that by putting enough different
heartbeat paths in place it will be (almost) impossible for the partition
to happen, I gave the details in response to a post that asked "why do you
care"

David Lang
Re: Reserve/Release for NBD/RAID: PLEASE COMMENT! [ In reply to ]
David Lang wrote:
>
> An example of where it can matter.
>
> Again using the WAN cluster, you can have two sites, each with a complete
> database, but if you end up partitioned the databases will start to
> diverge, at some point you need to have one site shut down to prevent you
> from returning the wrong answer.

But if the database records that I got paid, I may be very unhappy if I don't
get paid because you threw away the copy of the database that recorded the
payment.

If you use shared disk, and you make this mistake, you may forget that I even
exist, much less that I got paid. [when you completely trash the data]

-- Alan Robertson
alanr@bell-labs.com
Re: Reserve/Release for NBD/RAID: PLEASE COMMENT! [ In reply to ]
On Wed, 8 Dec 1999, Jonathan F. Dill wrote:

> "Stephen C. Tweedie" wrote:
> >
> > Hi,
> >
> > On Wed, 08 Dec 1999 02:35:27 -0500, "Jonathan F. Dill"
> > <jfdill@jfdill.suite.net> said:
> >
> > > Alan Robertson wrote:
> > >> However, some applications and some people demand more paranoia than
> > >> this. If you find yourself in that situation, you can always add
> > >> X-10 (or similar) remote reset control. Using the current code, one
> > >> could create a resource script which would reset the machine which
> > >> used to own the resource group (if it is now "dead").
> >
> > >> Does someone sell kits that do this for ISA or PCI cards, or should
> > >> we use X10?
> >
> > > I'm not sure, but I think I have a few leads for you. What exactly is
> > > it that you want to lock again? What is a ballpark time delay for
> > > action to take place, or what is the frequency or number of switches
> > > required within a specified unit of time?
> >
> > The clustering software can adapt to the hardware timeouts: as long as
> > you know that a disconnected node will reset itself within X seconds,
> > you can delay completion of the cluster transition for that time.
> >
> > PC hardware watchdog cards could achieve a lot of this.
>
> In that case, I think the Berkshire Products PCWD may do what you want.
>
> Just to describe briefly, in case not everybody knows how one of these
> things works...The watchdog (wd) has a "countdown timer" which gets
> reset each time the wd detects activity on some I/O address. The timer
> interval and I/O port can be configured via dip switches or jumpers on
> the wd. When the timer runs out, the relays on the wd are toggled--In
> the case of PCWD, there are 2 relays, one which is switched momentarily,
> and the other which is "latched on" until the system is powered down.
> There are internal and external contacts for connecting to both the NO
> and NC contacts of both relays.
>
> The normal operation under linux is to connect the internal NO contacts
> of the relay which gets toggled momentarily to the motherboard reset
> connector so that if the wd timer runs out, the equivalent of pressing
> the reset button will occur. An unique I/O address is used for the
> PCWD, and the kernel pcwd driver is set to trigger that I/O address
> every so many seconds, a sort of "heartbeat," if you will--Hopefully,
> you were smart enough to make sure that pcwd driver sends out the signal
> frequently enough that the timer does not expire during normal
> operation.
>
> I usually use a delay of 5 minutes, 10 minutes, or even 20 minutes
> because I have had a few problems with pcwd. First, if you have SCSI
> timeouts, there may be many seconds between pcwd signals getting out, so
> if you have the timer set to less than a couple minutes, a trigger is
> likely to occur. Second, the reset does not trigger a clean shutdown,
> so if you're using non-journaled filesystems like ext2, the boot up may
> get as far as "An unexpected inconsistency has occurred" during the fsck
> and you may have to enter the root passwd at the console to run fsck
> "manually" and reboot before the system will come up. Also, if you have
> several very large filesystems, or slow disks, you may have to disable
> the pcwd during boot up or else the timer may run out while you're doing
> the fsck, and enable the pcwd after fsck has finished. Ideally, the
> pcwd driver should start sending signals at the very beginning of the
> bootup process, but I'm not sure how you would do that.
>
> In one application, I had several large disks with large filesystems, so
> I set those disks to "noauto" and not to fsck /etc/fstab to let the
> system come up completely before checking those disks. I had a script
> that ran after the system booted up to run fsck on the large
> filesystems, and then mount the disks. However, this workaround would
> not handle the case where the root filesystem has to be fsck'd
> "manually." If you use ext3 or another journaled fs, these precautions
> should not be needed.
>
> For HA, I supsect a shorter time interval of less than 1 minute would be
> desireable. For the I/O address to monitor, you might think that an I/O
> address for the SCSI controller or an NIC would be a good idea, but what
> happens when the system is "idle?" You want some I/O address that is
> definitely going to get triggered, but there are also certain problems
> when you definitely want the card to be triggered even though I/O on
> some other channel might continue to work.
>
> I suppose a good approach would be to trigger on the unique I/O address
> as per the "normal" use of the pcwd under linux, and have additional
> mechanisms to externally trigger the I/O address when certain things are
> working correctly, and mechanisms to externally stop triggering the I/O
> address when a condition occurs that you definitely want to reboot.
>

I don't think that external triggering would be so good. Expecting the
worst, the machine could get stuck when the external triggering is set.
(by too many processes or so on)
Meanwhile some really bad happens, some process hangs the machine.
Then the external trigger would keep the machine living and that's surely
not supposed.

> It would also be nice if you could first try to trigger a "soft" reboot
> with a clean shutdown before you try a cold reboot i.e. the equivalent
> of hitting Ctrl-Alt-Del or issuing the reboot command.
>

A soft reboot is quite a nice idea, when the computer is able to process
it anymore.

> --
> "Jonathan F. Dill" (jfdill@jfdill.suite.net)
>
>
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.tummy.com
> http://lists.tummy.com/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
>

An other thing that comes in mind is that when a computer has got a faulty
device, that works a period of time and then produces problems and after
rebooting it starts over at beeing nice and becomes bad again.

At least this should be detected somehow when a computer was restarted.

kind regards, +43-676-4708155
Michael Moerz Systemengineer +43-1-718-98-80
CUBiT www.cubit.at