Mailing List Archive

May 24, 2000, 11:25 AM

Post #27 of 45 (4538 views)

Jerome Etienne wrote:
>
> On Wed, May 24, 2000 at 03:29:34AM -0600, Alan Robertson wrote:
> > > It may be a good idea to look more closely at these algorithms.
> > > The only one i know it described in structrure.txt and
> > > doesnt seems to garantee that only one partition got the
> > > quorum.
> >
> > Any method that doesn't guarantee that only one partition gets quorum
> > isn't a quorum algorithm - period. There are dozens of different quorum
> > methods, and they all have to guarantee this. I can assure you that
> > Stephen's methods do in fact guarantee this.
>
> the example in my first email gives a example in which a node falsly
> believes to have the quorum. nobody spots a mistake in the reasonement
> i exposed so currently i assume it correct.
>
> to assure me the algo describes in struture.txt works, you have to
> clearly demonstrate the mistakes i made in the example.

structure.txt says:

Each cluster also has a sequence number which is incremented on each
cluster transition, providing applications with an easy way of
polling for potential changed cluster state.

I don't think Stephen is clear about the partition
problem to Jerome's level of dissection. There is
certainly interaction between the sequence number
(a/k/a generation) and the cluster ID that isn't
obvious.

Since no one has built a TweedieCluster yet, it is not
surprising that there are ambiguities to be discovered
in the preliminary documents.

-dB

about quorum [ In reply to ]

May 25, 2000, 6:10 AM

Post #28 of 45 (4544 views)

Hola!

about quorum [ In reply to ]

May 25, 2000, 6:48 AM

Post #29 of 45 (4529 views)

Heya,

) Like I said, I don't pretend to understand drbd at all. I don't
) see how it works if the "primary" is being served by a node
) that is in the losing partition. The quorum holding side
) would want to work with the secondary, but then you'd have
) to move the "master" role, and you might lose writes done
) on the old master before the partition was detected. Merging
) divergent mirrors is nasty stuff.

That's why IMVeryHO shared storage sematics is handled much better by
distributed filesystems, with file level sharing. Sharing a block device
is OK for two nodes, but more than that smells like trouble to me. :)

In the case of two nodes, you have quorum when both are up, and also when
one of them is down. The problem when they loose communication and both
start writing is that you will loose the writes of one of the sides when
they reconnect, since you'll have to choose who gets to be secondary.
Imagine the following weird :) ASCII picture of a timeline:

Node 1 is writing: A_____B_____C_____D
Node 2 is writing: \___/

In A both nodes are up, things work as normal with a primary and a
secondary. On point B they loose communication, and both think they are
primary, so they start writing on their disks, and update the bitmap to
resync later.

The admin notices this on point C, and has to choose which state will be
kept. Upon doing this, the admin sets one of them as secondary and
reconnects them. As both were consistent until point B, and both logged
every write after point B, any node can bring the other uptodate to its
own state on C. It just takes a human to decide which side gets to
continue as primary. Point D is there only for simmetry. :)

So you can get the disks to be consistent with one of the states, but not
both. Maybe a distributed filesystem like <flame protection>Coda</flame
protection> would help merging both states, provided there were no
conflicting changes.

I hope this enlightens things a bit.

Fábio
( Fábio Olivé Leite -* ConectivaLinux *- olive@conectiva.com[.br] )
( PPGC/UFRGS MSc candidate -*- Advisor: Taisy Silva Weber )
( Linux - Distributed Systems - Fault Tolerance - Security - /etc )

about quorum [ In reply to ]

May 25, 2000, 7:11 AM

Post #30 of 45 (4535 views)

Correct, except for a typo at the end.

F=E1bio Oliv=E9 Leite wrote:
> =

> Hola!
> =

> From jetienne@arobas.net Thu May 25 09:54:28 2000
> ) the example in my first email gives a example in which a node falsly
> ) believes to have the quorum. nobody spots a mistake in the reasonemen=
t
> ) i exposed so currently i assume it correct.
> )
> ) to assure me the algo describes in struture.txt works, you have to
> ) clearly demonstrate the mistakes i made in the example.
> =

> OK, let's review that first example in light of the discussion that too=
k
> place after it.
> =

> On Mon, 22 May 2000, Jerome Etienne wrote:
> )
> ) [node1]<--->[many nodes]<----a link L--->[less nodes]<--->[node2]=

> )
> ) At the begining, there is no partitions so every node has the quorum.=

> ) suddenly the link L fail and the cluster is splitted. so only part
> ) with node1 is supposed to have the quorum.
> ) But the information of the link faillure doesn't reaches node2
> ) instantly. During this delay, node2 wrongly believes it has the
> ) quorum.so this algo seems not to garantee the quorum, simply to
> ) give a 'good probability'.
> =

> The left side notices the partition and goes through cluster
> reconfiguration. During that process, it recalculates quorum, sees it h=
as
> quorum, and fences off all access to shared resources from the right
> side. After the fencing is done, it proceeds to use the resources, sinc=
e
> the other side is now known to not be able to access it.
> =

> In the time it took the left side to fence off the right side, the righ=
t
> side might well have kept on using the shared resources, and as long as=

> those resources are always consistent, or can be brought to a consisten=
t
> state after fencing by the left side, everything is fine.
> =

> A shared SCSI disk with ReiserFS will always be consistent, even if it
> does not have the ultimate state. Ensuring the ultimate state goes to d=
isk
> is an application issue, either by calling fsync() or even sync().
> =

> Upon perceiving the fence (through IO errors or whatever), the left sid=
e
^
-------------------------------------------------------------------|
he means right side!

> might also go through cluster reconfiguration, see it has lost quorum, =
and
> call sit_and_cry(). :)
> =

> See? As long as you have a quorum algorithm and a working fencing syste=
m,
> the shared resources can always be reliably used. The mistake you made =
was
> not to take into consideration the fencing mechanism.
> =

> I hope this helps!

about quorum [ In reply to ]

May 25, 2000, 7:22 AM

Post #31 of 45 (4530 views)

Olá!

) Correct, except for a typo at the end.
) > Upon perceiving the fence (through IO errors or whatever), the left side
) ^
) -------------------------------------------------------------------|
) he means right side!
)
) > might also go through cluster reconfiguration, see it has lost quorum, and
) > call sit_and_cry(). :)

Oops! Thanks David! :)

( Fábio Olivé Leite -* ConectivaLinux *- olive@conectiva.com[.br] )
( PPGC/UFRGS MSc candidate -*- Advisor: Taisy Silva Weber )
( Linux - Distributed Systems - Fault Tolerance - Security - /etc )

about quorum [ In reply to ]

jetienne at arobas

May 25, 2000, 10:28 AM

Post #32 of 45 (4527 views)

On Thu, May 25, 2000 at 10:10:06AM -0300, Fábio Olivé Leite wrote:
> The left side notices the partition and goes through cluster
> reconfiguration. During that process, it recalculates quorum, sees it has
> quorum, and fences off all access to shared resources from the right
> side. After the fencing is done, it proceeds to use the resources, since
> the other side is now known to not be able to access it.
>
> In the time it took the left side to fence off the right side, the right
> side might well have kept on using the shared resources, and as long as
> those resources are always consistent, or can be brought to a consistent
> state after fencing by the left side, everything is fine.

i think all is here. When the split happens, both partitions believes
to have the quorum, so potentially modify a shared rescource. We end up
with 2 inconsistent copies of a single ressource. lets assume for the
example it is a disk replicated on both partitions. Suddenly
both partitions see there is a split and go through the process you
describe. As a result, one partition wins and goes on to modify this
resources. The other looses and stop to modify the rescources.

When the split is over, how the loosing partition resynch with the new
state of the shared resource ?
1. the delta solution: we try to handle only the modifications. but to
replay all the modifications made by the winning partition during
the split isnt enough. The loosing partition have to 'undo' the
modification made between the split and the moment it has seen
the split. This implies to be able to undo and to know up to
which point the undo must be made.
2. a complete resynch: the whole resource is copied from the winning
partition to the loosing one.

the delta solution is faster but has important constraints. the complete
resynch may be much slower(e.g. to copied a whole disk of 4gb) but has
less constraint.

> See? As long as you have a quorum algorithm and a working fencing system,
> the shared resources can always be reliably used. The mistake you made was
> not to take into consideration the fencing mechanism.
>
> I hope this helps!

it does but you describe a scenario explaining how things should work.
You don't spot a mistake explaining why the quorum algorithm described
in struture.txt would be garanteed and not just a 'good probability'.

this thread made realized that the algorithm can provide a garantee
if it is performed during a section protected by a fencing mechanism.
Nevertheless a host needs to know if it has the quorum, at least, each
time it modifies a shared rescource. It would be -really slow- to stop
all the nodes with a fence each time one single node try to modify it.

for example: each time a host write a block on a shared replicated
disk (drbd/odr) it would have to stop all the clusters to be sure it
got the quorum.

so my current opinion is the quorum algorithm described in strutures.txt
doesnt garantee the result in itself. The result can be garanteed if
used during a fence but the cost would so huge it doesnt worth it.

comment are welcome.

about quorum [ In reply to ]

May 25, 2000, 10:58 AM

Post #33 of 45 (4530 views)

The existence of quorum doesn't say anything about access
to shared resources. Those are the province of something
like a lock manager. It is the lock manager that serializes
tha access to a shared device.

It is not necessary to check quorum before each access.
It is necessary to hold the lock. So, if you held the
lock before a partition, you can keep working on the
resource it represents until you release it or are fenced
off.

When quorum is redetermined, the lock manager is going to
reclaim locks held by nodes in the losing partition. Before
it can do so, it must be sure that those nodes are not going
to be doing anything with the resource.

Therefore, the side winning quorum still isn't going to
access resources until it can acquire locks on them, which
it can't do until a fence is established.

Jerome Etienne wrote:
>
> On Thu, May 25, 2000 at 10:10:06AM -0300, Fábio Olivé Leite wrote:
> > The left side notices the partition and goes through cluster
> > reconfiguration. During that process, it recalculates quorum, sees it has
> > quorum, and fences off all access to shared resources from the right
> > side. After the fencing is done, it proceeds to use the resources, since
> > the other side is now known to not be able to access it.
> >
> > In the time it took the left side to fence off the right side, the right
> > side might well have kept on using the shared resources, and as long as
> > those resources are always consistent, or can be brought to a consistent
> > state after fencing by the left side, everything is fine.
>
> i think all is here. When the split happens, both partitions believes
> to have the quorum, so potentially modify a shared rescource.

--------------------- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This is incorrect. See above.

> We end up
> with 2 inconsistent copies of a single ressource. lets assume for the
> example it is a disk replicated on both partitions. Suddenly
> both partitions see there is a split and go through the process you
> describe. As a result, one partition wins and goes on to modify this
> resources. The other looses and stop to modify the rescources.
>
> When the split is over, how the loosing partition resynch with the new
> state of the shared resource ?
> 1. the delta solution: we try to handle only the modifications. but to
> replay all the modifications made by the winning partition during
> the split isnt enough. The loosing partition have to 'undo' the
> modification made between the split and the moment it has seen
> the split. This implies to be able to undo and to know up to
> which point the undo must be made.
> 2. a complete resynch: the whole resource is copied from the winning
> partition to the loosing one.

This is the drbd dillema I still don't understand.
>
> the delta solution is faster but has important constraints. the complete
> resynch may be much slower(e.g. to copied a whole disk of 4gb) but has
> less constraint.
>
> > See? As long as you have a quorum algorithm and a working fencing system,
> > the shared resources can always be reliably used. The mistake you made was
> > not to take into consideration the fencing mechanism.
> >
> > I hope this helps!
>
> it does but you describe a scenario explaining how things should work.
> You don't spot a mistake explaining why the quorum algorithm described
> in struture.txt would be garanteed and not just a 'good probability'.

I don't see the thing you are having a problem with still. I thought
the identity had been established that at a cluster generation, there
would be only one partition capable of holding quorum. If two partitions
both believe they have quorum, they are in different generations, and the
later generation is going to kill the predecessor before proceeding. The
new quorum members will not be doing any operations on contended shared
resources until the ownership is settled. The new quorum might still
do operations on resources it held before the partition, and the old
partition may do the same, until it is killed.
>
> this thread made realized that the algorithm can provide a garantee
> if it is performed during a section protected by a fencing mechanism.
> Nevertheless a host needs to know if it has the quorum, at least, each
> time it modifies a shared rescource. It would be -really slow- to stop
> all the nodes with a fence each time one single node try to modify it.

Quorum gives the authority to issue resource locks. It isn't the lock
itself.

>
> for example: each time a host write a block on a shared replicated
> disk (drbd/odr) it would have to stop all the clusters to be sure it
> got the quorum.
>
> so my current opinion is the quorum algorithm described in strutures.txt
> doesnt garantee the result in itself. The result can be garanteed if
> used during a fence but the cost would so huge it doesnt worth it.

It's far more involved than you yeet realize.
There are many layers. You haven't even gotten to the barrier
stuff yet, which is where syncronization during reconfiguration
is handled.

-dB

about quorum [ In reply to ]

May 25, 2000, 11:53 AM

Post #34 of 45 (4529 views)

Here we go again! :)

) i think all is here. When the split happens, both partitions believes
) to have the quorum, so potentially modify a shared rescource.

Before the split, they all had quorum and agreed that some machine in the
soon-to-be partition-without-quorum has a lock on it to be able to safely
modify it. No other machine is going to modify the same resource while it
keeps the lock.

Having many nodes modify a shared resource concurrently without a good
locking mechanism is dumb at best. See the GFS stuff; the worst thing to
actually solve is the locking.

Upon loosing communication, the node in the partition without quorum may
still hold the lock, but since it will be fenced off shortly, nothing bad
happens. A new lock will only be issued _after_ the partition who actually
has quorum succeeded in fencing the other partition off.

I may not be completely correct on the locking issue, but that is the
idea, at least. You are really not going to allow any dumb random screw to
the shared resource just because you have quorum. And having quorum does
not mean everybody in the partition hits the disks at will; it means
everybody can reliably decide _who_has_ access to the resource at any
given time, and act accordingly.

) We end up with 2 inconsistent copies of a single ressource. lets
) assume for the example it is a disk replicated on both partitions.
) Suddenly both partitions see there is a split and go through the
) process you describe. As a result, one partition wins and goes on to
) modify this resources. The other looses and stop to modify the
) rescources.
)
) When the split is over, how the loosing partition resynch with the new
) state of the shared resource ?
) 1. the delta solution: we try to handle only the modifications. but to
) replay all the modifications made by the winning partition during
) the split isnt enough. The loosing partition have to 'undo' the
) modification made between the split and the moment it has seen
) the split. This implies to be able to undo and to know up to
) which point the undo must be made.

There will be no undo. The changes will maintain the disk consistent,
since we're obviously talking about a journaled fs. After fencing, the
disks will still be consistent, and the nodes without quorum will be
fenced off, returning errors to their issuers of disk IO.

Again only after fencing, when the disk is consistent and there are no
more accesses to it, the partition with quorum will go on using it. The IO
calls on the nodes without quorum will either succeed, because the fence
has not yet been imposed, or return errors, because it has. So there's
nothing "in the queue" to be recovered.

) 2. a complete resynch: the whole resource is copied from the winning
) partition to the loosing one.
)
) the delta solution is faster but has important constraints. the complete
) resynch may be much slower(e.g. to copied a whole disk of 4gb) but has
) less constraint.
)
) > See? As long as you have a quorum algorithm and a working fencing system,
) > the shared resources can always be reliably used. The mistake you made was
) > not to take into consideration the fencing mechanism.
) >
) > I hope this helps!
)
) it does but you describe a scenario explaining how things should work.
) You don't spot a mistake explaining why the quorum algorithm described
) in struture.txt would be garanteed and not just a 'good probability'.

Well, by explaining how things work, I guess it's obvious how things don't
work. When I say that those who have quorum will fence the others off and
then use the resource, it's pretty much obvious that the others will not
access it anymore. That's how it happens, there's no mistake. What was
sent to disk will be on the disk for the others, the rest will cause
errors to be returned.

) this thread made realized that the algorithm can provide a garantee
) if it is performed during a section protected by a fencing mechanism.

Oh! Nice! :)

) Nevertheless a host needs to know if it has the quorum, at least, each
) time it modifies a shared rescource. It would be -really slow- to stop
) all the nodes with a fence each time one single node try to modify it.

No... each time it acquires a lock to an atomic part of a resource. You
can have multiple block groups on a disk, for example, and separate locks
on each.

Since it is a distributed system, it __HAS__ to have communication, so all
things get a little slower. We have to live with it, or buy a faster
network. We can't have a "probabilistic distributed system", that has a
good probability of resource locking, a good probability of being
consistent, a good probability of being reliable. The nodes _must_
communicate, and agree before any distributed work is done.

) for example: each time a host write a block on a shared replicated
) disk (drbd/odr) it would have to stop all the clusters to be sure it
) got the quorum.

Nope... it will send a message to the distributed lock manager, acquire
a lock to the resources it needs, issue as many operations on them as it
wants, and then drop the locks.

If you have fine grained filesystem level distributed locking, you can
have multiple nodes writing blocks to a shared disk and still have
consistency, since they will only send blocks to places they hold locks
on. This obviously requires some amount of communication, but then, it's a
distributed system. :)

) so my current opinion is the quorum algorithm described in strutures.txt
) doesnt garantee the result in itself. The result can be garanteed if
) used during a fence but the cost would so huge it doesnt worth it.

I don't think the costs are so huge, and if you have reliable group
communication, group membership, quorum management, distributed locking
management and distributed fencing mechanism, the thing just works. It's
simply a complex system built with many simple layers.

Of course, one must understand all layers and how they interact before
picturing the whole system in action. Or at least know about the layers
and their functions, and trust them to provide those functions. :)

This discussion has been very enlightening so far, at least for me! :)
Many concepts are being talked about over and over, and I guess we are all
rethinking many things in the process.

Celebrate, for the wheels are in motion! :)
Fábio
( Fábio Olivé Leite -* ConectivaLinux *- olive@conectiva.com[.br] )
( PPGC/UFRGS MSc candidate -*- Advisor: Taisy Silva Weber )
( Linux - Distributed Systems - Fault Tolerance - Security - /etc )

about quorum [ In reply to ]

jetienne at arobas

May 25, 2000, 4:32 PM

Post #35 of 45 (4531 views)

lets see if i get it this time :)

when i want to modify a shared resource, i check i have the lock for it.
if not, i stop the whole cluster via a fencing mecanism and i execute the
quorum algorithm. if i have the quorum, i lock the resource and start
modifying it. correct ?

note that a write lock can be owned by only one computer and
that this kind of lock requires to stop the whole cluster to be
modified so it is slow and not scalable.
it seems a good idea to have kind of hierachical lock/quorum to avoid
to stop the whole cluster each time a lock is modified.

about quorum [ In reply to ]

May 25, 2000, 6:37 PM

Post #36 of 45 (4532 views)

Jerome Etienne wrote:
>
> lets see if i get it this time :)
>
> when i want to modify a shared resource, i check i have the lock for it.
> if not, i stop the whole cluster via a fencing mecanism and i execute the
> quorum algorithm. if i have the quorum, i lock the resource and start
> modifying it. correct ?
>
> note that a write lock can be owned by only one computer and
> that this kind of lock requires to stop the whole cluster to be
> modified so it is slow and not scalable.
> it seems a good idea to have kind of hierachical lock/quorum to avoid
> to stop the whole cluster each time a lock is modified.

Getting closer:

step 1: if you have the lock, you are OK.

step 2: If you don't have the lock, try to get it from
whoever has the lock. If you can, you are OK.

step 3: If you can't get the lock, because you can't talk to
the node that has it, you may be in a partition.
Check quorum and do the right thing.

step 4: If you can't get the lock, but there is no partition or
quorum problem, and the node that has the lock ought to
be giving it to you, file a bug against the locking system,
or dig into the code and figure it out.

Your locking system may hide the partition and reconfig from
you completely, by blocking you until you can proceed.
It might also signal a reconfig event, indicating you
should retry. It might also never return, and be the
place where your node is taken down because of a reconfig
or a fence.

Right now is a good time to be playing with locking systems.
The GFS people have one model, and Peter Braam was looking at
reimplementing the VAX/VMS model in a linux context. I haven't
heard that Peter is making any progress, and the GFS dlock daemon
still isn't quite what you really want.

-dB

should retry your lock acquisition.

>
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.tummy.com
> http://lists.tummy.com/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/

about quorum [ In reply to ]

May 26, 2000, 6:36 AM

Post #37 of 45 (4536 views)

Oi!

) lets see if i get it this time :)

almost there! :)

) when i want to modify a shared resource, i check i have the lock for it.
) if not, i stop the whole cluster via a fencing mecanism and i execute the
) quorum algorithm. if i have the quorum, i lock the resource and start
) modifying it. correct ?

Hmmm... if you want to modify a shared resource, you ask for the lock for
it. You do that by sendind a message to the lock manager. The lock manager
answers your message, either giving you the lock or saying it is busy or
something like that. In any case, in a normal situation, it should not
take much more than these two messages.

The quorum is established on cluster reconfiguration. Cluster
reconfiguration will be triggered by nodes joining or leaving the cluster,
nodes crashing and being taken out of the cluster, and network partition
(maybe there are more cases, but you can get the picture).

Fencing is also done on cluster reconfiguration, and is imposed on those
nodes that are not in the partition with quorum.

Those nodes that can comunicate and establish a proper partition with
quorum can certainly coordinate themselves, so that those who need access
to shared resources will politely ask for it, and respect the
answer. If you can be sure that other nodes not in this partition can't
access it (that is, they're fenced off), the system works fine.

) note that a write lock can be owned by only one computer and
) that this kind of lock requires to stop the whole cluster to be
) modified so it is slow and not scalable.

It does not require "stopping the whole cluster", it just requires sendind
a message to the lock manager and respecting the answer.

) it seems a good idea to have kind of hierachical lock/quorum to avoid
) to stop the whole cluster each time a lock is modified.

Well... I think this is answered above. Just remember a cluster has
layers, and so has steps it must go through in order to start. You first
establish reliable communication and group membership, then you calculate
quorum, then you configure fencing, then you start cluster services, like
lock management. After this point the applications can coordinate anything
they want to do in a safe way.

There's no need to recalculate quorum if there was no event that triggered
a change in cluster membership. There's no need to stop every node to
acquire a lock on a distributed resource, since there's a locking service
that relies on quorum and fencing to know the only nodes accessing shared
resources are those that it explicitly allowed.

Cheers!
Fábio
( Fábio Olivé Leite -* ConectivaLinux *- olive@conectiva.com[.br] )
( PPGC/UFRGS MSc candidate -*- Advisor: Taisy Silva Weber )
( Linux - Distributed Systems - Fault Tolerance - Security - /etc )

about quorum [ In reply to ]

wombat at us

May 26, 2000, 7:53 AM

Post #38 of 45 (4524 views)

From:

"Stephen C. Tweedie" <sct@redhat.com>@lists.tummy.com on 05/26/2000
01:04:41 PM
Hi,

<text deleted for brevity>

There are decent ways to provide a tie-breaker. The usual way is to
use a third device to provide an extra vote. That device may be another
cluster node, but it may be something more lightweight such as the use
of SCSI device reservation on a shared scsi disk.

--Stephen

==============

Another reasonable technique here is to use the SCSI target-mode
communication feature, and have the nodes send heartbeat and cluster
reconfig messages across the disk links, just as they do across the network
links. In this case, if all network communications are down, the nodes are
still aware that each is up, and they can come to some rational conclusion
as to which node should take (or keep) ownership of the disk. This needs
to be used in conjunction with reserves (or fencing) to deal with cases of
mis-behaving nodes (e.g., one that hangs for a while, but then comes back.)

Using TM-SCSI (or TM-FC) for communication can allow you to run a system
without enforcing a quorum mechanism. It is functionally equivalent,
although the cluster services differ somewhat in their operation. Also,
this technique is most useful when the hardware only has a limited number
of connections - such as twin-tailed, or four-tailed disks.

Peter R. Badovinatz -- (503)578-5530 (TL 775)
Clusters and High Availability, Beaverton, OR
wombat@us.ibm.com or IBMUSM00(WOMBAT)

about quorum [ In reply to ]

May 26, 2000, 9:58 AM

Post #39 of 45 (4534 views)

Hi,

On Mon, May 22, 2000 at 11:10:07PM -0400, Jerome Etienne wrote:
> I have coded a quorum deamon with an associated life monitor and
> experience a problem with the quorum algorithm. I use one described
> by stephen tweedie in struture.txt[1]. The point of the quorum is
> to be sure that at -most- one partition has the quorum. Lets consider
> the following scenario:
>
> [node1]<--->[many nodes]<----a link L--->[less nodes]<--->[node2]
>
> At the begining, there is no partitions so every node has the quorum.
> suddenly the link L fail and the cluster is splitted. so only part
> with node1 is supposed to have the quorum.
> But the information of the link faillure doesn't reaches node2
> instantly. During this delay, node2 wrongly believes it has the
> quorum.so this algo seems not to garantee the quorum, simply to
> give a 'good probability'.
>
> i have 2 questions:
> 1. do i miss something ?

Yes. There's a fundamental property which protects you in these
cases: locking.

It doesn't matter whether or not you are using a distributed lock
manager or some centralised locking resource --- there is some way
of locking things (be they files, disk blocks, drbd stripes, whatever)
so that multiple nodes do not perform conflicting operations at once
in a properly working cluster.

This property is preserved over a cluster partition. If some parts
of the cluster lose connectivity, then the _immediate_ effect, before
the cluster software has decided that it needs to intervene, is to
halt communications between some of the nodes. Now, when we had full
connectivity, we had network traffic occuring to grant permission to
take certain locks and to modify certain data. If some network paths
have failed, then the obtaining of further locks may also fail.

Basically, you rely on the cluster locking --- if something goes
wrong, the disconnected nodes can do no harm, because they simply
cannot obtain the necessary locking permission to mess things up.

Cheers,
Stephen

about quorum [ In reply to ]

May 26, 2000, 10:04 AM

Post #40 of 45 (4537 views)

Hi,

On Thu, May 25, 2000 at 10:48:15AM -0300, Fábio Olivé Leite wrote:
>
> ) Like I said, I don't pretend to understand drbd at all. I don't
> ) see how it works if the "primary" is being served by a node
> ) that is in the losing partition. The quorum holding side
> ) would want to work with the secondary, but then you'd have
> ) to move the "master" role, and you might lose writes done
> ) on the old master before the partition was detected. Merging
> ) divergent mirrors is nasty stuff.
>
> That's why IMVeryHO shared storage sematics is handled much better by
> distributed filesystems, with file level sharing. Sharing a block device
> is OK for two nodes, but more than that smells like trouble to me. :)

VMS clusters successfully share disks directly amongst dozens of nodes.
As long as you have the locking right, it's still safe, and decent
interconnect technologies make the physical wiring of the disks
relatively painless.

> In the case of two nodes, you have quorum when both are up, and also when
> one of them is down.

No you don't --- not without a tie-breaker mechanism. That's a necessary
part of quorum. The trouble is, when one is down, the other doesn't
_know_ that it is down --- it can't tell the difference between a failed
node and a cluster partition.

There are decent ways to provide a tie-breaker. The usual way is to
use a third device to provide an extra vote. That device may be another
cluster node, but it may be something more lightweight such as the use
of SCSI device reservation on a shared scsi disk.

--Stephen

about quorum [ In reply to ]

May 26, 2000, 10:07 AM

Post #41 of 45 (4529 views)

Hi,

On Tue, May 23, 2000 at 10:33:09AM -0400, Jerome Etienne wrote:

> Is the fencing problem equivalent to the quorum one ?

No. Quorum is about whether a cluster partition has permission to
write to a disk or not. Fencing is not about whether or not *you*
have permission to write to the disk --- it is about you knowing
for a fact that some other node does *not* have permission, and
how you go about making sure that that node doesn't write stuff
regardless.

Quorum is assumed to be established by a working cluster. Fencing
is about nodes which are not working. If the cluster is working,
you can assume that its writes are consistent, and are protected
by quorum. However, if a node is not working, you don't have any
guarantees about *how* it has failed, and it is up to the rest of
the cluster to guarantee that it can do no damage.

--Stephen

about quorum [ In reply to ]

May 26, 2000, 10:12 AM

Post #42 of 45 (4532 views)

Hi,

On Tue, May 23, 2000 at 10:26:22AM -0400, Jerome Etienne wrote:
>
> let me rephrase the question hoping this help. "does a hosts require
> to know if it has the quorum or not without any possible mistake or
> an error with a low probability is tolerable?"

No. The whole problem is that if you don't have quorum, it is
because something has gone wrong, and if something has gone wrong,
you are in no position to judge whether or not you have still got
quorum!

The example often used to show why this is needed is the case of a
machine which locks up for several seconds. Now, while it is asleep,
the rest of the cluster sees that it has disappeared, and evicts it
from the cluster, reforming quorum without it.

What happens when the sleeping node revives itself? As far as it
is concerned, it has not had time to detect the loss of quorum, so it
continues scribbling on the shared disk. Eventually it will detect
the change of cluster membership and will react to that (probably by
rebooting itself), but it's too late, the damage has been done.

So, the rest of the cluster needs to protect itself from the dead
node, in case the supposedly dead node is in fact still capable of
doing harm. *That* is fencing.

--Stephen

about quorum [ In reply to ]

halp at sequent

May 26, 2000, 10:56 AM

Post #43 of 45 (4538 views)

Oh my Lord, you're getting into the thick of it now!

hal

> -----Original Message-----
> From: wombat@us.ibm.com [SMTP:wombat@us.ibm.com]
> Sent: Friday, May 26, 2000 7:54 AM
> To: linux-ha-dev@lists.tummy.com
> Subject: Re: [Linux-ha-dev] about quorum
>
>
> From:
>
> "Stephen C. Tweedie" <sct@redhat.com>@lists.tummy.com on 05/26/2000
> 01:04:41 PM
> Hi,
>
> <text deleted for brevity>
>
> There are decent ways to provide a tie-breaker. The usual way is to
> use a third device to provide an extra vote. That device may be another
> cluster node, but it may be something more lightweight such as the use
> of SCSI device reservation on a shared scsi disk.
>
> --Stephen
>
> ==============
>
> Another reasonable technique here is to use the SCSI target-mode
> communication feature, and have the nodes send heartbeat and cluster
> reconfig messages across the disk links, just as they do across the network
> links. In this case, if all network communications are down, the nodes are
> still aware that each is up, and they can come to some rational conclusion
> as to which node should take (or keep) ownership of the disk. This needs
> to be used in conjunction with reserves (or fencing) to deal with cases of
> mis-behaving nodes (e.g., one that hangs for a while, but then comes back.)
>
> Using TM-SCSI (or TM-FC) for communication can allow you to run a system
> without enforcing a quorum mechanism. It is functionally equivalent,
> although the cluster services differ somewhat in their operation. Also,
> this technique is most useful when the hardware only has a limited number
> of connections - such as twin-tailed, or four-tailed disks.
>
>
> Peter R. Badovinatz -- (503)578-5530 (TL 775)
> Clusters and High Availability, Beaverton, OR
> wombat@us.ibm.com or IBMUSM00(WOMBAT)
>
>
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.tummy.com
> http://lists.tummy.com/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/

about quorum [ In reply to ]

halp at sequent

May 26, 2000, 10:59 AM

Post #44 of 45 (4533 views)

My apologies for this message. I meant to send it directly to the author of
the original message.

hal

> -----Original Message-----
> From: Porter, Hal (halp) [SMTP:halp@sequent.com]
> Sent: Friday, May 26, 2000 10:56 AM
> To: 'linux-ha-dev@lists.tummy.com'
> Subject: RE: [Linux-ha-dev] about quorum
>
> Oh my Lord, you're getting into the thick of it now!
>
> hal
>
> > -----Original Message-----
> > From: wombat@us.ibm.com [SMTP:wombat@us.ibm.com]
> > Sent: Friday, May 26, 2000 7:54 AM
> > To: linux-ha-dev@lists.tummy.com
> > Subject: Re: [Linux-ha-dev] about quorum
> >
> >
> > From:
> >
> > "Stephen C. Tweedie" <sct@redhat.com>@lists.tummy.com on 05/26/2000
> > 01:04:41 PM
> > Hi,
> >
> > <text deleted for brevity>
> >
> > There are decent ways to provide a tie-breaker. The usual way is to
> > use a third device to provide an extra vote. That device may be another
> > cluster node, but it may be something more lightweight such as the use
> > of SCSI device reservation on a shared scsi disk.
> >
> > --Stephen
> >
> > ==============
> >
> > Another reasonable technique here is to use the SCSI target-mode
> > communication feature, and have the nodes send heartbeat and cluster
> > reconfig messages across the disk links, just as they do across the
> network
> > links. In this case, if all network communications are down, the nodes
> are
> > still aware that each is up, and they can come to some rational
> conclusion
> > as to which node should take (or keep) ownership of the disk. This needs
> > to be used in conjunction with reserves (or fencing) to deal with cases
> of
> > mis-behaving nodes (e.g., one that hangs for a while, but then comes
> back.)
> >
> > Using TM-SCSI (or TM-FC) for communication can allow you to run a system
> > without enforcing a quorum mechanism. It is functionally equivalent,
> > although the cluster services differ somewhat in their operation. Also,
> > this technique is most useful when the hardware only has a limited number
> > of connections - such as twin-tailed, or four-tailed disks.
> >
> >
> > Peter R. Badovinatz -- (503)578-5530 (TL 775)
> > Clusters and High Availability, Beaverton, OR
> > wombat@us.ibm.com or IBMUSM00(WOMBAT)
> >
> >
> > _______________________________________________________
> > Linux-HA-Dev: Linux-HA-Dev@lists.tummy.com
> > http://lists.tummy.com/mailman/listinfo/linux-ha-dev
> > Home Page: http://linux-ha.org/
>
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.tummy.com
> http://lists.tummy.com/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/

about quorum [ In reply to ]

alanr at suse

May 26, 2000, 6:08 PM

Post #45 of 45 (4533 views)