Mailing List Archive: about quorum

about quorum

jetienne at arobas

May 22, 2000, 8:10 PM

Post #1 of 45 (6776 views)

I have coded a quorum deamon with an associated life monitor and
experience a problem with the quorum algorithm. I use one described
by stephen tweedie in struture.txt[1]. The point of the quorum is
to be sure that at -most- one partition has the quorum. Lets consider
the following scenario:

[node1]<--->[many nodes]<----a link L--->[less nodes]<--->[node2]

At the begining, there is no partitions so every node has the quorum.
suddenly the link L fail and the cluster is splitted. so only part
with node1 is supposed to have the quorum.
But the information of the link faillure doesn't reaches node2
instantly. During this delay, node2 wrongly believes it has the
quorum.so this algo seems not to garantee the quorum, simply to
give a 'good probability'.

i have 2 questions:
1. do i miss something ?
2. is a garantee required or a 'good probability' is enough ?

[1] from structure.txt:
" Quorum is necessary to protect cluster-wide shared persistent
state. It is essential to avoid problems when we have "cluster
partition": a possible type of fault in which some of the cluster
members have lost communications with the rest, but where the nodes
themselves are still working. In a partitioned cluster, we need
some mechanism we can rely on to ensure that at most one partition
has the right to update the cluster's shared persistent state.
(That state might be a shared disk, for example.)

Quorum is maintained by assigning a number of votes to each node.
This is a configuration property of the node. The Quorum manager
keeps track of two separate vote counts: the "Cluster Votes", which
is the sum of the votes of every node which is a member of the
cluster, and the "Expected Votes", which is the sum of the votes on
every node which has ever been seen by any voting member of the
cluster. (The storage of those node records is one reason why the
Quorum layer requires a JDB in this design.)

The cluster has Quorum if, and only if, it posesses MORE than half
of the Expected Votes. This guarantees that the known nodes which
are not in this cluster can not possibly form a Quorum on their own."

about quorum [ In reply to ]

May 22, 2000, 8:42 PM

Post #2 of 45 (6739 views)

Here are a couple of points relevant to understanding
the issue and the solution.

Point 1: until the right side does something
that needs to communicate with the left side, it may not
detect the failure. This is fine, until the left side wants
to do something representing changed state. For instance, imagine
node 2 holds a lock, representing a disk block, before the
partition. It reads and writes, and updates and sometime
the partition happens, and it writes some more, and writes
some more, which is all OK and correct. In the meantime,
the left side goes through a transition, believes it has
the quorum, and wants to proceed.

Before it may proceed, it must establish Point 2:
that the other nodes are well and truly gone.
This is the part described as "i/o fencing"
and/or STONITH "shoot the other node in the head". The
quorum claiming side must ensure that the evicted nodes
are effectively dead. If not, then node 1 can become
the new lock owner, and a dangerously live node 2 can be
busily corrupting the data on the disk.

Quorum vote weights resolve a problem with partition
leaving equal sub-clusters, as only one will have the
weightier member(s). Weights don't resolve the need to
fence/evict members of the losing side; this problem
recurs in all quorum resolution schemes (of which there
are many).

I am currently holding the view that there ought to be
standard APIs and protocols for establishing fences
and/or invoking STONITH. These should exist independant
of the quorum mechanism in question. There seem to be
two reasonable flavors: STONITH, causing an actual
reboot (or halt, depending on your religious/pragmatic
disposition), and "virtual wire cutting", where something
acts as if the wires between the host and the i/o devices
had been chopped. Examples of the latter would include
asserting SCSI reservation on the surviving node in a
shared bus; programming fibre channel switches or storage
devices to reject i/o from a host; changing access rights
to a "smart" shared storage system such as a remote block
device or an NFS server. In all cases, the semantics must
be such that when the "fence" call succeeds, all i/o from
the host in question is guaranteed to be done, and that
no further i/o from the host can be done until the node is
rebooted. It is assumed that a reboot "unfences" the node,
and that the node will be well behaved (properly join the
cluster) before updating any shared resources.

-dB

Jerome Etienne wrote:
>
> I have coded a quorum deamon with an associated life monitor and
> experience a problem with the quorum algorithm. I use one described
> by stephen tweedie in struture.txt[1]. The point of the quorum is
> to be sure that at -most- one partition has the quorum. Lets consider
> the following scenario:
>
> [node1]<--->[many nodes]<----a link L--->[less nodes]<--->[node2]
>
> At the begining, there is no partitions so every node has the quorum.
> suddenly the link L fail and the cluster is splitted. so only part
> with node1 is supposed to have the quorum.
> But the information of the link faillure doesn't reaches node2
> instantly. During this delay, node2 wrongly believes it has the
> quorum.so this algo seems not to garantee the quorum, simply to
> give a 'good probability'.
>
> i have 2 questions:
> 1. do i miss something ?
> 2. is a garantee required or a 'good probability' is enough ?
>
> [1] from structure.txt:
> " Quorum is necessary to protect cluster-wide shared persistent
> state. It is essential to avoid problems when we have "cluster
> partition": a possible type of fault in which some of the cluster
> members have lost communications with the rest, but where the nodes
> themselves are still working. In a partitioned cluster, we need
> some mechanism we can rely on to ensure that at most one partition
> has the right to update the cluster's shared persistent state.
> (That state might be a shared disk, for example.)
>
> Quorum is maintained by assigning a number of votes to each node.
> This is a configuration property of the node. The Quorum manager
> keeps track of two separate vote counts: the "Cluster Votes", which
> is the sum of the votes of every node which is a member of the
> cluster, and the "Expected Votes", which is the sum of the votes on
> every node which has ever been seen by any voting member of the
> cluster. (The storage of those node records is one reason why the
> Quorum layer requires a JDB in this design.)
>
> The cluster has Quorum if, and only if, it posesses MORE than half
> of the Expected Votes. This guarantees that the known nodes which
> are not in this cluster can not possibly form a Quorum on their own."
>
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.tummy.com
> http://lists.tummy.com/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/

about quorum [ In reply to ]

jetienne at arobas

May 23, 2000, 5:42 AM

Post #3 of 45 (6742 views)

On Mon, May 22, 2000 at 08:42:43PM -0700, David Brower wrote:
> Here are a couple of points relevant to understanding
> the issue and the solution.

you may consider your email relevant but it doesn't answer my questions.

about quorum [ In reply to ]

May 23, 2000, 6:24 AM

Post #4 of 45 (6746 views)

To be specific then:

1. You missed the necessary and sufficient issue of
fencing as something to proceed with once quorum
is decided;

2. Yes, a guarantee (of fence viability) is required,
and good probability is not good enough.

-dB

Jerome Etienne wrote:
>
> On Mon, May 22, 2000 at 08:42:43PM -0700, David Brower wrote:
> > Here are a couple of points relevant to understanding
> > the issue and the solution.
>
> you may consider your email relevant but it doesn't answer my questions.
>
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.tummy.com
> http://lists.tummy.com/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/

about quorum [ In reply to ]

May 23, 2000, 7:00 AM

Post #5 of 45 (6743 views)

Jerome Etienne wrote:
>
> On Mon, May 22, 2000 at 08:42:43PM -0700, David Brower wrote:
> > Here are a couple of points relevant to understanding
> > the issue and the solution.
>
> you may consider your email relevant but it doesn't answer my questions.

Actually, David *did* answer your question.

Quorum is not enough in and of itself. You need I/O fencing also.

That's what he said. He just did it with more words.

-- Alan Robertson
alanr@suse.com

about quorum [ In reply to ]

May 23, 2000, 7:04 AM

Post #6 of 45 (6734 views)

David Brower wrote:
>
> To be specific then:
>
> 1. You missed the necessary and sufficient issue of
> fencing as something to proceed with once quorum
> is decided;
>
> 2. Yes, a guarantee (of fence viability) is required,
> and good probability is not good enough.

A guarantee if immediate termination is necessary for genuine
single-copy shared data (like shared SCSI) is used.

A guarantee of eventual discovery of the loss of quorum is sufficient
when an technique like drbd is used for disk mirroring onto independent
media.

-- Alan Robertson
alanr@suse.com

about quorum [ In reply to ]

jetienne at arobas

May 23, 2000, 7:26 AM

Post #7 of 45 (6734 views)

On Tue, May 23, 2000 at 08:00:45AM -0600, Alan Robertson wrote:
> > you may consider your email relevant but it doesn't answer my questions.
>
> Actually, David *did* answer your question.
> Quorum is not enough in and of itself. You need I/O fencing also.
> That's what he said. He just did it with more words.

well if he said that, he didn't answer my second question.

let me rephrase the question hoping this help. "does a hosts require
to know if it has the quorum or not without any possible mistake or
an error with a low probability is tolerable?"
the question wasnt 'is quorum enough for an unspecified purpose?'

about quorum [ In reply to ]

jetienne at arobas

May 23, 2000, 7:33 AM

Post #8 of 45 (6744 views)

On Tue, May 23, 2000 at 06:24:03AM -0700, David Brower wrote:
> 1. You missed the necessary and sufficient issue of
> fencing as something to proceed with once quorum
> is decided;

Is the fencing problem equivalent to the quorum one ?
If so, i am interested in any pointer or explaination about
possible solutions.

> 2. Yes, a guarantee (of fence viability) is required,
> and good probability is not good enough.

thanks for your succinct answer

about quorum [ In reply to ]

jetienne at arobas

May 23, 2000, 7:39 AM

Post #9 of 45 (6737 views)

On Tue, May 23, 2000 at 08:04:04AM -0600, Alan Robertson wrote:
> A guarantee if immediate termination is necessary for genuine
> single-copy shared data (like shared SCSI) is used.
>
> A guarantee of eventual discovery of the loss of quorum is sufficient
> when an technique like drbd is used for disk mirroring onto independent
> media.

why do you make a distinction between the two ?

about quorum [ In reply to ]

May 23, 2000, 8:14 AM

Post #10 of 45 (6742 views)

Jerome Etienne wrote:
>
> On Tue, May 23, 2000 at 08:04:04AM -0600, Alan Robertson wrote:
> > A guarantee if immediate termination is necessary for genuine
> > single-copy shared data (like shared SCSI) is used.
> >
> > A guarantee of eventual discovery of the loss of quorum is sufficient
> > when an technique like drbd is used for disk mirroring onto independent
> > media.
>
> why do you make a distinction between the two ?

> On Tue, May 23, 2000 at 06:24:03AM -0700, David Brower wrote:
> > 1. You missed the necessary and sufficient issue of
> > fencing as something to proceed with once quorum
> > is decided;
>
> Is the fencing problem equivalent to the quorum one ?
> If so, i am interested in any pointer or explaination about
> possible solutions.

The quorum step identifies -who- should proceed.

The fencing step means it is -safe- for the quorum members
to proceed.

The strengh of the fence necessary depends on the nature
of the resources being shared, and how they recover (or
don't recover) themselves from non-serialized access. It
is easier for something to recover if it knows there were
no non-serialized accesses. A truly shared disk needs
a strong fence.

Alan is suggesting DRBD doesn't need that because of the
nature of its recovery. I'm not sure I understand that
point, maybe because I don't understand drbd recovery enough
to have an opinion. There are reasons to be deeply
suspicious in that area. It is easy to make tragic
mistakes in recovery.

-dB

about quorum [ In reply to ]

jetienne at arobas

May 23, 2000, 8:36 AM

Post #11 of 45 (6741 views)

On Tue, May 23, 2000 at 08:14:58AM -0700, David Brower wrote:
> The quorum step identifies -who- should proceed.
>
> The fencing step means it is -safe- for the quorum members
> to proceed.

to see if i understand, the quorum doesnt need to be garanteed
because the garantee is ensured by the fencing step, correct ?

where can i find information about fencing algorithms ?

> The strengh of the fence necessary depends on the nature
> of the resources being shared, and how they recover (or
> don't recover) themselves from non-serialized access. It
> is easier for something to recover if it knows there were
> no non-serialized accesses. A truly shared disk needs
> a strong fence.
>
> Alan is suggesting DRBD doesn't need that because of the
> nature of its recovery. I'm not sure I understand that
> point, maybe because I don't understand drbd recovery enough
> to have an opinion. There are reasons to be deeply
> suspicious in that area. It is easy to make tragic
> mistakes in recovery.
>
> -dB
>
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.tummy.com
> http://lists.tummy.com/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/

about quorum [ In reply to ]

May 23, 2000, 8:45 AM

Post #12 of 45 (6753 views)

The quorum determination -must- be correct in the
sense that everyone agrees what it is. If there
are different opinions about who has The Quorum,
then bad things will happen. In your example case,
the losing partition has not checked his quorum,
and has a mistaken belief that would be corrected
if he checked. That is OK, and different than the
case where two sides of a partition detected the
transition, did their determination, and both
decided that they still hold quorum.

There is sparse literature on fencing mechanisms.
The discussions of STONITH here are a place one
could start looking. The "wolfpack" clustering
in Windows relies on SCSI reservation, so you
could fish the microsoft sites for how they do it.
I haven't seen any references to how fibre channel
environments actually do things in detail. There
have been suggestions that 'persistent reservations'
can be used, and others that 'domains' (which may
be proprietary to particular fibre channel switch
vendors) may also be used effectively.

-dB

Jerome Etienne wrote:
>
> On Tue, May 23, 2000 at 08:14:58AM -0700, David Brower wrote:
> > The quorum step identifies -who- should proceed.
> >
> > The fencing step means it is -safe- for the quorum members
> > to proceed.
>
> to see if i understand, the quorum doesnt need to be garanteed
> because the garantee is ensured by the fencing step, correct ?
>
> where can i find information about fencing algorithms ?
>
> > The strengh of the fence necessary depends on the nature
> > of the resources being shared, and how they recover (or
> > don't recover) themselves from non-serialized access. It
> > is easier for something to recover if it knows there were
> > no non-serialized accesses. A truly shared disk needs
> > a strong fence.
> >
> > Alan is suggesting DRBD doesn't need that because of the
> > nature of its recovery. I'm not sure I understand that
> > point, maybe because I don't understand drbd recovery enough
> > to have an opinion. There are reasons to be deeply
> > suspicious in that area. It is easy to make tragic
> > mistakes in recovery.
> >
> > -dB
> >
> > _______________________________________________________
> > Linux-HA-Dev: Linux-HA-Dev@lists.tummy.com
> > http://lists.tummy.com/mailman/listinfo/linux-ha-dev
> > Home Page: http://linux-ha.org/
>
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.tummy.com
> http://lists.tummy.com/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/

about quorum [ In reply to ]

jetienne at arobas

May 23, 2000, 9:03 AM

Post #13 of 45 (6733 views)

On Tue, May 23, 2000 at 08:45:02AM -0700, David Brower wrote:
> In your example case,
> the losing partition has not checked his quorum,
> and has a mistaken belief that would be corrected
> if he checked.

how a partition check its quorum ?

> That is OK, and different than the
> case where two sides of a partition detected the
> transition, did their determination, and both
> decided that they still hold quorum.

In my example, both partitions do that too. But there is a
delay between the begining of the partition and its detection by
all its nodes. During this delay, both partitions believe to own
the quorum.

> There is sparse literature on fencing mechanisms.

too bad, if you find some please post them here.

about quorum [ In reply to ]

May 23, 2000, 10:38 AM

Post #14 of 45 (6740 views)

Jerome Etienne wrote:
>
> On Tue, May 23, 2000 at 08:00:45AM -0600, Alan Robertson wrote:
> > > you may consider your email relevant but it doesn't answer my questions.
> >
> > Actually, David *did* answer your question.
> > Quorum is not enough in and of itself. You need I/O fencing also.
> > That's what he said. He just did it with more words.
>
> well if he said that, he didn't answer my second question.
>
> let me rephrase the question hoping this help. "does a hosts require
> to know if it has the quorum or not without any possible mistake or
> an error with a low probability is tolerable?"
> the question wasnt 'is quorum enough for an unspecified purpose?'

Quorum isn't enough for shared resources which get irretreivably damaged
by simultaneous access by two nodes. Shared disks are an example of
such a resource which is irretrievably damaged by simultaneous access.
It is enough for many other purposes.

But it *does* depend on what you're protecting with the quorum. It
depends on the characteristics of the resource.

-- Alan Robertson
alanr@suse.com

about quorum [ In reply to ]

May 23, 2000, 10:43 AM

Post #15 of 45 (6741 views)

Jerome Etienne wrote:
>
> On Tue, May 23, 2000 at 08:04:04AM -0600, Alan Robertson wrote:
> > A guarantee if immediate termination is necessary for genuine
> > single-copy shared data (like shared SCSI) is used.
> >
> > A guarantee of eventual discovery of the loss of quorum is sufficient
> > when an technique like drbd is used for disk mirroring onto independent
> > media.
>
> why do you make a distinction between the two ?

Because they have different characteristics.

If two machines mount a filesystem read/write simultaneously it will be
irretrievably damaged.

If two machines mount separate filesystems, no irretrievable damage will
occur. That's what happens when two drbd instances lose contact with
each other. Each takes over. Quorum will resolve the problem very
soon, and one copy will be declared the good copy, and the other will be
invalidated.

So, no permanent damage will occur.

-- Alan Robertson
alanr@suse.com

about quorum [ In reply to ]

May 23, 2000, 10:45 AM

Post #16 of 45 (6743 views)

David Brower wrote:
>
> Jerome Etienne wrote:
> >
> > On Tue, May 23, 2000 at 08:04:04AM -0600, Alan Robertson wrote:
> > > A guarantee if immediate termination is necessary for genuine
> > > single-copy shared data (like shared SCSI) is used.
> > >
> > > A guarantee of eventual discovery of the loss of quorum is sufficient
> > > when an technique like drbd is used for disk mirroring onto independent
> > > media.
> >
> > why do you make a distinction between the two ?
>
> > On Tue, May 23, 2000 at 06:24:03AM -0700, David Brower wrote:
> > > 1. You missed the necessary and sufficient issue of
> > > fencing as something to proceed with once quorum
> > > is decided;
> >
> > Is the fencing problem equivalent to the quorum one ?
> > If so, i am interested in any pointer or explaination about
> > possible solutions.
>
> The quorum step identifies -who- should proceed.
>
> The fencing step means it is -safe- for the quorum members
> to proceed.
>
> The strengh of the fence necessary depends on the nature
> of the resources being shared, and how they recover (or
> don't recover) themselves from non-serialized access. It
> is easier for something to recover if it knows there were
> no non-serialized accesses. A truly shared disk needs
> a strong fence.
>
> Alan is suggesting DRBD doesn't need that because of the
> nature of its recovery. I'm not sure I understand that
> point, maybe because I don't understand drbd recovery enough
> to have an opinion. There are reasons to be deeply
> suspicious in that area. It is easy to make tragic
> mistakes in recovery.

I agree. However, if drbd doesn't recover that way, it is a bug, and
should be fixed.

-- Alan Robertson
alanr@suse.com

about quorum [ In reply to ]

May 23, 2000, 10:46 AM

Post #17 of 45 (6735 views)

Jerome Etienne wrote:
>
> On Tue, May 23, 2000 at 08:45:02AM -0700, David Brower wrote:
> > In your example case,
> > the losing partition has not checked his quorum,
> > and has a mistaken belief that would be corrected
> > if he checked.
>
> how a partition check its quorum ?

Implementation dependant. It goes through the process
again and sees if it still has it. It might go through
an election with weights; it might rely on access to
a quorum device, etc. Many quorum schemes exist.

>
> > That is OK, and different than the
> > case where two sides of a partition detected the
> > transition, did their determination, and both
> > decided that they still hold quorum.
>
> In my example, both partitions do that too. But there is a
> delay between the begining of the partition and its detection by
> all its nodes. During this delay, both partitions believe to own
> the quorum.

Yes, this is why they need to wait for fence establishment
before proceeding -- to turn off any oblivious partition.

I think what Jerome is missing is that there is far
more going on at the time of a cluster transition
than just identifying the quorum holders. There may
be many sub-steps involved in the process, some just
to get to the quorum determination part, others to
establish fences, others to re-establish lock ownership,
etc.

At the time the someone detects something that causes
a significant transition, it may need to freeze current
operations until it establishes quorum and knows the
state of the cluster.

Therefore, in Jerome's example, the node1 sub-cluster
is frozen while it makes it's determination. It is OK
for the right hand node2 subcluster to not even know
about the transition, and be doing operations to the
resources it controlled before the partition occurred.
This was Point 1 in my original response.

Then, the left side quorum winner fences off the right
side, stopping it from doing anything else. When that
is known to be complete, then and only then may the
left side truly take ownership of the resources that
the right side may have been using.

-dB

about quorum [ In reply to ]

May 23, 2000, 1:48 PM

Post #18 of 45 (6740 views)

On Tue, May 23, 2000 at 08:14:58AM -0700, David Brower wrote:
> Jerome Etienne wrote:
> > On Tue, May 23, 2000 at 08:04:04AM -0600, Alan Robertson wrote:
> > > A guarantee of eventual discovery of the loss of quorum is sufficient
> > > when an technique like drbd is used for disk mirroring onto independent
> > > media.
> >
> > why do you make a distinction between the two ?
>
> Alan is suggesting DRBD doesn't need that because of the
> nature of its recovery. I'm not sure I understand that
> point, maybe because I don't understand drbd recovery enough
> to have an opinion. There are reasons to be deeply
> suspicious in that area. It is easy to make tragic
> mistakes in recovery.

I don't understand why DRBD recovery influences this either. As far as
I understand it the drbd primary maintains a list of outstanding writes
to the secondary when it sees that the secondary is unreachable. When
the secondary reconnects the primary can send it all the missed updates.
I have not thought about how cluster quorum interacts with this, but
don't see offhand why it should be special. Is it?

-dg

--
David Gould dg@suse.com
SuSE, Inc., 580 2cd St. #210, Oakland, CA 94607 510.628.3380
C++ : an octopus made by nailing extra legs onto a dog

about quorum [ In reply to ]

May 23, 2000, 1:54 PM

Post #19 of 45 (6742 views)

dgould@suse.com wrote:
>
> On Tue, May 23, 2000 at 08:14:58AM -0700, David Brower wrote:
> > Jerome Etienne wrote:
> > > On Tue, May 23, 2000 at 08:04:04AM -0600, Alan Robertson wrote:
> > > > A guarantee of eventual discovery of the loss of quorum is sufficient
> > > > when an technique like drbd is used for disk mirroring onto independent
> > > > media.
> > >
> > > why do you make a distinction between the two ?
> >
> > Alan is suggesting DRBD doesn't need that because of the
> > nature of its recovery. I'm not sure I understand that
> > point, maybe because I don't understand drbd recovery enough
> > to have an opinion. There are reasons to be deeply
> > suspicious in that area. It is easy to make tragic
> > mistakes in recovery.
>
> I don't understand why DRBD recovery influences this either. As far as
> I understand it the drbd primary maintains a list of outstanding writes
> to the secondary when it sees that the secondary is unreachable. When
> the secondary reconnects the primary can send it all the missed updates.
> I have not thought about how cluster quorum interacts with this, but
> don't see offhand why it should be special. Is it?
>
> -dg

Like I said, I don't pretend to understand drbd at all. I don't
see how it works if the "primary" is being served by a node
that is in the losing partition. The quorum holding side
would want to work with the secondary, but then you'd have
to move the "master" role, and you might lose writes done
on the old master before the partition was detected. Merging
divergent mirrors is nasty stuff.

It might be the case that drbd is useful only when the
serving nodes are outside the cluster, acting as independant
storage devices. Again, I have not understood enough
about it or thought it through.

-dB

about quorum [ In reply to ]

jetienne at arobas

May 23, 2000, 2:07 PM

Post #20 of 45 (6737 views)

On Tue, May 23, 2000 at 10:46:11AM -0700, David Brower wrote:
> > how a partition check its quorum ?
>
> Implementation dependant. It goes through the process
> again and sees if it still has it. It might go through
> an election with weights; it might rely on access to
> a quorum device, etc. Many quorum schemes exist.

It may be a good idea to look more closely at these algorithms.
The only one i know it described in structrure.txt and
doesnt seems to garantee that only one partition got the
quorum.

> I think what Jerome is missing is that there is far
> more going on at the time of a cluster transition
> than just identifying the quorum holders.

no, currently what i miss is how a partition is sure it
have the quorum.

> There may
> be many sub-steps involved in the process, some just
> to get to the quorum determination part, others to
> establish fences, others to re-establish lock ownership,
> etc.

As you said in another email, all that happen -after- the
partition knows if it has the quorum. so lets solve the quorum
problem first.

> Therefore, in Jerome's example, the node1 sub-cluster
> is frozen while it makes it's determination.

As i explain in my example, during a delay, some nodes
arent aware there is split going on. so they don't trigger
any determination process which may freeze them.

about quorum [ In reply to ]

May 23, 2000, 2:55 PM

Post #21 of 45 (6743 views)

On Tue, May 23, 2000 at 05:07:18PM -0400, Jerome Etienne wrote:
> On Tue, May 23, 2000 at 10:46:11AM -0700, David Brower wrote:
> > > how a partition check its quorum ?
> >
> > Implementation dependant. It goes through the process
> > again and sees if it still has it. It might go through
> > an election with weights; it might rely on access to
> > a quorum device, etc. Many quorum schemes exist.
>
> It may be a good idea to look more closely at these algorithms.
> The only one i know it described in structrure.txt and
> doesnt seems to garantee that only one partition got the
> quorum.

I didn't look closely at that, but was just reading a very relevent paper
which presents a quorum protocol in detail:

"The Part Time Parliment" - Leslie Lamport
http://gatekeeper.dec.com/pub/DEC/SRC/research-reports/abstracts/src-rr-049.html

Also a bit about distributed disks:

http://gatekeeper.dec.com/pub/DEC/SRC/research-reports/abstracts/src-rr-155.html

-dg

--
David Gould dg@suse.com
SuSE, Inc., 580 2cd St. #210, Oakland, CA 94607 510.628.3380
C++ : an octopus made by nailing extra legs onto a dog

about quorum [ In reply to ]

May 23, 2000, 3:04 PM

Post #22 of 45 (6739 views)

There is a cluster quorum thingy running on each
machine. When one of them detects a partition,
it signals its compatriots on the other nodes,
and they freeze, and work together to decide
their view of quorum, and go through all the
steps necessary to deal with a reconfiguration.

One of the things you're asking about is the
appearance there there are two quorums -- the
rightful one, and the wrong one in the partition.
This is resolved with cluster generation numbers.
There is only one quorum per generation. In the
example, the erroneous partition may have quorum
and resources in generation N, but the correct
holder is in generation N+1. When gen N+1 is sure
that N is dead, then it can take over the resources.

The side that does have quorum, as determined
by the procedure started when failure was detected,
needs to fence out the other side. The quorum can
be reliably determined using weights. (There are
possibilities that no quorum can be determined due to
misconfiguration or multiple partition).

If a node doesn't respond, then it's not going
to be in the quorum group for the new generation.
It might or might not detect the partition and
start its own quorum determination. If it does
start a quorum vote, it will lose, because it
doesn't have the votes. If it doesn't notice
the failure, it will run blindly until fenced.
This is acceptable and safe.

If the side that ought to have the quorum is
twiddling its thumbs, and doesn't get the
indication that it ought to check quorum for
a while, things are just slow. The side that
did try to reconfigure, and felt itself a loser
in the quorum will probably commit suicide.
Sooner or later, the other side -will- detect
something amiss and go into a reconfiguration,
pick up the quorum, and fence off the losers.

The algorithm in structure.txt does guarantee
that quorum will be determined. As I've said
before, it doesn't matter if the loser thinks
it has quorum for an old cluster generation or not.
it is safe to continue using resources it controlled
in the old generation. At some point it will either
realize there is a new generation, and reconfigure,
or it will get fenced, and the failures of i/o will
cause reconfiguration, or those failures will just
take the machine down.

I'm using lots of words to try out different
explanations. Whenever you get it, please
let us know which bit of information seemed
like the critical missing piece to you.

-dB

Jerome Etienne wrote:
>
> On Tue, May 23, 2000 at 10:46:11AM -0700, David Brower wrote:
> > > how a partition check its quorum ?
> >
> > Implementation dependant. It goes through the process
> > again and sees if it still has it. It might go through
> > an election with weights; it might rely on access to
> > a quorum device, etc. Many quorum schemes exist.
>
> It may be a good idea to look more closely at these algorithms.
> The only one i know it described in structrure.txt and
> doesnt seems to garantee that only one partition got the
> quorum.
>
> > I think what Jerome is missing is that there is far
> > more going on at the time of a cluster transition
> > than just identifying the quorum holders.
>
> no, currently what i miss is how a partition is sure it
> have the quorum.
>
> > There may
> > be many sub-steps involved in the process, some just
> > to get to the quorum determination part, others to
> > establish fences, others to re-establish lock ownership,
> > etc.
>
> As you said in another email, all that happen -after- the
> partition knows if it has the quorum. so lets solve the quorum
> problem first.
>
> > Therefore, in Jerome's example, the node1 sub-cluster
> > is frozen while it makes it's determination.
>
> As i explain in my example, during a delay, some nodes
> arent aware there is split going on. so they don't trigger
> any determination process which may freeze them.
>
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.tummy.com
> http://lists.tummy.com/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/

--
Butterflies tell me to say:
"The statements and opinions expressed here are my own and do not necessarily
represent those of Oracle Corporation."

about quorum [ In reply to ]

May 24, 2000, 2:29 AM

Post #23 of 45 (6738 views)

Jerome Etienne wrote:
>
> On Tue, May 23, 2000 at 10:46:11AM -0700, David Brower wrote:
> > > how a partition check its quorum ?
> >
> > Implementation dependant. It goes through the process
> > again and sees if it still has it. It might go through
> > an election with weights; it might rely on access to
> > a quorum device, etc. Many quorum schemes exist.
>
> It may be a good idea to look more closely at these algorithms.
> The only one i know it described in structrure.txt and
> doesnt seems to garantee that only one partition got the
> quorum.

Any method that doesn't guarantee that only one partition gets quorum
isn't a quorum algorithm - period. There are dozens of different quorum
methods, and they all have to guarantee this. I can assure you that
Stephen's methods do in fact guarantee this.

> > I think what Jerome is missing is that there is far
> > more going on at the time of a cluster transition
> > than just identifying the quorum holders.
>
> no, currently what i miss is how a partition is sure it
> have the quorum.

By using any one of the many variations on quorum methods. This is
their reason for existence. It is a tautology.

> > There may
> > be many sub-steps involved in the process, some just
> > to get to the quorum determination part, others to
> > establish fences, others to re-establish lock ownership,
> > etc.
>
> As you said in another email, all that happen -after- the
> partition knows if it has the quorum. so lets solve the quorum
> problem first.
>
> > Therefore, in Jerome's example, the node1 sub-cluster
> > is frozen while it makes it's determination.
>
> As i explain in my example, during a delay, some nodes
> arent aware there is split going on. so they don't trigger
> any determination process which may freeze them.

And has as already been mentioned *several* times, if they don't
participate in the quorum process, they'll get killed. No problem.

I would recommend reading the Linux-HA-HOWTO document, and other
relevant literature on the subject.

-- Alan Robertson
alanr@suse.com

about quorum [ In reply to ]

jetienne at arobas

May 24, 2000, 10:50 AM

Post #24 of 45 (6744 views)

On Tue, May 23, 2000 at 03:04:09PM -0700, David Brower wrote:
> I'm using lots of words to try out different
> explanations. Whenever you get it, please
> let us know which bit of information seemed
> like the critical missing piece to you.

Thanks for the explaination.
the generation mecanism is new to me, i have to think about it
to see if it solve my problem.

about quorum [ In reply to ]

jetienne at arobas

May 24, 2000, 10:56 AM

Post #25 of 45 (6740 views)

On Wed, May 24, 2000 at 03:29:34AM -0600, Alan Robertson wrote:
> > It may be a good idea to look more closely at these algorithms.
> > The only one i know it described in structrure.txt and
> > doesnt seems to garantee that only one partition got the
> > quorum.
>
> Any method that doesn't guarantee that only one partition gets quorum
> isn't a quorum algorithm - period. There are dozens of different quorum
> methods, and they all have to guarantee this. I can assure you that
> Stephen's methods do in fact guarantee this.

the example in my first email gives a example in which a node falsly
believes to have the quorum. nobody spots a mistake in the reasonement
i exposed so currently i assume it correct.

to assure me the algo describes in struture.txt works, you have to
clearly demonstrate the mistakes i made in the example.