Mailing List Archive

Proposed Consensus Cluster membership manager for heartbeat
Hi,

Heartbeat only tracks membership from the point of view of the local node. If
you have a well-behaved broadcast network, the local point of view should match
the cluster view of membership.

However, you may have asymmetric communication media, or you may have multicast
with a router in the middle.

In any case, it can happen for various reasons that every node may not agree on
cluster membership. This has various bad effects which I won't go into here.

Using the heartbeat API, one can now define such cluster wide services without
modifying heartbeat itself. I've named this service the consensus cluster
membership manager or ccmm.

This particular service is interesting, primarily because of the election
election phases. The methodology I propose is below.

Begin Transaction Phase:

Send "begin transaction" packet to the CMM processes in the cluster
if you see a heartbeat membership transition or you receive a begin
transaction packet. One field in this packet is your cluster
membership "uptime". Only one packet is sent from each
machine for every transaction. If you're sending a start
transaction packet because you received one, then you should
introduce a small random delay before sending your own.

Initial Election Phase:

This phase is entered when a start transaction packet is received
from every machine you believe to be up, or it has been at least "n"
seconds since the last "start transaction" packet was received.

Each machine looks at the set of "start transaction" packets it
got in the start phase, and votes for the machine with the longest
heartbeat uptime. In case of a tie, elect the machine with the
lowest name when sorted alphabetically.

Leadership Voting Phase:
Broadcast your vote, along with the number of machines you received
start transaction packets from. This number of machines it was
connected to is the weight given to a given node's vote.

[.PERHAPS THE INITIAL ELECTION PHASE AND LEADERSHIP VOTING PHASE
COULD BE COMBINED?]. The result would be that you couldn't weight
the votes by the number of machines that a given machine heard from during
this process... I wonder how important that is?

Final Election:

Each machine tallies the votes. A machine "wins" if it tallies
the largest number of votes. ASCII collating sequence on node
name is used to break ties.

Everyone should elect the same leader. Any machine which
believes it is the leader sends out a final election notice.
Any node that hears two different final election notices sends an
abort transaction.

Now, the elected leader proceeds as below:

Request heartbeat-level (local) cluster membership bitmap from each node.

Each node introduces a small random delay before replying.

Wait until each node it believes to be up returns an answer, or "n" seconds have
elapsed.

The leader ANDs the results it got from all the responding machines, and clears
bits for machines which did not respond to its request. This will become the
Official Cluster Membership.

It broadcasts the resulting bitmap to each machine in the cluster.

Each machine ACKs the new state (after a small random delay).

When it has received all the ACKs from the cluster members, the transaction
leader signals end-transaction.

If any machine receives heartbeat notification of changes in node reachability
after it received the notice of cluster leadership, then it sends an abort
transaction message.

If notification is received after then end-transaction, it goes to the begin
transaction phase and starts over...

It seems to me that this looks a lot like a specific case of the general n-phase
transaction service that Phoenix implements. The main difference between this
and the more general case is that we don't care who gets elected transaction
leader here. In the more general case, only certain nodes will want to be the
transaction leader, and will vie to be allowed to start a transaction by
nominating themselves in the begin transaction phase. This would be used as a
filter during the initial election phase.

Another difference is that the general transaction service would rely on this
level of cluster membership for its information about cluster membership, as
opposed to heartbeat's idea of cluster membership. This probably has some
subtle implications that escape me at the present time ;-)

I suppose I should go reread Stephen's proposal again...


-- Alan Robertson
alanr@suse.com
Re: Proposed Consensus Cluster membership manager for heartbeat [ In reply to ]
Alan Robertson wrote:

> It seems to me that this looks a lot like a specific case of the general n-phase
> transaction service that Phoenix implements. The main difference between this
> and the more general case is that we don't care who gets elected transaction
> leader here. In the more general case, only certain nodes will want to be the
> transaction leader, and will vie to be allowed to start a transaction by
> nominating themselves in the begin transaction phase. This would be used as a
> filter during the initial election phase.
>
> Another difference is that the general transaction service would rely on this
> level of cluster membership for its information about cluster membership, as
> opposed to heartbeat's idea of cluster membership. This probably has some
> subtle implications that escape me at the present time ;-)

I don't see how to plug in a cluster-aware service that needs global
recovery at membership change into this scheme. That's where the
n-phase of the phoenix CM is handy. Did you have a scheme in mind
for handling services that need actions before cluster transition
can be considered complete?

-dB
Re: Proposed Consensus Cluster membership manager for heartbeat [ In reply to ]
David Brower wrote:
>
> Alan Robertson wrote:
>
> > It seems to me that this looks a lot like a specific case of the general n-phase
> > transaction service that Phoenix implements. The main difference between this
> > and the more general case is that we don't care who gets elected transaction
> > leader here. In the more general case, only certain nodes will want to be the
> > transaction leader, and will vie to be allowed to start a transaction by
> > nominating themselves in the begin transaction phase. This would be used as a
> > filter during the initial election phase.
> >
> > Another difference is that the general transaction service would rely on this
> > level of cluster membership for its information about cluster membership, as
> > opposed to heartbeat's idea of cluster membership. This probably has some
> > subtle implications that escape me at the present time ;-)
>
> I don't see how to plug in a cluster-aware service that needs global
> recovery at membership change into this scheme. That's where the
> n-phase of the phoenix CM is handy. Did you have a scheme in mind
> for handling services that need actions before cluster transition
> can be considered complete?

It seems to me like you said a couple of things here:

1) Some cluster aware-apps need to know about transitions

2) Some apps need to be notified of the transition, then ack the transition
before it
can be considered complete

As regards the first one, I haven't defined the API for talking to this level of
the software yet - this is more of a conceptual overview.

As regards the second one, this seems like an oversight on my part. Thanks for
the feedback. I'm glad to see you're still making time to read the list!

-- Alan Robertson
alanr@suse.com