Mailing List Archive

FailOver tests - Tiebreaker moves to StandAlone
Hi,

We are using DRBD with a RHEL cluster in two nodes + diskless tiebreaker setup: Corosync, Pacemaker and DRBD(node1, node2, tiebreaker-diskless), our versions are kmod-drbd-9.0.27 and drbd-utils-9.12.1
We are testing failover scenarios on a physical environment with RedHat 7.6 by executing hard shutdowns on node1 and node 2 (we always start from a stable state with all nodes connected and up2date).
From time to time when node2 is shut down (forced) the volume on node1 moves to state Consistent from Up2Date after which (a couple of milliseconds later) the tiebreaker reports an error resulting in the following drbdadm status:

node1
postgres-zabbix-data7790 role:Secondary suspended:quorum
disk:UpToDate quorum:no blocked:upper
node2 connection:Connecting
tiebreaker connection:Connecting

tiebreaker
postgres-zabbix-data7790 role:Secondary suspended:quorum
disk:Diskless quorum:no blocked:upper
node1 role:StandAlone
node2 connection:Connecting


resource postgres-zabbix-data7790 {

options {
quorum majority;
}
protocol C;

startup {
wfc-timeout 10;
degr-wfc-timeout 5;
}
net {
max-epoch-size 2048;
max-buffers 2048;
sndbuf-size 0;
rcvbuf-size 0;
}

disk {
on-io-error detach;
c-max-rate 900M;
c-min-rate 100M;
c-fill-target 1M;
resync-rate 300M;
}

on node1 {
device /dev/drbd7790;
disk /dev/mapper/lvmdata-postgres--zabbix--data7790;
node-id 1;
meta-disk internal;
address 10.21.24.11:7790;
}

on node2 {
device /dev/drbd7790;
disk /dev/mapper/lvmdata-postgres--zabbix--data7790;
meta-disk internal;
node-id 2;
address 10.21.24.12:7790;
}


on tiebreaker {
device /dev/drbd7790;
disk none;
meta-disk internal;
node-id 3;
address 10.21.24.13:7790;
}

connection-mesh {
hosts node1 node2 tiebreaker;
}
}

Tiebreaker logs:
2021-02-23T23:32:24.299214+00:00;ERR;tiebreaker;P-/;[kernel/]; drbd postgres-zabbix-data7790 node2: PingAck did not arrive in time.;
2021-02-23T23:32:24.299244+00:00;INFO;tiebreaker;P-/;[kernel/]; drbd postgres-zabbix-data7790/0 drbd7790: Would lose quorum, but using tiebreaker logic to keep;
2021-02-23T23:32:24.299266+00:00;INFO;tiebreaker;P-/;[kernel/]; drbd postgres-zabbix-data7790 node2: conn( Connected -> NetworkFailure ) peer( Primary -> Unknown );
2021-02-23T23:32:24.299305+00:00;INFO;tiebreaker;P-/;[kernel/]; drbd postgres-zabbix-data7790/0 drbd7790 node2: pdsk( UpToDate -> DUnknown ) repl( Established -> Off );
2021-02-23T23:32:24.299326+00:00;INFO;tiebreaker;P-/;[kernel/]; drbd postgres-zabbix-data7790 node2: ack_receiver terminated;
2021-02-23T23:32:24.299349+00:00;INFO;tiebreaker;P-/;[kernel/]; drbd postgres-zabbix-data7790 node2: Terminating ack_recv thread;
2021-02-23T23:32:24.299372+00:00;INFO;tiebreaker;P-/;[kernel/]; drbd postgres-zabbix-data7790/0 drbd7790: Would lose quorum, but using tiebreaker logic to keep;
2021-02-23T23:32:24.299394+00:00;INFO;tiebreaker;P-/;[kernel/]; drbd postgres-zabbix-data7790/0 drbd7790: Would lose quorum, but using tiebreaker logic to keep;
2021-02-23T23:32:24.318231+00:00;INFO;tiebreaker;P-/;[kernel/]; drbd postgres-zabbix-data7790 node2: Restarting sender thread;
2021-02-23T23:32:24.318268+00:00;INFO;tiebreaker;P-/;[kernel/]; drbd postgres-zabbix-data7790 node2: Connection closed;
2021-02-23T23:32:24.318290+00:00;INFO;tiebreaker;P-/;[kernel/]; drbd postgres-zabbix-data7790/0 drbd7790: Would lose quorum, but using tiebreaker logic to keep;
2021-02-23T23:32:24.318312+00:00;INFO;tiebreaker;P-/;[kernel/]; drbd postgres-zabbix-data7790 node2: conn( NetworkFailure -> Unconnected );
2021-02-23T23:32:24.318337+00:00;INFO;tiebreaker;P-/;[kernel/]; drbd postgres-zabbix-data7790 node2: Restarting receiver thread;
2021-02-23T23:32:24.318370+00:00;INFO;tiebreaker;P-/;[kernel/]; drbd postgres-zabbix-data7790/0 drbd7790: Would lose quorum, but using tiebreaker logic to keep;
2021-02-23T23:32:24.318404+00:00;INFO;tiebreaker;P-/;[kernel/]; drbd postgres-zabbix-data7790 node2: conn( Unconnected -> Connecting );
2021-02-23T23:32:24.803484+00:00;ERR;tiebreaker;P-/;[kernel/]; drbd postgres-zabbix-data7790/0 drbd7790: Got NegDReply; Sector 0s, len 131072.;
2021-02-23T23:32:24.803557+00:00;INFO;tiebreaker;P-/;[kernel/]; drbd postgres-zabbix-data7790/0 drbd7790: Would lose quorum, but using tiebreaker logic to keep;
2021-02-23T23:32:24.803587+00:00;ERR;tiebreaker;P-/;[kernel/]; drbd postgres-zabbix-data7790: State change failed: Need access to UpToDate data;
2021-02-23T23:32:24.803628+00:00;INFO;tiebreaker;P-/;[kernel/]; drbd postgres-zabbix-data7790/0 drbd7790 node1: Failed: pdsk( UpToDate -> Consistent );
2021-02-23T23:32:24.803652+00:00;ERR;tiebreaker;P-/;[kernel/]; drbd postgres-zabbix-data7790/0 drbd7790: drbd_req_destroy: Logic BUG rq_state: (0:300000, 2:104), completion_ref = 0;
2021-02-23T23:32:24.805509+00:00;INFO;tiebreaker;P-/;[kernel/]; drbd postgres-zabbix-data7790: susp-io( no -> quorum);
2021-02-23T23:32:24.805540+00:00;INFO;tiebreaker;P-/;[kernel/]; drbd postgres-zabbix-data7790 node1: conn( Connected -> Disconnecting ) peer( Secondary -> Unknown );
2021-02-23T23:32:24.805570+00:00;INFO;tiebreaker;P-/;[kernel/]; drbd postgres-zabbix-data7790/0 drbd7790: quorum( yes -> no );
2021-02-23T23:32:24.805599+00:00;INFO;tiebreaker;P-/;[kernel/]; drbd postgres-zabbix-data7790/0 drbd7790 node1: pdsk( UpToDate -> DUnknown ) repl( Established -> Off );
2021-02-23T23:32:24.805627+00:00;ERR;tiebreaker;P-/;[kernel/]; drbd postgres-zabbix-data7790 node1: error receiving P_STATE, e: -5 l: 0!;
2021-02-23T23:32:24.807322+00:00;INFO;tiebreaker;P-/;[kernel/]; drbd postgres-zabbix-data7790 node1: ack_receiver terminated;
2021-02-23T23:32:24.807435+00:00;INFO;tiebreaker;P-/;[kernel/]; drbd postgres-zabbix-data7790 node1: Terminating ack_recv thread;
2021-02-23T23:32:24.818354+00:00;INFO;tiebreaker;P-/;[kernel/]; drbd postgres-zabbix-data7790 node1: Aborting remote state change 0 commit not possible;
2021-02-23T23:32:24.818465+00:00;INFO;tiebreaker;P-/;[kernel/]; drbd postgres-zabbix-data7790 node1: Restarting sender thread;
2021-02-23T23:32:24.818542+00:00;INFO;tiebreaker;P-/;[kernel/]; drbd postgres-zabbix-data7790 node1: Connection closed;
2021-02-23T23:32:24.818617+00:00;INFO;tiebreaker;P-/;[kernel/]; drbd postgres-zabbix-data7790 node1: conn( Disconnecting -> StandAlone );
2021-02-23T23:32:24.818690+00:00;INFO;tiebreaker;P-/;[kernel/]; drbd postgres-zabbix-data7790 node1: Terminating receiver thread;

Node1 logs:
2021-02-23T23:32:24.800547+00:00;ERR;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790 node2: PingAck did not arrive in time.;
2021-02-23T23:32:24.800610+00:00;INFO;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790/0 drbd7790: Would lose quorum, but using tiebreaker logic to keep;
2021-02-23T23:32:24.800645+00:00;INFO;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790 node2: conn( Connected -> NetworkFailure ) peer( Primary -> Unknown );
2021-02-23T23:32:24.800676+00:00;INFO;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790/0 drbd7790: disk( UpToDate -> Consistent );
2021-02-23T23:32:24.800706+00:00;INFO;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790/0 drbd7790 node2: pdsk( UpToDate -> DUnknown ) repl( Established -> Off );
2021-02-23T23:32:24.800742+00:00;INFO;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790 node2: ack_receiver terminated;
2021-02-23T23:32:24.800771+00:00;INFO;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790 node2: Terminating ack_recv thread;
2021-02-23T23:32:24.800800+00:00;INFO;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790/0 drbd7790: Would lose quorum, but using tiebreaker logic to keep;
2021-02-23T23:32:24.800829+00:00;INFO;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790/0 drbd7790: Would lose quorum, but using tiebreaker logic to keep;
2021-02-23T23:32:24.800856+00:00;INFO;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790/0 drbd7790 tiebreaker: receive_peer_dagatg(): source-set-bitmap by rule 30;
2021-02-23T23:32:24.800886+00:00;ERR;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790/0 drbd7790: Can not satisfy peer's read request, no local data.;
2021-02-23T23:32:24.802664+00:00;INFO;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790/0 drbd7790: Would lose quorum, but using tiebreaker logic to keep;
2021-02-23T23:32:24.811534+00:00;INFO;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790 tiebreaker: sock was shut down by peer;
2021-02-23T23:32:24.811594+00:00;INFO;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790: susp-io( no -> quorum);
2021-02-23T23:32:24.811626+00:00;INFO;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790 tiebreaker: conn( Connected -> BrokenPipe ) peer( Secondary -> Unknown );
2021-02-23T23:32:24.811662+00:00;INFO;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790/0 drbd7790: quorum( yes -> no );
2021-02-23T23:32:24.811692+00:00;INFO;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790/0 drbd7790 tiebreaker: pdsk( Diskless -> DUnknown ) repl( Established -> Off );
2021-02-23T23:32:24.811721+00:00;INFO;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790 tiebreaker: ack_receiver terminated;
2021-02-23T23:32:24.811750+00:00;INFO;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790 tiebreaker: Terminating ack_recv thread;
2021-02-23T23:32:24.811780+00:00;INFO;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790: Preparing cluster-wide state change 3151158077 (1->-1 0/0);
2021-02-23T23:32:24.811814+00:00;INFO;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790: Aborting cluster-wide state change 3151158077 (9ms) rv = -19;
2021-02-23T23:32:24.816641+00:00;INFO;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790 node2: Aborting remote state change 0 commit not possible;
2021-02-23T23:32:24.816701+00:00;INFO;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790 node2: Restarting sender thread;
2021-02-23T23:32:24.816925+00:00;INFO;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790 node2: Connection closed;
2021-02-23T23:32:24.816962+00:00;INFO;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790 node2: conn( NetworkFailure -> Unconnected );
2021-02-23T23:32:24.816989+00:00;INFO;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790 node2: Restarting receiver thread;
2021-02-23T23:32:24.817018+00:00;INFO;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790 node2: conn( Unconnected -> Connecting );
2021-02-23T23:32:24.826555+00:00;INFO;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790 tiebreaker: Aborting remote state change 0 commit not possible;
2021-02-23T23:32:24.826622+00:00;INFO;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790 tiebreaker: Restarting sender thread;
2021-02-23T23:32:24.826651+00:00;INFO;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790 tiebreaker: Connection closed;
2021-02-23T23:32:24.826681+00:00;INFO;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790 tiebreaker: conn( BrokenPipe -> Unconnected );
2021-02-23T23:32:24.826709+00:00;INFO;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790 tiebreaker: Restarting receiver thread;
2021-02-23T23:32:24.826745+00:00;INFO;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790 tiebreaker: conn( Unconnected -> Connecting );
2021-02-23T23:32:24.912545+00:00;INFO;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790: Preparing cluster-wide state change 1764698033 (1->-1 0/0);
2021-02-23T23:32:24.912605+00:00;INFO;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790: Committing cluster-wide state change 1764698033 (0ms);
2021-02-23T23:32:24.912637+00:00;INFO;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790/0 drbd7790: disk( Consistent -> UpToDate );
2021-02-23T23:32:34.322531+00:00;INFO;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790 tiebreaker: sock was shut down by peer;
2021-02-23T23:32:34.322577+00:00;INFO;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790 tiebreaker: conn( Connecting -> BrokenPipe );
2021-02-23T23:32:34.322599+00:00;WARNING;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790 tiebreaker: short read (expected size 8);
2021-02-23T23:32:34.345524+00:00;INFO;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790 tiebreaker: Aborting remote state change 0 commit not possible;
2021-02-23T23:32:34.345582+00:00;INFO;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790 tiebreaker: Restarting sender thread;
2021-02-23T23:32:34.347540+00:00;INFO;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790 tiebreaker: Connection closed;
2021-02-23T23:32:34.347607+00:00;INFO;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790 tiebreaker: conn( BrokenPipe -> Unconnected );
2021-02-23T23:32:35.347753+00:00;INFO;node1;P-/;[kernel/]; drbd postgres-zabbix-data7790 tiebreaker: conn( Unconnected -> Connecting );

Do you have any advice?

Best Regards,
Mihai