Hi,
drbd90 kernel module version:9.0.22-2
drbd90-utils:9.12.2-1
kernel:3.10.0-1127.18.2.el7.x86_64
pacemaker:1.1.21-4
corosync-2.4.5-4
system is centos:7.6
I have a 4 node test system(only ever 1 active primary) which is going split-brain unexpectedly.
n1 is the primary, n2/n3/n4 secondary.
System is being shutdown every night and sometimes on restart(particularly after weekend shutdown) some of the nodes are split-brain and require a full resync to fix.
Logs seem to indicate a problem with uuid_compare.
From the system log on n1:-
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n3: drbd_sync_handshake:
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n3: self 30D0D1B4BD67BAEE:CE02E3A41E743EDA:1AB2F8FC4793AC46:95EE6B42F9156BF6 bits:786 flags:20
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n3: peer 554921683EF7CC82:0000000000000000:272E3DE9D9C74A66:04B370F60768109E bits:0 flags:20
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n3: uuid_compare()=split-brain-disconnect by rule 100
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n3: helper command: /sbin/drbdadm initial-split-brain
Then for n2:-
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: drbd_sync_handshake:
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: self 30D0D1B4BD67BAEE:CE02E3A41E743EDA:1AB2F8FC4793AC46:95EE6B42F9156BF6 bits:786 flags:20
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: peer BC13E2E36CA8B2C6:CE02E3A41E743EDA:272E3DE9D9C74A66:001E2864952E2E96 bits:416 flags:20
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: uuid_compare()=split-brain-auto-recover by rule 90
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: helper command: /sbin/drbdadm initial-split-brain
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: meta connection shut down by peer.
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: conn( Connected -> NetworkFailure ) peer( Secondary -> Unknown )
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: ack_receiver terminated
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: Terminating ack_recv thread
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: helper command: /sbin/drbdadm initial-split-brain exit code 0
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0: Split-Brain detected but unresolved, dropping connection!
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: helper command: /sbin/drbdadm split-brain
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: helper command: /sbin/drbdadm split-brain exit code 0
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: conn( NetworkFailure -> Disconnecting )
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: error receiving P_STATE, e: -5 l: 0!
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: Restarting sender thread
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: Connection closed
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: conn( Disconnecting -> StandAlone )
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: Terminating receiver thread
The logs also have FIXME messages(which may be unrelated) e.g:-
Oct 13 12:50:42 cdc0-n1 kernel: drbd r0/0 drbd0: FIXME drbd_a_r0[97260] op clear, bitmap locked for 'send_bitmap (WFBitMapS)' by drbd_w_r0[1659]
Oct 13 12:50:42 cdc0-n1 kernel: drbd r0/0 drbd0: FIXME drbd_a_r0[97260] op clear, bitmap locked for 'receive bitmap' by drbd_r_r0[95684]
Sep 23 12:41:34 cdc0-n1 kernel: drbd r0/0 drbd0: FIXME drbd_a_r0[19978] op clear, bitmap locked for 'set_n_write from sync_handshake' by drbd_r_r0[17628]
Regards,
Jeremy Faith
drbd90 kernel module version:9.0.22-2
drbd90-utils:9.12.2-1
kernel:3.10.0-1127.18.2.el7.x86_64
pacemaker:1.1.21-4
corosync-2.4.5-4
system is centos:7.6
I have a 4 node test system(only ever 1 active primary) which is going split-brain unexpectedly.
n1 is the primary, n2/n3/n4 secondary.
System is being shutdown every night and sometimes on restart(particularly after weekend shutdown) some of the nodes are split-brain and require a full resync to fix.
Logs seem to indicate a problem with uuid_compare.
From the system log on n1:-
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n3: drbd_sync_handshake:
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n3: self 30D0D1B4BD67BAEE:CE02E3A41E743EDA:1AB2F8FC4793AC46:95EE6B42F9156BF6 bits:786 flags:20
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n3: peer 554921683EF7CC82:0000000000000000:272E3DE9D9C74A66:04B370F60768109E bits:0 flags:20
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n3: uuid_compare()=split-brain-disconnect by rule 100
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n3: helper command: /sbin/drbdadm initial-split-brain
Then for n2:-
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: drbd_sync_handshake:
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: self 30D0D1B4BD67BAEE:CE02E3A41E743EDA:1AB2F8FC4793AC46:95EE6B42F9156BF6 bits:786 flags:20
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: peer BC13E2E36CA8B2C6:CE02E3A41E743EDA:272E3DE9D9C74A66:001E2864952E2E96 bits:416 flags:20
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: uuid_compare()=split-brain-auto-recover by rule 90
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: helper command: /sbin/drbdadm initial-split-brain
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: meta connection shut down by peer.
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: conn( Connected -> NetworkFailure ) peer( Secondary -> Unknown )
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: ack_receiver terminated
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: Terminating ack_recv thread
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: helper command: /sbin/drbdadm initial-split-brain exit code 0
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0: Split-Brain detected but unresolved, dropping connection!
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: helper command: /sbin/drbdadm split-brain
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: helper command: /sbin/drbdadm split-brain exit code 0
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: conn( NetworkFailure -> Disconnecting )
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: error receiving P_STATE, e: -5 l: 0!
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: Restarting sender thread
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: Connection closed
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: conn( Disconnecting -> StandAlone )
Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: Terminating receiver thread
The logs also have FIXME messages(which may be unrelated) e.g:-
Oct 13 12:50:42 cdc0-n1 kernel: drbd r0/0 drbd0: FIXME drbd_a_r0[97260] op clear, bitmap locked for 'send_bitmap (WFBitMapS)' by drbd_w_r0[1659]
Oct 13 12:50:42 cdc0-n1 kernel: drbd r0/0 drbd0: FIXME drbd_a_r0[97260] op clear, bitmap locked for 'receive bitmap' by drbd_r_r0[95684]
Sep 23 12:41:34 cdc0-n1 kernel: drbd r0/0 drbd0: FIXME drbd_a_r0[19978] op clear, bitmap locked for 'set_n_write from sync_handshake' by drbd_r_r0[17628]
Regards,
Jeremy Faith