Mailing List Archive

linstor failure
I have a small test setup with 2 x diskless linstor-satellite nodes, and
4 x diskful linstor-satellite nodes, one of which is the linstor-controller.

The idea is that the diskless node is the compute node (xen, running the
VM's whose data is on linstor resources).

I have 2 x test VM's, one which was (and still is) working OK (it's an
older debian linux crossbowold), the other has failed (a Windows 10 VM
jspiterivm1) while I was installing (attempting) the xen PV drivers (not
sure if that is relevant or not). The other two resources are unused
(ns2 and windows-wm).

I have a nothing relevant in the linstor error logs, but the linstor
controller node has this in it's kern.log:

Dec 30 10:50:44 castle kernel: [4103630.414725] drbd windows-wm
san6.mytest.com.au: sock was shut down by peer
Dec 30 10:50:44 castle kernel: [4103630.414752] drbd windows-wm
san6.mytest.com.au: conn( Connected -> BrokenPipe ) peer( Secondary ->
Unknown )
Dec 30 10:50:44 castle kernel: [4103630.414759] drbd windows-wm/0
drbd1001 san6.mytest.com.au: pdsk( UpToDate -> DUnknown ) repl(
Established -> Off )
Dec 30 10:50:44 castle kernel: [4103630.414807] drbd windows-wm
san6.mytest.com.au: ack_receiver terminated
Dec 30 10:50:44 castle kernel: [4103630.414810] drbd windows-wm
san6.mytest.com.au: Terminating ack_recv thread
Dec 30 10:50:44 castle kernel: [4103630.445961] drbd windows-wm
san6.mytest.com.au: Restarting sender thread
Dec 30 10:50:44 castle kernel: [4103630.479708] drbd windows-wm
san6.mytest.com.au: Connection closed
Dec 30 10:50:44 castle kernel: [4103630.479739] drbd windows-wm
san6.mytest.com.au: helper command: /sbin/drbdadm disconnected
Dec 30 10:50:44 castle kernel: [4103630.486479] drbd windows-wm
san6.mytest.com.au: helper command: /sbin/drbdadm disconnected exit code 0
Dec 30 10:50:44 castle kernel: [4103630.486533] drbd windows-wm
san6.mytest.com.au: conn( BrokenPipe -> Unconnected )
Dec 30 10:50:44 castle kernel: [4103630.486556] drbd windows-wm
san6.mytest.com.au: Restarting receiver thread
Dec 30 10:50:44 castle kernel: [4103630.486566] drbd windows-wm
san6.mytest.com.au: conn( Unconnected -> Connecting )
Dec 30 10:50:44 castle kernel: [4103631.006727] drbd windows-wm
san6.mytest.com.au: Handshake to peer 2 successful: Agreed network
protocol version 117
Dec 30 10:50:44 castle kernel: [4103631.006735] drbd windows-wm
san6.mytest.com.au: Feature flags enabled on protocol level: 0xf TRIM
THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Dec 30 10:50:44 castle kernel: [4103631.006918] drbd windows-wm
san6.mytest.com.au: Peer authenticated using 20 bytes HMAC
Dec 30 10:50:44 castle kernel: [4103631.006943] drbd windows-wm
san6.mytest.com.au: Starting ack_recv thread (from drbd_r_windows- [1164])
Dec 30 10:50:44 castle kernel: [4103631.041925] drbd windows-wm/0
drbd1001 san6.mytest.com.au: drbd_sync_handshake:
Dec 30 10:50:44 castle kernel: [4103631.041932] drbd windows-wm/0
drbd1001 san6.mytest.com.au: self
CC647323743B5AE0:0000000000000000:0000000000000000:0000000000000000
bits:0 flags:120
Dec 30 10:50:44 castle kernel: [4103631.041937] drbd windows-wm/0
drbd1001 san6.mytest.com.au: peer
CC647323743B5AE0:0000000000000000:0000000000000000:0000000000000000
bits:0 flags:120
Dec 30 10:50:44 castle kernel: [4103631.041941] drbd windows-wm/0
drbd1001 san6.mytest.com.au: uuid_compare()=no-sync by rule 38
Dec 30 10:50:44 castle kernel: [4103631.229931] drbd windows-wm:
Preparing cluster-wide state change 1880606796 (0->2 499/146)
Dec 30 10:50:44 castle kernel: [4103631.230424] drbd windows-wm: State
change 1880606796: primary_nodes=0, weak_nodes=0
Dec 30 10:50:44 castle kernel: [4103631.230429] drbd windows-wm:
Committing cluster-wide state change 1880606796 (0ms)
Dec 30 10:50:44 castle kernel: [4103631.230480] drbd windows-wm
san6.mytest.com.au: conn( Connecting -> Connected ) peer( Unknown ->
Secondary )
Dec 30 10:50:44 castle kernel: [4103631.230486] drbd windows-wm/0
drbd1001 san6.mytest.com.au: pdsk( DUnknown -> UpToDate ) repl( Off ->
Established )
Dec 30 10:58:27 castle kernel: [4104093.577650] drbd jspiteriVM1
xen1.mytest.com.au: peer( Primary -> Secondary )
Dec 30 10:58:27 castle kernel: [4104093.790062] drbd jspiteriVM1/0
drbd1011: bitmap WRITE of 327 pages took 216 ms
Dec 30 10:58:39 castle kernel: [4104106.278699] drbd jspiteriVM1
xen1.mytest.com.au: Preparing remote state change 490644362
Dec 30 10:58:39 castle kernel: [4104106.278984] drbd jspiteriVM1
xen1.mytest.com.au: Committing remote state change 490644362
(primary_nodes=10)
Dec 30 10:58:39 castle kernel: [4104106.278999] drbd jspiteriVM1
xen1.mytest.com.au: peer( Secondary -> Primary )
Dec 30 10:58:40 castle kernel: [4104106.547178] drbd jspiteriVM1/0
drbd1011 xen1.mytest.com.au: resync-susp( no -> connection dependency )
Dec 30 10:58:40 castle kernel: [4104106.547191] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: repl( PausedSyncT -> SyncTarget )
resync-susp( peer -> no )
Dec 30 10:58:40 castle kernel: [4104106.547198] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Syncer continues.
Dec 30 11:04:29 castle kernel: [4104456.362585] drbd jspiteriVM1
xen1.mytest.com.au: peer( Primary -> Secondary )
Dec 30 11:04:30 castle kernel: [4104456.388543] drbd jspiteriVM1/0
drbd1011: bitmap WRITE of 1 pages took 24 ms
Dec 30 11:04:30 castle kernel: [4104456.401108] drbd jspiteriVM1/0
drbd1011 san6.mytest.com.au: pdsk( UpToDate -> Outdated )
Dec 30 11:04:30 castle kernel: [4104456.788360] drbd jspiteriVM1/0
drbd1011 san6.mytest.com.au: pdsk( Outdated -> Inconsistent )
Dec 30 11:09:15 castle kernel: [4104742.275721] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=2
Dec 30 11:09:15 castle kernel: [4104742.377977] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=2
Dec 30 11:09:16 castle kernel: [4104742.481920] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=3
Dec 30 11:09:16 castle kernel: [4104742.585933] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=4
Dec 30 11:09:16 castle kernel: [4104742.689909] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:16 castle kernel: [4104742.793898] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:16 castle kernel: [4104742.897895] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:16 castle kernel: [4104743.001927] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:16 castle kernel: [4104743.105909] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:16 castle kernel: [4104743.209908] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:16 castle kernel: [4104743.313927] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:17 castle kernel: [4104743.417897] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:17 castle kernel: [4104743.521909] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:17 castle kernel: [4104743.575764] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:17 castle kernel: [4104743.625902] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:17 castle kernel: [4104743.729908] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:17 castle kernel: [4104743.833894] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:17 castle kernel: [4104743.937890] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:17 castle kernel: [4104744.041907] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
[.this line repeats .... until Jan 2 2:33am, probably when I rebooted it]

Jan  2 02:33:46 castle kernel: [4333012.494110] drbd jspiteriVM1
san5.mytest.com.au: Restarting sender thread
Jan  2 02:33:46 castle kernel: [4333012.528437] drbd jspiteriVM1
san5.mytest.com.au: Connection closed
Jan  2 02:33:46 castle kernel: [4333012.528447] drbd jspiteriVM1
san5.mytest.com.au: helper command: /sbin/drbdadm disconnected
Jan  2 02:33:46 castle kernel: [4333012.530942] drbd jspiteriVM1
san5.mytest.com.au: helper command: /sbin/drbdadm disconnected exit code 0
Jan  2 02:33:46 castle kernel: [4333012.530960] drbd jspiteriVM1
san5.mytest.com.au: conn( BrokenPipe -> Unconnected )
Jan  2 02:33:46 castle kernel: [4333012.530970] drbd jspiteriVM1
san5.mytest.com.au: Restarting receiver thread
Jan  2 02:33:46 castle kernel: [4333012.530974] drbd jspiteriVM1
san5.mytest.com.au: conn( Unconnected -> Connecting )
Jan  2 02:33:46 castle kernel: [4333013.054060] drbd jspiteriVM1
san5.mytest.com.au: Handshake to peer 1 successful: Agreed network
protocol version 117
Jan  2 02:33:46 castle kernel: [4333013.054067] drbd jspiteriVM1
san5.mytest.com.au: Feature flags enabled on protocol level: 0xf TRIM
THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Jan  2 02:33:46 castle kernel: [4333013.054426] drbd jspiteriVM1
san5.mytest.com.au: Peer authenticated using 20 bytes HMAC
Jan  2 02:33:46 castle kernel: [4333013.054452] drbd jspiteriVM1
san5.mytest.com.au: Starting ack_recv thread (from drbd_r_jspiteri [1046])
Jan  2 02:33:46 castle kernel: [4333013.085933] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: drbd_sync_handshake:
Jan  2 02:33:46 castle kernel: [4333013.085941] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: self
122E90789B3D90E2:122E90789B3D90E3:4D2D1C8F63C38B44:B1B847713A96996E
bits:21168661 flags:124
Jan  2 02:33:46 castle kernel: [4333013.085946] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: peer
2B520E804A7D4EAC:0000000000000000:4D2D1C8F63C38B44:B1B847713A96996E
bits:21168661 flags:124
Jan  2 02:33:46 castle kernel: [4333013.085952] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: uuid_compare()=target-set-bitmap by rule 60
Jan  2 02:33:46 castle kernel: [4333013.085956] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Setting and writing one bitmap slot, after
drbd_sync_handshake
Jan  2 02:33:46 castle kernel: [4333013.226948] drbd jspiteriVM1/0
drbd1011: bitmap WRITE of 1078 pages took 88 ms
Jan  2 02:33:46 castle kernel: [4333013.278401] drbd jspiteriVM1:
Preparing cluster-wide state change 3482568163 (0->1 499/146)
Jan  2 02:33:46 castle kernel: [4333013.278980] drbd jspiteriVM1: State
change 3482568163: primary_nodes=0, weak_nodes=0
Jan  2 02:33:46 castle kernel: [4333013.278985] drbd jspiteriVM1:
Committing cluster-wide state change 3482568163 (0ms)
Jan  2 02:33:46 castle kernel: [4333013.279050] drbd jspiteriVM1
san5.mytest.com.au: conn( Connecting -> Connected ) peer( Unknown ->
Secondary )
Jan  2 02:33:46 castle kernel: [4333013.279055] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: repl( Off -> WFBitMapT )
Jan  2 02:33:46 castle kernel: [4333013.326494] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: receive bitmap stats [Bytes(packets)]:
plain 0(0), RLE 23(1), total 23; compression: 100.0%
Jan  2 02:33:46 castle kernel: [4333013.337300] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: send bitmap stats [Bytes(packets)]: plain
0(0), RLE 23(1), total 23; compression: 100.0%
Jan  2 02:33:46 castle kernel: [4333013.337313] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: helper command: /sbin/drbdadm
before-resync-target
Jan  2 02:33:46 castle kernel: [4333013.339475] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: helper command: /sbin/drbdadm
before-resync-target exit code 0
Jan  2 02:33:46 castle kernel: [4333013.339503] drbd jspiteriVM1/0
drbd1011 xen1.mytest.com.au: resync-susp( no -> connection dependency )
Jan  2 02:33:46 castle kernel: [4333013.339504] drbd jspiteriVM1/0
drbd1011 san7.mytest.com.au: resync-susp( no -> connection dependency )
Jan  2 02:33:46 castle kernel: [4333013.339505] drbd jspiteriVM1/0
drbd1011 san6.mytest.com.au: resync-susp( no -> connection dependency )
Jan  2 02:33:46 castle kernel: [4333013.339507] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: repl( WFBitMapT -> SyncTarget )
Jan  2 02:33:46 castle kernel: [4333013.339552] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Began resync as SyncTarget (will sync
104859732 KB [26214933 bits set]).
Jan  2 02:50:55 castle kernel: [4334042.151194] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=2
Jan  2 02:50:55 castle kernel: [4334042.254225] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Resync done (total 1028 sec; paused 0 sec;
102000 K/sec)
Jan  2 02:50:55 castle kernel: [4334042.254230] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: expected n_oos:23691797 to be equal to
rs_failed:23727152
Jan  2 02:50:55 castle kernel: [4334042.254232] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au:             23727152 failed blocks
Jan  2 02:50:55 castle kernel: [4334042.254245] drbd jspiteriVM1/0
drbd1011 xen1.mytest.com.au: resync-susp( connection dependency -> no )
Jan  2 02:50:55 castle kernel: [4334042.254247] drbd jspiteriVM1/0
drbd1011 san7.mytest.com.au: resync-susp( connection dependency -> no )
Jan  2 02:50:55 castle kernel: [4334042.254249] drbd jspiteriVM1/0
drbd1011 san6.mytest.com.au: resync-susp( connection dependency -> no )
Jan  2 02:50:55 castle kernel: [4334042.254252] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: pdsk( Outdated -> UpToDate ) repl(
SyncTarget -> Established )
Jan  2 02:50:55 castle kernel: [4334042.281495] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: helper command: /sbin/drbdadm
after-resync-target
Jan  2 02:50:55 castle kernel: [4334042.289879] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: helper command: /sbin/drbdadm
after-resync-target exit code 0
Jan  2 02:50:55 castle kernel: [4334042.289879] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: pdsk( UpToDate -> Inconsistent )
Jan  2 10:23:28 castle kernel: [4361194.855074] drbd windows-wm
san7.mytest.com.au: sock was shut down by peer
Jan  2 10:23:28 castle kernel: [4361194.855101] drbd windows-wm
san7.mytest.com.au: conn( Connected -> BrokenPipe ) peer( Secondary ->
Unknown )
Jan  2 10:23:28 castle kernel: [4361194.855109] drbd windows-wm/0
drbd1001 san7.mytest.com.au: pdsk( UpToDate -> DUnknown ) repl(
Established -> Off )
Jan  2 10:23:28 castle kernel: [4361194.855161] drbd windows-wm
san7.mytest.com.au: ack_receiver terminated
Jan  2 10:23:28 castle kernel: [4361194.855164] drbd windows-wm
san7.mytest.com.au: Terminating ack_recv thread
Jan  2 10:23:28 castle kernel: [4361194.882138] drbd windows-wm
san7.mytest.com.au: Restarting sender thread
Jan  2 10:23:28 castle kernel: [4361194.961402] drbd windows-wm
san7.mytest.com.au: Connection closed
Jan  2 10:23:28 castle kernel: [4361194.961435] drbd windows-wm
san7.mytest.com.au: helper command: /sbin/drbdadm disconnected
Jan  2 10:23:28 castle kernel: [4361194.968763] drbd windows-wm
san7.mytest.com.au: helper command: /sbin/drbdadm disconnected exit code 0
Jan  2 10:23:28 castle kernel: [4361194.968800] drbd windows-wm
san7.mytest.com.au: conn( BrokenPipe -> Unconnected )
Jan  2 10:23:28 castle kernel: [4361194.968812] drbd windows-wm
san7.mytest.com.au: Restarting receiver thread
Jan  2 10:23:28 castle kernel: [4361194.968816] drbd windows-wm
san7.mytest.com.au: conn( Unconnected -> Connecting )
Jan  2 10:23:29 castle kernel: [4361195.486059] drbd windows-wm
san7.mytest.com.au: Handshake to peer 3 successful: Agreed network
protocol version 117
Jan  2 10:23:29 castle kernel: [4361195.486066] drbd windows-wm
san7.mytest.com.au: Feature flags enabled on protocol level: 0xf TRIM
THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Jan  2 10:23:29 castle kernel: [4361195.486490] drbd windows-wm
san7.mytest.com.au: Peer authenticated using 20 bytes HMAC
Jan  2 10:23:29 castle kernel: [4361195.486515] drbd windows-wm
san7.mytest.com.au: Starting ack_recv thread (from drbd_r_windows- [1165])
Jan  2 10:23:29 castle kernel: [4361195.517928] drbd windows-wm/0
drbd1001 san7.mytest.com.au: drbd_sync_handshake:
Jan  2 10:23:29 castle kernel: [4361195.517935] drbd windows-wm/0
drbd1001 san7.mytest.com.au: self
CC647323743B5AE0:0000000000000000:0000000000000000:0000000000000000
bits:0 flags:120
Jan  2 10:23:29 castle kernel: [4361195.517940] drbd windows-wm/0
drbd1001 san7.mytest.com.au: peer
CC647323743B5AE0:0000000000000000:0000000000000000:0000000000000000
bits:0 flags:120
Jan  2 10:23:29 castle kernel: [4361195.517944] drbd windows-wm/0
drbd1001 san7.mytest.com.au: uuid_compare()=no-sync by rule 38
Jan  2 10:23:29 castle kernel: [4361195.677932] drbd windows-wm:
Preparing cluster-wide state change 3667329610 (0->3 499/146)
Jan  2 10:23:29 castle kernel: [4361195.678459] drbd windows-wm: State
change 3667329610: primary_nodes=0, weak_nodes=0
Jan  2 10:23:29 castle kernel: [4361195.678466] drbd windows-wm:
Committing cluster-wide state change 3667329610 (0ms)
Jan  2 10:23:29 castle kernel: [4361195.678516] drbd windows-wm
san7.mytest.com.au: conn( Connecting -> Connected ) peer( Unknown ->
Secondary )
Jan  2 10:23:29 castle kernel: [4361195.678522] drbd windows-wm/0
drbd1001 san7.mytest.com.au: pdsk( DUnknown -> UpToDate ) repl( Off ->
Established )

castle:/var/log# linstor resource list
????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
? ResourceName ? Node   ? Port ? Usage  ? Conns ?             State ?
CreatedOn           ?
????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
? crossbowold  ? castle ? 7010 ? Unused ? Ok   ?          UpToDate ?
2020-10-07 00:46:23 ?
? crossbowold  ? flail  ? 7010 ? Unused ? Ok ?          Diskless ?
2021-01-04 05:03:20 ?
? crossbowold  ? san5   ? 7010 ? Unused ? Ok ?          UpToDate ?
2020-10-07 00:46:23 ?
? crossbowold  ? san6   ? 7010 ? Unused ? Ok    ?          UpToDate ?
2020-10-07 00:46:22 ?
? crossbowold  ? san7   ? 7010 ? Unused ? Ok ?          UpToDate ?
2020-10-07 00:46:21 ?
? crossbowold  ? xen1   ? 7010 ? InUse  ? Ok ?          Diskless ?
2020-10-15 00:30:31 ?
? jspiteriVM1  ? castle ? 7011 ? Unused ?
StandAlone(san6.mytest.com.au,san7.mytest.com.au)    ? SyncTarget(0.00%)
? 2020-10-14 22:15:00 ?
? jspiteriVM1  ? san5   ? 7011 ? Unused ? Connecting(san7.mytest.com.au)
  ?      Inconsistent ? 2020-10-14 22:14:59 ?
? jspiteriVM1  ? san6   ? 7011 ? Unused ?
Connecting(castle.mytest.com.au,san7.mytest.com.au) ? SyncTarget(0.00%)
? 2020-10-14 22:14:58 ?
? jspiteriVM1  ? san7   ? 7011 ? Unused ?
Connecting(castle.mytest.com.au),StandAlone(san6.mytest.com.au,san5.mytest.com.au)
?      Inconsistent ? 2020-10-14 22:14:58 ?
? jspiteriVM1  ? xen1   ? 7011 ? Unused ? Ok ?          Diskless ?
2020-11-20 20:39:20 ?
? ns2          ? castle ? 7000 ? Unused ? Ok ?          UpToDate ?
2020-10-28 23:22:13 ?
? ns2          ? flail  ? 7000 ? Unused ? Ok ?          Diskless ?
2021-01-04 05:03:42 ?
? ns2          ? san5   ? 7000 ? Unused ? Ok ?          UpToDate ?
2020-10-28 23:22:12 ?
? ns2          ? san6   ? 7000 ? Unused ? Ok    ?          UpToDate ?
2020-10-28 23:22:11 ?
? ns2          ? xen1   ? 7000 ? Unused ? Ok ?          Diskless ?
2020-10-28 23:30:20 ?
? windows-wm   ? castle ? 7001 ? Unused ? Ok ?          UpToDate ?
2020-09-30 00:03:41 ?
? windows-wm   ? flail  ? 7001 ? Unused ? Ok ?          Diskless ?
2021-01-04 05:03:48 ?
? windows-wm   ? san5   ? 7001 ? Unused ? Ok ?          UpToDate ?
2020-09-30 00:03:40 ?
? windows-wm   ? san6   ? 7001 ? Unused ? Ok ?          UpToDate ?
2020-09-30 00:03:39 ?
? windows-wm   ? san7   ? 7001 ? Unused ? Ok    ?          UpToDate ?
2020-09-30 00:13:05 ?
????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

Could anyone determine from this, or advise what additional logs I
should examine, to work out why this failed? I don't see anything
obvious as to what caused linstor/drbd to fail here, all nodes where
online and un-interrupted as far as I can tell. All physical storage is
backed by MD raid arrays, so again there is some protection against disk
failures (haven't noticed any anyway though).

I've since done a upgrade to the latest version of drbd/linstor
components on all nodes.

Finally, what could I do to recover the data? Has it been destroyed, or
do I just need to select a node and tell lintor that this node has up to
date data? Or can linstor work that out somehow?

Regards,
Adam

_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user