Mailing List Archive

linstor failure
Reposting the below as I guess early January wasn't the best time to get
any responses. I'd really appreciate any assistance as I'd prefer to
avoid rebuilding the VM from scratch (wasted hours, not lost data), but
also I'd like to know how to resolve or avoid the issue in the future
when I actually have real data being stored.

Thanks,
Adam


I have a small test setup with 2 x diskless linstor-satellite nodes, and
4 x diskful linstor-satellite nodes, one of which is the linstor-controller.


The idea is that the diskless node is the compute node (xen, running the
VM's whose data is on linstor resources).

I have 2 x test VM's, one which was (and still is) working OK (it's an
older debian linux crossbowold), the other has failed (a Windows 10 VM
jspiterivm1) while I was installing (attempting) the xen PV drivers (not
sure if that is relevant or not). The other two resources are unused
(ns2 and windows-wm).

I have a nothing relevant in the linstor error logs, but the linstor
controller node has this in it's kern.log:

Dec 30 10:50:44 castle kernel: [4103630.414725] drbd windows-wm
san6.mytest.com.au: sock was shut down by peer
Dec 30 10:50:44 castle kernel: [4103630.414752] drbd windows-wm
san6.mytest.com.au: conn( Connected -> BrokenPipe ) peer( Secondary ->
Unknown )
Dec 30 10:50:44 castle kernel: [4103630.414759] drbd windows-wm/0
drbd1001 san6.mytest.com.au: pdsk( UpToDate -> DUnknown ) repl(
Established -> Off )
Dec 30 10:50:44 castle kernel: [4103630.414807] drbd windows-wm
san6.mytest.com.au: ack_receiver terminated
Dec 30 10:50:44 castle kernel: [4103630.414810] drbd windows-wm
san6.mytest.com.au: Terminating ack_recv thread
Dec 30 10:50:44 castle kernel: [4103630.445961] drbd windows-wm
san6.mytest.com.au: Restarting sender thread
Dec 30 10:50:44 castle kernel: [4103630.479708] drbd windows-wm
san6.mytest.com.au: Connection closed
Dec 30 10:50:44 castle kernel: [4103630.479739] drbd windows-wm
san6.mytest.com.au: helper command: /sbin/drbdadm disconnected
Dec 30 10:50:44 castle kernel: [4103630.486479] drbd windows-wm
san6.mytest.com.au: helper command: /sbin/drbdadm disconnected exit code 0
Dec 30 10:50:44 castle kernel: [4103630.486533] drbd windows-wm
san6.mytest.com.au: conn( BrokenPipe -> Unconnected )
Dec 30 10:50:44 castle kernel: [4103630.486556] drbd windows-wm
san6.mytest.com.au: Restarting receiver thread
Dec 30 10:50:44 castle kernel: [4103630.486566] drbd windows-wm
san6.mytest.com.au: conn( Unconnected -> Connecting )
Dec 30 10:50:44 castle kernel: [4103631.006727] drbd windows-wm
san6.mytest.com.au: Handshake to peer 2 successful: Agreed network
protocol version 117
Dec 30 10:50:44 castle kernel: [4103631.006735] drbd windows-wm
san6.mytest.com.au: Feature flags enabled on protocol level: 0xf TRIM
THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Dec 30 10:50:44 castle kernel: [4103631.006918] drbd windows-wm
san6.mytest.com.au: Peer authenticated using 20 bytes HMAC
Dec 30 10:50:44 castle kernel: [4103631.006943] drbd windows-wm
san6.mytest.com.au: Starting ack_recv thread (from drbd_r_windows- [1164])
Dec 30 10:50:44 castle kernel: [4103631.041925] drbd windows-wm/0
drbd1001 san6.mytest.com.au: drbd_sync_handshake:
Dec 30 10:50:44 castle kernel: [4103631.041932] drbd windows-wm/0
drbd1001 san6.mytest.com.au: self
CC647323743B5AE0:0000000000000000:0000000000000000:0000000000000000
bits:0 flags:120
Dec 30 10:50:44 castle kernel: [4103631.041937] drbd windows-wm/0
drbd1001 san6.mytest.com.au: peer
CC647323743B5AE0:0000000000000000:0000000000000000:0000000000000000
bits:0 flags:120
Dec 30 10:50:44 castle kernel: [4103631.041941] drbd windows-wm/0
drbd1001 san6.mytest.com.au: uuid_compare()=no-sync by rule 38
Dec 30 10:50:44 castle kernel: [4103631.229931] drbd windows-wm:
Preparing cluster-wide state change 1880606796 (0->2 499/146)
Dec 30 10:50:44 castle kernel: [4103631.230424] drbd windows-wm: State
change 1880606796: primary_nodes=0, weak_nodes=0
Dec 30 10:50:44 castle kernel: [4103631.230429] drbd windows-wm:
Committing cluster-wide state change 1880606796 (0ms)
Dec 30 10:50:44 castle kernel: [4103631.230480] drbd windows-wm
san6.mytest.com.au: conn( Connecting -> Connected ) peer( Unknown ->
Secondary )
Dec 30 10:50:44 castle kernel: [4103631.230486] drbd windows-wm/0
drbd1001 san6.mytest.com.au: pdsk( DUnknown -> UpToDate ) repl( Off ->
Established )
Dec 30 10:58:27 castle kernel: [4104093.577650] drbd jspiteriVM1
xen1.mytest.com.au: peer( Primary -> Secondary )
Dec 30 10:58:27 castle kernel: [4104093.790062] drbd jspiteriVM1/0
drbd1011: bitmap WRITE of 327 pages took 216 ms
Dec 30 10:58:39 castle kernel: [4104106.278699] drbd jspiteriVM1
xen1.mytest.com.au: Preparing remote state change 490644362
Dec 30 10:58:39 castle kernel: [4104106.278984] drbd jspiteriVM1
xen1.mytest.com.au: Committing remote state change 490644362
(primary_nodes=10)
Dec 30 10:58:39 castle kernel: [4104106.278999] drbd jspiteriVM1
xen1.mytest.com.au: peer( Secondary -> Primary )
Dec 30 10:58:40 castle kernel: [4104106.547178] drbd jspiteriVM1/0
drbd1011 xen1.mytest.com.au: resync-susp( no -> connection dependency )
Dec 30 10:58:40 castle kernel: [4104106.547191] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: repl( PausedSyncT -> SyncTarget )
resync-susp( peer -> no )
Dec 30 10:58:40 castle kernel: [4104106.547198] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Syncer continues.
Dec 30 11:04:29 castle kernel: [4104456.362585] drbd jspiteriVM1
xen1.mytest.com.au: peer( Primary -> Secondary )
Dec 30 11:04:30 castle kernel: [4104456.388543] drbd jspiteriVM1/0
drbd1011: bitmap WRITE of 1 pages took 24 ms
Dec 30 11:04:30 castle kernel: [4104456.401108] drbd jspiteriVM1/0
drbd1011 san6.mytest.com.au: pdsk( UpToDate -> Outdated )
Dec 30 11:04:30 castle kernel: [4104456.788360] drbd jspiteriVM1/0
drbd1011 san6.mytest.com.au: pdsk( Outdated -> Inconsistent )
Dec 30 11:09:15 castle kernel: [4104742.275721] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=2
Dec 30 11:09:15 castle kernel: [4104742.377977] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=2
Dec 30 11:09:16 castle kernel: [4104742.481920] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=3
Dec 30 11:09:16 castle kernel: [4104742.585933] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=4
Dec 30 11:09:16 castle kernel: [4104742.689909] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:16 castle kernel: [4104742.793898] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:16 castle kernel: [4104742.897895] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:16 castle kernel: [4104743.001927] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:16 castle kernel: [4104743.105909] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:16 castle kernel: [4104743.209908] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:16 castle kernel: [4104743.313927] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:17 castle kernel: [4104743.417897] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:17 castle kernel: [4104743.521909] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:17 castle kernel: [4104743.575764] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:17 castle kernel: [4104743.625902] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:17 castle kernel: [4104743.729908] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:17 castle kernel: [4104743.833894] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:17 castle kernel: [4104743.937890] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
Dec 30 11:09:17 castle kernel: [4104744.041907] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
[.this line repeats .... until Jan 2 2:33am, probably when I rebooted it]

Jan  2 02:33:46 castle kernel: [4333012.494110] drbd jspiteriVM1
san5.mytest.com.au: Restarting sender thread
Jan  2 02:33:46 castle kernel: [4333012.528437] drbd jspiteriVM1
san5.mytest.com.au: Connection closed
Jan  2 02:33:46 castle kernel: [4333012.528447] drbd jspiteriVM1
san5.mytest.com.au: helper command: /sbin/drbdadm disconnected
Jan  2 02:33:46 castle kernel: [4333012.530942] drbd jspiteriVM1
san5.mytest.com.au: helper command: /sbin/drbdadm disconnected exit code 0
Jan  2 02:33:46 castle kernel: [4333012.530960] drbd jspiteriVM1
san5.mytest.com.au: conn( BrokenPipe -> Unconnected )
Jan  2 02:33:46 castle kernel: [4333012.530970] drbd jspiteriVM1
san5.mytest.com.au: Restarting receiver thread
Jan  2 02:33:46 castle kernel: [4333012.530974] drbd jspiteriVM1
san5.mytest.com.au: conn( Unconnected -> Connecting )
Jan  2 02:33:46 castle kernel: [4333013.054060] drbd jspiteriVM1
san5.mytest.com.au: Handshake to peer 1 successful: Agreed network
protocol version 117
Jan  2 02:33:46 castle kernel: [4333013.054067] drbd jspiteriVM1
san5.mytest.com.au: Feature flags enabled on protocol level: 0xf TRIM
THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Jan  2 02:33:46 castle kernel: [4333013.054426] drbd jspiteriVM1
san5.mytest.com.au: Peer authenticated using 20 bytes HMAC
Jan  2 02:33:46 castle kernel: [4333013.054452] drbd jspiteriVM1
san5.mytest.com.au: Starting ack_recv thread (from drbd_r_jspiteri [1046])
Jan  2 02:33:46 castle kernel: [4333013.085933] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: drbd_sync_handshake:
Jan  2 02:33:46 castle kernel: [4333013.085941] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: self
122E90789B3D90E2:122E90789B3D90E3:4D2D1C8F63C38B44:B1B847713A96996E
bits:21168661 flags:124
Jan  2 02:33:46 castle kernel: [4333013.085946] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: peer
2B520E804A7D4EAC:0000000000000000:4D2D1C8F63C38B44:B1B847713A96996E
bits:21168661 flags:124
Jan  2 02:33:46 castle kernel: [4333013.085952] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: uuid_compare()=target-set-bitmap by rule 60
Jan  2 02:33:46 castle kernel: [4333013.085956] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Setting and writing one bitmap slot, after
drbd_sync_handshake
Jan  2 02:33:46 castle kernel: [4333013.226948] drbd jspiteriVM1/0
drbd1011: bitmap WRITE of 1078 pages took 88 ms
Jan  2 02:33:46 castle kernel: [4333013.278401] drbd jspiteriVM1:
Preparing cluster-wide state change 3482568163 (0->1 499/146)
Jan  2 02:33:46 castle kernel: [4333013.278980] drbd jspiteriVM1: State
change 3482568163: primary_nodes=0, weak_nodes=0
Jan  2 02:33:46 castle kernel: [4333013.278985] drbd jspiteriVM1:
Committing cluster-wide state change 3482568163 (0ms)
Jan  2 02:33:46 castle kernel: [4333013.279050] drbd jspiteriVM1
san5.mytest.com.au: conn( Connecting -> Connected ) peer( Unknown ->
Secondary )
Jan  2 02:33:46 castle kernel: [4333013.279055] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: repl( Off -> WFBitMapT )
Jan  2 02:33:46 castle kernel: [4333013.326494] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: receive bitmap stats [Bytes(packets)]:
plain 0(0), RLE 23(1), total 23; compression: 100.0%
Jan  2 02:33:46 castle kernel: [4333013.337300] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: send bitmap stats [Bytes(packets)]: plain
0(0), RLE 23(1), total 23; compression: 100.0%
Jan  2 02:33:46 castle kernel: [4333013.337313] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: helper command: /sbin/drbdadm
before-resync-target
Jan  2 02:33:46 castle kernel: [4333013.339475] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: helper command: /sbin/drbdadm
before-resync-target exit code 0
Jan  2 02:33:46 castle kernel: [4333013.339503] drbd jspiteriVM1/0
drbd1011 xen1.mytest.com.au: resync-susp( no -> connection dependency )
Jan  2 02:33:46 castle kernel: [4333013.339504] drbd jspiteriVM1/0
drbd1011 san7.mytest.com.au: resync-susp( no -> connection dependency )
Jan  2 02:33:46 castle kernel: [4333013.339505] drbd jspiteriVM1/0
drbd1011 san6.mytest.com.au: resync-susp( no -> connection dependency )
Jan  2 02:33:46 castle kernel: [4333013.339507] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: repl( WFBitMapT -> SyncTarget )
Jan  2 02:33:46 castle kernel: [4333013.339552] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Began resync as SyncTarget (will sync
104859732 KB [26214933 bits set]).
Jan  2 02:50:55 castle kernel: [4334042.151194] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=2
Jan  2 02:50:55 castle kernel: [4334042.254225] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: Resync done (total 1028 sec; paused 0 sec;
102000 K/sec)
Jan  2 02:50:55 castle kernel: [4334042.254230] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: expected n_oos:23691797 to be equal to
rs_failed:23727152
Jan  2 02:50:55 castle kernel: [4334042.254232] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au:             23727152 failed blocks
Jan  2 02:50:55 castle kernel: [4334042.254245] drbd jspiteriVM1/0
drbd1011 xen1.mytest.com.au: resync-susp( connection dependency -> no )
Jan  2 02:50:55 castle kernel: [4334042.254247] drbd jspiteriVM1/0
drbd1011 san7.mytest.com.au: resync-susp( connection dependency -> no )
Jan  2 02:50:55 castle kernel: [4334042.254249] drbd jspiteriVM1/0
drbd1011 san6.mytest.com.au: resync-susp( connection dependency -> no )
Jan  2 02:50:55 castle kernel: [4334042.254252] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: pdsk( Outdated -> UpToDate ) repl(
SyncTarget -> Established )
Jan  2 02:50:55 castle kernel: [4334042.281495] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: helper command: /sbin/drbdadm
after-resync-target
Jan  2 02:50:55 castle kernel: [4334042.289879] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: helper command: /sbin/drbdadm
after-resync-target exit code 0
Jan  2 02:50:55 castle kernel: [4334042.289879] drbd jspiteriVM1/0
drbd1011 san5.mytest.com.au: pdsk( UpToDate -> Inconsistent )
Jan  2 10:23:28 castle kernel: [4361194.855074] drbd windows-wm
san7.mytest.com.au: sock was shut down by peer
Jan  2 10:23:28 castle kernel: [4361194.855101] drbd windows-wm
san7.mytest.com.au: conn( Connected -> BrokenPipe ) peer( Secondary ->
Unknown )
Jan  2 10:23:28 castle kernel: [4361194.855109] drbd windows-wm/0
drbd1001 san7.mytest.com.au: pdsk( UpToDate -> DUnknown ) repl(
Established -> Off )
Jan  2 10:23:28 castle kernel: [4361194.855161] drbd windows-wm
san7.mytest.com.au: ack_receiver terminated
Jan  2 10:23:28 castle kernel: [4361194.855164] drbd windows-wm
san7.mytest.com.au: Terminating ack_recv thread
Jan  2 10:23:28 castle kernel: [4361194.882138] drbd windows-wm
san7.mytest.com.au: Restarting sender thread
Jan  2 10:23:28 castle kernel: [4361194.961402] drbd windows-wm
san7.mytest.com.au: Connection closed
Jan  2 10:23:28 castle kernel: [4361194.961435] drbd windows-wm
san7.mytest.com.au: helper command: /sbin/drbdadm disconnected
Jan  2 10:23:28 castle kernel: [4361194.968763] drbd windows-wm
san7.mytest.com.au: helper command: /sbin/drbdadm disconnected exit code 0
Jan  2 10:23:28 castle kernel: [4361194.968800] drbd windows-wm
san7.mytest.com.au: conn( BrokenPipe -> Unconnected )
Jan  2 10:23:28 castle kernel: [4361194.968812] drbd windows-wm
san7.mytest.com.au: Restarting receiver thread
Jan  2 10:23:28 castle kernel: [4361194.968816] drbd windows-wm
san7.mytest.com.au: conn( Unconnected -> Connecting )
Jan  2 10:23:29 castle kernel: [4361195.486059] drbd windows-wm
san7.mytest.com.au: Handshake to peer 3 successful: Agreed network
protocol version 117
Jan  2 10:23:29 castle kernel: [4361195.486066] drbd windows-wm
san7.mytest.com.au: Feature flags enabled on protocol level: 0xf TRIM
THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Jan  2 10:23:29 castle kernel: [4361195.486490] drbd windows-wm
san7.mytest.com.au: Peer authenticated using 20 bytes HMAC
Jan  2 10:23:29 castle kernel: [4361195.486515] drbd windows-wm
san7.mytest.com.au: Starting ack_recv thread (from drbd_r_windows- [1165])
Jan  2 10:23:29 castle kernel: [4361195.517928] drbd windows-wm/0
drbd1001 san7.mytest.com.au: drbd_sync_handshake:
Jan  2 10:23:29 castle kernel: [4361195.517935] drbd windows-wm/0
drbd1001 san7.mytest.com.au: self
CC647323743B5AE0:0000000000000000:0000000000000000:0000000000000000
bits:0 flags:120
Jan  2 10:23:29 castle kernel: [4361195.517940] drbd windows-wm/0
drbd1001 san7.mytest.com.au: peer
CC647323743B5AE0:0000000000000000:0000000000000000:0000000000000000
bits:0 flags:120
Jan  2 10:23:29 castle kernel: [4361195.517944] drbd windows-wm/0
drbd1001 san7.mytest.com.au: uuid_compare()=no-sync by rule 38
Jan  2 10:23:29 castle kernel: [4361195.677932] drbd windows-wm:
Preparing cluster-wide state change 3667329610 (0->3 499/146)
Jan  2 10:23:29 castle kernel: [4361195.678459] drbd windows-wm: State
change 3667329610: primary_nodes=0, weak_nodes=0
Jan  2 10:23:29 castle kernel: [4361195.678466] drbd windows-wm:
Committing cluster-wide state change 3667329610 (0ms)
Jan  2 10:23:29 castle kernel: [4361195.678516] drbd windows-wm
san7.mytest.com.au: conn( Connecting -> Connected ) peer( Unknown ->
Secondary )
Jan  2 10:23:29 castle kernel: [4361195.678522] drbd windows-wm/0
drbd1001 san7.mytest.com.au: pdsk( DUnknown -> UpToDate ) repl( Off ->
Established )

castle:/var/log# linstor resource list
????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
? ResourceName ? Node   ? Port ? Usage  ? Conns ?             State ?
CreatedOn           ?
????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
? crossbowold  ? castle ? 7010 ? Unused ? Ok   ?          UpToDate ?
2020-10-07 00:46:23 ?
? crossbowold  ? flail  ? 7010 ? Unused ? Ok ?          Diskless ?
2021-01-04 05:03:20 ?
? crossbowold  ? san5   ? 7010 ? Unused ? Ok ?          UpToDate ?
2020-10-07 00:46:23 ?
? crossbowold  ? san6   ? 7010 ? Unused ? Ok    ?          UpToDate ?
2020-10-07 00:46:22 ?
? crossbowold  ? san7   ? 7010 ? Unused ? Ok ?          UpToDate ?
2020-10-07 00:46:21 ?
? crossbowold  ? xen1   ? 7010 ? InUse  ? Ok ?          Diskless ?
2020-10-15 00:30:31 ?
? jspiteriVM1  ? castle ? 7011 ? Unused ?
StandAlone(san6.mytest.com.au,san7.mytest.com.au)    ? SyncTarget(0.00%)
? 2020-10-14 22:15:00 ?
? jspiteriVM1  ? san5   ? 7011 ? Unused ? Connecting(san7.mytest.com.au)
  ?      Inconsistent ? 2020-10-14 22:14:59 ?
? jspiteriVM1  ? san6   ? 7011 ? Unused ?
Connecting(castle.mytest.com.au,san7.mytest.com.au) ? SyncTarget(0.00%)
? 2020-10-14 22:14:58 ?
? jspiteriVM1  ? san7   ? 7011 ? Unused ?
Connecting(castle.mytest.com.au),StandAlone(san6.mytest.com.au,san5.mytest.com.au)
?      Inconsistent ? 2020-10-14 22:14:58 ?
? jspiteriVM1  ? xen1   ? 7011 ? Unused ? Ok ?          Diskless ?
2020-11-20 20:39:20 ?
? ns2          ? castle ? 7000 ? Unused ? Ok ?          UpToDate ?
2020-10-28 23:22:13 ?
? ns2          ? flail  ? 7000 ? Unused ? Ok ?          Diskless ?
2021-01-04 05:03:42 ?
? ns2          ? san5   ? 7000 ? Unused ? Ok ?          UpToDate ?
2020-10-28 23:22:12 ?
? ns2          ? san6   ? 7000 ? Unused ? Ok    ?          UpToDate ?
2020-10-28 23:22:11 ?
? ns2          ? xen1   ? 7000 ? Unused ? Ok ?          Diskless ?
2020-10-28 23:30:20 ?
? windows-wm   ? castle ? 7001 ? Unused ? Ok ?          UpToDate ?
2020-09-30 00:03:41 ?
? windows-wm   ? flail  ? 7001 ? Unused ? Ok ?          Diskless ?
2021-01-04 05:03:48 ?
? windows-wm   ? san5   ? 7001 ? Unused ? Ok ?          UpToDate ?
2020-09-30 00:03:40 ?
? windows-wm   ? san6   ? 7001 ? Unused ? Ok ?          UpToDate ?
2020-09-30 00:03:39 ?
? windows-wm   ? san7   ? 7001 ? Unused ? Ok    ?          UpToDate ?
2020-09-30 00:13:05 ?
????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

Could anyone determine from this, or advise what additional logs I
should examine, to work out why this failed? I don't see anything
obvious as to what caused linstor/drbd to fail here, all nodes where
online and un-interrupted as far as I can tell. All physical storage is
backed by MD raid arrays, so again there is some protection against disk
failures (haven't noticed any anyway though).

I've since done a upgrade to the latest version of drbd/linstor
components on all nodes.

Finally, what could I do to recover the data? Has it been destroyed, or
do I just need to select a node and tell lintor that this node has up to
date data? Or can linstor work that out somehow?

Regards,
Adam

_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user

_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user