Mailing List Archive

verify/disconnect/connect doesn't resync?
Hi, I have a three-node active/passive DRBD cluster, operating with default configuration. I had to replace disks on one of the nodes (call it node A) and resync the cluster.

Somehow, after doing this, A was not in sync with the primary (node C); I only discovered this because I couldn't even mount the filesystem on it after (temporarily) making A primary. I don't fully understand how I got into this situation but that's a tangent for now.

Following instructions in the documentation, I enabled a verification algorithm, and instructed A to verify (`drbdadm verify <my volume>`). It correctly found many discrepancies (gigabytes worth!) and emitted the ranges to dmesg.

I then attempted to resynchronize A with C (the primary) by running `drbdadm disconnect <my volume>` and then `drbdadm connect <my volume>`, again, following documentation. This did not appear to do anything, despite verify having just found nearly the entire disk to be out of sync. Indeed, running verify a second time produced the exact same results.

Instead I forced a full resync by bringing A down, invalidating it, and bringing it back up again. Now verification showed A and C to be in sync.

However A was still showing a small number (thousands) of discrepancies with node B (the other secondary node). So I repeated the above steps on B -- verify/disconnect/connect/verify -- and again, nothing changed. B still shows discrepancies between it and both A and C.

Running the same steps on node C (the primary) again found discrepancies with B, and again failed to resynchronize.

What am I missing? Is there an additional step needed to convince DRBD to resynchronize blocks found to mismatch during verify?

Further questions --

Why does `drbdadm status` not show whether out-of-sync blocks were found by `drbdadm verify`? Instead it shows UpToDate like nothing is wrong.

Why is resynchronization only triggered on reconnect? Is there a downside to simply starting resynchronization when out-of-sync blocks are discovered?

Version info:
DRBDADM_BUILDTAG=GIT-hash:\ 5acfd06032d4c511c75c92e58662eeeb18bd47db\ build\ by\ ec2-user@test-cluster-c.cpacejo.test\,\ 2021-07-06\ 20:48:54
DRBDADM_API_VERSION=2
DRBD_KERNEL_VERSION_CODE=0x090102
DRBD_KERNEL_VERSION=9.1.2
DRBDADM_VERSION_CODE=0x091200
DRBDADM_VERSION=9.18.0

dmesg logs below.

Thanks,
Chris

----

1st verify:

[...]
[7578792.035120] drbd drbd_test/0 drbd1 test-cluster-c.cpacejo.test: Online verify done (total 485 sec; paused 0 sec; 43252 K/sec)
[7578792.043632] drbd drbd_test/0 drbd1 test-cluster-c.cpacejo.test: Online verify found 2131 4k blocks out of sync!
[7578792.051555] drbd drbd_test/0 drbd1 test-cluster-c.cpacejo.test: repl( VerifyS -> Established )
[7578792.058956] drbd drbd_test/0 drbd1 test-cluster-c.cpacejo.test: helper command: /sbin/drbdadm out-of-sync
[7578792.067705] drbd drbd_test/0 drbd1 test-cluster-c.cpacejo.test: helper command: /sbin/drbdadm out-of-sync exit code 0

disconnect:

[7579233.399450] drbd drbd_test: Preparing cluster-wide state change 1874764997 (1->0 496/16)
[7579233.414663] drbd drbd_test: State change 1874764997: primary_nodes=4, weak_nodes=FFFFFFFFFFFFFFFA
[7579233.422721] drbd drbd_test: Committing cluster-wide state change 1874764997 (24ms)
[7579233.430676] drbd drbd_test test-cluster-a.cpacejo.test: conn( Connected -> Disconnecting ) peer( Secondary -> Unknown )
[7579233.440086] drbd drbd_test/0 drbd1 test-cluster-a.cpacejo.test: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
[7579233.449071] drbd drbd_test test-cluster-a.cpacejo.test: ack_receiver terminated
[7579233.455798] drbd drbd_test test-cluster-a.cpacejo.test: Terminating ack_recv thread
[7579233.462812] drbd drbd_test test-cluster-a.cpacejo.test: Terminating sender thread
[7579233.469779] drbd drbd_test test-cluster-a.cpacejo.test: Starting sender thread (from drbd_r_drbd_tes [4947])
[7579233.522434] drbd drbd_test test-cluster-a.cpacejo.test: Connection closed
[7579233.527357] drbd drbd_test test-cluster-a.cpacejo.test: helper command: /sbin/drbdadm disconnected
[7579233.535664] drbd drbd_test test-cluster-a.cpacejo.test: helper command: /sbin/drbdadm disconnected exit code 0
[7579233.543610] drbd drbd_test test-cluster-a.cpacejo.test: conn( Disconnecting -> StandAlone )
[7579233.550886] drbd drbd_test test-cluster-a.cpacejo.test: Terminating receiver thread
[7579233.551999] drbd drbd_test: Preparing cluster-wide state change 1117845239 (1->2 496/16)
[7579233.579829] drbd drbd_test: State change 1117845239: primary_nodes=4, weak_nodes=FFFFFFFFFFFFFFFA
[7579233.587246] drbd drbd_test test-cluster-c.cpacejo.test: Cluster is now split
[7579233.593837] drbd drbd_test: Committing cluster-wide state change 1117845239 (40ms)
[7579233.600717] drbd drbd_test test-cluster-c.cpacejo.test: conn( Connected -> Disconnecting ) peer( Primary -> Unknown )
[7579233.609005] drbd drbd_test/0 drbd1: disk( UpToDate -> Outdated )
[7579233.613529] drbd drbd_test/0 drbd1 test-cluster-c.cpacejo.test: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
[7579233.621764] drbd drbd_test test-cluster-c.cpacejo.test: ack_receiver terminated
[7579233.628522] drbd drbd_test test-cluster-c.cpacejo.test: Terminating ack_recv thread
[7579233.635568] drbd drbd_test test-cluster-c.cpacejo.test: Terminating sender thread
[7579233.642429] drbd drbd_test test-cluster-c.cpacejo.test: Starting sender thread (from drbd_r_drbd_tes [4949])
[7579233.702427] drbd drbd_test test-cluster-c.cpacejo.test: Connection closed
[7579233.707266] drbd drbd_test test-cluster-c.cpacejo.test: helper command: /sbin/drbdadm disconnected
[7579233.715540] drbd drbd_test test-cluster-c.cpacejo.test: helper command: /sbin/drbdadm disconnected exit code 0
[7579233.723443] drbd drbd_test test-cluster-c.cpacejo.test: conn( Disconnecting -> StandAlone )
[7579233.730614] drbd drbd_test test-cluster-c.cpacejo.test: Terminating receiver thread

connect:

[7579274.994900] drbd drbd_test test-cluster-a.cpacejo.test: conn( StandAlone -> Unconnected )
[7579275.002094] drbd drbd_test test-cluster-a.cpacejo.test: Starting receiver thread (from drbd_w_drbd_tes [4906])
[7579275.003118] drbd drbd_test test-cluster-c.cpacejo.test: conn( StandAlone -> Unconnected )
[7579275.017647] drbd drbd_test test-cluster-c.cpacejo.test: Starting receiver thread (from drbd_w_drbd_tes [4906])
[7579275.026007] drbd drbd_test test-cluster-a.cpacejo.test: conn( Unconnected -> Connecting )
[7579275.033321] drbd drbd_test test-cluster-c.cpacejo.test: conn( Unconnected -> Connecting )
[7579275.555495] drbd drbd_test test-cluster-a.cpacejo.test: Handshake to peer 0 successful: Agreed network protocol version 120
[7579275.563899] drbd drbd_test test-cluster-a.cpacejo.test: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
[7579275.573000] drbd drbd_test test-cluster-a.cpacejo.test: Starting ack_recv thread (from drbd_r_drbd_tes [6132])
[7579275.581050] drbd drbd_test test-cluster-c.cpacejo.test: Handshake to peer 2 successful: Agreed network protocol version 120
[7579275.589495] drbd drbd_test test-cluster-c.cpacejo.test: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
[7579275.598624] drbd drbd_test test-cluster-c.cpacejo.test: Starting ack_recv thread (from drbd_r_drbd_tes [6133])
[7579275.661901] drbd drbd_test: Preparing cluster-wide state change 2953429557 (1->2 499/146)
[7579275.689909] drbd drbd_test/0 drbd1 test-cluster-c.cpacejo.test: drbd_sync_handshake:
[7579275.696815] drbd drbd_test/0 drbd1 test-cluster-c.cpacejo.test: self 2218B43B129FB2C0:0000000000000000:9AEFA85FA67119B8:10B03C23106E5DBE bits:2131 flags:120
[7579275.708000] drbd drbd_test/0 drbd1 test-cluster-c.cpacejo.test: peer 2218B43B129FB2C0:0000000000000000:10B03C23106E5DBE:9AEFA85FA67119B8 bits:2131 flags:1120
[7579275.719280] drbd drbd_test/0 drbd1 test-cluster-c.cpacejo.test: uuid_compare()=no-sync by rule=reconnected
[7579275.727061] drbd drbd_test/0 drbd1 test-cluster-c.cpacejo.test: bitmap content (2131 bits)
[7579275.827372] drbd drbd_test: State change 2953429557: primary_nodes=4, weak_nodes=FFFFFFFFFFFFFFF8
[7579275.834831] drbd drbd_test: Committing cluster-wide state change 2953429557 (172ms)
[7579275.841772] drbd drbd_test test-cluster-c.cpacejo.test: conn( Connecting -> Connected ) peer( Unknown -> Primary )
[7579275.849916] drbd drbd_test/0 drbd1: disk( Outdated -> UpToDate )
[7579275.854455] drbd drbd_test/0 drbd1 test-cluster-c.cpacejo.test: pdsk( DUnknown -> UpToDate ) repl( Off -> Established )
[7579275.880192] drbd drbd_test test-cluster-a.cpacejo.test: Preparing remote state change 159357618
[7579275.901913] drbd drbd_test/0 drbd1 test-cluster-a.cpacejo.test: drbd_sync_handshake:
[7579275.908920] drbd drbd_test/0 drbd1 test-cluster-a.cpacejo.test: self 2218B43B129FB2C0:0000000000000000:9AEFA85FA67119B8:10B03C23106E5DBE bits:2131 flags:100
[7579275.920171] drbd drbd_test/0 drbd1 test-cluster-a.cpacejo.test: peer 2218B43B129FB2C0:0000000000000000:10B03C23106E5DBE:5698F48A94253A12 bits:2131 flags:1100
[7579275.931507] drbd drbd_test/0 drbd1 test-cluster-a.cpacejo.test: uuid_compare()=no-sync by rule=reconnected
[7579275.939352] drbd drbd_test/0 drbd1 test-cluster-a.cpacejo.test: bitmap content (2131 bits)
[7579275.961059] drbd drbd_test test-cluster-a.cpacejo.test: Committing remote state change 159357618 (primary_nodes=4)
[7579275.969219] drbd drbd_test test-cluster-a.cpacejo.test: conn( Connecting -> Connected ) peer( Unknown -> Secondary )
[7579275.978413] drbd drbd_test/0 drbd1 test-cluster-a.cpacejo.test: pdsk( DUnknown -> UpToDate ) repl( Off -> Established )

2nd verify:

[...]
[7579615.694997] drbd drbd_test/0 drbd1 test-cluster-a.cpacejo.test: Online verify done (total 231 sec; paused 0 sec; 90816 K/sec)
[7579615.703340] drbd drbd_test/0 drbd1 test-cluster-a.cpacejo.test: Online verify found 2131 4k blocks out of sync!
[7579615.711213] drbd drbd_test/0 drbd1 test-cluster-a.cpacejo.test: repl( VerifyS -> Established )
[7579615.718463] drbd drbd_test/0 drbd1 test-cluster-a.cpacejo.test: helper command: /sbin/drbdadm out-of-sync
[7579615.727226] drbd drbd_test/0 drbd1 test-cluster-a.cpacejo.test: helper command: /sbin/drbdadm out-of-sync exit code 0
[7579804.815701] drbd drbd_test/0 drbd1 test-cluster-c.cpacejo.test: Out of sync: start=31438264, size=17048 (sectors)
[7579867.917531] drbd drbd_test/0 drbd1 test-cluster-c.cpacejo.test: Online verify done (total 483 sec; paused 0 sec; 43432 K/sec)
[7579867.926131] drbd drbd_test/0 drbd1 test-cluster-c.cpacejo.test: Online verify found 2131 4k blocks out of sync!
[7579867.934090] drbd drbd_test/0 drbd1 test-cluster-c.cpacejo.test: repl( VerifyS -> Established )
[7579867.941461] drbd drbd_test/0 drbd1 test-cluster-c.cpacejo.test: helper command: /sbin/drbdadm out-of-sync
[7579867.950263] drbd drbd_test/0 drbd1 test-cluster-c.cpacejo.test: helper command: /sbin/drbdadm out-of-sync exit code 0
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user
Re: verify/disconnect/connect doesn't resync? [ In reply to ]
On 29/09/2021 01:10, Chris Pacejo wrote:
> Hi, I have a three-node active/passive DRBD cluster, operating with default configuration. I had to replace disks on one of the nodes (call it node A) and resync the cluster.
>
> Somehow, after doing this, A was not in sync with the primary (node C); I only discovered this because I couldn't even mount the filesystem on it after (temporarily) making A primary. I don't fully understand how I got into this situation but that's a tangent for now.
>
> Following instructions in the documentation, I enabled a verification algorithm, and instructed A to verify (`drbdadm verify <my volume>`). It correctly found many discrepancies (gigabytes worth!) and emitted the ranges to dmesg.
>
> I then attempted to resynchronize A with C (the primary) by running `drbdadm disconnect <my volume>` and then `drbdadm connect <my volume>`, again, following documentation. This did not appear to do anything, despite verify having just found nearly the entire disk to be out of sync. Indeed, running verify a second time produced the exact same results.
>
> Instead I forced a full resync by bringing A down, invalidating it, and bringing it back up again. Now verification showed A and C to be in sync.

What I usually do in this situation (I believe it's because no writes
have hit the primary while disconnected), to avoid the drastic step of
having to completely invalidate a secondary node, is: disconnect the
secondary, force a tiny change on the primary (e.g. touch and delete an
empty file on the filesystem, run a filsystem check which updates the fs
metadata), then reconnect. Of course this forces a resync and, in my
experience and from what I can tell by the number of Kbs resynced, the
resync includes the verified blocks found out of sync).

>
> However A was still showing a small number (thousands) of discrepancies with node B (the other secondary node). So I repeated the above steps on B -- verify/disconnect/connect/verify -- and again, nothing changed. B still shows discrepancies between it and both A and C.
>
> Running the same steps on node C (the primary) again found discrepancies with B, and again failed to resynchronize.
>
> What am I missing? Is there an additional step needed to convince DRBD to resynchronize blocks found to mismatch during verify?
>
> Further questions --
>
> Why does `drbdadm status` not show whether out-of-sync blocks were found by `drbdadm verify`? Instead it shows UpToDate like nothing is wrong.
>
> Why is resynchronization only triggered on reconnect? Is there a downside to simply starting resynchronization when out-of-sync blocks are discovered?

I believe this has just been left for the user to take whatever action
is desired using the out-of-sync helper. I suppose some people might not
want any automatic action taken and just have a helper script send them
a notification so they can manually intervene.

Eddie

>
> Version info:
> DRBDADM_BUILDTAG=GIT-hash:\ 5acfd06032d4c511c75c92e58662eeeb18bd47db\ build\ by\ ec2-user@test-cluster-c.cpacejo.test\,\ 2021-07-06\ 20:48:54
> DRBDADM_API_VERSION=2
> DRBD_KERNEL_VERSION_CODE=0x090102
> DRBD_KERNEL_VERSION=9.1.2
> DRBDADM_VERSION_CODE=0x091200
> DRBDADM_VERSION=9.18.0
>
> dmesg logs below.
>
> Thanks,
> Chris

<snip>
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user