Mailing List Archive: DRBD corruption with kmod-drbd90-9.1.8-1

DRBD corruption with kmod-drbd90-9.1.8-1

Aug 16, 2022, 11:30 AM

Post #1 of 4 (348 views)

I just had my second DRBD cluster fail after updating
kmod-drbd90-9.1.8-1 and then upgrading the kernel. I'm not sure if the
kernel update broke things or if it was because it caused after the
reboot. About 2 weeks ago there was an update (kmod-drbd90-9.1.8-1) from
elrepo, which got applied. But then after a kernel update the DRBD meta
data was corrupt. Here's the gist of the error:

This is using alma-linux 8:

Aug 7 16:41:13 nfs6 kernel: drbd r0: Starting worker thread (from
drbdsetup [3515])
Aug 7 16:41:13 nfs6 kernel: drbd r0 nfs5: Starting sender thread (from
drbdsetup [3519])
Aug 7 16:41:13 nfs6 kernel: drbd r0/0 drbd0: meta-data IO uses: blk-bio
Aug 7 16:41:13 nfs6 kernel: attempt to access beyond end of
device#012sdb1: rw=6144, want=31250710528, limit=31250706432
Aug 7 16:41:13 nfs6 kernel: drbd r0/0 drbd0:
drbd_md_sync_page_io(,31250710520s,READ) failed with error -5
Aug 7 16:41:13 nfs6 kernel: drbd r0/0 drbd0: Error while reading metadata.

This is from a centos 7 cluster:
Aug 16 11:04:57 v4 kernel: drbd r0 v3: Starting sender thread (from
drbdsetup [9486])
Aug 16 11:04:57 v4 kernel: drbd r0/0 drbd0: meta-data IO uses: blk-bio
Aug 16 11:04:57 v4 kernel: attempt to access beyond end of device
Aug 16 11:04:57 v4 kernel: sdb1: rw=1072, want=3905945600, limit=3905943552
Aug 16 11:04:57 v4 kernel: drbd r0/0 drbd0:
drbd_md_sync_page_io(,3905945592s,READ) failed with error -5
Aug 16 11:04:57 v4 kernel: drbd r0/0 drbd0: Error while reading metadata.
Aug 16 11:04:57 v4 drbd(drbd0)[9452]: ERROR: r0: Called drbdadm -c
/etc/drbd.conf -v adjust r0
Aug 16 11:04:57 v4 drbd(drbd0)[9452]: ERROR: r0: Exit code 1
Aug 16 11:04:57 v4 drbd(drbd0)[9452]: ERROR: r0: Command output:
drbdsetup new-peer r0 0 --_name=v3 --fencing=resource-only
--protocol=C#012drbdsetup new-path r0 0 ipv4:10.1.4.82:7788
ipv4:10.1.4.81:7788#012drbdmeta 0 v09 /dev/sdb1 internal
apply-al#012drbdsetup attach 0 /dev/sdb1 /dev/sdb1 internal
Aug 16 11:04:57 v4 drbd(drbd0)[9452]: ERROR: r0: Command stderr: 0:
Failure: (118) IO error(s) occurred during initial access to
meta-data.#012#012additional info from kernel:#012Error while reading
metadata.#012#012Command 'drbdsetup attach 0 /dev/sdb1 /dev/sdb1
internal' terminated with exit code 10

Both clusters have been running flawlessly for ~2 years. I was in
process of building a new DRBD custer to offload the first one when the
2nd production cluster had a kernel update and ran into the same exact
issue. On the first cluster (rhel8/alma) I deleted the metadata and
tried to resync the data over; however, it failed with the same issue.
I'm in processes of building a new one to fix that broken DRBD cluster.
In the last 15 years of using DRBD I have never run into any corruption
issues. I'm at a loss; I thought the first one was a fluke; now I know
it's not!

_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user

Re: DRBD corruption with kmod-drbd90-9.1.8-1 [ In reply to ]

jeneral9 at gmail

Aug 16, 2022, 12:05 PM

Post #2 of 4 (348 views)

Permalink

Issue at elrepo already reported:
https://elrepo.org/bugs/view.php?id=1250

Brent

On 8/16/2022 11:30 AM, Brent Jensen wrote:
> I just had my second DRBD cluster fail after updating
> kmod-drbd90-9.1.8-1 and then upgrading the kernel. I'm not sure if the
> kernel update broke things or if it was because it caused after the
> reboot. About 2 weeks ago there was an update (kmod-drbd90-9.1.8-1)
> from elrepo, which got applied. But then after a kernel update the
> DRBD meta data was corrupt. Here's the gist of the error:
>
> This is using alma-linux 8:
>
> Aug 7 16:41:13 nfs6 kernel: drbd r0: Starting worker thread (from
> drbdsetup [3515])
> Aug 7 16:41:13 nfs6 kernel: drbd r0 nfs5: Starting sender thread
> (from drbdsetup [3519])
> Aug 7 16:41:13 nfs6 kernel: drbd r0/0 drbd0: meta-data IO uses: blk-bio
> Aug 7 16:41:13 nfs6 kernel: attempt to access beyond end of
> device#012sdb1: rw=6144, want=31250710528, limit=31250706432
> Aug 7 16:41:13 nfs6 kernel: drbd r0/0 drbd0:
> drbd_md_sync_page_io(,31250710520s,READ) failed with error -5
> Aug 7 16:41:13 nfs6 kernel: drbd r0/0 drbd0: Error while reading
> metadata.
>
> This is from a centos 7 cluster:
> Aug 16 11:04:57 v4 kernel: drbd r0 v3: Starting sender thread (from
> drbdsetup [9486])
> Aug 16 11:04:57 v4 kernel: drbd r0/0 drbd0: meta-data IO uses: blk-bio
> Aug 16 11:04:57 v4 kernel: attempt to access beyond end of device
> Aug 16 11:04:57 v4 kernel: sdb1: rw=1072, want=3905945600,
> limit=3905943552
> Aug 16 11:04:57 v4 kernel: drbd r0/0 drbd0:
> drbd_md_sync_page_io(,3905945592s,READ) failed with error -5
> Aug 16 11:04:57 v4 kernel: drbd r0/0 drbd0: Error while reading metadata.
> Aug 16 11:04:57 v4 drbd(drbd0)[9452]: ERROR: r0: Called drbdadm -c
> /etc/drbd.conf -v adjust r0
> Aug 16 11:04:57 v4 drbd(drbd0)[9452]: ERROR: r0: Exit code 1
> Aug 16 11:04:57 v4 drbd(drbd0)[9452]: ERROR: r0: Command output:
> drbdsetup new-peer r0 0 --_name=v3 --fencing=resource-only
> --protocol=C#012drbdsetup new-path r0 0 ipv4:10.1.4.82:7788
> ipv4:10.1.4.81:7788#012drbdmeta 0 v09 /dev/sdb1 internal
> apply-al#012drbdsetup attach 0 /dev/sdb1 /dev/sdb1 internal
> Aug 16 11:04:57 v4 drbd(drbd0)[9452]: ERROR: r0: Command stderr: 0:
> Failure: (118) IO error(s) occurred during initial access to
> meta-data.#012#012additional info from kernel:#012Error while reading
> metadata.#012#012Command 'drbdsetup attach 0 /dev/sdb1 /dev/sdb1
> internal' terminated with exit code 10
>
> Both clusters have been running flawlessly for ~2 years. I was in
> process of building a new DRBD custer to offload the first one when
> the 2nd production cluster had a kernel update and ran into the same
> exact issue. On the first cluster (rhel8/alma) I deleted the metadata
> and tried to resync the data over; however, it failed with the same
> issue. I'm in processes of building a new one to fix that broken DRBD
> cluster. In the last 15 years of using DRBD I have never run into any
> corruption issues. I'm at a loss; I thought the first one was a fluke;
> now I know it's not!
>
> _______________________________________________
> Star us on GITHUB: https://github.com/LINBIT
> drbd-user mailing list
> drbd-user@lists.linbit.com
> https://lists.linbit.com/mailman/listinfo/drbd-user
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user

Re: DRBD corruption with kmod-drbd90-9.1.8-1 [ In reply to ]

christoph.boehmwalder at linbit

Aug 19, 2022, 1:14 AM

Post #3 of 4 (345 views)

Permalink

Am 16.08.22 um 20:30 schrieb Brent Jensen:
> I just had my second DRBD cluster fail after updating
> kmod-drbd90-9.1.8-1 and then upgrading the kernel. I'm not sure if the
> kernel update broke things or if it was because it caused after the
> reboot. About 2 weeks ago there was an update (kmod-drbd90-9.1.8-1) from
> elrepo, which got applied. But then after a kernel update the DRBD meta
> data was corrupt. Here's the gist of the error:
>
> This is using alma-linux 8:
>
> Aug 7 16:41:13 nfs6 kernel: drbd r0: Starting worker thread (from
> drbdsetup [3515])
> Aug 7 16:41:13 nfs6 kernel: drbd r0 nfs5: Starting sender thread (from
> drbdsetup [3519])
> Aug 7 16:41:13 nfs6 kernel: drbd r0/0 drbd0: meta-data IO uses: blk-bio
> Aug 7 16:41:13 nfs6 kernel: attempt to access beyond end of
> device#012sdb1: rw=6144, want=31250710528, limit=31250706432
> Aug 7 16:41:13 nfs6 kernel: drbd r0/0 drbd0:
> drbd_md_sync_page_io(,31250710520s,READ) failed with error -5
> Aug 7 16:41:13 nfs6 kernel: drbd r0/0 drbd0: Error while reading metadata.
>
> This is from a centos 7 cluster:
> Aug 16 11:04:57 v4 kernel: drbd r0 v3: Starting sender thread (from
> drbdsetup [9486])
> Aug 16 11:04:57 v4 kernel: drbd r0/0 drbd0: meta-data IO uses: blk-bio
> Aug 16 11:04:57 v4 kernel: attempt to access beyond end of device
> Aug 16 11:04:57 v4 kernel: sdb1: rw=1072, want=3905945600, limit=3905943552
> Aug 16 11:04:57 v4 kernel: drbd r0/0 drbd0:
> drbd_md_sync_page_io(,3905945592s,READ) failed with error -5
> Aug 16 11:04:57 v4 kernel: drbd r0/0 drbd0: Error while reading metadata.
> Aug 16 11:04:57 v4 drbd(drbd0)[9452]: ERROR: r0: Called drbdadm -c
> /etc/drbd.conf -v adjust r0
> Aug 16 11:04:57 v4 drbd(drbd0)[9452]: ERROR: r0: Exit code 1
> Aug 16 11:04:57 v4 drbd(drbd0)[9452]: ERROR: r0: Command output:
> drbdsetup new-peer r0 0 --_name=v3 --fencing=resource-only
> --protocol=C#012drbdsetup new-path r0 0 ipv4:10.1.4.82:7788
> ipv4:10.1.4.81:7788#012drbdmeta 0 v09 /dev/sdb1 internal
> apply-al#012drbdsetup attach 0 /dev/sdb1 /dev/sdb1 internal
> Aug 16 11:04:57 v4 drbd(drbd0)[9452]: ERROR: r0: Command stderr: 0:
> Failure: (118) IO error(s) occurred during initial access to
> meta-data.#012#012additional info from kernel:#012Error while reading
> metadata.#012#012Command 'drbdsetup attach 0 /dev/sdb1 /dev/sdb1
> internal' terminated with exit code 10
>
> Both clusters have been running flawlessly for ~2 years. I was in
> process of building a new DRBD custer to offload the first one when the
> 2nd production cluster had a kernel update and ran into the same exact
> issue. On the first cluster (rhel8/alma) I deleted the metadata and
> tried to resync the data over; however, it failed with the same issue.
> I'm in processes of building a new one to fix that broken DRBD cluster.
> In the last 15 years of using DRBD I have never run into any corruption
> issues. I'm at a loss; I thought the first one was a fluke; now I know
> it's not!
>
> _______________________________________________
> Star us on GITHUB: https://github.com/LINBIT
> drbd-user mailing list
> drbd-user@lists.linbit.com
> https://lists.linbit.com/mailman/listinfo/drbd-user

Hello,

thank you for the report.

We have implemented a fix for this[0] which will be released soon (i.e.
very likely within the next week).

If you easily can (and if this is a non-production system), it would be
great if you could build DRBD from that commit and verify that the fix
resolves the issue for you.

If not, the obvious workaround is to stay on 9.1.7 for now (or downgrade).

[0]
https://github.com/LINBIT/drbd/commit/d7d76aad2b95dee098d6052567aa15d1342b1bc4

--
Christoph Böhmwalder
LINBIT | Keeping the Digital World Running
DRBD HA — Disaster Recovery — Software defined Storage
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user

Re: DRBD corruption with kmod-drbd90-9.1.8-1 [ In reply to ]

jfisher at jaybus

Dec 21, 2022, 7:48 AM

Post #4 of 4 (239 views)

Permalink

There is already a bug report with Linbit/drbd on github. Issue #26
titled "Bug in drbd 9.1.5 on CentOS 7 #26" from Feb. 2022. I added an
update to that issue noting that it persists in 9.1.12 and giving device
info.

On 8/19/22 04:14, Christoph Böhmwalder wrote:
> Am 16.08.22 um 20:30 schrieb Brent Jensen:
>> I just had my second DRBD cluster fail after updating
>> kmod-drbd90-9.1.8-1 and then upgrading the kernel. I'm not sure if the
>> kernel update broke things or if it was because it caused after the
>> reboot. About 2 weeks ago there was an update (kmod-drbd90-9.1.8-1) from
>> elrepo, which got applied. But then after a kernel update the DRBD meta
>> data was corrupt. Here's the gist of the error:
>>
>> This is using alma-linux 8:
>>
>> Aug 7 16:41:13 nfs6 kernel: drbd r0: Starting worker thread (from
>> drbdsetup [3515])
>> Aug 7 16:41:13 nfs6 kernel: drbd r0 nfs5: Starting sender thread (from
>> drbdsetup [3519])
>> Aug 7 16:41:13 nfs6 kernel: drbd r0/0 drbd0: meta-data IO uses: blk-bio
>> Aug 7 16:41:13 nfs6 kernel: attempt to access beyond end of
>> device#012sdb1: rw=6144, want=31250710528, limit=31250706432
>> Aug 7 16:41:13 nfs6 kernel: drbd r0/0 drbd0:
>> drbd_md_sync_page_io(,31250710520s,READ) failed with error -5
>> Aug 7 16:41:13 nfs6 kernel: drbd r0/0 drbd0: Error while reading metadata.
>>
>> This is from a centos 7 cluster:
>> Aug 16 11:04:57 v4 kernel: drbd r0 v3: Starting sender thread (from
>> drbdsetup [9486])
>> Aug 16 11:04:57 v4 kernel: drbd r0/0 drbd0: meta-data IO uses: blk-bio
>> Aug 16 11:04:57 v4 kernel: attempt to access beyond end of device
>> Aug 16 11:04:57 v4 kernel: sdb1: rw=1072, want=3905945600, limit=3905943552
>> Aug 16 11:04:57 v4 kernel: drbd r0/0 drbd0:
>> drbd_md_sync_page_io(,3905945592s,READ) failed with error -5
>> Aug 16 11:04:57 v4 kernel: drbd r0/0 drbd0: Error while reading metadata.
>> Aug 16 11:04:57 v4 drbd(drbd0)[9452]: ERROR: r0: Called drbdadm -c
>> /etc/drbd.conf -v adjust r0
>> Aug 16 11:04:57 v4 drbd(drbd0)[9452]: ERROR: r0: Exit code 1
>> Aug 16 11:04:57 v4 drbd(drbd0)[9452]: ERROR: r0: Command output:
>> drbdsetup new-peer r0 0 --_name=v3 --fencing=resource-only
>> --protocol=C#012drbdsetup new-path r0 0 ipv4:10.1.4.82:7788
>> ipv4:10.1.4.81:7788#012drbdmeta 0 v09 /dev/sdb1 internal
>> apply-al#012drbdsetup attach 0 /dev/sdb1 /dev/sdb1 internal
>> Aug 16 11:04:57 v4 drbd(drbd0)[9452]: ERROR: r0: Command stderr: 0:
>> Failure: (118) IO error(s) occurred during initial access to
>> meta-data.#012#012additional info from kernel:#012Error while reading
>> metadata.#012#012Command 'drbdsetup attach 0 /dev/sdb1 /dev/sdb1
>> internal' terminated with exit code 10
>>
>> Both clusters have been running flawlessly for ~2 years. I was in
>> process of building a new DRBD custer to offload the first one when the
>> 2nd production cluster had a kernel update and ran into the same exact
>> issue. On the first cluster (rhel8/alma) I deleted the metadata and
>> tried to resync the data over; however, it failed with the same issue.
>> I'm in processes of building a new one to fix that broken DRBD cluster.
>> In the last 15 years of using DRBD I have never run into any corruption
>> issues. I'm at a loss; I thought the first one was a fluke; now I know
>> it's not!
>>
>> _______________________________________________
>> Star us on GITHUB: https://github.com/LINBIT
>> drbd-user mailing list
>> drbd-user@lists.linbit.com
>> https://lists.linbit.com/mailman/listinfo/drbd-user
> Hello,
>
> thank you for the report.
>
> We have implemented a fix for this[0] which will be released soon (i.e.
> very likely within the next week).
>
> If you easily can (and if this is a non-production system), it would be
> great if you could build DRBD from that commit and verify that the fix
> resolves the issue for you.
>
> If not, the obvious workaround is to stay on 9.1.7 for now (or downgrade).
>
> [0]
> https://github.com/LINBIT/drbd/commit/d7d76aad2b95dee098d6052567aa15d1342b1bc4
>
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user