Mailing List Archive

drbd9.2 resync stuck with drbd_set_in_sync: sector=<...>s size=<...> nonsense!
Dear DRBD-users,

we are currently performing an upgrade from proxmox ve-6 to ve-7 on a
three-node linstor/drbd cluster. (Only two nodes are storage+compute
nodes / satellites, third is linstor-controller+quorum node)

This is a testing environment that we built in preparation for the
upgrade of the live cluster.

Before starting the upgrade we were on linstor 1.11, drbd-dkms 9.0.27
and pve 6.3. Our upgrade route was to first upgrade linstor to 1.20,
then upgrade all nodes to pve 6.4 and drbd-9.2 (9.0.27-1 -> 9.2.0-1)

After a fresh boot all nodes we were in a good state. Healthy cluster,
pve6to7 happy, drbd in sync and all packages up-to-date.

We then performed the upgrade of the first node to pve-7 which seemed to
go well and rebooted the first node into pve-7.2-11) As we have three
active VMs with three disk resources this triggered a drbd resync.

Two resources came out fine:

drbd1000 Testserver1: Resync done (total 2 sec; paused 0 sec; 104448 K/sec)
drbd1002 Testserver1: Resync done (total 55 sec; paused 0 sec; 92120 K/sec)

The third resource however did sync about 65% of the outdated data and
then stalled (no more sync traffic, no progress in drbdmon)

The kernel message that seems to be relevant here is this:

drbd vm-101-disk-1/0 drbd1001: drbd_set_in_sync: sector=73703424s
size=134479872 nonsense!

More kernel logs from the pve7 node can be found here
https://pastebin.com/aGjy7Sgp

So far we have tried to reboot the pve7 node, but it will always get
stuck in inconsistent/synctarget (no percentage of progress shown) and
print the kernel error message "drbd_set_in_sync: sector=73703424s
size=134479872 nonsense".

The linstor resources are backed by lvm_thin which is backed by a
MegaRAID in RAID1 with SSD drives.

I don't know if this is relevant, but the VM in question has at some
point in its lifetime been rolled back to a snapshot. (All snapshots
have been removed prior to the upgrades).

At that time the rollback did work OK, but we noticed a huge increase of
the allocated space on the backing device (IIRC it was equal to the
virtual disk size). We have set "discard=on" in proxmox and did a
"fstrim" in the VM, which cut down the space usage, but it's not equal
on both nodes):

root@Testserver3:~# linstor resource list-volumes
???????????????????????????????????????????????????????????????????????????????????????????????????????????????????
? Node ? Resource ? StoragePool ? VolNr ? MinorNr ?
DeviceName ? Allocated ? InUse ? State ?
???????????????????????????????????????????????????????????????????????????????????????????????????????????????????
? Testserver1 ? vm-100-disk-1 ? ssd_thin ? 0 ? 1000 ?
/dev/drbd1000 ? 2.28 GiB ? InUse ? UpToDate ?
? Testserver2 ? vm-100-disk-1 ? ssd_thin ? 0 ? 1000 ?
/dev/drbd1000 ? 2.50 GiB ? Unused ? UpToDate ?
? Testserver1 ? vm-101-disk-1 ? ssd_thin ? 0 ? 1001 ?
/dev/drbd1001 ? 35.38 GiB ? InUse ? UpToDate ?
? Testserver2 ? vm-101-disk-1 ? ssd_thin ? 0 ? 1001 ?
/dev/drbd1001 ? 31.05 GiB ? Unused ? Inconsistent ?
? Testserver1 ? vm-102-disk-1 ? ssd_thin ? 0 ? 1002 ?
/dev/drbd1002 ? 7.04 GiB ? InUse ? UpToDate ?
? Testserver2 ? vm-102-disk-1 ? ssd_thin ? 0 ? 1002 ?
/dev/drbd1002 ? 7.04 GiB ? Unused ? UpToDate ?
???????????????????????????????????????????????????????????????????????????????????????????????????????????????????

The linstor-created resource looks like this:
https://pastebin.com/syLADBdC

relevant version numbers:

drbd-dkms: 9.2.0-1
linstor-(controller|satellite): 1.20.0-1
linstor-proxmox: 6.1.0-1
proxmox-ve versions: 6.4-1 (two nodes) and 7.2-1 (one node)
kernel: 5.4.203-1-pve (two nodes) and 5.15.64-1-pve (one node)

Any insight on this would be most welcome. I'll provide more details if
you feel something is missing.

thanks and kind regards,
Nils
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user
Re: drbd9.2 resync stuck with drbd_set_in_sync: sector=<...>s size=<...> nonsense! [ In reply to ]
Dear Nils,

> The third resource however did sync about 65% of the outdated data and
> then stalled (no more sync traffic, no progress in drbdmon)
>
> The kernel message that seems to be relevant here is this:
>
> drbd vm-101-disk-1/0 drbd1001: drbd_set_in_sync: sector=73703424s
> size=134479872 nonsense!

Thanks for the report. This looks like the problem that has just been fixed by
https://github.com/LINBIT/drbd/commit/06bbd6eec1b8d576dbda24b29d16129d43537c77

This commit is not yet included in any DRBD release. Is it feasible
for you to test with the master branch? Building unreleased versions
is a little tricky due to the kernel compat. Otherwise you can wait
for the next release. DRBD 9.2.1-rc.1 should be here soon enough.

Best regards,
Joel
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user
Re: drbd9.2 resync stuck with drbd_set_in_sync: sector=<...>s size=<...> nonsense! [ In reply to ]
Hello Joel,

thank you for your very quick response. It's greatly appreciated!

Am 25.10.22 um 16:32 schrieb Joel Colledge:
> Thanks for the report. This looks like the problem that has just been fixed by
> https://github.com/LINBIT/drbd/commit/06bbd6eec1b8d576dbda24b29d16129d43537c77
>
> This commit is not yet included in any DRBD release. Is it feasible
> for you to test with the master branch?

I did look at the docs for building the master branch but it's quite
daunting for a non-developer.

I'll rather wait for rc1 and invest my time in a thorough test of that
version.

thanks again,
Nils
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user
Re: drbd9.2 resync stuck with drbd_set_in_sync: sector=<...>s size=<...> nonsense! [ In reply to ]
Hello Joel,

Am 25.10.22 um 16:32 schrieb Joel Colledge:
> Thanks for the report. This looks like the problem that has just been fixed by
> https://github.com/LINBIT/drbd/commit/06bbd6eec1b8d576dbda24b29d16129d43537c77

we were able to test with drbd-9.2.1-rc.1, the issue is solved for us.

Resync completed without problems and everything is looking good.

again thanks and kind regards,
Nils
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user