Dear DRBD-users,
we are currently performing an upgrade from proxmox ve-6 to ve-7 on a
three-node linstor/drbd cluster. (Only two nodes are storage+compute
nodes / satellites, third is linstor-controller+quorum node)
This is a testing environment that we built in preparation for the
upgrade of the live cluster.
Before starting the upgrade we were on linstor 1.11, drbd-dkms 9.0.27
and pve 6.3. Our upgrade route was to first upgrade linstor to 1.20,
then upgrade all nodes to pve 6.4 and drbd-9.2 (9.0.27-1 -> 9.2.0-1)
After a fresh boot all nodes we were in a good state. Healthy cluster,
pve6to7 happy, drbd in sync and all packages up-to-date.
We then performed the upgrade of the first node to pve-7 which seemed to
go well and rebooted the first node into pve-7.2-11) As we have three
active VMs with three disk resources this triggered a drbd resync.
Two resources came out fine:
drbd1000 Testserver1: Resync done (total 2 sec; paused 0 sec; 104448 K/sec)
drbd1002 Testserver1: Resync done (total 55 sec; paused 0 sec; 92120 K/sec)
The third resource however did sync about 65% of the outdated data and
then stalled (no more sync traffic, no progress in drbdmon)
The kernel message that seems to be relevant here is this:
drbd vm-101-disk-1/0 drbd1001: drbd_set_in_sync: sector=73703424s
size=134479872 nonsense!
More kernel logs from the pve7 node can be found here
https://pastebin.com/aGjy7Sgp
So far we have tried to reboot the pve7 node, but it will always get
stuck in inconsistent/synctarget (no percentage of progress shown) and
print the kernel error message "drbd_set_in_sync: sector=73703424s
size=134479872 nonsense".
The linstor resources are backed by lvm_thin which is backed by a
MegaRAID in RAID1 with SSD drives.
I don't know if this is relevant, but the VM in question has at some
point in its lifetime been rolled back to a snapshot. (All snapshots
have been removed prior to the upgrades).
At that time the rollback did work OK, but we noticed a huge increase of
the allocated space on the backing device (IIRC it was equal to the
virtual disk size). We have set "discard=on" in proxmox and did a
"fstrim" in the VM, which cut down the space usage, but it's not equal
on both nodes):
root@Testserver3:~# linstor resource list-volumes
???????????????????????????????????????????????????????????????????????????????????????????????????????????????????
? Node ? Resource ? StoragePool ? VolNr ? MinorNr ?
DeviceName ? Allocated ? InUse ? State ?
???????????????????????????????????????????????????????????????????????????????????????????????????????????????????
? Testserver1 ? vm-100-disk-1 ? ssd_thin ? 0 ? 1000 ?
/dev/drbd1000 ? 2.28 GiB ? InUse ? UpToDate ?
? Testserver2 ? vm-100-disk-1 ? ssd_thin ? 0 ? 1000 ?
/dev/drbd1000 ? 2.50 GiB ? Unused ? UpToDate ?
? Testserver1 ? vm-101-disk-1 ? ssd_thin ? 0 ? 1001 ?
/dev/drbd1001 ? 35.38 GiB ? InUse ? UpToDate ?
? Testserver2 ? vm-101-disk-1 ? ssd_thin ? 0 ? 1001 ?
/dev/drbd1001 ? 31.05 GiB ? Unused ? Inconsistent ?
? Testserver1 ? vm-102-disk-1 ? ssd_thin ? 0 ? 1002 ?
/dev/drbd1002 ? 7.04 GiB ? InUse ? UpToDate ?
? Testserver2 ? vm-102-disk-1 ? ssd_thin ? 0 ? 1002 ?
/dev/drbd1002 ? 7.04 GiB ? Unused ? UpToDate ?
???????????????????????????????????????????????????????????????????????????????????????????????????????????????????
The linstor-created resource looks like this:
https://pastebin.com/syLADBdC
relevant version numbers:
drbd-dkms: 9.2.0-1
linstor-(controller|satellite): 1.20.0-1
linstor-proxmox: 6.1.0-1
proxmox-ve versions: 6.4-1 (two nodes) and 7.2-1 (one node)
kernel: 5.4.203-1-pve (two nodes) and 5.15.64-1-pve (one node)
Any insight on this would be most welcome. I'll provide more details if
you feel something is missing.
thanks and kind regards,
Nils
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user
we are currently performing an upgrade from proxmox ve-6 to ve-7 on a
three-node linstor/drbd cluster. (Only two nodes are storage+compute
nodes / satellites, third is linstor-controller+quorum node)
This is a testing environment that we built in preparation for the
upgrade of the live cluster.
Before starting the upgrade we were on linstor 1.11, drbd-dkms 9.0.27
and pve 6.3. Our upgrade route was to first upgrade linstor to 1.20,
then upgrade all nodes to pve 6.4 and drbd-9.2 (9.0.27-1 -> 9.2.0-1)
After a fresh boot all nodes we were in a good state. Healthy cluster,
pve6to7 happy, drbd in sync and all packages up-to-date.
We then performed the upgrade of the first node to pve-7 which seemed to
go well and rebooted the first node into pve-7.2-11) As we have three
active VMs with three disk resources this triggered a drbd resync.
Two resources came out fine:
drbd1000 Testserver1: Resync done (total 2 sec; paused 0 sec; 104448 K/sec)
drbd1002 Testserver1: Resync done (total 55 sec; paused 0 sec; 92120 K/sec)
The third resource however did sync about 65% of the outdated data and
then stalled (no more sync traffic, no progress in drbdmon)
The kernel message that seems to be relevant here is this:
drbd vm-101-disk-1/0 drbd1001: drbd_set_in_sync: sector=73703424s
size=134479872 nonsense!
More kernel logs from the pve7 node can be found here
https://pastebin.com/aGjy7Sgp
So far we have tried to reboot the pve7 node, but it will always get
stuck in inconsistent/synctarget (no percentage of progress shown) and
print the kernel error message "drbd_set_in_sync: sector=73703424s
size=134479872 nonsense".
The linstor resources are backed by lvm_thin which is backed by a
MegaRAID in RAID1 with SSD drives.
I don't know if this is relevant, but the VM in question has at some
point in its lifetime been rolled back to a snapshot. (All snapshots
have been removed prior to the upgrades).
At that time the rollback did work OK, but we noticed a huge increase of
the allocated space on the backing device (IIRC it was equal to the
virtual disk size). We have set "discard=on" in proxmox and did a
"fstrim" in the VM, which cut down the space usage, but it's not equal
on both nodes):
root@Testserver3:~# linstor resource list-volumes
???????????????????????????????????????????????????????????????????????????????????????????????????????????????????
? Node ? Resource ? StoragePool ? VolNr ? MinorNr ?
DeviceName ? Allocated ? InUse ? State ?
???????????????????????????????????????????????????????????????????????????????????????????????????????????????????
? Testserver1 ? vm-100-disk-1 ? ssd_thin ? 0 ? 1000 ?
/dev/drbd1000 ? 2.28 GiB ? InUse ? UpToDate ?
? Testserver2 ? vm-100-disk-1 ? ssd_thin ? 0 ? 1000 ?
/dev/drbd1000 ? 2.50 GiB ? Unused ? UpToDate ?
? Testserver1 ? vm-101-disk-1 ? ssd_thin ? 0 ? 1001 ?
/dev/drbd1001 ? 35.38 GiB ? InUse ? UpToDate ?
? Testserver2 ? vm-101-disk-1 ? ssd_thin ? 0 ? 1001 ?
/dev/drbd1001 ? 31.05 GiB ? Unused ? Inconsistent ?
? Testserver1 ? vm-102-disk-1 ? ssd_thin ? 0 ? 1002 ?
/dev/drbd1002 ? 7.04 GiB ? InUse ? UpToDate ?
? Testserver2 ? vm-102-disk-1 ? ssd_thin ? 0 ? 1002 ?
/dev/drbd1002 ? 7.04 GiB ? Unused ? UpToDate ?
???????????????????????????????????????????????????????????????????????????????????????????????????????????????????
The linstor-created resource looks like this:
https://pastebin.com/syLADBdC
relevant version numbers:
drbd-dkms: 9.2.0-1
linstor-(controller|satellite): 1.20.0-1
linstor-proxmox: 6.1.0-1
proxmox-ve versions: 6.4-1 (two nodes) and 7.2-1 (one node)
kernel: 5.4.203-1-pve (two nodes) and 5.15.64-1-pve (one node)
Any insight on this would be most welcome. I'll provide more details if
you feel something is missing.
thanks and kind regards,
Nils
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user