Mailing List Archive

DRBD 8.4 two-node primary locks up, did not send a P_BARRIER
Hi all,

at work my team and I are facing a DRBD 8.4 two-node cluster where the
primary node seemingly randomly locks up, thereby preventing access to
its data.

When this happens dmesg shows entries such as this one coming from DRBD:

We did not send a P_BARRIER for 5118944ms > ko-count (7) * timeout (60 *
0.1s); drbd kernel thread blocked?

At the same time DRBD commands such as 'drbdadm secondary' or viewing
state at '/proc/drbd' no longer return results and just hang.

This is happening on a Debian 10 Xen virtual machine (via XCP-ng). The
installed 'drbd-utils' Debian package is version 9.5.0-1. The 'drbd.ko'
module is version 8.4.10. Kernel is 4.19.208-1 installed via package
'linux-image-4.19.0-18-amd64'. The config as shown via 'drbdadm dump' is
available at Pastebin: https://pastebin.com/raw/b122wQU9.

The DRBD cluster is used as a block device for a ZFS zpool where ZFS
itself is version '2.0.3-9~bpo10+1'-

Systems monitoring suggests that the issue occurs when disk load
measured in I/O wait time is higher than usual. Since we've now seen
this situation only twice that's not much of a pattern yet. Despite disk
load seemingly being an issue none of the other virtual machine tenants
on the same hypervisor and disk array are facing issues. The underlying
disks are an SSD-based RAID 10 array of 4 disks total which are not
exhibiting suspicious behavior or metrics. Does anyone have any pointers
as to what might be going on here?

Google suggests RAM might be an issue, however, in both instances when
this happened the node in question had about 15 GiB of free RAM out of a
total of 48 GiB.

Just for fun we're thinking about testing a Debian 11 backports kernel
but don't have any concrete direction to go in.

Any and all hints are greatly appreciated, thanks!
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user
Re: DRBD 8.4 two-node primary locks up, did not send a P_BARRIER [ In reply to ]
On 23/11/2021 16:49, Sven-Erik Neve wrote:
> The 'drbd.ko'
> module is version 8.4.10.

Which is severely outdated. The latest is 8.4.11 and that is several
years old
https://lists.linbit.com/pipermail/drbd-user/2018-April/024061.html

Trevor

Disclaimer

The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful.

This email has been scanned for viruses and malware, and may have been automatically archived by Mimecast Ltd, an innovator in Software as a Service (SaaS) for business. Providing a safer and more useful place for your human generated data. Specializing in; Security, archiving and compliance. To find out more visit the Mimecast website.