Mailing List Archive: drbc 9.1.1 whole cluster blocked

drbc 9.1.1 whole cluster blocked

May 27, 2021, 4:13 AM

Post #1 of 5 (426 views)

I'm running a Proxmox cluster with 3 disk nodes and 3 diskless nodes
with drbd 9.1.1. The disk nodes have storage on md raid6 (8 ssds each)
with a journal on an optane device.

Yesterday, the whole cluster was severely impacted when one node had
write problems. There is no indication for any hardware problem, no
events whatsoever. What happened, taken from the logs:

- one diskless node reports "sending time expired" for some devices on a
specific disk node. After 30 seconds, it disconnects those devices on
that node.
- the disk node logs state change to outdated.
- After 80s, the disk node logs "task blocked for more than 120
seconds". These tasks are 8 drbd_r_xxx processes, but also md2_reclaim.
- No more logging after that.

After that, the whole cluster was severely impacted, most vms
unresponsive. The node hosts were still accessible, with no more kernel
logging.

After analyzing the situation, assuming a single node would block
everything, that node was rebooted (no normal reboot possible, needed
"echo b >/proc/sysrq-trigger"). This did help, everything back to normal.

So apparently there are situations when a backing storage problem might
block all drbd processing in a way that prevents normal timeout
detection and subsequent disconnection on other nodes. Reading the 9.1.2
release notes, this doesn't seem to be addressed there.

Regards,
Andreas

_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user

Re: drbc 9.1.1 whole cluster blocked [ In reply to ]

rene.peinthor at linbit

May 27, 2021, 4:37 AM

Post #2 of 5 (426 views)

Permalink

Could still be related to this fix:

* fix timeout detection after idle periods and for configs with ko-count
when a disk on an a secondary stops delivering IO-completion events

So if you have a ko-count set, this should be fixed.
Or it is something completely different... ;)

Cheers,
Rene

On Thu, May 27, 2021 at 1:25 PM Andreas Pflug <pgadmin@pse-consulting.de>
wrote:

> I'm running a Proxmox cluster with 3 disk nodes and 3 diskless nodes
> with drbd 9.1.1. The disk nodes have storage on md raid6 (8 ssds each)
> with a journal on an optane device.
>
> Yesterday, the whole cluster was severely impacted when one node had
> write problems. There is no indication for any hardware problem, no
> events whatsoever. What happened, taken from the logs:
>
> - one diskless node reports "sending time expired" for some devices on a
> specific disk node. After 30 seconds, it disconnects those devices on
> that node.
> - the disk node logs state change to outdated.
> - After 80s, the disk node logs "task blocked for more than 120
> seconds". These tasks are 8 drbd_r_xxx processes, but also md2_reclaim.
> - No more logging after that.
>
> After that, the whole cluster was severely impacted, most vms
> unresponsive. The node hosts were still accessible, with no more kernel
> logging.
>
> After analyzing the situation, assuming a single node would block
> everything, that node was rebooted (no normal reboot possible, needed
> "echo b >/proc/sysrq-trigger"). This did help, everything back to normal.
>
> So apparently there are situations when a backing storage problem might
> block all drbd processing in a way that prevents normal timeout
> detection and subsequent disconnection on other nodes. Reading the 9.1.2
> release notes, this doesn't seem to be addressed there.
>
> Regards,
> Andreas
>
> _______________________________________________
> Star us on GITHUB: https://github.com/LINBIT
> drbd-user mailing list
> drbd-user@lists.linbit.com
> https://lists.linbit.com/mailman/listinfo/drbd-user
>

Re: drbc 9.1.1 whole cluster blocked [ In reply to ]

pgadmin at pse-consulting

May 27, 2021, 4:47 AM

Post #3 of 5 (426 views)

Permalink

No ko-count set, so apparently something different...

Am 27.05.21 um 13:37 schrieb Rene Peinthor:
> Could still be related to this fix:
>
> * fix timeout detection after idle periods and for configs with ko-count
> when a disk on an a secondary stops delivering IO-completion events
>
> So if you have a ko-count set, this should be fixed.
> Or it is something completely different... ;)
>
> Cheers,
> Rene
>
> On Thu, May 27, 2021 at 1:25 PM Andreas Pflug <pgadmin@pse-consulting.de
> <mailto:pgadmin@pse-consulting.de>> wrote:
>
> I'm running a Proxmox cluster with 3 disk nodes and 3 diskless nodes
> with drbd 9.1.1. The disk nodes have storage on md raid6 (8 ssds each)
> with a journal on an optane device.
>
> Yesterday, the whole cluster was severely impacted when one node had
> write problems. There is no indication for any hardware problem, no
> events whatsoever. What happened, taken from the logs:
>
> - one diskless node reports "sending time expired" for some devices on a
> specific disk node. After 30 seconds, it disconnects those devices on
> that node.
> - the disk node logs state change to outdated.
> - After 80s, the disk node logs "task blocked for more than 120
> seconds". These tasks are 8 drbd_r_xxx processes, but also md2_reclaim.
> - No more logging after that.
>
> After that, the whole cluster was severely impacted, most vms
> unresponsive. The node hosts were still accessible, with no more kernel
> logging.
>
> After analyzing the situation, assuming a single node would block
> everything, that node was rebooted (no normal reboot possible, needed
> "echo b >/proc/sysrq-trigger"). This did help, everything back to
> normal.
>
> So apparently there are situations when a backing storage problem might
> block all drbd processing in a way that prevents normal timeout
> detection and subsequent disconnection on other nodes. Reading the 9.1.2
> release notes, this doesn't seem to be addressed there.
>
> Regards,
> Andreas
>
> _______________________________________________
> Star us on GITHUB: https://github.com/LINBIT <https://github.com/LINBIT>
> drbd-user mailing list
> drbd-user@lists.linbit.com <mailto:drbd-user@lists.linbit.com>
> https://lists.linbit.com/mailman/listinfo/drbd-user
> <https://lists.linbit.com/mailman/listinfo/drbd-user>
>

_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user

Re: drbc 9.1.1 whole cluster blocked [ In reply to ]

joel.colledge at linbit

May 27, 2021, 8:55 AM

Post #4 of 5 (426 views)

Permalink

> No ko-count set, so apparently something different...

ko-count is enabled by default (with value "7"). Have you explicitly
disabled it? Your description does sound very similar to the issue
that has been fixed as Rene mentioned.

Regards,
Joel
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user

Re: drbc 9.1.1 whole cluster blocked [ In reply to ]

pgadmin at pse-consulting

May 28, 2021, 12:46 AM

Post #5 of 5 (422 views)

Permalink

Am 27.05.21 um 17:55 schrieb Joel Colledge:
>> No ko-count set, so apparently something different...
>
> ko-count is enabled by default (with value "7"). Have you explicitly
> disabled it? Your description does sound very similar to the issue
> that has been fixed as Rene mentioned.

I can confirm that ko-count is 7, I've seen 6 retries with ko=6..1 logged.

Will upgrade soon.

Regards,
Andreas
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user