Mailing List Archive

Unable to recover from storage node failure in quorum scenario without node reboot
Hello all,

I'm currently in the process of evaluating DRBD as a possible storage solution for a hypervisor cluster.
Right now, I seem to have hit a roadblock that's possible due to my misunderstanding or possibly a bug.

I can't seem to being able to restore a DRBD volume once all storage nodes have failed, even if all nodes are back up and "UpToDate."
This goes as far as me having to restart the entire primary node for volumes that were mounted prior.
Even unmounting and re-mounting on the primary node is "broken" on the primary node.

Normally, I would expect this to be solvable on-line without rebooting any hosts.

What am I missing here?
If you require any more information, I'll provide them of course.

Thanks in advance.

Sincerely
Thomas Keppler

A little background to where I am coming from
----------------------------------------------

In my current setup, I work in pairs of two nodes - both running as both storage and compute (hypervisor) nodes.
For future scalability, I want to reduce the dependencies in this scenario and being able to have "compute-only" nodes that have no storage at all.
As per my understanding, "Diskless nodes" seem to be the concept I need to use here in conjunction with "tiebreakers" and the DRBD9 quorum system.

In order for my concept to work, I need to be able to restore from all kinds of failure conditions without rebooting physical hosts (or at least in most cases without doing so).

Steps that I've done to produce this failure case
-------------------------------------------------

storage1 - First storage server
storage2 - Second storage server
compute1 - Imaginary hypervisor node, diskless DRBD node

Preparation:

1.) Create a "test" DRBD volume with the device name "drbd_test" on storage1 and storage2.
2.) Format the volume with ext4.
3.) Promote compute1 to be the primary node.
4.) Mount "drbd_test" as "/mnt/test" on compute1.
5.) Put this script on the fresh DRBD volume:

> cat write.sh
>> #!/bin/bash
>> clear
>> while true; do
>> date -Iseconds >> dates.txt
>> sync
>> tail -n 1 dates.txt
>> sleep 1
>> done

What works as expected:

1.) Run the script on compute1 (INSIDE the DRBD volume of course).
2.) Power off storage2.

After that, the write "stutters" for a second or so but continues as desired.

3.) Power off storage1.

Of course, this is expected. There is nothing to write to and I/O is suspended.

What doesn't work as expected:

4.) Power on storage1: This was the last node that "failed" and therefore is "UpToDate."

After some time has passed, it recognizes itself as "UpToDate" but "compute1" can still not write.

5.) Power on storage2: After a small sync, I would think everything should be fine again.

The real result here is, though, that the volume now fails directly after it becomed "UpToDate" on all nodes and is re-mounted read-only.
From this state on, "compute1" is unable to re-mount the volume or unmount it ("target is busy" error).

To get everything going again, I have to reboot compute1.


Configuration
-------------

All nodes are running the latest version of DRBD9 from the Launchpad PPA on Ubuntu 20.04 LTS:

> ? ~ for HOST in storage1 storage2 compute1; do echo "=== $HOST ==="; ssh "$HOST" "cat /proc/drbd; echo"; done
>> === storage1 ===
>> version: 9.0.25-1 (api:2/proto:86-117)
>> GIT-hash: 1053e9f98123e8293e9f2897af654b40cde0d24c build by root@storage1, 2020-10-18 18:55:15
>> Transports (api:16):
>>
>> === storage2 ===
>> version: 9.0.25-1 (api:2/proto:86-117)
>> GIT-hash: 1053e9f98123e8293e9f2897af654b40cde0d24c build by root@storage2, 2020-10-18 18:56:49
>> Transports (api:16):
>>
>> === compute1 ===
>> version: 9.0.25-1 (api:2/proto:86-117)
>> GIT-hash: 1053e9f98123e8293e9f2897af654b40cde0d24c build by root@compute1, 2020-10-18 18:58:09
>> Transports (api:16):

Here is the resource definition of my test volume (backed by an LV on storage1 and storage2):

> cat /etc/drbd.d/test.res
>> resource test {
>> protocol C;
>>
>> startup {
>> wfc-timeout 15;
>> degr-wfc-timeout 60;
>> }
>>
>> net {
>> fencing dont-care; # Ignored for now as that's not important right now
>> cram-hmac-alg sha512;
>> shared-secret "secret";
>> }
>>
>> volume 0 {
>> device drbd_test minor 0;
>> disk /dev/vg0/test;
>> meta-disk internal;
>> }
>>
>> on storage1 {
>> address 172.16.181.128:7788;
>> node-id 0;
>> }
>> on storage2 {
>> address 172.16.181.129:7788;
>> node-id 1;
>> }
>> on compute1 {
>> address 172.16.181.130:7788;
>> node-id 2;
>>
>> volume 0 {
>> disk none;
>> }
>> }
>> connection-mesh {
>> hosts storage1 storage2 compute1;
>> }
>>
>> options {
>> quorum majority;
>> quorum-minimum-redundancy 1;
>> on-no-quorum suspend-io;
>> }
>> }

None of the nodes is running more than the standard Ubuntu configuration plus DRBD9. No hypervisor or any other abstractions are used here.
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user