Mailing List Archive: rdma: setting resource 'down' fails

Hi all,

we have a 3-node setup with drbd 9.2.5 on Infiniband-connected servers, so I tried 'drbd_transport_rdma'.

Seemed to work, but when I want to take down the resource on one of the servers, 'drbdadm down res' does not return for 10 minutes, when it runs into
a timeout with
> Command 'drbdsetup down zvol0' did not terminate within 600 seconds.

There are still two 'D' processes 'drbdsetup status res' and 'drbdsetup down res'.

As an additional goodie, the three resources are zvols - ZFS block devices. Of course I have no idea how that might interfere with the transport mode.

And of course, the same setup / command works when using the tcp module instead.

Anything obvious to look for?

Best regards
Thomas
--
--
------------------------------------------------------------------------------------------------
Thomas Roth
HPC Department

GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstr. 1, 64291 Darmstadt, http://www.gsi.de/

Gesellschaft mit beschraenkter Haftung

Sitz der Gesellschaft / Registered Office: Darmstadt
Handelsregister / Commercial Register:
Amtsgericht Darmstadt, HRB 1528

Geschaeftsfuehrung / Managing Directors:
Professor Dr. Paolo Giubellino, Ursula Weyrich, Jörg Blaurock

Vorsitzender des GSI-Aufsichtsrates /
Chairman of the Supervisory Board:
Staatssekretaer / State Secretary Dr. Georg Schütte

_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user

No use. All variants invariably lead to 'sticky' resources:
- Two-way setup
- Single disk instead of zvol to keep ZFS out of the picture
- protocol A/C
- various sndbuf-size, rcvbuf-size, max-buffers

And all parameter changes require a server reset - takes ages to test
all this.

Only starting without the "transport rdma" leads to the expected
behavior - "drbdadm down res" is instantaneous.

The sndbuf/rcvbuf/max-buffers variations seemed necessary because I also
got many "Not sending flow_control mgs, no receive window!".

But with the tedious server-resets it is no fun trying to figure out
what might be 'big enough to hold all rcvbuf-size' in our case (8000
pages is roughly 32MB, isn't it? 1M rcvbuf should fit ...).

Cheers
Thomas

On 11/12/23 12:16, Thomas Roth wrote:
> Hi all,
>
> we have a 3-node setup with drbd 9.2.5 on Infiniband-connected servers,
> so I tried 'drbd_transport_rdma'.
>
> Seemed to work, but when I want to take down the resource on one of the
> servers, 'drbdadm down res' does not return for 10 minutes, when it runs
> into a timeout with
> > Command 'drbdsetup down zvol0' did not terminate within 600 seconds.
>
> There are still two 'D' processes 'drbdsetup status res' and 'drbdsetup
> down res'.
>
>
> As an additional goodie, the three resources are zvols - ZFS block
> devices. Of course I have no idea how that might interfere with the
> transport mode.
>
> And of course, the same setup / command works when using the tcp module
> instead.
>
> Anything obvious to look for?
>
>
> Best regards
> Thomas
--
------------------------------------------------------------------------------------------------
Thomas Roth
HPC Department

GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstr. 1, 64291 Darmstadt, http://www.gsi.de/

Gesellschaft mit beschraenkter Haftung

Sitz der Gesellschaft / Registered Office: Darmstadt
Handelsregister / Commercial Register:
Amtsgericht Darmstadt, HRB 1528

Geschaeftsfuehrung / Managing Directors:
Professor Dr. Paolo Giubellino, Ursula Weyrich, Jörg Blaurock

Vorsitzender des GSI-Aufsichtsrates /
Chairman of the Supervisory Board:
Staatssekretaer / State Secretary Dr. Georg Schütte
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user