Mailing List Archive

300% latency difference between protocol A and B with NVMe
Hi,

I have a fairly simple and straightforward setup where I'm testing and
benchmarking DRBD9 under Ubuntu 20.04

Using DKMS and the PPAs I compiled DRBD 9.0.25-1 for Ubuntu 20.04 and
started testing.

My setup (2x):

- SuperMicro 1U machine
- AMD Epyc 7302P 16-core
- 128GB Memory
- 10x Samsung PM983 in RAID-10
- Mellanox ConnectX-5 25Gbit interconnect

The 10 NVMe drives are in software RAID-10 with MDADM.

My benchmark is focused on latency. Not on throughput. I tested this
with fio:

$ fio --name=rw_io_1_4k --ioengine=libaio --rw=randwrite \
--bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --direct=1

I tested on the md0 device, on DRBD with protocol A and C. The results
are as followed:

- md0: 32.200 IOps
- Protocol A: 30.200 IOps
- Protocol C: 11.000 IOps

The network between the two nodes is a direct LACP 2x25Gbit connection
with a 50cm DAC cable. About the lowest latency you can get on Ethernet
at the moment.

To me it seems obvious the TCP/IP stack or Ethernet is the problem here,
but I can't pinpoint what is causing such a massive drop.

The latency between the nodes is 0.150ms for a 8192 bytes ping which
seems very reasonable.

Is this to be expected or is there something wrong here?

Wido
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user
Re: 300% latency difference between protocol A and B with NVMe [ In reply to ]
On 23/11/2020 16:35, Wido den Hollander wrote:
> Hi,
>
> I have a fairly simple and straightforward setup where I'm testing and
> benchmarking DRBD9 under Ubuntu 20.04
>
> Using DKMS and the PPAs I compiled DRBD 9.0.25-1 for Ubuntu 20.04 and
> started testing.
>
> My setup (2x):
>
> - SuperMicro 1U machine
> - AMD Epyc 7302P 16-core
> - 128GB Memory
> - 10x Samsung PM983 in RAID-10
> - Mellanox ConnectX-5 25Gbit interconnect
>
> The 10 NVMe drives are in software RAID-10 with MDADM.
>
> My benchmark is focused on latency. Not on throughput. I tested this
> with fio:
>
> $ fio --name=rw_io_1_4k --ioengine=libaio --rw=randwrite \
>   --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --direct=1
>
> I tested on the md0 device, on DRBD with protocol A and C. The results
> are as followed:
>
> - md0: 32.200 IOps
> - Protocol A: 30.200 IOps
> - Protocol C: 11.000 IOps
>
> The network between the two nodes is a direct LACP 2x25Gbit connection
> with a 50cm DAC cable. About the lowest latency you can get on Ethernet
> at the moment.
>
> To me it seems obvious the TCP/IP stack or Ethernet is the problem here,
> but I can't pinpoint what is causing such a massive drop.
>
> The latency between the nodes is 0.150ms for a 8192 bytes ping which
> seems very reasonable.

I also tested with qperf to measure the tcp latency and bandwidth:

tcp_lat:
latency = 13.3 us
tcp_bw:
bw = 3.08 GB/sec

Looking at those values the network seems to perform good, but is this
good enough to not have that big performance impact when writing?

Wido

>
> Is this to be expected or is there something wrong here?
>
> Wido
> _______________________________________________
> Star us on GITHUB: https://github.com/LINBIT
> drbd-user mailing list
> drbd-user@lists.linbit.com
> https://lists.linbit.com/mailman/listinfo/drbd-user
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user
Re: 300% latency difference between protocol A and B with NVMe [ In reply to ]
Hi Wido,

These results are not too surprising. Consider the steps involved in a
protocol C write. Note that tcp_lat is one way latency, so we get:

Send data to peer: 13.3 us (perhaps more, if qperf was testing with a
size less than 4K)
Write on peer: 1s / 32200 == 31.1 us
Confirmation of write from peer: 13.3 us

Total: 13.3 us + 31.1 us + 13.3 us == 57.7 us

IOPS: 1s / 57.7 us == 17300

DRBD achieved 11000 IOPS, so 63% of the theoretical maximum. So not
all that far off. I would test latency with qperf for 4K messages too,
perhaps DRBD is even closer to the maximum.

To improve this you could try disabling LACP, using the disk directly
instead of in RAID, pinning DRBD and fio threads to the same core,
adjusting the interrupt affinities... Anything that simplifies the
process might help a little, but I would be surprised if you get it
much faster.

Best regards,
Joel

On Tue, Nov 24, 2020 at 10:46 AM Wido den Hollander
<wido@denhollander.io> wrote:
>
>
>
> On 23/11/2020 16:35, Wido den Hollander wrote:
> > Hi,
> >
> > I have a fairly simple and straightforward setup where I'm testing and
> > benchmarking DRBD9 under Ubuntu 20.04
> >
> > Using DKMS and the PPAs I compiled DRBD 9.0.25-1 for Ubuntu 20.04 and
> > started testing.
> >
> > My setup (2x):
> >
> > - SuperMicro 1U machine
> > - AMD Epyc 7302P 16-core
> > - 128GB Memory
> > - 10x Samsung PM983 in RAID-10
> > - Mellanox ConnectX-5 25Gbit interconnect
> >
> > The 10 NVMe drives are in software RAID-10 with MDADM.
> >
> > My benchmark is focused on latency. Not on throughput. I tested this
> > with fio:
> >
> > $ fio --name=rw_io_1_4k --ioengine=libaio --rw=randwrite \
> > --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --direct=1
> >
> > I tested on the md0 device, on DRBD with protocol A and C. The results
> > are as followed:
> >
> > - md0: 32.200 IOps
> > - Protocol A: 30.200 IOps
> > - Protocol C: 11.000 IOps
> >
> > The network between the two nodes is a direct LACP 2x25Gbit connection
> > with a 50cm DAC cable. About the lowest latency you can get on Ethernet
> > at the moment.
> >
> > To me it seems obvious the TCP/IP stack or Ethernet is the problem here,
> > but I can't pinpoint what is causing such a massive drop.
> >
> > The latency between the nodes is 0.150ms for a 8192 bytes ping which
> > seems very reasonable.
>
> I also tested with qperf to measure the tcp latency and bandwidth:
>
> tcp_lat:
> latency = 13.3 us
> tcp_bw:
> bw = 3.08 GB/sec
>
> Looking at those values the network seems to perform good, but is this
> good enough to not have that big performance impact when writing?
>
> Wido
>
> >
> > Is this to be expected or is there something wrong here?
> >
> > Wido
> > _______________________________________________
> > Star us on GITHUB: https://github.com/LINBIT
> > drbd-user mailing list
> > drbd-user@lists.linbit.com
> > https://lists.linbit.com/mailman/listinfo/drbd-user
> _______________________________________________
> Star us on GITHUB: https://github.com/LINBIT
> drbd-user mailing list
> drbd-user@lists.linbit.com
> https://lists.linbit.com/mailman/listinfo/drbd-user
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user
Re: 300% latency difference between protocol A and B with NVMe [ In reply to ]
On 24/11/2020 11:21, Joel Colledge wrote:
> Hi Wido,
>
> These results are not too surprising. Consider the steps involved in a
> protocol C write. Note that tcp_lat is one way latency, so we get:
>
> Send data to peer: 13.3 us (perhaps more, if qperf was testing with a
> size less than 4K)
> Write on peer: 1s / 32200 == 31.1 us
> Confirmation of write from peer: 13.3 us
>
> Total: 13.3 us + 31.1 us + 13.3 us == 57.7 us
>
> IOPS: 1s / 57.7 us == 17300
>
> DRBD achieved 11000 IOPS, so 63% of the theoretical maximum. So not
> all that far off. I would test latency with qperf for 4K messages too,
> perhaps DRBD is even closer to the maximum.

tcp_lat:
latency = 18 us
msg_size = 4 KB
tcp_lat:
latency = 21.9 us
msg_size = 8 KB
tcp_lat:
latency = 27.3 us
msg_size = 16 KB
tcp_lat:
latency = 48.4 us
msg_size = 32 KB
tcp_lat:
latency = 77.4 us
msg_size = 64 KB

So the total would be 67.0 us instead of 57.0 us.

Interesting information! Would RDMA make a huge difference here?

>
> To improve this you could try disabling LACP, using the disk directly
> instead of in RAID, pinning DRBD and fio threads to the same core,
> adjusting the interrupt affinities... Anything that simplifies the
> process might help a little, but I would be surprised if you get it
> much faster.
I disabled LACP in one of my tests, but that didn't change much. I'll
probably use the second NIC just for heartbeats for Pacemaker. Single
25Gbit is sufficient for the replication.

Without RAID we wouldn't have any redundancy inside the boxes.

I did notice that a single PM983 maxes out at 37k 4k write IOps. So
RAID-10 won't make it much faster. Or do you have any other suggestions?

Wido

>
> Best regards,
> Joel
>
> On Tue, Nov 24, 2020 at 10:46 AM Wido den Hollander
> <wido@denhollander.io> wrote:
>>
>>
>>
>> On 23/11/2020 16:35, Wido den Hollander wrote:
>>> Hi,
>>>
>>> I have a fairly simple and straightforward setup where I'm testing and
>>> benchmarking DRBD9 under Ubuntu 20.04
>>>
>>> Using DKMS and the PPAs I compiled DRBD 9.0.25-1 for Ubuntu 20.04 and
>>> started testing.
>>>
>>> My setup (2x):
>>>
>>> - SuperMicro 1U machine
>>> - AMD Epyc 7302P 16-core
>>> - 128GB Memory
>>> - 10x Samsung PM983 in RAID-10
>>> - Mellanox ConnectX-5 25Gbit interconnect
>>>
>>> The 10 NVMe drives are in software RAID-10 with MDADM.
>>>
>>> My benchmark is focused on latency. Not on throughput. I tested this
>>> with fio:
>>>
>>> $ fio --name=rw_io_1_4k --ioengine=libaio --rw=randwrite \
>>> --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --direct=1
>>>
>>> I tested on the md0 device, on DRBD with protocol A and C. The results
>>> are as followed:
>>>
>>> - md0: 32.200 IOps
>>> - Protocol A: 30.200 IOps
>>> - Protocol C: 11.000 IOps
>>>
>>> The network between the two nodes is a direct LACP 2x25Gbit connection
>>> with a 50cm DAC cable. About the lowest latency you can get on Ethernet
>>> at the moment.
>>>
>>> To me it seems obvious the TCP/IP stack or Ethernet is the problem here,
>>> but I can't pinpoint what is causing such a massive drop.
>>>
>>> The latency between the nodes is 0.150ms for a 8192 bytes ping which
>>> seems very reasonable.
>>
>> I also tested with qperf to measure the tcp latency and bandwidth:
>>
>> tcp_lat:
>> latency = 13.3 us
>> tcp_bw:
>> bw = 3.08 GB/sec
>>
>> Looking at those values the network seems to perform good, but is this
>> good enough to not have that big performance impact when writing?
>>
>> Wido
>>
>>>
>>> Is this to be expected or is there something wrong here?
>>>
>>> Wido
>>> _______________________________________________
>>> Star us on GITHUB: https://github.com/LINBIT
>>> drbd-user mailing list
>>> drbd-user@lists.linbit.com
>>> https://lists.linbit.com/mailman/listinfo/drbd-user
>> _______________________________________________
>> Star us on GITHUB: https://github.com/LINBIT
>> drbd-user mailing list
>> drbd-user@lists.linbit.com
>> https://lists.linbit.com/mailman/listinfo/drbd-user
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user