Mailing List Archive

Varnish and TCP Incast Throughput Collapse
I've been using Varnish in an "intranet" application. The picture is
roughly:

origin <-> Varnish <-- 10G channel ---> switch <-- 1G channel --> client

The machine running Varnish is a high-performance server. It can
easily saturate a 10Gbit channel. The machine running the client is a
more modest desktop workstation, but it's fully capable of saturating
a 1Gbit channel.

The client makes HTTP requests for objects of size 128kB.

When the client makes those requests serially, "useful" data is
transferred at about 80% of the channel bandwidth of the Gigabit
link, which seems perfectly reasonable.

But when the client makes the requests in parallel (typically
4-at-a-time, but it can vary), *total* throughput drops to about 25%
of the channel bandwidth, i.e., about 30Mbyte/sec.

After looking at traces and doing a fair amount of experimentation, we
have reached the tentative conclusion that we're seeing "TCP Incast
Throughput Collapse" (see references below)

The literature on "TCP Incast Throughput Collapse" typically describes
scenarios where a large number of servers overwhelm a single inbound
port. I haven't found any discussion of incast collapse with only one
server, but it seems like a natural consequence of a 10Gigabit-capable
server feeding a 1-Gigabit downlink.

Has anybody else seen anything similar? With Varnish or other single
servers on 10Gbit to 1Gbit links.

The literature offers a variety of mitigation strategies, but there are
non-trivial tradeoffs and none appears to be a silver bullet.

If anyone has seen TCP Incast Collapse with Varnish, were you able to work
around it, and if so, how?

Thanks,
John Salmon

References:

http://www.pdl.cmu.edu/Incast/

Annotated Bibliography in:
https://lists.freebsd.org/pipermail/freebsd-net/2015-November/043926.html

--
*.*
Re: Varnish and TCP Incast Throughput Collapse [ In reply to ]
Out of curiosity, what does ethtool show for the related nics on both
servers? I also have Varnish on a 10G server, and can reach around
7.7Gbit/s serving anywhere between 6-28k requests/second, however it did
take some sysctl tuning and the westwood TCP congestion control algo

On Wed, Jul 5, 2017 at 3:09 PM, John Salmon <John.Salmon@deshawresearch.com>
wrote:

> I've been using Varnish in an "intranet" application. The picture is
> roughly:
>
> origin <-> Varnish <-- 10G channel ---> switch <-- 1G channel --> client
>
> The machine running Varnish is a high-performance server. It can
> easily saturate a 10Gbit channel. The machine running the client is a
> more modest desktop workstation, but it's fully capable of saturating
> a 1Gbit channel.
>
> The client makes HTTP requests for objects of size 128kB.
>
> When the client makes those requests serially, "useful" data is
> transferred at about 80% of the channel bandwidth of the Gigabit
> link, which seems perfectly reasonable.
>
> But when the client makes the requests in parallel (typically
> 4-at-a-time, but it can vary), *total* throughput drops to about 25%
> of the channel bandwidth, i.e., about 30Mbyte/sec.
>
> After looking at traces and doing a fair amount of experimentation, we
> have reached the tentative conclusion that we're seeing "TCP Incast
> Throughput Collapse" (see references below)
>
> The literature on "TCP Incast Throughput Collapse" typically describes
> scenarios where a large number of servers overwhelm a single inbound
> port. I haven't found any discussion of incast collapse with only one
> server, but it seems like a natural consequence of a 10Gigabit-capable
> server feeding a 1-Gigabit downlink.
>
> Has anybody else seen anything similar? With Varnish or other single
> servers on 10Gbit to 1Gbit links.
>
> The literature offers a variety of mitigation strategies, but there are
> non-trivial tradeoffs and none appears to be a silver bullet.
>
> If anyone has seen TCP Incast Collapse with Varnish, were you able to work
> around it, and if so, how?
>
> Thanks,
> John Salmon
>
> References:
>
> http://www.pdl.cmu.edu/Incast/
>
> Annotated Bibliography in:
> https://lists.freebsd.org/pipermail/freebsd-net/2015-
> November/043926.html
>
> --
> *.*
>
> _______________________________________________
> varnish-misc mailing list
> varnish-misc@varnish-cache.org
> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>
Re: Varnish and TCP Incast Throughput Collapse [ In reply to ]
Two things: do you get the same results when the client is directly on the
Varnish server? (ie. not going through the switch) And is each new request
opening a new connection?

--
Guillaume Quintard

On Thu, Jul 6, 2017 at 6:45 AM, Andrei <lagged@gmail.com> wrote:

> Out of curiosity, what does ethtool show for the related nics on both
> servers? I also have Varnish on a 10G server, and can reach around
> 7.7Gbit/s serving anywhere between 6-28k requests/second, however it did
> take some sysctl tuning and the westwood TCP congestion control algo
>
> On Wed, Jul 5, 2017 at 3:09 PM, John Salmon <John.Salmon@deshawresearch.
> com> wrote:
>
>> I've been using Varnish in an "intranet" application. The picture is
>> roughly:
>>
>> origin <-> Varnish <-- 10G channel ---> switch <-- 1G channel --> client
>>
>> The machine running Varnish is a high-performance server. It can
>> easily saturate a 10Gbit channel. The machine running the client is a
>> more modest desktop workstation, but it's fully capable of saturating
>> a 1Gbit channel.
>>
>> The client makes HTTP requests for objects of size 128kB.
>>
>> When the client makes those requests serially, "useful" data is
>> transferred at about 80% of the channel bandwidth of the Gigabit
>> link, which seems perfectly reasonable.
>>
>> But when the client makes the requests in parallel (typically
>> 4-at-a-time, but it can vary), *total* throughput drops to about 25%
>> of the channel bandwidth, i.e., about 30Mbyte/sec.
>>
>> After looking at traces and doing a fair amount of experimentation, we
>> have reached the tentative conclusion that we're seeing "TCP Incast
>> Throughput Collapse" (see references below)
>>
>> The literature on "TCP Incast Throughput Collapse" typically describes
>> scenarios where a large number of servers overwhelm a single inbound
>> port. I haven't found any discussion of incast collapse with only one
>> server, but it seems like a natural consequence of a 10Gigabit-capable
>> server feeding a 1-Gigabit downlink.
>>
>> Has anybody else seen anything similar? With Varnish or other single
>> servers on 10Gbit to 1Gbit links.
>>
>> The literature offers a variety of mitigation strategies, but there are
>> non-trivial tradeoffs and none appears to be a silver bullet.
>>
>> If anyone has seen TCP Incast Collapse with Varnish, were you able to work
>> around it, and if so, how?
>>
>> Thanks,
>> John Salmon
>>
>> References:
>>
>> http://www.pdl.cmu.edu/Incast/
>>
>> Annotated Bibliography in:
>> https://lists.freebsd.org/pipermail/freebsd-net/2015-Novembe
>> r/043926.html
>>
>> --
>> *.*
>>
>> _______________________________________________
>> varnish-misc mailing list
>> varnish-misc@varnish-cache.org
>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>>
>
>
> _______________________________________________
> varnish-misc mailing list
> varnish-misc@varnish-cache.org
> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>
Re: Varnish and TCP Incast Throughput Collapse [ In reply to ]
> If anyone has seen TCP Incast Collapse with Varnish, were you able to work
> around it, and if so, how?

I don't know, but maybe this could help:

https://github.com/varnish/varnish-modules/blob/master/docs/vmod_tcp.rst#vmod_tcp

Dridi

_______________________________________________
varnish-misc mailing list
varnish-misc@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
Re: Varnish and TCP Incast Throughput Collapse [ In reply to ]
Thanks for your suggestions.

One more detail I didn't mention: Roughly speaking, the client is
doing "read ahead", but it only reads ahead by a limited amount (about 4
blocks, each of 128KiB). The surprising behavior is that when four
readahead threads are allowed to run concurrently their aggregate
throughput is much lower than when all the readaheads are serialized
through a single thread.

Traces (with strace and/or tcpdump) show frequent stalls of roughly
200ms where nothing seems to move across the channel and all client-side
system calls are waiting. 200ms is suspiciously close to the linux
'rto_min' parameter, which was the first thing that led me to suspect
TCP incast collapse. We get some improvement by reducing rto_min on the
server, and we also get some improvement by reducing SO_RCVBUF in the
client. But as I said, both have tradeoffs, so I'm interested if anyone
else has encountered or overcome this particular problem.

I do not see the dropoff from single-thread to multi-thread when I
client and server on the same host. I.e., I get around 500MB/s with one
client and roughly the same total bandwidth with multiple clients. I'm
sure that with some tuning, the 500MB/s could be improved, but that's
not the issue here.

Here are the ethtool reports:

On the client:
drdws0134$ ethtool eth0
Settings for eth0:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Supported pause frame use: No
Supports auto-negotiation: Yes
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Advertised pause frame use: No
Advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
MDI-X: on (auto)
Cannot get wake-on-lan settings: Operation not permitted
Current message level: 0x00000007 (7)
drv probe link
Link detected: yes
drdws0134$

On the server:

$ ethtool eth0
Settings for eth0:
Supported ports: [ TP ]
Supported link modes: 1000baseT/Full
10000baseT/Full
Supported pause frame use: No
Supports auto-negotiation: No
Advertised link modes: Not reported
Advertised pause frame use: No
Advertised auto-negotiation: No
Speed: 10000Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 0
Transceiver: internal
Auto-negotiation: off
MDI-X: Unknown
Cannot get wake-on-lan settings: Operation not permitted
Cannot get link status: Operation not permitted
$


On 07/06/2017 03:08 AM, Guillaume Quintard wrote:
> Two things: do you get the same results when the client is directly on
> the Varnish server? (ie. not going through the switch) And is each new
> request opening a new connection?
>
> --
> Guillaume Quintard
>
> On Thu, Jul 6, 2017 at 6:45 AM, Andrei <lagged@gmail.com
> <mailto:lagged@gmail.com>> wrote:
>
> Out of curiosity, what does ethtool show for the related nics on
> both servers? I also have Varnish on a 10G server, and can reach
> around 7.7Gbit/s serving anywhere between 6-28k requests/second,
> however it did take some sysctl tuning and the westwood TCP
> congestion control algo
>
> On Wed, Jul 5, 2017 at 3:09 PM, John Salmon
> <John.Salmon@deshawresearch.com
> <mailto:John.Salmon@deshawresearch.com>> wrote:
>
> I've been using Varnish in an "intranet" application. The
> picture is roughly:
>
> origin <-> Varnish <-- 10G channel ---> switch <-- 1G
> channel --> client
>
> The machine running Varnish is a high-performance server. It can
> easily saturate a 10Gbit channel. The machine running the
> client is a
> more modest desktop workstation, but it's fully capable of
> saturating
> a 1Gbit channel.
>
> The client makes HTTP requests for objects of size 128kB.
>
> When the client makes those requests serially, "useful" data is
> transferred at about 80% of the channel bandwidth of the Gigabit
> link, which seems perfectly reasonable.
>
> But when the client makes the requests in parallel (typically
> 4-at-a-time, but it can vary), *total* throughput drops to
> about 25%
> of the channel bandwidth, i.e., about 30Mbyte/sec.
>
> After looking at traces and doing a fair amount of
> experimentation, we
> have reached the tentative conclusion that we're seeing "TCP
> Incast
> Throughput Collapse" (see references below)
>
> The literature on "TCP Incast Throughput Collapse" typically
> describes
> scenarios where a large number of servers overwhelm a single
> inbound
> port. I haven't found any discussion of incast collapse with
> only one
> server, but it seems like a natural consequence of a
> 10Gigabit-capable
> server feeding a 1-Gigabit downlink.
>
> Has anybody else seen anything similar? With Varnish or other
> single
> servers on 10Gbit to 1Gbit links.
>
> The literature offers a variety of mitigation strategies, but
> there are
> non-trivial tradeoffs and none appears to be a silver bullet.
>
> If anyone has seen TCP Incast Collapse with Varnish, were you
> able to work
> around it, and if so, how?
>
> Thanks,
> John Salmon
>
> References:
>
> http://www.pdl.cmu.edu/Incast/
>
> Annotated Bibliography in:
> https://lists.freebsd.org/pipermail/freebsd-net/2015-November/043926.html
> <https://lists.freebsd.org/pipermail/freebsd-net/2015-November/043926.html>
>
> --
> *.*
>
> _______________________________________________
> varnish-misc mailing list
> varnish-misc@varnish-cache.org
> <mailto:varnish-misc@varnish-cache.org>
> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
> <https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc>
>
>
>
> _______________________________________________
> varnish-misc mailing list
> varnish-misc@varnish-cache.org <mailto:varnish-misc@varnish-cache.org>
> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
> <https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc>
>
>

--
*.*
Re: Varnish and TCP Incast Throughput Collapse [ In reply to ]
I'm having trouble understanding the concept of readahead in an HTTP
context.

You are using the malloc cache storage, right?

--
Guillaume Quintard

On Thu, Jul 6, 2017 at 7:15 PM, John Salmon <John.Salmon@deshawresearch.com>
wrote:

> Thanks for your suggestions.
>
> One more detail I didn't mention: Roughly speaking, the client is doing
> "read ahead", but it only reads ahead by a limited amount (about 4 blocks,
> each of 128KiB). The surprising behavior is that when four readahead
> threads are allowed to run concurrently their aggregate throughput is much
> lower than when all the readaheads are serialized through a single thread.
>
> Traces (with strace and/or tcpdump) show frequent stalls of roughly 200ms
> where nothing seems to move across the channel and all client-side system
> calls are waiting. 200ms is suspiciously close to the linux 'rto_min'
> parameter, which was the first thing that led me to suspect TCP incast
> collapse. We get some improvement by reducing rto_min on the server, and
> we also get some improvement by reducing SO_RCVBUF in the client. But as I
> said, both have tradeoffs, so I'm interested if anyone else has encountered
> or overcome this particular problem.
>
> I do not see the dropoff from single-thread to multi-thread when I client
> and server on the same host. I.e., I get around 500MB/s with one client
> and roughly the same total bandwidth with multiple clients. I'm sure that
> with some tuning, the 500MB/s could be improved, but that's not the issue
> here.
>
> Here are the ethtool reports:
>
> On the client:
> drdws0134$ ethtool eth0
> Settings for eth0:
> Supported ports: [ TP ]
> Supported link modes: 10baseT/Half 10baseT/Full
> 100baseT/Half 100baseT/Full
> 1000baseT/Full
> Supported pause frame use: No
> Supports auto-negotiation: Yes
> Advertised link modes: 10baseT/Half 10baseT/Full
> 100baseT/Half 100baseT/Full
> 1000baseT/Full
> Advertised pause frame use: No
> Advertised auto-negotiation: Yes
> Speed: 1000Mb/s
> Duplex: Full
> Port: Twisted Pair
> PHYAD: 1
> Transceiver: internal
> Auto-negotiation: on
> MDI-X: on (auto)
> Cannot get wake-on-lan settings: Operation not permitted
> Current message level: 0x00000007 (7)
> drv probe link
> Link detected: yes
> drdws0134$
>
> On the server:
>
> $ ethtool eth0
> Settings for eth0:
> Supported ports: [ TP ]
> Supported link modes: 1000baseT/Full
> 10000baseT/Full
> Supported pause frame use: No
> Supports auto-negotiation: No
> Advertised link modes: Not reported
> Advertised pause frame use: No
> Advertised auto-negotiation: No
> Speed: 10000Mb/s
> Duplex: Full
> Port: Twisted Pair
> PHYAD: 0
> Transceiver: internal
> Auto-negotiation: off
> MDI-X: Unknown
> Cannot get wake-on-lan settings: Operation not permitted
> Cannot get link status: Operation not permitted
> $
>
>
> On 07/06/2017 03:08 AM, Guillaume Quintard wrote:
>
> Two things: do you get the same results when the client is directly on the
> Varnish server? (ie. not going through the switch) And is each new request
> opening a new connection?
>
> --
> Guillaume Quintard
>
> On Thu, Jul 6, 2017 at 6:45 AM, Andrei <lagged@gmail.com> wrote:
>
>> Out of curiosity, what does ethtool show for the related nics on both
>> servers? I also have Varnish on a 10G server, and can reach around
>> 7.7Gbit/s serving anywhere between 6-28k requests/second, however it did
>> take some sysctl tuning and the westwood TCP congestion control algo
>>
>> On Wed, Jul 5, 2017 at 3:09 PM, John Salmon <
>> John.Salmon@deshawresearch.com> wrote:
>>
>>> I've been using Varnish in an "intranet" application. The picture is
>>> roughly:
>>>
>>> origin <-> Varnish <-- 10G channel ---> switch <-- 1G channel -->
>>> client
>>>
>>> The machine running Varnish is a high-performance server. It can
>>> easily saturate a 10Gbit channel. The machine running the client is a
>>> more modest desktop workstation, but it's fully capable of saturating
>>> a 1Gbit channel.
>>>
>>> The client makes HTTP requests for objects of size 128kB.
>>>
>>> When the client makes those requests serially, "useful" data is
>>> transferred at about 80% of the channel bandwidth of the Gigabit
>>> link, which seems perfectly reasonable.
>>>
>>> But when the client makes the requests in parallel (typically
>>> 4-at-a-time, but it can vary), *total* throughput drops to about 25%
>>> of the channel bandwidth, i.e., about 30Mbyte/sec.
>>>
>>> After looking at traces and doing a fair amount of experimentation, we
>>> have reached the tentative conclusion that we're seeing "TCP Incast
>>> Throughput Collapse" (see references below)
>>>
>>> The literature on "TCP Incast Throughput Collapse" typically describes
>>> scenarios where a large number of servers overwhelm a single inbound
>>> port. I haven't found any discussion of incast collapse with only one
>>> server, but it seems like a natural consequence of a 10Gigabit-capable
>>> server feeding a 1-Gigabit downlink.
>>>
>>> Has anybody else seen anything similar? With Varnish or other single
>>> servers on 10Gbit to 1Gbit links.
>>>
>>> The literature offers a variety of mitigation strategies, but there are
>>> non-trivial tradeoffs and none appears to be a silver bullet.
>>>
>>> If anyone has seen TCP Incast Collapse with Varnish, were you able to
>>> work
>>> around it, and if so, how?
>>>
>>> Thanks,
>>> John Salmon
>>>
>>> References:
>>>
>>> http://www.pdl.cmu.edu/Incast/
>>>
>>> Annotated Bibliography in:
>>> https://lists.freebsd.org/pipermail/freebsd-net/2015-Novembe
>>> r/043926.html
>>>
>>> --
>>> *.*
>>>
>>> _______________________________________________
>>> varnish-misc mailing list
>>> varnish-misc@varnish-cache.org
>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>>>
>>
>>
>> _______________________________________________
>> varnish-misc mailing list
>> varnish-misc@varnish-cache.org
>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>>
>
>
> --
> *.*
>
Re: Varnish and TCP Incast Throughput Collapse [ In reply to ]
Could you add another switch and use bonded interfaces? If you are
thinking the switch can't handle the load that may help.

_______________________________________________
varnish-misc mailing list
varnish-misc@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc