Mailing List Archive

Tail drop on EX3400
Hi

Been going back and forth with JTAC on an output drop issue. I have two
ex3400-48t in a VC configuration and I got some ports reporting tail drop
from time to time, but lately one customer pulling max 80m on a 1000M
connection is generating some level of drops all day long. No complaints
yet but enough for monitoring to trigger.

The suggesting from TAC is to configure class-of-service shared-buffer
percent 100. I'm used to QFX5k switches allowing you to clearly see the
allocation and change it a bit, but this really isn't clear to me when I
run // show system buffers or show interfaces queue, etc.

I've been trying to ask, what's the default precentage allocation? Where
does the rest of the unclaimed percentage live while it's not explicitely
called for? Does calling out 100% take it away from ingress or some other
important queue? Is round robin possible by configuring class-of-service?
Would it avoid dedicating 100% of something to just one direction?

I've asked all of those questions but I can't seem to get a clear answer.

Thanks for your help.
-----
Philippe Girard
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: Tail drop on EX3400 [ In reply to ]
On May 28, 2019, at 10:17 PM, Philippe Girard <philippe@skyhook.ca> wrote:
>
> I've asked all of those questions but I can't seem to get a clear answer.

One additional question: what is upstream from the 1g interface that's showing drops? Is it 10g (or larger)?

We have several small buildings that we're feeding 1g to from an EX4200-24F. However, the uplink to our core is 10g, so there's a speed mismatch in terms of how fast packets can arrive in the distribution switch versus how fast they can drain. We'll frequently see tail drops on egress when a burst of packets come in until TCP does its thing and the bandwidth levels out.

If you look at the bandwidth graphs, we're rarely over a hundred Mb/s, so it looks like the interface isn't maxing out. However, the packets arrive much more quickly than they can leave, so there's a bottleneck there.

There really isn't any clever way around it; I think those switches have 12MB of buffer (or is that the QFX?). Anyway, if you do the math you quickly find out that works out to like 10ms of traffic, so the switch simply can't buffer even short amounts of mismatched speed traffic no matter what you do with the buffers. And at 10ms, most monitoring software simply doesn't have the resolution to catch those bursts.

As you noted, the end user often doesn't notice. However, it might help explain how you're seeing loss even at low rates, yet that don't appear to adversely affect traffic.

Jason
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: Tail drop on EX3400 [ In reply to ]
On Thu, 30 May 2019 at 04:06, Jason Healy <jhealy@logn.net> wrote:

> There really isn't any clever way around it; I think those switches have 12MB of buffer (or is that the QFX?). Anyway, if you do the math you quickly find out that works out to like 10ms of traffic, so the switch simply can't buffer even short amounts of mismatched speed traffic no matter what you do with the buffers. And at 10ms, most monitoring software simply doesn't have the resolution to catch those bursts.

12MB / 1Gbps == 96ms. That would be massive buffer.

But EX4200 actually has 2.5MB so 2.5MB / 1Gbps == 20ms, when all
buffer used exclusively by single 1GE port.

Meaning your RTT between SRC-DST needs to be <40ms or so to be able to
grow TCP window exponentially to reach from 500Mbps to 1Gbps single
stream size, when no other traffic is flowing through the switch.
After the tcp window has grown, the buffer stress is gone, as typical
sender TCP implementation floods packet as fast as they can when tcp
window grows, but in steady state paces to RX rate, so during steady
state almost no buffer stress would exist. With TCP algo like BRR
packets would always be paced, so even during window growth no
significant buffer stress would occur, it is also possible to
configure linux tc in such manner that regardless of TCP algo packet
pacing to RX rate exists, which would also alleviate buffering in
transit.


However like @op said, I don't think by default all this buffer is
actually available to single port, so situation is even worse. And
definitely right config will help the situation.



--
++ytti
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: Tail drop on EX3400 [ In reply to ]
On May 30, 2019, at 2:23 AM, Saku Ytti <saku@ytti.fi> wrote:
>
> 12MB / 1Gbps == 96ms. That would be massive buffer.

Not if you're Arista... ;-)

You're correct that it's 96ms for the 1Gbps side, but if packets are arriving at 10Gbps then that's only 9.6ms (ish) before you run out of buffer. It's the mismatch in speed more than the actual buffer itself (assuming we're talking about megabytes of buffer, not gigabytes).

For steady state at a rate less than 1Gbps, the switch has enough buffer to handle the packets in flight. However, if packets arrive in microbursts then you can exceed the buffer briefly even though the amount of traffic is low on a larger timescale. 15MB of traffic evenly spread out over one second is not an issue, but 15MB of traffic arriving at 10Gbps at the start of a second, even with the rest of the second unused, is enough to overflow a buffer. Both rates are "15MB/s", but the arrival rate makes a huge difference.

I've certainly seen tail drops on interfaces in bursts like this where it quiets down very quickly, but is enough to trip monitoring alarms. We've maxed out the buffer configs on specific ports and haven't been able to eliminate the issue (not sure if it's reduced, as it's relatively infrequent).

Jason
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: Tail drop on EX3400 [ In reply to ]
Thanks everyone for your input, very interesting.

Reality is, we have ~300Mbps coming in the 10G port and ~50-100Mbps per
customer port at peaks, really, not that much.

Also, although tweaking TCP is nice, I can hardly go to each customer of
mine telling them to augment their TCP window settings because they're
triggering monitoring on my side.

I'm still very much unsatisfied with what JTAC told me, even with all my
very precise questions, I get few detailed answers and get told to increase
buffer allocation to 100% in COS. I'll give that the shot after hours today.

One thing that bugs me though is looking at default schedulers, still only
75% of buffer space is allocated to best effort traffic, so I'd still get
only 3/4 of the actual 100% allocation, and since I'm not getting the
answers I want, I don't know what the default allocation is VS that 100%
they want me to configure, so I have no idea of the actual increase this
will generate.

My guess is that I'll have to create a custom scheduler and apply to
interfaces to be able to have all that buffer space available for basic
Internet...

-----
Philippe Girard


On Thu, May 30, 2019 at 9:27 AM Jason Healy <jhealy@logn.net> wrote:

> On May 30, 2019, at 2:23 AM, Saku Ytti <saku@ytti.fi> wrote:
> >
> > 12MB / 1Gbps == 96ms. That would be massive buffer.
>
> Not if you're Arista... ;-)
>
> You're correct that it's 96ms for the 1Gbps side, but if packets are
> arriving at 10Gbps then that's only 9.6ms (ish) before you run out of
> buffer. It's the mismatch in speed more than the actual buffer itself
> (assuming we're talking about megabytes of buffer, not gigabytes).
>
> For steady state at a rate less than 1Gbps, the switch has enough buffer
> to handle the packets in flight. However, if packets arrive in microbursts
> then you can exceed the buffer briefly even though the amount of traffic is
> low on a larger timescale. 15MB of traffic evenly spread out over one
> second is not an issue, but 15MB of traffic arriving at 10Gbps at the start
> of a second, even with the rest of the second unused, is enough to overflow
> a buffer. Both rates are "15MB/s", but the arrival rate makes a huge
> difference.
>
> I've certainly seen tail drops on interfaces in bursts like this where it
> quiets down very quickly, but is enough to trip monitoring alarms. We've
> maxed out the buffer configs on specific ports and haven't been able to
> eliminate the issue (not sure if it's reduced, as it's relatively
> infrequent).
>
> Jason
> _______________________________________________
> juniper-nsp mailing list juniper-nsp@puck.nether.net
> https://puck.nether.net/mailman/listinfo/juniper-nsp
>
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: Tail drop on EX3400 [ In reply to ]
96ms was based on your proposal that EX4200 is 12MB, which it is not,
it's 2.5MB, so it's 20ms @ 1Gbps.

If we're talking about uncongested device then the worst case is 10GE
growth to reach 1Gbps. Only reason to queue at 10GE port if there is
congestion, that is multiple interfaces wanting access to the 10GE
egress.


On Thu, 30 May 2019 at 16:28, Jason Healy <jhealy@logn.net> wrote:
>
> On May 30, 2019, at 2:23 AM, Saku Ytti <saku@ytti.fi> wrote:
> >
> > 12MB / 1Gbps == 96ms. That would be massive buffer.
>
> Not if you're Arista... ;-)
>
> You're correct that it's 96ms for the 1Gbps side, but if packets are arriving at 10Gbps then that's only 9.6ms (ish) before you run out of buffer. It's the mismatch in speed more than the actual buffer itself (assuming we're talking about megabytes of buffer, not gigabytes).
>
> For steady state at a rate less than 1Gbps, the switch has enough buffer to handle the packets in flight. However, if packets arrive in microbursts then you can exceed the buffer briefly even though the amount of traffic is low on a larger timescale. 15MB of traffic evenly spread out over one second is not an issue, but 15MB of traffic arriving at 10Gbps at the start of a second, even with the rest of the second unused, is enough to overflow a buffer. Both rates are "15MB/s", but the arrival rate makes a huge difference.
>
> I've certainly seen tail drops on interfaces in bursts like this where it quiets down very quickly, but is enough to trip monitoring alarms. We've maxed out the buffer configs on specific ports and haven't been able to eliminate the issue (not sure if it's reduced, as it's relatively infrequent).
>
> Jason
> _______________________________________________
> juniper-nsp mailing list juniper-nsp@puck.nether.net
> https://puck.nether.net/mailman/listinfo/juniper-nsp



--
++ytti
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: Tail drop on EX3400 [ In reply to ]
> My guess is that I'll have to create a custom scheduler and apply to
> interfaces to be able to have all that buffer space available for basic
> Internet...

This is actually something we've needed to do on lot of 1RU switches
for 20 years. Catalysts at least since 3550 has not shipped with QoS
config where all buffer is possible to use out-of- the-box. It is
rather odd the default assumes multiclass QoS, while in reality vast
majority have single class and microburst avoidance is the key.

--
++ytti
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp