Mailing List Archive: 400G forwarding - how does it work?

Re: 400G forwarding - how does it work? [ In reply to ]

jwbensley+nanog at gmail

Jul 27, 2022, 12:50 PM

Post #26 of 65 (263 views)

On Tue, 26 Jul 2022 at 21:39, Lawrence Wobker <ljwobker@gmail.com> wrote:
> So if this pipeline can do 1.25 billion PPS and I want to be able to forward 10BPPS, I can build a chip that has 8 of these pipelines and get my performance target that way. I could also build a "pipeline" that processes multiple packets per clock, if I have one that does 2 packets/clock then I only need 4 of said pipelines... and so on and so forth.

Thanks for the response Lawrence.

The Broadcom BCM16K KBP has a clock speed of 1.2Ghz, so I expect the
J2 to have something similar (as someone already mentioned, most chips
I've seen are in the 1-1.5Ghz range), so in this case "only" 2
pipelines would be needed to maintain the headline 2Bpps rate of the
J2, or even just 1 if they have managed to squeeze out two packets per
cycle through parallelisation within the pipeline.

Cheers,
James.

RE: 400G forwarding - how does it work? [ In reply to ]

jefftant.ietf at gmail

Aug 3, 2022, 5:06 PM

Post #27 of 65 (257 views)

Hey,

This is not an advertisement but an attempt to help folks to better understand networking HW.

Some of you might know (and love 😊) “between 0x2 nerds” podcast Jeff Doyle and I have been hosting for a couple of years.

Following up the discussion we have decided to dedicate a number of upcoming podcasts to networking HW, the topic where more information and better education is very much needed (no, you won’t have to sign NDA before joining 😊), we have lined up a number of great guests, people who design and build ASICs and can talk firsthand about evolution of networking HW, complexity of the process, differences between fixed and programmable pipelines, memories and databases. This Thursday (08/04) at 11:00PST we are joined by one and only Sharada Yeluri - Sr. Director ASIC at Juniper. Other vendors will be joining in the later episodes, usual rules apply – no marketing, no BS.

More to come, stay tuned.

Live feed: https://lnkd.in/gk2x2ezZ"]https://lnkd.in/gk2x2ezZ

Between 0x2 nerds playlist, videos will be published to: https://www.youtube.com/playlist?list=PLMYH1xDLIabuZCr1Yeoo39enogPA2yJB7

Cheers,

Jeff

From: James Bensley
Sent: Wednesday, July 27, 2022 12:53 PM
To: Lawrence Wobker; NANOG
Subject: Re: 400G forwarding - how does it work?

On Tue, 26 Jul 2022 at 21:39, Lawrence Wobker <ljwobker@gmail.com> wrote:

> So if this pipeline can do 1.25 billion PPS and I want to be able to forward 10BPPS, I can build a chip that has 8 of these pipelines and get my performance target that way. I could also build a "pipeline" that processes multiple packets per clock, if I have one that does 2 packets/clock then I only need 4 of said pipelines... and so on and so forth.

Thanks for the response Lawrence.

The Broadcom BCM16K KBP has a clock speed of 1.2Ghz, so I expect the

J2 to have something similar (as someone already mentioned, most chips

I've seen are in the 1-1.5Ghz range), so in this case "only" 2

pipelines would be needed to maintain the headline 2Bpps rate of the

J2, or even just 1 if they have managed to squeeze out two packets per

cycle through parallelisation within the pipeline.

Cheers,

James.

Re: 400G forwarding - how does it work? [ In reply to ]

jefftant.ietf at gmail

Aug 4, 2022, 4:14 PM

Post #28 of 65 (255 views)

Apologies for garbage/HTMLed email, not sure what happened (thanks
Brian F for letting me know).
Anyway, the podcast with Juniper (mostly around Trio/Express) has been
broadcasted today and is available at
https://www.youtube.com/watch?v=1he8GjDBq9g
Next in the pipeline are:
Cisco SiliconOne
Broadcom DNX (Jericho/Qumran/Ramon)
For both - the guests are main architects of the silicon

Enjoy

On Wed, Aug 3, 2022 at 5:06 PM Jeff Tantsura <jefftant.ietf@gmail.com> wrote:
>
> Hey,
>
>
>
> This is not an advertisement but an attempt to help folks to better understand networking HW.
>
>
>
> Some of you might know (and love ????) “between 0x2 nerds” podcast Jeff Doyle and I have been hosting for a couple of years.
>
>
>
> Following up the discussion we have decided to dedicate a number of upcoming podcasts to networking HW, the topic where more information and better education is very much needed (no, you won’t have to sign NDA before joining ????), we have lined up a number of great guests, people who design and build ASICs and can talk firsthand about evolution of networking HW, complexity of the process, differences between fixed and programmable pipelines, memories and databases. This Thursday (08/04) at 11:00PST we are joined by one and only Sharada Yeluri - Sr. Director ASIC at Juniper. Other vendors will be joining in the later episodes, usual rules apply – no marketing, no BS.
>
> More to come, stay tuned.
>
> Live feed: https://lnkd.in/gk2x2ezZ
>
> Between 0x2 nerds playlist, videos will be published to: https://www.youtube.com/playlist?list=PLMYH1xDLIabuZCr1Yeoo39enogPA2yJB7
>
>
>
> Cheers,
>
> Jeff
>
>
>
> From: James Bensley
> Sent: Wednesday, July 27, 2022 12:53 PM
> To: Lawrence Wobker; NANOG
> Subject: Re: 400G forwarding - how does it work?
>
>
>
> On Tue, 26 Jul 2022 at 21:39, Lawrence Wobker <ljwobker@gmail.com> wrote:
>
> > So if this pipeline can do 1.25 billion PPS and I want to be able to forward 10BPPS, I can build a chip that has 8 of these pipelines and get my performance target that way. I could also build a "pipeline" that processes multiple packets per clock, if I have one that does 2 packets/clock then I only need 4 of said pipelines... and so on and so forth.
>
>
>
> Thanks for the response Lawrence.
>
>
>
> The Broadcom BCM16K KBP has a clock speed of 1.2Ghz, so I expect the
>
> J2 to have something similar (as someone already mentioned, most chips
>
> I've seen are in the 1-1.5Ghz range), so in this case "only" 2
>
> pipelines would be needed to maintain the headline 2Bpps rate of the
>
> J2, or even just 1 if they have managed to squeeze out two packets per
>
> cycle through parallelisation within the pipeline.
>
>
>
> Cheers,
>
> James.
>
>

Re: 400G forwarding - how does it work? [ In reply to ]

Aug 5, 2022, 12:15 AM

Post #29 of 65 (255 views)

Thank you for this.

I wish there would have been a deeper dive to the lookup side. My open questions

a) Trio model of packet stays in single PPE until done vs. FP model of
line-of-PPE (identical cores). I don't understand the advantages of
the FP model, the Trio model advantages are clear to me. Obviously the
FP model has to have some advantages, what are they?

b) What exactly are the gains of putting two trios on-package in
Trio6, there is no local-switching between WANs of trios in-package,
they are, as far as I can tell, ships in the night, packets between
trios go via fabric, as they would with separate Trios. I can
understand the benefit of putting trio and HBM2 on the same package,
to reduce distance so wattage goes down or frequency goes up.

c) What evolution they are thinking for the shallow ingress buffers
for Trio6. The collateral damage potential is significant, because WAN
which asks most, gets most, instead each having their fair share, thus
potentially arbitrarily low rate WAN ingress might not get access to
ingress buffer causing drop. Would it be practical in terms of
wattage/area to add some sort of preQoS towards the shallow ingress
buffer, so each WAN ingress has a fair guaranteed-rate to shallow
buffers?

On Fri, 5 Aug 2022 at 02:18, Jeff Tantsura <jefftant.ietf@gmail.com> wrote:
>
> Apologies for garbage/HTMLed email, not sure what happened (thanks
> Brian F for letting me know).
> Anyway, the podcast with Juniper (mostly around Trio/Express) has been
> broadcasted today and is available at
> https://www.youtube.com/watch?v=1he8GjDBq9g
> Next in the pipeline are:
> Cisco SiliconOne
> Broadcom DNX (Jericho/Qumran/Ramon)
> For both - the guests are main architects of the silicon
>
> Enjoy
>
>
> On Wed, Aug 3, 2022 at 5:06 PM Jeff Tantsura <jefftant.ietf@gmail.com> wrote:
> >
> > Hey,
> >
> >
> >
> > This is not an advertisement but an attempt to help folks to better understand networking HW.
> >
> >
> >
> > Some of you might know (and love ????) “between 0x2 nerds” podcast Jeff Doyle and I have been hosting for a couple of years.
> >
> >
> >
> > Following up the discussion we have decided to dedicate a number of upcoming podcasts to networking HW, the topic where more information and better education is very much needed (no, you won’t have to sign NDA before joining ????), we have lined up a number of great guests, people who design and build ASICs and can talk firsthand about evolution of networking HW, complexity of the process, differences between fixed and programmable pipelines, memories and databases. This Thursday (08/04) at 11:00PST we are joined by one and only Sharada Yeluri - Sr. Director ASIC at Juniper. Other vendors will be joining in the later episodes, usual rules apply – no marketing, no BS.
> >
> > More to come, stay tuned.
> >
> > Live feed: https://lnkd.in/gk2x2ezZ
> >
> > Between 0x2 nerds playlist, videos will be published to: https://www.youtube.com/playlist?list=PLMYH1xDLIabuZCr1Yeoo39enogPA2yJB7
> >
> >
> >
> > Cheers,
> >
> > Jeff
> >
> >
> >
> > From: James Bensley
> > Sent: Wednesday, July 27, 2022 12:53 PM
> > To: Lawrence Wobker; NANOG
> > Subject: Re: 400G forwarding - how does it work?
> >
> >
> >
> > On Tue, 26 Jul 2022 at 21:39, Lawrence Wobker <ljwobker@gmail.com> wrote:
> >
> > > So if this pipeline can do 1.25 billion PPS and I want to be able to forward 10BPPS, I can build a chip that has 8 of these pipelines and get my performance target that way. I could also build a "pipeline" that processes multiple packets per clock, if I have one that does 2 packets/clock then I only need 4 of said pipelines... and so on and so forth.
> >
> >
> >
> > Thanks for the response Lawrence.
> >
> >
> >
> > The Broadcom BCM16K KBP has a clock speed of 1.2Ghz, so I expect the
> >
> > J2 to have something similar (as someone already mentioned, most chips
> >
> > I've seen are in the 1-1.5Ghz range), so in this case "only" 2
> >
> > pipelines would be needed to maintain the headline 2Bpps rate of the
> >
> > J2, or even just 1 if they have managed to squeeze out two packets per
> >
> > cycle through parallelisation within the pipeline.
> >
> >
> >
> > Cheers,
> >
> > James.
> >
> >

--
++ytti

RE: 400G forwarding - how does it work? [ In reply to ]

ljwobker at gmail

Aug 5, 2022, 10:31 AM

Post #30 of 65 (254 views)

Disclaimer: I work for Cisco on a bunch of silicon. I'm not intimately familiar with any of these devices, but I'm familiar with the high level tradeoffs. There are also exceptions to almost EVERYTHING I'm about to say, especially once you get into the second- and third-order implementation details. Your mileage will vary... ;-)

If you have a model where one core/block does ALL of the processing, you generally benefit from lower latency, simpler programming, etc. A major downside is that to do this, all of these cores have to have access to all of the different memories used to forward said packet. Conversely, if you break up the processing into stages, you can only connect the FIB lookup memory to the cores that are going to be doing the FIB lookup, and only connect the encap memories to the cores/blocks that are doing the encapsulation work. Those interconnects take up silicon space, which equates to higher cost and power.

Packaging two cores on a single device is beneficial in that you only have one physical chip to work with instead of two. This often simplifies the board designers' job, and is often lower power than two separate chips. This starts to break down as you get to exceptionally large chips as you bump into the various physical/reticle limitations of how large a chip you can actually build. With newer packaging technology (2.5D chips, HBM and similar memories, chiplets down the road, etc) this becomes even more complicated, but the answer to "why would you put two XYZs on a package?" is that it's just cheaper and lower power from a system standpoint (and often also from a pure silicon standpoint...)

Buffer designs are *really* hard in modern high speed chips, and there are always lots and lots of tradeoffs. The "ideal" answer is an extremely large block of memory that ALL of the forwarding/queueing elements have fair/equal access to... but this physically looks more or less like a full mesh between the memory/buffering subsystem and all the forwarding engines, which becomes really unwieldly (expensive!) from a design standpoint. The amount of memory you can practically put on the main NPU die is on the order of 20-200 **mega** bytes, where a single stack of HBM memory comes in at 4GB -- it's literally 100x the size. Figuring out which side of this gigantic gulf you want to live on is a super important part of the basic architecture and also drives lots of other decisions down the line... once you've decided how much buffering memory you're willing/able to put down, the next challenge is coming up with ways to provide access to that memory from all the different potential clients. It's a LOT easier to wire up/design a chip where you have four separate pipelines/cores/whatever and each one of them accesses 1/4 of the buffer memory... but that also means that any given port only has access to 1/4 of the memory for burst absorption. Lots and lots of Smart People Time has gone into different memory designs that attempt to optimize this problem, and it's a major part of the intellectual property of various chip designs.

--lj

-----Original Message-----
From: NANOG <nanog-bounces+ljwobker=gmail.com@nanog.org> On Behalf Of Saku Ytti
Sent: Friday, August 5, 2022 3:16 AM
To: Jeff Tantsura <jefftant.ietf@gmail.com>
Cc: NANOG <nanog@nanog.org>; Jeff Doyle <jdoyle@juniper.net>
Subject: Re: 400G forwarding - how does it work?

Thank you for this.

I wish there would have been a deeper dive to the lookup side. My open questions

a) Trio model of packet stays in single PPE until done vs. FP model of line-of-PPE (identical cores). I don't understand the advantages of the FP model, the Trio model advantages are clear to me. Obviously the FP model has to have some advantages, what are they?

b) What exactly are the gains of putting two trios on-package in Trio6, there is no local-switching between WANs of trios in-package, they are, as far as I can tell, ships in the night, packets between trios go via fabric, as they would with separate Trios. I can understand the benefit of putting trio and HBM2 on the same package, to reduce distance so wattage goes down or frequency goes up.

c) What evolution they are thinking for the shallow ingress buffers for Trio6. The collateral damage potential is significant, because WAN which asks most, gets most, instead each having their fair share, thus potentially arbitrarily low rate WAN ingress might not get access to ingress buffer causing drop. Would it be practical in terms of wattage/area to add some sort of preQoS towards the shallow ingress buffer, so each WAN ingress has a fair guaranteed-rate to shallow buffers?

On Fri, 5 Aug 2022 at 02:18, Jeff Tantsura <jefftant.ietf@gmail.com> wrote:
>
> Apologies for garbage/HTMLed email, not sure what happened (thanks
> Brian F for letting me know).
> Anyway, the podcast with Juniper (mostly around Trio/Express) has been
> broadcasted today and is available at
> https://www.youtube.com/watch?v=1he8GjDBq9g
> Next in the pipeline are:
> Cisco SiliconOne
> Broadcom DNX (Jericho/Qumran/Ramon)
> For both - the guests are main architects of the silicon
>
> Enjoy
>
>
> On Wed, Aug 3, 2022 at 5:06 PM Jeff Tantsura <jefftant.ietf@gmail.com> wrote:
> >
> > Hey,
> >
> >
> >
> > This is not an advertisement but an attempt to help folks to better understand networking HW.
> >
> >
> >
> > Some of you might know (and love ????) “between 0x2 nerds” podcast Jeff Doyle and I have been hosting for a couple of years.
> >
> >
> >
> > Following up the discussion we have decided to dedicate a number of upcoming podcasts to networking HW, the topic where more information and better education is very much needed (no, you won’t have to sign NDA before joining ????), we have lined up a number of great guests, people who design and build ASICs and can talk firsthand about evolution of networking HW, complexity of the process, differences between fixed and programmable pipelines, memories and databases. This Thursday (08/04) at 11:00PST we are joined by one and only Sharada Yeluri - Sr. Director ASIC at Juniper. Other vendors will be joining in the later episodes, usual rules apply – no marketing, no BS.
> >
> > More to come, stay tuned.
> >
> > Live feed: https://lnkd.in/gk2x2ezZ
> >
> > Between 0x2 nerds playlist, videos will be published to:
> > https://www.youtube.com/playlist?list=PLMYH1xDLIabuZCr1Yeoo39enogPA2
> > yJB7
> >
> >
> >
> > Cheers,
> >
> > Jeff
> >
> >
> >
> > From: James Bensley
> > Sent: Wednesday, July 27, 2022 12:53 PM
> > To: Lawrence Wobker; NANOG
> > Subject: Re: 400G forwarding - how does it work?
> >
> >
> >
> > On Tue, 26 Jul 2022 at 21:39, Lawrence Wobker <ljwobker@gmail.com> wrote:
> >
> > > So if this pipeline can do 1.25 billion PPS and I want to be able to forward 10BPPS, I can build a chip that has 8 of these pipelines and get my performance target that way. I could also build a "pipeline" that processes multiple packets per clock, if I have one that does 2 packets/clock then I only need 4 of said pipelines... and so on and so forth.
> >
> >
> >
> > Thanks for the response Lawrence.
> >
> >
> >
> > The Broadcom BCM16K KBP has a clock speed of 1.2Ghz, so I expect the
> >
> > J2 to have something similar (as someone already mentioned, most
> > chips
> >
> > I've seen are in the 1-1.5Ghz range), so in this case "only" 2
> >
> > pipelines would be needed to maintain the headline 2Bpps rate of the
> >
> > J2, or even just 1 if they have managed to squeeze out two packets
> > per
> >
> > cycle through parallelisation within the pipeline.
> >
> >
> >
> > Cheers,
> >
> > James.
> >
> >

--
++ytti

Re: 400G forwarding - how does it work? [ In reply to ]

Aug 5, 2022, 10:37 PM

Post #31 of 65 (254 views)

On Fri, 5 Aug 2022 at 20:31, <ljwobker@gmail.com> wrote:

Hey LJ,

> Disclaimer: I work for Cisco on a bunch of silicon. I'm not intimately familiar with any of these devices, but I'm familiar with the high level tradeoffs. There are also exceptions to almost EVERYTHING I'm about to say, especially once you get into the second- and third-order implementation details. Your mileage will vary... ;-)

I expect it may come to this, my question may be too specific to be
answered without violating some NDA.

> If you have a model where one core/block does ALL of the processing, you generally benefit from lower latency, simpler programming, etc. A major downside is that to do this, all of these cores have to have access to all of the different memories used to forward said packet. Conversely, if you break up the processing into stages, you can only connect the FIB lookup memory to the cores that are going to be doing the FIB lookup, and only connect the encap memories to the cores/blocks that are doing the encapsulation work. Those interconnects take up silicon space, which equates to higher cost and power.

While an interesting answer, that is, the statement is, cost of giving
access to memory for cores versus having a more complex to program
pipeline of cores is a balanced tradeoff, I don't think it applies to
my specific question, while may apply to generic questions. We can
roughly think of FP having a similar amount of lines as Trio has PPEs,
therefore, a similar number of cores need access to memory, and
possibly higher number, as more than 1 core in line will need memory
access.
So the question is more, why a lot of less performant cores, where
performance is achieved by making pipeline, compared to fewer
performant cores, where individual cores will work on packet to
completion, when the former has a similar number of core lines as
latter has cores.

> Packaging two cores on a single device is beneficial in that you only have one physical chip to work with instead of two. This often simplifies the board designers' job, and is often lower power than two separate chips. This starts to break down as you get to exceptionally large chips as you bump into the various physical/reticle limitations of how large a chip you can actually build. With newer packaging technology (2.5D chips, HBM and similar memories, chiplets down the road, etc) this becomes even more complicated, but the answer to "why would you put two XYZs on a package?" is that it's just cheaper and lower power from a system standpoint (and often also from a pure silicon standpoint...)

Thank you for this, this does confirm that benefits aren't perhaps as
revolutionary as the presentation of thread proposed, presentation
divided Trio evolution to 3 phases, and this multiple trios on package
was presented as one of those big evolutions, and perhaps some other
division of generations could have been more communicative.

> Lots and lots of Smart People Time has gone into different memory designs that attempt to optimize this problem, and it's a major part of the intellectual property of various chip designs.

I choose to read this as 'where a lot of innovation happens, a lot of
mistakes happen'. Hopefully we'll figure out a good answer here soon,
as the answers vendors are ending up with are becoming increasingly
visible compromises in the field. I suspect a large part of this is
that cloudy shops represent, if not disproportionate revenue,
disproportionate focus and their networks tend to be a lot more static
in config and traffic than access/SP networks. And when you have that
quality, you can make increasingly broad assumptions, assumptions
which don't play as well in SP networks.

--
++ytti

RE: 400G forwarding - how does it work? [ In reply to ]

ljwobker at gmail

Aug 6, 2022, 7:08 AM

Post #32 of 65 (253 views)

I don't think I can add much here to the FP and Trio specific questions, for obvious reasons... but ultimately it comes down to a set of tradeoffs where some of the big concerns are things like "how do I get the forwarding state I need back and forth to the things doing the processing work" -- that's an insane level oversimplification, as a huge amount of engineering time goes into those choices.

I think the "revolutionary-ness" (to vocabulate a useful word?) of putting multiple cores or whatever onto a single package is somewhat in the eye of the beholder. The vast majority of customers would never know nor care whether a chip on the inside was implemented as two parallel "cores" or whether it was just one bigger "core" that does twice the amount of work in the same time. But to the silicon designer, and to a somewhat lesser extent the people writing the forwarding and associated chip-management code, it's definitely a big big deal. Also, having the ability to put two cores down on a given chip opens the door to eventually doing MORE than two cores, and if you really stretch your brain you get to where you might be able to put down "N" pipelines.

This is the story of integration: back in the day we built systems where everything was forwarded on a single CPU. From a performance standpoint all we cared about was the clock rate and how much work was required to forward a packet. Divide the second number by the first, and you get your answer. In the late 90's we built systems (the 7500 for me) that were distributed, so now we had a bunch of CPUs on linecards running that code. Horizontal scaling -- sort of. In the early 2000's the GSR came along and now we're doing forwarding in hardware, which is an order or two faster, but a whole bunch of features are now too complex to do in hardware, so they go over the side and people have to adapt. To the best of my knowledge, TCP intercept has never come back...
For a while, GSR and CRS type systems had linecards where each card had a bunch of chips that together built the forwarding pipeline. You had chips for the L1/L2 interfaces, chips for the packet lookups, chips for the QoS/queueing math, and chips for the fabric interfaces. Over time, we integrated more and more of these things together until you (more or less) had a linecard where everything was done on one or two chips, instead of a half dozen or more. Once we got here, the next step was to build linecards where you actually had multiple independent things doing forwarding -- on the ASR9k we called these "slices". This again multiplies the performance you can get, but now both the software and the operators have to deal with the complexity of having multiple things running code where you used to only have one. Now let's jump into the 2010's where the silicon integration allows you to put down multiple cores or pipelines on a single chip, each of these is now (more or less) it's own forwarding entity. So now you've got yet ANOTHER layer of abstraction. If I can attempt to draw out the tree, it looks like this now:
1) you have a chassis or a system, which has a bunch of linecards.
2) each of those linecards has a bunch of NPUs/ASICs
3) each of those NPUs has a bunch of cores/pipelines

And all of this stuff has to be managed and tracked by the software. If I've got a system with 16 linecards, and each of those has 4 NPUs, and each of THOSE has 4 cores - I've got over *two hundred and fifty* separate things forwarding packets at the same time. Now a lot of the info they're using is common (the FIB is probably the same for all these entities...) but some of it is NOT. There's no value in wasting memory for the encapsulation data to host XXX if I know that none of the ports on my given NPU/core are going to talk to that host, right? So - figuring out how to manage the *state locality* becomes super important. And yes, this code breaks like all code, but no one has figured out any better way to scale up the performance. If you have a brilliant idea here that will get me the performance of 250+ things running in parallel but the simplicity of it looking and acting like a single thing to the rest of the world, please find an angel investor and we'll get phenomenally rich together.

--lj

-----Original Message-----
From: Saku Ytti <saku@ytti.fi>
Sent: Saturday, August 6, 2022 1:38 AM
To: ljwobker@gmail.com
Cc: Jeff Tantsura <jefftant.ietf@gmail.com>; NANOG <nanog@nanog.org>; Jeff Doyle <jdoyle@juniper.net>
Subject: Re: 400G forwarding - how does it work?

On Fri, 5 Aug 2022 at 20:31, <ljwobker@gmail.com> wrote:

Hey LJ,

> Disclaimer: I work for Cisco on a bunch of silicon. I'm not intimately familiar with any of these devices, but I'm familiar with the high level tradeoffs. There are also exceptions to almost EVERYTHING I'm about to say, especially once you get into the second- and third-order implementation details. Your mileage will vary... ;-)

I expect it may come to this, my question may be too specific to be answered without violating some NDA.

> If you have a model where one core/block does ALL of the processing, you generally benefit from lower latency, simpler programming, etc. A major downside is that to do this, all of these cores have to have access to all of the different memories used to forward said packet. Conversely, if you break up the processing into stages, you can only connect the FIB lookup memory to the cores that are going to be doing the FIB lookup, and only connect the encap memories to the cores/blocks that are doing the encapsulation work. Those interconnects take up silicon space, which equates to higher cost and power.

While an interesting answer, that is, the statement is, cost of giving access to memory for cores versus having a more complex to program pipeline of cores is a balanced tradeoff, I don't think it applies to my specific question, while may apply to generic questions. We can roughly think of FP having a similar amount of lines as Trio has PPEs, therefore, a similar number of cores need access to memory, and possibly higher number, as more than 1 core in line will need memory access.
So the question is more, why a lot of less performant cores, where performance is achieved by making pipeline, compared to fewer performant cores, where individual cores will work on packet to completion, when the former has a similar number of core lines as latter has cores.

> Packaging two cores on a single device is beneficial in that you only
> have one physical chip to work with instead of two. This often
> simplifies the board designers' job, and is often lower power than two
> separate chips. This starts to break down as you get to exceptionally
> large chips as you bump into the various physical/reticle limitations
> of how large a chip you can actually build. With newer packaging
> technology (2.5D chips, HBM and similar memories, chiplets down the
> road, etc) this becomes even more complicated, but the answer to "why
> would you put two XYZs on a package?" is that it's just cheaper and
> lower power from a system standpoint (and often also from a pure
> silicon standpoint...)

Thank you for this, this does confirm that benefits aren't perhaps as revolutionary as the presentation of thread proposed, presentation divided Trio evolution to 3 phases, and this multiple trios on package was presented as one of those big evolutions, and perhaps some other division of generations could have been more communicative.

> Lots and lots of Smart People Time has gone into different memory designs that attempt to optimize this problem, and it's a major part of the intellectual property of various chip designs.

I choose to read this as 'where a lot of innovation happens, a lot of mistakes happen'. Hopefully we'll figure out a good answer here soon, as the answers vendors are ending up with are becoming increasingly visible compromises in the field. I suspect a large part of this is that cloudy shops represent, if not disproportionate revenue, disproportionate focus and their networks tend to be a lot more static in config and traffic than access/SP networks. And when you have that quality, you can make increasingly broad assumptions, assumptions which don't play as well in SP networks.

--
++ytti

Re: 400G forwarding - how does it work? [ In reply to ]

Aug 7, 2022, 1:43 AM

Post #33 of 65 (253 views)

On Sat, 6 Aug 2022 at 17:08, <ljwobker@gmail.com> wrote:

> For a while, GSR and CRS type systems had linecards where each card had a bunch of chips that together built the forwarding pipeline. You had chips for the L1/L2 interfaces, chips for the packet lookups, chips for the QoS/queueing math, and chips for the fabric interfaces. Over time, we integrated more and more of these things together until you (more or less) had a linecard where everything was done on one or two chips, instead of a half dozen or more. Once we got here, the next step was to build linecards where you actually had multiple independent things doing forwarding -- on the ASR9k we called these "slices". This again multiplies the performance you can get, but now both the software and the operators have to deal with the complexity of having multiple things running code where you used to only have one. Now let's jump into the 2010's where the silicon integration allows you to put down multiple cores or pipelines on a single chip, each of these is now (more or less) it's own forwarding entity. So now you've got yet ANOTHER layer of abstraction. If I can attempt to draw out the tree, it looks like this now:

> 1) you have a chassis or a system, which has a bunch of linecards.
> 2) each of those linecards has a bunch of NPUs/ASICs
> 3) each of those NPUs has a bunch of cores/pipelines

Thank you for this. I think we may have some ambiguity here. I'll
ignore multichassis designs, as those went out of fashion, for now.
And describe only 'NPU' not express/brcm style pipeline.

1) you have a chassis with multiple linecards
2) each linecard has 1 or more forwarding packages
3) each package has 1 or more NPUs (Juniper calls these slices, unsure
if EZchip vocabulary is same here)
4) each NPU has 1 or more identical cores (well, I can't really name
any with 1 core, I reckon, NPU like GPU pretty inherently has many
many cores, and unlike some in this thread, I don't think they ever
are ARM instruction set, that makes no sense, you create instruction
set targeting the application at hand which ARM instruction set is
not, but maybe some day we have some forwarding-IA, allowing customers
to provide ucode that runs on multiple targets, but this would reduce
pace of innovation)

Some of those NPU core architectures are flat, like Trio, where a
single core handles the entire packet. Where other core architectures,
like FP are matrices, where you have multiple lines and packet picks 1
of the lines and traverses each core in line. (FP has much more cores
in line, compared to leaba/pacific stuff)

--
++ytti

Re: 400G forwarding - how does it work? [ In reply to ]

mohta at necom830

Aug 7, 2022, 2:13 AM

Post #34 of 65 (253 views)

ljwobker@gmail.com wrote:

> Buffer designs are *really* hard in modern high speed chips, and
> there are always lots and lots of tradeoffs. The "ideal" answer is
> an extremely large block of memory that ALL of the
> forwarding/queueing elements have fair/equal access to... but this
> physically looks more or less like a full mesh between the
> memory/buffering subsystem and all the forwarding engines, which
> becomes really unwieldly (expensive!) from a design standpoint. The
> amount of memory you can practically put on the main NPU die is on
> the order of 20-200 **mega** bytes, where a single stack of HBM
> memory comes in at 4GB -- it's literally 100x the size.

I'm afraid you imply too much buffer bloat only to cause
unnecessary and unpleasant delay.

With 99% load M/M/1, 500 packets (750kB for 1500B MTU) of
buffer is enough to make packet drop probability less than
1%. With 98% load, the probability is 0.0041%.

But, there are so many router engineers who think, with
bloated buffer, packet drop probability can be zero, which
is wrong.

For example,

https://www.broadcom.com/products/ethernet-connectivity/switching/stratadnx/bcm88690
Jericho2 delivers a complete set of advanced features for
the most demanding carrier, campus and cloud environments.
The device supports low power, high bandwidth HBM packet
memory offering up to 160X more traffic buffering compared
with on-chip memory, enabling zero-packet-loss in heavily
congested networks.

Masataka Ohta

Re: 400G forwarding - how does it work? [ In reply to ]

Aug 7, 2022, 2:46 AM

Post #35 of 65 (252 views)

On Sun, 7 Aug 2022 at 12:16, Masataka Ohta
<mohta@necom830.hpcl.titech.ac.jp> wrote:

> I'm afraid you imply too much buffer bloat only to cause
> unnecessary and unpleasant delay.
>
> With 99% load M/M/1, 500 packets (750kB for 1500B MTU) of
> buffer is enough to make packet drop probability less than
> 1%. With 98% load, the probability is 0.0041%.

I feel like I'll live to regret asking. Which congestion control
algorithm are you thinking of? If we estimate BW and pace TCP window
growth at estimated BW, we don't need much buffering at all.
But Cubic and Reno will burst tcp window growth at sender rate, which
may be much more than receiver rate, someone has to store that growth
and pace it out at receiver rate, otherwise window won't grow, and
receiver rate won't be achieved.
So in an ideal scenario, no we don't need a lot of buffer, in
practical situations today, yes we need quite a bit of buffer.

Now add to this multiple logical interfaces, each having 4-8 queues,
it adds up. Big buffers are bad mmmm'kay is frankly simplistic and
inaccurate.

Also the shallow ingress buffers discussed in the thread are not delay
buffers and the problem is complex because no device is marketable
that can accept wire rate of minimum packet size, so what trade-offs
do we carry, when we get bad traffic at wire rate at small packet
size? We can't empty the ingress buffers fast enough, do we have
physical memory for each port, do we share, how do we share?

--
++ytti

Re: 400G forwarding - how does it work? [ In reply to ]

mohta at necom830

Aug 7, 2022, 4:16 AM

Post #36 of 65 (252 views)

Saku Ytti wrote:

>> I'm afraid you imply too much buffer bloat only to cause
>> unnecessary and unpleasant delay.
>>
>> With 99% load M/M/1, 500 packets (750kB for 1500B MTU) of
>> buffer is enough to make packet drop probability less than
>> 1%. With 98% load, the probability is 0.0041%.

> I feel like I'll live to regret asking. Which congestion control
> algorithm are you thinking of?

I'm not assuming LAN environment, for which paced TCP may
be desirable (if bandwidth requirement is tight, which is
unlikely in LAN).

> But Cubic and Reno will burst tcp window growth at sender rate, which
> may be much more than receiver rate, someone has to store that growth
> and pace it out at receiver rate, otherwise window won't grow, and
> receiver rate won't be achieved.

When many TCPs are running, burst is averaged and traffic
is poisson.

> So in an ideal scenario, no we don't need a lot of buffer, in
> practical situations today, yes we need quite a bit of buffer.

That is an old theory known to be invalid (Ethernet switches with
small buffer is enough for IXes) and theoretically denied by:

Sizing router buffers
https://dl.acm.org/doi/10.1145/1030194.1015499

after which paced TCP was developed for unimportant exceptional
cases of LAN.

> Now add to this multiple logical interfaces, each having 4-8 queues,
> it adds up.

Having so may queues requires sorting of queues to properly
prioritize them, which costs a lot of computation (and
performance loss) for no benefit and is a bad idea.

> Also the shallow ingress buffers discussed in the thread are not delay
> buffers and the problem is complex because no device is marketable
> that can accept wire rate of minimum packet size, so what trade-offs
> do we carry, when we get bad traffic at wire rate at small packet
> size? We can't empty the ingress buffers fast enough, do we have
> physical memory for each port, do we share, how do we share?

People who use irrationally small packets will suffer, which is
not a problem for the rest of us.

Masataka Ohta

RE: 400G forwarding - how does it work? [ In reply to ]

ljwobker at gmail

Aug 7, 2022, 7:24 AM

Post #37 of 65 (252 views)

You're getting to the core of the question (sorry, I could not resist...) -- and again the complexity is as much in the terminology as anything else.

In EZChip, at least as we used it on the ASR9k, the chip had a bunch of processing cores, and each core performed some of the work on each packet. I honestly don't know if the cores themselves were different or if they were all the same physical design, but they were definitely attached to different memories, and they definitely ran different microcode. These cores were allocated to separate stages, and had names along the lines of {parse, search, encap, transmit} etc. I'm sure these aren't 100% correct but you get the point. Importantly, there were NOT the same number of cores for each stage, so when a packet went from stage A to stage B there was some kind of mux in between. If you knew precisely that each stage had the same number of cores, you could choose to arrange it such that the packet always followed a "straight-line" through the processing pipe, which would make some parts of the implementation cheaper/easier.

You're correct that the instruction set for stuff like this is definitely not ARM (nor x86 nor anything else standard) because the problem space you're optimizing for is a lot smaller that what you'd have on a more general purpose CPU.

The (enormous) challenge for running the same ucode on multiple targets is that networking has exceptionally high performance requirements -- billions of packets per second is where this whole thread started! Fortunately, we also have a much smaller problem space to solve than general purpose compute, although in a lot of places that's because we vendors have told operators "Look, if you want something that can forward a couple hundred terabits in a single system, you're going to have to constrain what features you need, because otherwise the current hardware just can't do it".

To get that kind of performance without breaking the bank requires -- or at least has required up until this point in time -- some very tight integration between the hardware forwarding design and the microcode. I was at Barefoot when P4 was released, and Tofino was the closest thing I've seen to a "general purpose network ucode machine" -- and even that was still very much optimized in terms of how the hardware was designed and built, and it VERY much required the P4 programmer to have a deep understanding of what hardware resources were available. When you write a P4 program and compile it for an x86 machine, you can basically create as many tables and lookup stages as you want -- you just have to eat more CPU and memory accesses for more complex programs and they run slower. But on a chip like Tofino (or any other NPU-like target) you're going to have finite limits on how many processing stages and memory tables exist... so it's more the case that when your program gets bigger it no longer "just runs slower" but rather it "doesn't run at all".

The industry would greatly benefit from some magical abstraction layer that would let people write forwarding code that's both target-independent AND high-performance, but at least so far the performance penalty for making such code target independent has been waaaaay more than the market is willing to bear.

--lj

-----Original Message-----
From: Saku Ytti <saku@ytti.fi>
Sent: Sunday, August 7, 2022 4:44 AM
To: ljwobker@gmail.com
Cc: Jeff Tantsura <jefftant.ietf@gmail.com>; NANOG <nanog@nanog.org>; Jeff Doyle <jdoyle@juniper.net>
Subject: Re: 400G forwarding - how does it work?

On Sat, 6 Aug 2022 at 17:08, <ljwobker@gmail.com> wrote:

> For a while, GSR and CRS type systems had linecards where each card had a bunch of chips that together built the forwarding pipeline. You had chips for the L1/L2 interfaces, chips for the packet lookups, chips for the QoS/queueing math, and chips for the fabric interfaces. Over time, we integrated more and more of these things together until you (more or less) had a linecard where everything was done on one or two chips, instead of a half dozen or more. Once we got here, the next step was to build linecards where you actually had multiple independent things doing forwarding -- on the ASR9k we called these "slices". This again multiplies the performance you can get, but now both the software and the operators have to deal with the complexity of having multiple things running code where you used to only have one. Now let's jump into the 2010's where the silicon integration allows you to put down multiple cores or pipelines on a single chip, each of these is now (more or less) it's own forwarding entity. So now you've got yet ANOTHER layer of abstraction. If I can attempt to draw out the tree, it looks like this now:

> 1) you have a chassis or a system, which has a bunch of linecards.
> 2) each of those linecards has a bunch of NPUs/ASICs
> 3) each of those NPUs has a bunch of cores/pipelines

Thank you for this. I think we may have some ambiguity here. I'll ignore multichassis designs, as those went out of fashion, for now.
And describe only 'NPU' not express/brcm style pipeline.

1) you have a chassis with multiple linecards
2) each linecard has 1 or more forwarding packages
3) each package has 1 or more NPUs (Juniper calls these slices, unsure if EZchip vocabulary is same here)
4) each NPU has 1 or more identical cores (well, I can't really name any with 1 core, I reckon, NPU like GPU pretty inherently has many many cores, and unlike some in this thread, I don't think they ever are ARM instruction set, that makes no sense, you create instruction set targeting the application at hand which ARM instruction set is not, but maybe some day we have some forwarding-IA, allowing customers to provide ucode that runs on multiple targets, but this would reduce pace of innovation)

Some of those NPU core architectures are flat, like Trio, where a single core handles the entire packet. Where other core architectures, like FP are matrices, where you have multiple lines and packet picks 1 of the lines and traverses each core in line. (FP has much more cores in line, compared to leaba/pacific stuff)

--
++ytti

RE: 400G forwarding - how does it work? [ In reply to ]

ljwobker at gmail

Aug 7, 2022, 7:37 AM

Post #38 of 65 (252 views)

Buffering is a near-religious topic across a large swath of the network industry, but here are some opinions of mine:

a LOT of operators/providers need more buffering than you can realistically put directly onto the ASIC die. Fast chips without external buffers measure capacity in tens of microseconds, which is nowhere near enough for a lot of the market. We can (and do) argue about exactly where and what network roles can be met by this amount of buffering, but it's absolutely not a large enough part of the market to totally go away from "big" external buffers.
Once you "jump off the cliff" of needing something more than on-chip SRAM, you're in this weird area where nothing exists in the technology space that *really* solves the problem, because you really need access rate and bandwidth more than you need capacity. HBM is currently the best (or at least the most popular) combination of capacity, power, access rate, and bandwidth... but it's still nowhere near perfect. A common HBM2 implementation gives you 8GB of buffer space and about 2Tb of raw bandwidth, and a few hundred million IOPS. (A lot of that gets gobbled up by various overheads....)

These values are a function of two things:
1) memory physics - I don't know enough about how these things are Like Really Actually Built to talk about this part.
2) market forces... the market for this stuff is really GPUs, ML/AI applications, etc. The networking silicon market is a drop in the ocean compared to the rest of compute, so the specific needs of my router aren't going to ever drive enough volume to get big memory makers to do exactly what **I** want. I'm at the mercy of what they build for the gigantic players in the rest of the market.

If you told me that someone had a memory technology that was something like "one-fourth the capacity of HBM, but four times the bandwidth and four times the access rate" I would do backflips and buy a lot of it, because it's a way better fit for the specific performance dimensions I need for A Really Fast Router. But nothing remotely along these lines exists... so like a lot of other people I just have to order off the menu. ;-)

--lj

-----Original Message-----
From: NANOG <nanog-bounces+ljwobker=gmail.com@nanog.org> On Behalf Of Masataka Ohta
Sent: Sunday, August 7, 2022 5:13 AM
To: nanog@nanog.org
Subject: Re: 400G forwarding - how does it work?

ljwobker@gmail.com wrote:

> Buffer designs are *really* hard in modern high speed chips, and there
> are always lots and lots of tradeoffs. The "ideal" answer is an
> extremely large block of memory that ALL of the forwarding/queueing
> elements have fair/equal access to... but this physically looks more
> or less like a full mesh between the memory/buffering subsystem and
> all the forwarding engines, which becomes really unwieldly
> (expensive!) from a design standpoint. The amount of memory you can
> practically put on the main NPU die is on the order of 20-200 **mega**
> bytes, where a single stack of HBM memory comes in at 4GB -- it's
> literally 100x the size.

I'm afraid you imply too much buffer bloat only to cause unnecessary and unpleasant delay.

With 99% load M/M/1, 500 packets (750kB for 1500B MTU) of buffer is enough to make packet drop probability less than 1%. With 98% load, the probability is 0.0041%.

But, there are so many router engineers who think, with bloated buffer, packet drop probability can be zero, which is wrong.

For example,

https://www.broadcom.com/products/ethernet-connectivity/switching/stratadnx/bcm88690
Jericho2 delivers a complete set of advanced features for
the most demanding carrier, campus and cloud environments.
The device supports low power, high bandwidth HBM packet
memory offering up to 160X more traffic buffering compared
with on-chip memory, enabling zero-packet-loss in heavily
congested networks.

Masataka Ohta

Re: 400G forwarding - how does it work? [ In reply to ]

sronan at ronan-online

Aug 7, 2022, 7:58 AM

Post #39 of 65 (252 views)

There are MANY real world use cases which require high throughput at 64 byte packet size. Denying those use cases because they don’t fit your world view is short sighted. The word of networking is not all I-Mix.

> On Aug 7, 2022, at 7:16 AM, Masataka Ohta <mohta@necom830.hpcl.titech.ac.jp> wrote:
>
> ?Saku Ytti wrote:
>
>>> I'm afraid you imply too much buffer bloat only to cause
>>> unnecessary and unpleasant delay.
>>>
>>> With 99% load M/M/1, 500 packets (750kB for 1500B MTU) of
>>> buffer is enough to make packet drop probability less than
>>> 1%. With 98% load, the probability is 0.0041%.
>
>> I feel like I'll live to regret asking. Which congestion control
>> algorithm are you thinking of?
>
> I'm not assuming LAN environment, for which paced TCP may
> be desirable (if bandwidth requirement is tight, which is
> unlikely in LAN).
>
>> But Cubic and Reno will burst tcp window growth at sender rate, which
>> may be much more than receiver rate, someone has to store that growth
>> and pace it out at receiver rate, otherwise window won't grow, and
>> receiver rate won't be achieved.
>
> When many TCPs are running, burst is averaged and traffic
> is poisson.
>
>> So in an ideal scenario, no we don't need a lot of buffer, in
>> practical situations today, yes we need quite a bit of buffer.
>
> That is an old theory known to be invalid (Ethernet switches with
> small buffer is enough for IXes) and theoretically denied by:
>
> Sizing router buffers
> https://dl.acm.org/doi/10.1145/1030194.1015499
>
> after which paced TCP was developed for unimportant exceptional
> cases of LAN.
>
> > Now add to this multiple logical interfaces, each having 4-8 queues,
> > it adds up.
>
> Having so may queues requires sorting of queues to properly
> prioritize them, which costs a lot of computation (and
> performance loss) for no benefit and is a bad idea.
>
> > Also the shallow ingress buffers discussed in the thread are not delay
> > buffers and the problem is complex because no device is marketable
> > that can accept wire rate of minimum packet size, so what trade-offs
> > do we carry, when we get bad traffic at wire rate at small packet
> > size? We can't empty the ingress buffers fast enough, do we have
> > physical memory for each port, do we share, how do we share?
>
> People who use irrationally small packets will suffer, which is
> not a problem for the rest of us.
>
> Masataka Ohta
>
>

Re: 400G forwarding - how does it work? [ In reply to ]

Aug 7, 2022, 8:56 AM

Post #40 of 65 (252 views)

On Sun, 7 Aug 2022 at 17:58, <sronan@ronan-online.com> wrote:

> There are MANY real world use cases which require high throughput at 64 byte packet size. Denying those use cases because they don’t fit your world view is short sighted. The word of networking is not all I-Mix.

Yes but it's not an addressable market. Such a market will just buy
silly putty for 2bucks and modify the existing face-plate to do 64B.

No one will ship that box for you, because the addressable market
gladly will take more WAN ports as trade-off for large minimum mean
packet size.
--
++ytti

Re: 400G forwarding - how does it work? [ In reply to ]

Aug 7, 2022, 9:00 AM

Post #41 of 65 (252 views)

On Sun, 7 Aug 2022 at 14:16, Masataka Ohta
<mohta@necom830.hpcl.titech.ac.jp> wrote:

> When many TCPs are running, burst is averaged and traffic
> is poisson.

If you grow a window, and the sender sends the delta at 100G, and
receiver is 10G, eventually you'll hit that 10G port at 100G rate.
It's largely an edge problem, not a core problem.

> People who use irrationally small packets will suffer, which is
> not a problem for the rest of us.

Quite, unfortunately, the problem I have exists in the Internet, the
problem you're solving exists in Ohtanet, Ohtanet is much more
civilized and allows for elegant solutions. The Internet just has a
different shade of bad solution to pick from.

--
++ytti

Re: 400G forwarding - how does it work? [ In reply to ]

sronan at ronan-online

Aug 7, 2022, 9:26 AM

Post #42 of 65 (252 views)

You are incredibly incorrect, in fact the market for those devices is in the Billions of Dollars. But you continue to pretend that it doesn’t exist.

Shane

> On Aug 7, 2022, at 11:57 AM, Saku Ytti <saku@ytti.fi> wrote:
>
> ?On Sun, 7 Aug 2022 at 17:58, <sronan@ronan-online.com> wrote:
>
>> There are MANY real world use cases which require high throughput at 64 byte packet size. Denying those use cases because they don’t fit your world view is short sighted. The word of networking is not all I-Mix.
>
> Yes but it's not an addressable market. Such a market will just buy
> silly putty for 2bucks and modify the existing face-plate to do 64B.
>
> No one will ship that box for you, because the addressable market
> gladly will take more WAN ports as trade-off for large minimum mean
> packet size.
> --
> ++ytti

Re: 400G forwarding - how does it work? [ In reply to ]

diptanshu.singh at gmail

Aug 7, 2022, 10:12 AM

Post #43 of 65 (252 views)

Disclaimer: I often use the M/M/1 queuing assumption for much of my work to
keep the maths simple and believe that I am reasonably aware in which
context it's a right or a wrong application :). Also, I don't intend to
change the core topic of the thread, but since this has come up, I couldn't
resist.

>> With 99% load M/M/1, 500 packets (750kB for 1500B MTU) of
>> buffer is enough to make packet drop probability less than
>> 1%. With 98% load, the probability is 0.0041%.

To expand the above a bit so that there is no ambiguity. The above assumes
that the router behaves like an M/M/1 queue. The expected number of packets
in the systems can be given by

[image: image.png]
where [image: image.png] is the utilization. The probability that at least
B packets are in the system is given by [image: image.png] where B is the
number of packets in the system. for a link utilization of .98, the packet
drop probability is .98**(500) = 0.000041%. for a link utilization of 99%,
.99**500 = 0.00657%.

>> When many TCPs are running, burst is averaged and traffic
>> is poisson.

M/M/1 queuing assumes that traffic is Poisson, and the Poisson assumption is
1) The number of sources is infinite
2) The traffic arrival pattern is random.

I think the second assumption is where I often question whether the traffic
arrival pattern is truly random. I have seen cases where traffic behaves
more like self-similar. Most Poisson models rely on the Central limit
theorem, which loosely states that the sample distribution will approach a
normal distribution as we aggregate more from various distributions. The
mean will smooth towards a value.

Do you have any good pointers where the research has been done that today's
internet traffic can be modeled accurately by Poisson? For as many papers
supporting Poisson, I have seen as many papers saying it's not Poisson.

https://www.icir.org/vern/papers/poisson.TON.pdf
https://www.cs.wustl.edu/~jain/cse567-06/ftp/traffic_models2/#sec1.2

On Sun, 7 Aug 2022 at 04:18, Masataka Ohta <mohta@necom830.hpcl.titech.ac.jp>
wrote:

> Saku Ytti wrote:
>
> >> I'm afraid you imply too much buffer bloat only to cause
> >> unnecessary and unpleasant delay.
> >>
> >> With 99% load M/M/1, 500 packets (750kB for 1500B MTU) of
> >> buffer is enough to make packet drop probability less than
> >> 1%. With 98% load, the probability is 0.0041%.
>
> > I feel like I'll live to regret asking. Which congestion control
> > algorithm are you thinking of?
>
> I'm not assuming LAN environment, for which paced TCP may
> be desirable (if bandwidth requirement is tight, which is
> unlikely in LAN).
>
> > But Cubic and Reno will burst tcp window growth at sender rate, which
> > may be much more than receiver rate, someone has to store that growth
> > and pace it out at receiver rate, otherwise window won't grow, and
> > receiver rate won't be achieved.
>
> When many TCPs are running, burst is averaged and traffic
> is poisson.
>
> > So in an ideal scenario, no we don't need a lot of buffer, in
> > practical situations today, yes we need quite a bit of buffer.
>
> That is an old theory known to be invalid (Ethernet switches with
> small buffer is enough for IXes) and theoretically denied by:
>
> Sizing router buffers
> https://dl.acm.org/doi/10.1145/1030194.1015499
>
> after which paced TCP was developed for unimportant exceptional
> cases of LAN.
>
> > Now add to this multiple logical interfaces, each having 4-8 queues,
> > it adds up.
>
> Having so may queues requires sorting of queues to properly
> prioritize them, which costs a lot of computation (and
> performance loss) for no benefit and is a bad idea.
>
> > Also the shallow ingress buffers discussed in the thread are not delay
> > buffers and the problem is complex because no device is marketable
> > that can accept wire rate of minimum packet size, so what trade-offs
> > do we carry, when we get bad traffic at wire rate at small packet
> > size? We can't empty the ingress buffers fast enough, do we have
> > physical memory for each port, do we share, how do we share?
>
> People who use irrationally small packets will suffer, which is
> not a problem for the rest of us.
>
> Masataka Ohta
>
>
>

Re: 400G forwarding - how does it work? [ In reply to ]

dave.taht at gmail

Aug 7, 2022, 10:34 AM

Post #44 of 65 (252 views)

If it's of any help... the bloat mailing list at lists.bufferbloat.net has
the largest concentration of
queue theorists and network operator + developers I know of. (also, bloat
readers, this ongoing thread on nanog about 400Gbit is fascinating)

There is 10+ years worth of debate in the archives:
https://lists.bufferbloat.net/pipermail/bloat/2012-May/thread.html as one
example.

On Sun, Aug 7, 2022 at 10:14 AM dip <diptanshu.singh@gmail.com> wrote:

>
> Disclaimer: I often use the M/M/1 queuing assumption for much of my work
> to keep the maths simple and believe that I am reasonably aware in which
> context it's a right or a wrong application :). Also, I don't intend to
> change the core topic of the thread, but since this has come up, I couldn't
> resist.
>
> >> With 99% load M/M/1, 500 packets (750kB for 1500B MTU) of
> >> buffer is enough to make packet drop probability less than
> >> 1%. With 98% load, the probability is 0.0041%.
>
> To expand the above a bit so that there is no ambiguity. The above assumes
> that the router behaves like an M/M/1 queue. The expected number of packets
> in the systems can be given by
>
> [image: image.png]
> where [image: image.png] is the utilization. The probability that at
> least B packets are in the system is given by [image: image.png] where B
> is the number of packets in the system. for a link utilization of .98, the
> packet drop probability is .98**(500) = 0.000041%. for a link utilization
> of 99%, .99**500 = 0.00657%.
>
>
Regrettably, tcp ccs, by design do not stop growth until you get that drop,
e.g. 100+% utilization.

>> When many TCPs are running, burst is averaged and traffic
> >> is poisson.
>
> M/M/1 queuing assumes that traffic is Poisson, and the Poisson assumption
> is
> 1) The number of sources is infinite
> 2) The traffic arrival pattern is random.
>
> I think the second assumption is where I often question whether the
> traffic arrival pattern is truly random. I have seen cases where traffic
> behaves more like self-similar. Most Poisson models rely on the Central
> limit theorem, which loosely states that the sample distribution will
> approach a normal distribution as we aggregate more from various
> distributions. The mean will smooth towards a value.
>
> Do you have any good pointers where the research has been done that
> today's internet traffic can be modeled accurately by Poisson? For as many
> papers supporting Poisson, I have seen as many papers saying it's not
> Poisson.
>
> https://www.icir.org/vern/papers/poisson.TON.pdf
> https://www.cs.wustl.edu/~jain/cse567-06/ftp/traffic_models2/#sec1.2
>

I am firmly in the not-poisson camp, however, by inserting (esp) FQ and AQM
techniques on the bottleneck links it is very possible to smooth traffic
into this more easily analytical model - and gain enormous benefits from
doing so.

> On Sun, 7 Aug 2022 at 04:18, Masataka Ohta <
> mohta@necom830.hpcl.titech.ac.jp> wrote:
>
>> Saku Ytti wrote:
>>
>> >> I'm afraid you imply too much buffer bloat only to cause
>> >> unnecessary and unpleasant delay.
>> >>
>> >> With 99% load M/M/1, 500 packets (750kB for 1500B MTU) of
>> >> buffer is enough to make packet drop probability less than
>> >> 1%. With 98% load, the probability is 0.0041%.
>>
>> > I feel like I'll live to regret asking. Which congestion control
>> > algorithm are you thinking of?
>>
>> I'm not assuming LAN environment, for which paced TCP may
>> be desirable (if bandwidth requirement is tight, which is
>> unlikely in LAN).
>>
>> > But Cubic and Reno will burst tcp window growth at sender rate, which
>> > may be much more than receiver rate, someone has to store that growth
>> > and pace it out at receiver rate, otherwise window won't grow, and
>> > receiver rate won't be achieved.
>>
>> When many TCPs are running, burst is averaged and traffic
>> is poisson.
>>
>> > So in an ideal scenario, no we don't need a lot of buffer, in
>> > practical situations today, yes we need quite a bit of buffer.
>>
>> That is an old theory known to be invalid (Ethernet switches with
>> small buffer is enough for IXes) and theoretically denied by:
>>
>> Sizing router buffers
>> https://dl.acm.org/doi/10.1145/1030194.1015499
>>
>> after which paced TCP was developed for unimportant exceptional
>> cases of LAN.
>>
>> > Now add to this multiple logical interfaces, each having 4-8 queues,
>> > it adds up.
>>
>> Having so may queues requires sorting of queues to properly
>> prioritize them, which costs a lot of computation (and
>> performance loss) for no benefit and is a bad idea.
>>
>> > Also the shallow ingress buffers discussed in the thread are not delay
>> > buffers and the problem is complex because no device is marketable
>> > that can accept wire rate of minimum packet size, so what trade-offs
>> > do we carry, when we get bad traffic at wire rate at small packet
>> > size? We can't empty the ingress buffers fast enough, do we have
>> > physical memory for each port, do we share, how do we share?
>>
>> People who use irrationally small packets will suffer, which is
>> not a problem for the rest of us.
>>
>> Masataka Ohta
>>
>>
>>

--
FQ World Domination pending:
https://blog.cerowrt.org/post/state_of_fq_codel/
Dave Täht CEO, TekLibre, LLC

Re: 400G forwarding - how does it work? [ In reply to ]

Aug 7, 2022, 12:38 PM

Post #45 of 65 (252 views)

Masataka Ohta wrote on 07/08/2022 12:16:
> Ethernet switches with small buffer is enough for IXes

That would not be the experience of IXP operators.

Nick

Re: 400G forwarding - how does it work? [ In reply to ]

mohta at necom830

Aug 7, 2022, 11:21 PM

Post #46 of 65 (250 views)

sronan@ronan-online.com wrote:

> There are MANY real world use cases which require high throughput at
> 64 byte packet size.

Certainly, there were imaginary world use cases which require
to guarantee so high throughput of 64kbps with 48B payload
size for which 20(40)B IP header was obviously painful and 5B
header was used. At that time, poor fair queuing was assumed,
which requires small packet size for short delay.

But as fair queuing does not scale at all, they disappeared
long ago.

> Denying those use cases because they don’t fit
> your world view is short sighted.

That could have been a valid argument 20 years ago.

Masataka Ohta

Re: 400G forwarding - how does it work? [ In reply to ]

mohta at necom830

Aug 8, 2022, 2:00 AM

Post #47 of 65 (250 views)

dip wrote:

> I have seen cases where traffic behaves
> more like self-similar.

That could happen if there are small number of TCP streams
or multiple TCPs are synchronized through interactions on
bloated buffers, which is one reason why we should avoid
bloated buffers.

> Do you have any good pointers where the research has been done that today's
> internet traffic can be modeled accurately by Poisson? For as many papers
> supporting Poisson, I have seen as many papers saying it's not Poisson.
>
> https://www.icir.org/vern/papers/poisson.TON.pdf

It is based on observations between 1989 and 1994 when
Internet backbone was slow and the number of users
was small, which means the number of TCP streams
running in parallel is small.

For example, merely 124M packets for 36 days of observation
[LBL-1], is slower than 500kbps, which can be filled
up by a single TCP connection even by computers at that
time and is not a meaningful measurement.

> https://www.cs.wustl.edu/~jain/cse567-06/ftp/traffic_models2/#sec1.2

It merely states that some use non Poisson traffic models.

Masataka Ohta

Re: 400G forwarding - how does it work? [ In reply to ]

mohta at necom830

Aug 8, 2022, 2:59 AM

Post #48 of 65 (250 views)

Saku Ytti wrote:

>> When many TCPs are running, burst is averaged and traffic
>> is poisson.
>
> If you grow a window, and the sender sends the delta at 100G, and
> receiver is 10G, eventually you'll hit that 10G port at 100G rate.

Wrong. If it's local communicaiton where RTT is small, the
window is not so large smaller than unbloated router buffer.
If RTT is large, your 100G runs over several 100/400G
backbone links with many other traffic, which makes the
burst much slower than 10G.

Masataka Ohta

Re: 400G forwarding - how does it work? [ In reply to ]

Aug 8, 2022, 3:47 AM

Post #49 of 65 (250 views)

On Mon, 8 Aug 2022 at 13:03, Masataka Ohta
<mohta@necom830.hpcl.titech.ac.jp> wrote:

> If RTT is large, your 100G runs over several 100/400G
> backbone links with many other traffic, which makes the
> burst much slower than 10G.

In Ohtanet, I presume.

--
++ytti

Re: 400G forwarding - how does it work? [ In reply to ]

mohta at necom830

Aug 8, 2022, 3:59 AM

Post #50 of 65 (250 views)

Saku Ytti wrote:

>> If RTT is large, your 100G runs over several 100/400G
>> backbone links with many other traffic, which makes the
>> burst much slower than 10G.
>
> In Ohtanet, I presume.

which is, unlike Yttinet, the reality.

Masataka Ohta