Mailing List Archive: 400G forwarding

Re: 400G forwarding - how does it work? [ In reply to ]

Aug 8, 2022, 4:07 AM

Post #51 of 65 (587 views)

On Mon, 8 Aug 2022 at 14:02, Masataka Ohta
<mohta@necom830.hpcl.titech.ac.jp> wrote:

> which is, unlike Yttinet, the reality.

Yttinet has pesky customers who care about single TCP performance over
long fat links, and observe poor performance with shallow buffers at
the provider end. Yttinet is cost sensitive and does not want to do
work, unless sufficiently motivated by paying customers.

--
++ytti

Re: 400G forwarding - how does it work? [ In reply to ]

mohta at necom830

Aug 8, 2022, 4:33 AM

Post #52 of 65 (587 views)

Permalink

Saku Ytti wrote:

>> which is, unlike Yttinet, the reality.
>
> Yttinet has pesky customers who care about single TCP performance over
> long fat links, and observe poor performance with shallow buffers at
> the provider end.

With such an imaginary assumption, according to the end to end
principle, the customers (the ends) should use paced TCP instead
of paying unnecessarily bloated amount of money to intelligent
intermediate entities of ISPs using expensive routers with
bloated buffers.

> Yttinet is cost sensitive and does not want to do
> work, unless sufficiently motivated by paying customers.

I understand that if customers follow the end to end principle,
revenue of "intelligent" ISPs will be reduced.

Masataka Ohta

Re: 400G forwarding - how does it work? [ In reply to ]

saku at ytti

Aug 8, 2022, 4:43 AM

Post #53 of 65 (587 views)

Permalink

On Mon, 8 Aug 2022 at 14:37, Masataka Ohta
<mohta@necom830.hpcl.titech.ac.jp> wrote:

> With such an imaginary assumption, according to the end to end
> principle, the customers (the ends) should use paced TCP instead

I fully agree, unfortunately I do not control the whole problem
domain, and the solutions available with partial control over the
domain are less than elegant.

--
++ytti

Re: 400G forwarding - how does it work? [ In reply to ]

sronan at ronan-online

Aug 8, 2022, 5:38 AM

Post #54 of 65 (587 views)

Permalink

You keep using the term “imaginary” when presented with evidence that does not match your view of things.

There are many REAL scenarios where single flow high throughout TCP is a real requirements as well as high throughput extremely small packet size. In the case of the later, the market is extremely large, but it’s not Internet traffic.

Shane

> On Aug 8, 2022, at 7:34 AM, Masataka Ohta <mohta@necom830.hpcl.titech.ac.jp> wrote:
>
> ?Saku Ytti wrote:
>
>>> which is, unlike Yttinet, the reality.
>> Yttinet has pesky customers who care about single TCP performance over
>> long fat links, and observe poor performance with shallow buffers at
>> the provider end.
>
> With such an imaginary assumption, according to the end to end
> principle, the customers (the ends) should use paced TCP instead
> of paying unnecessarily bloated amount of money to intelligent
> intermediate entities of ISPs using expensive routers with
> bloated buffers.
>
>> Yttinet is cost sensitive and does not want to do
>> work, unless sufficiently motivated by paying customers.
>
> I understand that if customers follow the end to end principle,
> revenue of "intelligent" ISPs will be reduced.
>
> Masataka Ohta
>
>
>

RE: 400G forwarding - how does it work? [ In reply to ]

mhuff at ox

Aug 8, 2022, 5:46 AM

Post #55 of 65 (587 views)

Permalink

Also, for data center traffic, especially real-time market data and other UDP multicast traffic, micro-bursting is one of the biggest issues especially as you scale out your backbone. We have two 100GB switches, and have to distribute the traffic over a LACL link with 4 different 100GB ports on different ASIC even though the traffic < 1% just to account for micro-bursts.

-----Original Message-----
From: NANOG <nanog-bounces+mhuff=ox.com@nanog.org> On Behalf Of sronan@ronan-online.com
Sent: Monday, August 8, 2022 8:39 AM
To: Masataka Ohta <mohta@necom830.hpcl.titech.ac.jp>
Cc: nanog@nanog.org
Subject: Re: 400G forwarding - how does it work?

You keep using the term “imaginary” when presented with evidence that does not match your view of things.

There are many REAL scenarios where single flow high throughout TCP is a real requirements as well as high throughput extremely small packet size. In the case of the later, the market is extremely large, but it’s not Internet traffic.

Shane

> On Aug 8, 2022, at 7:34 AM, Masataka Ohta <mohta@necom830.hpcl.titech.ac.jp> wrote:
>
> ?Saku Ytti wrote:
>
>>> which is, unlike Yttinet, the reality.
>> Yttinet has pesky customers who care about single TCP performance
>> over long fat links, and observe poor performance with shallow
>> buffers at the provider end.
>
> With such an imaginary assumption, according to the end to end
> principle, the customers (the ends) should use paced TCP instead of
> paying unnecessarily bloated amount of money to intelligent
> intermediate entities of ISPs using expensive routers with bloated
> buffers.
>
>> Yttinet is cost sensitive and does not want to do work, unless
>> sufficiently motivated by paying customers.
>
> I understand that if customers follow the end to end principle,
> revenue of "intelligent" ISPs will be reduced.
>
> Masataka Ohta
>
>
>

Re: 400G forwarding - how does it work? [ In reply to ]

dave.taht at gmail

Aug 8, 2022, 5:51 AM

Post #56 of 65 (587 views)

Permalink

On Sun, Aug 7, 2022 at 11:24 PM Masataka Ohta
<mohta@necom830.hpcl.titech.ac.jp> wrote:
>
> sronan@ronan-online.com wrote:
>
> > There are MANY real world use cases which require high throughput at
> > 64 byte packet size.
>
> Certainly, there were imaginary world use cases which require
> to guarantee so high throughput of 64kbps with 48B payload
> size for which 20(40)B IP header was obviously painful and 5B
> header was used. At that time, poor fair queuing was assumed,
> which requires small packet size for short delay.
>
> But as fair queuing does not scale at all, they disappeared
> long ago.

What do you mean by FQ, exactly?

"5 tuple FQ" is scaling today on shaping middleboxes like preseem and
LibreQos to over 10gbits. ISP reported results of customer calls about
speed simply vanish. Admittedly the AQM is dropping or marking some
.x% of packets, but tests with fq with short buffers vs aqm alone
showed the former the clear winner, and fq+aqm took it in for the
score.

On linux fq_codel is the near universal default, also. The linux tcp
stack does fq+pacing at nearly 100gbits today on "BIG" tcp.

"disappeared", no. invisible, possible. transitioning from +10 gbit
down to 1gbit or less, really, really, useful. IMHO, and desparately
needed, in way more places.

Lastly VOCs and LAG and switch fabrics essentially FQ ports. In the
context of aggregating up to 400Gbit from that, you are FQing also.

Now fq-ing inline against the ip headers at 400gbit appeared
impossible until this convo when the depth of the pipeline and
hardware hashing was discussed, but I'll settle for more rfc7567
behavior just in stepping down from that, to 100, and from 100
stepping down, adding in fq+aqm.

>
> > Denying those use cases because they don’t fit
> > your world view is short sighted.
>
> That could have been a valid argument 20 years ago.
>
> Masataka Ohta

--
FQ World Domination pending: https://blog.cerowrt.org/post/state_of_fq_codel/
Dave Täht CEO, TekLibre, LLC

Re: 400G forwarding - how does it work? [ In reply to ]

cb.list6 at gmail

Aug 8, 2022, 6:02 AM

Post #57 of 65 (587 views)

Permalink

On Mon, Aug 8, 2022 at 5:39 AM <sronan@ronan-online.com> wrote:

> You keep using the term “imaginary” when presented with evidence that does
> not match your view of things.
>
> There are many REAL scenarios where single flow high throughout TCP is a
> real requirements as well as high throughput extremely small packet size.
> In the case of the later, the market is extremely large, but it’s not
> Internet traffic.
>

I believe this all started with asic experts saying trade-offs need to be
made to operate at crazy speeds in a single package.

Ohta-san is simply saying your usecase did not make the cut, which is
clear. That said, asic makers have gotten things wrong (for me), and some
things they can adjust in code, others not so much. The LPM / LEM lookup
table distribution is certainly one that has burned me in ipv6 and mpls
label scale, but thankfully SiliconOne can make some adjustments… but
watch-out if your network is anything other than /48s

The only thing that will change that is $$$.

> Shane
>
> > On Aug 8, 2022, at 7:34 AM, Masataka Ohta <
> mohta@necom830.hpcl.titech.ac.jp> wrote:
> >
> > ?Saku Ytti wrote:
> >
> >>> which is, unlike Yttinet, the reality.
> >> Yttinet has pesky customers who care about single TCP performance over
> >> long fat links, and observe poor performance with shallow buffers at
> >> the provider end.
> >
> > With such an imaginary assumption, according to the end to end
> > principle, the customers (the ends) should use paced TCP instead
> > of paying unnecessarily bloated amount of money to intelligent
> > intermediate entities of ISPs using expensive routers with
> > bloated buffers.
> >
> >> Yttinet is cost sensitive and does not want to do
> >> work, unless sufficiently motivated by paying customers.
> >
> > I understand that if customers follow the end to end principle,
> > revenue of "intelligent" ISPs will be reduced.
> >
> > Masataka Ohta
> >
> >
> >
>

RE: 400G forwarding - how does it work? [ In reply to ]

jefftant.ietf at gmail

Aug 9, 2022, 4:34 PM

Post #58 of 65 (586 views)

Permalink

Saku,

I have forwarded your questions to Sharada.

All,

For this week – at 11:00am PST, Thursday 08/11, we will be joined by Guy Caspary (co-founder of Leaba Semiconductor (acquired by Cisco -> SiliconOne)

https://m.youtube.com/watch?v=GDthnCj31_Y"]https://m.youtube.com/watch?v=GDthnCj31_Y

For the next week, I’m planning to get one of main architects of Broadcom DNX (Jericho/Qumran/Ramon).

Cheers,

Jeff

From: Saku Ytti
Sent: Friday, August 5, 2022 12:15 AM
To: Jeff Tantsura
Cc: NANOG; Jeff Doyle
Subject: Re: 400G forwarding - how does it work?

Thank you for this.

I wish there would have been a deeper dive to the lookup side. My open questions

a) Trio model of packet stays in single PPE until done vs. FP model of

line-of-PPE (identical cores). I don't understand the advantages of

the FP model, the Trio model advantages are clear to me. Obviously the

FP model has to have some advantages, what are they?

b) What exactly are the gains of putting two trios on-package in

Trio6, there is no local-switching between WANs of trios in-package,

they are, as far as I can tell, ships in the night, packets between

trios go via fabric, as they would with separate Trios. I can

understand the benefit of putting trio and HBM2 on the same package,

to reduce distance so wattage goes down or frequency goes up.

c) What evolution they are thinking for the shallow ingress buffers

for Trio6. The collateral damage potential is significant, because WAN

which asks most, gets most, instead each having their fair share, thus

potentially arbitrarily low rate WAN ingress might not get access to

ingress buffer causing drop. Would it be practical in terms of

wattage/area to add some sort of preQoS towards the shallow ingress

buffer, so each WAN ingress has a fair guaranteed-rate to shallow

buffers?

On Fri, 5 Aug 2022 at 02:18, Jeff Tantsura <jefftant.ietf@gmail.com> wrote:

>

> Apologies for garbage/HTMLed email, not sure what happened (thanks

> Brian F for letting me know).

> Anyway, the podcast with Juniper (mostly around Trio/Express) has been

> broadcasted today and is available at

> https://www.youtube.com/watch?v=1he8GjDBq9g

> Next in the pipeline are:

> Cisco SiliconOne

> Broadcom DNX (Jericho/Qumran/Ramon)

> For both - the guests are main architects of the silicon

>

> Enjoy

>

>

> On Wed, Aug 3, 2022 at 5:06 PM Jeff Tantsura <jefftant.ietf@gmail.com> wrote:

> >

> > Hey,

> >

> >

> >

> > This is not an advertisement but an attempt to help folks to better understand networking HW.

> >

> >

> >

> > Some of you might know (and love 😊) “between 0x2 nerds” podcast Jeff Doyle and I have been hosting for a couple of years.

> >

> >

> >

> > Following up the discussion we have decided to dedicate a number of upcoming podcasts to networking HW, the topic where more information and better education is very much needed (no, you won’t have to sign NDA before joining 😊), we have lined up a number of great guests, people who design and build ASICs and can talk firsthand about evolution of networking HW, complexity of the process, differences between fixed and programmable pipelines, memories and databases. This Thursday (08/04) at 11:00PST we are joined by one and only Sharada Yeluri - Sr. Director ASIC at Juniper. Other vendors will be joining in the later episodes, usual rules apply – no marketing, no BS.

> >

> > More to come, stay tuned.

> >

> > Live feed: https://lnkd.in/gk2x2ezZ

> >

> > Between 0x2 nerds playlist, videos will be published to: https://www.youtube.com/playlist?list=PLMYH1xDLIabuZCr1Yeoo39enogPA2yJB7

> >

> >

> >

> > Cheers,

> >

> > Jeff

> >

> >

> >

> > From: James Bensley

> > Sent: Wednesday, July 27, 2022 12:53 PM

> > To: Lawrence Wobker; NANOG

> > Subject: Re: 400G forwarding - how does it work?

> >

> >

> >

> > On Tue, 26 Jul 2022 at 21:39, Lawrence Wobker <ljwobker@gmail.com> wrote:

> >

> > > So if this pipeline can do 1.25 billion PPS and I want to be able to forward 10BPPS, I can build a chip that has 8 of these pipelines and get my performance target that way. I could also build a "pipeline" that processes multiple packets per clock, if I have one that does 2 packets/clock then I only need 4 of said pipelines.. and so on and so forth.

> >

> >

> >

> > Thanks for the response Lawrence.

> >

> >

> >

> > The Broadcom BCM16K KBP has a clock speed of 1.2Ghz, so I expect the

> >

> > J2 to have something similar (as someone already mentioned, most chips

> >

> > I've seen are in the 1-1.5Ghz range), so in this case "only" 2

> >

> > pipelines would be needed to maintain the headline 2Bpps rate of the

> >

> > J2, or even just 1 if they have managed to squeeze out two packets per

> >

> > cycle through parallelisation within the pipeline.

> >

> >

> >

> > Cheers,

> >

> > James.

> >

> >

--

++ytti

Re: 400G forwarding - how does it work? [ In reply to ]

mohta at necom830

Aug 9, 2022, 7:01 PM

Post #59 of 65 (586 views)

Permalink

Saku Ytti wrote:

>> With such an imaginary assumption, according to the end to end
>> principle, the customers (the ends) should use paced TCP instead

> I fully agree, unfortunately I do not control the whole problem
> domain, and the solutions available with partial control over the
> domain are less than elegant.

OK. But, you should be aware that, with bloated buffer, all
the customers sharing the buffer will suffer from delay.

Masataka Ohta

Re: 400G forwarding - how does it work? [ In reply to ]

mohta at necom830

Aug 9, 2022, 7:16 PM

Post #60 of 65 (586 views)

Permalink

Matthew Huff wrote:

> Also, for data center traffic, especially real-time market data and
> other UDP multicast traffic, micro-bursting is one of the biggest
> issues especially as you scale out your backbone.

Are you saying you rely on multicast even though loss of a packet
means loss of large amount of money?

Is it a reason why you use large buffer to eliminate possibilities
of packet dropping caused by buffer overflow but not by other
reasons?

Masataka Ohta

Re: 400G forwarding - how does it work? [ In reply to ]

mohta at necom830

Aug 9, 2022, 7:26 PM

Post #61 of 65 (586 views)

Permalink

Dave Taht wrote:

>> But as fair queuing does not scale at all, they disappeared
>> long ago.
>
> What do you mean by FQ, exactly?

Fair queuing is "fair queuing" not some queuing idea
which is, by someone, considered "fair".

See, for example,

https://en.wikipedia.org/wiki/Fair_queuing

> "5 tuple FQ" is scaling today

Fair queuing does not scale w.r.t. the number of queues.

Masataka Ohta

Re: 400G forwarding - how does it work? [ In reply to ]

sronan at ronan-online

Aug 9, 2022, 8:45 PM

Post #62 of 65 (586 views)

Permalink

How do you propose to fairly distribute market data feeds to the market if not multicast?

Shane

> On Aug 9, 2022, at 10:19 PM, Masataka Ohta <mohta@necom830.hpcl.titech.ac.jp> wrote:
>
> ?Matthew Huff wrote:
>
>> Also, for data center traffic, especially real-time market data and
>> other UDP multicast traffic, micro-bursting is one of the biggest
>> issues especially as you scale out your backbone.
>
> Are you saying you rely on multicast even though loss of a packet
> means loss of large amount of money?
>
> Is it a reason why you use large buffer to eliminate possibilities
> of packet dropping caused by buffer overflow but not by other
> reasons?
>
> Masataka Ohta

Re: 400G forwarding - how does it work? [ In reply to ]

saku at ytti

Aug 10, 2022, 12:07 AM

Post #63 of 65 (586 views)

Permalink

On Wed, 10 Aug 2022 at 06:48, <sronan@ronan-online.com> wrote:

> How do you propose to fairly distribute market data feeds to the market if not multicast?

I expected your aggressive support for small packets was for fintech.
An anecdote:

one of the largest exchanges in the world used MX for multicast
replication, which is btree or today utree replication, that is, each
NPU gets replicated packet wildy different time, therefore receivers
do. Which wasn't a problem for them, because they didn't know that's
how it works and suffered no negative consequence of this, which
arguably should have been a show stopper if we need receivers to
receive it at a remotely similar time.

Also, it is not in disagreement with my statement that it is not
addressable market, because this marker can use products which do not
do 64B wire-rate, for two separate reason either/and a) port is no
where near congested b) the market is not cost sensitive, they buy the
device with many WAN ports, and don't provision it so that they can't
get 64B on each actually used ports.

--
++ytti

Re: 400G forwarding - how does it work? [ In reply to ]

jefftant.ietf at gmail

Aug 10, 2022, 11:38 AM

Post #64 of 65 (586 views)

Permalink

Sharada’s answers:

a) Yes, the run-to-completion model of Trio is superior to FP5/Nokia model when it comes to flexible processing engines. In Trio, the same engines can do either ingress or egress processing. Traditionally, there is more processing on ingress than on egress. When that happens, by design, less number of processing engines get used for egress, and more engines are available for ingress processing. Trio gives full flexibility. Unless Nokia is optimizing the engines (not all engines are identical, and some are area optimized for specific processing) to save overall area, I do not see any other advantage.

b) Trio provides on-chip shallow buffering on ingress for fabric queues. We share this buffer between the slices on the same die. This gives us the flexibility to go easy on the size of SRAM we want to support for buffering.

c) I didn't completely follow the question. Shallow ingress buffers are for fabric-facing queues, and we do not expect sustained fabric congestion. This, combined with the fact that we have some speed up over fabric, ensures that all WAN packets do reach the egress PFE buffer. On ingress, if packet processing is oversubscribed, we have line rate pre-classifiers do proper drops based on WAN queue priority.

Cheers,
Jeff

> On Aug 9, 2022, at 16:34, Jeff Tantsura <jefftant.ietf@gmail.com> wrote:
>
> ?
> Saku,
>
> I have forwarded your questions to Sharada.
>
> All,
>
> For this week – at 11:00am PST, Thursday 08/11, we will be joined by Guy Caspary (co-founder of Leaba Semiconductor (acquired by Cisco -> SiliconOne)
> https://m.youtube.com/watch?v=GDthnCj31_Y
>
> For the next week, I’m planning to get one of main architects of Broadcom DNX (Jericho/Qumran/Ramon).
>
> Cheers,
> Jeff
>
> From: Saku Ytti
> Sent: Friday, August 5, 2022 12:15 AM
> To: Jeff Tantsura
> Cc: NANOG; Jeff Doyle
> Subject: Re: 400G forwarding - how does it work?
>
> Thank you for this.
>
> I wish there would have been a deeper dive to the lookup side. My open questions
>
> a) Trio model of packet stays in single PPE until done vs. FP model of
> line-of-PPE (identical cores). I don't understand the advantages of
> the FP model, the Trio model advantages are clear to me. Obviously the
> FP model has to have some advantages, what are they?
>
> b) What exactly are the gains of putting two trios on-package in
> Trio6, there is no local-switching between WANs of trios in-package,
> they are, as far as I can tell, ships in the night, packets between
> trios go via fabric, as they would with separate Trios. I can
> understand the benefit of putting trio and HBM2 on the same package,
> to reduce distance so wattage goes down or frequency goes up.
>
> c) What evolution they are thinking for the shallow ingress buffers
> for Trio6. The collateral damage potential is significant, because WAN
> which asks most, gets most, instead each having their fair share, thus
> potentially arbitrarily low rate WAN ingress might not get access to
> ingress buffer causing drop. Would it be practical in terms of
> wattage/area to add some sort of preQoS towards the shallow ingress
> buffer, so each WAN ingress has a fair guaranteed-rate to shallow
> buffers?
>
> On Fri, 5 Aug 2022 at 02:18, Jeff Tantsura <jefftant.ietf@gmail.com> wrote:
> >
> > Apologies for garbage/HTMLed email, not sure what happened (thanks
> > Brian F for letting me know).
> > Anyway, the podcast with Juniper (mostly around Trio/Express) has been
> > broadcasted today and is available at
> > https://www.youtube.com/watch?v=1he8GjDBq9g
> > Next in the pipeline are:
> > Cisco SiliconOne
> > Broadcom DNX (Jericho/Qumran/Ramon)
> > For both - the guests are main architects of the silicon
> >
> > Enjoy
> >
> >
> > On Wed, Aug 3, 2022 at 5:06 PM Jeff Tantsura <jefftant.ietf@gmail.com> wrote:
> > >
> > > Hey,
> > >
> > >
> > >
> > > This is not an advertisement but an attempt to help folks to better understand networking HW.
> > >
> > >
> > >
> > > Some of you might know (and love ????) “between 0x2 nerds” podcast Jeff Doyle and I have been hosting for a couple of years.
> > >
> > >
> > >
> > > Following up the discussion we have decided to dedicate a number of upcoming podcasts to networking HW, the topic where more information and better education is very much needed (no, you won’t have to sign NDA before joining ????), we have lined up a number of great guests, people who design and build ASICs and can talk firsthand about evolution of networking HW, complexity of the process, differences between fixed and programmable pipelines, memories and databases. This Thursday (08/04) at 11:00PST we are joined by one and only Sharada Yeluri - Sr. Director ASIC at Juniper. Other vendors will be joining in the later episodes, usual rules apply – no marketing, no BS.
> > >
> > > More to come, stay tuned.
> > >
> > > Live feed: https://lnkd.in/gk2x2ezZ
> > >
> > > Between 0x2 nerds playlist, videos will be published to: https://www.youtube.com/playlist?list=PLMYH1xDLIabuZCr1Yeoo39enogPA2yJB7
> > >
> > >
> > >
> > > Cheers,
> > >
> > > Jeff
> > >
> > >
> > >
> > > From: James Bensley
> > > Sent: Wednesday, July 27, 2022 12:53 PM
> > > To: Lawrence Wobker; NANOG
> > > Subject: Re: 400G forwarding - how does it work?
> > >
> > >
> > >
> > > On Tue, 26 Jul 2022 at 21:39, Lawrence Wobker <ljwobker@gmail.com> wrote:
> > >
> > > > So if this pipeline can do 1.25 billion PPS and I want to be able to forward 10BPPS, I can build a chip that has 8 of these pipelines and get my performance target that way. I could also build a "pipeline" that processes multiple packets per clock, if I have one that does 2 packets/clock then I only need 4 of said pipelines.. and so on and so forth.
> > >
> > >
> > >
> > > Thanks for the response Lawrence.
> > >
> > >
> > >
> > > The Broadcom BCM16K KBP has a clock speed of 1.2Ghz, so I expect the
> > >
> > > J2 to have something similar (as someone already mentioned, most chips
> > >
> > > I've seen are in the 1-1.5Ghz range), so in this case "only" 2
> > >
> > > pipelines would be needed to maintain the headline 2Bpps rate of the
> > >
> > > J2, or even just 1 if they have managed to squeeze out two packets per
> > >
> > > cycle through parallelisation within the pipeline.
> > >
> > >
> > >
> > > Cheers,
> > >
> > > James.
> > >
> > >
>
>
>
> --
> ++ytti
>

Re: 400G forwarding - how does it work? [ In reply to ]

mohta at necom830

Aug 12, 2022, 12:11 AM

Post #65 of 65 (583 views)

Permalink

sronan@ronan-online.com wrote:

> How do you propose to fairly distribute market data feeds to the > market if not multicast?

Unicast with randomized order.

To minimize latency, bloated buffer should be avoided
and TCP with configured small (initial) RTT should be
used.

Masataka Ohta