Mailing List Archive: 400G forwarding - how does it work?

400G forwarding - how does it work?

jwbensley+nanog at gmail

Jul 25, 2022, 5:53 AM

Post #1 of 65 (1356 views)

Hi All,

I've been trying to understand how forwarding at 400G is possible,
specifically in this example, in relation to the Broadcom J2 chips,
but I don't the mystery is anything specific to them...

According to the Broadcom Jericho2 BCM88690 data sheet it provides
4.8Tbps of traffic processing and supports packet forwarding at 2Bpps.
According to my maths that means it requires packet sizes of 300Bs to
reach line rate across all ports. The data sheet says packet sizes
above 284B, so I guess this is excluding some headers like the
inter-frame gap and CRC (nothing after the PHY/MAC needs to know about
them if the CRC is valid)? As I interpret the data sheet, J2 should
supports chassis with 12x 400Gbps ports at line rate with 284B packets
then.

Jericho2 can be linked to a BCM16K for expanded packet forwarding
tables and lookup processing (i.e. to hold the full global routing
table, in such a case, forwarding lookups are offloaded to the
BCM16K). The BCM16K documentation suggests that it uses TCAM for exact
matching (e.g.,for ACLs) in something called the "Database Array"
(with 2M 40b entries?), and SRAM for LPM (e.g., IP lookups) in
something called the "User Data Array" (with 16M 32b entries?).

A BCM16K supports 16 parallel searches, which means that each of the
12x 400G ports on a Jericho2 could perform an forwarding lookup at
same time. This means that the BCM16K "only" needs to perform
forwarding look-ups at a linear rate of 1x 400Gbps, not 4.8Tbps, and
"only" for packets larger than 284 bytes, because that is the Jericho2
line-rate Pps rate. This means that each of the 16 parallel searches
in the BCM16K, they need to support a rate of 164Mpps (164,473,684) to
reach 400Gbps. This is much more in the realm of feasible, but still
pretty extreme...

1 second / 164473684 packets = 1 packet every 6.08 nanoseconds, which
is within the access time of TCAM and SRAM but this needs to include
some computing time too e.g. generating a key for a lookup and passing
the results along the pipeline etc. The BCM16K has a clock speed of
1Ghz (1,000,000,000, cycles per second, or cycle every 1 nano second)
and supports an SRAM memory access in a single clock cycle (according
to the data sheet). If one cycle is required for an SRAM lookup, the
BCM16K only has 5 cycles to perform other computation tasks, and the
J2 chip needs to do the various header re-writes and various counter
updates etc., so how is magic this happening?!?

The obvious answer is that it's not magic and my understanding is
fundamentally flawed, so please enlighten me.

Cheers,
James.

Re: 400G forwarding - how does it work? [ In reply to ]

Jul 25, 2022, 6:16 AM

Post #2 of 65 (1356 views)

Once upon a time, James Bensley <jwbensley+nanog@gmail.com> said:
> The obvious answer is that it's not magic and my understanding is
> fundamentally flawed, so please enlighten me.

So I can't answer to your specific question, but I just wanted to say
that your CPU analysis is simplistic and doesn't really match how CPUs
work now. Something can be "line rate" but not push the first packet
through in the shortest time. CPUs break operations down into a series
of very small operations and then run those operations in a pipeline,
with different parts of the CPU working on the micro operations for
different overall operations at the same time. The first object out of
the pipeline (packet destination calculated in this case) may take more
time, but then after that you keep getting a result every cycle/few
cycles.

For example, it might take 4 times as long to process the first packet,
but as long as the hardware can handle 4 packets in a queue, you'll get
a packet result every cycle after that, without dropping anything. So
maybe the first result takes 12 cycles, but then you can keep getting a
result every 3 cycles as long as the pipeline is kept full.

This type of pipelined+superscalar processing was a big deal with Cray
supercomputers, but made it down to PC-level hardware with the Pentium
Pro. It has issues (see all the Spectre and Retbleed CPU flaws with
branch prediction for example), but in general it allows a CPU to handle
a chain of operations faster than it can handle each operation
individually.

--
Chris Adams <cma@cmadams.net>

Re: 400G forwarding - how does it work? [ In reply to ]

Jul 25, 2022, 6:16 AM

Post #3 of 65 (1356 views)

I'm not sure what your specific question is. So I answer my question instead.

Q: how can we do lookup fast enough to do 'big number' per second,
while underlying hardware inherently takes longer
A: we throw memory at the problem

I.e. say JNPR Trio PPE has many threads, and only one thread is
running, rest of the threads are waiting for answers from memory. That
is, once we start pushing packets through the device, it takes a long
ass time (like single digit microseconds) before we see any packets
out. 1000x longer than your calculated single digit nanoseconds.

On Mon, 25 Jul 2022 at 15:56, James Bensley <jwbensley+nanog@gmail.com> wrote:
>
> Hi All,
>
> I've been trying to understand how forwarding at 400G is possible,
> specifically in this example, in relation to the Broadcom J2 chips,
> but I don't the mystery is anything specific to them...
>
> According to the Broadcom Jericho2 BCM88690 data sheet it provides
> 4.8Tbps of traffic processing and supports packet forwarding at 2Bpps.
> According to my maths that means it requires packet sizes of 300Bs to
> reach line rate across all ports. The data sheet says packet sizes
> above 284B, so I guess this is excluding some headers like the
> inter-frame gap and CRC (nothing after the PHY/MAC needs to know about
> them if the CRC is valid)? As I interpret the data sheet, J2 should
> supports chassis with 12x 400Gbps ports at line rate with 284B packets
> then.
>
> Jericho2 can be linked to a BCM16K for expanded packet forwarding
> tables and lookup processing (i.e. to hold the full global routing
> table, in such a case, forwarding lookups are offloaded to the
> BCM16K). The BCM16K documentation suggests that it uses TCAM for exact
> matching (e.g.,for ACLs) in something called the "Database Array"
> (with 2M 40b entries?), and SRAM for LPM (e.g., IP lookups) in
> something called the "User Data Array" (with 16M 32b entries?).
>
> A BCM16K supports 16 parallel searches, which means that each of the
> 12x 400G ports on a Jericho2 could perform an forwarding lookup at
> same time. This means that the BCM16K "only" needs to perform
> forwarding look-ups at a linear rate of 1x 400Gbps, not 4.8Tbps, and
> "only" for packets larger than 284 bytes, because that is the Jericho2
> line-rate Pps rate. This means that each of the 16 parallel searches
> in the BCM16K, they need to support a rate of 164Mpps (164,473,684) to
> reach 400Gbps. This is much more in the realm of feasible, but still
> pretty extreme...
>
> 1 second / 164473684 packets = 1 packet every 6.08 nanoseconds, which
> is within the access time of TCAM and SRAM but this needs to include
> some computing time too e.g. generating a key for a lookup and passing
> the results along the pipeline etc. The BCM16K has a clock speed of
> 1Ghz (1,000,000,000, cycles per second, or cycle every 1 nano second)
> and supports an SRAM memory access in a single clock cycle (according
> to the data sheet). If one cycle is required for an SRAM lookup, the
> BCM16K only has 5 cycles to perform other computation tasks, and the
> J2 chip needs to do the various header re-writes and various counter
> updates etc., so how is magic this happening?!?
>
> The obvious answer is that it's not magic and my understanding is
> fundamentally flawed, so please enlighten me.
>
> Cheers,
> James.

--
++ytti

Re: 400G forwarding - how does it work? [ In reply to ]

ljwobker at gmail

Jul 25, 2022, 6:34 AM

Post #4 of 65 (1356 views)

So the most important bits are pipelining and parallelism. And this is substantially simplified, but hopefully it helps.

Pipelining basically means that you have a whole bunch of different operations that you need to perform to forward a packet. Lots of these are lookups into things like the FIB tables, the encap tables, the MAC tables, and literally dozens of other places where you store configuration and network state. Some of these are very small simple tables (?give me the value for a packet with TOS = 0b101?) and some are very complicated, like multi-level longest-prefix trees/tries that are built from lots of custom hardware logic and memory. It varies a lot from chip to chip, but there are on the order of 50-100 different tables for the current generation of ?fast? chips doing lots of 400GE interfaces. Figuring out how to distribute all this forwarding state across all the different memory banks/devices in a big, fast chip is one of the Very Hard Problems that the chip makers and system vendors have to figure out.

So once you build out this pipeline, you?ve got a bunch of different steps that all happen sequentially. The ?length? of the pipeline puts a floor on the latency for switching a single packet? if I have to do 25 lookups and they?re all dependent on the one before, it?s not possible for me to switch the packet in any less than 25 clocks?. BUT, if I have a bunch of hardware all running these operations at the same time, I can push the aggregate forwarding capacity way higher. This is the parallelism part. I can take multiple instances of these memory/logic pipelines, and run them in parallel to increase the throughput. Now there?s plenty of complexity in terms of HOW I do all that parallelism ? figuring out whether I have to replicate entire memory structures or if I can come up with sneaky ways of doing multiple lookups more efficiently, but that?s getting into the magic secret sauce type stuff.

I work on/with a chip that can forwarding about 10B packets per second? so if we go back to the order-of-magnitude number that I?m doing about ?tens? of memory lookups for every one of those packets, we?re talking about something like a hundred BILLION total memory lookups? and since memory does NOT give me answers in 1 picoseconds? we get back to pipelining and parallelism.

Hopefully that helps at least some.

Disclaimer: I?m a Cisco employee, these words are mine and not representative of anything awesome that I may or may not work on in my day job?

?lj
________________________________
From: NANOG <nanog-bounces+ljwobker=gmail.com@nanog.org> on behalf of James Bensley <jwbensley+nanog@gmail.com>
Sent: Monday, July 25, 2022 8:55 AM
To: NANOG <nanog@nanog.org>
Subject: 400G forwarding - how does it work?

Hi All,

I've been trying to understand how forwarding at 400G is possible,
specifically in this example, in relation to the Broadcom J2 chips,
but I don't the mystery is anything specific to them...

According to the Broadcom Jericho2 BCM88690 data sheet it provides
4.8Tbps of traffic processing and supports packet forwarding at 2Bpps.
According to my maths that means it requires packet sizes of 300Bs to
reach line rate across all ports. The data sheet says packet sizes
above 284B, so I guess this is excluding some headers like the
inter-frame gap and CRC (nothing after the PHY/MAC needs to know about
them if the CRC is valid)? As I interpret the data sheet, J2 should
supports chassis with 12x 400Gbps ports at line rate with 284B packets
then.

Jericho2 can be linked to a BCM16K for expanded packet forwarding
tables and lookup processing (i.e. to hold the full global routing
table, in such a case, forwarding lookups are offloaded to the
BCM16K). The BCM16K documentation suggests that it uses TCAM for exact
matching (e.g.,for ACLs) in something called the "Database Array"
(with 2M 40b entries?), and SRAM for LPM (e.g., IP lookups) in
something called the "User Data Array" (with 16M 32b entries?).

A BCM16K supports 16 parallel searches, which means that each of the
12x 400G ports on a Jericho2 could perform an forwarding lookup at
same time. This means that the BCM16K "only" needs to perform
forwarding look-ups at a linear rate of 1x 400Gbps, not 4.8Tbps, and
"only" for packets larger than 284 bytes, because that is the Jericho2
line-rate Pps rate. This means that each of the 16 parallel searches
in the BCM16K, they need to support a rate of 164Mpps (164,473,684) to
reach 400Gbps. This is much more in the realm of feasible, but still
pretty extreme...

1 second / 164473684 packets = 1 packet every 6.08 nanoseconds, which
is within the access time of TCAM and SRAM but this needs to include
some computing time too e.g. generating a key for a lookup and passing
the results along the pipeline etc. The BCM16K has a clock speed of
1Ghz (1,000,000,000, cycles per second, or cycle every 1 nano second)
and supports an SRAM memory access in a single clock cycle (according
to the data sheet). If one cycle is required for an SRAM lookup, the
BCM16K only has 5 cycles to perform other computation tasks, and the
J2 chip needs to do the various header re-writes and various counter
updates etc., so how is magic this happening?!?

The obvious answer is that it's not magic and my understanding is
fundamentally flawed, so please enlighten me.

Cheers,
James.

Re: 400G forwarding - how does it work? [ In reply to ]

jwbensley+nanog at gmail

Jul 25, 2022, 11:51 AM

Post #5 of 65 (1356 views)

Thanks for the responses Chris, Saku…

On Mon, 25 Jul 2022 at 15:17, Chris Adams <cma@cmadams.net> wrote:
>
> Once upon a time, James Bensley <jwbensley+nanog@gmail.com> said:
> > The obvious answer is that it's not magic and my understanding is
> > fundamentally flawed, so please enlighten me.
>
> So I can't answer to your specific question, but I just wanted to say
> that your CPU analysis is simplistic and doesn't really match how CPUs
> work now.

It wasn't a CPU analysis because switching ASICs != CPUs.

I am aware of the x86 architecture, but know little of network ASICs,
so I was deliberately trying to not apply my x86 knowledge here, in
case it sent me down the wrong path. You made references towards
typical CPU features;

> For example, it might take 4 times as long to process the first packet,
> but as long as the hardware can handle 4 packets in a queue, you'll get
> a packet result every cycle after that, without dropping anything. So
> maybe the first result takes 12 cycles, but then you can keep getting a
> result every 3 cycles as long as the pipeline is kept full.

Yes, in the x86/x64 CPU world keeping the instruction cache and data
cache hot indeed results in optimal performance, and as you say modern
CPUs use parallel pipelines amongst other techniques like branch
prediction, SIMD, (N)UMA, and so on, but I would assume (because I
don’t know) that not all of the x86 feature set map nicely to packet
processing in ASICs (VPP uses these techniques on COTS CPUs, to
emulate a fixed pipeline, rather than run to completion model).

You and Saku both suggest that heavy parallelism is the magic source;

> Something can be "line rate" but not push the first packet
> through in the shortest time.

On Mon, 25 Jul 2022 at 15:16, Saku Ytti <saku@ytti.fi> wrote:
> I.e. say JNPR Trio PPE has many threads, and only one thread is
> running, rest of the threads are waiting for answers from memory. That
> is, once we start pushing packets through the device, it takes a long
> ass time (like single digit microseconds) before we see any packets
> out. 1000x longer than your calculated single digit nanoseconds.

In principal I accept this idea. But lets try and do the maths, I'd
like to properly understand;

The non-drop rate of the J2 is 2Bpps @ 284 bytes == 4.8Tbps, my
example scenario was a single J2 chip in a 12x400G device. If each
port is receiving 400G @ 284 bytes (164,473,684 pps), that’s one every
6.08 nanoseconds coming in. What kind of parallelism is required to
stop from ingress dropping?

It takes say 5 microseconds to process and forward a packet (seems
reasonable looking at some Arista data sheets which use J2 variants),
which means we need to be operating on 5,000ns / 6.08ns == 822 packets
per port simultaneously, so 9868 packets are being processed across
all 12 ports simultaneously, to stop ingress dropping on all
interfaces.

I think the latest generation Trio has 160 PPEs per PFE, but I’m not
sure how many threads per PPE. Older generations had 20
threads/contexts per PPE, so if it hasn’t increased that would make
for 3200 threads in total. That is a 1.6Tbps FD chip, although not
apples to apples of course, Trio is run to completion too.

The Nokia FP5 has 1,200 cores (I have no idea how many threads per
core) and is rated for 4.8Tbps FD. Again doing something quite
different to a J2 chip, again its RTC.

J2 is a partially-fixed pipeline but slightly programmable if I have
understood correctly, but definitely at the other end of the spectrum
compared to RTC. So are we to surmise that a J2 chip has circa 10k
parallel pipelines, in order to process 9868 packets in parallel?

I have no frame of reference here, but in comparison to Gen 6 Trio of
NP5, that seems very high to me (to the point where I assume I am
wrong).

Cheers,
James.

Re: 400G forwarding - how does it work? [ In reply to ]

jwbensley+nanog at gmail

Jul 25, 2022, 11:56 AM

Post #6 of 65 (1356 views)

Hi Lawrence, thanks for your response.

On Mon, 25 Jul 2022 at 15:34, Lawrence Wobker <ljwobker@gmail.com> wrote:
> This is the parallelism part. I can take multiple instances of these memory/logic pipelines, and run them in parallel to increase the throughput.
...
> I work on/with a chip that can forwarding about 10B packets per second… so if we go back to the order-of-magnitude number that I’m doing about “tens” of memory lookups for every one of those packets, we’re talking about something like a hundred BILLION total memory lookups… and since memory does NOT give me answers in 1 picoseconds… we get back to pipelining and parallelism.

What level of parallelism is required to forward 10Bpps? Or 2Bpps like
my J2 example :)

Cheers,
James.

Re: 400G forwarding - how does it work? [ In reply to ]

Jul 25, 2022, 12:02 PM

Post #7 of 65 (1356 views)

On Mon, 25 Jul 2022 at 21:51, James Bensley <jwbensley+nanog@gmail.com> wrote:

> I have no frame of reference here, but in comparison to Gen 6 Trio of
> NP5, that seems very high to me (to the point where I assume I am
> wrong).

No you are right, FP has much much more PPEs than Trio.

For fair calculation, you compare how many lines FP has to PPEs in
Trio. Because in Trio single PPE handles entire packet, and all PPEs
run identical ucode, performing same work.

In FP each PPE in line has its own function, like first PPE in line
could be parsing the packet and extracting keys from it, second could
be doing ingressACL, 3rd ingressQoS, 4th ingress lookup and so forth.

Why choose this NP design instead of Trio design, I don't know. I
don't understand the upsides.

Downside is easy to understand, picture yourself as ucode developer,
and you get task to 'add this magic feature in the ucode'.
Implementing it in Trio seems trivial, add the code in ucode, rock on.
On FP, you might have to go 'aww shit, I need to do this before PPE5
but after PPE3 in the pipeline, but the instruction cost it adds isn't
in the budget that I have in the PPE4, crap, now I need to shuffle
around and figure out which PPE in line runs what function to keep the
PPS we promise to customer.

Let's look it from another vantage point, let's cook-up IPv6 header
with crapton of EH, in Trio, PPE keeps churning it out, taking long
time, but eventually it gets there or raises exception and gives up.
Every other PPE in the box is fully available to perform work.
Same thing in FP? You have HOLB, the PPEs in the line after thisPPE
are not doing anything and can't do anything, until the PPE before in
line is done.

Today Cisco and Juniper do 'proper' CoPP, that is, they do ingressACL
before and after lookup, before is normally needed for ingressACL but
after lookup ingressACL is needed for CoPP (we only know after lookup
if it is control-plane packet). Nokia doesn't do this at all, and I
bet they can't do it, because if they'd add it in the core where it
needs to be in line, total PPS would go down. as there is no budget
for additional ACL. Instead all control-plane packets from ingressFP
are sent to control plane FP, and inshallah we don't congest the
connection there or it.

>
> Cheers,
> James.

--
++ytti

Re: 400G forwarding - how does it work? [ In reply to ]

ltd at interlink

Jul 25, 2022, 3:57 PM

Post #8 of 65 (1355 views)

On Mon, Jul 25, 2022 at 11:58 AM James Bensley <jwbensley+nanog@gmail.com>
wrote:

> On Mon, 25 Jul 2022 at 15:34, Lawrence Wobker <ljwobker@gmail.com> wrote:
> > This is the parallelism part. I can take multiple instances of these
> memory/logic pipelines, and run them in parallel to increase the throughput.
> ...
> > I work on/with a chip that can forwarding about 10B packets per second…
> so if we go back to the order-of-magnitude number that I’m doing about
> “tens” of memory lookups for every one of those packets, we’re talking
> about something like a hundred BILLION total memory lookups… and since
> memory does NOT give me answers in 1 picoseconds… we get back to pipelining
> and parallelism.
>
> What level of parallelism is required to forward 10Bpps? Or 2Bpps like
> my J2 example :)
>

I suspect many folks know the exact answer for J2, but it's likely under
NDA to talk about said specific answer for a given thing.

Without being platform or device-specific, the core clock rate of many
network devices is often in a "goldilocks" zone of (today) 1 to 1.5GHz with
a goal of 1 packet forwarded 'per-clock'. As LJ described the pipeline that
doesn't mean a latency of 1 clock ingress-to-egress but rather that every
clock there is a forwarding decision from one 'pipeline', and the MPPS/BPPS
packet rate is achieved by having enough pipelines in parallel to achieve
that.
The number here is often "1" or "0.5" so you can work the number backwards.
(e.g. it emits a packet every clock, or every 2nd clock).

It's possible to build an ASIC/NPU to run a faster clock rate, but gets
back to what I'm hand-waving describing as "goldilocks". Look up power vs
frequency and you'll see its non-linear.
Just as CPUs can scale by adding more cores (vs increasing frequency),
~same holds true on network silicon, and you can go wider, multiple
pipelines. But its not 10K parallel slices, there's some parallel parts,
but there are multiple 'stages' on each doing different things.

Using your CPU comparison, there are some analogies here that do work:
- you have multiple cpu cores that can do things in parallel -- analogous
to pipelines
- they often share some common I/O (e.g. CPUs have PCIe, maybe sharing
some DRAM or LLC) -- maybe some lookup engines, or centralized
buffer/memory
- most modern CPUs are out-of-order execution, where under-the-covers, a
cache-miss or DRAM fetch has a disproportionate hit on performance, so its
hidden away from you as much as possible by speculative execution
out-of-order
-- no direct analogy to this one - it's unlikely most forwarding
pipelines do speculative execution like a general purpose CPU does - but
they definitely do 'other work' while waiting for a lookup to happen

A common-garden x86 is unlikely to achieve such a rate for a few different
reasons:
- packets-in or packets-out go via DRAM then you need sufficient DRAM
(page opens/sec, DRAM bandwidth) to sustain at least one write and one read
per packet. Look closer at DRAM and see its speed, Pay attention to page
opens/sec, and what that consumes.
- one 'trick' is to not DMA packets to DRAM but instead have it go into
SRAM of some form - e.g. Intel DDIO, ARM Cache Stashing, which at least
potentially saves you that DRAM write+read per packet
- ... but then do e.g. a LPM lookup, and best case that is back to a
memory access/packet. Maybe it's in L1/L2/L3 cache, but likely at large
table sizes it isn't.
- ... do more things to the packet (urpf lookups, counters) and it's yet
more lookups.

Software can achieve high rates, but note that a typical ASIC/NPU does on
the order of >100 separate lookups per packet, and 100 counter updates per
packet.
Just as forwarding in a ASIC or NPU is a series of tradeoffs, forwarding in
software on generic CPUs is also a series of tradeoffs.

cheers,

lincoln.

Re: 400G forwarding - how does it work? [ In reply to ]

beecher at beecher

Jul 25, 2022, 4:49 PM

Post #9 of 65 (1355 views)

>
> It wasn't a CPU analysis because switching ASICs != CPUs.
>
> I am aware of the x86 architecture, but know little of network ASICs,
> so I was deliberately trying to not apply my x86 knowledge here, in
> case it sent me down the wrong path. You made references towards
> typical CPU features;
>

A CPU is 'jack of all trades, master of none'. An ASIC is 'master of one
specific thing'.

If a given feature or design paradigm found in a CPU fits with the use case
the ASIC is being designed for, there's no reason it cannot be used.

On Mon, Jul 25, 2022 at 2:52 PM James Bensley <jwbensley+nanog@gmail.com>
wrote:

> Thanks for the responses Chris, Saku…
>
> On Mon, 25 Jul 2022 at 15:17, Chris Adams <cma@cmadams.net> wrote:
> >
> > Once upon a time, James Bensley <jwbensley+nanog@gmail.com> said:
> > > The obvious answer is that it's not magic and my understanding is
> > > fundamentally flawed, so please enlighten me.
> >
> > So I can't answer to your specific question, but I just wanted to say
> > that your CPU analysis is simplistic and doesn't really match how CPUs
> > work now.
>
> It wasn't a CPU analysis because switching ASICs != CPUs.
>
> I am aware of the x86 architecture, but know little of network ASICs,
> so I was deliberately trying to not apply my x86 knowledge here, in
> case it sent me down the wrong path. You made references towards
> typical CPU features;
>
> > For example, it might take 4 times as long to process the first packet,
> > but as long as the hardware can handle 4 packets in a queue, you'll get
> > a packet result every cycle after that, without dropping anything. So
> > maybe the first result takes 12 cycles, but then you can keep getting a
> > result every 3 cycles as long as the pipeline is kept full.
>
> Yes, in the x86/x64 CPU world keeping the instruction cache and data
> cache hot indeed results in optimal performance, and as you say modern
> CPUs use parallel pipelines amongst other techniques like branch
> prediction, SIMD, (N)UMA, and so on, but I would assume (because I
> don’t know) that not all of the x86 feature set map nicely to packet
> processing in ASICs (VPP uses these techniques on COTS CPUs, to
> emulate a fixed pipeline, rather than run to completion model).
>
> You and Saku both suggest that heavy parallelism is the magic source;
>
> > Something can be "line rate" but not push the first packet
> > through in the shortest time.
>
> On Mon, 25 Jul 2022 at 15:16, Saku Ytti <saku@ytti.fi> wrote:
> > I.e. say JNPR Trio PPE has many threads, and only one thread is
> > running, rest of the threads are waiting for answers from memory. That
> > is, once we start pushing packets through the device, it takes a long
> > ass time (like single digit microseconds) before we see any packets
> > out. 1000x longer than your calculated single digit nanoseconds.
>
> In principal I accept this idea. But lets try and do the maths, I'd
> like to properly understand;
>
> The non-drop rate of the J2 is 2Bpps @ 284 bytes == 4.8Tbps, my
> example scenario was a single J2 chip in a 12x400G device. If each
> port is receiving 400G @ 284 bytes (164,473,684 pps), that’s one every
> 6.08 nanoseconds coming in. What kind of parallelism is required to
> stop from ingress dropping?
>
> It takes say 5 microseconds to process and forward a packet (seems
> reasonable looking at some Arista data sheets which use J2 variants),
> which means we need to be operating on 5,000ns / 6.08ns == 822 packets
> per port simultaneously, so 9868 packets are being processed across
> all 12 ports simultaneously, to stop ingress dropping on all
> interfaces.
>
> I think the latest generation Trio has 160 PPEs per PFE, but I’m not
> sure how many threads per PPE. Older generations had 20
> threads/contexts per PPE, so if it hasn’t increased that would make
> for 3200 threads in total. That is a 1.6Tbps FD chip, although not
> apples to apples of course, Trio is run to completion too.
>
> The Nokia FP5 has 1,200 cores (I have no idea how many threads per
> core) and is rated for 4.8Tbps FD. Again doing something quite
> different to a J2 chip, again its RTC.
>
> J2 is a partially-fixed pipeline but slightly programmable if I have
> understood correctly, but definitely at the other end of the spectrum
> compared to RTC. So are we to surmise that a J2 chip has circa 10k
> parallel pipelines, in order to process 9868 packets in parallel?
>
> I have no frame of reference here, but in comparison to Gen 6 Trio of
> NP5, that seems very high to me (to the point where I assume I am
> wrong).
>
> Cheers,
> James.
>

RE: 400G forwarding - how does it work? [ In reply to ]

Jul 26, 2022, 12:51 AM

Post #10 of 65 (1354 views)

All high-performance networking devices on the market have pipeline architecture.

The pipeline consists of "stages".

ASICs have stages fixed to particular functions:

[cid:image002.png@01D8A0DD.988EC6A0]

Well, some stages are driven by code our days (a little flexibility).

Juniper is pipeline-based too (like any ASIC). They just invented one special stage in 1996 for lookup (sequence search by nibble in the big external memory tree) – it was public information up to 2000year. It is a different principle from TCAM search – performance is traded for flexibility/simplicity/cost.

Network Processors emulate stages on general-purpose ARM cores. It is a pipeline too (different cores for different functions, many cores for every function), just it is a virtual pipeline.

Ed/

-----Original Message-----
From: NANOG [mailto:nanog-bounces+vasilenko.eduard=huawei.com@nanog.org] On Behalf Of Saku Ytti
Sent: Monday, July 25, 2022 10:03 PM
To: James Bensley <jwbensley+nanog@gmail.com>
Cc: NANOG <nanog@nanog.org>
Subject: Re: 400G forwarding - how does it work?

On Mon, 25 Jul 2022 at 21:51, James Bensley <jwbensley+nanog@gmail.com<mailto:jwbensley+nanog@gmail.com>> wrote:

> I have no frame of reference here, but in comparison to Gen 6 Trio of

> NP5, that seems very high to me (to the point where I assume I am

> wrong).

No you are right, FP has much much more PPEs than Trio.

For fair calculation, you compare how many lines FP has to PPEs in Trio. Because in Trio single PPE handles entire packet, and all PPEs run identical ucode, performing same work.

In FP each PPE in line has its own function, like first PPE in line could be parsing the packet and extracting keys from it, second could be doing ingressACL, 3rd ingressQoS, 4th ingress lookup and so forth.

Why choose this NP design instead of Trio design, I don't know. I don't understand the upsides.

Downside is easy to understand, picture yourself as ucode developer, and you get task to 'add this magic feature in the ucode'.

Implementing it in Trio seems trivial, add the code in ucode, rock on.

On FP, you might have to go 'aww shit, I need to do this before PPE5 but after PPE3 in the pipeline, but the instruction cost it adds isn't in the budget that I have in the PPE4, crap, now I need to shuffle around and figure out which PPE in line runs what function to keep the PPS we promise to customer.

Let's look it from another vantage point, let's cook-up IPv6 header with crapton of EH, in Trio, PPE keeps churning it out, taking long time, but eventually it gets there or raises exception and gives up.

Every other PPE in the box is fully available to perform work.

Same thing in FP? You have HOLB, the PPEs in the line after thisPPE are not doing anything and can't do anything, until the PPE before in line is done.

Today Cisco and Juniper do 'proper' CoPP, that is, they do ingressACL before and after lookup, before is normally needed for ingressACL but after lookup ingressACL is needed for CoPP (we only know after lookup if it is control-plane packet). Nokia doesn't do this at all, and I bet they can't do it, because if they'd add it in the core where it needs to be in line, total PPS would go down. as there is no budget for additional ACL. Instead all control-plane packets from ingressFP are sent to control plane FP, and inshallah we don't congest the connection there or it.

>

> Cheers,

> James.

--

++ytti

Re: 400G forwarding - how does it work? [ In reply to ]

Jul 26, 2022, 1:52 AM

Post #11 of 65 (1354 views)

On Tue, 26 Jul 2022 at 10:52, Vasilenko Eduard <vasilenko.eduard@huawei.com>
wrote:

> Juniper is pipeline-based too (like any ASIC). They just invented one
> special stage in 1996 for lookup (sequence search by nibble in the big
> external memory tree) – it was public information up to 2000year. It is a
> different principle from TCAM search – performance is traded for
> flexibility/simplicity/cost.
>

How do you define a pipeline? My understanding is that fabric and wan
connections are in chip called MQ, 'head' of packet being some 320B or so
(bit less on more modern Trio, didn't measure specifically) is then sent to
LU complex for lookup.
LU then sprays packets to one of many PPE, but once packet hits PPE, it is
processed until done, it doesn't jump to another PPE.
Reordering will occur, which is later restored for flows, but outside flows
reorder may remain.

I don't know what the cores are, but I'm comfortable to bet money they are
not ARM. I know Cisco used to ezchip in ASR9k but is now jumping to their
own NPU called lightspeed, and lightspeed like CRS-1 and ASR1k use
tensilica cores, which are decidedly not ARM.

Nokia, as mentioned, kind of has a pipeline, because a single packet hits
every core in line, and each core does separate thing.

>
>
> Network Processors emulate stages on general-purpose ARM cores. It is a
> pipeline too (different cores for different functions, many cores for every
> function), just it is a virtual pipeline.
>
>
>
> Ed/
>
> -----Original Message-----
> From: NANOG [mailto:nanog-bounces+vasilenko.eduard=huawei.com@nanog.org]
> On Behalf Of Saku Ytti
> Sent: Monday, July 25, 2022 10:03 PM
> To: James Bensley <jwbensley+nanog@gmail.com>
> Cc: NANOG <nanog@nanog.org>
> Subject: Re: 400G forwarding - how does it work?
>
>
>
> On Mon, 25 Jul 2022 at 21:51, James Bensley <jwbensley+nanog@gmail.com>
> wrote:
>
>
>
> > I have no frame of reference here, but in comparison to Gen 6 Trio of
>
> > NP5, that seems very high to me (to the point where I assume I am
>
> > wrong).
>
>
>
> No you are right, FP has much much more PPEs than Trio.
>
>
>
> For fair calculation, you compare how many lines FP has to PPEs in Trio.
> Because in Trio single PPE handles entire packet, and all PPEs run
> identical ucode, performing same work.
>
>
>
> In FP each PPE in line has its own function, like first PPE in line could
> be parsing the packet and extracting keys from it, second could be doing
> ingressACL, 3rd ingressQoS, 4th ingress lookup and so forth.
>
>
>
> Why choose this NP design instead of Trio design, I don't know. I don't
> understand the upsides.
>
>
>
> Downside is easy to understand, picture yourself as ucode developer, and
> you get task to 'add this magic feature in the ucode'.
>
> Implementing it in Trio seems trivial, add the code in ucode, rock on.
>
> On FP, you might have to go 'aww shit, I need to do this before PPE5 but
> after PPE3 in the pipeline, but the instruction cost it adds isn't in the
> budget that I have in the PPE4, crap, now I need to shuffle around and
> figure out which PPE in line runs what function to keep the PPS we promise
> to customer.
>
>
>
> Let's look it from another vantage point, let's cook-up IPv6 header with
> crapton of EH, in Trio, PPE keeps churning it out, taking long time, but
> eventually it gets there or raises exception and gives up.
>
> Every other PPE in the box is fully available to perform work.
>
> Same thing in FP? You have HOLB, the PPEs in the line after thisPPE are
> not doing anything and can't do anything, until the PPE before in line is
> done.
>
>
>
> Today Cisco and Juniper do 'proper' CoPP, that is, they do ingressACL
> before and after lookup, before is normally needed for ingressACL but after
> lookup ingressACL is needed for CoPP (we only know after lookup if it is
> control-plane packet). Nokia doesn't do this at all, and I bet they can't
> do it, because if they'd add it in the core where it needs to be in line,
> total PPS would go down. as there is no budget for additional ACL. Instead
> all control-plane packets from ingressFP are sent to control plane FP, and
> inshallah we don't congest the connection there or it.
>
>
>
>
>
> >
>
> > Cheers,
>
> > James.
>
>
>
>
>
>
>
> --
>
> ++ytti
>

--
++ytti

RE: 400G forwarding - how does it work? [ In reply to ]

Jul 26, 2022, 3:55 AM

Post #12 of 65 (1354 views)

Nope, ASIC vendors are not ARM-based for PFE. Every “stage” is a very specialized ASIC with small programmability (not so small for P4 and some latest generation ASICs).
ARM cores are for Network Processors (NP). ARM cores (with proper microcode) could emulate any “stage” of ASIC. It is the typical explanation for why NPs are more flexible than ASIC.

Stages are connected to the common internal memory where enriched packet headers are stored. The pipeline is just the order of stages to process these internal enriched headers.
The size of this internal header is the critical restriction of the ASIC, never disclosed or discussed (but people know it anyway for the most popular ASICs – it is possible to google “key buffer”).
Hint: the smallest one in the industry is 128bytes, the biggest 384bytes. It is not possible to process longer headers for one PFE pass.
Non-compressed SRv6 header could be 208bytes (+TCP/UDP +VLAN +L2 +ASIC_internal_staff). Hence, the need for compressed.

It was a big marketing announcement from one famous ASIC vendor just a few years ago that some ASIC stages are capable of dynamically sharing common big external memory (used for MAC/IP/Filters).
It may be internal memory too for small scalability, but typically it is external memory. This memory is always discussed in detail – it is needed for the operation team.

It is only about headers. The packet itself (payload) is stored in the separate memory (buffer) that is not visible for pipeline stages.

There were times when it was difficult to squeeze everything into one ASIC. Then one chip prepares an internal (enriched) header and may do some processing (some simple stages), then send this header to the next chip for other “stages” (especially the complicated lookup with external memory connected). It is the artifact now.

Ed/
From: Saku Ytti [mailto:saku@ytti.fi]
Sent: Tuesday, July 26, 2022 11:53 AM
To: Vasilenko Eduard <vasilenko.eduard@huawei.com>
Cc: James Bensley <jwbensley+nanog@gmail.com>; NANOG <nanog@nanog.org>
Subject: Re: 400G forwarding - how does it work?

On Tue, 26 Jul 2022 at 10:52, Vasilenko Eduard <vasilenko.eduard@huawei.com<mailto:vasilenko.eduard@huawei.com>> wrote:

Juniper is pipeline-based too (like any ASIC). They just invented one special stage in 1996 for lookup (sequence search by nibble in the big external memory tree) – it was public information up to 2000year. It is a different principle from TCAM search – performance is traded for flexibility/simplicity/cost.

How do you define a pipeline? My understanding is that fabric and wan connections are in chip called MQ, 'head' of packet being some 320B or so (bit less on more modern Trio, didn't measure specifically) is then sent to LU complex for lookup.
LU then sprays packets to one of many PPE, but once packet hits PPE, it is processed until done, it doesn't jump to another PPE.
Reordering will occur, which is later restored for flows, but outside flows reorder may remain.

I don't know what the cores are, but I'm comfortable to bet money they are not ARM. I know Cisco used to ezchip in ASR9k but is now jumping to their own NPU called lightspeed, and lightspeed like CRS-1 and ASR1k use tensilica cores, which are decidedly not ARM.

Nokia, as mentioned, kind of has a pipeline, because a single packet hits every core in line, and each core does separate thing.

Network Processors emulate stages on general-purpose ARM cores. It is a pipeline too (different cores for different functions, many cores for every function), just it is a virtual pipeline.

Ed/

-----Original Message-----
From: NANOG [mailto:nanog-bounces+vasilenko.eduard<mailto:nanog-bounces%2Bvasilenko.eduard>=huawei.com@nanog.org<mailto:huawei.com@nanog.org>] On Behalf Of Saku Ytti
Sent: Monday, July 25, 2022 10:03 PM
To: James Bensley <jwbensley+nanog@gmail.com<mailto:jwbensley%2Bnanog@gmail.com>>
Cc: NANOG <nanog@nanog.org<mailto:nanog@nanog.org>>
Subject: Re: 400G forwarding - how does it work?

On Mon, 25 Jul 2022 at 21:51, James Bensley <jwbensley+nanog@gmail.com<mailto:jwbensley+nanog@gmail.com>> wrote:

> I have no frame of reference here, but in comparison to Gen 6 Trio of

> NP5, that seems very high to me (to the point where I assume I am

> wrong).

No you are right, FP has much much more PPEs than Trio.

For fair calculation, you compare how many lines FP has to PPEs in Trio. Because in Trio single PPE handles entire packet, and all PPEs run identical ucode, performing same work.

In FP each PPE in line has its own function, like first PPE in line could be parsing the packet and extracting keys from it, second could be doing ingressACL, 3rd ingressQoS, 4th ingress lookup and so forth.

Why choose this NP design instead of Trio design, I don't know. I don't understand the upsides.

Downside is easy to understand, picture yourself as ucode developer, and you get task to 'add this magic feature in the ucode'.

Implementing it in Trio seems trivial, add the code in ucode, rock on.

On FP, you might have to go 'aww shit, I need to do this before PPE5 but after PPE3 in the pipeline, but the instruction cost it adds isn't in the budget that I have in the PPE4, crap, now I need to shuffle around and figure out which PPE in line runs what function to keep the PPS we promise to customer.

Let's look it from another vantage point, let's cook-up IPv6 header with crapton of EH, in Trio, PPE keeps churning it out, taking long time, but eventually it gets there or raises exception and gives up.

Every other PPE in the box is fully available to perform work.

Same thing in FP? You have HOLB, the PPEs in the line after thisPPE are not doing anything and can't do anything, until the PPE before in line is done.

Today Cisco and Juniper do 'proper' CoPP, that is, they do ingressACL before and after lookup, before is normally needed for ingressACL but after lookup ingressACL is needed for CoPP (we only know after lookup if it is control-plane packet). Nokia doesn't do this at all, and I bet they can't do it, because if they'd add it in the core where it needs to be in line, total PPS would go down. as there is no budget for additional ACL. Instead all control-plane packets from ingressFP are sent to control plane FP, and inshallah we don't congest the connection there or it.

>

> Cheers,

> James.

--

++ytti

--
++ytti

Re: 400G forwarding - how does it work? [ In reply to ]

Jul 26, 2022, 4:04 AM

Post #13 of 65 (1354 views)

>
> How do you define a pipeline?

For what it's worth, and
with just a cursory look through this email, and
without wishing to offend anyone's knowledge:

a pipeline in processing is the division of the instruction cycle into a
number of stages.
General purpose RISC processors are often organized into five such stages.
Under optimal conditions,
which can be fairly, albeit loosely,
interpreted as "one instruction does not affect its peers which are already
in one of the stages",
then a pipeline can increase the number of instructions retired per second,
often quoted as MIPS (millions of instructions per second)
by a factor equal to the number of stages in the pipeline.

Cheers,

Etienne

On Tue, Jul 26, 2022 at 10:56 AM Saku Ytti <saku@ytti.fi> wrote:

>
> On Tue, 26 Jul 2022 at 10:52, Vasilenko Eduard <
> vasilenko.eduard@huawei.com> wrote:
>
>> Juniper is pipeline-based too (like any ASIC). They just invented one
>> special stage in 1996 for lookup (sequence search by nibble in the big
>> external memory tree) – it was public information up to 2000year. It is a
>> different principle from TCAM search – performance is traded for
>> flexibility/simplicity/cost.
>>
>
> How do you define a pipeline? My understanding is that fabric and wan
> connections are in chip called MQ, 'head' of packet being some 320B or so
> (bit less on more modern Trio, didn't measure specifically) is then sent to
> LU complex for lookup.
> LU then sprays packets to one of many PPE, but once packet hits PPE, it is
> processed until done, it doesn't jump to another PPE.
> Reordering will occur, which is later restored for flows, but outside
> flows reorder may remain.
>
> I don't know what the cores are, but I'm comfortable to bet money they are
> not ARM. I know Cisco used to ezchip in ASR9k but is now jumping to their
> own NPU called lightspeed, and lightspeed like CRS-1 and ASR1k use
> tensilica cores, which are decidedly not ARM.
>
> Nokia, as mentioned, kind of has a pipeline, because a single packet hits
> every core in line, and each core does separate thing.
>
>>
>>
>> Network Processors emulate stages on general-purpose ARM cores. It is a
>> pipeline too (different cores for different functions, many cores for every
>> function), just it is a virtual pipeline.
>>
>>
>>
>> Ed/
>>
>> -----Original Message-----
>> From: NANOG [mailto:nanog-bounces+vasilenko.eduard=huawei.com@nanog.org]
>> On Behalf Of Saku Ytti
>> Sent: Monday, July 25, 2022 10:03 PM
>> To: James Bensley <jwbensley+nanog@gmail.com>
>> Cc: NANOG <nanog@nanog.org>
>> Subject: Re: 400G forwarding - how does it work?
>>
>>
>>
>> On Mon, 25 Jul 2022 at 21:51, James Bensley <jwbensley+nanog@gmail.com>
>> wrote:
>>
>>
>>
>> > I have no frame of reference here, but in comparison to Gen 6 Trio of
>>
>> > NP5, that seems very high to me (to the point where I assume I am
>>
>> > wrong).
>>
>>
>>
>> No you are right, FP has much much more PPEs than Trio.
>>
>>
>>
>> For fair calculation, you compare how many lines FP has to PPEs in Trio.
>> Because in Trio single PPE handles entire packet, and all PPEs run
>> identical ucode, performing same work.
>>
>>
>>
>> In FP each PPE in line has its own function, like first PPE in line could
>> be parsing the packet and extracting keys from it, second could be doing
>> ingressACL, 3rd ingressQoS, 4th ingress lookup and so forth.
>>
>>
>>
>> Why choose this NP design instead of Trio design, I don't know. I don't
>> understand the upsides.
>>
>>
>>
>> Downside is easy to understand, picture yourself as ucode developer, and
>> you get task to 'add this magic feature in the ucode'.
>>
>> Implementing it in Trio seems trivial, add the code in ucode, rock on.
>>
>> On FP, you might have to go 'aww shit, I need to do this before PPE5 but
>> after PPE3 in the pipeline, but the instruction cost it adds isn't in the
>> budget that I have in the PPE4, crap, now I need to shuffle around and
>> figure out which PPE in line runs what function to keep the PPS we promise
>> to customer.
>>
>>
>>
>> Let's look it from another vantage point, let's cook-up IPv6 header with
>> crapton of EH, in Trio, PPE keeps churning it out, taking long time, but
>> eventually it gets there or raises exception and gives up.
>>
>> Every other PPE in the box is fully available to perform work.
>>
>> Same thing in FP? You have HOLB, the PPEs in the line after thisPPE are
>> not doing anything and can't do anything, until the PPE before in line is
>> done.
>>
>>
>>
>> Today Cisco and Juniper do 'proper' CoPP, that is, they do ingressACL
>> before and after lookup, before is normally needed for ingressACL but after
>> lookup ingressACL is needed for CoPP (we only know after lookup if it is
>> control-plane packet). Nokia doesn't do this at all, and I bet they can't
>> do it, because if they'd add it in the core where it needs to be in line,
>> total PPS would go down. as there is no budget for additional ACL. Instead
>> all control-plane packets from ingressFP are sent to control plane FP, and
>> inshallah we don't congest the connection there or it.
>>
>>
>>
>>
>>
>> >
>>
>> > Cheers,
>>
>> > James.
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> ++ytti
>>
>
>
> --
> ++ytti
>

--
Ing. Etienne-Victor Depasquale
Assistant Lecturer
Department of Communications & Computer Engineering
Faculty of Information & Communication Technology
University of Malta
Web. https://www.um.edu.mt/profile/etiennedepasquale

RE: 400G forwarding - how does it work? [ In reply to ]

Jul 26, 2022, 4:28 AM

Post #14 of 65 (1354 views)

Pipeline Stages are like separate computers (with their own ALU) sharing the same memory.
In the ASIC case, the computers have different types (different capabilities).

From: Etienne-Victor Depasquale [mailto:edepa@ieee.org]
Sent: Tuesday, July 26, 2022 2:05 PM
To: Saku Ytti <saku@ytti.fi>
Cc: Vasilenko Eduard <vasilenko.eduard@huawei.com>; NANOG <nanog@nanog.org>
Subject: Re: 400G forwarding - how does it work?

How do you define a pipeline?

For what it's worth, and
with just a cursory look through this email, and
without wishing to offend anyone's knowledge:

a pipeline in processing is the division of the instruction cycle into a number of stages.
General purpose RISC processors are often organized into five such stages.
Under optimal conditions,
which can be fairly, albeit loosely,
interpreted as "one instruction does not affect its peers which are already in one of the stages",
then a pipeline can increase the number of instructions retired per second,
often quoted as MIPS (millions of instructions per second)
by a factor equal to the number of stages in the pipeline.

Cheers,

Etienne

On Tue, Jul 26, 2022 at 10:56 AM Saku Ytti <saku@ytti.fi<mailto:saku@ytti.fi>> wrote:

On Tue, 26 Jul 2022 at 10:52, Vasilenko Eduard <vasilenko.eduard@huawei.com<mailto:vasilenko.eduard@huawei.com>> wrote:

Juniper is pipeline-based too (like any ASIC). They just invented one special stage in 1996 for lookup (sequence search by nibble in the big external memory tree) – it was public information up to 2000year. It is a different principle from TCAM search – performance is traded for flexibility/simplicity/cost.

How do you define a pipeline? My understanding is that fabric and wan connections are in chip called MQ, 'head' of packet being some 320B or so (bit less on more modern Trio, didn't measure specifically) is then sent to LU complex for lookup.
LU then sprays packets to one of many PPE, but once packet hits PPE, it is processed until done, it doesn't jump to another PPE.
Reordering will occur, which is later restored for flows, but outside flows reorder may remain.

I don't know what the cores are, but I'm comfortable to bet money they are not ARM. I know Cisco used to ezchip in ASR9k but is now jumping to their own NPU called lightspeed, and lightspeed like CRS-1 and ASR1k use tensilica cores, which are decidedly not ARM.

Nokia, as mentioned, kind of has a pipeline, because a single packet hits every core in line, and each core does separate thing.

Network Processors emulate stages on general-purpose ARM cores. It is a pipeline too (different cores for different functions, many cores for every function), just it is a virtual pipeline.

Ed/

-----Original Message-----
From: NANOG [mailto:nanog-bounces+vasilenko.eduard<mailto:nanog-bounces%2Bvasilenko.eduard>=huawei.com@nanog.org<mailto:huawei.com@nanog.org>] On Behalf Of Saku Ytti
Sent: Monday, July 25, 2022 10:03 PM
To: James Bensley <jwbensley+nanog@gmail.com<mailto:jwbensley%2Bnanog@gmail.com>>
Cc: NANOG <nanog@nanog.org<mailto:nanog@nanog.org>>
Subject: Re: 400G forwarding - how does it work?

On Mon, 25 Jul 2022 at 21:51, James Bensley <jwbensley+nanog@gmail.com<mailto:jwbensley+nanog@gmail.com>> wrote:

> I have no frame of reference here, but in comparison to Gen 6 Trio of

> NP5, that seems very high to me (to the point where I assume I am

> wrong).

No you are right, FP has much much more PPEs than Trio.

For fair calculation, you compare how many lines FP has to PPEs in Trio. Because in Trio single PPE handles entire packet, and all PPEs run identical ucode, performing same work.

In FP each PPE in line has its own function, like first PPE in line could be parsing the packet and extracting keys from it, second could be doing ingressACL, 3rd ingressQoS, 4th ingress lookup and so forth.

Why choose this NP design instead of Trio design, I don't know. I don't understand the upsides.

Downside is easy to understand, picture yourself as ucode developer, and you get task to 'add this magic feature in the ucode'.

Implementing it in Trio seems trivial, add the code in ucode, rock on.

On FP, you might have to go 'aww shit, I need to do this before PPE5 but after PPE3 in the pipeline, but the instruction cost it adds isn't in the budget that I have in the PPE4, crap, now I need to shuffle around and figure out which PPE in line runs what function to keep the PPS we promise to customer.

Let's look it from another vantage point, let's cook-up IPv6 header with crapton of EH, in Trio, PPE keeps churning it out, taking long time, but eventually it gets there or raises exception and gives up.

Every other PPE in the box is fully available to perform work.

Same thing in FP? You have HOLB, the PPEs in the line after thisPPE are not doing anything and can't do anything, until the PPE before in line is done.

Today Cisco and Juniper do 'proper' CoPP, that is, they do ingressACL before and after lookup, before is normally needed for ingressACL but after lookup ingressACL is needed for CoPP (we only know after lookup if it is control-plane packet). Nokia doesn't do this at all, and I bet they can't do it, because if they'd add it in the core where it needs to be in line, total PPS would go down. as there is no budget for additional ACL. Instead all control-plane packets from ingressFP are sent to control plane FP, and inshallah we don't congest the connection there or it.

>

> Cheers,

> James.

--

++ytti

--
++ytti

--
Ing. Etienne-Victor Depasquale
Assistant Lecturer
Department of Communications & Computer Engineering
Faculty of Information & Communication Technology
University of Malta
Web. https://www.um.edu.mt/profile/etiennedepasquale

Re: 400G forwarding - how does it work? [ In reply to ]

jwbensley+nanog at gmail

Jul 26, 2022, 11:28 AM

Post #15 of 65 (1353 views)

On 25 July 2022 19:02:50 UTC, Saku Ytti <saku@ytti.fi> wrote:
>On Mon, 25 Jul 2022 at 21:51, James Bensley <jwbensley+nanog@gmail.com> wrote:
>
>> I have no frame of reference here, but in comparison to Gen 6 Trio of
>> NP5, that seems very high to me (to the point where I assume I am
>> wrong).
>
>No you are right, FP has much much more PPEs than Trio.

Can you give any examples?

>Why choose this NP design instead of Trio design, I don't know. I
>don't understand the upsides.

I think one use case is fixed latency. If you have minimal variation in your traffic you can provide a guaranteed upper bound on latency. This should be possible with the RTC model too of course, just harder because any variation in traffic at all, will result in a different run time duration, and I imagine it is easier to measure, find, and fix/tune chunks of code (running on separate cores, like in a pipeline) than in more code all running one core (like in RTC). So that's possibly a second benefit, maybe FP is easier to debug and measure changes?

>Downside is easy to understand, picture yourself as ucode developer,
>and you get task to 'add this magic feature in the ucode'.
>Implementing it in Trio seems trivial, add the code in ucode, rock on.
>On FP, you might have to go 'aww shit, I need to do this before PPE5
>but after PPE3 in the pipeline, but the instruction cost it adds isn't
>in the budget that I have in the PPE4, crap, now I need to shuffle
>around and figure out which PPE in line runs what function to keep the
>PPS we promise to customer.

That's why we have packet recirc <troll face>

>Let's look it from another vantage point, let's cook-up IPv6 header
>with crapton of EH, in Trio, PPE keeps churning it out, taking long
>time, but eventually it gets there or raises exception and gives up.
>Every other PPE in the box is fully available to perform work.
>Same thing in FP? You have HOLB, the PPEs in the line after thisPPE
>are not doing anything and can't do anything, until the PPE before in
>line is done.

This is exactly the benefit of FP vs NPI, less flexible, more throughput. NPU has served us (the industry) well at the edge, and FP is serving us well in the core.

>Today Cisco and Juniper do 'proper' CoPP, that is, they do ingressACL
>before and after lookup, before is normally needed for ingressACL but
>after lookup ingressACL is needed for CoPP (we only know after lookup
>if it is control-plane packet). Nokia doesn't do this at all, and I
>bet they can't do it, because if they'd add it in the core where it
>needs to be in line, total PPS would go down. as there is no budget
>for additional ACL. Instead all control-plane packets from ingressFP
>are sent to control plane FP, and inshallah we don't congest the
>connection there or it.

Interesting.

Cheers,
James.

Re: 400G forwarding - how does it work? [ In reply to ]

ljwobker at gmail

Jul 26, 2022, 12:39 PM

Post #16 of 65 (1353 views)

>
>
> "Pipeline" in the context of networking chips is not a terribly
> well-defined term. In some chips, you'll have a pipeline that is built
> from very rigid hardware logic blocks -- the first block does exactly one
> part of the packet forwarding, then hands the packet (or just the header
> and metadata) to the second block, which does another portion of the
> forwarding. You build the pipeline out of as many blocks as you need to
> solve your particular networking problem, and voila!

"Pipeline", in the context of networking chips, is not a terribly
well-defined term! In some chips, you'll have an almost-literal pipeline
that is built from very rigid hardware logic blocks. The first block does
exactly one part of the packet forwarding, then hands the packet (or just
the header and metadata) to the second block, which does another portion of
the forwarding. You build the pipeline out of as many blocks as you need
to solve your particular networking problem, and voila!
The advantages here is that you can make things very fast and power
efficient, but they aren't all that flexible, and deity help you if you
ever need to do something in a different order than your pipeline!

You can also build a "pipeline" out of software functions - write up some
Python code (because everyone loves Python, right?) where function A calls
function B and so on. At some level, you've just build a pipeline out of
different software functions. This is going to be a lot slower (C code
will be faster but nowhere near as fast as dedicated hardware) but it's WAY
more flexible. You can more or less dynamically build your "pipeline" on a
packet-by-packet basis, depending on what features and packet data you're
dealing with.

"Microcode" is really just a term we use for something like "really
optimized and limited instruction sets for packet forwarding". Just like
an x86 or an ARM has some finite set of instructions that it can execute,
so do current networking chips. The larger that instruction space is and
the more combinations of those instructions you can store, the more
flexible your code is. Of course, you can't make that part of the chip
bigger without making something else smaller, so there's another tradeoff.

MOST current chips are really a hybrid/combination of these two extremes.
You have some set of fixed logic blocks that do exactly One Set Of Things,
and you have some other logic blocks that can be reconfigured to do A Few
Different Things. The degree to which the programmable stuff is
programmable is a major input to how many different features you can do on
the chip, and at what speeds. Sometimes you can use the same hardware
block to do multiple things on a packet if you're willing to sacrifice some
packet rate and/or bandwidth. The constant "law of physics" is that you
can always do a given function in less power/space/cost if you're willing
to optimize for that specific thing -- but you're sacrificing flexibility
to do it. The more flexibility ("programmability") you want to add to a
chip, the more logic and memory you need to add.

From a performance standpoint, on current "fast" chips, many (but certainly
not all) of the "pipelines" are designed to forward one packet per clock
cycle for "normal" use cases. (Of course we sneaky vendors get to decide
what is normal and what's not, but that's a separate issue...) So if I
have a chip that has one pipeline and it's clocked at 1.25Ghz, that means
that it can forward 1.25 billion packets per second. Note that this does
NOT mean that I can forward a packet in "a one-point-two-five-billionth of
a second" -- but it does mean that every clock cycle I can start on a new
packet and finish another one. The length of the pipeline impacts the
latency of the chip, although this part of the latency is often a rounding
error compared to the number of times I have to read and write the packet
into different memories as it goes through the system.

So if this pipeline can do 1.25 billion PPS and I want to be able to
forward 10BPPS, I can build a chip that has 8 of these pipelines and get my
performance target that way. I could also build a "pipeline" that
processes multiple packets per clock, if I have one that does 2
packets/clock then I only need 4 of said pipelines... and so on and so
forth. The exact details of how the pipelines are constructed and how much
parallelism I built INSIDE a pipeline as opposed to replicating pipelines
is sort of Gooky Implementation Details, but it's a very very important
part of doing the chip level architecture as those sorts of decisions drive
lots of Other Important Decisions in the silicon design...

--lj

Re: 400G forwarding - how does it work? [ In reply to ]

diptanshu.singh at gmail

Jul 26, 2022, 12:54 PM

Post #17 of 65 (1353 views)

mandatory slide of laundry analogy for pipelining
https://cs.stanford.edu/people/eroberts/courses/soco/projects/risc/pipelining/index.html

On Tue, 26 Jul 2022 at 12:41, Lawrence Wobker <ljwobker@gmail.com> wrote:

>
>> "Pipeline" in the context of networking chips is not a terribly
>> well-defined term. In some chips, you'll have a pipeline that is built
>> from very rigid hardware logic blocks -- the first block does exactly one
>> part of the packet forwarding, then hands the packet (or just the header
>> and metadata) to the second block, which does another portion of the
>> forwarding. You build the pipeline out of as many blocks as you need to
>> solve your particular networking problem, and voila!
>
>
>
> "Pipeline", in the context of networking chips, is not a terribly
> well-defined term! In some chips, you'll have an almost-literal pipeline
> that is built from very rigid hardware logic blocks. The first block does
> exactly one part of the packet forwarding, then hands the packet (or just
> the header and metadata) to the second block, which does another portion of
> the forwarding. You build the pipeline out of as many blocks as you need
> to solve your particular networking problem, and voila!
> The advantages here is that you can make things very fast and power
> efficient, but they aren't all that flexible, and deity help you if you
> ever need to do something in a different order than your pipeline!
>
> You can also build a "pipeline" out of software functions - write up some
> Python code (because everyone loves Python, right?) where function A calls
> function B and so on. At some level, you've just build a pipeline out of
> different software functions. This is going to be a lot slower (C code
> will be faster but nowhere near as fast as dedicated hardware) but it's WAY
> more flexible. You can more or less dynamically build your "pipeline" on a
> packet-by-packet basis, depending on what features and packet data you're
> dealing with.
>
> "Microcode" is really just a term we use for something like "really
> optimized and limited instruction sets for packet forwarding". Just like
> an x86 or an ARM has some finite set of instructions that it can execute,
> so do current networking chips. The larger that instruction space is and
> the more combinations of those instructions you can store, the more
> flexible your code is. Of course, you can't make that part of the chip
> bigger without making something else smaller, so there's another tradeoff.
>
> MOST current chips are really a hybrid/combination of these two extremes.
> You have some set of fixed logic blocks that do exactly One Set Of Things,
> and you have some other logic blocks that can be reconfigured to do A Few
> Different Things. The degree to which the programmable stuff is
> programmable is a major input to how many different features you can do on
> the chip, and at what speeds. Sometimes you can use the same hardware
> block to do multiple things on a packet if you're willing to sacrifice some
> packet rate and/or bandwidth. The constant "law of physics" is that you
> can always do a given function in less power/space/cost if you're willing
> to optimize for that specific thing -- but you're sacrificing flexibility
> to do it. The more flexibility ("programmability") you want to add to a
> chip, the more logic and memory you need to add.
>
> From a performance standpoint, on current "fast" chips, many (but
> certainly not all) of the "pipelines" are designed to forward one packet
> per clock cycle for "normal" use cases. (Of course we sneaky vendors get
> to decide what is normal and what's not, but that's a separate issue...)
> So if I have a chip that has one pipeline and it's clocked at 1.25Ghz, that
> means that it can forward 1.25 billion packets per second. Note that this
> does NOT mean that I can forward a packet in "a
> one-point-two-five-billionth of a second" -- but it does mean that every
> clock cycle I can start on a new packet and finish another one. The length
> of the pipeline impacts the latency of the chip, although this part of the
> latency is often a rounding error compared to the number of times I have to
> read and write the packet into different memories as it goes through the
> system.
>
> So if this pipeline can do 1.25 billion PPS and I want to be able to
> forward 10BPPS, I can build a chip that has 8 of these pipelines and get my
> performance target that way. I could also build a "pipeline" that
> processes multiple packets per clock, if I have one that does 2
> packets/clock then I only need 4 of said pipelines... and so on and so
> forth. The exact details of how the pipelines are constructed and how much
> parallelism I built INSIDE a pipeline as opposed to replicating pipelines
> is sort of Gooky Implementation Details, but it's a very very important
> part of doing the chip level architecture as those sorts of decisions drive
> lots of Other Important Decisions in the silicon design...
>
> --lj
>

Re: 400G forwarding - how does it work? [ In reply to ]

jefftant.ietf at gmail

Jul 26, 2022, 1:11 PM

Post #18 of 65 (1353 views)

As Lincoln said - all of us directly working with BCM/other silicon vendors have signed numerous NDAs.
However if you ask a well crafted question - there’s always a way to talk about it ;-)

In general, if we look at the whole spectrum, on one side there’re massively parallelized “many core” RTC ASICs, such as Trio, Lightspeed, and similar (as the last gasp of Redback/Ericsson venture - we have built 1400 HW threads ASIC (Spider).
On another side of the spectrum - fixed pipeline ASICs, from BCM Tomahawk at its extreme (max speed/radix - min features) moving with BCM Trident, Innovium, Barefoot(quite different animal wrt programmability), etc - usually shallow on chip buffer only (100-200M).

In between we have got so called programmable pipeline silicon, BCM DNX and Juniper Express are in this category, usually a combo of OCB + off chip memory (most often HBM), (2-6G), usually have line-rate/high scale security/overlay encap/decap capabilities. Usually have highly optimized RTC blocks within a pipeline (RTC within macro). The way and speed to access DBs, memories is evolving with each generation, number/speed of non networking cores(usually ARM) keeps growing - OAM, INT, local optimizations are primary users of it.

Cheers,
Jeff

> On Jul 25, 2022, at 15:59, Lincoln Dale <ltd@interlink.com.au> wrote:
>
> ?
>> On Mon, Jul 25, 2022 at 11:58 AM James Bensley <jwbensley+nanog@gmail.com> wrote:
>
>> On Mon, 25 Jul 2022 at 15:34, Lawrence Wobker <ljwobker@gmail.com> wrote:
>> > This is the parallelism part. I can take multiple instances of these memory/logic pipelines, and run them in parallel to increase the throughput.
>> ...
>> > I work on/with a chip that can forwarding about 10B packets per second… so if we go back to the order-of-magnitude number that I’m doing about “tens” of memory lookups for every one of those packets, we’re talking about something like a hundred BILLION total memory lookups… and since memory does NOT give me answers in 1 picoseconds… we get back to pipelining and parallelism.
>>
>> What level of parallelism is required to forward 10Bpps? Or 2Bpps like
>> my J2 example :)
>
> I suspect many folks know the exact answer for J2, but it's likely under NDA to talk about said specific answer for a given thing.
>
> Without being platform or device-specific, the core clock rate of many network devices is often in a "goldilocks" zone of (today) 1 to 1.5GHz with a goal of 1 packet forwarded 'per-clock'. As LJ described the pipeline that doesn't mean a latency of 1 clock ingress-to-egress but rather that every clock there is a forwarding decision from one 'pipeline', and the MPPS/BPPS packet rate is achieved by having enough pipelines in parallel to achieve that.
> The number here is often "1" or "0.5" so you can work the number backwards. (e.g. it emits a packet every clock, or every 2nd clock).
>
> It's possible to build an ASIC/NPU to run a faster clock rate, but gets back to what I'm hand-waving describing as "goldilocks". Look up power vs frequency and you'll see its non-linear.
> Just as CPUs can scale by adding more cores (vs increasing frequency), ~same holds true on network silicon, and you can go wider, multiple pipelines. But its not 10K parallel slices, there's some parallel parts, but there are multiple 'stages' on each doing different things.
>
> Using your CPU comparison, there are some analogies here that do work:
> - you have multiple cpu cores that can do things in parallel -- analogous to pipelines
> - they often share some common I/O (e.g. CPUs have PCIe, maybe sharing some DRAM or LLC) -- maybe some lookup engines, or centralized buffer/memory
> - most modern CPUs are out-of-order execution, where under-the-covers, a cache-miss or DRAM fetch has a disproportionate hit on performance, so its hidden away from you as much as possible by speculative execution out-of-order
> -- no direct analogy to this one - it's unlikely most forwarding pipelines do speculative execution like a general purpose CPU does - but they definitely do 'other work' while waiting for a lookup to happen
>
> A common-garden x86 is unlikely to achieve such a rate for a few different reasons:
> - packets-in or packets-out go via DRAM then you need sufficient DRAM (page opens/sec, DRAM bandwidth) to sustain at least one write and one read per packet. Look closer at DRAM and see its speed, Pay attention to page opens/sec, and what that consumes.
> - one 'trick' is to not DMA packets to DRAM but instead have it go into SRAM of some form - e.g. Intel DDIO, ARM Cache Stashing, which at least potentially saves you that DRAM write+read per packet
> - ... but then do e.g. a LPM lookup, and best case that is back to a memory access/packet. Maybe it's in L1/L2/L3 cache, but likely at large table sizes it isn't.
> - ... do more things to the packet (urpf lookups, counters) and it's yet more lookups.
>
> Software can achieve high rates, but note that a typical ASIC/NPU does on the order of >100 separate lookups per packet, and 100 counter updates per packet.
> Just as forwarding in a ASIC or NPU is a series of tradeoffs, forwarding in software on generic CPUs is also a series of tradeoffs.
>
>
> cheers,
>
> lincoln.
>

Re: 400G forwarding - how does it work? [ In reply to ]

Jul 26, 2022, 11:00 PM

Post #19 of 65 (1352 views)

On Tue, 26 Jul 2022 at 21:28, <jwbensley+nanog@gmail.com> wrote:

> >No you are right, FP has much much more PPEs than Trio.
>
> Can you give any examples?

Nokia FP is like >1k, Juniper Trio is closer to 100 (earlier Trio LUs
had much less). I could give exact numbers for EA and YT if needed,
they are visible in the CLI and the end user can even profile them, to
see what ucode function they are spending their time on.

--
++ytti

Re: 400G forwarding - how does it work? [ In reply to ]

Jul 26, 2022, 11:10 PM

Post #20 of 65 (1352 views)

On Tue, 26 Jul 2022 at 23:15, Jeff Tantsura <jefftant.ietf@gmail.com> wrote:

> In general, if we look at the whole spectrum, on one side there’re massively parallelized “many core” RTC ASICs, such as Trio, Lightspeed, and similar (as the last gasp of Redback/Ericsson venture - we have built 1400 HW threads ASIC (Spider).
> On another side of the spectrum - fixed pipeline ASICs, from BCM Tomahawk at its extreme (max speed/radix - min features) moving with BCM Trident, Innovium, Barefoot(quite different animal wrt programmability), etc - usually shallow on chip buffer only (100-200M).
>
> In between we have got so called programmable pipeline silicon, BCM DNX and Juniper Express are in this category, usually a combo of OCB + off chip memory (most often HBM), (2-6G), usually have line-rate/high scale security/overlay encap/decap capabilities. Usually have highly optimized RTC blocks within a pipeline (RTC within macro). The way and speed to access DBs, memories is evolving with each generation, number/speed of non networking cores(usually ARM) keeps growing - OAM, INT, local optimizations are primary users of it.

What do we call Nokia FP? Where you have a pipeline of identical cores
doing different things, and the packet has to hit each core in line in
order? How do we contrast this to NPU where a given packet hits
exactly one core?

I think ASIC, NPU, pipeline, RTC are all quite ambiguous. When we say
pipeline, usually people assume a purpose build unique HW blocks
packet travels through (like DNX, Express) and not fully flexible
identical cores pipeline like FP.

So I guess I would consider 'true pipeline', pipeline of unique HW
blocks and 'true NPU' where a given packet hits exactly 1 core. And
anything else as more or less hybrid.

I expect once you get to the details of implementation all of these
generalisations use communicative power.

--
++ytti

Re: 400G forwarding - how does it work? [ In reply to ]

mohta at necom830

Jul 27, 2022, 6:08 AM

Post #21 of 65 (1331 views)

James Bensley wrote:

> The BCM16K documentation suggests that it uses TCAM for exact
> matching (e.g.,for ACLs) in something called the "Database Array"
> (with 2M 40b entries?), and SRAM for LPM (e.g., IP lookups) in
> something called the "User Data Array" (with 16M 32b entries?).

Which documentation?

According to:

https://docs.broadcom.com/docs/16000-DS1-PUB

figure 1 and related explanations:

Database records 40b: 2048k/1024k.
Table width configurable as 80/160/320/480/640 bits.
User Data Array for associated data, width configurable as
32/64/128/256 bits.

means that header extracted by 88690 is analyzed by 16K finally
resulting in 40b (a lot shorter than IPv6 addresses, still may be
enough for IPv6 backbone to identify sites) information by "database"
lookup, which is, obviously by CAM because 40b is painful for
SRAM, converted to "32/64/128/256 bits data".

> 1 second / 164473684 packets = 1 packet every 6.08 nanoseconds, which
> is within the access time of TCAM and SRAM

As high speed TCAM and SRAM should be pipelined, cycle time, which
matters, is shorter than access time.

Finally, it should be pointed out that most, if not all, performance
figures such as MIPS and Flops are merely guaranteed not to be exceeded.

In this case, if so deep packet inspections by lengthy header for some
complicated routing schemes or to satisfy NSA requirements are required,
communication speed between 88690 and 16K will be the limitation factor
for PPS resulting in a lot less than maximum possible PPS.

Masataka Ohta

RE: 400G forwarding - how does it work? [ In reply to ]

ljwobker at gmail

Jul 27, 2022, 6:56 AM

Post #22 of 65 (1331 views)

The Broadcom KBP -- often called an "external TCAM" is really closer to a completely separate NPU than just an external TCAM. "Back in the day" we used external TCAMs to store forwarding state (FIB tables, ACL tables, whatever) on devices that were pretty much just a bunch of TCAM memory and an interface for the "main" NPU to ask for a lookup. Today the modern KBP devices have WAY more functionality, they have lots of different databases and tables available, which can be sliced and diced into different widths and depths. They can store lots of different kinds of state, from counters to LPM prefixes and ACLs. At risk of correcting Ohta-san, note that most ACLs are implemented using TCAMs with wildcard/masking support, as opposed to an exact match lookup. Exact match lookups are generally used for things that do not require masking or wildcard bits: MAC addresses and MPLS label values are the canonical examples here.

The SRAM memories used in fast networking chips are almost always built such that they provide one lookup per clock, although hardware designers often use multiple banks of these to increase the number of *effective* lookups per clock. TCAMs are also generally built such that they provide one lookup/result per clock, but again you can stack up multiple devices to increase this.

Many hardware designs also allow for more flexibility in how the various memories are utilized by the software -- almost everyone is familiar with the idea of "I can have a million entries of X bits, or half a million entries of 2*X bits". If the hardware and software complexity was free, we'd design memories that could be arbitrarily chopped into exactly the sizes we need, but that complexity is Absolutely Not Free.... so we end up picking a few discrete sizes and the software/forwarding code has to figure out how to use those bits efficiently. And you can bet your life that as soon as you have a memory that can function using either 80b or 160b entries, you will immediately come across a use case that really really needs to use entries of 81b.

FYI: There's nothing particularly magical about 40b memory widths. When building these chips you can (more or less) pick whatever width of SRAM you want to build, and the memory libraries that you use spit out the corresponding physical design.

Ohta-san correctly mentions that a critical part of the performance analysis is how fast the different parts of the pipeline can talk to each other. Note that this concept applies whether we're talking about the connection between very small blocks within the ASIC/NPU, or the interface between the NPU and an external KBP/TCAM, or for that matter between multiple NPUs/fabric chips within a system. At some point you'll always be constrained by whatever the slowest link in the pipeline is, so balancing all that stuff out is Yet One More Thing for the system designer to deal with.

--lj

-----Original Message-----
From: NANOG <nanog-bounces+ljwobker=gmail.com@nanog.org> On Behalf Of Masataka Ohta
Sent: Wednesday, July 27, 2022 9:09 AM
To: nanog@nanog.org
Subject: Re: 400G forwarding - how does it work?

James Bensley wrote:

> The BCM16K documentation suggests that it uses TCAM for exact matching
> (e.g.,for ACLs) in something called the "Database Array"
> (with 2M 40b entries?), and SRAM for LPM (e.g., IP lookups) in
> something called the "User Data Array" (with 16M 32b entries?).

Which documentation?

According to:

https://docs.broadcom.com/docs/16000-DS1-PUB

figure 1 and related explanations:

Database records 40b: 2048k/1024k.
Table width configurable as 80/160/320/480/640 bits.
User Data Array for associated data, width configurable as
32/64/128/256 bits.

means that header extracted by 88690 is analyzed by 16K finally resulting in 40b (a lot shorter than IPv6 addresses, still may be enough for IPv6 backbone to identify sites) information by "database"
lookup, which is, obviously by CAM because 40b is painful for SRAM, converted to "32/64/128/256 bits data".

> 1 second / 164473684 packets = 1 packet every 6.08 nanoseconds, which
> is within the access time of TCAM and SRAM

As high speed TCAM and SRAM should be pipelined, cycle time, which matters, is shorter than access time.

Finally, it should be pointed out that most, if not all, performance figures such as MIPS and Flops are merely guaranteed not to be exceeded.

In this case, if so deep packet inspections by lengthy header for some complicated routing schemes or to satisfy NSA requirements are required, communication speed between 88690 and 16K will be the limitation factor for PPS resulting in a lot less than maximum possible PPS.

Masataka Ohta

Re: 400G forwarding - how does it work? [ In reply to ]

dave.taht at gmail

Jul 27, 2022, 8:10 AM

Post #23 of 65 (1331 views)

This convo is giving me some hope that the sophisticated FQ and AQM
algorithms I favor can be made to run in more hardware at high rates,
but most of the work I'm aware of has targetted tofino and P4.

The only thing I am aware of shipping is AFD in some cisco hw. Anyone
using that?

Re: 400G forwarding - how does it work? [ In reply to ]

jefftant.ietf at gmail

Jul 27, 2022, 10:49 AM

Post #24 of 65 (1331 views)

FYI

https://community.juniper.net/blogs/nicolas-fevrier/2022/07/27/voq-and-dnx-pipeline

Cheers,
Jeff

> On Jul 25, 2022, at 15:59, Lincoln Dale <ltd@interlink.com.au> wrote:
>
> ?
>> On Mon, Jul 25, 2022 at 11:58 AM James Bensley <jwbensley+nanog@gmail.com> wrote:
>
>> On Mon, 25 Jul 2022 at 15:34, Lawrence Wobker <ljwobker@gmail.com> wrote:
>> > This is the parallelism part. I can take multiple instances of these memory/logic pipelines, and run them in parallel to increase the throughput.
>> ...
>> > I work on/with a chip that can forwarding about 10B packets per second… so if we go back to the order-of-magnitude number that I’m doing about “tens” of memory lookups for every one of those packets, we’re talking about something like a hundred BILLION total memory lookups… and since memory does NOT give me answers in 1 picoseconds… we get back to pipelining and parallelism.
>>
>> What level of parallelism is required to forward 10Bpps? Or 2Bpps like
>> my J2 example :)
>
> I suspect many folks know the exact answer for J2, but it's likely under NDA to talk about said specific answer for a given thing.
>
> Without being platform or device-specific, the core clock rate of many network devices is often in a "goldilocks" zone of (today) 1 to 1.5GHz with a goal of 1 packet forwarded 'per-clock'. As LJ described the pipeline that doesn't mean a latency of 1 clock ingress-to-egress but rather that every clock there is a forwarding decision from one 'pipeline', and the MPPS/BPPS packet rate is achieved by having enough pipelines in parallel to achieve that.
> The number here is often "1" or "0.5" so you can work the number backwards. (e.g. it emits a packet every clock, or every 2nd clock).
>
> It's possible to build an ASIC/NPU to run a faster clock rate, but gets back to what I'm hand-waving describing as "goldilocks". Look up power vs frequency and you'll see its non-linear.
> Just as CPUs can scale by adding more cores (vs increasing frequency), ~same holds true on network silicon, and you can go wider, multiple pipelines. But its not 10K parallel slices, there's some parallel parts, but there are multiple 'stages' on each doing different things.
>
> Using your CPU comparison, there are some analogies here that do work:
> - you have multiple cpu cores that can do things in parallel -- analogous to pipelines
> - they often share some common I/O (e.g. CPUs have PCIe, maybe sharing some DRAM or LLC) -- maybe some lookup engines, or centralized buffer/memory
> - most modern CPUs are out-of-order execution, where under-the-covers, a cache-miss or DRAM fetch has a disproportionate hit on performance, so its hidden away from you as much as possible by speculative execution out-of-order
> -- no direct analogy to this one - it's unlikely most forwarding pipelines do speculative execution like a general purpose CPU does - but they definitely do 'other work' while waiting for a lookup to happen
>
> A common-garden x86 is unlikely to achieve such a rate for a few different reasons:
> - packets-in or packets-out go via DRAM then you need sufficient DRAM (page opens/sec, DRAM bandwidth) to sustain at least one write and one read per packet. Look closer at DRAM and see its speed, Pay attention to page opens/sec, and what that consumes.
> - one 'trick' is to not DMA packets to DRAM but instead have it go into SRAM of some form - e.g. Intel DDIO, ARM Cache Stashing, which at least potentially saves you that DRAM write+read per packet
> - ... but then do e.g. a LPM lookup, and best case that is back to a memory access/packet. Maybe it's in L1/L2/L3 cache, but likely at large table sizes it isn't.
> - ... do more things to the packet (urpf lookups, counters) and it's yet more lookups.
>
> Software can achieve high rates, but note that a typical ASIC/NPU does on the order of >100 separate lookups per packet, and 100 counter updates per packet.
> Just as forwarding in a ASIC or NPU is a series of tradeoffs, forwarding in software on generic CPUs is also a series of tradeoffs.
>
>
> cheers,
>
> lincoln.
>

Re: 400G forwarding - how does it work? [ In reply to ]

jwbensley+nanog at gmail

Jul 27, 2022, 12:30 PM

Post #25 of 65 (1331 views)

On Wed, 27 Jul 2022 at 15:11, Masataka Ohta
<mohta@necom830.hpcl.titech.ac.jp> wrote:
>
> James Bensley wrote:
>
> > The BCM16K documentation suggests that it uses TCAM for exact
> > matching (e.g.,for ACLs) in something called the "Database Array"
> > (with 2M 40b entries?), and SRAM for LPM (e.g., IP lookups) in
> > something called the "User Data Array" (with 16M 32b entries?).
>
> Which documentation?
>
> According to:
>
> https://docs.broadcom.com/docs/16000-DS1-PUB
>
> figure 1 and related explanations:
>
> Database records 40b: 2048k/1024k.
> Table width configurable as 80/160/320/480/640 bits.
> User Data Array for associated data, width configurable as
> 32/64/128/256 bits.
>
> means that header extracted by 88690 is analyzed by 16K finally
> resulting in 40b (a lot shorter than IPv6 addresses, still may be
> enough for IPv6 backbone to identify sites) information by "database"
> lookup, which is, obviously by CAM because 40b is painful for
> SRAM, converted to "32/64/128/256 bits data".

Hi Masataka,

Yes I had read that data sheet. If you have 2M 40b entries in CAM, you
could also have 1M 80 entries (or a mixture); the 40b CAM blocks can
be chained together to store IPv4/IPv6/MPLS/whatever entries.

Cheers,
James.