Mailing List Archive

1 2 3  View All
Re: MX304 Port Layout [ In reply to ]
On 7/2/23 15:19, Saku Ytti wrote:

> Right as is MX304.
>
> I don't think this is 'my definition', everything was centralised
> originally, until Cisco7500 came out, which then had distributed
> forwarding capabilities.
>
> Now does centralisation truly mean BOM benefit to vendors? Probably
> not, but it may allow to address one lower margin market which as
> lower per-port performance needs, without cannibilising larger margin
> market.

Technically, do we not think that an oversubscribed Juniper box with a
single Trio 6 chip with no fabric is feasible? And is it not being built
because Juniper don't want to cannibalize their other distributed
compact boxes?

The MX204, for example, is a single Trio 3 chip that is oversubscribed
by an extra 240Gbps. So we know they can do it. The issue with the MX204
is that most customers will run out of ports before they run out of
bandwidth.

I don't think it's that vendors using Broadcom to oversubscribe a
high-capacity chip is the issue. It's that other vendors with in-house
silicon won't do the same with their own silicon.


Mark.
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: MX304 Port Layout [ In reply to ]
On Sun, 2 Jul 2023 at 17:15, Mark Tinka <mark@tinka.africa> wrote:

> Technically, do we not think that an oversubscribed Juniper box with a
> single Trio 6 chip with no fabric is feasible? And is it not being built
> because Juniper don't want to cannibalize their other distributed
> compact boxes?
>
> The MX204, for example, is a single Trio 3 chip that is oversubscribed
> by an extra 240Gbps. So we know they can do it. The issue with the MX204
> is that most customers will run out of ports before they run out of
> bandwidth.

Not disagreeing here, but how do we define oversubscribed here? Are
all boxes oversubscribed which can't do a) 100% at max size packet and
b) 100% at min size packet and c) 100% of packets to delay buffer, I
think this would be quite reasonable definition, but as far as I know,
no current device of non-modest scale would satisfy each 3, almost all
of them would only satisfy a).

Let's consider first gen trio serdes
1) 2/4 goes to fabric (btree replication)
2) 1/4 goes to delay buffer
3) 1/4 goes to WAN port
(and actually like 0.2 additionally goes to lookup engine)

So you're selling less than 1/4th of the serdes you ship, more than
3/4 are 'overhead'. Compared to say Silicon1, which is partially
buffered, they're selling almost 1/2 of the serdes they ship. You
could in theory put ports on all of these serdes in BPS terms, but not
in PPS terms at least not with off-chip memory.

And in each case, in a pizza box case, you could sell those fabric
ports, as there is no fabric. So given NPU has always ~2x the bps in
pizza box format (but usually no more pps). And in MX80/MX104 Juniper
did just this, they sell 80G WAN ports, when in linecard mode it only
is 40G WAN port device. I don't consider it oversubscribed, even
though the minimum packet size went up, because the lookup capacity
didn't increase.

Curiously AMZN told Nanog their ratio, when design is fully scaled to
100T is 1/4, 400T bought ports, 100T useful ports. Unclear how long
100T was going to scale, but obviously they wouldn't launch
architecture which needs to be redone next year, so when they decided
100T cap for the scale, they didn't have 100T need yet. This design
was with 112Gx128 chips, and boxes were single chip, so all serdes
connect ports, no fabrics, i.e. true pizzabox.
I found this very interesting, because the 100T design was, I think 3
racks? And last year 50T asics shipped, next year we'd likely get 100T
asics (224Gx512? or 112Gx1024?). So even hyperscalers are growing
slower than silicon, and can basically put their dc-in-a-chip, greatly
reducing cost (both CAPEX and OPEX) as no need for wasting 3/4th of
the investment on overhead.
The scale also surprised me, even though perhaps it should not have,
they quoted +1M network devices, considering they quote +20M nitro
system shipped, that's like <20 revenue generating compute per network
device. Depending on the refresh cycle, this means amazon is buying
15-30k network devices per month, which I expect is significantly more
than cisco+juniper+nokia ship combined to SP infra, so no wonder SPs
get little love.

--
++ytti
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: MX304 Port Layout [ In reply to ]
On 7/2/23 18:04, Saku Ytti wrote:

> Not disagreeing here, but how do we define oversubscribed here? Are
> all boxes oversubscribed which can't do a) 100% at max size packet and
> b) 100% at min size packet and c) 100% of packets to delay buffer, I
> think this would be quite reasonable definition, but as far as I know,
> no current device of non-modest scale would satisfy each 3, almost all
> of them would only satisfy a).

Well, the typical operator will use "oversubscribed" in the context of
number of ports vs. chip capacity. However, it is not unwise to consider
packet handling as a function of oversubscription too.


> Let's consider first gen trio serdes
> 1) 2/4 goes to fabric (btree replication)
> 2) 1/4 goes to delay buffer
> 3) 1/4 goes to WAN port
> (and actually like 0.2 additionally goes to lookup engine)
>
> So you're selling less than 1/4th of the serdes you ship, more than
> 3/4 are 'overhead'. Compared to say Silicon1, which is partially
> buffered, they're selling almost 1/2 of the serdes they ship. You
> could in theory put ports on all of these serdes in BPS terms, but not
> in PPS terms at least not with off-chip memory.

To be fair, although Silicon One is Cisco's first iteration of the chip,
it's not fair to compare it to Trio 1 :-).

But I take your point.


> And in each case, in a pizza box case, you could sell those fabric
> ports, as there is no fabric. So given NPU has always ~2x the bps in
> pizza box format (but usually no more pps). And in MX80/MX104 Juniper
> did just this, they sell 80G WAN ports, when in linecard mode it only
> is 40G WAN port device. I don't consider it oversubscribed, even
> though the minimum packet size went up, because the lookup capacity
> didn't increase.

Makes sense, but what that means is that you are more concerned with pps
while someone else could be more concerned with bps. I guess it depends
on if your operation is more pps-heavy, while someone else's is more
bps-heavy with average packet size.


> Curiously AMZN told Nanog their ratio, when design is fully scaled to
> 100T is 1/4, 400T bought ports, 100T useful ports. Unclear how long
> 100T was going to scale, but obviously they wouldn't launch
> architecture which needs to be redone next year, so when they decided
> 100T cap for the scale, they didn't have 100T need yet. This design
> was with 112Gx128 chips, and boxes were single chip, so all serdes
> connect ports, no fabrics, i.e. true pizzabox.
> I found this very interesting, because the 100T design was, I think 3
> racks? And last year 50T asics shipped, next year we'd likely get 100T
> asics (224Gx512? or 112Gx1024?). So even hyperscalers are growing
> slower than silicon, and can basically put their dc-in-a-chip, greatly
> reducing cost (both CAPEX and OPEX) as no need for wasting 3/4th of
> the investment on overhead.

Yes, I watched this NANOG session and was also quite surprised when they
mentioned that they only plan for 25% usage of the deployed capacity.
Are they giving themselves room to peak before they move to another chip
(considering that they are likely in a never-ending installation/upgrade
cycle), or trying to maintain line-rate across a vast number of packet
sizes? Or both?


> The scale also surprised me, even though perhaps it should not have,
> they quoted +1M network devices, considering they quote +20M nitro
> system shipped, that's like <20 revenue generating compute per network
> device. Depending on the refresh cycle, this means amazon is buying
> 15-30k network devices per month, which I expect is significantly more
> than cisco+juniper+nokia ship combined to SP infra, so no wonder SPs
> get little love.

Well, the no-love to service providers has been going on for some time
now. It largely started with the optical vendors, around 2015 when
coherent gave us 400Gbps waves over medium-haul distances, and the
content folk began deploying DCI's. Around the same time, submarine
systems began deploying uncompensated cables, and with most of them
being funded by the content folk, optical vendors focused 90% of their
attention that way, ignoring the service providers.

The content folk are largely IETF people, so they had options around
what they could do to optimize routing and switching (including building
their own gear). But I see that there is some interest in what they can
do with chips from Cisco, Juniper and Nokia if they have arrangements
where those are opened up to them for self development; not to mention
Broadcom, which means we - as network operators - are likely to see even
far less love from routing/switching vendors, going forward.

But with AWS deploying that many nodes, even with tooling, it must be a
mission staying on top of software (and hardware) upgrades.

Mark.
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: MX304 Port Layout [ In reply to ]
On Tue, 4 Jul 2023 at 08:34, Mark Tinka <mark@tinka.africa> wrote:

> Yes, I watched this NANOG session and was also quite surprised when they
> mentioned that they only plan for 25% usage of the deployed capacity.
> Are they giving themselves room to peak before they move to another chip
> (considering that they are likely in a never-ending installation/upgrade
> cycle), or trying to maintain line-rate across a vast number of packet
> sizes? Or both?

You must have misunderstood. When they fully scale the current design,
the design offers 100T capacity, but they've bought 400T of ports. 3/4
ports are overhead to build the design, to connect the pizzaboxes
together. All ports are used, but only 1/4 are revenue.

--
++ytti
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: MX304 Port Layout [ In reply to ]
On 7/4/23 09:11, Saku Ytti wrote:

> You must have misunderstood. When they fully scale the current design,
> the design offers 100T capacity, but they've bought 400T of ports. 3/4
> ports are overhead to build the design, to connect the pizzaboxes
> together. All ports are used, but only 1/4 are revenue.

Thanks, makes sense.

This is one of the reasons I prefer to use Ethernet switches to
interconnect devices in large data centre deployments.

Connecting stuff directly into the core routers or directly together
eats up a bunch of ports, without necessarily using all the available
capacity.

But to be fair, at the scale AWS run, I'm not exactly sure how I'd do
things.

Mark.
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: MX304 Port Layout [ In reply to ]
On Wed, 5 Jul 2023 at 04:45, Mark Tinka <mark@tinka.africa> wrote:

> This is one of the reasons I prefer to use Ethernet switches to
> interconnect devices in large data centre deployments.
>
> Connecting stuff directly into the core routers or directly together
> eats up a bunch of ports, without necessarily using all the available
> capacity.
>
> But to be fair, at the scale AWS run, I'm not exactly sure how I'd do
> things.

I'm sure it's perfectly reasonable, with some upsides and some
downsides compared to hiding the overhead ports inside chassis fabric
instead of exposing them in front-plate.

--
++ytti
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp

1 2 3  View All