Mailing List Archive

OSPF reference-bandwidth 1T
In the process of adding 100G, LAGs with multiple 100G, and to be prepared
for 400G, looking for feedback on setting ospf reference-bandwidth to 1T.

Please let me know if you have had any issues with this, or if it has been
a smooth transition for you .

Thanks in advance!
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: OSPF reference-bandwidth 1T [ In reply to ]
Hey,

> In the process of adding 100G, LAGs with multiple 100G, and to be prepared
> for 400G, looking for feedback on setting ospf reference-bandwidth to 1T.
>
> Please let me know if you have had any issues with this, or if it has been
> a smooth transition for you .

No one should be using bandwidth based metrics, it's quite
non-sensical. I would recommend that if you have only few egress
points for given prefix, adopt role based metric P-PE, P-P-city,
P-P-country etc. If you have many egress options for given prefix
latency based metric might be better bet.

Traffic should fit your ideal SPT, if you are not able to engineer so
that SPT has enough capacity, run RSVP to move some traffic off SPT.

--
++ytti
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: OSPF reference-bandwidth 1T [ In reply to ]
On 2019-01-16 16:41 MET, Saku Ytti wrote:

> No one should be using bandwidth based metrics, it's quite
> non-sensical. I would recommend that if you have only few egress
> points for given prefix, adopt role based metric P-PE, P-P-city,
> P-P-country etc. If you have many egress options for given prefix
> latency based metric might be better bet.

You are obviously talking about a service-provider perspective here,
since you are talking about P and PE. Not an unreasonable assumption
on this list of course, but I don't see any indication of what kind of
network Event Script is running

Would you advise avoiding bandwidth-based metrics in e.g. datacenter
or campus networks as well?

(I am myself running a mostly DC network, with a little bit of campus
network on the side, and we use bandwidth-based metrics in our OSPF.
But we have standardized on using 3 Tbit/s as our "reference bandwidth",
and Junos doesn't allow us to set that, so we set explicit metrics.)


/Bellman
Re: OSPF reference-bandwidth 1T [ In reply to ]
On Thu, 17 Jan 2019 at 17:49, Thomas Bellman <bellman@nsc.liu.se> wrote:

> Would you advise avoiding bandwidth-based metrics in e.g. datacenter
> or campus networks as well?

I see no point at all in bandwidth based metrics. If we are talking
about simple networks, then probably actual SPT out-come would be same
by having same metric in every link (i.e. make distance vector out of
your link-state, (essentially what RFC7938 topologies would be). This
would be simplest possible metric design.

> (I am myself running a mostly DC network, with a little bit of campus
> network on the side, and we use bandwidth-based metrics in our OSPF.
> But we have standardized on using 3 Tbit/s as our "reference bandwidth",
> and Junos doesn't allow us to set that, so we set explicit metrics.)

And you have shorter paths with inferior bandwidth which you do not
want to use, you'll rather take 9x10GE links than 1xGE to reach the
destination? It boggles my mind which network has _common case_ where
bandwidth is most indicative of best SPT.

--
++ytti
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: OSPF reference-bandwidth 1T [ In reply to ]
> Thomas Bellman
> Sent: Thursday, January 17, 2019 3:48 PM
>
> On 2019-01-16 16:41 MET, Saku Ytti wrote:
>
> > No one should be using bandwidth based metrics, it's quite
> > non-sensical. I would recommend that if you have only few egress
> > points for given prefix, adopt role based metric P-PE, P-P-city,
> > P-P-country etc. If you have many egress options for given prefix
> > latency based metric might be better bet.
>
> You are obviously talking about a service-provider perspective here, since
> you are talking about P and PE. Not an unreasonable assumption on this list
> of course, but I don't see any indication of what kind of network Event Script
> is running
>
> Would you advise avoiding bandwidth-based metrics in e.g. datacenter or
> campus networks as well?
>
> (I am myself running a mostly DC network, with a little bit of campus network
> on the side, and we use bandwidth-based metrics in our OSPF.
> But we have standardized on using 3 Tbit/s as our "reference bandwidth",
> and Junos doesn't allow us to set that, so we set explicit metrics.)
>
It makes me think where in Clos or Benes fabric would I need to resort to bw-based metric style.
The very point of Clos graphs is that all edges and vertices of a given layer are built equal and in Benes graph all edges and vertices are the same.
If you have shortcut/backdoor links in these then you're in trouble already.

One major problem with reference BW is that you'll ultimately run out of whatever, reasonable at the time, value you'll set there.

adam



_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: OSPF reference-bandwidth 1T [ In reply to ]
Hi all,


> Would you advise avoiding bandwidth-based metrics in e.g. datacenter
> or campus networks as well?
>
> (I am myself running a mostly DC network, with a little bit of campus
> network on the side, and we use bandwidth-based metrics in our OSPF.
> But we have standardized on using 3 Tbit/s as our "reference bandwidth",
> and Junos doesn't allow us to set that, so we set explicit metrics.)

As Adam has already mentioned, DC networks are becoming more and more Clos-based, so you basically don't need OSPF at all for this.

Fabric uplinks, Backbone/DCI and legacy still exist though, however in the DC we tend to ECMP it all, so you normally don't want to have unequal bandwidth links in parallel in the DC.

Workarounds happen, sometimes you have no more 100G ports available and need to plug, let's say, 4x40G "temporarily" in addition to two existing 100G which are starting to be saturated. In such a case you'd rather consciously decide weather you want to ECMP these 200 Gigs among six links (2x100 + 4x40) or use 40GB links as a backup only (might be not the best idea in this scenario).

So, it's not the reference bandwidth itself which is bad in the DC but rather the use-cases, where it can technically work, are not the best for modern DC networks.

--
Pavel
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: OSPF reference-bandwidth 1T [ In reply to ]
On 2019-01-22 12:02 MET, Pavel Lunin wrote:

>> (I am myself running a mostly DC network, with a little bit of campus
>> network on the side, and we use bandwidth-based metrics in our OSPF.
>> But we have standardized on using 3 Tbit/s as our "reference bandwidth",
>> and Junos doesn't allow us to set that, so we set explicit metrics.)

> As Adam has already mentioned, DC networks are becoming more and more
> Clos-based, so you basically don't need OSPF at all for this.
>
> Fabric uplinks, Backbone/DCI and legacy still exist though, however in
> the DC we tend to ECMP it all, so you normally don't want to have unequal
> bandwidth links in parallel in the DC.

Our network is roughly spine-and-leaf. But we have a fairly small net
(two spines, around twenty leafs, split over two computer rooms a couple
of hundred meters apart the way the fiber goes), and it doesn't make
economical sense to make it a perfectly pure folded Clos network. So,
there are a couple of leaf switches that are just layer 2 with spanning
tree, and the WAN connections to our partner in the neighbouring city
goes directly into our spines instead of into "peering leafs". (The
border routers for our normal Internet connectivity are connected as
leafs to our spines, though, but they are really our ISP's CPE routers,
not ours.)

Also, the leaves have wildly different bandwidth needs. Our DNS, email
and web servers don't need as much bandwidth as a 2000 node HPC cluster,
which in turn needs less bandwidth than the storage cluster for LHC
data. Most leaves have 10G uplinks (one to each spine), but we also
have leafs with 1G and with 40G uplinks.

I don't want a leaf with 1G uplinks becoming a "transit" node for traffic
between two other leafs in (some) failure cases, because an elephant flow
could easily saturate those 1G links. Thus, I want higher costs for those
links than for the 10G and 40G links. Of course, the costs don't have to
be exactly <REFERENCE_BW> / <ACTUAL_BW>, but there need to be some relation
to the bandwidth.

> Workarounds happen, sometimes you have no more 100G ports available and
> need to plug, let's say, 4x40G "temporarily" in addition to two existing
> 100G which are starting to be saturated. In such a case you'd rather
> consciously decide weather you want to ECMP these 200 Gigs among six
> links (2x100 + 4x40) or use 40GB links as a backup only (might be not
> the best idea in this scenario).

Right. I actually have one leaf switch with unequal bandwidth uplinks.
On one side, it uses 2×10G link aggregation, but on the other side, I
could use an old Infiniband AOC cable giving us a 40G uplink. In that
case, I have explicitly set the two uplinks to have the same costs.


/Bellman, NSC
Re: OSPF reference-bandwidth 1T [ In reply to ]
On Wed, 16 Jan 2019 at 15:06, Event Script <event.script.jnsp@gmail.com> wrote:
>
> In the process of adding 100G, LAGs with multiple 100G, and to be prepared
> for 400G, looking for feedback on setting ospf reference-bandwidth to 1T.
>
> Please let me know if you have had any issues with this, or if it has been
> a smooth transition for you .
>
> Thanks in advance!

Hi there,

I have worked on networks with the OSPF reference bandwidth set to
1Tbps for the same reason as you, we were planning to deploy 100G
links within the next 12 months. As we built the network we used a
reference bandwidth of 1Tbps from the start so that we wouldn't have
to change it in the future. We also expected to deploy 100G LAGs so a
reference bandwidth of 100G wasn't enough. I've worked on networks
where it has been set to 100Gbps, that just seems silly to me as
you'll just have to increase it at some point. Setting it to 1Tbps
makes good sense to me and in my experince it works fine (tested on
both Cisco and Juniper).

Cheers,
James.
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: OSPF reference-bandwidth 1T [ In reply to ]
On Thu, 17 Jan 2019 at 18:09, Saku Ytti <saku@ytti.fi> wrote:
> It boggles my mind which network has _common case_ where
> bandwidth is most indicative of best SPT.

Hi Saku,

I've worked on several small networks where you don't have equal
bandwidth links in the network. I don't mean U/ECMP, I mean a ring
topology for example, and some links might be 10G and some 1G etc.
Maybe the top half of the ring from 9 o'clock moving clockwise round
to 3 o'clock is 10Gbps, or 20Gbps, and the bottom half from 3 o'clock
moving clockwise round to 9 o'clock is 10Gbps or 1Gbps. I want traffic
from the 3 o'clock PE to always go anticlockwise to get to 8 o'clock
depsiten being one hop further to reduce the traffic across the bottom
half of the ring.

Previosuly you said....

On Wed, 16 Jan 2019 at 15:42, Saku Ytti <saku@ytti.fi> wrote:
...
> No one should be using bandwidth based metrics, it's quite
> non-sensical.

But for any link between PoP A and PoP B the bandwidth is directly
related to the cost, i.e. 1Gbps from A to B costs < 10Gbps and 10Gbps
from A to B costs < 100Gbps etc. Having worked on small very small
ISPs with only 2 or 3 PEs and lucky 8 ball to get by, costs is
everything and you end with both both; links of varrying speeds and
links fo varrying MTUs (oh the joy!).

> P-P-country etc. If you have many egress options for given prefix
> latency based metric might be better bet.

Yeah, for larger networks with more money this works well. $dayjob has
a lot of realtime voice and video flying around so we use latency
based metrics and it works well but we also have our own transmission
infrastructure meaning that bandwidth isn't an factor for us. Not
everyone has that luxury.

Cheers,
James.
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: OSPF reference-bandwidth 1T [ In reply to ]
On 17/Jan/19 20:08, Saku Ytti wrote:
> And you have shorter paths with inferior bandwidth which you do not
> want to use, you'll rather take 9x10GE links than 1xGE to reach the
> destination? It boggles my mind which network has _common case_ where
> bandwidth is most indicative of best SPT.

In some economies, shorter paths can be more expensive than longer ones.

Mark.
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: OSPF reference-bandwidth 1T [ In reply to ]
On 16/Jan/19 17:41, Saku Ytti wrote:

> No one should be using bandwidth based metrics, it's quite
> non-sensical. I would recommend that if you have only few egress
> points for given prefix, adopt role based metric P-PE, P-P-city,
> P-P-country etc. If you have many egress options for given prefix
> latency based metric might be better bet.
>
> Traffic should fit your ideal SPT, if you are not able to engineer so
> that SPT has enough capacity, run RSVP to move some traffic off SPT.

We designed both latency- and funtion-based metrics, based on a
reference bandwidth of 1Tbps. Works well, although, yes, you can design
this without the reference bandwidth in mind.

Luckily, when 1Tbps goes mainstream, it's easy to continue with our
approach without worrying about why we needed the 1Tbps reference
bandwidth in the first place.

Mark.
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: OSPF reference-bandwidth 1T [ In reply to ]
On Thu, 24 Jan 2019 at 10:57, Mark Tinka <mark.tinka@seacom.mu> wrote:

> > And you have shorter paths with inferior bandwidth which you do not
> > want to use, you'll rather take 9x10GE links than 1xGE to reach the
> > destination? It boggles my mind which network has _common case_ where
> > bandwidth is most indicative of best SPT.
>
> In some economies, shorter paths can be more expensive than longer ones.

I don't disagree, I just disagree that there are common case where
bandwidth is most indicative of good SPT.

Consider I have

10GE-1:
PE1 - P1 - P2 - P3 - P4 - P5 - P6 - P7 - P8 - PE2

10GE-2:
PE1 - P1 - P2 - P3 - P4 - P5 - P6 - P7 - P8 - P9 - PE2

10GE-3:
PE1 - P1 - P2 - P3 - P4 - P5 - P6 - P7 - P8 - P9 - P10 - PE2

1GE:
PE1 - PE2

In which realistic topology

a) in 10GE-1 + 1GE, I want to prefer the 10GE between PE?
b) in 10GE-2 + 1GE, I want to balance between the paths
c) in 10GE-3 + 1GE, I want to prefer the 1GE

All these seem nonsensical, what actually is meant '1GE has role Z,
10GE has role X, have higher metric for role Z', regardless what the
actual bandwidth is. I just happens that bandwidth approximates role
in that topology, but desired topology is likely achieved with
distance vector or simple role topology and bandwidth is not relevant
information.

Even when if the P boxes are in same pop, each device adds some 5-10km
of latency. So we'd prefer 40-80km latency over direct connection. Why
did that direct 1GE exist when would you realistically fall back to
using the lower bandwidth link? It seems ridiculously arbitrary and
not indicative of any design.

I'd like to see mock-up topology, where bandwidth metric makes any
sense at all, and is more frequently right than role or latency.

--
++ytti
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: OSPF reference-bandwidth 1T [ In reply to ]
On 24/Jan/19 11:26, Saku Ytti wrote:

> I don't disagree, I just disagree that there are common case where
> bandwidth is most indicative of good SPT.

Agreed.

> I'd like to see mock-up topology, where bandwidth metric makes any
> sense at all, and is more frequently right than role or latency.

It doesn't.

As with any thought-process that was used back in the day, the general
belief was that the more bandwidth you have, the better life will always
be. Real life, obviously, is very different.

Mark.

_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: OSPF reference-bandwidth 1T [ In reply to ]
TLDR; metrics aren't a purely design/academic decision, they are
operational too.

On Thu, 24 Jan 2019 at 09:27, Saku Ytti <saku@ytti.fi> wrote:
> I don't disagree, I just disagree that there are common case where
> bandwidth is most indicative of good SPT.

If by "good" you mean "shortest" (least number of hops) then I
disagree with you, bandwidth is usually indicative of shortest number
of hops (not always but usually). In any reasonable hierarchical
design northbound links aren't going to be of a lower speed than
southbound links. Taking Adams example of a folded Clos network as a
theoretical utopian text-book example, you also wouldn't have
east-west links between leaves and if you did they wouldn't be as fast
or faster than your northbound links. The problem is that in reality
no SP network looks as neat and tidy or simply as a Clos network, see
below....

> Consider I have
>
> 10GE-1:
> PE1 - P1 - P2 - P3 - P4 - P5 - P6 - P7 - P8 - PE2
>
> 10GE-2:
> PE1 - P1 - P2 - P3 - P4 - P5 - P6 - P7 - P8 - P9 - PE2
>
> 10GE-3:
> PE1 - P1 - P2 - P3 - P4 - P5 - P6 - P7 - P8 - P9 - P10 - PE2
>
> 1GE:
> PE1 - PE2
>
> In which realistic topology

> a) in 10GE-1 + 1GE, I want to prefer the 10GE between PE?

As soon as you have 1.000000001Gpbs of traffic to shift (see my
previous email). And this is where reality kicks in - why would you
have a PE with a 10G and 1G uplink? In the hypothetical Clos design
you simply wouldn't have mixed speed links facing northbound, in the
real SP networking world you wouldn't have a 10G uplink if you didn't
have >1Gbps of provisioned downstream connectivity, otherwise you're
wasting capex/opex (except for rare circumstances like a carrier
promotion selling 10G for the price of 1G or something, but you
probably hadn't planned for that). So, assuming there is a reason you
have bandwidth asymmetrical uplinks in your topology its probably
downstream bandwidth related. It could also be upstream relted though;
upstream link upgrades don't happen in a fixed time or perfectly
symmetrically, maybe the road cloure is delayed, route planning
changes, PoP closure, transmission equipment upgrade, you end up
upgrading one northbound circuit in 3 motnths and the other takes 12
months. To go full circle to your original point bandwidth is
dictating the "best" SPT here where "best" means "to avoid congestion
during normal operations, not times of excepional operations which is
when we look to QoS for help".

This is what happens in the "real world" and not Clos networks. We
might want diverse connections to a remote PoP and only one carrier
has 10G of capacity there, so our backup link has to be 1G. We
actually have more than 1G of provisioned downstream connectivity but
that is all we can get unless we want 2x10G from the same carrier and
no resilience. Maybe we can bond a few 1G links from the 2nd carrier
and have 10G + 5G backup. To be clear I don't approve of such a
design, my point is that in the real world, where things aren't
simple, circuit costs are higher than expected, we don't have enough
100G or 10G ports, the project has been under budgeted, the lead time
on the new router from vendor is 12 months not the promised 3, we end
up with these kinds of weird asymmetrical topologies and we have to
use a bandwidth based metric to route traffic.

> b) in 10GE-2 + 1GE, I want to balance between the paths

So, from a purely technical perspective, if you did per flow load
balancing it would work. Should you do it? I'd say Hell no. But not
because of anything to do with IGPs. The operational complexity of
troubleshooting such a topology is too high in this scenario; Imagine
if each one of those 10G links between P nodes was from a different
carrier it would be a case of service credits lining ready to be given
away.

> c) in 10GE-3 + 1GE, I want to prefer the 1GE

You actually have some bandwidth critical services which are <= 1Gbps.

> All these seem nonsensical, what actually is meant '1GE has role Z,
> 10GE has role X, have higher metric for role Z', regardless what the
> actual bandwidth is. I just happens that bandwidth approximates role
> in that topology, but desired topology is likely achieved with
> distance vector or simple role topology and bandwidth is not relevant
> information.

To me they aren't nonsensical, they are "not ideal" for a specific
purpose i.e. sub-optimal for latency, or operationally more complex.
Going right back to basics; the reason we have a metric at all in the
IGP is because there is some reason why the shortest path (number of
hops) from A to B isn't the most optimal path, so we're using the
metric as a weight to influence the SPT calculation. So the question
is why isn't the STP optimal for you? In the hypothetical Clos model
it is, in real life it isn't, so we're always trying to get as close
to that as we can. Metrics aren't just a purely design/academic
decision (function based or role based), they are operational too;
e.g. breaking up a failure domain or breaking up a change request
domain.

I've had to move traffic away from a P/PE node because traffic around
the core ring was disproportionately distributed such that the failure
of one P node had a much larger impact that other P nodes. As I
mentioned in my previous email, these issues only go away when you
have the kind of luxuries that I, and I expect you, have like your own
dedicate transmission network or enough influence to tell a carrier
where to lay fibre next.

Cheers,
James.
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp