Mailing List Archive

LAG/ECMP hash performance
Has anyone ran into a set of flows where ostensibly you have enough
entropy to balance fairly, but you end up seeing significant imbalance
anyhow? Can you share the story? What platform? How did you
troubleshoot? How did you fix?

I'll take nonJNPR stories too.

It looks like many/most vendors are still using CRC for LAG/ECMP,
which historically makes sense, as you could piggyback on EthernetFCS
transistors for 0cost implementation. Today likely the transistors are
different anyhow as PHY and lookup engine are too separate, so CRC may
not be very good choice for the problem.

If I read this right (thanks David)
https://github.com/rurban/smhasher/blob/master/doc/crc32 - CRC32
appears to have less than perfect 'diffusion' quality, which would
communicate that there are scenarios where poor balancing is by design
and where another hash implementation with good diffusion quality
would balance fairly.

Thanks!
--
++ytti
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: LAG/ECMP hash performance [ In reply to ]
On Sat, 24 Aug 2019 at 10:06, Saku Ytti <saku@ytti.fi> wrote:

Hi Saku,

> Has anyone ran into a set of flows where ostensibly you have enough
> entropy to balance fairly, but you end up seeing significant imbalance
> anyhow? Can you share the story? What platform? How did you
> troubleshoot? How did you fix?

No. Out of curiosity, have you, which is what lead you to post this?
If yes, what platform?

> It looks like many/most vendors are still using CRC for LAG/ECMP,
> which historically makes sense, as you could piggyback on EthernetFCS
> transistors for 0cost implementation. Today likely the transistors are
> different anyhow as PHY and lookup engine are too separate, so CRC may
> not be very good choice for the problem.

Yeah I more or less agree. It's a bit computationally expensive if the
lookup engine is not something "modern" (i.e. a typical modern Intel
x86_64 chip) with a native CRC32 instruction. In this case of say an
Intel chip (or any ASIC with CRC32 built-in) generating a CRC32 sum
for load-balancing wouldn't be much of an overhead. But even with a
native CRC32 instruction it seems like overkill. If "speed is
everything" a CRC32 instruction might not complete in a single CPU
cycle so other methods could be faster, especially given that most
people don't need 32bits of entropy produced by CRC32 (as in they
don't have 2^32 links in a single LAG bundle or that many ECMP
routes).

> If I read this right (thanks David)
> https://github.com/rurban/smhasher/blob/master/doc/crc32 - CRC32
> appears to have less than perfect 'diffusion' quality, which would
> communicate that there are scenarios where poor balancing is by design
> and where another hash implementation with good diffusion quality
> would balance fairly.

That is my understanding of CRC32 also, although I didn't know it was
being widely used for load-balancing so I had never though of it as an
actual piratical issue. One thing to consider is that not all CRC32
sums are the same, what kind of polynomial is used varies and so $box1
doing CRC32 for load-balancing might produce different results to
$box2, if they use different polynomials. I have recorded some common
ones here: https://null.53bits.co.uk/index.php?page=crc-and-checksum-error-detection#polynomial

It looks like the standard IEEE 802.3 value 0x04C11DB7 is being used
for these tests, here
https://github.com/jwbensley/Ethernet-CRC32/blob/master/crc32.c

Other polys are used though, e.g. for larger packets. When using jumbo
frames and stretching the amount of data the CRC has to protect
against with the same sized sum (32 bits) other polynomials can be
more effective. It's probably a safe bet that most implementations
that use CRC32 for hashing use the same standard poly value but I'm
keen to hear more about this.

Cheers,
James.
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: LAG/ECMP hash performance [ In reply to ]
On Wed, 28 Aug 2019 at 09:54, James Bensley
<jwbensley+juniper-nsp@gmail.com> wrote:

> No. Out of curiosity, have you, which is what lead you to post this?
> If yes, what platform?

I've had two issues where I cannot explain why there is imbalance. One
in MX2020 another in PTX. I can't find any elephant flows in netflow,
but I can find traffic grouped around with modest amount of IP address
entropy (like 20-32 SADDR + 20-32 DADDR + 1 SPORT + RND DPORT). My
understanding is, that just that RND DPORT should guarantee fair
balancing, in absence of elephant flows and when flow count is
sufficient.

I did briefly talk to some people, and one person mentioned they saw
this problem in NOK in their VOD distribution, again similarly flows
are grouped together, but ostensibly enough entropy. Curiously the NOK
case was fixed by adding static bits to the hash input, for every
single hash calculation (host IP). I think the solution supports hash
weakness, moving the input bits around caused the changing bit to move
from more 'vulnerable' bit locations to less vulnerable.
Another person mentioned seeing this in Jericho.

I did trivial lab test on MX2020, which I'll post at the end of the
email, which appears (not controlled enough to say for sure) to
support that hashing is less than idea.

> That is my understanding of CRC32 also, although I didn't know it was
> being widely used for load-balancing so I had never though of it as an
> actual piratical issue. One thing to consider is that not all CRC32
> sums are the same, what kind of polynomial is used varies and so $box1
> doing CRC32 for load-balancing might produce different results to
> $box2, if they use different polynomials. I have recorded some common
> ones here: https://null.53bits.co.uk/index.php?page=crc-and-checksum-error-detection#polynomial

Yes, I'm sure vendors have put some thought to this and have tried to
fix, what seems fundamental CRC quality of not being hash function
which has particularly good diffusion quality.

> It looks like the standard IEEE 802.3 value 0x04C11DB7 is being used
> for these tests, here
> https://github.com/jwbensley/Ethernet-CRC32/blob/master/crc32.c
>
> Other polys are used though, e.g. for larger packets. When using jumbo
> frames and stretching the amount of data the CRC has to protect
> against with the same sized sum (32 bits) other polynomials can be
> more effective. It's probably a safe bet that most implementations
> that use CRC32 for hashing use the same standard poly value but I'm
> keen to hear more about this.

Do you think that with other parameters it would achieve better
diffusion quality? Statistically you should see half of the output
bits change, when single input bit changes. And it may be that CRC
fundamentally does not satisfy this. And I think it makes sense,
because goal of CRC is to catch as much as possible of _small_
changes. Like Ethernet FCS will catch all single bit flips, and I
think maybe even all double bit flips, and then perhaps all evens or
odds count flips, I forgot which. And if you are spending the range
for this, fundamentally important goal in this application, then I
don't think you're going to achieve good diffusion with same
algorithm. Testing if hash function is good for ECMP/LAG should be
fairly trivial, as you can analyse large segment of the practical
space and see if there are statistical bias in diffusion.













SRC: (single 100GE interface, single unit 0)
23.A.B.20 .. 23.A.B.46
TCP/80
DST: (N*10GE LACP)
157.C.D.20 .. 157.C.D.35
TCP 2074..65470 (RANDOM, this alone, everything else static, should
have guaranteed fair balancing)

I'm running this through IXIA and my results are:

3*10GE Egress:
port1 10766516pps
port2 10766543pps
port3 7536578pps
after (set forwarding-options enhanced-hash-key family inet
incoming-interface-index)
port1 9689881pps
port2 11791986pps
port3 5383270pps
after removing s-int-index and setting adaptive
port1 9689889pps
port2 9689892pps
port3 9689884pps

I think this supports that the hash function diffuses poorly. It
should be noted that 2nd step adds entirely _static_ bits to the input
of the hash, source interface does not change. And it's perfectly
repeatable. This is to be expected, the most affected weakness bits
shift, either making the problem worse or better.
I.e. flows are 100% perfectly hashable, but not without biasing the
hash results. There aren't any elephants.


4*10GE Egress:
port1 4306757pps
port2 8612807pps
port3 9689893pps
port4 6459931pps
after adding incoming-interface-index)
port1 6459922pps
port2 8613236pps
port3 9691485pps
port4 4306620pps
after removing s-index and adding adaptive:
port1 7536562pps
port2 7536593pps
port3 6459928pps
port4 7536566pps
after removing adaptive and adding no-destination-port + no-source-port
port1: 5383279pps
port2: 9689886pps
port3: 7536588pps
port4: 6459922pps
after removing no-source-port (i.e. destination port is used for hash)
port1: 8613235pps
port2: 5383272pps
port3: 5383274pps
port4: 9689884pps

It is curious that it actually balances more fairly, without using TCP
ports at all! Even thought there is _tons_ of entropy there due to
random DPORT.


--
++ytti
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: LAG/ECMP hash performance [ In reply to ]
On Wed, 28 Aug 2019 at 08:21, Saku Ytti <saku@ytti.fi> wrote:
> I've had two issues where I cannot explain why there is imbalance. One
> in MX2020 another in PTX. I can't find any elephant flows in netflow,
> but I can find traffic grouped around with modest amount of IP address
> entropy (like 20-32 SADDR + 20-32 DADDR + 1 SPORT + RND DPORT). My
> understanding is, that just that RND DPORT should guarantee fair
> balancing, in absence of elephant flows and when flow count is
> sufficient.

Hi Saku,

Hmm, interesting, but has anyone confirmed to you these devices to use
a CRC32 for the hashing are you trying to reverse engineer this? Is
there any reason why this couldn't just be a dodgy Juniper proprietary
hash algo? I'm just playing devils advocate here.

> I did trivial lab test on MX2020, which I'll post at the end of the
> email, which appears (not controlled enough to say for sure) to
> support that hashing is less than idea.

I had the same idea to break out the lab Ixia but I haven't had time yet...

> Do you think that with other parameters it would achieve better
> diffusion quality?

Different parameters may or may not change the diffusion density, but
they may increase the range of results, i.e. perfect diffusion over
2^2 outcomes vs. perfect diffusion over 2^6 outcomes.

Also, ASR9Ks use a CRC32 on Typhoon cards but not of the whole frame,
"Post IOS-XR 4.2.0, Typhoon NPUs use a CRC based calculation of the
L3/L4 info and compute a 32 bit hash value." So actually, your results
below should have good diffusion in theory if this was an ASR9K
(although I'm sure that's not the case in reality). Is the Juniper
taking (1) the whole frame into the CRC function (2) all the headers
but no payload, or (3) just the specific headers fields (S/D
MAC/IP/Port/Intf)?

In the worst case scenario (1), with 4 bytes of CRC output to
represent an entire frame there is a large amount of hash collisions;
Min size frame; 6 byte SRC, 6 byte DST, 2 byte EType, 46 byte payload
== 2^480 possible Ethernet frames being mapped to 2^32 CRC values? So
that to me says that twiddling the DPORT value randomly wouldn't
actually get great diffusion because this isn't a perfect hashing
scenario because there are many collisions (I also don't know how
random that Ixia RNG is).

In the best case scenario (3) it's just the required header fields
feeding the CRC, the input field sizes still severely exceed 32 bits
of output field size.

> SRC: (single 100GE interface, single unit 0)
> 23.A.B.20 .. 23.A.B.46
> TCP/80
> DST: (N*10GE LACP)
> 157.C.D.20 .. 157.C.D.35
> TCP 2074..65470 (RANDOM, this alone, everything else static, should
> have guaranteed fair balancing)
>
> I'm running this through IXIA and my results are:
>
> 3*10GE Egress:
> port1 10766516pps
> port2 10766543pps
> port3 7536578pps
> after (set forwarding-options enhanced-hash-key family inet
> incoming-interface-index)
> port1 9689881pps
> port2 11791986pps
> port3 5383270pps
> after removing s-int-index and setting adaptive
> port1 9689889pps
> port2 9689892pps
> port3 9689884pps
>
> I think this supports that the hash function diffuses poorly. It
> should be noted that 2nd step adds entirely _static_ bits to the input
> of the hash, source interface does not change. And it's perfectly
> repeatable. This is to be expected, the most affected weakness bits
> shift, either making the problem worse or better.
> I.e. flows are 100% perfectly hashable, but not without biasing the
> hash results. There aren't any elephants.
>
>
> 4*10GE Egress:
> port1 4306757pps
> port2 8612807pps
> port3 9689893pps
> port4 6459931pps
> after adding incoming-interface-index)
> port1 6459922pps
> port2 8613236pps
> port3 9691485pps
> port4 4306620pps
> after removing s-index and adding adaptive:
> port1 7536562pps
> port2 7536593pps
> port3 6459928pps
> port4 7536566pps
> after removing adaptive and adding no-destination-port + no-source-port
> port1: 5383279pps
> port2: 9689886pps
> port3: 7536588pps
> port4: 6459922pps
> after removing no-source-port (i.e. destination port is used for hash)
> port1: 8613235pps
> port2: 5383272pps
> port3: 5383274pps
> port4: 9689884pps
>
> It is curious that it actually balances more fairly, without using TCP
> ports at all! Even thought there is _tons_ of entropy there due to
> random DPORT.

This is interesting.

In the past I have simply generated a flow like, SRC
11:11:11:11:11:11, DST 11:11:11:11:11:11, SRC 1.1.1.1, DST 1.1.1.1,
SPORT 1, DPORT 1, send the traffic, note which egress port was chosen
from the LAG, increment one value (.e.g DST IP) by 1, send traffic,
note the egress port, repeat; and you can sometimes "get a feel" for
the hashing being used. It's very laborious but it might give some
more insight into what's going on here.

Cheers,
James.
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: LAG/ECMP hash performance [ In reply to ]
On Thu, 29 Aug 2019 at 11:52, James Bensley
<jwbensley+juniper-nsp@gmail.com> wrote:

> Hmm, interesting, but has anyone confirmed to you these devices to use
> a CRC32 for the hashing are you trying to reverse engineer this? Is
> there any reason why this couldn't just be a dodgy Juniper proprietary
> hash algo? I'm just playing devils advocate here.

JNPR states CRC31+CRC16, but I'd rather have like proper pseudocode so
I could implement it myself and see how well its diffusion performs.

Because you have to have the transistors there for FCS anyway,
historically the same block has been used for other functions than
checking FCS. Like hash buckets for MAC addresses. And it stands to
reason, for LAG/ECMP balancing.

> This is interesting.

I think the removal of L4 keys improving results mirrors the report I
got from NOK owner. The NOK owner added SystemID (static 32b input),
which fixed balancing problem for them. Implying that the input bits
which cause weak diffusion were move to the left or right, and now
the changing input bits were no longer weak bits.
For CRC's designed function, of course, diffusion isn't important
quality. CRC is very very good for FCS.

--
++ytti
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: LAG/ECMP hash performance [ In reply to ]
On Thu, Aug 29, 2019 at 2:52 AM James Bensley
<jwbensley+juniper-nsp@gmail.com> wrote:
<snip>
> Different parameters may or may not change the diffusion density, but
> they may increase the range of results, i.e. perfect diffusion over
> 2^2 outcomes vs. perfect diffusion over 2^6 outcomes.
>
> Also, ASR9Ks use a CRC32 on Typhoon cards but not of the whole frame,
> "Post IOS-XR 4.2.0, Typhoon NPUs use a CRC based calculation of the
> L3/L4 info and compute a 32 bit hash value." So actually, your results
> below should have good diffusion in theory if this was an ASR9K
> (although I'm sure that's not the case in reality). Is the Juniper
> taking (1) the whole frame into the CRC function (2) all the headers
> but no payload, or (3) just the specific headers fields (S/D
> MAC/IP/Port/Intf)?
<snip>

I think 802.3ad and ECMP both require a given connection to hash to
the same link to prevent out-of-order delivery.

Taking full frames or even full headers into your hashing algorithm
would likely break the expectation of in-order delivery (unless your
have the same vendor on both sides with something proprietary).
Ignoring that requirement, you could ditch hashing altogether and go
for round-robin. Standards-compliant hashing implementations can only
look at header fields that don't change for a flow, namely src/dest
mac, ip, protocol, and port for TCP/UDP (maybe adding in certain MPLS,
VLAN, etc. fields or interface ids or other proprietary information
available to the chip that satisfies that requirement).

--
Eldon
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: LAG/ECMP hash performance [ In reply to ]
Hi Eldon,

You are very correct. I was very highly surprised to read Saku mentioning
use of CRC for hashing but then quick google revealed this link:

https://www.juniper.net/documentation/en_US/junos/topics/reference/configuration-statement/hash-parameters-edit-forwarding-options.html


Looks like ECMP and LAG hashing may seriously spread your flows as clearly
CRC includes payload and payload is likely to be different with every
packet.

Good that this is only for QFX though :-)

For MX I recall that the hash is not computed with entire packet. The
specific packet's fields are taken as input (per configuration) and CRC
functions are used to mangle them - which is very different from saying
that packet's CRC is used as input.

But I admit looking as few prod MXes load across LAG members is far from
being well balanced.

Thx,
R.


On Thu, Aug 29, 2019 at 4:56 PM Eldon Koyle <
ekoyle+puck.nether.net@gmail.com> wrote:

> On Thu, Aug 29, 2019 at 2:52 AM James Bensley
> <jwbensley+juniper-nsp@gmail.com> wrote:
> <snip>
> > Different parameters may or may not change the diffusion density, but
> > they may increase the range of results, i.e. perfect diffusion over
> > 2^2 outcomes vs. perfect diffusion over 2^6 outcomes.
> >
> > Also, ASR9Ks use a CRC32 on Typhoon cards but not of the whole frame,
> > "Post IOS-XR 4.2.0, Typhoon NPUs use a CRC based calculation of the
> > L3/L4 info and compute a 32 bit hash value." So actually, your results
> > below should have good diffusion in theory if this was an ASR9K
> > (although I'm sure that's not the case in reality). Is the Juniper
> > taking (1) the whole frame into the CRC function (2) all the headers
> > but no payload, or (3) just the specific headers fields (S/D
> > MAC/IP/Port/Intf)?
> <snip>
>
> I think 802.3ad and ECMP both require a given connection to hash to
> the same link to prevent out-of-order delivery.
>
> Taking full frames or even full headers into your hashing algorithm
> would likely break the expectation of in-order delivery (unless your
> have the same vendor on both sides with something proprietary).
> Ignoring that requirement, you could ditch hashing altogether and go
> for round-robin. Standards-compliant hashing implementations can only
> look at header fields that don't change for a flow, namely src/dest
> mac, ip, protocol, and port for TCP/UDP (maybe adding in certain MPLS,
> VLAN, etc. fields or interface ids or other proprietary information
> available to the chip that satisfies that requirement).
>
> --
> Eldon
> _______________________________________________
> juniper-nsp mailing list juniper-nsp@puck.nether.net
> https://puck.nether.net/mailman/listinfo/juniper-nsp
>
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: LAG/ECMP hash performance [ In reply to ]
On 2019-08-29 17:31 +0200, Robert Raszuk wrote:

> You are very correct. I was very highly surprised to read Saku mentioning
> use of CRC for hashing but then quick google revealed this link:
>
> https://www.juniper.net/documentation/en_US/junos/topics/reference/configuration-statement/hash-parameters-edit-forwarding-options.html
>
> Looks like ECMP and LAG hashing may seriously spread your flows as clearly
> CRC includes payload and payload is likely to be different with every
> packet.

On what basis do you figure CRC "clearly" includes payload? I see
no indication on that page, or a few other pages close by, that
anything but select layer 2 or layer 3/4 headers are used in the
hashes for LAG and ECMP.

Are you perhaps mislead by the 'forwarding-options enhanced-hash-key
hash-mode layer2-payload' setting? My understanding is that its
meaning is to use select L3 and/or L4 headers, as opposed to using
select L2 headers, as input to the CRC function. A better name for
that setting would probably be 'layer2/3-headers'.

https://www.juniper.net/documentation/en_US/junos/topics/reference/configuration-statement/hash-mode-edit-forwarding-options-ex-series.html
says:

If the hash mode is set to layer2-payload, you can set the fields
used by the hashing algorithm to hash IPv4 traffic using the set
forwarding-options enhanced-hash-key inet statement. You can set
the fields used by the hashing algorithm to hash IPv6 traffic using
the set forwarding-options enhanced-hash-key inet6 statement.

The fields you can select/deselect are:

- Source IPv4/IPv6 address
- Destination IPv4/IPv6 address
- Source L4 port
- Destination L4 port
- IPv4 protocol / IPv6 NextHdr
- VLAN-id (on EX and QFX 5k)
- Incoming port (on QFX 10k)
- IPv6 flow label (on QFX 10k)
- GPRS Tunneling Protocol endpoint id


> Good that this is only for QFX though :-)

The 'hash-parameter' settings are not even valid on all QFX:es. At
least Trident II (QFX 51x0) uses a Broadcom-proprietary hash called
RTAG7. I'm guessing that using CRC16 or CRC32 for LAG/ECMP hasing
is just used on QFX 10k, not any of the Trident- or Tomahawk-based
routers/switches.


> For MX I recall that the hash is not computed with entire packet. The
> specific packet's fields are taken as input (per configuration) and CRC
> functions are used to mangle them - which is very different from saying
> that packet's CRC is used as input.

I don't think anyone has said that any product use the ethernet
packet's CRC for LAG/ECMP hashing. Just that they might reuse
the CRC circuitry in the NPU/ASIC for calculating this hash, but
based on different inputs.


--
Thomas Bellman, National Supercomputer Centre, Linköping Univ., Sweden
"We don't understand the software, and sometimes we don't understand
the hardware, but we can *see* the blinking lights!"
Re: LAG/ECMP hash performance [ In reply to ]
> James Bensley
> Sent: Thursday, August 29, 2019 9:52 AM
>
> In the worst case scenario (1), with 4 bytes of CRC output to represent an
> entire frame there is a large amount of hash collisions; Min size frame; 6
byte
> SRC, 6 byte DST, 2 byte EType, 46 byte payload == 2^480 possible Ethernet
> frames being mapped to 2^32 CRC values?
>
Hmm isn't it more like 2^480 possible Ethernet frames being mapped to 2^32
CRC values which are then mapped to 2^6 buckets (assuming 64-way balancing
-not sure what's the value in Saku's setup).
All this time watching the thread I'm thinking if it's just coincidence that
in some cases no matter how good the hash value is the available buckets
skew the balancing. (and I guess that's why three are knobs to shift the
hash around.

adam

_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: LAG/ECMP hash performance [ In reply to ]
On Thu, 29 Aug 2019 at 22:17, Thomas Bellman <bellman@nsc.liu.se> wrote:

> I don't think anyone has said that any product use the ethernet
> packet's CRC for LAG/ECMP hashing. Just that they might reuse
> the CRC circuitry in the NPU/ASIC for calculating this hash, but
> based on different inputs.

Exactly and even that isn't true anymore today, since PHY and lookup
are usually different physical chips (still exceptions exist). Reuse
of CRC circuitry for MAC table hashing and LAG hashing does of course
make sense, for some specific constrains, which usually are not today
in modern devices and I think people have just grandfathered old
solution.

Obviously FCS on Ethernet specifically has bias, because it will
detect every single and every double bit flip at 1500B, so it
particularly avoids collisions on small changes. Where hash function
with good diffusion quality would not have such bias and would be
terrible FCS, as at FCS we care more about coverage of small changes
than consider all changes equal importance.
To say it otherwise CRC is great FCS, okish LAG balancer, but ideally
LAG should use hash which has theoretically ideal diffusion and such
hash can be implemented cheaply.

--
++ytti
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: LAG/ECMP hash performance [ In reply to ]
On Thu, 29 Aug 2019 at 23:10, <adamv0025@netconsultings.com> wrote:

> All this time watching the thread I'm thinking if it's just coincidence that
> in some cases no matter how good the hash value is the available buckets
> skew the balancing. (and I guess that's why three are knobs to shift the
> hash around.

Obviously there is no perfect solution, but alarm bells should ring if
you notice pattern when there is imbalance, such as closely grouped
flows. With ideal diffusion the pattern would be random any 5tuple is
equally likely to collide with any other 5tuple. With CRC it seems
some 5tuples are far more likely to collide with some other 5tuples,
and closely group, high volume traffic naturally occurs in very
typical network scenarios. Like CDN network internal traffic or SP VOD
distribution etc.
Bear in mind solution NOK owner found for his problem, adding
completely static set of input bits! Which does support (or add
confirmation bias at least:) to the explanation that the hash has poor
diffusion quality.

--
++ytti
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: LAG/ECMP hash performance [ In reply to ]
On Wed, 28 Aug 2019 at 08:21, Saku Ytti <saku@ytti.fi> wrote:
> SRC: (single 100GE interface, single unit 0)
> 23.A.B.20 .. 23.A.B.46
> TCP/80
> DST: (N*10GE LACP)
> 157.C.D.20 .. 157.C.D.35
> TCP 2074..65470 (RANDOM, this alone, everything else static, should
> have guaranteed fair balancing)
>
> I'm running this through IXIA and my results are:
>
> 3*10GE Egress:
> port1 10766516pps
> port2 10766543pps
> port3 7536578pps
> after (set forwarding-options enhanced-hash-key family inet
> incoming-interface-index)
> port1 9689881pps
> port2 11791986pps
> port3 5383270pps
> after removing s-int-index and setting adaptive
> port1 9689889pps
> port2 9689892pps
> port3 9689884pps
>
> I think this supports that the hash function diffuses poorly. It
> should be noted that 2nd step adds entirely _static_ bits to the input
> of the hash, source interface does not change. And it's perfectly
> repeatable. This is to be expected, the most affected weakness bits
> shift, either making the problem worse or better.
> I.e. flows are 100% perfectly hashable, but not without biasing the
> hash results. There aren't any elephants.
>
>
> 4*10GE Egress:
> port1 4306757pps
> port2 8612807pps
> port3 9689893pps
> port4 6459931pps
> after adding incoming-interface-index)
> port1 6459922pps
> port2 8613236pps
> port3 9691485pps
> port4 4306620pps
> after removing s-index and adding adaptive:
> port1 7536562pps
> port2 7536593pps
> port3 6459928pps
> port4 7536566pps
> after removing adaptive and adding no-destination-port + no-source-port
> port1: 5383279pps
> port2: 9689886pps
> port3: 7536588pps
> port4: 6459922pps
> after removing no-source-port (i.e. destination port is used for hash)
> port1: 8613235pps
> port2: 5383272pps
> port3: 5383274pps
> port4: 9689884pps
>
> It is curious that it actually balances more fairly, without using TCP
> ports at all! Even thought there is _tons_ of entropy there due to
> random DPORT.

Better late than never....

100G link from Ixia to ASR9K Hu0/1/0/3, with a pseudowire attachment
interface configured on Hu0/1/0/3.4001, 3x100G core facing LAG links
(Hu0/0/0/0, Hu0/0/0/5, Hu0/0/0/6).

The packet stream sent from Ixia has an Ethernet header with random
dest MAC, random src MAC, VLAN ID 4001 to match into pseudowire AC,
IPv4 headers are next with random dest IP and random src IP, TCP
headers follow with random dest port and random src port. Payload is
random data. Frame size is 1522 bytes.

Everything is re-randomised every frame. Sending ~100Mbps of traffic...

The default load-balancing method on ASR9K for L2VPNs is
per-pseudowire so initially everything falls onto one core facing LAG
member:

ar0-ws.bllab Monitor Time: 00:15:42 SysUptime: 312:06:24
Last Clear: 00:10:36
Protocol:General
Interface In(bps) Out(bps) InBytes/Delta OutBytes/Delta
Hu0/1/0/3 99.1M/ 0% 0/ 0% 2.7G/24.8M 0/0
Hu0/0/0/0 11000/ 0% 15000/ 0% 495110/3226 639642/4198
Hu0/0/0/5 12000/ 0% 100.3M/ 0% 467544/2958 2.7G/25.1M
Hu0/0/0/6 13000/ 0% 12000/ 0% 523510/3328 483334/3020


Switch to src+dst MAC load-balancing and we get a more or less perfect
distribution:
!
l2vpn
load-balancing flow src-dst-mac
!

ar0-ws.bllab Monitor Time: 00:20:56 SysUptime: 312:11:38
Last Clear: 00:17:02
Protocol:General
Interface In(bps) Out(bps) InBytes/Delta OutBytes/Delta
Hu0/1/0/3 99.7M/ 0% 0/ 0% 2.9G/24.9M 0/0
Hu0/0/0/0 12000/ 0% 31.7M/ 0% 371774/2972 993.0M/8.6M
Hu0/0/0/5 12000/ 0% 33.4M/ 0% 366524/2958 980.9M/8.1M
Hu0/0/0/6 12000/ 0% 33.3M/ 0% 373604/3442 979.3M/8.4M


When switching to src+dst IP load-balancing we get basically the same
distribution:
!
l2vpn
load-balancing flow src-dst-ip
!

ar0-ws.bllab Monitor Time: 00:23:22 SysUptime: 312:14:04
Last Clear: 00:21:58
Protocol:General
Interface In(bps) Out(bps) InBytes/Delta OutBytes/Delta
Hu0/1/0/3 99.7M/ 0% 0/ 0% 1.0G/24.9M 0/0
Hu0/0/0/0 11000/ 0% 31.2M/ 0% 135550/2888 355.8M/8.4M
Hu0/0/0/5 12000/ 0% 33.6M/ 0% 131840/3396 353.1M/8.4M
Hu0/0/0/6 12000/ 0% 33.4M/ 0% 134639/3091 351.1M/8.3M

Tomahawk NPU is using CRC32 for load-balancing so not sure why the
MX2020 box you tested was so uneven if also using CRC32. It could be
implementation specific as you mentioned with the Nokia owner who
added a 32b static value. Despite having TCP headers on top of IP
headers, if I remove TCP, or set the TCP ports to be static, random,
incrementing etc., it has no impact on the above, so the ASR9K isn't
feeding layer 4 keys into the CRC32 (which is exactly as the Cisco
documentation states).

This is not 100% apples to apples, because I'm interested in tested
how pseudowire traffic is load-balanced towards the core and I expect
your looking at layer 3 ECMP, however it is kind of the same; The
pseudowire ingress PE has access to the layer 2 / 3 / 4 headers of the
L2VPN payload traffic, so it has the same keys to feed into a CRC32.

Just a 2nd data point for you...

Cheers,
James.
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: LAG/ECMP hash performance [ In reply to ]
On Tue, 26 Nov 2019 at 18:27, James Bensley
<jwbensley+juniper-nsp@gmail.com> wrote:

> Tomahawk NPU is using CRC32 for load-balancing so not sure why the
> MX2020 box you tested was so uneven if also using CRC32. It could be

Pure CRC32 would be terrible, JNPR is doing like CRC32 rotated by
CRC16 which is not horrible.

> This is not 100% apples to apples, because I'm interested in tested
> how pseudowire traffic is load-balanced towards the core and I expect
> your looking at layer 3 ECMP, however it is kind of the same; The
> pseudowire ingress PE has access to the layer 2 / 3 / 4 headers of the
> L2VPN payload traffic, so it has the same keys to feed into a CRC32.

You were also using lot of random keys, which should yield good
result. I was copying what I saw in production, as well as reasonable.
There will be pathological biases in balancing with unlucky set of
keys, because the algorithms aren't perfectly diffusing.
If I use random, I get more or less perfect balancing, but in this
example I was not using random IP addresses. Anyhow adaptive balancing
fixes it and of course also fixes elephant flows, something which
perfectly diffusing algorithm can't fix, so adaptive load balancing is
superior solution anyhow to a better algorithm.

--
++ytti
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp