Mailing List Archive

[lvs-users] LVS-TUN BUG? - Intermittent incorrect source IP in TUN header for non-local realsevers (PBR no help)
Hello fellow LVS/IPVS users..

I have been running IPVS for a few years with great success but seem to
have recently run into a very perplexing problem in TUN mode with what
seems to be stale/leaked routing lookups?

On a IPVS director with multiple interfaces (and policy base routing)
it seems IPVS TUN code will emit IPIP packets from the correct
interface (according to PBR) but use/select the wrong source IP in the
TUN IP header. This is reproducible on our systems, but is _VERY_
infrequent (eg. 354 out of 1.5mil packets had an incorrect/invalid
source IP on our system in the last 24 hours)

Details of the failing IPVS director as follows:

<snip>
uname -r
3.18.14 (mainline vanilla)

head -1 /proc/net/ip_vs
IP Virtual Server version 1.2.1 (size=32768)

ip route show
default via 172.23.200.1 dev vlan200
172.23.10.0/24 dev vlan10 proto kernel scope link src 172.23.10.11
172.23.20.0/24 dev vlan20 proto kernel scope link src 172.23.20.11
172.23.200.0/24 dev vlan200 proto kernel scope link src 172.23.200.11
192.168.254.0/27 dev vlan500 proto kernel scope link src
192.168.254.15

ip rule show
0: from all lookup local
32764: from all to 192.168.254.32/27 lookup 3
32765: from 192.168.254.0/27 lookup 3
32766: from all lookup main
32767: from all lookup default

ip route show table 3
default via 192.168.254.1 dev vlan500 src 192.168.254.15

ipvsadm -Ln
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
FWM 254 wlc
-> 192.168.254.48:0 Tunnel 1 0 1528
FWM 255 wlc
-> 172.23.10.100:0 Tunnel 1 0 35

</snip>

In the following examples I have a remote client (8.8.8.8) attempting
to hit a LVS-TUN service (4.4.4.4:443), the packets are marked in
iptables mangle and are matched by IPVS using FWMARK (254 or 255).

IPVS clusters with realservers on locally connected networks (FWM 255),
the emitted TUN packets seems to work as expected 100% of the time. Eg.
IPVS encapsulates the original packet (src 8.8.8.8 dst 4.4.4.4) in a IP
header (src 172.23.10.11 dst 172.23.10.100) and emits the packet via
vlan10 interface.

<snip>
tcpdump -n -i vlan10
IP 172.23.10.11 > 172.23.10.100: IP 8.8.8.8.62757 > 4.4.4.4.443: Flags
[S], seq 2658459997, win 65535, options [mss 1460,nop,wscale
5,nop,nop,TS val 832819348 ecr 0,sackOK,eol], length 0 (ipip-proto-4)
</snip>

Alternatively if the inbound traffic is handled by a IPVS cluster with
remote L3 realservers, things work most of the time, then the
unexpected happens)

IPVS clusters with realservers on a remote L3 network (FWM 254), IPVS
encapsulates the original packet (src 8.8.8.8 dst 4.4.4.4) in a IP
header (src 192.168.254.15 dst 192.168.254.48) and emits the packet via
the vlan500 interface passing it to the L2 (mac address) of
192.168.254.1.

This worked swimmingly well until I noticed that very intermittently
the following happens:

IPVS encapsulates the original packet (src 8.8.8.8 dst 4.4.4.4) in a IP
header (src 172.23.10.11 dst 192.168.254.48) and emits the packet via
the vlan500 interface passing it to the mac address of 192.168.254.1.
(NOTE: the use of 172.23.10.11 rather than the expected 192.168.254.48
source IP in the TUN header)

Example of broken packet via tcpdump
<snip>
tcpdump -n -i vlan500
IP 172.23.10.11 > 192.168.254.48: IP 8.8.8.8.54063 > 4.4.4.4.443: Flags
[S], seq 1418130791, win 65535, options [mss 1386,nop,wscale
5,nop,nop,TS val 1213718660 ecr 0,sackOK,eol], length 0 (ipip-proto-4)
</snip>

So in the last 24 hours out of 1.5M packets emitted by IPVS on vlan500
(FWM 254) I had 349 packets which get emitted with the wrong source IP
address in the Tunnel IP header. The periods where the wrong source IP
is used by IPVS seem to last for ~2-5min at a time, and affects all
traffic in the LVS cluster with remote L3 realservers.

Another interesting meta-point is, on a LVS director with lots of
interfaces and active IPVS clusters, I only see IPVS select invalid
source IPs which meet the following criteria:

1. The IP exists on on of the directors local interfaces
2. The IP is currently used as a source IP for TUN traffic in other
IPVS clusters that are currently active (passing TUN traffic to local
realservers).

I am guessing this is related to the "saddr" address being stale/not
reinitialized/leaked in the do_output_route4 function of ip_vs_xmit.c?

Im not sure what would cause this type of issue in IPVS, or if its
really related to recent routing changes in the kernel?

Anyone have thoughts/suggestions?

Thanks

-Mike

--
Michael Vallaly <lvs@nolatency.com>

_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/

LinuxVirtualServer.org mailing list - lvs-users@LinuxVirtualServer.org
Send requests to lvs-users-request@LinuxVirtualServer.org
or go to http://lists.graemef.net/mailman/listinfo/lvs-users
Re: [lvs-users] LVS-TUN BUG? - Intermittent incorrect source IP in TUN header for non-local realsevers (PBR no help) [ In reply to ]
Hello,

On Wed, 17 Jun 2015, Michael Vallaly wrote:

> IPVS clusters with realservers on a remote L3 network (FWM 254), IPVS
> encapsulates the original packet (src 8.8.8.8 dst 4.4.4.4) in a IP
> header (src 192.168.254.15 dst 192.168.254.48) and emits the packet via
> the vlan500 interface passing it to the L2 (mac address) of
> 192.168.254.1.
>
> This worked swimmingly well until I noticed that very intermittently
> the following happens:
>
> IPVS encapsulates the original packet (src 8.8.8.8 dst 4.4.4.4) in a IP
> header (src 172.23.10.11 dst 192.168.254.48) and emits the packet via
> the vlan500 interface passing it to the mac address of 192.168.254.1.
> (NOTE: the use of 172.23.10.11 rather than the expected 192.168.254.48
> source IP in the TUN header)

May be IP 192.168.254.15 was removed?

__ip_vs_get_out_rt should provide previous saddr but
after commit 026ace060dfe ("ipvs: optimize dst usage for real server")
we always provide 0.0.0.0 as initial source, so
do_output_route4 should always get fresh source address
and then will get second route with this source.

So, now on dst_cache refresh we do not try
to preserve the previous saddr.

If you see different address here, it means
it is returned by routing. The routing cache does
not keep source addresses but nexthops can remember
source returned by fib_info_update_nh_saddr. It
should be from the same subnet because you have
"via 192.168.254.1". Otherwise, first address from
device or system is returned.

Also __ip_vs_dst_cache_reset is called on
dest add/edit, for dests coming from trash...
I'll think more on this problem but for now I don't
see what can be the cause.

> So in the last 24 hours out of 1.5M packets emitted by IPVS on vlan500
> (FWM 254) I had 349 packets which get emitted with the wrong source IP
> address in the Tunnel IP header. The periods where the wrong source IP
> is used by IPVS seem to last for ~2-5min at a time, and affects all
> traffic in the LVS cluster with remote L3 realservers.

Do you see same IP/routing config when this
happens?

Regards

--
Julian Anastasov <ja@ssi.bg>

_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/

LinuxVirtualServer.org mailing list - lvs-users@LinuxVirtualServer.org
Send requests to lvs-users-request@LinuxVirtualServer.org
or go to http://lists.graemef.net/mailman/listinfo/lvs-users
Re: [lvs-users] LVS-TUN BUG? - Intermittent incorrect source IP in TUN header for non-local realsevers (PBR no help) [ In reply to ]
On Thu, 18 Jun 2015 10:45:12 +0300 (EEST)
Julian Anastasov <ja@ssi.bg> wrote:

> Hello,
Hi,

> On Wed, 17 Jun 2015, Michael Vallaly wrote:
>
> > IPVS clusters with realservers on a remote L3 network (FWM 254), IPVS
> > encapsulates the original packet (src 8.8.8.8 dst 4.4.4.4) in a IP
> > header (src 192.168.254.15 dst 192.168.254.48) and emits the packet via
> > the vlan500 interface passing it to the L2 (mac address) of
> > 192.168.254.1.
> >
> > This worked swimmingly well until I noticed that very intermittently
> > the following happens:
> >
> > IPVS encapsulates the original packet (src 8.8.8.8 dst 4.4.4.4) in a IP
> > header (src 172.23.10.11 dst 192.168.254.48) and emits the packet via
> > the vlan500 interface passing it to the mac address of 192.168.254.1.
> > (NOTE: the use of 172.23.10.11 rather than the expected 192.168.254.48
> > source IP in the TUN header)
>
> May be IP 192.168.254.15 was removed?
>
Sorry I fat fingered the 192.168.254.15 IP address in the email. The
sentence should have read "the use of 172.23.10.11 rather than the
expected _192.168.254.15_ source IP in the TUN header"

> __ip_vs_get_out_rt should provide previous saddr but
> after commit 026ace060dfe ("ipvs: optimize dst usage for real server")
> we always provide 0.0.0.0 as initial source, so
> do_output_route4 should always get fresh source address
> and then will get second route with this source.
>
If my understanding of this is correct this means that the IPVS TUN
code just binds its socket to all interfaces, which leverages the
kernel routing code to select the "best" interface?

As far as I can tell this seems to affect all traffic emitted from the
IPVS TUN code for this IPVS cluster during the timeframe. Eg. I see no
"valid" src_ip packets emitted for the remote L3 realserver during the
timeframe. I confirmed this happens even with multiple local
realservers in the same cluster as well.

> So, now on dst_cache refresh we do not try
> to preserve the previous saddr.
>
> If you see different address here, it means
> it is returned by routing. The routing cache does
> not keep source addresses but nexthops can remember
> source returned by fib_info_update_nh_saddr. It
> should be from the same subnet because you have
> "via 192.168.254.1". Otherwise, first address from
> device or system is returned.
>
I was under the impression that the routing cache was removed from the
kernel after 3.6? I see there is some sort of i4flow caching that
seems to be done now in the fib_trie, but I am not very familiar with
it. Do you know of a way to dump/monitor the route nexthop information?
I had attempted previous to my email to use "ip monitor
route/neigh/link" and I don't see any netlink events around/during the
incorrect packet emission timeframes.

> Also __ip_vs_dst_cache_reset is called on
> dest add/edit, for dests coming from trash...
> I'll think more on this problem but for now I don't
> see what can be the cause.

To be explicitly clear the IPVS config for these tests is static (the
realserver IP never changes), and no interfaces / route / policy changes
take place on the machine.

I have been suspecting that since the packets get emitted correctly >
99% of the time, the routing code obviously works the majority of the
time, but maybe we are missing a "use" count, and the route to
the NH eventually expires? Maybe there is a corner case code path
that is getting executed? Unfortunately I don't understand why even if
this was the case a route lookup would ever select a SRC IP from a
different interface entirely. (Especially since the packet gets emitted
out the correct interface (vlan500). Given that the policy route is in
place which explicitly defines what interface / next hop to use, and
there is a local IP bound to the interface which is directly connected
to the nexthop subnet, why would any other interface IP be returned from
the routing code?)

Additionally strange/convenient seems to be the observed use of SRC_IP
of interfaces currently being used by other IPVS clusters.

Eg. On a box with 20 vlan interfaces, 100% of the errant SRC_IPs
(7) were sourced from VLANs actively running other IPVS clusters (with
locally connected realservers), the remaining 13 vlan interfacess IPs
never got used by the IPVS xmit code. (but maybe I just need to wait
longer). I don't see, as example the consistent use of only the first
route/interface in the main routing table, which would seem to me to
be a more natural failback/last resort.

>
> > So in the last 24 hours out of 1.5M packets emitted by IPVS on vlan500
> > (FWM 254) I had 349 packets which get emitted with the wrong source IP
> > address in the Tunnel IP header. The periods where the wrong source IP
> > is used by IPVS seem to last for ~2-5min at a time, and affects all
> > traffic in the LVS cluster with remote L3 realservers.
>
> Do you see same IP/routing config when this
> happens?

Yup no IP/routing changes are being made to the system.

I can fairly easily reproduce this behavior, and would be happy to
try / provide any additional discovery/testing to get to the bottom of
this.

All suggestions / speculation welcome ;)

Thanks!

-Mike

>
> Regards
>
> --
> Julian Anastasov <ja@ssi.bg>
>
> _______________________________________________
> Please read the documentation before posting - it's available at:
> http://www.linuxvirtualserver.org/
>
> LinuxVirtualServer.org mailing list - lvs-users@LinuxVirtualServer.org
> Send requests to lvs-users-request@LinuxVirtualServer.org
> or go to http://lists.graemef.net/mailman/listinfo/lvs-users


--
Michael Vallaly <lvs@nolatency.com>

_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/

LinuxVirtualServer.org mailing list - lvs-users@LinuxVirtualServer.org
Send requests to lvs-users-request@LinuxVirtualServer.org
or go to http://lists.graemef.net/mailman/listinfo/lvs-users
Re: [lvs-users] LVS-TUN BUG? - Intermittent incorrect source IP in TUN header for non-local realsevers (PBR no help) [ In reply to ]
Hello,

On Thu, 18 Jun 2015, Michael Vallaly wrote:

> If my understanding of this is correct this means that the IPVS TUN
> code just binds its socket to all interfaces, which leverages the
> kernel routing code to select the "best" interface?

We simply get source from routing but ensure
with second lookup that IP rules are considered.

> I was under the impression that the routing cache was removed from the
> kernel after 3.6? I see there is some sort of i4flow caching that
> seems to be done now in the fib_trie, but I am not very familiar with
> it. Do you know of a way to dump/monitor the route nexthop information?

Hm, no.

> I had attempted previous to my email to use "ip monitor
> route/neigh/link" and I don't see any netlink events around/during the
> incorrect packet emission timeframes.

Yes, we still have caching but it is bound.

> Additionally strange/convenient seems to be the observed use of SRC_IP
> of interfaces currently being used by other IPVS clusters.

OK, route is there, only saddr is from other service.
I see a bug that can explain it: ip_vs_dest_dst_alloc uses
kmalloc, not kzalloc, so saddr is not always 0 (0.0.0.0).
If random memory is provided we get -EINVAL and then we
get valid address. What if some memory freed by another
dest_dst (containing saddr used by other dest IP) is allocated?
Routing agrees because this saddr is valid, so we get
valid local address, only that it is not suitable for
our target IP. So, you are probably right about this...
we get saddr from other dests/services.

You can fix it and test it by changing
ip_vs_dest_dst_alloc to use kzalloc. Let me know if
you prefer to apply patch from me or you can change it
directly for the test. If that works we can prepare it
as patch for mainline because I plan to provide
the old saddr as done before the wrong change.

Regards

--
Julian Anastasov <ja@ssi.bg>

_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/

LinuxVirtualServer.org mailing list - lvs-users@LinuxVirtualServer.org
Send requests to lvs-users-request@LinuxVirtualServer.org
or go to http://lists.graemef.net/mailman/listinfo/lvs-users