Mailing List Archive

[lvs-users] UDP packet loss when real server removed from farm
I'm using keepalived to distribute DNS requests (UDP port 53) to a
group of DNS servers. The farm is using source hashing. Environment
is RHEL, with the stock keepalived and IPVS. I've reproduced the
problem with RHEL 7.2, 6.8, and an older 6.x version.

When a health check fails and keepalived takes a real server out of the
farm, tests show that a client using the removed server has its packets
discarded until it is remapped to a new server. I can also provoke the
problem without keepalived, by using ipvsadm to remove a real server from
the farm.

I ran tcpdump on the load-balancing server during the test. When the
IPVS load balancing is working as expected, I see the packets arrive
on the incoming interface (a 2-interface bond) and then immediately get
forwarded to a real server. We are using direct response so there's
no manipulation of the IP headers.

After the real server is removed from the farm, requests from clients
that were hashed to that server still arrive, but they don't get forwarded
out. I haven't calculated all the numbers yet, but on a farm that gets
roughly 7500 requests per second, when one of the five real servers is
removed, around 3400 requests do not get forwarded. Under various test
scenarios it can take as long as a second for the farm to work normally
again from the impacted clients' perspective - the problem gets worse
when request rates are increased.

I didn't see any loss for clients who were not using the removed server
during the transition. I also didn't see any loss when a real server
was added back into the farm.

When I change the farm from source hashing to round-robin, the problem
is reduced by an order of magnitude - instead of hundreds of lost
requests, I get at most a few dozen.

I'm kind of stuck at this point as I don't know much about IPVS internals.
I've looked at the IPVS stats in /proc but those only cover packets
successfully processed, there don't seem to be any counters for errors
or drops.

iptables is in use on the load balancer hosts (a very simple list with 3 or
4 drop rules), but in my test environment I didn't see any difference when
the iptables modules were unloaded ("service iptables stop" and then
confirmed with lsmod). The modules iptables uses when in NAT mode (I think
it's nf_conntrack and a couple of others) are already blacklisted as they
caused havoc one day last year when they were accidentally loaded.

So my questions are:

* Could there be a bug in the connection table code when a real server
is removed and the farm mappings have to be recalculated?

* Is it realistic to expect that no packets will be dropped when a real
server is removed from the farm?

* If not, what can I do to minimize the packet loss?

Thanks,

-- Ed

_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/

LinuxVirtualServer.org mailing list - lvs-users@LinuxVirtualServer.org
Send requests to lvs-users-request@LinuxVirtualServer.org
or go to http://lists.graemef.net/mailman/listinfo/lvs-users
Re: [lvs-users] UDP packet loss when real server removed from farm [ In reply to ]
Ed,

I faced a similar issue with http traffic. The cause for me was keepalived
by default will remove a real server from the configuration when it is
detected down breaking any sessions that were going to it.

If you specify one of the weighted routing algorithms and use
'inhibit_on_failure' instead of removing the real servers from the config
it will mark it with a weight of 0. Existing connections will be able to
complete and new ones will be routed to a new destination. Use weighted
round robin maybe.

As a caveat when I did this we had the persistence value set to non 0 and
this made some busy clients never drain from the node that was removed. To
work around this I set the 'net.ipv4.vs.expire_quiescent_template' sysctl
setting to 1. Without this setting even if the real server was hard down
IPVS (admittedly older version) would keep sending traffic to the down
server.

Phillip Moore





On Sun, Jul 16, 2017 at 9:41 AM, Ed Ravin <eravin@panix.com> wrote:

> I'm using keepalived to distribute DNS requests (UDP port 53) to a
> group of DNS servers. The farm is using source hashing. Environment
> is RHEL, with the stock keepalived and IPVS. I've reproduced the
> problem with RHEL 7.2, 6.8, and an older 6.x version.
>
> When a health check fails and keepalived takes a real server out of the
> farm, tests show that a client using the removed server has its packets
> discarded until it is remapped to a new server. I can also provoke the
> problem without keepalived, by using ipvsadm to remove a real server from
> the farm.
>
> I ran tcpdump on the load-balancing server during the test. When the
> IPVS load balancing is working as expected, I see the packets arrive
> on the incoming interface (a 2-interface bond) and then immediately get
> forwarded to a real server. We are using direct response so there's
> no manipulation of the IP headers.
>
> After the real server is removed from the farm, requests from clients
> that were hashed to that server still arrive, but they don't get forwarded
> out. I haven't calculated all the numbers yet, but on a farm that gets
> roughly 7500 requests per second, when one of the five real servers is
> removed, around 3400 requests do not get forwarded. Under various test
> scenarios it can take as long as a second for the farm to work normally
> again from the impacted clients' perspective - the problem gets worse
> when request rates are increased.
>
> I didn't see any loss for clients who were not using the removed server
> during the transition. I also didn't see any loss when a real server
> was added back into the farm.
>
> When I change the farm from source hashing to round-robin, the problem
> is reduced by an order of magnitude - instead of hundreds of lost
> requests, I get at most a few dozen.
>
> I'm kind of stuck at this point as I don't know much about IPVS internals.
> I've looked at the IPVS stats in /proc but those only cover packets
> successfully processed, there don't seem to be any counters for errors
> or drops.
>
> iptables is in use on the load balancer hosts (a very simple list with 3 or
> 4 drop rules), but in my test environment I didn't see any difference when
> the iptables modules were unloaded ("service iptables stop" and then
> confirmed with lsmod). The modules iptables uses when in NAT mode (I think
> it's nf_conntrack and a couple of others) are already blacklisted as they
> caused havoc one day last year when they were accidentally loaded.
>
> So my questions are:
>
> * Could there be a bug in the connection table code when a real server
> is removed and the farm mappings have to be recalculated?
>
> * Is it realistic to expect that no packets will be dropped when a real
> server is removed from the farm?
>
> * If not, what can I do to minimize the packet loss?
>
> Thanks,
>
> -- Ed
>
> _______________________________________________
> Please read the documentation before posting - it's available at:
> http://www.linuxvirtualserver.org/
>
> LinuxVirtualServer.org mailing list - lvs-users@LinuxVirtualServer.org
> Send requests to lvs-users-request@LinuxVirtualServer.org
> or go to http://lists.graemef.net/mailman/listinfo/lvs-users
>
_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/

LinuxVirtualServer.org mailing list - lvs-users@LinuxVirtualServer.org
Send requests to lvs-users-request@LinuxVirtualServer.org
or go to http://lists.graemef.net/mailman/listinfo/lvs-users
Re: [lvs-users] UDP packet loss when real server removed from farm [ In reply to ]
Our farm uses source hashing, not round-robin, so I don't think the weight
setting and inhibit_on_failure are applicable. I had already tried it, but
when a farm host was removed via ipvsadm there was no rehash, the clients
just got their packets dropped until the server was put back in the farm.

Persistence is an interesting issue though. We're using UDP, so it doesn't
work the same way as in TCP applications. For DNS there are no "connections
in progress" from the client's point of view, it's one UDP packet out and the
server responds with an answer and the transaction is over. But LVS's point
of view is different, it can't parse the UDP protocols, it looks like
it is implementing a timeout based on when the last packet arrived from the
same source IP and source port. As this is DNS, the client could be a
caching server for a busy network or busy host and thus have a continual
stream of different requests, but none of them are related to each other
so they're not really in a "session".

I'll test out the expire_quiescent_template option next and see if it
changes anything.

Thanks,

-- Ed

On Sun, Jul 16, 2017 at 10:38:51AM -0500, Phillip Moore wrote:
> Ed,
>
> I faced a similar issue with http traffic. The cause for me was keepalived
> by default will remove a real server from the configuration when it is
> detected down breaking any sessions that were going to it.
>
> If you specify one of the weighted routing algorithms and use
> 'inhibit_on_failure' instead of removing the real servers from the config
> it will mark it with a weight of 0. Existing connections will be able to
> complete and new ones will be routed to a new destination. Use weighted
> round robin maybe.
>
> As a caveat when I did this we had the persistence value set to non 0 and
> this made some busy clients never drain from the node that was removed. To
> work around this I set the 'net.ipv4.vs.expire_quiescent_template' sysctl
> setting to 1. Without this setting even if the real server was hard down
> IPVS (admittedly older version) would keep sending traffic to the down
> server.
>
> Phillip Moore
>
>
>
>
>
> On Sun, Jul 16, 2017 at 9:41 AM, Ed Ravin <eravin@panix.com> wrote:
>
> > I'm using keepalived to distribute DNS requests (UDP port 53) to a
> > group of DNS servers. The farm is using source hashing. Environment
> > is RHEL, with the stock keepalived and IPVS. I've reproduced the
> > problem with RHEL 7.2, 6.8, and an older 6.x version.
> >
> > When a health check fails and keepalived takes a real server out of the
> > farm, tests show that a client using the removed server has its packets
> > discarded until it is remapped to a new server. I can also provoke the
> > problem without keepalived, by using ipvsadm to remove a real server from
> > the farm.
> >
> > I ran tcpdump on the load-balancing server during the test. When the
> > IPVS load balancing is working as expected, I see the packets arrive
> > on the incoming interface (a 2-interface bond) and then immediately get
> > forwarded to a real server. We are using direct response so there's
> > no manipulation of the IP headers.
> >
> > After the real server is removed from the farm, requests from clients
> > that were hashed to that server still arrive, but they don't get forwarded
> > out. I haven't calculated all the numbers yet, but on a farm that gets
> > roughly 7500 requests per second, when one of the five real servers is
> > removed, around 3400 requests do not get forwarded. Under various test
> > scenarios it can take as long as a second for the farm to work normally
> > again from the impacted clients' perspective - the problem gets worse
> > when request rates are increased.
> >
> > I didn't see any loss for clients who were not using the removed server
> > during the transition. I also didn't see any loss when a real server
> > was added back into the farm.
> >
> > When I change the farm from source hashing to round-robin, the problem
> > is reduced by an order of magnitude - instead of hundreds of lost
> > requests, I get at most a few dozen.
> >
> > I'm kind of stuck at this point as I don't know much about IPVS internals.
> > I've looked at the IPVS stats in /proc but those only cover packets
> > successfully processed, there don't seem to be any counters for errors
> > or drops.
> >
> > iptables is in use on the load balancer hosts (a very simple list with 3 or
> > 4 drop rules), but in my test environment I didn't see any difference when
> > the iptables modules were unloaded ("service iptables stop" and then
> > confirmed with lsmod). The modules iptables uses when in NAT mode (I think
> > it's nf_conntrack and a couple of others) are already blacklisted as they
> > caused havoc one day last year when they were accidentally loaded.
> >
> > So my questions are:
> >
> > * Could there be a bug in the connection table code when a real server
> > is removed and the farm mappings have to be recalculated?
> >
> > * Is it realistic to expect that no packets will be dropped when a real
> > server is removed from the farm?
> >
> > * If not, what can I do to minimize the packet loss?
> >
> > Thanks,
> >
> > -- Ed
> >
> > _______________________________________________
> > Please read the documentation before posting - it's available at:
> > http://www.linuxvirtualserver.org/
> >
> > LinuxVirtualServer.org mailing list - lvs-users@LinuxVirtualServer.org
> > Send requests to lvs-users-request@LinuxVirtualServer.org
> > or go to http://lists.graemef.net/mailman/listinfo/lvs-users
> >
> _______________________________________________
> Please read the documentation before posting - it's available at:
> http://www.linuxvirtualserver.org/
>
> LinuxVirtualServer.org mailing list - lvs-users@LinuxVirtualServer.org
> Send requests to lvs-users-request@LinuxVirtualServer.org
> or go to http://lists.graemef.net/mailman/listinfo/lvs-users
>

--
Ed Ravin | Warning - this email may contain rhetorical
| devices, metaphors, analogies, typographical
eravin@ | errors, or just plain snarkiness. A sense of
panix.com | humor may be required for proper interpretation.

_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/

LinuxVirtualServer.org mailing list - lvs-users@LinuxVirtualServer.org
Send requests to lvs-users-request@LinuxVirtualServer.org
or go to http://lists.graemef.net/mailman/listinfo/lvs-users