Mailing List Archive

[lvs-users] IPVS adding a 1s delay on connection establishment under moderately high number of TCP req/s
Hello,

We detected a problem with IPVS module. Here's a quick summary of what
triggers the problem:

- IPVS has a hardcoded TIME_WAIT timeout of 120s
- TCP/IP layer on the kernel has a hardcoded TIME_WAIT timeout of 60s
- the connection rescheduling mechanism on IPVS acts by dropping the
first received SYN message and letting the client retransmit the SYN
message after (also hardcoded) RTO timeout, which in practice seems to
be 1s

Here is a scenario that triggers this problem:

- we have some backend server balanced by IPVS
- we have an external load balancer that balances requests from real
clients to IPVS and does SNAT

Here is what happens previous scenario under high throughput:

- the external load balancer is behaving (due to SNAT) as a single
origin IP for requests forwarded to IPVS
- IPVS receives connections and forwards them to internal servers, but
once served, on the IPVS connection table, connections remain in
TIME_WAIT during 120s
- the external load balancer has a TIME_WAIT of 60s, so after this
time (or before if reusing connections in TIME_WAIT) it recycles the
same ephemeral ports to send requests to IPVS
- in-between those 60s (where the external LB starts reusing ports)
and those 120s (where IPVS still has the connection in TIME_WAIT), the
re-scheduling mechanism on IPVS has the result of adding a 1s delay
(due to SYN-drop and the RTO timeout on the LB) to the connection
establishment

And this implies that when the external LB is under mid load, approx
250 req/s (calculated from [net.ipv4.ip_local_port_range on the LB]
divided by [TW timeout on the LB = 60s]), the rescheduling mechanism
at IPVS adds a delay of 1s to the establishment of TCP connections to
internal servers.

This 1s delay seems to be either caused by:

- a mismatch between hardcoded TW-timeout on: IPVS = 120s, standard
kernel TCP driver = 60s
- the rescheduling algorithm on IPVS that forces the client (the LB)
to wait an entire RTO before retransmitting the SYN packet

I'm not telling that IPVS is either bad parametrized neither that the
rescheduling algorithm is bad designed. You guys are awesome and have
done a really great work with IPVS.

The question is then: what can we do to avoid that 1s delay when
rescheduling connections?

If you need it, I can elaborate on all the previous details, even
provide a link of a github issue (for the docker project) with the
details on how we arrived at sending an email to this list.

Thanks in advance,
Toni

_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/

LinuxVirtualServer.org mailing list - lvs-users@LinuxVirtualServer.org
Send requests to lvs-users-request@LinuxVirtualServer.org
or go to http://lists.graemef.net/mailman/listinfo/lvs-users
Re: [lvs-users] IPVS adding a 1s delay on connection establishment under moderately high number of TCP req/s [ In reply to ]
Hello,

On Wed, 23 May 2018, Toni Martí wrote:

> We detected a problem with IPVS module. Here's a quick summary of what
> triggers the problem:
>
> - IPVS has a hardcoded TIME_WAIT timeout of 120s
> - TCP/IP layer on the kernel has a hardcoded TIME_WAIT timeout of 60s
> - the connection rescheduling mechanism on IPVS acts by dropping the
> first received SYN message and letting the client retransmit the SYN
> message after (also hardcoded) RTO timeout, which in practice seems to
> be 1s
>
> Here is a scenario that triggers this problem:
>
> - we have some backend server balanced by IPVS
> - we have an external load balancer that balances requests from real
> clients to IPVS and does SNAT
>
> Here is what happens previous scenario under high throughput:
>
> - the external load balancer is behaving (due to SNAT) as a single
> origin IP for requests forwarded to IPVS
> - IPVS receives connections and forwards them to internal servers, but
> once served, on the IPVS connection table, connections remain in
> TIME_WAIT during 120s
> - the external load balancer has a TIME_WAIT of 60s, so after this
> time (or before if reusing connections in TIME_WAIT) it recycles the
> same ephemeral ports to send requests to IPVS
> - in-between those 60s (where the external LB starts reusing ports)
> and those 120s (where IPVS still has the connection in TIME_WAIT), the
> re-scheduling mechanism on IPVS has the result of adding a 1s delay
> (due to SYN-drop and the RTO timeout on the LB) to the connection
> establishment
>
> And this implies that when the external LB is under mid load, approx
> 250 req/s (calculated from [net.ipv4.ip_local_port_range on the LB]
> divided by [TW timeout on the LB = 60s]), the rescheduling mechanism
> at IPVS adds a delay of 1s to the establishment of TCP connections to
> internal servers.
>
> This 1s delay seems to be either caused by:
>
> - a mismatch between hardcoded TW-timeout on: IPVS = 120s, standard
> kernel TCP driver = 60s
> - the rescheduling algorithm on IPVS that forces the client (the LB)
> to wait an entire RTO before retransmitting the SYN packet
>
> I'm not telling that IPVS is either bad parametrized neither that the
> rescheduling algorithm is bad designed. You guys are awesome and have
> done a really great work with IPVS.
>
> The question is then: what can we do to avoid that 1s delay when
> rescheduling connections?

There was recent discussion about this 1-second delay.
May be you will find the needed answers here:

https://marc.info/?t=151683118100004&r=1&w=2

Basicly, you have 3 options:

- echo 0 > conn_reuse_mode: do not attempt to reschedule on
port reuse (new SYN hits unexpired conn), just use the same real
server. This can be bad, we do not select alive server if the
server used by old connection is not available anymore (weight=0
or removed).

Next two options are if you do not want to
use the first option:

- echo 0 > conntrack: if you do not use rules to match
conntrack state for the IPVS packets. This is slowest,
conntracks are created and destroyed for every packet.

- use NOTRACK for IPVS packets: fastest, conntracks are
not created, less memory is used

> If you need it, I can elaborate on all the previous details, even
> provide a link of a github issue (for the docker project) with the
> details on how we arrived at sending an email to this list.
>
> Thanks in advance,
> Toni

Regards

--
Julian Anastasov <ja@ssi.bg>
Re: [lvs-users] IPVS adding a 1s delay on connection establishment under moderately high number of TCP req/s [ In reply to ]
Many thanks Julian.

Really good options you provide me :-)

> There was recent discussion about this 1-second delay.
> May be you will find the needed answers here:
>
> https://marc.info/?t=151683118100004&r=1&w=2

So bascially the proposed solutions are the same as below.

> Basicly, you have 3 options:
>
> - echo 0 > conn_reuse_mode: do not attempt to reschedule on
> port reuse (new SYN hits unexpired conn), just use the same real
> server. This can be bad, we do not select alive server if the
> server used by old connection is not available anymore (weight=0
> or removed).

Already tried this, but has the ugly effect of not detecting new servers

> Next two options are if you do not want to
> use the first option:
>
> - echo 0 > conntrack: if you do not use rules to match
> conntrack state for the IPVS packets. This is slowest,
> conntracks are created and destroyed for every packet.
>
> - use NOTRACK for IPVS packets: fastest, conntracks are
> not created, less memory is used
>

_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/

LinuxVirtualServer.org mailing list - lvs-users@LinuxVirtualServer.org
Send requests to lvs-users-request@LinuxVirtualServer.org
or go to http://lists.graemef.net/mailman/listinfo/lvs-users
Re: [lvs-users] IPVS adding a 1s delay on connection establishment under moderately high number of TCP req/s [ In reply to ]
Sorry, the last message went accidentally out while being written.

Many thanks Julian.

Really good options you provide me :-)

> There was recent discussion about this 1-second delay.
> May be you will find the needed answers here:
>
> https://marc.info/?t=151683118100004&r=1&w=2

So bascially the proposed solutions are the same as below.

> Basicly, you have 3 options:
>
> - echo 0 > conn_reuse_mode: do not attempt to reschedule on
> port reuse (new SYN hits unexpired conn), just use the same real
> server. This can be bad, we do not select alive server if the
> server used by old connection is not available anymore (weight=0
> or removed).

Already tried this, but has the ugly effect of IPVS not to balancing to newly
added servers to the balanced set under high throughput (and connections
being effectively reused).

> - echo 0 > conntrack: if you do not use rules to match
> conntrack state for the IPVS packets. This is slowest,
> conntracks are created and destroyed for every packet.

Also tried this one, but I think docker (the main IPVS user) is using
ipfilter rules that require conntrack, and TCP connections were not
being established at all.

> - use NOTRACK for IPVS packets: fastest, conntracks are
>not created, less memory is used

So I think this is the only good remaining option.

Rewriting iptables rules (created by docker swarm) so that they don't use
tracking.

So many many thanks again for your help.

I will try 3rd option and come back here with the result.

Regards,
Toni

_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/

LinuxVirtualServer.org mailing list - lvs-users@LinuxVirtualServer.org
Send requests to lvs-users-request@LinuxVirtualServer.org
or go to http://lists.graemef.net/mailman/listinfo/lvs-users