Mailing List Archive

[lvs-users] connection broken after 2MB of data transmitted
Hi,

We are using this tool in one of our project, and we are facing a disconnect every ~2MB of data transferred.

rhel 7n
ipvsadm.x86_64 1.27-7.el7
keepalived.x86_64 1.3.5-1.el7
kernel.x86_64 3.1nnN0.0-514.21.1.el7


Our configuration:
VIP 10.1.1.130
LB1 10.1.1.131 Virtual Server keepalived Active
LB2 10.1.1.132 Virtual Server keepalived backup
MQ1 10.1.1.151 Real Server MQ Active
MQ2 10.1.1.152 Real Server MQ Standby

Our keepalived.conf (simplified)
global_defs {
notification_email {
blablalba@mymail.com<mailto:blablalba@mymail.com>
}
notification_email_from blablalba@mymail.com<mailto:blablalba@mymail.com>
smtp_server sysmail.mymail.com
smtp_connect_timeout 30
}

vrrp_instance vi_y-maas {
state BACKUP
virtual_router_id 100
interface ens32
priority 150
advert_int 5
nopreempt
smtp_alert
virtual_ipaddress {
10.1.1.130/25
}
}

# My MQ
virtual_server 10.1.1.130 1423 {
delay_loop 2
protocol TCP
lb_algo rr
lb_kind DR

real_server 10.1.1.151 1423 {
weight 10
TCP_CHECK {
}
}
real_server 10.1.1.152 1423 {
weight 10
TCP_CHECK {
}
}
}

On MQ1 and MQ2, we have added ARP rules (due to Direct Routing)
:INPUT ACCEPT
:OUTPUT ACCEPT
:FORWARD ACCEPT
-A INPUT -j DROP -d 10.1.1.130
-A OUTPUT -j mangle -s 10.1.1.130 --mangle-ip-s 10.1.1.151
And
:INPUT ACCEPT
:OUTPUT ACCEPT
:FORWARD ACCEPT
-A INPUT -j DROP -d 10.1.1.130
-A OUTPUT -j mangle -s 10.1.1.130 --mangle-ip-s 10.1.1.152

MQ1 and MQ2 also have the VIP Address as a secondary address of the interface
2: ens32: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
link/ether xx:xx:xx:xx:xx:xx brd ff:ff:ff:ff:ff:ff
inet 10.1.1.151/25 brd 10.1.1.255 scope global ens32
valid_lft forever preferred_lft forever
inet 10.1.1.130/25 scope global secondary ens32
valid_lft forever preferred_lft forever

This permit us to direct the routing to the active MQ without intervention (if the Active MQ fail, the StdBy take relay and the LB detect that MQ1 is down and MQ2 is up)

My problem

When trying to read messages (~8'000) from MQ, using VIP to connect, the program can read ~2MB, then the connection is broken (We can see that in Wireshark trace that there are 5 TCP Retransmit with increasing delay between retransmit, between the IP where the Application PGM runs and the VIP address, and that at the same time, there is no more traffic between the LB1 (active LB) to the MQ1 (Active MQ)

With the same pgm, same read of messages, when connecting directly to the MQ1, there is no problems.

Could this be a problem related to keepalived or from linux-lvs it-self ?

Many thanks and regards
Robert

_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/

LinuxVirtualServer.org mailing list - lvs-users@LinuxVirtualServer.org
Send requests to lvs-users-request@LinuxVirtualServer.org
or go to http://lists.graemef.net/mailman/listinfo/lvs-users
Re: [lvs-users] connection broken after 2MB of data transmitted [ In reply to ]
Hello,

On Mon, 14 May 2018, Robert.Grange@swisscom.com wrote:

> We are using this tool in one of our project, and we are facing a disconnect every ~2MB of data transferred.
>
> rhel 7n
> ipvsadm.x86_64 1.27-7.el7
> keepalived.x86_64 1.3.5-1.el7
> kernel.x86_64 3.1nnN0.0-514.21.1.el7
>
>
> Our configuration:
> VIP 10.1.1.130
> LB1 10.1.1.131 Virtual Server keepalived Active
> LB2 10.1.1.132 Virtual Server keepalived backup
> MQ1 10.1.1.151 Real Server MQ Active
> MQ2 10.1.1.152 Real Server MQ Standby
>
> Our keepalived.conf (simplified)
> global_defs {
> notification_email {
> blablalba@mymail.com<mailto:blablalba@mymail.com>
> }
> notification_email_from blablalba@mymail.com<mailto:blablalba@mymail.com>
> smtp_server sysmail.mymail.com
> smtp_connect_timeout 30
> }
>
> vrrp_instance vi_y-maas {
> state BACKUP
> virtual_router_id 100
> interface ens32
> priority 150
> advert_int 5
> nopreempt
> smtp_alert
> virtual_ipaddress {
> 10.1.1.130/25
> }
> }
>
> # My MQ
> virtual_server 10.1.1.130 1423 {
> delay_loop 2
> protocol TCP
> lb_algo rr
> lb_kind DR
>
> real_server 10.1.1.151 1423 {
> weight 10
> TCP_CHECK {
> }
> }
> real_server 10.1.1.152 1423 {
> weight 10
> TCP_CHECK {
> }
> }
> }
>
> On MQ1 and MQ2, we have added ARP rules (due to Direct Routing)
> :INPUT ACCEPT
> :OUTPUT ACCEPT
> :FORWARD ACCEPT
> -A INPUT -j DROP -d 10.1.1.130
> -A OUTPUT -j mangle -s 10.1.1.130 --mangle-ip-s 10.1.1.151
> And
> :INPUT ACCEPT
> :OUTPUT ACCEPT
> :FORWARD ACCEPT
> -A INPUT -j DROP -d 10.1.1.130
> -A OUTPUT -j mangle -s 10.1.1.130 --mangle-ip-s 10.1.1.152
>
> MQ1 and MQ2 also have the VIP Address as a secondary address of the interface
> 2: ens32: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
> link/ether xx:xx:xx:xx:xx:xx brd ff:ff:ff:ff:ff:ff
> inet 10.1.1.151/25 brd 10.1.1.255 scope global ens32
> valid_lft forever preferred_lft forever
> inet 10.1.1.130/25 scope global secondary ens32
> valid_lft forever preferred_lft forever
>
> This permit us to direct the routing to the active MQ without intervention (if the Active MQ fail, the StdBy take relay and the LB detect that MQ1 is down and MQ2 is up)
>
> My problem
>
> When trying to read messages (~8'000) from MQ, using VIP to connect, the program can read ~2MB, then the connection is broken (We can see that in Wireshark trace that there are 5 TCP Retransmit with increasing delay between retransmit, between the IP where the Application PGM runs and the VIP address, and that at the same time, there is no more traffic between the LB1 (active LB) to the MQ1 (Active MQ)

It would be useful to see trace just before the
retransmission starts, from client, director and real server:

tcpdump -lnnnv -i any -s 0 port 1423 or icmp

If you prefer, you can scramble the addresses, we
care for things like checksum, packet sizes, PMTU (ICMP errors?).

Also, you can try to stop GRO/GSO on the director:

ethtool ETH -K gso off
ethtool ETH -K gro off

Check on client with arp -an if MAC for VIP is correct,
just in case to be sure.

> With the same pgm, same read of messages, when connecting directly to the MQ1, there is no problems.
>
> Could this be a problem related to keepalived or from linux-lvs it-self ?
>
> Many thanks and regards
> Robert

Regards

--
Julian Anastasov <ja@ssi.bg>

_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/

LinuxVirtualServer.org mailing list - lvs-users@LinuxVirtualServer.org
Send requests to lvs-users-request@LinuxVirtualServer.org
or go to http://lists.graemef.net/mailman/listinfo/lvs-users
Re: [lvs-users] connection broken after 2MB of data transmitted [ In reply to ]
Hi Julian,

Here are the 3 traces anonymized (IP, MAC, Port, and with no payload)
mq_00001_20180518102401_onlyip_192.168.0.1_anon at frame 3430
director_00001_20180518102356_onlyip_192.168.0.1_anon at frame 3428 (but always TCP Retransmit & TCP Dup Ack since the beginning)
client_00001_20180518102350_anon at frame 4287

I also tried to change the

ethtool -K ETH gso off
ethtool -K ETH gro off

but it didn't help

on the client, the arp -a shows that the MAC of the VIP has the correct MAC (the one of the active director)

Robert

-----Message d'origine-----
De?: Julian Anastasov [mailto:ja@ssi.bg]
Envoy??: jeudi 17 mai 2018 22:02
??: Grange Robert, INI-ONE-CIS-GSV-MFS <Robert.Grange@swisscom.com>
Cc?: lvs-users@linuxvirtualserver.org
Objet?: Re: [lvs-users] connection broken after 2MB of data transmitted


Hello,

On Mon, 14 May 2018, Robert.Grange@swisscom.com wrote:

> We are using this tool in one of our project, and we are facing a disconnect every ~2MB of data transferred.
>
> rhel 7n
> ipvsadm.x86_64 1.27-7.el7
> keepalived.x86_64 1.3.5-1.el7
> kernel.x86_64 3.1nnN0.0-514.21.1.el7
>
>
> Our configuration:
> VIP 10.1.1.130
> LB1 10.1.1.131 Virtual Server keepalived Active
> LB2 10.1.1.132 Virtual Server keepalived backup
> MQ1 10.1.1.151 Real Server MQ Active
> MQ2 10.1.1.152 Real Server MQ Standby
>
> Our keepalived.conf (simplified)
> global_defs {
> notification_email {
> blablalba@mymail.com<mailto:blablalba@mymail.com>
> }
> notification_email_from blablalba@mymail.com<mailto:blablalba@mymail.com>
> smtp_server sysmail.mymail.com
> smtp_connect_timeout 30
> }
>
> vrrp_instance vi_y-maas {
> state BACKUP
> virtual_router_id 100
> interface ens32
> priority 150
> advert_int 5
> nopreempt
> smtp_alert
> virtual_ipaddress {
> 10.1.1.130/25
> }
> }
>
> # My MQ
> virtual_server 10.1.1.130 1423 {
> delay_loop 2
> protocol TCP
> lb_algo rr
> lb_kind DR
>
> real_server 10.1.1.151 1423 {
> weight 10
> TCP_CHECK {
> }
> }
> real_server 10.1.1.152 1423 {
> weight 10
> TCP_CHECK {
> }
> }
> }
>
> On MQ1 and MQ2, we have added ARP rules (due to Direct Routing) :INPUT
> ACCEPT :OUTPUT ACCEPT :FORWARD ACCEPT -A INPUT -j DROP -d 10.1.1.130
> -A OUTPUT -j mangle -s 10.1.1.130 --mangle-ip-s 10.1.1.151 And :INPUT
> ACCEPT :OUTPUT ACCEPT :FORWARD ACCEPT -A INPUT -j DROP -d 10.1.1.130
> -A OUTPUT -j mangle -s 10.1.1.130 --mangle-ip-s 10.1.1.152
>
> MQ1 and MQ2 also have the VIP Address as a secondary address of the
> interface
> 2: ens32: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
> link/ether xx:xx:xx:xx:xx:xx brd ff:ff:ff:ff:ff:ff
> inet 10.1.1.151/25 brd 10.1.1.255 scope global ens32
> valid_lft forever preferred_lft forever
> inet 10.1.1.130/25 scope global secondary ens32
> valid_lft forever preferred_lft forever
>
> This permit us to direct the routing to the active MQ without
> intervention (if the Active MQ fail, the StdBy take relay and the LB
> detect that MQ1 is down and MQ2 is up)
>
> My problem
>
> When trying to read messages (~8'000) from MQ, using VIP to connect,
> the program can read ~2MB, then the connection is broken (We can see
> that in Wireshark trace that there are 5 TCP Retransmit with
> increasing delay between retransmit, between the IP where the
> Application PGM runs and the VIP address, and that at the same time,
> there is no more traffic between the LB1 (active LB) to the MQ1
> (Active MQ)

It would be useful to see trace just before the retransmission starts, from client, director and real server:

tcpdump -lnnnv -i any -s 0 port 1423 or icmp

If you prefer, you can scramble the addresses, we care for things like checksum, packet sizes, PMTU (ICMP errors?).

Also, you can try to stop GRO/GSO on the director:

ethtool ETH -K gso off
ethtool ETH -K gro off

Check on client with arp -an if MAC for VIP is correct, just in case to be sure.

> With the same pgm, same read of messages, when connecting directly to the MQ1, there is no problems.
>
> Could this be a problem related to keepalived or from linux-lvs it-self ?
>
> Many thanks and regards
> Robert

Regards

--
Julian Anastasov <ja@ssi.bg>
Re: [lvs-users] connection broken after 2MB of data transmitted [ In reply to ]
Hello,

On Fri, 18 May 2018, Robert.Grange@swisscom.com wrote:

> Hi Julian,
>
> Here are the 3 traces anonymized (IP, MAC, Port, and with no payload)
> mq_00001_20180518102401_onlyip_192.168.0.1_anon at frame 3430
> director_00001_20180518102356_onlyip_192.168.0.1_anon at frame 3428 (but always TCP Retransmit & TCP Dup Ack since the beginning)
> client_00001_20180518102350_anon at frame 4287
>
> I also tried to change the
>
> ethtool -K ETH gso off
> ethtool -K ETH gro off
>
> but it didn't help
>
> on the client, the arp -a shows that the MAC of the VIP has the correct MAC (the one of the active director)

May be on director there is some HW problem with the
receiving interface. Client sends packets for seq 67008:67084
on the 192.168.0.1.50315 > 192.168.0.101.1234 connection,
they have IP id 12552, 12554, 12555, 12556, 12557, 12558.
None of them is received on director. Still, director successfully
receives id 12553 and 12559 which are a pure ACK.

Try to disable HW RX CSUM on director with
'ethtool -K INDEV rx off' to check if that helps to receive
such packets, in case they are dropped for some reason.
You can also monitor the interface stats for strange
errors:

ethtool -S INDEV

If the problem is with the HW RX CSUMs, try with
fresh kernel or different hardware.

Another alternative is problem with TX CSUMs in
client but you say it works directly to real server...

Regards

--
Julian Anastasov <ja@ssi.bg>

_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/

LinuxVirtualServer.org mailing list - lvs-users@LinuxVirtualServer.org
Send requests to lvs-users-request@LinuxVirtualServer.org
or go to http://lists.graemef.net/mailman/listinfo/lvs-users