Mailing List Archive: [lvs-users] LVS-DR Cluster Some Real Servers Stuck in SYN_RECV

[lvs-users] LVS-DR Cluster Some Real Servers Stuck in SYN_RECV

brudolph at admantx

Feb 28, 2014, 11:21 AM

Post #1 of 12 (5021 views)

I have an LVS-DR cluster which has been running for seven months without
a hitch. Recently, the cluster started to timeout on the majority of
connections. Some connections were passed through to a real server and
processed. I have tried for a week to figure out what happened. What I
found was that one real server out of five is connecting and servicing
the client request. The other four real servers have the HTTP connection
stuck in the SYN_RECV state until it times out (60 seconds).

In summary, I have seven CentOS 6.4 servers (kernel
2.6.32-358.18.1.el6.x86_64). Two servers are configured as load
balancers (a primary and a backup) and five real servers. I have setup
LVS-DR using IPTables. The servers have a public IP bound to a NIC
device and an internal VLAN bound to a second NIC. The VIP is configured
on the real servers local loopback (lo:0) device. The
/etc/sysconfig/ha/lvs.cf was setup properly and everything was running
successfully for seven months.

We installed new versions of our software for the web service we are
running. Nothing network related. All five real servers were updated the
same way. I am comparing the one working real server from the four that
are not working. So far I have found nothing.

Any ideas on trouble shooting points?

--
Best Regards,
Bruce

_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/

LinuxVirtualServer.org mailing list - lvs-users@LinuxVirtualServer.org
Send requests to lvs-users-request@LinuxVirtualServer.org
or go to http://lists.graemef.net/mailman/listinfo/lvs-users

Re: [lvs-users] LVS-DR Cluster Some Real Servers Stuck in SYN_RECV [ In reply to ]

malcolm at loadbalancer

Feb 28, 2014, 11:26 AM

Post #2 of 12 (4947 views)

snip -- "I have setup
LVS-DR using IPTables."

Then why are you using a loopback adapter as well?

You only need to use one method iptables REDIRECT .... or ...
loopbackadapter + arptables settings

SYN_RECV means the real server is not replying when hit with a packet
that says Hi are you the VIP?

On 28 February 2014 19:21, Bruce Rudolph <brudolph@admantx.com> wrote:
> I have an LVS-DR cluster which has been running for seven months without
> a hitch. Recently, the cluster started to timeout on the majority of
> connections. Some connections were passed through to a real server and
> processed. I have tried for a week to figure out what happened. What I
> found was that one real server out of five is connecting and servicing
> the client request. The other four real servers have the HTTP connection
> stuck in the SYN_RECV state until it times out (60 seconds).
>
> In summary, I have seven CentOS 6.4 servers (kernel
> 2.6.32-358.18.1.el6.x86_64). Two servers are configured as load
> balancers (a primary and a backup) and five real servers. I have setup
> LVS-DR using IPTables. The servers have a public IP bound to a NIC
> device and an internal VLAN bound to a second NIC. The VIP is configured
> on the real servers local loopback (lo:0) device. The
> /etc/sysconfig/ha/lvs.cf was setup properly and everything was running
> successfully for seven months.
>
> We installed new versions of our software for the web service we are
> running. Nothing network related. All five real servers were updated the
> same way. I am comparing the one working real server from the four that
> are not working. So far I have found nothing.
>
> Any ideas on trouble shooting points?
>
> --
> Best Regards,
> Bruce
>
>
> _______________________________________________
> Please read the documentation before posting - it's available at:
> http://www.linuxvirtualserver.org/
>
> LinuxVirtualServer.org mailing list - lvs-users@LinuxVirtualServer.org
> Send requests to lvs-users-request@LinuxVirtualServer.org
> or go to http://lists.graemef.net/mailman/listinfo/lvs-users

--
Regards,

Malcolm Turnbull.

Loadbalancer.org Ltd.
Phone: +44 (0)870 443 8779
http://www.loadbalancer.org/

_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/

LinuxVirtualServer.org mailing list - lvs-users@LinuxVirtualServer.org
Send requests to lvs-users-request@LinuxVirtualServer.org
or go to http://lists.graemef.net/mailman/listinfo/lvs-users

Re: [lvs-users] LVS-DR Cluster Some Real Servers Stuck in SYN_RECV [ In reply to ]

brudolph at admantx

Feb 28, 2014, 12:01 PM

Post #3 of 12 (4942 views)

I followed instructions from two sources

1)
http://www.centos.org/docs/5/html/Virtual_Server_Administration/s2-lvs-direct-iptables-VSA.html

I updated iptables using the commands on this page.

2)
http://ptylr.com/2013/05/01/configuring-lvs-piranha-on-centos-for-direct-routing/

This page had information on configuring lo:0 which was
the final step that I needed to get LVS-DR to work.

The setup this way had been working since last August. It is still
working on one of the real servers but not on four other ones. Very odd.

On 2/28/14 2:26 PM, Malcolm Turnbull wrote:
> snip -- "I have setup
> LVS-DR using IPTables."
>
> Then why are you using a loopback adapter as well?
>
> You only need to use one method iptables REDIRECT .... or ...
> loopbackadapter + arptables settings
>
> SYN_RECV means the real server is not replying when hit with a packet
> that says Hi are you the VIP?
>
>
>
> On 28 February 2014 19:21, Bruce Rudolph <brudolph@admantx.com> wrote:
>> I have an LVS-DR cluster which has been running for seven months without
>> a hitch. Recently, the cluster started to timeout on the majority of
>> connections. Some connections were passed through to a real server and
>> processed. I have tried for a week to figure out what happened. What I
>> found was that one real server out of five is connecting and servicing
>> the client request. The other four real servers have the HTTP connection
>> stuck in the SYN_RECV state until it times out (60 seconds).
>>
>> In summary, I have seven CentOS 6.4 servers (kernel
>> 2.6.32-358.18.1.el6.x86_64). Two servers are configured as load
>> balancers (a primary and a backup) and five real servers. I have setup
>> LVS-DR using IPTables. The servers have a public IP bound to a NIC
>> device and an internal VLAN bound to a second NIC. The VIP is configured
>> on the real servers local loopback (lo:0) device. The
>> /etc/sysconfig/ha/lvs.cf was setup properly and everything was running
>> successfully for seven months.
>>
>> We installed new versions of our software for the web service we are
>> running. Nothing network related. All five real servers were updated the
>> same way. I am comparing the one working real server from the four that
>> are not working. So far I have found nothing.
>>
>> Any ideas on trouble shooting points?
>>
>> --
>> Best Regards,
>> Bruce
>>
>>
>> _______________________________________________
>> Please read the documentation before posting - it's available at:
>> http://www.linuxvirtualserver.org/
>>
>> LinuxVirtualServer.org mailing list - lvs-users@LinuxVirtualServer.org
>> Send requests to lvs-users-request@LinuxVirtualServer.org
>> or go to http://lists.graemef.net/mailman/listinfo/lvs-users
>
>

_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/

LinuxVirtualServer.org mailing list - lvs-users@LinuxVirtualServer.org
Send requests to lvs-users-request@LinuxVirtualServer.org
or go to http://lists.graemef.net/mailman/listinfo/lvs-users

Re: [lvs-users] LVS-DR Cluster Some Real Servers Stuck in SYN_RECV [ In reply to ]

malcolm at loadbalancer

Feb 28, 2014, 12:23 PM

Post #4 of 12 (4938 views)

Bruce,

You definitely only need one, and personally I find the iptables method easiest.
NB. Your apache instance must be configured to respond to the VIP as
well as the RIP (heath checks are on the RIP)
If you use a local web browser on the real server does it work when
you connect to the VIP ? i.e.

links x.x.x.x

IF so then great but your routing is probably messed up by the lo:0 adapter.

On 28 February 2014 20:01, Bruce Rudolph <brudolph@admantx.com> wrote:
> I followed instructions from two sources
>
> 1)
> http://www.centos.org/docs/5/html/Virtual_Server_Administration/s2-lvs-direct-iptables-VSA.html
>
> I updated iptables using the commands on this page.
>
> 2)
> http://ptylr.com/2013/05/01/configuring-lvs-piranha-on-centos-for-direct-routing/
>
> This page had information on configuring lo:0 which was
> the final step that I needed to get LVS-DR to work.
>
> The setup this way had been working since last August. It is still
> working on one of the real servers but not on four other ones. Very odd.
>
>
>
> On 2/28/14 2:26 PM, Malcolm Turnbull wrote:
>> snip -- "I have setup
>> LVS-DR using IPTables."
>>
>> Then why are you using a loopback adapter as well?
>>
>> You only need to use one method iptables REDIRECT .... or ...
>> loopbackadapter + arptables settings
>>
>> SYN_RECV means the real server is not replying when hit with a packet
>> that says Hi are you the VIP?
>>
>>
>>
>> On 28 February 2014 19:21, Bruce Rudolph <brudolph@admantx.com> wrote:
>>> I have an LVS-DR cluster which has been running for seven months without
>>> a hitch. Recently, the cluster started to timeout on the majority of
>>> connections. Some connections were passed through to a real server and
>>> processed. I have tried for a week to figure out what happened. What I
>>> found was that one real server out of five is connecting and servicing
>>> the client request. The other four real servers have the HTTP connection
>>> stuck in the SYN_RECV state until it times out (60 seconds).
>>>
>>> In summary, I have seven CentOS 6.4 servers (kernel
>>> 2.6.32-358.18.1.el6.x86_64). Two servers are configured as load
>>> balancers (a primary and a backup) and five real servers. I have setup
>>> LVS-DR using IPTables. The servers have a public IP bound to a NIC
>>> device and an internal VLAN bound to a second NIC. The VIP is configured
>>> on the real servers local loopback (lo:0) device. The
>>> /etc/sysconfig/ha/lvs.cf was setup properly and everything was running
>>> successfully for seven months.
>>>
>>> We installed new versions of our software for the web service we are
>>> running. Nothing network related. All five real servers were updated the
>>> same way. I am comparing the one working real server from the four that
>>> are not working. So far I have found nothing.
>>>
>>> Any ideas on trouble shooting points?
>>>
>>> --
>>> Best Regards,
>>> Bruce
>>>
>>>
>>> _______________________________________________
>>> Please read the documentation before posting - it's available at:
>>> http://www.linuxvirtualserver.org/
>>>
>>> LinuxVirtualServer.org mailing list - lvs-users@LinuxVirtualServer.org
>>> Send requests to lvs-users-request@LinuxVirtualServer.org
>>> or go to http://lists.graemef.net/mailman/listinfo/lvs-users
>>
>>
>
> _______________________________________________
> Please read the documentation before posting - it's available at:
> http://www.linuxvirtualserver.org/
>
> LinuxVirtualServer.org mailing list - lvs-users@LinuxVirtualServer.org
> Send requests to lvs-users-request@LinuxVirtualServer.org
> or go to http://lists.graemef.net/mailman/listinfo/lvs-users

--
Regards,

Malcolm Turnbull.

Loadbalancer.org Ltd.
Phone: +44 (0)870 443 8779
http://www.loadbalancer.org/

_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/

LinuxVirtualServer.org mailing list - lvs-users@LinuxVirtualServer.org
Send requests to lvs-users-request@LinuxVirtualServer.org
or go to http://lists.graemef.net/mailman/listinfo/lvs-users

Re: [lvs-users] LVS-DR Cluster Some Real Servers Stuck in SYN_RECV [ In reply to ]

brudolph at admantx

Feb 28, 2014, 1:25 PM

Post #5 of 12 (4939 views)

Malcolm,

If there is a conflict with performing both steps (iptables redirect and
binding the VIP to lo:0) then I would think this should have failed when
I first set it up. And now one real server is handling requests passing
responses to the client, while the other four have sessions in SYN_RECV
state. For example:

Every 5.0s: netstat
-t Fri Feb 28
22:15:42 2014

Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign
Address State
tcp 0 0 172.18.30.20:http
<client_IP_address>:50864 SYN_RECV

I tried the two approaches you indicated on one of the failing servers
and got the same SYN_RECV result.

My servers are headless. No local browsers.

Thanks for the response and the ideas. I keep trying.

Bruce

On 2/28/14 3:23 PM, Malcolm Turnbull wrote:
> Bruce,
>
> You definitely only need one, and personally I find the iptables method easiest.
> NB. Your apache instance must be configured to respond to the VIP as
> well as the RIP (heath checks are on the RIP)
> If you use a local web browser on the real server does it work when
> you connect to the VIP ? i.e.
>
> links x.x.x.x
>
> IF so then great but your routing is probably messed up by the lo:0 adapter.
>
>
>
>
>
>
> On 28 February 2014 20:01, Bruce Rudolph <brudolph@admantx.com> wrote:
>> I followed instructions from two sources
>>
>> 1)
>> http://www.centos.org/docs/5/html/Virtual_Server_Administration/s2-lvs-direct-iptables-VSA.html
>>
>> I updated iptables using the commands on this page.
>>
>> 2)
>> http://ptylr.com/2013/05/01/configuring-lvs-piranha-on-centos-for-direct-routing/
>>
>> This page had information on configuring lo:0 which was
>> the final step that I needed to get LVS-DR to work.
>>
>> The setup this way had been working since last August. It is still
>> working on one of the real servers but not on four other ones. Very odd.
>>
>>
>>
>> On 2/28/14 2:26 PM, Malcolm Turnbull wrote:
>>> snip -- "I have setup
>>> LVS-DR using IPTables."
>>>
>>> Then why are you using a loopback adapter as well?
>>>
>>> You only need to use one method iptables REDIRECT .... or ...
>>> loopbackadapter + arptables settings
>>>
>>> SYN_RECV means the real server is not replying when hit with a packet
>>> that says Hi are you the VIP?
>>>
>>>
>>>
>>> On 28 February 2014 19:21, Bruce Rudolph <brudolph@admantx.com> wrote:
>>>> I have an LVS-DR cluster which has been running for seven months without
>>>> a hitch. Recently, the cluster started to timeout on the majority of
>>>> connections. Some connections were passed through to a real server and
>>>> processed. I have tried for a week to figure out what happened. What I
>>>> found was that one real server out of five is connecting and servicing
>>>> the client request. The other four real servers have the HTTP connection
>>>> stuck in the SYN_RECV state until it times out (60 seconds).
>>>>
>>>> In summary, I have seven CentOS 6.4 servers (kernel
>>>> 2.6.32-358.18.1.el6.x86_64). Two servers are configured as load
>>>> balancers (a primary and a backup) and five real servers. I have setup
>>>> LVS-DR using IPTables. The servers have a public IP bound to a NIC
>>>> device and an internal VLAN bound to a second NIC. The VIP is configured
>>>> on the real servers local loopback (lo:0) device. The
>>>> /etc/sysconfig/ha/lvs.cf was setup properly and everything was running
>>>> successfully for seven months.
>>>>
>>>> We installed new versions of our software for the web service we are
>>>> running. Nothing network related. All five real servers were updated the
>>>> same way. I am comparing the one working real server from the four that
>>>> are not working. So far I have found nothing.
>>>>
>>>> Any ideas on trouble shooting points?
>>>>
>>>> --
>>>> Best Regards,
>>>> Bruce
>>>>
>>>>
>>>> _______________________________________________
>>>> Please read the documentation before posting - it's available at:
>>>> http://www.linuxvirtualserver.org/
>>>>
>>>> LinuxVirtualServer.org mailing list - lvs-users@LinuxVirtualServer.org
>>>> Send requests to lvs-users-request@LinuxVirtualServer.org
>>>> or go to http://lists.graemef.net/mailman/listinfo/lvs-users
>>>
>> _______________________________________________
>> Please read the documentation before posting - it's available at:
>> http://www.linuxvirtualserver.org/
>>
>> LinuxVirtualServer.org mailing list - lvs-users@LinuxVirtualServer.org
>> Send requests to lvs-users-request@LinuxVirtualServer.org
>> or go to http://lists.graemef.net/mailman/listinfo/lvs-users
>
>

_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/

LinuxVirtualServer.org mailing list - lvs-users@LinuxVirtualServer.org
Send requests to lvs-users-request@LinuxVirtualServer.org
or go to http://lists.graemef.net/mailman/listinfo/lvs-users

Re: [lvs-users] LVS-DR Cluster Some Real Servers Stuck in SYN_RECV [ In reply to ]

brudolph at admantx

Mar 1, 2014, 9:38 AM

Post #6 of 12 (4942 views)

My current findings.

The overall LVS cluster is working at a degraded performance because
four of the five real servers are failing. The failure is strange. When
a client sends a request to the VIP (Virtual IP address) the LVS
Director (load balancer) distributes it to one of the real servers based
on the scheduling algorithm (LC).

Legend for the examples

VIP = Virtual IP Address for the LVS cluster
DIR = the LVS Director or Load Balancer
RS = Real Server - the web service we have running listening on port 80

The servers that are failing are doing so because of the following sequence:
ERROR SEQUENCE

Client sends SYN to VIP
DIR forwards SYN to an available RS
RS receives the SYN and responds to Client with SYN-ACK
Client does not receive the SYN-ACK so it never sends an ACK. It
continues to send a SYN trying to establish a connection until the
timeout. THIS IS THE FAILURE POINT.

The one working real server has the following sequence:
SUCCESS SEQUENCE

Client send SYN to VIP
DIR forwards SYN to an available RS
RS receives the SYN and responds to Client with SYN-ACK
Client receives the SYN-ACK and sends an ACK
Client sends data packet (service request)
RS receives data packet
RS pushes data to the application
RS sends ACK to Client
RS application sends response data packet to Client
RS sends FIN to Client
Client receives response data and sends ACK
Client sends ACK to RS (for the FIN)
Client sends FIN to RS
RS sends ACK to Client (the connection is closed)

IMPORTANT: I can send the same request to the Real Servers public IP
(RIP) address rather than the VIP and each real server responds correctly.

The working real server was setup the same as the currently broken real
servers. I have not found why the broken real servers send a SYN-ACK,
directed to the Client, but it is never received at the client. Since
the Client doesn't receive the SYN-ACK it keeps sending SYNs until a
timeout closes the request. The session on the real server is stuck in
SYN_RECV until it times out.

Any ideas given this scenario?

Bruce

On 2/28/14 3:23 PM, Malcolm Turnbull wrote:
> Bruce,
>
> You definitely only need one, and personally I find the iptables method easiest.
> NB. Your apache instance must be configured to respond to the VIP as
> well as the RIP (heath checks are on the RIP)
> If you use a local web browser on the real server does it work when
> you connect to the VIP ? i.e.
>
> links x.x.x.x
>
> IF so then great but your routing is probably messed up by the lo:0 adapter.
>
>
>
>
>
>
> On 28 February 2014 20:01, Bruce Rudolph <brudolph@admantx.com> wrote:
>> I followed instructions from two sources
>>
>> 1)
>> http://www.centos.org/docs/5/html/Virtual_Server_Administration/s2-lvs-direct-iptables-VSA.html
>>
>> I updated iptables using the commands on this page.
>>
>> 2)
>> http://ptylr.com/2013/05/01/configuring-lvs-piranha-on-centos-for-direct-routing/
>>
>> This page had information on configuring lo:0 which was
>> the final step that I needed to get LVS-DR to work.
>>
>> The setup this way had been working since last August. It is still
>> working on one of the real servers but not on four other ones. Very odd.
>>
>>
>>
>> On 2/28/14 2:26 PM, Malcolm Turnbull wrote:
>>> snip -- "I have setup
>>> LVS-DR using IPTables."
>>>
>>> Then why are you using a loopback adapter as well?
>>>
>>> You only need to use one method iptables REDIRECT .... or ...
>>> loopbackadapter + arptables settings
>>>
>>> SYN_RECV means the real server is not replying when hit with a packet
>>> that says Hi are you the VIP?
>>>
>>>
>>>
>>> On 28 February 2014 19:21, Bruce Rudolph <brudolph@admantx.com> wrote:
>>>> I have an LVS-DR cluster which has been running for seven months without
>>>> a hitch. Recently, the cluster started to timeout on the majority of
>>>> connections. Some connections were passed through to a real server and
>>>> processed. I have tried for a week to figure out what happened. What I
>>>> found was that one real server out of five is connecting and servicing
>>>> the client request. The other four real servers have the HTTP connection
>>>> stuck in the SYN_RECV state until it times out (60 seconds).
>>>>
>>>> In summary, I have seven CentOS 6.4 servers (kernel
>>>> 2.6.32-358.18.1.el6.x86_64). Two servers are configured as load
>>>> balancers (a primary and a backup) and five real servers. I have setup
>>>> LVS-DR using IPTables. The servers have a public IP bound to a NIC
>>>> device and an internal VLAN bound to a second NIC. The VIP is configured
>>>> on the real servers local loopback (lo:0) device. The
>>>> /etc/sysconfig/ha/lvs.cf was setup properly and everything was running
>>>> successfully for seven months.
>>>>
>>>> We installed new versions of our software for the web service we are
>>>> running. Nothing network related. All five real servers were updated the
>>>> same way. I am comparing the one working real server from the four that
>>>> are not working. So far I have found nothing.
>>>>
>>>> Any ideas on trouble shooting points?
>>>>
>>>> --
>>>> Best Regards,
>>>> Bruce
>>>>
>>>>
>>>> _______________________________________________
>>>> Please read the documentation before posting - it's available at:
>>>> http://www.linuxvirtualserver.org/
>>>>
>>>> LinuxVirtualServer.org mailing list - lvs-users@LinuxVirtualServer.org
>>>> Send requests to lvs-users-request@LinuxVirtualServer.org
>>>> or go to http://lists.graemef.net/mailman/listinfo/lvs-users
>>>
>> _______________________________________________
>> Please read the documentation before posting - it's available at:
>> http://www.linuxvirtualserver.org/
>>
>> LinuxVirtualServer.org mailing list - lvs-users@LinuxVirtualServer.org
>> Send requests to lvs-users-request@LinuxVirtualServer.org
>> or go to http://lists.graemef.net/mailman/listinfo/lvs-users
>
>

_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/

LinuxVirtualServer.org mailing list - lvs-users@LinuxVirtualServer.org
Send requests to lvs-users-request@LinuxVirtualServer.org
or go to http://lists.graemef.net/mailman/listinfo/lvs-users

Re: [lvs-users] LVS-DR Cluster Some Real Servers Stuck in SYN_RECV [ In reply to ]

Mar 1, 2014, 11:01 AM

Post #7 of 12 (4962 views)

Hello,

On Sat, 1 Mar 2014, Bruce Rudolph wrote:

> My current findings.
>
> The overall LVS cluster is working at a degraded performance because
> four of the five real servers are failing. The failure is strange. When
> a client sends a request to the VIP (Virtual IP address) the LVS
> Director (load balancer) distributes it to one of the real servers based
> on the scheduling algorithm (LC).
>
> Legend for the examples
>
> VIP = Virtual IP Address for the LVS cluster
> DIR = the LVS Director or Load Balancer
> RS = Real Server - the web service we have running listening on port 80
>
>
> The servers that are failing are doing so because of the following sequence:
> ERROR SEQUENCE
>
> Client sends SYN to VIP
> DIR forwards SYN to an available RS
> RS receives the SYN and responds to Client with SYN-ACK

If there is reponse, check on real server that
it is correct:

1. It should contain VIP in saddr in IP header. This is expected
because director should send the request to real server
with VIP in daddr. Also, the client should see the same
server port (vport) in the response.

2. 'tcpdump -lennn src host VIP' on real server can show
to which destination MAC is sent the response

3. If it is going via director you can notice it with
tcpdump also on director. I guess, DR setups do not use
director for responses, otherwise they would use NAT mode
to avoid the source spoofing checks. I guess all your
real servers use same default gateway.

> Client does not receive the SYN-ACK so it never sends an ACK. It
> continues to send a SYN trying to establish a connection until the
> timeout. THIS IS THE FAILURE POINT.

Regards

--
Julian Anastasov <ja@ssi.bg>

_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/

LinuxVirtualServer.org mailing list - lvs-users@LinuxVirtualServer.org
Send requests to lvs-users-request@LinuxVirtualServer.org
or go to http://lists.graemef.net/mailman/listinfo/lvs-users

Re: [lvs-users] LVS-DR Cluster Some Real Servers Stuck in SYN_RECV [ In reply to ]

brudolph at admantx

Mar 3, 2014, 6:32 AM

Post #8 of 12 (4948 views)

Julian,

Thanks for the suggestions. The following shows the results with the
failing servers:

# tcpdump -lennnvvv -i any port http
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked),
capture size 65535 bytes
18:21:12.346348 In 68:05:ca:18:61:c1 ethertype IPv4 (0x0800),
length 80: (tos 0x28, ttl 53, id 52608, offset 0, flags [DF], proto
TCP (6), length 64)
<CIP>.62628 > <VIP>.80: Flags [S], cksum 0x3e62 (correct), seq
4011092518, win 65535, options [mss 1460,nop,wscale 1,nop,nop,TS val
3844971164 ecr 0,sackOK,eol], length 0
18:21:12.346386 Out e4:11:5b:ae:f9:e5 ethertype IPv4 (0x0800),
length 76: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP
(6), length 60)
<VIP>.80 > <CIP//>.62628: Flags [S.], cksum 0xf2a9 (correct),
seq 4207299083, ack 4011092519, win 14480, options [mss
1460,sackOK,TS val 82369115 ecr 3844971164,nop,wscale 7], length 0
18:21:13.478479 Out e4:11:5b:ae:f9:e5 ethertype IPv4 (0x0800),
length 76: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP
(6), length 60)
<VIP>.80 > <CIP>.62628: Flags [S.], cksum 0xee3c (correct), seq
4207299083, ack 4011092519, win 14480, options [mss 1460,sackOK,TS
val 82370248 ecr 3844971164,nop,wscale 7], length 0
18:21:13.550009 In 68:05:ca:18:61:c1 ethertype IPv4 (0x0800),
length 80: (tos 0x28, ttl 53, id 21930, offset 0, flags [DF], proto
TCP (6), length 64)
<CIP>.62628 > <VIP>.80: Flags [S], cksum 0x39b5 (correct), seq
4011092518, win 65535, options [mss 1460,nop,wscale 1,nop,nop,TS val
3844972361 ecr 0,sackOK,eol], length 0
18:21:13.550032 Out e4:11:5b:ae:f9:e5 ethertype IPv4 (0x0800),
length 76: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP
(6), length 60)
<VIP>.80 > <CIP>.62628: Flags [S.], cksum 0xedf5 (correct), seq
4207299083, ack 4011092519, win 14480, options [mss 1460,sackOK,TS
val 82370319 ecr 3844971164,nop,wscale 7], length 0
18:21:14.666596 In 68:05:ca:18:61:c1 ethertype IPv4 (0x0800),
length 80: (tos 0x28, ttl 53, id 24982, offset 0, flags [DF], proto
TCP (6), length 64)
<CIP>.62628 > <VIP>.80: Flags [S], cksum 0x356e (correct), seq
4011092518, win 65535, options [mss 1460,nop,wscale 1,nop,nop,TS val
3844973456 ecr 0,sackOK,eol], length 0
18:21:14.666626 Out e4:11:5b:ae:f9:e5 ethertype IPv4 (0x0800),
length 76: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP
(6), length 60)
<VIP>.80 > <CIP>.62628: Flags [S.], cksum 0xe998 (correct), seq
4207299083, ack 4011092519, win 14480, options [mss 1460,sackOK,TS
val 82371436 ecr 3844971164,nop,wscale 7], length 0
18:21:15.478479 Out e4:11:5b:ae:f9:e5 ethertype IPv4 (0x0800),
length 76: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP
(6), length 60)
<VIP>.80 > <CIP>.62628: Flags [S.], cksum 0xe66c (correct), seq
4207299083, ack 4011092519, win 14480, options [mss 1460,sackOK,TS
val 82372248 ecr 3844971164,nop,wscale 7], length 0
18:21:15.758857 In 68:05:ca:18:61:c1 ethertype IPv4 (0x0800),
length 80: (tos 0x28, ttl 53, id 40934, offset 0, flags [DF], proto
TCP (6), length 64)

The pattern above shows the cycle
CIP VIP
----- -----
SYN ---------->
<--------- SYN-ACK
<--------- SYN-ACK (1+ seconds later)
SYN ---------->
<--------- SYN-ACK
<--------- SYN-ACK (1+ seconds later)

In the same environment for the real servers that are failing I can send
the request to the RIP successfully. tcpdump output follows

# tcpdump -lennnvvv -i any port http
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked),
capture size 65535 bytes
15:25:35.287886 In 00:23:9c:10:e0:41 ethertype IPv4 (0x0800),
length 80: (tos 0x28, ttl 53, id 20068, offset 0, flags [DF], proto
TCP (6), length 64)
<CIP>.52747 > <RIP>.80: Flags [S], cksum 0x6bde (correct), seq
2178856449, win 65535, options [mss 1460,nop,wscale 1,nop,nop,TS val
3920435549 ecr 0,sackOK,eol], length 0
15:25:35.287937 Out e4:11:5b:ae:f9:e5 ethertype IPv4 (0x0800),
length 76: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP
(6), length 60)
<RIP>.80 > <CIP>.52747: Flags [S.], cksum 0xde4a (correct), seq
242406834, ack 2178856450, win 14480, options [mss 1460,sackOK,TS
val 56852073 ecr 3920435549,nop,wscale 7], length 0
15:25:35.401916 In 00:23:9c:10:e0:41 ethertype IPv4 (0x0800),
length 68: (tos 0x28, ttl 53, id 10796, offset 0, flags [DF], proto
TCP (6), length 52)
<CIP>.52747 > <RIP>.80: Flags [.], cksum 0xc321 (correct), seq
1, ack 1, win 33304, options [nop,nop,TS val 3920435658 ecr
56852073], length 0
15:25:43.297092 In 00:23:9c:10:e0:41 ethertype IPv4 (0x0800),
length 87: (tos 0x28, ttl 53, id 38439, offset 0, flags [DF], proto
TCP (6), length 71)
<CIP>.52747 > <RIP>.80: Flags [P.], cksum 0x9558 (correct), seq
1:20, ack 1, win 33304, options [nop,nop,TS val 3920443505 ecr
56852073], length 19
15:25:43.297119 Out e4:11:5b:ae:f9:e5 ethertype IPv4 (0x0800),
length 68: (tos 0x0, ttl 64, id 43880, offset 0, flags [DF], proto
TCP (6), length 52)
<RIP>.80 > <CIP>.52747: Flags [.], cksum 0x06c5 (correct), seq
1, ack 20, win 114, options [nop,nop,TS val 56860082 ecr
3920443505], length 0
15:25:43.300061 Out e4:11:5b:ae:f9:e5 ethertype IPv4 (0x0800),
length 72: (tos 0x0, ttl 64, id 43881, offset 0, flags [DF], proto
TCP (6), length 56)
<RIP>.80 > <CIP>.52747: Flags [P.], cksum 0xc206 (incorrect ->
0x27df), seq 1:5, ack 20, win 114, options [nop,nop,TS val 56860085
ecr 3920443505], length 4
15:25:43.300077 Out e4:11:5b:ae:f9:e5 ethertype IPv4 (0x0800),
length 68: (tos 0x0, ttl 64, id 43882, offset 0, flags [DF], proto
TCP (6), length 52)
<RIP>.80 > <CIP>.52747: Flags [F.], cksum 0x06bd (correct), seq
5, ack 20, win 114, options [nop,nop,TS val 56860085 ecr
3920443505], length 0
15:25:43.414941 In 00:23:9c:10:e0:41 ethertype IPv4 (0x0800),
length 68: (tos 0x28, ttl 53, id 36982, offset 0, flags [DF], proto
TCP (6), length 52)
<CIP>.52747 > <RIP>.80: Flags [.], cksum 0x84aa (correct), seq
20, ack 5, win 33302, options [nop,nop,TS val 3920443616 ecr
56860085], length 0
15:25:43.414964 In 00:23:9c:10:e0:41 ethertype IPv4 (0x0800),
length 68: (tos 0x28, ttl 53, id 11379, offset 0, flags [DF], proto
TCP (6), length 52)
<CIP>.52747 > <RIP>.80: Flags [.], cksum 0x84a6 (correct), seq
20, ack 6, win 33304, options [nop,nop,TS val 3920443617 ecr
56860085], length 0
15:25:43.414974 In 00:23:9c:10:e0:41 ethertype IPv4 (0x0800),
length 68: (tos 0x28, ttl 53, id 13828, offset 0, flags [DF], proto
TCP (6), length 52)
<CIP>.52747 > <RIP>.80: Flags [F.], cksum 0x84a5 (correct), seq
20, ack 6, win 33304, options [nop,nop,TS val 3920443617 ecr
56860085], length 0
15:25:43.414986 Out e4:11:5b:ae:f9:e5 ethertype IPv4 (0x0800),
length 68: (tos 0x0, ttl 64, id 43883, offset 0, flags [DF], proto
TCP (6), length 52)
<RIP>.80 > <CIP>.52747: Flags [.], cksum 0x05d9 (correct), seq
6, ack 21, win 114, options [nop,nop,TS val 56860200 ecr
3920443617], length 0

This at least proves the NIC card is working.

Again, the odd thing is that one real server works when sending to the
VIP. Still trying to find a difference but these were all setup the same
and so far no good.

Thanks again for the additional insights.

Regards,
Bruce

On 3/1/14 2:01 PM, Julian Anastasov wrote:
> Hello,
>
> On Sat, 1 Mar 2014, Bruce Rudolph wrote:
>
>> My current findings.
>>
>> The overall LVS cluster is working at a degraded performance because
>> four of the five real servers are failing. The failure is strange. When
>> a client sends a request to the VIP (Virtual IP address) the LVS
>> Director (load balancer) distributes it to one of the real servers based
>> on the scheduling algorithm (LC).
>>
>> Legend for the examples
>>
>> VIP = Virtual IP Address for the LVS cluster
>> DIR = the LVS Director or Load Balancer
>> RS = Real Server - the web service we have running listening on port 80
>>
>>
>> The servers that are failing are doing so because of the following sequence:
>> ERROR SEQUENCE
>>
>> Client sends SYN to VIP
>> DIR forwards SYN to an available RS
>> RS receives the SYN and responds to Client with SYN-ACK
> If there is reponse, check on real server that
> it is correct:
>
> 1. It should contain VIP in saddr in IP header. This is expected
> because director should send the request to real server
> with VIP in daddr. Also, the client should see the same
> server port (vport) in the response.
>
> 2. 'tcpdump -lennn src host VIP' on real server can show
> to which destination MAC is sent the response
>
> 3. If it is going via director you can notice it with
> tcpdump also on director. I guess, DR setups do not use
> director for responses, otherwise they would use NAT mode
> to avoid the source spoofing checks. I guess all your
> real servers use same default gateway.
>
>> Client does not receive the SYN-ACK so it never sends an ACK. It
>> continues to send a SYN trying to establish a connection until the
>> timeout. THIS IS THE FAILURE POINT.
> Regards
>
> --
> Julian Anastasov <ja@ssi.bg>

_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/

LinuxVirtualServer.org mailing list - lvs-users@LinuxVirtualServer.org
Send requests to lvs-users-request@LinuxVirtualServer.org
or go to http://lists.graemef.net/mailman/listinfo/lvs-users

Re: [lvs-users] LVS-DR Cluster Some Real Servers Stuck in SYN_RECV [ In reply to ]

Mar 3, 2014, 9:28 AM

Post #9 of 12 (4933 views)

Hello,

On Mon, 3 Mar 2014, Bruce Rudolph wrote:

> 18:21:12.346386 Out e4:11:5b:ae:f9:e5 ethertype IPv4 (0x0800),
> length 76: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP
> (6), length 60)
> <VIP>.80 > <CIP//>.62628: Flags [S.], cksum 0xf2a9 (correct),
> seq 4207299083, ack 4011092519, win 14480, options [mss
> 1460,sackOK,TS val 82369115 ecr 3844971164,nop,wscale 7], length 0

Response is going to e4:11:5b:ae:f9:e5 ? Do
you see it reaching there? Also, simple test with
client on LAN can reveal the problem, just check with
tcpdump on client box. It can show if problem comes
from router or from real servers. Sometimes, smart
switches can be the culprit too.

Also, check on real servers (mostly the working
one) with tcpdump that you don't see the VIP in
outgoing ARP packets, only director can expose the VIP
in ARP packets. This can be also checked from client on
LAN with 'arping -c 1 VIP', only the director should
reply for VIP.

Regards

--
Julian Anastasov <ja@ssi.bg>

_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/

LinuxVirtualServer.org mailing list - lvs-users@LinuxVirtualServer.org
Send requests to lvs-users-request@LinuxVirtualServer.org
or go to http://lists.graemef.net/mailman/listinfo/lvs-users

Re: [lvs-users] LVS-DR Cluster Some Real Servers Stuck in SYN_RECV [ In reply to ]

brudolph at admantx

Mar 3, 2014, 10:54 AM

Post #10 of 12 (4927 views)

On the failing real servers the response is sent but is never received
by the client (e4:11:5b:ae:f9:e5). On the working server the response is
sent and the client gets it and sends an ACK and the connection is open.

I run tcpdump on the client (my Mac for the testing) and that is how I
know that the SYN-ACK packet is not received from the failing real servers.

This is the mind boggling thing...where are they going? Could it be a
smart switch in the cloud environment? If so, then why would one server
out of five work correctly?

The real servers are not responding to arping. Only the Directory does.

Bruce

On 3/3/14 12:28 PM, Julian Anastasov wrote:
> Hello,
>
> On Mon, 3 Mar 2014, Bruce Rudolph wrote:
>
>> 18:21:12.346386 Out e4:11:5b:ae:f9:e5 ethertype IPv4 (0x0800),
>> length 76: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP
>> (6), length 60)
>> <VIP>.80 > <CIP//>.62628: Flags [S.], cksum 0xf2a9 (correct),
>> seq 4207299083, ack 4011092519, win 14480, options [mss
>> 1460,sackOK,TS val 82369115 ecr 3844971164,nop,wscale 7], length 0
> Response is going to e4:11:5b:ae:f9:e5 ? Do
> you see it reaching there? Also, simple test with
> client on LAN can reveal the problem, just check with
> tcpdump on client box. It can show if problem comes
> from router or from real servers. Sometimes, smart
> switches can be the culprit too.
>
> Also, check on real servers (mostly the working
> one) with tcpdump that you don't see the VIP in
> outgoing ARP packets, only director can expose the VIP
> in ARP packets. This can be also checked from client on
> LAN with 'arping -c 1 VIP', only the director should
> reply for VIP.
>
> Regards
>
> --
> Julian Anastasov <ja@ssi.bg>

_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/

LinuxVirtualServer.org mailing list - lvs-users@LinuxVirtualServer.org
Send requests to lvs-users-request@LinuxVirtualServer.org
or go to http://lists.graemef.net/mailman/listinfo/lvs-users

Re: [lvs-users] LVS-DR Cluster Some Real Servers Stuck in SYN_RECV [ In reply to ]

brudolph at admantx

Mar 5, 2014, 10:24 AM

Post #11 of 12 (4918 views)

One more follow up to see if there are any other suggestions.

Yesterday I added a sixth real server to the cluster. All of these
servers are of the exact same type (bare metal machines). I installed
and configured the new server exactly as the others. I added it to the
cluster and tried it. It failed too, that is, sending requests to the
VIP causes the real server to send a SYN-ACK (response to the SYN), but
it is never seen by the client. The one working server, of the same
type, continues to respond correctly!

Today I reconfigured a non-working server to use Direct Routing via the
arptables_jf technique. I tried a request and it failed. Then I
reconfigured the working server to use arptables_jf and it worked. So
the failure continues on all bad servers with either DR configuration,
and works on the sixth.

I doubt five servers can have a hardware problem with their NICs.

The cloud vendor has checked their smart switches and they state they
are working fine.

Thanks for listening and any support suggestions you may have.

Regards,
Bruce

On 3/3/14 1:54 PM, Bruce Rudolph wrote:
> On the failing real servers the response is sent but is never received
> by the client (e4:11:5b:ae:f9:e5). On the working server the response
> is sent and the client gets it and sends an ACK and the connection is
> open.
>
> I run tcpdump on the client (my Mac for the testing) and that is how I
> know that the SYN-ACK packet is not received from the failing real
> servers.
>
> This is the mind boggling thing...where are they going? Could it be a
> smart switch in the cloud environment? If so, then why would one
> server out of five work correctly?
>
> The real servers are not responding to arping. Only the Directory does.
>
> Bruce
>
> On 3/3/14 12:28 PM, Julian Anastasov wrote:
>> Hello,
>>
>> On Mon, 3 Mar 2014, Bruce Rudolph wrote:
>>
>>> 18:21:12.346386 Out e4:11:5b:ae:f9:e5 ethertype IPv4 (0x0800),
>>> length 76: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP
>>> (6), length 60)
>>> <VIP>.80 > <CIP//>.62628: Flags [S.], cksum 0xf2a9 (correct),
>>> seq 4207299083, ack 4011092519, win 14480, options [mss
>>> 1460,sackOK,TS val 82369115 ecr 3844971164,nop,wscale 7], length 0
>> Response is going to e4:11:5b:ae:f9:e5 ? Do
>> you see it reaching there? Also, simple test with
>> client on LAN can reveal the problem, just check with
>> tcpdump on client box. It can show if problem comes
>> from router or from real servers. Sometimes, smart
>> switches can be the culprit too.
>>
>> Also, check on real servers (mostly the working
>> one) with tcpdump that you don't see the VIP in
>> outgoing ARP packets, only director can expose the VIP
>> in ARP packets. This can be also checked from client on
>> LAN with 'arping -c 1 VIP', only the director should
>> reply for VIP.
>>
>> Regards
>>
>> --
>> Julian Anastasov<ja@ssi.bg>
>

_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/

LinuxVirtualServer.org mailing list - lvs-users@LinuxVirtualServer.org
Send requests to lvs-users-request@LinuxVirtualServer.org
or go to http://lists.graemef.net/mailman/listinfo/lvs-users

Re: [lvs-users] LVS-DR Cluster Some Real Servers Stuck in SYN_RECV [ In reply to ]

brudolph at admantx

Mar 13, 2014, 2:35 PM

Post #12 of 12 (4906 views)

All,

Figured I'd update all with the discovered problem in case it happens to
you.

My bare-metal vendor enabled uRFP on a some routers in their environment
which causes the outgoing SYN-ACKs from the Real Server to the Client to
be dropped. On the one working Real Server the network segment did not
have uRFP enabled. Frustrating for two weeks they didn't let me know of
this change.

Bruce

On 3/3/14 12:28 PM, Julian Anastasov wrote:
> Hello,
>
> On Mon, 3 Mar 2014, Bruce Rudolph wrote:
>
>> 18:21:12.346386 Out e4:11:5b:ae:f9:e5 ethertype IPv4 (0x0800),
>> length 76: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP
>> (6), length 60)
>> <VIP>.80 > <CIP//>.62628: Flags [S.], cksum 0xf2a9 (correct),
>> seq 4207299083, ack 4011092519, win 14480, options [mss
>> 1460,sackOK,TS val 82369115 ecr 3844971164,nop,wscale 7], length 0
> Response is going to e4:11:5b:ae:f9:e5 ? Do
> you see it reaching there? Also, simple test with
> client on LAN can reveal the problem, just check with
> tcpdump on client box. It can show if problem comes
> from router or from real servers. Sometimes, smart
> switches can be the culprit too.
>
> Also, check on real servers (mostly the working
> one) with tcpdump that you don't see the VIP in
> outgoing ARP packets, only director can expose the VIP
> in ARP packets. This can be also checked from client on
> LAN with 'arping -c 1 VIP', only the director should
> reply for VIP.
>
> Regards
>
> --
> Julian Anastasov <ja@ssi.bg>

_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/

LinuxVirtualServer.org mailing list - lvs-users@LinuxVirtualServer.org
Send requests to lvs-users-request@LinuxVirtualServer.org
or go to http://lists.graemef.net/mailman/listinfo/lvs-users