Mailing List Archive: Wackamole failing after cable dis-/reconnect

Wackamole failing after cable dis-/reconnect

Oct 22, 2003, 3:44 AM

Post #1 of 5 (2819 views)

Hallo,

I have a Spread/Wackamole setup which works at least testing-wise fine
as long as I fail one machine of the two machine setup by completely
rebooting it or by killing the Wackamole or Spread daemon.
I the above case(s) the other machine in that litte cluster takes over
with a very short outage only.

The problem which I encounter is after disconnecting physically the
network interface's cable on either machine and afterwards reconnecting
that cable.
Step by step I do the following, assuming initially both machines are
fine an listening:
1. disconnect NIC cable of machine A (which is 2.4.20 kernel, German
SUSE 8.2 distro)
2. watch syslog on the other machine B (which is RH 7.1, kernel
2.4.2-2), wait for Wackamole to complete the arp spoof
3. watch ping -t on a Windows box on the same network. After
disconnection there is a brief outage of one-two seconds, then the other
machine jumps in, and ping is receiving good responses again
3. reconnect NIC cable of machine A (where Spread daemon and Wackamole
have continued running while the cable was off)
4. watch syslog of machine B, Wackamole brings the VIP down
5. watch syslog of machine A, there is no activity, apart from the
notice that the cacle has been reconnected and a 100Mbit link has been
established
6. watch ping -t on Windows box, ping receives destination host
unreachable messages originating from the physical IP of machine B.
Since machine B has taken down the VIP it was listening to when the
cable on machine A was reconnected it should not be able to respond to
ping going to the VIP, which is OK.
7. doing arp -a on the Windows box, I see that the arp cache for the VIP
has not been updated. One explanation that occurs to me is, that the arp
spoof and subsequent update of the shared arp cache seem to happen only
when a VIP comes up, not when its ging down. So in my case, the VIP on
machine B goes down, without notifying anyone of it. And the VIP on
machine A, which has been up right through during the physical
disconnect, does not sense any changes and therefore does not broadcast
arp information as well.
8. If I purge the VIP from the Windows box arp cache, the ping comes
right back with good responses.

Well, I hope no one got bored with the lengthy explanation.
I will post the important parts of my conf below.
The Wackamole conf is different from most others I have seen. I want
(have to) use only one IP Address as VIP for both machines in my little
cluster, since booth machines have to exposed by that IP address, not at
the same time (I know this wouldn't work) but intermittendly depending
on their health state or running condition. The network has no DNS
available, therefore I have to go with the IP.

Spread (Conf is identical on both machines A and B)
Spread_Segment 192.168.1.255:4803 {
"192" 192.168.1.141
ibm-linux 192.168.1.59
}

Wackamole (identical as well)
# The Spread daemon we are going to connect to. It should be on the
local box
Spread = 4803
SpreadRetryInterval = 2s
# The group name
Group = wack1
# Named socket for online control
Control = /var/run/wack.it

# Denote the interface we prefer to have
#prefer eth0:10.3.4.5/8
#prefer { eth0:10.2.3.4/8 eth1:192.168.10.23/24 }

# In most cases, I just don't care. Let wackamole decide.
Prefer None

# List all the virtual interfaces (ALL of them)
VirtualInterfaces {
# The following two lines have the same effect
# en0:192.168.1.2/24
{ eth0:192.168.1.200/24 }

# This is how you say 2 or more IPs are to be treated as a single
# "set" or "virtual interface". If wackamole decides that this
# machine will manage it, you are ensured to get ALL the ips in the
# set.
# { en1:10.0.0.1/8 en0:192.168.35.64/26 }
}

# Collect and broadcast the IPs in our ARP table every so often
Arp-Cache = 1s

# List who we will notify
# Here the netblock (/24 or /28) can be deceptive. It is NOT a
netmask
# for a single IP. It is how one will describe that they want to
# notify ALL IPs in a segment.
Notify {
# Let's notify our router:
eth0:192.168.1.1/32
# Notify out DNS servers
# en1:10.0.0.10/32
# en1:10.0.0.11/32
# 10.0.0.0 -> 10.0.0.255, but only 128 notifications/sec
# en0:10.0.0.0/24 throttle 128
# Wackamole shares arp-cache across machines, this says to
# notify every IP address in the aggregate shared arp-cache.
arp-cache
}
balance {
# This field is the maximum number of IP addresses that will move
# from one wackamole to another during a round of balancing.
AcquisitionsPerRound = 1
# Time interval in each balancing round.
interval = 1s
}
# How long it takes us to mature
mature = 3s

-----

If anyone has got some time:
Can that what I intend to do work at all?
Any hints how I could work aroud my problem?

If you haven't the time, thanks for reading anyway!

--
Mit freundlichen Gruessen / Kind Regards
Toralf Richter

triplesense GmbH
Hanauer Landstraße 186
60314 Frankfurt am Main

Wackamole failing after cable dis-/reconnect [ In reply to ]

munjal at cnds

Oct 22, 2003, 11:36 AM

Post #2 of 5 (2719 views)

Permalink

Wednesday, October 22, 2003, 6:44:52 AM, Toralf Richter wrote:

TR> Hallo,

TR> I have a Spread/Wackamole setup which works at least testing-wise fine
TR> as long as I fail one machine of the two machine setup by completely
TR> rebooting it or by killing the Wackamole or Spread daemon.
TR> I the above case(s) the other machine in that litte cluster takes over
TR> with a very short outage only.

TR> The problem which I encounter is after disconnecting physically the
TR> network interface's cable on either machine and afterwards reconnecting
TR> that cable.
TR> Step by step I do the following, assuming initially both machines are
TR> fine an listening:
TR> 1. disconnect NIC cable of machine A (which is 2.4.20 kernel, German
TR> SUSE 8.2 distro)
TR> 2. watch syslog on the other machine B (which is RH 7.1, kernel
TR> 2.4.2-2), wait for Wackamole to complete the arp spoof
TR> 3. watch ping -t on a Windows box on the same network. After
TR> disconnection there is a brief outage of one-two seconds, then the other
TR> machine jumps in, and ping is receiving good responses again
TR> 3. reconnect NIC cable of machine A (where Spread daemon and Wackamole
TR> have continued running while the cable was off)
TR> 4. watch syslog of machine B, Wackamole brings the VIP down
TR> 5. watch syslog of machine A, there is no activity, apart from the
TR> notice that the cacle has been reconnected and a 100Mbit link has been
TR> established
TR> 6. watch ping -t on Windows box, ping receives destination host
TR> unreachable messages originating from the physical IP of machine B.
TR> Since machine B has taken down the VIP it was listening to when the
TR> cable on machine A was reconnected it should not be able to respond to
TR> ping going to the VIP, which is OK.
TR> 7. doing arp -a on the Windows box, I see that the arp cache for the VIP
TR> has not been updated. One explanation that occurs to me is, that the arp
TR> spoof and subsequent update of the shared arp cache seem to happen only
TR> when a VIP comes up, not when its ging down. So in my case, the VIP on
TR> machine B goes down, without notifying anyone of it. And the VIP on
TR> machine A, which has been up right through during the physical
TR> disconnect, does not sense any changes and therefore does not broadcast
TR> arp information as well.
TR> 8. If I purge the VIP from the Windows box arp cache, the ping comes
TR> right back with good responses.

TR> Well, I hope no one got bored with the lengthy explanation.
TR> I will post the important parts of my conf below.
TR> The Wackamole conf is different from most others I have seen. I want
TR> (have to) use only one IP Address as VIP for both machines in my little
TR> cluster, since booth machines have to exposed by that IP address, not at
TR> the same time (I know this wouldn't work) but intermittendly depending
TR> on their health state or running condition. The network has no DNS
TR> available, therefore I have to go with the IP.

TR> Spread (Conf is identical on both machines A and B)
TR> Spread_Segment 192.168.1.255:4803 {
TR> "192" 192.168.1.141
TR> ibm-linux 192.168.1.59
TR> }

TR> Wackamole (identical as well)
TR> # The Spread daemon we are going to connect to. It should be on the
TR> local box
TR> Spread = 4803
TR> SpreadRetryInterval = 2s
TR> # The group name
TR> Group = wack1
TR> # Named socket for online control
TR> Control = /var/run/wack.it

TR> # Denote the interface we prefer to have
TR> #prefer eth0:10.3.4.5/8
TR> #prefer { eth0:10.2.3.4/8 eth1:192.168.10.23/24 }

TR> # In most cases, I just don't care. Let wackamole decide.
TR> Prefer None

TR> # List all the virtual interfaces (ALL of them)
TR> VirtualInterfaces {
TR> # The following two lines have the same effect
TR> # en0:192.168.1.2/24
TR> { eth0:192.168.1.200/24 }

TR> # This is how you say 2 or more IPs are to be treated as a single
TR> # "set" or "virtual interface". If wackamole decides that this
TR> # machine will manage it, you are ensured to get ALL the ips in the
TR> # set.
TR> # { en1:10.0.0.1/8 en0:192.168.35.64/26 }
TR> }

TR> # Collect and broadcast the IPs in our ARP table every so often
TR> Arp-Cache = 1s

TR> # List who we will notify
TR> # Here the netblock (/24 or /28) can be deceptive. It is NOT a
TR> netmask
TR> # for a single IP. It is how one will describe that they want to
TR> # notify ALL IPs in a segment.
TR> Notify {
TR> # Let's notify our router:
TR> eth0:192.168.1.1/32
TR> # Notify out DNS servers
TR> # en1:10.0.0.10/32
TR> # en1:10.0.0.11/32
TR> # 10.0.0.0 -> 10.0.0.255, but only 128 notifications/sec
TR> # en0:10.0.0.0/24 throttle 128
TR> # Wackamole shares arp-cache across machines, this says to
TR> # notify every IP address in the aggregate shared arp-cache.
TR> arp-cache
TR> }
TR> balance {
TR> # This field is the maximum number of IP addresses that will move
TR> # from one wackamole to another during a round of balancing.
TR> AcquisitionsPerRound = 1
TR> # Time interval in each balancing round.
TR> interval = 1s
TR> }
TR> # How long it takes us to mature
TR> mature = 3s

TR> -----

TR> If anyone has got some time:
TR> Can that what I intend to do work at all?
TR> Any hints how I could work aroud my problem?

TR> If you haven't the time, thanks for reading anyway!

Hi Toralf,
As you said your network has no dns available. Somewhere in
wackamole we obtain a index for each machines based on dns information
and when that is not availabe wackamole tends to screw up. If you care
to dig into the code to fix that for your case, you'll need to give
each machine a unique index though the command line.

Ashima mailto:munjal@cnds.jhu.edu
-----------------------------------

Wackamole failing after cable dis-/reconnect [ In reply to ]

jesus at omniti

Oct 22, 2003, 10:02 PM

Post #3 of 5 (2731 views)

Permalink

On Wednesday, Oct 22, 2003, at 06:44 US/Eastern, Toralf Richter wrote:
> I have a Spread/Wackamole setup which works at least testing-wise fine
> as long as I fail one machine of the two machine setup by completely
> rebooting it or by killing the Wackamole or Spread daemon.
> I the above case(s) the other machine in that litte cluster takes over
> with a very short outage only.
>
> The problem which I encounter is after disconnecting physically the
> network interface's cable on either machine and afterwards
> reconnecting that cable.
> Step by step I do the following, assuming initially both machines are
> fine an listening:
> 1. disconnect NIC cable of machine A (which is 2.4.20 kernel, German
> SUSE 8.2 distro)
> 2. watch syslog on the other machine B (which is RH 7.1, kernel
> 2.4.2-2), wait for Wackamole to complete the arp spoof
> 3. watch ping -t on a Windows box on the same network. After
> disconnection there is a brief outage of one-two seconds, then the
> other machine jumps in, and ping is receiving good responses again
> 3. reconnect NIC cable of machine A (where Spread daemon and Wackamole
> have continued running while the cable was off)
> 4. watch syslog of machine B, Wackamole brings the VIP down
> 5. watch syslog of machine A, there is no activity, apart from the
> notice that the cacle has been reconnected and a 100Mbit link has been
> established

Machine A should drop one of its VIPs here -- which from your later
description sounds like it does.

The "bug" is that it doesn't re-arp the VIP that it keeps. I agree
this is not the desired behaviour. I'll need to look at the algorithm
and see if there is a way that A can realize that "at start" it was in
conflict with at least one of its peers and re-arp all conflicting VIPs.

// Theo Schlossnagle
// Principal Engineer -- http://www.omniti.com/~jesus/
// Postal Engine -- http://www.postalengine.com/
// Ecelerity: fastest MTA on earth

Wackamole failing after cable dis-/reconnect [ In reply to ]

t.richter at triplesense

Oct 23, 2003, 1:16 AM

Post #4 of 5 (2730 views)

Permalink

Ashima Munjal wrote:

>
> Wednesday, October 22, 2003, 6:44:52 AM, Toralf Richter wrote:
>
> TR> Hallo,
>
> TR> I have a Spread/Wackamole setup which works at least testing-wise fine
> TR> as long as I fail one machine of the two machine setup by completely
> TR> rebooting it or by killing the Wackamole or Spread daemon.
> TR> I the above case(s) the other machine in that litte cluster takes over
> TR> with a very short outage only.
>
> TR> The problem which I encounter is after disconnecting physically the
> TR> network interface's cable on either machine and afterwards reconnecting
> TR> that cable.
> TR> Step by step I do the following, assuming initially both machines are
> TR> fine an listening:
> TR> 1. disconnect NIC cable of machine A (which is 2.4.20 kernel, German
> TR> SUSE 8.2 distro)
> TR> 2. watch syslog on the other machine B (which is RH 7.1, kernel
> TR> 2.4.2-2), wait for Wackamole to complete the arp spoof
> TR> 3. watch ping -t on a Windows box on the same network. After
> TR> disconnection there is a brief outage of one-two seconds, then the other
> TR> machine jumps in, and ping is receiving good responses again
> TR> 3. reconnect NIC cable of machine A (where Spread daemon and Wackamole
> TR> have continued running while the cable was off)
> TR> 4. watch syslog of machine B, Wackamole brings the VIP down
> TR> 5. watch syslog of machine A, there is no activity, apart from the
> TR> notice that the cacle has been reconnected and a 100Mbit link has been
> TR> established
> TR> 6. watch ping -t on Windows box, ping receives destination host
> TR> unreachable messages originating from the physical IP of machine B.
> TR> Since machine B has taken down the VIP it was listening to when the
> TR> cable on machine A was reconnected it should not be able to respond to
> TR> ping going to the VIP, which is OK.
> TR> 7. doing arp -a on the Windows box, I see that the arp cache for the VIP
> TR> has not been updated. One explanation that occurs to me is, that the arp
> TR> spoof and subsequent update of the shared arp cache seem to happen only
> TR> when a VIP comes up, not when its ging down. So in my case, the VIP on
> TR> machine B goes down, without notifying anyone of it. And the VIP on
> TR> machine A, which has been up right through during the physical
> TR> disconnect, does not sense any changes and therefore does not broadcast
> TR> arp information as well.
> TR> 8. If I purge the VIP from the Windows box arp cache, the ping comes
> TR> right back with good responses.
>
> TR> Well, I hope no one got bored with the lengthy explanation.
> TR> I will post the important parts of my conf below.
> TR> The Wackamole conf is different from most others I have seen. I want
> TR> (have to) use only one IP Address as VIP for both machines in my little
> TR> cluster, since booth machines have to exposed by that IP address, not at
> TR> the same time (I know this wouldn't work) but intermittendly depending
> TR> on their health state or running condition. The network has no DNS
> TR> available, therefore I have to go with the IP.
>
> TR> Spread (Conf is identical on both machines A and B)
> TR> Spread_Segment 192.168.1.255:4803 {
> TR> "192" 192.168.1.141
> TR> ibm-linux 192.168.1.59
> TR> }
>
> TR> Wackamole (identical as well)
> TR> # The Spread daemon we are going to connect to. It should be on the
> TR> local box
> TR> Spread = 4803
> TR> SpreadRetryInterval = 2s
> TR> # The group name
> TR> Group = wack1
> TR> # Named socket for online control
> TR> Control = /var/run/wack.it
>
> TR> # Denote the interface we prefer to have
> TR> #prefer eth0:10.3.4.5/8
> TR> #prefer { eth0:10.2.3.4/8 eth1:192.168.10.23/24 }
>
> TR> # In most cases, I just don't care. Let wackamole decide.
> TR> Prefer None
>
> TR> # List all the virtual interfaces (ALL of them)
> TR> VirtualInterfaces {
> TR> # The following two lines have the same effect
> TR> # en0:192.168.1.2/24
> TR> { eth0:192.168.1.200/24 }
>
> TR> # This is how you say 2 or more IPs are to be treated as a single
> TR> # "set" or "virtual interface". If wackamole decides that this
> TR> # machine will manage it, you are ensured to get ALL the ips in the
> TR> # set.
> TR> # { en1:10.0.0.1/8 en0:192.168.35.64/26 }
> TR> }
>
> TR> # Collect and broadcast the IPs in our ARP table every so often
> TR> Arp-Cache = 1s
>
> TR> # List who we will notify
> TR> # Here the netblock (/24 or /28) can be deceptive. It is NOT a
> TR> netmask
> TR> # for a single IP. It is how one will describe that they want to
> TR> # notify ALL IPs in a segment.
> TR> Notify {
> TR> # Let's notify our router:
> TR> eth0:192.168.1.1/32
> TR> # Notify out DNS servers
> TR> # en1:10.0.0.10/32
> TR> # en1:10.0.0.11/32
> TR> # 10.0.0.0 -> 10.0.0.255, but only 128 notifications/sec
> TR> # en0:10.0.0.0/24 throttle 128
> TR> # Wackamole shares arp-cache across machines, this says to
> TR> # notify every IP address in the aggregate shared arp-cache.
> TR> arp-cache
> TR> }
> TR> balance {
> TR> # This field is the maximum number of IP addresses that will move
> TR> # from one wackamole to another during a round of balancing.
> TR> AcquisitionsPerRound = 1
> TR> # Time interval in each balancing round.
> TR> interval = 1s
> TR> }
> TR> # How long it takes us to mature
> TR> mature = 3s
>
>
> TR> -----
>
> TR> If anyone has got some time:
> TR> Can that what I intend to do work at all?
> TR> Any hints how I could work aroud my problem?
>
> TR> If you haven't the time, thanks for reading anyway!
>
>
>
>
> Hi Toralf,
> As you said your network has no dns available. Somewhere in
> wackamole we obtain a index for each machines based on dns information
> and when that is not availabe wackamole tends to screw up. If you care
> to dig into the code to fix that for your case, you'll need to give
> each machine a unique index though the command line.
>
>

Hi Ashima,
Thanks for the reply.
just for clarification for me and for:
I had the two machines which make up the little cluster using DNS.
Although, since I am not maintaining the network I am working in I am
still in process of finding out whether the two cluster machines have
their own records in the DNS. Clients that connect to either machine in
the cluster using the VIP will never be able to use DNS resolution and
will always connect using the IP number.

Is what you said above about Wackamole keeping a machine index based on
the DNS records still true considering what I wrote now?

Would wackamole work in a scenario where cluster machines have acces to
DNS and have their records in it and only client machines do not have
access to DNS information?

--
Mit freundlichen Gruessen / Kind Regards
Toralf Richter

fon 069.94 34 05-10
fax 069.94 34 05-27
t.richter@triplesense.de

triplesense GmbH
Hanauer Landstraße 186
60314 Frankfurt am Main

Wackamole failing after cable dis-/reconnect [ In reply to ]

t.richter at triplesense

Oct 23, 2003, 1:26 AM

Post #5 of 5 (2742 views)

Permalink

Theo Schlossnagle wrote:

>
> On Wednesday, Oct 22, 2003, at 06:44 US/Eastern, Toralf Richter wrote:
>
>> I have a Spread/Wackamole setup which works at least testing-wise fine
>> as long as I fail one machine of the two machine setup by completely
>> rebooting it or by killing the Wackamole or Spread daemon.
>> I the above case(s) the other machine in that litte cluster takes over
>> with a very short outage only.
>>
>> The problem which I encounter is after disconnecting physically the
>> network interface's cable on either machine and afterwards
>> reconnecting that cable.
>> Step by step I do the following, assuming initially both machines are
>> fine an listening:
>> 1. disconnect NIC cable of machine A (which is 2.4.20 kernel, German
>> SUSE 8.2 distro)
>> 2. watch syslog on the other machine B (which is RH 7.1, kernel
>> 2.4.2-2), wait for Wackamole to complete the arp spoof
>> 3. watch ping -t on a Windows box on the same network. After
>> disconnection there is a brief outage of one-two seconds, then the
>> other machine jumps in, and ping is receiving good responses again
>> 3. reconnect NIC cable of machine A (where Spread daemon and Wackamole
>> have continued running while the cable was off)
>> 4. watch syslog of machine B, Wackamole brings the VIP down
>> 5. watch syslog of machine A, there is no activity, apart from the
>> notice that the cacle has been reconnected and a 100Mbit link has been
>> established

Hi Theo,
thanks for your reply. A few remarks from me:
> Machine A should drop one of its VIPs here -- which from your later
> description sounds like it does.
As I said in the earlier email, on reconnection of the cable to machine
A, machine B drops the VIP, visible in the syslog.
machine A does not do anything regarding the VIP, at least not anything
that appears in the syslog. Therefore I would suppose, that whle the
cable is off machine A it keep the VIP - but dring physical
disconnection naturally can not respond to it. on disconnection of the
cable machine B picks up the VIP and broadcasts the updated arp
information. on reconnection it drops the VIP but does not broadcast arp
information (if I run wackamole from the command line with -d, I do not
see the shared arp cache beeing updated and broadcast). So to me it
seems, that wackamole in the cases it shuts a VIP down on a machine
should broadcast as well, just as it when it claims a VIP.

> The "bug" is that it doesn't re-arp the VIP that it keeps. I agree this
> is not the desired behaviour. I'll need to look at the algorithm and
> see if there is a way that A can realize that "at start" it was in
> conflict with at least one of its peers and re-arp all conflicting VIPs.
Correct me if I am wrong, but I think the problem is that after the
reconnection of the cable Wackamole does not really come into an "at
start" position, because it does not notice or respond to down or up
condition the physical network link.

I would be happy if that helped you. If my further explanations seem
redundand to you, I beg your apologies.

Have a nice day.

--
Mit freundlichen Gruessen / Kind Regards
Toralf Richter

fon 069.94 34 05-10
fax 069.94 34 05-27
t.richter@triplesense.de

triplesense GmbH
Hanauer Landstraße 186
60314 Frankfurt am Main