Mailing List Archive

Issues with ARP notification on latest CVS of wackamole
Greetings;

We've had some problems with wackmole and arp notification, specifically the
cisco pix firewall doesn't seem to get the arp notifications from all the
machines in the wackamole cluster behind it. We found a posting in the
mailing list archives about this, and it appeared that it was all taken care
of (problem verified and fixed) in the latest CVS version. It was something
about now notifying the bc_mac address instead of ze_mac address.

So we put in the latest CVS and still have the problem it appears. If we
fail all machines in the cluster, when they come back up outside references
to those machines hit the cisco pix, and the pix has the arp entries for
those IP's on the wrong (old) machines. We can immediately fix this problem
by doing a "clear arp" on the cisco pix. However, I don't think this is a
pix issue.

I'm wondering if perhaps the arp entries are being cleaned up, but just not
as quickly as I would have thought. It's hard to test this theory because I
can't keep those machines down longer than a minute or two or top brass gets
a little irked :)

So in the search for what could be causing this, I'm wondering about the
time related variables in wackamole.conf, and perhaps I don't understand
them well as to the implication of their settings. Here is the file
(identical on all machines in the cluster except for Spread=):

Spread = 4803@britney.kwcorp.com
Group = web
SpreadRetryInterval = 5s
Control = /var/tmp/wack.it
Prefer None
VirtualInterfaces {
{em0:192.168.55.100/32 em0:192.168.55.101/32 em0:192.168.55.102/32
em0:192.168.55.103/32 em0:192.168.55.104/32}
{em0:192.168.55.110/32 em0:192.168.55.111/32 em0:192.168.55.112/32
em0:192.168.55.113/32 em0:192.168.55.114/32}
{em0:192.168.55.120/32 em0:192.168.55.121/32 em0:192.168.55.122/32
em0:192.168.55.123/32 em0:192.168.55.124/32}
{em0:192.168.55.130/32 em0:192.168.55.131/32 em0:192.168.55.132/32
em0:192.168.55.133/32 em0:192.168.55.134/32}
}
Arp-Cache = 90s
Notify {
em0:192.168.55.1/32
em0:192.168.55.0/24 throttle 128
arp-cache
}
balance {
AcquisitionsPerRound = all
interval = 4s
}
mature = 5s

Basically there are 4 machines in the cluster, and each machine has 4 VIP's
that should move as a group. 192.168.55.1 is the address of the inside
interface on the pix (where these webserver machines are located). I can't
find a lot in the documentation to explain exactly what the time related
settings really do, like SpreadRetryInterval, arp-cache, throttle 128,
interval, and mature. I have some basic idea what they mean, but not really
the impact or how to intelligently set them. Am I headed down the right path
here, and if so, can someone educate me a bit more on these settings? Also,
to my knowledge there are no special access lists or configuration to the
pix that would need to be done to allow this to happen.

Thanks!

Jay West
Knights Direct

---
[This E-mail scanned for viruses by Declude Virus]
Issues with ARP notification on latest CVS of wackamole [ In reply to ]
On Apr 12, 2004, at 1:37 PM, Jay West wrote:
> So we put in the latest CVS and still have the problem it appears. If
> we
> fail all machines in the cluster, when they come back up outside
> references
> to those machines hit the cisco pix, and the pix has the arp entries
> for
> those IP's on the wrong (old) machines. We can immediately fix this
> problem
> by doing a "clear arp" on the cisco pix. However, I don't think this
> is a
> pix issue.

I think it is the PIX not allowing arp-spoofing -- it is a firewall
after all :-)

Perhaps the ARP reply packets need to be addressed specifically to the
PIX's MAC address. I looked through the documentation, but I don't see
anything that mentions how to send unsolicited ARP replies to a PIX.
Any insight would be appreciated. Whatever it is, I am sure it is a
simple software fix.

// Theo Schlossnagle
// Principal Engineer -- http://www.omniti.com/~jesus/
// Postal Engine -- http://www.postalengine.com/
// Ecelerity: fastest MTA on Earth
Issues with ARP notification on latest CVS of wackamole [ In reply to ]
Theo wrote...
> I think it is the PIX not allowing arp-spoofing -- it is a firewall
> after all :-)
>
> Perhaps the ARP reply packets need to be addressed specifically to the
> PIX's MAC address. I looked through the documentation, but I don't see
> anything that mentions how to send unsolicited ARP replies to a PIX.
> Any insight would be appreciated. Whatever it is, I am sure it is a
> simple software fix.
I just got a response from a Cisco engineer - no dice really. There is no
way to make the PIX accept unsolicited arp requests. However, the timeout
can be set from the standard 14400 seconds down to 60 seconds. Any less than
60 seconds and the pix will probably start loosing packets.

So this leaves me with two possible solutions. First, we can set the arp
cache timeout on the pix down to 60. This will work fine, but we really hate
the 60 seconds of downtime for a failover. More directly because when we
push a cvs update to each machine for new website code, the system doing the
pushing fails a machine in the cluster (wackatrl -f), stops apache on the
machine (ssh xxxxx apachectl stop), rsyncs the new code, starts apache, then
starts wackamole (wackatrl -s). It then moves to the next machine in the
cluster. For small CVS updates it's entirely possible that 3 or 4 of the
machines will be in a failed state or just recovering at the same time (the
same 60 second window). Since there's 4 machines in the web cluster, this
would result in too large of a "downtime" window. Round robin DNS would mean
only one of 4 machines (or none) can service any requests. We may have to go
that route though.

However, there may be a second solution. Is there some way that whenever
wackamole brings up any VIP's, it can call an external program? I'm thinking
along the lines of perhaps sending an SNMP request to the PIX to clear it's
arp cache when this happens. Any thoughts?

Jay West

---
[This E-mail scanned for viruses by Declude Virus]
Issues with ARP notification on latest CVS of wackamole [ In reply to ]
Jay West wrote:

>However, there may be a second solution. Is there some way that whenever
>wackamole brings up any VIP's, it can call an external program? I'm thinking
>along the lines of perhaps sending an SNMP request to the PIX to clear it's
>arp cache when this happens. Any thoughts?
>
>
Should work like a charm. The latest CVS version supports both embedded
C and Perl modules for external functions. You should be able to hack a
"announce" perl handler using Net::SNMP without much trouble at all.

There is an example or three in the distribution, though no docs currently.

We use this in production to add routes on some of our nameservers
running wackamole. An SNMP ping is much less cumbersome that calling
out to /sbin/route :-)

Can you post your results here? I'd love to use this example in my
OSCON talk in July.

--
// Theo Schlossnagle
// Principal Engineer -- http://www.omniti.com/~jesus/
// Postal Engine -- http://www.postalengine.com/
// Ecelerity: fastest MTA on Earth