Mailing List Archive: FW: pulse issues?

Hello all...

This is a copy of a message I posted to the piranha and ha-dev mailing
lists, so I apologize if anyone is subscribed to multiple lists and is
getting this more than once, but to be honest I'm not sure which list is
appropriate...so here goes

I've downloaded the HA software, experimented with
it, and have found a couple issues. So here goes...

I'm using (basically) the instructions for creating an lvs cluster from the
document "Setting up a five-node cluster" off the RH website. Except I've
only got 4 machines, so there are 2 lvs routers and 2 webservers. I'm using
NAT to route. So, as far as I understand, the initial state is that lvs1
acting as primary router has its real public and private interfaces up and 2
aliases, one for the virtual server externally and one for the virtual nat
router internally. Pulse is heartbeating over the external interface to
lvs2, now the backup server. lvs2 has only its real interfaces up. In the
event that the pulse is lost, lvs2 brings up all virtual interfaces and
begins running nanny in order to act as the new virtual router and web
server.

I have 2 main issues with this scenario.

1) Pulse cannot tell the difference between a source and destination
failure. IE, if the backup lvs router for some reason has a broken
connection to the network, it will assume the primary server has failed. It
will then bring up it's virtual aliases and begin arp spoofing, and
attempting to route. The webservers will all use lvs2 for their router (I
deduced this experimentally as well as theoretically). Only since it has no
external connection, it cannot route. So the cluster fails. This seems
like it also might be the case if the external network connection on the
primary failed, because it would leave its internal virtual interfaces up,
AND the backup would also bring its internal interfaces up. Therefor 2
machines would be responding to the arps. I don't know exactly what the
result of that scenario would be.

Now, I BELIEVE this can be solved if I used direct routing instead of NAT.
This way, since the webservers would be returning requests directly to the
clients, they would only be dependant on the routers for incoming requests,
and I haven't yet figured out a plausible scenario where pulse would allow
both external virtual interfaces to be up at the same time. Even if it did,
who cares, as long as the packets get to the webservers.

2) Pulse only has knowledge of one interface at a time. Therefor, if the
internal interface on the primary lvs goes, and therefor its connection to
the webservers goes, pulse will not transfer control to the backup because
it continues to get a heartbeat through the external interface. Thus the
cluster fails, because lvs1 continues to act as a virtual server without
being able to communicate with the real servers.

What I need to know is whether or not there's a better way to configure
pulse to account for these situations. Perhaps running multiple instances
that communicate with one another and make decisions based on the states of
all of the nics. Or if it tried to heartbeat more than one server, in an
effort to diagnose where the failure sits exactly. Or is that asking too
much?

Also, any general advice on the advantages of buying the HA software package
(ie, the tech support) vs. just downloading and doing it myself would be
appreciated.

Thanks in advance for your advice/answers...

-----------------------
Brian J. Sweeney
"I want to know God's thoughts ... the rest are details." -Albert Einstein
Systems Admin, imagedog
bsweeney@imagedog.com