Mailing List Archive: Initial resource takeover problems in heartbeat-0.4.5a

Initial resource takeover problems in heartbeat-0.4.5a

Nov 5, 1999, 11:50 PM

Post #1 of 6 (1911 views)

Several different people have reported problems with initial resource takeover
in heartbeat 0.4.5a.

I tried to reproduce it here, and I could -- on one machine. When I recompiled
it from source, it seemed to go away. One of the people reporting it seemed to
have the same experience.

I have added a little debug to the code, fixed a problem with logging from shell
scripts, and now call it 0.4.5b.

It's now pointed to by the download page.

Please let me know what you find. I would encourage anyone who is willing to
try the RPM version first.

Thanks!!

-- Alan Robertson
alanr@bell-labs.com

Initial resource takeover problems in heartbeat-0.4.5a [ In reply to ]

george at captech

Nov 7, 1999, 3:33 AM

Post #2 of 6 (1883 views)

Permalink

On Fri, 5 Nov 1999, Alan Robertson wrote:

> Several different people have reported problems with initial resource takeover
> in heartbeat 0.4.5a.
>
> I tried to reproduce it here, and I could -- on one machine. When I recompiled
> it from source, it seemed to go away. One of the people reporting it seemed to
> have the same experience.
>

If this is the same problem I had a while back ... secondary server would
immediately take the ha resource on startup, even recompiling did not fix
it. I had to move a binary that I compiled on a different machine over.

The problem seemed to me to be that it would try to check to see if the
resource was being served but would return immediately (not enough time
for a heartbeat to arrive) thinking it was the only server alive and start
serving the ha resource. It would then ignore the fact that the primary
was alive from then on.

I >>THINK<< that if I took the primary down and restarted it, the
secondary would then give up the resource but I have been to sleep since
then.

I recompiled several times then finally moved over the binary from the
primary and it worked fine. I also think that one system is using gcc272
while the other is using 2.95 but I do not know if that is a problem
because I think I tracked it down to something in ResourceManager before
giving up and copying the binary over.

Some things:

Any reason hardware flow control is not enabled on the ports? I get a lot
of serial overrun errors and wonder if that could cause problems.

Any reason the primary has to regain control? I posted this in a private
message a while back (sorry I have not posted here but I have been very
busy building a global WAN the last couple of weeks). I would like to see
some method of designating equal peers, either one could be primary and
possibly a backup machine that should give up the resource when one of the
primaries comes back up. In other words ... multiple primaries.

My reason for needing this is because I would like to use heartbeat for
routers. These routers do NAT. If one fails and the resource moves to the
backup, all sessions through that machine must be restarted. When the
original primary comes back up, I would like it to assume the role of
backup rather than taking back the resource and causing all sessions to
have to be restarted again. I had in mind a configuration like this:

resource_group router {
ip_address {
192.168.1.1;
192.168.2.1;
172.16.5.1;
10.2.4.1;
};
command {
zebra::-d;
ospfd::-d;
};
primary_server {
SC-A;
SC-B;
};
secondary_server {
fubar;
};

};

Meaning that normally SC-A is the primary serving the 4 IP addresses and
running zebra and ospfd. If it fails, SC-B takes over the IP addresses and
starts the routing processes. If SC-A returns, it does nothing unless SC-B
fails. Fubar does nothing unless SC-A and SC-B fail then it takes over
resources only until one of them reappears and gives them back.

Initial resource takeover problems in heartbeat-0.4.5a [ In reply to ]

alanr at bell-labs

Nov 7, 1999, 7:06 AM

Post #3 of 6 (1900 views)

Permalink

George Bonser wrote:
>
> On Fri, 5 Nov 1999, Alan Robertson wrote:
>
> > Several different people have reported problems with initial resource takeover
> > in heartbeat 0.4.5a.
> >
> > I tried to reproduce it here, and I could -- on one machine. When I recompiled
> > it from source, it seemed to go away. One of the people reporting it seemed to
> > have the same experience.
> >
>
> If this is the same problem I had a while back ... secondary server would
> immediately take the ha resource on startup, even recompiling did not fix
> it. I had to move a binary that I compiled on a different machine over.
>
> The problem seemed to me to be that it would try to check to see if the
> resource was being served but would return immediately (not enough time
> for a heartbeat to arrive) thinking it was the only server alive and start
> serving the ha resource. It would then ignore the fact that the primary
> was alive from then on.
>
> I >>THINK<< that if I took the primary down and restarted it, the
> secondary would then give up the resource but I have been to sleep since
> then.
>
> I recompiled several times then finally moved over the binary from the
> primary and it worked fine. I also think that one system is using gcc272
> while the other is using 2.95 but I do not know if that is a problem
> because I think I tracked it down to something in ResourceManager before
> giving up and copying the binary over.
>
> Some things:
>
> Any reason hardware flow control is not enabled on the ports? I get a lot
> of serial overrun errors and wonder if that could cause problems.

I've seen that in recent kernels. I didn't see that in older kernels.
It may be something that I'm doing wrong. The Pilot-UNIX mailing list was
complaining about some kernel problems with serial ports causing them
problems. Perhaps it could be something like this. I'll reexamine
the flow control issue.

> Any reason the primary has to regain control? I posted this in a private
> message a while back (sorry I have not posted here but I have been very
> busy building a global WAN the last couple of weeks). I would like to see
> some method of designating equal peers, either one could be primary and
> possibly a backup machine that should give up the resource when one of the
> primaries comes back up. In other words ... multiple primaries.
>
> My reason for needing this is because I would like to use heartbeat for
> routers. These routers do NAT. If one fails and the resource moves to the
> backup, all sessions through that machine must be restarted. When the
> original primary comes back up, I would like it to assume the role of
> backup rather than taking back the resource and causing all sessions to
> have to be restarted again. I had in mind a configuration like this:
>
> resource_group router {
> ip_address {
> 192.168.1.1;
> 192.168.2.1;
> 172.16.5.1;
> 10.2.4.1;
> };
> command {
> zebra::-d;
> ospfd::-d;
> };
> primary_server {
> SC-A;
> SC-B;
> };
> secondary_server {
> fubar;
> };
>
> };
>
> Meaning that normally SC-A is the primary serving the 4 IP addresses and
> running zebra and ospfd. If it fails, SC-B takes over the IP addresses and
> starts the routing processes. If SC-A returns, it does nothing unless SC-B
> fails. Fubar does nothing unless SC-A and SC-B fail then it takes over
> resources only until one of them reappears and gives them back.

This is a very common thing to use heartbeat for.
The reason that heartbeat works the way it does, is that it is very simple. No
other reason. This is something I could fix on the current version of heartbeat
with some trouble. It wouldn't take the complete rearchitecting which Volker is
beginning. On the other hand, I couldn't provide the complete (3-node) fix
until we get the new code done.

Can you live with this a few more months for the new code, or is should I finish
the protocol work I'm in the middle of and try and do something short-term to
allow a machine to keep a resource once it has it?

-- Alan Robertson
alanr@bell-labs.com

Initial resource takeover problems in heartbeat-0.4.5a [ In reply to ]

george at captech

Nov 7, 1999, 2:10 PM

Post #4 of 6 (1909 views)

Permalink

>
> Can you live with this a few more months for the new code, or is should I finish
> the protocol work I'm in the middle of and try and do something short-term to
> allow a machine to keep a resource once it has it?

If this is already being worked on, it is probably not worth the trouble
risking a break and a lot of debugging of the currently working code. If
it should be put into the existing stuff, it should probably be
configurable because I have a feeling there are a lot of people out there
that depend on it working as it now does. BTW, I was not suggesting a
config file format in the message I posted, it is just that the format I
used tends to show the logic a little better ... at least in my mind.
Something along the lines of a gated or bind-8 config file. Mon even has
something somewhat like that.

I can live with it since we have decided to go with a commercial router
for the initial rollout that does failover quite well but longer term, we
would like to use Linux in more of our operations. From my point of view
this is a development issue more than a current production issue and I am
up to my eyes in alligators in production issues at the moment :)

Re: Initial resource takeover problems in heartbeat-0.4.5a [ In reply to ]

th at ant

Nov 8, 1999, 2:05 PM

Post #5 of 6 (1900 views)

Permalink

Hi,
On Fri, Nov 05, 1999 at 11:50:09PM -0700, Alan Robertson wrote:
> Several different people have reported problems with initial resource takeover
> in heartbeat 0.4.5a.
>
> I tried to reproduce it here, and I could -- on one machine. When I recompiled
> it from source, it seemed to go away. One of the people reporting it seemed to
> have the same experience.
>
> I have added a little debug to the code, fixed a problem with logging from shell
> scripts, and now call it 0.4.5b.
>
> It's now pointed to by the download page.
>
> Please let me know what you find. I would encourage anyone who is willing to
> try the RPM version first.

OK gave it a try, and no luck (debian system). So i added some debugging
to find the place where it fails. It seems that that command which is
run by req_our_resources does not respond in time. I changed the fgets
loop to retry the read more than once if the first read fails, waiting 1 second
after every failed read, and it works .....
I have no idea why the first read fail ..., errno is set to 4.

Hope this helps a little bit ....

Thomas
--
-----------------------------------------------
| Thomas Hepper th@ant.han.de |
| ( If the above address fail try ) |
| ( thomas.hepper@planet-interkom.de) |
-----------------------------------------------

Re: Initial resource takeover problems in heartbeat-0.4.5a [ In reply to ]

alanr at bell-labs

Nov 8, 1999, 10:50 PM

Post #6 of 6 (1895 views)

Permalink

Thomas Hepper wrote:
>
> Hi,
> On Fri, Nov 05, 1999 at 11:50:09PM -0700, Alan Robertson wrote:
> > Several different people have reported problems with initial resource takeover
> > in heartbeat 0.4.5a.
> >
> > I tried to reproduce it here, and I could -- on one machine. When I recompiled
> > it from source, it seemed to go away. One of the people reporting it seemed to
> > have the same experience.
> >
> > I have added a little debug to the code, fixed a problem with logging from shell
> > scripts, and now call it 0.4.5b.
> >
> > It's now pointed to by the download page.
> >
> > Please let me know what you find. I would encourage anyone who is willing to
> > try the RPM version first.
>
> OK gave it a try, and no luck (debian system). So i added some debugging
> to find the place where it fails. It seems that that command which is
> run by req_our_resources does not respond in time. I changed the fgets
> loop to retry the read more than once if the first read fails, waiting 1 second
> after every failed read, and it works .....
> I have no idea why the first read fail ..., errno is set to 4.

Errno 4 is EINTR. This process has an alarm running, and it's success or
failure probably depends on where in the alarm cycle it occurs. A good bit of
my testing isn't on a real cluster, so this means I never hear from any other
machines -- so my test cases are synchronized to the alarm code, so I almost
always have a full second before the SIGALRM goes off.

This sounds like a great find!

Thanks Thomas!

-- Alan Robertson
alanr@bell-labs.com