Mailing List Archive: Re: I just installed heartbeat.

Alexandre Perematko wrote:
>
> Hello,
>
> >> at version 2.2.12 of kernel. Is it developed for diffrent line of >>kernels
> ?
> >I may have bollixed up the version I distributed. Can you tell me what >the
> error message is?
>
> cd kernel.d; make
> make[1]: Entering directory `/home/alexp/heartbeat-0.4.4/kernel.d'
> gcc -Wall -fomit-frame-pointer -c proc_ha.c -o ha.o
> proc_ha.c:205: warning: initialization from incompatible pointer type
> proc_ha.c:264: warning: initialization from incompatible pointer type
> proc_ha.c: In function `proc_hactl_write':
> proc_ha.c:326: void value not ignored as it ought to be
> proc_ha.c:326: void value not ignored as it ought to be
> proc_ha.c: In function `proc_hb_read':
> proc_ha.c:707: structure has no member named `f_dentry'
> proc_ha.c:743: void value not ignored as it ought to be
> proc_ha.c:744: void value not ignored as it ought to be
> proc_ha.c: In function `proc_read_allattrs':
> proc_ha.c:756: structure has no member named `f_dentry'
> proc_ha.c:788: void value not ignored as it ought to be
> proc_ha.c:789: void value not ignored as it ought to be
> proc_ha.c:790: void value not ignored as it ought to be
> proc_ha.c:791: void value not ignored as it ought to be
> proc_ha.c: In function `init_module':
> proc_ha.c:930: warning: assignment from incompatible pointer type
> make[1]: *** [ha.o] Error 1
> make[1]: Leaving directory `/home/alexp/heartbeat-0.4.4/kernel.d'
> make: *** [kernel] Error 2

Yes. This IS something I messed up. One of those misguided attempts to improve
Volker's code :-) I don't remember exactly what the problem is, but there is a
#ifdef LINUX_VERSION_CODE >= VERSION_CODE(2,1,0) that isn't working for some
reason. Since it appears to compile correctly on my machine, perhaps it's a
little difference between Debian and Red Hat...

Volker's in the process of updating this driver substantially... He hasn't
announced his expectations of when he'd be done.

> >> 2) single IP failover works Ok. But when the node, that was down, is >>trying
> to
> >> join the cluster something strange things hapen. Several times I got the
> >> situation with the same IP address existing on both nodes. The only way >> to
> >> restart the cluster is to restart heartbeat on both nodes.
>
> >If you lose heartbeat communication without the heartbeat daemon going
> >down,then it's going to do that. I highly recommend redundant >communcation
> paths. A copy of the logs from both machines would help >here.Do you know what
> the sequence of events was leading up to this point?
>
> I was able to recreate situation several times today. It looks like it happens
> if you restart heartbeat in 10-20 seconds after failover. Attached are log files
> from both nodes. Clocks at both are syncronized. Both nodes run Debian ( slink
> ), newest version of nettools package.

If it usually works, the version of nettools probably isn't at issue. Thanks
for reading the notes (or is it the list?). You seem to be on top of the known
issues here. That's great!

Given your description, I could imagine that this could happen. There is a
particular timing that isn't quite as it should be. If you try and revert back
to the "master" node less than 30 seconds after it's taken over (I think this
means a minimum interval of 40 seconds or less), something like this could
happen.

Since a 40 second reboot time would be quite phenomenal, I hadn't worried too
much about this issue yet. Is this a big issue for you at this point?

> Uname -a:
> Linux node1 2.2.13pre9 #4 SMP Wed Sep 29 16:53:02 EDT 1999 i586 unknown
> Linux node2 2.2.13pre9 #4 SMP Wed Sep 29 16:53:02 EDT 1999 i586 unknown
> uname -n works ok on both nodes ( proucing node1 & node2 ).
>
>
> There are several attempts to produce the problem in the logs, but only the last
> succeed.

That's how this kind of stuff works :-)

I'll look at the logs, and see if I can verify exactly what's going on.

> >> 3) I tried to create configuration with following lines in ipresourses
> >>file:
> >>
> >> node1 10.10.10.3/24
> >> node2 10.10.10.4/24
> >
> >> with expected result of having different IP addresses on the nodes, >>failing
> over
>
> >This works on my machines here. It isn't something like both machines >being
> named node1, or something is it?
>
> It started to work, out of the blue, today morning and I was not able to
> reproduce the situation today. I have impression that it was the same problem
> like in the previous case. It require pretty specific timing to happen.

I want to solve these two problems, but it will require having a "real" cluster
manager. This won't probably be available until early next year.

Thanks for trying heartbeat!

-- Alan Robertson
alanr@bell-labs.com