Mailing List Archive

Question about pulling the ethernet
Hello

When pulling the ethernet cable off the active node, the secondary will
not take over. I have heartbeat 0.4.8 and nice_failback is set. There
are two network cards but I pulled the cable of both cards but it does
not switch over.

What do I have to do that this works, ie when the active node is no longer
reachable because, say the switch is broken, that the secondary node will
take over?

Regards,
Holger
Question about pulling the ethernet [ In reply to ]
Holger Kiehl wrote:
>
> Hello
>
> When pulling the ethernet cable off the active node, the secondary will
> not take over. I have heartbeat 0.4.8 and nice_failback is set. There
> are two network cards but I pulled the cable of both cards but it does
> not switch over.
>
> What do I have to do that this works, ie when the active node is no longer
> reachable because, say the switch is broken, that the secondary node will
> take over?

I assume that you have a serial connection inbetween your hosts. If so
then the two host are still communicating over the serial line and they
see that your active node is still active. There is no service
monitoring in heartbeat. Heartbeat wouldn't fail over if your service
(eg. apache) died. You have to provide this bit yourself, eg. using mon.

juri

--
juri.haberland@innominate.de
system engineer innominate AG
clustering & security networking people
phone: +49-30-308806-45 fax: -77 web: http://innominate.de
Question about pulling the ethernet [ In reply to ]
On 2000-08-01T06:55:54,
Alan Robertson <alanr@suse.com> said:

> Make sure you are only trying to move ALIASES (eth0:0, etc.) between machines as
> service addresses. Heartbeat won't move main interfaces (eth0, etc.) between
> machines.

If anyone would care to step forward to write an IPAddr script to use
"iproute2" (ie the "ip addr" commands), this problem could be solved, because
they don't have the concept of "aliases" anymore, and expose more of the way
the Linux kernel handles them internally: Namely, as addresses associated with
a network device.

Sincerely,
Lars Marowsky-Brée <lmb@suse.de>
Development HA

--
Perfection is our goal, excellence will be tolerated. -- J. Yahl
Question about pulling the ethernet [ In reply to ]
Holger Kiehl wrote:
>
> Hello
>
> When pulling the ethernet cable off the active node, the secondary will
> not take over. I have heartbeat 0.4.8 and nice_failback is set. There
> are two network cards but I pulled the cable of both cards but it does
> not switch over.
>
> What do I have to do that this works, ie when the active node is no longer
> reachable because, say the switch is broken, that the secondary node will
> take over?

This is a pretty-frequently asked question, unfortunately...

In every case but one that this has occured so far, this has been a
configuration error.

Usually it's a problem in the haresources file, or in how the ethernets are
configured. You have to have the haresources files identical between the two
machines.

As an aside, for now, don't put more than one tab per line in the haresources
file.

Make sure you are only trying to move ALIASES (eth0:0, etc.) between machines as
service addresses. Heartbeat won't move main interfaces (eth0, etc.) between
machines.

-- Alan Robertson
alanr@suse.com
Question about pulling the ethernet [ In reply to ]
On Tue, 1 Aug 2000, Alan Robertson wrote:

> Holger Kiehl wrote:
> >
> > Hello
> >
> > When pulling the ethernet cable off the active node, the secondary will
> > not take over. I have heartbeat 0.4.8 and nice_failback is set. There
> > are two network cards but I pulled the cable of both cards but it does
> > not switch over.
> >
> > What do I have to do that this works, ie when the active node is no longer
> > reachable because, say the switch is broken, that the secondary node will
> > take over?
>
> This is a pretty-frequently asked question, unfortunately...
>
> In every case but one that this has occured so far, this has been a
> configuration error.
>
So this must work and there is no need for mon?

> Usually it's a problem in the haresources file, or in how the ethernets are
> configured. You have to have the haresources files identical between the two
> machines.
>
I think I have setup everything correct. But could someone please check it:

There are two nodes yoda (192.168.124.126) and presto (192.168.124.125)
which both have the same haresources file:

yoda 192.168.124.127

With only one tab inside!

The ha.cf of yoda looks as follows:

debugfile /var/log/ha-debug
logfile /var/log/ha-log
keepalive 2
deadtime 10
serial /dev/ttyS1
baud 19200
udpport 694
udp eth0
node yoda
node presto.dwd.de

And that of presto:

debugfile /var/log/ha-debug
logfile /var/log/ha-log
keepalive 2
deadtime 10
serial /dev/ttyS1
baud 19200
udpport 694
udp eth0
node presto.dwd.de
node yoda

I did check the serial line by catting some data from one node to the other.
So the serial line is working. But when I pull the ethernet cable from
yoda this is the ha-log output from yoda:

heartbeat: 2000/08/01_22:25:03 info: Configuration validated. Starting heartbeat 0.4.8
heartbeat: 2000/08/01_22:25:03 info: heartbeat: version 0.4.8
heartbeat: 2000/08/01_22:25:04 notice: Starting serial heartbeat on tty /dev/ttyS1
heartbeat: 2000/08/01_22:25:04 notice: UDP heartbeat started on port 694 interface eth0
heartbeat: 2000/08/01_22:25:04 info: Local status now set to: 'up'
heartbeat: 2000/08/01_22:25:04 info: Link yoda:eth0: status up
heartbeat: 2000/08/01_22:25:05 info: Local status now set to: 'active'
heartbeat: 2000/08/01_22:25:05 info: Link presto.dwd.de:eth0: status up
heartbeat: 2000/08/01_22:25:05 info: Node presto.dwd.de: status up
heartbeat: 2000/08/01_22:25:05 info: Node presto.dwd.de: status active
heartbeat: 2000/08/01_22:25:05 info: Running /etc/ha.d/rc.d/status status
heartbeat: 2000/08/01_22:25:05 info: Running /etc/ha.d/rc.d/status status
heartbeat: 2000/08/01_22:25:05 info: Link presto.dwd.de:/dev/ttyS1: status up
heartbeat: 2000/08/01_22:25:05 info: Running /etc/ha.d/resource.d/IPaddr 192.168.124.127 status
heartbeat: 2000/08/01_22:25:05 info: Running /etc/ha.d/rc.d/ip-request ip-request
heartbeat: 2000/08/01_22:25:15 info: Acquiring resource group: yoda 192.168.124.127
heartbeat: 2000/08/01_22:25:16 info: Running /etc/ha.d/resource.d/IPaddr 192.168.124.127 start
heartbeat: 2000/08/01_22:25:16 info: ifconfig eth0:0 192.168.124.127 netmask 255.255.255.0 broadcast 192.168.124.255
heartbeat: 2000/08/01_22:25:16 info: Sending Gratuitous Arp for 192.168.124.127 on eth0:0 [eth0]
heartbeat: 2000/08/01_22:25:46 WARN: Link presto.dwd.de:eth0 dead.
heartbeat: 2000/08/01_22:27:12 info: Link presto.dwd.de:eth0: status up

The last line is after I plugged in the ethernet cable. 192.168.124.127
will not answer to any pings. I did this several times also waiting longer,
always with the same result. The ha-log of presto looks nearly the same:

heartbeat: 2000/08/25_00:23:50 info: Configuration validated. Starting heartbeat 0.4.8
heartbeat: 2000/08/25_00:23:50 info: heartbeat: version 0.4.8
heartbeat: 2000/08/25_00:23:51 notice: Starting serial heartbeat on tty /dev/ttyS1
heartbeat: 2000/08/25_00:23:51 notice: UDP heartbeat started on port 694 interface eth0
heartbeat: 2000/08/25_00:23:52 info: Local status now set to: 'up'
heartbeat: 2000/08/25_00:23:52 info: Link presto.dwd.de:eth0: status up
heartbeat: 2000/08/25_00:23:52 info: Local status now set to: 'active'
heartbeat: 2000/08/25_00:23:52 info: Link yoda:eth0: status up
heartbeat: 2000/08/25_00:23:52 info: Node yoda: status active
heartbeat: 2000/08/25_00:23:52 info: Link yoda:/dev/ttyS1: status up
heartbeat: 2000/08/25_00:23:52 info: Running /etc/ha.d/rc.d/status status
heartbeat: 2000/08/25_00:23:52 info: No local resources [/usr/lib/heartbeat/ResourceManager listkeys presto.dwd.de]
heartbeat: 2000/08/25_00:23:52 info: Running /etc/ha.d/rc.d/ip-request ip-request
heartbeat: 2000/08/25_00:23:53 info: Running /etc/ha.d/resource.d/IPaddr 192.168.124.127 status
heartbeat: 2000/08/25_00:24:32 WARN: Link yoda:eth0 dead.
heartbeat: 2000/08/25_00:25:58 info: Link yoda:eth0: status up

I must be doing something wrong since I do get the same results with
another cluster. Could someone please give me a pointer what I am
doing wrong?

Thanks,
Holger
Question about pulling the ethernet [ In reply to ]
Holger Kiehl wrote:
>
> On Tue, 1 Aug 2000, Alan Robertson wrote:
>
> > Holger Kiehl wrote:
> > >
> > > Hello
> > >
> > > When pulling the ethernet cable off the active node, the secondary will
> > > not take over. I have heartbeat 0.4.8 and nice_failback is set. There
> > > are two network cards but I pulled the cable of both cards but it does
> > > not switch over.
> > >
> > > What do I have to do that this works, ie when the active node is no longer
> > > reachable because, say the switch is broken, that the secondary node will
> > > take over?
> >
> So this must work and there is no need for mon?
>
> I think I have setup everything correct. But could someone please check it:
>
> There are two nodes yoda (192.168.124.126) and presto (192.168.124.125)
> which both have the same haresources file:
>
> yoda 192.168.124.127
>
> With only one tab inside!
>
> The ha.cf of yoda looks as follows:
>
> debugfile /var/log/ha-debug
> logfile /var/log/ha-log
> keepalive 2
> deadtime 10
> serial /dev/ttyS1
> baud 19200
> udpport 694
> udp eth0
> node yoda
> node presto.dwd.de
>
> And that of presto:
>
> debugfile /var/log/ha-debug
> logfile /var/log/ha-log
> keepalive 2
> deadtime 10
> serial /dev/ttyS1
> baud 19200
> udpport 694
> udp eth0
> node presto.dwd.de
> node yoda

Ok, looks good, just one thing as mentioned in the sample ha.cf file:

The host name in the node statement has to match the output of 'uname
-n'


>
> I did check the serial line by catting some data from one node to the other.
> So the serial line is working. But when I pull the ethernet cable from
>
> The last line is after I plugged in the ethernet cable. 192.168.124.127
> will not answer to any pings. I did this several times also waiting longer,
> always with the same result. The ha-log of presto looks nearly the same:

And to see if I understand, what you are trying to achieve:

You have setup a ha-cluster with two nodes, connected by two different
ethernet cable and additionally with a serial connection. You are now
trying to simulate a connectivity loss of one node by pulling the
primary network cable. You even pulled both ethernet cables. And you are
expecting that the other node would notice that the primary node is no
longer reachable by the clients and therefor do a fail over - right?

If my assumption is right, than this won't work, as stated in my
previous post, because the nodes can still see the heartbeat via the
serial connection and because heartbeat does not test the reachability
of the service IP address. This must be done with mon or any other
similar software.

juri

--
juri.haberland@innominate.de
system engineer innominate AG
clustering & security networking people
phone: +49-30-308806-45 fax: -77 http://innominate.de
Question about pulling the ethernet [ In reply to ]
On Fri, 4 Aug 2000, Juri Haberland wrote:

> > node presto.dwd.de
> > node yoda
>
> Ok, looks good, just one thing as mentioned in the sample ha.cf file:
>
> The host name in the node statement has to match the output of 'uname
> -n'
>
Yes, I have made sure that this is correct.

>
> And to see if I understand, what you are trying to achieve:
>
> You have setup a ha-cluster with two nodes, connected by two different
> ethernet cable and additionally with a serial connection. You are now
> trying to simulate a connectivity loss of one node by pulling the
> primary network cable. You even pulled both ethernet cables. And you are
> expecting that the other node would notice that the primary node is no
> longer reachable by the clients and therefor do a fail over - right?
>
I just pull one cable. Each box is connected to its own switch. So
what I try to simulate if the switch at the active node (say yoda) dies:

---------+----------------------+-------------
| |
+----+----+ +----+----+
| Switch1 | | Switch2 |
+----+----+ +----+----+
| |
|eth0 |eth0
+----+----+ /dev/ttyS1 +----+----+
| yoda +------------+ presto |
+---------+ +---------+

> If my assumption is right, than this won't work, as stated in my
> previous post, because the nodes can still see the heartbeat via the
> serial connection and because heartbeat does not test the reachability
> of the service IP address. This must be done with mon or any other
> similar software.
>
Ok, I thought this would work without any additional tools.

Holger
Question about pulling the ethernet [ In reply to ]
On Fri, 4 Aug 2000, Holger Kiehl wrote:

>
>
> On Fri, 4 Aug 2000, Juri Haberland wrote:
>
> > > node presto.dwd.de
> > > node yoda
> >
> > Ok, looks good, just one thing as mentioned in the sample ha.cf file:
> >
> > The host name in the node statement has to match the output of 'uname
> > -n'
> >
> Yes, I have made sure that this is correct.
>
> >
> > And to see if I understand, what you are trying to achieve:
> >
> > You have setup a ha-cluster with two nodes, connected by two different
> > ethernet cable and additionally with a serial connection. You are now
> > trying to simulate a connectivity loss of one node by pulling the
> > primary network cable. You even pulled both ethernet cables. And you are
> > expecting that the other node would notice that the primary node is no
> > longer reachable by the clients and therefor do a fail over - right?
> >
> I just pull one cable. Each box is connected to its own switch. So
> what I try to simulate if the switch at the active node (say yoda) dies:
>
> ---------+----------------------+-------------
> | |
> +----+----+ +----+----+
> | Switch1 | | Switch2 |
> +----+----+ +----+----+
> | |
> |eth0 |eth0
> +----+----+ /dev/ttyS1 +----+----+
> | yoda +------------+ presto |
> +---------+ +---------+
>
> > If my assumption is right, than this won't work, as stated in my
> > previous post, because the nodes can still see the heartbeat via the
> > serial connection and because heartbeat does not test the reachability
> > of the service IP address. This must be done with mon or any other
> > similar software.
> >
> Ok, I thought this would work without any additional tools.

On the first version of my "per-link status" patch I added a config option
to set a given interface as mandatory, so in case this interface status
became "down", heartbeat would consider the other node as dead, even if
there were other interfaces with status up.

I removed it because its an ugly hack and what we want instead is a
service<->link dependancy scheme.

Anyway, I'll redo this patch now and send you.
Question about pulling the ethernet [ In reply to ]
This message is in MIME format. The first part should be readable text,
while the remaining parts are likely unreadable without MIME-aware tools.
Send mail to mime@docserver.cac.washington.edu for more info.

--661009-1397044680-965598613=:12321
Content-Type: TEXT/PLAIN; charset=US-ASCII

On Sun, 6 Aug 2000, Marcelo Tosatti wrote:

<snip>

> On the first version of my "per-link status" patch I added a config option
> to set a given interface as mandatory, so in case this interface status
> became "down", heartbeat would consider the other node as dead, even if
> there were other interfaces with status up.
>
> I removed it because its an ugly hack and what we want instead is a
> service<->link dependancy scheme.
>
> Anyway, I'll redo this patch now and send you.
>

Ok, I'm attaching the patch I described above. (against heartbeat 0.4.8)

Now you have a new config option, named "mandatory".

Here is my config file:

keepalive 1
deadtime 5
udp eth0
udp eth1
mandatory eth1
debugfile /var/log/ha-debug
logfile /var/log/ha-log
logfacility local0
node ha1
node ha2
udpport 1001

And here is the result:

Aug 6 21:40:40 ha2 heartbeat[27080]: info: heartbeat: version
Aug 6 21:40:40 ha2 heartbeat[27080]: notice: UDP heartbeat started on
port 1001 interface eth0
Aug 6 21:40:40 ha2 heartbeat[27080]: notice: UDP heartbeat started on
port 1001 interface eth1
Aug 6 21:40:40 ha2 heartbeat[27081]: info: Local status now set to: 'up'
Aug 6 21:40:40 ha2 heartbeat[27081]: info: Link ha2:eth0: status up
Aug 6 21:40:40 ha2 heartbeat[27081]: info: Link ha2:eth1: status up
Aug 6 21:40:40 ha2 heartbeat[27081]: info: Local status now set
to: 'active'
Aug 6 21:40:40 ha2 heartbeat[27081]: info: Link ha1:eth0: status up
(unplugged eth1 cable)
Aug 6 21:40:45 ha2 heartbeat[27081]: WARN: node ha1: is dead
Aug 6 21:40:45 ha2 heartbeat[27081]: WARN: Link ha1:eth1 dead.




--661009-1397044680-965598613=:12321
Content-Type: TEXT/PLAIN; charset=US-ASCII; name="heartbeat-0.4.8-mandatory.patch"
Content-Transfer-Encoding: BASE64
Content-ID: <Pine.LNX.4.21.0008061850130.12321@freak.distro.conectiva>
Content-Description:
Content-Disposition: attachment; filename="heartbeat-0.4.8-mandatory.patch"

ZGlmZiAtTnVyIGhlYXJ0YmVhdC5vcmlnL2NvbmZpZy5jIGhlYXJ0YmVhdC9j
b25maWcuYw0KLS0tIGhlYXJ0YmVhdC5vcmlnL2NvbmZpZy5jCVN1biBBdWcg
IDYgMjE6MTM6MTAgMjAwMA0KKysrIGhlYXJ0YmVhdC9jb25maWcuYwlTdW4g
QXVnICA2IDIxOjIxOjUxIDIwMDANCkBAIC0xOTIsNiArMTkyLDcgQEANCiAj
ZGVmaW5lCUtFWV9MT0dGSUxFCSJsb2dmaWxlIg0KICNkZWZpbmUJS0VZX0RC
R0ZJTEUJImRlYnVnZmlsZSINCiAjZGVmaW5lIEtFWV9GQUlMQkFDSyAgICAi
bmljZV9mYWlsYmFjayINCisjZGVmaW5lIEtFWV9NQU5EQVRPUlkJIm1hbmRh
dG9yeSINCiANCiBpbnQgYWRkX25vZGUoY29uc3QgY2hhciAqKTsNCiBpbnQg
c2V0X2hvcGZ1ZGdlKGNvbnN0IGNoYXIgKik7DQpAQCAtMjA0LDYgKzIwNSw3
IEBADQogaW50IHNldF9sb2dmaWxlKGNvbnN0IGNoYXIgKik7DQogaW50IHNl
dF9kYmdmaWxlKGNvbnN0IGNoYXIgKik7DQogaW50IHNldF9uaWNlX2ZhaWxi
YWNrKGNvbnN0IGNoYXIgKik7DQoraW50IHNldF9tYW5kYXRvcnkoY29uc3Qg
Y2hhciAqKTsNCiANCiBleHRlcm4gY29uc3Qgc3RydWN0IGhiX21lZGlhX2Zu
cwlpcF9tZWRpYV9mbnM7DQogZXh0ZXJuIGNvbnN0IHN0cnVjdCBoYl9tZWRp
YV9mbnMJc2VyaWFsX21lZGlhX2ZuczsNCkBAIC0yMzQsNiArMjM2LDcgQEAN
CiAsCXtLRVlfTE9HRklMRSwgICBzZXRfbG9nZmlsZX0NCiAsCXtLRVlfREJH
RklMRSwgICBzZXRfZGJnZmlsZX0NCiAsICAgICAgIHtLRVlfRkFJTEJBQ0ss
ICBzZXRfbmljZV9mYWlsYmFja30NCissCXtLRVlfTUFOREFUT1JZLCBzZXRf
bWFuZGF0b3J5fQ0KIH07DQogDQogDQpAQCAtODc5LDMgKzg4MiwyMCBAQA0K
IA0KICAgICAgICAgcmV0dXJuKEhBX09LKTsNCiB9DQorDQoraW50IA0KK3Nl
dF9tYW5kYXRvcnkoY29uc3QgY2hhciAqdmFsdWUpIA0KK3sNCisJaWYgKGNv
bmZpZy0+bWFuZGF0b3J5ICE9IE5VTEwpIHsNCisJCWZwcmludGYoc3RkZXJy
LCAiJXM6IE1hbmRhdG9yeSBkZXZpY2UgbXVsdGlwbHkgc3BlY2lmaWVkLlxu
Ig0KKwkJLAljbWRuYW1lKTsNCisJCXJldHVybihIQV9GQUlMKTsNCisJfQ0K
KwlpZiAoKGNvbmZpZy0+bWFuZGF0b3J5ID0gKGNoYXIgKiloYV9tYWxsb2Mo
c3RybGVuKHZhbHVlKSsxKSkgPT0gTlVMTCkgew0KKwkJZnByaW50ZihzdGRl
cnIsICIlczogT3V0IG9mIG1lbW9yeSBmb3IgbWFuZGF0b3J5IGRldmljZVxu
Ig0KKwkJLAljbWRuYW1lKTsNCisJCXJldHVybihIQV9GQUlMKTsNCisJfQ0K
KwlzdHJjcHkoY29uZmlnLT5tYW5kYXRvcnksIHZhbHVlKTsNCisgCXJldHVy
bihIQV9PSyk7DQorIH0NCmRpZmYgLU51ciBoZWFydGJlYXQub3JpZy9oZWFy
dGJlYXQuYyBoZWFydGJlYXQvaGVhcnRiZWF0LmMNCi0tLSBoZWFydGJlYXQu
b3JpZy9oZWFydGJlYXQuYwlTdW4gQXVnICA2IDIxOjEzOjEwIDIwMDANCisr
KyBoZWFydGJlYXQvaGVhcnRiZWF0LmMJU3VuIEF1ZyAgNiAyMTo0MDoyNyAy
MDAwDQpAQCAtMTI2MCw5ICsxMjYwLDE1IEBADQogDQogCQkJdGhpc25vZGUt
PnJtdF9sYXN0dXBkYXRlID0gbXNndGltZTsNCiANCisJCQl0aGlzbm9kZS0+
c3RhdHVzX3NlcW5vID0gc2Vxbm87DQorDQorCQkJaWYoY29uZmlnLT5tYW5k
YXRvcnkgJiYgdGhpc25vZGUgIT0gY3Vybm9kZSkgew0KKwkJCQlpZihzdHJj
bXAoY29uZmlnLT5tYW5kYXRvcnksIGlmYWNlKSAhPSAwKSANCisJCQkJCWNv
bnRpbnVlOw0KKwkJCX0NCisNCiAJCQl0aGlzbm9kZS0+bG9jYWxfbGFzdHVw
ZGF0ZSA9IG1lc3NhZ2V0aW1lOw0KIA0KLQkJCXRoaXNub2RlLT5zdGF0dXNf
c2Vxbm8gPSBzZXFubzsNCiANCiAJCQkvKiBJcyB0aGUgbm9kZSBzdGF0dXMg
dGhlIHNhbWU/ICovDQogCQkJaWYgKHN0cmNhc2VjbXAodGhpc25vZGUtPnN0
YXR1cywgc3RhdHVzKSAhPSAwKSB7DQpkaWZmIC1OdXIgaGVhcnRiZWF0Lm9y
aWcvaGVhcnRiZWF0LmggaGVhcnRiZWF0L2hlYXJ0YmVhdC5oDQotLS0gaGVh
cnRiZWF0Lm9yaWcvaGVhcnRiZWF0LmgJU3VuIEF1ZyAgNiAyMTo0MjozMSAy
MDAwDQorKysgaGVhcnRiZWF0L2hlYXJ0YmVhdC5oCVN1biBBdWcgIDYgMjE6
MTI6NTIgMjAwMA0KQEAgLTIwNyw2ICsyMDcsNyBAQA0KIAlzdHJ1Y3QgYXV0
aF9pbmZvKiBhdXRobWV0aG9kOwkvKiBhdXRoX2NvbmZpZ1thdXRobnVtXSAq
Lw0KIAlzdHJ1Y3Qgbm9kZV9pbmZvICBub2Rlc1tNQVhOT0RFXTsNCiAJc3Ry
dWN0IGF1dGhfaW5mbyAgYXV0aF9jb25maWdbTUFYQVVUSF07DQorCWNoYXIq
IG1hbmRhdG9yeTsNCiB9Ow0KIA0KIA0K
--661009-1397044680-965598613=:12321--