Mailing List Archive

Error in serial code of heartbeat?
Hello

There seems to be a bug in heartbeat serial code. I have been using
heartbeat for a very long time and have had no problems. But since I moved
the machine and put a higher constant load on it, I am getting
the following errors every hour:
TTY write timeout on [/dev/ttyS1] (no connection?)

At first I was running version 0.4.6c when these errors popped up. I
rebooted both nodes several times, but this did not help. The error
always popped up again. I then tried to do an strace on the heartbeat
doing the serial stuff and could see that it always reads every two
seconds from the serial fd, although the serial buffer was full with
data! I could verify this by simply disconnecting the serial connection
and the heartbeat process was still reading data from the serial
port for about 5 - 10 minutes before the buffer was empty! Connecting
it again, this time with a serial analyser between the two, one
could see the buffer fill up until it was full again and the RTS
signal dropped.

It seems that heartbeat is reading just one record every two seconds
and does not read everything from the buffer. So if the process
writing to the port writes faster, it will always fill the
buffer and heartbeat will NOT detect if the other node has
crashed for 5 - 10 minutes until the buffer is empty.

Two days ago I decided to upgrade to 0.4.7 and everything seemed to
be running. However looking at the log files this morning I see that
the same messages appear in my log files on both nodes.

As I said this all started to happen when I moved the nodes from one
room to another one and have more procceses running on it causing a
higher load on the active node:

9:06am up 1 day, 22:19, 5 users, load average: 0.87, 0.61, 0.44

There are about 195 processes now running on the active node. Before
I moved load average was always around zero.

Holger
Error in serial code of heartbeat? [ In reply to ]
On Thu, Apr 20, 2000 at 09:24:43AM +0200, Holger Kiehl wrote:
> Hello
>
> There seems to be a bug in heartbeat serial code. I have been using
> heartbeat for a very long time and have had no problems. But since I moved
> the machine and put a higher constant load on it, I am getting
> the following errors every hour:
> TTY write timeout on [/dev/ttyS1] (no connection?)
>
> At first I was running version 0.4.6c when these errors popped up. I
> rebooted both nodes several times, but this did not help. The error
> always popped up again. I then tried to do an strace on the heartbeat
> doing the serial stuff and could see that it always reads every two
> seconds from the serial fd, although the serial buffer was full with
> data! I could verify this by simply disconnecting the serial connection
> and the heartbeat process was still reading data from the serial
> port for about 5 - 10 minutes before the buffer was empty! Connecting
> it again, this time with a serial analyser between the two, one
> could see the buffer fill up until it was full again and the RTS
> signal dropped.
>
> It seems that heartbeat is reading just one record every two seconds
> and does not read everything from the buffer. So if the process
> writing to the port writes faster, it will always fill the
> buffer and heartbeat will NOT detect if the other node has
> crashed for 5 - 10 minutes until the buffer is empty.
>
> Two days ago I decided to upgrade to 0.4.7 and everything seemed to
> be running. However looking at the log files this morning I see that
> the same messages appear in my log files on both nodes.
>
> As I said this all started to happen when I moved the nodes from one
> room to another one and have more procceses running on it causing a
> higher load on the active node:
>
> 9:06am up 1 day, 22:19, 5 users, load average: 0.87, 0.61, 0.44
>
> There are about 195 processes now running on the active node. Before
> I moved load average was always around zero.


I'm trying to track this down but I'm not having a lot of luck.
The serial code does only read one line at a time, but the process
that handles the reading of the data should (by my reading of the code)
be coninuously reading information from the serial port.

One possibliy I thought of is that the buffer for the pipe
used to communicat status between heartbeat processes is being
filled. The process reads the pipe, again should be doing this continously
by my reading of the code.

Another posiblility is that heartbeat is taking so long to write
messages that it is unable to read messages fast enough (from the pipe).
This seems unlikely, though would tie in with the load requirement.
If this is the case then a mechanism for flushing the buffers continuously,
and discarding backlogged messages would be required.

I am going to try and repoduce this problem to try and understand it
better. Alan do you have any ideas on what the cause might be?



--
Horms
Error in serial code of heartbeat? [ In reply to ]
Horms wrote:
>
> On Thu, Apr 20, 2000 at 09:24:43AM +0200, Holger Kiehl wrote:
> > Hello
> >
> > There seems to be a bug in heartbeat serial code. I have been using
> > heartbeat for a very long time and have had no problems. But since I moved
> > the machine and put a higher constant load on it, I am getting
> > the following errors every hour:
> > TTY write timeout on [/dev/ttyS1] (no connection?)
> >
> > At first I was running version 0.4.6c when these errors popped up. I
> > rebooted both nodes several times, but this did not help. The error
> > always popped up again. I then tried to do an strace on the heartbeat
> > doing the serial stuff and could see that it always reads every two
> > seconds from the serial fd, although the serial buffer was full with
> > data! I could verify this by simply disconnecting the serial connection
> > and the heartbeat process was still reading data from the serial
> > port for about 5 - 10 minutes before the buffer was empty! Connecting
> > it again, this time with a serial analyser between the two, one
> > could see the buffer fill up until it was full again and the RTS
> > signal dropped.
> >
> > It seems that heartbeat is reading just one record every two seconds
> > and does not read everything from the buffer. So if the process
> > writing to the port writes faster, it will always fill the
> > buffer and heartbeat will NOT detect if the other node has
> > crashed for 5 - 10 minutes until the buffer is empty.
> >
> > Two days ago I decided to upgrade to 0.4.7 and everything seemed to
> > be running. However looking at the log files this morning I see that
> > the same messages appear in my log files on both nodes.
> >
> > As I said this all started to happen when I moved the nodes from one
> > room to another one and have more procceses running on it causing a
> > higher load on the active node:
> >
> > 9:06am up 1 day, 22:19, 5 users, load average: 0.87, 0.61, 0.44
> >
> > There are about 195 processes now running on the active node. Before
> > I moved load average was always around zero.
>
> I'm trying to track this down but I'm not having a lot of luck.
> The serial code does only read one line at a time, but the process
> that handles the reading of the data should (by my reading of the code)
> be coninuously reading information from the serial port.
>
> One possibliy I thought of is that the buffer for the pipe
> used to communicat status between heartbeat processes is being
> filled. The process reads the pipe, again should be doing this continously
> by my reading of the code.
>
> Another posiblility is that heartbeat is taking so long to write
> messages that it is unable to read messages fast enough (from the pipe).
> This seems unlikely, though would tie in with the load requirement.
> If this is the case then a mechanism for flushing the buffers continuously,
> and discarding backlogged messages would be required.
>
> I am going to try and repoduce this problem to try and understand it
> better. Alan do you have any ideas on what the cause might be?

I have no solid idea on what this might be. I'll see if I can set
things up here to reproduce it. Heartbeat has the design it does
primarily to prevent these kinds of things...

I'll let you know what I find out as well.

Sorry for the difficulties!

-- Alan Robertson
alanr@suse.com
Re: Error in serial code of heartbeat? [ In reply to ]
Holger Kiehl wrote:
>
> Hello
>
> There seems to be a bug in heartbeat serial code. I have been using
> heartbeat for a very long time and have had no problems. But since I moved
> the machine and put a higher constant load on it, I am getting
> the following errors every hour:
> TTY write timeout on [/dev/ttyS1] (no connection?)
>
> At first I was running version 0.4.6c when these errors popped up. I
> rebooted both nodes several times, but this did not help. The error
> always popped up again. I then tried to do an strace on the heartbeat
> doing the serial stuff and could see that it always reads every two
> seconds from the serial fd, although the serial buffer was full with
> data! I could verify this by simply disconnecting the serial connection
> and the heartbeat process was still reading data from the serial
> port for about 5 - 10 minutes before the buffer was empty! Connecting
> it again, this time with a serial analyser between the two, one
> could see the buffer fill up until it was full again and the RTS
> signal dropped.
>
> It seems that heartbeat is reading just one record every two seconds
> and does not read everything from the buffer. So if the process
> writing to the port writes faster, it will always fill the
> buffer and heartbeat will NOT detect if the other node has
> crashed for 5 - 10 minutes until the buffer is empty.
>
> Two days ago I decided to upgrade to 0.4.7 and everything seemed to
> be running. However looking at the log files this morning I see that
> the same messages appear in my log files on both nodes.
>
> As I said this all started to happen when I moved the nodes from one
> room to another one and have more procceses running on it causing a
> higher load on the active node:
>
> 9:06am up 1 day, 22:19, 5 users, load average: 0.87, 0.61, 0.44
>
> There are about 195 processes now running on the active node. Before
> I moved load average was always around zero.
>
> Holger

Hi Holger,

What OS are you running this with? What was the version of heartbeat
that you were using before that worked fine?

Thanks!

-- Alan Robertson
alanr@suse.com
Re: Error in serial code of heartbeat? [ In reply to ]
Alan Robertson wrote:
>
> Holger Kiehl wrote:
> >
> > Hello
> >
> > There seems to be a bug in heartbeat serial code. I have been using
> > heartbeat for a very long time and have had no problems. But since I moved
> > the machine and put a higher constant load on it, I am getting
> > the following errors every hour:
> > TTY write timeout on [/dev/ttyS1] (no connection?)
> >
> > At first I was running version 0.4.6c when these errors popped up. I
> > rebooted both nodes several times, but this did not help. The error
> > always popped up again. I then tried to do an strace on the heartbeat
> > doing the serial stuff and could see that it always reads every two
> > seconds from the serial fd, although the serial buffer was full with
> > data! I could verify this by simply disconnecting the serial connection
> > and the heartbeat process was still reading data from the serial
> > port for about 5 - 10 minutes before the buffer was empty! Connecting
> > it again, this time with a serial analyser between the two, one
> > could see the buffer fill up until it was full again and the RTS
> > signal dropped.
> >
> > It seems that heartbeat is reading just one record every two seconds
> > and does not read everything from the buffer. So if the process
> > writing to the port writes faster, it will always fill the
> > buffer and heartbeat will NOT detect if the other node has
> > crashed for 5 - 10 minutes until the buffer is empty.
> >
> > Two days ago I decided to upgrade to 0.4.7 and everything seemed to
> > be running. However looking at the log files this morning I see that
> > the same messages appear in my log files on both nodes.
> >
> > As I said this all started to happen when I moved the nodes from one
> > room to another one and have more procceses running on it causing a
> > higher load on the active node:
> >
> > 9:06am up 1 day, 22:19, 5 users, load average: 0.87, 0.61, 0.44
> >
> > There are about 195 processes now running on the active node. Before
> > I moved load average was always around zero.
> >
> > Holger
>
> Hi Holger,
>
> What OS are you running this with? What was the version of heartbeat
> that you were using before that worked fine?

Hi Holger,

I had another (and I think better) thought about your problem.

Since I tested this code extensively before releasing it back in about
the 0.4.6c timeframe, I suspect that you may have fallen victim to a
change in cable requirements.

The old code only required TX, RX and ground. The new code requires
that TX, RX, CTS, RTS, CD, TR and ground are all hooked up.

If your cables don't carry all those signals through, then I think you
would have troubles very similar to those you describe.

If you check /proc/tty/driver/serial, you should see at least these
signals:
RTS|CTS|DTR|CD
regardless of which machine you look from.

Check it out and let me know what you see.

Thanks!

-- Alan Robertson
alanr@suse.com
Re: Error in serial code of heartbeat? [ In reply to ]
On Fri, 21 Apr 2000, Alan Robertson wrote:

> Hi Holger,
>
> I had another (and I think better) thought about your problem.
>
> Since I tested this code extensively before releasing it back in about
> the 0.4.6c timeframe, I suspect that you may have fallen victim to a
> change in cable requirements.
>
> The old code only required TX, RX and ground. The new code requires
> that TX, RX, CTS, RTS, CD, TR and ground are all hooked up.
>
> If your cables don't carry all those signals through, then I think you
> would have troubles very similar to those you describe.
>
> If you check /proc/tty/driver/serial, you should see at least these
> signals:
> RTS|CTS|DTR|CD
> regardless of which machine you look from.
>
> Check it out and let me know what you see.
>

Here is the information of the system that does have the problem it is
with SuSE 6.1 and kernel 2.2.12:

afd@diagnostix:~$ cat /proc/tty/driver/serial
serinfo:1.0 driver:4.27
0: uart:16550A port:3F8 irq:4 baud:9600 tx:0 rx:0
1: uart:16550A port:2F8 irq:3 baud:19200 tx:5612304 rx:4978108 fe:3903 RTS|DTR
2: uart:unknown port:3E8 irq:4
3: uart:unknown port:2E8 irq:3

I have another system with RH 6.1 and kernel 2.2.14 where I do NOT see
these problems. The serial information is as follows:

afd@botanix:~$ cat /proc/tty/driver/serial
serinfo:1.0 driver:4.27
0: uart:16550A port:3F8 irq:4 baud:19200 tx:12430759 rx:10835270 fe:2541 RTS|CTS|DTR|DSR|CD
1: uart:16550A port:2F8 irq:3 tx:0 rx:0 RTS
2: uart:unknown port:3E8 irq:4
3: uart:unknown port:2E8 irq:3

So if I understand you correct, this shows that the cable of the two
systems are not the same? I had them being made at the same time.
I will check them on tuesday when I am back at work.

Besides, another thing I noticed on the system with the working cable,
is that heartbeat on the master terminates itself when that is under
very heavy load (> 100). Now the secondary node takes over but this one
will, after some time reach the same load and heartbeat terminates itself.
Now both nodes do not have heartbeat and the services. I managed to get
around this by increasing the deadtime parameter in ha.cf. Here is the
ha-log output of node botanix:

heartbeat: 2000/04/20_20:04:52 warn: node botanix: is dead
heartbeat: 2000/04/20_20:04:52 error: No local heartbeat. Forcing shutdown.
heartbeat: 2000/04/20_20:05:09 info: Heartbeat shutdown in progress.
heartbeat: 2000/04/20_20:05:10 info: Giving up all HA resources.
heartbeat: 2000/04/20_20:05:11 INFO: Running /etc/ha.d/rc.d/status status
heartbeat: 2000/04/20_20:05:12 Taking over resource group 192.168.124.124
heartbeat: 2000/04/20_20:05:15 Releasing resource group: botanix 192.168.124.124
heartbeat: 2000/04/20_20:05:15 INFO: Running /etc/ha.d/resource.d/IPaddr 192.168.124.124 stop
heartbeat: 2000/04/20_20:05:28 IP Address 192.168.124.124 released
heartbeat: 2000/04/20_20:05:28 info: All HA resources relinquished.
heartbeat: 2000/04/20_20:05:28 info: Heartbeat shutdown complete.

Is this the correct behaviour? I thought for heartbeat that it is correct
to give over the services but not to terminate itself. So heartbeat
would function in that case like a load balancer. In that case it would
also be nice to have a feature that heartbeat will not always switch back
to the master node.

This brings me to another point. Would it not be good that heartbeat
have its own init daemon, that will activate all heartbeat processes
and then check that they are always active. If one heartbeat process fails,
segmentation fault or bus error or some other unexpected error situation,
it will automatically be restarted by this init daemon. We could then
start this init process via the /etc/inittab, that way heartbeat will
always be active regardless what happens.

Looking at the code I also noticed that when heartbeat is started
no core dumps are allowed. Why is this so? I always found core files
very informative. In fact I always do an abort() in my programs when
I hit an unexpected situation. Core files might give one very usefull
information in strange situations.

Thanks, for all the very quick responces to my serial problem and I
hope that this is the correct list to post all these questions.

Regards,
Holger
Re: Error in serial code of heartbeat? [ In reply to ]
Holger Kiehl wrote:
>
> On Fri, 21 Apr 2000, Alan Robertson wrote:
>
> > Hi Holger,
> >
> > I had another (and I think better) thought about your problem.
> >
> > Since I tested this code extensively before releasing it back in about
> > the 0.4.6c timeframe, I suspect that you may have fallen victim to a
> > change in cable requirements.
> >
> > The old code only required TX, RX and ground. The new code requires
> > that TX, RX, CTS, RTS, CD, TR and ground are all hooked up.
> >
> > If your cables don't carry all those signals through, then I think you
> > would have troubles very similar to those you describe.
> >
> > If you check /proc/tty/driver/serial, you should see at least these
> > signals:
> > RTS|CTS|DTR|CD
> > regardless of which machine you look from.
> >
> > Check it out and let me know what you see.
> >
>
> Here is the information of the system that does have the problem it is
> with SuSE 6.1 and kernel 2.2.12:
>
> afd@diagnostix:~$ cat /proc/tty/driver/serial
> serinfo:1.0 driver:4.27
> 0: uart:16550A port:3F8 irq:4 baud:9600 tx:0 rx:0
> 1: uart:16550A port:2F8 irq:3 baud:19200 tx:5612304 rx:4978108 fe:3903 RTS|DTR
>
> I have another system with RH 6.1 and kernel 2.2.14 where I do NOT see
> these problems. The serial information is as follows:
>
> afd@botanix:~$ cat /proc/tty/driver/serial
> serinfo:1.0 driver:4.27
> 0: uart:16550A port:3F8 irq:4 baud:19200 tx:12430759 rx:10835270 fe:2541 RTS|CTS|DTR|DSR|CD
> 1: uart:16550A port:2F8 irq:3 tx:0 rx:0 RTS

> So if I understand you correct, this shows that the cable of the two
> systems are not the same? I had them being made at the same time.
> I will check them on tuesday when I am back at work.

Or the port is broken, or something else is wrong. The UART needs to
see CTS coming from the other system. Without it, the UART won't write
any characters, and you get the results you've been seeing. This
difference (RTS vs CTS|RTS) accounts for the difference. RTS and TR on
the local end translates into CTS and CD on the remote end through the
null modem cable. What does the system at the other end of the cable
see? Do you have a "breakout box"?

It actually looks like the cable in the first case isn't hooked up *at
all*. Please trace the cable and check it out... [Maybe you've gotten
things plugged into the wrong ports?]. This could happen to you
undetected if you're also doing heartbeat across the ethernet. Note
that in the first case, it looks like you're using COM2 (/dev/ttyS1),
and in the second case it looks like you're using COM1 (/dev/ttyS0).

> Besides, another thing I noticed on the system with the working cable,
> is that heartbeat on the master terminates itself when that is under
> very heavy load (> 100). Now the secondary node takes over but this one
> will, after some time reach the same load and heartbeat terminates itself.
> Now both nodes do not have heartbeat and the services. I managed to get
> around this by increasing the deadtime parameter in ha.cf. Here is the
> ha-log output of node botanix:
>
> heartbeat: 2000/04/20_20:04:52 warn: node botanix: is dead
> heartbeat: 2000/04/20_20:04:52 error: No local heartbeat. Forcing shutdown.

This message is *always* unwelcome :-(. It means that it was unable to
hear it's own heart beat in the given heartbeat interval. This means
that the code isn't functioning correctly [probably because of load in
your case]. There are several things that one could do about this:
1) Increase the heartbeat "dead" time
2) Lock heartbeat into memory and increase it's priority so it gets
scheduled
3) decrease the system load

#2 is the best option, followed by #2 and #1 combined. #3 would be a
last-ditch effort. I hadn't done #2 because no one had reported the
problem before.

> heartbeat: 2000/04/20_20:05:09 info: Heartbeat shutdown in progress.
> heartbeat: 2000/04/20_20:05:10 info: Giving up all HA resources.
> heartbeat: 2000/04/20_20:05:11 INFO: Running /etc/ha.d/rc.d/status status
> heartbeat: 2000/04/20_20:05:12 Taking over resource group 192.168.124.124
> heartbeat: 2000/04/20_20:05:15 Releasing resource group: botanix 192.168.124.124
> heartbeat: 2000/04/20_20:05:15 INFO: Running /etc/ha.d/resource.d/IPaddr 192.168.124.124 stop
> heartbeat: 2000/04/20_20:05:28 IP Address 192.168.124.124 released
> heartbeat: 2000/04/20_20:05:28 info: All HA resources relinquished.
> heartbeat: 2000/04/20_20:05:28 info: Heartbeat shutdown complete.
>
> Is this the correct behaviour? I thought for heartbeat that it is correct
> to give over the services but not to terminate itself. So heartbeat
> would function in that case like a load balancer. In that case it would
> also be nice to have a feature that heartbeat will not always switch back
> to the master node.

See discussion above.

The feature of not switching back to the master node is coming. It's
called "nice_failback".

> This brings me to another point. Would it not be good that heartbeat
> have its own init daemon, that will activate all heartbeat processes
> and then check that they are always active. If one heartbeat process fails,
> segmentation fault or bus error or some other unexpected error situation,
> it will automatically be restarted by this init daemon. We could then
> start this init process via the /etc/inittab, that way heartbeat will
> always be active regardless what happens.

I've thought about doing the self-monitoring thing in heartbeat. It
wouldn't be too hard since the list of processes is in shared memory.
Ditto for automatic restart on memory leaks, etc. I just haven't gotten
around to it.

> Looking at the code I also noticed that when heartbeat is started
> no core dumps are allowed. Why is this so? I always found core files
> very informative. In fact I always do an abort() in my programs when
> I hit an unexpected situation. Core files might give one very usefull
> information in strange situations.

I don't do anything to deliberately disable core dumps. I've seen it
fail to dump core on many occasions - and it was quite frustrating.
What do you think I'm doing to prevent this? If you tell me what I'm
doing to cause this behavior, I'll fix it so fast it'll make your head
spin ;-)

> Thanks, for all the very quick responces to my serial problem and I
> hope that this is the correct list to post all these questions.

This is *exactly* the right list to post your queries to.

Thanks Holger!!

-- Alan Robertson
alanr@suse.com
Re: Error in serial code of heartbeat? [ In reply to ]
On Sat, Apr 22, 2000 at 05:41:00PM +0200, Holger Kiehl wrote:
>
>
> On Fri, 21 Apr 2000, Alan Robertson wrote:
>
> > Hi Holger,
> >
> > I had another (and I think better) thought about your problem.
> >
> > Since I tested this code extensively before releasing it back in about
> > the 0.4.6c timeframe, I suspect that you may have fallen victim to a
> > change in cable requirements.
> >
> > The old code only required TX, RX and ground. The new code requires
> > that TX, RX, CTS, RTS, CD, TR and ground are all hooked up.
> >
> > If your cables don't carry all those signals through, then I think you
> > would have troubles very similar to those you describe.
> >
> > If you check /proc/tty/driver/serial, you should see at least these
> > signals:
> > RTS|CTS|DTR|CD
> > regardless of which machine you look from.
> >
> > Check it out and let me know what you see.
> >
>
> Here is the information of the system that does have the problem it is
> with SuSE 6.1 and kernel 2.2.12:
>
> afd@diagnostix:~$ cat /proc/tty/driver/serial
> serinfo:1.0 driver:4.27
> 0: uart:16550A port:3F8 irq:4 baud:9600 tx:0 rx:0
> 1: uart:16550A port:2F8 irq:3 baud:19200 tx:5612304 rx:4978108 fe:3903 RTS|DTR
> 2: uart:unknown port:3E8 irq:4
> 3: uart:unknown port:2E8 irq:3
>
> I have another system with RH 6.1 and kernel 2.2.14 where I do NOT see
> these problems. The serial information is as follows:
>
> afd@botanix:~$ cat /proc/tty/driver/serial
> serinfo:1.0 driver:4.27
> 0: uart:16550A port:3F8 irq:4 baud:19200 tx:12430759 rx:10835270 fe:2541 RTS|CTS|DTR|DSR|CD
> 1: uart:16550A port:2F8 irq:3 tx:0 rx:0 RTS
> 2: uart:unknown port:3E8 irq:4
> 3: uart:unknown port:2E8 irq:3
>
> So if I understand you correct, this shows that the cable of the two
> systems are not the same? I had them being made at the same time.
> I will check them on tuesday when I am back at work.

My reading would be that the first system does not have
CTS DSR or CD hooked up.

--
Horms
Re: Error in serial code of heartbeat? [ In reply to ]
Holger Kiehl wrote:
>
> On Sat, 22 Apr 2000, Alan Robertson wrote:
>
> > Holger Kiehl wrote:
> > >
> > > Here is the information of the system that does have the problem it is
> > > with SuSE 6.1 and kernel 2.2.12:
> > >
> > > afd@diagnostix:~$ cat /proc/tty/driver/serial
> > > serinfo:1.0 driver:4.27
> > > 0: uart:16550A port:3F8 irq:4 baud:9600 tx:0 rx:0
> > > 1: uart:16550A port:2F8 irq:3 baud:19200 tx:5612304 rx:4978108 fe:3903 RTS|DTR
> > >
> > > I have another system with RH 6.1 and kernel 2.2.14 where I do NOT see
> > > these problems. The serial information is as follows:
> > >
> > > afd@botanix:~$ cat /proc/tty/driver/serial
> > > serinfo:1.0 driver:4.27
> > > 0: uart:16550A port:3F8 irq:4 baud:19200 tx:12430759 rx:10835270 fe:2541 RTS|CTS|DTR|DSR|CD
> > > 1: uart:16550A port:2F8 irq:3 tx:0 rx:0 RTS
> >
> This output is from another cluster. My wording should have been:
> I have another CLUSTER with RH 6.1 ...
> Sorry, for the confusion!
>
> > > So if I understand you correct, this shows that the cable of the two
> > > systems are not the same? I had them being made at the same time.
> > > I will check them on tuesday when I am back at work.
> >
> > Or the port is broken, or something else is wrong. The UART needs to
> > see CTS coming from the other system. Without it, the UART won't write
> > any characters, and you get the results you've been seeing. This
> > difference (RTS vs CTS|RTS) accounts for the difference. RTS and TR on
> > the local end translates into CTS and CD on the remote end through the
> > null modem cable. What does the system at the other end of the cable
> > see? Do you have a "breakout box"?
> >
> Here is the output of the other box (secondary node) of the cluster
> with this problem:
>
> afd@praktifix:~$ cat /proc/tty/driver/serial
> serinfo:1.0 driver:4.27
> 0: uart:16550A port:3F8 irq:4 baud:19200 tx:34964 rx:21248 fe:2610
> 1: uart:16550A port:2F8 irq:3 baud:19200 tx:4963114 rx:5608860 fe:2712 brk:5 RTS|DTR
> 2: uart:unknown port:3E8 irq:4
> 3: uart:unknown port:2E8 irq:3
>
> > It actually looks like the cable in the first case isn't hooked up *at
> > all*. Please trace the cable and check it out... [.Maybe you've gotten
> > things plugged into the wrong ports?]. This could happen to you
> > undetected if you're also doing heartbeat across the ethernet. Note
> > that in the first case, it looks like you're using COM2 (/dev/ttyS1),
> > and in the second case it looks like you're using COM1 (/dev/ttyS0).
> >
> The second system I mentioned is another cluster. Sorry, about that!
> On the cluster where I have the problem with the serial port both are
> connected to ttyS1. I am also sure that they are connected and data is
> going over the line. But I need to check the cable if all lines are
> connected as you mentioned.
>
> > >
> > > heartbeat: 2000/04/20_20:04:52 warn: node botanix: is dead
> > > heartbeat: 2000/04/20_20:04:52 error: No local heartbeat. Forcing shutdown.
> >
> > This message is *always* unwelcome :-(. It means that it was unable to
> > hear it's own heart beat in the given heartbeat interval. This means
> > that the code isn't functioning correctly [probably because of load in
> > your case]. There are several things that one could do about this:
> > 1) Increase the heartbeat "dead" time
> > 2) Lock heartbeat into memory and increase it's priority so it gets
> > scheduled
> > 3) decrease the system load
> >
> > #2 is the best option, followed by #2 and #1 combined. #3 would be a
> > last-ditch effort. I hadn't done #2 because no one had reported the
> > problem before.
> >
> Will #2 be implemented in heartbeat at some later stage?
>
> > The feature of not switching back to the master node is coming. It's
> > called "nice_failback".
> >
> Good, I can hardly wait for it. ;-)
>
> >
> > > Looking at the code I also noticed that when heartbeat is started
> > > no core dumps are allowed. Why is this so? I always found core files
> > > very informative. In fact I always do an abort() in my programs when
> > > I hit an unexpected situation. Core files might give one very usefull
> > > information in strange situations.
> >
> > I don't do anything to deliberately disable core dumps. I've seen it
> > fail to dump core on many occasions - and it was quite frustrating.
> > What do you think I'm doing to prevent this? If you tell me what I'm
> > doing to cause this behavior, I'll fix it so fast it'll make your head
> > spin ;-)
> >
> Sorry, its not you, its the RH function daemon(). Here they do an
> ulimit -c 0, thus disabling core dumps. But I am not sure if this
> function is still being used to startup heartbeat.
>
> Besides, I like that about spinning my head! ;-)


It looks like the daemon() script that comes with Red Hat is being used
to start heartbeat if you're running on Red Hat. Otherwise it uses it's
own daemon() function. Sounds like you've found it.

This code occurs early in heartbeat.sh:

#
# Source in Red Hat's function library.
#
if
[ ! -x $RHFUNCS ]
then
daemon() {
$*
}
status() {
$HA_BIN/heartbeat -s
}
else
. $RHFUNCS
fi

And later on, I use daemon to start it up:


start_heartbeat() {
if
daemon $HA_BIN/heartbeat # -d >& /dev/null
then
: OK
else
return $?
fi
}

So, it looks like you're right. This fix goes into the next release ;-)


Thanks Holger!

-- Alan Robertson
alanr@suse.com
Re: Error in serial code of heartbeat? [ In reply to ]
Holger Kiehl wrote:
>
> On Sat, 22 Apr 2000, Alan Robertson wrote:
>
> > Holger Kiehl wrote:
> > >
> > > Here is the information of the system that does have the problem it is
> > > with SuSE 6.1 and kernel 2.2.12:
> > >
> > > afd@diagnostix:~$ cat /proc/tty/driver/serial
> > > serinfo:1.0 driver:4.27
> > > 0: uart:16550A port:3F8 irq:4 baud:9600 tx:0 rx:0
> > > 1: uart:16550A port:2F8 irq:3 baud:19200 tx:5612304 rx:4978108 fe:3903 RTS|DTR
> > >
> > > I have another system with RH 6.1 and kernel 2.2.14 where I do NOT see
> > > these problems. The serial information is as follows:
> > >
> > > afd@botanix:~$ cat /proc/tty/driver/serial
> > > serinfo:1.0 driver:4.27
> > > 0: uart:16550A port:3F8 irq:4 baud:19200 tx:12430759 rx:10835270 fe:2541 RTS|CTS|DTR|DSR|CD
> > > 1: uart:16550A port:2F8 irq:3 tx:0 rx:0 RTS
> >
> This output is from another cluster. My wording should have been:
> I have another CLUSTER with RH 6.1 ...
> Sorry, for the confusion!
>
> > > So if I understand you correct, this shows that the cable of the two
> > > systems are not the same? I had them being made at the same time.
> > > I will check them on tuesday when I am back at work.
> >
> > Or the port is broken, or something else is wrong. The UART needs to
> > see CTS coming from the other system. Without it, the UART won't write
> > any characters, and you get the results you've been seeing. This
> > difference (RTS vs CTS|RTS) accounts for the difference. RTS and TR on
> > the local end translates into CTS and CD on the remote end through the
> > null modem cable. What does the system at the other end of the cable
> > see? Do you have a "breakout box"?
> >
> Here is the output of the other box (secondary node) of the cluster
> with this problem:
>
> afd@praktifix:~$ cat /proc/tty/driver/serial
> serinfo:1.0 driver:4.27
> 0: uart:16550A port:3F8 irq:4 baud:19200 tx:34964 rx:21248 fe:2610
> 1: uart:16550A port:2F8 irq:3 baud:19200 tx:4963114 rx:5608860 fe:2712 brk:5 RTS|DTR

This is consistent with the other machine. Neither one sees CTS or CD.
This might also happen if there was no null modem on the line. In any
case, it looks like a wiring or other hardware problem.

> > It actually looks like the cable in the first case isn't hooked up *at
> > all*. Please trace the cable and check it out... [.Maybe you've gotten
> > things plugged into the wrong ports?]. This could happen to you
> > undetected if you're also doing heartbeat across the ethernet. Note
> > that in the first case, it looks like you're using COM2 (/dev/ttyS1),
> > and in the second case it looks like you're using COM1 (/dev/ttyS0).
> >
> The second system I mentioned is another cluster. Sorry, about that!
> On the cluster where I have the problem with the serial port both are
> connected to ttyS1. I am also sure that they are connected and data is
> going over the line. But I need to check the cable if all lines are
> connected as you mentioned.
>
> > >
> > > heartbeat: 2000/04/20_20:04:52 warn: node botanix: is dead
> > > heartbeat: 2000/04/20_20:04:52 error: No local heartbeat. Forcing shutdown.
> >
> > This message is *always* unwelcome :-(. It means that it was unable to
> > hear it's own heart beat in the given heartbeat interval. This means
> > that the code isn't functioning correctly [probably because of load in
> > your case]. There are several things that one could do about this:
> > 1) Increase the heartbeat "dead" time
> > 2) Lock heartbeat into memory and increase it's priority so it gets
> > scheduled
> > 3) decrease the system load
> >
> > #2 is the best option, followed by #2 and #1 combined. #3 would be a
> > last-ditch effort. I hadn't done #2 because no one had reported the
> > problem before.
> >
> Will #2 be implemented in heartbeat at some later stage?

Yes. Probably next release.

-- Alan Robertson
alanr@suse.com
Re: Error in serial code of heartbeat? [ In reply to ]
On Sat, 22 Apr 2000, Alan Robertson wrote:

> Holger Kiehl wrote:
> >
> > Here is the information of the system that does have the problem it is
> > with SuSE 6.1 and kernel 2.2.12:
> >
> > afd@diagnostix:~$ cat /proc/tty/driver/serial
> > serinfo:1.0 driver:4.27
> > 0: uart:16550A port:3F8 irq:4 baud:9600 tx:0 rx:0
> > 1: uart:16550A port:2F8 irq:3 baud:19200 tx:5612304 rx:4978108 fe:3903 RTS|DTR
> >
> > I have another system with RH 6.1 and kernel 2.2.14 where I do NOT see
> > these problems. The serial information is as follows:
> >
> > afd@botanix:~$ cat /proc/tty/driver/serial
> > serinfo:1.0 driver:4.27
> > 0: uart:16550A port:3F8 irq:4 baud:19200 tx:12430759 rx:10835270 fe:2541 RTS|CTS|DTR|DSR|CD
> > 1: uart:16550A port:2F8 irq:3 tx:0 rx:0 RTS
>
This output is from another cluster. My wording should have been:
I have another CLUSTER with RH 6.1 ...
Sorry, for the confusion!

> > So if I understand you correct, this shows that the cable of the two
> > systems are not the same? I had them being made at the same time.
> > I will check them on tuesday when I am back at work.
>
> Or the port is broken, or something else is wrong. The UART needs to
> see CTS coming from the other system. Without it, the UART won't write
> any characters, and you get the results you've been seeing. This
> difference (RTS vs CTS|RTS) accounts for the difference. RTS and TR on
> the local end translates into CTS and CD on the remote end through the
> null modem cable. What does the system at the other end of the cable
> see? Do you have a "breakout box"?
>
Here is the output of the other box (secondary node) of the cluster
with this problem:

afd@praktifix:~$ cat /proc/tty/driver/serial
serinfo:1.0 driver:4.27
0: uart:16550A port:3F8 irq:4 baud:19200 tx:34964 rx:21248 fe:2610
1: uart:16550A port:2F8 irq:3 baud:19200 tx:4963114 rx:5608860 fe:2712 brk:5 RTS|DTR
2: uart:unknown port:3E8 irq:4
3: uart:unknown port:2E8 irq:3

> It actually looks like the cable in the first case isn't hooked up *at
> all*. Please trace the cable and check it out... [Maybe you've gotten
> things plugged into the wrong ports?]. This could happen to you
> undetected if you're also doing heartbeat across the ethernet. Note
> that in the first case, it looks like you're using COM2 (/dev/ttyS1),
> and in the second case it looks like you're using COM1 (/dev/ttyS0).
>
The second system I mentioned is another cluster. Sorry, about that!
On the cluster where I have the problem with the serial port both are
connected to ttyS1. I am also sure that they are connected and data is
going over the line. But I need to check the cable if all lines are
connected as you mentioned.

> >
> > heartbeat: 2000/04/20_20:04:52 warn: node botanix: is dead
> > heartbeat: 2000/04/20_20:04:52 error: No local heartbeat. Forcing shutdown.
>
> This message is *always* unwelcome :-(. It means that it was unable to
> hear it's own heart beat in the given heartbeat interval. This means
> that the code isn't functioning correctly [probably because of load in
> your case]. There are several things that one could do about this:
> 1) Increase the heartbeat "dead" time
> 2) Lock heartbeat into memory and increase it's priority so it gets
> scheduled
> 3) decrease the system load
>
> #2 is the best option, followed by #2 and #1 combined. #3 would be a
> last-ditch effort. I hadn't done #2 because no one had reported the
> problem before.
>
Will #2 be implemented in heartbeat at some later stage?

> The feature of not switching back to the master node is coming. It's
> called "nice_failback".
>
Good, I can hardly wait for it. ;-)

>
> > Looking at the code I also noticed that when heartbeat is started
> > no core dumps are allowed. Why is this so? I always found core files
> > very informative. In fact I always do an abort() in my programs when
> > I hit an unexpected situation. Core files might give one very usefull
> > information in strange situations.
>
> I don't do anything to deliberately disable core dumps. I've seen it
> fail to dump core on many occasions - and it was quite frustrating.
> What do you think I'm doing to prevent this? If you tell me what I'm
> doing to cause this behavior, I'll fix it so fast it'll make your head
> spin ;-)
>
Sorry, its not you, its the RH function daemon(). Here they do an
ulimit -c 0, thus disabling core dumps. But I am not sure if this
function is still being used to startup heartbeat.

Besides, I like that about spinning my head! ;-)

Thanks,
Holger
Re: Error in serial code of heartbeat? [ In reply to ]
Hello

First of all sorry for replying so late!

It seems as it was a cable problem. I had it checked (taken apart) and
they told me there was nothing wrong with the cable. But plugging in
the cable again everything seems to be working. Here the output of
/proc/tty/driver/serial:

node 1

1: uart:16550A port:2F8 irq:3 baud:19200 tx:15811383 rx:14934790 fe:3903 RTS|CTS|DTR|DSR|CD

node 2

1: uart:16550A port:2F8 irq:3 baud:19200 tx:14919246 rx:15807379 fe:2712 brk:5 RTS|CTS|DTR|DSR|CD

Also, there are NO more "TTY write timeout on" messages on either node.
So, I assume there must have been a loose connection in the cable itself.

Sorry, bothering the list with this problem and thanks for the very
good help I did receive!

Regards,
Holger
Re: Error in serial code of heartbeat? [ In reply to ]
Holger Kiehl wrote:
>
> Hello
>
> First of all sorry for replying so late!
>
> It seems as it was a cable problem. I had it checked (taken apart) and
> they told me there was nothing wrong with the cable. But plugging in
> the cable again everything seems to be working. Here the output of
> /proc/tty/driver/serial:
>
> node 1
>
> 1: uart:16550A port:2F8 irq:3 baud:19200 tx:15811383 rx:14934790 fe:3903 RTS|CTS|DTR|DSR|CD
>
> node 2
>
> 1: uart:16550A port:2F8 irq:3 baud:19200 tx:14919246 rx:15807379 fe:2712 brk:5 RTS|CTS|DTR|DSR|CD
>
> Also, there are NO more "TTY write timeout on" messages on either node.
> So, I assume there must have been a loose connection in the cable itself.
>
> Sorry, bothering the list with this problem and thanks for the very
> good help I did receive!

No! It was good to bother the list. These kinds of things happen to
everyone, and it helps everyone's problem debugging skills ;-) - and you
had other quite valuable things to say as well. I'm about to put out
0.4.7a which has many of the things you wanted in it: nice_failback,
core dumps, and lots of other small tweaks.

Thanks again!

-- Alan Robertson
alanr@suse.com