Mailing List Archive: Error in serial code of heartbeat?

Error in serial code of heartbeat?

Apr 20, 2000, 12:24 AM

Post #1 of 13 (3650 views)

Hello

There seems to be a bug in heartbeat serial code. I have been using
heartbeat for a very long time and have had no problems. But since I moved
the machine and put a higher constant load on it, I am getting
the following errors every hour:
TTY write timeout on [/dev/ttyS1] (no connection?)

At first I was running version 0.4.6c when these errors popped up. I
rebooted both nodes several times, but this did not help. The error
always popped up again. I then tried to do an strace on the heartbeat
doing the serial stuff and could see that it always reads every two
seconds from the serial fd, although the serial buffer was full with
data! I could verify this by simply disconnecting the serial connection
and the heartbeat process was still reading data from the serial
port for about 5 - 10 minutes before the buffer was empty! Connecting
it again, this time with a serial analyser between the two, one
could see the buffer fill up until it was full again and the RTS
signal dropped.

It seems that heartbeat is reading just one record every two seconds
and does not read everything from the buffer. So if the process
writing to the port writes faster, it will always fill the
buffer and heartbeat will NOT detect if the other node has
crashed for 5 - 10 minutes until the buffer is empty.

Two days ago I decided to upgrade to 0.4.7 and everything seemed to
be running. However looking at the log files this morning I see that
the same messages appear in my log files on both nodes.

As I said this all started to happen when I moved the nodes from one
room to another one and have more procceses running on it causing a
higher load on the active node:

9:06am up 1 day, 22:19, 5 users, load average: 0.87, 0.61, 0.44

There are about 195 processes now running on the active node. Before
I moved load average was always around zero.

Holger

Error in serial code of heartbeat? [ In reply to ]

horms at vergenet

Apr 21, 2000, 1:27 AM

Post #2 of 13 (3621 views)

Permalink

On Thu, Apr 20, 2000 at 09:24:43AM +0200, Holger Kiehl wrote:
> Hello
>
> There seems to be a bug in heartbeat serial code. I have been using
> heartbeat for a very long time and have had no problems. But since I moved
> the machine and put a higher constant load on it, I am getting
> the following errors every hour:
> TTY write timeout on [/dev/ttyS1] (no connection?)
>
> At first I was running version 0.4.6c when these errors popped up. I
> rebooted both nodes several times, but this did not help. The error
> always popped up again. I then tried to do an strace on the heartbeat
> doing the serial stuff and could see that it always reads every two
> seconds from the serial fd, although the serial buffer was full with
> data! I could verify this by simply disconnecting the serial connection
> and the heartbeat process was still reading data from the serial
> port for about 5 - 10 minutes before the buffer was empty! Connecting
> it again, this time with a serial analyser between the two, one
> could see the buffer fill up until it was full again and the RTS
> signal dropped.
>
> It seems that heartbeat is reading just one record every two seconds
> and does not read everything from the buffer. So if the process
> writing to the port writes faster, it will always fill the
> buffer and heartbeat will NOT detect if the other node has
> crashed for 5 - 10 minutes until the buffer is empty.
>
> Two days ago I decided to upgrade to 0.4.7 and everything seemed to
> be running. However looking at the log files this morning I see that
> the same messages appear in my log files on both nodes.
>
> As I said this all started to happen when I moved the nodes from one
> room to another one and have more procceses running on it causing a
> higher load on the active node:
>
> 9:06am up 1 day, 22:19, 5 users, load average: 0.87, 0.61, 0.44
>
> There are about 195 processes now running on the active node. Before
> I moved load average was always around zero.

I'm trying to track this down but I'm not having a lot of luck.
The serial code does only read one line at a time, but the process
that handles the reading of the data should (by my reading of the code)
be coninuously reading information from the serial port.

One possibliy I thought of is that the buffer for the pipe
used to communicat status between heartbeat processes is being
filled. The process reads the pipe, again should be doing this continously
by my reading of the code.

Another posiblility is that heartbeat is taking so long to write
messages that it is unable to read messages fast enough (from the pipe).
This seems unlikely, though would tie in with the load requirement.
If this is the case then a mechanism for flushing the buffers continuously,
and discarding backlogged messages would be required.

I am going to try and repoduce this problem to try and understand it
better. Alan do you have any ideas on what the cause might be?

--
Horms

Error in serial code of heartbeat? [ In reply to ]

alanr at suse

Apr 21, 2000, 7:41 AM

Post #3 of 13 (3628 views)

Permalink

Horms wrote:
>
> On Thu, Apr 20, 2000 at 09:24:43AM +0200, Holger Kiehl wrote:
> > Hello
> >
> > There seems to be a bug in heartbeat serial code. I have been using
> > heartbeat for a very long time and have had no problems. But since I moved
> > the machine and put a higher constant load on it, I am getting
> > the following errors every hour:
> > TTY write timeout on [/dev/ttyS1] (no connection?)
> >
> > At first I was running version 0.4.6c when these errors popped up. I
> > rebooted both nodes several times, but this did not help. The error
> > always popped up again. I then tried to do an strace on the heartbeat
> > doing the serial stuff and could see that it always reads every two
> > seconds from the serial fd, although the serial buffer was full with
> > data! I could verify this by simply disconnecting the serial connection
> > and the heartbeat process was still reading data from the serial
> > port for about 5 - 10 minutes before the buffer was empty! Connecting
> > it again, this time with a serial analyser between the two, one
> > could see the buffer fill up until it was full again and the RTS
> > signal dropped.
> >
> > It seems that heartbeat is reading just one record every two seconds
> > and does not read everything from the buffer. So if the process
> > writing to the port writes faster, it will always fill the
> > buffer and heartbeat will NOT detect if the other node has
> > crashed for 5 - 10 minutes until the buffer is empty.
> >
> > Two days ago I decided to upgrade to 0.4.7 and everything seemed to
> > be running. However looking at the log files this morning I see that
> > the same messages appear in my log files on both nodes.
> >
> > As I said this all started to happen when I moved the nodes from one
> > room to another one and have more procceses running on it causing a
> > higher load on the active node:
> >
> > 9:06am up 1 day, 22:19, 5 users, load average: 0.87, 0.61, 0.44
> >
> > There are about 195 processes now running on the active node. Before
> > I moved load average was always around zero.
>
> I'm trying to track this down but I'm not having a lot of luck.
> The serial code does only read one line at a time, but the process
> that handles the reading of the data should (by my reading of the code)
> be coninuously reading information from the serial port.
>
> One possibliy I thought of is that the buffer for the pipe
> used to communicat status between heartbeat processes is being
> filled. The process reads the pipe, again should be doing this continously
> by my reading of the code.
>
> Another posiblility is that heartbeat is taking so long to write
> messages that it is unable to read messages fast enough (from the pipe).
> This seems unlikely, though would tie in with the load requirement.
> If this is the case then a mechanism for flushing the buffers continuously,
> and discarding backlogged messages would be required.
>
> I am going to try and repoduce this problem to try and understand it
> better. Alan do you have any ideas on what the cause might be?

I have no solid idea on what this might be. I'll see if I can set
things up here to reproduce it. Heartbeat has the design it does
primarily to prevent these kinds of things...

I'll let you know what I find out as well.

Sorry for the difficulties!

-- Alan Robertson
alanr@suse.com