Hello
There seems to be a bug in heartbeat serial code. I have been using
heartbeat for a very long time and have had no problems. But since I moved
the machine and put a higher constant load on it, I am getting
the following errors every hour:
TTY write timeout on [/dev/ttyS1] (no connection?)
At first I was running version 0.4.6c when these errors popped up. I
rebooted both nodes several times, but this did not help. The error
always popped up again. I then tried to do an strace on the heartbeat
doing the serial stuff and could see that it always reads every two
seconds from the serial fd, although the serial buffer was full with
data! I could verify this by simply disconnecting the serial connection
and the heartbeat process was still reading data from the serial
port for about 5 - 10 minutes before the buffer was empty! Connecting
it again, this time with a serial analyser between the two, one
could see the buffer fill up until it was full again and the RTS
signal dropped.
It seems that heartbeat is reading just one record every two seconds
and does not read everything from the buffer. So if the process
writing to the port writes faster, it will always fill the
buffer and heartbeat will NOT detect if the other node has
crashed for 5 - 10 minutes until the buffer is empty.
Two days ago I decided to upgrade to 0.4.7 and everything seemed to
be running. However looking at the log files this morning I see that
the same messages appear in my log files on both nodes.
As I said this all started to happen when I moved the nodes from one
room to another one and have more procceses running on it causing a
higher load on the active node:
9:06am up 1 day, 22:19, 5 users, load average: 0.87, 0.61, 0.44
There are about 195 processes now running on the active node. Before
I moved load average was always around zero.
Holger
There seems to be a bug in heartbeat serial code. I have been using
heartbeat for a very long time and have had no problems. But since I moved
the machine and put a higher constant load on it, I am getting
the following errors every hour:
TTY write timeout on [/dev/ttyS1] (no connection?)
At first I was running version 0.4.6c when these errors popped up. I
rebooted both nodes several times, but this did not help. The error
always popped up again. I then tried to do an strace on the heartbeat
doing the serial stuff and could see that it always reads every two
seconds from the serial fd, although the serial buffer was full with
data! I could verify this by simply disconnecting the serial connection
and the heartbeat process was still reading data from the serial
port for about 5 - 10 minutes before the buffer was empty! Connecting
it again, this time with a serial analyser between the two, one
could see the buffer fill up until it was full again and the RTS
signal dropped.
It seems that heartbeat is reading just one record every two seconds
and does not read everything from the buffer. So if the process
writing to the port writes faster, it will always fill the
buffer and heartbeat will NOT detect if the other node has
crashed for 5 - 10 minutes until the buffer is empty.
Two days ago I decided to upgrade to 0.4.7 and everything seemed to
be running. However looking at the log files this morning I see that
the same messages appear in my log files on both nodes.
As I said this all started to happen when I moved the nodes from one
room to another one and have more procceses running on it causing a
higher load on the active node:
9:06am up 1 day, 22:19, 5 users, load average: 0.87, 0.61, 0.44
There are about 195 processes now running on the active node. Before
I moved load average was always around zero.
Holger