Mailing List Archive

Anybody seen this?
Hi all,
I am running heavy traffic on the primary and then kill the secondary
(power-off). After the primary has recognized what happened it throws these
messages -
drbd0: ack timeout detected (pc=30)!
drbd : timeout detected! (pid=3)
drbd0: Connection lost.(pc=30,uc=0)

and then dies!!!

Not pingable any more, no console input - simply dead. This happens every
now and then. Rebooting both nodes gets the system in a good condition
again.
I have mounted the disks with sync and I am continuously copying files to
the disk to generate the load

Any idea what happened? Unfortunately I am not the kernel-wizard :-/

/Wolfram


Here are the system details:
----------------------------------------
700 MHz Pentium III
512MB RAM
18GB IDE Disk
Lynuxworks BlueCat 3.0 (Kernel 2.2.12-1)
Heartbeat 0.4.9
DRBD 5.8.1 (reproduced with 6.1-pre2 as well)

eth0 (100BaseT) is used for heartbeat
eth1 (100BaseT) is used for heartbeat and drbd-sync
eth2 (100BaseT) is used for cluster-IP and client access

Here is the drbdsetup:
----------------------------------
Node1# drbdsetup /dev/nb0 show
Lower device: 22:03 (/dev/hdc3)
Disk options:
do-panic
Local address: 172.21.1.1:7788
Remote address: 172.21.2.1:7788
Wire protocol: B
Net options:
timeout = 6.0 sec
sync-rate = 3000 KB/sec
tl-size = 256
connect-int = 10 sec
ping-int = 10 sec

Node2# drbdsetup /dev/nb0 show
Lower device: 22:03 (/dev/hdc3)
Disk options:
do-panic
Local address: 172.21.2.1:7788
Remote address: 172.21.1.1:7788
Wire protocol: B
Net options:
timeout = 6.0 sec
sync-rate = 3000 KB/sec
tl-size = 256
connect-int = 10 sec
ping-int = 10 sec


=======================================================================
Wolfram Weyer FORCE COMPUTERS GmbH
Staff Engineer - Systems Engineering A Solectron Subsidiary

phone: +49 89 60814-523 Street: Prof.-Messerschmitt-Str. 1
fax: +49 89 60814-112 City: D-85579 Neubiberg/Muenchen
mailto:Wolfram.Weyer@example.com <mailto:Wolfram.Weyer@force.de>
http://www.forcecomputers.com <http://www.forcecomputers.com/>
=======================================================================
RE: Anybody seen this? [ In reply to ]
I have seen another flavor of this behavior:
When I power-off the secondary node there is a chance that the writing
process (the cp command in this case) gets stuck somewhere in the kernel
(kill -9 does not work). Surprisingly it gets back when drbd is coming up on
the rebooting secondary.

/Wolfram

-----Original Message-----
From: Weyer, Wolfram [mailto:Wolfram.Weyer@example.com]
Sent: Donnerstag, 27. September 2001 11:01
To: drbd-devel@example.com
Subject: [DRBD-dev] Anybody seen this?


Hi all,
I am running heavy traffic on the primary and then kill the secondary
(power-off). After the primary has recognized what happened it throws these
messages -
drbd0: ack timeout detected (pc=30)!
drbd : timeout detected! (pid=3)
drbd0: Connection lost.(pc=30,uc=0)

and then dies!!!

Not pingable any more, no console input - simply dead. This happens every
now and then. Rebooting both nodes gets the system in a good condition
again.
I have mounted the disks with sync and I am continuously copying files to
the disk to generate the load

Any idea what happened? Unfortunately I am not the kernel-wizard :-/

/Wolfram


Here are the system details:
----------------------------------------
700 MHz Pentium III
512MB RAM
18GB IDE Disk
Lynuxworks BlueCat 3.0 (Kernel 2.2.12-1)
Heartbeat 0.4.9
DRBD 5.8.1 (reproduced with 6.1-pre2 as well)

eth0 (100BaseT) is used for heartbeat
eth1 (100BaseT) is used for heartbeat and drbd-sync
eth2 (100BaseT) is used for cluster-IP and client access

Here is the drbdsetup:
----------------------------------
Node1# drbdsetup /dev/nb0 show
Lower device: 22:03 (/dev/hdc3)
Disk options:
do-panic
Local address: 172.21.1.1:7788
Remote address: 172.21.2.1:7788
Wire protocol: B
Net options:
timeout = 6.0 sec
sync-rate = 3000 KB/sec
tl-size = 256
connect-int = 10 sec
ping-int = 10 sec

Node2# drbdsetup /dev/nb0 show
Lower device: 22:03 (/dev/hdc3)
Disk options:
do-panic
Local address: 172.21.2.1:7788
Remote address: 172.21.1.1:7788
Wire protocol: B
Net options:
timeout = 6.0 sec
sync-rate = 3000 KB/sec
tl-size = 256
connect-int = 10 sec
ping-int = 10 sec


=======================================================================
Wolfram Weyer FORCE COMPUTERS GmbH
Staff Engineer - Systems Engineering A Solectron Subsidiary

phone: +49 89 60814-523 Street: Prof.-Messerschmitt-Str. 1
fax: +49 89 60814-112 City: D-85579 Neubiberg/Muenchen
mailto:Wolfram.Weyer@example.com <mailto:Wolfram.Weyer@force.de>
http://www.forcecomputers.com <http://www.forcecomputers.com/>
=======================================================================
Re: Anybody seen this? [ In reply to ]
Hi,

Something like this happened to me.
I got rid from ack timouts, connection losses by increasing the timeout
parameter in DRBD.
Maybe you will need to tune the Heartbeat timeout also.
It seems that you are giving excessive load to your system.

Ricardo


> "Weyer, Wolfram" wrote:
>
> Hi all,
> I am running heavy traffic on the primary and then kill the secondary
> (power-off). After the primary has recognized what happened it throws
> these messages -
> drbd0: ack timeout detected (pc=30)!
> drbd : timeout detected! (pid=3)
> drbd0: Connection lost.(pc=30,uc=0)
>
> and then dies!!!
>
> Not pingable any more, no console input - simply dead. This happens
> every now and then. Rebooting both nodes gets the system in a good
> condition again.
> I have mounted the disks with sync and I am continuously copying files
> to the disk to generate the load
>
> Any idea what happened? Unfortunately I am not the kernel-wizard :-/
>
> /Wolfram
>
>
> Here are the system details:
> ----------------------------------------
> 700 MHz Pentium III
> 512MB RAM
> 18GB IDE Disk
> Lynuxworks BlueCat 3.0 (Kernel 2.2.12-1)
> Heartbeat 0.4.9
> DRBD 5.8.1 (reproduced with 6.1-pre2 as well)
>
> eth0 (100BaseT) is used for heartbeat
> eth1 (100BaseT) is used for heartbeat and drbd-sync
> eth2 (100BaseT) is used for cluster-IP and client access
>
> Here is the drbdsetup:
> ----------------------------------
> Node1# drbdsetup /dev/nb0 show
> Lower device: 22:03 (/dev/hdc3)
> Disk options:
> do-panic
> Local address: 172.21.1.1:7788
> Remote address: 172.21.2.1:7788
> Wire protocol: B
> Net options:
> timeout = 6.0 sec
> sync-rate = 3000 KB/sec
> tl-size = 256
> connect-int = 10 sec
> ping-int = 10 sec
>
> Node2# drbdsetup /dev/nb0 show
> Lower device: 22:03 (/dev/hdc3)
> Disk options:
> do-panic
> Local address: 172.21.2.1:7788
> Remote address: 172.21.1.1:7788
> Wire protocol: B
> Net options:
> timeout = 6.0 sec
> sync-rate = 3000 KB/sec
> tl-size = 256
> connect-int = 10 sec
> ping-int = 10 sec
>
>
> =======================================================================
>
> Wolfram Weyer FORCE COMPUTERS
> GmbH
> Staff Engineer - Systems Engineering A Solectron
> Subsidiary
>
> phone: +49 89 60814-523 Street: Prof.-Messerschmitt-Str.
> 1
> fax: +49 89 60814-112 City: D-85579
> Neubiberg/Muenchen
> mailto:Wolfram.Weyer@example.com http://www.forcecomputers.com
>
> =======================================================================
>
>
Re: Anybody seen this? [ In reply to ]
Wolfram,

Are you using 0.5.8 code or 0.6.1-pre?

-Philipp

* Weyer, Wolfram <Wolfram.Weyer@example.com> [010927 11:00]:
> Hi all,
> I am running heavy traffic on the primary and then kill the secondary
> (power-off). After the primary has recognized what happened it throws these
> messages -
> drbd0: ack timeout detected (pc=30)!
> drbd : timeout detected! (pid=3)
> drbd0: Connection lost.(pc=30,uc=0)
>
> and then dies!!!
>
> Not pingable any more, no console input - simply dead. This happens every
> now and then. Rebooting both nodes gets the system in a good condition
> again.
> I have mounted the disks with sync and I am continuously copying files to
> the disk to generate the load
>
> Any idea what happened? Unfortunately I am not the kernel-wizard :-/
>
> /Wolfram
>
>
> Here are the system details:
> ----------------------------------------
> 700 MHz Pentium III
> 512MB RAM
> 18GB IDE Disk
> Lynuxworks BlueCat 3.0 (Kernel 2.2.12-1)
> Heartbeat 0.4.9
> DRBD 5.8.1 (reproduced with 6.1-pre2 as well)
>
> eth0 (100BaseT) is used for heartbeat
> eth1 (100BaseT) is used for heartbeat and drbd-sync
> eth2 (100BaseT) is used for cluster-IP and client access
>
> Here is the drbdsetup:
> ----------------------------------
> Node1# drbdsetup /dev/nb0 show
> Lower device: 22:03 (/dev/hdc3)
> Disk options:
> do-panic
> Local address: 172.21.1.1:7788
> Remote address: 172.21.2.1:7788
> Wire protocol: B
> Net options:
> timeout = 6.0 sec
> sync-rate = 3000 KB/sec
> tl-size = 256
> connect-int = 10 sec
> ping-int = 10 sec
>
> Node2# drbdsetup /dev/nb0 show
> Lower device: 22:03 (/dev/hdc3)
> Disk options:
> do-panic
> Local address: 172.21.2.1:7788
> Remote address: 172.21.1.1:7788
> Wire protocol: B
> Net options:
> timeout = 6.0 sec
> sync-rate = 3000 KB/sec
> tl-size = 256
> connect-int = 10 sec
> ping-int = 10 sec
>
>
> =======================================================================
> Wolfram Weyer FORCE COMPUTERS GmbH
> Staff Engineer - Systems Engineering A Solectron Subsidiary
>
> phone: +49 89 60814-523 Street: Prof.-Messerschmitt-Str. 1
> fax: +49 89 60814-112 City: D-85579 Neubiberg/Muenchen
> mailto:Wolfram.Weyer@example.com <mailto:Wolfram.Weyer@force.de>
> http://www.forcecomputers.com <http://www.forcecomputers.com/>
> =======================================================================
>
>
Re: Anybody seen this? [ In reply to ]
Hmmm, this sounds bad. I will try this as soon as I have a cluster made of
real iron again.

-Philipp

* Weyer, Wolfram <Wolfram.Weyer@example.com> [010927 12:19]:
> I have seen another flavor of this behavior:
> When I power-off the secondary node there is a chance that the writing
> process (the cp command in this case) gets stuck somewhere in the kernel
> (kill -9 does not work). Surprisingly it gets back when drbd is coming up on
> the rebooting secondary.
>
> /Wolfram
>
> -----Original Message-----
> From: Weyer, Wolfram [mailto:Wolfram.Weyer@example.com]
> Sent: Donnerstag, 27. September 2001 11:01
> To: drbd-devel@example.com
> Subject: [DRBD-dev] Anybody seen this?
>
>
> Hi all,
> I am running heavy traffic on the primary and then kill the secondary
> (power-off). After the primary has recognized what happened it throws these
> messages -
> drbd0: ack timeout detected (pc=30)!
> drbd : timeout detected! (pid=3)
> drbd0: Connection lost.(pc=30,uc=0)
>
> and then dies!!!
>
> Not pingable any more, no console input - simply dead. This happens every
> now and then. Rebooting both nodes gets the system in a good condition
> again.
> I have mounted the disks with sync and I am continuously copying files to
> the disk to generate the load
>
> Any idea what happened? Unfortunately I am not the kernel-wizard :-/
>
> /Wolfram
>
>
> Here are the system details:
> ----------------------------------------
> 700 MHz Pentium III
> 512MB RAM
> 18GB IDE Disk
> Lynuxworks BlueCat 3.0 (Kernel 2.2.12-1)
> Heartbeat 0.4.9
> DRBD 5.8.1 (reproduced with 6.1-pre2 as well)
>
> eth0 (100BaseT) is used for heartbeat
> eth1 (100BaseT) is used for heartbeat and drbd-sync
> eth2 (100BaseT) is used for cluster-IP and client access
>
> Here is the drbdsetup:
> ----------------------------------
> Node1# drbdsetup /dev/nb0 show
> Lower device: 22:03 (/dev/hdc3)
> Disk options:
> do-panic
> Local address: 172.21.1.1:7788
> Remote address: 172.21.2.1:7788
> Wire protocol: B
> Net options:
> timeout = 6.0 sec
> sync-rate = 3000 KB/sec
> tl-size = 256
> connect-int = 10 sec
> ping-int = 10 sec
>
> Node2# drbdsetup /dev/nb0 show
> Lower device: 22:03 (/dev/hdc3)
> Disk options:
> do-panic
> Local address: 172.21.2.1:7788
> Remote address: 172.21.1.1:7788
> Wire protocol: B
> Net options:
> timeout = 6.0 sec
> sync-rate = 3000 KB/sec
> tl-size = 256
> connect-int = 10 sec
> ping-int = 10 sec
>
>
> =======================================================================
> Wolfram Weyer FORCE COMPUTERS GmbH
> Staff Engineer - Systems Engineering A Solectron Subsidiary
>
> phone: +49 89 60814-523 Street: Prof.-Messerschmitt-Str. 1
> fax: +49 89 60814-112 City: D-85579 Neubiberg/Muenchen
> mailto:Wolfram.Weyer@example.com <mailto:Wolfram.Weyer@force.de>
> http://www.forcecomputers.com <http://www.forcecomputers.com/>
> =======================================================================
>
>
>
RE: Anybody seen this? [ In reply to ]
Hi Ricardo,
the messages about ack timeouts are expected messages because I killed the
secondary. Drbd should timeout and write to its own disk only then.
However the bad thing that showed up, is that the primary died after
recognizing that the secondary is dead.
I am doing load on the system by means of disk-IO. The CPU load does not
exceed 5%.

/Wolfram

>>> -----Original Message-----
>>> From: Ricardo Alexandre Mattar [mailto:mattar@example.com]
>>> Sent: Samstag, 29. September 2001 18:24
>>> To: Weyer, Wolfram
>>> Cc: drbd-devel@example.com
>>> Subject: Re: [DRBD-dev] Anybody seen this?
>>>
>>>
>>> Hi,
>>>
>>> Something like this happened to me.
>>> I got rid from ack timouts, connection losses by increasing
>>> the timeout
>>> parameter in DRBD.
>>> Maybe you will need to tune the Heartbeat timeout also.
>>> It seems that you are giving excessive load to your system.
>>>
>>> Ricardo
>>>
>>>
>>> > "Weyer, Wolfram" wrote:
>>> >
>>> > Hi all,
>>> > I am running heavy traffic on the primary and then kill
>>> the secondary
>>> > (power-off). After the primary has recognized what
>>> happened it throws
>>> > these messages -
>>> > drbd0: ack timeout detected (pc=30)!
>>> > drbd : timeout detected! (pid=3)
>>> > drbd0: Connection lost.(pc=30,uc=0)
>>> >
>>> > and then dies!!!
>>> >
>>> > Not pingable any more, no console input - simply dead.
>>> This happens
>>> > every now and then. Rebooting both nodes gets the system in a good
>>> > condition again.
>>> > I have mounted the disks with sync and I am continuously
>>> copying files
>>> > to the disk to generate the load
>>> >
>>> > Any idea what happened? Unfortunately I am not the
>>> kernel-wizard :-/
>>> >
>>> > /Wolfram
>>> >
>>> >
>>> > Here are the system details:
>>> > ----------------------------------------
>>> > 700 MHz Pentium III
>>> > 512MB RAM
>>> > 18GB IDE Disk
>>> > Lynuxworks BlueCat 3.0 (Kernel 2.2.12-1)
>>> > Heartbeat 0.4.9
>>> > DRBD 5.8.1 (reproduced with 6.1-pre2 as well)
>>> >
>>> > eth0 (100BaseT) is used for heartbeat
>>> > eth1 (100BaseT) is used for heartbeat and drbd-sync
>>> > eth2 (100BaseT) is used for cluster-IP and client access
>>> >
>>> > Here is the drbdsetup:
>>> > ----------------------------------
>>> > Node1# drbdsetup /dev/nb0 show
>>> > Lower device: 22:03 (/dev/hdc3)
>>> > Disk options:
>>> > do-panic
>>> > Local address: 172.21.1.1:7788
>>> > Remote address: 172.21.2.1:7788
>>> > Wire protocol: B
>>> > Net options:
>>> > timeout = 6.0 sec
>>> > sync-rate = 3000 KB/sec
>>> > tl-size = 256
>>> > connect-int = 10 sec
>>> > ping-int = 10 sec
>>> >
>>> > Node2# drbdsetup /dev/nb0 show
>>> > Lower device: 22:03 (/dev/hdc3)
>>> > Disk options:
>>> > do-panic
>>> > Local address: 172.21.2.1:7788
>>> > Remote address: 172.21.1.1:7788
>>> > Wire protocol: B
>>> > Net options:
>>> > timeout = 6.0 sec
>>> > sync-rate = 3000 KB/sec
>>> > tl-size = 256
>>> > connect-int = 10 sec
>>> > ping-int = 10 sec
>>> >
>>> >
>>> >
>>> ============================================================
>>> ===========
>>> >
>>> > Wolfram Weyer FORCE COMPUTERS
>>> > GmbH
>>> > Staff Engineer - Systems Engineering A Solectron
>>> > Subsidiary
>>> >
>>> > phone: +49 89 60814-523 Street:
>>> Prof.-Messerschmitt-Str.
>>> > 1
>>> > fax: +49 89 60814-112 City: D-85579
>>> > Neubiberg/Muenchen
>>> > mailto:Wolfram.Weyer@example.com
>>> http://www.forcecomputers.com
>>> >
>>> >
>>> ============================================================
>>> ===========
>>> >
>>> >
>>>
>>> _______________________________________________
>>> DRBD-devel mailing list
>>> DRBD-devel@example.com
>>> https://lists.sourceforge.net/lists/listinfo/drbd-devel
>>>