Hello Philipp,
my test scenario is pretty easy:
I have a "Connected" drbd cluster which puts load on the disk by simply
copying files via "cp". Then I switch off the secondary node. Then I restart
the secondary and let it join again. After a quicksync everything should be
in the initial state again.
I got the following results out of 20 runs:
16 times : everything ok with the following console messages on the primary
drbd0: ack timeout detected (pc=2)!
drbd0: Connection lost. (pc=2,uc=0)
2 times : the primary node got completely hung after printing the console
messages
drbd0: ack timeout detected (pc=36)!
drbd : timeout detected! (pid=19932)
drbd0: Connection lost. (pc=36,uc=0)
The node was not pingable any more. No console input possible.
2 times : the last write access to the nb-device gets stuck (a shell-cp
command).
The disk is still writable but the node throws the console messages
drbd0: ack timeout detected (pc=29)!
drbd : timeout detected! (pid=3)
The status of drbd changes to "Timeout st". When the secondary has
rebooted and reconnected I get some more console message on the primary
drbd0: send timed out!! (pid=3)
drbd0: Connection lost. (pc=29,uc=0)
drbd0: Connection established.
...
and everything continues as if nothing has happened
The common messages for both failure scenarios seem these "drbd : timeout
detected!" messages.
----------------------------------------------------------------------------
---
I am using the 2.2.12-1 Kernel from the BlueCat 3.0 Distribution.
Below is my drbd.conf file, nothing special about that I think. As already
mentioned I have seen this behavior with the 6.1 pre-versions as well.
resource drbd0 {
protocol=B
fsckcmd=fsck -p -y
disk {
do-panic
# disk-size=4096543
}
net {
sync-rate=5000
# skip-sync
tl-size=256
timeout=60
connect-int=10
ping-int=10
}
on node1 {
device=/dev/nb0
disk=/dev/hdc3
address=172.21.1.1
port=7788
}
on node2 {
device=/dev/nb0
disk=/dev/hdc3
address=172.21.2.1
port=7788
}
}
Please let me no if there is anything more I can help with. Unfortunately I
have never used a kernel debugger to see what is happening here.
/Wolfram
>>> -----Original Message-----
>>> From: Philipp Reisner [mailto:philipp.reisner@example.com]
>>> Sent: Mittwoch, 17. Oktober 2001 12:38
>>> To: Weyer, Wolfram
>>> Cc: drbd-devel@example.com
>>> Subject: Re: [DRBD-dev] drpb pre4 test
>>>
>>>
>>> Hi Wolfram,
>>>
>>> could you give us a more detailed description of this lockup?
>>> Your are using a 2.2.x kernel, right ?
>>>
>>> -Philipp
>>>
>>> * Weyer, Wolfram <Wolfram.Weyer@example.com> [011017 09:47]:
>>> > Hi,
>>> > I think its ok that you get these messages when the
>>> secondary dies. However
>>> > my tests have shown that the primary then gets into some
>>> kind of kernel
>>> > lockup and has to be rebooted as well. Is this what you see?
>>> > I happens with all other versions (even 5.8.1) as well.
>>> >
>>> > /Wolfram
>>> >
>>> > >>> -----Original Message-----
>>> > >>> From: Jean-Yves Bouet - 78636
[mailto:jean-yves.bouet@example.com]
> >>> Sent: Mittwoch, 17. Oktober 2001 09:29
> >>> To: drbd-devel@example.com
> >>> Subject: [DRBD-dev] drpb pre4 test
> >>>
> >>>
> >>> Hello,
> >>>
> >>> strange message in my syslog using drbd pre4:
> >>>
> >>> Oct 16 17:03:21 CNODE-1-120 kernel: drbd0: ping ack did not arrive
> >>> Oct 16 17:03:21 CNODE-1-120 kernel: drbd0: sock_recvmsg
> >>> returned -512
> >>> Oct 16 17:03:21 CNODE-1-120 kernel: drbd0: Connection
> >>> lost.(pc=0,uc=0)
> >>>
> >>> I got it when secondary node fails.
> >>>
> >>> Bye!
> >>>
> >>> --
> >>> Jean-Yves BOUET
> >>> EADS Defence and Security Networks
> >>> jean-yves.bouet@example.com
> >>> 01 34 60 86 36
> >>>
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> DRBD-devel mailing list
> >>> DRBD-devel@example.com
> >>> https://lists.sourceforge.net/lists/listinfo/drbd-devel
> >>>
>
> _______________________________________________
> DRBD-devel mailing list
> DRBD-devel@example.com
> https://lists.sourceforge.net/lists/listinfo/drbd-devel