Mailing List Archive

drpb pre4 test
Hello,

strange message in my syslog using drbd pre4:

Oct 16 17:03:21 CNODE-1-120 kernel: drbd0: ping ack did not arrive
Oct 16 17:03:21 CNODE-1-120 kernel: drbd0: sock_recvmsg returned -512
Oct 16 17:03:21 CNODE-1-120 kernel: drbd0: Connection lost.(pc=0,uc=0)

I got it when secondary node fails.

Bye!

--
Jean-Yves BOUET
EADS Defence and Security Networks
jean-yves.bouet@example.com
01 34 60 86 36
RE: drpb pre4 test [ In reply to ]
Hi,
I think its ok that you get these messages when the secondary dies. However
my tests have shown that the primary then gets into some kind of kernel
lockup and has to be rebooted as well. Is this what you see?
I happens with all other versions (even 5.8.1) as well.

/Wolfram

>>> -----Original Message-----
>>> From: Jean-Yves Bouet - 78636 [mailto:jean-yves.bouet@example.com]
>>> Sent: Mittwoch, 17. Oktober 2001 09:29
>>> To: drbd-devel@example.com
>>> Subject: [DRBD-dev] drpb pre4 test
>>>
>>>
>>> Hello,
>>>
>>> strange message in my syslog using drbd pre4:
>>>
>>> Oct 16 17:03:21 CNODE-1-120 kernel: drbd0: ping ack did not arrive
>>> Oct 16 17:03:21 CNODE-1-120 kernel: drbd0: sock_recvmsg
>>> returned -512
>>> Oct 16 17:03:21 CNODE-1-120 kernel: drbd0: Connection
>>> lost.(pc=0,uc=0)
>>>
>>> I got it when secondary node fails.
>>>
>>> Bye!
>>>
>>> --
>>> Jean-Yves BOUET
>>> EADS Defence and Security Networks
>>> jean-yves.bouet@example.com
>>> 01 34 60 86 36
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> DRBD-devel mailing list
>>> DRBD-devel@example.com
>>> https://lists.sourceforge.net/lists/listinfo/drbd-devel
>>>
Re: drpb pre4 test [ In reply to ]
"Weyer, Wolfram" wrote:

> Hi,
> I think its ok that you get these messages when the secondary dies. However
> my tests have shown that the primary then gets into some kind of kernel
> lockup and has to be rebooted as well. Is this what you see?
> I happens with all other versions (even 5.8.1) as well.
>
> /Wolfram

No i haven't any problem with primary node when secondary dies. Moreover the
strange message i got didn't appear with former versions.
Re: drpb pre4 test [ In reply to ]
Hi Wolfram,

could you give us a more detailed description of this lockup?
Your are using a 2.2.x kernel, right ?

-Philipp

* Weyer, Wolfram <Wolfram.Weyer@example.com> [011017 09:47]:
> Hi,
> I think its ok that you get these messages when the secondary dies. However
> my tests have shown that the primary then gets into some kind of kernel
> lockup and has to be rebooted as well. Is this what you see?
> I happens with all other versions (even 5.8.1) as well.
>
> /Wolfram
>
> >>> -----Original Message-----
> >>> From: Jean-Yves Bouet - 78636 [mailto:jean-yves.bouet@example.com]
> >>> Sent: Mittwoch, 17. Oktober 2001 09:29
> >>> To: drbd-devel@example.com
> >>> Subject: [DRBD-dev] drpb pre4 test
> >>>
> >>>
> >>> Hello,
> >>>
> >>> strange message in my syslog using drbd pre4:
> >>>
> >>> Oct 16 17:03:21 CNODE-1-120 kernel: drbd0: ping ack did not arrive
> >>> Oct 16 17:03:21 CNODE-1-120 kernel: drbd0: sock_recvmsg
> >>> returned -512
> >>> Oct 16 17:03:21 CNODE-1-120 kernel: drbd0: Connection
> >>> lost.(pc=0,uc=0)
> >>>
> >>> I got it when secondary node fails.
> >>>
> >>> Bye!
> >>>
> >>> --
> >>> Jean-Yves BOUET
> >>> EADS Defence and Security Networks
> >>> jean-yves.bouet@example.com
> >>> 01 34 60 86 36
> >>>
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> DRBD-devel mailing list
> >>> DRBD-devel@example.com
> >>> https://lists.sourceforge.net/lists/listinfo/drbd-devel
> >>>
>
> _______________________________________________
> DRBD-devel mailing list
> DRBD-devel@example.com
> https://lists.sourceforge.net/lists/listinfo/drbd-devel
RE: drpb pre4 test [ In reply to ]
Hello Philipp,

my test scenario is pretty easy:
I have a "Connected" drbd cluster which puts load on the disk by simply
copying files via "cp". Then I switch off the secondary node. Then I restart
the secondary and let it join again. After a quicksync everything should be
in the initial state again.

I got the following results out of 20 runs:

16 times : everything ok with the following console messages on the primary
drbd0: ack timeout detected (pc=2)!
drbd0: Connection lost. (pc=2,uc=0)

2 times : the primary node got completely hung after printing the console
messages
drbd0: ack timeout detected (pc=36)!
drbd : timeout detected! (pid=19932)
drbd0: Connection lost. (pc=36,uc=0)
The node was not pingable any more. No console input possible.

2 times : the last write access to the nb-device gets stuck (a shell-cp
command).
The disk is still writable but the node throws the console messages
drbd0: ack timeout detected (pc=29)!
drbd : timeout detected! (pid=3)
The status of drbd changes to "Timeout st". When the secondary has
rebooted and reconnected I get some more console message on the primary
drbd0: send timed out!! (pid=3)
drbd0: Connection lost. (pc=29,uc=0)
drbd0: Connection established.
...
and everything continues as if nothing has happened

The common messages for both failure scenarios seem these "drbd : timeout
detected!" messages.

----------------------------------------------------------------------------
---
I am using the 2.2.12-1 Kernel from the BlueCat 3.0 Distribution.


Below is my drbd.conf file, nothing special about that I think. As already
mentioned I have seen this behavior with the 6.1 pre-versions as well.

resource drbd0 {

protocol=B
fsckcmd=fsck -p -y

disk {
do-panic
# disk-size=4096543
}

net {
sync-rate=5000
# skip-sync
tl-size=256
timeout=60
connect-int=10
ping-int=10
}

on node1 {
device=/dev/nb0
disk=/dev/hdc3
address=172.21.1.1
port=7788
}

on node2 {
device=/dev/nb0
disk=/dev/hdc3
address=172.21.2.1
port=7788
}
}


Please let me no if there is anything more I can help with. Unfortunately I
have never used a kernel debugger to see what is happening here.


/Wolfram




>>> -----Original Message-----
>>> From: Philipp Reisner [mailto:philipp.reisner@example.com]
>>> Sent: Mittwoch, 17. Oktober 2001 12:38
>>> To: Weyer, Wolfram
>>> Cc: drbd-devel@example.com
>>> Subject: Re: [DRBD-dev] drpb pre4 test
>>>
>>>
>>> Hi Wolfram,
>>>
>>> could you give us a more detailed description of this lockup?
>>> Your are using a 2.2.x kernel, right ?
>>>
>>> -Philipp
>>>
>>> * Weyer, Wolfram <Wolfram.Weyer@example.com> [011017 09:47]:
>>> > Hi,
>>> > I think its ok that you get these messages when the
>>> secondary dies. However
>>> > my tests have shown that the primary then gets into some
>>> kind of kernel
>>> > lockup and has to be rebooted as well. Is this what you see?
>>> > I happens with all other versions (even 5.8.1) as well.
>>> >
>>> > /Wolfram
>>> >
>>> > >>> -----Original Message-----
>>> > >>> From: Jean-Yves Bouet - 78636
[mailto:jean-yves.bouet@example.com]
> >>> Sent: Mittwoch, 17. Oktober 2001 09:29
> >>> To: drbd-devel@example.com
> >>> Subject: [DRBD-dev] drpb pre4 test
> >>>
> >>>
> >>> Hello,
> >>>
> >>> strange message in my syslog using drbd pre4:
> >>>
> >>> Oct 16 17:03:21 CNODE-1-120 kernel: drbd0: ping ack did not arrive
> >>> Oct 16 17:03:21 CNODE-1-120 kernel: drbd0: sock_recvmsg
> >>> returned -512
> >>> Oct 16 17:03:21 CNODE-1-120 kernel: drbd0: Connection
> >>> lost.(pc=0,uc=0)
> >>>
> >>> I got it when secondary node fails.
> >>>
> >>> Bye!
> >>>
> >>> --
> >>> Jean-Yves BOUET
> >>> EADS Defence and Security Networks
> >>> jean-yves.bouet@example.com
> >>> 01 34 60 86 36
> >>>
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> DRBD-devel mailing list
> >>> DRBD-devel@example.com
> >>> https://lists.sourceforge.net/lists/listinfo/drbd-devel
> >>>
>
> _______________________________________________
> DRBD-devel mailing list
> DRBD-devel@example.com
> https://lists.sourceforge.net/lists/listinfo/drbd-devel
Re: drpb pre4 test [ In reply to ]
* Weyer, Wolfram <Wolfram.Weyer@example.com> [011017 17:51]:
> Hello Philipp,
>
> my test scenario is pretty easy:
> I have a "Connected" drbd cluster which puts load on the disk by simply
> copying files via "cp". Then I switch off the secondary node. Then I restart
> the secondary and let it join again. After a quicksync everything should be
> in the initial state again.
>
> I got the following results out of 20 runs:
>
> 16 times : everything ok with the following console messages on the primary
> drbd0: ack timeout detected (pc=2)!
> drbd0: Connection lost. (pc=2,uc=0)
>
> 2 times : the primary node got completely hung after printing the console
> messages
> drbd0: ack timeout detected (pc=36)!
> drbd : timeout detected! (pid=19932)
> drbd0: Connection lost. (pc=36,uc=0)
> The node was not pingable any more. No console input possible.

Hmmm, a complete lockup. -- Why are you using the 2.2.12 Kernel, there
are more recent 2.2.x Kernels out there...

> 2 times : the last write access to the nb-device gets stuck (a shell-cp
> command).
> The disk is still writable but the node throws the console messages
> drbd0: ack timeout detected (pc=29)!
> drbd : timeout detected! (pid=3)
> The status of drbd changes to "Timeout st". When the secondary has
> rebooted and reconnected I get some more console message on the primary
> drbd0: send timed out!! (pid=3)
> drbd0: Connection lost. (pc=29,uc=0)
> drbd0: Connection established.
> ...
> and everything continues as if nothing has happened
>
> The common messages for both failure scenarios seem these "drbd : timeout
> detected!" messages.

Ok, this means that the signal is not getting through. I will try to
reproduce this in the course of the next week. (Just to be shure, this
was with pre4, right?)

-PHilipp
RE: drpb pre4 test [ In reply to ]
Philipp,
we are using the 2.2.12 kernel as it is part of the BlueCat 3.0
distribution. We want to integrate this into a high-volume embedded product
and a support contract is already in place, which is based on this
distribution. If we change the kernel we will lose supportability.
In the near future BlueCat 4.0 will be available, which is based on the
2.4.2 kernel. This might improve some things.

For the testrun below I used 5.8.1, but I reproduced this with 6.1pre3 as
well (maybe the messages were different). Is it of any use if I repeat the
test with the 6.1pre3? As already mentioned I did not get pre4 compiled with
the 2.2.12 kernel :(

/Wolfram

>>> -----Original Message-----
>>> From: Philipp Reisner [mailto:philipp.reisner@example.com]
>>> Sent: Donnerstag, 18. Oktober 2001 10:46
>>> To: Weyer, Wolfram
>>> Cc: drbd-devel@example.com
>>> Subject: Re: [DRBD-dev] drpb pre4 test
>>>
>>>
>>> * Weyer, Wolfram <Wolfram.Weyer@example.com> [011017 17:51]:
>>> > Hello Philipp,
>>> >
>>> > my test scenario is pretty easy:
>>> > I have a "Connected" drbd cluster which puts load on the
>>> disk by simply
>>> > copying files via "cp". Then I switch off the secondary
>>> node. Then I restart
>>> > the secondary and let it join again. After a quicksync
>>> everything should be
>>> > in the initial state again.
>>> >
>>> > I got the following results out of 20 runs:
>>> >
>>> > 16 times : everything ok with the following console
>>> messages on the primary
>>> > drbd0: ack timeout detected (pc=2)!
>>> > drbd0: Connection lost. (pc=2,uc=0)
>>> >
>>> > 2 times : the primary node got completely hung after
>>> printing the console
>>> > messages
>>> > drbd0: ack timeout detected (pc=36)!
>>> > drbd : timeout detected! (pid=19932)
>>> > drbd0: Connection lost. (pc=36,uc=0)
>>> > The node was not pingable any more. No console input possible.
>>>
>>> Hmmm, a complete lockup. -- Why are you using the 2.2.12
>>> Kernel, there
>>> are more recent 2.2.x Kernels out there...
>>>
>>> > 2 times : the last write access to the nb-device gets
>>> stuck (a shell-cp
>>> > command).
>>> > The disk is still writable but the node throws the
>>> console messages
>>> > drbd0: ack timeout detected (pc=29)!
>>> > drbd : timeout detected! (pid=3)
>>> > The status of drbd changes to "Timeout st". When the
>>> secondary has
>>> > rebooted and reconnected I get some more console message
>>> on the primary
>>> > drbd0: send timed out!! (pid=3)
>>> > drbd0: Connection lost. (pc=29,uc=0)
>>> > drbd0: Connection established.
>>> > ...
>>> > and everything continues as if nothing has happened
>>> >
>>> > The common messages for both failure scenarios seem these
>>> "drbd : timeout
>>> > detected!" messages.
>>>
>>> Ok, this means that the signal is not getting through. I will try to
>>> reproduce this in the course of the next week. (Just to be
>>> shure, this
>>> was with pre4, right?)
>>>
>>> -PHilipp
>>>
>>>