Mailing List Archive

Timing Issue with Heartbeat?
I've been testing fail-over of a two node drbd/heartbeat
configuration. My method of forcing a fail-over has been
to run "/etc/rc.d/init.d/heartbeat stop" on the system that
is the drbd Primary.

Occasionally, after heartbeat transfers control from
one system to another, drbd status shows a disconnect:

On Node2:
version : 58

0: cs:WFConnection st:Secondary/Unknown ns:44 nr:28 dw:74 dr:17579 of:0

On Node1:
version : 58

0: cs:Unconfigured st:Primary/Unknown ns:28 nr:44 dw:73 dr:1081 of:0

I suspect that the "new" Primary brings up drbd before the "old"
Primary has fully released it. My haresources look like this

bcmdual1 datadisk 10.1.15.1 101.1.15.1

Datadisk is brought up first on the "new" Primary and brought down
last on the "old" primary. The problem occurs more frequently when I
add a resource to the haresources:

bcmdual1 datadisk 10.1.15.1 101.1.15.1 myservice

Where "myservice" takes a good deal of time to go up or down.

I seem to prevent the problem from happening by adding (yet another)
haresource:

bcmdual1 wait datadisk 10.1.15.1 101.1.15.1

Where the script for "wait" insures that the other drbd node is not
Primary before continuing. Here is the wait script:

<<<<<<<<<<<<<<<<<<<<<<<<<<< start >>>>>>>>>>>>>>>>>>>>>>>>>>

#!/bin/sh
#
INSTALL="/etc/ha.d/resource.d"
. $INSTALL/drbd_commun $*


# See how we were called.
case "$1" in
start)
# Use this to wait for other side to become not Primary.

if [ -z "$RESOURCE" ]; then
runForAll start
exit $RETVAL
fi
haveRemotePrimary $RESOURCE
RPRIMARY=$?

while [ $RPRIMARY -eq 1 ]; do
logger -t "wait" "Waiting for other side to not be Primary."
sleep 1
haveRemotePrimary $RESOURCE
RPRIMARY=$?
done
logger -t "wait" "Remote side is not Primary"
;;

stop)
if [ -z "$RESOURCE" ]; then
runForAll stop
exit $RETVAL
fi
;;

status)
$INSTALL/drbdc $*
;;

*)
echo "Usage: $0 [resource] {start|stop|restart|status}"
RETVAL=1
;;
esac

exit $RETVAL

<<<<<<<<<<<<<<<<<<<<<<<<<<< end >>>>>>>>>>>>>>>>>>>>>>>>>>

Is anyone else aware of this problem? Anyone see any issues with my
approach to fixing it? If not, should this be part of datadisk?

My Configuration:

Running RH 6.1. Heartbeat is running over two 10/100 ethernets.
nice_failback is ON. Heartbeat is using CRC.

DRBD:
GLOBAL_MODE = force
GLOBAL_OPTIONS="-t 99 -d 4195768 --sync-rate 51000"
GLOBAL_PROTOCOL="C"


--
Tony Willoughby
ADC Telecommunications, Inc.
Broadband Access and Transport Group
mailto:tony_willoughby@example.com
Re: Timing Issue with Heartbeat? [ In reply to ]
Hi,

The problem was raised before but I missed this solution.
I am wondering if the wait code can not simply be added in the datadisk
start instead of behing a separate script, only correct me if this raise any
problem as I am not an heartbeat expert.

Thank you
Thomas

----- Original Message -----
From: "tony willoughby" <twilloughby@example.com>
To: "DRBD Developer List" <DRBD-devel@example.com>
Sent: Friday, October 27, 2000 7:06 PM
Subject: [DRBD-dev] Timing Issue with Heartbeat?


>
> I've been testing fail-over of a two node drbd/heartbeat
> configuration. My method of forcing a fail-over has been
> to run "/etc/rc.d/init.d/heartbeat stop" on the system that
> is the drbd Primary.
>
> Occasionally, after heartbeat transfers control from
> one system to another, drbd status shows a disconnect:
>
> On Node2:
> version : 58
>
> 0: cs:WFConnection st:Secondary/Unknown ns:44 nr:28 dw:74 dr:17579 of:0
>
> On Node1:
> version : 58
>
> 0: cs:Unconfigured st:Primary/Unknown ns:28 nr:44 dw:73 dr:1081 of:0
>
> I suspect that the "new" Primary brings up drbd before the "old"
> Primary has fully released it. My haresources look like this
>
> bcmdual1 datadisk 10.1.15.1 101.1.15.1
>
> Datadisk is brought up first on the "new" Primary and brought down
> last on the "old" primary. The problem occurs more frequently when I
> add a resource to the haresources:
>
> bcmdual1 datadisk 10.1.15.1 101.1.15.1 myservice
>
> Where "myservice" takes a good deal of time to go up or down.
>
> I seem to prevent the problem from happening by adding (yet another)
> haresource:
>
> bcmdual1 wait datadisk 10.1.15.1 101.1.15.1
>
> Where the script for "wait" insures that the other drbd node is not
> Primary before continuing. Here is the wait script:
>
> <<<<<<<<<<<<<<<<<<<<<<<<<<< start >>>>>>>>>>>>>>>>>>>>>>>>>>
>
> #!/bin/sh
> #
> INSTALL="/etc/ha.d/resource.d"
> . $INSTALL/drbd_commun $*
>
>
> # See how we were called.
> case "$1" in
> start)
> # Use this to wait for other side to become not Primary.
>
> if [ -z "$RESOURCE" ]; then
> runForAll start
> exit $RETVAL
> fi
> haveRemotePrimary $RESOURCE
> RPRIMARY=$?
>
> while [ $RPRIMARY -eq 1 ]; do
> logger -t "wait" "Waiting for other side to not be Primary."
> sleep 1
> haveRemotePrimary $RESOURCE
> RPRIMARY=$?
> done
> logger -t "wait" "Remote side is not Primary"
> ;;
>
> stop)
> if [ -z "$RESOURCE" ]; then
> runForAll stop
> exit $RETVAL
> fi
> ;;
>
> status)
> $INSTALL/drbdc $*
> ;;
>
> *)
> echo "Usage: $0 [resource] {start|stop|restart|status}"
> RETVAL=1
> ;;
> esac
>
> exit $RETVAL
>
> <<<<<<<<<<<<<<<<<<<<<<<<<<< end >>>>>>>>>>>>>>>>>>>>>>>>>>
>
> Is anyone else aware of this problem? Anyone see any issues with my
> approach to fixing it? If not, should this be part of datadisk?
>
> My Configuration:
>
> Running RH 6.1. Heartbeat is running over two 10/100 ethernets.
> nice_failback is ON. Heartbeat is using CRC.
>
> DRBD:
> GLOBAL_MODE = force
> GLOBAL_OPTIONS="-t 99 -d 4195768 --sync-rate 51000"
> GLOBAL_PROTOCOL="C"
>
>
> --
> Tony Willoughby
> ADC Telecommunications, Inc.
> Broadband Access and Transport Group
> mailto:tony_willoughby@example.com
>
> _______________________________________________
> DRBD-devel mailing list
> DRBD-devel@example.com
> http://lists.sourceforge.net/mailman/listinfo/drbd-devel
>
Re: Timing Issue with Heartbeat? [ In reply to ]
Hello,

I updated the CVS for datadisk fore the "force" mode.
It include a piece of code waiting for the other node to release its primary
state before taking it.
A timeout value can now be set in the configuration file It is set to 300.

It is new untested and may not work properly.

Thomas
--
Thomas Mangin (mailto:thomas.mangin@example.com)
System Administrator (mailto:systems@example.com)
Legend Internet Ltd. (http://www.legend.co.uk:/)
--
The urgent is done, the impossible is on the way, for miracles expect a
small delay
Re: Timing Issue with Heartbeat? [ In reply to ]
On Wed, 1 Nov 2000, Thomas Mangin wrote:

> Hello,
>
> I updated the CVS for datadisk fore the "force" mode.
> It include a piece of code waiting for the other node to release its primary
> state before taking it.
> A timeout value can now be set in the configuration file It is set to 300.
>
> It is new untested and may not work properly.

I finally had a chance to sit down and try this out. It seems to work
well, cleaner than my proposal. Thanks for the effort.

Just a couple of notes:

- I downloaded the latest CVS code on November 9th.
- On my system (Redhat 6.1), datadisk's calls to logger were hanging.
Using the -t switch requires that two strings follow.
- I was getting syntax errors on the line:
if [ $havetowait -eq 1]; then
Adding a space after the 1 and before the ] cleaned it up.

The diff follows.

17a18,19
> # set -x
>
35c37
< logger -t "Waiting for the other node to become secondary"
---
> logger -t "datadisk" "Waiting for the other node to become secondary"
45c47
< if [ $havetowait -eq 1]; then
---
> if [ $havetowait -eq 1 ]; then
47,48c49,50
< logger -t "The other node didn't release the resource after $2 second"
< logger -t "Forcing the to become primary"
---
> logger -t "datadisk" "The other node didn't release the resource after $2 second"
> logger -t "datadisk" "Forcing the to become primary"
50c52
< logger -t "The other node is now secondary"
---
> logger -t "datadisk" "The other node is now secondary"
53c55
< logger -t "The node can become primary without waiting"
---
> logger -t "datadisk" "The node can become primary without waiting"

--
Tony Willoughby
ADC Telecommunications, Inc.
Broadband Access and Transport Group
mailto:tony_willoughby@example.com
Re: Timing Issue with Heartbeat? [ In reply to ]
Hi,

Thank you for the report , I will update the CVS as soon as I can.
Happy to see that heartbeat is finally working properly.

Thomas
----- Original Message -----
From: "tony willoughby" <twilloughby@example.com>
To: "Thomas Mangin" <thomasm@example.com>
Cc: "DRBD Developer List" <DRBD-devel@example.com>
Sent: Monday, November 13, 2000 7:27 PM
Subject: Re: [DRBD-dev] Timing Issue with Heartbeat?


> On Wed, 1 Nov 2000, Thomas Mangin wrote:
>
> > Hello,
> >
> > I updated the CVS for datadisk fore the "force" mode.
> > It include a piece of code waiting for the other node to release its
primary
> > state before taking it.
> > A timeout value can now be set in the configuration file It is set to
300.
> >
> > It is new untested and may not work properly.
>
> I finally had a chance to sit down and try this out. It seems to work
> well, cleaner than my proposal. Thanks for the effort.
>
> Just a couple of notes:
>
> - I downloaded the latest CVS code on November 9th.
> - On my system (Redhat 6.1), datadisk's calls to logger were hanging.
> Using the -t switch requires that two strings follow.
> - I was getting syntax errors on the line:
> if [ $havetowait -eq 1]; then
> Adding a space after the 1 and before the ] cleaned it up.
>
> The diff follows.
>
> 17a18,19
> > # set -x
> >
> 35c37
> < logger -t "Waiting for the other node to become
secondary"
> ---
> > logger -t "datadisk" "Waiting for the other node
to become secondary"
> 45c47
> < if [ $havetowait -eq 1]; then
> ---
> > if [ $havetowait -eq 1 ]; then
> 47,48c49,50
> < logger -t "The other node didn't release
the resource after $2 second"
> < logger -t "Forcing the to become primary"
> ---
> > logger -t "datadisk" "The other node
didn't release the resource after $2 second"
> > logger -t "datadisk" "Forcing the to
become primary"
> 50c52
> < logger -t "The other node is now
secondary"
> ---
> > logger -t "datadisk" "The other node is
now secondary"
> 53c55
> < logger -t "The node can become primary without
waiting"
> ---
> > logger -t "datadisk" "The node can become primary
without waiting"
>
> --
> Tony Willoughby
> ADC Telecommunications, Inc.
> Broadband Access and Transport Group
> mailto:tony_willoughby@example.com
>
> _______________________________________________
> DRBD-devel mailing list
> DRBD-devel@example.com
> http://lists.sourceforge.net/mailman/listinfo/drbd-devel
>