I've been testing fail-over of a two node drbd/heartbeat
configuration. My method of forcing a fail-over has been
to run "/etc/rc.d/init.d/heartbeat stop" on the system that
is the drbd Primary.
Occasionally, after heartbeat transfers control from
one system to another, drbd status shows a disconnect:
On Node2:
version : 58
0: cs:WFConnection st:Secondary/Unknown ns:44 nr:28 dw:74 dr:17579 of:0
On Node1:
version : 58
0: cs:Unconfigured st:Primary/Unknown ns:28 nr:44 dw:73 dr:1081 of:0
I suspect that the "new" Primary brings up drbd before the "old"
Primary has fully released it. My haresources look like this
bcmdual1 datadisk 10.1.15.1 101.1.15.1
Datadisk is brought up first on the "new" Primary and brought down
last on the "old" primary. The problem occurs more frequently when I
add a resource to the haresources:
bcmdual1 datadisk 10.1.15.1 101.1.15.1 myservice
Where "myservice" takes a good deal of time to go up or down.
I seem to prevent the problem from happening by adding (yet another)
haresource:
bcmdual1 wait datadisk 10.1.15.1 101.1.15.1
Where the script for "wait" insures that the other drbd node is not
Primary before continuing. Here is the wait script:
<<<<<<<<<<<<<<<<<<<<<<<<<<< start >>>>>>>>>>>>>>>>>>>>>>>>>>
#!/bin/sh
#
INSTALL="/etc/ha.d/resource.d"
. $INSTALL/drbd_commun $*
# See how we were called.
case "$1" in
start)
# Use this to wait for other side to become not Primary.
if [ -z "$RESOURCE" ]; then
runForAll start
exit $RETVAL
fi
haveRemotePrimary $RESOURCE
RPRIMARY=$?
while [ $RPRIMARY -eq 1 ]; do
logger -t "wait" "Waiting for other side to not be Primary."
sleep 1
haveRemotePrimary $RESOURCE
RPRIMARY=$?
done
logger -t "wait" "Remote side is not Primary"
;;
stop)
if [ -z "$RESOURCE" ]; then
runForAll stop
exit $RETVAL
fi
;;
status)
$INSTALL/drbdc $*
;;
*)
echo "Usage: $0 [resource] {start|stop|restart|status}"
RETVAL=1
;;
esac
exit $RETVAL
<<<<<<<<<<<<<<<<<<<<<<<<<<< end >>>>>>>>>>>>>>>>>>>>>>>>>>
Is anyone else aware of this problem? Anyone see any issues with my
approach to fixing it? If not, should this be part of datadisk?
My Configuration:
Running RH 6.1. Heartbeat is running over two 10/100 ethernets.
nice_failback is ON. Heartbeat is using CRC.
DRBD:
GLOBAL_MODE = force
GLOBAL_OPTIONS="-t 99 -d 4195768 --sync-rate 51000"
GLOBAL_PROTOCOL="C"
--
Tony Willoughby
ADC Telecommunications, Inc.
Broadband Access and Transport Group
mailto:tony_willoughby@example.com
configuration. My method of forcing a fail-over has been
to run "/etc/rc.d/init.d/heartbeat stop" on the system that
is the drbd Primary.
Occasionally, after heartbeat transfers control from
one system to another, drbd status shows a disconnect:
On Node2:
version : 58
0: cs:WFConnection st:Secondary/Unknown ns:44 nr:28 dw:74 dr:17579 of:0
On Node1:
version : 58
0: cs:Unconfigured st:Primary/Unknown ns:28 nr:44 dw:73 dr:1081 of:0
I suspect that the "new" Primary brings up drbd before the "old"
Primary has fully released it. My haresources look like this
bcmdual1 datadisk 10.1.15.1 101.1.15.1
Datadisk is brought up first on the "new" Primary and brought down
last on the "old" primary. The problem occurs more frequently when I
add a resource to the haresources:
bcmdual1 datadisk 10.1.15.1 101.1.15.1 myservice
Where "myservice" takes a good deal of time to go up or down.
I seem to prevent the problem from happening by adding (yet another)
haresource:
bcmdual1 wait datadisk 10.1.15.1 101.1.15.1
Where the script for "wait" insures that the other drbd node is not
Primary before continuing. Here is the wait script:
<<<<<<<<<<<<<<<<<<<<<<<<<<< start >>>>>>>>>>>>>>>>>>>>>>>>>>
#!/bin/sh
#
INSTALL="/etc/ha.d/resource.d"
. $INSTALL/drbd_commun $*
# See how we were called.
case "$1" in
start)
# Use this to wait for other side to become not Primary.
if [ -z "$RESOURCE" ]; then
runForAll start
exit $RETVAL
fi
haveRemotePrimary $RESOURCE
RPRIMARY=$?
while [ $RPRIMARY -eq 1 ]; do
logger -t "wait" "Waiting for other side to not be Primary."
sleep 1
haveRemotePrimary $RESOURCE
RPRIMARY=$?
done
logger -t "wait" "Remote side is not Primary"
;;
stop)
if [ -z "$RESOURCE" ]; then
runForAll stop
exit $RETVAL
fi
;;
status)
$INSTALL/drbdc $*
;;
*)
echo "Usage: $0 [resource] {start|stop|restart|status}"
RETVAL=1
;;
esac
exit $RETVAL
<<<<<<<<<<<<<<<<<<<<<<<<<<< end >>>>>>>>>>>>>>>>>>>>>>>>>>
Is anyone else aware of this problem? Anyone see any issues with my
approach to fixing it? If not, should this be part of datadisk?
My Configuration:
Running RH 6.1. Heartbeat is running over two 10/100 ethernets.
nice_failback is ON. Heartbeat is using CRC.
DRBD:
GLOBAL_MODE = force
GLOBAL_OPTIONS="-t 99 -d 4195768 --sync-rate 51000"
GLOBAL_PROTOCOL="C"
--
Tony Willoughby
ADC Telecommunications, Inc.
Broadband Access and Transport Group
mailto:tony_willoughby@example.com