Mailing List Archive

heartbeat/drbd0.6.1-pre2/strange behaviour
Hello,

When the former master joins the cluster, it must be full synchronized,
right?
This is not the case with drbd 0.6.1-pre2. It worked with pre1.

What i did:
node1 ,node 2: rm /var/lib/drbd/*
node 1: drbd start
node 2: drbd start
node 1: heartbeat start (becomes master, mount OK,services OK)
node 2: heartbeat start
node 1: reboot
node 2: becomes master.
node 1: comes back.
node 1: drbd start :
Setting up drbd0...
Setting up drbd1...
Do you want to abort waiting for other server and make this one primary?
no
-->Node 1 must starts here a full synch, no?

Bye,

--
Jean-Yves BOUET
EADS Defence and Security Networks
jean-yves.bouet@example.com
Re: heartbeat/drbd0.6.1-pre2/strange behaviour [ In reply to ]
Sorry, I can not reproduce this report:

What I did:

tcube3:/var/lib/drbd# /etc/init.d/drbd start
Setting up drbd0...[ OK ]
Setting up drbd1...[ OK ]
Setting up drbd2...[ OK ]
Setting up drbd3...[ OK ]
Setting up drbd4...[ OK ]
Setting up drbd5...[ OK ]
Setting up drbd6...[ OK ]
Do you want to abort waiting for other server and make this one primary? no
tcube3:/var/lib/drbd# drbdsetup /dev/nb0 primary
tcube3:/var/lib/drbd# /home/phil/drbd/testing/read_gc.pl
device | Consistent | HumanCnt | ConnectedCnt | ArbitraryCnt | lastState
drbd0 1 1 2 1 primary
drbd1 1 1 1 1 secondary
drbd2 1 1 1 1 secondary
drbd3 1 1 1 1 secondary
drbd4 1 1 1 1 secondary
drbd5 1 1 1 1 secondary
drbd6 1 1 1 1 secondary
tcube3:/var/lib/drbd# cp drbd0 drbd0_
#### Since I do not want to reboot my machine by reset button pressing
I make a copy of the state file.

tcube3:/var/lib/drbd# rmmod drbd
tcube3:/var/lib/drbd# /home/phil/drbd/testing/read_gc.pl
device | Consistent | HumanCnt | ConnectedCnt | ArbitraryCnt | lastState
drbd0 1 1 2 1 secondary
drbd1 1 1 1 1 secondary
drbd2 1 1 1 1 secondary
drbd3 1 1 1 1 secondary
drbd4 1 1 1 1 secondary
drbd5 1 1 1 1 secondary
drbd6 1 1 1 1 secondary
tcube3:/var/lib/drbd# mv drbd0_ drbd0
#### You can see that module unloading changed the "lastState" field.
In case of a real crash the "lastState" field should still be primary,
therefore I move the drbd0_ into it's place.

tcube2:~# drbdsetup /dev/nb0 primary
#### On the other node, I am simulating heartbeat's action (switching
the device into primary state.

tcube3:/var/lib/drbd# /etc/init.d/drbd start
Setting up drbd0...[ OK ]
Setting up drbd1...[ OK ]
Setting up drbd2...[ OK ]
Setting up drbd3...[ OK ]
Setting up drbd4...[ OK ]
Setting up drbd5...[ OK ]
Setting up drbd6...[ OK ]
Do you want to abort waiting for other server and make this one primary?
Waiting until drbd0 is up to date (using SyncingAll) abort?
#### You can see here that drbd0 becomes a SyncAll from tcube2, exactly
what I expected.

-Philipp

* Jean-Yves Bouet - 78636 <jean-yves.bouet@example.com> [010926 09:09]:
> Hello,
>
> When the former master joins the cluster, it must be full synchronized,
> right?
> This is not the case with drbd 0.6.1-pre2. It worked with pre1.
>
> What i did:
> node1 ,node 2: rm /var/lib/drbd/*
> node 1: drbd start
> node 2: drbd start
> node 1: heartbeat start (becomes master, mount OK,services OK)
> node 2: heartbeat start
> node 1: reboot
> node 2: becomes master.
> node 1: comes back.
> node 1: drbd start :
> Setting up drbd0...
> Setting up drbd1...
> Do you want to abort waiting for other server and make this one primary?
> no
> -->Node 1 must starts here a full synch, no?
>
> Bye,
>
> --
> Jean-Yves BOUET
> EADS Defence and Security Networks
> jean-yves.bouet@example.com
>
>
>
>
> _______________________________________________
> DRBD-devel mailing list
> DRBD-devel@example.com
> https://lists.sourceforge.net/lists/listinfo/drbd-devel
Re: heartbeat/drbd0.6.1-pre2/strange behaviour [ In reply to ]
Philipp Reisner wrote:

> Sorry, I can not reproduce this report:
>
> What I did:
>
> tcube3:/var/lib/drbd# /etc/init.d/drbd start
> Setting up drbd0...[ OK ]
> Setting up drbd1...[ OK ]
> Setting up drbd2...[ OK ]
> Setting up drbd3...[ OK ]
> Setting up drbd4...[ OK ]
> Setting up drbd5...[ OK ]
> Setting up drbd6...[ OK ]
> Do you want to abort waiting for other server and make this one primary? no
> tcube3:/var/lib/drbd# drbdsetup /dev/nb0 primary
> tcube3:/var/lib/drbd# /home/phil/drbd/testing/read_gc.pl
> device | Consistent | HumanCnt | ConnectedCnt | ArbitraryCnt | lastState
> drbd0 1 1 2 1 primary
> drbd1 1 1 1 1 secondary
> drbd2 1 1 1 1 secondary
> drbd3 1 1 1 1 secondary
> drbd4 1 1 1 1 secondary
> drbd5 1 1 1 1 secondary
> drbd6 1 1 1 1 secondary
> tcube3:/var/lib/drbd# cp drbd0 drbd0_
> #### Since I do not want to reboot my machine by reset button pressing
> I make a copy of the state file.
>
> tcube3:/var/lib/drbd# rmmod drbd
> tcube3:/var/lib/drbd# /home/phil/drbd/testing/read_gc.pl
> device | Consistent | HumanCnt | ConnectedCnt | ArbitraryCnt | lastState
> drbd0 1 1 2 1 secondary
> drbd1 1 1 1 1 secondary
> drbd2 1 1 1 1 secondary
> drbd3 1 1 1 1 secondary
> drbd4 1 1 1 1 secondary
> drbd5 1 1 1 1 secondary
> drbd6 1 1 1 1 secondary
> tcube3:/var/lib/drbd# mv drbd0_ drbd0
> #### You can see that module unloading changed the "lastState" field.
> In case of a real crash the "lastState" field should still be primary,
> therefore I move the drbd0_ into it's place.
>
> tcube2:~# drbdsetup /dev/nb0 primary
> #### On the other node, I am simulating heartbeat's action (switching
> the device into primary state.
>
> tcube3:/var/lib/drbd# /etc/init.d/drbd start
> Setting up drbd0...[ OK ]
> Setting up drbd1...[ OK ]
> Setting up drbd2...[ OK ]
> Setting up drbd3...[ OK ]
> Setting up drbd4...[ OK ]
> Setting up drbd5...[ OK ]
> Setting up drbd6...[ OK ]
> Do you want to abort waiting for other server and make this one primary?
> Waiting until drbd0 is up to date (using SyncingAll) abort?
> #### You can see here that drbd0 becomes a SyncAll from tcube2, exactly
> what I expected.
>

OK, my test did not work because i used a SOFT reboot.

But in that case, the node should be also synchronized when it comes back , no?
Re: heartbeat/drbd0.6.1-pre2/strange behaviour [ In reply to ]
* Jean-Yves Bouet - 78636 <jean-yves.bouet@example.com> [010926 16:08]:
> Philipp Reisner wrote:
>
> > Sorry, I can not reproduce this report:
> >
> > What I did:
> >
[description removed]
> >
>
> OK, my test did not work because i used a SOFT reboot.
>
> But in that case, the node should be also synchronized when it comes back , no?
>

If you are doing a gracefull reboot, the module is unloaded and there is
not reason for a full sync.

If the primary leaves the cluster gracefully, we know that it can not
do a last-second-modification of the storage.

-Philipp
Re: heartbeat/drbd0.6.1-pre2/strange behaviour [ In reply to ]
>
> If you are doing a gracefull reboot, the module is unloaded and there is
> not reason for a full sync.
>
> If the primary leaves the cluster gracefully, we know that it can not
> do a last-second-modification of the storage.
>
> -Philipp

Hello,
OK, in the case of gracefull reboot, the former master has not to be full synch.
But it should at least be quick synch : it must get datas which were written on the
other node while the former node was dead.
And i think it is not the case with drbd-0.6.1-pre2.

node1 (master) : GRACEFULL reboot
node2 : becomes master, mount drbd devices
node2 : writes on drbd device
node 1 : comes back. drbd start. Here it must be quick synck, no? That's not
what happens

Bye.

--
Jean-Yves BOUET
EADS Defence and Security Networks
jean-yves.bouet@example.com
01 34 60 86 36
Re: heartbeat/drbd0.6.1-pre2/strange behaviour [ In reply to ]
Hi Jean,

After the CUBiT fiasco (the have only paid part of my September
salary) I have finaly a development environment set up at home
(consisting of two UML boxes; But I am confident that soon I will have a real
cluster as well hope, including a SMP box)

##Ok, I started with connected devices, one put into primary state:
[root@uml1 /root]# /home/philipp/src/uni/drbd/testing/read_gc.pl
device | Consistent | HumanCnt | ConnectedCnt | ArbitraryCnt | lastState
drbd0 1 1 2 1 primary
[root@uml2 /root]# /home/philipp/src/uni/drbd/testing/read_gc.pl
device | Consistent | HumanCnt | ConnectedCnt | ArbitraryCnt | lastState
drbd0 1 1 2 1 secondary

##Gracefull reboot on uml1 == unloading of module:
[root@uml1 /root]# /home/philipp/src/uni/drbd/testing/read_gc.pl
device | Consistent | HumanCnt | ConnectedCnt | ArbitraryCnt | lastState
drbd0 1 1 2 1 secondary

##Becoming primary on uml2
[root@uml2 /root]# drbdsetup /dev/nb0 primary
[root@uml2 /root]# /home/philipp/src/uni/drbd/testing/read_gc.pl
device | Consistent | HumanCnt | ConnectedCnt | ArbitraryCnt | lastState
drbd0 1 1 2 2 primary

##Now uml1 comes back
[root@uml1 /root]# ./etc/init.d/drbd start
Setting up drbd0...[ OK ]
Do you want to abort waiting for other server and make this one primary?
Waiting until drbd0 is up to date (using SyncingQuick) abort? no

##Quicksync happended.


There must be a difference in what you did, and I did. Could you please
try to find out what happened at your cluster ? -- Use the read_pc.pl
script after each and every step.

-Philipp

* Jean-Yves Bouet - 78636 <jean-yves.bouet@example.com> [010927 14:18]:
> > Hello,
> > OK, in the case of gracefull reboot, the former master has not to be full synch.
> > But it should at least be quick synch : it must get datas which were written on the
> > other node while the former node was dead.
> > And i think it is not the case with drbd-0.6.1-pre2.
> >
> > node1 (master) : GRACEFULL reboot
> > node2 : becomes master, mount drbd devices
> > node2 : writes on drbd device
> > node 1 : comes back. drbd start. Here it must be quick synck, no? That's not
> > what happens
> >
> > Bye.
>
> When i play this scenario I have this message in my syslog:
> (when node 1 comes back)
> kernel: drbd0: Connection established.
> drbd0: size=52416 KB / blksize=4096 B
> drbd0: predetermined states are in contradiction to GC's
> drbd0: cancelling automatic resynchronisation
>
>
>
>
> _______________________________________________
> DRBD-devel mailing list
> DRBD-devel@example.com
> https://lists.sourceforge.net/lists/listinfo/drbd-devel
Re: heartbeat/drbd0.6.1-pre2/strange behaviour [ In reply to ]
Philipp Reisner wrote:

> Hi Jean,
>
> After the CUBiT fiasco (the have only paid part of my September
> salary) I have finaly a development environment set up at home
> (consisting of two UML boxes; But I am confident that soon I will have a real
> cluster as well hope, including a SMP box)
>
> ##Ok, I started with connected devices, one put into primary state:
> [root@uml1 /root]# /home/philipp/src/uni/drbd/testing/read_gc.pl
> device | Consistent | HumanCnt | ConnectedCnt | ArbitraryCnt | lastState
> drbd0 1 1 2 1 primary
> [root@uml2 /root]# /home/philipp/src/uni/drbd/testing/read_gc.pl
> device | Consistent | HumanCnt | ConnectedCnt | ArbitraryCnt | lastState
> drbd0 1 1 2 1 secondary
>
> ##Gracefull reboot on uml1 == unloading of module:
> [root@uml1 /root]# /home/philipp/src/uni/drbd/testing/read_gc.pl
> device | Consistent | HumanCnt | ConnectedCnt | ArbitraryCnt | lastState
> drbd0 1 1 2 1 secondary
>
> ##Becoming primary on uml2
> [root@uml2 /root]# drbdsetup /dev/nb0 primary
> [root@uml2 /root]# /home/philipp/src/uni/drbd/testing/read_gc.pl
> device | Consistent | HumanCnt | ConnectedCnt | ArbitraryCnt | lastState
> drbd0 1 1 2 2 primary
>
> ##Now uml1 comes back
> [root@uml1 /root]# ./etc/init.d/drbd start
> Setting up drbd0...[ OK ]
> Do you want to abort waiting for other server and make this one primary?
> Waiting until drbd0 is up to date (using SyncingQuick) abort? no
>
> ##Quicksync happended.
>
> There must be a difference in what you did, and I did. Could you please
> try to find out what happened at your cluster ? -- Use the read_pc.pl
> script after each and every step.
>
> -Philipp
>

Hi Philipp,
Thanks to look at my problems, in spite of your business troubles ...
I give you more details about what i did:
#node 1 : rm /var/lib/drbd/*
#node 2 : rm /var/lib/drbd/*

#node 1 : drbd start
#node 2 : drbd start

#node 1 : readgc
device | Consistent | HumanCnt | ConnectedCnt | ArbitraryCnt | lastState
drbd0 1 1 1 1 secondary
drbd1 1 1 1 1 secondary

#node 2 : readgc
device | Consistent | HumanCnt | ConnectedCnt | ArbitraryCnt | lastState
drbd0 1 1 1 1 secondary
drbd1 1 1 1 1 secondary

#node 1: heartbeat start
node 1 is master. Mounts OK. Services OK.
#node 2: heartbeat start


#node 1 : readgc
device | Consistent | HumanCnt | ConnectedCnt | ArbitraryCnt | lastState
drbd0 1 1 2 1 primary
drbd1 1 1 2 1 primary
#node 2 : readgc
device | Consistent | HumanCnt | ConnectedCnt | ArbitraryCnt | lastState
drbd0 1 1 2 1 primary
drbd1 1 1 2 1 primary


#node 1 : reboot (/sbin/reboot)
#node 2 : becomes master. Mounts OK, services OK

#node 1 : is back
#node 1 : readgc
device | Consistent | HumanCnt | ConnectedCnt | ArbitraryCnt | lastState
drbd0 1 1 3 1 secondary
drbd1 1 1 3 1 secondary

#node 1 : drbd start
Setting up drbd0...
[ OK ]
Setting up drbd1...
[ OK ]
Do you want to abort waiting for other server and make this one primary?
no
-->No synch seems to start ...<--
#node 1 : cat /proc/drbd
version: 0.6.1-pre2 (api:58/proto:58)

0: cs:Connected st:Secondary/Primary ns:0 nr:0 dw:0 dr:0 pe:0 ua:0
1: cs:Connected st:Secondary/Primary ns:0 nr:0 dw:0 dr:0 pe:0 ua:0
#node 1 : tail syslog
Oct 2 10:25:10 CNODE-1-110 kernel: drbd: initialised. Version: 0.6.1-pre2
(api:58/proto:58)
Oct 2 10:25:10 CNODE-1-110 kernel: drbd : vmallocing 3213 B for bitmap. @c88c0020
Oct 2 10:25:10 CNODE-1-110 kernel: drbd1: Connection established.
Oct 2 10:25:10 CNODE-1-110 kernel: drbd1: size=102816 KB / blksize=4096 B
Oct 2 10:25:10 CNODE-1-110 kernel: drbd1: predetermined states are in contradiction to
GC's
Oct 2 10:25:10 CNODE-1-110 kernel: drbd1: cancelling automatic resynchronisation
Oct 2 10:25:10 CNODE-1-110 kernel: drbd : vmallocing 1638 B for bitmap. @c88c2020
Oct 2 10:25:10 CNODE-1-110 kernel: drbd0: Connection established.
Oct 2 10:25:10 CNODE-1-110 kernel: drbd0: size=52416 KB / blksize=4096 B
Oct 2 10:25:10 CNODE-1-110 kernel: drbd0: predetermined states are in contradiction to
GC's
Oct 2 10:25:10 CNODE-1-110 kernel: drbd0: cancelling automatic resynchronisation
#node 2 : readgc
device | Consistent | HumanCnt | ConnectedCnt | ArbitraryCnt | lastState
drbd0 1 1 2 2 primary
drbd1 1 1 2 2 primary

If i use "reboot -f" instead of a simple "reboot", node 1 starts full synch when it comes
back.

My drbd.conf:


resource drbd0 {

protocol=B
fsckcmd=true
net {
sync-rate=2500
tl-size=512
}

on CNODE-1-110 {
device=/dev/nb0
disk=/dev/hdc3
address=136.10.15.110
port=7788
}

on CNODE-1-120 {
device=/dev/nb0
disk=/dev/hdc3
address=136.10.15.120
port=7788
}
}

resource drbd1 {

protocol=C
fsckcmd=true
net {
sync-rate=2500
tl-size=512
}

on CNODE-1-110 {
device=/dev/nb1
disk=/dev/hdc4
address=136.10.15.110
port=7789
}

on CNODE-1-120 {
device=/dev/nb1
disk=/dev/hdc4
address=136.10.15.120
port=7789
}
}


Bye!

PS : another question :
could-you explain more precisely what are tl-size parameters because the doc isn't very
clear .
Thanks..

--
Jean-Yves BOUET
EADS Defence and Security Networks
jean-yves.bouet@example.com
01 34 60 86 36
Re: heartbeat/drbd0.6.1-pre2/strange behaviour [ In reply to ]
* Jean-Yves Bouet - 78636 <jean-yves.bouet@example.com> [011002 10:49]:
[...]
>
> Hi Philipp,
> Thanks to look at my problems, in spite of your business troubles ...
> I give you more details about what i did:
> #node 1 : rm /var/lib/drbd/*
> #node 2 : rm /var/lib/drbd/*
>
> #node 1 : drbd start
> #node 2 : drbd start
>
> #node 1 : readgc
> device | Consistent | HumanCnt | ConnectedCnt | ArbitraryCnt | lastState
> drbd0 1 1 1 1 secondary
> drbd1 1 1 1 1 secondary
>
> #node 2 : readgc
> device | Consistent | HumanCnt | ConnectedCnt | ArbitraryCnt | lastState
> drbd0 1 1 1 1 secondary
> drbd1 1 1 1 1 secondary
>
> #node 1: heartbeat start
> node 1 is master. Mounts OK. Services OK.
> #node 2: heartbeat start
>
>
> #node 1 : readgc
> device | Consistent | HumanCnt | ConnectedCnt | ArbitraryCnt | lastState
> drbd0 1 1 2 1 primary
> drbd1 1 1 2 1 primary
> #node 2 : readgc
> device | Consistent | HumanCnt | ConnectedCnt | ArbitraryCnt | lastState
> drbd0 1 1 2 1 primary
> drbd1 1 1 2 1 primary
>
>
> #node 1 : reboot (/sbin/reboot)
> #node 2 : becomes master. Mounts OK, services OK
>
> #node 1 : is back
> #node 1 : readgc
> device | Consistent | HumanCnt | ConnectedCnt | ArbitraryCnt | lastState
> drbd0 1 1 3 1 secondary
> drbd1 1 1 3 1 secondary
>
> #node 1 : drbd start
> Setting up drbd0...
> [ OK ]
> Setting up drbd1...
> [ OK ]
> Do you want to abort waiting for other server and make this one primary?
> no
> -->No synch seems to start ...<--
> #node 1 : cat /proc/drbd
> version: 0.6.1-pre2 (api:58/proto:58)
>
> 0: cs:Connected st:Secondary/Primary ns:0 nr:0 dw:0 dr:0 pe:0 ua:0
> 1: cs:Connected st:Secondary/Primary ns:0 nr:0 dw:0 dr:0 pe:0 ua:0
> #node 1 : tail syslog
> Oct 2 10:25:10 CNODE-1-110 kernel: drbd: initialised. Version: 0.6.1-pre2
> (api:58/proto:58)
> Oct 2 10:25:10 CNODE-1-110 kernel: drbd : vmallocing 3213 B for bitmap. @c88c0020
> Oct 2 10:25:10 CNODE-1-110 kernel: drbd1: Connection established.
> Oct 2 10:25:10 CNODE-1-110 kernel: drbd1: size=102816 KB / blksize=4096 B
> Oct 2 10:25:10 CNODE-1-110 kernel: drbd1: predetermined states are in contradiction to
> GC's
> Oct 2 10:25:10 CNODE-1-110 kernel: drbd1: cancelling automatic resynchronisation
> Oct 2 10:25:10 CNODE-1-110 kernel: drbd : vmallocing 1638 B for bitmap. @c88c2020
> Oct 2 10:25:10 CNODE-1-110 kernel: drbd0: Connection established.
> Oct 2 10:25:10 CNODE-1-110 kernel: drbd0: size=52416 KB / blksize=4096 B
> Oct 2 10:25:10 CNODE-1-110 kernel: drbd0: predetermined states are in contradiction to
> GC's
> Oct 2 10:25:10 CNODE-1-110 kernel: drbd0: cancelling automatic resynchronisation
> #node 2 : readgc
> device | Consistent | HumanCnt | ConnectedCnt | ArbitraryCnt | lastState
> drbd0 1 1 2 2 primary
> drbd1 1 1 2 2 primary
>
> If i use "reboot -f" instead of a simple "reboot", node 1 starts full synch when it comes
> back.
>
> My drbd.conf:
[...]


Ok, now I understand what happens here.

There are two problems one at your machines and one in the drbd code. The
first one showed the bug in the drbd code, thus a good thing :)

After reboot of node1 it's GC was (1,3,1,s) while node2's (1,2,2,p). This
shows us, that the nodes where seperated by a network failure.

Node1 increased it's connected count, since it was primary and from it's
point of view the secondary went away.

Node2 increased it's arbitrary count, since it was put into primary state
after it had lost connection to the partner.

==> Have a look at the your power-down script sequence. It should be:
1) heartbeat stop
2) drbd stop
3) network stop
---
It seems that network stop is called before heartbeat and drbd stop.

*********************
The right decission of drbd would be to do a full sync. That drbd
cancels the resynchronisation is wrong. -- I will fix this.

-Philipp
RE: heartbeat/drbd0.6.1-pre2/strange behaviour [ In reply to ]
Hi Philip / Jean-Yves

> After reboot of node1 it's GC was (1,3,1,s) while node2's
> (1,2,2,p). This
> shows us, that the nodes where seperated by a network failure.
>
> Node1 increased it's connected count, since it was primary
> and from it's
> point of view the secondary went away.
>
> Node2 increased it's arbitrary count, since it was put into
> primary state
> after it had lost connection to the partner.
>
> ==> Have a look at the your power-down script sequence. It should be:
> 1) heartbeat stop
> 2) drbd stop
> 3) network stop
> ---
> It seems that network stop is called before heartbeat and drbd stop.

Crosscheck with shutdown-redhat patch:
*) is this on a redhat box?
*) does drbd start touch /var/lock/subsys/drbd to enable proper service
shutdown?

Bye, Martin