Mailing List Archive

More Meta-data testing...
I've done some more testing of the meta-data code and think I've
uncovered a bug. It's reproducible. I can cause it to happen with
heartbeat controlling DRBD or by controlling DRBD by hand. (the
example below is without HB).

We can get into a situation where a file created on one node is not
quick-synced to it's partner. DRBD always selects the correct node to
be primary, but it seems it doesn't do the quick-sync. My two systems
are called "cuda1" and "cuda2":


1) Cuda1: start DRBD, make it primary, mount fs.

2) Full sync from Cuda1 to Cuda2.

-- [root@cuda1 /root]# cat /proc/drbd ; dmeta
-- version: 0.6.1-pre3 (api:58/proto:58)
--
-- 0: cs:Connected st:Primary/Secondary ns:4097800 nr:0 dw:164 dr:4097460
pe:0 ua:0
-- device | Consistent | HumanCnt | ConnectedCnt | ArbitraryCnt | lastState
-- drbd0 1 1 27 11 primary

3) Cuda2: stop DRBD.

-- [root@cuda2 /root]# /etc/rc.d/init.d/drbd stop

On Cuda1 we have:

-- [root@cuda1 /root]# cat /proc/drbd ; dmeta
-- version: 0.6.1-pre3 (api:58/proto:58)
--
-- 0: cs:WFConnection st:Primary/Unknown ns:4097800 nr:0 dw:164 dr:4097460
pe:0 ua:0
-- device | Consistent | HumanCnt | ConnectedCnt | ArbitraryCnt | lastState
-- drbd0 1 1 28 11 primary

4) Cuda1: Create file in shared partition.

-- [root@cuda1 /root]# echo "testing" > /bas/data/testfile
-- [root@cuda1 /root]# ls -l /bas/data/testfile
-- -rw-r--r-- 1 root root 8 Oct 15 15:20 /bas/data/testfile

5) Cuda1: umount fs, make DRBD secondary, stop DRBD.

-- [root@cuda1 /root]# umount /bas/data
-- [root@cuda1 /root]# drbdsetup /dev/nb0 secondary
-- [root@cuda1 /root]# /etc/rc.d/init.d/drbd stop
-- [root@cuda1 /root]#
-- [root@cuda1 /root]# cat /proc/drbd ; dmeta
-- cat: /proc/drbd: No such file or directory
-- device | Consistent | HumanCnt | ConnectedCnt | ArbitraryCnt | lastState
-- drbd0 1 1 28 11 secondary
-- [root@cuda1 /root]#

6) Start DRBD on both nodes. (Note: DRBD always selects the correct
node to be primary)

-- [root@cuda1 /root]# /etc/rc.d/init.d/drbd start
-- Setting up drbd0...[ OK ]
-- Do you want to abort waiting for other server and make this one primary?
-- /dev/nb0 is Connected,Primaryno
-- [root@cuda1 /root]# cat /proc/drbd ; dmeta
-- version: 0.6.1-pre3 (api:58/proto:58)
--
-- 0: cs:Connected st:Primary/Secondary ns:0 nr:0 dw:0 dr:0 pe:0 ua:0
-- device | Consistent | HumanCnt | ConnectedCnt | ArbitraryCnt | lastState
-- drbd0 1 1 28 12 primary

7) Cuda1: DRBD is already primary, mount the fs. Observe the file
created in step 4 is present.

-- [root@cuda1 /root]# mount /bas/data
-- [root@cuda1 /root]# ls -l /bas/data/testfile
-- -rw-r--r-- 1 root root 8 Oct 15 15:20 /bas/data/testfile

8) Cuda1: umount fs, make drbd secondary.

-- [root@cuda1 /root]# umount /bas/data
-- [root@cuda1 /root]# drbdsetup /dev/nb0 secondary
-- [root@cuda1 /root]# cat /proc/drbd ; dmeta
-- version: 0.6.1-pre3 (api:58/proto:58)
--
-- 0: cs:Connected st:Secondary/Secondary ns:176 nr:0 dw:44 dr:4 pe:0 ua:0
-- device | Consistent | HumanCnt | ConnectedCnt | ArbitraryCnt | lastState
-- drbd0 1 1 28 12 secondary

9) Cuda2: make DRBD primary. mount fs. Observe the file created in
step 4 is missing.

-- [root@cuda2 /root]# drbdsetup /dev/nb0 primary
-- [root@cuda2 /root]# mount /bas/data
-- [root@cuda2 /root]# ls -l /bas/data/testfile
-- ls: /bas/data/testfile: No such file or directory
-- [root@cuda2 /root]# cat /proc/drbd ; dmeta
-- version: 0.6.1-pre3 (api:58/proto:58)
--
-- 0: cs:Connected st:Primary/Secondary ns:16 nr:44 dw:48 dr:80 pe:0 ua:0
-- device | Consistent | HumanCnt | ConnectedCnt | ArbitraryCnt | lastState
-- drbd0 1 1 29 12 primary

Has anyone else seen this problem?


--------------------------------------------------------------------------
System configuration follows:

Two RH 6.1 systems.
DRBD: version: 0.6.1-pre3 (api:58/proto:58).
reiserfs over DRBD.
Heartbeat: version 0.4.9.0d.

drbd.conf:

resource drbd0 {

protocol=B
fsckcmd=fsck -p -y

disk {
do-panic
disk-size=2048728
}

net {
sync-rate=5000
skip-sync
tl-size=256
timeout=60
connect-int=10
ping-int=10
}

on cuda1 {
device=/dev/nb0
disk=/dev/hda5
address=192.0.2.1
port=7788
}

on cuda2 {
device=/dev/nb0
disk=/dev/hda5
address=192.0.2.2
port=7788
}
}


--
Tony Willoughby tonyw@example.com

"That's where Japanzees live."
-My four year old describing Japan.
RE: More Meta-data testing... [ In reply to ]
Hi Tony,


> Has anyone else seen this problem?

only when configuring drbd to skip sync on connect - which should only be
done for debugging.

> net {
> sync-rate=5000
> skip-sync
^^^^^^^^^
> tl-size=256
> timeout=60
> connect-int=10
> ping-int=10
> }

Try without this - btw, in the scenario you described, drbd should do a full
sync not a quick sync. (primary turned off after modifications ==> no
blockmap for quicksync available.

Bye, Martin
Re: More Meta-data testing... [ In reply to ]
On Tuesday 16 October 2001 03:01 am, Bene, Martin wrote:
> Hi Tony,
>
> > Has anyone else seen this problem?
>
> only when configuring drbd to skip sync on connect - which should only be
> done for debugging.
>
> > net {
> > sync-rate=5000
> > skip-sync
>
> ^^^^^^^^^
>
> > tl-size=256
> > timeout=60
> > connect-int=10
> > ping-int=10
> > }
>
> Try without this - btw, in the scenario you described, drbd should do a
> full sync not a quick sync. (primary turned off after modifications ==> no
> blockmap for quicksync available.
>
> Bye, Martin

Thank you Martin!

That was the problem, kind of obvious in hindsight.

The drbd.conf file in the tarball has this as well (it's where I got it). It
might make life easier in the future if it was commented out in the tar ball.

Thanks again!



--
Tony Willoughby tonyw@example.com

"Some nights I can't tell which finger the thumb pick goes on"
-Warren Zevon
Re: More Meta-data testing... [ In reply to ]
* Tony Willoughby <tonyw@example.com> [011016 15:27]:
> On Tuesday 16 October 2001 03:01 am, Bene, Martin wrote:
> > Hi Tony,
> >
> > > Has anyone else seen this problem?
> >
> > only when configuring drbd to skip sync on connect - which should only be
> > done for debugging.
> >
> > > net {
> > > sync-rate=5000
> > > skip-sync
> >
> > ^^^^^^^^^
> >
> > > tl-size=256
> > > timeout=60
> > > connect-int=10
> > > ping-int=10
> > > }
> >
> > Try without this - btw, in the scenario you described, drbd should do a
> > full sync not a quick sync. (primary turned off after modifications ==> no
> > blockmap for quicksync available.
> >
> > Bye, Martin
>
> Thank you Martin!
>
> That was the problem, kind of obvious in hindsight.
>
> The drbd.conf file in the tarball has this as well (it's where I got it). It
> might make life easier in the future if it was commented out in the tar ball.
>
> Thanks again!

Definitely!

I changed it in to ...

resource drbd0 {

protocol=B
fsckcmd=fsck -p -y
# inittimeout=60

disk {
do-panic
disk-size=4096543
}

net {
sync-rate=250
# skip-sync
tl-size=5000
timeout=60
connect-int=10
ping-int=10
}
...


-Philipp