Mailing List Archive

drbd 5.7 hang on secondary
I have a reliable hang on the secondary side using drbd. No echo, no file
access, no disk i/o, but console switching and sysrq work.

Setup:

Primary 2 processor SGI box, 1 scsi disk, 1 drbd partition
Secondary is the same.

The secondary hangs immediately when I try to make a filesystem on the
primary.

According to magic sysrq 't' the current task is drbdd_0, but if I keep
hitting it, sometimes it is drbd_asender_0.

Totally reproducable, so I can work on it when I have time, but if anyone
has any thoughts (like, silly boy, everyone knows drbd does not work on SMP
or such...) I would like to hear them.

I tried the latest from cvs too, and no different (not likely give the diff).

Thanks

-dg

--
David Gould dg@example.com
SuSE, Inc., 580 2cd St. #210, Oakland, CA 94607 510.628.3380
"So many ways to skin a cat, and still everyone uses a great big knife."
Re: drbd 5.7 hang on secondary [ In reply to ]
On Wed, Sep 27, 2000 at 09:16:19PM -0700, David Gould wrote:
>
> I have a reliable hang on the secondary side using drbd. No echo, no file
> access, no disk i/o, but console switching and sysrq work.
>
> Setup:
>
> Primary 2 processor SGI box, 1 scsi disk, 1 drbd partition
> Secondary is the same.
>
> The secondary hangs immediately when I try to make a filesystem on the
> primary.
>
> According to magic sysrq 't' the current task is drbdd_0, but if I keep
> hitting it, sometimes it is drbd_asender_0.
>
> Totally reproducable, so I can work on it when I have time, but if anyone
> has any thoughts (like, silly boy, everyone knows drbd does not work on SMP
> or such...) I would like to hear them.

After some testing here is a bit more information:

If I test with "dd of=/dev/nb0 if=/dev/zero bs=4096 count=N" protocol A
and B seem to work even for large N. But, protocol C hangs the secondary
everytime with N=32. That is, only 32 4k writes will consistantly hang.

When I get some time, I will try this with a UP kernel, but for now it
seems that protocol C does not work at all on SMP.

-dg

--
David Gould dg@example.com
SuSE, Inc., 580 2cd St. #210, Oakland, CA 94607 510.628.3380
"So many ways to skin a cat, and still everyone uses a great big knife."
Re: drbd 5.7 hang on secondary [ In reply to ]
Am Don, 28 Sep 2000 schrieb David Gould:

David,
it's quite likely that there are still a lot of problems with SMP
systems.
But as I have currently no SMP systems for development and there are
also a number of problems on UP systems I am currently hunting...

PS: I have stated to write documentation about the source, but
it's far from being finished. Do you think there is immediate
need for more docs ?

-Philipp

>I have a reliable hang on the secondary side using drbd. No echo, no file
>access, no disk i/o, but console switching and sysrq work.
>
>Setup:
>
> Primary 2 processor SGI box, 1 scsi disk, 1 drbd partition
> Secondary is the same.
>
>The secondary hangs immediately when I try to make a filesystem on the
>primary.
>
>According to magic sysrq 't' the current task is drbdd_0, but if I keep
>hitting it, sometimes it is drbd_asender_0.
>
>Totally reproducable, so I can work on it when I have time, but if anyone
>has any thoughts (like, silly boy, everyone knows drbd does not work on SMP
>or such...) I would like to hear them.
>
>I tried the latest from cvs too, and no different (not likely give the diff).
>
>Thanks
>
>-dg
>
>--
>David Gould dg@example.com
>SuSE, Inc., 580 2cd St. #210, Oakland, CA 94607 510.628.3380
>"So many ways to skin a cat, and still everyone uses a great big knife."
>_______________________________________________
>DRBD-devel mailing list
>DRBD-devel@example.com
>http://lists.sourceforge.net/mailman/listinfo/drbd-devel
--
Want to try something new? Are you a Linux hacker?
Volunteer in testing mergemem!
(Get it from http://das.ist.org/mergemem)
-----
Philipp Reisner PGP: http://der.ist.org/~kde/pgp.asc
Re: drbd 5.7 hang on secondary [ In reply to ]
On Sat, Sep 30, 2000 at 09:38:34AM +0200, Philipp Reisner wrote:
> Am Don, 28 Sep 2000 schrieb David Gould:
>
> David,
> it's quite likely that there are still a lot of problems with SMP
> systems.
> But as I have currently no SMP systems for development and there are
> also a number of problems on UP systems I am currently hunting...
>
> PS: I have stated to write documentation about the source, but
> it's far from being finished. Do you think there is immediate
> need for more docs ?

Yes. Only, not as separate docs, but more as block comments in the source,
especially explaining "what we are doing, and why we need it" rather than
detailed "how". "how" is derivable by reading the code, why is not.

For example, the epoch barrier thing is quite simply explained in your
new paper, but the fact that it is an optimization and what the rules it
enforces are, is not at all obvious from reading the code.

There is just enough going on in drbd to make it tricky for someone to
pick up just reading the code. Which I expect is why you get more help
with the admin tools than the module.

Also it would help if there was a "known issues" in the distribution or
on the website. I would have just used UP kernels if I had known that
SMP might not work... At least for this case. Likewise, if there was
such information, I might have tried to debug it on SMP before I needed
it.

-dg

--
David Gould dg@example.com
SuSE, Inc., 580 2cd St. #210, Oakland, CA 94607 510.628.3380
"So many ways to skin a cat, and still everyone uses a great big knife."
Re: drbd 5.7 hang on secondary [ In reply to ]
On Sat, 30 Sep 2000, David Gould wrote:

> On Sat, Sep 30, 2000 at 09:38:34AM +0200, Philipp Reisner wrote:
> > Am Don, 28 Sep 2000 schrieb David Gould:
> >
> > David,
> > it's quite likely that there are still a lot of problems with SMP
> > systems.
> > But as I have currently no SMP systems for development and there are
> > also a number of problems on UP systems I am currently hunting...
> >
> > PS: I have stated to write documentation about the source, but
> > it's far from being finished. Do you think there is immediate
> > need for more docs ?
>
> Yes. Only, not as separate docs, but more as block comments in the source,
> especially explaining "what we are doing, and why we need it" rather than
> detailed "how". "how" is derivable by reading the code, why is not.
>
> For example, the epoch barrier thing is quite simply explained in your
> new paper, but the fact that it is an optimization and what the rules it
> enforces are, is not at all obvious from reading the code.

Well, you won't find a description of file descriptors, file (objects),
inodes and their relationship in Linux's source. You will find this
in an Unix text book. -- You can find a description of DRBD's algorithms
in the paper(s) about DRBD. :)

> There is just enough going on in drbd to make it tricky for someone to
> pick up just reading the code. Which I expect is why you get more help
> with the admin tools than the module.

I am writing this "source documentation", which will map the
concepts of the DRBD paper to fuction names, variables and data
structures.

> Also it would help if there was a "known issues" in the distribution or
> on the website. I would have just used UP kernels if I had known that
> SMP might not work... At least for this case. Likewise, if there was
> such information, I might have tried to debug it on SMP before I needed
> it.

Yes, that's right. We do not have this "known issues" file yet, we
only have this TODO file in CVS.

-Philipp
Re: drbd 5.7 hang on secondary [ In reply to ]
Hi Philipp!

How're you doing, dear friend?

On Tue, 3 Oct 2000, Philipp Reisner wrote:
...
>> I am writing this "source documentation", which will map the
> concepts of the DRBD paper to fuction names, variables and data
> structures.

Could you please send me what you've aready written about it?
I'm having lot's of hungs and I'm starting looking at the code :)
...
If it helps, I'm using dbench to stress the cluster. It hungs with
'dbench -2'. It's funny that when I set '-r 20000' it hungs faster
than with '-r 10000' (default).

Thanks for your help!

Luis

[ Luis Claudio R. Goncalves lclaudio@example.com ]
[. MSc coming soon -- Conectiva HA Team -- Gospel User -- Linuxer -- :) ]
[. Fault Tolerance - Real-Time - Distributed Systems - IECLB - IS 40:31 ]
[. LateNite Programmer -- Jesus Is The Solid Rock On Which I Stand -- ]
Re: drbd 5.7 hang on secondary [ In reply to ]
On Tue, Oct 03, 2000 at 09:42:55AM +0200, Philipp Reisner wrote:
> On Sat, 30 Sep 2000, David Gould wrote:
>
> > On Sat, Sep 30, 2000 at 09:38:34AM +0200, Philipp Reisner wrote:
> > > Am Don, 28 Sep 2000 schrieb David Gould:
> > >
> Well, you won't find a description of file descriptors, file (objects),
> inodes and their relationship in Linux's source. You will find this
> in an Unix text book. -- You can find a description of DRBD's algorithms
> in the paper(s) about DRBD. :)

Yes, very nice. My unix books all talk a lot about a function namei(),
which does appear at all in Linux since 2.0. I guess I need to patch
my books to know about dcache...

> > There is just enough going on in drbd to make it tricky for someone to
> > pick up just reading the code. Which I expect is why you get more help
> > with the admin tools than the module.
>
> I am writing this "source documentation", which will map the
> concepts of the DRBD paper to fuction names, variables and data
> structures.

Excellent. I am just urging that it go into the source itself. At least
there it is possible it may get maintained. And even if it doesn't, a
comment that disagrees with the code at least alerts you that _something_
changed.

> Yes, that's right. We do not have this "known issues" file yet, we
> only have this TODO file in CVS.

-dg

--
David Gould dg@example.com
SuSE, Inc., 580 2cd St. #210, Oakland, CA 94607 510.628.3380
"So many ways to skin a cat, and still everyone uses a great big knife."
Re: drbd 5.7 hang on secondary [ In reply to ]
Hi!

On Tue, 3 Oct 2000, David Gould wrote:
...
> Excellent. I am just urging that it go into the source itself. At least
> there it is possible it may get maintained. And even if it doesn't, a
> comment that disagrees with the code at least alerts you that _something_
> changed.

Philipp, if you can put a straight description of the structs you
used in drbd.[hc], I can spread some comments in the source code.
I'm currently hunting a bug (a solution, to be honest) I emailed
you this morning and I'd be glad on commenting (a bit) the code.

Hugs!

Luis
[ Luis Claudio R. Goncalves lclaudio@example.com ]
[. MSc coming soon -- Conectiva HA Team -- Gospel User -- Linuxer -- :) ]
[. Fault Tolerance - Real-Time - Distributed Systems - IECLB - IS 40:31 ]
[. LateNite Programmer -- Jesus Is The Solid Rock On Which I Stand -- ]