Mailing List Archive

Re: Slakware strike back ..
Am Son, 04 Jun 2000 schriebst Du:
>Hi,
>
>Sorry to disturb you again with my Slackware ;) but I desesperatly cannot
>reach the "step 3 " of the drbd script without failure... :
>
>www:/etc/rc.d/init.d# ./drbd start
>Loading DRBD module...ok
>Configuring DRBD resource drbd0...ok
>Waiting for DRBD resource drbd0 to resynchronize...failure!
>

You are right, this message is a bit missleading.
The fact is that, if the other side is not in primary state,
resynchronization is "failing", because there is nothing to
resynchronize :)

A better message would be "not-needed" or "skiped", but this
emerged from a RedHat environment and RH's action function can
only print "OK" or "failure" :)

>And there is no delay to synchronize, the failure appears immediatly..
>The network seems Ok, Im using 192.168.150.3 and 4 for the IP of the 2
>servers.

Well, I was asuming, that there was nothing to resynchronize. To test it
you could try:

1) Bring up both nodes.
2) Simulate a failure on the master node.
3) Modify the data on the slave node (which should offer the service by now)
4) Restart the master node. (now resynchronization should work)

Without heartbeat:

1) Connect the device-pair.
2) Unload the module on one machine (The failure of the master node :)
3) Modify the data on the other (datadisk start && modify fs)
4) Run the "drbd" script on the machine without module again.
(Now you should see a working resynchronize)

>So my question is, like always: the problem appears because i'm on a
>Slackware, or cause my config wasn't OK ?

I do not think that it is Slackware specific.


>I'm working to modify your program to make it "Debian and Slack-friendly "
>;) I'll tell you when that will be done..
>
That would be great.

PS: Bist du aus dem deutschsprachigen Teil der Schweiz ?

-Philipp
Re: Re: Slakware strike back .. [ In reply to ]
> >www:/etc/rc.d/init.d# ./drbd start
> >Loading DRBD module...ok
> >Configuring DRBD resource drbd0...ok
> >Waiting for DRBD resource drbd0 to resynchronize...failure!
>
> A better message would be "not-needed" or "skiped", but this
> emerged from a RedHat environment and RH's action function can
> only print "OK" or "failure" :)

The best would be to use "passed" like for fsck ...

Thomas
Re: Re: Slakware strike back .. [ In reply to ]
Hi,

) > A better message would be "not-needed" or "skiped", but this
) > emerged from a RedHat environment and RH's action function can
) > only print "OK" or "failure" :)
)
) The best would be to use "passed" like for fsck ...

Really? Hmmm... Passed is for something that had errors but was
recovered. That is actually not the case with drbd. If it can't connect,
resync fails, so I think the script should say so.

Ideas welcome... :)

( Fábio Olivé Leite -* ConectivaLinux *- olive@example.com[.br] )
( PPGC/UFRGS MSc candidate -*- Advisor: Taisy Silva Weber )
( Linux - Distributed Systems - Fault Tolerance - Security - /etc )
AW: Re: Slakware strike back .. [ In reply to ]
Hi,

> ) > A better message would be "not-needed" or "skiped", but this
> ) > emerged from a RedHat environment and RH's action function can
> ) > only print "OK" or "failure" :)
> )
> ) The best would be to use "passed" like for fsck ...
>
> Really? Hmmm... Passed is for something that had errors but was
> recovered. That is actually not the case with drbd. If it can't connect,
> resync fails, so I think the script should say so.
>
> Ideas welcome... :)

I'd say that "failed" is only apropriate if drbd couldn't connect to the
remote side.

Connected to other slave, nothing to do,
Connected to master, nothing to do,
connected to master, sync finished

could all be reported as OK in my opinion; I only expect to see a failure
notification if something is actually wrong.

The next thing to seriously think about would be at least an event counter
or perhaps event counter + on-disk copy of replication bitmap: If you're
running with disk size specified (so you can get the device up and running
even without the 2nd computer available) it would still be nice to be able
to detect when the devices get out of sync in a way that requires full
replication.

I'm thinking of stuff like:
* secondary goes down,
* primary continues, data is changed
* primary gets restarted
* secondary comes online.

For the simpler case, it'd be nice for the secondary to be able to tell that
the primary had been restarted and that thus the resync - bitmap of the
primary is NOT sufficient to get the drives fully synchronized again.

More ambitious: if the primary saved the resync bitmap on shutdown and
restored it an startup, we'd have a reliable map of sectors to update on the
secondary even after prolonged disconnection.

Hmm, probably we'd want counters for both sides of the device, so the
primary would store something like "I'm at 99 and when I last talked to the
secondary it was at 96". That should also at least allow detection of the
messy case where data was changed on BOTH sides because of complete system
separation. I'm not sure what the correct recovery procedure for such a case
would be but I sure want to KNOW about it. probably leave the device
disconnected and force admin intervention +manual resync.

Bye, Martin

"you have moved your mouse, please reboot to make this change take effect"
--------------------------------------------------
Martin Bene vox: +43-316-813824
simon media fax: +43-316-813824-6
Andreas-Hofer-Platz 9 e-mail: mb@example.com
8010 Graz, Austria
--------------------------------------------------
finger mb@example.com for PGP public key