Mailing List Archive: Why SyncAll ?

Why SyncAll ?

Oct 26, 2001, 2:34 PM

Post #1 of 3 (497 views)

I now have my NFS cluster almost working (Redhat 7.1 base,
2.4.10-ac11 patched for ext3, drbd 0.6.1-pre6) I am testing right now by
leaving the secondary node always up and simply rebooting the primary
node. The failover seems to go fine, drbd switches to primary on node2 and
the services start without a hitch. When the primary node comes back up,
it seems to want to SyncAll most of the time - I have seen it do the quick
sync sometimes, but usually SyncAll... This causes the datadisk scripts to
fail to mount the nb0 and nb1, which in turn causes most of the nfs
services to fail since they try to export /home (nb0) and the lock dirs
are symlink'd to nb1.
It seems to me that drbd should either be doing the quick sync always
(assuming both nodes didnt go down at the same time) or else datadisk
should at least wait and retry to mount the partitions until the syncall
completes (an hour later).

Any suggestions?

Re: Why SyncAll ? [ In reply to ]

philipp.reisner at example

Oct 27, 2001, 12:23 AM

Post #2 of 3 (503 views)

Permalink

* Ryan Rubley <rubleyr@example.com> [011026 23:34]:
>
> I now have my NFS cluster almost working (Redhat 7.1 base,
> 2.4.10-ac11 patched for ext3, drbd 0.6.1-pre6) I am testing right now by
> leaving the secondary node always up and simply rebooting the primary
> node. The failover seems to go fine, drbd switches to primary on node2 and
> the services start without a hitch. When the primary node comes back up,
> it seems to want to SyncAll most of the time - I have seen it do the quick
> sync sometimes, but usually SyncAll... This causes the datadisk scripts to
> fail to mount the nb0 and nb1, which in turn causes most of the nfs
> services to fail since they try to export /home (nb0) and the lock dirs
> are symlink'd to nb1.
> It seems to me that drbd should either be doing the quick sync always
> (assuming both nodes didnt go down at the same time) or else datadisk
> should at least wait and retry to mount the partitions until the syncall
> completes (an hour later).
>
> Any suggestions?
>

Heartbeat should be started _after_ drbd in the boot process. Since
the /etc/init.d/drbd script does not terminate before synchronisation
is done, heartbeat will start as soon as sync is done.

PS: If the primary crashes a SyncAll is necessay.

-Philipp

Re: Why SyncAll ? [ In reply to ]

rubleyr at example

Oct 27, 2001, 7:24 AM

Post #3 of 3 (497 views)

Permalink

On Sat, 27 Oct 2001, Philipp Reisner wrote:

> * Ryan Rubley <rubleyr@example.com> [011026 23:34]:
> >
> > I now have my NFS cluster almost working (Redhat 7.1 base,
> > 2.4.10-ac11 patched for ext3, drbd 0.6.1-pre6) I am testing right now by
> > leaving the secondary node always up and simply rebooting the primary
> > node. The failover seems to go fine, drbd switches to primary on node2 and
> > the services start without a hitch. When the primary node comes back up,
> > it seems to want to SyncAll most of the time - I have seen it do the quick
> > sync sometimes, but usually SyncAll... This causes the datadisk scripts to
> > fail to mount the nb0 and nb1, which in turn causes most of the nfs
> > services to fail since they try to export /home (nb0) and the lock dirs
> > are symlink'd to nb1.
> > It seems to me that drbd should either be doing the quick sync always
> > (assuming both nodes didnt go down at the same time) or else datadisk
> > should at least wait and retry to mount the partitions until the syncall
> > completes (an hour later).
> >
> > Any suggestions?
> >
>
> Heartbeat should be started _after_ drbd in the boot process. Since
> the /etc/init.d/drbd script does not terminate before synchronisation
> is done, heartbeat will start as soon as sync is done.
>
> PS: If the primary crashes a SyncAll is necessay.
>
> -Philipp
>
>

heartbeat is the last thing to start (S99) - is it the init-timeout=10
line that is causing drbd to not block and go to the backgroudn maybe?

Why cant the secondary do a quicksync to the primary of only the blocks
that were modified while the secondary had control? At the very least,
some sort of crc algorithm could be used to speed up sync'ing couldnt it?