Mailing List Archive: Both nodes crashed scenario...

Both nodes crashed scenario...

Nov 14, 2000, 1:53 PM

Post #1 of 14 (3883 views)

Hi,

I am sorry that I am so inactive on the mailing-list currently. I really hope
that I will be ably to use more of my time on drbd development soon.

As it was discussed recently it is a serious problem if both nodes are
down at the same time.
In order to solve this problem I want to introduce a "cluster state number".
This number (csn for short) will encreas which every change "agreed"
state change in the cluster.

The CSN must be stored in non volatile memory. I have not yet decided where
to put it. The possibilies are:

*) a file (/var/drbd...)
+easy to do
-not coupled with the ll_dev of drbd

*) last block on the ll_dev
-Then it would no longer be possible to convert a partition with
a filesystem into a ll_dev of an drbd device. (At least not in all
cases)
+physically coupled to drbd
+can also be used to detect replacement disks.

*) In a metadata location provided by LVM.
-Not yet implemented by LVM.
-Not available on systems not using LVM.

On node startup a node looks for it's partnet node, if it's not
availabe it continues to wait for it for a configurable time.
(Administrators of Databases will use a setting of -1 which
has the meaning: wait for ever)
During this time it will offer the administrator the possibility
to make this node the new primary/master node.

If the node can finally communicate with the other node, they will
be able to decide which node has the newer data and should become
masther thereof _AND_ they will decide if a QuickSync is possible
and sufficient.
-----
Unfortunately it's not possible to implement the CSN logic only in
drbdsetup. ...
... it will need changes in the module, drbdsetup and the scripts.

-----
Here is my wishlist to the authors of cluster manager software:

You need to implement a resource type that may be unavailable
at node startup, but may become available later.

-Philipp

Re: Both nodes crashed scenario... [ In reply to ]

thomas.mangin at example

Nov 15, 2000, 2:41 AM

Post #2 of 14 (3845 views)

Permalink

> I am sorry that I am so inactive on the mailing-list currently. I really
hope
> that I will be ably to use more of my time on drbd development soon.
>
> As it was discussed recently it is a serious problem if both nodes are
> down at the same time.
> In order to solve this problem I want to introduce a "cluster state
number".
> This number (csn for short) will encreas which every change "agreed"
> state change in the cluster.
>
> The CSN must be stored in non volatile memory. I have not yet decided
where
> to put it. The possibilies are:

Very good idea ..

> *) a file (/var/drbd...)
> +easy to do
> -not coupled with the ll_dev of drbd

Why not create put this information at the root of the mounted file system
as an hidden file?

> +easy to do
> +can also be used to detect replacement disks.
> -not coupled with the ll_dev of drbd

> *) last block on the ll_dev
> -Then it would no longer be possible to convert a partition with
> a filesystem into a ll_dev of an drbd device. (At least not in all
> cases)
> +physically coupled to drbd
> +can also be used to detect replacement disks.

I do think this is a major tradeoff .. I like to be able to mount my /dev/sd
device is
something goes wrong (I had to do it last week, and still must revert to
drbd)

> On node startup a node looks for it's partnet node, if it's not
> availabe it continues to wait for it for a configurable time.
> (Administrators of Databases will use a setting of -1 which
> has the meaning: wait for ever)
> During this time it will offer the administrator the possibility
> to make this node the new primary/master node.

We can already do that at a script level or I am missing something ??

> If the node can finally communicate with the other node, they will
> be able to decide which node has the newer data and should become
> masther thereof _AND_ they will decide if a QuickSync is possible
> and sufficient.

Looks good ..

> Unfortunately it's not possible to implement the CSN logic only in
> drbdsetup. ...
> ... it will need changes in the module, drbdsetup and the scripts.

For the script I see no problems ;*)

Thomas

Re: Both nodes crashed scenario... [ In reply to ]

lmb at example

Nov 15, 2000, 2:42 AM

Post #3 of 14 (3852 views)

Permalink

On 2000-11-15T09:41:46,
Thomas Mangin <thomas.mangin@example.com> said:

> Why not create put this information at the root of the mounted file system
> as an hidden file?

Because drbd doesn't have any knowledge about what is on the blockdevice.

Sincerely,
Lars Marowsky-Brée <lmb@example.com>
Development HA

--
Perfection is our goal, excellence will be tolerated. -- J. Yahl

Re: Both nodes crashed scenario... [ In reply to ]

thomas.mangin at example

Nov 15, 2000, 3:02 AM

Post #4 of 14 (3852 views)

Permalink

> On 2000-11-15T09:41:46,
> Thomas Mangin <thomas.mangin@example.com> said:
>
> > Why not create put this information at the root of the mounted file
system
> > as an hidden file?
>
> Because drbd doesn't have any knowledge about what is on the blockdevice.

no more than what is in /var/run ..
Can you explain me what I am missing please ?
I understand that drbd have no knowledge of the FS ..

Thomas

AW: Both nodes crashed scenario... [ In reply to ]

mb at example

Nov 15, 2000, 3:46 AM

Post #5 of 14 (3856 views)

Permalink

Hi Thomas,

> Why not create put this information at the root of the mounted file system
> as an hidden file?

a) the drdb device need not actually contain a filesystem - think about raw
database devices.
b) you need to access this info at a time the drbd device is not yet mounted
or even active.

> I do think this is a major tradeoff .. I like to be able to mount
> my /dev/sd device if something goes wrong (I had to do it last week,
> and still must revert to drbd)

That problem is not as bad as you think: raid does it this way, (reserving
the last 4k of data on each physical partition used). You can still mount a
raw partition used in raid1 (mirroring). The filesystem on the partition
just doesn't use the whole partiton but can be accessed without any
restrictions. The one problem you have is converting an existing filesystem
to drbd - you'd have to shrink the existing fs to make room for the drbd
metadata.

Bye, Martin

"you have moved your mouse, please reboot to make this change take effect"
--------------------------------------------------
Martin Bene vox: +43-316-813824
simon media fax: +43-316-813824-6
Nikolaiplatz 4 e-mail: mb@example.com
8020 Graz, Austria
--------------------------------------------------
finger mb@example.com for PGP public key

Re: Both nodes crashed scenario... [ In reply to ]

thomas.mangin at example

Nov 16, 2000, 2:12 AM

Post #6 of 14 (3855 views)

Permalink

Hi,

Thank you both for your explanation.
It clarified the point.

Thomas

----- Original Message -----
From: Philipp Reisner <philipp@example.com>
To: DRBD-List <drbd-devel@example.com>
Sent: Wednesday, November 15, 2000 9:34 PM
Subject: Re: [DRBD-dev] Both nodes crashed scenario...

> * Martin Bene <mb@example.com> [001115 11:46]:
> > Hi Thomas,
> >
> > > Why not create put this information at the root of the mounted file
system
> > > as an hidden file?
> >
> > a) the drdb device need not actually contain a filesystem - think about
raw
> > database devices.
> > b) you need to access this info at a time the drbd device is not yet
mounted
> > or even active.
> >
> > > I do think this is a major tradeoff .. I like to be able to mount
> > > my /dev/sd device if something goes wrong (I had to do it last week,
> > > and still must revert to drbd)
> >
> > That problem is not as bad as you think: raid does it this way,
(reserving
> > the last 4k of data on each physical partition used). You can still
mount a
> > raw partition used in raid1 (mirroring). The filesystem on the partition
> > just doesn't use the whole partiton but can be accessed without any
> > restrictions. The one problem you have is converting an existing
filesystem
> > to drbd - you'd have to shrink the existing fs to make room for the drbd
> > metadata.
> >
>
> Yes, Martin explained the situation very precise. Actually
> the sizes of block devices are given in 1024 byte units in Linux.
> A lot of filesystems are useing a block size of 4k theese days
> (ReiserFS uses always 4k, and I think most distributions use 4k
> as default for mkfs.ext2)
>
> -- When the FS uses 4k blocks we have a chance of 75% that the
> last 1k block is unused !!
> -- The only situation you have to worry about if you want to
> use a already existing FS for the first time as a DRBD
> device.
> -- Today we have working resize utilities for ext2 and reiserfs.
>
> Storing the metadata in the last 1024 bytes of the ll_dev seems
> to be a nice solution.
>
> -Philipp
> _______________________________________________
> DRBD-devel mailing list
> DRBD-devel@example.com
> http://lists.sourceforge.net/mailman/listinfo/drbd-devel
>

Re: Both nodes crashed scenario... [ In reply to ]

dg at example

Nov 17, 2000, 7:14 PM

Post #7 of 14 (3849 views)

Permalink

On Wed, Nov 15, 2000 at 10:02:17AM -0000, Thomas Mangin wrote:
> > On 2000-11-15T09:41:46,
> > Thomas Mangin <thomas.mangin@example.com> said:
> >
> > > Why not create put this information at the root of the mounted file
> system
> > > as an hidden file?
> >
> > Because drbd doesn't have any knowledge about what is on the blockdevice.
>
> no more than what is in /var/run ..
> Can you explain me what I am missing please ?
> I understand that drbd have no knowledge of the FS ..

Do you really mean as a hidden file in /var/run, and not on the drbd
partition itself? This is not too bad, except that hidden files are the
work of Satan.

-dg

--
David Gould dg@example.com
SuSE, Inc., 580 2cd St. #210, Oakland, CA 94607 510.628.3380
"I personally think Unix is "superior" because on LSD it tastes
like Blue." -- jbarnett

Re: Both nodes crashed scenario... [ In reply to ]

dg at example

Nov 17, 2000, 8:10 PM

Post #8 of 14 (3848 views)

Permalink

On Wed, Nov 15, 2000 at 10:34:03PM +0100, Philipp Reisner wrote:
> * Martin Bene <mb@example.com> [001115 11:46]:
> > Hi Thomas,
> >
> > > Why not create put this information at the root of the mounted file system
> > > as an hidden file?

> > That problem is not as bad as you think: raid does it this way, (reserving
> > the last 4k of data on each physical partition used). You can still mount a
> > raw partition used in raid1 (mirroring). The filesystem on the partition
> > just doesn't use the whole partiton but can be accessed without any
> > restrictions. The one problem you have is converting an existing filesystem
> > to drbd - you'd have to shrink the existing fs to make room for the drbd
> > metadata.

> -- The only situation you have to worry about if you want to
> use a already existing FS for the first time as a DRBD
> device.

LVM?

> -- Today we have working resize utilities for ext2 and reiserfs.
>
> Storing the metadata in the last 1024 bytes of the ll_dev seems
> to be a nice solution.

I wonder what problem we are trying to solve?

1. given a bootable system that was set up with drbd, be able to configure
and start the drbd devices?
or
2. given an unidentified hard disk, scan it for drbd volumes and configure
them based on the stored metadata?

I think we only need option 1, and that can easily be done with files
in /var and /etc. There is _no_need_ to fool around with the on disk
format of the volume.

If we need option 2, then we will need on volume metadata. But, there are
right now several subsystems that need a bit of metadata on a volume,
think of LVM, GFS, drbd, raid, etc.

My vote is to keep it simple and solve problems we have, not problems
we do't.

-dg

--
David Gould dg@example.com
SuSE, Inc., 580 2cd St. #210, Oakland, CA 94607 510.628.3380
"I personally think Unix is "superior" because on LSD it tastes
like Blue." -- jbarnett

AW: Both nodes crashed scenario... [ In reply to ]

mb at example

Nov 18, 2000, 5:17 AM

Post #9 of 14 (3848 views)

Permalink

Hi David,

> > Storing the metadata in the last 1024 bytes of the ll_dev seems
> > to be a nice solution.
>
> I wonder what problem we are trying to solve?
>
> 1. given a bootable system that was set up with drbd, be able to
> configure and start the drbd devices?
> or
> 2. given an unidentified hard disk, scan it for drbd volumes and
> configure them based on the stored metadata?
>
> I think we only need option 1, and that can easily be done with files
> in /var and /etc. There is _no_need_ to fool around with the on disk
> format of the volume.

Agreed, we only need option 1; hovever it seems that having on-device
metadata would be a good idea for that as well; otherwise there's no chance
of detecting a replaced disk for example. A robust system should avoid just
doing a quicksync to a completely new disk/partition and claiming to to be
in sync afterwards..

Bye, Martin

"you have moved your mouse, please reboot to make this change take effect"
--------------------------------------------------
Martin Bene vox: +43-316-813824
simon media fax: +43-316-813824-6
Nikolaiplatz 4 e-mail: mb@example.com
8020 Graz, Austria
--------------------------------------------------
finger mb@example.com for PGP public key

Re: Both nodes crashed scenario... [ In reply to ]

philipp at example

Nov 18, 2000, 1:25 PM

Post #10 of 14 (3840 views)

Permalink

* Martin Bene <mb@example.com> [001118 13:17]:
> Hi David,
>
> > > Storing the metadata in the last 1024 bytes of the ll_dev seems
> > > to be a nice solution.
> >
> > I wonder what problem we are trying to solve?
> >
> > 1. given a bootable system that was set up with drbd, be able to
> > configure and start the drbd devices?
> > or
> > 2. given an unidentified hard disk, scan it for drbd volumes and
> > configure them based on the stored metadata?
> >
> > I think we only need option 1, and that can easily be done with files
> > in /var and /etc. There is _no_need_ to fool around with the on disk
> > format of the volume.
>
> Agreed, we only need option 1; hovever it seems that having on-device
> metadata would be a good idea for that as well; otherwise there's no chance
> of detecting a replaced disk for example. A robust system should avoid just
> doing a quicksync to a completely new disk/partition and claiming to to be
> in sync afterwards..
>

Hmmm, while reading this the following idea just came to my mind:

We could use /var/... approach and store some checksums of some
blocks in the file as well. If the disk is replaced by a new
one, the checksums will no longer macht.
We could have let's say 10 blocks. If one of the blocks is
changed by the application (or FS) we have to update the /var/..
file to contain the new checksum.

On system startup 9 checksums have to match to the corresponding
blocks.

Hmmm.
What do you think of this?

-Philipp

Re: Both nodes crashed scenario... [ In reply to ]

dg at example

Nov 18, 2000, 5:36 PM

Post #11 of 14 (3840 views)

Permalink

On Sat, Nov 18, 2000 at 09:25:19PM +0100, Philipp Reisner wrote:
> * Martin Bene <mb@example.com> [001118 13:17]:
> > Hi David,
> >
> > > > Storing the metadata in the last 1024 bytes of the ll_dev seems
> > > > to be a nice solution.
> > >
> > > I wonder what problem we are trying to solve?
> > >
> > > 1. given a bootable system that was set up with drbd, be able to
> > > configure and start the drbd devices?
> > > or
> > > 2. given an unidentified hard disk, scan it for drbd volumes and
> > > configure them based on the stored metadata?
> > >
> > > I think we only need option 1, and that can easily be done with files
> > > in /var and /etc. There is _no_need_ to fool around with the on disk
> > > format of the volume.
> >
> > Agreed, we only need option 1; hovever it seems that having on-device
> > metadata would be a good idea for that as well; otherwise there's no chance
> > of detecting a replaced disk for example. A robust system should avoid just
> > doing a quicksync to a completely new disk/partition and claiming to to be
> > in sync afterwards..
> >
>
> Hmmm, while reading this the following idea just came to my mind:
>
> We could use /var/... approach and store some checksums of some
> blocks in the file as well. If the disk is replaced by a new
> one, the checksums will no longer macht.
> We could have let's say 10 blocks. If one of the blocks is
> changed by the application (or FS) we have to update the /var/..
> file to contain the new checksum.
>
> On system startup 9 checksums have to match to the corresponding
> blocks.

> Hmmm.
> What do you think of this?

This is a great idea! It might even be safer than what you describe, to
just reuse the mount code that determines filesystem type and then and
the superblocks which will let us take advantage of most filesystems
moving toward using unique identifiers.

-dg

--
David Gould dg@example.com
SuSE, Inc., 580 2cd St. #210, Oakland, CA 94607 510.628.3380
"I personally think Unix is "superior" because on LSD it tastes
like Blue." -- jbarnett

Re: Both nodes crashed scenario... [ In reply to ]

marcelo at example

Nov 18, 2000, 5:49 PM

Post #12 of 14 (3841 views)

Permalink

On Sat, 18 Nov 2000, Philipp Reisner wrote:

> * Martin Bene <mb@example.com> [001118 13:17]:
> > Hi David,
> >
> > > > Storing the metadata in the last 1024 bytes of the ll_dev seems
> > > > to be a nice solution.
> > >
> > > I wonder what problem we are trying to solve?
> > >
> > > 1. given a bootable system that was set up with drbd, be able to
> > > configure and start the drbd devices?
> > > or
> > > 2. given an unidentified hard disk, scan it for drbd volumes and
> > > configure them based on the stored metadata?
> > >
> > > I think we only need option 1, and that can easily be done with files
> > > in /var and /etc. There is _no_need_ to fool around with the on disk
> > > format of the volume.
> >
> > Agreed, we only need option 1; hovever it seems that having on-device
> > metadata would be a good idea for that as well; otherwise there's no chance
> > of detecting a replaced disk for example. A robust system should avoid just
> > doing a quicksync to a completely new disk/partition and claiming to to be
> > in sync afterwards..
> >
>
> Hmmm, while reading this the following idea just came to my mind:
>
> We could use /var/... approach and store some checksums of some
> blocks in the file as well. If the disk is replaced by a new
> one, the checksums will no longer macht.
> We could have let's say 10 blocks. If one of the blocks is
> changed by the application (or FS) we have to update the /var/..
> file to contain the new checksum.
>
> On system startup 9 checksums have to match to the corresponding
> blocks.
>
> Hmmm.
> What do you think of this?

Which blocks are you going to use to checksum?

Re: Both nodes crashed scenario... [ In reply to ]

philipp at example

Nov 21, 2000, 12:29 PM

Post #13 of 14 (3849 views)

Permalink

* Marcelo Tosatti <marcelo@example.com> [001119 01:49]:
>
> On Sat, 18 Nov 2000, Philipp Reisner wrote:
>
> > * Martin Bene <mb@example.com> [001118 13:17]:
> > > Hi David,
> > >
> > > > > Storing the metadata in the last 1024 bytes of the ll_dev seems
> > > > > to be a nice solution.
> > > >
> > > > I wonder what problem we are trying to solve?
> > > >
> > > > 1. given a bootable system that was set up with drbd, be able to
> > > > configure and start the drbd devices?
> > > > or
> > > > 2. given an unidentified hard disk, scan it for drbd volumes and
> > > > configure them based on the stored metadata?
> > > >
> > > > I think we only need option 1, and that can easily be done with files
> > > > in /var and /etc. There is _no_need_ to fool around with the on disk
> > > > format of the volume.
> > >
> > > Agreed, we only need option 1; hovever it seems that having on-device
> > > metadata would be a good idea for that as well; otherwise there's no chance
> > > of detecting a replaced disk for example. A robust system should avoid just
> > > doing a quicksync to a completely new disk/partition and claiming to to be
> > > in sync afterwards..
> > >
> >
> > Hmmm, while reading this the following idea just came to my mind:
> >
> > We could use /var/... approach and store some checksums of some
> > blocks in the file as well. If the disk is replaced by a new
> > one, the checksums will no longer macht.
> > We could have let's say 10 blocks. If one of the blocks is
> > changed by the application (or FS) we have to update the /var/..
> > file to contain the new checksum.
> >
> > On system startup 9 checksums have to match to the corresponding
> > blocks.
> >
> > Hmmm.
> > What do you think of this?
>
> Which blocks are you going to use to checksum?
>

I select the blocks by random. If one of the blocks is writte I
would select an other one for the "checksum set".
==> The system should converge to a state where the most seldom
touched blocks are used for the "checksum set".

-Philipp

Re: Both nodes crashed scenario... [ In reply to ]

dg at example

Nov 21, 2000, 3:43 PM

Post #14 of 14 (3855 views)

Permalink

On Tue, Nov 21, 2000 at 08:29:42PM +0100, Philipp Reisner wrote:
> * Marcelo Tosatti <marcelo@example.com> [001119 01:49]:
> >
> > On Sat, 18 Nov 2000, Philipp Reisner wrote:
> >
> > > * Martin Bene <mb@example.com> [001118 13:17]:
> > > > Hi David,
> > > >
> > > > > > Storing the metadata in the last 1024 bytes of the ll_dev seems
> > > > > > to be a nice solution.
> > > > >
> > > We could have let's say 10 blocks. If one of the blocks is
> > > changed by the application (or FS) we have to update the /var/..
> > > file to contain the new checksum.
> > >
> > > On system startup 9 checksums have to match to the corresponding
> > > blocks.
> > >
> > > Hmmm.
> > > What do you think of this?
> >
> > Which blocks are you going to use to checksum?
> >
>
> I select the blocks by random. If one of the blocks is writte I
> would select an other one for the "checksum set".
> ==> The system should converge to a state where the most seldom
> touched blocks are used for the "checksum set".

You are scaring me. Maybe I don't understand what you mean by "select the
blocks by random", but this seems like a really unsafe way to procede, ie
on a mostly empty device, it might never detect changes.

-dg

--
David Gould dg@example.com
SuSE, Inc., 580 2cd St. #210, Oakland, CA 94607 510.628.3380
"I personally think Unix is "superior" because on LSD it tastes
like Blue." -- jbarnett