Mailing List Archive: Datadisk

Datadisk

May 18, 2000, 6:58 AM

Post #1 of 17 (7278 views)

Hi,

Another version of Datadisk (again), I will add the resync option as soon as
possible.
This script was tested for a time and is working fine.

You can grep the state option with a cut -f1 -d" " to know if you computer
state.

Can you have a look to the slave section as I am using drdbsetup and I would
like to know if it can be improved to use repl if needed

Does drbdsetup return a special error code if failed due to data
inconsistancy between the two nodes.
(after a WAIT to know if we must run a REPL)

AW: Datadisk [ In reply to ]

mb at example

May 19, 2000, 10:47 PM

Post #2 of 17 (7202 views)

Permalink

Hi Thomas,

> Another version of Datadisk (again), I will add the resync option
> as soon as
> possible.
> This script was tested for a time and is working fine.
>
> You can grep the state option with a cut -f1 -d" " to know if you computer
> state.
>
> Can you have a look to the slave section as I am using drdbsetup
> and I would
> like to know if it can be improved to use repl if needed
>
> Does drbdsetup return a special error code if failed due to data
> inconsistancy between the two nodes.
> (after a WAIT to know if we must run a REPL)

I' ve been experimenting with this version of datadisk; I just had my
test-partition eaten by it: drbd was primary on the slave in a cluster after
failure of primary node. When the primary came back up and requested the
rescources, datadisk failed to release the device to secondary status. I
couldn't unmount due to active processes (interactive session by sysop)
working on the device.

I think it might a good Idea to really try HARD to get the device freed when
asked to stop; might even do a fuser -m on the mountpoint and kill every
process listed as accessing the device in order to get it released for use
by the requesting primary.

Bye, Martin

"you have moved your mouse, please reboot to make this change take effect"
--------------------------------------------------
Martin Bene vox: +43-316-813824
simon media fax: +43-316-813824-6
Andreas-Hofer-Platz 9 e-mail: mb@example.com
8010 Graz, Austria
--------------------------------------------------
finger mb@example.com for PGP public key

Re: AW: Datadisk [ In reply to ]

philipp at example

May 19, 2000, 11:46 PM

Post #3 of 17 (7191 views)

Permalink

Am Sam, 20 Mai 2000 schrieb Martin Bene:
>Hi Thomas,
>
>> Another version of Datadisk (again), I will add the resync option
[...]
>
>I' ve been experimenting with this version of datadisk; I just had my
>test-partition eaten by it: drbd was primary on the slave in a cluster after
>failure of primary node. When the primary came back up and requested the
>rescources, datadisk failed to release the device to secondary status. I
>couldn't unmount due to active processes (interactive session by sysop)
>working on the device.
>
>I think it might a good Idea to really try HARD to get the device freed when
>asked to stop; might even do a fuser -m on the mountpoint and kill every
>process listed as accessing the device in order to get it released for use
>by the requesting primary.
>

I think this is a problem of heartbeat, which is migrating the
service back to it's home node as soon as the home node is up again.

I think Luis Claudio added a nice feature to heartbeat, which changes
this behaviour. It is called nice_failback and it is available in the
0.4.7a release of heartbeat.

-Philippp

AW: AW: Datadisk [ In reply to ]

mb at example

May 19, 2000, 11:57 PM

Post #4 of 17 (7187 views)

Permalink

Hi Philipp,

> >I think it might a good Idea to really try HARD to get the
> device freed when
> >asked to stop; might even do a fuser -m on the mountpoint and kill every
> >process listed as accessing the device in order to get it
> released for use
> >by the requesting primary.
> >
>
> I think this is a problem of heartbeat, which is migrating the
> service back to it's home node as soon as the home node is up again.

It certainly is - heartbeat won't take NO as an answer when requesting a
resource. The problem is compounded by heartbeats lack of knowledge about
resource interdependancy - there really is no point in starting services
and aquiring IP addresses if it can't get the drbd device the services need
in order to function..

> I think Luis Claudio added a nice feature to heartbeat, which changes
> this behaviour. It is called nice_failback and it is available in the
> 0.4.7a release of heartbeat.

Unfortunately, it was later removed andis no longer in heartbeat (current
version 0.4.7b); an imporved version should be reintroduced using the
heartbeat API currently in development.

So, since datadisk is specifically ment to interoperate with heartbeat,
we'll have to work with heartbeats current requirements.

I've added

action "killing processes active on $LOCAL_POINT " \
fuser -km $MASTER_MNTPOINT

right before the unmount command to my datadisk script; seems to do the job
for the moment. It's a bit ugly since you'll get a [FAILED] status printed
out in the (normal) case when there are no tasks accessing the device but
it'll work anyway.

A second question: I've made a mistake when configureing drbd my hand and
set /dev/nb0 to secondary while still mounted rw. After this, write attempts
to the directory work just fine, the stats on /proc/drbd show that local
disk writes are executed while network writes are not executed. would it be
possible to fail local writes as well and errors for write attempts to avoid
getting the disks out of sync?

Bye, Martin

"you have moved your mouse, please reboot to make this change take effect"
--------------------------------------------------
Martin Bene vox: +43-316-813824
simon media fax: +43-316-813824-6
Andreas-Hofer-Platz 9 e-mail: mb@example.com
8010 Graz, Austria
--------------------------------------------------
finger mb@example.com for PGP public key

Re: AW: Datadisk [ In reply to ]

thomas.mangin at example

May 20, 2000, 1:36 AM

Post #5 of 17 (7186 views)

Permalink

Hi everybody,

> >I' ve been experimenting with this version of datadisk; I just had my
> >test-partition eaten by it: drbd was primary on the slave in a cluster
after
> >failure of primary node. When the primary came back up and requested the
> >rescources, datadisk failed to release the device to secondary status. I
> >couldn't unmount due to active processes (interactive session by sysop)
> >working on the device.

;*) Too bad ...

Since yesterday our mail server is replicating using drbd. I hope I won't
see
that on it ;*)

I wrote for my job a piece of scripts (which include a simple perl command
deamon bind
to the private interface) which allow to swap the master and the slave.

You must for sure insure the relase of the slave was a success (That why you
now
have a state option on which you can cut with -f1 -d" " to know if you
command
worked.) before switching the other node to master.

> >I think it might a good Idea to really try HARD to get the device freed
when
> >asked to stop; might even do a fuser -m on the mountpoint and kill every
> >process listed as accessing the device in order to get it released for
use
> >by the requesting primary.

It look like a good idea but if you are not feeing the device it can be due
to
reason for which you finally don't wan't to swich

On my mail server solution I have to be sure all the qmail deamon are well
stopped
Otherwise some person may acces the unmounted mount point.

> I think this is a problem of heartbeat, which is migrating the
> service back to it's home node as soon as the home node is up again.

If i well read the source it call stop on the active side before switching
which make sense and look safe.

Thomas

AW: AW: Datadisk [ In reply to ]

mb at example

May 20, 2000, 2:19 AM

Post #6 of 17 (7186 views)

Permalink

Hi Thomas,

> -----Ursprüngliche Nachricht-----
> Von: drbd-devel-admin@example.com
> [mailto:drbd-devel-admin@example.com]Im Auftrag von Thomas
> Mangin
> Gesendet: Samstag, 20. Mai 2000 10:37
> An: drbd-devel@example.com
> Betreff: Re: AW: [DRBD-dev] Datadisk

> > >couldn't unmount due to active processes (interactive session by sysop)
> > >working on the device.
>
> ;*) Too bad ...

That's what test installations are for, no problem :-)

> > >I think it might a good Idea to really try HARD to get the device
> > >freed when asked to stop; might even do a fuser -m on the
> > >mountpoint and kill every process listed as accessing the device
> > >in order to get it released for use by the requesting primary.
>
> It look like a good idea but if you are not feeing the device it
> can be due to reason for which you finally don't wan't to swich

That's the rub with current heartbeat code - you CAN'T refuse to give up a
resource; the master will ask the slave to free a resource, wait for the
answer and proceed to allocate it regardless of the answer sent by the
client. the "nice-failover" stuff tried to fix this situation but was
removed again to make way for a newer, better and not yet available
solution.

> > I think this is a problem of heartbeat, which is migrating the
> > service back to it's home node as soon as the home node is up again.

> If i well read the source it call stop on the active side before switching
> which make sense and look safe.

IF the stop command works, it's save all right - bit that's a BIG if. It'll
work for the normal case where only the jobs you're expecting to access the
mountpoint actually are, but as I in my example it can fail if the sysop's
sittin on the device in an interactive shell. Waht happens next is not a
pretty sight; freeing the slave fails and the master takes over anyway.

Another stupid mistake in my setup I just discovered:

If you're running drbd as a resource under heartbeat, make VERY sure that
drbd gets started before heartbeat in the system startup sequence. Otherwise
this'll happen on restart of master node startup:

1) heartbeat starts, initialisation begins...
2) drbd starts, loads device as slave, wait for sync to finish
3) heartbeat devices to take over the drbd device, switches to primary ->
image of virtual mushroom cloud like you get from an A-bomb.
4) sysop bursts out in tears :-)

BTW, defualt startup sequence turns out to be wrong way round.

Bye, Martin

"you have moved your mouse, please reboot to make this change take effect"
--------------------------------------------------
Martin Bene vox: +43-316-813824
simon media fax: +43-316-813824-6
Andreas-Hofer-Platz 9 e-mail: mb@example.com
8010 Graz, Austria
--------------------------------------------------
finger mb@example.com for PGP public key

Re: AW: Datadisk [ In reply to ]

thomas.mangin at example

May 20, 2000, 2:57 AM

Post #7 of 17 (7187 views)

Permalink

Hi Martin,

> > > >I think it might a good Idea to really try HARD to get the device
> > > >freed when asked to stop; might even do a fuser -m on the
> > > >mountpoint and kill every process listed as accessing the device
> > > >in order to get it released for use by the requesting primary.
> >
> > It look like a good idea but if you are not feeing the device it
> > can be due to reason for which you finally don't wan't to swich
>
> That's the rub with current heartbeat code - you CAN'T refuse to give up a
> resource; the master will ask the slave to free a resource, wait for the
> answer and proceed to allocate it regardless of the answer sent by the
> client. the "nice-failover" stuff tried to fix this situation but was
> removed again to make way for a newer, better and not yet available
> solution.

You expressed in a clearer way the reason why I wrote some ugly piece of
script which allow my to do the takeover by hand. If it can't do it, it
stay in its previous configuration ..

> > > I think this is a problem of heartbeat, which is migrating the
> > > service back to it's home node as soon as the home node is up again.

> IF the stop command works, it's save all right - bit that's a BIG if.
It'll
> work for the normal case where only the jobs you're expecting to access
the
> mountpoint actually are, but as I in my example it can fail if the sysop's
> sittin on the device in an interactive shell. Waht happens next is not a
> pretty sight; freeing the slave fails and the master takes over anyway.

I wonder why heartbeat can't call two script one for the migration and
one to test if it succed ? I guess this is off topic here ..

> Another stupid mistake in my setup I just discovered:

make sure the chkconfig value make heartbeat stop before too !
or "sysop bursts out in tears :-)"

> "you have moved your mouse, please reboot to make this change take effect"

You are not using windows 2000, which doesn't load the mouse and keyboard
drivers
if they are not present at boot time !! Such an economy, very usefull with
VNC !

Thomas

Re: AW: AW: Datadisk [ In reply to ]

philipp at example

May 21, 2000, 12:10 AM

Post #8 of 17 (7196 views)

Permalink

Am Sam, 20 Mai 2000 schrieb Martin Bene:
[...]
>A second question: I've made a mistake when configureing drbd my hand and
>set /dev/nb0 to secondary while still mounted rw. After this, write attempts
>to the directory work just fine, the stats on /proc/drbd show that local
>disk writes are executed while network writes are not executed. would it be
>possible to fail local writes as well and errors for write attempts to avoid
>getting the disks out of sync?
>
>Bye, Martin
>

Yes this is a bug, let me explain why it is like this:

There is a bitfield in the kernel, where block-devices can be marked as
read-only. But currently it is not possible to mark a drbd device, which
is in secondary state, as read-only. Because when it receives blocks
over the network, it puts them on the buffer-cache (as blocks from that
drbd-device) and from there they are processes by drbd's do_request
function. (If we mark the drbd device as read-only the IO-request would
be canceled somwhere in ll_rw_block())

This are the options:

1) Switching to secondary state fails as long as at least one has opened
the device in rw-mode.

2) We wait until we have the GFS-Support. In this case we could use the
kernel's bitmap.

I am opting for solution 1, becuase it is "the unix way of life", like
unmount is not working as long as someone accesses a object in that filesystem.

So, to be compatible with current heartbeat (which forces the service back
to it's home node as soon as the home node is up again).

datadisk stop must:
=> Try to switch the device into secondary state.
=> If it is not working it must try to unmount the FS which is
mounted on top of that device.
=> If this is not working it must kill the processes which are
accessing this filesystem.

What do you think ?

-Philipp
--
Want to try something new? Are you a Linux hacker?
Volunteer in testing mergemem!
(Get it from http://das.ist.org/mergemem)
-----
Philipp Reisner PGP: http://der.ist.org/~kde/pgp.asc

AW: Datadisk [ In reply to ]

mb at example

May 21, 2000, 1:12 AM

Post #9 of 17 (7185 views)

Permalink

Hi Philipp,

On Sun, 21, May 2000 09:10 Philipp Reisner wrote:

> Yes this is a bug, let me explain why it is like this:
>
> There is a bitfield in the kernel, where block-devices can be marked as
> read-only. But currently it is not possible to mark a drbd device, which
> is in secondary state, as read-only. Because when it receives blocks
> over the network, it puts them on the buffer-cache (as blocks from that
> drbd-device) and from there they are processes by drbd's do_request
> function. (If we mark the drbd device as read-only the IO-request would
> be canceled somwhere in ll_rw_block())
>
> This are the options:
>
> 1) Switching to secondary state fails as long as at least one
> has opened
> the device in rw-mode.
>
> 2) We wait until we have the GFS-Support. In this case we could use the
> kernel's bitmap.
>
> I am opting for solution 1, becuase it is "the unix way of life", like
> unmount is not working as long as someone accesses a object in
> that filesystem.

Agreed that 2) isn't what we want. However, I'm not too happy about 1)
either.

OK, we can't switch the secondary fs to read only mode. Do you think it
would be feasible to return an error code to local write requests, just like
you'd get an error if you try to write to a bad sector on your harddisk?

I'd really like to be able to force drbd to a secondary role, even if
unmount fails - If whatever still tries to write to the device afterwards
gets errors back, that's fine by me (worst thing that can happen is that
some aplication exits with an error - which is just what we wanted to
achieve anyway).

> So, to be compatible with current heartbeat (which forces the service back
> to it's home node as soon as the home node is up again).
>
> datadisk stop must:
> => Try to switch the device into secondary state.
> => If it is not working it must try to unmount the FS which is
> mounted on top of that device.
> => If this is not working it must kill the processes which are
> accessing this filesystem.
>
> What do you think ?

I'm using a different sequence, but the effect is about the same -
0) datadisk is the last resource to be freed before releasing the IP
address - so all the resources using drbd should already have been shut down
cleanly.

1) kill all processess still accessing the device (using fuser -mk
/dev/nb0)

2) try unmounting

3a) currently: if it doesn't work, we're in trouble;
3b) hoped for: switch to seconary mode anyway, return disk error to any
local writer.

Possible?

Bye, Martin

"you have moved your mouse, please reboot to make this change take effect"
--------------------------------------------------
Martin Bene vox: +43-316-813824
simon media fax: +43-316-813824-6
Andreas-Hofer-Platz 9 e-mail: mb@example.com
8010 Graz, Austria
--------------------------------------------------
finger mb@example.com for PGP public key

Re: AW: Datadisk [ In reply to ]

philipp at example

May 22, 2000, 12:37 AM

Post #10 of 17 (7187 views)

Permalink

Am Son, 21 Mai 2000 schriebst Du:
>Hi Philipp,
>
>
>On Sun, 21, May 2000 09:10 Philipp Reisner wrote:
>
>> Yes this is a bug, let me explain why it is like this:
>>
>> There is a bitfield in the kernel, where block-devices can be marked as
>> read-only. But currently it is not possible to mark a drbd device, which
>> is in secondary state, as read-only. Because when it receives blocks
>> over the network, it puts them on the buffer-cache (as blocks from that
>> drbd-device) and from there they are processes by drbd's do_request
>> function. (If we mark the drbd device as read-only the IO-request would
>> be canceled somwhere in ll_rw_block())
>>
>> This are the options:
>>
>> 1) Switching to secondary state fails as long as at least one
>> has opened
>> the device in rw-mode.
>>
>> 2) We wait until we have the GFS-Support. In this case we could use the
>> kernel's bitmap.
>>
>> I am opting for solution 1, becuase it is "the unix way of life", like
>> unmount is not working as long as someone accesses a object in
>> that filesystem.
>
>Agreed that 2) isn't what we want. However, I'm not too happy about 1)
>either.
>
>OK, we can't switch the secondary fs to read only mode. Do you think it
>would be feasible to return an error code to local write requests, just like
>you'd get an error if you try to write to a bad sector on your harddisk?
>
>I'd really like to be able to force drbd to a secondary role, even if
>unmount fails - If whatever still tries to write to the device afterwards
>gets errors back, that's fine by me (worst thing that can happen is that
>some aplication exits with an error - which is just what we wanted to
>achieve anyway).
>

Sorry I did not expressed myself in an understandable way.

With "... return an error code to local write requests, just like
you'd get an error if you try to write to a bad sector..."
you are asking for option 2.

Option 2 is currently not possible. (Because in do_request() drbd can not
distinguish between blocks received from the network and blocks originating
from a local filesystem)

I think option 1 is the way to go, because if we would go with 2, the
filesystem will see this errors. Ext2 will not crash/panic but will print a
lot of messages to syslog, but the application processes will never see an
error!! (The last time I checked ReiserFS crashed hard when getting errors
on IO requests)

It's the same thing as with unmount. There is no way to force an unmount.
You have to use fuser and then you have to retry unmount.

The current behaviour is the worst:
drbdsetup simply suggests that switching to secondary worked, while there are
still applications/filesystems accessing the /dev/nbX device.

The proposed behaviour (aka option 1):
drbdsetup /dev/nbX SEC fails as long as there is at least one
application/filesystem that has opened the device in rw mode.

>> So, to be compatible with current heartbeat (which forces the service back
>> to it's home node as soon as the home node is up again).
>>
>> datadisk stop must:
>> => Try to switch the device into secondary state.
>> => If it is not working it must try to unmount the FS which is
>> mounted on top of that device.
>> => If this is not working it must kill the processes which are
>> accessing this filesystem.
>>
>> What do you think ?
>
>I'm using a different sequence, but the effect is about the same -
> 0) datadisk is the last resource to be freed before releasing the IP
>address - so all the resources using drbd should already have been shut down
>cleanly.
>
> 1) kill all processess still accessing the device (using fuser -mk
>/dev/nb0)
>
> 2) try unmounting
>
> 3a) currently: if it doesn't work, we're in trouble;
> 3b) hoped for: switch to seconary mode anyway, return disk error to any
>local writer.
>
>Possible?

Yes, datadisk should be the last resource to be freed, ...

Thus at first it tries to switch the device into secondary state.
If this does not work, because the sysadmin has a running shell somwhere
in there. Datadisk tries to unmount the fs on top of that device. If that
fails too, beause fo that shell, it tries to kill the shell (using fuser).
Then it retries the unmount, and finally it retryies to switch the device
into secondary state.

3b is currently not possible, but will be possible as soon as GFS support is
finished.

-Philipp

Re: AW: Datadisk [ In reply to ]

lclaudio at example

May 22, 2000, 7:07 AM

Post #11 of 17 (7188 views)

Permalink

Hi!

I wrote a new, smaller and cleaner version of nice_failback that's
working fine for me and some other victm^Wadmins who are testing this
feature.
The nice_failback stuff that was in 0.4.7a is broken and hurts the
heartbeat philosofy of working (the main reason why the new stuff will
be off heartbeat until the API is ready). As nice_failback is a need
for some of us, I rewrote the patch and send it to Alan (who's
travelling right now) and maybe it fits in 0.4.7c .
Anyway, here's the pacth. :)

Best Regards!

[ Luis Claudio R. Goncalves lclaudio@example.com ]
[. BSc in Computer Science -- MSc coming soon -- Gospel User -- Linuxer ]
[. Fault Tolerance - Real-Time - Distributed Systems - IECLB - IS 40:31 ]
[. LateNite Programmer -- Jesus Is The Solid Rock On Which I Stand -- ]

Re: AW: Datadisk [ In reply to ]

olive at example

May 22, 2000, 11:07 AM

Post #12 of 17 (7190 views)

Permalink

Hi there,

) I think it might a good Idea to really try HARD to get the device freed when
) asked to stop; might even do a fuser -m on the mountpoint and kill every
) process listed as accessing the device in order to get it released for use
) by the requesting primary.

I think that could be used as a very last alternative. I don't think it
would be very smart to keep open shells on such a volatile area as a drbd
volume, and since the admin is usually the one in charge of getting
machines back up, he would be shooting his own foot keeping a shell on a
volume he knows will go away (or try to) when the other machine takes
over.

Anyone knows if a remount read-only would work for such cases where all
services are properly shutdown but someone still keeps a file open in the
mounted drbd device? Maybe after mount -o ro,remount, we can drbdsetup SEC
and avoid killing those processes.

See ya!
Fábio
( Fábio Olivé Leite -* ConectivaLinux *- olive@example.com[.br] )
( PPGC/UFRGS MSc candidate -*- Advisor: Taisy Silva Weber )
( Linux - Distributed Systems - Fault Tolerance - Security - /etc )

AW: AW: Datadisk [ In reply to ]

mb at example

May 22, 2000, 1:31 PM

Post #13 of 17 (7181 views)

Permalink

Hi Fábio,

> ) I think it might a good Idea to really try HARD to get the
> device freed when
> ) asked to stop; might even do a fuser -m on the mountpoint and kill every
> ) process listed as accessing the device in order to get it
> released for use
> ) by the requesting primary.

> I think that could be used as a very last alternative. I don't think it
> would be very smart to keep open shells on such a volatile area as a drbd
> volume, and since the admin is usually the one in charge of getting
> machines back up, he would be shooting his own foot keeping a shell on a
> volume he knows will go away (or try to) when the other machine takes
> over.

Granted - it isn't a very inteligent or logical thing to do. however, I
think it's a mistake that's very easy to make - chances are you'll want to
look around on your drbd device on the master to check out things before
initiating failover/failback. So I'd say the probability of someone letting
a shell sitting around on the device is rather high.

> Anyone knows if a remount read-only would work for such cases where all
> services are properly shutdown but someone still keeps a file open in the
> mounted drbd device? Maybe after mount -o ro,remount, we can drbdsetup SEC
> and avoid killing those processes.

For many cases this will work, esp. in case of the interactive shell - just
tried it. So it looks like we should do:

try to unmount the mounted filesystem
if it fails, tray to remount ro
if it fails,
try killing processes with fuser -mk /dev/nbx
retry unmount/remount ro.
set /dev/dbx to secondary mode

Depending on what you're actually doing with your drbd device killing
processes might be preferable to switching to readonly mode - when switching
to ro chances are higher of some process lingering on and confising things
even if it isn't very usefull if it can't write to disk any longer.

also, drbd needs to be changed to fail switching to secondary if the
filesystem is still mounted rw - shouldn't matter with correct scripts but
should be caught at driver level anyway.

Bye, Martin

"you have moved your mouse, please reboot to make this change take effect"
--------------------------------------------------
Martin Bene vox: +43-316-813824
simon media fax: +43-316-813824-6
Andreas-Hofer-Platz 9 e-mail: mb@example.com
8010 Graz, Austria
--------------------------------------------------
finger mb@example.com for PGP public key

Re: AW: Datadisk [ In reply to ]

thomas.mangin at example

May 23, 2000, 2:29 AM

Post #14 of 17 (7188 views)

Permalink

> Granted - it isn't a very inteligent or logical thing to do. however, I
> think it's a mistake that's very easy to make - chances are you'll want to
> look around on your drbd device on the master to check out things before
> initiating failover/failback. So I'd say the probability of someone
letting
> a shell sitting around on the device is rather high.

I agree it is a easy mistake to do, but if you don't realise for your shell,
you won't think about your daemon properly neither ...

You need to stop them before a takeover ...

> try to unmount the mounted filesystem
> if it fails, tray to remount ro
> if it fails,
try to find the deamon using it to stop them properly
if still not free
> try killing processes with fuser -mk /dev/nbx

Why ? at this stage as all process are normally killed !
> retry unmount/remount ro.
better
try to unmount
if it fails,
then remount ro
> set /dev/dbx to secondary mode

> Depending on what you're actually doing with your drbd device killing
> processes might be preferable to switching to readonly mode - when
switching
> to ro chances are higher of some process lingering on and confising things
> even if it isn't very usefull if it can't write to disk any longer.

I totally agree, It is not a good idea to try to guess what DRBD will be
used for
I use it for a mail server and mail is very different of web.

Re: AW: Datadisk [ In reply to ]

alanr at example

May 23, 2000, 7:56 PM

Post #15 of 17 (7183 views)

Permalink

Martin Bene wrote:
>
> I think it might a good Idea to really try HARD to get the device freed when
> asked to stop; might even do a fuser -m on the mountpoint and kill every
> process listed as accessing the device in order to get it released for use
> by the requesting primary.

For high-availability systems this is essential.

-- Alan Robertson
alanr@example.com

Re: AW: Datadisk [ In reply to ]

alanr at example

May 23, 2000, 8:04 PM

Post #16 of 17 (7186 views)

Permalink

Philipp Reisner wrote:
>
> Am Sam, 20 Mai 2000 schrieb Martin Bene:
> >Hi Thomas,
> >
> >> Another version of Datadisk (again), I will add the resync option
> [...]
> >
> >I' ve been experimenting with this version of datadisk; I just had my
> >test-partition eaten by it: drbd was primary on the slave in a cluster after
> >failure of primary node. When the primary came back up and requested the
> >rescources, datadisk failed to release the device to secondary status. I
> >couldn't unmount due to active processes (interactive session by sysop)
> >working on the device.
> >
> >I think it might a good Idea to really try HARD to get the device freed when
> >asked to stop; might even do a fuser -m on the mountpoint and kill every
> >process listed as accessing the device in order to get it released for use
> >by the requesting primary.
> >
>
> I think this is a problem of heartbeat, which is migrating the
> service back to it's home node as soon as the home node is up again.

Why does this cause a problem with drbd? I don't understand why drbd
shouldn't tolerate moving the resources back. I understand and agree
with Luis' idea, but don't understand why it's lack should cause
problems.
>
> I think Luis Claudio added a nice feature to heartbeat, which changes
> this behaviour. It is called nice_failback and it is available in the
> 0.4.7a release of heartbeat.

0.4.7a has serious problems. It was removed in 0.4.7b because it had
bugs in it. It may come back in a future version, or it may not.

-- Alan Robertson
alanr@example.com

Re: AW: Datadisk [ In reply to ]

alanr at example

May 24, 2000, 6:46 PM

Post #17 of 17 (7189 views)

Permalink

"Luis Claudio R. Goncalves" wrote:
>
> Hi!
>
> I wrote a new, smaller and cleaner version of nice_failback that's
> working fine for me and some other victm^Wadmins who are testing this
> feature.
> The nice_failback stuff that was in 0.4.7a is broken and hurts the
> heartbeat philosofy of working (the main reason why the new stuff will
> be off heartbeat until the API is ready). As nice_failback is a need
> for some of us, I rewrote the patch and send it to Alan (who's
> travelling right now) and maybe it fits in 0.4.7c .
> Anyway, here's the pacth. :)

Having read all the threads on what extreme pain that heartbeat is
causing to people using drbd, we will work diligently to get Luis'
changes integrated into heartbeat, probably in 0.4.7c.

Nevertheless, it should be possible to force it back to the other
machine if for no other reason than the sysadmin told you to do so...

Sorry for the pain...

-- Alan Robertson
alanr@example.com

Mailing List Archive

Attached Files:

Attached Files: