Hi,
Sorry to send this to both lists. I couldn't decide which.
This is an attempt to solve the problem of network partitioning possibly causing
an application to report than an I/O operation was a success, when in fact, it
shouldn't have been successful, because we are in a "partitioned cluster" mode.
I have been mulling over an idea which no doubt still has some holes in it.
Let's see if we can make them very small, or better yet, make them go completely
away.
I propose a modified version of the NBD and the mirroring code. Perhaps the
changes will be small. Perhaps they won't, and the RAID driver has to fork.
Perhaps we'll decide it isn't practical. Let's find out.
As in all mirroring schemes, each write has to go both to the local disk and the
remote disk. In this scheme, the RAID code would then for the remote disk write
to complete before attempting the local disk write.
If the remote machine has falsely declared the local machine "down", then it
will make the remote disk "busy" (effectively reserved), and the local machine
will then get an error when it tries to write to the remote disk. It would then
treat this error as a special case, and refuse to write to the local disk as
well - propagating this error back to the caller.
This obviously has some potential performance (latency) issues. I'm not sure
they're worse than the "normal" network RAID case, since (I think?) the writer
has to wait for both writes anyway. You could always save the old block from
the disk, and then put it back if the remote write gets the partitioned cluster
error. It is essential that you not return success to the user before the
remote responds or times out.
If the local machine has declared the remote machine down, then the process is
in some sense simpler, since this means that the "owning" machine simply has to
resync the mirror on the remote machine. No potential of scrogging data state
here...
The worst case is probably where each machine thinks the other one is down.
There are several cases I haven't considered here, and there are also questions
about how to involve alternative (i.e., serial) communication media in this, so
that you can handle some cases like where various kinds of network failures
occur.
It's a little late for me to make this bulletproof before going to bed, but I
thought I'd throw it out for you to tear up and improve.
-- Alan Robertson
alanr@bell-labs.com
Sorry to send this to both lists. I couldn't decide which.
This is an attempt to solve the problem of network partitioning possibly causing
an application to report than an I/O operation was a success, when in fact, it
shouldn't have been successful, because we are in a "partitioned cluster" mode.
I have been mulling over an idea which no doubt still has some holes in it.
Let's see if we can make them very small, or better yet, make them go completely
away.
I propose a modified version of the NBD and the mirroring code. Perhaps the
changes will be small. Perhaps they won't, and the RAID driver has to fork.
Perhaps we'll decide it isn't practical. Let's find out.
As in all mirroring schemes, each write has to go both to the local disk and the
remote disk. In this scheme, the RAID code would then for the remote disk write
to complete before attempting the local disk write.
If the remote machine has falsely declared the local machine "down", then it
will make the remote disk "busy" (effectively reserved), and the local machine
will then get an error when it tries to write to the remote disk. It would then
treat this error as a special case, and refuse to write to the local disk as
well - propagating this error back to the caller.
This obviously has some potential performance (latency) issues. I'm not sure
they're worse than the "normal" network RAID case, since (I think?) the writer
has to wait for both writes anyway. You could always save the old block from
the disk, and then put it back if the remote write gets the partitioned cluster
error. It is essential that you not return success to the user before the
remote responds or times out.
If the local machine has declared the remote machine down, then the process is
in some sense simpler, since this means that the "owning" machine simply has to
resync the mirror on the remote machine. No potential of scrogging data state
here...
The worst case is probably where each machine thinks the other one is down.
There are several cases I haven't considered here, and there are also questions
about how to involve alternative (i.e., serial) communication media in this, so
that you can handle some cases like where various kinds of network failures
occur.
It's a little late for me to make this bulletproof before going to bed, but I
thought I'd throw it out for you to tear up and improve.
-- Alan Robertson
alanr@bell-labs.com