Mailing List Archive

journalled data CFS understanding
I've been thinking about the issues with clustered journalled-data
file systems.

In GFS, the intent is to handle all pages in the file as versions of
the data represented by a generation number in the dinode. This
version will need to be a 64 bit value, as each write to a file could
potentially update the version.

In the event of a multiple failure, it is believed that replay of the
failed nodes can be done sequentially, independantly without merging
the logs. The belief is each log record can be compared to the
on-disk dinode version, and the data blocks in the log applied only if
needed. I think this works if, and only if -all- dirty blocks in a
file are flushed to disk before the block is transferred to another
node. A log flush alone will not suffice. This will have awkward
performance, as it effectively removes the efficacy of locking
individual blocks, and sending them individually.

Am I missing something?

Perhaps it is necessary to have a global dinode generation, and
then per-journal generations. This way, recovery can know that
the dinode is up to date to global generation X, but that the
data for this journal is only good through M.

-dB
Re: journalled data CFS understanding [ In reply to ]
On Thu, Jan 27, 2000 at 06:46:04PM -0800, David Brower wrote:
> In the event of a multiple failure, it is believed that replay of the
> failed nodes can be done sequentially, independantly without merging
> the logs. The belief is each log record can be compared to the
> on-disk dinode version, and the data blocks in the log applied only if
> needed. I think this works if, and only if -all- dirty blocks in a
> file are flushed to disk before the block is transferred to another
> node. A log flush alone will not suffice. This will have awkward
> performance, as it effectively removes the efficacy of locking
> individual blocks, and sending them individually.
>
> Am I missing something?

No, that's right.

In the current versions of GFS there is only one lock per inode, so all the
file's dirty blocks must be flushed to disk anyway.

Multiple locks per inode can't be easily done with Dlock. It's only just
recently (because of the Cluster Workshop) that we've been looking a DLM-style
locking interface in GFS. GFS has locking modules and that is enough to use
a DLM, but we're only just looking at changing the interface to really take
advantage of all its nifty properties.

It's possible to put a generation number in every block of metadata. Metadata
blocks in the same file, but with different locks, could be passed back and
forth without flushing the whole file. This fixes much of the problem.
(Expanding the generation number concept when doing data journaling isn't
something we've totally explored yet.)

There are a couple of reasons we haven't put a generation number in all the
metadata yet, though. The first is just the simple reason that it involves
a lot of code changes. We've been concentrating on getting the stuff we
already have to work.

The bigger issue is that any piece of metadata that has a generation number
is fairly permanently stuck on disk. The problem is you can't deallocate,
say a indirect block with a generation number, and reuse it as a datablock
until all references to that indirect block are out of the active regions of
all journals in the cluster. As long a reference to that block is in any
client's active journal, it's possible that the generation number will be
needed to do recovery. Overwriting it with data could be bad.

Determining when the block is out of all machine's journals is a somewhat
difficult thing to do, so right now it's "once a generation number, always
a generation number". (Maybe we'll come up with a good way around this in
the future.)

So, making inodes persistant on the disk isn't too difficult a thing to do,
but it becomes more of a pain when *every* piece of metadata must be kept in
a list to be reused. (Again, it comes down to a lot of rewritten code.)
This might be something we do in the future, but we aren't planning on
doing it right now.

> Perhaps it is necessary to have a global dinode generation, and
> then per-journal generations. This way, recovery can know that
> the dinode is up to date to global generation X, but that the
> data for this journal is only good through M.

I don't see right off how this would work, but I'll think about it some more.

--
Ken Preslan <kpreslan@zarniwoop.com>
Re: journalled data CFS understanding [ In reply to ]
Ken wrote:
> David wrote:
> > Perhaps it is necessary to have a global dinode generation, and
> > then per-journal generations. This way, recovery can know that
> > the dinode is up to date to global generation X, but that the
> > data for this journal is only good through M.
>
> I don't see right off how this would work, but I'll think about it some more.
>

This _is_ a totally standard "vector time protocol" aka as version
vectors. The version vectors form a partially ordered set of versions.
On way of implementing them is for each system to remember how many
updates it made, and home many updates it believes other systems made.

Each log record affecting an object should have a version vector.

The DLM or locking protocol now imposes a total order on these vectors.
Replaying log records now merely has to be done in order.

I think that passing a representation of version vectors around as lock
versions would give a lot of information.

BTW, these are just generalities ... Coda uses version vectors for server
replication, which is not so different from cluster file system issues.
Many books on distributed systems discuss replication and version vectors.

- Peter -









> --
> Ken Preslan <kpreslan@zarniwoop.com>
>
>
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.tummy.com
> http://lists.tummy.com/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
>
Re: journalled data CFS understanding [ In reply to ]
Hi,

On Thu, 27 Jan 2000 23:04:40 -0700 (MST), "Peter J. Braam"
<braam@cs.cmu.edu> said:

> Each log record affecting an object should have a version vector.

> The DLM or locking protocol now imposes a total order on these vectors.
> Replaying log records now merely has to be done in order.

That's not the problem. Sure, adding a version tag to each record
will allow multiple journal playbacks to be synchronised. The problem
is that if you deallocate some metadata (eg. an indirect or directory
block), and then reallocate it as data, you now have writes to that
same block which are not journaled at all. It's the interaction
between data and metadata which is the real problem here.

Ken, surely you can get the best of both worlds by maintaining an
inode generation number in addition to the fine granularity locks?
For deletes, this would be just fine --- the update to the inode
sequence number would prevent any old log entries for that inode from
being replayed over the newly-released disk blocks.

For truncate, things would have to fall back on the current
synchronous behaviour, flushing out all dirty blocks, because the act
of bumping the inode sequence number would invalidate any updates
already journaled for that inode.

For all other operations, we would be able to perform
fine-granularity locking within the inode using the per-block locks
and sequence numbers.

--Stephen