Mailing List Archive

RAID 0, protocol C
Hello all,

I've made an interesting observation, and I'm not quite sure if I should
be concerned or not:

I've set up a new HA system: 2 Compaq Proliants with SmartArray 3200
hardware RAID controllers, 5 NIC's, 4-9GB 10K rpm SCSI drives, and 1GB
RAM. I'm using protocol C, but when I watch as I copy data onto my drbd
partition, the copy finishes almost immediately and there is no activity
on the network or the disks on the second node until the copy completes
- it's as if I'm using protocol A.

Is there anyway of finding out which protocol drbd thinks it's using
while it's loaded?

Thanks,
Dan

--
Dan Yocum, Sr. Linux Consultant
Linuxcare, Inc.
630.697.8066 tel
yocum@example.com, http://www.linuxcare.com

Linuxcare. Support for the revolution.
Re: RAID 0, protocol C [ In reply to ]
On Tue, Dec 12, 2000 at 05:17:57PM -0600, Dan Yocum wrote:
> Hello all,
>
> I've made an interesting observation, and I'm not quite sure if I should
> be concerned or not:
>
> I've set up a new HA system: 2 Compaq Proliants with SmartArray 3200
> hardware RAID controllers, 5 NIC's, 4-9GB 10K rpm SCSI drives, and 1GB
> RAM. I'm using protocol C, but when I watch as I copy data onto my drbd
> partition, the copy finishes almost immediately and there is no activity
> on the network or the disks on the second node until the copy completes
> - it's as if I'm using protocol A.
>
> Is there anyway of finding out which protocol drbd thinks it's using
> while it's loaded?

How much data are you copying? This looks (given the 1GB ram) like all
the writes are waiting for a sync.

-dg

--
David Gould dg@example.com
SuSE, Inc., 580 2cd St. #210, Oakland, CA 94607 510.628.3380
why would you want to own /dev/null? "ooo! ooo! look! i stole nothing!
i'm the thief of nihilism! i'm the new god of zen monks."
Re: RAID 0, protocol C [ In reply to ]
David Gould wrote:
>
> On Tue, Dec 12, 2000 at 05:17:57PM -0600, Dan Yocum wrote:
> > Hello all,
> >
> > I've made an interesting observation, and I'm not quite sure if I should
> > be concerned or not:
> >
> > I've set up a new HA system: 2 Compaq Proliants with SmartArray 3200
> > hardware RAID controllers, 5 NIC's, 4-9GB 10K rpm SCSI drives, and 1GB
> > RAM. I'm using protocol C, but when I watch as I copy data onto my drbd
> > partition, the copy finishes almost immediately and there is no activity
> > on the network or the disks on the second node until the copy completes
> > - it's as if I'm using protocol A.
> >
> > Is there anyway of finding out which protocol drbd thinks it's using
> > while it's loaded?
>
> How much data are you copying? This looks (given the 1GB ram) like all
> the writes are waiting for a sync.


Ah. Hm. Good point. But, protocol C essentially has a 'sync' built
into it, right? I mean, that's the whole point of C: a write is not
complete until a block is received by the secondary node *and* written
to disk there.

OK, so let's mount the fs with the sync option...

Ugh. What a dog. Talk about slow... I mean it's painfully slow, now.

FWIW, this is on a 2.2.16 kernel from Red Hat with no patches applied
(no Marcelo patches, no Andreas patches, no Alan patches, nuthin').

Dan

--
Dan Yocum, Sr. Linux Consultant
Linuxcare, Inc.
630.697.8066 tel
yocum@example.com, http://www.linuxcare.com

Linuxcare. Support for the revolution.
Re: RAID 0, protocol C [ In reply to ]
On Tue, Dec 12, 2000 at 08:17:23PM -0600, Dan Yocum wrote:
> David Gould wrote:
> > How much data are you copying? This looks (given the 1GB ram) like all
> > the writes are waiting for a sync.
>
> Ah. Hm. Good point. But, protocol C essentially has a 'sync' built
> into it, right? I mean, that's the whole point of C: a write is not

Uh, no. The sync comes from the filesystem, not the block device.

> complete until a block is received by the secondary node *and* written
> to disk there.

Right. Well really, the write is not complete until both the local disk
i/o completes and the remote acknowledges completion too.

But the case I suspect you are seeing is that the writes are not _started_
so completeing them is not the issue.

-dg

> Ugh. What a dog. Talk about slow... I mean it's painfully slow, now.

Well, there is a reason for all that buffer cache stuff...

-dg

--
David Gould dg@example.com
SuSE, Inc., 580 2cd St. #210, Oakland, CA 94607 510.628.3380
why would you want to own /dev/null? "ooo! ooo! look! i stole nothing!
i'm the thief of nihilism! i'm the new god of zen monks."
Re: RAID 0, protocol C [ In reply to ]
David,

David Gould wrote:
>
> On Tue, Dec 12, 2000 at 08:17:23PM -0600, Dan Yocum wrote:
> > David Gould wrote:
> > > How much data are you copying? This looks (given the 1GB ram) like all
> > > the writes are waiting for a sync.

However, Philipp quoteth thusly in his Linux Kongress paper:

"With protocol C a write operation is considered complete when a
block-has-been-written acklowedgement[sic] from the standy system is
received. - This protocol can guarantee the transaction semantics in all
failure cases."


What I'm seeing is that the cp of a file to the drbd partition finishes
(i.e., the shell prompt comes back) on the primary system *long* before
the secondary even gets all the data! I don't like this. This implies,
to me, that the write has not finished. Am I wrong? I see this by
running 'watch cat /proc/drbd' as well as watching the packets fly over
the second NIC with netwatch. There's still a lot of data being
transferred after the prompt comes back.

What gets me is that I only see this on a SCSI hardware RAID system and
not on a simple, non-RAID IDE system.

> >
> > Ah. Hm. Good point. But, protocol C essentially has a 'sync' built
> > into it, right? I mean, that's the whole point of C: a write is not
>
> Uh, no. The sync comes from the filesystem, not the block device.
>
> > complete until a block is received by the secondary node *and* written
> > to disk there.
>
> Right. Well really, the write is not complete until both the local disk
> i/o completes and the remote acknowledges completion too.
>
> But the case I suspect you are seeing is that the writes are not _started_
> so completeing them is not the issue.

But, then with protocol C the operation should block forever until it is
started and subsequently finished.

So, who is using DRBD protocol C sucessfully on a) a SCSI system, b) a
hardware RAID system and b1) is it an IDE RAID or a SCSI RAID?

Thanks,
Dan


--
Dan Yocum, Sr. Linux Consultant
Linuxcare, Inc.
630.697.8066 tel
yocum@example.com, http://www.linuxcare.com

Linuxcare. Support for the revolution.
Re: RAID 0, protocol C [ In reply to ]
On Wed, Dec 13, 2000 at 11:08:40AM -0600, Dan Yocum wrote:

> However, Philipp quoteth thusly in his Linux Kongress paper:
>
> "With protocol C a write operation is considered complete when a
> block-has-been-written acklowedgement[sic] from the standy system is
> received. - This protocol can guarantee the transaction semantics in all
> failure cases."

Right, I am not disputing this. I am saying there is no conflict between
this and what you are seeing.

> What I'm seeing is that the cp of a file to the drbd partition finishes
> (i.e., the shell prompt comes back) on the primary system *long* before
> the secondary even gets all the data! I don't like this. This implies,
> to me, that the write has not finished. Am I wrong? I see this by

Yes, you wrong. The shell prompt comes back long before the the primary
gets all the data too. There is no determinisitic relationship between
the writes and the completion of the shell command, or even of the write()
syscall.

> What gets me is that I only see this on a SCSI hardware RAID system and
> not on a simple, non-RAID IDE system.

Interesting, but suspicious. Are you completely sure? Did you have 1GB of
memory on the ide system too?

> > But the case I suspect you are seeing is that the writes are not _started_
> > so completeing them is not the issue.
>
> But, then with protocol C the operation should block forever until it is
> started and subsequently finished.

Not really, you are confusing two different instances of "the operation".
For example "cp 4kb_file /someotherdevice/4kb_file" is going to do:

- cp reads the input file, and opens the output file (skipping some details
here ;-)
- cp makes the syscall "write(fd, buf, 4096") which does approximately:
- kernel allocates a page of memory and a descriptor that describes
it as part of the file.
- kernel copies the data from the user buffer to the new pagecache page.
- kernel marks the page as "dirty".
- syscall write() completes.
- cp exits.
- The cp "operation" is complete. Note that no block I/O has been requested
or started or waited for or anything.

- time passes (and the dirty file page just sits in the cache).

- the kernel for one of several reasons (sync call, flushd running, trying
to free up some memory...) decides that it wants to clean the dirty cached
page.
- kernel allocates an i/o request structure and fills it in pointing to
the dirty page.
- kernel marks the cached page as "in I/O"
- kernel puts the i/o request on the device request queue.
- depending on some rather voodoo optimizations (plugging the device),
the device may decide to wait for more requests or not ...
- but eventually, the device is unpluged and
- the driver takes the request from the queue
- the driver sets up the device with the request and tells it to start
- the driver marks the request as in progress and returns control to
the kernel.

- time passes

- the device interuppts when it completes the request
- the isr clears the device and (skipping a lot of detail here) notifies
the driver.
- the driver takes the requests of the queue, checks for errors etc
then clears the dirty and in I/O flags from the dirty page.

At this point the request "operation" is complete. This is completely
asyncronous to the original write() system call.

The difference with drbd is that the drbd driver has to make a copy of the
request and uses one copy to send a notice of the I/O to the slave, and the
other copy to issue I/O to the lower device. Drbd also modifies the requests
so they report back to drbd rather than clearing the page dirty flags
directly.

With protocol A, drbd reports completion (marks the page clean)
as soon as the local I/O request reports the completion to drbd.

With protocol C, drbd waits for both the local I/O to complete and for a
packet from the slave that says it has completed before reporting the
completion of the original request to drbd.

-dg

--
David Gould dg@example.com
SuSE, Inc., 580 2cd St. #210, Oakland, CA 94607 510.628.3380
why would you want to own /dev/null? "ooo! ooo! look! i stole nothing!
i'm the thief of nihilism! i'm the new god of zen monks."
Re: RAID 0, protocol C [ In reply to ]
On Wed, 13 Dec 2000, David Gould wrote:

<snip>

> > What gets me is that I only see this on a SCSI hardware RAID system and
> > not on a simple, non-RAID IDE system.
>
> Interesting, but suspicious. Are you completely sure? Did you have 1GB of
> memory on the ide system too?

With 2.2 kernel this is possible. See below.

> > > But the case I suspect you are seeing is that the writes are not _started_
> > > so completeing them is not the issue.
> >
> > But, then with protocol C the operation should block forever until it is
> > started and subsequently finished.
>
> Not really, you are confusing two different instances of "the operation".
> For example "cp 4kb_file /someotherdevice/4kb_file" is going to do:
>
> - cp reads the input file, and opens the output file (skipping some details
> here ;-)
> - cp makes the syscall "write(fd, buf, 4096") which does approximately:
> - kernel allocates a page of memory and a descriptor that describes
> it as part of the file.
> - kernel copies the data from the user buffer to the new pagecache page.
> - kernel marks the page as "dirty".

On 2.2 writes do not go through the page cache (on 2.4 they do).

On 2.2, the kernel queues the IO requests during write syscall. This does
not guarantee operation completion but it can be the reason why Dan
noticed a difference between SCSI and IDE -- the SCSI layer has "its own"
queueing code, which has a different behaviour from the generic queueing
code.
Re: RAID 0, protocol C [ In reply to ]
On Wed, Dec 13, 2000 at 09:54:00PM -0200, Marcelo Tosatti wrote:
>
> On Wed, 13 Dec 2000, David Gould wrote:
>
> <snip>
>
> > > What gets me is that I only see this on a SCSI hardware RAID system and
> > > not on a simple, non-RAID IDE system.
> >
> > Interesting, but suspicious. Are you completely sure? Did you have 1GB of
> > memory on the ide system too?
>
> With 2.2 kernel this is possible. See below.
>
> > > > But the case I suspect you are seeing is that the writes are not _started_
> > > > so completeing them is not the issue.
> > >
> > > But, then with protocol C the operation should block forever until it is
> > > started and subsequently finished.
> >
> > Not really, you are confusing two different instances of "the operation".
> > For example "cp 4kb_file /someotherdevice/4kb_file" is going to do:
> >
> > - cp reads the input file, and opens the output file (skipping some details
> > here ;-)
> > - cp makes the syscall "write(fd, buf, 4096") which does approximately:
> > - kernel allocates a page of memory and a descriptor that describes
> > it as part of the file.
> > - kernel copies the data from the user buffer to the new pagecache page.
> > - kernel marks the page as "dirty".
>
> On 2.2 writes do not go through the page cache (on 2.4 they do).

Yes, I was just trying to get the sense of things out without my mail
going over a million lines ;-)

> On 2.2, the kernel queues the IO requests during write syscall. This does
> not guarantee operation completion but it can be the reason why Dan
> noticed a difference between SCSI and IDE -- the SCSI layer has "its own"
> queueing code, which has a different behaviour from the generic queueing
> code.

There are differences between IDE and SCSI wrt plugging, but I was sort of
hoping not to explain all that, especially since I don't think it matters
much to this case.

Also, the kernel in 2.2 does not queue the IO request in the write syscall.
It just puts the data in a buffer and marks it dirty. To be specific, I
just had a look at ext2_file_write() and as far as I see, we only queue the
write (via ll_rw_block() when we update an already dirty buffer or in the
O_SYNC case:

ext2_file_write()
...
do {
bh = ext2_getblk (inode, block, 1, &err);
...
if (c > count)
c = count;
new_buffer = (!buffer_uptodate(bh) && !buffer_locked(bh) &&
c == sb->s_blocksize);
if (new_buffer) {
set_bit(BH_Lock, &bh->b_state);
c -= copy_from_user (bh->b_data + offset, buf, c);
...
mark_buffer_uptodate(bh, 1);
unlock_buffer(bh);
} else {
if (!buffer_uptodate(bh)) {
ll_rw_block (READ, 1, &bh);
wait_on_buffer (bh);
if (!buffer_uptodate(bh)) {
brelse (bh);
...
break;
}
}
c -= copy_from_user (bh->b_data + offset, buf, c);
}
...
if (filp->f_flags & O_SYNC)
bufferlist[buffercount++] = bh;
else
brelse(bh);
if (buffercount == NBUF){
ll_rw_block(WRITE, buffercount, bufferlist);
for(i=0; i<buffercount; i++){
wait_on_buffer(bufferlist[i]);
...
brelse(bufferlist[i]);
}
buffercount=0;
}
...
block++;
offset = 0;
c = sb->s_blocksize;
} while (count);
if ( buffercount ){
ll_rw_block(WRITE, buffercount, bufferlist);
for(i=0; i<buffercount; i++){
wait_on_buffer(bufferlist[i]);
...
brelse(bufferlist[i]);
}

-dg

--
David Gould dg@example.com
SuSE, Inc., 580 2cd St. #210, Oakland, CA 94607 510.628.3380
why would you want to own /dev/null? "ooo! ooo! look! i stole nothing!
i'm the thief of nihilism! i'm the new god of zen monks."
Re: RAID 0, protocol C [ In reply to ]
Meant to send this to the list, too.

Dan


--
Dan Yocum, Sr. Linux Consultant
Linuxcare, Inc.
630.697.8066 tel
yocum@example.com, http://www.linuxcare.com

Linuxcare. Support for the revolution.
Re: RAID 0, protocol C [ In reply to ]
David Gould wrote:
>
> On Thu, Dec 21, 2000 at 03:41:49PM -0600, Dan Yocum wrote:
> > David, et al.,
> >
> > Not to belabor the thread, but what if, on a high performance system
> > with lots of RAM (e.g, 1GB) and *really* fast SCSI RAID disk (e.g.,
> > Ultra160), we set nfract in bdflush_param to something rather small, say
> > 5 instead of 40, which is in linux/fs/buffer.c (can this be set
>
> nfract is in /proc/sys/vm/bdflush/nfract.

Cool. Thanks.

>
> > somewhere in /proc instead?) This seems like it would reduce the
> > write-to-SEC-machine latency that I've been seeing, but what other
> > problems would it cause? How much of a performance hit would this
> > incur?
>
> Just to repeat myself, what you are seeing is the usual write buffering. This
> is not related to whether there is drbd or a second machine or not.


Right. Sorry, that's what I meant - not what I wrote... I meant
write-to-disk "latency" (I just have SEC on the brain).


> That said, cutting down nfract, and nref_dirty will make writes more likely
> to happen sooner. The downside is that some extra unnecessary writes will


Good, and I suspect that in this environment, extra writes can be
tolerated, within reason. Clearly, mounting the FS in sync mode on a
DRBD device is not acceptable, performance-wise.

> be done. Depending on workload this may be acceptable or it may be horrible.
> A horrible workload would be one where the same pages are written many times
> (ie, databases), or where many files are created, written, and then soon
> deleted.

I guess I'll just have to experiment with this to see - I'll let you
know.

>
> Just what problem are you trying to solve?

I'm trying to minimize the amount of data lost (i.e., that data that's
sitting in the buffer cache) when the primary node crashes. The more
often it writes to disk, the less data will be lost. Since the disks
are nice and fast, we might as well write to them as often as possible.
Make sense?

Dan

--
Dan Yocum, Sr. Linux Consultant
Linuxcare, Inc.
630.697.8066 tel
yocum@example.com, http://www.linuxcare.com

Linuxcare. Support for the revolution.
Re: RAID 0, protocol C [ In reply to ]
On Fri, Dec 22, 2000 at 10:12:19AM -0600, Dan Yocum wrote:
> > That said, cutting down nfract, and nref_dirty will make writes more likely
> > to happen sooner. The downside is that some extra unnecessary writes will
>
> Good, and I suspect that in this environment, extra writes can be
> tolerated, within reason. Clearly, mounting the FS in sync mode on a
> DRBD device is not acceptable, performance-wise.

It would rarely be acceptable on a local disk performancewise either.

> > be done. Depending on workload this may be acceptable or it may be horrible.
> > A horrible workload would be one where the same pages are written many times
> > (ie, databases), or where many files are created, written, and then soon
> > deleted.
>
> I guess I'll just have to experiment with this to see - I'll let you
> know.

Please do.

> I'm trying to minimize the amount of data lost (i.e., that data that's
> sitting in the buffer cache) when the primary node crashes. The more
> often it writes to disk, the less data will be lost. Since the disks
> are nice and fast, we might as well write to them as often as possible.
> Make sense?

Ah, well in that case you will want to cut down the bdflush times too,
currently bdflush looks for work every 5 sec for metadata and 30 sec for user
data. Change these to as small as you can stand, see
linux/Documentation/sysctl.txt for details. Also there are some pagecache
tuning things too that you could maybe get some help from. One other thing
is that if you are writing dirty pages frequently, you may want to cut
down the number that get written at one time to smooth out the work a bit,
but you will have to try it and see.

-dg

--
David Gould dg@example.com
SuSE, Inc., 580 2cd St. #210, Oakland, CA 94607 510.628.3380
why would you want to own /dev/null? "ooo! ooo! look! i stole nothing!
i'm the thief of nihilism! i'm the new god of zen monks."