Mailing List Archive

[PATCH -mm 0/10][RFC] aio: make struct kiocb private
This series is an attempt to generalize the async I/O paths to be
implementation agnostic. It completely eliminates knowledge of
the kiocb structure in the generic code and makes it private within the
current aio code. Things get noticeably cleaner without that layering
violation.

The new interface takes a file_endio_t function pointer, and a private data
pointer, which would normally be aio_complete and a kiocb pointer,
respectively. If the aio submission function gets back EIOCBQUEUED, that is
a guarantee that the endio function will be called, or *already has been
called*. If the file_endio_t pointer provided to aio_[read|write] is NULL,
the FS must block on I/O completion, then return either the number of bytes
read, or an error.

I had to touch more areas that I had originally expected, so there are
changes in a corner of the socket code, and a slight behavior change in the
direct-io completion path with affects XFS and OCFS2. I would appreciate
further review there, so I copied some extra people I hope can help.

This patch is against 2.6.20-rc4-mm1. It has been compile-tested at each
stage. It needs some runtime testing yet, but I prefer to get it out for
commentary and test later.

These patches are for RFC only and have not yet been signed off.

NATE

---

Documentation/filesystems/Locking | 11 +
Documentation/filesystems/vfs.txt | 11 +
arch/s390/hypfs/inode.c | 16 +-
drivers/net/pppoe.c | 8 -
drivers/net/tun.c | 13 +-
drivers/usb/gadget/inode.c | 239 +-------------------------------------
fs/aio.c | 74 ++++++-----
fs/bad_inode.c | 10 -
fs/block_dev.c | 109 +++++++++++------
fs/cifs/cifsfs.c | 10 -
fs/compat.c | 56 --------
fs/direct-io.c | 92 ++++++++------
fs/ecryptfs/file.c | 16 +-
fs/ext2/inode.c | 12 -
fs/ext3/file.c | 9 -
fs/ext3/inode.c | 11 -
fs/ext4/file.c | 9 -
fs/ext4/inode.c | 11 -
fs/fat/inode.c | 12 -
fs/fuse/dev.c | 13 +-
fs/gfs2/ops_address.c | 14 +-
fs/hfs/inode.c | 13 --
fs/hfsplus/inode.c | 13 --
fs/jfs/inode.c | 12 -
fs/nfs/direct.c | 92 +++++++-------
fs/nfs/file.c | 62 +++++----
fs/ntfs/file.c | 71 ++---------
fs/ocfs2/aops.c | 24 +--
fs/ocfs2/aops.h | 8 -
fs/ocfs2/file.c | 44 +++---
fs/ocfs2/inode.h | 2
fs/pipe.c | 12 -
fs/read_write.c | 225 ++++++++++++-----------------------
fs/read_write.h | 8 -
fs/reiserfs/inode.c | 13 --
fs/smbfs/file.c | 28 ++--
fs/udf/file.c | 13 +-
fs/xfs/linux-2.6/xfs_aops.c | 44 +++---
fs/xfs/linux-2.6/xfs_file.c | 58 +++++----
fs/xfs/linux-2.6/xfs_lrw.c | 29 ++--
fs/xfs/linux-2.6/xfs_lrw.h | 10 -
fs/xfs/linux-2.6/xfs_vnode.h | 20 +--
include/linux/aio.h | 11 -
include/linux/fs.h | 114 +++++++++---------
include/linux/net.h | 18 +-
include/linux/nfs_fs.h | 12 -
include/net/bluetooth/bluetooth.h | 2
include/net/inet_common.h | 3
include/net/scm.h | 2
include/net/sock.h | 45 +------
include/net/tcp.h | 6
include/net/udp.h | 3
mm/filemap.c | 109 ++++++++---------
net/appletalk/ddp.c | 5
net/atm/common.c | 6
net/atm/common.h | 7 -
net/ax25/af_ax25.c | 7 -
net/bluetooth/af_bluetooth.c | 4
net/bluetooth/hci_sock.c | 7 -
net/bluetooth/l2cap.c | 2
net/bluetooth/rfcomm/sock.c | 8 -
net/bluetooth/sco.c | 3
net/core/sock.c | 12 -
net/dccp/dccp.h | 8 -
net/dccp/probe.c | 3
net/dccp/proto.c | 7 -
net/decnet/af_decnet.c | 7 -
net/econet/af_econet.c | 7 -
net/ipv4/af_inet.c | 5
net/ipv4/raw.c | 8 -
net/ipv4/tcp.c | 7 -
net/ipv4/tcp_probe.c | 3
net/ipv4/udp.c | 9 -
net/ipv4/udp_impl.h | 2
net/ipv6/raw.c | 6
net/ipv6/udp.c | 10 -
net/ipv6/udp_impl.h | 6
net/ipx/af_ipx.c | 7 -
net/irda/af_irda.c | 29 ++--
net/key/af_key.c | 6
net/llc/af_llc.c | 7 -
net/netlink/af_netlink.c | 24 +--
net/netrom/af_netrom.c | 7 -
net/packet/af_packet.c | 11 -
net/rose/af_rose.c | 7 -
net/sctp/socket.c | 9 -
net/socket.c | 199 +++++++++----------------------
net/tipc/socket.c | 28 +---
net/unix/af_unix.c | 116 +++++++-----------
net/wanrouter/af_wanpipe.c | 7 -
net/x25/af_x25.c | 6
sound/core/pcm_native.c | 15 +-
92 files changed, 1009 insertions(+), 1500 deletions(-)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm 0/10][RFC] aio: make struct kiocb private [ In reply to ]
On Mon, Jan 15, 2007 at 05:54:50PM -0800, Nate Diller wrote:
> This series is an attempt to generalize the async I/O paths to be
> implementation agnostic. It completely eliminates knowledge of
> the kiocb structure in the generic code and makes it private within the
> current aio code. Things get noticeably cleaner without that layering
> violation.
>
> The new interface takes a file_endio_t function pointer, and a private data
> pointer, which would normally be aio_complete and a kiocb pointer,
> respectively. If the aio submission function gets back EIOCBQUEUED, that is
> a guarantee that the endio function will be called, or *already has been
> called*. If the file_endio_t pointer provided to aio_[read|write] is NULL,
> the FS must block on I/O completion, then return either the number of bytes
> read, or an error.

I don't really like this patchet at all. At some point it's a lot nicer
to have a lot of paramaters that are related and passed down a long
callchain into a structure, and I think the aio code is over that threshold.
The completion function cleanups look okay to me, but I'd rather add
that completion function to struct kiocb instead of removing kiocb use.

I have this slight feeling you want to use this completions for something
else than the current aio code, if that's the case it would help
if you could explain briefly in what direction your heading.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm 0/10][RFC] aio: make struct kiocb private [ In reply to ]
On 1/15/07, Christoph Hellwig <hch@infradead.org> wrote:
> On Mon, Jan 15, 2007 at 05:54:50PM -0800, Nate Diller wrote:
> > This series is an attempt to generalize the async I/O paths to be
> > implementation agnostic. It completely eliminates knowledge of
> > the kiocb structure in the generic code and makes it private within the
> > current aio code. Things get noticeably cleaner without that layering
> > violation.
> >
> > The new interface takes a file_endio_t function pointer, and a private data
> > pointer, which would normally be aio_complete and a kiocb pointer,
> > respectively. If the aio submission function gets back EIOCBQUEUED, that is
> > a guarantee that the endio function will be called, or *already has been
> > called*. If the file_endio_t pointer provided to aio_[read|write] is NULL,
> > the FS must block on I/O completion, then return either the number of bytes
> > read, or an error.
>
> I don't really like this patchet at all. At some point it's a lot nicer
> to have a lot of paramaters that are related and passed down a long
> callchain into a structure, and I think the aio code is over that threshold.
> The completion function cleanups look okay to me, but I'd rather add
> that completion function to struct kiocb instead of removing kiocb use.
>
> I have this slight feeling you want to use this completions for something
> else than the current aio code, if that's the case it would help
> if you could explain briefly in what direction your heading.

Actually I agree with you more than you might think. I had intended
this to mesh with your struct iodesc idea, where iodesc would contain
the iovec pointer, nr_segs, iov_length, and whatever else needs to be
there, potentially even the endio function and its private data, tying
those to the iovec instead of a separate structure that needs to be
kept in sync. There's a distinct layering that should exist between
things that should accompany the iovec transparently, and private data
that should be attached opaquely by layers above.

The biggest thing I have in mind for this patch, actually, is to fix
up the *sync* paths. I don't think we should be waiting on sync I/O
at the *top* of the call stack, like with wait_on_sync_kiocb(), I'd
say the best place to wait is at the *bottom*, down in the I/O
scheduler. This would make it a lot easier to clean up the completion
paths, because in the sync case, you'd be right back in process
context again as you traverse upward through the RAID, encryption,
loopback, directIO, FS log commit, etc. It doesn't by itself
eliminate the need for all the threads and workqueues and such that
those layers each own, but it is a step in the right direction.

Now if you want to talk about long-term vaporware style ideas, yeah, I
do have my own thoughts on how aio should work. And from Agami's
perspective, this patch also makes it easier for us to do certain
debugging traces that we wish to hack together, in order to profile
performance on our platform. But I'd be hesitant to make those
arguments, cause they are largely irrelevant (we can obviously carry
the patch for debugging without buy-in from the community). This is
the right thing to do from a design perspective. Hopefully it enables
a new architecture that can reduce context switches in I/O completion,
and reduce overhead. That's the real motive ;)

NATE
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm 0/10][RFC] aio: make struct kiocb private [ In reply to ]
On Monday 15 January 2007 8:25 pm, Nate Diller wrote:

> I don't think we should be waiting on sync I/O
> at the *top* of the call stack, like with wait_on_sync_kiocb(), I'd
> say the best place to wait is at the *bottom*, down in the I/O
> scheduler.

Erm ... *what* I/O scheduler? These I/O requests may go directly
to the end of the hardware I/O queue, which already has an I/O model
where each request can correspond directly to a KIOCB. And which
does not include any synchronous primitives.

No such scheduler has previously been, or _should_ be, required.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm 0/10][RFC] aio: make struct kiocb private [ In reply to ]
On Mon, Jan 15, 2007 at 08:25:15PM -0800, Nate Diller wrote:
> the right thing to do from a design perspective. Hopefully it enables
> a new architecture that can reduce context switches in I/O completion,
> and reduce overhead. That's the real motive ;)

And it's a broken motive. Context switches per se are not bad, as they
make it possible to properly schedule code in a busy system (which is
*very* important when realtime concerns come into play). Have a look
at how things were done in the 2.4 aio code to see how completion would
get done with a non-retry method, typically in interrupt context. I had
code that did direct I/O rather differently by sharing code with the
read/write code paths at some point, the catch being that it was pretty
invasive, which meant that it never got merged with the changes to handle
writeback pressure and other work that happened during 2.5.

That said, you can't make kiocb private without completely removing the
ability of the rest of the kernel to complete an aio sanely from irq context.
You need some form of i/o descriptor, and a kiocb is just that. Adding more
layering is just going to make things messier and slower for no real gain.

-ben
--
"Time is of no importance, Mr. President, only life is important."
Don't Email: <dont@kvack.org>.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm 0/10][RFC] aio: make struct kiocb private [ In reply to ]
On Wed, 17 Jan 2007, Benjamin LaHaise wrote:

> On Mon, Jan 15, 2007 at 08:25:15PM -0800, Nate Diller wrote:
>> the right thing to do from a design perspective. Hopefully it enables
>> a new architecture that can reduce context switches in I/O completion,
>> and reduce overhead. That's the real motive ;)
>
> And it's a broken motive. Context switches per se are not bad, as they
> make it possible to properly schedule code in a busy system (which is
> *very* important when realtime concerns come into play). Have a look
> at how things were done in the 2.4 aio code to see how completion would
> get done with a non-retry method, typically in interrupt context. I had
> code that did direct I/O rather differently by sharing code with the
> read/write code paths at some point, the catch being that it was pretty
> invasive, which meant that it never got merged with the changes to handle
> writeback pressure and other work that happened during 2.5.

I'm having some trouble understanding your concern. From my perspective,
any unnecessary context switch represents not only performance loss, but
extra complexity in the code. In this case, I'm not suggesting that the
aio.c code causes problems, quite the opposite. The code I'd like to change
is FS and md levels, where context switches happen because of timers,
workqueues, and worker threads. For sync I/O, these layers could be doing
their completion work in process context, but because waiting on sync I/O is
done in layers above, they must resort to other means, even for the common
case. The dm-crypt module is the most straightforward example.

I took a look at some 2.4.18 aio patches in kernel.org/.../bcrl/aio/, and if
I understand what you did, you were basically operating at the aops level
rather than f_ops. I actually like that idea, it's nicer than having the
direct-io code do its work seperately from the aio code. Part of where I'm
going with this patch is a better integration between the block layer
(make_request), page layer (aops), and FS layer (f_ops), particularly in the
completion paths. The direct-io code is an improvement over the common code
on that point, do_readahead() and friends all wait on individual pages to
become uptodate. I'd like to bring some improvements from the directIO
architecture into use in the common case, which I hope will help
performance.

I know that might seem somewhat unrelated, but I don't think it is. This
change goes hand in hand with using completion handlers in the aops. That
will link together the completion callback in the bio with the aio callback,
so that the whole stack can finish its work in one context.

> That said, you can't make kiocb private without completely removing the
> ability of the rest of the kernel to complete an aio sanely from irq context.
> You need some form of i/o descriptor, and a kiocb is just that. Adding more
> layering is just going to make things messier and slower for no real gain.

This patchset does not change how or when I/O completion happens,
aio_complete() will still get called from direct-io.c, nfs-direct.c, et al.
The iocb structure is still passed to aio_complete, just like before. The
only difference is that the lower level code doesn't know that it's got an
iocb, all it sees is an opaque cookie. It's more like enforcing a layer
that's already in place, and I think things got simpler rather than messier.
Whether things are slower or not remains to be seen, but I expect no
measurable changes either way with this patch.

I'm releasing a new version of the patch soon, it will use a new iodesc
structure to keep track of iovec state, which simplifies things further. It
also will have a new version of the usb gadget code, and some general
cleanups. I hope you'll take a look at it.

NATE
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/