Mailing List Archive

[patch 10/21] buffer heads: Support slab defrag
Defragmentation support for buffer heads. We convert the references to
buffers to struct page references and try to remove the buffers from
those pages. If the pages are dirty then trigger writeout so that the
buffer heads can be removed later.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
fs/buffer.c | 99 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 99 insertions(+)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c 2008-05-07 20:27:15.182659486 -0700
+++ linux-2.6/fs/buffer.c 2008-05-07 20:29:13.052102980 -0700
@@ -3255,6 +3255,104 @@ int bh_submit_read(struct buffer_head *b
}
EXPORT_SYMBOL(bh_submit_read);

+/*
+ * Writeback a page to clean the dirty state
+ */
+static void trigger_write(struct page *page)
+{
+ struct address_space *mapping = page_mapping(page);
+ int rc;
+ struct writeback_control wbc = {
+ .sync_mode = WB_SYNC_NONE,
+ .nr_to_write = 1,
+ .range_start = 0,
+ .range_end = LLONG_MAX,
+ .nonblocking = 1,
+ .for_reclaim = 0
+ };
+
+ if (!mapping->a_ops->writepage)
+ /* No write method for the address space */
+ return;
+
+ if (!clear_page_dirty_for_io(page))
+ /* Someone else already triggered a write */
+ return;
+
+ rc = mapping->a_ops->writepage(page, &wbc);
+ if (rc < 0)
+ /* I/O Error writing */
+ return;
+
+ if (rc == AOP_WRITEPAGE_ACTIVATE)
+ unlock_page(page);
+}
+
+/*
+ * Get references on buffers.
+ *
+ * We obtain references on the page that uses the buffer. v[i] will point to
+ * the corresponding page after get_buffers() is through.
+ *
+ * We are safe from the underlying page being removed simply by doing
+ * a get_page_unless_zero. The buffer head removal may race at will.
+ * try_to_free_buffes will later take appropriate locks to remove the
+ * buffers if they are still there.
+ */
+static void *get_buffers(struct kmem_cache *s, int nr, void **v)
+{
+ struct page *page;
+ struct buffer_head *bh;
+ int i, j;
+ int n = 0;
+
+ for (i = 0; i < nr; i++) {
+ bh = v[i];
+ v[i] = NULL;
+
+ page = bh->b_page;
+
+ if (page && PagePrivate(page)) {
+ for (j = 0; j < n; j++)
+ if (page == v[j])
+ continue;
+ }
+
+ if (get_page_unless_zero(page))
+ v[n++] = page;
+ }
+ return NULL;
+}
+
+/*
+ * Despite its name: kick_buffers operates on a list of pointers to
+ * page structs that was set up by get_buffer().
+ */
+static void kick_buffers(struct kmem_cache *s, int nr, void **v,
+ void *private)
+{
+ struct page *page;
+ int i;
+
+ for (i = 0; i < nr; i++) {
+ page = v[i];
+
+ if (!page || PageWriteback(page))
+ continue;
+
+ if (!TestSetPageLocked(page)) {
+ if (PageDirty(page))
+ trigger_write(page);
+ else {
+ if (PagePrivate(page))
+ try_to_free_buffers(page);
+ unlock_page(page);
+ }
+ }
+ put_page(page);
+ }
+}
+
static void
init_buffer_head(struct kmem_cache *cachep, void *data)
{
@@ -3273,6 +3371,7 @@ void __init buffer_init(void)
(SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|
SLAB_MEM_SPREAD),
init_buffer_head);
+ kmem_cache_setup_defrag(bh_cachep, get_buffers, kick_buffers);

/*
* Limit the bh occupancy to 10% of ZONE_NORMAL

--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 10/21] buffer heads: Support slab defrag [ In reply to ]
On Fri, May 09, 2008 at 08:08:41PM -0700, Christoph Lameter wrote:
> Defragmentation support for buffer heads. We convert the references to
> buffers to struct page references and try to remove the buffers from
> those pages. If the pages are dirty then trigger writeout so that the
> buffer heads can be removed later.

Oh, no, please don't trigger more random single page writeback from
memory reclaim. We shoul dbe killing the VM's use of ->writepage,
not encouraging it.

If you are going to clean bufferheads (or pages), please clean entire
mappings via ->writepages as it leads to far superior I/O patterns
and a far higher aggregate rate of page cleaning.....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 10/21] buffer heads: Support slab defrag [ In reply to ]
On Mon, 12 May 2008, David Chinner wrote:

> If you are going to clean bufferheads (or pages), please clean entire
> mappings via ->writepages as it leads to far superior I/O patterns
> and a far higher aggregate rate of page cleaning.....

That brings up another issue: Lets say I use writepages on a large file
(couple of gig). How much do you want to write back?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 10/21] buffer heads: Support slab defrag [ In reply to ]
On Thu, May 15, 2008 at 10:42:15AM -0700, Christoph Lameter wrote:
> On Mon, 12 May 2008, David Chinner wrote:
>
> > If you are going to clean bufferheads (or pages), please clean entire
> > mappings via ->writepages as it leads to far superior I/O patterns
> > and a far higher aggregate rate of page cleaning.....
>
> That brings up another issue: Lets say I use writepages on a large file
> (couple of gig). How much do you want to write back?

We're out of memory. I'd suggest write backing as much as you can
without blocking. e.g. treat it like pdflush and say 1024 pages, or
like balance_dirty_pages() and write a 'write_chunk' back from the
mapping (i.e. sync_writeback_pages()).

Any of these are better from an I/O perspective than single page
writeback....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 10/21] buffer heads: Support slab defrag [ In reply to ]
On Fri, 16 May 2008, David Chinner wrote:

> On Thu, May 15, 2008 at 10:42:15AM -0700, Christoph Lameter wrote:
> > On Mon, 12 May 2008, David Chinner wrote:
> >
> > > If you are going to clean bufferheads (or pages), please clean entire
> > > mappings via ->writepages as it leads to far superior I/O patterns
> > > and a far higher aggregate rate of page cleaning.....
> >
> > That brings up another issue: Lets say I use writepages on a large file
> > (couple of gig). How much do you want to write back?
>
> We're out of memory. I'd suggest write backing as much as you can
> without blocking. e.g. treat it like pdflush and say 1024 pages, or
> like balance_dirty_pages() and write a 'write_chunk' back from the
> mapping (i.e. sync_writeback_pages()).

Why are we out of memory? How do you trigger such a special writeout?

> Any of these are better from an I/O perspective than single page
> writeback....

But then filesystem can do tricks like writing out the surrounding areas
as needed. The filesystem likely can estimate better how much writeout
makes sense.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 10/21] buffer heads: Support slab defrag [ In reply to ]
On Fri, May 16, 2008 at 10:01:38AM -0700, Christoph Lameter wrote:
> On Fri, 16 May 2008, David Chinner wrote:
>
> > On Thu, May 15, 2008 at 10:42:15AM -0700, Christoph Lameter wrote:
> > > On Mon, 12 May 2008, David Chinner wrote:
> > >
> > > > If you are going to clean bufferheads (or pages), please clean entire
> > > > mappings via ->writepages as it leads to far superior I/O patterns
> > > > and a far higher aggregate rate of page cleaning.....
> > >
> > > That brings up another issue: Lets say I use writepages on a large file
> > > (couple of gig). How much do you want to write back?
> >
> > We're out of memory. I'd suggest write backing as much as you can
> > without blocking. e.g. treat it like pdflush and say 1024 pages, or
> > like balance_dirty_pages() and write a 'write_chunk' back from the
> > mapping (i.e. sync_writeback_pages()).
>
> Why are we out of memory?

Defragmentation is triggered as part of the usual memory reclaim
process. Which implies we've run out of free memory, correct?

> How do you trigger such a special writeout?

filemap_fdatawrite_range() perhaps?

> > Any of these are better from an I/O perspective than single page
> > writeback....
>
> But then filesystem can do tricks like writing out the surrounding areas
> as needed. The filesystem likely can estimate better how much writeout
> makes sense.

Pushing write-around into a method that is only supposed to write
the single page that is passed to it is a pretty bad abuse of the
API. Especially as we have many simple, ranged writeback methods
you could call. filemap_fdatawrite_range(), do_writepages(),
->writepages, etc.

FWIW, look at the mess of layering violations that write clustering
causes in XFS because we have to do this to keep allocation overhead
and fragmentation down to a minimum. It's a nasty hack to mitigate
the impact of the awful I/O patterns we see from the VM - suggesting
that all filesystems do this just so you don't have to call a
slightly smarter writeback primitive is insane....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 10/21] buffer heads: Support slab defrag [ In reply to ]
On Mon, 19 May 2008, David Chinner wrote:

> Defragmentation is triggered as part of the usual memory reclaim
> process. Which implies we've run out of free memory, correct?

Yes but we have already reclaimed some memory.

> > How do you trigger such a special writeout?
>
> filemap_fdatawrite_range() perhaps?

Could you provide me such a patch? I would not know how much to writeout.
If we had such a method then we could also use that for the swap case
where we also write out single pages?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 10/21] buffer heads: Support slab defrag [ In reply to ]
On Mon, May 19, 2008 at 09:44:11AM -0700, Christoph Lameter wrote:
> On Mon, 19 May 2008, David Chinner wrote:
>
> > Defragmentation is triggered as part of the usual memory reclaim
> > process. Which implies we've run out of free memory, correct?
>
> Yes but we have already reclaimed some memory.
>
> > > How do you trigger such a special writeout?
> >
> > filemap_fdatawrite_range() perhaps?
>
> Could you provide me such a patch? I would not know how much to writeout.
> If we had such a method then we could also use that for the swap case
> where we also write out single pages?

How hard is it? I don't have time right now to do this, but it's essentially:

mapping = page->mapping;
......
- mapping->aops->writepage();
+ filemap_fdatawrite_range(mapping, start, end);

Where [start,end] span page->index and are is large enough
to get a substantial sized I/O to disk (say at least SWAP_CLUSTER_MAX
pages, preferrably larger for 4k page size machines).

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 10/21] buffer heads: Support slab defrag [ In reply to ]
On Tue, May 20, 2008 at 10:25:03AM +1000, David Chinner (dgc@sgi.com) wrote:
> + filemap_fdatawrite_range(mapping, start, end);
>
> Where [start,end] span page->index and are is large enough
> to get a substantial sized I/O to disk (say at least SWAP_CLUSTER_MAX
> pages, preferrably larger for 4k page size machines).

Or just sync_inode().

--
Evgeniy Polyakov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 10/21] buffer heads: Support slab defrag [ In reply to ]
On Tue, May 20, 2008 at 10:56:23AM +0400, Evgeniy Polyakov wrote:
> On Tue, May 20, 2008 at 10:25:03AM +1000, David Chinner (dgc@sgi.com) wrote:
> > + filemap_fdatawrite_range(mapping, start, end);
> >
> > Where [start,end] span page->index and are is large enough
> > to get a substantial sized I/O to disk (say at least SWAP_CLUSTER_MAX
> > pages, preferrably larger for 4k page size machines).
>
> Or just sync_inode().

Oh, god no. Let's not put the inode_lock right at the top of
the VM page cleaning path. We don't need to modify inode state,
the superblock dirty lists, etc - all we need to do is write
dirty pages on a given mapping in a more efficient manner.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 10/21] buffer heads: Support slab defrag [ In reply to ]
On Wed, May 21, 2008 at 07:46:17AM +1000, David Chinner (dgc@sgi.com) wrote:
> Oh, god no. Let's not put the inode_lock right at the top of
> the VM page cleaning path. We don't need to modify inode state,
> the superblock dirty lists, etc - all we need to do is write
> dirty pages on a given mapping in a more efficient manner.

I'm not advocating that, but having swap on reclaim does not hurt
anyone, this is essentially the same, but with different underlying
storage. System will do that anyway sooner or later during usual
writeback, which in turn can be a result of the same reclaim...

--
Evgeniy Polyakov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 10/21] buffer heads: Support slab defrag [ In reply to ]
David Chinner wrote:
> > Why are we out of memory?
>
> Defragmentation is triggered as part of the usual memory reclaim
> process. Which implies we've run out of free memory, correct?

I don't think that's true on no-MMU. Defragmentation can be needed
often on no-MMU when there's lots of free memory, just in the wrong
places.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 10/21] buffer heads: Support slab defrag [ In reply to ]
On Wed, May 21, 2008 at 02:25:05AM +0400, Evgeniy Polyakov wrote:
> On Wed, May 21, 2008 at 07:46:17AM +1000, David Chinner (dgc@sgi.com) wrote:
> > Oh, god no. Let's not put the inode_lock right at the top of the VM page
> > cleaning path. We don't need to modify inode state, the superblock dirty
> > lists, etc - all we need to do is write dirty pages on a given mapping in
> > a more efficient manner.
>
> I'm not advocating that, but having swap on reclaim does not hurt anyone,
> this is essentially the same, but with different underlying storage.

Sure. But my point is simply that sync_inode() is far too
heavy-weight to be used in a reclaim context. The fact that it holds
the inode_lock will interfere with normal writeback via pdflush and
that could potentially slow down writeback even more.

e.g. think of kswapd threads running on 20 nodes of a NUMA machine
all at once writing back dirty memory (yes, it happens). If we use
sync_inode() to write back dirty mappings we would then have at
least 20 CPUs serialising on the inode_lock trying to write back
pages. If we instead use a thin wrapper around ->writepages() then
they can all run in parallel through the filesystem(s), block
devices, etc rather than being serialised at the highest possible
layer....

> System
> will do that anyway sooner or later during usual writeback, which in turn
> can be a result of the same reclaim...

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 10/21] buffer heads: Support slab defrag [ In reply to ]
On Wed, May 21, 2008 at 02:25:05AM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> > Oh, god no. Let's not put the inode_lock right at the top of
> > the VM page cleaning path. We don't need to modify inode state,
> > the superblock dirty lists, etc - all we need to do is write
> > dirty pages on a given mapping in a more efficient manner.
>
> I'm not advocating that, but having swap on reclaim does not hurt
> anyone, this is essentially the same, but with different underlying
> storage. System will do that anyway sooner or later during usual
> writeback, which in turn can be a result of the same reclaim...

And actually having tiny operations under inode_lock is the last thing
to worry about when we are about to start writing pages to disk because
memory is so fragmented that we need to move things around.

That is the simplest from the typing viewpoint, one can also do
something like that:

struct address_space *mapping = page->mapping;
struct backing_dev_info *bdi = mapping->backing_dev_info;
struct writeback_control wbc = {
.bdi = bdi,
.sync_mode = WB_SYNC_ALL, /* likly we want to wait... */
.older_than_this = NULL,
.nr_to_write = 13,
.range_cyclic = 0,
.range_start = start_index,
.range_end = end_index
};

do_writepages(mapping, &wbc);

Cristoph, is this example you wnated to check out? It will only try to
write .nr_to_write pages between .range_start and .range_end without
syncing inode info itself.

--
Evgeniy Polyakov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 10/21] buffer heads: Support slab defrag [ In reply to ]
On Wed, 21 May 2008 09:19:42 +1000 David Chinner <dgc@sgi.com> wrote:

> sync_inode() is far too
> heavy-weight to be used in a reclaim context

It's more than efficiency. There are lots and lots of things we cannot
do in direct-reclaim context.

a) Can't lock pages (well we kinda sorta could, but generally code
will just trylock)

b) Cannot rely on the inode or the address_space being present in
memory after we have unlocked the page.

c) Cannot run iput(). Or at least, we couldn't five or six years
ago. afaik nobody has investigated whether the situation is now
better or worse.

d) lots of deadlock scenarios - need to test __GFP_FS basically everywhere
in which you share code with normal writeback paths.

Plus e), f), g) and h). Direct-reclaim is a hostile environment.
Things like b) are a real killer - nasty, subtle, rare,
memory-pressure-dependent crashes.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 10/21] buffer heads: Support slab defrag [ In reply to ]
On Wed, May 21, 2008 at 03:22:56AM +0400, Evgeniy Polyakov wrote:
> On Wed, May 21, 2008 at 02:25:05AM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> > > Oh, god no. Let's not put the inode_lock right at the top of
> > > the VM page cleaning path. We don't need to modify inode state,
> > > the superblock dirty lists, etc - all we need to do is write
> > > dirty pages on a given mapping in a more efficient manner.
> >
> > I'm not advocating that, but having swap on reclaim does not hurt
> > anyone, this is essentially the same, but with different underlying
> > storage. System will do that anyway sooner or later during usual
> > writeback, which in turn can be a result of the same reclaim...
>
> And actually having tiny operations under inode_lock is the last thing
> to worry about when we are about to start writing pages to disk because
> memory is so fragmented that we need to move things around.
>
> That is the simplest from the typing viewpoint, one can also do
> something like that:
>
> struct address_space *mapping = page->mapping;
> struct backing_dev_info *bdi = mapping->backing_dev_info;
> struct writeback_control wbc = {
> .bdi = bdi,
> .sync_mode = WB_SYNC_ALL, /* likly we want to wait... */
> .older_than_this = NULL,
> .nr_to_write = 13,
> .range_cyclic = 0,
> .range_start = start_index,
> .range_end = end_index
> };
>
> do_writepages(mapping, &wbc);

Which is the exact implementation of

filemap_fdatawrite_range(mapping, start, end);

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 10/21] buffer heads: Support slab defrag [ In reply to ]
On Wed, 21 May 2008, Evgeniy Polyakov wrote:

> Cristoph, is this example you wnated to check out? It will only try to
> write .nr_to_write pages between .range_start and .range_end without
> syncing inode info itself.

Well that is what Dave wants. I'd rather go the safe route for now and
defer this until later. I think you are much more an expert on the
filesystems and I/O paths than I am. So I'd rather take my hands of as
soon as possible.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 10/21] buffer heads: Support slab defrag [ In reply to ]
On Tue, May 20, 2008 at 04:28:16PM -0700, Andrew Morton (akpm@linux-foundation.org) wrote:
> It's more than efficiency. There are lots and lots of things we cannot
> do in direct-reclaim context.
>
> a) Can't lock pages (well we kinda sorta could, but generally code
> will just trylock)
>
> b) Cannot rely on the inode or the address_space being present in
> memory after we have unlocked the page.
>
> c) Cannot run iput(). Or at least, we couldn't five or six years
> ago. afaik nobody has investigated whether the situation is now
> better or worse.
>
> d) lots of deadlock scenarios - need to test __GFP_FS basically everywhere
> in which you share code with normal writeback paths.
>
> Plus e), f), g) and h). Direct-reclaim is a hostile environment.
> Things like b) are a real killer - nasty, subtle, rare,
> memory-pressure-dependent crashes.

Which basically means we can not do direct writeback at reclaim time?..

--
Evgeniy Polyakov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 10/21] buffer heads: Support slab defrag [ In reply to ]
On Wed, May 21, 2008 at 09:30:15AM +1000, David Chinner (dgc@sgi.com) wrote:
> Which is the exact implementation of
>
> filemap_fdatawrite_range(mapping, start, end);

Cool, I did not know that, probably because it is not exported :)

--
Evgeniy Polyakov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 10/21] buffer heads: Support slab defrag [ In reply to ]
On Wed, 21 May 2008 10:15:32 +0400 Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> On Tue, May 20, 2008 at 04:28:16PM -0700, Andrew Morton (akpm@linux-foundation.org) wrote:
> > It's more than efficiency. There are lots and lots of things we cannot
> > do in direct-reclaim context.
> >
> > a) Can't lock pages (well we kinda sorta could, but generally code
> > will just trylock)
> >
> > b) Cannot rely on the inode or the address_space being present in
> > memory after we have unlocked the page.
> >
> > c) Cannot run iput(). Or at least, we couldn't five or six years
> > ago. afaik nobody has investigated whether the situation is now
> > better or worse.
> >
> > d) lots of deadlock scenarios - need to test __GFP_FS basically everywhere
> > in which you share code with normal writeback paths.
> >
> > Plus e), f), g) and h). Direct-reclaim is a hostile environment.
> > Things like b) are a real killer - nasty, subtle, rare,
> > memory-pressure-dependent crashes.
>
> Which basically means we can not do direct writeback at reclaim time?..
>

Well, we _can_, but doing so within the present constraints is delicate.

An implementation which locked all the to-be-written pages up front and
then wrote them out and which was careful not to touch the inode or
address_space after the last page is unlocked could work.

Or perhaps add a new lock to the inode and then in reclaim

a) lock a page on the LRU, thus pinning the address_space and inode.

b) take some new sleeping lock in the inode

c) unlock that page and now proceed to do writeback. But still
honouring !GFP_FS.

and teach the unmount code to take the per-inode locks too, to ensure
that reclaim has got out of there before zapping the inodes. Perhaps a
per-superblock lock rather than per-inode, dunno.

But we won't be able to just dive in there and call the existing
writeback functions from within reclaim. Because

a) callers can hold all sorts of locks, including implicit ones such
as journal_start() and

b) reclaim doesn't have a reference on the page's inode, and the
inode and address_space can vanish if reclaim isn't holding a lock
on one of the address_space's pages.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/