Mailing List Archive

Folio discussion recap
So I've been following the folio discussion, and it seems like the discussion
has gone off the rails a bit partly just because struct page is such a mess and
has been so overused, and we all want to see that cleaned up but we're not being
clear about what that means. I was just talking with Johannes off list, and I
thought I'd recap that discussion as well as other talks with Mathew and see if
I can lay something out that everyone agrees with.

Some background:

For some years now, the overhead of dealing with 4k pages in the page cache has
gotten really, really painful. Any time we're doing buffered IO, we end up
walking a radix tree to get to the cached page, then doing a memcpy to or from
that page - which quite conveniently blows away the CPU cache - then walking
the radix tree to look up the next page, often touching locks along the way that
are no longer in cache - it's really bad.

We've been hacking around this - the btrfs people have a vectorized buffered
write path, and also this is what my generic_file_buffered_read() patches we're
about, batching up the page cache lookups - but really these are hacks that make
our core IO paths even more complicated, when the right answer that's been
staring all of us filesystem people in the face for years has been that it's
2021 and dealing with cached data in 4k chunks (when block based filesystems are
a thing of the past!) is abject stupidity.

So we need to be moving to larger, variable sized allocations for cached data.
We NEED this, this HAS TO HAPPEN - spend some time really digging into profiles,
and looking actual application usage, this is the #1 thing that's killing our
performance in the IO paths. Remember, us developers tend to be benchmarking
things like direct IO and small random IOs because we're looking at the whole IO
path, but most reads and writes are buffered, and they're already in cache, and
they're mostly big and sequential.

I emphasize this because a lot of us have really been waiting rather urgently
for Willy's work to go in, and there will no doubt be a lot more downstream
filesystem work to be done to fully take advantage of it and we're waiting on
this stuff to get merged so we can actually start testing and profiling the
brave new world and seeing what to work on next.

As an aside, before this there have been quite a few attempts at using
hugepages to deal with these issues, and they're all _fucking gross_, because
they all do if (normal page) else if (hugepage), and they all cut and paste
filemap.c code because no one (rightly) wanted to add their abortions to the
main IO paths. But look around the kernel and see how many times you can find
core filemap.c code duplicated elsewhere... Anyways, Willy's work is going to
let us delete all that crap.

So: this all means that filesystem code needs to start working in larger,
variable sized units, which today means - compound pages. Hence, the folio work
started out as a wrapper around compound pages.

So, one objection to folios has been that they leak too much MM details out into
the filesystem code. To that we must point out: all the code that's going to be
using folios is right now using struct page - this isn't leaking out new details
and making things worse, this is actually (potentially!) a step in the right
direction, by moving some users of struct page to a new type that is actually
created for a specific purpose.

I think a lot of the acrimony in this discussion came precisely from this mess;
Johannes and the other MM people would like to see this situation improved so
that they have more freedom to reengineer and improve things on their side. One
particularly noteworthy idea was having struct page refer to multiple hardware
pages, and using slab/slub for larger alloctions. In my view, the primary reason
for making this change isn't the memory overhead to struct page (though reducing
that would be nice); it's that the slab allocator is _significantly_ faster than
the buddy allocator (the buddy allocator isn't percpu!) and as average
allocation sizes increase, this is hurting us more and more over time.

So we should listen to the MM people.

Fortunately, Matthew made a big step in the right direction by making folios a
new type. Right now, struct folio is not separately allocated - it's just
unionized/overlayed with struct page - but perhaps in the future they could be
separately allocated. I don't think that is a remotely realistic goal for _this_
patch series given the amount of code that touches struct page (thing: writeback
code, LRU list code, page fault handlers!) - but I think that's a goal we could
keep in mind going forward.

We should also be clear on what _exactly_ folios are for, so they don't become
the new dumping ground for everyone to stash their crap. They're to be a new
core abstraction, and we should endeaver to keep our core data structures
_small_, and _simple_. So: no scatter gather. A folio should just represent a
single buffer of physically contiguous memory - vmap is slow, kmap_atomic() only
works on single pages, we do _not_ want to make filesystem code jump through
hoops to deal with anything else. The buffers should probably be power of two
sized, as that's what the buddy allocator likes to give us - that doesn't
necessarily have to be baked into the design, but I can't see us ever actually
wanting non power of two sized allocations.

Q: But what about fragmentation? Won't these allocations fail sometimes?

Yes, and that's OK. The relevant filesystem code is all changing to handle
variable sized allocations, so it's completely fine if we fail a 256k allocation
and we have to fall back to whatever is available.

But also keep in mind that switching the biggest consumer of kernel side memory
to larger allocations is going to do more than anything else to help prevent
memory from getting fragmented in the first place. We _want_ this.

Q: Oh yeah, but what again are folios for, exactly?

Folios are for cached filesystem data which (importantly) may be mapped to
userspace.

So when MM people see a new data structure come up with new references to page
size - there's a very good reason with that, which is that we need to be
allocating in multiples of the hardware page size if we're going to be able to
map it to userspace and have PTEs point to it.

So going forward, if the MM people want struct page to refer to muliple hardware
pages - this shouldn't prevent that, and folios will refer to multiples of the
_hardware_ page size, not struct page pagesize.

Also - all the filesystem code that's being converted tends to talk and thing in
units of pages. So going forward, it would be a nice cleanup to get rid of as
many of those references as possible and just talk in terms of bytes (e.g. I
have generally been trying to get rid of references to PAGE_SIZE in bcachefs
wherever reasonable, for other reasons) - those cleanups are probably for
another patch series, and in the interests of getting this patch series merged
with the fewest introduced bugs possible we probably want the current helpers.

-------------

That's my recap, I hope I haven't missed anything. The TL;DR is:

* struct page is a mess; yes, we know. We're all living with that pain.

* This isn't our ultimate end goal (nothing ever is!) - but it's probably along
the right path.

* Going forward: maybe struct folio should be separately allocated. That will
entail a lot more work so it's not appropriate for this patch series, but I
think it's a goal that would make everyone

* We should probably think and talk more concretely about what our end goals
are.

Getting away from struct page is something that comes up again and again - DAX
is another notable (and acrimonious) area this has come up. Also, page->mapping
and page->index make sharing cached data in different files (thing: reflink,
snapshots) pretty much non starters.

I'm going to publicly float one of my own ideas here: maybe entries in the page
cache radix tree don't have to be just a single pointer/ulong. If those entries
were bigger, perhaps some things would fit better there than in either struct
page/folio.


Excessive PAGE_SIZE usage:
--------------------------

Another thing that keeps coming up is - indiscriminate use of PAGE_SIZE makes it
hard, especially when we're reviewing new code, to tell what's a legitimate use
or not. When it's tied to the hardware page size (as folios are), it's probably
legitimate, but PAGE_SIZE is _way_ overused.

Partly this was because historically slab had to be used for small allocations
and the buddy allocator, __get_free_pages(), had to be used for larger
allocations. This is still somewhat the case - slab can go up to something like
128k, but there's still a hard cap on allocation size with kmalloc().

Perhaps the MM people could look into lifting this restriction, so that
kmalloc() could be used for any sized physically contiguous allocation that the
system could satisfy? If we had this, then it would make it more practical to
go through and refactor existing code that uses __get_free_pages() and convert
it to kmalloc(), without having to stare at code and figure out if it's safe.

And that's my $.02
Re: Folio discussion recap [ In reply to ]
On Fri, Sep 10, 2021 at 04:16:28PM -0400, Kent Overstreet wrote:
> So we should listen to the MM people.

Count me here.

I think the problem with folio is that everybody wants to read in her/his
hopes and dreams into it and gets disappointed when see their somewhat
related problem doesn't get magically fixed with folio.

Folio started as a way to relief pain from dealing with compound pages.
It provides an unified view on base pages and compound pages. That's it.

It is required ground work for wider adoption of compound pages in page
cache. But it also will be useful for anon THP and hugetlb.

Based on adoption rate and resulting code, the new abstraction has nice
downstream effects. It may be suitable for more than it was intended for
initially. That's great.

But if it doesn't solve your problem... well, sorry...

The patchset makes a nice step forward and cuts back on mess I created on
the way to huge-tmpfs.

I would be glad to see the patchset upstream.

--
Kirill A. Shutemov
Re: Folio discussion recap [ In reply to ]
On Sat 11-09-21 04:23:24, Kirill A. Shutemov wrote:
> On Fri, Sep 10, 2021 at 04:16:28PM -0400, Kent Overstreet wrote:
> > So we should listen to the MM people.
>
> Count me here.
>
> I think the problem with folio is that everybody wants to read in her/his
> hopes and dreams into it and gets disappointed when see their somewhat
> related problem doesn't get magically fixed with folio.
>
> Folio started as a way to relief pain from dealing with compound pages.
> It provides an unified view on base pages and compound pages. That's it.
>
> It is required ground work for wider adoption of compound pages in page
> cache. But it also will be useful for anon THP and hugetlb.
>
> Based on adoption rate and resulting code, the new abstraction has nice
> downstream effects. It may be suitable for more than it was intended for
> initially. That's great.
>
> But if it doesn't solve your problem... well, sorry...
>
> The patchset makes a nice step forward and cuts back on mess I created on
> the way to huge-tmpfs.
>
> I would be glad to see the patchset upstream.

I do agree here. While points that Johannes brought up are relevant
and worth thinking about I do also see a clear advantage of folio (or
whatever $name) is bringing. The compound page handling is just a mess
and source of practical problems and bugs.

This really requires some systematic approach to deal with it. The
proposed type system is definitely a good way to approach it. Johannes
is not happy about having the type still refer to page units but I
haven't seen an example where that leads to a worse or harder to
maintain code so far. The evolution is likely not going to stop at the
current type system but I haven't seen any specifics to prove it would
stand in the way. The existing code (fs or other subsystem interacting
with MM) is going to require quite a lot of changes to move away from
struct page notion but I do not see folios to add fundamental blocker
there.

All that being said, not only I see folios to be a step into the
right direction to address compound pages mess it is also a code that
already exists and gives some real advantages. I haven't heard anybody
subscribing to a different approach and providing an implementation in a
foreseeable future so I would rather go with this approach then dealing
with the existing code long term.
--
Michal Hocko
SUSE Labs
Re: Folio discussion recap [ In reply to ]
On Mon, Sep 13, 2021 at 01:32:30PM +0200, Michal Hocko wrote:
> The existing code (fs or other subsystem interacting with MM) is
> going to require quite a lot of changes to move away from struct
> page notion but I do not see folios to add fundamental blocker
> there.

The current folio seems to do quite a bit of that work, actually. But
it'll be undone when the MM conversion matures the data structure into
the full-blown new page.

It's not about hopes and dreams, it's the simple fact that the patches
do something now that seems very valuable, but which we'll lose again
over time. And avoiding that is a relatively minor adjustment at this
time compared to a much larger one later on.

So yeah, it's not really a blocker. It's just a missed opportunity to
lastingly disentangle struct page's multiple roles when touching all
the relevant places anyway. It's also (needlessly) betting that
compound pages can be made into a scalable, reliable, and predictable
allocation model, and proliferating them into fs/ based on that.

These patches, and all the ones that will need to follow to finish the
conversion, are exceptionally expensive. It would have been nice to
get more out of this disruption than to identify the relatively few
places that genuinely need compound_head(), and having a datatype for
N contiguous pages. Is there merit in solving those problems? Sure. Is
it a robust, forward-looking direction for the MM space that justifies
the cost of these and later patches? You seem to think so, I don't.

It doesn't look like we'll agree on this. But I think I've made my
points several times now, so I'll defer to Linus and Andrew.
Re: Folio discussion recap [ In reply to ]
On Fri, Sep 10, 2021 at 04:16:28PM -0400, Kent Overstreet wrote:
> One particularly noteworthy idea was having struct page refer to
> multiple hardware pages, and using slab/slub for larger
> alloctions. In my view, the primary reason for making this change
> isn't the memory overhead to struct page (though reducing that would
> be nice);

Don't underestimate this, however.

Picture the near future Willy describes, where we don't bump struct
page size yet but serve most cache with compound huge pages.

On x86, it would mean that the average page cache entry has 512
mapping pointers, 512 index members, 512 private pointers, 1024 LRU
list pointers, 512 dirty flags, 512 writeback flags, 512 uptodate
flags, 512 memcg pointers etc. - you get the idea.

This is a ton of memory. I think this doesn't get more traction
because it's memory we've always allocated, and we're simply more
sensitive to regressions than long-standing pain. But nevertheless
this is a pretty low-hanging fruit.

The folio makes a great first step moving those into a separate data
structure, opening the door to one day realizing these savings. Even
when some MM folks say this was never the intent behind the patches, I
think this is going to matter significantly, if not more so, later on.

> Fortunately, Matthew made a big step in the right direction by making folios a
> new type. Right now, struct folio is not separately allocated - it's just
> unionized/overlayed with struct page - but perhaps in the future they could be
> separately allocated. I don't think that is a remotely realistic goal for _this_
> patch series given the amount of code that touches struct page (thing: writeback
> code, LRU list code, page fault handlers!) - but I think that's a goal we could
> keep in mind going forward.

Yeah, agreed. Not doable out of the gate, but retaining the ability to
allocate the "cache entry descriptor" bits - mapping, index etc. -
on-demand would be a huge benefit down the road for the above reason.

For that they would have to be in - and stay in - their own type.

> We should also be clear on what _exactly_ folios are for, so they don't become
> the new dumping ground for everyone to stash their crap. They're to be a new
> core abstraction, and we should endeaver to keep our core data structures
> _small_, and _simple_.

Right. struct page is a lot of things and anything but simple and
obvious today. struct folio in its current state does a good job
separating some of that stuff out.

However, when we think about *which* of the struct page mess the folio
wants to address, I think that bias toward recent pain over much
bigger long-standing pain strikes again.

The compound page proliferation is new, and we're sensitive to the
ambiguity it created between head and tail pages. It's added some
compound_head() in lower-level accessor functions that are not
necessary for many contexts. The folio type safety will help clean
that up, and this is great.

However, there is a much bigger, systematic type ambiguity in the MM
world that we've just gotten used to over the years: anon vs file vs
shmem vs slab vs ...

- Many places rely on context to say "if we get here, it must be
anon/file", and then unsafely access overloaded member elements:
page->mapping, PG_readahead, PG_swapcache, PG_private

- On the other hand, we also have low-level accessor functions that
disambiguate the type and impose checks on contexts that may or may
not actually need them - not unlike compound_head() in PageActive():

struct address_space *folio_mapping(struct folio *folio)
{
struct address_space *mapping;

/* This happens if someone calls flush_dcache_page on slab page */
if (unlikely(folio_test_slab(folio)))
return NULL;

if (unlikely(folio_test_swapcache(folio)))
return swap_address_space(folio_swap_entry(folio));

mapping = folio->mapping;
if ((unsigned long)mapping & PAGE_MAPPING_ANON)
return NULL;

return (void *)((unsigned long)mapping & ~PAGE_MAPPING_FLAGS);
}

Then we go identify places that say "we know it's at least not a
slab page!" and convert them to page_mapping_file() which IS safe to
use with anon. Or we say "we know this MUST be a file page" and just
access the (unsafe) mapping pointer directly.

- We have a singular page lock, but what it guards depends on what
type of page we're dealing with. For a cache page it protects
uptodate and the mapping. For an anon page it protects swap state.

A lot of us can remember the rules if we try, but the code doesn't
help and it gets really tricky when dealing with multiple types of
pages simultaneously. Even mature code like reclaim just serializes
the operation instead of protecting data - the writeback checks and
the page table reference tests don't seem to need page lock.

When the cgroup folks wrote the initial memory controller, they just
added their own page-scope lock to protect page->memcg even though
the page lock would have covered what it needed.

- shrink_page_list() uses page_mapping() in the first half of the
function to tell whether the page is anon or file, but halfway
through we do this:

/* Adding to swap updated mapping */
mapping = page_mapping(page);

and then use PageAnon() to disambiguate the page type.

- At activate_locked:, we check PG_swapcache directly on the page and
rely on it doing the right thing for anon, file, and shmem pages.
But this flag is PG_owner_priv_1 and actually used by the filesystem
for something else. I guess PG_checked pages currently don't make it
this far in reclaim, or we'd crash somewhere in try_to_free_swap().

I suppose we're also never calling page_mapping() on PageChecked
filesystem pages right now, because it would return a swap mapping
before testing whether this is a file page. You know, because shmem.

These are just a few examples from an MM perspective. I'm sure the FS
folks have their own stories and examples about pitfalls in dealing
with struct page members.

We're so used to this that we don't realize how much bigger and
pervasive this lack of typing is than the compound page thing.

I'm not saying the compound page mess isn't worth fixing. It is.

I'm saying if we started with a file page or cache entry abstraction
we'd solve not only the huge page cache, but also set us up for a MUCH
more comprehensive cleanup in MM code and MM/FS interaction that makes
the tailpage cleanup pale in comparison. For the same amount of churn,
since folio would also touch all of these places.
Re: Folio discussion recap [ In reply to ]
Hello together,

I am an outsider and following the discussion here on the subject.
Can we not go upsream with the state of development ?
Optimizations will always be there and new kernel releases too.

I can not assess the risk but I think a decision must be made.

Damian


On Wed, 15. Sep 11:40, Johannes Weiner wrote:
> On Fri, Sep 10, 2021 at 04:16:28PM -0400, Kent Overstreet wrote:
> > One particularly noteworthy idea was having struct page refer to
> > multiple hardware pages, and using slab/slub for larger
> > alloctions. In my view, the primary reason for making this change
> > isn't the memory overhead to struct page (though reducing that would
> > be nice);
>
> Don't underestimate this, however.
>
> Picture the near future Willy describes, where we don't bump struct
> page size yet but serve most cache with compound huge pages.
>
> On x86, it would mean that the average page cache entry has 512
> mapping pointers, 512 index members, 512 private pointers, 1024 LRU
> list pointers, 512 dirty flags, 512 writeback flags, 512 uptodate
> flags, 512 memcg pointers etc. - you get the idea.
>
> This is a ton of memory. I think this doesn't get more traction
> because it's memory we've always allocated, and we're simply more
> sensitive to regressions than long-standing pain. But nevertheless
> this is a pretty low-hanging fruit.
>
> The folio makes a great first step moving those into a separate data
> structure, opening the door to one day realizing these savings. Even
> when some MM folks say this was never the intent behind the patches, I
> think this is going to matter significantly, if not more so, later on.
>
> > Fortunately, Matthew made a big step in the right direction by making folios a
> > new type. Right now, struct folio is not separately allocated - it's just
> > unionized/overlayed with struct page - but perhaps in the future they could be
> > separately allocated. I don't think that is a remotely realistic goal for _this_
> > patch series given the amount of code that touches struct page (thing: writeback
> > code, LRU list code, page fault handlers!) - but I think that's a goal we could
> > keep in mind going forward.
>
> Yeah, agreed. Not doable out of the gate, but retaining the ability to
> allocate the "cache entry descriptor" bits - mapping, index etc. -
> on-demand would be a huge benefit down the road for the above reason.
>
> For that they would have to be in - and stay in - their own type.
>
> > We should also be clear on what _exactly_ folios are for, so they don't become
> > the new dumping ground for everyone to stash their crap. They're to be a new
> > core abstraction, and we should endeaver to keep our core data structures
> > _small_, and _simple_.
>
> Right. struct page is a lot of things and anything but simple and
> obvious today. struct folio in its current state does a good job
> separating some of that stuff out.
>
> However, when we think about *which* of the struct page mess the folio
> wants to address, I think that bias toward recent pain over much
> bigger long-standing pain strikes again.
>
> The compound page proliferation is new, and we're sensitive to the
> ambiguity it created between head and tail pages. It's added some
> compound_head() in lower-level accessor functions that are not
> necessary for many contexts. The folio type safety will help clean
> that up, and this is great.
>
> However, there is a much bigger, systematic type ambiguity in the MM
> world that we've just gotten used to over the years: anon vs file vs
> shmem vs slab vs ...
>
> - Many places rely on context to say "if we get here, it must be
> anon/file", and then unsafely access overloaded member elements:
> page->mapping, PG_readahead, PG_swapcache, PG_private
>
> - On the other hand, we also have low-level accessor functions that
> disambiguate the type and impose checks on contexts that may or may
> not actually need them - not unlike compound_head() in PageActive():
>
> struct address_space *folio_mapping(struct folio *folio)
> {
> struct address_space *mapping;
>
> /* This happens if someone calls flush_dcache_page on slab page */
> if (unlikely(folio_test_slab(folio)))
> return NULL;
>
> if (unlikely(folio_test_swapcache(folio)))
> return swap_address_space(folio_swap_entry(folio));
>
> mapping = folio->mapping;
> if ((unsigned long)mapping & PAGE_MAPPING_ANON)
> return NULL;
>
> return (void *)((unsigned long)mapping & ~PAGE_MAPPING_FLAGS);
> }
>
> Then we go identify places that say "we know it's at least not a
> slab page!" and convert them to page_mapping_file() which IS safe to
> use with anon. Or we say "we know this MUST be a file page" and just
> access the (unsafe) mapping pointer directly.
>
> - We have a singular page lock, but what it guards depends on what
> type of page we're dealing with. For a cache page it protects
> uptodate and the mapping. For an anon page it protects swap state.
>
> A lot of us can remember the rules if we try, but the code doesn't
> help and it gets really tricky when dealing with multiple types of
> pages simultaneously. Even mature code like reclaim just serializes
> the operation instead of protecting data - the writeback checks and
> the page table reference tests don't seem to need page lock.
>
> When the cgroup folks wrote the initial memory controller, they just
> added their own page-scope lock to protect page->memcg even though
> the page lock would have covered what it needed.
>
> - shrink_page_list() uses page_mapping() in the first half of the
> function to tell whether the page is anon or file, but halfway
> through we do this:
>
> /* Adding to swap updated mapping */
> mapping = page_mapping(page);
>
> and then use PageAnon() to disambiguate the page type.
>
> - At activate_locked:, we check PG_swapcache directly on the page and
> rely on it doing the right thing for anon, file, and shmem pages.
> But this flag is PG_owner_priv_1 and actually used by the filesystem
> for something else. I guess PG_checked pages currently don't make it
> this far in reclaim, or we'd crash somewhere in try_to_free_swap().
>
> I suppose we're also never calling page_mapping() on PageChecked
> filesystem pages right now, because it would return a swap mapping
> before testing whether this is a file page. You know, because shmem.
>
> These are just a few examples from an MM perspective. I'm sure the FS
> folks have their own stories and examples about pitfalls in dealing
> with struct page members.
>
> We're so used to this that we don't realize how much bigger and
> pervasive this lack of typing is than the compound page thing.
>
> I'm not saying the compound page mess isn't worth fixing. It is.
>
> I'm saying if we started with a file page or cache entry abstraction
> we'd solve not only the huge page cache, but also set us up for a MUCH
> more comprehensive cleanup in MM code and MM/FS interaction that makes
> the tailpage cleanup pale in comparison. For the same amount of churn,
> since folio would also touch all of these places.
>
Re: Folio discussion recap [ In reply to ]
On Wed, Sep 15, 2021 at 11:40:11AM -0400, Johannes Weiner wrote:
> On Fri, Sep 10, 2021 at 04:16:28PM -0400, Kent Overstreet wrote:
> > One particularly noteworthy idea was having struct page refer to
> > multiple hardware pages, and using slab/slub for larger
> > alloctions. In my view, the primary reason for making this change
> > isn't the memory overhead to struct page (though reducing that would
> > be nice);
>
> Don't underestimate this, however.
>
> Picture the near future Willy describes, where we don't bump struct
> page size yet but serve most cache with compound huge pages.
>
> On x86, it would mean that the average page cache entry has 512
> mapping pointers, 512 index members, 512 private pointers, 1024 LRU
> list pointers, 512 dirty flags, 512 writeback flags, 512 uptodate
> flags, 512 memcg pointers etc. - you get the idea.
>
> This is a ton of memory. I think this doesn't get more traction
> because it's memory we've always allocated, and we're simply more
> sensitive to regressions than long-standing pain. But nevertheless
> this is a pretty low-hanging fruit.
>
> The folio makes a great first step moving those into a separate data
> structure, opening the door to one day realizing these savings. Even
> when some MM folks say this was never the intent behind the patches, I
> think this is going to matter significantly, if not more so, later on.

So ... I chatted with Kent the other day, who suggested to me that maybe
the point you're really after is that you want to increase the hw page
size to reduce overhead while retaining the ability to hand out parts of
those larger pages to the page cache, and folios don't get us there?

> > Fortunately, Matthew made a big step in the right direction by making folios a
> > new type. Right now, struct folio is not separately allocated - it's just
> > unionized/overlayed with struct page - but perhaps in the future they could be
> > separately allocated. I don't think that is a remotely realistic goal for _this_
> > patch series given the amount of code that touches struct page (thing: writeback
> > code, LRU list code, page fault handlers!) - but I think that's a goal we could
> > keep in mind going forward.
>
> Yeah, agreed. Not doable out of the gate, but retaining the ability to
> allocate the "cache entry descriptor" bits - mapping, index etc. -
> on-demand would be a huge benefit down the road for the above reason.
>
> For that they would have to be in - and stay in - their own type.
>
> > We should also be clear on what _exactly_ folios are for, so they don't become
> > the new dumping ground for everyone to stash their crap. They're to be a new
> > core abstraction, and we should endeaver to keep our core data structures
> > _small_, and _simple_.
>
> Right. struct page is a lot of things and anything but simple and
> obvious today. struct folio in its current state does a good job
> separating some of that stuff out.
>
> However, when we think about *which* of the struct page mess the folio
> wants to address, I think that bias toward recent pain over much
> bigger long-standing pain strikes again.
>
> The compound page proliferation is new, and we're sensitive to the
> ambiguity it created between head and tail pages. It's added some
> compound_head() in lower-level accessor functions that are not
> necessary for many contexts. The folio type safety will help clean
> that up, and this is great.
>
> However, there is a much bigger, systematic type ambiguity in the MM
> world that we've just gotten used to over the years: anon vs file vs
> shmem vs slab vs ...
>
> - Many places rely on context to say "if we get here, it must be
> anon/file", and then unsafely access overloaded member elements:
> page->mapping, PG_readahead, PG_swapcache, PG_private
>
> - On the other hand, we also have low-level accessor functions that
> disambiguate the type and impose checks on contexts that may or may
> not actually need them - not unlike compound_head() in PageActive():
>
> struct address_space *folio_mapping(struct folio *folio)
> {
> struct address_space *mapping;
>
> /* This happens if someone calls flush_dcache_page on slab page */
> if (unlikely(folio_test_slab(folio)))
> return NULL;
>
> if (unlikely(folio_test_swapcache(folio)))
> return swap_address_space(folio_swap_entry(folio));
>
> mapping = folio->mapping;
> if ((unsigned long)mapping & PAGE_MAPPING_ANON)
> return NULL;
>
> return (void *)((unsigned long)mapping & ~PAGE_MAPPING_FLAGS);
> }
>
> Then we go identify places that say "we know it's at least not a
> slab page!" and convert them to page_mapping_file() which IS safe to
> use with anon. Or we say "we know this MUST be a file page" and just
> access the (unsafe) mapping pointer directly.
>
> - We have a singular page lock, but what it guards depends on what
> type of page we're dealing with. For a cache page it protects
> uptodate and the mapping. For an anon page it protects swap state.
>
> A lot of us can remember the rules if we try, but the code doesn't
> help and it gets really tricky when dealing with multiple types of
> pages simultaneously. Even mature code like reclaim just serializes
> the operation instead of protecting data - the writeback checks and
> the page table reference tests don't seem to need page lock.
>
> When the cgroup folks wrote the initial memory controller, they just
> added their own page-scope lock to protect page->memcg even though
> the page lock would have covered what it needed.
>
> - shrink_page_list() uses page_mapping() in the first half of the
> function to tell whether the page is anon or file, but halfway
> through we do this:
>
> /* Adding to swap updated mapping */
> mapping = page_mapping(page);
>
> and then use PageAnon() to disambiguate the page type.
>
> - At activate_locked:, we check PG_swapcache directly on the page and
> rely on it doing the right thing for anon, file, and shmem pages.
> But this flag is PG_owner_priv_1 and actually used by the filesystem
> for something else. I guess PG_checked pages currently don't make it
> this far in reclaim, or we'd crash somewhere in try_to_free_swap().
>
> I suppose we're also never calling page_mapping() on PageChecked
> filesystem pages right now, because it would return a swap mapping
> before testing whether this is a file page. You know, because shmem.

(Yes, it would be helpful to fix these ambiguities, because I feel like
discussions about all these other non-pagecache uses of memory keep
coming up on fsdevel and the code /really/ doesn't help me figure out
what everyone's talking about before the discussion moves on...)

> These are just a few examples from an MM perspective. I'm sure the FS
> folks have their own stories and examples about pitfalls in dealing
> with struct page members.

We do, and I thought we were making good progress pushing a lot of that
into the fs/iomap/ library. With fs iomap, disk filesystems pass space
mapping data to the iomap functions and let them deal with pages (or
folios). IOWs, filesystems don't deal with pages directly anymore, and
folios sounded like an easy transition (for a filesystem) to whatever
comes next. At some point it would be nice to get fscrypt and fsverify
hooked up so that we could move ext4 further off of buffer heads.

I don't know how we proceed from here -- there's quite a bit of
filesystems work that depended on the folios series actually landing.
Given that Linus has neither pulled it, rejected it, or told willy what
to do, and the folio series now has a NAK on it, I can't even start on
how to proceed from here.

--D

> We're so used to this that we don't realize how much bigger and
> pervasive this lack of typing is than the compound page thing.
>
> I'm not saying the compound page mess isn't worth fixing. It is.
>
> I'm saying if we started with a file page or cache entry abstraction
> we'd solve not only the huge page cache, but also set us up for a MUCH
> more comprehensive cleanup in MM code and MM/FS interaction that makes
> the tailpage cleanup pale in comparison. For the same amount of churn,
> since folio would also touch all of these places.
Re: Folio discussion recap [ In reply to ]
On Wed, Sep 15, 2021 at 07:58:54PM -0700, Darrick J. Wong wrote:
> On Wed, Sep 15, 2021 at 11:40:11AM -0400, Johannes Weiner wrote:
> > On Fri, Sep 10, 2021 at 04:16:28PM -0400, Kent Overstreet wrote:
> > > One particularly noteworthy idea was having struct page refer to
> > > multiple hardware pages, and using slab/slub for larger
> > > alloctions. In my view, the primary reason for making this change
> > > isn't the memory overhead to struct page (though reducing that would
> > > be nice);
> >
> > Don't underestimate this, however.
> >
> > Picture the near future Willy describes, where we don't bump struct
> > page size yet but serve most cache with compound huge pages.
> >
> > On x86, it would mean that the average page cache entry has 512
> > mapping pointers, 512 index members, 512 private pointers, 1024 LRU
> > list pointers, 512 dirty flags, 512 writeback flags, 512 uptodate
> > flags, 512 memcg pointers etc. - you get the idea.
> >
> > This is a ton of memory. I think this doesn't get more traction
> > because it's memory we've always allocated, and we're simply more
> > sensitive to regressions than long-standing pain. But nevertheless
> > this is a pretty low-hanging fruit.
> >
> > The folio makes a great first step moving those into a separate data
> > structure, opening the door to one day realizing these savings. Even
> > when some MM folks say this was never the intent behind the patches, I
> > think this is going to matter significantly, if not more so, later on.
>
> So ... I chatted with Kent the other day, who suggested to me that maybe
> the point you're really after is that you want to increase the hw page
> size to reduce overhead while retaining the ability to hand out parts of
> those larger pages to the page cache, and folios don't get us there?

Yes, that's one of the points.

It's exporting the huge page model we've been using for anonymous
memory to the filesystems, even though that model has shown
significant limitations in practice: it doesn't work well out of the
box, the necessary configuration is painful and complicated, and even
when done correctly it still has high allocation latencies. It's much
more "handtuned HPC workload" than "general purpose feature".

Fixing this is an open problem. I don't know for sure if we need to
increase the page size for that, but neither does anybody else. This
is simply work and experiments that haven't been done on the MM side.

Exposing the filesystems to that implementation now exposes them to
the risk of a near-term do-over, and puts a significantly higher
barrier on fixing the allocation model down the line.

There isn't a technical reason for this coupling the filesystems that
tightly to the allocation model. It's just that the filesystem people
would like a size-agnostic cache object, and some MM folks would like
to clean up the compound page mess, and folio tries to do both of
these things at once.

> > > Fortunately, Matthew made a big step in the right direction by making folios a
> > > new type. Right now, struct folio is not separately allocated - it's just
> > > unionized/overlayed with struct page - but perhaps in the future they could be
> > > separately allocated. I don't think that is a remotely realistic goal for _this_
> > > patch series given the amount of code that touches struct page (thing: writeback
> > > code, LRU list code, page fault handlers!) - but I think that's a goal we could
> > > keep in mind going forward.
> >
> > Yeah, agreed. Not doable out of the gate, but retaining the ability to
> > allocate the "cache entry descriptor" bits - mapping, index etc. -
> > on-demand would be a huge benefit down the road for the above reason.
> >
> > For that they would have to be in - and stay in - their own type.
> >
> > > We should also be clear on what _exactly_ folios are for, so they don't become
> > > the new dumping ground for everyone to stash their crap. They're to be a new
> > > core abstraction, and we should endeaver to keep our core data structures
> > > _small_, and _simple_.
> >
> > Right. struct page is a lot of things and anything but simple and
> > obvious today. struct folio in its current state does a good job
> > separating some of that stuff out.
> >
> > However, when we think about *which* of the struct page mess the folio
> > wants to address, I think that bias toward recent pain over much
> > bigger long-standing pain strikes again.
> >
> > The compound page proliferation is new, and we're sensitive to the
> > ambiguity it created between head and tail pages. It's added some
> > compound_head() in lower-level accessor functions that are not
> > necessary for many contexts. The folio type safety will help clean
> > that up, and this is great.
> >
> > However, there is a much bigger, systematic type ambiguity in the MM
> > world that we've just gotten used to over the years: anon vs file vs
> > shmem vs slab vs ...
> >
> > - Many places rely on context to say "if we get here, it must be
> > anon/file", and then unsafely access overloaded member elements:
> > page->mapping, PG_readahead, PG_swapcache, PG_private
> >
> > - On the other hand, we also have low-level accessor functions that
> > disambiguate the type and impose checks on contexts that may or may
> > not actually need them - not unlike compound_head() in PageActive():
> >
> > struct address_space *folio_mapping(struct folio *folio)
> > {
> > struct address_space *mapping;
> >
> > /* This happens if someone calls flush_dcache_page on slab page */
> > if (unlikely(folio_test_slab(folio)))
> > return NULL;
> >
> > if (unlikely(folio_test_swapcache(folio)))
> > return swap_address_space(folio_swap_entry(folio));
> >
> > mapping = folio->mapping;
> > if ((unsigned long)mapping & PAGE_MAPPING_ANON)
> > return NULL;
> >
> > return (void *)((unsigned long)mapping & ~PAGE_MAPPING_FLAGS);
> > }
> >
> > Then we go identify places that say "we know it's at least not a
> > slab page!" and convert them to page_mapping_file() which IS safe to
> > use with anon. Or we say "we know this MUST be a file page" and just
> > access the (unsafe) mapping pointer directly.
> >
> > - We have a singular page lock, but what it guards depends on what
> > type of page we're dealing with. For a cache page it protects
> > uptodate and the mapping. For an anon page it protects swap state.
> >
> > A lot of us can remember the rules if we try, but the code doesn't
> > help and it gets really tricky when dealing with multiple types of
> > pages simultaneously. Even mature code like reclaim just serializes
> > the operation instead of protecting data - the writeback checks and
> > the page table reference tests don't seem to need page lock.
> >
> > When the cgroup folks wrote the initial memory controller, they just
> > added their own page-scope lock to protect page->memcg even though
> > the page lock would have covered what it needed.
> >
> > - shrink_page_list() uses page_mapping() in the first half of the
> > function to tell whether the page is anon or file, but halfway
> > through we do this:
> >
> > /* Adding to swap updated mapping */
> > mapping = page_mapping(page);
> >
> > and then use PageAnon() to disambiguate the page type.
> >
> > - At activate_locked:, we check PG_swapcache directly on the page and
> > rely on it doing the right thing for anon, file, and shmem pages.
> > But this flag is PG_owner_priv_1 and actually used by the filesystem
> > for something else. I guess PG_checked pages currently don't make it
> > this far in reclaim, or we'd crash somewhere in try_to_free_swap().
> >
> > I suppose we're also never calling page_mapping() on PageChecked
> > filesystem pages right now, because it would return a swap mapping
> > before testing whether this is a file page. You know, because shmem.
>
> (Yes, it would be helpful to fix these ambiguities, because I feel like
> discussions about all these other non-pagecache uses of memory keep
> coming up on fsdevel and the code /really/ doesn't help me figure out
> what everyone's talking about before the discussion moves on...)

Excellent.

However, after listening to Kent and other filesystem folks, I think
it's important to point out that the folio is not a dedicated page
cache page descriptor that will address any of the above examples.

The MM POV (and the justification for both the acks and the naks of
the patchset) is that it's a generic, untyped compound page
abstraction, which applies to file, anon, slab, networking
pages. Certainly, the folio patches as of right now also convert anon
page handling to the folio. If followed to its conclusion, the folio
will have plenty of members and API functions for non-pagecache users
and look pretty much like struct page today, just with a dynamic size.

I know Kent was surprised by this. I know Dave Chinner suggested to
call it "cache page" or "cage" early on, which also suggests an
understanding of a *dedicated* cache page descriptor.

I don't think the ambiguous folio name and the ambiguous union with
the page helped in any way in aligning fs and mm folks on what this
thing is actually supposed to be!

I agree with what I think the filesystems want: instead of an untyped,
variable-sized block of memory, I think we should have a typed page
cache desciptor.

That would work better for the filesystems, and I think would also
work better for the MM code down the line and fix the above examples.

The headpage/tailpage cleanup would come free with that.

> > These are just a few examples from an MM perspective. I'm sure the FS
> > folks have their own stories and examples about pitfalls in dealing
> > with struct page members.
>
> We do, and I thought we were making good progress pushing a lot of that
> into the fs/iomap/ library. With fs iomap, disk filesystems pass space
> mapping data to the iomap functions and let them deal with pages (or
> folios). IOWs, filesystems don't deal with pages directly anymore, and
> folios sounded like an easy transition (for a filesystem) to whatever
> comes next. At some point it would be nice to get fscrypt and fsverify
> hooked up so that we could move ext4 further off of buffer heads.
>
> I don't know how we proceed from here -- there's quite a bit of
> filesystems work that depended on the folios series actually landing.
> Given that Linus has neither pulled it, rejected it, or told willy what
> to do, and the folio series now has a NAK on it, I can't even start on
> how to proceed from here.

I think divide and conquer is the way forward.

The crux of the matter is that folio is trying to 1) replace struct
page as the filesystem interface to the MM and 2) replace struct page
as the internal management object for file and anon, and conceptually
also slab & networking pages all at the same time.

As you can guess, goals 1) and 2) have vastly different scopes.

Replacing struct page in the filesystem isn't very controversial, and
filesystem folks seem uniformly ready to go. I agree.

Replacing struct page in MM code is much less clear cut. We have some
people who say it'll be great, some people who say we can probably
figure out open questions down the line, and we have some people who
have expressed doubts that all this churn will ever be worth it. I
think it's worth replacing, but not with an untyped compound thing.

It's sh*tty that the filesystem people are acutely blocked on
large-scope, long-term MM discussions they don't care about.

It's also sh*tty that these MM discussions are rushed by folks who
aren't familiar with or care too much about the MM internals.

This friction isn't necessary. The folio conversion is an incremental
process. It's not like everything in MM code has been fully converted
already - some stuff deals with the folio, most stuff with the page.

An easy way forward that I see is to split this large, open-ended
project into more digestible pieces. E.g. separate 1) and 2): merge a
"size-agnostic cache page" type now; give MM folks the time they need
to figure out how and if they want to replace struct page internally.

That's why I suggested to drop the anon page conversion bits in
swap.c, workingset.c, memcontrol.c etc, and just focus on the
uncontroversial page cache bits for now.
Re: Folio discussion recap [ In reply to ]
Johannes Weiner <hannes@cmpxchg.org> wrote:

> I know Kent was surprised by this. I know Dave Chinner suggested to
> call it "cache page" or "cage" early on, which also suggests an
> understanding of a *dedicated* cache page descriptor.

If we are aiming to get pages out of the view of the filesystem, then we
should probably not include "page" in the name. "Data cache" would seem
obvious, but we already have that concept for the CPU. How about something
like "struct content" and rename i_pages to i_content?

David
Re: Folio discussion recap [ In reply to ]
On Thu, Sep 16, 2021 at 12:54:22PM -0400, Johannes Weiner wrote:
> On Wed, Sep 15, 2021 at 07:58:54PM -0700, Darrick J. Wong wrote:
> > On Wed, Sep 15, 2021 at 11:40:11AM -0400, Johannes Weiner wrote:
> > > On Fri, Sep 10, 2021 at 04:16:28PM -0400, Kent Overstreet wrote:
> The MM POV (and the justification for both the acks and the naks of
> the patchset) is that it's a generic, untyped compound page
> abstraction, which applies to file, anon, slab, networking
> pages. Certainly, the folio patches as of right now also convert anon
> page handling to the folio. If followed to its conclusion, the folio
> will have plenty of members and API functions for non-pagecache users
> and look pretty much like struct page today, just with a dynamic size.
>
> I know Kent was surprised by this. I know Dave Chinner suggested to
> call it "cache page" or "cage" early on, which also suggests an
> understanding of a *dedicated* cache page descriptor.

Don't take a flippant I comment made in a bikeshed as any sort of
representation of what I think about this current situation. I've
largely been silent because of your history of yelling incoherently
in response to anything I say that you don't agree with.

But now you've explicitly drawn me into this discussion, I'll point
out that I'm one of very few people in the wider Linux mm/fs
community who has any *direct experience* with the cache handle
based architecture being advocated for here.

I don't agree with your assertion that cache handle based objects
are the way forward, so please read and try to understand what
I've just put a couple of hours into writing before you start
shouting. Please?

---

Ok, so this cache page descriptor/handle/object architecture has
been implemented in other operating systems. It's the solution that
Irix implemented back in the early _1990s_ via it's chunk cache.

I've talked about this a few times in the past 15 years, so I guess
I'll talk about it again. eg at LSFMM 2014 where I said "we
don't really want to go down that path" in reference to supporting
sector sizes > PAGE_SIZE:

https://lwn.net/Articles/592101/

So, in more gory detail why I don't think we really want to go down
that path.....

The Irix chunk cache sat between the low layer global, disk address
indexed buffer cache[1] and the high layer per-mm-context page cache
used for mmap().

A "chunk" was a variable sized object indexed by file offset on a
per-inode AVL tree - basically the same caching architecture as our
current per-inode mapping tree uses to index pages. But unlike the
Linux page cache, these chunks were an extension of the low level
buffer cache. Hence they were also indexed by physical disk
address and the life-cycle was managed by the buffer cache shrinker
rather than the mm-based page cache reclaim algorithms.

Chunks were built from page cache pages, and pages pointed back to
the chunk that they belonged to. Chunks needed their own locking. IO
was done based on chunks, not pages. Filesystems decided the size of
chunks, not the page cache. Pages attached to chunks could be of any
hardware supported size - the only limitation was that all pages
attached to a chunk had to be the same size. A large hardware page
in the page cache could be mapped by multiple smaller chunks. A
chunk made up of multiple hardware pages could vmap it's contents if
the user needed contiguous access.[2]

Chunks were largely unaware of ongoing mmap operations. page faults
on pages that had no associated chunk (e.g. originally populated
into the page cache by a read fault into hole or a cached page that
the buffer cache had torn down) then a new chunk had to be built.
The code needed to handle to partially populated chunks in this sort
of situation was really, really nasty as it required interacting
with the filesystem and having the filesystem take locks and call
back up into the page cache to build the new chunk in the IO path.

Similarly, dirty page state from page faults needed to be propagated
down to the chunks, because dirty tracking for writeback was done at
the chunk level, not the page cache level. This was *really* nasty,
because if the page didn't have a chunk already built, it couldn't
be built in a write fault context. Hence sweeping dirty page state
to the IO subsystem was handled periodically by a pdflush daemon ,
which could work with the filesytsem to build new (dirty) chunks and
insert them into the chunk cache for writeback.

Similar problems will have to be considered during design for Linux
because the dirty tracking in Linux for writeback is done at the
per-inode mapping tree level. Hence things like ->page_mkwrite are
going to have to dig through the page to the cached chunk and mark
the chunk dirty rather than the page. Whether deadlocks are going to
have to be worked around is an open question; I don't have answers
to these concerns because nobody is proposing an architecture
detailed enough to explore these situations.

This also leads to really interesting questions about how page and
chunk state w.r.t. IO is kept coherent. e.g. if we are not tracking
IO state on individual page cache pages, how do we ensure all the
pages stay stable when IO is being done to a block device that
requires stable pages? Along similar lines: what's the interlock
mechanism that we'll use to ensure that IO or truncate can lock out
per-page accesses if the filesystem IO paths no longer directly
interact with page state any more? I also wonder how will we manage
cached chunks if the filesystem currently relies on page level
locking for atomicity, concurrency and existence guarantees (e.g.
ext4 buffered IO)?

IOWs, it is extremely likely that there will still be situations
where we have to blast directly through the cache handle abstraction
to manipulate the objects behind the abstraction so that we can make
specific functionality work correctly, without regressions and/or
efficiently.

Hence the biggest issue that a chunk-like cache handle introduces
is the complex multi-dimensional state update interactions. These will
require more complex locking and that locking will be required to
work in arbitrary orders for operations to be performed safely and
atomically. e.g IO needs inode->chunk->page order, whilst page
migration/comapction needs page->chunk->inode order. Page migration
and compaction on Irix had some unfixable deadlocks in rare corner
cases because of locking inversion problems between filesystems,
chunks, pages and mm contexts. I don't see any fundamental
difference in Linux architecture that makes me think that it will
be any different.[3]

I've got war chest full of chunk cache related data corruption bugs
on Irix that were crazy hard to reproduce and even more difficult to
fix. At least half the bugs I had to fix in the chunk cache over
3-4 years as maintainer were data corruption bugs resulting from
inconsistencies in multi-object state updates.

I've got a whole 'nother barrel full of problem cases that revolve
around memory reclaim, too. The cache handles really need to pin the
pages that back them, and so we can't really do access optimised
per-page based reclaim of file-backed pages anymore. The Irix chunk
cache had it's own LRUs and shrinker[4] to manage life-cycles of
chunks under memory pressure, and the mm code had it's own
independent page cache shrinker. Hence pages didn't get freed until
both the chunk cache and the page cache released the pages they had
references to.

IOWs, we're going to end up needing to reclaim cache handles before
we can do page reclaim. This needs careful thought and will likely
need a complete redesign of the vmscan.c algorithms to work
properly. I really, really don't want to see awful layer violations
like bufferhead reclaim getting hacked into the low layer page
reclaim algorithms happen ever again. We're still paying the price
for that.

And given the way Linux uses the mapping tree for keeping stuff like
per-page working set refault information after the pages have been
removed from the page cache, I really struggle to see how
functionality like this can be supported with a chunk based cache
index that doesn't actually have direct tracking of individual page
access and reclaim behaviour.

We're also going to need a range-based indexing mechanism for the
mapping tree if we want to avoid the inefficiencies that mapping
large objects into the Xarray require. We'll need an rcu-aware tree
of some kind, be it a btree, maple tree or something else so that we
can maintain lockless lookups of cache objects. That infrastructure
doesn't exist yet, either.

And on that note, it is worth keeping in mind that one of the
reasons that the current linux page cache architecture scales better
for single files than the Irix architecture ever did is because the
Irix chunk cache could not be made lockless. The requirements for
atomic multi-dimensional indexing updates and coherent, atomic
multi-object state changes could never be solved in a lockless
manner. It was not for lack of trying or talent; people way
smarter than me couldn't solve that problem. SO there's an open
question as to whether we can maintain existing lockless algorithms
when a chunk cache is layered over the top of the page cache.

IOWs, I see significant, fundamental problems that chunk cache
architectures suffer from. I know there are inherent problems with
state coherency, locking, complexity in the IO path, etc. Some of
these problems will inot be discovered until the implementation is
well under way. Some of these problem may well be unsolveable, too.
And until there's an actual model proposed of how everything will
interact and work, we can't actually do any of this architectural
analysis to determine if it might work or not. The chunk cache
proposal is really just a grand thought experiment at this point in
time.

OTOH, folios have none of these problems and are here right now.
Sure, they have their own issues, but we can see them for what they
are given the code is already out there, and pretty much everyone
sees them as a big step forwards.

Folios don't prevent a chunk cache from being implemented. In fact,
to make folios highly efficient, we have to do things a chunk cache
would also require to be implemented. e.g. range-based cache
indexing. Unlike a chunk cache, folios don't depend on this being
done first - they stand alone without those changes, and will only
improve from making them. IOWs, you can't use the "folios being
mapped 512 times into the mapping tree" as a reason the chunk cache
is better - the chunk cache also requires this same problem to be
solved, but the chunk cache needs efficient range lookups done
*before* it is implemented, not provided afterwards as an
optimisation.

IOWs, if we want to move towards a chunk cache, the first step is to
move to folios to allow large objects in the page cache. Then we can
implement a lock-less range based index mechanism for the mapping
tree. Then we can look to replace folios with a typed cache handle
without having to worry about all the whacky multi-object coherency
problems because they only need to point to a single folio. Then we
can work out all the memory reclaim issues, locking issues, sort out
the API that filesystems use instead of folios, etc that ineed to be
done when cache handles are introduced. And once we've worked
through all that, then we can add support for multiple folios within
a single cache object and discover all the really hard problems that
this exposes. At this point, the cache objects are no longer
dependent on folios to provide objects > PAGE_SIZE to the
filesystems, and we can start to remove folios from the mm code and
replace them with something else that the cache handle uses to
provide the backing store to the filesysetms...

Seriously, I have given a lot of thought over the years to a chunk
cache for Linux. Right now, a chunk cache is a solution looking for
a problem to solve. Unless there's an overall architectural mm
plan that is being worked towards that requires a chunk cache, then
I just don't see the justification for doing all this work because
the first two steps above get filesystems everything they are
currently asking for. Everything else past that is really just an
experiment...

> I agree with what I think the filesystems want: instead of an untyped,
> variable-sized block of memory, I think we should have a typed page
> cache desciptor.

I don't think that's what fs devs want at all. It's what you think
fs devs want. If you'd been listening to us the same way that Willy
has been for the past year, maybe you'd have a different opinion.

Indeed, we don't actually need a new page cache abstraction.
fs/iomap already provides filesystems with a complete, efficient
page cache abstraction that only requires filesytems to provide
block mapping services. Filesystems using iomap do not interact with
the page cache at all. And David Howells is working with Willy and
all the network fs devs to build an equivalent generic netfs page
cache abstraction based on folios that is supported by the major
netfs client implementations in the kernel.

IOWs, fs devs don't need a new page cache abstraction - we've got
our own abstractions tailored directly to our needs. What we need
are API cleanups, consistency in object access mechanisms and
dynamic object size support to simplify and fill out the feature set
of the abstractions we've already built.

The fact that so many fs developers are pushing *hard* for folios is
that it provides what we've been asking for individually over last
few years. Willy has done a great job of working with the fs
developers and getting feedback at every step of the process, and
you see that in the amount of work that in progress that is already
based on folios. ANd it provides those cleanups and new
functionality without changing or invalidating any of the knowledge
we collectively hold about how the page cache works. That's _pure
gold_ right there.

In summary:

If you don't know anything about the architecture and limitations of
the XFS buffer cache (also read the footnotes), you'd do very well
to pay heed to what I've said in this email considering the direct
relevancy it's history has to the alternative cache handle proposal
being made here. We also need to consider the evidence that
filesystems do not actually need a new page cache abstraction - they
just need the existing page cache to be able to index objects larger
than PAGE_SIZE.

So with all that in mind, I consider folios (or whatever we call
them) to be the best stepping stone towards a PAGE_SIZE indepedent
future that we currently have. folios don't prevent us from
introducing a cache handle based architecture if we have a
compelling reason to do so in the future, nor do they stop anyone
working on such infrastructure in parallel if it really is
necessary. But the reality is that we don't need such a fundamental
architectural change to provide the functionality that folios
provide us with _right now_.

Folios are not perfect, but they are here and they solve many issues
we need solved. We're never going to have a perfect solution that
everyone agrees with, so the real question is "are folios good
enough?". To me the answer is a resounding yes.

Cheers,

Dave.

[1] fs/xfs/xfs_buf.c is an example of a high performance handle
based, variable object size cache that abstracts away the details of
the data store being allocated from slab, discontiguous pages,
contiguous pages or [2] vmapped memory. It is basically two decade
old re-implementation of the Irix low layer global disk-addressed
buffer cache, modernised and tailored directly to the needs of XFS
metadata caching.

[3] Keep in mind that the xfs_buf cache used to be page cache
backed. The page cache provided the caching and memory reclaim
infrastructure to the xfs_buf handles - and so we do actually have
recent direct experience on Linux with the architecture you are
proposing here. This architecture proved to be a major limitation
from a performance, multi-object state coherency and cache residency
prioritisation aspects. It really sucked with systems that had 64KB
page sizes and 4kB metadata block sizes, and ....

[4] So we went back to the old Irix way of managing the cache - our
own buffer based LRUs and aging mechanisms, with memory reclaim run
by a shrinkers based on buffer-type base priorities. We use bulk
page allocation for buffers that >= PAGE_SIZE, and slab allocation <
PAGE_SIZE. That's exactly what you are suggesting we do with 2MB
sized base pages, but without having to care about mmap() at all.

--
Dave Chinner
david@fromorbit.com
Re: Folio discussion recap [ In reply to ]
On Fri, Sep 17, 2021 at 03:24:40PM +1000, Dave Chinner wrote:
> Folios are not perfect, but they are here and they solve many issues
> we need solved. We're never going to have a perfect solution that
> everyone agrees with, so the real question is "are folios good
> enough?". To me the answer is a resounding yes.

Besides agreeing to all what you said, the other important part is:
even if we were to eventually go with Johannes grand plans (which I
disagree with in many apects), what is the harm in doing folios now?

Despite all the fuzz, the pending folio PR does nothing but add type
safety to compound pages. Which is something we badly need, no matter
what kind of other caching grand plans people have.
Re: Folio discussion recap [ In reply to ]
On Fri, Sep 17, 2021 at 03:24:40PM +1000, Dave Chinner wrote:
> On Thu, Sep 16, 2021 at 12:54:22PM -0400, Johannes Weiner wrote:
> > I agree with what I think the filesystems want: instead of an untyped,
> > variable-sized block of memory, I think we should have a typed page
> > cache desciptor.
>
> I don't think that's what fs devs want at all. It's what you think
> fs devs want. If you'd been listening to us the same way that Willy
> has been for the past year, maybe you'd have a different opinion.

I was going off of Darrick's remarks about non-pagecache uses, Kent's
remarks Kent about simple and obvious core data structures, and yes
your suggestion of "cache page".

But I think you may have overinterpreted what I meant by cache
descriptor:

> Indeed, we don't actually need a new page cache abstraction.

I didn't suggest to change what the folio currently already is for the
page cache. I asked to keep anon pages out of it (and in the future
potentially other random stuff that is using compound pages).

It doesn't have any bearing on how it presents to you on the
filesystem side, other than that it isn't as overloaded as struct page
is with non-pagecache stuff.

A full-on disconnect between the cache entry descriptor and the page
is something that came up during speculation on how the MM will be
able to effectively raise the page size and meet scalability
requirements on modern hardware - and in that context I do appreciate
you providing background information on the chunk cache, which will be
valuable to inform *that* discussion.

But it isn't what I suggested as the immediate action to unblock the
folio merge.

> The fact that so many fs developers are pushing *hard* for folios is
> that it provides what we've been asking for individually over last
> few years.

I'm not sure filesystem people are pushing hard for non-pagecache
stuff to be in the folio.

> Willy has done a great job of working with the fs developers and
> getting feedback at every step of the process, and you see that in
> the amount of work that in progress that is already based on
> folios.

And that's great, but the folio is blocked on MM questions:

1. Is the folio a good descriptor for all uses of anon and file pages
inside MM code way beyond the page cache layer YOU care about?

2. Are compound pages a scalable, future-proof allocation strategy?

For some people the answers are yes, for others they are a no.

For 1), the value proposition is to clean up the relatively recent
head/tail page confusion. And though everybody agrees that there is
value in that, it's a LOT of churn for what it does. Several people
have pointed this out, and AFAICS this is the most common reason for
people that have expressed doubt or hesitation over the patches.

In an attempt to address this, I pointed out the cleanup opportunities
that would open up by using separate anon and file folio types instead
of one type for both. Nothing more. No intermediate thing, no chunk
cache. Doesn't affect you. Just taking Willy's concept of type safety
and applying it to file and anon instead of page vs compound page.

- It wouldn't change anything for fs people from the current folio
patchset (except maybe the name)

- It would accomplish the head/tail page cleanup the same way, since
just like a folio, a "file folio" could also never be a tail page

- It would take the same solution folio prescribes to the compound
page issue (explicit typing to get rid of useless checks, lookups
and subtle bugs) and solve way more instances of this all over MM
code, thereby hopefully boosting the value proposition and making
*that part* of the patches a clearer win for the MM subsystem

This is a question directed at MM people, not filesystem people. It
doesn't pertain to you at all.

And if MM people agree or want to keep discussing it, the relatively
minor action item for the folio patch is the same: drop the partial
anon-to-folio conversion bits inside MM code for now and move on.

For 2), nobody knows the answer to this. Nobody. Anybody who claims to
do so is full of sh*t. Maybe compound pages work out, maybe they
don't. We can talk a million years about larger page sizes, how to
handle internal fragmentation, the difficulties of implementing a
chunk cache, but it's completely irrelevant because it's speculative.

We know there are multiple page sizes supported by the hardware and
the smallest supported one is no longer the most dominant one. We do
not know for sure yet how the MM is internally going to lay out its
type system so that the allocator, mmap, page reclaim etc. can be CPU
efficient and the descriptors be memory efficient.

Nobody's "grand plan" here is any more viable, tested or proven than
anybody else's.

My question for fs folks is simply this: as long as you can pass a
folio to kmap and mmap and it knows what to do with it, is there any
filesystem relevant requirement that the folio map to 1 or more
literal "struct page", and that folio_page(), folio_nr_pages() etc be
part of the public API? Or can we keep this translation layer private
to MM code? And will page_folio() be required for anything beyond the
transitional period away from pages?

Can we move things not used outside of MM into mm/internal.h, mark the
transitional bits of the public API as such, and move on?

The unproductive vitriol, personal attacks and dismissiveness over
relatively minor asks and RFCs from the subsystem that is the most
impacted by this patchset is just nuts.
Re: Folio discussion recap [ In reply to ]
On Fri, Sep 17, 2021 at 12:31:36PM -0400, Johannes Weiner wrote:
> I didn't suggest to change what the folio currently already is for the
> page cache. I asked to keep anon pages out of it (and in the future
> potentially other random stuff that is using compound pages).

It would mean that anon-THP cannot benefit from the work Willy did with
folios. Anon-THP is the most active user of compound pages at the moment
and it also suffers from the compound_head() plague. You ask to exclude
anon-THP siting *possible* future benefits for pagecache.

Sorry, but this doesn't sound fair to me.

We already had similar experiment with PAGE_CACHE_SIZE. It was introduced
with hope to have PAGE_CACHE_SIZE != PAGE_SIZE one day. It never happened
and only caused confusion on the border between pagecache-specific code
and generic code that handled both file and anon pages.

If you want to limit usage of the new type to pagecache, the burden on you
to prove that it is useful and not just a dead weight.

--
Kirill A. Shutemov
Re: Folio discussion recap [ In reply to ]
Snipped, reordered:

On Fri, Sep 17, 2021 at 12:31:36PM -0400, Johannes Weiner wrote:
> 2. Are compound pages a scalable, future-proof allocation strategy?
>
> For 2), nobody knows the answer to this. Nobody. Anybody who claims to
> do so is full of sh*t. Maybe compound pages work out, maybe they
> don't. We can talk a million years about larger page sizes, how to
> handle internal fragmentation, the difficulties of implementing a
> chunk cache, but it's completely irrelevant because it's speculative.

Calling it compound pages here is a misnomer, and it confuses the discussion.
The question is really about whether we should start using higher order
allocations for data in the page cache, and perhaps a better way of framing that
question is: should we continue to fragment all our page cache allocations up
front into individual pages?

But I don't think this really the blocker.

> 1. Is the folio a good descriptor for all uses of anon and file pages
> inside MM code way beyond the page cache layer YOU care about?
>
> For some people the answers are yes, for others they are a no.

The anon page conversion does seem to be where all the disagreement is coming
from.

So my ask, to everyone involved is - if anonymous pages are dropped from the
folio patches, do we have any other real objections to the patch series?

It's an open question as to how much anonymous pages are like file pages, and if
we continue down the route of of splitting up struct page into separate types
whether anonymous pages should be the same time as file pages.

Also, it appears even file pages aren't fully converted to folios in Willy's
patch set - grepping around reveals plenty of references to struct page left in
fs/. I think that even if anonymous pages are going to become folios it's a
pretty reasonable ask for that to wait a cycle or two and see how the conversion
of file pages fully plays out.

Also: it's become pretty clear to me that we have crappy communications between
MM developers and filesystem developers. Internally both teams have solid
communications - I know in filesystem land we all talk to each other and are
pretty good at working colaboratively, and it sounds like the MM team also has
good internal communications. But we seem to have some problems with tackling
issues that cross over between FS and MM land, or awkwardly sit between them.

Perhaps this is something we could try to address when picking conference topics
in the future. Johannes also mentioned a monthly group call the MM devs schedule
- I wonder if it would be useful to get something similar going between MM and
interested parties in filesystem land.
Re: Folio discussion recap [ In reply to ]
On Fri, Sep 17, 2021 at 11:57:35PM +0300, Kirill A. Shutemov wrote:
> On Fri, Sep 17, 2021 at 12:31:36PM -0400, Johannes Weiner wrote:
> > I didn't suggest to change what the folio currently already is for the
> > page cache. I asked to keep anon pages out of it (and in the future
> > potentially other random stuff that is using compound pages).
>
> It would mean that anon-THP cannot benefit from the work Willy did with
> folios. Anon-THP is the most active user of compound pages at the moment
> and it also suffers from the compound_head() plague. You ask to exclude
> anon-THP siting *possible* future benefits for pagecache.
>
> Sorry, but this doesn't sound fair to me.

I'm less concerned with what's fair than figuring out what the consensus is so
we can move forward. I agree that anonymous THPs could benefit greatly from
conversion to folios - but looking at the code it doesn't look like much of that
has been done yet.

I understand you've had some input into the folio patches, so maybe you'd be
best able to answer while Matthew is away - would it be fair to say that, in the
interests of moving forward, anonymous pages could be split out for now? That
way the MM people gain time to come to their own consensus and we can still
unblock the FS work that's already been done on top of folios.
Re: Folio discussion recap [ In reply to ]
On Fri, Sep 17, 2021 at 05:17:09PM -0400, Kent Overstreet wrote:
> On Fri, Sep 17, 2021 at 11:57:35PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Sep 17, 2021 at 12:31:36PM -0400, Johannes Weiner wrote:
> > > I didn't suggest to change what the folio currently already is for the
> > > page cache. I asked to keep anon pages out of it (and in the future
> > > potentially other random stuff that is using compound pages).
> >
> > It would mean that anon-THP cannot benefit from the work Willy did with
> > folios. Anon-THP is the most active user of compound pages at the moment
> > and it also suffers from the compound_head() plague. You ask to exclude
> > anon-THP siting *possible* future benefits for pagecache.
> >
> > Sorry, but this doesn't sound fair to me.
>
> I'm less concerned with what's fair than figuring out what the consensus is so
> we can move forward. I agree that anonymous THPs could benefit greatly from
> conversion to folios - but looking at the code it doesn't look like much of that
> has been done yet.
>
> I understand you've had some input into the folio patches, so maybe you'd be
> best able to answer while Matthew is away - would it be fair to say that, in the
> interests of moving forward, anonymous pages could be split out for now? That
> way the MM people gain time to come to their own consensus and we can still
> unblock the FS work that's already been done on top of folios.

I can't answer for Matthew.

Anon conversion patchset doesn't exists yet (but it is in plans) so
there's nothing to split out. Once someone will come up with such patchset
he has to sell it upstream on its own merit.

Possible future efforts should not block code at hands. "Talk is cheap.
Show me the code."

--
Kirill A. Shutemov
Re: Folio discussion recap [ In reply to ]
On Sat, Sep 18, 2021 at 01:02:09AM +0300, Kirill A. Shutemov wrote:
> I can't answer for Matthew.
>
> Anon conversion patchset doesn't exists yet (but it is in plans) so
> there's nothing to split out. Once someone will come up with such patchset
> he has to sell it upstream on its own merit.

Perhaps we've been operating under some incorrect assumptions then. If the
current patch series doesn't actually touch anonymous pages - the patch series
does touch code in e.g. mm/swap.c, but looking closer it might just be due to
the (mis)organization of the current code - maybe there aren't any real
objections left?
Re: Folio discussion recap [ In reply to ]
On Fri, Sep 17, 2021 at 05:13:10PM -0400, Kent Overstreet wrote:
> Also: it's become pretty clear to me that we have crappy
> communications between MM developers and filesystem
> developers.

I think one of the challenges has been the lack of an LSF/MM since
2019. And it may be that having *some* kind of ad hoc technical
discussion given that LSF/MM in 2021 is not happening might be a good
thing. I'm sure if we asked nicely, we could use the LPC
infrasutrcture to set up something, assuming we can find a mutually
agreeable day or dates.

> Internally both teams have solid communications - I know
> in filesystem land we all talk to each other and are pretty good at
> working colaboratively, and it sounds like the MM team also has good
> internal communications. But we seem to have some problems with
> tackling issues that cross over between FS and MM land, or awkwardly
> sit between them.

That's a bit of a over-generalization; it seems like we've uncovered
that some of the disagreemnts are between different parts of the MM
community over the suitability of folios for anonymous pages.

And it's interesting, because I don't really consider Willy to be one
of "the FS folks" --- and he has been quite diligent to reaching out
to a number of folks in the FS community about our needs, and it's
clear that this has been really, really helpful. There's no question
that we've had for many years some difficulties in the code paths that
sit between FS and MM, and I'd claim that it's not just because of
communications, but the relative lack of effort that was focused in
that area. The fact that Willy has spent the last 9 months working on
FS / MM interactions has been really great, and I hope it continues.

That being said, it sounds like there are issues internal to the MM
devs that still need to be ironed out, and at the risk of throwing the
anon-THP folks under the bus, if we can land at least some portion of
the folio commits, it seems like that would be a step in the right
direction.

Cheers,

- Ted
Re: Folio discussion recap [ In reply to ]
On Fri, Sep 17, 2021 at 11:57:35PM +0300, Kirill A. Shutemov wrote:
> On Fri, Sep 17, 2021 at 12:31:36PM -0400, Johannes Weiner wrote:
> > I didn't suggest to change what the folio currently already is for the
> > page cache. I asked to keep anon pages out of it (and in the future
> > potentially other random stuff that is using compound pages).
>
> It would mean that anon-THP cannot benefit from the work Willy did with
> folios. Anon-THP is the most active user of compound pages at the moment
> and it also suffers from the compound_head() plague. You ask to exclude
> anon-THP siting *possible* future benefits for pagecache.
>
> Sorry, but this doesn't sound fair to me.

Hold on Kirill. I'm not saying we shouldn't fix anonthp. But let's
clarify the actual code in question in this specific patchset. You say
anonthp cannot benefit from folio, but in the other email you say this
patchset isn't doing the conversion yet.

The code I'm specifically referring to here is the conversion of some
code that encounters both anon and file pages - swap.c, memcontrol.c,
workingset.c, and a few other places. It's a small part of the folio
patches, but it's a big deal for the MM code conceptually.

I'm requesting to drop those and just keep the page cache bits. Not
because I think anonthp shouldn't be fixed, but because I think we're
not in agreement yet on how they should be fixed. And it's somewhat
independent of fixing the page cache interface now that people are
waiting on much more desparately and acutely than we inside MM wait
for a struct page cleanup. It's not good to hold them while we argue.

Dropping the anon bits isn't final. Depending on how our discussion
turns out, we can still put them in later or we can put in something
new. The important thing is that the uncontroversial page cache bits
aren't held up any longer while we figure it out.

> If you want to limit usage of the new type to pagecache, the burden on you
> to prove that it is useful and not just a dead weight.

I'm not asking to add anything to the folio patches, just to remove
some bits around the edges. And for the page cache bits: I think we
have a rather large number of folks really wanting those. Now.

Again, I think we should fix anonthp. But I also think we should
really look at struct page more broadly. And I think we should have
that discussion inside a forum of MM people that truly care.

I'm just trying to unblock the fs folks at this point and merge what
we can now.
Re: Folio discussion recap [ In reply to ]
On 9/17/21 6:25 PM, Theodore Ts'o wrote:
> On Fri, Sep 17, 2021 at 05:13:10PM -0400, Kent Overstreet wrote:
>> Also: it's become pretty clear to me that we have crappy
>> communications between MM developers and filesystem
>> developers.
>
> I think one of the challenges has been the lack of an LSF/MM since
> 2019. And it may be that having *some* kind of ad hoc technical
> discussion given that LSF/MM in 2021 is not happening might be a good
> thing. I'm sure if we asked nicely, we could use the LPC
> infrasutrcture to set up something, assuming we can find a mutually
> agreeable day or dates.
>

We have a slot for this in the FS MC, first slot actually, so hopefully
we can get things hashed out there. Thanks,

Josef
Re: Folio discussion recap [ In reply to ]
On Fri, Sep 17, 2021 at 12:31:36PM -0400, Johannes Weiner wrote:
> My question for fs folks is simply this: as long as you can pass a
> folio to kmap and mmap and it knows what to do with it, is there any
> filesystem relevant requirement that the folio map to 1 or more
> literal "struct page", and that folio_page(), folio_nr_pages() etc be
> part of the public API?

In the short term, yes, we need those things in the public API.
In the long term, not so much.

We need something in the public API that tells us the offset and
size of the folio. Lots of page cache code currently does stuff like
calculate the size or iteration counts based on the difference of
page->index values (i.e. number of pages) and iterate page by page.
A direct conversion of such algorithms increments by
folio_nr_pages() instead of 1. So stuff like this is definitely
necessary as public APIs in the initial conversion.

Let's face it, folio_nr_pages() is a huge improvement on directly
exposing THP/compound page interfaces to filesystems and leaving
them to work it out for themselves. So even in the short term, these
API members represent a major step forward in mm API cleanliness.

As for long term, everything in the page cache API needs to
transition to byte offsets and byte counts instead of units of
PAGE_SIZE and page->index. That's a more complex transition, but
AFAIA that's part of the future work Willy is intended to do with
folios and the folio API. Once we get away from accounting and
tracking everything as units of struct page, all the public facing
APIs that use those units can go away.

It's fairly slow to do this, because we have so much code that is
doing stuff like converting file offsets between byte counts and
page counts and vice versa. And it's not necessary to do an initial
conversion to folios, either. But once everything in the page cache
indexing API moves to byte ranges, the need to count pages, use page
counts are ranges, iterate by page index, etc all goes away and
hence those APIs can also go away.

As for converting between folios and pages, we'll need those sorts
of APIs for the foreseeable future because low level storage layers
and hardware use pages for their scatter gather arrays and at some
point we've got to expose those pages from behind the folio API.
Even if we replace struct page with some other hardware page
descriptor, we're still going to need such translation APIs are some
point in the stack....

> Or can we keep this translation layer private
> to MM code? And will page_folio() be required for anything beyond the
> transitional period away from pages?

No idea, but as per above I think it's a largely irrelevant concern
for the forseeable future because pages will be here for a long time
yet.

> Can we move things not used outside of MM into mm/internal.h, mark the
> transitional bits of the public API as such, and move on?

Sure, but that's up to you to do as a patch set on top of Willy's
folio trees if you think it improves the status quo. Write the
patches and present them for review just like everyone else does,
and they can be discussed on their merits in that context rather
than being presented as a reason for blocking current progress on
folios.

Cheers,

Dave.
--
Dave Chinner
david@fromorbit.com
Re: Folio discussion recap [ In reply to ]
On Sat, Sep 18, 2021 at 11:04:40AM +1000, Dave Chinner wrote:
> As for long term, everything in the page cache API needs to
> transition to byte offsets and byte counts instead of units of
> PAGE_SIZE and page->index. That's a more complex transition, but
> AFAIA that's part of the future work Willy is intended to do with
> folios and the folio API. Once we get away from accounting and
> tracking everything as units of struct page, all the public facing
> APIs that use those units can go away.

Probably 95% of the places we use page->index and page->mapping aren't necessary
because we've already got that information from the context we're in and
removing them would be a useful cleanup - if we've already got that from context
(e.g. we're looking up the page in the page cache, via i_pageS) eliminating the
page->index or page->mapping use means we're getting rid of a data dependency so
it's good for performance - but more importantly, those (much fewer) places in
the code where we actually _do_ need page->index and page->mapping are really
important places to be able to find because they're interesting boundaries
between different components in the VM.
Re: Folio discussion recap [ In reply to ]
On Sat, Sep 18, 2021 at 12:51:50AM -0400, Kent Overstreet wrote:
> On Sat, Sep 18, 2021 at 11:04:40AM +1000, Dave Chinner wrote:
> > As for long term, everything in the page cache API needs to
> > transition to byte offsets and byte counts instead of units of
> > PAGE_SIZE and page->index. That's a more complex transition, but
> > AFAIA that's part of the future work Willy is intended to do with
> > folios and the folio API. Once we get away from accounting and
> > tracking everything as units of struct page, all the public facing
> > APIs that use those units can go away.
>
> Probably 95% of the places we use page->index and page->mapping aren't necessary
> because we've already got that information from the context we're in and
> removing them would be a useful cleanup

*nod*

> - if we've already got that from context
> (e.g. we're looking up the page in the page cache, via i_pageS) eliminating the
> page->index or page->mapping use means we're getting rid of a data dependency so
> it's good for performance - but more importantly, those (much fewer) places in
> the code where we actually _do_ need page->index and page->mapping are really
> important places to be able to find because they're interesting boundaries
> between different components in the VM.

*nod*

This is where infrastructure like like write_cache_pages() is
problematic. It's not actually a component of the VM - it's core
page cache/filesystem API functionality - but the implementation is
determined by the fact there is no clear abstraction between the
page cache and the VM and so while the filesysetm side of the API is
byte-ranged based, the VM side is struct page based and so the
impedence mismatch has to be handled in the page cache
implementation.

Folios are definitely pointing out issues like this whilst, IMO,
demonstrating that an abstraction like folios are also a necessary
first step to address the problems they make obvious...

Cheers,

Dave.
--
Dave Chinner
david@fromorbit.com
Re: Folio discussion recap [ In reply to ]
On Fri, Sep 10, 2021 at 04:16:28PM -0400, Kent Overstreet wrote:
> Q: Oh yeah, but what again are folios for, exactly?
>
> Folios are for cached filesystem data which (importantly) may be mapped to
> userspace.
>
> So when MM people see a new data structure come up with new references to page
> size - there's a very good reason with that, which is that we need to be
> allocating in multiples of the hardware page size if we're going to be able to
> map it to userspace and have PTEs point to it.
>
> So going forward, if the MM people want struct page to refer to muliple hardware
> pages - this shouldn't prevent that, and folios will refer to multiples of the
> _hardware_ page size, not struct page pagesize.
>
> Also - all the filesystem code that's being converted tends to talk and thing in
> units of pages. So going forward, it would be a nice cleanup to get rid of as
> many of those references as possible and just talk in terms of bytes (e.g. I
> have generally been trying to get rid of references to PAGE_SIZE in bcachefs
> wherever reasonable, for other reasons) - those cleanups are probably for
> another patch series, and in the interests of getting this patch series merged
> with the fewest introduced bugs possible we probably want the current helpers.

I'd like to thank those who reached out off-list. Some of you know I've
had trouble with depression in the past, and I'd like to reassure you
that that's not a problem at the moment. I had a good holiday, and I
was able to keep from thinking about folios most of the time.

I'd also like to thank those who engaged in the discussion while I was
gone. A lot of good points have been made. I don't think the normal
style of replying to each email individually makes a lot of sense at
this point, so I'll make some general comments instead. I'll respond
to the process issues on the other thread.

I agree with the feeling a lot of people have expressed, that struct page
is massively overloaded and we would do much better with stronger typing.
I like it when the compiler catches bugs for me. Disentangling struct
page is something I've been working on for a while, and folios are a
step in that direction (in that they remove the two types of tail page
from the universe of possibilities).

I don't believe it is realistic to disentangle file pages and anon
pages from each other. Thanks to swap and shmem, both file pages and
anon pages need to be able to be moved in and out of the swap cache.
The swap cache shares a lot of code with the page cache, so changing
how the swap cache works is also tricky.

What I do believe is possible is something Kent hinted at; treating anon
pages more like file pages. I also believe that shmem should be able to
write pages to swap without moving the pages into the swap cache first.
But these two things are just beliefs. I haven't tried to verify them
and they may come to nothing.

I also want to split out slab_page and page_table_page from struct page.
I don't intend to convert either of those to folios.

I do want to make struct page dynamically allocated (and have for
a while). There are some complicating factors ...

There are two primary places where we need to map from a physical
address to a "memory descriptor". The one that most people care about
is get_user_pages(). We have a page table entry and need to increment
the refcount on the head page, possibly mark the head page dirty, but
also return the subpage of any compound page we find. The one that far
fewer people care about is memory-failure.c; we also need to find the
head page to determine what kind of memory has been affected, but we
need to mark the subpage as HWPoison.

Both of these need to be careful to not confuse tail and non-tail pages.
So yes, we need to use folios for anything that's mappable to userspace.
That's not just anon & file pages but also network pools, graphics card
memory and vmalloc memory. Eventually, I think struct page actually goes
down to a union of a few words of padding, along with ->compound_head.
Because that's all we're guaranteed is actually there; everything else
is only there in head pages.

There are a lot of places that should use folios which the current
patchset doesn't convert. I prioritised filesystems because we've got
~60 filesystems to convert, and working on the filesystems can proceed
in parallel with working on the rest of the MM. Also, if I converted
the entire MM at once, there would be complaints that a 600 patch series
was unreviewable. So here we are, there's a bunch of compatibility code
that indicates areas which still need to be converted.

I'm sure I've missed things, but I've been working on this email all
day and wanted to send it out before going to sleep.
Re: Folio discussion recap [ In reply to ]
On Fri, Sep 17, 2021 at 07:15:40PM -0400, Johannes Weiner wrote:
> The code I'm specifically referring to here is the conversion of some
> code that encounters both anon and file pages - swap.c, memcontrol.c,
> workingset.c, and a few other places. It's a small part of the folio
> patches, but it's a big deal for the MM code conceptually.

Hard to say without actually trying, but my worry here that this may lead
to code duplication to separate file and anon code path. I donno.

--
Kirill A. Shutemov

1 2  View All