Mailing List Archive

Folio discussion recap
So I've been following the folio discussion, and it seems like the discussion
has gone off the rails a bit partly just because struct page is such a mess and
has been so overused, and we all want to see that cleaned up but we're not being
clear about what that means. I was just talking with Johannes off list, and I
thought I'd recap that discussion as well as other talks with Mathew and see if
I can lay something out that everyone agrees with.

Some background:

For some years now, the overhead of dealing with 4k pages in the page cache has
gotten really, really painful. Any time we're doing buffered IO, we end up
walking a radix tree to get to the cached page, then doing a memcpy to or from
that page - which quite conveniently blows away the CPU cache - then walking
the radix tree to look up the next page, often touching locks along the way that
are no longer in cache - it's really bad.

We've been hacking around this - the btrfs people have a vectorized buffered
write path, and also this is what my generic_file_buffered_read() patches we're
about, batching up the page cache lookups - but really these are hacks that make
our core IO paths even more complicated, when the right answer that's been
staring all of us filesystem people in the face for years has been that it's
2021 and dealing with cached data in 4k chunks (when block based filesystems are
a thing of the past!) is abject stupidity.

So we need to be moving to larger, variable sized allocations for cached data.
We NEED this, this HAS TO HAPPEN - spend some time really digging into profiles,
and looking actual application usage, this is the #1 thing that's killing our
performance in the IO paths. Remember, us developers tend to be benchmarking
things like direct IO and small random IOs because we're looking at the whole IO
path, but most reads and writes are buffered, and they're already in cache, and
they're mostly big and sequential.

I emphasize this because a lot of us have really been waiting rather urgently
for Willy's work to go in, and there will no doubt be a lot more downstream
filesystem work to be done to fully take advantage of it and we're waiting on
this stuff to get merged so we can actually start testing and profiling the
brave new world and seeing what to work on next.

As an aside, before this there have been quite a few attempts at using
hugepages to deal with these issues, and they're all _fucking gross_, because
they all do if (normal page) else if (hugepage), and they all cut and paste
filemap.c code because no one (rightly) wanted to add their abortions to the
main IO paths. But look around the kernel and see how many times you can find
core filemap.c code duplicated elsewhere... Anyways, Willy's work is going to
let us delete all that crap.

So: this all means that filesystem code needs to start working in larger,
variable sized units, which today means - compound pages. Hence, the folio work
started out as a wrapper around compound pages.

So, one objection to folios has been that they leak too much MM details out into
the filesystem code. To that we must point out: all the code that's going to be
using folios is right now using struct page - this isn't leaking out new details
and making things worse, this is actually (potentially!) a step in the right
direction, by moving some users of struct page to a new type that is actually
created for a specific purpose.

I think a lot of the acrimony in this discussion came precisely from this mess;
Johannes and the other MM people would like to see this situation improved so
that they have more freedom to reengineer and improve things on their side. One
particularly noteworthy idea was having struct page refer to multiple hardware
pages, and using slab/slub for larger alloctions. In my view, the primary reason
for making this change isn't the memory overhead to struct page (though reducing
that would be nice); it's that the slab allocator is _significantly_ faster than
the buddy allocator (the buddy allocator isn't percpu!) and as average
allocation sizes increase, this is hurting us more and more over time.

So we should listen to the MM people.

Fortunately, Matthew made a big step in the right direction by making folios a
new type. Right now, struct folio is not separately allocated - it's just
unionized/overlayed with struct page - but perhaps in the future they could be
separately allocated. I don't think that is a remotely realistic goal for _this_
patch series given the amount of code that touches struct page (thing: writeback
code, LRU list code, page fault handlers!) - but I think that's a goal we could
keep in mind going forward.

We should also be clear on what _exactly_ folios are for, so they don't become
the new dumping ground for everyone to stash their crap. They're to be a new
core abstraction, and we should endeaver to keep our core data structures
_small_, and _simple_. So: no scatter gather. A folio should just represent a
single buffer of physically contiguous memory - vmap is slow, kmap_atomic() only
works on single pages, we do _not_ want to make filesystem code jump through
hoops to deal with anything else. The buffers should probably be power of two
sized, as that's what the buddy allocator likes to give us - that doesn't
necessarily have to be baked into the design, but I can't see us ever actually
wanting non power of two sized allocations.

Q: But what about fragmentation? Won't these allocations fail sometimes?

Yes, and that's OK. The relevant filesystem code is all changing to handle
variable sized allocations, so it's completely fine if we fail a 256k allocation
and we have to fall back to whatever is available.

But also keep in mind that switching the biggest consumer of kernel side memory
to larger allocations is going to do more than anything else to help prevent
memory from getting fragmented in the first place. We _want_ this.

Q: Oh yeah, but what again are folios for, exactly?

Folios are for cached filesystem data which (importantly) may be mapped to
userspace.

So when MM people see a new data structure come up with new references to page
size - there's a very good reason with that, which is that we need to be
allocating in multiples of the hardware page size if we're going to be able to
map it to userspace and have PTEs point to it.

So going forward, if the MM people want struct page to refer to muliple hardware
pages - this shouldn't prevent that, and folios will refer to multiples of the
_hardware_ page size, not struct page pagesize.

Also - all the filesystem code that's being converted tends to talk and thing in
units of pages. So going forward, it would be a nice cleanup to get rid of as
many of those references as possible and just talk in terms of bytes (e.g. I
have generally been trying to get rid of references to PAGE_SIZE in bcachefs
wherever reasonable, for other reasons) - those cleanups are probably for
another patch series, and in the interests of getting this patch series merged
with the fewest introduced bugs possible we probably want the current helpers.

-------------

That's my recap, I hope I haven't missed anything. The TL;DR is:

* struct page is a mess; yes, we know. We're all living with that pain.

* This isn't our ultimate end goal (nothing ever is!) - but it's probably along
the right path.

* Going forward: maybe struct folio should be separately allocated. That will
entail a lot more work so it's not appropriate for this patch series, but I
think it's a goal that would make everyone

* We should probably think and talk more concretely about what our end goals
are.

Getting away from struct page is something that comes up again and again - DAX
is another notable (and acrimonious) area this has come up. Also, page->mapping
and page->index make sharing cached data in different files (thing: reflink,
snapshots) pretty much non starters.

I'm going to publicly float one of my own ideas here: maybe entries in the page
cache radix tree don't have to be just a single pointer/ulong. If those entries
were bigger, perhaps some things would fit better there than in either struct
page/folio.


Excessive PAGE_SIZE usage:
--------------------------

Another thing that keeps coming up is - indiscriminate use of PAGE_SIZE makes it
hard, especially when we're reviewing new code, to tell what's a legitimate use
or not. When it's tied to the hardware page size (as folios are), it's probably
legitimate, but PAGE_SIZE is _way_ overused.

Partly this was because historically slab had to be used for small allocations
and the buddy allocator, __get_free_pages(), had to be used for larger
allocations. This is still somewhat the case - slab can go up to something like
128k, but there's still a hard cap on allocation size with kmalloc().

Perhaps the MM people could look into lifting this restriction, so that
kmalloc() could be used for any sized physically contiguous allocation that the
system could satisfy? If we had this, then it would make it more practical to
go through and refactor existing code that uses __get_free_pages() and convert
it to kmalloc(), without having to stare at code and figure out if it's safe.

And that's my $.02
Re: Folio discussion recap [ In reply to ]
On Fri, Sep 10, 2021 at 04:16:28PM -0400, Kent Overstreet wrote:
> So we should listen to the MM people.

Count me here.

I think the problem with folio is that everybody wants to read in her/his
hopes and dreams into it and gets disappointed when see their somewhat
related problem doesn't get magically fixed with folio.

Folio started as a way to relief pain from dealing with compound pages.
It provides an unified view on base pages and compound pages. That's it.

It is required ground work for wider adoption of compound pages in page
cache. But it also will be useful for anon THP and hugetlb.

Based on adoption rate and resulting code, the new abstraction has nice
downstream effects. It may be suitable for more than it was intended for
initially. That's great.

But if it doesn't solve your problem... well, sorry...

The patchset makes a nice step forward and cuts back on mess I created on
the way to huge-tmpfs.

I would be glad to see the patchset upstream.

--
Kirill A. Shutemov
Re: Folio discussion recap [ In reply to ]
On Sat 11-09-21 04:23:24, Kirill A. Shutemov wrote:
> On Fri, Sep 10, 2021 at 04:16:28PM -0400, Kent Overstreet wrote:
> > So we should listen to the MM people.
>
> Count me here.
>
> I think the problem with folio is that everybody wants to read in her/his
> hopes and dreams into it and gets disappointed when see their somewhat
> related problem doesn't get magically fixed with folio.
>
> Folio started as a way to relief pain from dealing with compound pages.
> It provides an unified view on base pages and compound pages. That's it.
>
> It is required ground work for wider adoption of compound pages in page
> cache. But it also will be useful for anon THP and hugetlb.
>
> Based on adoption rate and resulting code, the new abstraction has nice
> downstream effects. It may be suitable for more than it was intended for
> initially. That's great.
>
> But if it doesn't solve your problem... well, sorry...
>
> The patchset makes a nice step forward and cuts back on mess I created on
> the way to huge-tmpfs.
>
> I would be glad to see the patchset upstream.

I do agree here. While points that Johannes brought up are relevant
and worth thinking about I do also see a clear advantage of folio (or
whatever $name) is bringing. The compound page handling is just a mess
and source of practical problems and bugs.

This really requires some systematic approach to deal with it. The
proposed type system is definitely a good way to approach it. Johannes
is not happy about having the type still refer to page units but I
haven't seen an example where that leads to a worse or harder to
maintain code so far. The evolution is likely not going to stop at the
current type system but I haven't seen any specifics to prove it would
stand in the way. The existing code (fs or other subsystem interacting
with MM) is going to require quite a lot of changes to move away from
struct page notion but I do not see folios to add fundamental blocker
there.

All that being said, not only I see folios to be a step into the
right direction to address compound pages mess it is also a code that
already exists and gives some real advantages. I haven't heard anybody
subscribing to a different approach and providing an implementation in a
foreseeable future so I would rather go with this approach then dealing
with the existing code long term.
--
Michal Hocko
SUSE Labs
Re: Folio discussion recap [ In reply to ]
On Mon, Sep 13, 2021 at 01:32:30PM +0200, Michal Hocko wrote:
> The existing code (fs or other subsystem interacting with MM) is
> going to require quite a lot of changes to move away from struct
> page notion but I do not see folios to add fundamental blocker
> there.

The current folio seems to do quite a bit of that work, actually. But
it'll be undone when the MM conversion matures the data structure into
the full-blown new page.

It's not about hopes and dreams, it's the simple fact that the patches
do something now that seems very valuable, but which we'll lose again
over time. And avoiding that is a relatively minor adjustment at this
time compared to a much larger one later on.

So yeah, it's not really a blocker. It's just a missed opportunity to
lastingly disentangle struct page's multiple roles when touching all
the relevant places anyway. It's also (needlessly) betting that
compound pages can be made into a scalable, reliable, and predictable
allocation model, and proliferating them into fs/ based on that.

These patches, and all the ones that will need to follow to finish the
conversion, are exceptionally expensive. It would have been nice to
get more out of this disruption than to identify the relatively few
places that genuinely need compound_head(), and having a datatype for
N contiguous pages. Is there merit in solving those problems? Sure. Is
it a robust, forward-looking direction for the MM space that justifies
the cost of these and later patches? You seem to think so, I don't.

It doesn't look like we'll agree on this. But I think I've made my
points several times now, so I'll defer to Linus and Andrew.
Re: Folio discussion recap [ In reply to ]
On Fri, Sep 10, 2021 at 04:16:28PM -0400, Kent Overstreet wrote:
> One particularly noteworthy idea was having struct page refer to
> multiple hardware pages, and using slab/slub for larger
> alloctions. In my view, the primary reason for making this change
> isn't the memory overhead to struct page (though reducing that would
> be nice);

Don't underestimate this, however.

Picture the near future Willy describes, where we don't bump struct
page size yet but serve most cache with compound huge pages.

On x86, it would mean that the average page cache entry has 512
mapping pointers, 512 index members, 512 private pointers, 1024 LRU
list pointers, 512 dirty flags, 512 writeback flags, 512 uptodate
flags, 512 memcg pointers etc. - you get the idea.

This is a ton of memory. I think this doesn't get more traction
because it's memory we've always allocated, and we're simply more
sensitive to regressions than long-standing pain. But nevertheless
this is a pretty low-hanging fruit.

The folio makes a great first step moving those into a separate data
structure, opening the door to one day realizing these savings. Even
when some MM folks say this was never the intent behind the patches, I
think this is going to matter significantly, if not more so, later on.

> Fortunately, Matthew made a big step in the right direction by making folios a
> new type. Right now, struct folio is not separately allocated - it's just
> unionized/overlayed with struct page - but perhaps in the future they could be
> separately allocated. I don't think that is a remotely realistic goal for _this_
> patch series given the amount of code that touches struct page (thing: writeback
> code, LRU list code, page fault handlers!) - but I think that's a goal we could
> keep in mind going forward.

Yeah, agreed. Not doable out of the gate, but retaining the ability to
allocate the "cache entry descriptor" bits - mapping, index etc. -
on-demand would be a huge benefit down the road for the above reason.

For that they would have to be in - and stay in - their own type.

> We should also be clear on what _exactly_ folios are for, so they don't become
> the new dumping ground for everyone to stash their crap. They're to be a new
> core abstraction, and we should endeaver to keep our core data structures
> _small_, and _simple_.

Right. struct page is a lot of things and anything but simple and
obvious today. struct folio in its current state does a good job
separating some of that stuff out.

However, when we think about *which* of the struct page mess the folio
wants to address, I think that bias toward recent pain over much
bigger long-standing pain strikes again.

The compound page proliferation is new, and we're sensitive to the
ambiguity it created between head and tail pages. It's added some
compound_head() in lower-level accessor functions that are not
necessary for many contexts. The folio type safety will help clean
that up, and this is great.

However, there is a much bigger, systematic type ambiguity in the MM
world that we've just gotten used to over the years: anon vs file vs
shmem vs slab vs ...

- Many places rely on context to say "if we get here, it must be
anon/file", and then unsafely access overloaded member elements:
page->mapping, PG_readahead, PG_swapcache, PG_private

- On the other hand, we also have low-level accessor functions that
disambiguate the type and impose checks on contexts that may or may
not actually need them - not unlike compound_head() in PageActive():

struct address_space *folio_mapping(struct folio *folio)
{
struct address_space *mapping;

/* This happens if someone calls flush_dcache_page on slab page */
if (unlikely(folio_test_slab(folio)))
return NULL;

if (unlikely(folio_test_swapcache(folio)))
return swap_address_space(folio_swap_entry(folio));

mapping = folio->mapping;
if ((unsigned long)mapping & PAGE_MAPPING_ANON)
return NULL;

return (void *)((unsigned long)mapping & ~PAGE_MAPPING_FLAGS);
}

Then we go identify places that say "we know it's at least not a
slab page!" and convert them to page_mapping_file() which IS safe to
use with anon. Or we say "we know this MUST be a file page" and just
access the (unsafe) mapping pointer directly.

- We have a singular page lock, but what it guards depends on what
type of page we're dealing with. For a cache page it protects
uptodate and the mapping. For an anon page it protects swap state.

A lot of us can remember the rules if we try, but the code doesn't
help and it gets really tricky when dealing with multiple types of
pages simultaneously. Even mature code like reclaim just serializes
the operation instead of protecting data - the writeback checks and
the page table reference tests don't seem to need page lock.

When the cgroup folks wrote the initial memory controller, they just
added their own page-scope lock to protect page->memcg even though
the page lock would have covered what it needed.

- shrink_page_list() uses page_mapping() in the first half of the
function to tell whether the page is anon or file, but halfway
through we do this:

/* Adding to swap updated mapping */
mapping = page_mapping(page);

and then use PageAnon() to disambiguate the page type.

- At activate_locked:, we check PG_swapcache directly on the page and
rely on it doing the right thing for anon, file, and shmem pages.
But this flag is PG_owner_priv_1 and actually used by the filesystem
for something else. I guess PG_checked pages currently don't make it
this far in reclaim, or we'd crash somewhere in try_to_free_swap().

I suppose we're also never calling page_mapping() on PageChecked
filesystem pages right now, because it would return a swap mapping
before testing whether this is a file page. You know, because shmem.

These are just a few examples from an MM perspective. I'm sure the FS
folks have their own stories and examples about pitfalls in dealing
with struct page members.

We're so used to this that we don't realize how much bigger and
pervasive this lack of typing is than the compound page thing.

I'm not saying the compound page mess isn't worth fixing. It is.

I'm saying if we started with a file page or cache entry abstraction
we'd solve not only the huge page cache, but also set us up for a MUCH
more comprehensive cleanup in MM code and MM/FS interaction that makes
the tailpage cleanup pale in comparison. For the same amount of churn,
since folio would also touch all of these places.
Re: Folio discussion recap [ In reply to ]
Hello together,

I am an outsider and following the discussion here on the subject.
Can we not go upsream with the state of development ?
Optimizations will always be there and new kernel releases too.

I can not assess the risk but I think a decision must be made.

Damian


On Wed, 15. Sep 11:40, Johannes Weiner wrote:
> On Fri, Sep 10, 2021 at 04:16:28PM -0400, Kent Overstreet wrote:
> > One particularly noteworthy idea was having struct page refer to
> > multiple hardware pages, and using slab/slub for larger
> > alloctions. In my view, the primary reason for making this change
> > isn't the memory overhead to struct page (though reducing that would
> > be nice);
>
> Don't underestimate this, however.
>
> Picture the near future Willy describes, where we don't bump struct
> page size yet but serve most cache with compound huge pages.
>
> On x86, it would mean that the average page cache entry has 512
> mapping pointers, 512 index members, 512 private pointers, 1024 LRU
> list pointers, 512 dirty flags, 512 writeback flags, 512 uptodate
> flags, 512 memcg pointers etc. - you get the idea.
>
> This is a ton of memory. I think this doesn't get more traction
> because it's memory we've always allocated, and we're simply more
> sensitive to regressions than long-standing pain. But nevertheless
> this is a pretty low-hanging fruit.
>
> The folio makes a great first step moving those into a separate data
> structure, opening the door to one day realizing these savings. Even
> when some MM folks say this was never the intent behind the patches, I
> think this is going to matter significantly, if not more so, later on.
>
> > Fortunately, Matthew made a big step in the right direction by making folios a
> > new type. Right now, struct folio is not separately allocated - it's just
> > unionized/overlayed with struct page - but perhaps in the future they could be
> > separately allocated. I don't think that is a remotely realistic goal for _this_
> > patch series given the amount of code that touches struct page (thing: writeback
> > code, LRU list code, page fault handlers!) - but I think that's a goal we could
> > keep in mind going forward.
>
> Yeah, agreed. Not doable out of the gate, but retaining the ability to
> allocate the "cache entry descriptor" bits - mapping, index etc. -
> on-demand would be a huge benefit down the road for the above reason.
>
> For that they would have to be in - and stay in - their own type.
>
> > We should also be clear on what _exactly_ folios are for, so they don't become
> > the new dumping ground for everyone to stash their crap. They're to be a new
> > core abstraction, and we should endeaver to keep our core data structures
> > _small_, and _simple_.
>
> Right. struct page is a lot of things and anything but simple and
> obvious today. struct folio in its current state does a good job
> separating some of that stuff out.
>
> However, when we think about *which* of the struct page mess the folio
> wants to address, I think that bias toward recent pain over much
> bigger long-standing pain strikes again.
>
> The compound page proliferation is new, and we're sensitive to the
> ambiguity it created between head and tail pages. It's added some
> compound_head() in lower-level accessor functions that are not
> necessary for many contexts. The folio type safety will help clean
> that up, and this is great.
>
> However, there is a much bigger, systematic type ambiguity in the MM
> world that we've just gotten used to over the years: anon vs file vs
> shmem vs slab vs ...
>
> - Many places rely on context to say "if we get here, it must be
> anon/file", and then unsafely access overloaded member elements:
> page->mapping, PG_readahead, PG_swapcache, PG_private
>
> - On the other hand, we also have low-level accessor functions that
> disambiguate the type and impose checks on contexts that may or may
> not actually need them - not unlike compound_head() in PageActive():
>
> struct address_space *folio_mapping(struct folio *folio)
> {
> struct address_space *mapping;
>
> /* This happens if someone calls flush_dcache_page on slab page */
> if (unlikely(folio_test_slab(folio)))
> return NULL;
>
> if (unlikely(folio_test_swapcache(folio)))
> return swap_address_space(folio_swap_entry(folio));
>
> mapping = folio->mapping;
> if ((unsigned long)mapping & PAGE_MAPPING_ANON)
> return NULL;
>
> return (void *)((unsigned long)mapping & ~PAGE_MAPPING_FLAGS);
> }
>
> Then we go identify places that say "we know it's at least not a
> slab page!" and convert them to page_mapping_file() which IS safe to
> use with anon. Or we say "we know this MUST be a file page" and just
> access the (unsafe) mapping pointer directly.
>
> - We have a singular page lock, but what it guards depends on what
> type of page we're dealing with. For a cache page it protects
> uptodate and the mapping. For an anon page it protects swap state.
>
> A lot of us can remember the rules if we try, but the code doesn't
> help and it gets really tricky when dealing with multiple types of
> pages simultaneously. Even mature code like reclaim just serializes
> the operation instead of protecting data - the writeback checks and
> the page table reference tests don't seem to need page lock.
>
> When the cgroup folks wrote the initial memory controller, they just
> added their own page-scope lock to protect page->memcg even though
> the page lock would have covered what it needed.
>
> - shrink_page_list() uses page_mapping() in the first half of the
> function to tell whether the page is anon or file, but halfway
> through we do this:
>
> /* Adding to swap updated mapping */
> mapping = page_mapping(page);
>
> and then use PageAnon() to disambiguate the page type.
>
> - At activate_locked:, we check PG_swapcache directly on the page and
> rely on it doing the right thing for anon, file, and shmem pages.
> But this flag is PG_owner_priv_1 and actually used by the filesystem
> for something else. I guess PG_checked pages currently don't make it
> this far in reclaim, or we'd crash somewhere in try_to_free_swap().
>
> I suppose we're also never calling page_mapping() on PageChecked
> filesystem pages right now, because it would return a swap mapping
> before testing whether this is a file page. You know, because shmem.
>
> These are just a few examples from an MM perspective. I'm sure the FS
> folks have their own stories and examples about pitfalls in dealing
> with struct page members.
>
> We're so used to this that we don't realize how much bigger and
> pervasive this lack of typing is than the compound page thing.
>
> I'm not saying the compound page mess isn't worth fixing. It is.
>
> I'm saying if we started with a file page or cache entry abstraction
> we'd solve not only the huge page cache, but also set us up for a MUCH
> more comprehensive cleanup in MM code and MM/FS interaction that makes
> the tailpage cleanup pale in comparison. For the same amount of churn,
> since folio would also touch all of these places.
>
Re: Folio discussion recap [ In reply to ]
On Wed, Sep 15, 2021 at 11:40:11AM -0400, Johannes Weiner wrote:
> On Fri, Sep 10, 2021 at 04:16:28PM -0400, Kent Overstreet wrote:
> > One particularly noteworthy idea was having struct page refer to
> > multiple hardware pages, and using slab/slub for larger
> > alloctions. In my view, the primary reason for making this change
> > isn't the memory overhead to struct page (though reducing that would
> > be nice);
>
> Don't underestimate this, however.
>
> Picture the near future Willy describes, where we don't bump struct
> page size yet but serve most cache with compound huge pages.
>
> On x86, it would mean that the average page cache entry has 512
> mapping pointers, 512 index members, 512 private pointers, 1024 LRU
> list pointers, 512 dirty flags, 512 writeback flags, 512 uptodate
> flags, 512 memcg pointers etc. - you get the idea.
>
> This is a ton of memory. I think this doesn't get more traction
> because it's memory we've always allocated, and we're simply more
> sensitive to regressions than long-standing pain. But nevertheless
> this is a pretty low-hanging fruit.
>
> The folio makes a great first step moving those into a separate data
> structure, opening the door to one day realizing these savings. Even
> when some MM folks say this was never the intent behind the patches, I
> think this is going to matter significantly, if not more so, later on.

So ... I chatted with Kent the other day, who suggested to me that maybe
the point you're really after is that you want to increase the hw page
size to reduce overhead while retaining the ability to hand out parts of
those larger pages to the page cache, and folios don't get us there?

> > Fortunately, Matthew made a big step in the right direction by making folios a
> > new type. Right now, struct folio is not separately allocated - it's just
> > unionized/overlayed with struct page - but perhaps in the future they could be
> > separately allocated. I don't think that is a remotely realistic goal for _this_
> > patch series given the amount of code that touches struct page (thing: writeback
> > code, LRU list code, page fault handlers!) - but I think that's a goal we could
> > keep in mind going forward.
>
> Yeah, agreed. Not doable out of the gate, but retaining the ability to
> allocate the "cache entry descriptor" bits - mapping, index etc. -
> on-demand would be a huge benefit down the road for the above reason.
>
> For that they would have to be in - and stay in - their own type.
>
> > We should also be clear on what _exactly_ folios are for, so they don't become
> > the new dumping ground for everyone to stash their crap. They're to be a new
> > core abstraction, and we should endeaver to keep our core data structures
> > _small_, and _simple_.
>
> Right. struct page is a lot of things and anything but simple and
> obvious today. struct folio in its current state does a good job
> separating some of that stuff out.
>
> However, when we think about *which* of the struct page mess the folio
> wants to address, I think that bias toward recent pain over much
> bigger long-standing pain strikes again.
>
> The compound page proliferation is new, and we're sensitive to the
> ambiguity it created between head and tail pages. It's added some
> compound_head() in lower-level accessor functions that are not
> necessary for many contexts. The folio type safety will help clean
> that up, and this is great.
>
> However, there is a much bigger, systematic type ambiguity in the MM
> world that we've just gotten used to over the years: anon vs file vs
> shmem vs slab vs ...
>
> - Many places rely on context to say "if we get here, it must be
> anon/file", and then unsafely access overloaded member elements:
> page->mapping, PG_readahead, PG_swapcache, PG_private
>
> - On the other hand, we also have low-level accessor functions that
> disambiguate the type and impose checks on contexts that may or may
> not actually need them - not unlike compound_head() in PageActive():
>
> struct address_space *folio_mapping(struct folio *folio)
> {
> struct address_space *mapping;
>
> /* This happens if someone calls flush_dcache_page on slab page */
> if (unlikely(folio_test_slab(folio)))
> return NULL;
>
> if (unlikely(folio_test_swapcache(folio)))
> return swap_address_space(folio_swap_entry(folio));
>
> mapping = folio->mapping;
> if ((unsigned long)mapping & PAGE_MAPPING_ANON)
> return NULL;
>
> return (void *)((unsigned long)mapping & ~PAGE_MAPPING_FLAGS);
> }
>
> Then we go identify places that say "we know it's at least not a
> slab page!" and convert them to page_mapping_file() which IS safe to
> use with anon. Or we say "we know this MUST be a file page" and just
> access the (unsafe) mapping pointer directly.
>
> - We have a singular page lock, but what it guards depends on what
> type of page we're dealing with. For a cache page it protects
> uptodate and the mapping. For an anon page it protects swap state.
>
> A lot of us can remember the rules if we try, but the code doesn't
> help and it gets really tricky when dealing with multiple types of
> pages simultaneously. Even mature code like reclaim just serializes
> the operation instead of protecting data - the writeback checks and
> the page table reference tests don't seem to need page lock.
>
> When the cgroup folks wrote the initial memory controller, they just
> added their own page-scope lock to protect page->memcg even though
> the page lock would have covered what it needed.
>
> - shrink_page_list() uses page_mapping() in the first half of the
> function to tell whether the page is anon or file, but halfway
> through we do this:
>
> /* Adding to swap updated mapping */
> mapping = page_mapping(page);
>
> and then use PageAnon() to disambiguate the page type.
>
> - At activate_locked:, we check PG_swapcache directly on the page and
> rely on it doing the right thing for anon, file, and shmem pages.
> But this flag is PG_owner_priv_1 and actually used by the filesystem
> for something else. I guess PG_checked pages currently don't make it
> this far in reclaim, or we'd crash somewhere in try_to_free_swap().
>
> I suppose we're also never calling page_mapping() on PageChecked
> filesystem pages right now, because it would return a swap mapping
> before testing whether this is a file page. You know, because shmem.

(Yes, it would be helpful to fix these ambiguities, because I feel like
discussions about all these other non-pagecache uses of memory keep
coming up on fsdevel and the code /really/ doesn't help me figure out
what everyone's talking about before the discussion moves on...)

> These are just a few examples from an MM perspective. I'm sure the FS
> folks have their own stories and examples about pitfalls in dealing
> with struct page members.

We do, and I thought we were making good progress pushing a lot of that
into the fs/iomap/ library. With fs iomap, disk filesystems pass space
mapping data to the iomap functions and let them deal with pages (or
folios). IOWs, filesystems don't deal with pages directly anymore, and
folios sounded like an easy transition (for a filesystem) to whatever
comes next. At some point it would be nice to get fscrypt and fsverify
hooked up so that we could move ext4 further off of buffer heads.

I don't know how we proceed from here -- there's quite a bit of
filesystems work that depended on the folios series actually landing.
Given that Linus has neither pulled it, rejected it, or told willy what
to do, and the folio series now has a NAK on it, I can't even start on
how to proceed from here.

--D

> We're so used to this that we don't realize how much bigger and
> pervasive this lack of typing is than the compound page thing.
>
> I'm not saying the compound page mess isn't worth fixing. It is.
>
> I'm saying if we started with a file page or cache entry abstraction
> we'd solve not only the huge page cache, but also set us up for a MUCH
> more comprehensive cleanup in MM code and MM/FS interaction that makes
> the tailpage cleanup pale in comparison. For the same amount of churn,
> since folio would also touch all of these places.
Re: Folio discussion recap [ In reply to ]
On Wed, Sep 15, 2021 at 07:58:54PM -0700, Darrick J. Wong wrote:
> On Wed, Sep 15, 2021 at 11:40:11AM -0400, Johannes Weiner wrote:
> > On Fri, Sep 10, 2021 at 04:16:28PM -0400, Kent Overstreet wrote:
> > > One particularly noteworthy idea was having struct page refer to
> > > multiple hardware pages, and using slab/slub for larger
> > > alloctions. In my view, the primary reason for making this change
> > > isn't the memory overhead to struct page (though reducing that would
> > > be nice);
> >
> > Don't underestimate this, however.
> >
> > Picture the near future Willy describes, where we don't bump struct
> > page size yet but serve most cache with compound huge pages.
> >
> > On x86, it would mean that the average page cache entry has 512
> > mapping pointers, 512 index members, 512 private pointers, 1024 LRU
> > list pointers, 512 dirty flags, 512 writeback flags, 512 uptodate
> > flags, 512 memcg pointers etc. - you get the idea.
> >
> > This is a ton of memory. I think this doesn't get more traction
> > because it's memory we've always allocated, and we're simply more
> > sensitive to regressions than long-standing pain. But nevertheless
> > this is a pretty low-hanging fruit.
> >
> > The folio makes a great first step moving those into a separate data
> > structure, opening the door to one day realizing these savings. Even
> > when some MM folks say this was never the intent behind the patches, I
> > think this is going to matter significantly, if not more so, later on.
>
> So ... I chatted with Kent the other day, who suggested to me that maybe
> the point you're really after is that you want to increase the hw page
> size to reduce overhead while retaining the ability to hand out parts of
> those larger pages to the page cache, and folios don't get us there?

Yes, that's one of the points.

It's exporting the huge page model we've been using for anonymous
memory to the filesystems, even though that model has shown
significant limitations in practice: it doesn't work well out of the
box, the necessary configuration is painful and complicated, and even
when done correctly it still has high allocation latencies. It's much
more "handtuned HPC workload" than "general purpose feature".

Fixing this is an open problem. I don't know for sure if we need to
increase the page size for that, but neither does anybody else. This
is simply work and experiments that haven't been done on the MM side.

Exposing the filesystems to that implementation now exposes them to
the risk of a near-term do-over, and puts a significantly higher
barrier on fixing the allocation model down the line.

There isn't a technical reason for this coupling the filesystems that
tightly to the allocation model. It's just that the filesystem people
would like a size-agnostic cache object, and some MM folks would like
to clean up the compound page mess, and folio tries to do both of
these things at once.

> > > Fortunately, Matthew made a big step in the right direction by making folios a
> > > new type. Right now, struct folio is not separately allocated - it's just
> > > unionized/overlayed with struct page - but perhaps in the future they could be
> > > separately allocated. I don't think that is a remotely realistic goal for _this_
> > > patch series given the amount of code that touches struct page (thing: writeback
> > > code, LRU list code, page fault handlers!) - but I think that's a goal we could
> > > keep in mind going forward.
> >
> > Yeah, agreed. Not doable out of the gate, but retaining the ability to
> > allocate the "cache entry descriptor" bits - mapping, index etc. -
> > on-demand would be a huge benefit down the road for the above reason.
> >
> > For that they would have to be in - and stay in - their own type.
> >
> > > We should also be clear on what _exactly_ folios are for, so they don't become
> > > the new dumping ground for everyone to stash their crap. They're to be a new
> > > core abstraction, and we should endeaver to keep our core data structures
> > > _small_, and _simple_.
> >
> > Right. struct page is a lot of things and anything but simple and
> > obvious today. struct folio in its current state does a good job
> > separating some of that stuff out.
> >
> > However, when we think about *which* of the struct page mess the folio
> > wants to address, I think that bias toward recent pain over much
> > bigger long-standing pain strikes again.
> >
> > The compound page proliferation is new, and we're sensitive to the
> > ambiguity it created between head and tail pages. It's added some
> > compound_head() in lower-level accessor functions that are not
> > necessary for many contexts. The folio type safety will help clean
> > that up, and this is great.
> >
> > However, there is a much bigger, systematic type ambiguity in the MM
> > world that we've just gotten used to over the years: anon vs file vs
> > shmem vs slab vs ...
> >
> > - Many places rely on context to say "if we get here, it must be
> > anon/file", and then unsafely access overloaded member elements:
> > page->mapping, PG_readahead, PG_swapcache, PG_private
> >
> > - On the other hand, we also have low-level accessor functions that
> > disambiguate the type and impose checks on contexts that may or may
> > not actually need them - not unlike compound_head() in PageActive():
> >
> > struct address_space *folio_mapping(struct folio *folio)
> > {
> > struct address_space *mapping;
> >
> > /* This happens if someone calls flush_dcache_page on slab page */
> > if (unlikely(folio_test_slab(folio)))
> > return NULL;
> >
> > if (unlikely(folio_test_swapcache(folio)))
> > return swap_address_space(folio_swap_entry(folio));
> >
> > mapping = folio->mapping;
> > if ((unsigned long)mapping & PAGE_MAPPING_ANON)
> > return NULL;
> >
> > return (void *)((unsigned long)mapping & ~PAGE_MAPPING_FLAGS);
> > }
> >
> > Then we go identify places that say "we know it's at least not a
> > slab page!" and convert them to page_mapping_file() which IS safe to
> > use with anon. Or we say "we know this MUST be a file page" and just
> > access the (unsafe) mapping pointer directly.
> >
> > - We have a singular page lock, but what it guards depends on what
> > type of page we're dealing with. For a cache page it protects
> > uptodate and the mapping. For an anon page it protects swap state.
> >
> > A lot of us can remember the rules if we try, but the code doesn't
> > help and it gets really tricky when dealing with multiple types of
> > pages simultaneously. Even mature code like reclaim just serializes
> > the operation instead of protecting data - the writeback checks and
> > the page table reference tests don't seem to need page lock.
> >
> > When the cgroup folks wrote the initial memory controller, they just
> > added their own page-scope lock to protect page->memcg even though
> > the page lock would have covered what it needed.
> >
> > - shrink_page_list() uses page_mapping() in the first half of the
> > function to tell whether the page is anon or file, but halfway
> > through we do this:
> >
> > /* Adding to swap updated mapping */
> > mapping = page_mapping(page);
> >
> > and then use PageAnon() to disambiguate the page type.
> >
> > - At activate_locked:, we check PG_swapcache directly on the page and
> > rely on it doing the right thing for anon, file, and shmem pages.
> > But this flag is PG_owner_priv_1 and actually used by the filesystem
> > for something else. I guess PG_checked pages currently don't make it
> > this far in reclaim, or we'd crash somewhere in try_to_free_swap().
> >
> > I suppose we're also never calling page_mapping() on PageChecked
> > filesystem pages right now, because it would return a swap mapping
> > before testing whether this is a file page. You know, because shmem.
>
> (Yes, it would be helpful to fix these ambiguities, because I feel like
> discussions about all these other non-pagecache uses of memory keep
> coming up on fsdevel and the code /really/ doesn't help me figure out
> what everyone's talking about before the discussion moves on...)

Excellent.

However, after listening to Kent and other filesystem folks, I think
it's important to point out that the folio is not a dedicated page
cache page descriptor that will address any of the above examples.

The MM POV (and the justification for both the acks and the naks of
the patchset) is that it's a generic, untyped compound page
abstraction, which applies to file, anon, slab, networking
pages. Certainly, the folio patches as of right now also convert anon
page handling to the folio. If followed to its conclusion, the folio
will have plenty of members and API functions for non-pagecache users
and look pretty much like struct page today, just with a dynamic size.

I know Kent was surprised by this. I know Dave Chinner suggested to
call it "cache page" or "cage" early on, which also suggests an
understanding of a *dedicated* cache page descriptor.

I don't think the ambiguous folio name and the ambiguous union with
the page helped in any way in aligning fs and mm folks on what this
thing is actually supposed to be!

I agree with what I think the filesystems want: instead of an untyped,
variable-sized block of memory, I think we should have a typed page
cache desciptor.

That would work better for the filesystems, and I think would also
work better for the MM code down the line and fix the above examples.

The headpage/tailpage cleanup would come free with that.

> > These are just a few examples from an MM perspective. I'm sure the FS
> > folks have their own stories and examples about pitfalls in dealing
> > with struct page members.
>
> We do, and I thought we were making good progress pushing a lot of that
> into the fs/iomap/ library. With fs iomap, disk filesystems pass space
> mapping data to the iomap functions and let them deal with pages (or
> folios). IOWs, filesystems don't deal with pages directly anymore, and
> folios sounded like an easy transition (for a filesystem) to whatever
> comes next. At some point it would be nice to get fscrypt and fsverify
> hooked up so that we could move ext4 further off of buffer heads.
>
> I don't know how we proceed from here -- there's quite a bit of
> filesystems work that depended on the folios series actually landing.
> Given that Linus has neither pulled it, rejected it, or told willy what
> to do, and the folio series now has a NAK on it, I can't even start on
> how to proceed from here.

I think divide and conquer is the way forward.

The crux of the matter is that folio is trying to 1) replace struct
page as the filesystem interface to the MM and 2) replace struct page
as the internal management object for file and anon, and conceptually
also slab & networking pages all at the same time.

As you can guess, goals 1) and 2) have vastly different scopes.

Replacing struct page in the filesystem isn't very controversial, and
filesystem folks seem uniformly ready to go. I agree.

Replacing struct page in MM code is much less clear cut. We have some
people who say it'll be great, some people who say we can probably
figure out open questions down the line, and we have some people who
have expressed doubts that all this churn will ever be worth it. I
think it's worth replacing, but not with an untyped compound thing.

It's sh*tty that the filesystem people are acutely blocked on
large-scope, long-term MM discussions they don't care about.

It's also sh*tty that these MM discussions are rushed by folks who
aren't familiar with or care too much about the MM internals.

This friction isn't necessary. The folio conversion is an incremental
process. It's not like everything in MM code has been fully converted
already - some stuff deals with the folio, most stuff with the page.

An easy way forward that I see is to split this large, open-ended
project into more digestible pieces. E.g. separate 1) and 2): merge a
"size-agnostic cache page" type now; give MM folks the time they need
to figure out how and if they want to replace struct page internally.

That's why I suggested to drop the anon page conversion bits in
swap.c, workingset.c, memcontrol.c etc, and just focus on the
uncontroversial page cache bits for now.
Re: Folio discussion recap [ In reply to ]
Johannes Weiner <hannes@cmpxchg.org> wrote:

> I know Kent was surprised by this. I know Dave Chinner suggested to
> call it "cache page" or "cage" early on, which also suggests an
> understanding of a *dedicated* cache page descriptor.

If we are aiming to get pages out of the view of the filesystem, then we
should probably not include "page" in the name. "Data cache" would seem
obvious, but we already have that concept for the CPU. How about something
like "struct content" and rename i_pages to i_content?

David
Re: Folio discussion recap [ In reply to ]
On Thu, Sep 16, 2021 at 12:54:22PM -0400, Johannes Weiner wrote:
> On Wed, Sep 15, 2021 at 07:58:54PM -0700, Darrick J. Wong wrote:
> > On Wed, Sep 15, 2021 at 11:40:11AM -0400, Johannes Weiner wrote:
> > > On Fri, Sep 10, 2021 at 04:16:28PM -0400, Kent Overstreet wrote:
> The MM POV (and the justification for both the acks and the naks of
> the patchset) is that it's a generic, untyped compound page
> abstraction, which applies to file, anon, slab, networking
> pages. Certainly, the folio patches as of right now also convert anon
> page handling to the folio. If followed to its conclusion, the folio
> will have plenty of members and API functions for non-pagecache users
> and look pretty much like struct page today, just with a dynamic size.
>
> I know Kent was surprised by this. I know Dave Chinner suggested to
> call it "cache page" or "cage" early on, which also suggests an
> understanding of a *dedicated* cache page descriptor.

Don't take a flippant I comment made in a bikeshed as any sort of
representation of what I think about this current situation. I've
largely been silent because of your history of yelling incoherently
in response to anything I say that you don't agree with.

But now you've explicitly drawn me into this discussion, I'll point
out that I'm one of very few people in the wider Linux mm/fs
community who has any *direct experience* with the cache handle
based architecture being advocated for here.

I don't agree with your assertion that cache handle based objects
are the way forward, so please read and try to understand what
I've just put a couple of hours into writing before you start
shouting. Please?

---

Ok, so this cache page descriptor/handle/object architecture has
been implemented in other operating systems. It's the solution that
Irix implemented back in the early _1990s_ via it's chunk cache.

I've talked about this a few times in the past 15 years, so I guess
I'll talk about it again. eg at LSFMM 2014 where I said "we
don't really want to go down that path" in reference to supporting
sector sizes > PAGE_SIZE:

https://lwn.net/Articles/592101/

So, in more gory detail why I don't think we really want to go down
that path.....

The Irix chunk cache sat between the low layer global, disk address
indexed buffer cache[1] and the high layer per-mm-context page cache
used for mmap().

A "chunk" was a variable sized object indexed by file offset on a
per-inode AVL tree - basically the same caching architecture as our
current per-inode mapping tree uses to index pages. But unlike the
Linux page cache, these chunks were an extension of the low level
buffer cache. Hence they were also indexed by physical disk
address and the life-cycle was managed by the buffer cache shrinker
rather than the mm-based page cache reclaim algorithms.

Chunks were built from page cache pages, and pages pointed back to
the chunk that they belonged to. Chunks needed their own locking. IO
was done based on chunks, not pages. Filesystems decided the size of
chunks, not the page cache. Pages attached to chunks could be of any
hardware supported size - the only limitation was that all pages
attached to a chunk had to be the same size. A large hardware page
in the page cache could be mapped by multiple smaller chunks. A
chunk made up of multiple hardware pages could vmap it's contents if
the user needed contiguous access.[2]

Chunks were largely unaware of ongoing mmap operations. page faults
on pages that had no associated chunk (e.g. originally populated
into the page cache by a read fault into hole or a cached page that
the buffer cache had torn down) then a new chunk had to be built.
The code needed to handle to partially populated chunks in this sort
of situation was really, really nasty as it required interacting
with the filesystem and having the filesystem take locks and call
back up into the page cache to build the new chunk in the IO path.

Similarly, dirty page state from page faults needed to be propagated
down to the chunks, because dirty tracking for writeback was done at
the chunk level, not the page cache level. This was *really* nasty,
because if the page didn't have a chunk already built, it couldn't
be built in a write fault context. Hence sweeping dirty page state
to the IO subsystem was handled periodically by a pdflush daemon ,
which could work with the filesytsem to build new (dirty) chunks and
insert them into the chunk cache for writeback.

Similar problems will have to be considered during design for Linux
because the dirty tracking in Linux for writeback is done at the
per-inode mapping tree level. Hence things like ->page_mkwrite are
going to have to dig through the page to the cached chunk and mark
the chunk dirty rather than the page. Whether deadlocks are going to
have to be worked around is an open question; I don't have answers
to these concerns because nobody is proposing an architecture
detailed enough to explore these situations.

This also leads to really interesting questions about how page and
chunk state w.r.t. IO is kept coherent. e.g. if we are not tracking
IO state on individual page cache pages, how do we ensure all the
pages stay stable when IO is being done to a block device that
requires stable pages? Along similar lines: what's the interlock
mechanism that we'll use to ensure that IO or truncate can lock out
per-page accesses if the filesystem IO paths no longer directly
interact with page state any more? I also wonder how will we manage
cached chunks if the filesystem currently relies on page level
locking for atomicity, concurrency and existence guarantees (e.g.
ext4 buffered IO)?

IOWs, it is extremely likely that there will still be situations
where we have to blast directly through the cache handle abstraction
to manipulate the objects behind the abstraction so that we can make
specific functionality work correctly, without regressions and/or
efficiently.

Hence the biggest issue that a chunk-like cache handle introduces
is the complex multi-dimensional state update interactions. These will
require more complex locking and that locking will be required to
work in arbitrary orders for operations to be performed safely and
atomically. e.g IO needs inode->chunk->page order, whilst page
migration/comapction needs page->chunk->inode order. Page migration
and compaction on Irix had some unfixable deadlocks in rare corner
cases because of locking inversion problems between filesystems,
chunks, pages and mm contexts. I don't see any fundamental
difference in Linux architecture that makes me think that it will
be any different.[3]

I've got war chest full of chunk cache related data corruption bugs
on Irix that were crazy hard to reproduce and even more difficult to
fix. At least half the bugs I had to fix in the chunk cache over
3-4 years as maintainer were data corruption bugs resulting from
inconsistencies in multi-object state updates.

I've got a whole 'nother barrel full of problem cases that revolve
around memory reclaim, too. The cache handles really need to pin the
pages that back them, and so we can't really do access optimised
per-page based reclaim of file-backed pages anymore. The Irix chunk
cache had it's own LRUs and shrinker[4] to manage life-cycles of
chunks under memory pressure, and the mm code had it's own
independent page cache shrinker. Hence pages didn't get freed until
both the chunk cache and the page cache released the pages they had
references to.

IOWs, we're going to end up needing to reclaim cache handles before
we can do page reclaim. This needs careful thought and will likely
need a complete redesign of the vmscan.c algorithms to work
properly. I really, really don't want to see awful layer violations
like bufferhead reclaim getting hacked into the low layer page
reclaim algorithms happen ever again. We're still paying the price
for that.

And given the way Linux uses the mapping tree for keeping stuff like
per-page working set refault information after the pages have been
removed from the page cache, I really struggle to see how
functionality like this can be supported with a chunk based cache
index that doesn't actually have direct tracking of individual page
access and reclaim behaviour.

We're also going to need a range-based indexing mechanism for the
mapping tree if we want to avoid the inefficiencies that mapping
large objects into the Xarray require. We'll need an rcu-aware tree
of some kind, be it a btree, maple tree or something else so that we
can maintain lockless lookups of cache objects. That infrastructure
doesn't exist yet, either.

And on that note, it is worth keeping in mind that one of the
reasons that the current linux page cache architecture scales better
for single files than the Irix architecture ever did is because the
Irix chunk cache could not be made lockless. The requirements for
atomic multi-dimensional indexing updates and coherent, atomic
multi-object state changes could never be solved in a lockless
manner. It was not for lack of trying or talent; people way
smarter than me couldn't solve that problem. SO there's an open
question as to whether we can maintain existing lockless algorithms
when a chunk cache is layered over the top of the page cache.

IOWs, I see significant, fundamental problems that chunk cache
architectures suffer from. I know there are inherent problems with
state coherency, locking, complexity in the IO path, etc. Some of
these problems will inot be discovered until the implementation is
well under way. Some of these problem may well be unsolveable, too.
And until there's an actual model proposed of how everything will
interact and work, we can't actually do any of this architectural
analysis to determine if it might work or not. The chunk cache
proposal is really just a grand thought experiment at this point in
time.

OTOH, folios have none of these problems and are here right now.
Sure, they have their own issues, but we can see them for what they
are given the code is already out there, and pretty much everyone
sees them as a big step forwards.

Folios don't prevent a chunk cache from being implemented. In fact,
to make folios highly efficient, we have to do things a chunk cache
would also require to be implemented. e.g. range-based cache
indexing. Unlike a chunk cache, folios don't depend on this being
done first - they stand alone without those changes, and will only
improve from making them. IOWs, you can't use the "folios being
mapped 512 times into the mapping tree" as a reason the chunk cache
is better - the chunk cache also requires this same problem to be
solved, but the chunk cache needs efficient range lookups done
*before* it is implemented, not provided afterwards as an
optimisation.

IOWs, if we want to move towards a chunk cache, the first step is to
move to folios to allow large objects in the page cache. Then we can
implement a lock-less range based index mechanism for the mapping
tree. Then we can look to replace folios with a typed cache handle
without having to worry about all the whacky multi-object coherency
problems because they only need to point to a single folio. Then we
can work out all the memory reclaim issues, locking issues, sort out
the API that filesystems use instead of folios, etc that ineed to be
done when cache handles are introduced. And once we've worked
through all that, then we can add support for multiple folios within
a single cache object and discover all the really hard problems that
this exposes. At this point, the cache objects are no longer
dependent on folios to provide objects > PAGE_SIZE to the
filesystems, and we can start to remove folios from the mm code and
replace them with something else that the cache handle uses to
provide the backing store to the filesysetms...

Seriously, I have given a lot of thought over the years to a chunk
cache for Linux. Right now, a chunk cache is a solution looking for
a problem to solve. Unless there's an overall architectural mm
plan that is being worked towards that requires a chunk cache, then
I just don't see the justification for doing all this work because
the first two steps above get filesystems everything they are
currently asking for. Everything else past that is really just an
experiment...

> I agree with what I think the filesystems want: instead of an untyped,
> variable-sized block of memory, I think we should have a typed page
> cache desciptor.

I don't think that's what fs devs want at all. It's what you think
fs devs want. If you'd been listening to us the same way that Willy
has been for the past year, maybe you'd have a different opinion.

Indeed, we don't actually need a new page cache abstraction.
fs/iomap already provides filesystems with a complete, efficient
page cache abstraction that only requires filesytems to provide
block mapping services. Filesystems using iomap do not interact with
the page cache at all. And David Howells is working with Willy and
all the network fs devs to build an equivalent generic netfs page
cache abstraction based on folios that is supported by the major
netfs client implementations in the kernel.

IOWs, fs devs don't need a new page cache abstraction - we've got
our own abstractions tailored directly to our needs. What we need
are API cleanups, consistency in object access mechanisms and
dynamic object size support to simplify and fill out the feature set
of the abstractions we've already built.

The fact that so many fs developers are pushing *hard* for folios is
that it provides what we've been asking for individually over last
few years. Willy has done a great job of working with the fs
developers and getting feedback at every step of the process, and
you see that in the amount of work that in progress that is already
based on folios. ANd it provides those cleanups and new
functionality without changing or invalidating any of the knowledge
we collectively hold about how the page cache works. That's _pure
gold_ right there.

In summary:

If you don't know anything about the architecture and limitations of
the XFS buffer cache (also read the footnotes), you'd do very well
to pay heed to what I've said in this email considering the direct
relevancy it's history has to the alternative cache handle proposal
being made here. We also need to consider the evidence that
filesystems do not actually need a new page cache abstraction - they
just need the existing page cache to be able to index objects larger
than PAGE_SIZE.

So with all that in mind, I consider folios (or whatever we call
them) to be the best stepping stone towards a PAGE_SIZE indepedent
future that we currently have. folios don't prevent us from
introducing a cache handle based architecture if we have a
compelling reason to do so in the future, nor do they stop anyone
working on such infrastructure in parallel if it really is
necessary. But the reality is that we don't need such a fundamental
architectural change to provide the functionality that folios
provide us with _right now_.

Folios are not perfect, but they are here and they solve many issues
we need solved. We're never going to have a perfect solution that
everyone agrees with, so the real question is "are folios good
enough?". To me the answer is a resounding yes.

Cheers,

Dave.

[1] fs/xfs/xfs_buf.c is an example of a high performance handle
based, variable object size cache that abstracts away the details of
the data store being allocated from slab, discontiguous pages,
contiguous pages or [2] vmapped memory. It is basically two decade
old re-implementation of the Irix low layer global disk-addressed
buffer cache, modernised and tailored directly to the needs of XFS
metadata caching.

[3] Keep in mind that the xfs_buf cache used to be page cache
backed. The page cache provided the caching and memory reclaim
infrastructure to the xfs_buf handles - and so we do actually have
recent direct experience on Linux with the architecture you are
proposing here. This architecture proved to be a major limitation
from a performance, multi-object state coherency and cache residency
prioritisation aspects. It really sucked with systems that had 64KB
page sizes and 4kB metadata block sizes, and ....

[4] So we went back to the old Irix way of managing the cache - our
own buffer based LRUs and aging mechanisms, with memory reclaim run
by a shrinkers based on buffer-type base priorities. We use bulk
page allocation for buffers that >= PAGE_SIZE, and slab allocation <
PAGE_SIZE. That's exactly what you are suggesting we do with 2MB
sized base pages, but without having to care about mmap() at all.

--
Dave Chinner
david@fromorbit.com
Re: Folio discussion recap [ In reply to ]
On Fri, Sep 17, 2021 at 03:24:40PM +1000, Dave Chinner wrote:
> Folios are not perfect, but they are here and they solve many issues
> we need solved. We're never going to have a perfect solution that
> everyone agrees with, so the real question is "are folios good
> enough?". To me the answer is a resounding yes.

Besides agreeing to all what you said, the other important part is:
even if we were to eventually go with Johannes grand plans (which I
disagree with in many apects), what is the harm in doing folios now?

Despite all the fuzz, the pending folio PR does nothing but add type
safety to compound pages. Which is something we badly need, no matter
what kind of other caching grand plans people have.
Re: Folio discussion recap [ In reply to ]
On Fri, Sep 17, 2021 at 03:24:40PM +1000, Dave Chinner wrote:
> On Thu, Sep 16, 2021 at 12:54:22PM -0400, Johannes Weiner wrote:
> > I agree with what I think the filesystems want: instead of an untyped,
> > variable-sized block of memory, I think we should have a typed page
> > cache desciptor.
>
> I don't think that's what fs devs want at all. It's what you think
> fs devs want. If you'd been listening to us the same way that Willy
> has been for the past year, maybe you'd have a different opinion.

I was going off of Darrick's remarks about non-pagecache uses, Kent's
remarks Kent about simple and obvious core data structures, and yes
your suggestion of "cache page".

But I think you may have overinterpreted what I meant by cache
descriptor:

> Indeed, we don't actually need a new page cache abstraction.

I didn't suggest to change what the folio currently already is for the
page cache. I asked to keep anon pages out of it (and in the future
potentially other random stuff that is using compound pages).

It doesn't have any bearing on how it presents to you on the
filesystem side, other than that it isn't as overloaded as struct page
is with non-pagecache stuff.

A full-on disconnect between the cache entry descriptor and the page
is something that came up during speculation on how the MM will be
able to effectively raise the page size and meet scalability
requirements on modern hardware - and in that context I do appreciate
you providing background information on the chunk cache, which will be
valuable to inform *that* discussion.

But it isn't what I suggested as the immediate action to unblock the
folio merge.

> The fact that so many fs developers are pushing *hard* for folios is
> that it provides what we've been asking for individually over last
> few years.

I'm not sure filesystem people are pushing hard for non-pagecache
stuff to be in the folio.

> Willy has done a great job of working with the fs developers and
> getting feedback at every step of the process, and you see that in
> the amount of work that in progress that is already based on
> folios.

And that's great, but the folio is blocked on MM questions:

1. Is the folio a good descriptor for all uses of anon and file pages
inside MM code way beyond the page cache layer YOU care about?

2. Are compound pages a scalable, future-proof allocation strategy?

For some people the answers are yes, for others they are a no.

For 1), the value proposition is to clean up the relatively recent
head/tail page confusion. And though everybody agrees that there is
value in that, it's a LOT of churn for what it does. Several people
have pointed this out, and AFAICS this is the most common reason for
people that have expressed doubt or hesitation over the patches.

In an attempt to address this, I pointed out the cleanup opportunities
that would open up by using separate anon and file folio types instead
of one type for both. Nothing more. No intermediate thing, no chunk
cache. Doesn't affect you. Just taking Willy's concept of type safety
and applying it to file and anon instead of page vs compound page.

- It wouldn't change anything for fs people from the current folio
patchset (except maybe the name)

- It would accomplish the head/tail page cleanup the same way, since
just like a folio, a "file folio" could also never be a tail page

- It would take the same solution folio prescribes to the compound
page issue (explicit typing to get rid of useless checks, lookups
and subtle bugs) and solve way more instances of this all over MM
code, thereby hopefully boosting the value proposition and making
*that part* of the patches a clearer win for the MM subsystem

This is a question directed at MM people, not filesystem people. It
doesn't pertain to you at all.

And if MM people agree or want to keep discussing it, the relatively
minor action item for the folio patch is the same: drop the partial
anon-to-folio conversion bits inside MM code for now and move on.

For 2), nobody knows the answer to this. Nobody. Anybody who claims to
do so is full of sh*t. Maybe compound pages work out, maybe they
don't. We can talk a million years about larger page sizes, how to
handle internal fragmentation, the difficulties of implementing a
chunk cache, but it's completely irrelevant because it's speculative.

We know there are multiple page sizes supported by the hardware and
the smallest supported one is no longer the most dominant one. We do
not know for sure yet how the MM is internally going to lay out its
type system so that the allocator, mmap, page reclaim etc. can be CPU
efficient and the descriptors be memory efficient.

Nobody's "grand plan" here is any more viable, tested or proven than
anybody else's.

My question for fs folks is simply this: as long as you can pass a
folio to kmap and mmap and it knows what to do with it, is there any
filesystem relevant requirement that the folio map to 1 or more
literal "struct page", and that folio_page(), folio_nr_pages() etc be
part of the public API? Or can we keep this translation layer private
to MM code? And will page_folio() be required for anything beyond the
transitional period away from pages?

Can we move things not used outside of MM into mm/internal.h, mark the
transitional bits of the public API as such, and move on?

The unproductive vitriol, personal attacks and dismissiveness over
relatively minor asks and RFCs from the subsystem that is the most
impacted by this patchset is just nuts.
Re: Folio discussion recap [ In reply to ]
On Fri, Sep 17, 2021 at 12:31:36PM -0400, Johannes Weiner wrote:
> I didn't suggest to change what the folio currently already is for the
> page cache. I asked to keep anon pages out of it (and in the future
> potentially other random stuff that is using compound pages).

It would mean that anon-THP cannot benefit from the work Willy did with
folios. Anon-THP is the most active user of compound pages at the moment
and it also suffers from the compound_head() plague. You ask to exclude
anon-THP siting *possible* future benefits for pagecache.

Sorry, but this doesn't sound fair to me.

We already had similar experiment with PAGE_CACHE_SIZE. It was introduced
with hope to have PAGE_CACHE_SIZE != PAGE_SIZE one day. It never happened
and only caused confusion on the border between pagecache-specific code
and generic code that handled both file and anon pages.

If you want to limit usage of the new type to pagecache, the burden on you
to prove that it is useful and not just a dead weight.

--
Kirill A. Shutemov
Re: Folio discussion recap [ In reply to ]
Snipped, reordered:

On Fri, Sep 17, 2021 at 12:31:36PM -0400, Johannes Weiner wrote:
> 2. Are compound pages a scalable, future-proof allocation strategy?
>
> For 2), nobody knows the answer to this. Nobody. Anybody who claims to
> do so is full of sh*t. Maybe compound pages work out, maybe they
> don't. We can talk a million years about larger page sizes, how to
> handle internal fragmentation, the difficulties of implementing a
> chunk cache, but it's completely irrelevant because it's speculative.

Calling it compound pages here is a misnomer, and it confuses the discussion.
The question is really about whether we should start using higher order
allocations for data in the page cache, and perhaps a better way of framing that
question is: should we continue to fragment all our page cache allocations up
front into individual pages?

But I don't think this really the blocker.

> 1. Is the folio a good descriptor for all uses of anon and file pages
> inside MM code way beyond the page cache layer YOU care about?
>
> For some people the answers are yes, for others they are a no.

The anon page conversion does seem to be where all the disagreement is coming
from.

So my ask, to everyone involved is - if anonymous pages are dropped from the
folio patches, do we have any other real objections to the patch series?

It's an open question as to how much anonymous pages are like file pages, and if
we continue down the route of of splitting up struct page into separate types
whether anonymous pages should be the same time as file pages.

Also, it appears even file pages aren't fully converted to folios in Willy's
patch set - grepping around reveals plenty of references to struct page left in
fs/. I think that even if anonymous pages are going to become folios it's a
pretty reasonable ask for that to wait a cycle or two and see how the conversion
of file pages fully plays out.

Also: it's become pretty clear to me that we have crappy communications between
MM developers and filesystem developers. Internally both teams have solid
communications - I know in filesystem land we all talk to each other and are
pretty good at working colaboratively, and it sounds like the MM team also has
good internal communications. But we seem to have some problems with tackling
issues that cross over between FS and MM land, or awkwardly sit between them.

Perhaps this is something we could try to address when picking conference topics
in the future. Johannes also mentioned a monthly group call the MM devs schedule
- I wonder if it would be useful to get something similar going between MM and
interested parties in filesystem land.
Re: Folio discussion recap [ In reply to ]
On Fri, Sep 17, 2021 at 11:57:35PM +0300, Kirill A. Shutemov wrote:
> On Fri, Sep 17, 2021 at 12:31:36PM -0400, Johannes Weiner wrote:
> > I didn't suggest to change what the folio currently already is for the
> > page cache. I asked to keep anon pages out of it (and in the future
> > potentially other random stuff that is using compound pages).
>
> It would mean that anon-THP cannot benefit from the work Willy did with
> folios. Anon-THP is the most active user of compound pages at the moment
> and it also suffers from the compound_head() plague. You ask to exclude
> anon-THP siting *possible* future benefits for pagecache.
>
> Sorry, but this doesn't sound fair to me.

I'm less concerned with what's fair than figuring out what the consensus is so
we can move forward. I agree that anonymous THPs could benefit greatly from
conversion to folios - but looking at the code it doesn't look like much of that
has been done yet.

I understand you've had some input into the folio patches, so maybe you'd be
best able to answer while Matthew is away - would it be fair to say that, in the
interests of moving forward, anonymous pages could be split out for now? That
way the MM people gain time to come to their own consensus and we can still
unblock the FS work that's already been done on top of folios.
Re: Folio discussion recap [ In reply to ]
On Fri, Sep 17, 2021 at 05:17:09PM -0400, Kent Overstreet wrote:
> On Fri, Sep 17, 2021 at 11:57:35PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Sep 17, 2021 at 12:31:36PM -0400, Johannes Weiner wrote:
> > > I didn't suggest to change what the folio currently already is for the
> > > page cache. I asked to keep anon pages out of it (and in the future
> > > potentially other random stuff that is using compound pages).
> >
> > It would mean that anon-THP cannot benefit from the work Willy did with
> > folios. Anon-THP is the most active user of compound pages at the moment
> > and it also suffers from the compound_head() plague. You ask to exclude
> > anon-THP siting *possible* future benefits for pagecache.
> >
> > Sorry, but this doesn't sound fair to me.
>
> I'm less concerned with what's fair than figuring out what the consensus is so
> we can move forward. I agree that anonymous THPs could benefit greatly from
> conversion to folios - but looking at the code it doesn't look like much of that
> has been done yet.
>
> I understand you've had some input into the folio patches, so maybe you'd be
> best able to answer while Matthew is away - would it be fair to say that, in the
> interests of moving forward, anonymous pages could be split out for now? That
> way the MM people gain time to come to their own consensus and we can still
> unblock the FS work that's already been done on top of folios.

I can't answer for Matthew.

Anon conversion patchset doesn't exists yet (but it is in plans) so
there's nothing to split out. Once someone will come up with such patchset
he has to sell it upstream on its own merit.

Possible future efforts should not block code at hands. "Talk is cheap.
Show me the code."

--
Kirill A. Shutemov
Re: Folio discussion recap [ In reply to ]
On Sat, Sep 18, 2021 at 01:02:09AM +0300, Kirill A. Shutemov wrote:
> I can't answer for Matthew.
>
> Anon conversion patchset doesn't exists yet (but it is in plans) so
> there's nothing to split out. Once someone will come up with such patchset
> he has to sell it upstream on its own merit.

Perhaps we've been operating under some incorrect assumptions then. If the
current patch series doesn't actually touch anonymous pages - the patch series
does touch code in e.g. mm/swap.c, but looking closer it might just be due to
the (mis)organization of the current code - maybe there aren't any real
objections left?
Re: Folio discussion recap [ In reply to ]
On Fri, Sep 17, 2021 at 05:13:10PM -0400, Kent Overstreet wrote:
> Also: it's become pretty clear to me that we have crappy
> communications between MM developers and filesystem
> developers.

I think one of the challenges has been the lack of an LSF/MM since
2019. And it may be that having *some* kind of ad hoc technical
discussion given that LSF/MM in 2021 is not happening might be a good
thing. I'm sure if we asked nicely, we could use the LPC
infrasutrcture to set up something, assuming we can find a mutually
agreeable day or dates.

> Internally both teams have solid communications - I know
> in filesystem land we all talk to each other and are pretty good at
> working colaboratively, and it sounds like the MM team also has good
> internal communications. But we seem to have some problems with
> tackling issues that cross over between FS and MM land, or awkwardly
> sit between them.

That's a bit of a over-generalization; it seems like we've uncovered
that some of the disagreemnts are between different parts of the MM
community over the suitability of folios for anonymous pages.

And it's interesting, because I don't really consider Willy to be one
of "the FS folks" --- and he has been quite diligent to reaching out
to a number of folks in the FS community about our needs, and it's
clear that this has been really, really helpful. There's no question
that we've had for many years some difficulties in the code paths that
sit between FS and MM, and I'd claim that it's not just because of
communications, but the relative lack of effort that was focused in
that area. The fact that Willy has spent the last 9 months working on
FS / MM interactions has been really great, and I hope it continues.

That being said, it sounds like there are issues internal to the MM
devs that still need to be ironed out, and at the risk of throwing the
anon-THP folks under the bus, if we can land at least some portion of
the folio commits, it seems like that would be a step in the right
direction.

Cheers,

- Ted
Re: Folio discussion recap [ In reply to ]
On Fri, Sep 17, 2021 at 11:57:35PM +0300, Kirill A. Shutemov wrote:
> On Fri, Sep 17, 2021 at 12:31:36PM -0400, Johannes Weiner wrote:
> > I didn't suggest to change what the folio currently already is for the
> > page cache. I asked to keep anon pages out of it (and in the future
> > potentially other random stuff that is using compound pages).
>
> It would mean that anon-THP cannot benefit from the work Willy did with
> folios. Anon-THP is the most active user of compound pages at the moment
> and it also suffers from the compound_head() plague. You ask to exclude
> anon-THP siting *possible* future benefits for pagecache.
>
> Sorry, but this doesn't sound fair to me.

Hold on Kirill. I'm not saying we shouldn't fix anonthp. But let's
clarify the actual code in question in this specific patchset. You say
anonthp cannot benefit from folio, but in the other email you say this
patchset isn't doing the conversion yet.

The code I'm specifically referring to here is the conversion of some
code that encounters both anon and file pages - swap.c, memcontrol.c,
workingset.c, and a few other places. It's a small part of the folio
patches, but it's a big deal for the MM code conceptually.

I'm requesting to drop those and just keep the page cache bits. Not
because I think anonthp shouldn't be fixed, but because I think we're
not in agreement yet on how they should be fixed. And it's somewhat
independent of fixing the page cache interface now that people are
waiting on much more desparately and acutely than we inside MM wait
for a struct page cleanup. It's not good to hold them while we argue.

Dropping the anon bits isn't final. Depending on how our discussion
turns out, we can still put them in later or we can put in something
new. The important thing is that the uncontroversial page cache bits
aren't held up any longer while we figure it out.

> If you want to limit usage of the new type to pagecache, the burden on you
> to prove that it is useful and not just a dead weight.

I'm not asking to add anything to the folio patches, just to remove
some bits around the edges. And for the page cache bits: I think we
have a rather large number of folks really wanting those. Now.

Again, I think we should fix anonthp. But I also think we should
really look at struct page more broadly. And I think we should have
that discussion inside a forum of MM people that truly care.

I'm just trying to unblock the fs folks at this point and merge what
we can now.
Re: Folio discussion recap [ In reply to ]
On 9/17/21 6:25 PM, Theodore Ts'o wrote:
> On Fri, Sep 17, 2021 at 05:13:10PM -0400, Kent Overstreet wrote:
>> Also: it's become pretty clear to me that we have crappy
>> communications between MM developers and filesystem
>> developers.
>
> I think one of the challenges has been the lack of an LSF/MM since
> 2019. And it may be that having *some* kind of ad hoc technical
> discussion given that LSF/MM in 2021 is not happening might be a good
> thing. I'm sure if we asked nicely, we could use the LPC
> infrasutrcture to set up something, assuming we can find a mutually
> agreeable day or dates.
>

We have a slot for this in the FS MC, first slot actually, so hopefully
we can get things hashed out there. Thanks,

Josef
Re: Folio discussion recap [ In reply to ]
On Fri, Sep 17, 2021 at 12:31:36PM -0400, Johannes Weiner wrote:
> My question for fs folks is simply this: as long as you can pass a
> folio to kmap and mmap and it knows what to do with it, is there any
> filesystem relevant requirement that the folio map to 1 or more
> literal "struct page", and that folio_page(), folio_nr_pages() etc be
> part of the public API?

In the short term, yes, we need those things in the public API.
In the long term, not so much.

We need something in the public API that tells us the offset and
size of the folio. Lots of page cache code currently does stuff like
calculate the size or iteration counts based on the difference of
page->index values (i.e. number of pages) and iterate page by page.
A direct conversion of such algorithms increments by
folio_nr_pages() instead of 1. So stuff like this is definitely
necessary as public APIs in the initial conversion.

Let's face it, folio_nr_pages() is a huge improvement on directly
exposing THP/compound page interfaces to filesystems and leaving
them to work it out for themselves. So even in the short term, these
API members represent a major step forward in mm API cleanliness.

As for long term, everything in the page cache API needs to
transition to byte offsets and byte counts instead of units of
PAGE_SIZE and page->index. That's a more complex transition, but
AFAIA that's part of the future work Willy is intended to do with
folios and the folio API. Once we get away from accounting and
tracking everything as units of struct page, all the public facing
APIs that use those units can go away.

It's fairly slow to do this, because we have so much code that is
doing stuff like converting file offsets between byte counts and
page counts and vice versa. And it's not necessary to do an initial
conversion to folios, either. But once everything in the page cache
indexing API moves to byte ranges, the need to count pages, use page
counts are ranges, iterate by page index, etc all goes away and
hence those APIs can also go away.

As for converting between folios and pages, we'll need those sorts
of APIs for the foreseeable future because low level storage layers
and hardware use pages for their scatter gather arrays and at some
point we've got to expose those pages from behind the folio API.
Even if we replace struct page with some other hardware page
descriptor, we're still going to need such translation APIs are some
point in the stack....

> Or can we keep this translation layer private
> to MM code? And will page_folio() be required for anything beyond the
> transitional period away from pages?

No idea, but as per above I think it's a largely irrelevant concern
for the forseeable future because pages will be here for a long time
yet.

> Can we move things not used outside of MM into mm/internal.h, mark the
> transitional bits of the public API as such, and move on?

Sure, but that's up to you to do as a patch set on top of Willy's
folio trees if you think it improves the status quo. Write the
patches and present them for review just like everyone else does,
and they can be discussed on their merits in that context rather
than being presented as a reason for blocking current progress on
folios.

Cheers,

Dave.
--
Dave Chinner
david@fromorbit.com
Re: Folio discussion recap [ In reply to ]
On Sat, Sep 18, 2021 at 11:04:40AM +1000, Dave Chinner wrote:
> As for long term, everything in the page cache API needs to
> transition to byte offsets and byte counts instead of units of
> PAGE_SIZE and page->index. That's a more complex transition, but
> AFAIA that's part of the future work Willy is intended to do with
> folios and the folio API. Once we get away from accounting and
> tracking everything as units of struct page, all the public facing
> APIs that use those units can go away.

Probably 95% of the places we use page->index and page->mapping aren't necessary
because we've already got that information from the context we're in and
removing them would be a useful cleanup - if we've already got that from context
(e.g. we're looking up the page in the page cache, via i_pageS) eliminating the
page->index or page->mapping use means we're getting rid of a data dependency so
it's good for performance - but more importantly, those (much fewer) places in
the code where we actually _do_ need page->index and page->mapping are really
important places to be able to find because they're interesting boundaries
between different components in the VM.
Re: Folio discussion recap [ In reply to ]
On Sat, Sep 18, 2021 at 12:51:50AM -0400, Kent Overstreet wrote:
> On Sat, Sep 18, 2021 at 11:04:40AM +1000, Dave Chinner wrote:
> > As for long term, everything in the page cache API needs to
> > transition to byte offsets and byte counts instead of units of
> > PAGE_SIZE and page->index. That's a more complex transition, but
> > AFAIA that's part of the future work Willy is intended to do with
> > folios and the folio API. Once we get away from accounting and
> > tracking everything as units of struct page, all the public facing
> > APIs that use those units can go away.
>
> Probably 95% of the places we use page->index and page->mapping aren't necessary
> because we've already got that information from the context we're in and
> removing them would be a useful cleanup

*nod*

> - if we've already got that from context
> (e.g. we're looking up the page in the page cache, via i_pageS) eliminating the
> page->index or page->mapping use means we're getting rid of a data dependency so
> it's good for performance - but more importantly, those (much fewer) places in
> the code where we actually _do_ need page->index and page->mapping are really
> important places to be able to find because they're interesting boundaries
> between different components in the VM.

*nod*

This is where infrastructure like like write_cache_pages() is
problematic. It's not actually a component of the VM - it's core
page cache/filesystem API functionality - but the implementation is
determined by the fact there is no clear abstraction between the
page cache and the VM and so while the filesysetm side of the API is
byte-ranged based, the VM side is struct page based and so the
impedence mismatch has to be handled in the page cache
implementation.

Folios are definitely pointing out issues like this whilst, IMO,
demonstrating that an abstraction like folios are also a necessary
first step to address the problems they make obvious...

Cheers,

Dave.
--
Dave Chinner
david@fromorbit.com
Re: Folio discussion recap [ In reply to ]
On Fri, Sep 10, 2021 at 04:16:28PM -0400, Kent Overstreet wrote:
> Q: Oh yeah, but what again are folios for, exactly?
>
> Folios are for cached filesystem data which (importantly) may be mapped to
> userspace.
>
> So when MM people see a new data structure come up with new references to page
> size - there's a very good reason with that, which is that we need to be
> allocating in multiples of the hardware page size if we're going to be able to
> map it to userspace and have PTEs point to it.
>
> So going forward, if the MM people want struct page to refer to muliple hardware
> pages - this shouldn't prevent that, and folios will refer to multiples of the
> _hardware_ page size, not struct page pagesize.
>
> Also - all the filesystem code that's being converted tends to talk and thing in
> units of pages. So going forward, it would be a nice cleanup to get rid of as
> many of those references as possible and just talk in terms of bytes (e.g. I
> have generally been trying to get rid of references to PAGE_SIZE in bcachefs
> wherever reasonable, for other reasons) - those cleanups are probably for
> another patch series, and in the interests of getting this patch series merged
> with the fewest introduced bugs possible we probably want the current helpers.

I'd like to thank those who reached out off-list. Some of you know I've
had trouble with depression in the past, and I'd like to reassure you
that that's not a problem at the moment. I had a good holiday, and I
was able to keep from thinking about folios most of the time.

I'd also like to thank those who engaged in the discussion while I was
gone. A lot of good points have been made. I don't think the normal
style of replying to each email individually makes a lot of sense at
this point, so I'll make some general comments instead. I'll respond
to the process issues on the other thread.

I agree with the feeling a lot of people have expressed, that struct page
is massively overloaded and we would do much better with stronger typing.
I like it when the compiler catches bugs for me. Disentangling struct
page is something I've been working on for a while, and folios are a
step in that direction (in that they remove the two types of tail page
from the universe of possibilities).

I don't believe it is realistic to disentangle file pages and anon
pages from each other. Thanks to swap and shmem, both file pages and
anon pages need to be able to be moved in and out of the swap cache.
The swap cache shares a lot of code with the page cache, so changing
how the swap cache works is also tricky.

What I do believe is possible is something Kent hinted at; treating anon
pages more like file pages. I also believe that shmem should be able to
write pages to swap without moving the pages into the swap cache first.
But these two things are just beliefs. I haven't tried to verify them
and they may come to nothing.

I also want to split out slab_page and page_table_page from struct page.
I don't intend to convert either of those to folios.

I do want to make struct page dynamically allocated (and have for
a while). There are some complicating factors ...

There are two primary places where we need to map from a physical
address to a "memory descriptor". The one that most people care about
is get_user_pages(). We have a page table entry and need to increment
the refcount on the head page, possibly mark the head page dirty, but
also return the subpage of any compound page we find. The one that far
fewer people care about is memory-failure.c; we also need to find the
head page to determine what kind of memory has been affected, but we
need to mark the subpage as HWPoison.

Both of these need to be careful to not confuse tail and non-tail pages.
So yes, we need to use folios for anything that's mappable to userspace.
That's not just anon & file pages but also network pools, graphics card
memory and vmalloc memory. Eventually, I think struct page actually goes
down to a union of a few words of padding, along with ->compound_head.
Because that's all we're guaranteed is actually there; everything else
is only there in head pages.

There are a lot of places that should use folios which the current
patchset doesn't convert. I prioritised filesystems because we've got
~60 filesystems to convert, and working on the filesystems can proceed
in parallel with working on the rest of the MM. Also, if I converted
the entire MM at once, there would be complaints that a 600 patch series
was unreviewable. So here we are, there's a bunch of compatibility code
that indicates areas which still need to be converted.

I'm sure I've missed things, but I've been working on this email all
day and wanted to send it out before going to sleep.
Re: Folio discussion recap [ In reply to ]
On Fri, Sep 17, 2021 at 07:15:40PM -0400, Johannes Weiner wrote:
> The code I'm specifically referring to here is the conversion of some
> code that encounters both anon and file pages - swap.c, memcontrol.c,
> workingset.c, and a few other places. It's a small part of the folio
> patches, but it's a big deal for the MM code conceptually.

Hard to say without actually trying, but my worry here that this may lead
to code duplication to separate file and anon code path. I donno.

--
Kirill A. Shutemov
Re: Folio discussion recap [ In reply to ]
Just a note upfront:

This discussion is now about whether folio are suitable for anon pages
as well. I'd like to reiterate that regardless of the outcome of this
discussion I think we should probably move ahead with the page cache
bits, since people are specifically blocked on those and there is no
dependency on the anon stuff, as the conversion is incremental.

On Mon, Sep 20, 2021 at 03:17:15AM +0100, Matthew Wilcox wrote:
> I don't believe it is realistic to disentangle file pages and anon
> pages from each other. Thanks to swap and shmem, both file pages and
> anon pages need to be able to be moved in and out of the swap cache.

Yes, the swapcache is actually shared code and needs a shared type.

However, once swap and shmem are fully operating on *typed* anon and
file pages, there are no possible routes of admission for tail pages
into the swapcache:

vmscan:
add_to_swap_cache(anon_page->page);

shmem:
delete_from_swap_cache(file_page->page);

and so the justification for replacing page with folio *below* those
entry points to address tailpage confusion becomes nil: there is no
confusion. Move the anon bits to anon_page and leave the shared bits
in page. That's 912 lines of swap_state.c we could mostly leave alone.

The same is true for the LRU code in swap.c. Conceptually, already no
tailpages *should* make it onto the LRU. Once the high-level page
instantiation functions - add_to_page_cache_lru, do_anonymous_page -
have type safety, you really do not need to worry about tail pages
deep in the LRU code. 1155 more lines of swap.c.

And when you've ensured that tail pages can't make it onto the LRU,
that takes care of the entire page reclaim code as well; converting it
wholesale to folio again would provide little additional value. 4707
lines of vmscan.c.

And with the page instantiation functions typed, nobody can pass tail
pages into memcg charging code, either. 7509 lines of memcontrol.c.

But back to your generic swapcache example: beyond the swapcache and
the page LRU management, there really isn't a great deal of code that
is currently truly type-agnostic and generic like that. And the rest
could actually benefit from being typed more tightly to bring out what
is actually going on.

The anon_page->page relationship may look familiar too. It's a natural
type hierarchy between superclass and subclasses that is common in
object oriented languages: page has attributes and methods that are
generic and shared; anon_page and file_page encode where their
implementation differs.

A type system like that would set us up for a lot of clarification and
generalization of the MM code. For example it would immediately
highlight when "generic" code is trying to access type-specific stuff
that maybe it shouldn't, and thus help/force us refactor - something
that a shared, flat folio type would not.

And again, higher-level types would take care of the tail page
confusion in many (most?) places automatically.

> The swap cache shares a lot of code with the page cache, so changing
> how the swap cache works is also tricky.

The overlap is actually fairly small right now. Add and delete
routines are using the raw xarray functions. Lookups use the most
minimal version of find_get_page(), which wouldn't be a big deal to
open-code until swapcache and pagecache would *actually* be unified.

> What I do believe is possible is something Kent hinted at; treating
> anon pages more like file pages. I also believe that shmem should
> be able to write pages to swap without moving the pages into the
> swap cache first. But these two things are just beliefs. I haven't
> tried to verify them and they may come to nothing.

Treating anon and file pages the same where possible makes sense. It's
simple: the more code that can be made truly generic and be shared
between subclasses, the better.

However, for that we first have to identify what parts actually are
generic, and what parts are falsely shared and shoehorned into
equivalency due to being crammed into the same overloaded structure.

For example, page->mapping for file is an address_space and the page's
membership in that tree structure is protected by the page
lock. page->mapping for anon is... not that. The pointer itself is
ad-hoc typed to point to an anon_vma instead. And anon_vmas behave
completely differently from a page's pagecache state.

The *swapcache* state of an anon page is actually much closer to what
the pagecache state of a file page is. And since it would be nice to
share more of the swapcache and pagecache *implementation*, it makes
sense that the relevant page attributes would correspond as well.

(Yeah, page->mapping and page->index are used "the same way" for rmap,
but that's a much smaller, read-only case. And when you look at how
"generic" the rmap code is - with its foo_file and foo_anon functions,
and PageAnon() checks, and conditional page locking in the shared
bits-- the attribute sharing at the page level really did nothing to
help the implementation be more generic.)

It really should be something like:

struct page {
/* pagecache/swapcache state */
struct address_space *address_space;
pgoff_t index;
lock_t lock;
}

struct file_page {
struct page;
}

struct anon_page {
struct page;
struct anon_vma *anon_vma;
pgoff_t offset;
};

to recognize the difference in anon vs file rmapping and locking,
while recognizing the similarity between swapcache and pagecache.

A shared folio would perpetuate false equivalencies between anon and
file which make it difficult to actually split out and refactor what
*should* be generic vs what should be type-specific. And instead lead
to more "generic" code littered with FolioAnon() conditionals.

And in the name of tail page cleanup it would churn through thousands
of lines of code where there is no conceptual confusion about tail
pages to begin with.

Proper type inheritance would allow us to encode how things actually
are implemented right now and would be a great first step in
identifying what needs to be done in order to share more code.

And it would take care of so many places re: tail pages that it's a
legitimate question to ask: how many places would actually be *left*
that *need* to deal with tail pages? Couldn't we bubble
compound_head() and friends into these few select places and be done?

> I also want to split out slab_page and page_table_page from struct page.
> I don't intend to convert either of those to folios.
>
> I do want to make struct page dynamically allocated (and have for
> a while). There are some complicating factors ...
>
> There are two primary places where we need to map from a physical
> address to a "memory descriptor". The one that most people care about
> is get_user_pages(). We have a page table entry and need to increment
> the refcount on the head page, possibly mark the head page dirty, but
> also return the subpage of any compound page we find. The one that far
> fewer people care about is memory-failure.c; we also need to find the
> head page to determine what kind of memory has been affected, but we
> need to mark the subpage as HWPoison.
>
> Both of these need to be careful to not confuse tail and non-tail pages.

That makes sense.

But gup() as an interface to the rest of the kernel is rather strange:
It's not a generic page table walker that can take a callback argument
to deal with whatever the page table points to. It also doesn't return
properly typed objects: it returns struct page which is currently a
wildcard for whatever people cram into it.

> So yes, we need to use folios for anything that's mappable to userspace.
> That's not just anon & file pages but also network pools, graphics card
> memory and vmalloc memory. Eventually, I think struct page actually goes
> down to a union of a few words of padding, along with ->compound_head.
> Because that's all we're guaranteed is actually there; everything else
> is only there in head pages.

(Side question: if GUP can return tail pages, how does that map to
folios?)

Anyway, I don't see that folio for everything mappable is the obvious
conclusion, because it doesn't address what is really weird about gup.

While a folio interface would clean up the head and tail page issue,
it maintains the incentive of cramming everything that people want to
mmap and gup into the same wildcard type struct. And still leave the
bigger problem of ad-hoc typing that wildcard ("What is the thing that
was returned? Anon? File? GPU memory?") to the user.

I think rather than cramming everything that can be mmapped into folio
for the purpose of GUP and tail pages - even when these objects have
otherwise little in common - it would make more sense to reconsider
how GUP as an interface deals with typing. Some options:

a) Make it a higher-order function that leaves typing fully to the
provided callback. This makes it clear (and greppable) which
functions need to be wary about tail pages, and type inference in
general.

b) Create an intermediate mmap type that can map to one of the
higher-order types like anon or file, but never to a tail
page. This sounds like what you want struct page to be
long-term. But this sort of depends on the side question above -
what if pte maps tail page?

c) Provide a stricter interface for known higher-order types
(get_anon_pages...). Supporting new types means adding more entry
function, which IMO is preferable to cramming more stuff into a
wildcard struct folio.

d) A hybrid of a) and c) to safely cover common cases, while
allowing "i know what i'm doing" uses.

In summary, I think a page type hierarchy would do wonders to clean up
anon and file page implementations, and encourage and enable more code
sharing down the line, while taking care of tail pages as well.

This leaves the question how many places are actually *left* to deal
with tail pages in MM. Folios are based on the premise that the
confusion is simply everywhere, and that everything needs to be
converted first to be safe. This is convenient because it means we
never have to identify which parts truly *do* need tailpage handling,
truly *need* the compound_head() lookups. Yes, compound_head() has to
go from generic page flags testers. But as per the examples at the
top, I really don't think we need to convert every crevice of the MM
code to folio before we can be reasonably sure that removing it is
safe. I really want to to see a better ballpark analysis of what parts
need to deal with tail pages to justify all this churn for them.
Re: Folio discussion recap [ In reply to ]
On Tue, Sep 21, 2021 at 03:47:29PM -0400, Johannes Weiner wrote:
> This discussion is now about whether folio are suitable for anon pages
> as well. I'd like to reiterate that regardless of the outcome of this
> discussion I think we should probably move ahead with the page cache
> bits, since people are specifically blocked on those and there is no
> dependency on the anon stuff, as the conversion is incremental.

So you withdraw your NAK for the 5.15 pull request which is now four
weeks old and has utterly missed the merge window?

> and so the justification for replacing page with folio *below* those
> entry points to address tailpage confusion becomes nil: there is no
> confusion. Move the anon bits to anon_page and leave the shared bits
> in page. That's 912 lines of swap_state.c we could mostly leave alone.

Your argument seems to be based on "minimising churn". Which is certainly
a goal that one could have, but I think in this case is actually harmful.
There are hundreds, maybe thousands, of functions throughout the kernel
(certainly throughout filesystems) which assume that a struct page is
PAGE_SIZE bytes. Yes, every single one of them is buggy to assume that,
but tracking them all down is a never-ending task as new ones will be
added as fast as they can be removed.

> The same is true for the LRU code in swap.c. Conceptually, already no
> tailpages *should* make it onto the LRU. Once the high-level page
> instantiation functions - add_to_page_cache_lru, do_anonymous_page -
> have type safety, you really do not need to worry about tail pages
> deep in the LRU code. 1155 more lines of swap.c.

It's actually impossible in practice as well as conceptually. The list
LRU is in the union with compound_head, so you cannot put a tail page
onto the LRU. But yet we call compound_head() on every one of them
multiple times because our current type system does not allow us to
express "this is not a tail page".

> The anon_page->page relationship may look familiar too. It's a natural
> type hierarchy between superclass and subclasses that is common in
> object oriented languages: page has attributes and methods that are
> generic and shared; anon_page and file_page encode where their
> implementation differs.
>
> A type system like that would set us up for a lot of clarification and
> generalization of the MM code. For example it would immediately
> highlight when "generic" code is trying to access type-specific stuff
> that maybe it shouldn't, and thus help/force us refactor - something
> that a shared, flat folio type would not.

If you want to try your hand at splitting out anon_folio from folio
later, be my guest. I've just finished splitting out 'slab' from page,
and I'll post it later. I don't think that splitting anon_folio from
folio is worth doing, but will not stand in your way. I do think that
splitting tail pages from non-tail pages is worthwhile, and that's what
this patchset does.
Re: Folio discussion recap [ In reply to ]
On Tue, Sep 21, 2021 at 09:38:54PM +0100, Matthew Wilcox wrote:
> On Tue, Sep 21, 2021 at 03:47:29PM -0400, Johannes Weiner wrote:
> > and so the justification for replacing page with folio *below* those
> > entry points to address tailpage confusion becomes nil: there is no
> > confusion. Move the anon bits to anon_page and leave the shared bits
> > in page. That's 912 lines of swap_state.c we could mostly leave alone.
>
> Your argument seems to be based on "minimising churn". Which is certainly
> a goal that one could have, but I think in this case is actually harmful.
> There are hundreds, maybe thousands, of functions throughout the kernel
> (certainly throughout filesystems) which assume that a struct page is
> PAGE_SIZE bytes. Yes, every single one of them is buggy to assume that,
> but tracking them all down is a never-ending task as new ones will be
> added as fast as they can be removed.

Yet it's only file backed pages that are actually changing in behaviour right
now - folios don't _have_ to be the tool to fix that elsewhere, for anon, for
network pools, for slab.

> > The anon_page->page relationship may look familiar too. It's a natural
> > type hierarchy between superclass and subclasses that is common in
> > object oriented languages: page has attributes and methods that are
> > generic and shared; anon_page and file_page encode where their
> > implementation differs.
> >
> > A type system like that would set us up for a lot of clarification and
> > generalization of the MM code. For example it would immediately
> > highlight when "generic" code is trying to access type-specific stuff
> > that maybe it shouldn't, and thus help/force us refactor - something
> > that a shared, flat folio type would not.
>
> If you want to try your hand at splitting out anon_folio from folio
> later, be my guest. I've just finished splitting out 'slab' from page,
> and I'll post it later. I don't think that splitting anon_folio from
> folio is worth doing, but will not stand in your way. I do think that
> splitting tail pages from non-tail pages is worthwhile, and that's what
> this patchset does.

Eesh, we can and should hold ourselves to a higher standard in our technical
discussions.

Let's not let past misfourtune (and yes, folios missing 5.15 _was_ unfortunate
and shouldn't have happened) colour our perceptions and keep us from having
productive working relationships going forward. The points Johannes is bringing
up are valid and pertinent and deserve to be discussed.

If you're still trying to sell folios as the be all, end all solution for
anything using compound pages, I think you should be willing to make the
argument that that really is the _right_ solution - not just that it was the one
easiest for you to implement.

Actual code might make this discussion more concrete and clearer. Could you post
your slab conversion?
Re: Folio discussion recap [ In reply to ]
On Tue, Sep 21, 2021 at 09:38:54PM +0100, Matthew Wilcox wrote:
> On Tue, Sep 21, 2021 at 03:47:29PM -0400, Johannes Weiner wrote:
> > This discussion is now about whether folio are suitable for anon pages
> > as well. I'd like to reiterate that regardless of the outcome of this
> > discussion I think we should probably move ahead with the page cache
> > bits, since people are specifically blocked on those and there is no
> > dependency on the anon stuff, as the conversion is incremental.
>
> So you withdraw your NAK for the 5.15 pull request which is now four
> weeks old and has utterly missed the merge window?

Once you drop the bits that convert shared anon and file
infrastructure, yes. Because we haven't discussed yet, nor agree on,
that folio are the way forward for anon pages.

> > and so the justification for replacing page with folio *below* those
> > entry points to address tailpage confusion becomes nil: there is no
> > confusion. Move the anon bits to anon_page and leave the shared bits
> > in page. That's 912 lines of swap_state.c we could mostly leave alone.
>
> Your argument seems to be based on "minimising churn". Which is certainly
> a goal that one could have, but I think in this case is actually harmful.
> There are hundreds, maybe thousands, of functions throughout the kernel
> (certainly throughout filesystems) which assume that a struct page is
> PAGE_SIZE bytes. Yes, every single one of them is buggy to assume that,
> but tracking them all down is a never-ending task as new ones will be
> added as fast as they can be removed.

What does that have to do with anon pages?

> > The same is true for the LRU code in swap.c. Conceptually, already no
> > tailpages *should* make it onto the LRU. Once the high-level page
> > instantiation functions - add_to_page_cache_lru, do_anonymous_page -
> > have type safety, you really do not need to worry about tail pages
> > deep in the LRU code. 1155 more lines of swap.c.
>
> It's actually impossible in practice as well as conceptually. The list
> LRU is in the union with compound_head, so you cannot put a tail page
> onto the LRU. But yet we call compound_head() on every one of them
> multiple times because our current type system does not allow us to
> express "this is not a tail page".

No, because we haven't identified *who actually needs* these calls
and move them up and out of the low-level helpers.

It was a mistake to add them there, yes. But they were added recently
for rather few callers. And we've had people send patches already to
move them where they are actually needed.

Of course converting *absolutely everybody else* to not-tailpage
instead will also fix the problem... I just don't agree that this is
an appropriate response to the issue.

Asking again: who conceptually deals with tail pages in MM? LRU and
reclaim don't. The page cache doesn't. Compaction doesn't. Migration
doesn't. All these data structures and operations are structured
around headpages, because that's the logical unit they operate on. The
notable exception, of course, are the page tables because they map the
pfns of tail pages. But is that it? Does it come down to page table
walkers encountering pte-mapped tailpages? And needing compound_head()
before calling mark_page_accessed() or set_page_dirty()?

We couldn't fix vm_normal_page() to handle this? And switch khugepaged
to a new vm_raw_page() or whatever?

It should be possible to answer this question as part of the case for
converting tens of thousands of lines of code to folio.
Re: Folio discussion recap [ In reply to ]
On Tue, Sep 21, 2021 at 05:11:09PM -0400, Kent Overstreet wrote:
> On Tue, Sep 21, 2021 at 09:38:54PM +0100, Matthew Wilcox wrote:
> > On Tue, Sep 21, 2021 at 03:47:29PM -0400, Johannes Weiner wrote:
> > > and so the justification for replacing page with folio *below* those
> > > entry points to address tailpage confusion becomes nil: there is no
> > > confusion. Move the anon bits to anon_page and leave the shared bits
> > > in page. That's 912 lines of swap_state.c we could mostly leave alone.
> >
> > Your argument seems to be based on "minimising churn". Which is certainly
> > a goal that one could have, but I think in this case is actually harmful.
> > There are hundreds, maybe thousands, of functions throughout the kernel
> > (certainly throughout filesystems) which assume that a struct page is
> > PAGE_SIZE bytes. Yes, every single one of them is buggy to assume that,
> > but tracking them all down is a never-ending task as new ones will be
> > added as fast as they can be removed.
>
> Yet it's only file backed pages that are actually changing in behaviour right
> now - folios don't _have_ to be the tool to fix that elsewhere, for anon, for
> network pools, for slab.

The point (I think) Johannes is making is that some of the patches in
this series touch code paths which are used by both anon and file pages.
And it's those he's objecting to.

> > If you want to try your hand at splitting out anon_folio from folio
> > later, be my guest. I've just finished splitting out 'slab' from page,
> > and I'll post it later. I don't think that splitting anon_folio from
> > folio is worth doing, but will not stand in your way. I do think that
> > splitting tail pages from non-tail pages is worthwhile, and that's what
> > this patchset does.
>
> Eesh, we can and should hold ourselves to a higher standard in our technical
> discussions.
>
> Let's not let past misfourtune (and yes, folios missing 5.15 _was_ unfortunate
> and shouldn't have happened) colour our perceptions and keep us from having
> productive working relationships going forward. The points Johannes is bringing
> up are valid and pertinent and deserve to be discussed.
>
> If you're still trying to sell folios as the be all, end all solution for
> anything using compound pages, I think you should be willing to make the
> argument that that really is the _right_ solution - not just that it was the one
> easiest for you to implement.

Starting from the principle that the type of a pointer should never be
wrong, GUP can convert from a PTE to a struct page. We need a name for
the head page that GUP converts to, and my choice for that name is folio.
A folio needs a refcount, a lock bit and a dirty bit.

By the way, I think I see a path to:

struct page {
unsigned long compound_head;
};

which will reduce the overhead of struct page from 64 bytes to 8.
That should solve one of Johannes' problems.

> Actual code might make this discussion more concrete and clearer. Could you post
> your slab conversion?

It's a bit big and deserves to be split into multiple patches.
It's on top of folio-5.15. It also only really works for SLUB right
now; CONFIG_SLAB doesn't compile yet. It does pass xfstests with
CONFIG_SLUB ;-)

I'm not entirely convinced I've done the right thing with
page_memcg_check(). There's probably other things wrong with it,
I was banging it out during gaps between sessions at Plumbers.

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index ddeaba947eb3..5f3d2efeb88b 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -981,7 +981,7 @@ static void __meminit free_pagetable(struct page *page, int order)
if (PageReserved(page)) {
__ClearPageReserved(page);

- magic = (unsigned long)page->freelist;
+ magic = page->index;
if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
while (nr_pages--)
put_page_bootmem(page++);
diff --git a/include/linux/bootmem_info.h b/include/linux/bootmem_info.h
index 2bc8b1f69c93..cc35d010fa94 100644
--- a/include/linux/bootmem_info.h
+++ b/include/linux/bootmem_info.h
@@ -30,7 +30,7 @@ void put_page_bootmem(struct page *page);
*/
static inline void free_bootmem_page(struct page *page)
{
- unsigned long magic = (unsigned long)page->freelist;
+ unsigned long magic = page->index;

/*
* The reserve_bootmem_region sets the reserved flag on bootmem
diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index dd874a1ee862..59c860295618 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -188,11 +188,11 @@ static __always_inline size_t kasan_metadata_size(struct kmem_cache *cache)
return 0;
}

-void __kasan_poison_slab(struct page *page);
-static __always_inline void kasan_poison_slab(struct page *page)
+void __kasan_poison_slab(struct slab *slab);
+static __always_inline void kasan_poison_slab(struct slab *slab)
{
if (kasan_enabled())
- __kasan_poison_slab(page);
+ __kasan_poison_slab(slab);
}

void __kasan_unpoison_object_data(struct kmem_cache *cache, void *object);
@@ -317,7 +317,7 @@ static inline void kasan_cache_create(struct kmem_cache *cache,
slab_flags_t *flags) {}
static inline void kasan_cache_create_kmalloc(struct kmem_cache *cache) {}
static inline size_t kasan_metadata_size(struct kmem_cache *cache) { return 0; }
-static inline void kasan_poison_slab(struct page *page) {}
+static inline void kasan_poison_slab(struct slab *slab) {}
static inline void kasan_unpoison_object_data(struct kmem_cache *cache,
void *object) {}
static inline void kasan_poison_object_data(struct kmem_cache *cache,
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 562b27167c9e..1c0b3b95bdd7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -546,41 +546,39 @@ static inline bool folio_memcg_kmem(struct folio *folio)
}

/*
- * page_objcgs - get the object cgroups vector associated with a page
- * @page: a pointer to the page struct
+ * slab_objcgs - get the object cgroups vector associated with a slab
+ * @slab: a pointer to the slab struct
*
- * Returns a pointer to the object cgroups vector associated with the page,
- * or NULL. This function assumes that the page is known to have an
- * associated object cgroups vector. It's not safe to call this function
- * against pages, which might have an associated memory cgroup: e.g.
- * kernel stack pages.
+ * Returns a pointer to the object cgroups vector associated with the slab,
+ * or NULL. This function assumes that the slab is known to have an
+ * associated object cgroups vector.
*/
-static inline struct obj_cgroup **page_objcgs(struct page *page)
+static inline struct obj_cgroup **slab_objcgs(struct slab *slab)
{
- unsigned long memcg_data = READ_ONCE(page->memcg_data);
+ unsigned long memcg_data = READ_ONCE(slab->memcg_data);

- VM_BUG_ON_PAGE(memcg_data && !(memcg_data & MEMCG_DATA_OBJCGS), page);
- VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_KMEM, page);
+ VM_BUG_ON_PAGE(memcg_data && !(memcg_data & MEMCG_DATA_OBJCGS), &slab->page);
+ VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_KMEM, &slab->page);

return (struct obj_cgroup **)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
}

/*
- * page_objcgs_check - get the object cgroups vector associated with a page
- * @page: a pointer to the page struct
+ * slab_objcgs_check - get the object cgroups vector associated with a slab
+ * @slab: a pointer to the slab struct
*
- * Returns a pointer to the object cgroups vector associated with the page,
- * or NULL. This function is safe to use if the page can be directly associated
+ * Returns a pointer to the object cgroups vector associated with the slab,
+ * or NULL. This function is safe to use if the slab can be directly associated
* with a memory cgroup.
*/
-static inline struct obj_cgroup **page_objcgs_check(struct page *page)
+static inline struct obj_cgroup **slab_objcgs_check(struct slab *slab)
{
- unsigned long memcg_data = READ_ONCE(page->memcg_data);
+ unsigned long memcg_data = READ_ONCE(slab->memcg_data);

if (!memcg_data || !(memcg_data & MEMCG_DATA_OBJCGS))
return NULL;

- VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_KMEM, page);
+ VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_KMEM, &slab->page);

return (struct obj_cgroup **)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
}
@@ -591,12 +589,12 @@ static inline bool folio_memcg_kmem(struct folio *folio)
return false;
}

-static inline struct obj_cgroup **page_objcgs(struct page *page)
+static inline struct obj_cgroup **slab_objcgs(struct slab *slab)
{
return NULL;
}

-static inline struct obj_cgroup **page_objcgs_check(struct page *page)
+static inline struct obj_cgroup **slab_objcgs_check(struct slab *slab)
{
return NULL;
}
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 1066afc9a06d..6db4d64ebe6d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -109,33 +109,6 @@ struct page {
*/
unsigned long dma_addr[2];
};
- struct { /* slab, slob and slub */
- union {
- struct list_head slab_list;
- struct { /* Partial pages */
- struct page *next;
-#ifdef CONFIG_64BIT
- int pages; /* Nr of pages left */
- int pobjects; /* Approximate count */
-#else
- short int pages;
- short int pobjects;
-#endif
- };
- };
- struct kmem_cache *slab_cache; /* not slob */
- /* Double-word boundary */
- void *freelist; /* first free object */
- union {
- void *s_mem; /* slab: first object */
- unsigned long counters; /* SLUB */
- struct { /* SLUB */
- unsigned inuse:16;
- unsigned objects:15;
- unsigned frozen:1;
- };
- };
- };
struct { /* Tail pages of compound page */
unsigned long compound_head; /* Bit zero is set */

@@ -199,9 +172,6 @@ struct page {
* which are currently stored here.
*/
unsigned int page_type;
-
- unsigned int active; /* SLAB */
- int units; /* SLOB */
};

/* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
@@ -231,6 +201,59 @@ struct page {
#endif
} _struct_page_alignment;

+struct slab {
+ union {
+ struct {
+ unsigned long flags;
+ union {
+ struct list_head slab_list;
+ struct { /* Partial pages */
+ struct slab *next;
+#ifdef CONFIG_64BIT
+ int slabs; /* Nr of slabs left */
+ int pobjects; /* Approximate count */
+#else
+ short int slabs;
+ short int pobjects;
+#endif
+ };
+ };
+ struct kmem_cache *slab_cache; /* not slob */
+ /* Double-word boundary */
+ void *freelist; /* first free object */
+ union {
+ void *s_mem; /* slab: first object */
+ unsigned long counters; /* SLUB */
+ struct { /* SLUB */
+ unsigned inuse:16;
+ unsigned objects:15;
+ unsigned frozen:1;
+ };
+ };
+
+ union {
+ unsigned int active; /* SLAB */
+ int units; /* SLOB */
+ };
+ atomic_t _refcount;
+#ifdef CONFIG_MEMCG
+ unsigned long memcg_data;
+#endif
+ };
+ struct page page;
+ };
+};
+
+#define SLAB_MATCH(pg, sl) \
+ static_assert(offsetof(struct page, pg) == offsetof(struct slab, sl))
+SLAB_MATCH(flags, flags);
+SLAB_MATCH(compound_head, slab_list);
+SLAB_MATCH(_refcount, _refcount);
+#ifdef CONFIG_MEMCG
+SLAB_MATCH(memcg_data, memcg_data);
+#endif
+#undef SLAB_MATCH
+
/**
* struct folio - Represents a contiguous set of bytes.
* @flags: Identical to the page flags.
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index b48bc214fe89..a21d14fec973 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -167,6 +167,8 @@ enum pageflags {
/* Remapped by swiotlb-xen. */
PG_xen_remapped = PG_owner_priv_1,

+ /* SLAB / SLUB / SLOB */
+ PG_pfmemalloc = PG_active,
/* SLOB */
PG_slob_free = PG_private,

@@ -193,6 +195,25 @@ static inline unsigned long _compound_head(const struct page *page)

#define compound_head(page) ((typeof(page))_compound_head(page))

+/**
+ * page_slab - Converts from page to slab.
+ * @p: The page.
+ *
+ * This function cannot be called on a NULL pointer. It can be called
+ * on a non-slab page; the caller should check is_slab() to be sure
+ * that the slab really is a slab.
+ *
+ * Return: The slab which contains this page.
+ */
+#define page_slab(p) (_Generic((p), \
+ const struct page *: (const struct slab *)_compound_head(p), \
+ struct page *: (struct slab *)_compound_head(p)))
+
+static inline bool is_slab(struct slab *slab)
+{
+ return test_bit(PG_slab, &slab->flags);
+}
+
/**
* page_folio - Converts from page to folio.
* @p: The page.
@@ -921,34 +942,6 @@ extern bool is_free_buddy_page(struct page *page);

__PAGEFLAG(Isolated, isolated, PF_ANY);

-/*
- * If network-based swap is enabled, sl*b must keep track of whether pages
- * were allocated from pfmemalloc reserves.
- */
-static inline int PageSlabPfmemalloc(struct page *page)
-{
- VM_BUG_ON_PAGE(!PageSlab(page), page);
- return PageActive(page);
-}
-
-static inline void SetPageSlabPfmemalloc(struct page *page)
-{
- VM_BUG_ON_PAGE(!PageSlab(page), page);
- SetPageActive(page);
-}
-
-static inline void __ClearPageSlabPfmemalloc(struct page *page)
-{
- VM_BUG_ON_PAGE(!PageSlab(page), page);
- __ClearPageActive(page);
-}
-
-static inline void ClearPageSlabPfmemalloc(struct page *page)
-{
- VM_BUG_ON_PAGE(!PageSlab(page), page);
- ClearPageActive(page);
-}
-
#ifdef CONFIG_MMU
#define __PG_MLOCKED (1UL << PG_mlocked)
#else
diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
index 3aa5e1e73ab6..f1bfcb10f5e0 100644
--- a/include/linux/slab_def.h
+++ b/include/linux/slab_def.h
@@ -87,11 +87,11 @@ struct kmem_cache {
struct kmem_cache_node *node[MAX_NUMNODES];
};

-static inline void *nearest_obj(struct kmem_cache *cache, struct page *page,
+static inline void *nearest_obj(struct kmem_cache *cache, struct slab *slab,
void *x)
{
- void *object = x - (x - page->s_mem) % cache->size;
- void *last_object = page->s_mem + (cache->num - 1) * cache->size;
+ void *object = x - (x - slab->s_mem) % cache->size;
+ void *last_object = slab->s_mem + (cache->num - 1) * cache->size;

if (unlikely(object > last_object))
return last_object;
@@ -106,16 +106,16 @@ static inline void *nearest_obj(struct kmem_cache *cache, struct page *page,
* reciprocal_divide(offset, cache->reciprocal_buffer_size)
*/
static inline unsigned int obj_to_index(const struct kmem_cache *cache,
- const struct page *page, void *obj)
+ const struct slab *slab, void *obj)
{
- u32 offset = (obj - page->s_mem);
+ u32 offset = (obj - slab->s_mem);
return reciprocal_divide(offset, cache->reciprocal_buffer_size);
}

-static inline int objs_per_slab_page(const struct kmem_cache *cache,
- const struct page *page)
+static inline int objs_per_slab(const struct kmem_cache *cache,
+ const struct slab *slab)
{
- if (is_kfence_address(page_address(page)))
+ if (is_kfence_address(slab_address(slab)))
return 1;
return cache->num;
}
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index dcde82a4434c..7394c959dc5f 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -43,9 +43,9 @@ enum stat_item {
struct kmem_cache_cpu {
void **freelist; /* Pointer to next available object */
unsigned long tid; /* Globally unique transaction id */
- struct page *page; /* The slab from which we are allocating */
+ struct slab *slab; /* The slab from which we are allocating */
#ifdef CONFIG_SLUB_CPU_PARTIAL
- struct page *partial; /* Partially allocated frozen slabs */
+ struct slab *partial; /* Partially allocated frozen slabs */
#endif
#ifdef CONFIG_SLUB_STATS
unsigned stat[NR_SLUB_STAT_ITEMS];
@@ -159,16 +159,16 @@ static inline void sysfs_slab_release(struct kmem_cache *s)
}
#endif

-void object_err(struct kmem_cache *s, struct page *page,
+void object_err(struct kmem_cache *s, struct slab *slab,
u8 *object, char *reason);

void *fixup_red_left(struct kmem_cache *s, void *p);

-static inline void *nearest_obj(struct kmem_cache *cache, struct page *page,
+static inline void *nearest_obj(struct kmem_cache *cache, struct slab *slab,
void *x) {
- void *object = x - (x - page_address(page)) % cache->size;
- void *last_object = page_address(page) +
- (page->objects - 1) * cache->size;
+ void *object = x - (x - slab_address(slab)) % cache->size;
+ void *last_object = slab_address(slab) +
+ (slab->objects - 1) * cache->size;
void *result = (unlikely(object > last_object)) ? last_object : object;

result = fixup_red_left(cache, result);
@@ -184,16 +184,16 @@ static inline unsigned int __obj_to_index(const struct kmem_cache *cache,
}

static inline unsigned int obj_to_index(const struct kmem_cache *cache,
- const struct page *page, void *obj)
+ const struct slab *slab, void *obj)
{
if (is_kfence_address(obj))
return 0;
- return __obj_to_index(cache, page_address(page), obj);
+ return __obj_to_index(cache, slab_address(slab), obj);
}

-static inline int objs_per_slab_page(const struct kmem_cache *cache,
- const struct page *page)
+static inline int objs_per_slab(const struct kmem_cache *cache,
+ const struct slab *slab)
{
- return page->objects;
+ return slab->objects;
}
#endif /* _LINUX_SLUB_DEF_H */
diff --git a/mm/bootmem_info.c b/mm/bootmem_info.c
index 5b152dba7344..cf8f62c59b0a 100644
--- a/mm/bootmem_info.c
+++ b/mm/bootmem_info.c
@@ -15,7 +15,7 @@

void get_page_bootmem(unsigned long info, struct page *page, unsigned long type)
{
- page->freelist = (void *)type;
+ page->index = type;
SetPagePrivate(page);
set_page_private(page, info);
page_ref_inc(page);
@@ -23,14 +23,13 @@ void get_page_bootmem(unsigned long info, struct page *page, unsigned long type)

void put_page_bootmem(struct page *page)
{
- unsigned long type;
+ unsigned long type = page->index;

- type = (unsigned long) page->freelist;
BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
type > MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE);

if (page_ref_dec_return(page) == 1) {
- page->freelist = NULL;
+ page->index = 0;
ClearPagePrivate(page);
set_page_private(page, 0);
INIT_LIST_HEAD(&page->lru);
diff --git a/mm/kasan/common.c b/mm/kasan/common.c
index 2baf121fb8c5..a8b9a7822b9f 100644
--- a/mm/kasan/common.c
+++ b/mm/kasan/common.c
@@ -247,8 +247,9 @@ struct kasan_free_meta *kasan_get_free_meta(struct kmem_cache *cache,
}
#endif

-void __kasan_poison_slab(struct page *page)
+void __kasan_poison_slab(struct slab *slab)
{
+ struct page *page = &slab->page;
unsigned long i;

for (i = 0; i < compound_nr(page); i++)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c954fda9d7f4..c21b9a63fb4a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2842,16 +2842,16 @@ static struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *objcg)
*/
#define OBJCGS_CLEAR_MASK (__GFP_DMA | __GFP_RECLAIMABLE | __GFP_ACCOUNT)

-int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s,
+int memcg_alloc_slab_obj_cgroups(struct slab *slab, struct kmem_cache *s,
gfp_t gfp, bool new_page)
{
- unsigned int objects = objs_per_slab_page(s, page);
+ unsigned int objects = objs_per_slab(s, slab);
unsigned long memcg_data;
void *vec;

gfp &= ~OBJCGS_CLEAR_MASK;
vec = kcalloc_node(objects, sizeof(struct obj_cgroup *), gfp,
- page_to_nid(page));
+ slab_nid(slab));
if (!vec)
return -ENOMEM;

@@ -2862,8 +2862,8 @@ int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s,
* it's memcg_data, no synchronization is required and
* memcg_data can be simply assigned.
*/
- page->memcg_data = memcg_data;
- } else if (cmpxchg(&page->memcg_data, 0, memcg_data)) {
+ slab->memcg_data = memcg_data;
+ } else if (cmpxchg(&slab->memcg_data, 0, memcg_data)) {
/*
* If the slab page is already in use, somebody can allocate
* and assign obj_cgroups in parallel. In this case the existing
@@ -2891,38 +2891,39 @@ int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s,
*/
struct mem_cgroup *mem_cgroup_from_obj(void *p)
{
- struct page *page;
+ struct slab *slab;

if (mem_cgroup_disabled())
return NULL;

- page = virt_to_head_page(p);
+ slab = virt_to_slab(p);

/*
* Slab objects are accounted individually, not per-page.
* Memcg membership data for each individual object is saved in
- * the page->obj_cgroups.
+ * the slab->obj_cgroups.
*/
- if (page_objcgs_check(page)) {
+ if (slab_objcgs_check(slab)) {
struct obj_cgroup *objcg;
unsigned int off;

- off = obj_to_index(page->slab_cache, page, p);
- objcg = page_objcgs(page)[off];
+ off = obj_to_index(slab->slab_cache, slab, p);
+ objcg = slab_objcgs(slab)[off];
if (objcg)
return obj_cgroup_memcg(objcg);

return NULL;
}

+ /* I am pretty sure this is wrong */
/*
- * page_memcg_check() is used here, because page_has_obj_cgroups()
+ * page_memcg_check() is used here, because slab_has_obj_cgroups()
* check above could fail because the object cgroups vector wasn't set
* at that moment, but it can be set concurrently.
- * page_memcg_check(page) will guarantee that a proper memory
+ * page_memcg_check() will guarantee that a proper memory
* cgroup pointer or NULL will be returned.
*/
- return page_memcg_check(page);
+ return page_memcg_check(&slab->page);
}

__always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
diff --git a/mm/slab.h b/mm/slab.h
index f997fd5e42c8..1c6311fd7060 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -5,6 +5,69 @@
* Internal slab definitions
*/

+static inline void *slab_address(const struct slab *slab)
+{
+ return page_address(&slab->page);
+}
+
+static inline struct pglist_data *slab_pgdat(const struct slab *slab)
+{
+ return page_pgdat(&slab->page);
+}
+
+static inline int slab_nid(const struct slab *slab)
+{
+ return page_to_nid(&slab->page);
+}
+
+static inline struct slab *virt_to_slab(const void *addr)
+{
+ struct page *page = virt_to_page(addr);
+
+ return page_slab(page);
+}
+
+static inline bool SlabMulti(const struct slab *slab)
+{
+ return test_bit(PG_head, &slab->flags);
+}
+
+static inline int slab_order(const struct slab *slab)
+{
+ if (!SlabMulti(slab))
+ return 0;
+ return (&slab->page)[1].compound_order;
+}
+
+static inline size_t slab_size(const struct slab *slab)
+{
+ return PAGE_SIZE << slab_order(slab);
+}
+
+/*
+ * If network-based swap is enabled, sl*b must keep track of whether pages
+ * were allocated from pfmemalloc reserves.
+ */
+static inline bool SlabPfmemalloc(const struct slab *slab)
+{
+ return test_bit(PG_pfmemalloc, &slab->flags);
+}
+
+static inline void SetSlabPfmemalloc(struct slab *slab)
+{
+ set_bit(PG_pfmemalloc, &slab->flags);
+}
+
+static inline void __ClearSlabPfmemalloc(struct slab *slab)
+{
+ __clear_bit(PG_pfmemalloc, &slab->flags);
+}
+
+static inline void ClearSlabPfmemalloc(struct slab *slab)
+{
+ clear_bit(PG_pfmemalloc, &slab->flags);
+}
+
#ifdef CONFIG_SLOB
/*
* Common fields provided in kmem_cache by all slab allocators
@@ -245,15 +308,15 @@ static inline bool kmem_cache_debug_flags(struct kmem_cache *s, slab_flags_t fla
}

#ifdef CONFIG_MEMCG_KMEM
-int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s,
+int memcg_alloc_slab_obj_cgroups(struct slab *slab, struct kmem_cache *s,
gfp_t gfp, bool new_page);
void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
enum node_stat_item idx, int nr);

-static inline void memcg_free_page_obj_cgroups(struct page *page)
+static inline void memcg_free_slab_obj_cgroups(struct slab *slab)
{
- kfree(page_objcgs(page));
- page->memcg_data = 0;
+ kfree(slab_objcgs(slab));
+ slab->memcg_data = 0;
}

static inline size_t obj_full_size(struct kmem_cache *s)
@@ -298,7 +361,7 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
gfp_t flags, size_t size,
void **p)
{
- struct page *page;
+ struct slab *slab;
unsigned long off;
size_t i;

@@ -307,19 +370,19 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,

for (i = 0; i < size; i++) {
if (likely(p[i])) {
- page = virt_to_head_page(p[i]);
+ slab = virt_to_slab(p[i]);

- if (!page_objcgs(page) &&
- memcg_alloc_page_obj_cgroups(page, s, flags,
+ if (!slab_objcgs(slab) &&
+ memcg_alloc_slab_obj_cgroups(slab, s, flags,
false)) {
obj_cgroup_uncharge(objcg, obj_full_size(s));
continue;
}

- off = obj_to_index(s, page, p[i]);
+ off = obj_to_index(s, slab, p[i]);
obj_cgroup_get(objcg);
- page_objcgs(page)[off] = objcg;
- mod_objcg_state(objcg, page_pgdat(page),
+ slab_objcgs(slab)[off] = objcg;
+ mod_objcg_state(objcg, slab_pgdat(slab),
cache_vmstat_idx(s), obj_full_size(s));
} else {
obj_cgroup_uncharge(objcg, obj_full_size(s));
@@ -334,7 +397,7 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s_orig,
struct kmem_cache *s;
struct obj_cgroup **objcgs;
struct obj_cgroup *objcg;
- struct page *page;
+ struct slab *slab;
unsigned int off;
int i;

@@ -345,24 +408,24 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s_orig,
if (unlikely(!p[i]))
continue;

- page = virt_to_head_page(p[i]);
- objcgs = page_objcgs(page);
+ slab = virt_to_slab(p[i]);
+ objcgs = slab_objcgs(slab);
if (!objcgs)
continue;

if (!s_orig)
- s = page->slab_cache;
+ s = slab->slab_cache;
else
s = s_orig;

- off = obj_to_index(s, page, p[i]);
+ off = obj_to_index(s, slab, p[i]);
objcg = objcgs[off];
if (!objcg)
continue;

objcgs[off] = NULL;
obj_cgroup_uncharge(objcg, obj_full_size(s));
- mod_objcg_state(objcg, page_pgdat(page), cache_vmstat_idx(s),
+ mod_objcg_state(objcg, slab_pgdat(slab), cache_vmstat_idx(s),
-obj_full_size(s));
obj_cgroup_put(objcg);
}
@@ -374,14 +437,14 @@ static inline struct mem_cgroup *memcg_from_slab_obj(void *ptr)
return NULL;
}

-static inline int memcg_alloc_page_obj_cgroups(struct page *page,
+static inline int memcg_alloc_slab_obj_cgroups(struct slab *slab,
struct kmem_cache *s, gfp_t gfp,
bool new_page)
{
return 0;
}

-static inline void memcg_free_page_obj_cgroups(struct page *page)
+static inline void memcg_free_slab_obj_cgroups(struct slab *slab)
{
}

@@ -407,33 +470,33 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s,

static inline struct kmem_cache *virt_to_cache(const void *obj)
{
- struct page *page;
+ struct slab *slab;

- page = virt_to_head_page(obj);
- if (WARN_ONCE(!PageSlab(page), "%s: Object is not a Slab page!\n",
+ slab = virt_to_slab(obj);
+ if (WARN_ONCE(!is_slab(slab), "%s: Object is not a Slab page!\n",
__func__))
return NULL;
- return page->slab_cache;
+ return slab->slab_cache;
}

-static __always_inline void account_slab_page(struct page *page, int order,
+static __always_inline void account_slab(struct slab *slab, int order,
struct kmem_cache *s,
gfp_t gfp)
{
if (memcg_kmem_enabled() && (s->flags & SLAB_ACCOUNT))
- memcg_alloc_page_obj_cgroups(page, s, gfp, true);
+ memcg_alloc_slab_obj_cgroups(slab, s, gfp, true);

- mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
+ mod_node_page_state(slab_pgdat(slab), cache_vmstat_idx(s),
PAGE_SIZE << order);
}

-static __always_inline void unaccount_slab_page(struct page *page, int order,
+static __always_inline void unaccount_slab(struct slab *slab, int order,
struct kmem_cache *s)
{
if (memcg_kmem_enabled())
- memcg_free_page_obj_cgroups(page);
+ memcg_free_slab_obj_cgroups(slab);

- mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
+ mod_node_page_state(slab_pgdat(slab), cache_vmstat_idx(s),
-(PAGE_SIZE << order));
}

@@ -635,7 +698,7 @@ static inline void debugfs_slab_release(struct kmem_cache *s) { }
#define KS_ADDRS_COUNT 16
struct kmem_obj_info {
void *kp_ptr;
- struct page *kp_page;
+ struct slab *kp_slab;
void *kp_objp;
unsigned long kp_data_offset;
struct kmem_cache *kp_slab_cache;
@@ -643,7 +706,7 @@ struct kmem_obj_info {
void *kp_stack[KS_ADDRS_COUNT];
void *kp_free_stack[KS_ADDRS_COUNT];
};
-void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct page *page);
+void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab);
#endif

#endif /* MM_SLAB_H */
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 1c673c323baf..d0d843cb7cf1 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -585,18 +585,18 @@ void kmem_dump_obj(void *object)
{
char *cp = IS_ENABLED(CONFIG_MMU) ? "" : "/vmalloc";
int i;
- struct page *page;
+ struct slab *slab;
unsigned long ptroffset;
struct kmem_obj_info kp = { };

if (WARN_ON_ONCE(!virt_addr_valid(object)))
return;
- page = virt_to_head_page(object);
- if (WARN_ON_ONCE(!PageSlab(page))) {
+ slab = virt_to_slab(object);
+ if (WARN_ON_ONCE(!is_slab(slab))) {
pr_cont(" non-slab memory.\n");
return;
}
- kmem_obj_info(&kp, object, page);
+ kmem_obj_info(&kp, object, slab);
if (kp.kp_slab_cache)
pr_cont(" slab%s %s", cp, kp.kp_slab_cache->name);
else
diff --git a/mm/slub.c b/mm/slub.c
index 090fa14628f9..c3b84bd61400 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -47,7 +47,7 @@
* Lock order:
* 1. slab_mutex (Global Mutex)
* 2. node->list_lock
- * 3. slab_lock(page) (Only on some arches and for debugging)
+ * 3. slab_lock(slab) (Only on some arches and for debugging)
*
* slab_mutex
*
@@ -56,17 +56,17 @@
*
* The slab_lock is only used for debugging and on arches that do not
* have the ability to do a cmpxchg_double. It only protects:
- * A. page->freelist -> List of object free in a page
- * B. page->inuse -> Number of objects in use
- * C. page->objects -> Number of objects in page
- * D. page->frozen -> frozen state
+ * A. slab->freelist -> List of object free in a slab
+ * B. slab->inuse -> Number of objects in use
+ * C. slab->objects -> Number of objects in slab
+ * D. slab->frozen -> frozen state
*
* If a slab is frozen then it is exempt from list management. It is not
* on any list except per cpu partial list. The processor that froze the
- * slab is the one who can perform list operations on the page. Other
+ * slab is the one who can perform list operations on the slab. Other
* processors may put objects onto the freelist but the processor that
* froze the slab is the only one that can retrieve the objects from the
- * page's freelist.
+ * slab's freelist.
*
* The list_lock protects the partial and full list on each node and
* the partial slab counter. If taken then no new slabs may be added or
@@ -94,10 +94,10 @@
* cannot scan all objects.
*
* Slabs are freed when they become empty. Teardown and setup is
- * minimal so we rely on the page allocators per cpu caches for
+ * minimal so we rely on the slab allocators per cpu caches for
* fast frees and allocs.
*
- * page->frozen The slab is frozen and exempt from list processing.
+ * slab->frozen The slab is frozen and exempt from list processing.
* This means that the slab is dedicated to a purpose
* such as satisfying allocations for a specific
* processor. Objects may be freed in the slab while
@@ -192,7 +192,7 @@ static inline bool kmem_cache_has_cpu_partial(struct kmem_cache *s)

#define OO_SHIFT 16
#define OO_MASK ((1 << OO_SHIFT) - 1)
-#define MAX_OBJS_PER_PAGE 32767 /* since page.objects is u15 */
+#define MAX_OBJS_PER_PAGE 32767 /* since slab.objects is u15 */

/* Internal SLUB flags */
/* Poison object */
@@ -357,22 +357,20 @@ static inline unsigned int oo_objects(struct kmem_cache_order_objects x)
}

/*
- * Per slab locking using the pagelock
+ * Per slab locking using the slablock
*/
-static __always_inline void slab_lock(struct page *page)
+static __always_inline void slab_lock(struct slab *slab)
{
- VM_BUG_ON_PAGE(PageTail(page), page);
- bit_spin_lock(PG_locked, &page->flags);
+ bit_spin_lock(PG_locked, &slab->flags);
}

-static __always_inline void slab_unlock(struct page *page)
+static __always_inline void slab_unlock(struct slab *slab)
{
- VM_BUG_ON_PAGE(PageTail(page), page);
- __bit_spin_unlock(PG_locked, &page->flags);
+ __bit_spin_unlock(PG_locked, &slab->flags);
}

/* Interrupts must be disabled (for the fallback code to work right) */
-static inline bool __cmpxchg_double_slab(struct kmem_cache *s, struct page *page,
+static inline bool __cmpxchg_double_slab(struct kmem_cache *s, struct slab *slab,
void *freelist_old, unsigned long counters_old,
void *freelist_new, unsigned long counters_new,
const char *n)
@@ -381,22 +379,22 @@ static inline bool __cmpxchg_double_slab(struct kmem_cache *s, struct page *page
#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
if (s->flags & __CMPXCHG_DOUBLE) {
- if (cmpxchg_double(&page->freelist, &page->counters,
+ if (cmpxchg_double(&slab->freelist, &slab->counters,
freelist_old, counters_old,
freelist_new, counters_new))
return true;
} else
#endif
{
- slab_lock(page);
- if (page->freelist == freelist_old &&
- page->counters == counters_old) {
- page->freelist = freelist_new;
- page->counters = counters_new;
- slab_unlock(page);
+ slab_lock(slab);
+ if (slab->freelist == freelist_old &&
+ slab->counters == counters_old) {
+ slab->freelist = freelist_new;
+ slab->counters = counters_new;
+ slab_unlock(slab);
return true;
}
- slab_unlock(page);
+ slab_unlock(slab);
}

cpu_relax();
@@ -409,7 +407,7 @@ static inline bool __cmpxchg_double_slab(struct kmem_cache *s, struct page *page
return false;
}

-static inline bool cmpxchg_double_slab(struct kmem_cache *s, struct page *page,
+static inline bool cmpxchg_double_slab(struct kmem_cache *s, struct slab *slab,
void *freelist_old, unsigned long counters_old,
void *freelist_new, unsigned long counters_new,
const char *n)
@@ -417,7 +415,7 @@ static inline bool cmpxchg_double_slab(struct kmem_cache *s, struct page *page,
#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
if (s->flags & __CMPXCHG_DOUBLE) {
- if (cmpxchg_double(&page->freelist, &page->counters,
+ if (cmpxchg_double(&slab->freelist, &slab->counters,
freelist_old, counters_old,
freelist_new, counters_new))
return true;
@@ -427,16 +425,16 @@ static inline bool cmpxchg_double_slab(struct kmem_cache *s, struct page *page,
unsigned long flags;

local_irq_save(flags);
- slab_lock(page);
- if (page->freelist == freelist_old &&
- page->counters == counters_old) {
- page->freelist = freelist_new;
- page->counters = counters_new;
- slab_unlock(page);
+ slab_lock(slab);
+ if (slab->freelist == freelist_old &&
+ slab->counters == counters_old) {
+ slab->freelist = freelist_new;
+ slab->counters = counters_new;
+ slab_unlock(slab);
local_irq_restore(flags);
return true;
}
- slab_unlock(page);
+ slab_unlock(slab);
local_irq_restore(flags);
}

@@ -475,24 +473,24 @@ static inline bool slab_add_kunit_errors(void) { return false; }
#endif

/*
- * Determine a map of object in use on a page.
+ * Determine a map of object in use on a slab.
*
- * Node listlock must be held to guarantee that the page does
+ * Node listlock must be held to guarantee that the slab does
* not vanish from under us.
*/
-static unsigned long *get_map(struct kmem_cache *s, struct page *page)
+static unsigned long *get_map(struct kmem_cache *s, struct slab *slab)
__acquires(&object_map_lock)
{
void *p;
- void *addr = page_address(page);
+ void *addr = slab_address(slab);

VM_BUG_ON(!irqs_disabled());

spin_lock(&object_map_lock);

- bitmap_zero(object_map, page->objects);
+ bitmap_zero(object_map, slab->objects);

- for (p = page->freelist; p; p = get_freepointer(s, p))
+ for (p = slab->freelist; p; p = get_freepointer(s, p))
set_bit(__obj_to_index(s, addr, p), object_map);

return object_map;
@@ -552,19 +550,19 @@ static inline void metadata_access_disable(void)
* Object debugging
*/

-/* Verify that a pointer has an address that is valid within a slab page */
+/* Verify that a pointer has an address that is valid within a slab */
static inline int check_valid_pointer(struct kmem_cache *s,
- struct page *page, void *object)
+ struct slab *slab, void *object)
{
void *base;

if (!object)
return 1;

- base = page_address(page);
+ base = slab_address(slab);
object = kasan_reset_tag(object);
object = restore_red_left(s, object);
- if (object < base || object >= base + page->objects * s->size ||
+ if (object < base || object >= base + slab->objects * s->size ||
(object - base) % s->size) {
return 0;
}
@@ -675,11 +673,11 @@ void print_tracking(struct kmem_cache *s, void *object)
print_track("Freed", get_track(s, object, TRACK_FREE), pr_time);
}

-static void print_page_info(struct page *page)
+static void print_slab_info(struct slab *slab)
{
pr_err("Slab 0x%p objects=%u used=%u fp=0x%p flags=%#lx(%pGp)\n",
- page, page->objects, page->inuse, page->freelist,
- page->flags, &page->flags);
+ slab, slab->objects, slab->inuse, slab->freelist,
+ slab->flags, &slab->flags);

}

@@ -713,12 +711,12 @@ static void slab_fix(struct kmem_cache *s, char *fmt, ...)
va_end(args);
}

-static bool freelist_corrupted(struct kmem_cache *s, struct page *page,
+static bool freelist_corrupted(struct kmem_cache *s, struct slab *slab,
void **freelist, void *nextfree)
{
if ((s->flags & SLAB_CONSISTENCY_CHECKS) &&
- !check_valid_pointer(s, page, nextfree) && freelist) {
- object_err(s, page, *freelist, "Freechain corrupt");
+ !check_valid_pointer(s, slab, nextfree) && freelist) {
+ object_err(s, slab, *freelist, "Freechain corrupt");
*freelist = NULL;
slab_fix(s, "Isolate corrupted freechain");
return true;
@@ -727,14 +725,14 @@ static bool freelist_corrupted(struct kmem_cache *s, struct page *page,
return false;
}

-static void print_trailer(struct kmem_cache *s, struct page *page, u8 *p)
+static void print_trailer(struct kmem_cache *s, struct slab *slab, u8 *p)
{
unsigned int off; /* Offset of last byte */
- u8 *addr = page_address(page);
+ u8 *addr = slab_address(slab);

print_tracking(s, p);

- print_page_info(page);
+ print_slab_info(slab);

pr_err("Object 0x%p @offset=%tu fp=0x%p\n\n",
p, p - addr, get_freepointer(s, p));
@@ -766,18 +764,18 @@ static void print_trailer(struct kmem_cache *s, struct page *page, u8 *p)
dump_stack();
}

-void object_err(struct kmem_cache *s, struct page *page,
+void object_err(struct kmem_cache *s, struct slab *slab,
u8 *object, char *reason)
{
if (slab_add_kunit_errors())
return;

slab_bug(s, "%s", reason);
- print_trailer(s, page, object);
+ print_trailer(s, slab, object);
add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
}

-static __printf(3, 4) void slab_err(struct kmem_cache *s, struct page *page,
+static __printf(3, 4) void slab_err(struct kmem_cache *s, struct slab *slab,
const char *fmt, ...)
{
va_list args;
@@ -790,7 +788,7 @@ static __printf(3, 4) void slab_err(struct kmem_cache *s, struct page *page,
vsnprintf(buf, sizeof(buf), fmt, args);
va_end(args);
slab_bug(s, "%s", buf);
- print_page_info(page);
+ print_slab_info(slab);
dump_stack();
add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
}
@@ -818,13 +816,13 @@ static void restore_bytes(struct kmem_cache *s, char *message, u8 data,
memset(from, data, to - from);
}

-static int check_bytes_and_report(struct kmem_cache *s, struct page *page,
+static int check_bytes_and_report(struct kmem_cache *s, struct slab *slab,
u8 *object, char *what,
u8 *start, unsigned int value, unsigned int bytes)
{
u8 *fault;
u8 *end;
- u8 *addr = page_address(page);
+ u8 *addr = slab_address(slab);

metadata_access_enable();
fault = memchr_inv(kasan_reset_tag(start), value, bytes);
@@ -843,7 +841,7 @@ static int check_bytes_and_report(struct kmem_cache *s, struct page *page,
pr_err("0x%p-0x%p @offset=%tu. First byte 0x%x instead of 0x%x\n",
fault, end - 1, fault - addr,
fault[0], value);
- print_trailer(s, page, object);
+ print_trailer(s, slab, object);
add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);

skip_bug_print:
@@ -889,7 +887,7 @@ static int check_bytes_and_report(struct kmem_cache *s, struct page *page,
* may be used with merged slabcaches.
*/

-static int check_pad_bytes(struct kmem_cache *s, struct page *page, u8 *p)
+static int check_pad_bytes(struct kmem_cache *s, struct slab *slab, u8 *p)
{
unsigned long off = get_info_end(s); /* The end of info */

@@ -902,12 +900,12 @@ static int check_pad_bytes(struct kmem_cache *s, struct page *page, u8 *p)
if (size_from_object(s) == off)
return 1;

- return check_bytes_and_report(s, page, p, "Object padding",
+ return check_bytes_and_report(s, slab, p, "Object padding",
p + off, POISON_INUSE, size_from_object(s) - off);
}

-/* Check the pad bytes at the end of a slab page */
-static int slab_pad_check(struct kmem_cache *s, struct page *page)
+/* Check the pad bytes at the end of a slab */
+static int slab_pad_check(struct kmem_cache *s, struct slab *slab)
{
u8 *start;
u8 *fault;
@@ -919,8 +917,8 @@ static int slab_pad_check(struct kmem_cache *s, struct page *page)
if (!(s->flags & SLAB_POISON))
return 1;

- start = page_address(page);
- length = page_size(page);
+ start = slab_address(slab);
+ length = slab_size(slab);
end = start + length;
remainder = length % s->size;
if (!remainder)
@@ -935,7 +933,7 @@ static int slab_pad_check(struct kmem_cache *s, struct page *page)
while (end > fault && end[-1] == POISON_INUSE)
end--;

- slab_err(s, page, "Padding overwritten. 0x%p-0x%p @offset=%tu",
+ slab_err(s, slab, "Padding overwritten. 0x%p-0x%p @offset=%tu",
fault, end - 1, fault - start);
print_section(KERN_ERR, "Padding ", pad, remainder);

@@ -943,23 +941,23 @@ static int slab_pad_check(struct kmem_cache *s, struct page *page)
return 0;
}

-static int check_object(struct kmem_cache *s, struct page *page,
+static int check_object(struct kmem_cache *s, struct slab *slab,
void *object, u8 val)
{
u8 *p = object;
u8 *endobject = object + s->object_size;

if (s->flags & SLAB_RED_ZONE) {
- if (!check_bytes_and_report(s, page, object, "Left Redzone",
+ if (!check_bytes_and_report(s, slab, object, "Left Redzone",
object - s->red_left_pad, val, s->red_left_pad))
return 0;

- if (!check_bytes_and_report(s, page, object, "Right Redzone",
+ if (!check_bytes_and_report(s, slab, object, "Right Redzone",
endobject, val, s->inuse - s->object_size))
return 0;
} else {
if ((s->flags & SLAB_POISON) && s->object_size < s->inuse) {
- check_bytes_and_report(s, page, p, "Alignment padding",
+ check_bytes_and_report(s, slab, p, "Alignment padding",
endobject, POISON_INUSE,
s->inuse - s->object_size);
}
@@ -967,15 +965,15 @@ static int check_object(struct kmem_cache *s, struct page *page,

if (s->flags & SLAB_POISON) {
if (val != SLUB_RED_ACTIVE && (s->flags & __OBJECT_POISON) &&
- (!check_bytes_and_report(s, page, p, "Poison", p,
+ (!check_bytes_and_report(s, slab, p, "Poison", p,
POISON_FREE, s->object_size - 1) ||
- !check_bytes_and_report(s, page, p, "End Poison",
+ !check_bytes_and_report(s, slab, p, "End Poison",
p + s->object_size - 1, POISON_END, 1)))
return 0;
/*
* check_pad_bytes cleans up on its own.
*/
- check_pad_bytes(s, page, p);
+ check_pad_bytes(s, slab, p);
}

if (!freeptr_outside_object(s) && val == SLUB_RED_ACTIVE)
@@ -986,8 +984,8 @@ static int check_object(struct kmem_cache *s, struct page *page,
return 1;

/* Check free pointer validity */
- if (!check_valid_pointer(s, page, get_freepointer(s, p))) {
- object_err(s, page, p, "Freepointer corrupt");
+ if (!check_valid_pointer(s, slab, get_freepointer(s, p))) {
+ object_err(s, slab, p, "Freepointer corrupt");
/*
* No choice but to zap it and thus lose the remainder
* of the free objects in this slab. May cause
@@ -999,57 +997,57 @@ static int check_object(struct kmem_cache *s, struct page *page,
return 1;
}

-static int check_slab(struct kmem_cache *s, struct page *page)
+static int check_slab(struct kmem_cache *s, struct slab *slab)
{
int maxobj;

VM_BUG_ON(!irqs_disabled());

- if (!PageSlab(page)) {
- slab_err(s, page, "Not a valid slab page");
+ if (!is_slab(slab)) {
+ slab_err(s, slab, "Not a valid slab slab");
return 0;
}

- maxobj = order_objects(compound_order(page), s->size);
- if (page->objects > maxobj) {
- slab_err(s, page, "objects %u > max %u",
- page->objects, maxobj);
+ maxobj = order_objects(slab_order(slab), s->size);
+ if (slab->objects > maxobj) {
+ slab_err(s, slab, "objects %u > max %u",
+ slab->objects, maxobj);
return 0;
}
- if (page->inuse > page->objects) {
- slab_err(s, page, "inuse %u > max %u",
- page->inuse, page->objects);
+ if (slab->inuse > slab->objects) {
+ slab_err(s, slab, "inuse %u > max %u",
+ slab->inuse, slab->objects);
return 0;
}
/* Slab_pad_check fixes things up after itself */
- slab_pad_check(s, page);
+ slab_pad_check(s, slab);
return 1;
}

/*
- * Determine if a certain object on a page is on the freelist. Must hold the
+ * Determine if a certain object on a slab is on the freelist. Must hold the
* slab lock to guarantee that the chains are in a consistent state.
*/
-static int on_freelist(struct kmem_cache *s, struct page *page, void *search)
+static int on_freelist(struct kmem_cache *s, struct slab *slab, void *search)
{
int nr = 0;
void *fp;
void *object = NULL;
int max_objects;

- fp = page->freelist;
- while (fp && nr <= page->objects) {
+ fp = slab->freelist;
+ while (fp && nr <= slab->objects) {
if (fp == search)
return 1;
- if (!check_valid_pointer(s, page, fp)) {
+ if (!check_valid_pointer(s, slab, fp)) {
if (object) {
- object_err(s, page, object,
+ object_err(s, slab, object,
"Freechain corrupt");
set_freepointer(s, object, NULL);
} else {
- slab_err(s, page, "Freepointer corrupt");
- page->freelist = NULL;
- page->inuse = page->objects;
+ slab_err(s, slab, "Freepointer corrupt");
+ slab->freelist = NULL;
+ slab->inuse = slab->objects;
slab_fix(s, "Freelist cleared");
return 0;
}
@@ -1060,34 +1058,34 @@ static int on_freelist(struct kmem_cache *s, struct page *page, void *search)
nr++;
}

- max_objects = order_objects(compound_order(page), s->size);
+ max_objects = order_objects(slab_order(slab), s->size);
if (max_objects > MAX_OBJS_PER_PAGE)
max_objects = MAX_OBJS_PER_PAGE;

- if (page->objects != max_objects) {
- slab_err(s, page, "Wrong number of objects. Found %d but should be %d",
- page->objects, max_objects);
- page->objects = max_objects;
+ if (slab->objects != max_objects) {
+ slab_err(s, slab, "Wrong number of objects. Found %d but should be %d",
+ slab->objects, max_objects);
+ slab->objects = max_objects;
slab_fix(s, "Number of objects adjusted");
}
- if (page->inuse != page->objects - nr) {
- slab_err(s, page, "Wrong object count. Counter is %d but counted were %d",
- page->inuse, page->objects - nr);
- page->inuse = page->objects - nr;
+ if (slab->inuse != slab->objects - nr) {
+ slab_err(s, slab, "Wrong object count. Counter is %d but counted were %d",
+ slab->inuse, slab->objects - nr);
+ slab->inuse = slab->objects - nr;
slab_fix(s, "Object count adjusted");
}
return search == NULL;
}

-static void trace(struct kmem_cache *s, struct page *page, void *object,
+static void trace(struct kmem_cache *s, struct slab *slab, void *object,
int alloc)
{
if (s->flags & SLAB_TRACE) {
pr_info("TRACE %s %s 0x%p inuse=%d fp=0x%p\n",
s->name,
alloc ? "alloc" : "free",
- object, page->inuse,
- page->freelist);
+ object, slab->inuse,
+ slab->freelist);

if (!alloc)
print_section(KERN_INFO, "Object ", (void *)object,
@@ -1101,22 +1099,22 @@ static void trace(struct kmem_cache *s, struct page *page, void *object,
* Tracking of fully allocated slabs for debugging purposes.
*/
static void add_full(struct kmem_cache *s,
- struct kmem_cache_node *n, struct page *page)
+ struct kmem_cache_node *n, struct slab *slab)
{
if (!(s->flags & SLAB_STORE_USER))
return;

lockdep_assert_held(&n->list_lock);
- list_add(&page->slab_list, &n->full);
+ list_add(&slab->slab_list, &n->full);
}

-static void remove_full(struct kmem_cache *s, struct kmem_cache_node *n, struct page *page)
+static void remove_full(struct kmem_cache *s, struct kmem_cache_node *n, struct slab *slab)
{
if (!(s->flags & SLAB_STORE_USER))
return;

lockdep_assert_held(&n->list_lock);
- list_del(&page->slab_list);
+ list_del(&slab->slab_list);
}

/* Tracking of the number of slabs for debugging purposes */
@@ -1156,7 +1154,7 @@ static inline void dec_slabs_node(struct kmem_cache *s, int node, int objects)
}

/* Object debug checks for alloc/free paths */
-static void setup_object_debug(struct kmem_cache *s, struct page *page,
+static void setup_object_debug(struct kmem_cache *s, struct slab *slab,
void *object)
{
if (!kmem_cache_debug_flags(s, SLAB_STORE_USER|SLAB_RED_ZONE|__OBJECT_POISON))
@@ -1167,90 +1165,90 @@ static void setup_object_debug(struct kmem_cache *s, struct page *page,
}

static
-void setup_page_debug(struct kmem_cache *s, struct page *page, void *addr)
+void setup_slab_debug(struct kmem_cache *s, struct slab *slab, void *addr)
{
if (!kmem_cache_debug_flags(s, SLAB_POISON))
return;

metadata_access_enable();
- memset(kasan_reset_tag(addr), POISON_INUSE, page_size(page));
+ memset(kasan_reset_tag(addr), POISON_INUSE, slab_size(slab));
metadata_access_disable();
}

static inline int alloc_consistency_checks(struct kmem_cache *s,
- struct page *page, void *object)
+ struct slab *slab, void *object)
{
- if (!check_slab(s, page))
+ if (!check_slab(s, slab))
return 0;

- if (!check_valid_pointer(s, page, object)) {
- object_err(s, page, object, "Freelist Pointer check fails");
+ if (!check_valid_pointer(s, slab, object)) {
+ object_err(s, slab, object, "Freelist Pointer check fails");
return 0;
}

- if (!check_object(s, page, object, SLUB_RED_INACTIVE))
+ if (!check_object(s, slab, object, SLUB_RED_INACTIVE))
return 0;

return 1;
}

static noinline int alloc_debug_processing(struct kmem_cache *s,
- struct page *page,
+ struct slab *slab,
void *object, unsigned long addr)
{
if (s->flags & SLAB_CONSISTENCY_CHECKS) {
- if (!alloc_consistency_checks(s, page, object))
+ if (!alloc_consistency_checks(s, slab, object))
goto bad;
}

/* Success perform special debug activities for allocs */
if (s->flags & SLAB_STORE_USER)
set_track(s, object, TRACK_ALLOC, addr);
- trace(s, page, object, 1);
+ trace(s, slab, object, 1);
init_object(s, object, SLUB_RED_ACTIVE);
return 1;

bad:
- if (PageSlab(page)) {
+ if (is_slab(slab)) {
/*
- * If this is a slab page then lets do the best we can
+ * If this is a slab then lets do the best we can
* to avoid issues in the future. Marking all objects
* as used avoids touching the remaining objects.
*/
slab_fix(s, "Marking all objects used");
- page->inuse = page->objects;
- page->freelist = NULL;
+ slab->inuse = slab->objects;
+ slab->freelist = NULL;
}
return 0;
}

static inline int free_consistency_checks(struct kmem_cache *s,
- struct page *page, void *object, unsigned long addr)
+ struct slab *slab, void *object, unsigned long addr)
{
- if (!check_valid_pointer(s, page, object)) {
- slab_err(s, page, "Invalid object pointer 0x%p", object);
+ if (!check_valid_pointer(s, slab, object)) {
+ slab_err(s, slab, "Invalid object pointer 0x%p", object);
return 0;
}

- if (on_freelist(s, page, object)) {
- object_err(s, page, object, "Object already free");
+ if (on_freelist(s, slab, object)) {
+ object_err(s, slab, object, "Object already free");
return 0;
}

- if (!check_object(s, page, object, SLUB_RED_ACTIVE))
+ if (!check_object(s, slab, object, SLUB_RED_ACTIVE))
return 0;

- if (unlikely(s != page->slab_cache)) {
- if (!PageSlab(page)) {
- slab_err(s, page, "Attempt to free object(0x%p) outside of slab",
+ if (unlikely(s != slab->slab_cache)) {
+ if (!is_slab(slab)) {
+ slab_err(s, slab, "Attempt to free object(0x%p) outside of slab",
object);
- } else if (!page->slab_cache) {
+ } else if (!slab->slab_cache) {
pr_err("SLUB <none>: no slab for object 0x%p.\n",
object);
dump_stack();
} else
- object_err(s, page, object,
- "page slab pointer corrupt.");
+ object_err(s, slab, object,
+ "slab slab pointer corrupt.");
return 0;
}
return 1;
@@ -1258,21 +1256,21 @@ static inline int free_consistency_checks(struct kmem_cache *s,

/* Supports checking bulk free of a constructed freelist */
static noinline int free_debug_processing(
- struct kmem_cache *s, struct page *page,
+ struct kmem_cache *s, struct slab *slab,
void *head, void *tail, int bulk_cnt,
unsigned long addr)
{
- struct kmem_cache_node *n = get_node(s, page_to_nid(page));
+ struct kmem_cache_node *n = get_node(s, slab_nid(slab));
void *object = head;
int cnt = 0;
unsigned long flags;
int ret = 0;

spin_lock_irqsave(&n->list_lock, flags);
- slab_lock(page);
+ slab_lock(slab);

if (s->flags & SLAB_CONSISTENCY_CHECKS) {
- if (!check_slab(s, page))
+ if (!check_slab(s, slab))
goto out;
}

@@ -1280,13 +1278,13 @@ static noinline int free_debug_processing(
cnt++;

if (s->flags & SLAB_CONSISTENCY_CHECKS) {
- if (!free_consistency_checks(s, page, object, addr))
+ if (!free_consistency_checks(s, slab, object, addr))
goto out;
}

if (s->flags & SLAB_STORE_USER)
set_track(s, object, TRACK_FREE, addr);
- trace(s, page, object, 0);
+ trace(s, slab, object, 0);
/* Freepointer not overwritten by init_object(), SLAB_POISON moved it */
init_object(s, object, SLUB_RED_INACTIVE);

@@ -1299,10 +1297,10 @@ static noinline int free_debug_processing(

out:
if (cnt != bulk_cnt)
- slab_err(s, page, "Bulk freelist count(%d) invalid(%d)\n",
+ slab_err(s, slab, "Bulk freelist count(%d) invalid(%d)\n",
bulk_cnt, cnt);

- slab_unlock(page);
+ slab_unlock(slab);
spin_unlock_irqrestore(&n->list_lock, flags);
if (!ret)
slab_fix(s, "Object at 0x%p not freed", object);
@@ -1514,26 +1512,26 @@ slab_flags_t kmem_cache_flags(unsigned int object_size,
}
#else /* !CONFIG_SLUB_DEBUG */
static inline void setup_object_debug(struct kmem_cache *s,
- struct page *page, void *object) {}
+ struct slab *slab, void *object) {}
static inline
-void setup_page_debug(struct kmem_cache *s, struct page *page, void *addr) {}
+void setup_slab_debug(struct kmem_cache *s, struct slab *slab, void *addr) {}

static inline int alloc_debug_processing(struct kmem_cache *s,
- struct page *page, void *object, unsigned long addr) { return 0; }
+ struct slab *slab, void *object, unsigned long addr) { return 0; }

static inline int free_debug_processing(
- struct kmem_cache *s, struct page *page,
+ struct kmem_cache *s, struct slab *slab,
void *head, void *tail, int bulk_cnt,
unsigned long addr) { return 0; }

-static inline int slab_pad_check(struct kmem_cache *s, struct page *page)
+static inline int slab_pad_check(struct kmem_cache *s, struct slab *slab)
{ return 1; }
-static inline int check_object(struct kmem_cache *s, struct page *page,
+static inline int check_object(struct kmem_cache *s, struct slab *slab,
void *object, u8 val) { return 1; }
static inline void add_full(struct kmem_cache *s, struct kmem_cache_node *n,
- struct page *page) {}
+ struct slab *slab) {}
static inline void remove_full(struct kmem_cache *s, struct kmem_cache_node *n,
- struct page *page) {}
+ struct slab *slab) {}
slab_flags_t kmem_cache_flags(unsigned int object_size,
slab_flags_t flags, const char *name)
{
@@ -1552,7 +1550,7 @@ static inline void inc_slabs_node(struct kmem_cache *s, int node,
static inline void dec_slabs_node(struct kmem_cache *s, int node,
int objects) {}

-static bool freelist_corrupted(struct kmem_cache *s, struct page *page,
+static bool freelist_corrupted(struct kmem_cache *s, struct slab *slab,
void **freelist, void *nextfree)
{
return false;
@@ -1662,10 +1660,10 @@ static inline bool slab_free_freelist_hook(struct kmem_cache *s,
return *head != NULL;
}

-static void *setup_object(struct kmem_cache *s, struct page *page,
+static void *setup_object(struct kmem_cache *s, struct slab *slab,
void *object)
{
- setup_object_debug(s, page, object);
+ setup_object_debug(s, slab, object);
object = kasan_init_slab_obj(s, object);
if (unlikely(s->ctor)) {
kasan_unpoison_object_data(s, object);
@@ -1678,18 +1676,25 @@ static void *setup_object(struct kmem_cache *s, struct page *page,
/*
* Slab allocation and freeing
*/
-static inline struct page *alloc_slab_page(struct kmem_cache *s,
+static inline struct slab *alloc_slab(struct kmem_cache *s,
gfp_t flags, int node, struct kmem_cache_order_objects oo)
{
struct page *page;
+ struct slab *slab;
unsigned int order = oo_order(oo);

if (node == NUMA_NO_NODE)
page = alloc_pages(flags, order);
else
page = __alloc_pages_node(node, flags, order);
+ if (!page)
+ return NULL;

- return page;
+ __SetPageSlab(page);
+ slab = (struct slab *)page;
+ if (page_is_pfmemalloc(page))
+ SetSlabPfmemalloc(slab);
+ return slab;
}

#ifdef CONFIG_SLAB_FREELIST_RANDOM
@@ -1710,7 +1715,7 @@ static int init_cache_random_seq(struct kmem_cache *s)
return err;
}

- /* Transform to an offset on the set of pages */
+ /* Transform to an offset on the set of slabs */
if (s->random_seq) {
unsigned int i;

@@ -1734,54 +1739,54 @@ static void __init init_freelist_randomization(void)
}

/* Get the next entry on the pre-computed freelist randomized */
-static void *next_freelist_entry(struct kmem_cache *s, struct page *page,
+static void *next_freelist_entry(struct kmem_cache *s, struct slab *slab,
unsigned long *pos, void *start,
- unsigned long page_limit,
+ unsigned long slab_limit,
unsigned long freelist_count)
{
unsigned int idx;

/*
- * If the target page allocation failed, the number of objects on the
- * page might be smaller than the usual size defined by the cache.
+ * If the target slab allocation failed, the number of objects on the
+ * slab might be smaller than the usual size defined by the cache.
*/
do {
idx = s->random_seq[*pos];
*pos += 1;
if (*pos >= freelist_count)
*pos = 0;
- } while (unlikely(idx >= page_limit));
+ } while (unlikely(idx >= slab_limit));

return (char *)start + idx;
}

/* Shuffle the single linked freelist based on a random pre-computed sequence */
-static bool shuffle_freelist(struct kmem_cache *s, struct page *page)
+static bool shuffle_freelist(struct kmem_cache *s, struct slab *slab)
{
void *start;
void *cur;
void *next;
- unsigned long idx, pos, page_limit, freelist_count;
+ unsigned long idx, pos, slab_limit, freelist_count;

- if (page->objects < 2 || !s->random_seq)
+ if (slab->objects < 2 || !s->random_seq)
return false;

freelist_count = oo_objects(s->oo);
pos = get_random_int() % freelist_count;

- page_limit = page->objects * s->size;
- start = fixup_red_left(s, page_address(page));
+ slab_limit = slab->objects * s->size;
+ start = fixup_red_left(s, slab_address(slab));

/* First entry is used as the base of the freelist */
- cur = next_freelist_entry(s, page, &pos, start, page_limit,
+ cur = next_freelist_entry(s, slab, &pos, start, slab_limit,
freelist_count);
- cur = setup_object(s, page, cur);
- page->freelist = cur;
+ cur = setup_object(s, slab, cur);
+ slab->freelist = cur;

- for (idx = 1; idx < page->objects; idx++) {
- next = next_freelist_entry(s, page, &pos, start, page_limit,
+ for (idx = 1; idx < slab->objects; idx++) {
+ next = next_freelist_entry(s, slab, &pos, start, slab_limit,
freelist_count);
- next = setup_object(s, page, next);
+ next = setup_object(s, slab, next);
set_freepointer(s, cur, next);
cur = next;
}
@@ -1795,15 +1800,15 @@ static inline int init_cache_random_seq(struct kmem_cache *s)
return 0;
}
static inline void init_freelist_randomization(void) { }
-static inline bool shuffle_freelist(struct kmem_cache *s, struct page *page)
+static inline bool shuffle_freelist(struct kmem_cache *s, struct slab *slab)
{
return false;
}
#endif /* CONFIG_SLAB_FREELIST_RANDOM */

-static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
+static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
{
- struct page *page;
+ struct slab *slab;
struct kmem_cache_order_objects oo = s->oo;
gfp_t alloc_gfp;
void *start, *p, *next;
@@ -1825,65 +1830,62 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
if ((alloc_gfp & __GFP_DIRECT_RECLAIM) && oo_order(oo) > oo_order(s->min))
alloc_gfp = (alloc_gfp | __GFP_NOMEMALLOC) & ~(__GFP_RECLAIM|__GFP_NOFAIL);

- page = alloc_slab_page(s, alloc_gfp, node, oo);
- if (unlikely(!page)) {
+ slab = alloc_slab(s, alloc_gfp, node, oo);
+ if (unlikely(!slab)) {
oo = s->min;
alloc_gfp = flags;
/*
* Allocation may have failed due to fragmentation.
* Try a lower order alloc if possible
*/
- page = alloc_slab_page(s, alloc_gfp, node, oo);
- if (unlikely(!page))
+ slab = alloc_slab(s, alloc_gfp, node, oo);
+ if (unlikely(!slab))
goto out;
stat(s, ORDER_FALLBACK);
}

- page->objects = oo_objects(oo);
+ slab->objects = oo_objects(oo);

- account_slab_page(page, oo_order(oo), s, flags);
+ account_slab(slab, oo_order(oo), s, flags);

- page->slab_cache = s;
- __SetPageSlab(page);
- if (page_is_pfmemalloc(page))
- SetPageSlabPfmemalloc(page);
+ slab->slab_cache = s;

- kasan_poison_slab(page);
+ kasan_poison_slab(slab);

- start = page_address(page);
+ start = slab_address(slab);

- setup_page_debug(s, page, start);
+ setup_slab_debug(s, slab, start);

- shuffle = shuffle_freelist(s, page);
+ shuffle = shuffle_freelist(s, slab);

if (!shuffle) {
start = fixup_red_left(s, start);
- start = setup_object(s, page, start);
- page->freelist = start;
- for (idx = 0, p = start; idx < page->objects - 1; idx++) {
+ start = setup_object(s, slab, start);
+ slab->freelist = start;
+ for (idx = 0, p = start; idx < slab->objects - 1; idx++) {
next = p + s->size;
- next = setup_object(s, page, next);
+ next = setup_object(s, slab, next);
set_freepointer(s, p, next);
p = next;
}
set_freepointer(s, p, NULL);
}

- page->inuse = page->objects;
- page->frozen = 1;
+ slab->inuse = slab->objects;
+ slab->frozen = 1;

out:
if (gfpflags_allow_blocking(flags))
local_irq_disable();
- if (!page)
+ if (!slab)
return NULL;

- inc_slabs_node(s, page_to_nid(page), page->objects);
+ inc_slabs_node(s, slab_nid(slab), slab->objects);

- return page;
+ return slab;
}

-static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
+static struct slab *new_slab(struct kmem_cache *s, gfp_t flags, int node)
{
if (unlikely(flags & GFP_SLAB_BUG_MASK))
flags = kmalloc_fix_flags(flags);
@@ -1892,76 +1894,77 @@ static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
}

-static void __free_slab(struct kmem_cache *s, struct page *page)
+static void __free_slab(struct kmem_cache *s, struct slab *slab)
{
- int order = compound_order(page);
- int pages = 1 << order;
+ struct page *page = &slab->page;
+ int order = slab_order(slab);
+ int slabs = 1 << order;

if (kmem_cache_debug_flags(s, SLAB_CONSISTENCY_CHECKS)) {
void *p;

- slab_pad_check(s, page);
- for_each_object(p, s, page_address(page),
- page->objects)
- check_object(s, page, p, SLUB_RED_INACTIVE);
+ slab_pad_check(s, slab);
+ for_each_object(p, s, slab_address(slab),
+ slab->objects)
+ check_object(s, slab, p, SLUB_RED_INACTIVE);
}

- __ClearPageSlabPfmemalloc(page);
+ __ClearSlabPfmemalloc(slab);
__ClearPageSlab(page);
- /* In union with page->mapping where page allocator expects NULL */
- page->slab_cache = NULL;
+ page->mapping = NULL;
if (current->reclaim_state)
- current->reclaim_state->reclaimed_slab += pages;
- unaccount_slab_page(page, order, s);
- __free_pages(page, order);
+ current->reclaim_state->reclaimed_slab += slabs;
+ unaccount_slab(slab, order, s);
+ put_page(page);
}

static void rcu_free_slab(struct rcu_head *h)
{
struct page *page = container_of(h, struct page, rcu_head);
+ struct slab *slab = (struct slab *)page;

- __free_slab(page->slab_cache, page);
+ __free_slab(slab->slab_cache, slab);
}

-static void free_slab(struct kmem_cache *s, struct page *page)
+static void free_slab(struct kmem_cache *s, struct slab *slab)
{
if (unlikely(s->flags & SLAB_TYPESAFE_BY_RCU)) {
- call_rcu(&page->rcu_head, rcu_free_slab);
+ call_rcu(&slab->page.rcu_head, rcu_free_slab);
} else
- __free_slab(s, page);
+ __free_slab(s, slab);
}

-static void discard_slab(struct kmem_cache *s, struct page *page)
+static void discard_slab(struct kmem_cache *s, struct slab *slab)
{
- dec_slabs_node(s, page_to_nid(page), page->objects);
- free_slab(s, page);
+ dec_slabs_node(s, slab_nid(slab), slab->objects);
+ free_slab(s, slab);
}

/*
* Management of partially allocated slabs.
*/
static inline void
-__add_partial(struct kmem_cache_node *n, struct page *page, int tail)
+__add_partial(struct kmem_cache_node *n, struct slab *slab, int tail)
{
n->nr_partial++;
if (tail == DEACTIVATE_TO_TAIL)
- list_add_tail(&page->slab_list, &n->partial);
+ list_add_tail(&slab->slab_list, &n->partial);
else
- list_add(&page->slab_list, &n->partial);
+ list_add(&slab->slab_list, &n->partial);
}

static inline void add_partial(struct kmem_cache_node *n,
- struct page *page, int tail)
+ struct slab *slab, int tail)
{
lockdep_assert_held(&n->list_lock);
- __add_partial(n, page, tail);
+ __add_partial(n, slab, tail);
}

static inline void remove_partial(struct kmem_cache_node *n,
- struct page *page)
+ struct slab *slab)
{
lockdep_assert_held(&n->list_lock);
- list_del(&page->slab_list);
+ list_del(&slab->slab_list);
n->nr_partial--;
}

@@ -1972,12 +1975,12 @@ static inline void remove_partial(struct kmem_cache_node *n,
* Returns a list of objects or NULL if it fails.
*/
static inline void *acquire_slab(struct kmem_cache *s,
- struct kmem_cache_node *n, struct page *page,
+ struct kmem_cache_node *n, struct slab *slab,
int mode, int *objects)
{
void *freelist;
unsigned long counters;
- struct page new;
+ struct slab new;

lockdep_assert_held(&n->list_lock);

@@ -1986,12 +1989,12 @@ static inline void *acquire_slab(struct kmem_cache *s,
* The old freelist is the list of objects for the
* per cpu allocation list.
*/
- freelist = page->freelist;
- counters = page->counters;
+ freelist = slab->freelist;
+ counters = slab->counters;
new.counters = counters;
*objects = new.objects - new.inuse;
if (mode) {
- new.inuse = page->objects;
+ new.inuse = slab->objects;
new.freelist = NULL;
} else {
new.freelist = freelist;
@@ -2000,19 +2003,19 @@ static inline void *acquire_slab(struct kmem_cache *s,
VM_BUG_ON(new.frozen);
new.frozen = 1;

- if (!__cmpxchg_double_slab(s, page,
+ if (!__cmpxchg_double_slab(s, slab,
freelist, counters,
new.freelist, new.counters,
"acquire_slab"))
return NULL;

- remove_partial(n, page);
+ remove_partial(n, slab);
WARN_ON(!freelist);
return freelist;
}

-static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain);
-static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags);
+static void put_cpu_partial(struct kmem_cache *s, struct slab *slab, int drain);
+static inline bool pfmemalloc_match(struct slab *slab, gfp_t gfpflags);

/*
* Try to allocate a partial slab from a specific node.
@@ -2020,7 +2023,7 @@ static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags);
static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
struct kmem_cache_cpu *c, gfp_t flags)
{
- struct page *page, *page2;
+ struct slab *slab, *slab2;
void *object = NULL;
unsigned int available = 0;
int objects;
@@ -2035,23 +2038,23 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
return NULL;

spin_lock(&n->list_lock);
- list_for_each_entry_safe(page, page2, &n->partial, slab_list) {
+ list_for_each_entry_safe(slab, slab2, &n->partial, slab_list) {
void *t;

- if (!pfmemalloc_match(page, flags))
+ if (!pfmemalloc_match(slab, flags))
continue;

- t = acquire_slab(s, n, page, object == NULL, &objects);
+ t = acquire_slab(s, n, slab, object == NULL, &objects);
if (!t)
break;

available += objects;
if (!object) {
- c->page = page;
+ c->slab = slab;
stat(s, ALLOC_FROM_PARTIAL);
object = t;
} else {
- put_cpu_partial(s, page, 0);
+ put_cpu_partial(s, slab, 0);
stat(s, CPU_PARTIAL_NODE);
}
if (!kmem_cache_has_cpu_partial(s)
@@ -2064,7 +2067,7 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
}

/*
- * Get a page from somewhere. Search in increasing NUMA distances.
+ * Get a slab from somewhere. Search in increasing NUMA distances.
*/
static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
struct kmem_cache_cpu *c)
@@ -2128,7 +2131,7 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
}

/*
- * Get a partial page, lock it and return it.
+ * Get a partial slab, lock it and return it.
*/
static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
struct kmem_cache_cpu *c)
@@ -2218,19 +2221,19 @@ static void init_kmem_cache_cpus(struct kmem_cache *s)
/*
* Remove the cpu slab
*/
-static void deactivate_slab(struct kmem_cache *s, struct page *page,
+static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
void *freelist, struct kmem_cache_cpu *c)
{
enum slab_modes { M_NONE, M_PARTIAL, M_FULL, M_FREE };
- struct kmem_cache_node *n = get_node(s, page_to_nid(page));
+ struct kmem_cache_node *n = get_node(s, slab_nid(slab));
int lock = 0, free_delta = 0;
enum slab_modes l = M_NONE, m = M_NONE;
void *nextfree, *freelist_iter, *freelist_tail;
int tail = DEACTIVATE_TO_HEAD;
- struct page new;
- struct page old;
+ struct slab new;
+ struct slab old;

- if (page->freelist) {
+ if (slab->freelist) {
stat(s, DEACTIVATE_REMOTE_FREES);
tail = DEACTIVATE_TO_TAIL;
}
@@ -2249,7 +2252,7 @@ static void deactivate_slab(struct kmem_cache *s, struct page *page,
* 'freelist_iter' is already corrupted. So isolate all objects
* starting at 'freelist_iter' by skipping them.
*/
- if (freelist_corrupted(s, page, &freelist_iter, nextfree))
+ if (freelist_corrupted(s, slab, &freelist_iter, nextfree))
break;

freelist_tail = freelist_iter;
@@ -2259,25 +2262,25 @@ static void deactivate_slab(struct kmem_cache *s, struct page *page,
}

/*
- * Stage two: Unfreeze the page while splicing the per-cpu
- * freelist to the head of page's freelist.
+ * Stage two: Unfreeze the slab while splicing the per-cpu
+ * freelist to the head of slab's freelist.
*
- * Ensure that the page is unfrozen while the list presence
+ * Ensure that the slab is unfrozen while the list presence
* reflects the actual number of objects during unfreeze.
*
* We setup the list membership and then perform a cmpxchg
- * with the count. If there is a mismatch then the page
- * is not unfrozen but the page is on the wrong list.
+ * with the count. If there is a mismatch then the slab
+ * is not unfrozen but the slab is on the wrong list.
*
* Then we restart the process which may have to remove
- * the page from the list that we just put it on again
+ * the slab from the list that we just put it on again
* because the number of objects in the slab may have
* changed.
*/
redo:

- old.freelist = READ_ONCE(page->freelist);
- old.counters = READ_ONCE(page->counters);
+ old.freelist = READ_ONCE(slab->freelist);
+ old.counters = READ_ONCE(slab->counters);
VM_BUG_ON(!old.frozen);

/* Determine target state of the slab */
@@ -2299,7 +2302,7 @@ static void deactivate_slab(struct kmem_cache *s, struct page *page,
lock = 1;
/*
* Taking the spinlock removes the possibility
- * that acquire_slab() will see a slab page that
+ * that acquire_slab() will see a slab slab that
* is frozen
*/
spin_lock(&n->list_lock);
@@ -2319,18 +2322,18 @@ static void deactivate_slab(struct kmem_cache *s, struct page *page,

if (l != m) {
if (l == M_PARTIAL)
- remove_partial(n, page);
+ remove_partial(n, slab);
else if (l == M_FULL)
- remove_full(s, n, page);
+ remove_full(s, n, slab);

if (m == M_PARTIAL)
- add_partial(n, page, tail);
+ add_partial(n, slab, tail);
else if (m == M_FULL)
- add_full(s, n, page);
+ add_full(s, n, slab);
}

l = m;
- if (!__cmpxchg_double_slab(s, page,
+ if (!__cmpxchg_double_slab(s, slab,
old.freelist, old.counters,
new.freelist, new.counters,
"unfreezing slab"))
@@ -2345,11 +2348,11 @@ static void deactivate_slab(struct kmem_cache *s, struct page *page,
stat(s, DEACTIVATE_FULL);
else if (m == M_FREE) {
stat(s, DEACTIVATE_EMPTY);
- discard_slab(s, page);
+ discard_slab(s, slab);
stat(s, FREE_SLAB);
}

- c->page = NULL;
+ c->slab = NULL;
c->freelist = NULL;
}

@@ -2365,15 +2368,15 @@ static void unfreeze_partials(struct kmem_cache *s,
{
#ifdef CONFIG_SLUB_CPU_PARTIAL
struct kmem_cache_node *n = NULL, *n2 = NULL;
- struct page *page, *discard_page = NULL;
+ struct slab *slab, *next_slab = NULL;

- while ((page = slub_percpu_partial(c))) {
- struct page new;
- struct page old;
+ while ((slab = slub_percpu_partial(c))) {
+ struct slab new;
+ struct slab old;

- slub_set_percpu_partial(c, page);
+ slub_set_percpu_partial(c, slab);

- n2 = get_node(s, page_to_nid(page));
+ n2 = get_node(s, slab_nid(slab));
if (n != n2) {
if (n)
spin_unlock(&n->list_lock);
@@ -2384,8 +2387,8 @@ static void unfreeze_partials(struct kmem_cache *s,

do {

- old.freelist = page->freelist;
- old.counters = page->counters;
+ old.freelist = slab->freelist;
+ old.counters = slab->counters;
VM_BUG_ON(!old.frozen);

new.counters = old.counters;
@@ -2393,16 +2396,16 @@ static void unfreeze_partials(struct kmem_cache *s,

new.frozen = 0;

- } while (!__cmpxchg_double_slab(s, page,
+ } while (!__cmpxchg_double_slab(s, slab,
old.freelist, old.counters,
new.freelist, new.counters,
"unfreezing slab"));

if (unlikely(!new.inuse && n->nr_partial >= s->min_partial)) {
- page->next = discard_page;
- discard_page = page;
+ slab->next = next_slab;
+ next_slab = slab;
} else {
- add_partial(n, page, DEACTIVATE_TO_TAIL);
+ add_partial(n, slab, DEACTIVATE_TO_TAIL);
stat(s, FREE_ADD_PARTIAL);
}
}
@@ -2410,40 +2413,40 @@ static void unfreeze_partials(struct kmem_cache *s,
if (n)
spin_unlock(&n->list_lock);

- while (discard_page) {
- page = discard_page;
- discard_page = discard_page->next;
+ while (next_slab) {
+ slab = next_slab;
+ next_slab = next_slab->next;

stat(s, DEACTIVATE_EMPTY);
- discard_slab(s, page);
+ discard_slab(s, slab);
stat(s, FREE_SLAB);
}
#endif /* CONFIG_SLUB_CPU_PARTIAL */
}

/*
- * Put a page that was just frozen (in __slab_free|get_partial_node) into a
- * partial page slot if available.
+ * Put a slab that was just frozen (in __slab_free|get_partial_node) into a
+ * partial slab slot if available.
*
* If we did not find a slot then simply move all the partials to the
* per node partial list.
*/
-static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
+static void put_cpu_partial(struct kmem_cache *s, struct slab *slab, int drain)
{
#ifdef CONFIG_SLUB_CPU_PARTIAL
- struct page *oldpage;
- int pages;
+ struct slab *oldslab;
+ int slabs;
int pobjects;

preempt_disable();
do {
- pages = 0;
+ slabs = 0;
pobjects = 0;
- oldpage = this_cpu_read(s->cpu_slab->partial);
+ oldslab = this_cpu_read(s->cpu_slab->partial);

- if (oldpage) {
- pobjects = oldpage->pobjects;
- pages = oldpage->pages;
+ if (oldslab) {
+ pobjects = oldslab->pobjects;
+ slabs = oldslab->slabs;
if (drain && pobjects > slub_cpu_partial(s)) {
unsigned long flags;
/*
@@ -2453,22 +2456,22 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
local_irq_save(flags);
unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
local_irq_restore(flags);
- oldpage = NULL;
+ oldslab = NULL;
pobjects = 0;
- pages = 0;
+ slabs = 0;
stat(s, CPU_PARTIAL_DRAIN);
}
}

- pages++;
- pobjects += page->objects - page->inuse;
+ slabs++;
+ pobjects += slab->objects - slab->inuse;

- page->pages = pages;
- page->pobjects = pobjects;
- page->next = oldpage;
+ slab->slabs = slabs;
+ slab->pobjects = pobjects;
+ slab->next = oldslab;

- } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
- != oldpage);
+ } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldslab, slab)
+ != oldslab);
if (unlikely(!slub_cpu_partial(s))) {
unsigned long flags;

@@ -2483,7 +2486,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
{
stat(s, CPUSLAB_FLUSH);
- deactivate_slab(s, c->page, c->freelist, c);
+ deactivate_slab(s, c->slab, c->freelist, c);

c->tid = next_tid(c->tid);
}
@@ -2497,7 +2500,7 @@ static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
{
struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);

- if (c->page)
+ if (c->slab)
flush_slab(s, c);

unfreeze_partials(s, c);
@@ -2515,7 +2518,7 @@ static bool has_cpu_slab(int cpu, void *info)
struct kmem_cache *s = info;
struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);

- return c->page || slub_percpu_partial(c);
+ return c->slab || slub_percpu_partial(c);
}

static void flush_all(struct kmem_cache *s)
@@ -2546,19 +2549,19 @@ static int slub_cpu_dead(unsigned int cpu)
* Check if the objects in a per cpu structure fit numa
* locality expectations.
*/
-static inline int node_match(struct page *page, int node)
+static inline int node_match(struct slab *slab, int node)
{
#ifdef CONFIG_NUMA
- if (node != NUMA_NO_NODE && page_to_nid(page) != node)
+ if (node != NUMA_NO_NODE && slab_nid(slab) != node)
return 0;
#endif
return 1;
}

#ifdef CONFIG_SLUB_DEBUG
-static int count_free(struct page *page)
+static int count_free(struct slab *slab)
{
- return page->objects - page->inuse;
+ return slab->objects - slab->inuse;
}

static inline unsigned long node_nr_objs(struct kmem_cache_node *n)
@@ -2569,15 +2572,15 @@ static inline unsigned long node_nr_objs(struct kmem_cache_node *n)

#if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS)
static unsigned long count_partial(struct kmem_cache_node *n,
- int (*get_count)(struct page *))
+ int (*get_count)(struct slab *))
{
unsigned long flags;
unsigned long x = 0;
- struct page *page;
+ struct slab *slab;

spin_lock_irqsave(&n->list_lock, flags);
- list_for_each_entry(page, &n->partial, slab_list)
- x += get_count(page);
+ list_for_each_entry(slab, &n->partial, slab_list)
+ x += get_count(slab);
spin_unlock_irqrestore(&n->list_lock, flags);
return x;
}
@@ -2625,7 +2628,7 @@ static inline void *new_slab_objects(struct kmem_cache *s, gfp_t flags,
{
void *freelist;
struct kmem_cache_cpu *c = *pc;
- struct page *page;
+ struct slab *slab;

WARN_ON_ONCE(s->ctor && (flags & __GFP_ZERO));

@@ -2634,62 +2637,62 @@ static inline void *new_slab_objects(struct kmem_cache *s, gfp_t flags,
if (freelist)
return freelist;

- page = new_slab(s, flags, node);
- if (page) {
+ slab = new_slab(s, flags, node);
+ if (slab) {
c = raw_cpu_ptr(s->cpu_slab);
- if (c->page)
+ if (c->slab)
flush_slab(s, c);

/*
- * No other reference to the page yet so we can
+ * No other reference to the slab yet so we can
* muck around with it freely without cmpxchg
*/
- freelist = page->freelist;
- page->freelist = NULL;
+ freelist = slab->freelist;
+ slab->freelist = NULL;

stat(s, ALLOC_SLAB);
- c->page = page;
+ c->slab = slab;
*pc = c;
}

return freelist;
}

-static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags)
+static inline bool pfmemalloc_match(struct slab *slab, gfp_t gfpflags)
{
- if (unlikely(PageSlabPfmemalloc(page)))
+ if (unlikely(SlabPfmemalloc(slab)))
return gfp_pfmemalloc_allowed(gfpflags);

return true;
}

/*
- * Check the page->freelist of a page and either transfer the freelist to the
- * per cpu freelist or deactivate the page.
+ * Check the slab->freelist of a slab and either transfer the freelist to the
+ * per cpu freelist or deactivate the slab.
*
- * The page is still frozen if the return value is not NULL.
+ * The slab is still frozen if the return value is not NULL.
*
- * If this function returns NULL then the page has been unfrozen.
+ * If this function returns NULL then the slab has been unfrozen.
*
* This function must be called with interrupt disabled.
*/
-static inline void *get_freelist(struct kmem_cache *s, struct page *page)
+static inline void *get_freelist(struct kmem_cache *s, struct slab *slab)
{
- struct page new;
+ struct slab new;
unsigned long counters;
void *freelist;

do {
- freelist = page->freelist;
- counters = page->counters;
+ freelist = slab->freelist;
+ counters = slab->counters;

new.counters = counters;
VM_BUG_ON(!new.frozen);

- new.inuse = page->objects;
+ new.inuse = slab->objects;
new.frozen = freelist != NULL;

- } while (!__cmpxchg_double_slab(s, page,
+ } while (!__cmpxchg_double_slab(s, slab,
freelist, counters,
NULL, new.counters,
"get_freelist"));
@@ -2711,7 +2714,7 @@ static inline void *get_freelist(struct kmem_cache *s, struct page *page)
*
* And if we were unable to get a new slab from the partial slab lists then
* we need to allocate a new slab. This is the slowest path since it involves
- * a call to the page allocator and the setup of a new slab.
+ * a call to the slab allocator and the setup of a new slab.
*
* Version of __slab_alloc to use when we know that interrupts are
* already disabled (which is the case for bulk allocation).
@@ -2720,12 +2723,12 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
unsigned long addr, struct kmem_cache_cpu *c)
{
void *freelist;
- struct page *page;
+ struct slab *slab;

stat(s, ALLOC_SLOWPATH);

- page = c->page;
- if (!page) {
+ slab = c->slab;
+ if (!slab) {
/*
* if the node is not online or has no normal memory, just
* ignore the node constraint
@@ -2737,7 +2740,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
}
redo:

- if (unlikely(!node_match(page, node))) {
+ if (unlikely(!node_match(slab, node))) {
/*
* same as above but node_match() being false already
* implies node != NUMA_NO_NODE
@@ -2747,18 +2750,18 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
goto redo;
} else {
stat(s, ALLOC_NODE_MISMATCH);
- deactivate_slab(s, page, c->freelist, c);
+ deactivate_slab(s, slab, c->freelist, c);
goto new_slab;
}
}

/*
- * By rights, we should be searching for a slab page that was
+ * By rights, we should be searching for a slab slab that was
* PFMEMALLOC but right now, we are losing the pfmemalloc
- * information when the page leaves the per-cpu allocator
+ * information when the slab leaves the per-cpu allocator
*/
- if (unlikely(!pfmemalloc_match(page, gfpflags))) {
- deactivate_slab(s, page, c->freelist, c);
+ if (unlikely(!pfmemalloc_match(slab, gfpflags))) {
+ deactivate_slab(s, slab, c->freelist, c);
goto new_slab;
}

@@ -2767,10 +2770,10 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
if (freelist)
goto load_freelist;

- freelist = get_freelist(s, page);
+ freelist = get_freelist(s, slab);

if (!freelist) {
- c->page = NULL;
+ c->slab = NULL;
stat(s, DEACTIVATE_BYPASS);
goto new_slab;
}
@@ -2780,10 +2783,10 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
load_freelist:
/*
* freelist is pointing to the list of objects to be used.
- * page is pointing to the page from which the objects are obtained.
- * That page must be frozen for per cpu allocations to work.
+ * slab is pointing to the slab from which the objects are obtained.
+ * That slab must be frozen for per cpu allocations to work.
*/
- VM_BUG_ON(!c->page->frozen);
+ VM_BUG_ON(!c->slab->frozen);
c->freelist = get_freepointer(s, freelist);
c->tid = next_tid(c->tid);
return freelist;
@@ -2791,8 +2794,8 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
new_slab:

if (slub_percpu_partial(c)) {
- page = c->page = slub_percpu_partial(c);
- slub_set_percpu_partial(c, page);
+ slab = c->slab = slub_percpu_partial(c);
+ slub_set_percpu_partial(c, slab);
stat(s, CPU_PARTIAL_ALLOC);
goto redo;
}
@@ -2804,16 +2807,16 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
return NULL;
}

- page = c->page;
- if (likely(!kmem_cache_debug(s) && pfmemalloc_match(page, gfpflags)))
+ slab = c->slab;
+ if (likely(!kmem_cache_debug(s) && pfmemalloc_match(slab, gfpflags)))
goto load_freelist;

/* Only entered in the debug case */
if (kmem_cache_debug(s) &&
- !alloc_debug_processing(s, page, freelist, addr))
+ !alloc_debug_processing(s, slab, freelist, addr))
goto new_slab; /* Slab failed checks. Next slab needed */

- deactivate_slab(s, page, get_freepointer(s, freelist), c);
+ deactivate_slab(s, slab, get_freepointer(s, freelist), c);
return freelist;
}

@@ -2869,7 +2872,7 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
{
void *object;
struct kmem_cache_cpu *c;
- struct page *page;
+ struct slab *slab;
unsigned long tid;
struct obj_cgroup *objcg = NULL;
bool init = false;
@@ -2902,9 +2905,9 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
/*
* Irqless object alloc/free algorithm used here depends on sequence
* of fetching cpu_slab's data. tid should be fetched before anything
- * on c to guarantee that object and page associated with previous tid
+ * on c to guarantee that object and slab associated with previous tid
* won't be used with current tid. If we fetch tid first, object and
- * page could be one associated with next tid and our alloc/free
+ * slab could be one associated with next tid and our alloc/free
* request will be failed. In this case, we will retry. So, no problem.
*/
barrier();
@@ -2917,8 +2920,8 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
*/

object = c->freelist;
- page = c->page;
- if (unlikely(!object || !page || !node_match(page, node))) {
+ slab = c->slab;
+ if (unlikely(!object || !slab || !node_match(slab, node))) {
object = __slab_alloc(s, gfpflags, node, addr, c);
} else {
void *next_object = get_freepointer_safe(s, object);
@@ -3020,17 +3023,17 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_trace);
* have a longer lifetime than the cpu slabs in most processing loads.
*
* So we still attempt to reduce cache line usage. Just take the slab
- * lock and free the item. If there is no additional partial page
+ * lock and free the item. If there is no additional partial slab
* handling required then we can return immediately.
*/
-static void __slab_free(struct kmem_cache *s, struct page *page,
+static void __slab_free(struct kmem_cache *s, struct slab *slab,
void *head, void *tail, int cnt,
unsigned long addr)

{
void *prior;
int was_frozen;
- struct page new;
+ struct slab new;
unsigned long counters;
struct kmem_cache_node *n = NULL;
unsigned long flags;
@@ -3041,7 +3044,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
return;

if (kmem_cache_debug(s) &&
- !free_debug_processing(s, page, head, tail, cnt, addr))
+ !free_debug_processing(s, slab, head, tail, cnt, addr))
return;

do {
@@ -3049,8 +3052,8 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
spin_unlock_irqrestore(&n->list_lock, flags);
n = NULL;
}
- prior = page->freelist;
- counters = page->counters;
+ prior = slab->freelist;
+ counters = slab->counters;
set_freepointer(s, tail, prior);
new.counters = counters;
was_frozen = new.frozen;
@@ -3069,7 +3072,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,

} else { /* Needs to be taken off a list */

- n = get_node(s, page_to_nid(page));
+ n = get_node(s, slab_nid(slab));
/*
* Speculatively acquire the list_lock.
* If the cmpxchg does not succeed then we may
@@ -3083,7 +3086,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
}
}

- } while (!cmpxchg_double_slab(s, page,
+ } while (!cmpxchg_double_slab(s, slab,
prior, counters,
head, new.counters,
"__slab_free"));
@@ -3098,10 +3101,10 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
stat(s, FREE_FROZEN);
} else if (new.frozen) {
/*
- * If we just froze the page then put it onto the
+ * If we just froze the slab then put it onto the
* per cpu partial list.
*/
- put_cpu_partial(s, page, 1);
+ put_cpu_partial(s, slab, 1);
stat(s, CPU_PARTIAL_FREE);
}

@@ -3116,8 +3119,8 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
* then add it.
*/
if (!kmem_cache_has_cpu_partial(s) && unlikely(!prior)) {
- remove_full(s, n, page);
- add_partial(n, page, DEACTIVATE_TO_TAIL);
+ remove_full(s, n, slab);
+ add_partial(n, slab, DEACTIVATE_TO_TAIL);
stat(s, FREE_ADD_PARTIAL);
}
spin_unlock_irqrestore(&n->list_lock, flags);
@@ -3128,16 +3131,16 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
/*
* Slab on the partial list.
*/
- remove_partial(n, page);
+ remove_partial(n, slab);
stat(s, FREE_REMOVE_PARTIAL);
} else {
/* Slab must be on the full list */
- remove_full(s, n, page);
+ remove_full(s, n, slab);
}

spin_unlock_irqrestore(&n->list_lock, flags);
stat(s, FREE_SLAB);
- discard_slab(s, page);
+ discard_slab(s, slab);
}

/*
@@ -3152,11 +3155,11 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
* with all sorts of special processing.
*
* Bulk free of a freelist with several objects (all pointing to the
- * same page) possible by specifying head and tail ptr, plus objects
+ * same slab) possible by specifying head and tail ptr, plus objects
* count (cnt). Bulk free indicated by tail pointer being set.
*/
static __always_inline void do_slab_free(struct kmem_cache *s,
- struct page *page, void *head, void *tail,
+ struct slab *slab, void *head, void *tail,
int cnt, unsigned long addr)
{
void *tail_obj = tail ? : head;
@@ -3180,7 +3183,7 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
/* Same with comment on barrier() in slab_alloc_node() */
barrier();

- if (likely(page == c->page)) {
+ if (likely(slab == c->slab)) {
void **freelist = READ_ONCE(c->freelist);

set_freepointer(s, tail_obj, freelist);
@@ -3195,11 +3198,11 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
}
stat(s, FREE_FASTPATH);
} else
- __slab_free(s, page, head, tail_obj, cnt, addr);
+ __slab_free(s, slab, head, tail_obj, cnt, addr);

}

-static __always_inline void slab_free(struct kmem_cache *s, struct page *page,
+static __always_inline void slab_free(struct kmem_cache *s, struct slab *slab,
void *head, void *tail, int cnt,
unsigned long addr)
{
@@ -3208,13 +3211,13 @@ static __always_inline void slab_free(struct kmem_cache *s, struct page *page,
* to remove objects, whose reuse must be delayed.
*/
if (slab_free_freelist_hook(s, &head, &tail))
- do_slab_free(s, page, head, tail, cnt, addr);
+ do_slab_free(s, slab, head, tail, cnt, addr);
}

#ifdef CONFIG_KASAN_GENERIC
void ___cache_free(struct kmem_cache *cache, void *x, unsigned long addr)
{
- do_slab_free(cache, virt_to_head_page(x), x, NULL, 1, addr);
+ do_slab_free(cache, virt_to_slab(x), x, NULL, 1, addr);
}
#endif

@@ -3223,13 +3226,13 @@ void kmem_cache_free(struct kmem_cache *s, void *x)
s = cache_from_obj(s, x);
if (!s)
return;
- slab_free(s, virt_to_head_page(x), x, NULL, 1, _RET_IP_);
+ slab_free(s, virt_to_slab(x), x, NULL, 1, _RET_IP_);
trace_kmem_cache_free(_RET_IP_, x, s->name);
}
EXPORT_SYMBOL(kmem_cache_free);

struct detached_freelist {
- struct page *page;
+ struct slab *slab;
void *tail;
void *freelist;
int cnt;
@@ -3239,8 +3242,8 @@ struct detached_freelist {
/*
* This function progressively scans the array with free objects (with
* a limited look ahead) and extract objects belonging to the same
- * page. It builds a detached freelist directly within the given
- * page/objects. This can happen without any need for
+ * slab. It builds a detached freelist directly within the given
+ * slab/objects. This can happen without any need for
* synchronization, because the objects are owned by running process.
* The freelist is build up as a single linked list in the objects.
* The idea is, that this detached freelist can then be bulk
@@ -3255,10 +3258,10 @@ int build_detached_freelist(struct kmem_cache *s, size_t size,
size_t first_skipped_index = 0;
int lookahead = 3;
void *object;
- struct page *page;
+ struct slab *slab;

/* Always re-init detached_freelist */
- df->page = NULL;
+ df->slab = NULL;

do {
object = p[--size];
@@ -3268,18 +3271,18 @@ int build_detached_freelist(struct kmem_cache *s, size_t size,
if (!object)
return 0;

- page = virt_to_head_page(object);
+ slab = virt_to_slab(object);
if (!s) {
/* Handle kalloc'ed objects */
- if (unlikely(!PageSlab(page))) {
- BUG_ON(!PageCompound(page));
+ if (unlikely(!is_slab(slab))) {
+ BUG_ON(!SlabMulti(slab));
kfree_hook(object);
- __free_pages(page, compound_order(page));
+ put_page(&slab->page);
p[size] = NULL; /* mark object processed */
return size;
}
/* Derive kmem_cache from object */
- df->s = page->slab_cache;
+ df->s = slab->slab_cache;
} else {
df->s = cache_from_obj(s, object); /* Support for memcg */
}
@@ -3292,7 +3295,7 @@ int build_detached_freelist(struct kmem_cache *s, size_t size,
}

/* Start new detached freelist */
- df->page = page;
+ df->slab = slab;
set_freepointer(df->s, object, NULL);
df->tail = object;
df->freelist = object;
@@ -3304,8 +3307,8 @@ int build_detached_freelist(struct kmem_cache *s, size_t size,
if (!object)
continue; /* Skip processed objects */

- /* df->page is always set at this point */
- if (df->page == virt_to_head_page(object)) {
+ /* df->slab is always set at this point */
+ if (df->slab == virt_to_slab(object)) {
/* Opportunity build freelist */
set_freepointer(df->s, object, df->freelist);
df->freelist = object;
@@ -3337,10 +3340,10 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
struct detached_freelist df;

size = build_detached_freelist(s, size, p, &df);
- if (!df.page)
+ if (!df.slab)
continue;

- slab_free(df.s, df.page, df.freelist, df.tail, df.cnt, _RET_IP_);
+ slab_free(df.s, df.slab, df.freelist, df.tail, df.cnt, _RET_IP_);
} while (likely(size));
}
EXPORT_SYMBOL(kmem_cache_free_bulk);
@@ -3435,7 +3438,7 @@ EXPORT_SYMBOL(kmem_cache_alloc_bulk);
*/

/*
- * Minimum / Maximum order of slab pages. This influences locking overhead
+ * Minimum / Maximum order of slab slabs. This influences locking overhead
* and slab fragmentation. A higher order reduces the number of partial slabs
* and increases the number of allocations possible without having to
* take the list_lock.
@@ -3449,7 +3452,7 @@ static unsigned int slub_min_objects;
*
* The order of allocation has significant impact on performance and other
* system components. Generally order 0 allocations should be preferred since
- * order 0 does not cause fragmentation in the page allocator. Larger objects
+ * order 0 does not cause fragmentation in the slab allocator. Larger objects
* be problematic to put into order 0 slabs because there may be too much
* unused space left. We go to a higher order if more than 1/16th of the slab
* would be wasted.
@@ -3461,15 +3464,15 @@ static unsigned int slub_min_objects;
*
* slub_max_order specifies the order where we begin to stop considering the
* number of objects in a slab as critical. If we reach slub_max_order then
- * we try to keep the page order as low as possible. So we accept more waste
- * of space in favor of a small page order.
+ * we try to keep the slab order as low as possible. So we accept more waste
+ * of space in favor of a small slab order.
*
* Higher order allocations also allow the placement of more objects in a
* slab and thereby reduce object handling overhead. If the user has
* requested a higher minimum order then we start with that one instead of
* the smallest order which will fit the object.
*/
-static inline unsigned int slab_order(unsigned int size,
+static inline unsigned int calc_slab_order(unsigned int size,
unsigned int min_objects, unsigned int max_order,
unsigned int fract_leftover)
{
@@ -3533,7 +3536,7 @@ static inline int calculate_order(unsigned int size)

fraction = 16;
while (fraction >= 4) {
- order = slab_order(size, min_objects,
+ order = calc_slab_order(size, min_objects,
slub_max_order, fraction);
if (order <= slub_max_order)
return order;
@@ -3546,14 +3549,14 @@ static inline int calculate_order(unsigned int size)
* We were unable to place multiple objects in a slab. Now
* lets see if we can place a single object there.
*/
- order = slab_order(size, 1, slub_max_order, 1);
+ order = calc_slab_order(size, 1, slub_max_order, 1);
if (order <= slub_max_order)
return order;

/*
* Doh this slab cannot be placed using slub_max_order.
*/
- order = slab_order(size, 1, MAX_ORDER, 1);
+ order = calc_slab_order(size, 1, MAX_ORDER, 1);
if (order < MAX_ORDER)
return order;
return -ENOSYS;
@@ -3605,38 +3608,38 @@ static struct kmem_cache *kmem_cache_node;
*/
static void early_kmem_cache_node_alloc(int node)
{
- struct page *page;
+ struct slab *slab;
struct kmem_cache_node *n;

BUG_ON(kmem_cache_node->size < sizeof(struct kmem_cache_node));

- page = new_slab(kmem_cache_node, GFP_NOWAIT, node);
+ slab = new_slab(kmem_cache_node, GFP_NOWAIT, node);

- BUG_ON(!page);
- if (page_to_nid(page) != node) {
+ BUG_ON(!slab);
+ if (slab_nid(slab) != node) {
pr_err("SLUB: Unable to allocate memory from node %d\n", node);
pr_err("SLUB: Allocating a useless per node structure in order to be able to continue\n");
}

- n = page->freelist;
+ n = slab->freelist;
BUG_ON(!n);
#ifdef CONFIG_SLUB_DEBUG
init_object(kmem_cache_node, n, SLUB_RED_ACTIVE);
init_tracking(kmem_cache_node, n);
#endif
n = kasan_slab_alloc(kmem_cache_node, n, GFP_KERNEL, false);
- page->freelist = get_freepointer(kmem_cache_node, n);
- page->inuse = 1;
- page->frozen = 0;
+ slab->freelist = get_freepointer(kmem_cache_node, n);
+ slab->inuse = 1;
+ slab->frozen = 0;
kmem_cache_node->node[node] = n;
init_kmem_cache_node(n);
- inc_slabs_node(kmem_cache_node, node, page->objects);
+ inc_slabs_node(kmem_cache_node, node, slab->objects);

/*
* No locks need to be taken here as it has just been
* initialized and there is no concurrent access.
*/
- __add_partial(n, page, DEACTIVATE_TO_HEAD);
+ __add_partial(n, slab, DEACTIVATE_TO_HEAD);
}

static void free_kmem_cache_nodes(struct kmem_cache *s)
@@ -3894,8 +3897,8 @@ static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags)
#endif

/*
- * The larger the object size is, the more pages we want on the partial
- * list to avoid pounding the page allocator excessively.
+ * The larger the object size is, the more slabs we want on the partial
+ * list to avoid pounding the slab allocator excessively.
*/
set_min_partial(s, ilog2(s->size) / 2);

@@ -3922,19 +3925,19 @@ static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags)
return -EINVAL;
}

-static void list_slab_objects(struct kmem_cache *s, struct page *page,
+static void list_slab_objects(struct kmem_cache *s, struct slab *slab,
const char *text)
{
#ifdef CONFIG_SLUB_DEBUG
- void *addr = page_address(page);
+ void *addr = slab_address(slab);
unsigned long *map;
void *p;

- slab_err(s, page, text, s->name);
- slab_lock(page);
+ slab_err(s, slab, text, s->name);
+ slab_lock(slab);

- map = get_map(s, page);
- for_each_object(p, s, addr, page->objects) {
+ map = get_map(s, slab);
+ for_each_object(p, s, addr, slab->objects) {

if (!test_bit(__obj_to_index(s, addr, p), map)) {
pr_err("Object 0x%p @offset=%tu\n", p, p - addr);
@@ -3942,7 +3945,7 @@ static void list_slab_objects(struct kmem_cache *s, struct page *page,
}
}
put_map(map);
- slab_unlock(page);
+ slab_unlock(slab);
#endif
}

@@ -3954,23 +3957,23 @@ static void list_slab_objects(struct kmem_cache *s, struct page *page,
static void free_partial(struct kmem_cache *s, struct kmem_cache_node *n)
{
LIST_HEAD(discard);
- struct page *page, *h;
+ struct slab *slab, *h;

BUG_ON(irqs_disabled());
spin_lock_irq(&n->list_lock);
- list_for_each_entry_safe(page, h, &n->partial, slab_list) {
- if (!page->inuse) {
- remove_partial(n, page);
- list_add(&page->slab_list, &discard);
+ list_for_each_entry_safe(slab, h, &n->partial, slab_list) {
+ if (!slab->inuse) {
+ remove_partial(n, slab);
+ list_add(&slab->slab_list, &discard);
} else {
- list_slab_objects(s, page,
+ list_slab_objects(s, slab,
"Objects remaining in %s on __kmem_cache_shutdown()");
}
}
spin_unlock_irq(&n->list_lock);

- list_for_each_entry_safe(page, h, &discard, slab_list)
- discard_slab(s, page);
+ list_for_each_entry_safe(slab, h, &discard, slab_list)
+ discard_slab(s, slab);
}

bool __kmem_cache_empty(struct kmem_cache *s)
@@ -4003,31 +4006,31 @@ int __kmem_cache_shutdown(struct kmem_cache *s)
}

#ifdef CONFIG_PRINTK
-void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct page *page)
+void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab)
{
void *base;
int __maybe_unused i;
unsigned int objnr;
void *objp;
void *objp0;
- struct kmem_cache *s = page->slab_cache;
+ struct kmem_cache *s = slab->slab_cache;
struct track __maybe_unused *trackp;

kpp->kp_ptr = object;
- kpp->kp_page = page;
+ kpp->kp_slab = slab;
kpp->kp_slab_cache = s;
- base = page_address(page);
+ base = slab_address(slab);
objp0 = kasan_reset_tag(object);
#ifdef CONFIG_SLUB_DEBUG
objp = restore_red_left(s, objp0);
#else
objp = objp0;
#endif
- objnr = obj_to_index(s, page, objp);
+ objnr = obj_to_index(s, slab, objp);
kpp->kp_data_offset = (unsigned long)((char *)objp0 - (char *)objp);
objp = base + s->size * objnr;
kpp->kp_objp = objp;
- if (WARN_ON_ONCE(objp < base || objp >= base + page->objects * s->size || (objp - base) % s->size) ||
+ if (WARN_ON_ONCE(objp < base || objp >= base + slab->objects * s->size || (objp - base) % s->size) ||
!(s->flags & SLAB_STORE_USER))
return;
#ifdef CONFIG_SLUB_DEBUG
@@ -4115,8 +4118,8 @@ static void *kmalloc_large_node(size_t size, gfp_t flags, int node)
unsigned int order = get_order(size);

flags |= __GFP_COMP;
- page = alloc_pages_node(node, flags, order);
- if (page) {
+ slab = alloc_pages_node(node, flags, order);
+ if (slab) {
ptr = page_address(page);
mod_lruvec_page_state(page, NR_SLAB_UNRECLAIMABLE_B,
PAGE_SIZE << order);
@@ -4165,7 +4168,7 @@ EXPORT_SYMBOL(__kmalloc_node);
* Returns NULL if check passes, otherwise const char * to name of cache
* to indicate an error.
*/
-void __check_heap_object(const void *ptr, unsigned long n, struct page *page,
+void __check_heap_object(const void *ptr, unsigned long n, struct slab *slab,
bool to_user)
{
struct kmem_cache *s;
@@ -4176,18 +4179,18 @@ void __check_heap_object(const void *ptr, unsigned long n, struct page *page,
ptr = kasan_reset_tag(ptr);

/* Find object and usable object size. */
- s = page->slab_cache;
+ s = slab->slab_cache;

/* Reject impossible pointers. */
- if (ptr < page_address(page))
- usercopy_abort("SLUB object not in SLUB page?!", NULL,
+ if (ptr < slab_address(slab))
+ usercopy_abort("SLUB object not in SLUB slab?!", NULL,
to_user, 0, n);

/* Find offset within object. */
if (is_kfence)
offset = ptr - kfence_object_start(ptr);
else
- offset = (ptr - page_address(page)) % s->size;
+ offset = (ptr - slab_address(slab)) % s->size;

/* Adjust for redzone and reject if within the redzone. */
if (!is_kfence && kmem_cache_debug_flags(s, SLAB_RED_ZONE)) {
@@ -4222,25 +4225,25 @@ void __check_heap_object(const void *ptr, unsigned long n, struct page *page,

size_t __ksize(const void *object)
{
- struct page *page;
+ struct slab *slab;

if (unlikely(object == ZERO_SIZE_PTR))
return 0;

- page = virt_to_head_page(object);
+ slab = virt_to_slab(object);

- if (unlikely(!PageSlab(page))) {
- WARN_ON(!PageCompound(page));
- return page_size(page);
+ if (unlikely(!is_slab(slab))) {
+ WARN_ON(!SlabMulti(slab));
+ return slab_size(slab);
}

- return slab_ksize(page->slab_cache);
+ return slab_ksize(slab->slab_cache);
}
EXPORT_SYMBOL(__ksize);

void kfree(const void *x)
{
- struct page *page;
+ struct slab *slab;
void *object = (void *)x;

trace_kfree(_RET_IP_, x);
@@ -4248,18 +4251,19 @@ void kfree(const void *x)
if (unlikely(ZERO_OR_NULL_PTR(x)))
return;

- page = virt_to_head_page(x);
- if (unlikely(!PageSlab(page))) {
- unsigned int order = compound_order(page);
+ slab = virt_to_slab(x);
+ if (unlikely(!is_slab(slab))) {
+ unsigned int order = slab_order(slab);
+ struct page *page = &slab->page;

- BUG_ON(!PageCompound(page));
+ BUG_ON(!SlabMulti(slab));
kfree_hook(object);
mod_lruvec_page_state(page, NR_SLAB_UNRECLAIMABLE_B,
-(PAGE_SIZE << order));
- __free_pages(page, order);
+ put_page(page);
return;
}
- slab_free(page->slab_cache, page, object, NULL, 1, _RET_IP_);
+ slab_free(slab->slab_cache, slab, object, NULL, 1, _RET_IP_);
}
EXPORT_SYMBOL(kfree);

@@ -4279,8 +4283,8 @@ int __kmem_cache_shrink(struct kmem_cache *s)
int node;
int i;
struct kmem_cache_node *n;
- struct page *page;
- struct page *t;
+ struct slab *slab;
+ struct slab *t;
struct list_head discard;
struct list_head promote[SHRINK_PROMOTE_MAX];
unsigned long flags;
@@ -4298,22 +4302,22 @@ int __kmem_cache_shrink(struct kmem_cache *s)
* Build lists of slabs to discard or promote.
*
* Note that concurrent frees may occur while we hold the
- * list_lock. page->inuse here is the upper limit.
+ * list_lock. slab->inuse here is the upper limit.
*/
- list_for_each_entry_safe(page, t, &n->partial, slab_list) {
- int free = page->objects - page->inuse;
+ list_for_each_entry_safe(slab, t, &n->partial, slab_list) {
+ int free = slab->objects - slab->inuse;

- /* Do not reread page->inuse */
+ /* Do not reread slab->inuse */
barrier();

/* We do not keep full slabs on the list */
BUG_ON(free <= 0);

- if (free == page->objects) {
- list_move(&page->slab_list, &discard);
+ if (free == slab->objects) {
+ list_move(&slab->slab_list, &discard);
n->nr_partial--;
} else if (free <= SHRINK_PROMOTE_MAX)
- list_move(&page->slab_list, promote + free - 1);
+ list_move(&slab->slab_list, promote + free - 1);
}

/*
@@ -4326,8 +4330,8 @@ int __kmem_cache_shrink(struct kmem_cache *s)
spin_unlock_irqrestore(&n->list_lock, flags);

/* Release empty slabs */
- list_for_each_entry_safe(page, t, &discard, slab_list)
- discard_slab(s, page);
+ list_for_each_entry_safe(slab, t, &discard, slab_list)
+ discard_slab(s, slab);

if (slabs_node(s, node))
ret = 1;
@@ -4461,7 +4465,7 @@ static struct notifier_block slab_memory_callback_nb = {

/*
* Used for early kmem_cache structures that were allocated using
- * the page allocator. Allocate them properly then fix up the pointers
+ * the slab allocator. Allocate them properly then fix up the pointers
* that may be pointing to the wrong kmem_cache structure.
*/

@@ -4480,7 +4484,7 @@ static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache)
*/
__flush_cpu_slab(s, smp_processor_id());
for_each_kmem_cache_node(s, node, n) {
- struct page *p;
+ struct slab *p;

list_for_each_entry(p, &n->partial, slab_list)
p->slab_cache = s;
@@ -4656,54 +4660,54 @@ EXPORT_SYMBOL(__kmalloc_node_track_caller);
#endif

#ifdef CONFIG_SYSFS
-static int count_inuse(struct page *page)
+static int count_inuse(struct slab *slab)
{
- return page->inuse;
+ return slab->inuse;
}

-static int count_total(struct page *page)
+static int count_total(struct slab *slab)
{
- return page->objects;
+ return slab->objects;
}
#endif

#ifdef CONFIG_SLUB_DEBUG
-static void validate_slab(struct kmem_cache *s, struct page *page)
+static void validate_slab(struct kmem_cache *s, struct slab *slab)
{
void *p;
- void *addr = page_address(page);
+ void *addr = slab_address(slab);
unsigned long *map;

- slab_lock(page);
+ slab_lock(slab);

- if (!check_slab(s, page) || !on_freelist(s, page, NULL))
+ if (!check_slab(s, slab) || !on_freelist(s, slab, NULL))
goto unlock;

/* Now we know that a valid freelist exists */
- map = get_map(s, page);
- for_each_object(p, s, addr, page->objects) {
+ map = get_map(s, slab);
+ for_each_object(p, s, addr, slab->objects) {
u8 val = test_bit(__obj_to_index(s, addr, p), map) ?
SLUB_RED_INACTIVE : SLUB_RED_ACTIVE;

- if (!check_object(s, page, p, val))
+ if (!check_object(s, slab, p, val))
break;
}
put_map(map);
unlock:
- slab_unlock(page);
+ slab_unlock(slab);
}

static int validate_slab_node(struct kmem_cache *s,
struct kmem_cache_node *n)
{
unsigned long count = 0;
- struct page *page;
+ struct slab *slab;
unsigned long flags;

spin_lock_irqsave(&n->list_lock, flags);

- list_for_each_entry(page, &n->partial, slab_list) {
- validate_slab(s, page);
+ list_for_each_entry(slab, &n->partial, slab_list) {
+ validate_slab(s, slab);
count++;
}
if (count != n->nr_partial) {
@@ -4715,8 +4719,8 @@ static int validate_slab_node(struct kmem_cache *s,
if (!(s->flags & SLAB_STORE_USER))
goto out;

- list_for_each_entry(page, &n->full, slab_list) {
- validate_slab(s, page);
+ list_for_each_entry(slab, &n->full, slab_list) {
+ validate_slab(s, slab);
count++;
}
if (count != atomic_long_read(&n->nr_slabs)) {
@@ -4838,7 +4842,7 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
cpumask_set_cpu(track->cpu,
to_cpumask(l->cpus));
}
- node_set(page_to_nid(virt_to_page(track)), l->nodes);
+ node_set(slab_nid(virt_to_slab(track)), l->nodes);
return 1;
}

@@ -4869,19 +4873,19 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
cpumask_clear(to_cpumask(l->cpus));
cpumask_set_cpu(track->cpu, to_cpumask(l->cpus));
nodes_clear(l->nodes);
- node_set(page_to_nid(virt_to_page(track)), l->nodes);
+ node_set(slab_nid(virt_to_slab(track)), l->nodes);
return 1;
}

static void process_slab(struct loc_track *t, struct kmem_cache *s,
- struct page *page, enum track_item alloc)
+ struct slab *slab, enum track_item alloc)
{
- void *addr = page_address(page);
+ void *addr = slab_address(slab);
void *p;
unsigned long *map;

- map = get_map(s, page);
- for_each_object(p, s, addr, page->objects)
+ map = get_map(s, slab);
+ for_each_object(p, s, addr, slab->objects)
if (!test_bit(__obj_to_index(s, addr, p), map))
add_location(t, s, get_track(s, p, alloc));
put_map(map);
@@ -4924,32 +4928,32 @@ static ssize_t show_slab_objects(struct kmem_cache *s,
struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab,
cpu);
int node;
- struct page *page;
+ struct slab *slab;

- page = READ_ONCE(c->page);
- if (!page)
+ slab = READ_ONCE(c->slab);
+ if (!slab)
continue;

- node = page_to_nid(page);
+ node = slab_nid(slab);
if (flags & SO_TOTAL)
- x = page->objects;
+ x = slab->objects;
else if (flags & SO_OBJECTS)
- x = page->inuse;
+ x = slab->inuse;
else
x = 1;

total += x;
nodes[node] += x;

- page = slub_percpu_partial_read_once(c);
- if (page) {
- node = page_to_nid(page);
+ slab = slub_percpu_partial_read_once(c);
+ if (slab) {
+ node = slab_nid(slab);
if (flags & SO_TOTAL)
WARN_ON_ONCE(1);
else if (flags & SO_OBJECTS)
WARN_ON_ONCE(1);
else
- x = page->pages;
+ x = slab->slabs;
total += x;
nodes[node] += x;
}
@@ -5146,31 +5150,31 @@ SLAB_ATTR_RO(objects_partial);
static ssize_t slabs_cpu_partial_show(struct kmem_cache *s, char *buf)
{
int objects = 0;
- int pages = 0;
+ int slabs = 0;
int cpu;
int len = 0;

for_each_online_cpu(cpu) {
- struct page *page;
+ struct slab *slab;

- page = slub_percpu_partial(per_cpu_ptr(s->cpu_slab, cpu));
+ slab = slub_percpu_partial(per_cpu_ptr(s->cpu_slab, cpu));

- if (page) {
- pages += page->pages;
- objects += page->pobjects;
+ if (slab) {
+ slabs += slab->slabs;
+ objects += slab->pobjects;
}
}

- len += sysfs_emit_at(buf, len, "%d(%d)", objects, pages);
+ len += sysfs_emit_at(buf, len, "%d(%d)", objects, slabs);

#ifdef CONFIG_SMP
for_each_online_cpu(cpu) {
- struct page *page;
+ struct slab *slab;

- page = slub_percpu_partial(per_cpu_ptr(s->cpu_slab, cpu));
- if (page)
+ slab = slub_percpu_partial(per_cpu_ptr(s->cpu_slab, cpu));
+ if (slab)
len += sysfs_emit_at(buf, len, " C%d=%d(%d)",
- cpu, page->pobjects, page->pages);
+ cpu, slab->pobjects, slab->slabs);
}
#endif
len += sysfs_emit_at(buf, len, "\n");
@@ -5825,16 +5829,16 @@ static int slab_debug_trace_open(struct inode *inode, struct file *filep)

for_each_kmem_cache_node(s, node, n) {
unsigned long flags;
- struct page *page;
+ struct slab *slab;

if (!atomic_long_read(&n->nr_slabs))
continue;

spin_lock_irqsave(&n->list_lock, flags);
- list_for_each_entry(page, &n->partial, slab_list)
- process_slab(t, s, page, alloc);
- list_for_each_entry(page, &n->full, slab_list)
- process_slab(t, s, page, alloc);
+ list_for_each_entry(slab, &n->partial, slab_list)
+ process_slab(t, s, slab, alloc);
+ list_for_each_entry(slab, &n->full, slab_list)
+ process_slab(t, s, slab, alloc);
spin_unlock_irqrestore(&n->list_lock, flags);
}

diff --git a/mm/sparse.c b/mm/sparse.c
index 6326cdf36c4f..2b1099c986c6 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -750,7 +750,7 @@ static void free_map_bootmem(struct page *memmap)
>> PAGE_SHIFT;

for (i = 0; i < nr_pages; i++, page++) {
- magic = (unsigned long) page->freelist;
+ magic = page->index;

BUG_ON(magic == NODE_INFO);

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 68e8831068f4..0661dc09e11b 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -17,7 +17,7 @@
*
* Usage of struct page fields:
* page->private: points to zspage
- * page->freelist(index): links together all component pages of a zspage
+ * page->index: links together all component pages of a zspage
* For the huge page, this is always 0, so we use this field
* to store handle.
* page->units: first object offset in a subpage of zspage
@@ -827,7 +827,7 @@ static struct page *get_next_page(struct page *page)
if (unlikely(PageHugeObject(page)))
return NULL;

- return page->freelist;
+ return (struct page *)page->index;
}

/**
@@ -901,7 +901,7 @@ static void reset_page(struct page *page)
set_page_private(page, 0);
page_mapcount_reset(page);
ClearPageHugeObject(page);
- page->freelist = NULL;
+ page->index = 0;
}

static int trylock_zspage(struct zspage *zspage)
@@ -1027,7 +1027,7 @@ static void create_page_chain(struct size_class *class, struct zspage *zspage,

/*
* Allocate individual pages and link them together as:
- * 1. all pages are linked together using page->freelist
+ * 1. all pages are linked together using page->index
* 2. each sub-page point to zspage using page->private
*
* we set PG_private to identify the first page (i.e. no other sub-page
@@ -1036,7 +1036,7 @@ static void create_page_chain(struct size_class *class, struct zspage *zspage,
for (i = 0; i < nr_pages; i++) {
page = pages[i];
set_page_private(page, (unsigned long)zspage);
- page->freelist = NULL;
+ page->index = 0;
if (i == 0) {
zspage->first_page = page;
SetPagePrivate(page);
@@ -1044,7 +1044,7 @@ static void create_page_chain(struct size_class *class, struct zspage *zspage,
class->pages_per_zspage == 1))
SetPageHugeObject(page);
} else {
- prev_page->freelist = page;
+ prev_page->index = (unsigned long)page;
}
prev_page = page;
}
Re: Folio discussion recap [ In reply to ]
On Tue, Sep 21, 2021 at 11:18:52PM +0100, Matthew Wilcox wrote:

...

> +/**
> + * page_slab - Converts from page to slab.
> + * @p: The page.
> + *
> + * This function cannot be called on a NULL pointer. It can be called
> + * on a non-slab page; the caller should check is_slab() to be sure
> + * that the slab really is a slab.
> + *
> + * Return: The slab which contains this page.
> + */
> +#define page_slab(p) (_Generic((p), \
> + const struct page *: (const struct slab *)_compound_head(p), \
> + struct page *: (struct slab *)_compound_head(p)))
> +
> +static inline bool is_slab(struct slab *slab)
> +{
> + return test_bit(PG_slab, &slab->flags);
> +}
> +

I'm sorry, I don't have a dog in this fight and conceptually I think folios are
a good idea...

But for this work, having a call which returns if a 'struct slab' really is a
'struct slab' seems odd and well, IMHO, wrong. Why can't page_slab() return
NULL if there is no slab containing that page?

Ira
Re: Folio discussion recap [ In reply to ]
On Wed, Sep 22, 2021 at 05:45:15PM -0700, Ira Weiny wrote:
> On Tue, Sep 21, 2021 at 11:18:52PM +0100, Matthew Wilcox wrote:
> > +/**
> > + * page_slab - Converts from page to slab.
> > + * @p: The page.
> > + *
> > + * This function cannot be called on a NULL pointer. It can be called
> > + * on a non-slab page; the caller should check is_slab() to be sure
> > + * that the slab really is a slab.
> > + *
> > + * Return: The slab which contains this page.
> > + */
> > +#define page_slab(p) (_Generic((p), \
> > + const struct page *: (const struct slab *)_compound_head(p), \
> > + struct page *: (struct slab *)_compound_head(p)))
> > +
> > +static inline bool is_slab(struct slab *slab)
> > +{
> > + return test_bit(PG_slab, &slab->flags);
> > +}
> > +
>
> I'm sorry, I don't have a dog in this fight and conceptually I think folios are
> a good idea...
>
> But for this work, having a call which returns if a 'struct slab' really is a
> 'struct slab' seems odd and well, IMHO, wrong. Why can't page_slab() return
> NULL if there is no slab containing that page?

No, this is a good question.

The way slub works right now is that if you ask for a "large" allocation,
it does:

flags |= __GFP_COMP;
page = alloc_pages_node(node, flags, order);

and returns page_address(page) (eventually; the code is more complex)
So when you call kfree(), it uses the PageSlab flag to determine if the
allocation was "large" or not:

page = virt_to_head_page(x);
if (unlikely(!PageSlab(page))) {
free_nonslab_page(page, object);
return;
}
slab_free(page->slab_cache, page, object, NULL, 1, _RET_IP_);

Now, you could say that this is a bad way to handle things, and every
allocation from slab should have PageSlab set, and it should use one of
the many other bits in page->flags to indicate whether it's a large
allocation or not. I may have feelings in that direction myself.
But I don't think I should be changing that in this patch.

Maybe calling this function is_slab() is the confusing thing.
Perhaps it should be called SlabIsLargeAllocation(). Not sure.
Re: Folio discussion recap [ In reply to ]
On Thu, Sep 23, 2021 at 04:41:04AM +0100, Matthew Wilcox wrote:
> On Wed, Sep 22, 2021 at 05:45:15PM -0700, Ira Weiny wrote:
> > On Tue, Sep 21, 2021 at 11:18:52PM +0100, Matthew Wilcox wrote:
> > > +/**
> > > + * page_slab - Converts from page to slab.
> > > + * @p: The page.
> > > + *
> > > + * This function cannot be called on a NULL pointer. It can be called
> > > + * on a non-slab page; the caller should check is_slab() to be sure
> > > + * that the slab really is a slab.
> > > + *
> > > + * Return: The slab which contains this page.
> > > + */
> > > +#define page_slab(p) (_Generic((p), \
> > > + const struct page *: (const struct slab *)_compound_head(p), \
> > > + struct page *: (struct slab *)_compound_head(p)))
> > > +
> > > +static inline bool is_slab(struct slab *slab)
> > > +{
> > > + return test_bit(PG_slab, &slab->flags);
> > > +}
> > > +
> >
> > I'm sorry, I don't have a dog in this fight and conceptually I think folios are
> > a good idea...
> >
> > But for this work, having a call which returns if a 'struct slab' really is a
> > 'struct slab' seems odd and well, IMHO, wrong. Why can't page_slab() return
> > NULL if there is no slab containing that page?
>
> No, this is a good question.
>
> The way slub works right now is that if you ask for a "large" allocation,
> it does:
>
> flags |= __GFP_COMP;
> page = alloc_pages_node(node, flags, order);
>
> and returns page_address(page) (eventually; the code is more complex)
> So when you call kfree(), it uses the PageSlab flag to determine if the
> allocation was "large" or not:
>
> page = virt_to_head_page(x);
> if (unlikely(!PageSlab(page))) {
> free_nonslab_page(page, object);
> return;
> }
> slab_free(page->slab_cache, page, object, NULL, 1, _RET_IP_);
>
> Now, you could say that this is a bad way to handle things, and every
> allocation from slab should have PageSlab set,

Yea basically.

So what makes 'struct slab' different from 'struct page' in an order 0
allocation? Am I correct in deducing that PG_slab is not set in that case?

> and it should use one of
> the many other bits in page->flags to indicate whether it's a large
> allocation or not.

Isn't the fact that it is a compound page enough to know that?

> I may have feelings in that direction myself.
> But I don't think I should be changing that in this patch.
>
> Maybe calling this function is_slab() is the confusing thing.
> Perhaps it should be called SlabIsLargeAllocation(). Not sure.

Well that makes a lot more sense to me from an API standpoint but checking
PG_slab is still likely to raise some eyebrows.

Regardless I like the fact that the community is at least attempting to fix
stuff like this. Because adding types like this make it easier for people like
me to understand what is going on.

Ira
Re: Folio discussion recap [ In reply to ]
On Thu, Sep 23, 2021 at 03:12:41PM -0700, Ira Weiny wrote:
> On Thu, Sep 23, 2021 at 04:41:04AM +0100, Matthew Wilcox wrote:
> > On Wed, Sep 22, 2021 at 05:45:15PM -0700, Ira Weiny wrote:
> > > On Tue, Sep 21, 2021 at 11:18:52PM +0100, Matthew Wilcox wrote:
> > > > +/**
> > > > + * page_slab - Converts from page to slab.
> > > > + * @p: The page.
> > > > + *
> > > > + * This function cannot be called on a NULL pointer. It can be called
> > > > + * on a non-slab page; the caller should check is_slab() to be sure
> > > > + * that the slab really is a slab.
> > > > + *
> > > > + * Return: The slab which contains this page.
> > > > + */
> > > > +#define page_slab(p) (_Generic((p), \
> > > > + const struct page *: (const struct slab *)_compound_head(p), \
> > > > + struct page *: (struct slab *)_compound_head(p)))
> > > > +
> > > > +static inline bool is_slab(struct slab *slab)
> > > > +{
> > > > + return test_bit(PG_slab, &slab->flags);
> > > > +}
> > > > +
> > >
> > > I'm sorry, I don't have a dog in this fight and conceptually I think folios are
> > > a good idea...
> > >
> > > But for this work, having a call which returns if a 'struct slab' really is a
> > > 'struct slab' seems odd and well, IMHO, wrong. Why can't page_slab() return
> > > NULL if there is no slab containing that page?
> >
> > No, this is a good question.
> >
> > The way slub works right now is that if you ask for a "large" allocation,
> > it does:
> >
> > flags |= __GFP_COMP;
> > page = alloc_pages_node(node, flags, order);
> >
> > and returns page_address(page) (eventually; the code is more complex)
> > So when you call kfree(), it uses the PageSlab flag to determine if the
> > allocation was "large" or not:
> >
> > page = virt_to_head_page(x);
> > if (unlikely(!PageSlab(page))) {
> > free_nonslab_page(page, object);
> > return;
> > }
> > slab_free(page->slab_cache, page, object, NULL, 1, _RET_IP_);
> >
> > Now, you could say that this is a bad way to handle things, and every
> > allocation from slab should have PageSlab set,
>
> Yea basically.
>
> So what makes 'struct slab' different from 'struct page' in an order 0
> allocation? Am I correct in deducing that PG_slab is not set in that case?

You might mean a couple of different things by that question, so let
me say some things which are true (on x86) but might not answer your
question:

If you kmalloc(4095) bytes, it comes from a slab. That slab would usually
be an order-3 allocation. If that order-3 allocation fails, slab might
go as low as an order-0 allocation, but PageSlab will always be set on
that head/base page because the allocation is smaller than two pages.

If you kmalloc(8193) bytes, slub throws up its hands and does an
allocation from the page allocator. So it allocates an order-2 page,
does not set PG_slab on it, but PG_head is set on the head page and
PG_tail is set on all three tail pages.

> > and it should use one of
> > the many other bits in page->flags to indicate whether it's a large
> > allocation or not.
>
> Isn't the fact that it is a compound page enough to know that?

No -- regular slab allocations have PG_head set. But it could use,
eg, slab->slab_cache == NULL to distinguish page allocations
from slab allocations.

> > I may have feelings in that direction myself.
> > But I don't think I should be changing that in this patch.
> >
> > Maybe calling this function is_slab() is the confusing thing.
> > Perhaps it should be called SlabIsLargeAllocation(). Not sure.
>
> Well that makes a lot more sense to me from an API standpoint but checking
> PG_slab is still likely to raise some eyebrows.

Yeah. Here's what I have right now:

+static inline bool SlabMultiPage(const struct slab *slab)
+{
+ return test_bit(PG_head, &slab->flags);
+}
+
+/* Did this allocation come from the page allocator instead of slab? */
+static inline bool SlabPageAllocation(const struct slab *slab)
+{
+ return !test_bit(PG_slab, &slab->flags);
+}

> Regardless I like the fact that the community is at least attempting to fix
> stuff like this. Because adding types like this make it easier for people like
> me to understand what is going on.

Yes, I dislike that 'struct page' is so hard to understand, and so easy
to misuse. It's a very weak type.