Mailing List Archive: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order

Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order [ In reply to ]

Jan 27, 2021, 4:15 AM

Post #26 of 34 (90 views)

On 1/27/21 10:10 AM, Christoph Lameter wrote:
> On Tue, 26 Jan 2021, Will Deacon wrote:
>
>> > Hm, but booting the secondaries is just a software (kernel) action? They are
>> > already physically there, so it seems to me as if the cpu_present_mask is not
>> > populated correctly on arm64, and it's just a mirror of cpu_online_mask?
>>
>> I think the present_mask retains CPUs if they are hotplugged off, whereas
>> the online mask does not. We can't really do any better on arm64, as there's
>> no way of telling that a CPU is present until we've seen it.
>
> The order of each page in a kmem cache --and therefore also the number
> of objects in a slab page-- can be different because that information is
> stored in the page struct.
>
> Therefore it is possible to retune the order while the cache is in operaton.

Yes, but it's tricky to do the retuning safely, e.g. if freelist randomization
is enabled, see [1].

But as a quick fix for the regression, the heuristic idea could work reasonably
on all architectures?
- if num_present_cpus() is > 1, trust that it doesn't have the issue such as
arm64, and use it
- otherwise use nr_cpu_ids

Long-term we can attempt to do the retuning safe, or decide that number of cpus
shouldn't determine the order...

[1] https://lore.kernel.org/linux-mm/d7fb9425-9a62-c7b8-604d-5828d7e6b1da@suse.cz/

> This means you can run an initcall after all cpus have been brought up to
> set the order and number of objects in a slab page differently.
>
> The older slab pages will continue to exist with the old orders until they
> are freed.
>

Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order [ In reply to ]

vbabka at suse

Jan 27, 2021, 6:15 AM

Post #27 of 34 (90 views)

Permalink

On 1/26/21 2:59 PM, Michal Hocko wrote:
>>
>> On 8 CPUs, I run hackbench with up to 16 groups which means 16*40
>> threads. But I raise up to 256 groups, which means 256*40 threads, on
>> the 224 CPUs system. In fact, hackbench -g 1 (with 1 group) doesn't
>> regress on the 224 CPUs system. The next test with 4 groups starts
>> to regress by -7%. But the next one: hackbench -g 16 regresses by 187%
>> (duration is almost 3 times longer). It seems reasonable to assume
>> that the number of running threads and resources scale with the number
>> of CPUs because we want to run more stuff.
>
> OK, I do understand that more jobs scale with the number of CPUs but I
> would also expect that higher order pages are generally more expensive
> to get so this is not really a clear cut especially under some more
> demand on the memory where allocations are smooth. So the question
> really is whether this is not just optimizing for artificial conditions.

FWIW, I enabled CONFIG_SLUB_STATS and run "hackbench -l 16000 -g 16" in a
(small) VM, and checked tools/vm/slabinfo -DA as per the config option's help,
and it seems to be these 2 caches that are stressed:

Name Objects Alloc Free %Fast Fallb O CmpX UL
kmalloc-512 812 25655535 25654908 71 1 0 0 20082 0
skbuff_head_cache 304 25602632 25602632 84 1 0 0 11241 0

I guess larger pages mean more batched per-cpu allocations without going to the
shared structures or even page allocator. But 3 times duration is still surprising
to me. I'll dig more.

Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order [ In reply to ]

mgorman at techsingularity

Jan 28, 2021, 6:15 AM

Post #28 of 34 (90 views)

Permalink

On Tue, Jan 26, 2021 at 02:59:18PM +0100, Michal Hocko wrote:
> > > This thread shows that this is still somehow related to performance but
> > > the real reason is not clear. I believe we should be focusing on the
> > > actual reasons for the performance impact than playing with some fancy
> > > math and tuning for a benchmark on a particular machine which doesn't
> > > work for others due to subtle initialization timing issues.
> > >
> > > Fundamentally why should higher number of CPUs imply the size of slab in
> > > the first place?
> >
> > A 1st answer is that the activity and the number of threads involved
> > scales with the number of CPUs. Regarding the hackbench benchmark as
> > an example, the number of group/threads raise to a higher level on the
> > server than on the small system which doesn't seem unreasonable.
> >
> > On 8 CPUs, I run hackbench with up to 16 groups which means 16*40
> > threads. But I raise up to 256 groups, which means 256*40 threads, on
> > the 224 CPUs system. In fact, hackbench -g 1 (with 1 group) doesn't
> > regress on the 224 CPUs system. The next test with 4 groups starts
> > to regress by -7%. But the next one: hackbench -g 16 regresses by 187%
> > (duration is almost 3 times longer). It seems reasonable to assume
> > that the number of running threads and resources scale with the number
> > of CPUs because we want to run more stuff.
>
> OK, I do understand that more jobs scale with the number of CPUs but I
> would also expect that higher order pages are generally more expensive
> to get so this is not really a clear cut especially under some more
> demand on the memory where allocations are smooth. So the question
> really is whether this is not just optimizing for artificial conditions.

The flip side is that smaller orders increase zone lock contention and
contention can csale with the number of CPUs so it's partially related.
hackbench-sockets is an extreme case (pipetest is not affected) but it's
the messenger in this case.

On a x86-64 2-socket 40 core (80 threads) machine then comparing a revert
of the patch with vanilla 5.11-rc5 is

hackbench-process-sockets
5.11-rc5 5.11-rc5
revert-lockstat vanilla-lockstat
Amean 1 1.1560 ( 0.00%) 1.0633 * 8.02%*
Amean 4 2.0797 ( 0.00%) 2.5470 * -22.47%*
Amean 7 3.2693 ( 0.00%) 4.3433 * -32.85%*
Amean 12 5.2043 ( 0.00%) 6.5600 * -26.05%*
Amean 21 10.5817 ( 0.00%) 11.3320 * -7.09%*
Amean 30 13.3923 ( 0.00%) 15.5817 * -16.35%*
Amean 48 20.3893 ( 0.00%) 23.6733 * -16.11%*
Amean 79 31.4210 ( 0.00%) 38.2787 * -21.83%*
Amean 110 43.6177 ( 0.00%) 53.8847 * -23.54%*
Amean 141 56.3840 ( 0.00%) 68.4257 * -21.36%*
Amean 172 70.0577 ( 0.00%) 85.0077 * -21.34%*
Amean 203 81.9717 ( 0.00%) 100.7137 * -22.86%*
Amean 234 95.1900 ( 0.00%) 116.0280 * -21.89%*
Amean 265 108.9097 ( 0.00%) 130.4307 * -19.76%*
Amean 296 119.7470 ( 0.00%) 142.3637 * -18.89%*

i.e. the patch incurs a 7% to 32% performance penalty. This bisected
cleanly yesterday when I was looking for the regression and then found
the thread.

Numerous caches change size. For example, kmalloc-512 goes from order-0
(vanilla) to order-2 with the revert.

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

VANILLA
&zone->lock: 1202731 1203433 0.07 120.55 1555485.48 1.29 8920825 12537091 0.06 84.10 9855085.12 0.79
-----------
&zone->lock 61903 [<00000000b47dc96a>] free_one_page+0x3f/0x530
&zone->lock 7655 [<00000000099f6e05>] get_page_from_freelist+0x475/0x1370
&zone->lock 36529 [<0000000075b9b918>] free_pcppages_bulk+0x1ac/0x7d0
&zone->lock 1097346 [<00000000b8e4950a>] get_page_from_freelist+0xaf0/0x1370
-----------
&zone->lock 44716 [<00000000099f6e05>] get_page_from_freelist+0x475/0x1370
&zone->lock 69813 [<0000000075b9b918>] free_pcppages_bulk+0x1ac/0x7d0
&zone->lock 31596 [<00000000b47dc96a>] free_one_page+0x3f/0x530
&zone->lock 1057308 [<00000000b8e4950a>] get_page_from_freelist+0xaf0/0x1370

REVERT
&zone->lock: 735827 739037 0.06 66.12 699661.56 0.95 4095299 7757942 0.05 54.35 5670083.68 0.73
-----------
&zone->lock 101927 [<00000000a60d5f86>] free_one_page+0x3f/0x530
&zone->lock 626426 [<00000000122cecf3>] get_page_from_freelist+0xaf0/0x1370
&zone->lock 9207 [<0000000068b9c9a1>] free_pcppages_bulk+0x1ac/0x7d0
&zone->lock 1477 [<00000000f856e720>] get_page_from_freelist+0x475/0x1370
-----------
&zone->lock 6249 [<00000000f856e720>] get_page_from_freelist+0x475/0x1370
&zone->lock 92224 [<00000000a60d5f86>] free_one_page+0x3f/0x530
&zone->lock 19690 [<0000000068b9c9a1>] free_pcppages_bulk+0x1ac/0x7d0
&zone->lock 620874 [<00000000122cecf3>] get_page_from_freelist+0xaf0/0x1370

Each individual wait time is small but the maximum waittime-max is roughly
double (120us vanilla vs 66us reverting the patch). Total wait time is
roughly doubled also due to the patch. Acquisitions are almost doubled.

So mostly this is down to the number of times SLUB calls into the page
allocator which only caches order-0 pages on a per-cpu basis. I do have
a prototype for a high-order per-cpu allocator but it is very rough --
high watermarks stop making sense, code is rough, memory needed for the
pcpu structures quadruples etc.

--
Mel Gorman
SUSE Labs

Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order [ In reply to ]

mhocko at suse

Jan 28, 2021, 6:15 AM

Post #29 of 34 (90 views)

Permalink

On Thu 28-01-21 13:45:12, Mel Gorman wrote:
[...]
> So mostly this is down to the number of times SLUB calls into the page
> allocator which only caches order-0 pages on a per-cpu basis. I do have
> a prototype for a high-order per-cpu allocator but it is very rough --
> high watermarks stop making sense, code is rough, memory needed for the
> pcpu structures quadruples etc.

Thanks this is really useful. But it really begs a question whether this
is a general case or more an exception. And as such maybe we want to
define high throughput caches which would gain a higher order pages to
keep pace with allocation and reduce the churn or deploy some other
techniques to reduce the direct page allocator involvement.
--
Michal Hocko
SUSE Labs

Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order [ In reply to ]

mgorman at techsingularity

Jan 28, 2021, 8:15 AM

Post #30 of 34 (90 views)

Permalink

On Thu, Jan 28, 2021 at 02:57:10PM +0100, Michal Hocko wrote:
> On Thu 28-01-21 13:45:12, Mel Gorman wrote:
> [...]
> > So mostly this is down to the number of times SLUB calls into the page
> > allocator which only caches order-0 pages on a per-cpu basis. I do have
> > a prototype for a high-order per-cpu allocator but it is very rough --
> > high watermarks stop making sense, code is rough, memory needed for the
> > pcpu structures quadruples etc.
>
> Thanks this is really useful. But it really begs a question whether this
> is a general case or more an exception. And as such maybe we want to
> define high throughput caches which would gain a higher order pages to
> keep pace with allocation and reduce the churn or deploy some other
> techniques to reduce the direct page allocator involvement.

I don't think we want to define "high throughput caches" because it'll
be workload dependant and a game of whack-a-mole. If the "high throughput
cache" is a kmalloc cache for some set of workloads and one of the inode
caches or dcaches for another one, there will be no setting that is
universally good.

--
Mel Gorman
SUSE Labs

Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order [ In reply to ]

bharata at linux

Feb 3, 2021, 4:15 AM

Post #31 of 34 (90 views)

Permalink

On Wed, Jan 27, 2021 at 12:04:01PM +0100, Vlastimil Babka wrote:
> On 1/27/21 10:10 AM, Christoph Lameter wrote:
> > On Tue, 26 Jan 2021, Will Deacon wrote:
> >
> >> > Hm, but booting the secondaries is just a software (kernel) action? They are
> >> > already physically there, so it seems to me as if the cpu_present_mask is not
> >> > populated correctly on arm64, and it's just a mirror of cpu_online_mask?
> >>
> >> I think the present_mask retains CPUs if they are hotplugged off, whereas
> >> the online mask does not. We can't really do any better on arm64, as there's
> >> no way of telling that a CPU is present until we've seen it.
> >
> > The order of each page in a kmem cache --and therefore also the number
> > of objects in a slab page-- can be different because that information is
> > stored in the page struct.
> >
> > Therefore it is possible to retune the order while the cache is in operaton.
>
> Yes, but it's tricky to do the retuning safely, e.g. if freelist randomization
> is enabled, see [1].
>
> But as a quick fix for the regression, the heuristic idea could work reasonably
> on all architectures?
> - if num_present_cpus() is > 1, trust that it doesn't have the issue such as
> arm64, and use it
> - otherwise use nr_cpu_ids
>
> Long-term we can attempt to do the retuning safe, or decide that number of cpus
> shouldn't determine the order...
>
> [1] https://lore.kernel.org/linux-mm/d7fb9425-9a62-c7b8-604d-5828d7e6b1da@suse.cz/

So what is preferrable here now? Above or other quick fix or reverting
the original commit?

Regards,
Bharata.

Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order [ In reply to ]

vincent.guittot at linaro

Feb 4, 2021, 12:15 AM

Post #32 of 34 (90 views)

Permalink

On Wed, 3 Feb 2021 at 12:10, Bharata B Rao <bharata@linux.ibm.com> wrote:
>
> On Wed, Jan 27, 2021 at 12:04:01PM +0100, Vlastimil Babka wrote:
> > On 1/27/21 10:10 AM, Christoph Lameter wrote:
> > > On Tue, 26 Jan 2021, Will Deacon wrote:
> > >
> > >> > Hm, but booting the secondaries is just a software (kernel) action? They are
> > >> > already physically there, so it seems to me as if the cpu_present_mask is not
> > >> > populated correctly on arm64, and it's just a mirror of cpu_online_mask?
> > >>
> > >> I think the present_mask retains CPUs if they are hotplugged off, whereas
> > >> the online mask does not. We can't really do any better on arm64, as there's
> > >> no way of telling that a CPU is present until we've seen it.
> > >
> > > The order of each page in a kmem cache --and therefore also the number
> > > of objects in a slab page-- can be different because that information is
> > > stored in the page struct.
> > >
> > > Therefore it is possible to retune the order while the cache is in operaton.
> >
> > Yes, but it's tricky to do the retuning safely, e.g. if freelist randomization
> > is enabled, see [1].
> >
> > But as a quick fix for the regression, the heuristic idea could work reasonably
> > on all architectures?
> > - if num_present_cpus() is > 1, trust that it doesn't have the issue such as
> > arm64, and use it
> > - otherwise use nr_cpu_ids
> >
> > Long-term we can attempt to do the retuning safe, or decide that number of cpus
> > shouldn't determine the order...
> >
> > [1] https://lore.kernel.org/linux-mm/d7fb9425-9a62-c7b8-604d-5828d7e6b1da@suse.cz/
>
> So what is preferrable here now? Above or other quick fix or reverting
> the original commit?

I'm fine with whatever the solution as long as we can use keep using
nr_cpu_ids when other values like num_present_cpus, don't reflect
correctly the system

Regards,
Vincent

>
> Regards,
> Bharata.

Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order [ In reply to ]

cl at linux

Feb 4, 2021, 2:15 AM

Post #33 of 34 (90 views)

Permalink

On Thu, 4 Feb 2021, Vincent Guittot wrote:

> > So what is preferrable here now? Above or other quick fix or reverting
> > the original commit?
>
> I'm fine with whatever the solution as long as we can use keep using
> nr_cpu_ids when other values like num_present_cpus, don't reflect
> correctly the system

AFAICT they are correctly reflecting the current state of the system.

The problem here is the bringup of the system and the tuning therefor.

One additional thing that may help: The slab caches can work in a degraded
mode where no fastpath allocations can occur. That mode is used primarily
for debugging but maybe that mode can also help during bootstrap to avoid
having to deal with the per cpu data and so on.

In degraded mode SLUB will take a lock for each operation on an object.

In this mode the following is true

kmem_cache_cpu->page == NULL
kmem_cache_cpu->freelist == NULL

kmem_cache_debug(s) == true

So if you define a new debug mode and include it in SLAB_DEBUG_FLAGS then
you can force SLUB to fallback to operations where a lock is taken and
where slab allocation can be stopped. This may be ok for bring up.

The debug flags are also tied to some wizardry that can patch the code at
runtime to optimize for debubgging or fast operations. You would tie into
that one as well. Start in debug mode by default and switch to fast
operations after all processors are up.

Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order [ In reply to ]

vbabka at suse

Feb 4, 2021, 8:15 AM

Post #34 of 34 (90 views)

Permalink

On 2/3/21 12:10 PM, Bharata B Rao wrote:
> On Wed, Jan 27, 2021 at 12:04:01PM +0100, Vlastimil Babka wrote:
>> Yes, but it's tricky to do the retuning safely, e.g. if freelist randomization
>> is enabled, see [1].
>>
>> But as a quick fix for the regression, the heuristic idea could work reasonably
>> on all architectures?
>> - if num_present_cpus() is > 1, trust that it doesn't have the issue such as
>> arm64, and use it
>> - otherwise use nr_cpu_ids
>>
>> Long-term we can attempt to do the retuning safe, or decide that number of cpus
>> shouldn't determine the order...
>>
>> [1] https://lore.kernel.org/linux-mm/d7fb9425-9a62-c7b8-604d-5828d7e6b1da@suse.cz/
>
> So what is preferrable here now? Above or other quick fix or reverting
> the original commit?

I would try the above first. In case it doesn't work, revert. As the immediate
fix for the regression, that people can safely backport.
Anything more complex will take more time and would be more risky to backport.

> Regards,
> Bharata.
>