Mailing List Archive

[RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order
The page order of the slab that gets chosen for a given slab
cache depends on the number of objects that can be fit in the
slab while meeting other requirements. We start with a value
of minimum objects based on nr_cpu_ids that is driven by
possible number of CPUs and hence could be higher than the
actual number of CPUs present in the system. This leads to
calculate_order() chosing a page order that is on the higher
side leading to increased slab memory consumption on systems
that have bigger page sizes.

Hence rely on the number of online CPUs when determining the
mininum objects, thereby increasing the chances of chosing
a lower conservative page order for the slab.

Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
---
This is a generic change and I am unsure how it would affect
other archs, but as a start, here are some numbers from
PowerPC pseries KVM guest with and without this patch:

This table shows how this change has affected some of the slab
caches.
===================================================================
Current Patched
Cache <objperslab> <pagesperslab> <objperslab> <pagesperslab>
===================================================================
TCPv6 53 2 26 1
net_namespace 53 4 26 2
dtl 32 2 16 1
names_cache 32 2 16 1
task_struct 53 8 13 2
thread_stack 32 8 8 2
pgtable-2^11 16 8 8 4
pgtable-2^8 32 2 16 1
kmalloc-32k 16 8 8 4
kmalloc-16k 32 8 8 2
kmalloc-8k 32 4 8 1
kmalloc-4k 32 2 16 1
===================================================================

Slab memory (kB) consumption comparision
==================================================================
Current Patched
==================================================================
After-boot 205760 156096
During-hackbench 629145 506752 (Avg of 5 runs)
After-hackbench 474176 331840 (after drop_caches)
==================================================================

Hackbench Time (Avg of 5 runs)
(hackbench -s 1024 -l 200 -g 200 -f 25 -P)
==========================================
Current Patched
==========================================
10.990 11.010
==========================================

Measuring the effect due to CPU hotplug
----------------------------------------
Since the patch doesn't consider all the possible CPUs for page
order calcluation, let's see how affects the case when CPUs are
hotplugged. Here I compare a system that is booted with 64CPUs
with a system that is booted with 16CPUs but hotplugged with
48CPUs after boot. These numbers are with the patch applied.

Slab memory (kB) consumption comparision
===================================================================
64bootCPUs 16bootCPUs+48HotPluggedCPUs
===================================================================
After-boot 390272 159744
After-hotplug - 251328
During-hackbench 1001267 941926 (Avg of 5 runs)
After-hackbench 913600 827200 (after drop_caches)
===================================================================

Hackbench Time (Avg of 5 runs)
(hackbench -s 1024 -l 200 -g 200 -f 25 -P)
===========================================
64bootCPUs 16bootCPUs+48HotPluggedCPUs
===========================================
12.554 12.589
===========================================
mm/slub.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/slub.c b/mm/slub.c
index 34dcc09e2ec9..8342c0a167b2 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3433,7 +3433,7 @@ static inline int calculate_order(unsigned int size)
*/
min_objects = slub_min_objects;
if (!min_objects)
- min_objects = 4 * (fls(nr_cpu_ids) + 1);
+ min_objects = 4 * (fls(num_online_cpus()) + 1);
max_objects = order_objects(slub_max_order, size);
min_objects = min(min_objects, max_objects);

--
2.26.2
Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order [ In reply to ]
On 11/18/20 9:27 AM, Bharata B Rao wrote:
> The page order of the slab that gets chosen for a given slab
> cache depends on the number of objects that can be fit in the
> slab while meeting other requirements. We start with a value
> of minimum objects based on nr_cpu_ids that is driven by
> possible number of CPUs and hence could be higher than the
> actual number of CPUs present in the system. This leads to
> calculate_order() chosing a page order that is on the higher
> side leading to increased slab memory consumption on systems
> that have bigger page sizes.
>
> Hence rely on the number of online CPUs when determining the
> mininum objects, thereby increasing the chances of chosing
> a lower conservative page order for the slab.
>
> Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

Ideally, we would react to hotplug events and update existing caches
accordingly. But for that, recalculation of order for existing caches
would have to be made safe, while not affecting hot paths. We have
removed the sysfs interface with 32a6f409b693 ("mm, slub: remove runtime
allocation order changes") as it didn't seem easy and worth the trouble.

In case somebody wants to start with a large order right from the boot
because they know they will hotplug lots of cpus later, they can use
slub_min_objects= boot param to override this heuristic. So in case this
change regresses somebody's performance, there's a way around it and
thus the risk is low IMHO.

> ---
> This is a generic change and I am unsure how it would affect
> other archs, but as a start, here are some numbers from
> PowerPC pseries KVM guest with and without this patch:
>
> This table shows how this change has affected some of the slab
> caches.
> ===================================================================
> Current Patched
> Cache <objperslab> <pagesperslab> <objperslab> <pagesperslab>
> ===================================================================
> TCPv6 53 2 26 1
> net_namespace 53 4 26 2
> dtl 32 2 16 1
> names_cache 32 2 16 1
> task_struct 53 8 13 2
> thread_stack 32 8 8 2
> pgtable-2^11 16 8 8 4
> pgtable-2^8 32 2 16 1
> kmalloc-32k 16 8 8 4
> kmalloc-16k 32 8 8 2
> kmalloc-8k 32 4 8 1
> kmalloc-4k 32 2 16 1
> ===================================================================
>
> Slab memory (kB) consumption comparision
> ==================================================================
> Current Patched
> ==================================================================
> After-boot 205760 156096
> During-hackbench 629145 506752 (Avg of 5 runs)
> After-hackbench 474176 331840 (after drop_caches)
> ==================================================================
>
> Hackbench Time (Avg of 5 runs)
> (hackbench -s 1024 -l 200 -g 200 -f 25 -P)
> ==========================================
> Current Patched
> ==========================================
> 10.990 11.010
> ==========================================
>
> Measuring the effect due to CPU hotplug
> ----------------------------------------
> Since the patch doesn't consider all the possible CPUs for page
> order calcluation, let's see how affects the case when CPUs are
> hotplugged. Here I compare a system that is booted with 64CPUs
> with a system that is booted with 16CPUs but hotplugged with
> 48CPUs after boot. These numbers are with the patch applied.
>
> Slab memory (kB) consumption comparision
> ===================================================================
> 64bootCPUs 16bootCPUs+48HotPluggedCPUs
> ===================================================================
> After-boot 390272 159744
> After-hotplug - 251328
> During-hackbench 1001267 941926 (Avg of 5 runs)
> After-hackbench 913600 827200 (after drop_caches)
> ===================================================================
>
> Hackbench Time (Avg of 5 runs)
> (hackbench -s 1024 -l 200 -g 200 -f 25 -P)
> ===========================================
> 64bootCPUs 16bootCPUs+48HotPluggedCPUs
> ===========================================
> 12.554 12.589
> ===========================================
> mm/slub.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 34dcc09e2ec9..8342c0a167b2 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3433,7 +3433,7 @@ static inline int calculate_order(unsigned int size)
> */
> min_objects = slub_min_objects;
> if (!min_objects)
> - min_objects = 4 * (fls(nr_cpu_ids) + 1);
> + min_objects = 4 * (fls(num_online_cpus()) + 1);
> max_objects = order_objects(slub_max_order, size);
> min_objects = min(min_objects, max_objects);
>
>
Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order [ In reply to ]
On Wed, Nov 18, 2020 at 12:25:38PM +0100, Vlastimil Babka wrote:
> On 11/18/20 9:27 AM, Bharata B Rao wrote:
> > The page order of the slab that gets chosen for a given slab
> > cache depends on the number of objects that can be fit in the
> > slab while meeting other requirements. We start with a value
> > of minimum objects based on nr_cpu_ids that is driven by
> > possible number of CPUs and hence could be higher than the
> > actual number of CPUs present in the system. This leads to
> > calculate_order() chosing a page order that is on the higher
> > side leading to increased slab memory consumption on systems
> > that have bigger page sizes.
> >
> > Hence rely on the number of online CPUs when determining the
> > mininum objects, thereby increasing the chances of chosing
> > a lower conservative page order for the slab.
> >
> > Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
>
> Ideally, we would react to hotplug events and update existing caches
> accordingly. But for that, recalculation of order for existing caches would
> have to be made safe, while not affecting hot paths. We have removed the
> sysfs interface with 32a6f409b693 ("mm, slub: remove runtime allocation
> order changes") as it didn't seem easy and worth the trouble.
>
> In case somebody wants to start with a large order right from the boot
> because they know they will hotplug lots of cpus later, they can use
> slub_min_objects= boot param to override this heuristic. So in case this
> change regresses somebody's performance, there's a way around it and thus
> the risk is low IMHO.

I agree. For the absolute majority of users there will be no difference.
And there is a good workaround for the rest.

Acked-by: Roman Gushchin <guro@fb.com>
Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order [ In reply to ]
On Wed, 18 Nov 2020, Roman Gushchin wrote:

> On Wed, Nov 18, 2020 at 12:25:38PM +0100, Vlastimil Babka wrote:
> > On 11/18/20 9:27 AM, Bharata B Rao wrote:
> > > The page order of the slab that gets chosen for a given slab
> > > cache depends on the number of objects that can be fit in the
> > > slab while meeting other requirements. We start with a value
> > > of minimum objects based on nr_cpu_ids that is driven by
> > > possible number of CPUs and hence could be higher than the
> > > actual number of CPUs present in the system. This leads to
> > > calculate_order() chosing a page order that is on the higher
> > > side leading to increased slab memory consumption on systems
> > > that have bigger page sizes.
> > >
> > > Hence rely on the number of online CPUs when determining the
> > > mininum objects, thereby increasing the chances of chosing
> > > a lower conservative page order for the slab.
> > >
> > > Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
> >
> > Acked-by: Vlastimil Babka <vbabka@suse.cz>
> >
> > Ideally, we would react to hotplug events and update existing caches
> > accordingly. But for that, recalculation of order for existing caches would
> > have to be made safe, while not affecting hot paths. We have removed the
> > sysfs interface with 32a6f409b693 ("mm, slub: remove runtime allocation
> > order changes") as it didn't seem easy and worth the trouble.
> >
> > In case somebody wants to start with a large order right from the boot
> > because they know they will hotplug lots of cpus later, they can use
> > slub_min_objects= boot param to override this heuristic. So in case this
> > change regresses somebody's performance, there's a way around it and thus
> > the risk is low IMHO.
>
> I agree. For the absolute majority of users there will be no difference.
> And there is a good workaround for the rest.
>
> Acked-by: Roman Gushchin <guro@fb.com>
>

Acked-by: David Rientjes <rientjes@google.com>
Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order [ In reply to ]
Hi,

On Wed, 18 Nov 2020 at 09:28, Bharata B Rao <bharata@linux.ibm.com> wrote:
>
> The page order of the slab that gets chosen for a given slab
> cache depends on the number of objects that can be fit in the
> slab while meeting other requirements. We start with a value
> of minimum objects based on nr_cpu_ids that is driven by
> possible number of CPUs and hence could be higher than the
> actual number of CPUs present in the system. This leads to
> calculate_order() chosing a page order that is on the higher
> side leading to increased slab memory consumption on systems
> that have bigger page sizes.
>
> Hence rely on the number of online CPUs when determining the
> mininum objects, thereby increasing the chances of chosing
> a lower conservative page order for the slab.
>
> Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
> ---
> This is a generic change and I am unsure how it would affect
> other archs, but as a start, here are some numbers from
> PowerPC pseries KVM guest with and without this patch:
>
> This table shows how this change has affected some of the slab
> caches.
> ===================================================================
> Current Patched
> Cache <objperslab> <pagesperslab> <objperslab> <pagesperslab>
> ===================================================================
> TCPv6 53 2 26 1
> net_namespace 53 4 26 2
> dtl 32 2 16 1
> names_cache 32 2 16 1
> task_struct 53 8 13 2
> thread_stack 32 8 8 2
> pgtable-2^11 16 8 8 4
> pgtable-2^8 32 2 16 1
> kmalloc-32k 16 8 8 4
> kmalloc-16k 32 8 8 2
> kmalloc-8k 32 4 8 1
> kmalloc-4k 32 2 16 1
> ===================================================================
>
> Slab memory (kB) consumption comparision
> ==================================================================
> Current Patched
> ==================================================================
> After-boot 205760 156096
> During-hackbench 629145 506752 (Avg of 5 runs)
> After-hackbench 474176 331840 (after drop_caches)
> ==================================================================
>
> Hackbench Time (Avg of 5 runs)
> (hackbench -s 1024 -l 200 -g 200 -f 25 -P)
> ==========================================
> Current Patched
> ==========================================
> 10.990 11.010
> ==========================================
>
> Measuring the effect due to CPU hotplug
> ----------------------------------------
> Since the patch doesn't consider all the possible CPUs for page
> order calcluation, let's see how affects the case when CPUs are
> hotplugged. Here I compare a system that is booted with 64CPUs
> with a system that is booted with 16CPUs but hotplugged with
> 48CPUs after boot. These numbers are with the patch applied.
>
> Slab memory (kB) consumption comparision
> ===================================================================
> 64bootCPUs 16bootCPUs+48HotPluggedCPUs
> ===================================================================
> After-boot 390272 159744
> After-hotplug - 251328
> During-hackbench 1001267 941926 (Avg of 5 runs)
> After-hackbench 913600 827200 (after drop_caches)
> ===================================================================
>
> Hackbench Time (Avg of 5 runs)
> (hackbench -s 1024 -l 200 -g 200 -f 25 -P)
> ===========================================
> 64bootCPUs 16bootCPUs+48HotPluggedCPUs
> ===========================================
> 12.554 12.589
> ===========================================
> mm/slub.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>

I'm facing significant performances regression on a large arm64 server
system (224 CPUs). Regressions is also present on small arm64 system
(8 CPUs) but in a far smaller order of magnitude

On 224 CPUs system : 9 iterations of hackbench -l 16000 -g 16
v5.11-rc4 : 9.135sec (+/- 0.45%)
v5.11-rc4 + revert this patch: 3.173sec (+/- 0.48%)
v5.10: 3.136sec (+/- 0.40%)

This is a 191% regression compared to v5.10.

The problem is that calculate_order() is called a number of times
before secondaries CPUs are booted and it returns 1 instead of 224.
This makes the use of num_online_cpus() irrelevant for those cases

After adding in my command line "slub_min_objects=36" which equals to
4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
, the regression diseapears:

9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)



> diff --git a/mm/slub.c b/mm/slub.c
> index 34dcc09e2ec9..8342c0a167b2 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3433,7 +3433,7 @@ static inline int calculate_order(unsigned int size)
> */
> min_objects = slub_min_objects;
> if (!min_objects)
> - min_objects = 4 * (fls(nr_cpu_ids) + 1);
> + min_objects = 4 * (fls(num_online_cpus()) + 1);
> max_objects = order_objects(slub_max_order, size);
> min_objects = min(min_objects, max_objects);
>
> --
> 2.26.2
>
Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order [ In reply to ]
On Wed, Jan 20, 2021 at 06:36:31PM +0100, Vincent Guittot wrote:
> Hi,
>
> On Wed, 18 Nov 2020 at 09:28, Bharata B Rao <bharata@linux.ibm.com> wrote:
> >
> > The page order of the slab that gets chosen for a given slab
> > cache depends on the number of objects that can be fit in the
> > slab while meeting other requirements. We start with a value
> > of minimum objects based on nr_cpu_ids that is driven by
> > possible number of CPUs and hence could be higher than the
> > actual number of CPUs present in the system. This leads to
> > calculate_order() chosing a page order that is on the higher
> > side leading to increased slab memory consumption on systems
> > that have bigger page sizes.
> >
> > Hence rely on the number of online CPUs when determining the
> > mininum objects, thereby increasing the chances of chosing
> > a lower conservative page order for the slab.
> >
> > Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
> > ---
> > This is a generic change and I am unsure how it would affect
> > other archs, but as a start, here are some numbers from
> > PowerPC pseries KVM guest with and without this patch:
> >
> > This table shows how this change has affected some of the slab
> > caches.
> > ===================================================================
> > Current Patched
> > Cache <objperslab> <pagesperslab> <objperslab> <pagesperslab>
> > ===================================================================
> > TCPv6 53 2 26 1
> > net_namespace 53 4 26 2
> > dtl 32 2 16 1
> > names_cache 32 2 16 1
> > task_struct 53 8 13 2
> > thread_stack 32 8 8 2
> > pgtable-2^11 16 8 8 4
> > pgtable-2^8 32 2 16 1
> > kmalloc-32k 16 8 8 4
> > kmalloc-16k 32 8 8 2
> > kmalloc-8k 32 4 8 1
> > kmalloc-4k 32 2 16 1
> > ===================================================================
> >
> > Slab memory (kB) consumption comparision
> > ==================================================================
> > Current Patched
> > ==================================================================
> > After-boot 205760 156096
> > During-hackbench 629145 506752 (Avg of 5 runs)
> > After-hackbench 474176 331840 (after drop_caches)
> > ==================================================================
> >
> > Hackbench Time (Avg of 5 runs)
> > (hackbench -s 1024 -l 200 -g 200 -f 25 -P)
> > ==========================================
> > Current Patched
> > ==========================================
> > 10.990 11.010
> > ==========================================
> >
> > Measuring the effect due to CPU hotplug
> > ----------------------------------------
> > Since the patch doesn't consider all the possible CPUs for page
> > order calcluation, let's see how affects the case when CPUs are
> > hotplugged. Here I compare a system that is booted with 64CPUs
> > with a system that is booted with 16CPUs but hotplugged with
> > 48CPUs after boot. These numbers are with the patch applied.
> >
> > Slab memory (kB) consumption comparision
> > ===================================================================
> > 64bootCPUs 16bootCPUs+48HotPluggedCPUs
> > ===================================================================
> > After-boot 390272 159744
> > After-hotplug - 251328
> > During-hackbench 1001267 941926 (Avg of 5 runs)
> > After-hackbench 913600 827200 (after drop_caches)
> > ===================================================================
> >
> > Hackbench Time (Avg of 5 runs)
> > (hackbench -s 1024 -l 200 -g 200 -f 25 -P)
> > ===========================================
> > 64bootCPUs 16bootCPUs+48HotPluggedCPUs
> > ===========================================
> > 12.554 12.589
> > ===========================================
> > mm/slub.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
>
> I'm facing significant performances regression on a large arm64 server
> system (224 CPUs). Regressions is also present on small arm64 system
> (8 CPUs) but in a far smaller order of magnitude
>
> On 224 CPUs system : 9 iterations of hackbench -l 16000 -g 16
> v5.11-rc4 : 9.135sec (+/- 0.45%)
> v5.11-rc4 + revert this patch: 3.173sec (+/- 0.48%)
> v5.10: 3.136sec (+/- 0.40%)
>
> This is a 191% regression compared to v5.10.
>
> The problem is that calculate_order() is called a number of times
> before secondaries CPUs are booted and it returns 1 instead of 224.
> This makes the use of num_online_cpus() irrelevant for those cases
>
> After adding in my command line "slub_min_objects=36" which equals to
> 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
> , the regression diseapears:
>
> 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)

Should we have switched to num_present_cpus() rather than
num_online_cpus()? If so, the below patch should address the
above problem.

From 252b332ccbee7152da1e18f1fff5b83f8e01b8df Mon Sep 17 00:00:00 2001
From: Bharata B Rao <bharata@linux.ibm.com>
Date: Thu, 21 Jan 2021 10:35:08 +0530
Subject: [PATCH] mm/slub: let number of present CPUs determine the slub
page order

Commit 045ab8c9487b ("mm/slub: let number of online CPUs determine
the slub page order") changed the slub page order to depend on
num_online_cpus() from nr_cpu_ids. However we find that certain
caches (kmalloc) are initialized even before the secondary CPUs
are onlined resulting in lower slub page order and subsequent
regression.

Switch to num_present_cpus() instead.

Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
Fixes: 045ab8c9487b ("mm/slub: let number of online CPUs determine the slub page order")
---
mm/slub.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/slub.c b/mm/slub.c
index d9e4e10683cc..2f3e412c849d 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3433,7 +3433,7 @@ static inline int calculate_order(unsigned int size)
*/
min_objects = slub_min_objects;
if (!min_objects)
- min_objects = 4 * (fls(num_online_cpus()) + 1);
+ min_objects = 4 * (fls(num_present_cpus()) + 1);
max_objects = order_objects(slub_max_order, size);
min_objects = min(min_objects, max_objects);

--
2.26.2
Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order [ In reply to ]
On Thu, 21 Jan 2021 at 06:31, Bharata B Rao <bharata@linux.ibm.com> wrote:
>
> On Wed, Jan 20, 2021 at 06:36:31PM +0100, Vincent Guittot wrote:
> > Hi,
> >
> > On Wed, 18 Nov 2020 at 09:28, Bharata B Rao <bharata@linux.ibm.com> wrote:
> > >
> > > The page order of the slab that gets chosen for a given slab
> > > cache depends on the number of objects that can be fit in the
> > > slab while meeting other requirements. We start with a value
> > > of minimum objects based on nr_cpu_ids that is driven by
> > > possible number of CPUs and hence could be higher than the
> > > actual number of CPUs present in the system. This leads to
> > > calculate_order() chosing a page order that is on the higher
> > > side leading to increased slab memory consumption on systems
> > > that have bigger page sizes.
> > >
> > > Hence rely on the number of online CPUs when determining the
> > > mininum objects, thereby increasing the chances of chosing
> > > a lower conservative page order for the slab.
> > >
> > > Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
> > > ---
> > > This is a generic change and I am unsure how it would affect
> > > other archs, but as a start, here are some numbers from
> > > PowerPC pseries KVM guest with and without this patch:
> > >
> > > This table shows how this change has affected some of the slab
> > > caches.
> > > ===================================================================
> > > Current Patched
> > > Cache <objperslab> <pagesperslab> <objperslab> <pagesperslab>
> > > ===================================================================
> > > TCPv6 53 2 26 1
> > > net_namespace 53 4 26 2
> > > dtl 32 2 16 1
> > > names_cache 32 2 16 1
> > > task_struct 53 8 13 2
> > > thread_stack 32 8 8 2
> > > pgtable-2^11 16 8 8 4
> > > pgtable-2^8 32 2 16 1
> > > kmalloc-32k 16 8 8 4
> > > kmalloc-16k 32 8 8 2
> > > kmalloc-8k 32 4 8 1
> > > kmalloc-4k 32 2 16 1
> > > ===================================================================
> > >
> > > Slab memory (kB) consumption comparision
> > > ==================================================================
> > > Current Patched
> > > ==================================================================
> > > After-boot 205760 156096
> > > During-hackbench 629145 506752 (Avg of 5 runs)
> > > After-hackbench 474176 331840 (after drop_caches)
> > > ==================================================================
> > >
> > > Hackbench Time (Avg of 5 runs)
> > > (hackbench -s 1024 -l 200 -g 200 -f 25 -P)
> > > ==========================================
> > > Current Patched
> > > ==========================================
> > > 10.990 11.010
> > > ==========================================
> > >
> > > Measuring the effect due to CPU hotplug
> > > ----------------------------------------
> > > Since the patch doesn't consider all the possible CPUs for page
> > > order calcluation, let's see how affects the case when CPUs are
> > > hotplugged. Here I compare a system that is booted with 64CPUs
> > > with a system that is booted with 16CPUs but hotplugged with
> > > 48CPUs after boot. These numbers are with the patch applied.
> > >
> > > Slab memory (kB) consumption comparision
> > > ===================================================================
> > > 64bootCPUs 16bootCPUs+48HotPluggedCPUs
> > > ===================================================================
> > > After-boot 390272 159744
> > > After-hotplug - 251328
> > > During-hackbench 1001267 941926 (Avg of 5 runs)
> > > After-hackbench 913600 827200 (after drop_caches)
> > > ===================================================================
> > >
> > > Hackbench Time (Avg of 5 runs)
> > > (hackbench -s 1024 -l 200 -g 200 -f 25 -P)
> > > ===========================================
> > > 64bootCPUs 16bootCPUs+48HotPluggedCPUs
> > > ===========================================
> > > 12.554 12.589
> > > ===========================================
> > > mm/slub.c | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> >
> > I'm facing significant performances regression on a large arm64 server
> > system (224 CPUs). Regressions is also present on small arm64 system
> > (8 CPUs) but in a far smaller order of magnitude
> >
> > On 224 CPUs system : 9 iterations of hackbench -l 16000 -g 16
> > v5.11-rc4 : 9.135sec (+/- 0.45%)
> > v5.11-rc4 + revert this patch: 3.173sec (+/- 0.48%)
> > v5.10: 3.136sec (+/- 0.40%)
> >
> > This is a 191% regression compared to v5.10.
> >
> > The problem is that calculate_order() is called a number of times
> > before secondaries CPUs are booted and it returns 1 instead of 224.
> > This makes the use of num_online_cpus() irrelevant for those cases
> >
> > After adding in my command line "slub_min_objects=36" which equals to
> > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
> > , the regression diseapears:
> >
> > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)
>
> Should we have switched to num_present_cpus() rather than
> num_online_cpus()? If so, the below patch should address the
> above problem.

The problem is the same with num_present_cpus() which is initialized
at the same time as num_online_cpus.
Only num_possible_cpus() returns a correct value just like nr_cpu_ids.

Both num_possible_cpus and nr_cpu_ids return the number of CPUs of
the platforms and not the NR_CPUS
num_possible_cpus() = nr_cpu_ids = 224 from the beginning whereas
NR_CPUS=256 on my large system
num_possible_cpus() = nr_cpu_ids = 8 from the beginning whereas
NR_CPUS=256 on my small system


>
> From 252b332ccbee7152da1e18f1fff5b83f8e01b8df Mon Sep 17 00:00:00 2001
> From: Bharata B Rao <bharata@linux.ibm.com>
> Date: Thu, 21 Jan 2021 10:35:08 +0530
> Subject: [PATCH] mm/slub: let number of present CPUs determine the slub
> page order
>
> Commit 045ab8c9487b ("mm/slub: let number of online CPUs determine
> the slub page order") changed the slub page order to depend on
> num_online_cpus() from nr_cpu_ids. However we find that certain
> caches (kmalloc) are initialized even before the secondary CPUs
> are onlined resulting in lower slub page order and subsequent
> regression.
>
> Switch to num_present_cpus() instead.
>
> Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
> Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
> Fixes: 045ab8c9487b ("mm/slub: let number of online CPUs determine the slub page order")
> ---
> mm/slub.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index d9e4e10683cc..2f3e412c849d 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3433,7 +3433,7 @@ static inline int calculate_order(unsigned int size)
> */
> min_objects = slub_min_objects;
> if (!min_objects)
> - min_objects = 4 * (fls(num_online_cpus()) + 1);
> + min_objects = 4 * (fls(num_present_cpus()) + 1);
> max_objects = order_objects(slub_max_order, size);
> min_objects = min(min_objects, max_objects);
>
> --
> 2.26.2
>
>
>
Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order [ In reply to ]
On Thu, 21 Jan 2021, Bharata B Rao wrote:

> > The problem is that calculate_order() is called a number of times
> > before secondaries CPUs are booted and it returns 1 instead of 224.
> > This makes the use of num_online_cpus() irrelevant for those cases
> >
> > After adding in my command line "slub_min_objects=36" which equals to
> > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
> > , the regression diseapears:
> >
> > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)
>
> Should we have switched to num_present_cpus() rather than
> num_online_cpus()? If so, the below patch should address the
> above problem.

There is certainly an initcall after secondaries are booted where we could
redo the calculate_order?

Or the num_online_cpus needs to be up to date earlier. Why does this issue
not occur on x86? Does x86 have an up to date num_online_cpus earlier?
Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order [ In reply to ]
On Thu, 21 Jan 2021 at 11:01, Christoph Lameter <cl@linux.com> wrote:
>
> On Thu, 21 Jan 2021, Bharata B Rao wrote:
>
> > > The problem is that calculate_order() is called a number of times
> > > before secondaries CPUs are booted and it returns 1 instead of 224.
> > > This makes the use of num_online_cpus() irrelevant for those cases
> > >
> > > After adding in my command line "slub_min_objects=36" which equals to
> > > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
> > > , the regression diseapears:
> > >
> > > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)
> >
> > Should we have switched to num_present_cpus() rather than
> > num_online_cpus()? If so, the below patch should address the
> > above problem.
>
> There is certainly an initcall after secondaries are booted where we could
> redo the calculate_order?
>
> Or the num_online_cpus needs to be up to date earlier. Why does this issue
> not occur on x86? Does x86 have an up to date num_online_cpus earlier?

I have added a printk in calculate_order() :
pr_info(" SLUB calculate_order cmd %d min %d online %d
present %d possible %d cpus %d", slub_min_objects, min_objects,
num_online_cpus(), num_present_cpus(), num_possible_cpus(),
nr_cpu_ids);

And checked with
qemu-system-x86_64 -kernel bzImage -nographic -smp 4 -append "console=ttyS0"

[ 0.064927] SLUB calculate_order cmd 0 min 8 online 1 present 4
possible 4 cpus 4
[ 0.064970] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1

The num_online_cpus has the same behavior as on my platform and
num_online_cpus = 1 when kmem_cache_init() is called

Only the num_present_cpus = 4 from the beginning but that's probably
just because it runs in a VM

Also it's interesting to notice that num_possible_cpus and nr_cpu_ids
are set to the correct value

>
>
Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order [ In reply to ]
On 1/21/21 11:01 AM, Christoph Lameter wrote:
> On Thu, 21 Jan 2021, Bharata B Rao wrote:
>
>> > The problem is that calculate_order() is called a number of times
>> > before secondaries CPUs are booted and it returns 1 instead of 224.
>> > This makes the use of num_online_cpus() irrelevant for those cases
>> >
>> > After adding in my command line "slub_min_objects=36" which equals to
>> > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
>> > , the regression diseapears:
>> >
>> > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)

I'm surprised that hackbench is that sensitive to slab performance, anyway. It's
supposed to be a scheduler benchmark? What exactly is going on?

>> Should we have switched to num_present_cpus() rather than
>> num_online_cpus()? If so, the below patch should address the
>> above problem.
>
> There is certainly an initcall after secondaries are booted where we could
> redo the calculate_order?

We could do it even in hotplug handler. But in practice that means making sure
it's safe, i.e. all users of oo_order/oo_objects must handle the value changing.

Consider e.g. init_cache_random_seq() which uses oo_objects(s->oo) to allocate
s->random_seq when cache s is created. Then shuffle_freelist() will use the
current value of oo_objects(s->oo) to index s->random_seq, for a newly allocated
slab - what if the page order has increased meanwhile due to secondary booting
or hotplug? Array overflow. That's why I just made the former sysfs handler for
changing order read-only.

Things would be easier if we could trust *on all arches* either

- num_present_cpus() to count what the hardware really physically has during
boot, even if not yet onlined, at the time we init slab. This would still not
handle later hotplug (probably mostly in a VM scenario, not that somebody would
bring bunch of actual new cpu boards to a running bare metal system?).

- num_possible_cpus()/nr_cpu_ids not to be excessive (broken BIOS?) on systems
where it's not really possible to plug more CPU's. In a VM scenario we could
still have an opposite problem, where theoretically "anything is possible" but
the virtual cpus are never added later.

We could also start questioning the very assumption that number of cpus should
affect slab page size in the first place. Should it? After all, each CPU will
have one or more slab pages privately cached, as we discuss in the other
thread... So why make the slab pages also larger?

> Or the num_online_cpus needs to be up to date earlier. Why does this issue
> not occur on x86? Does x86 have an up to date num_online_cpus earlier?
>
>
Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order [ In reply to ]
On Thu, 21 Jan 2021 at 19:19, Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 1/21/21 11:01 AM, Christoph Lameter wrote:
> > On Thu, 21 Jan 2021, Bharata B Rao wrote:
> >
> >> > The problem is that calculate_order() is called a number of times
> >> > before secondaries CPUs are booted and it returns 1 instead of 224.
> >> > This makes the use of num_online_cpus() irrelevant for those cases
> >> >
> >> > After adding in my command line "slub_min_objects=36" which equals to
> >> > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
> >> > , the regression diseapears:
> >> >
> >> > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)
>
> I'm surprised that hackbench is that sensitive to slab performance, anyway. It's
> supposed to be a scheduler benchmark? What exactly is going on?
>

From hackbench description:
Hackbench is both a benchmark and a stress test for the Linux kernel
scheduler. It's main
job is to create a specified number of pairs of schedulable
entities (either threads or
traditional processes) which communicate via either sockets or
pipes and time how long it
takes for each pair to send data back and forth.

> >> Should we have switched to num_present_cpus() rather than
> >> num_online_cpus()? If so, the below patch should address the
> >> above problem.
> >
> > There is certainly an initcall after secondaries are booted where we could
> > redo the calculate_order?
>
> We could do it even in hotplug handler. But in practice that means making sure
> it's safe, i.e. all users of oo_order/oo_objects must handle the value changing.
>
> Consider e.g. init_cache_random_seq() which uses oo_objects(s->oo) to allocate
> s->random_seq when cache s is created. Then shuffle_freelist() will use the
> current value of oo_objects(s->oo) to index s->random_seq, for a newly allocated
> slab - what if the page order has increased meanwhile due to secondary booting
> or hotplug? Array overflow. That's why I just made the former sysfs handler for
> changing order read-only.
>
> Things would be easier if we could trust *on all arches* either
>
> - num_present_cpus() to count what the hardware really physically has during
> boot, even if not yet onlined, at the time we init slab. This would still not
> handle later hotplug (probably mostly in a VM scenario, not that somebody would
> bring bunch of actual new cpu boards to a running bare metal system?).
>
> - num_possible_cpus()/nr_cpu_ids not to be excessive (broken BIOS?) on systems
> where it's not really possible to plug more CPU's. In a VM scenario we could
> still have an opposite problem, where theoretically "anything is possible" but
> the virtual cpus are never added later.

On all the system that I have tested num_possible_cpus()/nr_cpu_ids
were correctly initialized

large arm64 acpi system
small arm64 DT based system
VM on x86 system

>
> We could also start questioning the very assumption that number of cpus should
> affect slab page size in the first place. Should it? After all, each CPU will
> have one or more slab pages privately cached, as we discuss in the other
> thread... So why make the slab pages also larger?
>
> > Or the num_online_cpus needs to be up to date earlier. Why does this issue
> > not occur on x86? Does x86 have an up to date num_online_cpus earlier?
> >
> >
>
Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order [ In reply to ]
On 1/22/21 9:03 AM, Vincent Guittot wrote:
> On Thu, 21 Jan 2021 at 19:19, Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> On 1/21/21 11:01 AM, Christoph Lameter wrote:
>> > On Thu, 21 Jan 2021, Bharata B Rao wrote:
>> >
>> >> > The problem is that calculate_order() is called a number of times
>> >> > before secondaries CPUs are booted and it returns 1 instead of 224.
>> >> > This makes the use of num_online_cpus() irrelevant for those cases
>> >> >
>> >> > After adding in my command line "slub_min_objects=36" which equals to
>> >> > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
>> >> > , the regression diseapears:
>> >> >
>> >> > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)
>>
>> I'm surprised that hackbench is that sensitive to slab performance, anyway. It's
>> supposed to be a scheduler benchmark? What exactly is going on?
>>
>
> From hackbench description:
> Hackbench is both a benchmark and a stress test for the Linux kernel
> scheduler. It's main
> job is to create a specified number of pairs of schedulable
> entities (either threads or
> traditional processes) which communicate via either sockets or
> pipes and time how long it
> takes for each pair to send data back and forth.

Yep, so I wonder which slab entities this is stressing that much.

>> Things would be easier if we could trust *on all arches* either
>>
>> - num_present_cpus() to count what the hardware really physically has during
>> boot, even if not yet onlined, at the time we init slab. This would still not
>> handle later hotplug (probably mostly in a VM scenario, not that somebody would
>> bring bunch of actual new cpu boards to a running bare metal system?).
>>
>> - num_possible_cpus()/nr_cpu_ids not to be excessive (broken BIOS?) on systems
>> where it's not really possible to plug more CPU's. In a VM scenario we could
>> still have an opposite problem, where theoretically "anything is possible" but
>> the virtual cpus are never added later.
>
> On all the system that I have tested num_possible_cpus()/nr_cpu_ids
> were correctly initialized
>
> large arm64 acpi system
> small arm64 DT based system
> VM on x86 system

So it's just powerpc that has this issue with too large nr_cpu_ids? Is it caused
by bios or the hypervisor? How does num_present_cpus() look there?

What about heuristic:
- num_online_cpus() > 1 - we trust that and use it
- otherwise nr_cpu_ids
Would that work? Too arbitrary?


>> We could also start questioning the very assumption that number of cpus should
>> affect slab page size in the first place. Should it? After all, each CPU will
>> have one or more slab pages privately cached, as we discuss in the other
>> thread... So why make the slab pages also larger?
>>
>> > Or the num_online_cpus needs to be up to date earlier. Why does this issue
>> > not occur on x86? Does x86 have an up to date num_online_cpus earlier?
>> >
>> >
>>
>
Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order [ In reply to ]
On Thu, Jan 21, 2021 at 7:19 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> On 1/21/21 11:01 AM, Christoph Lameter wrote:
> > On Thu, 21 Jan 2021, Bharata B Rao wrote:
> >
> >> > The problem is that calculate_order() is called a number of times
> >> > before secondaries CPUs are booted and it returns 1 instead of 224.
> >> > This makes the use of num_online_cpus() irrelevant for those cases
> >> >
> >> > After adding in my command line "slub_min_objects=36" which equals to
> >> > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
> >> > , the regression diseapears:
> >> >
> >> > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)
>
> I'm surprised that hackbench is that sensitive to slab performance, anyway. It's
> supposed to be a scheduler benchmark? What exactly is going on?

Uuuh, I think powerpc doesn't have cmpxchg_double?

"vgrep cmpxchg_double arch/" just spits out arm64, s390 and x86? And
<https://liblfds.org/mediawiki/index.php?title=Article:CAS_and_LL/SC_Implementation_Details_by_Processor_family>
says under "POWERPC": "no DW LL/SC"

So powerpc is probably hitting the page-bitlock-based implementation
all the time for stuff like __slub_free()? Do you have detailed
profiling results from "perf top" or something like that?

(I actually have some WIP patches and a design document for getting
rid of cmpxchg_double in struct page that I hacked together in the
last couple days; I'm currently in the process of sending them over to
some other folks in the company who hopefully have cycles to
review/polish/benchmark them so that they can be upstreamed, assuming
that those folks think they're important enough. I don't have the
cycles for it...)
Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order [ In reply to ]
On Fri, Jan 22, 2021 at 2:05 PM Jann Horn <jannh@google.com> wrote:
> On Thu, Jan 21, 2021 at 7:19 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> > On 1/21/21 11:01 AM, Christoph Lameter wrote:
> > > On Thu, 21 Jan 2021, Bharata B Rao wrote:
> > >
> > >> > The problem is that calculate_order() is called a number of times
> > >> > before secondaries CPUs are booted and it returns 1 instead of 224.
> > >> > This makes the use of num_online_cpus() irrelevant for those cases
> > >> >
> > >> > After adding in my command line "slub_min_objects=36" which equals to
> > >> > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
> > >> > , the regression diseapears:
> > >> >
> > >> > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)
> >
> > I'm surprised that hackbench is that sensitive to slab performance, anyway. It's
> > supposed to be a scheduler benchmark? What exactly is going on?
>
> Uuuh, I think powerpc doesn't have cmpxchg_double?
>
> "vgrep cmpxchg_double arch/" just spits out arm64, s390 and x86? And
> <https://liblfds.org/mediawiki/index.php?title=Article:CAS_and_LL/SC_Implementation_Details_by_Processor_family>
> says under "POWERPC": "no DW LL/SC"
>
> So powerpc is probably hitting the page-bitlock-based implementation
> all the time for stuff like __slub_free()? Do you have detailed
> profiling results from "perf top" or something like that?
>
> (I actually have some WIP patches and a design document for getting
> rid of cmpxchg_double in struct page that I hacked together in the
> last couple days; I'm currently in the process of sending them over to
> some other folks in the company who hopefully have cycles to
> review/polish/benchmark them so that they can be upstreamed, assuming
> that those folks think they're important enough. I don't have the
> cycles for it...)

(The stuff I have in mind will only work on 64-bit though. We are
talking about PPC64 here, right?)
Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order [ In reply to ]
On Fri, 22 Jan 2021 at 13:03, Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 1/22/21 9:03 AM, Vincent Guittot wrote:
> > On Thu, 21 Jan 2021 at 19:19, Vlastimil Babka <vbabka@suse.cz> wrote:
> >>
> >> On 1/21/21 11:01 AM, Christoph Lameter wrote:
> >> > On Thu, 21 Jan 2021, Bharata B Rao wrote:
> >> >
> >> >> > The problem is that calculate_order() is called a number of times
> >> >> > before secondaries CPUs are booted and it returns 1 instead of 224.
> >> >> > This makes the use of num_online_cpus() irrelevant for those cases
> >> >> >
> >> >> > After adding in my command line "slub_min_objects=36" which equals to
> >> >> > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
> >> >> > , the regression diseapears:
> >> >> >
> >> >> > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)
> >>
> >> I'm surprised that hackbench is that sensitive to slab performance, anyway. It's
> >> supposed to be a scheduler benchmark? What exactly is going on?
> >>
> >
> > From hackbench description:
> > Hackbench is both a benchmark and a stress test for the Linux kernel
> > scheduler. It's main
> > job is to create a specified number of pairs of schedulable
> > entities (either threads or
> > traditional processes) which communicate via either sockets or
> > pipes and time how long it
> > takes for each pair to send data back and forth.
>
> Yep, so I wonder which slab entities this is stressing that much.
>
> >> Things would be easier if we could trust *on all arches* either
> >>
> >> - num_present_cpus() to count what the hardware really physically has during
> >> boot, even if not yet onlined, at the time we init slab. This would still not
> >> handle later hotplug (probably mostly in a VM scenario, not that somebody would
> >> bring bunch of actual new cpu boards to a running bare metal system?).
> >>
> >> - num_possible_cpus()/nr_cpu_ids not to be excessive (broken BIOS?) on systems
> >> where it's not really possible to plug more CPU's. In a VM scenario we could
> >> still have an opposite problem, where theoretically "anything is possible" but
> >> the virtual cpus are never added later.
> >
> > On all the system that I have tested num_possible_cpus()/nr_cpu_ids
> > were correctly initialized
> >
> > large arm64 acpi system
> > small arm64 DT based system
> > VM on x86 system
>
> So it's just powerpc that has this issue with too large nr_cpu_ids? Is it caused
> by bios or the hypervisor? How does num_present_cpus() look there?

num_present_cpus() starts to 1 until secondary cpus boot in the arm64 case

>
> What about heuristic:
> - num_online_cpus() > 1 - we trust that and use it
> - otherwise nr_cpu_ids
> Would that work? Too arbitrary?
>
>
> >> We could also start questioning the very assumption that number of cpus should
> >> affect slab page size in the first place. Should it? After all, each CPU will
> >> have one or more slab pages privately cached, as we discuss in the other
> >> thread... So why make the slab pages also larger?
> >>
> >> > Or the num_online_cpus needs to be up to date earlier. Why does this issue
> >> > not occur on x86? Does x86 have an up to date num_online_cpus earlier?
> >> >
> >> >
> >>
> >
>
Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order [ In reply to ]
On 1/22/21 2:05 PM, Jann Horn wrote:
> On Thu, Jan 21, 2021 at 7:19 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>> On 1/21/21 11:01 AM, Christoph Lameter wrote:
>> > On Thu, 21 Jan 2021, Bharata B Rao wrote:
>> >
>> >> > The problem is that calculate_order() is called a number of times
>> >> > before secondaries CPUs are booted and it returns 1 instead of 224.
>> >> > This makes the use of num_online_cpus() irrelevant for those cases
>> >> >
>> >> > After adding in my command line "slub_min_objects=36" which equals to
>> >> > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
>> >> > , the regression diseapears:
>> >> >
>> >> > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)
>>
>> I'm surprised that hackbench is that sensitive to slab performance, anyway. It's
>> supposed to be a scheduler benchmark? What exactly is going on?
>
> Uuuh, I think powerpc doesn't have cmpxchg_double?

The benchmark was done by Vincent on arm64, AFAICS. PowerPC (ppc64) was what
Bharata had used to demonstrate the order calculation change in his patch.

There seems to be some implementation dependency on CONFIG_ARM64_LSE_ATOMICS but
AFAICS that doesn't determine if cmpxchg_double is provided.

> "vgrep cmpxchg_double arch/" just spits out arm64, s390 and x86? And
> <https://liblfds.org/mediawiki/index.php?title=Article:CAS_and_LL/SC_Implementation_Details_by_Processor_family>
> says under "POWERPC": "no DW LL/SC"

Interesting find in any case.

> So powerpc is probably hitting the page-bitlock-based implementation
> all the time for stuff like __slub_free()? Do you have detailed
> profiling results from "perf top" or something like that?
>
> (I actually have some WIP patches and a design document for getting
> rid of cmpxchg_double in struct page that I hacked together in the
> last couple days; I'm currently in the process of sending them over to
> some other folks in the company who hopefully have cycles to
> review/polish/benchmark them so that they can be upstreamed, assuming
> that those folks think they're important enough. I don't have the
> cycles for it...)

I'm curious, so I hope this works out :)
Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order [ In reply to ]
On Fri, Jan 22, 2021 at 01:03:57PM +0100, Vlastimil Babka wrote:
> On 1/22/21 9:03 AM, Vincent Guittot wrote:
> > On Thu, 21 Jan 2021 at 19:19, Vlastimil Babka <vbabka@suse.cz> wrote:
> >>
> >> On 1/21/21 11:01 AM, Christoph Lameter wrote:
> >> > On Thu, 21 Jan 2021, Bharata B Rao wrote:
> >> >
> >> >> > The problem is that calculate_order() is called a number of times
> >> >> > before secondaries CPUs are booted and it returns 1 instead of 224.
> >> >> > This makes the use of num_online_cpus() irrelevant for those cases
> >> >> >
> >> >> > After adding in my command line "slub_min_objects=36" which equals to
> >> >> > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
> >> >> > , the regression diseapears:
> >> >> >
> >> >> > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)
> >>
> >> I'm surprised that hackbench is that sensitive to slab performance, anyway. It's
> >> supposed to be a scheduler benchmark? What exactly is going on?
> >>
> >
> > From hackbench description:
> > Hackbench is both a benchmark and a stress test for the Linux kernel
> > scheduler. It's main
> > job is to create a specified number of pairs of schedulable
> > entities (either threads or
> > traditional processes) which communicate via either sockets or
> > pipes and time how long it
> > takes for each pair to send data back and forth.
>
> Yep, so I wonder which slab entities this is stressing that much.
>
> >> Things would be easier if we could trust *on all arches* either
> >>
> >> - num_present_cpus() to count what the hardware really physically has during
> >> boot, even if not yet onlined, at the time we init slab. This would still not
> >> handle later hotplug (probably mostly in a VM scenario, not that somebody would
> >> bring bunch of actual new cpu boards to a running bare metal system?).
> >>
> >> - num_possible_cpus()/nr_cpu_ids not to be excessive (broken BIOS?) on systems
> >> where it's not really possible to plug more CPU's. In a VM scenario we could
> >> still have an opposite problem, where theoretically "anything is possible" but
> >> the virtual cpus are never added later.
> >
> > On all the system that I have tested num_possible_cpus()/nr_cpu_ids
> > were correctly initialized
> >
> > large arm64 acpi system
> > small arm64 DT based system
> > VM on x86 system
>
> So it's just powerpc that has this issue with too large nr_cpu_ids? Is it caused
> by bios or the hypervisor? How does num_present_cpus() look there?

PowerPC PowerNV Host: (160 cpus)
num_online_cpus 1 num_present_cpus 160 num_possible_cpus 160 nr_cpu_ids 160

PowerPC pseries KVM guest: (-smp 16,maxcpus=160)
num_online_cpus 1 num_present_cpus 16 num_possible_cpus 160 nr_cpu_ids 160

That's what I see on powerpc, hence I thought num_present_cpus() could
be the correct one to use in slub page order calculation.

>
> What about heuristic:
> - num_online_cpus() > 1 - we trust that and use it
> - otherwise nr_cpu_ids
> Would that work? Too arbitrary?

Looking at the following snippet from include/linux/cpumask.h, it
appears that num_present_cpus() should be reasonable compromise
between online and possible/nr_cpus_ids to use here.

/*
* The following particular system cpumasks and operations manage
* possible, present, active and online cpus.
*
* cpu_possible_mask- has bit 'cpu' set iff cpu is populatable
* cpu_present_mask - has bit 'cpu' set iff cpu is populated
* cpu_online_mask - has bit 'cpu' set iff cpu available to scheduler
* cpu_active_mask - has bit 'cpu' set iff cpu available to migration
*
* If !CONFIG_HOTPLUG_CPU, present == possible, and active == online.
*
* The cpu_possible_mask is fixed at boot time, as the set of CPU id's
* that it is possible might ever be plugged in at anytime during the
* life of that system boot. The cpu_present_mask is dynamic(*),
* representing which CPUs are currently plugged in. And
* cpu_online_mask is the dynamic subset of cpu_present_mask,
* indicating those CPUs available for scheduling.
*
* If HOTPLUG is enabled, then cpu_possible_mask is forced to have
* all NR_CPUS bits set, otherwise it is just the set of CPUs that
* ACPI reports present at boot.
*
* If HOTPLUG is enabled, then cpu_present_mask varies dynamically,
* depending on what ACPI reports as currently plugged in, otherwise
* cpu_present_mask is just a copy of cpu_possible_mask.
*
* (*) Well, cpu_present_mask is dynamic in the hotplug case. If not
* hotplug, it's a copy of cpu_possible_mask, hence fixed at boot.
*/

So for host systems, present is (usually) equal to possible and for
guest systems present should indicate the CPUs found to be present
at boottime. The intention of my original patch was to use this
metric in slub page order calculation rather than nr_cpus_ids
or num_cpus_possible() which could be high on guest systems that
typically support CPU hotplug.

Regards,
Bharata.
Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order [ In reply to ]
+Adding arch arm64 Maintainers

On Sat, 23 Jan 2021 at 06:16, Bharata B Rao <bharata@linux.ibm.com> wrote:
>
> On Fri, Jan 22, 2021 at 01:03:57PM +0100, Vlastimil Babka wrote:
> > On 1/22/21 9:03 AM, Vincent Guittot wrote:
> > > On Thu, 21 Jan 2021 at 19:19, Vlastimil Babka <vbabka@suse.cz> wrote:
> > >>
> > >> On 1/21/21 11:01 AM, Christoph Lameter wrote:
> > >> > On Thu, 21 Jan 2021, Bharata B Rao wrote:
> > >> >
> > >> >> > The problem is that calculate_order() is called a number of times
> > >> >> > before secondaries CPUs are booted and it returns 1 instead of 224.
> > >> >> > This makes the use of num_online_cpus() irrelevant for those cases
> > >> >> >
> > >> >> > After adding in my command line "slub_min_objects=36" which equals to
> > >> >> > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
> > >> >> > , the regression diseapears:
> > >> >> >
> > >> >> > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)
> > >>
> > >> I'm surprised that hackbench is that sensitive to slab performance, anyway. It's
> > >> supposed to be a scheduler benchmark? What exactly is going on?
> > >>
> > >
> > > From hackbench description:
> > > Hackbench is both a benchmark and a stress test for the Linux kernel
> > > scheduler. It's main
> > > job is to create a specified number of pairs of schedulable
> > > entities (either threads or
> > > traditional processes) which communicate via either sockets or
> > > pipes and time how long it
> > > takes for each pair to send data back and forth.
> >
> > Yep, so I wonder which slab entities this is stressing that much.
> >
> > >> Things would be easier if we could trust *on all arches* either
> > >>
> > >> - num_present_cpus() to count what the hardware really physically has during
> > >> boot, even if not yet onlined, at the time we init slab. This would still not
> > >> handle later hotplug (probably mostly in a VM scenario, not that somebody would
> > >> bring bunch of actual new cpu boards to a running bare metal system?).
> > >>
> > >> - num_possible_cpus()/nr_cpu_ids not to be excessive (broken BIOS?) on systems
> > >> where it's not really possible to plug more CPU's. In a VM scenario we could
> > >> still have an opposite problem, where theoretically "anything is possible" but
> > >> the virtual cpus are never added later.
> > >
> > > On all the system that I have tested num_possible_cpus()/nr_cpu_ids
> > > were correctly initialized
> > >
> > > large arm64 acpi system
> > > small arm64 DT based system
> > > VM on x86 system
> >
> > So it's just powerpc that has this issue with too large nr_cpu_ids? Is it caused
> > by bios or the hypervisor? How does num_present_cpus() look there?
>
> PowerPC PowerNV Host: (160 cpus)
> num_online_cpus 1 num_present_cpus 160 num_possible_cpus 160 nr_cpu_ids 160
>
> PowerPC pseries KVM guest: (-smp 16,maxcpus=160)
> num_online_cpus 1 num_present_cpus 16 num_possible_cpus 160 nr_cpu_ids 160
>
> That's what I see on powerpc, hence I thought num_present_cpus() could
> be the correct one to use in slub page order calculation.

num_present_cpus() is set to 1 on arm64 until secondaries cpus boot

arm64 224cpus acpi host:
num_online_cpus 1 num_present_cpus 1 num_possible_cpus 224 nr_cpu_ids 224
arm64 8cpus DT host:
num_online_cpus 1 num_present_cpus 1 num_possible_cpus 8 nr_cpu_ids 8
arm64 8cpus qemu-system-aarch64 (-smp 8,maxcpus=256)
num_online_cpus 1 num_present_cpus 1 num_possible_cpus 8 nr_cpu_ids 8

Then present and online increase to num_possible_cpus once all cpus are booted

>
> >
> > What about heuristic:
> > - num_online_cpus() > 1 - we trust that and use it
> > - otherwise nr_cpu_ids
> > Would that work? Too arbitrary?
>
> Looking at the following snippet from include/linux/cpumask.h, it
> appears that num_present_cpus() should be reasonable compromise
> between online and possible/nr_cpus_ids to use here.
>
> /*
> * The following particular system cpumasks and operations manage
> * possible, present, active and online cpus.
> *
> * cpu_possible_mask- has bit 'cpu' set iff cpu is populatable
> * cpu_present_mask - has bit 'cpu' set iff cpu is populated
> * cpu_online_mask - has bit 'cpu' set iff cpu available to scheduler
> * cpu_active_mask - has bit 'cpu' set iff cpu available to migration
> *
> * If !CONFIG_HOTPLUG_CPU, present == possible, and active == online.
> *
> * The cpu_possible_mask is fixed at boot time, as the set of CPU id's
> * that it is possible might ever be plugged in at anytime during the
> * life of that system boot. The cpu_present_mask is dynamic(*),
> * representing which CPUs are currently plugged in. And
> * cpu_online_mask is the dynamic subset of cpu_present_mask,
> * indicating those CPUs available for scheduling.
> *
> * If HOTPLUG is enabled, then cpu_possible_mask is forced to have
> * all NR_CPUS bits set, otherwise it is just the set of CPUs that
> * ACPI reports present at boot.
> *
> * If HOTPLUG is enabled, then cpu_present_mask varies dynamically,
> * depending on what ACPI reports as currently plugged in, otherwise
> * cpu_present_mask is just a copy of cpu_possible_mask.
> *
> * (*) Well, cpu_present_mask is dynamic in the hotplug case. If not
> * hotplug, it's a copy of cpu_possible_mask, hence fixed at boot.
> */
>
> So for host systems, present is (usually) equal to possible and for

But "cpu_present_mask varies dynamically, depending on what ACPI
reports as currently plugged in"

So it should varies when secondaries cpus are booted

> guest systems present should indicate the CPUs found to be present
> at boottime. The intention of my original patch was to use this
> metric in slub page order calculation rather than nr_cpus_ids
> or num_cpus_possible() which could be high on guest systems that
> typically support CPU hotplug.
>
> Regards,
> Bharata.
Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order [ In reply to ]
On Fri, Jan 22, 2021 at 02:05:47PM +0100, Jann Horn wrote:
> On Thu, Jan 21, 2021 at 7:19 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> > On 1/21/21 11:01 AM, Christoph Lameter wrote:
> > > On Thu, 21 Jan 2021, Bharata B Rao wrote:
> > >
> > >> > The problem is that calculate_order() is called a number of times
> > >> > before secondaries CPUs are booted and it returns 1 instead of 224.
> > >> > This makes the use of num_online_cpus() irrelevant for those cases
> > >> >
> > >> > After adding in my command line "slub_min_objects=36" which equals to
> > >> > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
> > >> > , the regression diseapears:
> > >> >
> > >> > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)
> >
> > I'm surprised that hackbench is that sensitive to slab performance, anyway. It's
> > supposed to be a scheduler benchmark? What exactly is going on?
>
> Uuuh, I think powerpc doesn't have cmpxchg_double?
>
> "vgrep cmpxchg_double arch/" just spits out arm64, s390 and x86? And
> <https://liblfds.org/mediawiki/index.php?title=Article:CAS_and_LL/SC_Implementation_Details_by_Processor_family>
> says under "POWERPC": "no DW LL/SC"
>
> So powerpc is probably hitting the page-bitlock-based implementation
> all the time for stuff like __slub_free()? Do you have detailed
> profiling results from "perf top" or something like that?

I can check that, but the current patch was aimed at reducing
the page order of the slub caches so that they don't end up
consuming more memory on 64K systems.

>
> (I actually have some WIP patches and a design document for getting
> rid of cmpxchg_double in struct page that I hacked together in the
> last couple days; I'm currently in the process of sending them over to
> some other folks in the company who hopefully have cycles to
> review/polish/benchmark them so that they can be upstreamed, assuming
> that those folks think they're important enough. I don't have the
> cycles for it...)

Sounds interesting, will keep a watch to see its effect on powerpc.

Regards,
Bharata.
Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order [ In reply to ]
On 1/23/21 1:32 PM, Vincent Guittot wrote:
>> PowerPC PowerNV Host: (160 cpus)
>> num_online_cpus 1 num_present_cpus 160 num_possible_cpus 160 nr_cpu_ids 160
>>
>> PowerPC pseries KVM guest: (-smp 16,maxcpus=160)
>> num_online_cpus 1 num_present_cpus 16 num_possible_cpus 160 nr_cpu_ids 160
>>
>> That's what I see on powerpc, hence I thought num_present_cpus() could
>> be the correct one to use in slub page order calculation.
>
> num_present_cpus() is set to 1 on arm64 until secondaries cpus boot
>
> arm64 224cpus acpi host:
> num_online_cpus 1 num_present_cpus 1 num_possible_cpus 224 nr_cpu_ids 224
> arm64 8cpus DT host:
> num_online_cpus 1 num_present_cpus 1 num_possible_cpus 8 nr_cpu_ids 8
> arm64 8cpus qemu-system-aarch64 (-smp 8,maxcpus=256)
> num_online_cpus 1 num_present_cpus 1 num_possible_cpus 8 nr_cpu_ids 8

I would have expected num_present_cpus to be 224, 8, 8, respectively.

> Then present and online increase to num_possible_cpus once all cpus are booted
>
>>
>> >
>> > What about heuristic:
>> > - num_online_cpus() > 1 - we trust that and use it
>> > - otherwise nr_cpu_ids
>> > Would that work? Too arbitrary?
>>
>> Looking at the following snippet from include/linux/cpumask.h, it
>> appears that num_present_cpus() should be reasonable compromise
>> between online and possible/nr_cpus_ids to use here.
>>
>> /*
>> * The following particular system cpumasks and operations manage
>> * possible, present, active and online cpus.
>> *
>> * cpu_possible_mask- has bit 'cpu' set iff cpu is populatable
>> * cpu_present_mask - has bit 'cpu' set iff cpu is populated
>> * cpu_online_mask - has bit 'cpu' set iff cpu available to scheduler
>> * cpu_active_mask - has bit 'cpu' set iff cpu available to migration
>> *
>> * If !CONFIG_HOTPLUG_CPU, present == possible, and active == online.
>> *
>> * The cpu_possible_mask is fixed at boot time, as the set of CPU id's
>> * that it is possible might ever be plugged in at anytime during the
>> * life of that system boot. The cpu_present_mask is dynamic(*),
>> * representing which CPUs are currently plugged in. And
>> * cpu_online_mask is the dynamic subset of cpu_present_mask,
>> * indicating those CPUs available for scheduling.
>> *
>> * If HOTPLUG is enabled, then cpu_possible_mask is forced to have
>> * all NR_CPUS bits set, otherwise it is just the set of CPUs that
>> * ACPI reports present at boot.
>> *
>> * If HOTPLUG is enabled, then cpu_present_mask varies dynamically,
>> * depending on what ACPI reports as currently plugged in, otherwise
>> * cpu_present_mask is just a copy of cpu_possible_mask.
>> *
>> * (*) Well, cpu_present_mask is dynamic in the hotplug case. If not
>> * hotplug, it's a copy of cpu_possible_mask, hence fixed at boot.
>> */
>>
>> So for host systems, present is (usually) equal to possible and for
>
> But "cpu_present_mask varies dynamically, depending on what ACPI
> reports as currently plugged in"
>
> So it should varies when secondaries cpus are booted

Hm, but booting the secondaries is just a software (kernel) action? They are
already physically there, so it seems to me as if the cpu_present_mask is not
populated correctly on arm64, and it's just a mirror of cpu_online_mask?
Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order [ In reply to ]
On Thu 21-01-21 19:19:21, Vlastimil Babka wrote:
[...]
> We could also start questioning the very assumption that number of cpus should
> affect slab page size in the first place. Should it? After all, each CPU will
> have one or more slab pages privately cached, as we discuss in the other
> thread... So why make the slab pages also larger?

I do agree. What is the acutal justification for this scaling?
/*
* Attempt to find best configuration for a slab. This
* works by first attempting to generate a layout with
* the best configuration and backing off gradually.
*
* First we increase the acceptable waste in a slab. Then
* we reduce the minimum objects required in a slab.
*/

doesn't speak about CPUs. 9b2cd506e5f2 ("slub: Calculate min_objects
based on number of processors.") does talk about hackbench "This has
been shown to address the performance issues in hackbench on 16p etc."
but it doesn't give any more details to tell actually _why_ that works.

This thread shows that this is still somehow related to performance but
the real reason is not clear. I believe we should be focusing on the
actual reasons for the performance impact than playing with some fancy
math and tuning for a benchmark on a particular machine which doesn't
work for others due to subtle initialization timing issues.

Fundamentally why should higher number of CPUs imply the size of slab in
the first place?
--
Michal Hocko
SUSE Labs
Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order [ In reply to ]
On Tue, 26 Jan 2021 at 09:52, Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 21-01-21 19:19:21, Vlastimil Babka wrote:
> [...]
> > We could also start questioning the very assumption that number of cpus should
> > affect slab page size in the first place. Should it? After all, each CPU will
> > have one or more slab pages privately cached, as we discuss in the other
> > thread... So why make the slab pages also larger?
>
> I do agree. What is the acutal justification for this scaling?
> /*
> * Attempt to find best configuration for a slab. This
> * works by first attempting to generate a layout with
> * the best configuration and backing off gradually.
> *
> * First we increase the acceptable waste in a slab. Then
> * we reduce the minimum objects required in a slab.
> */
>
> doesn't speak about CPUs. 9b2cd506e5f2 ("slub: Calculate min_objects
> based on number of processors.") does talk about hackbench "This has
> been shown to address the performance issues in hackbench on 16p etc."
> but it doesn't give any more details to tell actually _why_ that works.
>
> This thread shows that this is still somehow related to performance but
> the real reason is not clear. I believe we should be focusing on the
> actual reasons for the performance impact than playing with some fancy
> math and tuning for a benchmark on a particular machine which doesn't
> work for others due to subtle initialization timing issues.
>
> Fundamentally why should higher number of CPUs imply the size of slab in
> the first place?

A 1st answer is that the activity and the number of threads involved
scales with the number of CPUs. Regarding the hackbench benchmark as
an example, the number of group/threads raise to a higher level on the
server than on the small system which doesn't seem unreasonable.

On 8 CPUs, I run hackbench with up to 16 groups which means 16*40
threads. But I raise up to 256 groups, which means 256*40 threads, on
the 224 CPUs system. In fact, hackbench -g 1 (with 1 group) doesn't
regress on the 224 CPUs system. The next test with 4 groups starts
to regress by -7%. But the next one: hackbench -g 16 regresses by 187%
(duration is almost 3 times longer). It seems reasonable to assume
that the number of running threads and resources scale with the number
of CPUs because we want to run more stuff.


> --
> Michal Hocko
> SUSE Labs
Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order [ In reply to ]
On Tue 26-01-21 14:38:14, Vincent Guittot wrote:
> On Tue, 26 Jan 2021 at 09:52, Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 21-01-21 19:19:21, Vlastimil Babka wrote:
> > [...]
> > > We could also start questioning the very assumption that number of cpus should
> > > affect slab page size in the first place. Should it? After all, each CPU will
> > > have one or more slab pages privately cached, as we discuss in the other
> > > thread... So why make the slab pages also larger?
> >
> > I do agree. What is the acutal justification for this scaling?
> > /*
> > * Attempt to find best configuration for a slab. This
> > * works by first attempting to generate a layout with
> > * the best configuration and backing off gradually.
> > *
> > * First we increase the acceptable waste in a slab. Then
> > * we reduce the minimum objects required in a slab.
> > */
> >
> > doesn't speak about CPUs. 9b2cd506e5f2 ("slub: Calculate min_objects
> > based on number of processors.") does talk about hackbench "This has
> > been shown to address the performance issues in hackbench on 16p etc."
> > but it doesn't give any more details to tell actually _why_ that works.
> >
> > This thread shows that this is still somehow related to performance but
> > the real reason is not clear. I believe we should be focusing on the
> > actual reasons for the performance impact than playing with some fancy
> > math and tuning for a benchmark on a particular machine which doesn't
> > work for others due to subtle initialization timing issues.
> >
> > Fundamentally why should higher number of CPUs imply the size of slab in
> > the first place?
>
> A 1st answer is that the activity and the number of threads involved
> scales with the number of CPUs. Regarding the hackbench benchmark as
> an example, the number of group/threads raise to a higher level on the
> server than on the small system which doesn't seem unreasonable.
>
> On 8 CPUs, I run hackbench with up to 16 groups which means 16*40
> threads. But I raise up to 256 groups, which means 256*40 threads, on
> the 224 CPUs system. In fact, hackbench -g 1 (with 1 group) doesn't
> regress on the 224 CPUs system. The next test with 4 groups starts
> to regress by -7%. But the next one: hackbench -g 16 regresses by 187%
> (duration is almost 3 times longer). It seems reasonable to assume
> that the number of running threads and resources scale with the number
> of CPUs because we want to run more stuff.

OK, I do understand that more jobs scale with the number of CPUs but I
would also expect that higher order pages are generally more expensive
to get so this is not really a clear cut especially under some more
demand on the memory where allocations are smooth. So the question
really is whether this is not just optimizing for artificial conditions.
--
Michal Hocko
SUSE Labs
Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order [ In reply to ]
On Tue, 26 Jan 2021, Will Deacon wrote:

> > Hm, but booting the secondaries is just a software (kernel) action? They are
> > already physically there, so it seems to me as if the cpu_present_mask is not
> > populated correctly on arm64, and it's just a mirror of cpu_online_mask?
>
> I think the present_mask retains CPUs if they are hotplugged off, whereas
> the online mask does not. We can't really do any better on arm64, as there's
> no way of telling that a CPU is present until we've seen it.

The order of each page in a kmem cache --and therefore also the number
of objects in a slab page-- can be different because that information is
stored in the page struct.

Therefore it is possible to retune the order while the cache is in operaton.

This means you can run an initcall after all cpus have been brought up to
set the order and number of objects in a slab page differently.

The older slab pages will continue to exist with the old orders until they
are freed.
Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order [ In reply to ]
On Mon, Jan 25, 2021 at 12:20:14PM +0100, Vlastimil Babka wrote:
> On 1/23/21 1:32 PM, Vincent Guittot wrote:
> >> PowerPC PowerNV Host: (160 cpus)
> >> num_online_cpus 1 num_present_cpus 160 num_possible_cpus 160 nr_cpu_ids 160
> >>
> >> PowerPC pseries KVM guest: (-smp 16,maxcpus=160)
> >> num_online_cpus 1 num_present_cpus 16 num_possible_cpus 160 nr_cpu_ids 160
> >>
> >> That's what I see on powerpc, hence I thought num_present_cpus() could
> >> be the correct one to use in slub page order calculation.
> >
> > num_present_cpus() is set to 1 on arm64 until secondaries cpus boot
> >
> > arm64 224cpus acpi host:
> > num_online_cpus 1 num_present_cpus 1 num_possible_cpus 224 nr_cpu_ids 224
> > arm64 8cpus DT host:
> > num_online_cpus 1 num_present_cpus 1 num_possible_cpus 8 nr_cpu_ids 8
> > arm64 8cpus qemu-system-aarch64 (-smp 8,maxcpus=256)
> > num_online_cpus 1 num_present_cpus 1 num_possible_cpus 8 nr_cpu_ids 8
>
> I would have expected num_present_cpus to be 224, 8, 8, respectively.
>
> > Then present and online increase to num_possible_cpus once all cpus are booted
> >
> >>
> >> >
> >> > What about heuristic:
> >> > - num_online_cpus() > 1 - we trust that and use it
> >> > - otherwise nr_cpu_ids
> >> > Would that work? Too arbitrary?
> >>
> >> Looking at the following snippet from include/linux/cpumask.h, it
> >> appears that num_present_cpus() should be reasonable compromise
> >> between online and possible/nr_cpus_ids to use here.
> >>
> >> /*
> >> * The following particular system cpumasks and operations manage
> >> * possible, present, active and online cpus.
> >> *
> >> * cpu_possible_mask- has bit 'cpu' set iff cpu is populatable
> >> * cpu_present_mask - has bit 'cpu' set iff cpu is populated
> >> * cpu_online_mask - has bit 'cpu' set iff cpu available to scheduler
> >> * cpu_active_mask - has bit 'cpu' set iff cpu available to migration
> >> *
> >> * If !CONFIG_HOTPLUG_CPU, present == possible, and active == online.
> >> *
> >> * The cpu_possible_mask is fixed at boot time, as the set of CPU id's
> >> * that it is possible might ever be plugged in at anytime during the
> >> * life of that system boot. The cpu_present_mask is dynamic(*),
> >> * representing which CPUs are currently plugged in. And
> >> * cpu_online_mask is the dynamic subset of cpu_present_mask,
> >> * indicating those CPUs available for scheduling.
> >> *
> >> * If HOTPLUG is enabled, then cpu_possible_mask is forced to have
> >> * all NR_CPUS bits set, otherwise it is just the set of CPUs that
> >> * ACPI reports present at boot.
> >> *
> >> * If HOTPLUG is enabled, then cpu_present_mask varies dynamically,
> >> * depending on what ACPI reports as currently plugged in, otherwise
> >> * cpu_present_mask is just a copy of cpu_possible_mask.
> >> *
> >> * (*) Well, cpu_present_mask is dynamic in the hotplug case. If not
> >> * hotplug, it's a copy of cpu_possible_mask, hence fixed at boot.
> >> */
> >>
> >> So for host systems, present is (usually) equal to possible and for
> >
> > But "cpu_present_mask varies dynamically, depending on what ACPI
> > reports as currently plugged in"
> >
> > So it should varies when secondaries cpus are booted
>
> Hm, but booting the secondaries is just a software (kernel) action? They are
> already physically there, so it seems to me as if the cpu_present_mask is not
> populated correctly on arm64, and it's just a mirror of cpu_online_mask?

I think the present_mask retains CPUs if they are hotplugged off, whereas
the online mask does not. We can't really do any better on arm64, as there's
no way of telling that a CPU is present until we've seen it.

Will

1 2  View All