Mailing List Archive

1 2  View All
Re: [PATCH 2/2] module: add support to avoid duplicates early on load [ In reply to ]
On Tue, May 30, 2023 at 06:17:11PM -0400, Linus Torvalds wrote:
>On Tue, May 30, 2023 at 3:41?PM Luis Chamberlain <mcgrof@kernel.org> wrote:
>>
>> OK thanks! So just to confirm, it seems fine to return the same error
>> code if duplicates wait, or do you prefer for some reason for there to
>> be an exception and return -EEXIST if the module did succeed in the
>> duplicate case?
>
>I think either should be fine, since either was possible before.
>
>By definition, these are module loads being done in parallel, and so
>any of them "could" have been the first, and returned success before.
>
>And by extension, any of them could have been not first, and returned
>-EEXIST if somebody else loaded the same module first.
>
>So that "somebody else did a load" code:
>
> if (idempotent(&idem, file_inode(f))) {
> wait_for_completion(&idem.complete);
> return idem.ret;
> }
>
>could certainly have made the return value be something like
>
> return idem.ret ? : -EEXIST;

yes, this is what I had in mind.

>
>instead of that "return idem.ret".
>
>But it does seem simpler - and more in line with the conceptual
>"loading the same module is an idempotent operation" of the patch -
>to just always return the success value to all of them.
>
>After all, they all did in some sense succeed to get that module
>loaded, even if it was a communal effort, and some threads did more
>than others...
>
>As mentioned, I don't think it can matter either way, since any of the
>callers might as well have been the successful one, and they would
>basically have to act the same way regardless (ie "somebody else
>succeeded" and "you succeeded" are basically equivalent return

agreed, it will just be a slightly different behavior if finit_module()
is called twice and the first call is already in the process of
initializing the module, i.e. complete_formation() was already called,
putting the module in the MODULE_STATE_COMING state, as per
kernel/module/main.c:add_unformed_module():

/*
* We are here only when the same module was being loaded. Do
* not try to load it again right now. It prevents long delays
* caused by serialized module load failures. It might happen
* when more devices of the same type trigger load of
* a particular module.
*/
if (old && old->state == MODULE_STATE_LIVE)
err = -EEXIST;
else
err = -EBUSY;
goto out;

in userspace we already deal with that in a special way and should be
compatible with returning 0 for all practical purposes.

thanks
Lucas De Marchi

>values). If the module was a prerequisite for another module being
>loaded, either -EEXIST or 0 _is_ a success case.
>
> Linus
Re: [PATCH 2/2] module: add support to avoid duplicates early on load [ In reply to ]
On 30.05.23 18:22, Luis Chamberlain wrote:
> On Mon, May 29, 2023 at 09:55:15PM -0400, Linus Torvalds wrote:
>> On Mon, May 29, 2023 at 11:18?AM Johan Hovold <johan@kernel.org> wrote:
>>>
>>> I took a closer look at some of the modules that failed to load and
>>> noticed a pattern in that they have dependencies that are needed by more
>>> than one device.
>>
>> Ok, this is a "maybe something like this" RFC series of two patches -
>> one trivial one to re-organize things a bit so that we can then do the
>> real one which uses a filter based on the inode pointer to return an
>> "idempotent return value" for module loads that share the same inode.
>>
>> It's entirely untested, and since I'm on the road I'm going to not
>> really be able to test it. It compiles for me, and the code looks
>> fairly straightforward, but it's probably buggy.
>>
>> It's very loosely based on Luis' attempt, but it
>> (a) is internal to module loading
>> (b) uses a reliable cookie
>> (c) doesn't leave the cookie around randomly for later
>> (d) has seen absolutely no testing
>>
>> Put another way: if somebody wants to play with this, please treat it
>> as a starting point, not the final thing. You might need to debug
>> things, and fix silly mistakes.
>>
>> The idea is to just have a simple hash list of currently executing
>> module loads, protected by a trivial spinlock. Every module loader
>> adds itself to the right hash list, and if they were the *first* one
>> (ie no other pending module loads for that inode), will actually do
>> the module load.
>>
>> Everybody who *isn't* the first one will just wait for completion and
>> return the same error code that the first one returned.
>
> That's also a hell much more snazzier MODULE_DEBUG_AUTOLOAD_DUPS if we
> ever wanted to do something similar there if we wanted to also
> join request_module() calls, instead of it hiding under debug.
>
>> This is technically bogus. The first one might fail due to arguments.
>
> For boot it's fine, as I can't think of boot wanting to support trying
> to load a module with different arguments but who knows. But I can't
> see it sensible to issue concurrent multiple requests for modules
> with different arguments without waiting in userspace for the first
> to fail.
>
> Even post-boot, doing that sounds rather insane, but it would certainly
> be a compromise and should probably be clearly documented. I think just
> a comment acknolwedging that corner case seems sensible.
>
> Because we won't be able to get the arguments until we process the
> module, so it would be too late for this optimization on kread. So it is
> why I had also stuck to the original feature being in kread, as then it
> provides a uniq kread call and the caller is aware of it. But indeed I
> had not considered the effects of arguments.
>
> Lucas, any thoughts from modules kmod userspace perspective into
> supporting anyone likely issuing concurrent modules requests with
> differing arguments?
>
>> So the cookie shouldn't be just the inode, it should be the inode and
>> a hash of the arguments or something like that.
>
> Personally I think it's a fine optimization without the arguments.
>
>> But it is what it is,
>> and apart from possible show-stopper bugs this is no worse than the
>> failed "exclusive write deny" attempt. IOW - maybe worth trying?
>
> The only thing I can think of is allowing threads other than the
> first one to complete before the one that actually loaded the
> module. I thought about this race for module auto-loading, see
> the comment in kmod_dup_request_announce(), so that just
> further delays the completion to other thread with a stupid
> queue_work(). That seems more important for module auto-loading
> duplicates than for boot finit_module() duplicates. But not sure
> if odering matters in the end due to a preemtible kernel and maybe
> that concern is hysteria.
>
>> And if *that* didn't sell people on this patch series, I don't know
>> what will. I should be in marketing! Two drink minimums, here I come!
>
> Sold:
>
> on 255 vcpus 0 duplicates found with this setup:
>
> root@kmod ~ # cat /sys/kernel/debug/modules/stats
> Mods ever loaded 66
> Mods failed on kread 0
> Mods failed on decompress 0
> Mods failed on becoming 0
> Mods failed on load 0
> Total module size 11268096
> Total mod text size 4149248
> Failed kread bytes 0
> Failed decompress bytes 0
> Failed becoming bytes 0
> Failed kmod bytes 0
> Virtual mem wasted bytes 0
> Average mod size 170729
> Average mod text size 62868
>
> So:
>
> Tested-by: Luis Chamberlain <mcgrof@kernel.org>
>
> In terms of bootup timing:
>
> Before:
> Startup finished in 41.653s (kernel) + 44.305s (userspace) = 1min 25.958s
> graphical.target reached after 44.178s in userspace.
>
> After:
> Startup finished in 23.995s (kernel) + 40.350s (userspace) = 1min 4.345s
> graphical.target reached after 40.226s in userspace.

I'll try grabbing the system where we saw the KASAN-related issues [1]
and give it a churn with and without the two patches. Might take a bit
(~1 day), unfortunately.

[1] https://lkml.kernel.org/r/20221013180518.217405-1-david@redhat.com

--
Thanks,

David / dhildenb
Re: [PATCH 2/2] module: add support to avoid duplicates early on load [ In reply to ]
On Wed, May 31, 2023 at 09:51:41AM +0200, David Hildenbrand wrote:
> On 30.05.23 18:22, Luis Chamberlain wrote:
> > On Mon, May 29, 2023 at 09:55:15PM -0400, Linus Torvalds wrote:
> > > On Mon, May 29, 2023 at 11:18?AM Johan Hovold <johan@kernel.org> wrote:
> > > >
> > > > I took a closer look at some of the modules that failed to load and
> > > > noticed a pattern in that they have dependencies that are needed by more
> > > > than one device.
> > >
> > > Ok, this is a "maybe something like this" RFC series of two patches -
> > > one trivial one to re-organize things a bit so that we can then do the
> > > real one which uses a filter based on the inode pointer to return an
> > > "idempotent return value" for module loads that share the same inode.
> > >
> > > It's entirely untested, and since I'm on the road I'm going to not
> > > really be able to test it. It compiles for me, and the code looks
> > > fairly straightforward, but it's probably buggy.
> > >
> > > It's very loosely based on Luis' attempt, but it
> > > (a) is internal to module loading
> > > (b) uses a reliable cookie
> > > (c) doesn't leave the cookie around randomly for later
> > > (d) has seen absolutely no testing
> > >
> > > Put another way: if somebody wants to play with this, please treat it
> > > as a starting point, not the final thing. You might need to debug
> > > things, and fix silly mistakes.
> > >
> > > The idea is to just have a simple hash list of currently executing
> > > module loads, protected by a trivial spinlock. Every module loader
> > > adds itself to the right hash list, and if they were the *first* one
> > > (ie no other pending module loads for that inode), will actually do
> > > the module load.
> > >
> > > Everybody who *isn't* the first one will just wait for completion and
> > > return the same error code that the first one returned.
> >
> > That's also a hell much more snazzier MODULE_DEBUG_AUTOLOAD_DUPS if we
> > ever wanted to do something similar there if we wanted to also
> > join request_module() calls, instead of it hiding under debug.
> >
> > > This is technically bogus. The first one might fail due to arguments.
> >
> > For boot it's fine, as I can't think of boot wanting to support trying
> > to load a module with different arguments but who knows. But I can't
> > see it sensible to issue concurrent multiple requests for modules
> > with different arguments without waiting in userspace for the first
> > to fail.
> >
> > Even post-boot, doing that sounds rather insane, but it would certainly
> > be a compromise and should probably be clearly documented. I think just
> > a comment acknolwedging that corner case seems sensible.
> >
> > Because we won't be able to get the arguments until we process the
> > module, so it would be too late for this optimization on kread. So it is
> > why I had also stuck to the original feature being in kread, as then it
> > provides a uniq kread call and the caller is aware of it. But indeed I
> > had not considered the effects of arguments.
> >
> > Lucas, any thoughts from modules kmod userspace perspective into
> > supporting anyone likely issuing concurrent modules requests with
> > differing arguments?
> >
> > > So the cookie shouldn't be just the inode, it should be the inode and
> > > a hash of the arguments or something like that.
> >
> > Personally I think it's a fine optimization without the arguments.
> >
> > > But it is what it is,
> > > and apart from possible show-stopper bugs this is no worse than the
> > > failed "exclusive write deny" attempt. IOW - maybe worth trying?
> >
> > The only thing I can think of is allowing threads other than the
> > first one to complete before the one that actually loaded the
> > module. I thought about this race for module auto-loading, see
> > the comment in kmod_dup_request_announce(), so that just
> > further delays the completion to other thread with a stupid
> > queue_work(). That seems more important for module auto-loading
> > duplicates than for boot finit_module() duplicates. But not sure
> > if odering matters in the end due to a preemtible kernel and maybe
> > that concern is hysteria.
> >
> > > And if *that* didn't sell people on this patch series, I don't know
> > > what will. I should be in marketing! Two drink minimums, here I come!
> >
> > Sold:
> >
> > on 255 vcpus 0 duplicates found with this setup:
> >
> > root@kmod ~ # cat /sys/kernel/debug/modules/stats
> > Mods ever loaded 66
> > Mods failed on kread 0
> > Mods failed on decompress 0
> > Mods failed on becoming 0
> > Mods failed on load 0
> > Total module size 11268096
> > Total mod text size 4149248
> > Failed kread bytes 0
> > Failed decompress bytes 0
> > Failed becoming bytes 0
> > Failed kmod bytes 0
> > Virtual mem wasted bytes 0
> > Average mod size 170729
> > Average mod text size 62868
> >
> > So:
> >
> > Tested-by: Luis Chamberlain <mcgrof@kernel.org>
> >
> > In terms of bootup timing:
> >
> > Before:
> > Startup finished in 41.653s (kernel) + 44.305s (userspace) = 1min 25.958s
> > graphical.target reached after 44.178s in userspace.
> > After:
> > Startup finished in 23.995s (kernel) + 40.350s (userspace) = 1min 4.345s
> > graphical.target reached after 40.226s in userspace.
>
> I'll try grabbing the system where we saw the KASAN-related issues [1] and
> give it a churn with and without the two patches. Might take a bit (~1 day),
> unfortunately.
>
> [1] https://lkml.kernel.org/r/20221013180518.217405-1-david@redhat.com

Great, don't forget:

diff --git a/kernel/module/main.c b/kernel/module/main.c
index 82b0dcc1fe77..222015093eeb 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -3066,7 +3066,7 @@ struct idempotent {

#define IDEM_HASH_BITS 8
static struct hlist_head idem_hash[1 << IDEM_HASH_BITS];
-static struct spinlock idem_lock;
+static DEFINE_SPINLOCK(idem_lock);

static bool idempotent(struct idempotent *u, const void *cookie)
{
Re: [PATCH 2/2] module: add support to avoid duplicates early on load [ In reply to ]
On 31.05.23 18:57, Luis Chamberlain wrote:
> On Wed, May 31, 2023 at 09:51:41AM +0200, David Hildenbrand wrote:
>> On 30.05.23 18:22, Luis Chamberlain wrote:
>>> On Mon, May 29, 2023 at 09:55:15PM -0400, Linus Torvalds wrote:
>>>> On Mon, May 29, 2023 at 11:18?AM Johan Hovold <johan@kernel.org> wrote:
>>>>>
>>>>> I took a closer look at some of the modules that failed to load and
>>>>> noticed a pattern in that they have dependencies that are needed by more
>>>>> than one device.
>>>>
>>>> Ok, this is a "maybe something like this" RFC series of two patches -
>>>> one trivial one to re-organize things a bit so that we can then do the
>>>> real one which uses a filter based on the inode pointer to return an
>>>> "idempotent return value" for module loads that share the same inode.
>>>>
>>>> It's entirely untested, and since I'm on the road I'm going to not
>>>> really be able to test it. It compiles for me, and the code looks
>>>> fairly straightforward, but it's probably buggy.
>>>>
>>>> It's very loosely based on Luis' attempt, but it
>>>> (a) is internal to module loading
>>>> (b) uses a reliable cookie
>>>> (c) doesn't leave the cookie around randomly for later
>>>> (d) has seen absolutely no testing
>>>>
>>>> Put another way: if somebody wants to play with this, please treat it
>>>> as a starting point, not the final thing. You might need to debug
>>>> things, and fix silly mistakes.
>>>>
>>>> The idea is to just have a simple hash list of currently executing
>>>> module loads, protected by a trivial spinlock. Every module loader
>>>> adds itself to the right hash list, and if they were the *first* one
>>>> (ie no other pending module loads for that inode), will actually do
>>>> the module load.
>>>>
>>>> Everybody who *isn't* the first one will just wait for completion and
>>>> return the same error code that the first one returned.
>>>
>>> That's also a hell much more snazzier MODULE_DEBUG_AUTOLOAD_DUPS if we
>>> ever wanted to do something similar there if we wanted to also
>>> join request_module() calls, instead of it hiding under debug.
>>>
>>>> This is technically bogus. The first one might fail due to arguments.
>>>
>>> For boot it's fine, as I can't think of boot wanting to support trying
>>> to load a module with different arguments but who knows. But I can't
>>> see it sensible to issue concurrent multiple requests for modules
>>> with different arguments without waiting in userspace for the first
>>> to fail.
>>>
>>> Even post-boot, doing that sounds rather insane, but it would certainly
>>> be a compromise and should probably be clearly documented. I think just
>>> a comment acknolwedging that corner case seems sensible.
>>>
>>> Because we won't be able to get the arguments until we process the
>>> module, so it would be too late for this optimization on kread. So it is
>>> why I had also stuck to the original feature being in kread, as then it
>>> provides a uniq kread call and the caller is aware of it. But indeed I
>>> had not considered the effects of arguments.
>>>
>>> Lucas, any thoughts from modules kmod userspace perspective into
>>> supporting anyone likely issuing concurrent modules requests with
>>> differing arguments?
>>>
>>>> So the cookie shouldn't be just the inode, it should be the inode and
>>>> a hash of the arguments or something like that.
>>>
>>> Personally I think it's a fine optimization without the arguments.
>>>
>>>> But it is what it is,
>>>> and apart from possible show-stopper bugs this is no worse than the
>>>> failed "exclusive write deny" attempt. IOW - maybe worth trying?
>>>
>>> The only thing I can think of is allowing threads other than the
>>> first one to complete before the one that actually loaded the
>>> module. I thought about this race for module auto-loading, see
>>> the comment in kmod_dup_request_announce(), so that just
>>> further delays the completion to other thread with a stupid
>>> queue_work(). That seems more important for module auto-loading
>>> duplicates than for boot finit_module() duplicates. But not sure
>>> if odering matters in the end due to a preemtible kernel and maybe
>>> that concern is hysteria.
>>>
>>>> And if *that* didn't sell people on this patch series, I don't know
>>>> what will. I should be in marketing! Two drink minimums, here I come!
>>>
>>> Sold:
>>>
>>> on 255 vcpus 0 duplicates found with this setup:
>>>
>>> root@kmod ~ # cat /sys/kernel/debug/modules/stats
>>> Mods ever loaded 66
>>> Mods failed on kread 0
>>> Mods failed on decompress 0
>>> Mods failed on becoming 0
>>> Mods failed on load 0
>>> Total module size 11268096
>>> Total mod text size 4149248
>>> Failed kread bytes 0
>>> Failed decompress bytes 0
>>> Failed becoming bytes 0
>>> Failed kmod bytes 0
>>> Virtual mem wasted bytes 0
>>> Average mod size 170729
>>> Average mod text size 62868
>>>
>>> So:
>>>
>>> Tested-by: Luis Chamberlain <mcgrof@kernel.org>
>>>
>>> In terms of bootup timing:
>>>
>>> Before:
>>> Startup finished in 41.653s (kernel) + 44.305s (userspace) = 1min 25.958s
>>> graphical.target reached after 44.178s in userspace.
>>> After:
>>> Startup finished in 23.995s (kernel) + 40.350s (userspace) = 1min 4.345s
>>> graphical.target reached after 40.226s in userspace.
>>
>> I'll try grabbing the system where we saw the KASAN-related issues [1] and
>> give it a churn with and without the two patches. Might take a bit (~1 day),
>> unfortunately.
>>
>> [1] https://lkml.kernel.org/r/20221013180518.217405-1-david@redhat.com
>
> Great, don't forget:
>
> diff --git a/kernel/module/main.c b/kernel/module/main.c
> index 82b0dcc1fe77..222015093eeb 100644
> --- a/kernel/module/main.c
> +++ b/kernel/module/main.c
> @@ -3066,7 +3066,7 @@ struct idempotent {
>
> #define IDEM_HASH_BITS 8
> static struct hlist_head idem_hash[1 << IDEM_HASH_BITS];
> -static struct spinlock idem_lock;
> +static DEFINE_SPINLOCK(idem_lock);
>
> static bool idempotent(struct idempotent *u, const void *cookie)
> {
>

Finally was able to run it on that ThinkSystem SR950 with 8 populated
sockets -> 224 cores and 448 logical CPUs.

The KASAN vmap issues on that system were already no longer reproducible with your
(Luis) previous work that's already in master.

I tested a !debug and debug config (both based on corresponding RHEL9 configs), comparing
929ed21dfdb6 ("master") with 929ed21dfdb6 + Linus' patches ("patched").


Unfortunately, boot times vary a lot, and I did not figure out how to reduce
the noise. I captured the "systemd-analyze blame" output as well.


1) !debug config (not enabling KASAN)

a) master

Startup finished in 32.225s (kernel) + 7.399s (initrd) + 20.378s (userspace) = 1min 3ms
multi-user.target reached after 20.352s in userspace.
Startup finished in 43.734s (kernel) + 7.288s (initrd) + 19.827s (userspace) = 1min 10.851s
multi-user.target reached after 19.800s in userspace.
Startup finished in 50.514s (kernel) + 7.171s (initrd) + 24.757s (userspace) = 1min 22.443s
multi-user.target reached after 24.734s in userspace.
Startup finished in 26.722s (kernel) + 7.249s (initrd) + 23.923s (userspace) = 57.895s
multi-user.target reached after 23.892s in userspace.

b) patched

Startup finished in 36.318s (kernel) + 7.177s (initrd) + 21.383s (userspace) = 1min 4.879s
multi-user.target reached after 21.355s in userspace.
Startup finished in 36.318s (kernel) + 7.177s (initrd) + 21.383s (userspace) = 1min 4.879s
multi-user.target reached after 21.355s in userspace.
Startup finished in 1min 34.678s (kernel) + 7.239s (initrd) + 24.066s (userspace) = 2min 5.985s
multi-user.target reached after 24.040s in userspace.
Startup finished in 25.879s (kernel) + 7.144s (initrd) + 29.665s (userspace) = 1min 2.689s
multi-user.target reached after 29.637s in userspace.



2) debug config (enabling KASAN)

a) master

Startup finished in 2min 12.695s (kernel) + 25.058s (initrd) + 1min 13.012s (userspace) = 3min 50.765s
multi-user.target reached after 1min 12.903s in userspace.
Startup finished in 1min 45.400s (kernel) + 24.294s (initrd) + 1min 8.910s (userspace) = 3min 18.606s
multi-user.target reached after 1min 8.786s in userspace.
Startup finished in 2min 4.857s (kernel) + 24.715s (initrd) + 1min 5.088s (userspace) = 3min 34.660s
multi-user.target reached after 1min 4.967s in userspace.
Startup finished in 3min 20.400s (kernel) + 24.703s (initrd) + 1min 5.469s (userspace) = 4min 50.573s
multi-user.target reached after 1min 5.344s in userspace.

b) patched

Startup finished in 2min 5.250s (kernel) + 25.049s (initrd) + 1min 1.961s (userspace) = 3min 32.262s
multi-user.target reached after 1min 1.844s in userspace.
Startup finished in 1min 52.524s (kernel) + 24.897s (initrd) + 1min 5.062s (userspace) = 3min 22.484s
multi-user.target reached after 1min 4.916s in userspace.
Startup finished in 9min 36.817s (kernel) + 24.859s (initrd) + 1min 18.657s (userspace) = 11min 20.335s
multi-user.target reached after 1min 18.455s in userspace.
Startup finished in 30min 20.715s (kernel) + 24.722s (initrd) + 1min 7.039s (userspace) = 31min 52.476s
multi-user.target reached after 1min 6.907s in userspace.


What concerns me a bit, is that on the patched kernel we seem to hit more cases where
boot takes much longer (in both kernel configs).

I'll do some more runs/investigation to see if this is reproducible or just some system oddity.

Staring just at the udev settle time (systemd-analyze blame), it's very similar between both kernel
versions.

--
Cheers,

David / dhildenb
Re: [PATCH 2/2] module: add support to avoid duplicates early on load [ In reply to ]
On Fri, Jun 02, 2023 at 05:19:58PM +0200, David Hildenbrand wrote:
> On 31.05.23 18:57, Luis Chamberlain wrote:
> > On Wed, May 31, 2023 at 09:51:41AM +0200, David Hildenbrand wrote:
> > > On 30.05.23 18:22, Luis Chamberlain wrote:
> > > > On Mon, May 29, 2023 at 09:55:15PM -0400, Linus Torvalds wrote:
> > > > > On Mon, May 29, 2023 at 11:18?AM Johan Hovold <johan@kernel.org> wrote:
> > > > > >
> > > > > > I took a closer look at some of the modules that failed to load and
> > > > > > noticed a pattern in that they have dependencies that are needed by more
> > > > > > than one device.
> > > > >
> > > > > Ok, this is a "maybe something like this" RFC series of two patches -
> > > > > one trivial one to re-organize things a bit so that we can then do the
> > > > > real one which uses a filter based on the inode pointer to return an
> > > > > "idempotent return value" for module loads that share the same inode.
> > > > >
> > > > > It's entirely untested, and since I'm on the road I'm going to not
> > > > > really be able to test it. It compiles for me, and the code looks
> > > > > fairly straightforward, but it's probably buggy.
> > > > >
> > > > > It's very loosely based on Luis' attempt, but it
> > > > > (a) is internal to module loading
> > > > > (b) uses a reliable cookie
> > > > > (c) doesn't leave the cookie around randomly for later
> > > > > (d) has seen absolutely no testing
> > > > >
> > > > > Put another way: if somebody wants to play with this, please treat it
> > > > > as a starting point, not the final thing. You might need to debug
> > > > > things, and fix silly mistakes.
> > > > >
> > > > > The idea is to just have a simple hash list of currently executing
> > > > > module loads, protected by a trivial spinlock. Every module loader
> > > > > adds itself to the right hash list, and if they were the *first* one
> > > > > (ie no other pending module loads for that inode), will actually do
> > > > > the module load.
> > > > >
> > > > > Everybody who *isn't* the first one will just wait for completion and
> > > > > return the same error code that the first one returned.
> > > >
> > > > That's also a hell much more snazzier MODULE_DEBUG_AUTOLOAD_DUPS if we
> > > > ever wanted to do something similar there if we wanted to also
> > > > join request_module() calls, instead of it hiding under debug.
> > > >
> > > > > This is technically bogus. The first one might fail due to arguments.
> > > >
> > > > For boot it's fine, as I can't think of boot wanting to support trying
> > > > to load a module with different arguments but who knows. But I can't
> > > > see it sensible to issue concurrent multiple requests for modules
> > > > with different arguments without waiting in userspace for the first
> > > > to fail.
> > > >
> > > > Even post-boot, doing that sounds rather insane, but it would certainly
> > > > be a compromise and should probably be clearly documented. I think just
> > > > a comment acknolwedging that corner case seems sensible.
> > > >
> > > > Because we won't be able to get the arguments until we process the
> > > > module, so it would be too late for this optimization on kread. So it is
> > > > why I had also stuck to the original feature being in kread, as then it
> > > > provides a uniq kread call and the caller is aware of it. But indeed I
> > > > had not considered the effects of arguments.
> > > >
> > > > Lucas, any thoughts from modules kmod userspace perspective into
> > > > supporting anyone likely issuing concurrent modules requests with
> > > > differing arguments?
> > > >
> > > > > So the cookie shouldn't be just the inode, it should be the inode and
> > > > > a hash of the arguments or something like that.
> > > >
> > > > Personally I think it's a fine optimization without the arguments.
> > > >
> > > > > But it is what it is,
> > > > > and apart from possible show-stopper bugs this is no worse than the
> > > > > failed "exclusive write deny" attempt. IOW - maybe worth trying?
> > > >
> > > > The only thing I can think of is allowing threads other than the
> > > > first one to complete before the one that actually loaded the
> > > > module. I thought about this race for module auto-loading, see
> > > > the comment in kmod_dup_request_announce(), so that just
> > > > further delays the completion to other thread with a stupid
> > > > queue_work(). That seems more important for module auto-loading
> > > > duplicates than for boot finit_module() duplicates. But not sure
> > > > if odering matters in the end due to a preemtible kernel and maybe
> > > > that concern is hysteria.
> > > >
> > > > > And if *that* didn't sell people on this patch series, I don't know
> > > > > what will. I should be in marketing! Two drink minimums, here I come!
> > > >
> > > > Sold:
> > > >
> > > > on 255 vcpus 0 duplicates found with this setup:
> > > >
> > > > root@kmod ~ # cat /sys/kernel/debug/modules/stats
> > > > Mods ever loaded 66
> > > > Mods failed on kread 0
> > > > Mods failed on decompress 0
> > > > Mods failed on becoming 0
> > > > Mods failed on load 0
> > > > Total module size 11268096
> > > > Total mod text size 4149248
> > > > Failed kread bytes 0
> > > > Failed decompress bytes 0
> > > > Failed becoming bytes 0
> > > > Failed kmod bytes 0
> > > > Virtual mem wasted bytes 0
> > > > Average mod size 170729
> > > > Average mod text size 62868
> > > >
> > > > So:
> > > >
> > > > Tested-by: Luis Chamberlain <mcgrof@kernel.org>
> > > >
> > > > In terms of bootup timing:
> > > >
> > > > Before:
> > > > Startup finished in 41.653s (kernel) + 44.305s (userspace) = 1min 25.958s
> > > > graphical.target reached after 44.178s in userspace.
> > > > After:
> > > > Startup finished in 23.995s (kernel) + 40.350s (userspace) = 1min 4.345s
> > > > graphical.target reached after 40.226s in userspace.
> > >
> > > I'll try grabbing the system where we saw the KASAN-related issues [1] and
> > > give it a churn with and without the two patches. Might take a bit (~1 day),
> > > unfortunately.
> > >
> > > [1] https://lkml.kernel.org/r/20221013180518.217405-1-david@redhat.com
> >
> > Great, don't forget:
> >
> > diff --git a/kernel/module/main.c b/kernel/module/main.c
> > index 82b0dcc1fe77..222015093eeb 100644
> > --- a/kernel/module/main.c
> > +++ b/kernel/module/main.c
> > @@ -3066,7 +3066,7 @@ struct idempotent {
> > #define IDEM_HASH_BITS 8
> > static struct hlist_head idem_hash[1 << IDEM_HASH_BITS];
> > -static struct spinlock idem_lock;
> > +static DEFINE_SPINLOCK(idem_lock);
> > static bool idempotent(struct idempotent *u, const void *cookie)
> > {
> >
>
> Finally was able to run it on that ThinkSystem SR950 with 8 populated
> sockets -> 224 cores and 448 logical CPUs.
>
> The KASAN vmap issues on that system were already no longer reproducible with your
> (Luis) previous work that's already in master.
>
> I tested a !debug and debug config (both based on corresponding RHEL9 configs), comparing
> 929ed21dfdb6 ("master") with 929ed21dfdb6 + Linus' patches ("patched").
>
>
> Unfortunately, boot times vary a lot, and I did not figure out how to reduce
> the noise. I captured the "systemd-analyze blame" output as well.
>
>
> 1) !debug config (not enabling KASAN)
>
> a) master
>
> Startup finished in 32.225s (kernel) + 7.399s (initrd) + 20.378s (userspace) = 1min 3ms
> multi-user.target reached after 20.352s in userspace.
> Startup finished in 43.734s (kernel) + 7.288s (initrd) + 19.827s (userspace) = 1min 10.851s
> multi-user.target reached after 19.800s in userspace.
> Startup finished in 50.514s (kernel) + 7.171s (initrd) + 24.757s (userspace) = 1min 22.443s
> multi-user.target reached after 24.734s in userspace.
> Startup finished in 26.722s (kernel) + 7.249s (initrd) + 23.923s (userspace) = 57.895s
> multi-user.target reached after 23.892s in userspace.
>
> b) patched
>
> Startup finished in 36.318s (kernel) + 7.177s (initrd) + 21.383s (userspace) = 1min 4.879s
> multi-user.target reached after 21.355s in userspace.
> Startup finished in 36.318s (kernel) + 7.177s (initrd) + 21.383s (userspace) = 1min 4.879s
> multi-user.target reached after 21.355s in userspace.
> Startup finished in 1min 34.678s (kernel) + 7.239s (initrd) + 24.066s (userspace) = 2min 5.985s
> multi-user.target reached after 24.040s in userspace.
> Startup finished in 25.879s (kernel) + 7.144s (initrd) + 29.665s (userspace) = 1min 2.689s
> multi-user.target reached after 29.637s in userspace.
>
>
>
> 2) debug config (enabling KASAN)
>
> a) master
>
> Startup finished in 2min 12.695s (kernel) + 25.058s (initrd) + 1min 13.012s (userspace) = 3min 50.765s
> multi-user.target reached after 1min 12.903s in userspace.
> Startup finished in 1min 45.400s (kernel) + 24.294s (initrd) + 1min 8.910s (userspace) = 3min 18.606s
> multi-user.target reached after 1min 8.786s in userspace.
> Startup finished in 2min 4.857s (kernel) + 24.715s (initrd) + 1min 5.088s (userspace) = 3min 34.660s
> multi-user.target reached after 1min 4.967s in userspace.
> Startup finished in 3min 20.400s (kernel) + 24.703s (initrd) + 1min 5.469s (userspace) = 4min 50.573s
> multi-user.target reached after 1min 5.344s in userspace.
>
> b) patched
>
> Startup finished in 2min 5.250s (kernel) + 25.049s (initrd) + 1min 1.961s (userspace) = 3min 32.262s
> multi-user.target reached after 1min 1.844s in userspace.
> Startup finished in 1min 52.524s (kernel) + 24.897s (initrd) + 1min 5.062s (userspace) = 3min 22.484s
> multi-user.target reached after 1min 4.916s in userspace.
> Startup finished in 9min 36.817s (kernel) + 24.859s (initrd) + 1min 18.657s (userspace) = 11min 20.335s
> multi-user.target reached after 1min 18.455s in userspace.
> Startup finished in 30min 20.715s (kernel) + 24.722s (initrd) + 1min 7.039s (userspace) = 31min 52.476s
> multi-user.target reached after 1min 6.907s in userspace.
>
>
> What concerns me a bit, is that on the patched kernel we seem to hit more cases where
> boot takes much longer (in both kernel configs).
>
> I'll do some more runs/investigation to see if this is reproducible or just some system oddity.
>
> Staring just at the udev settle time (systemd-analyze blame), it's very similar between both kernel
> versions.

Thanks for these tests, having MODULE_STATS enabled +

cat /sys/kernel/debug/modules/stats

would be of huge value, both on master and + patched.

Luis
Re: [PATCH 2/2] module: add support to avoid duplicates early on load [ In reply to ]
On Fri, Jun 2, 2023 at 11:20?AM David Hildenbrand <david@redhat.com> wrote:
>
> What concerns me a bit, is that on the patched kernel we seem to hit more cases where
> boot takes much longer (in both kernel configs).

So it potentially serializes the loads to the same file more, but in
the process uses much less memory (since the ones waiting will not
have done any of the "load file contents and uncompress them"). So
it's a bit of a trade-off.

We could complicate things a bit, and let other callers return -EEXIST
a bit earlier, but I'm not convinced it really matters.

Honestly, taking too long because user space does something stupid and
wrong is not a kernel bug. Not booting because we use too much memory
- that's problematic. But booting slowly because udev does several
thousand unnecessary module loads is entirely on udev.

Linus
Re: [PATCH 2/2] module: add support to avoid duplicates early on load [ In reply to ]
On 02.06.23 18:06, Linus Torvalds wrote:
> On Fri, Jun 2, 2023 at 11:20?AM David Hildenbrand <david@redhat.com> wrote:
>>
>> What concerns me a bit, is that on the patched kernel we seem to hit more cases where
>> boot takes much longer (in both kernel configs).
>
> So it potentially serializes the loads to the same file more, but in
> the process uses much less memory (since the ones waiting will not
> have done any of the "load file contents and uncompress them"). So
> it's a bit of a trade-off.

I have the feeling that -- on this system -- it's some inaccurate
accounting of firmware+loader times to the kernel startup time. Combined
with some other noise. Especially the firmware loading time seems to be
fairly randomized.

I guess what we care about regarding module loading is the
initrd+userspace loading times, and they are fairly stable. But we
mostly care about udev.

So let's look only at "systemd-udev" services:

1) !debug

a) master

5.672s systemd-udev-settle.service
505ms systemd-udev-trigger.service
272ms systemd-udevd.service
5.418s systemd-udev-settle.service
487ms systemd-udev-trigger.service
258ms systemd-udevd.service
5.707s systemd-udev-settle.service
527ms systemd-udev-trigger.service
273ms systemd-udevd.service
6.250s systemd-udev-settle.service
455ms systemd-udev-trigger.service
283ms systemd-udevd.service


b) patched

4.652s systemd-udev-settle.service
461ms systemd-udev-trigger.service
302ms systemd-udevd.service
4.652s systemd-udev-settle.service
461ms systemd-udev-trigger.service
302ms systemd-udevd.service
4.634s systemd-udev-settle.service
444ms systemd-udev-trigger.service
296ms systemd-udevd.service
4.745s systemd-udev-settle.service
444ms systemd-udev-trigger.service
273ms systemd-udevd.service


2) debug

a) master

32.806s systemd-udev-settle.service
9.584s systemd-udev-trigger.service
471ms systemd-udevd.service
29.901s systemd-udev-settle.service
8.914s systemd-udev-trigger.service
400ms systemd-udevd.service
28.640s systemd-udev-settle.service
9.260s systemd-udev-trigger.service
477ms systemd-udevd.service
29.498s systemd-udev-settle.service
9.073s systemd-udev-trigger.service
444ms systemd-udevd.service


b) patched

28.765s systemd-udev-settle.service
8.898s systemd-udev-trigger.service
400ms systemd-udevd.service
28.292s systemd-udev-settle.service
8.903s systemd-udev-trigger.service
401ms systemd-udevd.service
34.588s systemd-udev-settle.service
8.959s systemd-udev-trigger.service
455ms systemd-udevd.service
28.641s systemd-udev-settle.service
8.953s systemd-udev-trigger.service
389ms systemd-udevd.service



So except some noise, in the general case the patched version seems to
be faster just looking at systemd-udev.

>
> We could complicate things a bit, and let other callers return -EEXIST
> a bit earlier, but I'm not convinced it really matters.

Looking at the numbers, agreed.

>
> Honestly, taking too long because user space does something stupid and
> wrong is not a kernel bug. Not booting because we use too much memory
> - that's problematic. But booting slowly because udev does several
> thousand unnecessary module loads is entirely on udev.

Yes.


I'll do some more experiments, but from what I can tell

Tested-by: David Hildenbrand <david@redhat.com>

--
Cheers,

David / dhildenb
Re: [PATCH 2/2] module: add support to avoid duplicates early on load [ In reply to ]
On Mon, May 29, 2023 at 09:55:15PM -0400, Linus Torvalds wrote:
> On Mon, May 29, 2023 at 11:18?AM Johan Hovold <johan@kernel.org> wrote:
> >
> > I took a closer look at some of the modules that failed to load and
> > noticed a pattern in that they have dependencies that are needed by more
> > than one device.
>
> Ok, this is a "maybe something like this" RFC series of two patches -
> one trivial one to re-organize things a bit so that we can then do the
> real one which uses a filter based on the inode pointer to return an
> "idempotent return value" for module loads that share the same inode.
>
> It's entirely untested, and since I'm on the road I'm going to not
> really be able to test it. It compiles for me, and the code looks
> fairly straightforward, but it's probably buggy.

The updated patches fix the issue experienced with 6.4-rc4 where the Intel
NUC12 booted fine, but on media playback resulted in the video stuttering,
then freezing - and that audio was not coming though.

# diff modules-rc4 modules-rc4-patched
< snd_hda_codec_hdmi 77824 0
< snd_hda_codec_realtek 159744 0
---
> snd_hda_codec_hdmi 77824 1
> snd_hda_codec_realtek 159744 1

Revert "module: error out early on concurrent load of the same module file"
module: split up 'finit_module()' into init_module_from_file() helper
modules: catch concurrent module loads, take two
Johan's fix:
-static struct spinlock idem_lock;
+static DEFINE_SPINLOCK(idem_lock);

Tested-by: Rudi Heitbaum <rudi@heitbaum..com>
Re: [PATCH 2/2] module: add support to avoid duplicates early on load [ In reply to ]
On 02.06.23 18:04, Luis Chamberlain wrote:
> On Fri, Jun 02, 2023 at 05:19:58PM +0200, David Hildenbrand wrote:
>> On 31.05.23 18:57, Luis Chamberlain wrote:
>>> On Wed, May 31, 2023 at 09:51:41AM +0200, David Hildenbrand wrote:
>>>> On 30.05.23 18:22, Luis Chamberlain wrote:
>>>>> On Mon, May 29, 2023 at 09:55:15PM -0400, Linus Torvalds wrote:
>>>>>> On Mon, May 29, 2023 at 11:18?AM Johan Hovold <johan@kernel.org> wrote:
>>>>>>>
>>>>>>> I took a closer look at some of the modules that failed to load and
>>>>>>> noticed a pattern in that they have dependencies that are needed by more
>>>>>>> than one device.
>>>>>>
>>>>>> Ok, this is a "maybe something like this" RFC series of two patches -
>>>>>> one trivial one to re-organize things a bit so that we can then do the
>>>>>> real one which uses a filter based on the inode pointer to return an
>>>>>> "idempotent return value" for module loads that share the same inode.
>>>>>>
>>>>>> It's entirely untested, and since I'm on the road I'm going to not
>>>>>> really be able to test it. It compiles for me, and the code looks
>>>>>> fairly straightforward, but it's probably buggy.
>>>>>>
>>>>>> It's very loosely based on Luis' attempt, but it
>>>>>> (a) is internal to module loading
>>>>>> (b) uses a reliable cookie
>>>>>> (c) doesn't leave the cookie around randomly for later
>>>>>> (d) has seen absolutely no testing
>>>>>>
>>>>>> Put another way: if somebody wants to play with this, please treat it
>>>>>> as a starting point, not the final thing. You might need to debug
>>>>>> things, and fix silly mistakes.
>>>>>>
>>>>>> The idea is to just have a simple hash list of currently executing
>>>>>> module loads, protected by a trivial spinlock. Every module loader
>>>>>> adds itself to the right hash list, and if they were the *first* one
>>>>>> (ie no other pending module loads for that inode), will actually do
>>>>>> the module load.
>>>>>>
>>>>>> Everybody who *isn't* the first one will just wait for completion and
>>>>>> return the same error code that the first one returned.
>>>>>
>>>>> That's also a hell much more snazzier MODULE_DEBUG_AUTOLOAD_DUPS if we
>>>>> ever wanted to do something similar there if we wanted to also
>>>>> join request_module() calls, instead of it hiding under debug.
>>>>>
>>>>>> This is technically bogus. The first one might fail due to arguments.
>>>>>
>>>>> For boot it's fine, as I can't think of boot wanting to support trying
>>>>> to load a module with different arguments but who knows. But I can't
>>>>> see it sensible to issue concurrent multiple requests for modules
>>>>> with different arguments without waiting in userspace for the first
>>>>> to fail.
>>>>>
>>>>> Even post-boot, doing that sounds rather insane, but it would certainly
>>>>> be a compromise and should probably be clearly documented. I think just
>>>>> a comment acknolwedging that corner case seems sensible.
>>>>>
>>>>> Because we won't be able to get the arguments until we process the
>>>>> module, so it would be too late for this optimization on kread. So it is
>>>>> why I had also stuck to the original feature being in kread, as then it
>>>>> provides a uniq kread call and the caller is aware of it. But indeed I
>>>>> had not considered the effects of arguments.
>>>>>
>>>>> Lucas, any thoughts from modules kmod userspace perspective into
>>>>> supporting anyone likely issuing concurrent modules requests with
>>>>> differing arguments?
>>>>>
>>>>>> So the cookie shouldn't be just the inode, it should be the inode and
>>>>>> a hash of the arguments or something like that.
>>>>>
>>>>> Personally I think it's a fine optimization without the arguments.
>>>>>
>>>>>> But it is what it is,
>>>>>> and apart from possible show-stopper bugs this is no worse than the
>>>>>> failed "exclusive write deny" attempt. IOW - maybe worth trying?
>>>>>
>>>>> The only thing I can think of is allowing threads other than the
>>>>> first one to complete before the one that actually loaded the
>>>>> module. I thought about this race for module auto-loading, see
>>>>> the comment in kmod_dup_request_announce(), so that just
>>>>> further delays the completion to other thread with a stupid
>>>>> queue_work(). That seems more important for module auto-loading
>>>>> duplicates than for boot finit_module() duplicates. But not sure
>>>>> if odering matters in the end due to a preemtible kernel and maybe
>>>>> that concern is hysteria.
>>>>>
>>>>>> And if *that* didn't sell people on this patch series, I don't know
>>>>>> what will. I should be in marketing! Two drink minimums, here I come!
>>>>>
>>>>> Sold:
>>>>>
>>>>> on 255 vcpus 0 duplicates found with this setup:
>>>>>
>>>>> root@kmod ~ # cat /sys/kernel/debug/modules/stats
>>>>> Mods ever loaded 66
>>>>> Mods failed on kread 0
>>>>> Mods failed on decompress 0
>>>>> Mods failed on becoming 0
>>>>> Mods failed on load 0
>>>>> Total module size 11268096
>>>>> Total mod text size 4149248
>>>>> Failed kread bytes 0
>>>>> Failed decompress bytes 0
>>>>> Failed becoming bytes 0
>>>>> Failed kmod bytes 0
>>>>> Virtual mem wasted bytes 0
>>>>> Average mod size 170729
>>>>> Average mod text size 62868
>>>>>
>>>>> So:
>>>>>
>>>>> Tested-by: Luis Chamberlain <mcgrof@kernel.org>
>>>>>
>>>>> In terms of bootup timing:
>>>>>
>>>>> Before:
>>>>> Startup finished in 41.653s (kernel) + 44.305s (userspace) = 1min 25.958s
>>>>> graphical.target reached after 44.178s in userspace.
>>>>> After:
>>>>> Startup finished in 23.995s (kernel) + 40.350s (userspace) = 1min 4.345s
>>>>> graphical.target reached after 40.226s in userspace.
>>>>
>>>> I'll try grabbing the system where we saw the KASAN-related issues [1] and
>>>> give it a churn with and without the two patches. Might take a bit (~1 day),
>>>> unfortunately.
>>>>
>>>> [1] https://lkml.kernel.org/r/20221013180518.217405-1-david@redhat.com
>>>
>>> Great, don't forget:
>>>
>>> diff --git a/kernel/module/main.c b/kernel/module/main.c
>>> index 82b0dcc1fe77..222015093eeb 100644
>>> --- a/kernel/module/main.c
>>> +++ b/kernel/module/main.c
>>> @@ -3066,7 +3066,7 @@ struct idempotent {
>>> #define IDEM_HASH_BITS 8
>>> static struct hlist_head idem_hash[1 << IDEM_HASH_BITS];
>>> -static struct spinlock idem_lock;
>>> +static DEFINE_SPINLOCK(idem_lock);
>>> static bool idempotent(struct idempotent *u, const void *cookie)
>>> {
>>>
>>
>> Finally was able to run it on that ThinkSystem SR950 with 8 populated
>> sockets -> 224 cores and 448 logical CPUs.
>>
>> The KASAN vmap issues on that system were already no longer reproducible with your
>> (Luis) previous work that's already in master.
>>
>> I tested a !debug and debug config (both based on corresponding RHEL9 configs), comparing
>> 929ed21dfdb6 ("master") with 929ed21dfdb6 + Linus' patches ("patched").
>>
>>
>> Unfortunately, boot times vary a lot, and I did not figure out how to reduce
>> the noise. I captured the "systemd-analyze blame" output as well.
>>
>>
>> 1) !debug config (not enabling KASAN)
>>
>> a) master
>>
>> Startup finished in 32.225s (kernel) + 7.399s (initrd) + 20.378s (userspace) = 1min 3ms
>> multi-user.target reached after 20.352s in userspace.
>> Startup finished in 43.734s (kernel) + 7.288s (initrd) + 19.827s (userspace) = 1min 10.851s
>> multi-user.target reached after 19.800s in userspace.
>> Startup finished in 50.514s (kernel) + 7.171s (initrd) + 24.757s (userspace) = 1min 22.443s
>> multi-user.target reached after 24.734s in userspace.
>> Startup finished in 26.722s (kernel) + 7.249s (initrd) + 23.923s (userspace) = 57.895s
>> multi-user.target reached after 23.892s in userspace.
>>
>> b) patched
>>
>> Startup finished in 36.318s (kernel) + 7.177s (initrd) + 21.383s (userspace) = 1min 4.879s
>> multi-user.target reached after 21.355s in userspace.
>> Startup finished in 36.318s (kernel) + 7.177s (initrd) + 21.383s (userspace) = 1min 4.879s
>> multi-user.target reached after 21.355s in userspace.
>> Startup finished in 1min 34.678s (kernel) + 7.239s (initrd) + 24.066s (userspace) = 2min 5.985s
>> multi-user.target reached after 24.040s in userspace.
>> Startup finished in 25.879s (kernel) + 7.144s (initrd) + 29.665s (userspace) = 1min 2.689s
>> multi-user.target reached after 29.637s in userspace.
>>
>>
>>
>> 2) debug config (enabling KASAN)
>>
>> a) master
>>
>> Startup finished in 2min 12.695s (kernel) + 25.058s (initrd) + 1min 13.012s (userspace) = 3min 50.765s
>> multi-user.target reached after 1min 12.903s in userspace.
>> Startup finished in 1min 45.400s (kernel) + 24.294s (initrd) + 1min 8.910s (userspace) = 3min 18.606s
>> multi-user.target reached after 1min 8.786s in userspace.
>> Startup finished in 2min 4.857s (kernel) + 24.715s (initrd) + 1min 5.088s (userspace) = 3min 34.660s
>> multi-user.target reached after 1min 4.967s in userspace.
>> Startup finished in 3min 20.400s (kernel) + 24.703s (initrd) + 1min 5.469s (userspace) = 4min 50.573s
>> multi-user.target reached after 1min 5.344s in userspace.
>>
>> b) patched
>>
>> Startup finished in 2min 5.250s (kernel) + 25.049s (initrd) + 1min 1.961s (userspace) = 3min 32.262s
>> multi-user.target reached after 1min 1.844s in userspace.
>> Startup finished in 1min 52.524s (kernel) + 24.897s (initrd) + 1min 5.062s (userspace) = 3min 22.484s
>> multi-user.target reached after 1min 4.916s in userspace.
>> Startup finished in 9min 36.817s (kernel) + 24.859s (initrd) + 1min 18.657s (userspace) = 11min 20.335s
>> multi-user.target reached after 1min 18.455s in userspace.
>> Startup finished in 30min 20.715s (kernel) + 24.722s (initrd) + 1min 7.039s (userspace) = 31min 52.476s
>> multi-user.target reached after 1min 6.907s in userspace.
>>
>>
>> What concerns me a bit, is that on the patched kernel we seem to hit more cases where
>> boot takes much longer (in both kernel configs).
>>
>> I'll do some more runs/investigation to see if this is reproducible or just some system oddity.
>>
>> Staring just at the udev settle time (systemd-analyze blame), it's very similar between both kernel
>> versions.
>
> Thanks for these tests, having MODULE_STATS enabled +
>
> cat /sys/kernel/debug/modules/stats
>
> would be of huge value, both on master and + patched.

Sure! Sorry for the slight delay, had to recompile the kernels.

I only did a single run on each kernel, should be good enough for the purpose here.


1) !debug config (not enabling KASAN)

a) master

# cat /sys/kernel/debug/modules/stats
Mods ever loaded 75
Mods failed on kread 0
Mods failed on decompress 0
Mods failed on becoming 8052
Mods failed on load 531
Total module size 16343040
Total mod text size 4767744
Failed kread bytes 0
Failed decompress bytes 0
Failed becoming bytes 19185958488
Failed kmod bytes 1172591936
Virtual mem wasted bytes 20358550424
Average mod size 217908
Average mod text size 63570
Avg fail becoming bytes 2382757
Average fail load bytes 2208272
Duplicate failed modules:
Module-name How-many-times Reason
intel_rapl_common 447 Becoming
intel_uncore_frequency 447 Becoming
intel_uncore_frequency_common 445 Becoming & Load
isst_if_mbox_msr 447 Becoming
isst_if_common 447 Becoming & Load
skx_edac 447 Becoming & Load
nfit 447 Becoming
libnvdimm 447 Becoming
x86_pkg_temp_thermal 447 Becoming
intel_powerclamp 447 Becoming
coretemp 447 Becoming & Load
kvm_intel 447 Becoming
kvm 447 Becoming & Load
irdma 3 Load
ice 3 Becoming
irqbypass 447 Becoming & Load
rapl 393 Becoming & Load
gnss 2 Becoming & Load
ioatdma 63 Becoming & Load
ib_uverbs 3 Becoming & Load
intel_cstate 447 Becoming & Load
ipmi_si 1 Load
intel_uncore 447 Becoming & Load
ib_core 3 Becoming & Load
ipmi_devintf 1 Load
ipmi_msghandler 1 Becoming
dca 63 Becoming
ahci 1 Becoming
i40e 3 Becoming
libahci 1 Becoming
crct10dif_pclmul 195 Becoming
crc32_pclmul 262 Becoming
crc32c_intel 126 Becoming & Load
libata 1 Load
ghash_clmulni_intel 297 Becoming & Load
fjes 11 Becoming


b) patched

# cat /sys/kernel/debug/modules/stats
Mods ever loaded 75
Mods failed on kread 0
Mods failed on decompress 0
Mods failed on becoming 0
Mods failed on load 0
Total module size 16343040
Total mod text size 4767744
Failed kread bytes 0
Failed decompress bytes 0
Failed becoming bytes 0
Failed kmod bytes 0
Virtual mem wasted bytes 0
Average mod size 217908
Average mod text size 63570



2) debug config (enabling KASAN)

a) master

# cat /sys/kernel/debug/modules/stats
Mods ever loaded 75
Mods failed on kread 0
Mods failed on decompress 0
Mods failed on becoming 3173
Mods failed on load 923
Total module size 32288768
Total mod text size 14819328
Failed kread bytes 0
Failed decompress bytes 0
Failed becoming bytes 4279728952
Failed kmod bytes 2174133088
Virtual mem wasted bytes 6453862040
Average mod size 430517
Average mod text size 197592
Avg fail becoming bytes 1348796
Average fail load bytes 2355508
Duplicate failed modules:
Module-name How-many-times Reason
ioatdma 63 Becoming & Load
intel_rapl_common 313 Becoming
intel_uncore_frequency 136 Becoming & Load
intel_uncore_frequency_common 38 Becoming
isst_if_mbox_msr 443 Becoming & Load
isst_if_common 210 Becoming & Load
skx_edac 257 Becoming & Load
nfit 180 Becoming & Load
libnvdimm 294 Becoming & Load
x86_pkg_temp_thermal 321 Becoming & Load
intel_powerclamp 247 Becoming & Load
coretemp 55 Becoming & Load
kvm_intel 2 Becoming
kvm 3 Becoming & Load
irqbypass 3 Becoming
rapl 1 Becoming
intel_cstate 444 Load
intel_uncore 447 Becoming & Load
ib_uverbs 1 Load
ib_core 3 Becoming & Load
ipmi_si 1 Becoming
dca 20 Becoming & Load
sg 1 Becoming
i40e 3 Becoming & Load
ahci 1 Becoming
crct10dif_pclmul 160 Becoming & Load
crc32_pclmul 151 Becoming & Load
crc32c_intel 146 Becoming & Load
ghash_clmulni_intel 140 Becoming
libata 1 Load
fjes 11 Becoming & Load



b) patched

# cat /sys/kernel/debug/modules/stats
Mods ever loaded 75
Mods failed on kread 0
Mods failed on decompress 0
Mods failed on becoming 19
Mods failed on load 0
Total module size 32288768
Total mod text size 14819328
Failed kread bytes 0
Failed decompress bytes 0
Failed becoming bytes 6441296
Failed kmod bytes 0
Virtual mem wasted bytes 6441296
Average mod size 430517
Average mod text size 197592
Avg fail becoming bytes 339016
Duplicate failed modules:
Module-name How-many-times Reason
intel_rapl_common 2 Becoming
intel_uncore_frequency_common 5 Becoming
isst_if_common 3 Becoming
intel_powerclamp 2 Becoming
irqbypass 5 Becoming
crct10dif_pclmul 1 Becoming
crc32c_intel 1 Becoming



So, with these (helpful) stats, the improvement is obvious (and explains the ~1s
improvement I saw staring at the startup times of the udev services).

There are still some failed module loads with the debug config (only in the
becoming state), I did not dive deeply into the actual code changes (-EBUSY),
just spelling it out so we can decide if this is to be expected or some corner
case that shouldn't be happening.


Thanks!

--
Cheers,

David / dhildenb
Re: [PATCH 2/2] module: add support to avoid duplicates early on load [ In reply to ]
On Tue, May 30, 2023 at 11:40:35AM +0200, Johan Hovold wrote:
> On Mon, May 29, 2023 at 09:55:15PM -0400, Linus Torvalds wrote:
> > On Mon, May 29, 2023 at 11:18?AM Johan Hovold <johan@kernel.org> wrote:
> > >
> > > I took a closer look at some of the modules that failed to load and
> > > noticed a pattern in that they have dependencies that are needed by more
> > > than one device.
> >
> > Ok, this is a "maybe something like this" RFC series of two patches -
> > one trivial one to re-organize things a bit so that we can then do the
> > real one which uses a filter based on the inode pointer to return an
> > "idempotent return value" for module loads that share the same inode.
> >
> > It's entirely untested, and since I'm on the road I'm going to not
> > really be able to test it. It compiles for me, and the code looks
> > fairly straightforward, but it's probably buggy.
> >
> > It's very loosely based on Luis' attempt, but it
> > (a) is internal to module loading
> > (b) uses a reliable cookie
> > (c) doesn't leave the cookie around randomly for later
> > (d) has seen absolutely no testing
> >
> > Put another way: if somebody wants to play with this, please treat it
> > as a starting point, not the final thing. You might need to debug
> > things, and fix silly mistakes.
>
> With the missing spinlock initialisation fixed:
>
> -static struct spinlock idem_lock;
> +static DEFINE_SPINLOCK(idem_lock);
>
> this passes basic smoke testing and allows the X13s to boot.
>
> It does not seem to have any significant impact on boot time, but it
> avoids some of the unnecessary load attempts as intended:

Took another look at code which modulo some nits like missing spaces
around a + operator and an "ourselfs" typo looks good to me. So with the
spinlock initialisation fixed:

Reviewed-by: Johan Hovold <johan@kernel.org>
Tested-by: Johan Hovold <johan@kernel.org>

Johan
Re: [PATCH 2/2] module: add support to avoid duplicates early on load [ In reply to ]
On Mon, Jun 05, 2023 at 01:26:00PM +0200, David Hildenbrand wrote:
> I only did a single run on each kernel, should be good enough for the purpose here.
>
>
> 1) !debug config (not enabling KASAN)
>
> a) master
>
> # cat /sys/kernel/debug/modules/stats
> Virtual mem wasted bytes 20358550424
18 GiB

> Average mod size 217908
212 KiB

> Average mod text size 63570
62 KiB

> b) patched
>
> # cat /sys/kernel/debug/modules/stats
> Virtual mem wasted bytes 0

> 2) debug config (enabling KASAN)
>
> a) master
>
> # cat /sys/kernel/debug/modules/stats
> Virtual mem wasted bytes 6453862040
6 GiB

> Average mod size 430517
420 KiB, so ballpark kasan pretty much doubles module size.

> Average mod text size 197592
192 KiB, and is reflected on module .text too, in fact .text more than doubles.

It would have otherwise been difficult to get some of these stats, so
thanks! I make note of .text just because of the recent development work
going on for a new module_alloc(). About 14 MiB required to house a big
iron kasan enabled module .text, whereas about half is required for !kasan.

> b) patched
>
> # cat /sys/kernel/debug/modules/stats
> Virtual mem wasted bytes 6441296

We've gone down from ~6 GiB to ~6 MiB.

> So, with these (helpful) stats,

Extremely useful, yes thanks.

> the improvement is obvious (and explains the ~1s
> improvement I saw staring at the startup times of the udev services).
>
> There are still some failed module loads with the debug config (only in the
> becoming state), I did not dive deeply into the actual code changes (-EBUSY),

That's fine, Linus' patch does not keep the lock for the entire life
of the module, it releases it right away after we're done with the
kernel_read(), and because of this, there is a small race in between
a thread the kernel_read() finishing and the module then being processed
into the modules linked list. During that time, if a new module with the
same name comes in, we'll have to allow it since the lock was released.
Those extra modules end up lingering to wait for the first one that made
it to the modules linked list.

I don't think we need to worry about 6 MiB, this patch alone should
suffice for a long time until userspace gets its act together and fixes
this properly. Fixing userspace should reduce some latencies as well on
bootup so someone who cares about bootup speeds on high end systems
could likely be encouraged to fix that.

> just spelling it out so we can decide if this is to be expected or some corner
> case that shouldn't be happening.

It is expected, in fact the fact that the heuristic works so well,
without keeping the lock forever, and therefore keeping the code changes
to a minimum is quite an amazing.

Luis
Re: [PATCH 2/2] module: add support to avoid duplicates early on load [ In reply to ]
On Mon, Jun 05, 2023 at 08:17:42AM -0700, Luis Chamberlain wrote:
> We've gone down from ~6 GiB to ~6 MiB.

And just to also highlight, that was just for for the KASAN enabled case, and
for !KASAN we went from ~18 GiB to 0.

Luis
Re: [PATCH 2/2] module: add support to avoid duplicates early on load [ In reply to ]
On Mon, Jun 05, 2023 at 08:28:41AM -0700, Luis Chamberlain wrote:
> On Mon, Jun 05, 2023 at 08:17:42AM -0700, Luis Chamberlain wrote:
> > We've gone down from ~6 GiB to ~6 MiB.
>
> And just to also highlight, that was just for for the KASAN enabled case, and
> for !KASAN we went from ~18 GiB to 0.

Linus, were you thinking of including these patches in for v6.5-rc1?

Luis
Re: [PATCH 2/2] module: add support to avoid duplicates early on load [ In reply to ]
On Wed, 28 Jun 2023 at 11:52, Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> Linus, were you thinking of including these patches in for v6.5-rc1?

Heh, I don't even have them in my tree any more, I was assuming that
if they were good they'd be sent back to me...

I guess I can go fish them out from my old emails, but I actually was
expecting to just get them from you.

I do actually maintain my own branches for *some* things, and merge
them myself, but they tend to be for areas that I feel I'm
co-maintaining (ie notably vfs and mm that I still feel I'm involved
in).

In other areas, I may send out patches, but I don't feel like I'm a
maintainer, so I then think that "the real maintainer can decide if
these patches are good or not".

And I would very much hope that people don't take said patches _just_
because they come from me. They should be judged on their own merits,
and then occasionally people can mutter "Christ, what drugs is Linus
on _today_?" and just throw my patches in the garbage.

Linus
Re: [PATCH 2/2] module: add support to avoid duplicates early on load [ In reply to ]
On Wed, 28 Jun 2023 at 13:14, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> In other areas, I may send out patches, but I don't feel like I'm a
> maintainer, so I then think that "the real maintainer can decide if
> these patches are good or not".

Anyway, since I clearly didn't communicate clearly that I was throwing
the patch over to you, let me go and find it and just apply it myself.

Linus
Re: [PATCH 2/2] module: add support to avoid duplicates early on load [ In reply to ]
On Wed, 28 Jun 2023 at 15:07, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Anyway, since I clearly didn't communicate clearly that I was throwing
> the patch over to you, let me go and find it and just apply it myself.

Ok, I spent some time writing a better commit message for that change,
tried to make sure I got everybody's comments and Tested-by's sorted
out etc, and committed the end result.

I also did the actual module pull and your sysctl pull.

Can you please check that it all looks like you expected?

Linus
Re: [PATCH 2/2] module: add support to avoid duplicates early on load [ In reply to ]
On Wed, Jun 28, 2023 at 04:17:10PM -0700, Linus Torvalds wrote:
> On Wed, 28 Jun 2023 at 15:07, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Anyway, since I clearly didn't communicate clearly that I was throwing
> > the patch over to you, let me go and find it and just apply it myself.
>
> Ok, I spent some time writing a better commit message for that change,
> tried to make sure I got everybody's comments and Tested-by's sorted
> out etc, and committed the end result.
>
> I also did the actual module pull and your sysctl pull.
>
> Can you please check that it all looks like you expected?

Looks good to me, thanks for doing that. It certainly was not clear
to me that you had expected me to merge your patch just because you
had previously merged your own and later reverted it. But now I know
in the future I may just pick up your patches, instead of expecting
you to shortcut it, thanks!

Luis

1 2  View All