Mailing List Archive

[RFC] KEXEC: allocate crash note buffers at boot time v2
Hello,

Presented is version 2 of this patch, which uses cpu hotplug
notifications to be rather safer when allocating buffers.

It occurs to me that there is a bit of an API problem with it comes to
combining a crashdump kernel with potential hotplug events.

If dom0 does not get notification of new/removed pcpus, and if it
doesn't re-evaluate /proc/iomem by recalling things like
KEXEC_CMD_get_range, then subsequent calls to /sbin/kexec -p will bundle
up stale information into the kdump magic bundle.

Even if dom0 does get a notification of pcpu hotplug events, and it
updates its iomem map, /sbin/kexec would still need to be called to
bundle the new information into the kdump magic bundle. Possibly doable
off a udev CPU hotplug event.

Even if /sbin/kexec gets called after hotplug events, a crash before the
new kexec magic bundle has been loaded will still result in the old
bundle being used, with its stale information regarding the position and
number of crash notes.

Sadly, I don't see any easy or sensible solution to these problems.
However, it is probably worth knowing as a potential limitation.

I have worked around this problem by never deallocating crash notes, so
even if information is stale as to the number of crash notes to be
expected, the stale physical addresses still point to allocated notes
buffers which have been partially initialized to have sensible note
headers in. This unfortunately means that an offlined cpu which was
present at boot time will have a notes section with zero'd data. It
also means that a cpu onlined after boot which crashes will not have its
register state presented to the kdump environment.

I would like to hope that both of these cases are unlikely, but again,
it is still a potential limitation.

The cpu crash notes themselves (PRSTATUS and XEN_ELFNOTE_CRASH_REGS)
don't contain a reference to which pcpu they represent. This is
inferred by /sbin/kexec from the order in which they appear in dom0's
/proc/iomem, meaning that the kdump environments idea of which pcpu is
which might differ from Xen's. This depending on whether Xen allocates
the notes buffer in ascending order, whether dom0 sorts memory addresses
reported, and, as it currently gets it correct, whether either of these
behaviors change in the future.

This final issue is within my ability to fix and I will do so in the
near future, along with some other additions I already need to make to
the per cpu crash notes.

--
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com
Re: [RFC] KEXEC: allocate crash note buffers at boot time v3 [ In reply to ]
On 01/12/2011 12:56, "Jan Beulich" <JBeulich@suse.com> wrote:

>> + register_keyhandler('C', &crashdump_trigger_keyhandler);
>> +
>> + /* If no crash area, no need to allocate space for notes. */
>> + if ( 0 == kexec_crash_area.size )
>> + return 0;
>
> Wouldn't it make sense to switch the order of these?

I also really hate this constant-first ordering.

-- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [RFC] KEXEC: allocate crash note buffers at boot time v2 [ In reply to ]
>>> On 30.11.11 at 18:24, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>-static DEFINE_PER_CPU_READ_MOSTLY(void *, crash_notes);
>+static void * crash_notes[NR_CPUS];

If at all possible, can you please allocate this dynamically using
nr_cpu_ids for the size?

>+static int kexec_init_cpu_notes(const int cpu)

If the function's argument was made 'unsigned int' (as it already is in
its caller), then ...

>+ BUG_ON( cpu < 0 || cpu >= NR_CPUS );

... only the second check would be needed. Furthermore, it's
pointless to use signed types for CPU numbers (and it's
inefficient on various architectures), despite there being numerous
(bad) examples throughout the tree.

>+ if ( 0 == cpu )
>+ nr_bytes = nr_bytes +
>+ sizeof_note("Xen", sizeof(crash_xen_info_t));

Minor nit: Why not use += here (presumably allowing the statement to
fit on one line)?

>+ if ( ! note )
>+ /* Ideally, this would be -ENOMEM. However, there are more problems
>+ * assocated with trying to deal with -ENOMEM sensibly than just
>+ * pretending that the crash note area doesn't exist. */
>+ return 0;

The only current caller ignores the return value, so why the bogus
return value?

>+ if ( nr < 0 || nr >= nr_cpu_ids || !cpu_online(nr) || !crash_notes[nr] )
>...
>- if ( per_cpu(crash_notes, nr) == NULL )
>- {
>...
>- }

I would suggest to retry allocation here. Even if this isn't suitable for
a 32-bit kdump kernel on large systems, there#s no reason to penalize
fully 64-bit environment.

And here it would also become meaningful that kexec_init_cpu_notes()
actually returns a meaningful error upon failure.

Also, I can't see the reason for the patch to move around
do_crashdump_trigger() and crashdump_trigger_keyhandler.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [RFC] KEXEC: allocate crash note buffers at boot time v2 [ In reply to ]
On 01/12/11 09:08, Jan Beulich wrote:
>>>> On 30.11.11 at 18:24, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>> -static DEFINE_PER_CPU_READ_MOSTLY(void *, crash_notes);
>> +static void * crash_notes[NR_CPUS];
> If at all possible, can you please allocate this dynamically using
> nr_cpu_ids for the size?

Ok

>> +static int kexec_init_cpu_notes(const int cpu)
> If the function's argument was made 'unsigned int' (as it already is in
> its caller), then ...
>
>> + BUG_ON( cpu < 0 || cpu >= NR_CPUS );
> ... only the second check would be needed. Furthermore, it's
> pointless to use signed types for CPU numbers (and it's
> inefficient on various architectures), despite there being numerous
> (bad) examples throughout the tree.

Good point. I will clean this up

>> + if ( 0 == cpu )
>> + nr_bytes = nr_bytes +
>> + sizeof_note("Xen", sizeof(crash_xen_info_t));
> Minor nit: Why not use += here (presumably allowing the statement to
> fit on one line)?

Because the next few patches in my series adds a new crash notes at this
point. I felt it was neater to leave in this form for a reduced diff
next patch, and gcc will optimize it to +=

>> + if ( ! note )
>> + /* Ideally, this would be -ENOMEM. However, there are more problems
>> + * assocated with trying to deal with -ENOMEM sensibly than just
>> + * pretending that the crash note area doesn't exist. */
>> + return 0;
> The only current caller ignores the return value, so why the bogus
> return value?

Originally it passed the return value back up to the cpu hotplug
notifier, but I decided that causing an -ENOMEM at this point was a
little silly given that:
1) a lack of mem on boot will quickly kill the host in other ways, and
2) while there is no way useful way to ensure that the crashdump kernel
gets reloaded with new notes pointers, blocking on not being able to
allocate notes is silly.

Would you consider this enough of a problem to actually fail the
CPU_PREPARE_UP ?

On further thought from my musings yesterday, it would be fantastic if
we were able to point the kdump kernel at a PT_NOTE header without
discerned length, at which point Xen can rewrite its crash notes without
having to get /sbin/kexec to repackage its magic elf blog pointing to
new/different memory locations. However, as this would involve changes
to kexec-tools, Linux (and Xen) in a not trivially backward compatible
fashion, the chances of it happening are slim.

>> + if ( nr < 0 || nr >= nr_cpu_ids || !cpu_online(nr) || !crash_notes[nr] )
>> ...
>> - if ( per_cpu(crash_notes, nr) == NULL )
>> - {
>> ...
>> - }
> I would suggest to retry allocation here. Even if this isn't suitable for
> a 32-bit kdump kernel on large systems, there#s no reason to penalize
> fully 64-bit environment.

At this point, none of the allocation makes it suitable for a 32bit
system. For that, I need to start investigating something akin to
xalloc_below(), which is probably going to be todays task. If we have
previously failed to allocate the notes buffer (and are somehow still
running), what if anything makes it likely that we will successfully
allocate memory this time?

> And here it would also become meaningful that kexec_init_cpu_notes()
> actually returns a meaningful error upon failure.
> Also, I can't see the reason for the patch to move around
> do_crashdump_trigger() and crashdump_trigger_keyhandler.

This is probably collateral damage from having to reorder sizeof_note()
and setup_note() in preference to making a forward declaration of them.
I will see about fixing.

> Jan
>

--
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [RFC] KEXEC: allocate crash note buffers at boot time v2 [ In reply to ]
>>> On 01.12.11 at 10:49, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> On 01/12/11 09:08, Jan Beulich wrote:
>>>>> On 30.11.11 at 18:24, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>>> + if ( ! note )
>>> + /* Ideally, this would be -ENOMEM. However, there are more problems
>>> + * assocated with trying to deal with -ENOMEM sensibly than just
>>> + * pretending that the crash note area doesn't exist. */
>>> + return 0;
>> The only current caller ignores the return value, so why the bogus
>> return value?
>
> Originally it passed the return value back up to the cpu hotplug
> notifier, but I decided that causing an -ENOMEM at this point was a
> little silly given that:
> 1) a lack of mem on boot will quickly kill the host in other ways, and
> 2) while there is no way useful way to ensure that the crashdump kernel
> gets reloaded with new notes pointers, blocking on not being able to
> allocate notes is silly.
>
> Would you consider this enough of a problem to actually fail the
> CPU_PREPARE_UP ?

No, absolutely not. Ignoring the return value there is fine.

>>> + if ( nr < 0 || nr >= nr_cpu_ids || !cpu_online(nr) || !crash_notes[nr] )
>>> ...
>>> - if ( per_cpu(crash_notes, nr) == NULL )
>>> - {
>>> ...
>>> - }
>> I would suggest to retry allocation here. Even if this isn't suitable for
>> a 32-bit kdump kernel on large systems, there#s no reason to penalize
>> fully 64-bit environment.
>
> At this point, none of the allocation makes it suitable for a 32bit
> system. For that, I need to start investigating something akin to
> xalloc_below(), which is probably going to be todays task. If we have
> previously failed to allocate the notes buffer (and are somehow still
> running), what if anything makes it likely that we will successfully
> allocate memory this time?

Because memory got freed meanwhile? I'm particularly having post-
boot onlining of CPUs in mind; of course, if allocation failed during
boot we have bigger problems than this one.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [RFC] KEXEC: allocate crash note buffers at boot time v3 [ In reply to ]
Here is v3 of this patch.

After some consideration, I am not so certain the spinlock is strictly
necessary. However, as it is not a common codepath, I figure that it is
better to be safe than sorry.

--
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com
Re: [RFC] KEXEC: allocate crash note buffers at boot time v3 [ In reply to ]
>>> On 01.12.11 at 13:29, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>+static spinlock_t crash_notes_lock = SPIN_LOCK_UNLOCKED;

Please use DEFINE_SPINLOCK() here.

>+ register_keyhandler('C', &crashdump_trigger_keyhandler);
>+
>+ /* If no crash area, no need to allocate space for notes. */
>+ if ( 0 == kexec_crash_area.size )
>+ return 0;

Wouldn't it make sense to switch the order of these?

>+ crash_notes = xmalloc_bytes(nr_cpu_ids * sizeof(void*));

Please use xmalloc_array() here.

>+ if ( !crash_notes[nr] && 0 != kexec_init_cpu_notes(nr) )

The first check is pointless - the function will return zero if the
allocation was already done.

Further, you shouldn't take a lock around a call to xmalloc() or alike
unless absolutely necessary. It is pretty simple to avoid here - you
really only need to lock around the storing of the pointer and maybe
the setup_note() calls (but be careful with returning -ENOMEM - you
shouldn't if the allocation fails, but you then find - under the lock -
that a pointer was already set by another CPU).

Finally, one thing I failed to notice on the previous version - the
nr_bytes calculations are now being done twice. This should
probably be moved into a helper function, especially since you
said you intend to add stuff here subsequently.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [RFC] KEXEC: allocate crash note buffers at boot time v3 [ In reply to ]
On 01/12/11 12:56, Jan Beulich wrote:
>>>> On 01.12.11 at 13:29, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>> +static spinlock_t crash_notes_lock = SPIN_LOCK_UNLOCKED;
> Please use DEFINE_SPINLOCK() here.

Ok

>> + register_keyhandler('C', &crashdump_trigger_keyhandler);
>> +
>> + /* If no crash area, no need to allocate space for notes. */
>> + if ( 0 == kexec_crash_area.size )
>> + return 0;
> Wouldn't it make sense to switch the order of these?

Possibly. In the case where a crash kernel has not been loaded, it
would degrade to a reboot, so it is still of some use if the there is no
kexec area. Having said that, there is an explicit reboot handler, so
making this one disappear is probably a good thing.

>> + crash_notes = xmalloc_bytes(nr_cpu_ids * sizeof(void*));
> Please use xmalloc_array() here.

Yes - it was dim of me to forget that.

>> + if ( !crash_notes[nr] && 0 != kexec_init_cpu_notes(nr) )
> The first check is pointless - the function will return zero if the
> allocation was already done.

Good point - I missed that.

> Further, you shouldn't take a lock around a call to xmalloc() or alike
> unless absolutely necessary. It is pretty simple to avoid here - you
> really only need to lock around the storing of the pointer and maybe
> the setup_note() calls (but be careful with returning -ENOMEM - you
> shouldn't if the allocation fails, but you then find - under the lock -
> that a pointer was already set by another CPU).

So what we should do is this:

xmalloc
take lock
check to see if the entry is been filled in the meantime. if so, free
the malloc'd region
release lock
only return -ENOMEM if we fail the malloc and the crash_note is still
NULL when we take the lock

I think this ought to cover all possible cases ?

(In reality I think the xmalloc itself should be covered by the fact we
will fail the !cpu_online(nr) test before we consider trying to
reallocate the buffer, but that doesn't preclude future proofing the code)

> Finally, one thing I failed to notice on the previous version - the
> nr_bytes calculations are now being done twice. This should
> probably be moved into a helper function, especially since you
> said you intend to add stuff here subsequently.

I had noticed this and was going to let it slide for now, considering
what would be best to do about it. Playing with void pointers and
calculating lengths with sizeof is always more dangerous than
calculating a size, malloc'ing it and filling in a range start and size.

Given that it is such a rare codepath, I am honestly not sure which is
the better tradeof - an extra function call in 2 places or doubling the
size of the crash_notes array by introducing a size as well as a start.
Both seem very minor in the grand scheme of things.

> Jan
>

--
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [RFC] KEXEC: allocate crash note buffers at boot time v3 [ In reply to ]
On 01/12/11 05:20, Keir Fraser wrote:
> On 01/12/2011 12:56, "Jan Beulich" <JBeulich@suse.com> wrote:
>
>>> + register_keyhandler('C', &crashdump_trigger_keyhandler);
>>> +
>>> + /* If no crash area, no need to allocate space for notes. */
>>> + if ( 0 == kexec_crash_area.size )
>>> + return 0;
>> Wouldn't it make sense to switch the order of these?
> I also really hate this constant-first ordering.
>
> -- Keir
>

It was a useful habit I got into when learning C/C++ and has stayed with
me. I will endeavor to not do it.

--
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [RFC] KEXEC: allocate crash note buffers at boot time v3 [ In reply to ]
On 01/12/11 12:56, Jan Beulich wrote:
>>>> On 01.12.11 at 13:29, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>> +static spinlock_t crash_notes_lock = SPIN_LOCK_UNLOCKED;
> Please use DEFINE_SPINLOCK() here.
>
>> + register_keyhandler('C', &crashdump_trigger_keyhandler);
>> +
>> + /* If no crash area, no need to allocate space for notes. */
>> + if ( 0 == kexec_crash_area.size )
>> + return 0;
> Wouldn't it make sense to switch the order of these?
>
>> + crash_notes = xmalloc_bytes(nr_cpu_ids * sizeof(void*));
> Please use xmalloc_array() here.
>
>> + if ( !crash_notes[nr] && 0 != kexec_init_cpu_notes(nr) )
> The first check is pointless - the function will return zero if the
> allocation was already done.

I forgot to say in my previous email. Short circuit evaluation should
prevent the kexec_init_cpu_notes() function call actually being made.

> Further, you shouldn't take a lock around a call to xmalloc() or alike
> unless absolutely necessary. It is pretty simple to avoid here - you
> really only need to lock around the storing of the pointer and maybe
> the setup_note() calls (but be careful with returning -ENOMEM - you
> shouldn't if the allocation fails, but you then find - under the lock -
> that a pointer was already set by another CPU).
>
> Finally, one thing I failed to notice on the previous version - the
> nr_bytes calculations are now being done twice. This should
> probably be moved into a helper function, especially since you
> said you intend to add stuff here subsequently.
>
> Jan
>

--
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [RFC] KEXEC: allocate crash note buffers at boot time v3 [ In reply to ]
>>> On 01.12.11 at 14:59, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> On 01/12/11 12:56, Jan Beulich wrote:
>> Further, you shouldn't take a lock around a call to xmalloc() or alike
>> unless absolutely necessary. It is pretty simple to avoid here - you
>> really only need to lock around the storing of the pointer and maybe
>> the setup_note() calls (but be careful with returning -ENOMEM - you
>> shouldn't if the allocation fails, but you then find - under the lock -
>> that a pointer was already set by another CPU).
>
> So what we should do is this:
>
> xmalloc
> take lock
> check to see if the entry is been filled in the meantime. if so, free
> the malloc'd region

Don't call xfree() with the lock held either, if possible.

> release lock
> only return -ENOMEM if we fail the malloc and the crash_note is still
> NULL when we take the lock

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [RFC] KEXEC: allocate crash note buffers at boot time v3 [ In reply to ]
>>> On 01.12.11 at 16:02, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> On 01/12/11 12:56, Jan Beulich wrote:
>>>>> On 01.12.11 at 13:29, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>>> +static spinlock_t crash_notes_lock = SPIN_LOCK_UNLOCKED;
>> Please use DEFINE_SPINLOCK() here.
>>
>>> + register_keyhandler('C', &crashdump_trigger_keyhandler);
>>> +
>>> + /* If no crash area, no need to allocate space for notes. */
>>> + if ( 0 == kexec_crash_area.size )
>>> + return 0;
>> Wouldn't it make sense to switch the order of these?
>>
>>> + crash_notes = xmalloc_bytes(nr_cpu_ids * sizeof(void*));
>> Please use xmalloc_array() here.
>>
>>> + if ( !crash_notes[nr] && 0 != kexec_init_cpu_notes(nr) )
>> The first check is pointless - the function will return zero if the
>> allocation was already done.
>
> I forgot to say in my previous email. Short circuit evaluation should
> prevent the kexec_init_cpu_notes() function call actually being made.

What would that buy you? Performance is not an issue on this code
path.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [RFC] KEXEC: allocate crash note buffers at boot time v4 [ In reply to ]
Version 4 attached.

Fixed the logic regarding locking and x{malloc,free}'ing, added a few
more comments in places, and changed the coding style to avoid
constant-first compares.

--
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com
Re: [RFC] KEXEC: allocate crash note buffers at boot time v4 [ In reply to ]
>>> On 01.12.11 at 18:14, Andrew Cooper <andrew.cooper3@citrix.com> wrote:

Looks good to me now except for one minor thing (below), and the
fact that you still decided to retain the two duplicates of the size
calculation (I'll have to remember to clean this up if you don't, unless
Keir explicitly agrees with the duplication).

>+ /* Try once again to allocate room for the crash notes. It is just possible
>+ * that more space has become available since we last tried. If space has
>+ * already been allocated, kexec_init_cpu_notes() will return early with 0.
>+ */
>+ if ( kexec_init_cpu_notes(nr) )
> return -EINVAL;

The function can fail only with -ENOMEM, so why not return this here?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [RFC] KEXEC: allocate crash note buffers at boot time v4 [ In reply to ]
On 02/12/11 08:02, Jan Beulich wrote:
>>>> On 01.12.11 at 18:14, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> Looks good to me now except for one minor thing (below), and the
> fact that you still decided to retain the two duplicates of the size
> calculation (I'll have to remember to clean this up if you don't, unless
> Keir explicitly agrees with the duplication).

Ok - I will turn them into a start,size pair

>> + /* Try once again to allocate room for the crash notes. It is just possible
>> + * that more space has become available since we last tried. If space has
>> + * already been allocated, kexec_init_cpu_notes() will return early with 0.
>> + */
>> + if ( kexec_init_cpu_notes(nr) )
>> return -EINVAL;
> The function can fail only with -ENOMEM, so why not return this here?
>
> Jan
>

Actually, returning -EINVAL here is counter productive. -EINVAL is used
by dom0 to work out when it has asked for each CPU. This in itself is a
little broken because there is nothing stopping a middle CPU from being
offline at the time these hypercalls are made. The other thing is that
there is nothing stopping an offline cpu from having a valid notes
section. Therefore, the test of online should be removed, so -EINVAL
only gets returned for cpu out of range, or not set up notes at all in
the first place.

--
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel