Mailing List Archive

Problems with APIC on versions 4.9 and later (4.8 works)
Hi,

Sorry for the simultaneous post on xen-users and xen-devel, but as I noted
that the problem appears only for versions of xen that are >= 4.9, I think
that developers may have a look at this.

I recently bought a generic mainboard and a Xeon E5-2926v2 CPU, it is a 12
core, 24 threads cpu.

My system was already running on another machine with xen 4.11 on a debian
10 system and after replacing the mainboard it didn't boot.

After many tries I noticed that downgrading to the previous version of Xen
(4.8, available on Debian 9) works well. I also tried a lot of variations
for the dom0 kernel, all of them with the same results.

All my tests were done with 4.11, but I checked with a live version of
Alpine Linux (3.7.3, with Xen 4.9.4) that the system doesn't boot on that
release also.

With more research I noticed that if I limit dom0 to use only one CPU
(dom0_max_vcpus=1) the system boots up, I didn't check if guest VM worked,
but I suppose that they would not be able to use the other 23 vcpus
available, anyway, a system with one vcpu for dom0 would be very slow I
think.

I've noticed also that if I keep dom0 to use more than one core but disable
acpi on the dom0 kernel, it boots up, unfortunately this is not
sufficient because I cannot use any device attached to the system (not even
the usb keyboard). This only helps to detect that the problem may be in the
ACPI/APIC code.

I tried many variations with parameters related with ACPI and APIC. None of
them was satisfactory, always ended on a halted system or a self rebooting
one.

Can anyone point me to a solution for this?

Best regards,
Claudemir
Re: Problems with APIC on versions 4.9 and later (4.8 works) [ In reply to ]
On 18.01.2021 21:15, Claudemir Todo Bom wrote:
> Sorry for the simultaneous post on xen-users and xen-devel, but as I noted
> that the problem appears only for versions of xen that are >= 4.9, I think
> that developers may have a look at this.

Dropping xen-users.

> I recently bought a generic mainboard and a Xeon E5-2926v2 CPU, it is a 12
> core, 24 threads cpu.
>
> My system was already running on another machine with xen 4.11 on a debian
> 10 system and after replacing the mainboard it didn't boot.
>
> After many tries I noticed that downgrading to the previous version of Xen
> (4.8, available on Debian 9) works well. I also tried a lot of variations
> for the dom0 kernel, all of them with the same results.
>
> All my tests were done with 4.11, but I checked with a live version of
> Alpine Linux (3.7.3, with Xen 4.9.4) that the system doesn't boot on that
> release also.
>
> With more research I noticed that if I limit dom0 to use only one CPU
> (dom0_max_vcpus=1) the system boots up, I didn't check if guest VM worked,
> but I suppose that they would not be able to use the other 23 vcpus
> available, anyway, a system with one vcpu for dom0 would be very slow I
> think.
>
> I've noticed also that if I keep dom0 to use more than one core but disable
> acpi on the dom0 kernel, it boots up, unfortunately this is not
> sufficient because I cannot use any device attached to the system (not even
> the usb keyboard). This only helps to detect that the problem may be in the
> ACPI/APIC code.
>
> I tried many variations with parameters related with ACPI and APIC. None of
> them was satisfactory, always ended on a halted system or a self rebooting
> one.
>
> Can anyone point me to a solution for this?

For this we first of all need details about your problem: A full
boot log ideally, or if this isn't available anything at least
allowing us to see what exactly goes wrong (and with this I mean
the first thing going wrong, not later possible follow-on issues
from earlier problems), like screen shots. And this again ideally
with master / staging Xen, or if that's not feasible with as new
a version as possible.

I don't suppose you'd be up for bisecting the 4.8 ... 4.9 window
to determine when exactly your issue was introduced?

Jan
Re: Problems with APIC on versions 4.9 and later (4.8 works) [ In reply to ]
On 19.01.2021 20:36, Claudemir Todo Bom wrote:
> I do not have serial output on this setup, so I recorded a video with
> boot_delay=50 in order to be able to get all the kernel messages:
> https://youtu.be/y95h6vqoF7Y

This doesn't show any badness afaics.

> This is running 4.14 from debian bullseye (testing).
>
> I'm also attaching the dmesg output when booting xen 4.8 with the same
> kernel version and same parameters.
>
> I visually compared all the messages, and the only thing I noticed was that
> 4.14 used tsc as clocksource and 4.8 used xen. I tried to boot the kernel
> with "clocksource=xen" and the problem is happening with that also.

There's some confusion here I suppose: The clock source you talk
about is the kernel's, not Xen's. I didn't think this would
change for the same kernel version with different Xen underneath,
but the Linux maintainers of the Xen code there may know better.
Cc-ing them.

> The "start" of the problem is that when the kernel gets to the "Freeing
> unused kernel image (initmem) memory: 2380K" it hangs and stays there for a
> while. After a few minutes it shows that a process (swapper) is blocked for
> sometime (image attached)

Now that's pretty unusual - the call trace seen in the screen
shot you had attached indicates the kernel didn't even make it
past its own initialization just yet. Just to have explored that
possibility - could you enable Xen's NMI watchdog (simply
"watchdog" on the Xen command line)? Among the boot messages
there ought to be one indicating whether it actually works on
your system. Without a serial console you wouldn't see anything
if it triggers, but the system would then never make it to the
kernel side issue.

As far as making sure we at least see all kernel messages -
are you having "ignore_loglevel" in place? I don't think I've
been able to spot the kernel command line anywhere in the video.

I'm afraid there's no real way around seeing the full Xen
messages, i.e. including possible ones while Dom0 already boots
(and allowing some debug keys to be issued, as the rcu_barrier
on the stack may suggest there's an issue with one of the
secondary CPUs). You could try whether "vag=keep" on the Xen
command line allows you to see more, but this option may have
quite severe an effect on the timing of Dom0's booting, which
may make an already bad situation worse.

Alternatively the kernel may need instrumenting to figure what
exactly it is that prevent forward progress.

There's one other wild guess you may want to try: "cpuidle=no"
on the Xen command line.

Jan
Re: Problems with APIC on versions 4.9 and later (4.8 works) [ In reply to ]
On 20.01.21 09:50, Jan Beulich wrote:
> On 19.01.2021 20:36, Claudemir Todo Bom wrote:
>> I do not have serial output on this setup, so I recorded a video with
>> boot_delay=50 in order to be able to get all the kernel messages:
>> https://youtu.be/y95h6vqoF7Y
>
> This doesn't show any badness afaics.
>
>> This is running 4.14 from debian bullseye (testing).
>>
>> I'm also attaching the dmesg output when booting xen 4.8 with the same
>> kernel version and same parameters.
>>
>> I visually compared all the messages, and the only thing I noticed was that
>> 4.14 used tsc as clocksource and 4.8 used xen. I tried to boot the kernel
>> with "clocksource=xen" and the problem is happening with that also.
>
> There's some confusion here I suppose: The clock source you talk
> about is the kernel's, not Xen's. I didn't think this would
> change for the same kernel version with different Xen underneath,
> but the Linux maintainers of the Xen code there may know better.
> Cc-ing them.

This might depend on CPUID bits given to dom0 by Xen, e.g. regarding
TSC stability.

>
>> The "start" of the problem is that when the kernel gets to the "Freeing
>> unused kernel image (initmem) memory: 2380K" it hangs and stays there for a
>> while. After a few minutes it shows that a process (swapper) is blocked for
>> sometime (image attached)
>
> Now that's pretty unusual - the call trace seen in the screen
> shot you had attached indicates the kernel didn't even make it
> past its own initialization just yet. Just to have explored that
> possibility - could you enable Xen's NMI watchdog (simply
> "watchdog" on the Xen command line)? Among the boot messages
> there ought to be one indicating whether it actually works on
> your system. Without a serial console you wouldn't see anything
> if it triggers, but the system would then never make it to the
> kernel side issue.
>
> As far as making sure we at least see all kernel messages -
> are you having "ignore_loglevel" in place? I don't think I've
> been able to spot the kernel command line anywhere in the video.
>
> I'm afraid there's no real way around seeing the full Xen
> messages, i.e. including possible ones while Dom0 already boots
> (and allowing some debug keys to be issued, as the rcu_barrier
> on the stack may suggest there's an issue with one of the
> secondary CPUs). You could try whether "vag=keep" on the Xen
> command line allows you to see more, but this option may have
> quite severe an effect on the timing of Dom0's booting, which
> may make an already bad situation worse.
>
> Alternatively the kernel may need instrumenting to figure what
> exactly it is that prevent forward progress.
>
> There's one other wild guess you may want to try: "cpuidle=no"
> on the Xen command line.

Other wild guesses are:

- add "sched=credit" to the Xen command line

or

- add "xen.fifo_events=0" to the dom0 command line


Juergen
Re: Problems with APIC on versions 4.9 and later (4.8 works) [ In reply to ]
Em qua., 20 de jan. de 2021 às 12:13, Jürgen Groß <jgross@suse.com> escreveu:
>
> On 20.01.21 09:50, Jan Beulich wrote:
> > On 19.01.2021 20:36, Claudemir Todo Bom wrote:
> >> I do not have serial output on this setup, so I recorded a video with
> >> boot_delay=50 in order to be able to get all the kernel messages:
> >> https://youtu.be/y95h6vqoF7Y
> >
> > This doesn't show any badness afaics.
> >
> >> This is running 4.14 from debian bullseye (testing).
> >>
> >> I'm also attaching the dmesg output when booting xen 4.8 with the same
> >> kernel version and same parameters.
> >>
> >> I visually compared all the messages, and the only thing I noticed was that
> >> 4.14 used tsc as clocksource and 4.8 used xen. I tried to boot the kernel
> >> with "clocksource=xen" and the problem is happening with that also.
> >
> > There's some confusion here I suppose: The clock source you talk
> > about is the kernel's, not Xen's. I didn't think this would
> > change for the same kernel version with different Xen underneath,
> > but the Linux maintainers of the Xen code there may know better.
> > Cc-ing them.
>
> This might depend on CPUID bits given to dom0 by Xen, e.g. regarding
> TSC stability.

Will ignore this for now, I suppose it is not the cause of the problem.

> >> The "start" of the problem is that when the kernel gets to the "Freeing
> >> unused kernel image (initmem) memory: 2380K" it hangs and stays there for a
> >> while. After a few minutes it shows that a process (swapper) is blocked for
> >> sometime (image attached)
> >
> > Now that's pretty unusual - the call trace seen in the screen
> > shot you had attached indicates the kernel didn't even make it
> > past its own initialization just yet. Just to have explored that
> > possibility - could you enable Xen's NMI watchdog (simply
> > "watchdog" on the Xen command line)? Among the boot messages
> > there ought to be one indicating whether it actually works on
> > your system. Without a serial console you wouldn't see anything
> > if it triggers, but the system would then never make it to the
> > kernel side issue.

"watchdog" parameter changed nothing.

> > As far as making sure we at least see all kernel messages -
> > are you having "ignore_loglevel" in place? I don't think I've
> > been able to spot the kernel command line anywhere in the video.

I was using loglevel=7, since it is the maximum level according to
documentation, should be the same, but tested with "ignore_loglevel"
and it looks pretty similar.

> > I'm afraid there's no real way around seeing the full Xen
> > messages, i.e. including possible ones while Dom0 already boots
> > (and allowing some debug keys to be issued, as the rcu_barrier
> > on the stack may suggest there's an issue with one of the
> > secondary CPUs). You could try whether "vag=keep" on the Xen
> > command line allows you to see more, but this option may have
> > quite severe an effect on the timing of Dom0's booting, which
> > may make an already bad situation worse.

already used "vga=keep", no new information. Will try to enable a
serial output in order to debug more. Is there any parameters I could
give to Xen in order to it write more information on serial line while
the dom0 is booting on the screen?

> > Alternatively the kernel may need instrumenting to figure what
> > exactly it is that prevent forward progress.
> >
> > There's one other wild guess you may want to try: "cpuidle=no"
> > on the Xen command line.
> Other wild guesses are:
>
> - add "sched=credit" to the Xen command line
>
> or
>
> - add "xen.fifo_events=0" to the dom0 command line

all 3 suggestions changed nothing.

I noticed that Debian have a lot of distribution managed patches, so I
think that if I want to find exactly where after 4.8.5 the problem
started I will need to build Xen from sources ignoring debian helpers.

Best regards,
Claudemir
Re: Problems with APIC on versions 4.9 and later (4.8 works) [ In reply to ]
On 20.01.2021 21:13, Claudemir Todo Bom wrote:
> Em qua., 20 de jan. de 2021 às 12:13, Jürgen Groß <jgross@suse.com> escreveu:
>> On 20.01.21 09:50, Jan Beulich wrote:
>>> I'm afraid there's no real way around seeing the full Xen
>>> messages, i.e. including possible ones while Dom0 already boots
>>> (and allowing some debug keys to be issued, as the rcu_barrier
>>> on the stack may suggest there's an issue with one of the
>>> secondary CPUs). You could try whether "vag=keep" on the Xen
>>> command line allows you to see more, but this option may have
>>> quite severe an effect on the timing of Dom0's booting, which
>>> may make an already bad situation worse.
>
> already used "vga=keep", no new information. Will try to enable a
> serial output in order to debug more. Is there any parameters I could
> give to Xen in order to it write more information on serial line while
> the dom0 is booting on the screen?

You'll presumably want "console=" and "com<N>="; the exact values
to use depend on your system, so I will refer you to the command
line doc[1].

Jan

[1] https://xenbits.xen.org/docs/unstable/misc/xen-command-line.html
Re: Problems with APIC on versions 4.9 and later (4.8 works) [ In reply to ]
Em qua., 20 de jan. de 2021 às 12:13, Jürgen Groß <jgross@suse.com> escreveu:
>
> On 20.01.21 09:50, Jan Beulich wrote:
> > On 19.01.2021 20:36, Claudemir Todo Bom wrote:
> >> I do not have serial output on this setup, so I recorded a video with
> >> boot_delay=50 in order to be able to get all the kernel messages:
> >> https://youtu.be/y95h6vqoF7Y
> >
> > This doesn't show any badness afaics.
> >
> >> This is running 4.14 from debian bullseye (testing).
> >>
> >> I'm also attaching the dmesg output when booting xen 4.8 with the same
> >> kernel version and same parameters.
> >>
> >> I visually compared all the messages, and the only thing I noticed was that
> >> 4.14 used tsc as clocksource and 4.8 used xen. I tried to boot the kernel
> >> with "clocksource=xen" and the problem is happening with that also.
> >
> > There's some confusion here I suppose: The clock source you talk
> > about is the kernel's, not Xen's. I didn't think this would
> > change for the same kernel version with different Xen underneath,
> > but the Linux maintainers of the Xen code there may know better.
> > Cc-ing them.
>
> This might depend on CPUID bits given to dom0 by Xen, e.g. regarding
> TSC stability.
>

Looks like the CPUID changes I observed and wrote on the other
messages are another
problem I may end up with. I narrowed down the cause of the problem on
booting of dom0 with more than 1 core on the following commit:

https://github.com/xen-project/xen/commit/63e1d01b8fd948b3e0fa3beea494e407668aa43b

Code built from this commit doesn't boot, built from the parent of it, boots.

Now, there is something I can do on the command line to make it boots?
Or its needed to fix on the code?

Best regards,
Claudemir
Re: Problems with APIC on versions 4.9 and later (4.8 works) [ In reply to ]
On 23.01.2021 00:36, Claudemir Todo Bom wrote:
> Em qua., 20 de jan. de 2021 às 12:13, Jürgen Groß <jgross@suse.com> escreveu:
>>
>> On 20.01.21 09:50, Jan Beulich wrote:
>>> On 19.01.2021 20:36, Claudemir Todo Bom wrote:
>>>> I do not have serial output on this setup, so I recorded a video with
>>>> boot_delay=50 in order to be able to get all the kernel messages:
>>>> https://youtu.be/y95h6vqoF7Y
>>>
>>> This doesn't show any badness afaics.
>>>
>>>> This is running 4.14 from debian bullseye (testing).
>>>>
>>>> I'm also attaching the dmesg output when booting xen 4.8 with the same
>>>> kernel version and same parameters.
>>>>
>>>> I visually compared all the messages, and the only thing I noticed was that
>>>> 4.14 used tsc as clocksource and 4.8 used xen. I tried to boot the kernel
>>>> with "clocksource=xen" and the problem is happening with that also.
>>>
>>> There's some confusion here I suppose: The clock source you talk
>>> about is the kernel's, not Xen's. I didn't think this would
>>> change for the same kernel version with different Xen underneath,
>>> but the Linux maintainers of the Xen code there may know better.
>>> Cc-ing them.
>>
>> This might depend on CPUID bits given to dom0 by Xen, e.g. regarding
>> TSC stability.
>>
>
> Looks like the CPUID changes I observed and wrote on the other
> messages are another
> problem I may end up with. I narrowed down the cause of the problem on
> booting of dom0 with more than 1 core on the following commit:
>
> https://github.com/xen-project/xen/commit/63e1d01b8fd948b3e0fa3beea494e407668aa43b
>
> Code built from this commit doesn't boot, built from the parent of it, boots.

Odd.

> Now, there is something I can do on the command line to make it boots?
> Or its needed to fix on the code?

That's too early to ask. We first need to understand what's going
on. There are two things I'd like you to try: One is to use
"clocksource=tsc" on the Xen (not the kernel) command line, and
the other (without that option) is to try the debugging patch
below. Of course that patch is only going to be useful when you
can somehow record Xen's log messages up to the point where the
system hangs. (Both ideally on as new a Xen as you can arrange
for.)

Jan

--- unstable.orig/xen/arch/x86/time.c
+++ unstable/xen/arch/x86/time.c
@@ -1799,9 +1799,11 @@ static void time_calibration(void *unuse
cpumask_copy(&r.cpu_calibration_map, &cpu_online_map);

/* @wait=1 because we must wait for all cpus before freeing @r. */
+printk("TSC: %ps\n", time_calibration_rendezvous_fn);//temp
on_selected_cpus(&r.cpu_calibration_map,
time_calibration_rendezvous_fn,
&r, 1);
+printk("TSC: end rendezvous\n");//temp
}

static struct cpu_time_stamp ap_bringup_ref;
@@ -2043,6 +2045,7 @@ static int __init verify_tsc_reliability
* While with constant-rate TSCs the scale factor can be shared, when TSCs
* are not marked as 'reliable', re-sync during rendezvous.
*/
+printk("TSC: c=%d r=%d\n", !!boot_cpu_has(X86_FEATURE_CONSTANT_TSC), !!boot_cpu_has(X86_FEATURE_TSC_RELIABLE));//temp
if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
!boot_cpu_has(X86_FEATURE_TSC_RELIABLE) )
time_calibration_rendezvous_fn = time_calibration_tsc_rendezvous;
@@ -2056,6 +2059,7 @@ int __init init_xen_time(void)
{
tsc_check_writability();

+printk("TSC: c=%d r=%d\n", !!boot_cpu_has(X86_FEATURE_CONSTANT_TSC), !!boot_cpu_has(X86_FEATURE_TSC_RELIABLE));//temp
open_softirq(TIME_CALIBRATE_SOFTIRQ, local_time_calibration);

/* NB. get_wallclock_time() can take over one second to execute. */
Re: Problems with APIC on versions 4.9 and later (4.8 works) [ In reply to ]
On 25.01.2021 20:37, Claudemir Todo Bom wrote:
> I've managed to get the debug messages on the screen using
> vga=text-80x50,keep and disabling all messages from the kernel. Two
> images are attached with the output running the debug patch.

And the 1st of them (161303) was taken at the time of the hang of
the kernel (or entire system), not any earlier? I ask because one
part of the reason for the patch was to understand whether the
rendezvousing itself would fail in some way (like one of the CPUs
not calling in).

Were new log messages (from the debugging patch) still issued at
this point, showing Xen itself was still alive?

The 2nd of the pictures (162313) at least clarifies that indeed
the commit in question had a functional effect on this system,
because of

(XEN) TSC warp detected, disabling TSC_RELIABLE

I still can't figure though why the change in rendezvous handling
(from "std" to "tsc") would have broken your system.

> About the version I've used to test: since the 4.14 shows that other
> bug with the detection of cpu features I mentioned on the other
> subthread, I chose to work on 4.11 that doesn't shows that behaviour.
>
> Calling with clocksource on the xen command line changed nothing.

Oh, right, because the specific feature that causes the change
of rendezvous functions for you also is a prereq for that mode
of operation.

> I don't know if this part of code is intended to execute a lot of
> times, but when starting with dom0_max_vcpus=1, the system boots up
> and keeps showing the messages.

When there's just one CPU, there's no CPU to rendezvous with.

Iirc you did say that you observe the hang even with as little
as 2 CPUs? The problem the above quoted message is supposed to
address is normally coming into play only on multi-socket
systems. Yet from your initial report I deduce this is a
single socket system. So in the end I suppose there are two
problems - one is the hang, and the other is that your system
gets diagnosed as having an unreliable TSC (at least I didn't
think Xeon E5 v2 should have a problem there).

I will want to extend the debugging patch, but I'd like to
have clarification on some of the points above first.

Jan
Re: Problems with APIC on versions 4.9 and later (4.8 works) [ In reply to ]
Em ter., 26 de jan. de 2021 às 08:48, Jan Beulich <jbeulich@suse.com> escreveu:
>
> On 25.01.2021 20:37, Claudemir Todo Bom wrote:
> > I've managed to get the debug messages on the screen using
> > vga=text-80x50,keep and disabling all messages from the kernel. Two
> > images are attached with the output running the debug patch.
>
> And the 1st of them (161303) was taken at the time of the hang of
> the kernel (or entire system), not any earlier? I ask because one
> part of the reason for the patch was to understand whether the
> rendezvousing itself would fail in some way (like one of the CPUs
> not calling in).

I could not tell if it already hung when I took the picture, but I can
tell the messages keep appearing after the hang. I tested this
enabling log messages... the screen became a mess, but I can assure
that the rendezvous function is being run and completed multiple times
after the "freeing memory" message that freezes the kernel.

> Were new log messages (from the debugging patch) still issued at
> this point, showing Xen itself was still alive?
>
> The 2nd of the pictures (162313) at least clarifies that indeed
> the commit in question had a functional effect on this system,
> because of
>
> (XEN) TSC warp detected, disabling TSC_RELIABLE
>
> I still can't figure though why the change in rendezvous handling
> (from "std" to "tsc") would have broken your system.
>
> > About the version I've used to test: since the 4.14 shows that other
> > bug with the detection of cpu features I mentioned on the other
> > subthread, I chose to work on 4.11 that doesn't shows that behaviour.
> >
> > Calling with clocksource on the xen command line changed nothing.
>
> Oh, right, because the specific feature that causes the change
> of rendezvous functions for you also is a prereq for that mode
> of operation.

Oh, this should be why reverting the code on 4.14 didn't work...
probably messed up with features introduced after 4.11.

> > I don't know if this part of code is intended to execute a lot of
> > times, but when starting with dom0_max_vcpus=1, the system boots up
> > and keeps showing the messages.
>
> When there's just one CPU, there's no CPU to rendezvous with.
>
> Iirc you did say that you observe the hang even with as little
> as 2 CPUs? The problem the above quoted message is supposed to
> address is normally coming into play only on multi-socket
> systems. Yet from your initial report I deduce this is a
> single socket system. So in the end I suppose there are two
> problems - one is the hang, and the other is that your system
> gets diagnosed as having an unreliable TSC (at least I didn't
> think Xeon E5 v2 should have a problem there).

It is a single socket, I was talking about virtual cpus for domain 0.

After the last tests I tried to boot it with maxcpus=1 parameter on
the xen command line. This changed the rendezvous code to std and the
system worked on all versions up to 4.14.

Is there any performance issue on using this parameter and this "std"
rendezvous code?

> I will want to extend the debugging patch, but I'd like to
> have clarification on some of the points above first.

If this information is good for more tests, please send the patch and
I will test it!

Best regards,
Claudemir
Re: Problems with APIC on versions 4.9 and later (4.8 works) [ In reply to ]
On 26.01.2021 14:03, Claudemir Todo Bom wrote:
> If this information is good for more tests, please send the patch and
> I will test it!

Here you go. For simplifying analysis it may be helpful if you
could limit the number of CPUs in use, e.g. by "maxcpus=4" or
at least "smt=0". Provided the problem still reproduces with
such options, of course.

Jan

--- unstable.orig/xen/arch/x86/time.c
+++ unstable/xen/arch/x86/time.c
@@ -1574,6 +1574,12 @@ static void local_time_calibration(void)
* TSC Reliability check
*/

+static struct {//temp
+ unsigned cpu;
+ signed iter;
+ cycles_t prev, now;
+} check_log[NR_CPUS + 4];
+static unsigned check_idx;//temp
/*
* The Linux original version of this function is
* Copyright (c) 2006, Red Hat, Inc., Ingo Molnar
@@ -1582,6 +1588,7 @@ static void check_tsc_warp(unsigned long
{
static DEFINE_SPINLOCK(sync_lock);
static cycles_t last_tsc;
+unsigned idx, cpu = smp_processor_id();//temp

cycles_t start, now, prev, end;
int i;
@@ -1592,6 +1599,15 @@ static void check_tsc_warp(unsigned long
end = start + tsc_khz * 20ULL;
now = start;

+{//temp
+ spin_lock(&sync_lock);
+ idx = check_idx++;
+ check_log[idx].cpu = cpu;
+ check_log[idx].iter = -1;
+ check_log[idx].now = now;
+ spin_unlock(&sync_lock);
+}
+
for ( i = 0; ; i++ )
{
/*
@@ -1626,7 +1642,14 @@ static void check_tsc_warp(unsigned long
{
spin_lock(&sync_lock);
if ( *max_warp < prev - now )
+{//temp
*max_warp = prev - now;
+ idx = check_idx++;
+ check_log[idx].cpu = cpu;
+ check_log[idx].iter = i;
+ check_log[idx].prev = prev;
+ check_log[idx].now = now;
+}
spin_unlock(&sync_lock);
}
}
@@ -1663,6 +1686,12 @@ static void tsc_check_reliability(void)
cpu_relax();

spin_unlock(&lock);
+{//temp
+ unsigned i;
+ printk("CHK[%2u] %lx\n", cpu, tsc_max_warp);//temp
+ for(i = 0; i < ARRAY_SIZE(check_log) && check_log[i].now; ++i)
+ printk("chk[%4u] CPU%-2u %016lx %016lx #%d\n", i, check_log[i].cpu, check_log[i].prev, check_log[i].now, check_log[i].iter);
+}
}

/*
@@ -1677,6 +1706,7 @@ struct calibration_rendezvous {
u64 master_tsc_stamp;
};

+static bool rdzv_log;//temp
static void
time_calibration_rendezvous_tail(const struct calibration_rendezvous *r)
{
@@ -1686,6 +1716,7 @@ time_calibration_rendezvous_tail(const s
c->local_stime = get_s_time_fixed(c->local_tsc);
c->master_stime = r->master_stime;

+if(rdzv_log) printk("RDZV[%2u] t=%016lx(%016lx) s=%012lx(%012lx)\n", smp_processor_id(), c->local_tsc, r->master_tsc_stamp, c->local_stime, r->master_stime);//temp
raise_softirq(TIME_CALIBRATE_SOFTIRQ);
}

@@ -1699,6 +1730,7 @@ static void time_calibration_tsc_rendezv
struct calibration_rendezvous *r = _r;
unsigned int total_cpus = cpumask_weight(&r->cpu_calibration_map);

+if(rdzv_log) printk("RDZV[%2u] t=%016lx\n", smp_processor_id(), rdtsc_ordered());//temp
/* Loop to get rid of cache effects on TSC skew. */
for ( i = 4; i >= 0; i-- )
{
@@ -1788,6 +1820,12 @@ static void time_calibration(void *unuse
struct calibration_rendezvous r = {
.semaphore = ATOMIC_INIT(0)
};
+static unsigned long cnt, thr;//temp
+if(system_state > SYS_STATE_smp_boot && ++cnt > thr) {//temp
+ thr |= cnt;
+ printk("TSC: %ps\n", time_calibration_rendezvous_fn);
+ rdzv_log = true;
+}

if ( clocksource_is_tsc() )
{
@@ -1802,6 +1840,10 @@ static void time_calibration(void *unuse
on_selected_cpus(&r.cpu_calibration_map,
time_calibration_rendezvous_fn,
&r, 1);
+if(rdzv_log || system_state <= SYS_STATE_smp_boot) {//temp
+ rdzv_log = false;
+ printk("TSC: end rendezvous\n");
+}
}

static struct cpu_time_stamp ap_bringup_ref;
@@ -1898,6 +1940,7 @@ void init_percpu_time(void)
}
t->stamp.local_tsc = tsc;
t->stamp.local_stime = now;
+printk("INIT[%2u] t=%016lx s=%012lx m=%012lx\n", smp_processor_id(), tsc, now, t->stamp.master_stime);//temp
}

/*
@@ -2043,6 +2086,7 @@ static int __init verify_tsc_reliability
* While with constant-rate TSCs the scale factor can be shared, when TSCs
* are not marked as 'reliable', re-sync during rendezvous.
*/
+printk("TSC: c=%d r=%d\n", !!boot_cpu_has(X86_FEATURE_CONSTANT_TSC), !!boot_cpu_has(X86_FEATURE_TSC_RELIABLE));//temp
if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
!boot_cpu_has(X86_FEATURE_TSC_RELIABLE) )
time_calibration_rendezvous_fn = time_calibration_tsc_rendezvous;
@@ -2056,6 +2100,7 @@ int __init init_xen_time(void)
{
tsc_check_writability();

+printk("TSC: c=%d r=%d\n", !!boot_cpu_has(X86_FEATURE_CONSTANT_TSC), !!boot_cpu_has(X86_FEATURE_TSC_RELIABLE));//temp
open_softirq(TIME_CALIBRATE_SOFTIRQ, local_time_calibration);

/* NB. get_wallclock_time() can take over one second to execute. */
Re: Problems with APIC on versions 4.9 and later (4.8 works) [ In reply to ]
On 28.01.2021 10:47, Jan Beulich wrote:
> On 26.01.2021 14:03, Claudemir Todo Bom wrote:
>> If this information is good for more tests, please send the patch and
>> I will test it!
>
> Here you go. For simplifying analysis it may be helpful if you
> could limit the number of CPUs in use, e.g. by "maxcpus=4" or
> at least "smt=0". Provided the problem still reproduces with
> such options, of course.

Speaking of command line options - it doesn't look like you have
told us what else you have on the Xen command line, and without
a serial log this isn't visible (e.g. in your video).

Jan
Re: Problems with APIC on versions 4.9 and later (4.8 works) [ In reply to ]
On 28.01.2021 14:08, Claudemir Todo Bom wrote:
> Em qui., 28 de jan. de 2021 às 06:49, Jan Beulich <jbeulich@suse.com> escreveu:
>>
>> On 28.01.2021 10:47, Jan Beulich wrote:
>>> On 26.01.2021 14:03, Claudemir Todo Bom wrote:
>>>> If this information is good for more tests, please send the patch and
>>>> I will test it!
>>>
>>> Here you go. For simplifying analysis it may be helpful if you
>>> could limit the number of CPUs in use, e.g. by "maxcpus=4" or
>>> at least "smt=0". Provided the problem still reproduces with
>>> such options, of course.
>>
>> Speaking of command line options - it doesn't look like you have
>> told us what else you have on the Xen command line, and without
>> a serial log this isn't visible (e.g. in your video).
>
> All tests are done with xen command line:
>
> dom0_mem=1024M,max:2048M dom0_max_vcpus=4 dom0_vcpus_pin=true
> smt=false vga=text-80x50,keep
>
> and kernel command line:
>
> loglevel=0 earlyprintk=xen nomodeset
>
> this way I can get all xen messages on console.
>
> Attached are the frames I captured from a video, I manually selected
> them starting from the first readable frame.

Okay, so we seem to be hitting two previously noticed issues, neither
of which so far was necessary to address directly (because there was
always something else to be tweaked such that the problems went away).

For one, the boot CPU has a TSC value that lags by more than a
second compared to all secondary CPUs. The way
time_calibration_tsc_rendezvous() works, together with the way we
calculate system time from the TSC (get_s_time_fixed() - this is
where the known issue here is: the function breaks when trying to
scale a negative delta, hence the absurdly high s= values in the
screenshots you've provided), allows for small negative deltas
between CPUs, but expects to bring all CPUs' TSCs forward (i.e. over
the 1s interval between rendezvous' the lagging CPUs are assumed to
have made enough progress to be ahead of the more towards the future
timestamps on the previous run). Secondary lagging behind the boot
CPU more than this could also be dealt with, but on your system the
situation is the other way around.

Btw - what kind of BIOS do you have on this system? This way of the
TSCs being set is pretty odd, and must be - unless you run other
pre-boot software or an unusual boot loader - caused by the BIOS.

And then this points out (again, afaic at least) that the way we
kickstart the rendezvous handling is likely inappropriate.
Especially when TSCs are skewed like they are here, it is unhelpful
to launch Dom0 before having brought the TSC in sync. (Related to
this, I also don't think we should arm the respective timer before
AP bringup was done, or else we risk the first rendezvous instance
to not hit all CPUs.)

I'll work on addressing both, hoping that in particular for the
first one you'll be ready to help with testing (through perhaps
multiple iterations).

> As a sidenote, I managed to get the system working with the parameter
> "tsc=unstable", performance looks satisfactory but I do not know what
> problems I may end with this parameter.

I _think_ you'd be running into trouble if you removed dom0_vcpus_pin
(which imo really no-one should use without reporting a bug, despite
all the hits to the contrary that one gets when searching the web),
and if you ran any guests (PV at least) without pinning their vCPU-s
to pCPU-s.

Jan
Re: Problems with APIC on versions 4.9 and later (4.8 works) [ In reply to ]
Em sex., 29 de jan. de 2021 às 11:21, Jan Beulich <jbeulich@suse.com> escreveu:
>
> On 28.01.2021 14:08, Claudemir Todo Bom wrote:
> > Em qui., 28 de jan. de 2021 às 06:49, Jan Beulich <jbeulich@suse.com> escreveu:
> >>
> >> On 28.01.2021 10:47, Jan Beulich wrote:
> >>> On 26.01.2021 14:03, Claudemir Todo Bom wrote:
> >>>> If this information is good for more tests, please send the patch and
> >>>> I will test it!
> >>>
> >>> Here you go. For simplifying analysis it may be helpful if you
> >>> could limit the number of CPUs in use, e.g. by "maxcpus=4" or
> >>> at least "smt=0". Provided the problem still reproduces with
> >>> such options, of course.
> >>
> >> Speaking of command line options - it doesn't look like you have
> >> told us what else you have on the Xen command line, and without
> >> a serial log this isn't visible (e.g. in your video).
> >
> > All tests are done with xen command line:
> >
> > dom0_mem=1024M,max:2048M dom0_max_vcpus=4 dom0_vcpus_pin=true
> > smt=false vga=text-80x50,keep
> >
> > and kernel command line:
> >
> > loglevel=0 earlyprintk=xen nomodeset
> >
> > this way I can get all xen messages on console.
> >
> > Attached are the frames I captured from a video, I manually selected
> > them starting from the first readable frame.
>
> Okay, so we seem to be hitting two previously noticed issues, neither
> of which so far was necessary to address directly (because there was
> always something else to be tweaked such that the problems went away).
>
> For one, the boot CPU has a TSC value that lags by more than a
> second compared to all secondary CPUs. The way
> time_calibration_tsc_rendezvous() works, together with the way we
> calculate system time from the TSC (get_s_time_fixed() - this is
> where the known issue here is: the function breaks when trying to
> scale a negative delta, hence the absurdly high s= values in the
> screenshots you've provided), allows for small negative deltas
> between CPUs, but expects to bring all CPUs' TSCs forward (i.e. over
> the 1s interval between rendezvous' the lagging CPUs are assumed to
> have made enough progress to be ahead of the more towards the future
> timestamps on the previous run). Secondary lagging behind the boot
> CPU more than this could also be dealt with, but on your system the
> situation is the other way around.
>
> Btw - what kind of BIOS do you have on this system? This way of the
> TSCs being set is pretty odd, and must be - unless you run other
> pre-boot software or an unusual boot loader - caused by the BIOS.

It is a generic mainboard acquired from china... it is very lame! I
was already thinking the big issue is the BIOS. Unfortunately I don't
know how to upgrade it.

> And then this points out (again, afaic at least) that the way we
> kickstart the rendezvous handling is likely inappropriate.
> Especially when TSCs are skewed like they are here, it is unhelpful
> to launch Dom0 before having brought the TSC in sync. (Related to
> this, I also don't think we should arm the respective timer before
> AP bringup was done, or else we risk the first rendezvous instance
> to not hit all CPUs.)
>
> I'll work on addressing both, hoping that in particular for the
> first one you'll be ready to help with testing (through perhaps
> multiple iterations).

I can help you a little more until end of next week. After that I will
move the host to another address and I will not have a quick "hands
on" access to it.

> > As a sidenote, I managed to get the system working with the parameter
> > "tsc=unstable", performance looks satisfactory but I do not know what
> > problems I may end with this parameter.
>
> I _think_ you'd be running into trouble if you removed dom0_vcpus_pin
> (which imo really no-one should use without reporting a bug, despite
> all the hits to the contrary that one gets when searching the web),
> and if you ran any guests (PV at least) without pinning their vCPU-s
> to pCPU-s.

just tested it without the cpu pin, it worked.

I stress-tested both dom0 and a pv guest with the "yes method"
described here:
https://linuxconfig.org/how-to-stress-test-your-cpu-on-linux.

Best regards,
Claudemir
Re: Problems with APIC on versions 4.9 and later (4.8 works) [ In reply to ]
On 28.01.2021 14:08, Claudemir Todo Bom wrote:
> Em qui., 28 de jan. de 2021 às 06:49, Jan Beulich <jbeulich@suse.com> escreveu:
>>
>> On 28.01.2021 10:47, Jan Beulich wrote:
>>> On 26.01.2021 14:03, Claudemir Todo Bom wrote:
>>>> If this information is good for more tests, please send the patch and
>>>> I will test it!
>>>
>>> Here you go. For simplifying analysis it may be helpful if you
>>> could limit the number of CPUs in use, e.g. by "maxcpus=4" or
>>> at least "smt=0". Provided the problem still reproduces with
>>> such options, of course.
>>
>> Speaking of command line options - it doesn't look like you have
>> told us what else you have on the Xen command line, and without
>> a serial log this isn't visible (e.g. in your video).
>
> All tests are done with xen command line:
>
> dom0_mem=1024M,max:2048M dom0_max_vcpus=4 dom0_vcpus_pin=true
> smt=false vga=text-80x50,keep
>
> and kernel command line:
>
> loglevel=0 earlyprintk=xen nomodeset
>
> this way I can get all xen messages on console.
>
> Attached are the frames I captured from a video, I manually selected
> them starting from the first readable frame.

I've just sent a pair of patches, with you Cc-ed on the 2nd one.
Please give that one a try, with or without the updated debugging
patch below. In case of problems I'd of course want to see the
output from the debugging patch as well. I think it's up to you
whether you also use the first patch from that series - afaict it
shouldn't directly affect your case, but I may be wrong.

Thanks, Jan

--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -1558,6 +1558,12 @@ static void local_time_calibration(void)
* TSC Reliability check
*/

+static struct {//temp
+ unsigned cpu;
+ signed iter;
+ cycles_t prev, now;
+} check_log[NR_CPUS + 4];
+static unsigned check_idx;//temp
/*
* The Linux original version of this function is
* Copyright (c) 2006, Red Hat, Inc., Ingo Molnar
@@ -1566,6 +1572,7 @@ static void check_tsc_warp(unsigned long
{
static DEFINE_SPINLOCK(sync_lock);
static cycles_t last_tsc;
+unsigned idx, cpu = smp_processor_id();//temp

cycles_t start, now, prev, end;
int i;
@@ -1576,6 +1583,15 @@ static void check_tsc_warp(unsigned long
end = start + tsc_khz * 20ULL;
now = start;

+{//temp
+ spin_lock(&sync_lock);
+ idx = check_idx++;
+ check_log[idx].cpu = cpu;
+ check_log[idx].iter = -1;
+ check_log[idx].now = now;
+ spin_unlock(&sync_lock);
+}
+
for ( i = 0; ; i++ )
{
/*
@@ -1610,7 +1626,14 @@ static void check_tsc_warp(unsigned long
{
spin_lock(&sync_lock);
if ( *max_warp < prev - now )
+{//temp
*max_warp = prev - now;
+ idx = check_idx++;
+ check_log[idx].cpu = cpu;
+ check_log[idx].iter = i;
+ check_log[idx].prev = prev;
+ check_log[idx].now = now;
+}
spin_unlock(&sync_lock);
}
}
@@ -1647,6 +1670,12 @@ static void tsc_check_reliability(void)
cpu_relax();

spin_unlock(&lock);
+{//temp
+ unsigned i;
+ printk("CHK[%2u] %lx\n", cpu, tsc_max_warp);//temp
+ for(i = 0; i < ARRAY_SIZE(check_log) && check_log[i].now; ++i)
+ printk("chk[%4u] CPU%-2u %016lx %016lx #%d\n", i, check_log[i].cpu, check_log[i].prev, check_log[i].now, check_log[i].iter);
+}
}

/*
@@ -1661,6 +1690,7 @@ struct calibration_rendezvous {
uint64_t master_tsc_stamp, max_tsc_stamp;
};

+static bool rdzv_log;//temp
static void
time_calibration_rendezvous_tail(const struct calibration_rendezvous *r)
{
@@ -1670,6 +1700,7 @@ time_calibration_rendezvous_tail(const s
c->local_stime = get_s_time_fixed(c->local_tsc);
c->master_stime = r->master_stime;

+if(rdzv_log) printk("RDZV[%2u] t=%016lx(%016lx) s=%012lx(%012lx)\n", smp_processor_id(), c->local_tsc, r->master_tsc_stamp, c->local_stime, r->master_stime);//temp
raise_softirq(TIME_CALIBRATE_SOFTIRQ);
}

@@ -1682,7 +1713,9 @@ static void time_calibration_tsc_rendezv
int i;
struct calibration_rendezvous *r = _r;
unsigned int total_cpus = cpumask_weight(&r->cpu_calibration_map);
+uint64_t adj = 0;//temp

+if(rdzv_log) printk("RDZV[%2u] t=%016lx\n", smp_processor_id(), rdtsc_ordered());//temp
/* Loop to get rid of cache effects on TSC skew. */
for ( i = 4; i >= 0; i-- )
{
@@ -1706,6 +1739,7 @@ static void time_calibration_tsc_rendezv
*/
uint64_t delta = r->max_tsc_stamp - r->master_tsc_stamp;

+adj = delta;//temp
r->master_stime += scale_delta(delta,
&this_cpu(cpu_time).tsc_scale);
r->master_tsc_stamp = r->max_tsc_stamp;
@@ -1747,6 +1781,13 @@ static void time_calibration_tsc_rendezv
}

time_calibration_rendezvous_tail(r);
+if(adj) {//temp
+ static unsigned long cnt, thr;
+ if(++cnt > thr) {
+ thr |= cnt;
+ printk("TSC adjusted by %lx\n", adj);
+ }
+}
}

/* Ordinary rendezvous function which does not modify TSC values. */
@@ -1798,6 +1839,12 @@ static void time_calibration(void *unuse
struct calibration_rendezvous r = {
.semaphore = ATOMIC_INIT(0)
};
+static unsigned long cnt, thr;//temp
+if(system_state > SYS_STATE_smp_boot && ++cnt > thr) {//temp
+ thr |= cnt;
+ printk("TSC: %ps\n", time_calibration_rendezvous_fn);
+ rdzv_log = true;
+}

if ( clocksource_is_tsc() )
{
@@ -1812,6 +1859,10 @@ static void time_calibration(void *unuse
on_selected_cpus(&r.cpu_calibration_map,
time_calibration_rendezvous_fn,
&r, 1);
+if(rdzv_log || system_state <= SYS_STATE_smp_boot) {//temp
+ rdzv_log = false;
+ printk("TSC: end rendezvous\n");
+}
}

static struct cpu_time_stamp ap_bringup_ref;
@@ -1908,6 +1959,7 @@ void init_percpu_time(void)
}
t->stamp.local_tsc = tsc;
t->stamp.local_stime = now;
+printk("INIT[%2u] t=%016lx s=%012lx m=%012lx\n", smp_processor_id(), tsc, now, t->stamp.master_stime);//temp
}

/*
@@ -2050,6 +2102,7 @@ static int __init verify_tsc_reliability
* While with constant-rate TSCs the scale factor can be shared, when TSCs
* are not marked as 'reliable', re-sync during rendezvous.
*/
+printk("TSC: c=%d r=%d\n", !!boot_cpu_has(X86_FEATURE_CONSTANT_TSC), !!boot_cpu_has(X86_FEATURE_TSC_RELIABLE));//temp
if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
!boot_cpu_has(X86_FEATURE_TSC_RELIABLE) )
time_calibration_rendezvous_fn = time_calibration_tsc_rendezvous;
@@ -2065,6 +2118,7 @@ int __init init_xen_time(void)
{
tsc_check_writability();

+printk("TSC: c=%d r=%d\n", !!boot_cpu_has(X86_FEATURE_CONSTANT_TSC), !!boot_cpu_has(X86_FEATURE_TSC_RELIABLE));//temp
open_softirq(TIME_CALIBRATE_SOFTIRQ, local_time_calibration);

/* NB. get_wallclock_time() can take over one second to execute. */
Re: Problems with APIC on versions 4.9 and later (4.8 works) [ In reply to ]
Em sex., 29 de jan. de 2021 às 13:24, Jan Beulich <jbeulich@suse.com> escreveu:
>
> On 28.01.2021 14:08, Claudemir Todo Bom wrote:
> > Em qui., 28 de jan. de 2021 às 06:49, Jan Beulich <jbeulich@suse.com> escreveu:
> >>
> >> On 28.01.2021 10:47, Jan Beulich wrote:
> >>> On 26.01.2021 14:03, Claudemir Todo Bom wrote:
> >>>> If this information is good for more tests, please send the patch and
> >>>> I will test it!
> >>>
> >>> Here you go. For simplifying analysis it may be helpful if you
> >>> could limit the number of CPUs in use, e.g. by "maxcpus=4" or
> >>> at least "smt=0". Provided the problem still reproduces with
> >>> such options, of course.
> >>
> >> Speaking of command line options - it doesn't look like you have
> >> told us what else you have on the Xen command line, and without
> >> a serial log this isn't visible (e.g. in your video).
> >
> > All tests are done with xen command line:
> >
> > dom0_mem=1024M,max:2048M dom0_max_vcpus=4 dom0_vcpus_pin=true
> > smt=false vga=text-80x50,keep
> >
> > and kernel command line:
> >
> > loglevel=0 earlyprintk=xen nomodeset
> >
> > this way I can get all xen messages on console.
> >
> > Attached are the frames I captured from a video, I manually selected
> > them starting from the first readable frame.
>
> I've just sent a pair of patches, with you Cc-ed on the 2nd one.
> Please give that one a try, with or without the updated debugging
> patch below. In case of problems I'd of course want to see the
> output from the debugging patch as well. I think it's up to you
> whether you also use the first patch from that series - afaict it
> shouldn't directly affect your case, but I may be wrong.

I've applied both patches, system didn't booted, used following parameters:

xen: dom0_mem=1024M,max:2048M dom0_max_vcpus=4 dom0_vcpus_pin=true smt=true
kernel: loglevel=3

The screen cleared right after the initial xen messages and frozen
there for a few minutes until I restarted the system.

I've added "vga=text-80x25,keep" to the xen command line and
"nomodeset" to the kernel command line, hoping to get some more info
and surprisingly this was sufficient to make system boot!

System prompt took a lot to appear, the kernel driver for the usb
keyboard loaded after 3 minutes and the driver for the usb wifi dongle
I am using loaded about five minutes after kernel boot, and I had to
issue "ifup -a" to get an ip address from the dhcp server, and it took
almost one minute to get it!

Also, the system became very sensible for the keyboard, a lot of
double-typing happening with it, console beeps keeps sounding longer
also.

Then with the debug patch, since it booted I was able to get the "xl
dmesg" output with all the debug messages, it is attached. I'm also
attaching the dmesg output also, but I do not know if it is useful in
any form. Kernel started to boot at stamp ~20 of the xen hypervisor.

Hope all of this helps!
Re: Problems with APIC on versions 4.9 and later (4.8 works) [ In reply to ]
On 29.01.2021 20:31, Claudemir Todo Bom wrote:
> Em sex., 29 de jan. de 2021 às 13:24, Jan Beulich <jbeulich@suse.com> escreveu:
>>
>> On 28.01.2021 14:08, Claudemir Todo Bom wrote:
>>> Em qui., 28 de jan. de 2021 às 06:49, Jan Beulich <jbeulich@suse.com> escreveu:
>>>>
>>>> On 28.01.2021 10:47, Jan Beulich wrote:
>>>>> On 26.01.2021 14:03, Claudemir Todo Bom wrote:
>>>>>> If this information is good for more tests, please send the patch and
>>>>>> I will test it!
>>>>>
>>>>> Here you go. For simplifying analysis it may be helpful if you
>>>>> could limit the number of CPUs in use, e.g. by "maxcpus=4" or
>>>>> at least "smt=0". Provided the problem still reproduces with
>>>>> such options, of course.
>>>>
>>>> Speaking of command line options - it doesn't look like you have
>>>> told us what else you have on the Xen command line, and without
>>>> a serial log this isn't visible (e.g. in your video).
>>>
>>> All tests are done with xen command line:
>>>
>>> dom0_mem=1024M,max:2048M dom0_max_vcpus=4 dom0_vcpus_pin=true
>>> smt=false vga=text-80x50,keep
>>>
>>> and kernel command line:
>>>
>>> loglevel=0 earlyprintk=xen nomodeset
>>>
>>> this way I can get all xen messages on console.
>>>
>>> Attached are the frames I captured from a video, I manually selected
>>> them starting from the first readable frame.
>>
>> I've just sent a pair of patches, with you Cc-ed on the 2nd one.
>> Please give that one a try, with or without the updated debugging
>> patch below. In case of problems I'd of course want to see the
>> output from the debugging patch as well. I think it's up to you
>> whether you also use the first patch from that series - afaict it
>> shouldn't directly affect your case, but I may be wrong.
>
> I've applied both patches, system didn't booted, used following parameters:
>
> xen: dom0_mem=1024M,max:2048M dom0_max_vcpus=4 dom0_vcpus_pin=true smt=true
> kernel: loglevel=3
>
> The screen cleared right after the initial xen messages and frozen
> there for a few minutes until I restarted the system.
>
> I've added "vga=text-80x25,keep" to the xen command line and
> "nomodeset" to the kernel command line, hoping to get some more info
> and surprisingly this was sufficient to make system boot!

Odd, but as per my reply to the patch submission itself a
few minutes ago, over the weekend I realized a flaw. I do
think this explains the anomalies seen from the log between
CPU0 and and all other CPUs; the problem merely isn't as
severe anymore as it was before as it seems. I also did
realize I ought to be able to mimic your system's behavior;
if so I ought to be able to send out an updated series that
actually had some testing for the specific case. Later
today, hopefully.

Jan
Re: Problems with APIC on versions 4.9 and later (4.8 works) [ In reply to ]
On 29.01.2021 20:31, Claudemir Todo Bom wrote:
> I've applied both patches, system didn't booted, used following parameters:
>
> xen: dom0_mem=1024M,max:2048M dom0_max_vcpus=4 dom0_vcpus_pin=true smt=true
> kernel: loglevel=3
>
> The screen cleared right after the initial xen messages and frozen
> there for a few minutes until I restarted the system.
>
> I've added "vga=text-80x25,keep" to the xen command line and
> "nomodeset" to the kernel command line, hoping to get some more info
> and surprisingly this was sufficient to make system boot!
>
> System prompt took a lot to appear, the kernel driver for the usb
> keyboard loaded after 3 minutes and the driver for the usb wifi dongle
> I am using loaded about five minutes after kernel boot, and I had to
> issue "ifup -a" to get an ip address from the dhcp server, and it took
> almost one minute to get it!

I was able to repro this behavior, by deliberately screwing up
CPU0's TSC early during boot. This of course did make it a lot
easier to find and fix the problem. I've Cc-ed you on the full
3-patch series that I've sent a minute ago, because while you
may continue to opt for ignoring the first patch, you'll now
need the latter two. And as before, the updated debugging patch
below.

Jan

--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -1558,6 +1558,12 @@ static void local_time_calibration(void)
* TSC Reliability check
*/

+static struct {//temp
+ unsigned cpu;
+ signed iter;
+ cycles_t prev, now;
+} check_log[NR_CPUS + 4];
+static unsigned check_idx;//temp
/*
* The Linux original version of this function is
* Copyright (c) 2006, Red Hat, Inc., Ingo Molnar
@@ -1566,6 +1572,7 @@ static void check_tsc_warp(unsigned long
{
static DEFINE_SPINLOCK(sync_lock);
static cycles_t last_tsc;
+unsigned idx, cpu = smp_processor_id();//temp

cycles_t start, now, prev, end;
int i;
@@ -1576,6 +1583,15 @@ static void check_tsc_warp(unsigned long
end = start + tsc_khz * 20ULL;
now = start;

+{//temp
+ spin_lock(&sync_lock);
+ idx = check_idx++;
+ check_log[idx].cpu = cpu;
+ check_log[idx].iter = -1;
+ check_log[idx].now = now;
+ spin_unlock(&sync_lock);
+}
+
for ( i = 0; ; i++ )
{
/*
@@ -1610,7 +1626,14 @@ static void check_tsc_warp(unsigned long
{
spin_lock(&sync_lock);
if ( *max_warp < prev - now )
+{//temp
*max_warp = prev - now;
+ idx = check_idx++;
+ check_log[idx].cpu = cpu;
+ check_log[idx].iter = i;
+ check_log[idx].prev = prev;
+ check_log[idx].now = now;
+}
spin_unlock(&sync_lock);
}
}
@@ -1647,6 +1670,12 @@ static void tsc_check_reliability(void)
cpu_relax();

spin_unlock(&lock);
+{//temp
+ unsigned i;
+ printk("CHK[%2u] %lx\n", cpu, tsc_max_warp);//temp
+ for(i = 0; i < ARRAY_SIZE(check_log) && check_log[i].now; ++i)
+ printk("chk[%4u] CPU%-2u %016lx %016lx #%d\n", i, check_log[i].cpu, check_log[i].prev, check_log[i].now, check_log[i].iter);
+}
}

/*
@@ -1661,6 +1690,7 @@ struct calibration_rendezvous {
uint64_t master_tsc_stamp, max_tsc_stamp;
};

+static bool rdzv_log;//temp
static void
time_calibration_rendezvous_tail(const struct calibration_rendezvous *r,
uint64_t old_tsc, uint64_t new_tsc)
@@ -1671,6 +1701,7 @@ time_calibration_rendezvous_tail(const s
c->local_stime = get_s_time_fixed(old_tsc ?: new_tsc);
c->master_stime = r->master_stime;

+if(rdzv_log) printk("RDZV[%2u] t=%016lx(%016lx) s=%012lx(%012lx)\n", smp_processor_id(), c->local_tsc, r->master_tsc_stamp, c->local_stime, r->master_stime);//temp
raise_softirq(TIME_CALIBRATE_SOFTIRQ);
}

@@ -1684,7 +1715,9 @@ static void time_calibration_tsc_rendezv
struct calibration_rendezvous *r = _r;
unsigned int total_cpus = cpumask_weight(&r->cpu_calibration_map);
uint64_t tsc = 0;
+uint64_t adj = 0;//temp

+if(rdzv_log) printk("RDZV[%2u] t=%016lx\n", smp_processor_id(), rdtsc_ordered());//temp
/* Loop to get rid of cache effects on TSC skew. */
for ( i = 4; i >= 0; i-- )
{
@@ -1701,6 +1734,7 @@ static void time_calibration_tsc_rendezv
* Use the largest value observed anywhere on the first
* iteration.
*/
+adj = r->max_tsc_stamp - r->master_tsc_stamp,//temp
r->master_tsc_stamp = r->max_tsc_stamp;
else if ( i == 0 )
r->master_stime = read_platform_stime(NULL);
@@ -1743,6 +1777,13 @@ static void time_calibration_tsc_rendezv
}

time_calibration_rendezvous_tail(r, tsc, r->master_tsc_stamp);
+if(adj) {//temp
+ static unsigned long cnt, thr;
+ if(++cnt > thr) {
+ thr |= cnt;
+ printk("TSC adjusted by %lx\n", adj);
+ }
+}
}

/* Ordinary rendezvous function which does not modify TSC values. */
@@ -1794,6 +1835,12 @@ static void time_calibration(void *unuse
struct calibration_rendezvous r = {
.semaphore = ATOMIC_INIT(0)
};
+static unsigned long cnt, thr;//temp
+if(++cnt > thr) {//temp
+ thr |= cnt;
+ printk("TSC: %ps\n", time_calibration_rendezvous_fn);
+ rdzv_log = true;
+}

if ( clocksource_is_tsc() )
{
@@ -1808,6 +1855,10 @@ static void time_calibration(void *unuse
on_selected_cpus(&r.cpu_calibration_map,
time_calibration_rendezvous_fn,
&r, 1);
+if(rdzv_log) {//temp
+ rdzv_log = false;
+ printk("TSC: end rendezvous\n");
+}
}

static struct cpu_time_stamp ap_bringup_ref;
@@ -1904,6 +1955,7 @@ void init_percpu_time(void)
}
t->stamp.local_tsc = tsc;
t->stamp.local_stime = now;
+printk("INIT[%2u] t=%016lx s=%012lx m=%012lx\n", smp_processor_id(), tsc, now, t->stamp.master_stime);//temp
}

/*
@@ -2046,6 +2098,7 @@ static int __init verify_tsc_reliability
* While with constant-rate TSCs the scale factor can be shared, when TSCs
* are not marked as 'reliable', re-sync during rendezvous.
*/
+printk("TSC: c=%d r=%d\n", !!boot_cpu_has(X86_FEATURE_CONSTANT_TSC), !!boot_cpu_has(X86_FEATURE_TSC_RELIABLE));//temp
if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
!boot_cpu_has(X86_FEATURE_TSC_RELIABLE) )
time_calibration_rendezvous_fn = time_calibration_tsc_rendezvous;
@@ -2061,6 +2114,7 @@ int __init init_xen_time(void)
{
tsc_check_writability();

+printk("TSC: c=%d r=%d\n", !!boot_cpu_has(X86_FEATURE_CONSTANT_TSC), !!boot_cpu_has(X86_FEATURE_TSC_RELIABLE));//temp
open_softirq(TIME_CALIBRATE_SOFTIRQ, local_time_calibration);

/* NB. get_wallclock_time() can take over one second to execute. */
Re: Problems with APIC on versions 4.9 and later (4.8 works) [ In reply to ]
Tested first without the debug patch and with following parameters:

xen: dom0_mem=1024M,max:2048M dom0_max_vcpus=4 dom0_vcpus_pin=true smt=true
kernel: loglevel=3

same behaviour as before... black screen right after the xen messages.

adding earlyprintk=xen to the kernel command line is sufficient to
make it boot, I can imagine this can be happening because Xen is not
releasing console to the kernel at that moment.

The system worked well (with earlyprintk=xen), tested with the "yes
stress test" mentioned before on a guest and on dom0.

Then, I installed the debug patch and booted it again, it also needed
the earlyprintk=xen parameter on the kernel command line. I've also
added console_timestamps=boot to the xen command line in order to get
the time of the messages.

I'm attaching the outputs of "xl dmesg" and "dmesg" on this message.

Think it is almost done! WIll wait for the next round of tests!

Thank you very much!

Em seg., 1 de fev. de 2021 às 09:47, Jan Beulich <jbeulich@suse.com> escreveu:
>
> On 29.01.2021 20:31, Claudemir Todo Bom wrote:
> > I've applied both patches, system didn't booted, used following parameters:
> >
> > xen: dom0_mem=1024M,max:2048M dom0_max_vcpus=4 dom0_vcpus_pin=true smt=true
> > kernel: loglevel=3
> >
> > The screen cleared right after the initial xen messages and frozen
> > there for a few minutes until I restarted the system.
> >
> > I've added "vga=text-80x25,keep" to the xen command line and
> > "nomodeset" to the kernel command line, hoping to get some more info
> > and surprisingly this was sufficient to make system boot!
> >
> > System prompt took a lot to appear, the kernel driver for the usb
> > keyboard loaded after 3 minutes and the driver for the usb wifi dongle
> > I am using loaded about five minutes after kernel boot, and I had to
> > issue "ifup -a" to get an ip address from the dhcp server, and it took
> > almost one minute to get it!
>
> I was able to repro this behavior, by deliberately screwing up
> CPU0's TSC early during boot. This of course did make it a lot
> easier to find and fix the problem. I've Cc-ed you on the full
> 3-patch series that I've sent a minute ago, because while you
> may continue to opt for ignoring the first patch, you'll now
> need the latter two. And as before, the updated debugging patch
> below.
>
> Jan
>
> --- a/xen/arch/x86/time.c
> +++ b/xen/arch/x86/time.c
> @@ -1558,6 +1558,12 @@ static void local_time_calibration(void)
> * TSC Reliability check
> */
>
> +static struct {//temp
> + unsigned cpu;
> + signed iter;
> + cycles_t prev, now;
> +} check_log[NR_CPUS + 4];
> +static unsigned check_idx;//temp
> /*
> * The Linux original version of this function is
> * Copyright (c) 2006, Red Hat, Inc., Ingo Molnar
> @@ -1566,6 +1572,7 @@ static void check_tsc_warp(unsigned long
> {
> static DEFINE_SPINLOCK(sync_lock);
> static cycles_t last_tsc;
> +unsigned idx, cpu = smp_processor_id();//temp
>
> cycles_t start, now, prev, end;
> int i;
> @@ -1576,6 +1583,15 @@ static void check_tsc_warp(unsigned long
> end = start + tsc_khz * 20ULL;
> now = start;
>
> +{//temp
> + spin_lock(&sync_lock);
> + idx = check_idx++;
> + check_log[idx].cpu = cpu;
> + check_log[idx].iter = -1;
> + check_log[idx].now = now;
> + spin_unlock(&sync_lock);
> +}
> +
> for ( i = 0; ; i++ )
> {
> /*
> @@ -1610,7 +1626,14 @@ static void check_tsc_warp(unsigned long
> {
> spin_lock(&sync_lock);
> if ( *max_warp < prev - now )
> +{//temp
> *max_warp = prev - now;
> + idx = check_idx++;
> + check_log[idx].cpu = cpu;
> + check_log[idx].iter = i;
> + check_log[idx].prev = prev;
> + check_log[idx].now = now;
> +}
> spin_unlock(&sync_lock);
> }
> }
> @@ -1647,6 +1670,12 @@ static void tsc_check_reliability(void)
> cpu_relax();
>
> spin_unlock(&lock);
> +{//temp
> + unsigned i;
> + printk("CHK[%2u] %lx\n", cpu, tsc_max_warp);//temp
> + for(i = 0; i < ARRAY_SIZE(check_log) && check_log[i].now; ++i)
> + printk("chk[%4u] CPU%-2u %016lx %016lx #%d\n", i, check_log[i].cpu, check_log[i].prev, check_log[i].now, check_log[i].iter);
> +}
> }
>
> /*
> @@ -1661,6 +1690,7 @@ struct calibration_rendezvous {
> uint64_t master_tsc_stamp, max_tsc_stamp;
> };
>
> +static bool rdzv_log;//temp
> static void
> time_calibration_rendezvous_tail(const struct calibration_rendezvous *r,
> uint64_t old_tsc, uint64_t new_tsc)
> @@ -1671,6 +1701,7 @@ time_calibration_rendezvous_tail(const s
> c->local_stime = get_s_time_fixed(old_tsc ?: new_tsc);
> c->master_stime = r->master_stime;
>
> +if(rdzv_log) printk("RDZV[%2u] t=%016lx(%016lx) s=%012lx(%012lx)\n", smp_processor_id(), c->local_tsc, r->master_tsc_stamp, c->local_stime, r->master_stime);//temp
> raise_softirq(TIME_CALIBRATE_SOFTIRQ);
> }
>
> @@ -1684,7 +1715,9 @@ static void time_calibration_tsc_rendezv
> struct calibration_rendezvous *r = _r;
> unsigned int total_cpus = cpumask_weight(&r->cpu_calibration_map);
> uint64_t tsc = 0;
> +uint64_t adj = 0;//temp
>
> +if(rdzv_log) printk("RDZV[%2u] t=%016lx\n", smp_processor_id(), rdtsc_ordered());//temp
> /* Loop to get rid of cache effects on TSC skew. */
> for ( i = 4; i >= 0; i-- )
> {
> @@ -1701,6 +1734,7 @@ static void time_calibration_tsc_rendezv
> * Use the largest value observed anywhere on the first
> * iteration.
> */
> +adj = r->max_tsc_stamp - r->master_tsc_stamp,//temp
> r->master_tsc_stamp = r->max_tsc_stamp;
> else if ( i == 0 )
> r->master_stime = read_platform_stime(NULL);
> @@ -1743,6 +1777,13 @@ static void time_calibration_tsc_rendezv
> }
>
> time_calibration_rendezvous_tail(r, tsc, r->master_tsc_stamp);
> +if(adj) {//temp
> + static unsigned long cnt, thr;
> + if(++cnt > thr) {
> + thr |= cnt;
> + printk("TSC adjusted by %lx\n", adj);
> + }
> +}
> }
>
> /* Ordinary rendezvous function which does not modify TSC values. */
> @@ -1794,6 +1835,12 @@ static void time_calibration(void *unuse
> struct calibration_rendezvous r = {
> .semaphore = ATOMIC_INIT(0)
> };
> +static unsigned long cnt, thr;//temp
> +if(++cnt > thr) {//temp
> + thr |= cnt;
> + printk("TSC: %ps\n", time_calibration_rendezvous_fn);
> + rdzv_log = true;
> +}
>
> if ( clocksource_is_tsc() )
> {
> @@ -1808,6 +1855,10 @@ static void time_calibration(void *unuse
> on_selected_cpus(&r.cpu_calibration_map,
> time_calibration_rendezvous_fn,
> &r, 1);
> +if(rdzv_log) {//temp
> + rdzv_log = false;
> + printk("TSC: end rendezvous\n");
> +}
> }
>
> static struct cpu_time_stamp ap_bringup_ref;
> @@ -1904,6 +1955,7 @@ void init_percpu_time(void)
> }
> t->stamp.local_tsc = tsc;
> t->stamp.local_stime = now;
> +printk("INIT[%2u] t=%016lx s=%012lx m=%012lx\n", smp_processor_id(), tsc, now, t->stamp.master_stime);//temp
> }
>
> /*
> @@ -2046,6 +2098,7 @@ static int __init verify_tsc_reliability
> * While with constant-rate TSCs the scale factor can be shared, when TSCs
> * are not marked as 'reliable', re-sync during rendezvous.
> */
> +printk("TSC: c=%d r=%d\n", !!boot_cpu_has(X86_FEATURE_CONSTANT_TSC), !!boot_cpu_has(X86_FEATURE_TSC_RELIABLE));//temp
> if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> !boot_cpu_has(X86_FEATURE_TSC_RELIABLE) )
> time_calibration_rendezvous_fn = time_calibration_tsc_rendezvous;
> @@ -2061,6 +2114,7 @@ int __init init_xen_time(void)
> {
> tsc_check_writability();
>
> +printk("TSC: c=%d r=%d\n", !!boot_cpu_has(X86_FEATURE_CONSTANT_TSC), !!boot_cpu_has(X86_FEATURE_TSC_RELIABLE));//temp
> open_softirq(TIME_CALIBRATE_SOFTIRQ, local_time_calibration);
>
> /* NB. get_wallclock_time() can take over one second to execute. */
Re: Problems with APIC on versions 4.9 and later (4.8 works) [ In reply to ]
On 01.02.2021 15:46, Claudemir Todo Bom wrote:
> Tested first without the debug patch and with following parameters:

And this test was all three of the non-debugging patches?

> xen: dom0_mem=1024M,max:2048M dom0_max_vcpus=4 dom0_vcpus_pin=true smt=true
> kernel: loglevel=3
>
> same behaviour as before... black screen right after the xen messages.
>
> adding earlyprintk=xen to the kernel command line is sufficient to
> make it boot, I can imagine this can be happening because Xen is not
> releasing console to the kernel at that moment.

If the answer to the above question is "yes", then I start
suspecting this to be a different problem. I'm not sure I
see a way to debug this without having access to any output
(i.e. neither video nor serial). Without "earlyprintk=xen"
and instead with "vga=keep watchdog" on the Xen command
line, is there anything helpful (without or if need be with
the debugging patch in place)?

> The system worked well (with earlyprintk=xen), tested with the "yes
> stress test" mentioned before on a guest and on dom0.
>
> Then, I installed the debug patch and booted it again, it also needed
> the earlyprintk=xen parameter on the kernel command line. I've also
> added console_timestamps=boot to the xen command line in order to get
> the time of the messages.
>
> I'm attaching the outputs of "xl dmesg" and "dmesg" on this message.
>
> Think it is almost done! WIll wait for the next round of tests!

As per above, not sure if there's going to be one. Thanks
for your patient testing!

Jan
Re: Problems with APIC on versions 4.9 and later (4.8 works) [ In reply to ]
Em seg., 1 de fev. de 2021 às 12:09, Jan Beulich <jbeulich@suse.com> escreveu:
>
> On 01.02.2021 15:46, Claudemir Todo Bom wrote:
> > Tested first without the debug patch and with following parameters:
>
> And this test was all three of the non-debugging patches?

Yes, all three patches.

> > xen: dom0_mem=1024M,max:2048M dom0_max_vcpus=4 dom0_vcpus_pin=true smt=true
> > kernel: loglevel=3
> >
> > same behaviour as before... black screen right after the xen messages.
> >
> > adding earlyprintk=xen to the kernel command line is sufficient to
> > make it boot, I can imagine this can be happening because Xen is not
> > releasing console to the kernel at that moment.
>
> If the answer to the above question is "yes", then I start
> suspecting this to be a different problem. I'm not sure I
> see a way to debug this without having access to any output
> (i.e. neither video nor serial). Without "earlyprintk=xen"
> and instead with "vga=keep watchdog" on the Xen command
> line, is there anything helpful (without or if need be with
> the debugging patch in place)?

with "vga=text-80x25,keep watchdog" and without the earlyprintk,
system booted. I'm attaching the "xl dmesg" and "dmesg" outputs here.

>
> > The system worked well (with earlyprintk=xen), tested with the "yes
> > stress test" mentioned before on a guest and on dom0.
> >
> > Then, I installed the debug patch and booted it again, it also needed
> > the earlyprintk=xen parameter on the kernel command line. I've also
> > added console_timestamps=boot to the xen command line in order to get
> > the time of the messages.
> >
> > I'm attaching the outputs of "xl dmesg" and "dmesg" on this message.
> >
> > Think it is almost done! WIll wait for the next round of tests!
>
> As per above, not sure if there's going to be one. Thanks
> for your patient testing!

I can live with the "earlyprintk=xen" or any other solution that makes
it boot. I'm pretty sure this problem will appear in tests of other
people with a setup that helps debug deeper.

Thank you for your work!

Best regards,
Claudemir
Re: Problems with APIC on versions 4.9 and later (4.8 works) [ In reply to ]
On 01.02.2021 16:26, Claudemir Todo Bom wrote:
> Em seg., 1 de fev. de 2021 às 12:09, Jan Beulich <jbeulich@suse.com> escreveu:
>>
>> On 01.02.2021 15:46, Claudemir Todo Bom wrote:
>>> Tested first without the debug patch and with following parameters:
>>
>> And this test was all three of the non-debugging patches?
>
> Yes, all three patches.
>
>>> xen: dom0_mem=1024M,max:2048M dom0_max_vcpus=4 dom0_vcpus_pin=true smt=true
>>> kernel: loglevel=3
>>>
>>> same behaviour as before... black screen right after the xen messages.
>>>
>>> adding earlyprintk=xen to the kernel command line is sufficient to
>>> make it boot, I can imagine this can be happening because Xen is not
>>> releasing console to the kernel at that moment.
>>
>> If the answer to the above question is "yes", then I start
>> suspecting this to be a different problem. I'm not sure I
>> see a way to debug this without having access to any output
>> (i.e. neither video nor serial). Without "earlyprintk=xen"
>> and instead with "vga=keep watchdog" on the Xen command
>> line, is there anything helpful (without or if need be with
>> the debugging patch in place)?
>
> with "vga=text-80x25,keep watchdog" and without the earlyprintk,
> system booted.

Well, you clearly don't want to keep "vga=keep". There has to
be something that's still going wrong, but this may now be a
kernel side issue. In the logs you provided I couldn't spot
anything odd, but these were from working cases after all. So
as said, for now I'm lost, and you may need to live with some
form of workaround (which you've said you're okay with).

Jan