Mailing List Archive

Random Hangs At Boot Around setup_local_APIC(void): ExtINT on CPU#
I posted to this list a few days ago a problem I was having with
an Intel Atom-based computer with UEFI and Xen, subject
"Boot Sometimes Hangs At "masked EXTINT" (Varies)"  I am experiencing
random hangs at start-up before the posting event of

"(XEN) [...] Brought up 8 CPUs"

and after the event

"(XEN) [...] HVM: HAP page sizes: 4kB, 2MB".

Here is a successful startup log (that I randomly can obtain if
I keep rebooting) of the section where the hang occurs:

(XEN) [2019-03-04 22:43:05] HVM: Hardware Assisted Paging (HAP) detected
(XEN) [2019-03-04 22:43:05] HVM: HAP page sizes: 4kB, 2MB
(XEN) [2019-03-04 22:43:01] masked ExtINT on CPU#1
(XEN) [2019-03-04 22:43:01] masked ExtINT on CPU#2
(XEN) [2019-03-04 22:43:01] masked ExtINT on CPU#3
(XEN) [2019-03-04 22:43:01] masked ExtINT on CPU#4
(XEN) [2019-03-04 22:43:01] masked ExtINT on CPU#5
(XEN) [2019-03-04 22:43:01] masked ExtINT on CPU#6
(XEN) [2019-03-04 22:43:01] masked ExtINT on CPU#7
(XEN) [2019-03-04 22:43:06] Brought up 8 CPUs

Without altering anything, e.g. touching my grub.cfg,
I can repeatedly try to launch the Xen
hypervisor and obtain inconsistent results.  Sometimes booting will
hang after printing out "masked ExtINT on CPU#1", or #2, or #3...6, and then
sometimes I can make it through that "masked..." output
to a successful start-up.  This randomness causes me to believe
there is something hardware related that has not been
accounted for by the software.

I could not find an official web page for the code that I could link to
as I might were it on GitHub,
so I'll just have to provide references.  The print-out of
"masked ExtINT on CPU#..."  occurs within api.c under the function
 setup_local_APIC(void)  around line 646 (version 11.1?).

Oh, and I have tried versions Gentoo's Xen 10.2 and 11.1 and
see no difference in the errant behavior.

I'm wondering if the following post to a forum I found bears on the issue:

======= start posting =====
I remember seeing something like this in the past and it turned out to be
a BIOS issue.  BIOS was enabling the APs to interact with the legacy 8259
interrupt controller when only the BSP should. During POST the APs were
exposed to ExtINT/INTR events as a result of the mis-configuration
(probably due to a UEFI timer-tick using the 8259) and this left a
pending ExtINT/INTR interrupt latched on the APs.

When the APs were started by the OS, the latched ExtINT/INTR interrupt is
processed shortly after the OS enables interrupts. The AP then queries the
8259 to identify the vector number (which is the value of the 8259's ICW2
register + the IRQ level). The master 8259's ICW2 was set to 0x30 and,
since no interrupts are actually pending, the 8259 will respond with
IRQ7 (spurious interrupt) yielding a vector of 0x37 or 55.

The OS was not expecting vector 55 and printed the message.

From the Intel Developer's Manual: Vol 3a, Section 10.5.1:
"Only one processor in the system should have an LVT entry
configured to use the ExtINT delivery mode."
======= end posting =====
From https://lkml.org/lkml/2019/3/5/538

If someone wants to provide a patch that I could apply to Gentoo's package,
I can run it to see if there is something afoot that has not been
considered.  Otherwise, any suggestions on how to
work around this problem?


John



_______________________________________________
Xen-users mailing list
Xen-users@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-users