Mailing List Archive

1 2  View All
Re: Ryzen 6000 (Mobile) [ In reply to ]
Hi Jan,

Yes please, I have Qubes's Build System setup with sourcehut so I can add patches at will, however please be aware Qubes currently uses Xen 4.14.

I'll take a look and see if I can access that location

With the added logging I should be able to trigger the crash and get to the bottom of it

Thank you for your help Jan
Re: Ryzen 6000 (Mobile) [ In reply to ]
On 25.08.2022 13:10, Dylanger Daly wrote:
> Yes please, I have Qubes's Build System setup with sourcehut so I can add patches at will, however please be aware Qubes currently uses Xen 4.14.
>
> I'll take a look and see if I can access that location
>
> With the added logging I should be able to trigger the crash and get to the bottom of it

Here's the (trivial) patch. It's against out version of 4.14, but I expect
to apply fine. Further logging would likely need to go in the kernel,
since - if my analysis+guessing is right - we wouldn't even see a mapping
attempt in Xen.

Jan

--- sle15sp3.orig/xen/arch/x86/e820.c
+++ sle15sp3/xen/arch/x86/e820.c
@@ -700,3 +700,15 @@ unsigned long __init init_e820(const cha

return find_max_pfn();
}
+
+#include <xen/domain_page.h>//temp
+static int __init ryzen6000_init(void) {//temp
+ if(e820_all_mapped(0x7AF67000, 0x7AF68000, E820_NVS)) {
+ const uint32_t*p = map_domain_page(_mfn(0x7AF67));
+ printk("0x7AF67000: %08x %08x %08x %08x\n", p[0], p[1], p[2], p[3]);
+ printk("0x7AF67010: %08x %08x %08x %08x\n", p[4], p[5], p[6], p[7]);
+ unmap_domain_page(p);
+ }
+ return 0;
+}
+__initcall(ryzen6000_init);
Re: Ryzen 6000 (Mobile) [ In reply to ]
Hi Jan,

Please see the attached iomem and meminfo text files from both Qubes under Xen and functional Fedora Workstation.

After some investigation, I've seen that I'm able to run a few small VMs just fine, however as soon as I start a larger VM, PCI devices appear to stop working, I assume this is because AMD moved PCI register addresses and Xen doesn't know about this change, so it's mapping this memory for other appVMs to use, resulting in weird behavior? Does that sound correct?

Please also see a screenshot of dom0 crashing when I issue a restart after triggering the bad state

Thank you
Re: Ryzen 6000 (Mobile) [ In reply to ]
Hi Jan,

I think I've narrowed the issue down to a PCI device, if I start 2 large VM, then simply run lspci in dom0, it'll trigger a crash.

This makes sense as sys-net works fine until I start a larger VM, then I see a 'chip reset' error in the appVM's dmesg, I assume the entire PCI Bus goes into a bad state.

Cheers
Re: Ryzen 6000 (Mobile) [ In reply to ]
Hi Jan,

Another update, if I assign only 1 core to my VMs, the system seems stable and does not break the PCIe Bus, so I assume the issue is related to the Scheduler or APIC, setting loglvl=all then triggering the issue (4 cores) Xen complains about APIC, I've attached a picture of the Xen log.

Cheers
Re: Ryzen 6000 (Mobile) [ In reply to ]
Hi Jan,

I've managed to finangle a very unstable environment together

What I'm seeing is the following:

1. All appVMs including dom0 must have 1 core assigned
2. This means I'm able to launch 4 appVMs, as soon as I launch a 5th, it puts all PCIe devices into a bad state
3. If I shut the 5th VM down, I'm able to restart sys-usb for example and "recover" the USB Controller.

I'm not ???? but I think this means it's a scheduling/core issue?

It almost works ???????

Cheers
Re: Ryzen 6000 (Mobile) [ In reply to ]
On 01.09.2022 00:12, Dylanger Daly wrote:
> I think I've narrowed the issue down to a PCI device, if I start 2 large VM, then simply run lspci in dom0, it'll trigger a crash.
>
> This makes sense as sys-net works fine until I start a larger VM, then I see a 'chip reset' error in the appVM's dmesg, I assume the entire PCI Bus goes into a bad state.

Sounds like this wants investigating from the Qubes site first then.

Jan
Re: Ryzen 6000 (Mobile) [ In reply to ]
On 29.08.2022 17:26, Dylanger Daly wrote:
> Please see the attached iomem and meminfo text files from both Qubes under Xen and functional Fedora Workstation.
>
> After some investigation, I've seen that I'm able to run a few small VMs just fine, however as soon as I start a larger VM, PCI devices appear to stop working, I assume this is because AMD moved PCI register addresses and Xen doesn't know about this change, so it's mapping this memory for other appVMs to use, resulting in weird behavior? Does that sound correct?

Not really, no. Even if BARs were moved, they still shouldn't overlap RAM.

But yes, BARs moving without Xen knowing would already be a problem. But
then the basic question is: Why would BARs be moved? The only reason I
could see (in the context of VM creation) is if PCI pass-through was
involved. Iirc you said it isn't, though.

Jan
Re: Ryzen 6000 (Mobile) [ In reply to ]
On 27.08.2022 19:52, Dylanger Daly wrote:
> Thank you for your reply, Xen appears to crash immediately on startup and appears to hit the patch

Oh, yes, silly me - map_domain_page() can't be used that way. You
may want to give the replacement patch (below) a try, albeit later
replies of yours have hinted in a different direction now anyway.

Jan

--- sle15sp3.orig/xen/arch/x86/e820.c
+++ sle15sp3/xen/arch/x86/e820.c
@@ -700,3 +700,16 @@ unsigned long __init init_e820(const cha

return find_max_pfn();
}
+
+#include <xen/domain_page.h>//temp
+static int __init ryzen6000_init(void) {//temp
+ if(e820_all_mapped(0x7AF67000, 0x7AF68000, E820_NVS)) {
+ mfn_t mfn = _mfn(0x7AF67);
+ const uint32_t*p = vmap(&mfn, 1);
+ printk("0x7AF67000: %08x %08x %08x %08x\n", p[0], p[1], p[2], p[3]);
+ printk("0x7AF67010: %08x %08x %08x %08x\n", p[4], p[5], p[6], p[7]);
+ vunmap(p);
+ }
+ return 0;
+}
+__initcall(ryzen6000_init);

1 2  View All