Mailing List Archive

Serious AMD-Vi issue
Apparently this was first noticed with 4.14, but more recently I've been
able to reproduce the issue:

https://bugs.debian.org/988477

The original observation features MD-RAID1 using a pair of Samsung
SATA-attached flash devices. The main line shows up in `xl dmesg`:

(XEN) AMD-Vi: IO_PAGE_FAULT: DDDD:bb:dd.f d0 addr ffffff???????000 flags 0x8 I

Where the device points at the SATA controller. I've ended up
reproducing this with some noticable differences.

A major goal of RAID is to have different devices fail at different
times. Hence my initial run had a Samsung device plus a device from
another reputable flash manufacturer.

I initially noticed this due to messages in domain 0's dmesg about
errors from the SATA device. Wasn't until rather later that I noticed
the IOMMU warnings in Xen's dmesg (perhaps post-domain 0 messages should
be duplicated into domain 0's dmesg?).

All of the failures consistently pointed at the Samsung device. Due to
the expectation it would fail first (lower quality offering with
lesser guarantees), I proceeded to replace it with a NVMe device.

With some monitoring I discovered the NVMe device was now triggering
IOMMU errors, though not nearly as many as the Samsung SATA device did.
As such looks like AMD-Vi plus MD-RAID1 appears to be exposing some sort
of IOMMU issue with Xen.


All I can do is offer speculation about the underlying cause. There
does seem to be a pattern of higher-performance flash storage devices
being more severely effected.

I was speculating about the issue being the MD-RAID1 driver abusing
Linux's DMA infrastructure in some fashion.

Upon further consideration, I'm wondering if this is perhaps a latency
issue. I imagine there is some sort of flush after the IOMMU tables are
modified. Perhaps the Samsung SATA (and all NVMe) devices were trying to
execute commands before reloading the IOMMU tables is complete.


--
(\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/)
\BS ( | ehem+sigmsg@m5p.com PGP 87145445 | ) /
\_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Re: Serious AMD-Vi issue [ In reply to ]
On Thu, Jan 25, 2024 at 12:24:53PM -0800, Elliott Mitchell wrote:
> Apparently this was first noticed with 4.14, but more recently I've been
> able to reproduce the issue:
>
> https://bugs.debian.org/988477
>
> The original observation features MD-RAID1 using a pair of Samsung
> SATA-attached flash devices. The main line shows up in `xl dmesg`:
>
> (XEN) AMD-Vi: IO_PAGE_FAULT: DDDD:bb:dd.f d0 addr ffffff???????000 flags 0x8 I
>
> Where the device points at the SATA controller. I've ended up
> reproducing this with some noticable differences.
>
> A major goal of RAID is to have different devices fail at different
> times. Hence my initial run had a Samsung device plus a device from
> another reputable flash manufacturer.
>
> I initially noticed this due to messages in domain 0's dmesg about
> errors from the SATA device. Wasn't until rather later that I noticed
> the IOMMU warnings in Xen's dmesg (perhaps post-domain 0 messages should
> be duplicated into domain 0's dmesg?).
>
> All of the failures consistently pointed at the Samsung device. Due to
> the expectation it would fail first (lower quality offering with
> lesser guarantees), I proceeded to replace it with a NVMe device.
>
> With some monitoring I discovered the NVMe device was now triggering
> IOMMU errors, though not nearly as many as the Samsung SATA device did.
> As such looks like AMD-Vi plus MD-RAID1 appears to be exposing some sort
> of IOMMU issue with Xen.
>
>
> All I can do is offer speculation about the underlying cause. There
> does seem to be a pattern of higher-performance flash storage devices
> being more severely effected.
>
> I was speculating about the issue being the MD-RAID1 driver abusing
> Linux's DMA infrastructure in some fashion.
>
> Upon further consideration, I'm wondering if this is perhaps a latency
> issue. I imagine there is some sort of flush after the IOMMU tables are
> modified. Perhaps the Samsung SATA (and all NVMe) devices were trying to
> execute commands before reloading the IOMMU tables is complete.

Ping!

The recipe seems to be Linux MD RAID1, plus Samsung SATA or any NVMe.

To make it explicit, when I tried Crucial SATA + Samsung SATA. IOMMU
errors matched the Samsung SATA (a number of times the SATA driver
complained).

As stated, I'm speculating lower latency devices starting to execute
commands before IOMMU tables have finished reloading. When originally
implemented fast flash devices were rare.


--
(\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/)
\BS ( | ehem+sigmsg@m5p.com PGP 87145445 | ) /
\_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Re: Serious AMD-Vi issue [ In reply to ]
On Mon, Feb 12, 2024 at 03:23:00PM -0800, Elliott Mitchell wrote:
> On Thu, Jan 25, 2024 at 12:24:53PM -0800, Elliott Mitchell wrote:
> > Apparently this was first noticed with 4.14, but more recently I've been
> > able to reproduce the issue:
> >
> > https://bugs.debian.org/988477
> >
> > The original observation features MD-RAID1 using a pair of Samsung
> > SATA-attached flash devices. The main line shows up in `xl dmesg`:
> >
> > (XEN) AMD-Vi: IO_PAGE_FAULT: DDDD:bb:dd.f d0 addr ffffff???????000 flags 0x8 I
> >
> > Where the device points at the SATA controller. I've ended up
> > reproducing this with some noticable differences.
> >
> > A major goal of RAID is to have different devices fail at different
> > times. Hence my initial run had a Samsung device plus a device from
> > another reputable flash manufacturer.
> >
> > I initially noticed this due to messages in domain 0's dmesg about
> > errors from the SATA device. Wasn't until rather later that I noticed
> > the IOMMU warnings in Xen's dmesg (perhaps post-domain 0 messages should
> > be duplicated into domain 0's dmesg?).
> >
> > All of the failures consistently pointed at the Samsung device. Due to
> > the expectation it would fail first (lower quality offering with
> > lesser guarantees), I proceeded to replace it with a NVMe device.
> >
> > With some monitoring I discovered the NVMe device was now triggering
> > IOMMU errors, though not nearly as many as the Samsung SATA device did.
> > As such looks like AMD-Vi plus MD-RAID1 appears to be exposing some sort
> > of IOMMU issue with Xen.
> >
> >
> > All I can do is offer speculation about the underlying cause. There
> > does seem to be a pattern of higher-performance flash storage devices
> > being more severely effected.
> >
> > I was speculating about the issue being the MD-RAID1 driver abusing
> > Linux's DMA infrastructure in some fashion.
> >
> > Upon further consideration, I'm wondering if this is perhaps a latency
> > issue. I imagine there is some sort of flush after the IOMMU tables are
> > modified. Perhaps the Samsung SATA (and all NVMe) devices were trying to
> > execute commands before reloading the IOMMU tables is complete.
>
> Ping!
>
> The recipe seems to be Linux MD RAID1, plus Samsung SATA or any NVMe.
>
> To make it explicit, when I tried Crucial SATA + Samsung SATA. IOMMU
> errors matched the Samsung SATA (a number of times the SATA driver
> complained).
>
> As stated, I'm speculating lower latency devices starting to execute
> commands before IOMMU tables have finished reloading. When originally
> implemented fast flash devices were rare.

I guess I'm lucky I ended up with some slightly higher-latency hardware.
This is a very serious issue as data loss can occur.

AMD needs to fund their Xen engineers more, otherwise soon AMD hardware
may no longer be viable with Xen.


--
(\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/)
\BS ( | ehem+sigmsg@m5p.com PGP 87145445 | ) /
\_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Re: AMD-Vi issue [ In reply to ]
On 25/01/2024 8:24 pm, Elliott Mitchell wrote:
> The original observation features MD-RAID1 using a pair of Samsung
> SATA-attached flash devices. The main line shows up in `xl dmesg`:
>
> (XEN) AMD-Vi: IO_PAGE_FAULT: DDDD:bb:dd.f d0 addr ffffff???????000 flags 0x8 I

You have a device which is issuing interrupts which have been blocked
due to the IOMMU configuration.

But you've elided all details other than the fact it's assigned to
dom0.  As such, there is nothing we can do to help.

This isn't the first time you've been asked to provide a bare minimum
amount of details such that we might be able to help.  I'm sorry you are
having problems, but continuing to ping a question with no actionable
information is unfair to those of us who are who are providing support
to the community for free.

~Andrew
Re: AMD-Vi issue [ In reply to ]
On Mon, Mar 04, 2024 at 11:55:07PM +0000, Andrew Cooper wrote:
> On 25/01/2024 8:24 pm, Elliott Mitchell wrote:
> > The original observation features MD-RAID1 using a pair of Samsung
> > SATA-attached flash devices. The main line shows up in `xl dmesg`:
> >
> > (XEN) AMD-Vi: IO_PAGE_FAULT: DDDD:bb:dd.f d0 addr ffffff???????000 flags 0x8 I
>
> You have a device which is issuing interrupts which have been blocked
> due to the IOMMU configuration.
>
> But you've elided all details other than the fact it's assigned to
> dom0.? As such, there is nothing we can do to help.

I've provided the details thought most likely to allow others to
reproduce. I've pointed to a report from someone else with somewhat
similar hardware who also encountered this (and had enough hardware to
try several combinations).

Sorry about being reluctant to send exact hardware serial numbers to a
public mailing list.


--
(\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/)
\BS ( | ehem+sigmsg@m5p.com PGP 87145445 | ) /
\_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Re: Serious AMD-Vi(?) issue [ In reply to ]
I sent a ping on this about 2 weeks ago. Since the plan is to move x86
IOMMU under general x86, the other x86 maintainers should be aware of
this:

On Mon, Feb 12, 2024 at 03:23:00PM -0800, Elliott Mitchell wrote:
> On Thu, Jan 25, 2024 at 12:24:53PM -0800, Elliott Mitchell wrote:
> > Apparently this was first noticed with 4.14, but more recently I've been
> > able to reproduce the issue:
> >
> > https://bugs.debian.org/988477
> >
> > The original observation features MD-RAID1 using a pair of Samsung
> > SATA-attached flash devices. The main line shows up in `xl dmesg`:
> >
> > (XEN) AMD-Vi: IO_PAGE_FAULT: DDDD:bb:dd.f d0 addr ffffff???????000 flags 0x8 I
> >
> > Where the device points at the SATA controller. I've ended up
> > reproducing this with some noticable differences.
> >
> > A major goal of RAID is to have different devices fail at different
> > times. Hence my initial run had a Samsung device plus a device from
> > another reputable flash manufacturer.
> >
> > I initially noticed this due to messages in domain 0's dmesg about
> > errors from the SATA device. Wasn't until rather later that I noticed
> > the IOMMU warnings in Xen's dmesg (perhaps post-domain 0 messages should
> > be duplicated into domain 0's dmesg?).
> >
> > All of the failures consistently pointed at the Samsung device. Due to
> > the expectation it would fail first (lower quality offering with
> > lesser guarantees), I proceeded to replace it with a NVMe device.
> >
> > With some monitoring I discovered the NVMe device was now triggering
> > IOMMU errors, though not nearly as many as the Samsung SATA device did.
> > As such looks like AMD-Vi plus MD-RAID1 appears to be exposing some sort
> > of IOMMU issue with Xen.
> >
> >
> > All I can do is offer speculation about the underlying cause. There
> > does seem to be a pattern of higher-performance flash storage devices
> > being more severely effected.
> >
> > I was speculating about the issue being the MD-RAID1 driver abusing
> > Linux's DMA infrastructure in some fashion.
> >
> > Upon further consideration, I'm wondering if this is perhaps a latency
> > issue. I imagine there is some sort of flush after the IOMMU tables are
> > modified. Perhaps the Samsung SATA (and all NVMe) devices were trying to
> > execute commands before reloading the IOMMU tables is complete.
>
> Ping!
>
> The recipe seems to be Linux MD RAID1, plus Samsung SATA or any NVMe.
>
> To make it explicit, when I tried Crucial SATA + Samsung SATA. IOMMU
> errors matched the Samsung SATA (a number of times the SATA driver
> complained).
>
> As stated, I'm speculating lower latency devices starting to execute
> commands before IOMMU tables have finished reloading. When originally
> implemented fast flash devices were rare.

Both reproductions of this issue I'm aware of were on systems with AMD
processors. I'm doubtul suspicion of flash storage hardware is unique
to owners of AMD systems. As a result while this /could/ also effect
Intel systems, the lack of reports /suggests/ otherwise.

I've noticed two things when glancing at the original report. LVM is not
in use here, so that doesn't seem to effect the problem. The Phenom II
the original reporter tested as not having the issue might have lacked
proper BIOS support, hence IOMMU not being functional.

This being a latency issue is *speculation*, but would explain the
pattern of devices being effected.

This is rather serious as it can lead to data loss (phew! glad I just
barely dodged this outcome).


--
(\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/)
\BS ( | ehem+sigmsg@m5p.com PGP 87145445 | ) /
\_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Re: Serious AMD-Vi(?) issue [ In reply to ]
Hi Elliott,

I hope you're well.

I'm Kelly, the community manager at the Xen Project.

I can see you've recently engaged with our community with some issues you'd
like help with.
We love the fact you are participating in our project, however, our
developers aren't able to help if you do not provide the specific details.

As an open-source project, our developers are committed to helping and
contributing as much as possible. We welcome you to continue participating,
however, please refrain from requesting help without providing the
necessary details as this takes up a lot of our community's time to analyze
what is possible and what assistance you might need.

I'd recommend providing logs or specific information so the community can
help you further.

If you'd like to chat more, let me know.

Many thanks,
Kelly Choi

Community Manager
Xen Project


On Mon, Mar 18, 2024 at 7:42?PM Elliott Mitchell <ehem+xen@m5p.com> wrote:

> I sent a ping on this about 2 weeks ago. Since the plan is to move x86
> IOMMU under general x86, the other x86 maintainers should be aware of
> this:
>
> On Mon, Feb 12, 2024 at 03:23:00PM -0800, Elliott Mitchell wrote:
> > On Thu, Jan 25, 2024 at 12:24:53PM -0800, Elliott Mitchell wrote:
> > > Apparently this was first noticed with 4.14, but more recently I've
> been
> > > able to reproduce the issue:
> > >
> > > https://bugs.debian.org/988477
> > >
> > > The original observation features MD-RAID1 using a pair of Samsung
> > > SATA-attached flash devices. The main line shows up in `xl dmesg`:
> > >
> > > (XEN) AMD-Vi: IO_PAGE_FAULT: DDDD:bb:dd.f d0 addr ffffff???????000
> flags 0x8 I
> > >
> > > Where the device points at the SATA controller. I've ended up
> > > reproducing this with some noticable differences.
> > >
> > > A major goal of RAID is to have different devices fail at different
> > > times. Hence my initial run had a Samsung device plus a device from
> > > another reputable flash manufacturer.
> > >
> > > I initially noticed this due to messages in domain 0's dmesg about
> > > errors from the SATA device. Wasn't until rather later that I noticed
> > > the IOMMU warnings in Xen's dmesg (perhaps post-domain 0 messages
> should
> > > be duplicated into domain 0's dmesg?).
> > >
> > > All of the failures consistently pointed at the Samsung device. Due to
> > > the expectation it would fail first (lower quality offering with
> > > lesser guarantees), I proceeded to replace it with a NVMe device.
> > >
> > > With some monitoring I discovered the NVMe device was now triggering
> > > IOMMU errors, though not nearly as many as the Samsung SATA device did.
> > > As such looks like AMD-Vi plus MD-RAID1 appears to be exposing some
> sort
> > > of IOMMU issue with Xen.
> > >
> > >
> > > All I can do is offer speculation about the underlying cause. There
> > > does seem to be a pattern of higher-performance flash storage devices
> > > being more severely effected.
> > >
> > > I was speculating about the issue being the MD-RAID1 driver abusing
> > > Linux's DMA infrastructure in some fashion.
> > >
> > > Upon further consideration, I'm wondering if this is perhaps a latency
> > > issue. I imagine there is some sort of flush after the IOMMU tables
> are
> > > modified. Perhaps the Samsung SATA (and all NVMe) devices were trying
> to
> > > execute commands before reloading the IOMMU tables is complete.
> >
> > Ping!
> >
> > The recipe seems to be Linux MD RAID1, plus Samsung SATA or any NVMe.
> >
> > To make it explicit, when I tried Crucial SATA + Samsung SATA. IOMMU
> > errors matched the Samsung SATA (a number of times the SATA driver
> > complained).
> >
> > As stated, I'm speculating lower latency devices starting to execute
> > commands before IOMMU tables have finished reloading. When originally
> > implemented fast flash devices were rare.
>
> Both reproductions of this issue I'm aware of were on systems with AMD
> processors. I'm doubtul suspicion of flash storage hardware is unique
> to owners of AMD systems. As a result while this /could/ also effect
> Intel systems, the lack of reports /suggests/ otherwise.
>
> I've noticed two things when glancing at the original report. LVM is not
> in use here, so that doesn't seem to effect the problem. The Phenom II
> the original reporter tested as not having the issue might have lacked
> proper BIOS support, hence IOMMU not being functional.
>
> This being a latency issue is *speculation*, but would explain the
> pattern of devices being effected.
>
> This is rather serious as it can lead to data loss (phew! glad I just
> barely dodged this outcome).
>
>
> --
> (\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/)
> \BS ( | ehem+sigmsg@m5p.com PGP 87145445 | ) /
> \_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/
> 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
>
>
>
>
Re: Serious AMD-Vi(?) issue [ In reply to ]
On Fri, Mar 22, 2024 at 04:41:45PM +0000, Kelly Choi wrote:
>
> I can see you've recently engaged with our community with some issues you'd
> like help with.
> We love the fact you are participating in our project, however, our
> developers aren't able to help if you do not provide the specific details.

Please point to specific details which have been omitted. Fairly little
data has been provided as fairly little data is available. The primary
observation is large numbers of:

(XEN) AMD-Vi: IO_PAGE_FAULT: DDDD:bb:dd.f d0 addr ffffff???????000 flags 0x8 I

Lines in Xen's ring buffer. I recall spotting 3 messages from Linux's
SATA driver (which weren't saved due to other causes being suspected),
which would likely be associated with hundreds or thousands of the above
log messages. I never observed any messages from the NVMe subsystem
during that phase.

The most overt sign was telling the Linux kernel to scan for
inconsistencies and the kernel finding some. The domain didn't otherwise
appear to notice trouble.

This is from memory, it would take some time to discover whether any
messages were missed. Present mitigation action is inhibiting the
messages, but the trouble is certainly still lurking.


--
(\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/)
\BS ( | ehem+sigmsg@m5p.com PGP 87145445 | ) /
\_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Re: Serious AMD-Vi(?) issue [ In reply to ]
On 22.03.2024 20:22, Elliott Mitchell wrote:
> On Fri, Mar 22, 2024 at 04:41:45PM +0000, Kelly Choi wrote:
>>
>> I can see you've recently engaged with our community with some issues you'd
>> like help with.
>> We love the fact you are participating in our project, however, our
>> developers aren't able to help if you do not provide the specific details.
>
> Please point to specific details which have been omitted. Fairly little
> data has been provided as fairly little data is available. The primary
> observation is large numbers of:
>
> (XEN) AMD-Vi: IO_PAGE_FAULT: DDDD:bb:dd.f d0 addr ffffff???????000 flags 0x8 I
>
> Lines in Xen's ring buffer.

Yet this is (part of) the problem: By providing only the messages that appear
relevant to you, you imply that you know that no other message is in any way
relevant. That's judgement you'd better leave to people actually trying to
investigate. Unless of course you were proposing an actual code change, with
suitable justification.

In fact when running into trouble, the usual course of action would be to
increase verbosity in both hypervisor and kernel, just to make sure no
potentially relevant message is missed.

> I recall spotting 3 messages from Linux's
> SATA driver (which weren't saved due to other causes being suspected),
> which would likely be associated with hundreds or thousands of the above
> log messages. I never observed any messages from the NVMe subsystem
> during that phase.
>
> The most overt sign was telling the Linux kernel to scan for
> inconsistencies and the kernel finding some. The domain didn't otherwise
> appear to notice trouble.
>
> This is from memory, it would take some time to discover whether any
> messages were missed. Present mitigation action is inhibiting the
> messages, but the trouble is certainly still lurking.

Iirc you were considering whether any of this might be a timing issue. Yet
beyond voicing that suspicion, you didn't provide any technical details as
to why you think so. Such technical details would include taking into
account how IOMMU mappings and associated IOMMU TLB flushing are carried
out. Right now, to me at least, your speculation in this regard fails
basic sanity checking. Therefore the scenario that you're thinking of
would need better describing, imo.

Jan
Re: Serious AMD-Vi(?) issue [ In reply to ]
On Mon, Mar 25, 2024 at 08:55:56AM +0100, Jan Beulich wrote:
> On 22.03.2024 20:22, Elliott Mitchell wrote:
> > On Fri, Mar 22, 2024 at 04:41:45PM +0000, Kelly Choi wrote:
> >>
> >> I can see you've recently engaged with our community with some issues you'd
> >> like help with.
> >> We love the fact you are participating in our project, however, our
> >> developers aren't able to help if you do not provide the specific details.
> >
> > Please point to specific details which have been omitted. Fairly little
> > data has been provided as fairly little data is available. The primary
> > observation is large numbers of:
> >
> > (XEN) AMD-Vi: IO_PAGE_FAULT: DDDD:bb:dd.f d0 addr ffffff???????000 flags 0x8 I
> >
> > Lines in Xen's ring buffer.
>
> Yet this is (part of) the problem: By providing only the messages that appear
> relevant to you, you imply that you know that no other message is in any way
> relevant. That's judgement you'd better leave to people actually trying to
> investigate. Unless of course you were proposing an actual code change, with
> suitable justification.

Honestly, I forgot about the very small number of messages from the SATA
subsystem. The question of whether the current mitigation actions are
effective right now was a bigger issue. As such monitoring `xl dmesg`
was a priority to looking at SATA messages which failed to reliably
indicate status.

I *thought* I would be able to retrieve those via other slow means, but a
different and possibly overlapping issue has shown up. Unfortunately
this means those are no longer retrievable. :-(

> In fact when running into trouble, the usual course of action would be to
> increase verbosity in both hypervisor and kernel, just to make sure no
> potentially relevant message is missed.

More/better information might have been obtained if I'd been engaged
earlier.

> > The most overt sign was telling the Linux kernel to scan for
> > inconsistencies and the kernel finding some. The domain didn't otherwise
> > appear to notice trouble.
> >
> > This is from memory, it would take some time to discover whether any
> > messages were missed. Present mitigation action is inhibiting the
> > messages, but the trouble is certainly still lurking.
>
> Iirc you were considering whether any of this might be a timing issue. Yet
> beyond voicing that suspicion, you didn't provide any technical details as
> to why you think so. Such technical details would include taking into
> account how IOMMU mappings and associated IOMMU TLB flushing are carried
> out. Right now, to me at least, your speculation in this regard fails
> basic sanity checking. Therefore the scenario that you're thinking of
> would need better describing, imo.

True. Mostly I'm analyzing the known information and considering what
the patterns suggest.

Presently I'm aware of two reports (Imre Sz?ll?si and mine).

Both of these feature AMD processor machines. Could be people with AMD
processors are less trustful of flash storage or could be an AMD-only
IOMMU issue. Ideally someone would test and confirm there is no issue
with Linux software RAID1 on flash on an Intel machine.

Both reports feature two flash storage devices being run through Linux
MD RAID1. Could be the MD RAID1 subsystem is abusing the DMA interface
in some fashion. While Imre Sz?ll?si reported this not occuring with a
single device, the report does not explicitly state whether that was a
degenerate RAID1 versus non-RAID. I'm unaware of any testing with 3x
devices in RAID1.


Both reports feature Samsung SATA flash devices. My case also includes a
Crucial NVMe device. My case also features a Crucial SATA flash device
for which the problem did NOT occur. So the question becomes, why did
the problem not occur for this Crucial SATA device?

According to the specifications, the Crucial SATA device is roughly on
par with the Samsung SATA devices in terms of read/write speeds. The
NVMe device's specifications are massively better.

What comes to mind is the Crucial SATA device might have higher latency
before executing commands. Specifications don't mention command
execution latency, so it isn't possible to know whether this is the
issue.


Yes, latency/timing is speculation. Does seem a good fit for the pattern
though.

This could be a Linux MD RAID1 bug or a Xen bug.

Unfortunately data loss is a very serious type of bug so I'm highly
reluctant to let go of mitigations without hope for progress.


--
(\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/)
\BS ( | ehem+sigmsg@m5p.com PGP 87145445 | ) /
\_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Re: Serious AMD-Vi(?) issue [ In reply to ]
On Mon, Mar 25, 2024 at 02:43:44PM -0700, Elliott Mitchell wrote:
> On Mon, Mar 25, 2024 at 08:55:56AM +0100, Jan Beulich wrote:
> > On 22.03.2024 20:22, Elliott Mitchell wrote:
> > > On Fri, Mar 22, 2024 at 04:41:45PM +0000, Kelly Choi wrote:
> > >>
> > >> I can see you've recently engaged with our community with some issues you'd
> > >> like help with.
> > >> We love the fact you are participating in our project, however, our
> > >> developers aren't able to help if you do not provide the specific details.
> > >
> > > Please point to specific details which have been omitted. Fairly little
> > > data has been provided as fairly little data is available. The primary
> > > observation is large numbers of:
> > >
> > > (XEN) AMD-Vi: IO_PAGE_FAULT: DDDD:bb:dd.f d0 addr ffffff???????000 flags 0x8 I
> > >
> > > Lines in Xen's ring buffer.
> >
> > Yet this is (part of) the problem: By providing only the messages that appear
> > relevant to you, you imply that you know that no other message is in any way
> > relevant. That's judgement you'd better leave to people actually trying to
> > investigate. Unless of course you were proposing an actual code change, with
> > suitable justification.
>
> Honestly, I forgot about the very small number of messages from the SATA
> subsystem. The question of whether the current mitigation actions are
> effective right now was a bigger issue. As such monitoring `xl dmesg`
> was a priority to looking at SATA messages which failed to reliably
> indicate status.
>
> I *thought* I would be able to retrieve those via other slow means, but a
> different and possibly overlapping issue has shown up. Unfortunately
> this means those are no longer retrievable. :-(

With some persistence I was able to retrieve them. There are other
pieces of software with worse UIs than Xen.

> > In fact when running into trouble, the usual course of action would be to
> > increase verbosity in both hypervisor and kernel, just to make sure no
> > potentially relevant message is missed.
>
> More/better information might have been obtained if I'd been engaged
> earlier.

This is still true, things are in full mitigation mode and I'll be
quite unhappy to go back with experiments at this point.


I now see why I left those out. The messages from the SATA subsystem
were from a kernel which a bad patch had leaked into a LTS branch. Looks
like the SATA subsystem was significantly broken and I'm unsure whether
any useful information could be retrieved. Notably there is quite a bit
of noise from SATA devices not effected by this issue.

Some of the messages /might/ be useful, but the amount of noise is quite
high. Do messages from a broken kernel interest you?


--
(\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/)
\BS ( | ehem+sigmsg@m5p.com PGP 87145445 | ) /
\_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Re: Serious AMD-Vi(?) issue [ In reply to ]
On 27.03.2024 18:27, Elliott Mitchell wrote:
> On Mon, Mar 25, 2024 at 02:43:44PM -0700, Elliott Mitchell wrote:
>> On Mon, Mar 25, 2024 at 08:55:56AM +0100, Jan Beulich wrote:
>>> On 22.03.2024 20:22, Elliott Mitchell wrote:
>>>> On Fri, Mar 22, 2024 at 04:41:45PM +0000, Kelly Choi wrote:
>>>>>
>>>>> I can see you've recently engaged with our community with some issues you'd
>>>>> like help with.
>>>>> We love the fact you are participating in our project, however, our
>>>>> developers aren't able to help if you do not provide the specific details.
>>>>
>>>> Please point to specific details which have been omitted. Fairly little
>>>> data has been provided as fairly little data is available. The primary
>>>> observation is large numbers of:
>>>>
>>>> (XEN) AMD-Vi: IO_PAGE_FAULT: DDDD:bb:dd.f d0 addr ffffff???????000 flags 0x8 I
>>>>
>>>> Lines in Xen's ring buffer.
>>>
>>> Yet this is (part of) the problem: By providing only the messages that appear
>>> relevant to you, you imply that you know that no other message is in any way
>>> relevant. That's judgement you'd better leave to people actually trying to
>>> investigate. Unless of course you were proposing an actual code change, with
>>> suitable justification.
>>
>> Honestly, I forgot about the very small number of messages from the SATA
>> subsystem. The question of whether the current mitigation actions are
>> effective right now was a bigger issue. As such monitoring `xl dmesg`
>> was a priority to looking at SATA messages which failed to reliably
>> indicate status.
>>
>> I *thought* I would be able to retrieve those via other slow means, but a
>> different and possibly overlapping issue has shown up. Unfortunately
>> this means those are no longer retrievable. :-(
>
> With some persistence I was able to retrieve them. There are other
> pieces of software with worse UIs than Xen.
>
>>> In fact when running into trouble, the usual course of action would be to
>>> increase verbosity in both hypervisor and kernel, just to make sure no
>>> potentially relevant message is missed.
>>
>> More/better information might have been obtained if I'd been engaged
>> earlier.
>
> This is still true, things are in full mitigation mode and I'll be
> quite unhappy to go back with experiments at this point.

Well, it very likely won't work without further experimenting by someone
able to observe the bad behavior. Recall we're on xen-devel here; it is
kind of expected that without clear (and practical) repro instructions
experimenting as well as info collection will remain with the reporter.

> I now see why I left those out. The messages from the SATA subsystem
> were from a kernel which a bad patch had leaked into a LTS branch. Looks
> like the SATA subsystem was significantly broken and I'm unsure whether
> any useful information could be retrieved. Notably there is quite a bit
> of noise from SATA devices not effected by this issue.
>
> Some of the messages /might/ be useful, but the amount of noise is quite
> high. Do messages from a broken kernel interest you?

If this was a less vague (in terms of possible root causes) issue, I'd
probably have answered "yes". But in the case here I'm afraid such might
further confuse things rather than clarifying them.

Jan
Re: Serious AMD-Vi(?) issue [ In reply to ]
On Thu, Mar 28, 2024 at 07:25:02AM +0100, Jan Beulich wrote:
> On 27.03.2024 18:27, Elliott Mitchell wrote:
> > On Mon, Mar 25, 2024 at 02:43:44PM -0700, Elliott Mitchell wrote:
> >> On Mon, Mar 25, 2024 at 08:55:56AM +0100, Jan Beulich wrote:
> >>> On 22.03.2024 20:22, Elliott Mitchell wrote:
> >>>> On Fri, Mar 22, 2024 at 04:41:45PM +0000, Kelly Choi wrote:
> >>>>>
> >>>>> I can see you've recently engaged with our community with some issues you'd
> >>>>> like help with.
> >>>>> We love the fact you are participating in our project, however, our
> >>>>> developers aren't able to help if you do not provide the specific details.
> >>>>
> >>>> Please point to specific details which have been omitted. Fairly little
> >>>> data has been provided as fairly little data is available. The primary
> >>>> observation is large numbers of:
> >>>>
> >>>> (XEN) AMD-Vi: IO_PAGE_FAULT: DDDD:bb:dd.f d0 addr ffffff???????000 flags 0x8 I
> >>>>
> >>>> Lines in Xen's ring buffer.
> >>>
> >>> Yet this is (part of) the problem: By providing only the messages that appear
> >>> relevant to you, you imply that you know that no other message is in any way
> >>> relevant. That's judgement you'd better leave to people actually trying to
> >>> investigate. Unless of course you were proposing an actual code change, with
> >>> suitable justification.
> >>
> >> Honestly, I forgot about the very small number of messages from the SATA
> >> subsystem. The question of whether the current mitigation actions are
> >> effective right now was a bigger issue. As such monitoring `xl dmesg`
> >> was a priority to looking at SATA messages which failed to reliably
> >> indicate status.
> >>
> >> I *thought* I would be able to retrieve those via other slow means, but a
> >> different and possibly overlapping issue has shown up. Unfortunately
> >> this means those are no longer retrievable. :-(
> >
> > With some persistence I was able to retrieve them. There are other
> > pieces of software with worse UIs than Xen.
> >
> >>> In fact when running into trouble, the usual course of action would be to
> >>> increase verbosity in both hypervisor and kernel, just to make sure no
> >>> potentially relevant message is missed.
> >>
> >> More/better information might have been obtained if I'd been engaged
> >> earlier.
> >
> > This is still true, things are in full mitigation mode and I'll be
> > quite unhappy to go back with experiments at this point.
>
> Well, it very likely won't work without further experimenting by someone
> able to observe the bad behavior. Recall we're on xen-devel here; it is
> kind of expected that without clear (and practical) repro instructions
> experimenting as well as info collection will remain with the reporter.

The first reporter: https://bugs.debian.org/988477 gave pretty specific
details about their setups.

While the exact border isn't very well defined, that seems enough to give
a pretty good start. We don't know whether all Samsung SATA devices are
effected, but most of the recent ones (<5 years old) are. This requires
a pair of devices in software RAID1. Likely reproduces better with AMD
AM4/AM5 processors, but almost certainly needs a fully operational IOMMU.

(ASUS motherboards tend to have well setup IOMMUs)

I would be surprised if you don't have all of the hardware on-hand. Only
issue would be finding an appropriate pair of SATA devices, since those
tend to remain in service. I would look for older devices which were
removed from service due to being too small (128GB 840 PRO from the first
report), or were pulled from service due to having had too many writes.


> > I now see why I left those out. The messages from the SATA subsystem
> > were from a kernel which a bad patch had leaked into a LTS branch. Looks
> > like the SATA subsystem was significantly broken and I'm unsure whether
> > any useful information could be retrieved. Notably there is quite a bit
> > of noise from SATA devices not effected by this issue.
> >
> > Some of the messages /might/ be useful, but the amount of noise is quite
> > high. Do messages from a broken kernel interest you?
>
> If this was a less vague (in terms of possible root causes) issue, I'd
> probably have answered "yes". But in the case here I'm afraid such might
> further confuse things rather than clarifying them.

Okay.


--
(\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/)
\BS ( | ehem+sigmsg@m5p.com PGP 87145445 | ) /
\_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Re: Serious AMD-Vi(?) issue [ In reply to ]
On Thu, Mar 28, 2024 at 08:22:31AM -0700, Elliott Mitchell wrote:
> On Thu, Mar 28, 2024 at 07:25:02AM +0100, Jan Beulich wrote:
> > On 27.03.2024 18:27, Elliott Mitchell wrote:
> > > On Mon, Mar 25, 2024 at 02:43:44PM -0700, Elliott Mitchell wrote:
> > >> On Mon, Mar 25, 2024 at 08:55:56AM +0100, Jan Beulich wrote:
> > >>>
> > >>> In fact when running into trouble, the usual course of action would be to
> > >>> increase verbosity in both hypervisor and kernel, just to make sure no
> > >>> potentially relevant message is missed.
> > >>
> > >> More/better information might have been obtained if I'd been engaged
> > >> earlier.
> > >
> > > This is still true, things are in full mitigation mode and I'll be
> > > quite unhappy to go back with experiments at this point.
> >
> > Well, it very likely won't work without further experimenting by someone
> > able to observe the bad behavior. Recall we're on xen-devel here; it is
> > kind of expected that without clear (and practical) repro instructions
> > experimenting as well as info collection will remain with the reporter.
>
> The first reporter: https://bugs.debian.org/988477 gave pretty specific
> details about their setups.
>
> While the exact border isn't very well defined, that seems enough to give
> a pretty good start. We don't know whether all Samsung SATA devices are
> effected, but most of the recent ones (<5 years old) are. This requires
> a pair of devices in software RAID1. Likely reproduces better with AMD
> AM4/AM5 processors, but almost certainly needs a fully operational IOMMU.
>
> (ASUS motherboards tend to have well setup IOMMUs)
>
> I would be surprised if you don't have all of the hardware on-hand. Only
> issue would be finding an appropriate pair of SATA devices, since those
> tend to remain in service. I would look for older devices which were
> removed from service due to being too small (128GB 840 PRO from the first
> report), or were pulled from service due to having had too many writes.

Come to think of it, one more possible ingredient to this. Similar to
the first report, when the problem occurred, the SATA device was plugged
into an on chipset SATA port, not the extra controller this motherboard
has. I don't know whether the performance difference of an off-main
chip controller would influence this, but it might.


--
(\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/)
\BS ( | ehem+sigmsg@m5p.com PGP 87145445 | ) /
\_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Re: Serious AMD-Vi(?) issue [ In reply to ]
On Thu, Mar 28, 2024 at 07:25:02AM +0100, Jan Beulich wrote:
> On 27.03.2024 18:27, Elliott Mitchell wrote:
> > On Mon, Mar 25, 2024 at 02:43:44PM -0700, Elliott Mitchell wrote:
> >> On Mon, Mar 25, 2024 at 08:55:56AM +0100, Jan Beulich wrote:
> >>>
> >>> In fact when running into trouble, the usual course of action would be to
> >>> increase verbosity in both hypervisor and kernel, just to make sure no
> >>> potentially relevant message is missed.
> >>
> >> More/better information might have been obtained if I'd been engaged
> >> earlier.
> >
> > This is still true, things are in full mitigation mode and I'll be
> > quite unhappy to go back with experiments at this point.
>
> Well, it very likely won't work without further experimenting by someone
> able to observe the bad behavior. Recall we're on xen-devel here; it is
> kind of expected that without clear (and practical) repro instructions
> experimenting as well as info collection will remain with the reporter.

After looking at the situation and considering the issues, I /may/ be
able to setup for doing more testing. I guess I should confirm, which of
those criteria do you think currently provided information fails at?

AMD-IOMMU + Linux MD RAID1 + dual Samsung SATA (or various NVMe) +
dbench; seems a pretty specific setup.

I could see this being criticised as impractical if /new/ devices were
required, but the confirmed flash devices are several generations old.
Difficulty is cheaper candidate devices are being recycled for their
precious metal content, rather than resold as used.


--
(\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/)
\BS ( | ehem+sigmsg@m5p.com PGP 87145445 | ) /
\_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Re: Serious AMD-Vi(?) issue [ In reply to ]
On 11.04.2024 04:41, Elliott Mitchell wrote:
> On Thu, Mar 28, 2024 at 07:25:02AM +0100, Jan Beulich wrote:
>> On 27.03.2024 18:27, Elliott Mitchell wrote:
>>> On Mon, Mar 25, 2024 at 02:43:44PM -0700, Elliott Mitchell wrote:
>>>> On Mon, Mar 25, 2024 at 08:55:56AM +0100, Jan Beulich wrote:
>>>>>
>>>>> In fact when running into trouble, the usual course of action would be to
>>>>> increase verbosity in both hypervisor and kernel, just to make sure no
>>>>> potentially relevant message is missed.
>>>>
>>>> More/better information might have been obtained if I'd been engaged
>>>> earlier.
>>>
>>> This is still true, things are in full mitigation mode and I'll be
>>> quite unhappy to go back with experiments at this point.
>>
>> Well, it very likely won't work without further experimenting by someone
>> able to observe the bad behavior. Recall we're on xen-devel here; it is
>> kind of expected that without clear (and practical) repro instructions
>> experimenting as well as info collection will remain with the reporter.
>
> After looking at the situation and considering the issues, I /may/ be
> able to setup for doing more testing. I guess I should confirm, which of
> those criteria do you think currently provided information fails at?
>
> AMD-IOMMU + Linux MD RAID1 + dual Samsung SATA (or various NVMe) +
> dbench; seems a pretty specific setup.

Indeed. If that's the only way to observe the issue, it suggests to me
that it'll need to be mainly you to do further testing, and perhaps even
debugging. Which isn't to say we're not available to help, but from all
I have gathered so far we're pretty much in the dark even as to which
component(s) may be to blame. As can still be seen at the top in reply
context, some suggestions were given as to obtaining possible further
information (or confirming the absence thereof).

I'd also like to come back to the vague theory you did voice, in that
you're suspecting flushes to take too long. I continue to have trouble
with this, and I would therefore like to ask that you put this down in
more technical terms, making connections to actual actions taken by
software / hardware.

Jan

> I could see this being criticised as impractical if /new/ devices were
> required, but the confirmed flash devices are several generations old.
> Difficulty is cheaper candidate devices are being recycled for their
> precious metal content, rather than resold as used.
>
>
Re: Serious AMD-Vi(?) issue [ In reply to ]
On Wed, Apr 17, 2024 at 02:40:09PM +0200, Jan Beulich wrote:
> On 11.04.2024 04:41, Elliott Mitchell wrote:
> > On Thu, Mar 28, 2024 at 07:25:02AM +0100, Jan Beulich wrote:
> >> On 27.03.2024 18:27, Elliott Mitchell wrote:
> >>> On Mon, Mar 25, 2024 at 02:43:44PM -0700, Elliott Mitchell wrote:
> >>>> On Mon, Mar 25, 2024 at 08:55:56AM +0100, Jan Beulich wrote:
> >>>>>
> >>>>> In fact when running into trouble, the usual course of action would be to
> >>>>> increase verbosity in both hypervisor and kernel, just to make sure no
> >>>>> potentially relevant message is missed.
> >>>>
> >>>> More/better information might have been obtained if I'd been engaged
> >>>> earlier.
> >>>
> >>> This is still true, things are in full mitigation mode and I'll be
> >>> quite unhappy to go back with experiments at this point.
> >>
> >> Well, it very likely won't work without further experimenting by someone
> >> able to observe the bad behavior. Recall we're on xen-devel here; it is
> >> kind of expected that without clear (and practical) repro instructions
> >> experimenting as well as info collection will remain with the reporter.
> >
> > After looking at the situation and considering the issues, I /may/ be
> > able to setup for doing more testing. I guess I should confirm, which of
> > those criteria do you think currently provided information fails at?
> >
> > AMD-IOMMU + Linux MD RAID1 + dual Samsung SATA (or various NVMe) +
> > dbench; seems a pretty specific setup.
>
> Indeed. If that's the only way to observe the issue, it suggests to me
> that it'll need to be mainly you to do further testing, and perhaps even
> debugging. Which isn't to say we're not available to help, but from all
> I have gathered so far we're pretty much in the dark even as to which
> component(s) may be to blame. As can still be seen at the top in reply
> context, some suggestions were given as to obtaining possible further
> information (or confirming the absence thereof).

There may be other ways which haven't yet been found.

I've been left with the suspicion AMD was to some degree sponsoring
work to ensure Xen works on their hardware. Given the severity of this
problem I would kind of expect them not want to gain a reputation for
having data loss issues. Assuming a suitable pair of devices weren't
already on-hand, I would kind of expect this to be well within their
budget.

> I'd also like to come back to the vague theory you did voice, in that
> you're suspecting flushes to take too long. I continue to have trouble
> with this, and I would therefore like to ask that you put this down in
> more technical terms, making connections to actual actions taken by
> software / hardware.

I'm trying to figure out a pattern.

Nominally all the devices are roughly on par (only a very cheap flash
device will be unable to overwhelm SATA's bandwidth). Yet why did the
Crucial SATA device /seem/ not to have the issue? Why did a Crucial NVMe
device demonstrate the issue.

My guess is the flash controllers Samsung uses may be able to start
executing commands faster than the ones Crucial uses. Meanwhile NVMe
is lower overhead and latency than SATA (SATA's overhead isn't an issue
for actual disks). Perhaps the IOMMU is still flushing its TLB, or
hasn't loaded the new tables.

I suspect when the MD-RAID1 issues block requests to a pair of devices,
it likely sends the block to one device and then reuses most/all of the
structures for the second device. As a result the second request would
likely get a command to the device rather faster than the first request.

Perhaps look into what structures the MD-RAID1 subsystem reuses are.
Then see whether doing early setup of those structures triggers the
issue?

(okay I'm deep into speculation here, but this seems the simplest
explanation for what could be occuring)


--
(\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/)
\BS ( | ehem+sigmsg@m5p.com PGP 87145445 | ) /
\_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Re: Serious AMD-Vi(?) issue [ In reply to ]
On 18.04.2024 08:45, Elliott Mitchell wrote:
> On Wed, Apr 17, 2024 at 02:40:09PM +0200, Jan Beulich wrote:
>> On 11.04.2024 04:41, Elliott Mitchell wrote:
>>> On Thu, Mar 28, 2024 at 07:25:02AM +0100, Jan Beulich wrote:
>>>> On 27.03.2024 18:27, Elliott Mitchell wrote:
>>>>> On Mon, Mar 25, 2024 at 02:43:44PM -0700, Elliott Mitchell wrote:
>>>>>> On Mon, Mar 25, 2024 at 08:55:56AM +0100, Jan Beulich wrote:
>>>>>>>
>>>>>>> In fact when running into trouble, the usual course of action would be to
>>>>>>> increase verbosity in both hypervisor and kernel, just to make sure no
>>>>>>> potentially relevant message is missed.
>>>>>>
>>>>>> More/better information might have been obtained if I'd been engaged
>>>>>> earlier.
>>>>>
>>>>> This is still true, things are in full mitigation mode and I'll be
>>>>> quite unhappy to go back with experiments at this point.
>>>>
>>>> Well, it very likely won't work without further experimenting by someone
>>>> able to observe the bad behavior. Recall we're on xen-devel here; it is
>>>> kind of expected that without clear (and practical) repro instructions
>>>> experimenting as well as info collection will remain with the reporter.
>>>
>>> After looking at the situation and considering the issues, I /may/ be
>>> able to setup for doing more testing. I guess I should confirm, which of
>>> those criteria do you think currently provided information fails at?
>>>
>>> AMD-IOMMU + Linux MD RAID1 + dual Samsung SATA (or various NVMe) +
>>> dbench; seems a pretty specific setup.
>>
>> Indeed. If that's the only way to observe the issue, it suggests to me
>> that it'll need to be mainly you to do further testing, and perhaps even
>> debugging. Which isn't to say we're not available to help, but from all
>> I have gathered so far we're pretty much in the dark even as to which
>> component(s) may be to blame. As can still be seen at the top in reply
>> context, some suggestions were given as to obtaining possible further
>> information (or confirming the absence thereof).
>
> There may be other ways which haven't yet been found.
>
> I've been left with the suspicion AMD was to some degree sponsoring
> work to ensure Xen works on their hardware. Given the severity of this
> problem I would kind of expect them not want to gain a reputation for
> having data loss issues. Assuming a suitable pair of devices weren't
> already on-hand, I would kind of expect this to be well within their
> budget.

You've got to talk to AMD then. Plus I assume it's clear to you that
even if the (presumably) necessary hardware was available, it still
would require respective setup, leaving open whether the issue then
could indeed be reproduced.

>> I'd also like to come back to the vague theory you did voice, in that
>> you're suspecting flushes to take too long. I continue to have trouble
>> with this, and I would therefore like to ask that you put this down in
>> more technical terms, making connections to actual actions taken by
>> software / hardware.
>
> I'm trying to figure out a pattern.
>
> Nominally all the devices are roughly on par (only a very cheap flash
> device will be unable to overwhelm SATA's bandwidth). Yet why did the
> Crucial SATA device /seem/ not to have the issue? Why did a Crucial NVMe
> device demonstrate the issue.
>
> My guess is the flash controllers Samsung uses may be able to start
> executing commands faster than the ones Crucial uses. Meanwhile NVMe
> is lower overhead and latency than SATA (SATA's overhead isn't an issue
> for actual disks). Perhaps the IOMMU is still flushing its TLB, or
> hasn't loaded the new tables.

Which would be an IOMMU issue then, that software at best may be able to
work around.

Jan

> I suspect when the MD-RAID1 issues block requests to a pair of devices,
> it likely sends the block to one device and then reuses most/all of the
> structures for the second device. As a result the second request would
> likely get a command to the device rather faster than the first request.
>
> Perhaps look into what structures the MD-RAID1 subsystem reuses are.
> Then see whether doing early setup of those structures triggers the
> issue?
>
> (okay I'm deep into speculation here, but this seems the simplest
> explanation for what could be occuring)
>
>
Re: Serious AMD-Vi(?) issue [ In reply to ]
On Thu, Apr 18, 2024 at 09:09:51AM +0200, Jan Beulich wrote:
> On 18.04.2024 08:45, Elliott Mitchell wrote:
> > On Wed, Apr 17, 2024 at 02:40:09PM +0200, Jan Beulich wrote:
> >> On 11.04.2024 04:41, Elliott Mitchell wrote:
> >>> On Thu, Mar 28, 2024 at 07:25:02AM +0100, Jan Beulich wrote:
> >>>> On 27.03.2024 18:27, Elliott Mitchell wrote:
> >>>>> On Mon, Mar 25, 2024 at 02:43:44PM -0700, Elliott Mitchell wrote:
> >>>>>> On Mon, Mar 25, 2024 at 08:55:56AM +0100, Jan Beulich wrote:
> >>>>>>>
> >>>>>>> In fact when running into trouble, the usual course of action would be to
> >>>>>>> increase verbosity in both hypervisor and kernel, just to make sure no
> >>>>>>> potentially relevant message is missed.
> >>>>>>
> >>>>>> More/better information might have been obtained if I'd been engaged
> >>>>>> earlier.
> >>>>>
> >>>>> This is still true, things are in full mitigation mode and I'll be
> >>>>> quite unhappy to go back with experiments at this point.
> >>>>
> >>>> Well, it very likely won't work without further experimenting by someone
> >>>> able to observe the bad behavior. Recall we're on xen-devel here; it is
> >>>> kind of expected that without clear (and practical) repro instructions
> >>>> experimenting as well as info collection will remain with the reporter.
> >>>
> >>> After looking at the situation and considering the issues, I /may/ be
> >>> able to setup for doing more testing. I guess I should confirm, which of
> >>> those criteria do you think currently provided information fails at?
> >>>
> >>> AMD-IOMMU + Linux MD RAID1 + dual Samsung SATA (or various NVMe) +
> >>> dbench; seems a pretty specific setup.
> >>
> >> Indeed. If that's the only way to observe the issue, it suggests to me
> >> that it'll need to be mainly you to do further testing, and perhaps even
> >> debugging. Which isn't to say we're not available to help, but from all
> >> I have gathered so far we're pretty much in the dark even as to which
> >> component(s) may be to blame. As can still be seen at the top in reply
> >> context, some suggestions were given as to obtaining possible further
> >> information (or confirming the absence thereof).
> >
> > There may be other ways which haven't yet been found.
> >
> > I've been left with the suspicion AMD was to some degree sponsoring
> > work to ensure Xen works on their hardware. Given the severity of this
> > problem I would kind of expect them not want to gain a reputation for
> > having data loss issues. Assuming a suitable pair of devices weren't
> > already on-hand, I would kind of expect this to be well within their
> > budget.
>
> You've got to talk to AMD then. Plus I assume it's clear to you that
> even if the (presumably) necessary hardware was available, it still
> would require respective setup, leaving open whether the issue then
> could indeed be reproduced.

I had a vain hope your links to AMD would allow you to say "we've got a
major problem in need of addressing ASAP".

I suspect it will reproduce readily. The sparsity of reports is likely
due to few people using RAID1 for flash. Yet even though the initial
surveys suggest flash has a rather lower initial failure rate, they're
still pointing to rather non-zero failures in the first 5 years.

> >> I'd also like to come back to the vague theory you did voice, in that
> >> you're suspecting flushes to take too long. I continue to have trouble
> >> with this, and I would therefore like to ask that you put this down in
> >> more technical terms, making connections to actual actions taken by
> >> software / hardware.
> >
> > I'm trying to figure out a pattern.
> >
> > Nominally all the devices are roughly on par (only a very cheap flash
> > device will be unable to overwhelm SATA's bandwidth). Yet why did the
> > Crucial SATA device /seem/ not to have the issue? Why did a Crucial NVMe
> > device demonstrate the issue.
> >
> > My guess is the flash controllers Samsung uses may be able to start
> > executing commands faster than the ones Crucial uses. Meanwhile NVMe
> > is lower overhead and latency than SATA (SATA's overhead isn't an issue
> > for actual disks). Perhaps the IOMMU is still flushing its TLB, or
> > hasn't loaded the new tables.
>
> Which would be an IOMMU issue then, that software at best may be able to
> work around.

Yet even if uses of RAID1 with flash are uncommon or rare, I would
expect this to have already manifested on Linux without Xen. In turn
this would suggest Linux likely already has some sort of workaround.

I suspect this is a case of there is some step which is missing from
Xen's IOMMU handling. Perhaps something which Linux does during an early
DMA setup stage, but the current Xen implementation does lazily?
Alternatively some flag setting or missing step?

I should be able to do another test approach in a few weeks, but I would
love if something could be found sooner.


--
(\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/)
\BS ( | ehem+sigmsg@m5p.com PGP 87145445 | ) /
\_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445