Mailing List Archive

Debian 10, xen 4.11 reliability
Hello,

I have used Debian/Xen dom0 for many years, Debian 7, 3.2.41 kernel with
Xen 4.1.4 on some servers still work very reliable with no problems,
only wanted to change new hard discs pushed me to install new Debian 10
with Xen packages to newer version.

Where I reinstall Debian 10, latest 4.19 kernel it comes with Xen
4.11.4-pre all crashes in 1-2weeks. In start I was blaming HW, but now
it repeats in old reliable servers too. My config is I have 2-3 hard
discs per server node and RAID1 with sw md raid, it crashes with md lost
access for disc. After reboot all works as it should.

I tested different HW and discs, all the same problems.
Loads are no big, just few testing DomU nodes.
Any suggestion? I`m currently running latest Debian kernel 4.19 was
thinking to downgrade to test different kernel.
Or problem with Debian Xen package as it not so popular anymore?
Any suggestion what to test to figure out problem?

Sincerely,
Casper
Re: Debian 10, xen 4.11 reliability [ In reply to ]
Get rid of Systemd. Use Devuan. Works like a charm.

[ send by mobile device ]

Casper <kl@os.lv> schrieb am Do., 9. Juli 2020, 10:50:

> Hello,
>
> I have used Debian/Xen dom0 for many years, Debian 7, 3.2.41 kernel with
> Xen 4.1.4 on some servers still work very reliable with no problems,
> only wanted to change new hard discs pushed me to install new Debian 10
> with Xen packages to newer version.
>
> Where I reinstall Debian 10, latest 4.19 kernel it comes with Xen
> 4.11.4-pre all crashes in 1-2weeks. In start I was blaming HW, but now
> it repeats in old reliable servers too. My config is I have 2-3 hard
> discs per server node and RAID1 with sw md raid, it crashes with md lost
> access for disc. After reboot all works as it should.
>
> I tested different HW and discs, all the same problems.
> Loads are no big, just few testing DomU nodes.
> Any suggestion? I`m currently running latest Debian kernel 4.19 was
> thinking to downgrade to test different kernel.
> Or problem with Debian Xen package as it not so popular anymore?
> Any suggestion what to test to figure out problem?
>
> Sincerely,
> Casper
>
>
Re: Debian 10, xen 4.11 reliability [ In reply to ]
BSD not bad idea, but it would be interesting to find problem.
I`m curious to go to different kernel/xen version to test problem.

On 09.07.20 11:55, Goran wrote:
> Or take some BSD...
>
> [ send by mobile device ]
>
> Goran <sendmailtogoran@gmail.com <mailto:sendmailtogoran@gmail.com>>
> schrieb am Do., 9. Juli 2020, 10:54:
>
> Get rid of Systemd. Use Devuan. Works like a charm.
>
> [ send by mobile device ]
>
> Casper <kl@os.lv <mailto:kl@os.lv>> schrieb am Do., 9. Juli 2020, 10:50:
>
> Hello,
>
> I have used Debian/Xen dom0 for many years, Debian 7, 3.2.41
> kernel with
> Xen 4.1.4 on some servers still work very reliable with no
> problems,
> only wanted to change new hard discs pushed me to install new
> Debian 10
> with Xen packages to newer version.
>
> Where I reinstall Debian 10, latest 4.19 kernel it comes with Xen
> 4.11.4-pre all crashes in 1-2weeks. In start I was blaming HW,
> but now
> it repeats in old reliable servers too. My config is I have 2-3
> hard
> discs per server node and RAID1 with sw md raid, it crashes with
> md lost
> access for disc. After reboot all works as it should.
>
> I tested different HW and discs, all the same problems.
> Loads are no big, just few testing DomU nodes.
> Any suggestion? I`m currently running latest Debian kernel 4.19 was
> thinking to downgrade to test different kernel.
> Or problem with Debian Xen package as it not so popular anymore?
> Any suggestion what to test to figure out problem?
>
> Sincerely,
> Casper
>
Re: Debian 10, xen 4.11 reliability [ In reply to ]
On 7/9/20 1:45 AM, Casper wrote:
> Hello,
>
> I have used Debian/Xen dom0 for many years, Debian 7, 3.2.41 kernel with Xen 4.1.4 on some servers still work very reliable with no problems, only
> wanted to change new hard discs pushed me to install new Debian 10 with Xen packages to newer version.
>
> Where I reinstall Debian 10, latest 4.19 kernel it comes with Xen 4.11.4-pre all crashes in 1-2weeks. In start I was blaming HW, but now it repeats in
> old reliable servers too. My config is I have 2-3 hard discs per server node and RAID1 with sw md raid, it crashes with md lost access for disc. After
> reboot all works as it should.
>
> I tested different HW and discs, all the same problems.
> Loads are no big, just few testing DomU nodes.
> Any suggestion? I`m currently running latest Debian kernel 4.19 was thinking to downgrade to test different kernel.
> Or problem with Debian Xen package as it not so popular anymore?
> Any suggestion what to test to figure out problem?

Kernel messages and all the kernel versions you've tried since Debian 7 please?

We're trying to track down some other issues with debian guests; maybe they are the same root cause.

--Sarah
Re: Debian 10, xen 4.11 reliability [ In reply to ]
Hello,

I have exactly the same problem with XEN 4.11 and Buster, Xen does not start (it even does not recognize the boot fidk !) and
never detects the RAID 1 disks.
I cannot boot with Xen, the same Kernel boots perfectly OK without XEN and all RAID disks are OK.
I don't find any clue on Internet and can no more use Xen (I used it since 2009 with no problems ...).

Regards

JP P

----- Mail original -----
De: "Sarah Newman" <srn@prgmr.com>
À: "Casper" <kl@os.lv>, xen-users@lists.xenproject.org
Envoyé: Jeudi 9 Juillet 2020 16:41:25
Objet: Re: Debian 10, xen 4.11 reliability

On 7/9/20 1:45 AM, Casper wrote:
> Hello,
>
> I have used Debian/Xen dom0 for many years, Debian 7, 3.2.41 kernel with Xen 4.1.4 on some servers still work very reliable with no problems, only
> wanted to change new hard discs pushed me to install new Debian 10 with Xen packages to newer version.
>
> Where I reinstall Debian 10, latest 4.19 kernel it comes with Xen 4.11.4-pre all crashes in 1-2weeks. In start I was blaming HW, but now it repeats in
> old reliable servers too. My config is I have 2-3 hard discs per server node and RAID1 with sw md raid, it crashes with md lost access for disc. After
> reboot all works as it should.
>
> I tested different HW and discs, all the same problems.
> Loads are no big, just few testing DomU nodes.
> Any suggestion? I`m currently running latest Debian kernel 4.19 was thinking to downgrade to test different kernel.
> Or problem with Debian Xen package as it not so popular anymore?
> Any suggestion what to test to figure out problem?

Kernel messages and all the kernel versions you've tried since Debian 7 please?

We're trying to track down some other issues with debian guests; maybe they are the same root cause.

--Sarah
Re: Debian 10, xen 4.11 reliability [ In reply to ]
Please post bottom or inline

On 7/9/20 3:10 PM, JP P wrote:
>
> ----- Mail original -----
> De: "Sarah Newman" <srn@prgmr.com>
> À: "Casper" <kl@os.lv>, xen-users@lists.xenproject.org
> Envoyé: Jeudi 9 Juillet 2020 16:41:25
> Objet: Re: Debian 10, xen 4.11 reliability
>
> On 7/9/20 1:45 AM, Casper wrote:
>> Hello,
>>
>> I have used Debian/Xen dom0 for many years, Debian 7, 3.2.41 kernel with Xen 4.1.4 on some servers still work very reliable with no problems, only
>> wanted to change new hard discs pushed me to install new Debian 10 with Xen packages to newer version.
>>
>> Where I reinstall Debian 10, latest 4.19 kernel it comes with Xen 4.11.4-pre all crashes in 1-2weeks. In start I was blaming HW, but now it repeats in
>> old reliable servers too. My config is I have 2-3 hard discs per server node and RAID1 with sw md raid, it crashes with md lost access for disc. After
>> reboot all works as it should.
>>
>> I tested different HW and discs, all the same problems.
>> Loads are no big, just few testing DomU nodes.
>> Any suggestion? I`m currently running latest Debian kernel 4.19 was thinking to downgrade to test different kernel.
>> Or problem with Debian Xen package as it not so popular anymore?
>> Any suggestion what to test to figure out problem?
>
> Kernel messages and all the kernel versions you've tried since Debian 7 please?
>
> We're trying to track down some other issues with debian guests; maybe they are the same root cause.
>
> --Sarah
>

> Hello,
>
> I have exactly the same problem with XEN 4.11 and Buster, Xen does not start (it even does not recognize the boot fidk !) and
> never detects the RAID 1 disks.
> I cannot boot with Xen, the same Kernel boots perfectly OK without XEN and all RAID disks are OK.
> I don't find any clue on Internet and can no more use Xen (I used it since 2009 with no problems ...).
>
> Regards
>
> JP P

I don't think this is the same issue. It's much different to have a problem come up several days after boot versus not booting at all.

--Sarah
Re: Debian 10, xen 4.11 reliability [ In reply to ]
Hi,

i am running Debian Buster with Xen 4.11 on dozens of machines. works
like a charm. but i use a custom kernel, because debian kernel is too
old for me.

Don't have any problems with it.

Could you provide more infos?

--
~Holger
Re: Debian 10, xen 4.11 reliability [ In reply to ]
Hi,

I think default Debian kernel with xen not working.
I have updated original Debian 10.4 packages from I think 10.0 - all the
same.
Debian 7 works like charm, I will switch to other kernel or, make own
custom to test it.
I have md raid with 2-3 discs, tried with raid1 and raid5, it all fails
mostly with i/o problem to access some disc, had another issue with
e1000e intel driver, but it seems upgrade to latest fixed problem.

What more info are interested?

Casper

On 11.07.20 13:35, Holger Schramm wrote:
> Hi,
>
> i am running Debian Buster with Xen 4.11 on dozens of machines. works
> like a charm. but i use a custom kernel, because debian kernel is too
> old for me.
>
> Don't have any problems with it.
>
> Could you provide more infos?
>
Re: Debian 10, xen 4.11 reliability [ In reply to ]
Hello,

I try with some different kernels :
Debian 5.4, 5.5,
"Home made" with some kernels 5.6, 5.7 with no more luck.

Raid disks (RAID 0 here) are never recognized ... and I can't boot with XEN,
I had to use KVM to start different virtual machines ...
Xen 4.11 is definitely dead for me.

Regards

JP P

----- Mail original -----
De: "Casper" <kl@os.lv>
À: xen-users@lists.xenproject.org
Envoyé: Samedi 11 Juillet 2020 16:25:11
Objet: Re: Debian 10, xen 4.11 reliability

Hi,

I think default Debian kernel with xen not working.
I have updated original Debian 10.4 packages from I think 10.0 - all the
same.
Debian 7 works like charm, I will switch to other kernel or, make own
custom to test it.
I have md raid with 2-3 discs, tried with raid1 and raid5, it all fails
mostly with i/o problem to access some disc, had another issue with
e1000e intel driver, but it seems upgrade to latest fixed problem.

What more info are interested?

Casper

On 11.07.20 13:35, Holger Schramm wrote:
> Hi,
>
> i am running Debian Buster with Xen 4.11 on dozens of machines. works
> like a charm. but i use a custom kernel, because debian kernel is too
> old for me.
>
> Don't have any problems with it.
>
> Could you provide more infos?
>
Re: Debian 10, xen 4.11 reliability [ In reply to ]
Hi Casper,

On 7/9/20 10:45 AM, Casper wrote:
>
> I have used Debian/Xen dom0 for many years, Debian 7, 3.2.41 kernel with
> Xen 4.1.4 on some servers still work very reliable with no problems,
> only wanted to change new hard discs pushed me to install new Debian 10
> with Xen packages to newer version.
>
> Where I reinstall Debian 10, latest 4.19 kernel it comes with Xen
> 4.11.4-pre all crashes in 1-2weeks. In start I was blaming HW, but now
> it repeats in old reliable servers too. My config is I have 2-3 hard
> discs per server node and RAID1 with sw md raid, it crashes with md lost
> access for disc. After reboot all works as it should.
>
> I tested different HW and discs, all the same problems.
> Loads are no big, just few testing DomU nodes.
> Any suggestion? I`m currently running latest Debian kernel 4.19 was
> thinking to downgrade to test different kernel.
> Or problem with Debian Xen package as it not so popular anymore?
> Any suggestion what to test to figure out problem?

The first suggestion is, like Sarah also mentions, capturing logging,
and providing that. Also, all details possible, do not leave out minor
things that you think "this for sure won't cause it" about. Even your
raid controller type.

Does the entire machine crash? Or does only a virtual machine crash? Are
the domUs also Debian Buster?

If something crashes, can you still login to the dom0? Can you do dmesg
and xl dmesg there?

If a domU crashes, then enable logging of the xen console to log files
on the dom0. Also look at xl dmesg to see if there's anything added
while it happens, pointing at Xen choosing to destroy the domU when it's
trying to do something that's absolutely not allowed.

If it completely crashes, then you need a serial console to capture
whatever it tries to tell you in the last breath, because, otherwise
it's invisible.

Use loglvl=all and guest_loglvl=all on the hypervisor command line
(GRUB_CMDLINE_XEN_DEFAULT in /etc/default/grub.d/xen.cfg).

Also, at least update to newest xen and kernel packages in Buster.
That's 4.11.4+24-gddaaccbbab-1~deb10u1 for Xen and 4.19.118-2+deb10u1
for Linux.

When only saying 'it does not work', nobody can help you.

Hans
Re: Debian 10, xen 4.11 reliability [ In reply to ]
On 14/7/20 03:02, Hans van Kranenburg wrote:
> Hi Casper,
>
> On 7/9/20 10:45 AM, Casper wrote:
>> I have used Debian/Xen dom0 for many years, Debian 7, 3.2.41 kernel with
>> Xen 4.1.4 on some servers still work very reliable with no problems,
>> only wanted to change new hard discs pushed me to install new Debian 10
>> with Xen packages to newer version.
>>
>> Where I reinstall Debian 10, latest 4.19 kernel it comes with Xen
>> 4.11.4-pre all crashes in 1-2weeks. In start I was blaming HW, but now
>> it repeats in old reliable servers too. My config is I have 2-3 hard
>> discs per server node and RAID1 with sw md raid, it crashes with md lost
>> access for disc. After reboot all works as it should.
>>
>> I tested different HW and discs, all the same problems.
>> Loads are no big, just few testing DomU nodes.
>> Any suggestion? I`m currently running latest Debian kernel 4.19 was
>> thinking to downgrade to test different kernel.
>> Or problem with Debian Xen package as it not so popular anymore?
>> Any suggestion what to test to figure out problem?

BTW, I don't think is a general rule that Debian 10.4 with packages Xen
4.11 doesn't work. I have a couple of Debian 11 boxes running multiple
DomU's and they are working well:

ii  libxencall1:i386 4.11.3+24-g14b62ab3e5-1~deb10u1 i386         Xen
runtime library - libxencall
ii  libxendevicemodel1:i386 4.11.3+24-g14b62ab3e5-1~deb10u1 i386        
Xen runtime libraries - libxendevicemodel
ii  libxenevtchn1:i386 4.11.3+24-g14b62ab3e5-1~deb10u1 i386         Xen
runtime libraries - libxenevtchn
ii  libxenforeignmemory1:i386 4.11.3+24-g14b62ab3e5-1~deb10u1
i386         Xen runtime libraries - libxenforeignmemory
ii  libxengnttab1:i386 4.11.3+24-g14b62ab3e5-1~deb10u1 i386         Xen
runtime libraries - libxengnttab
ii  libxenmisc4.11:i386 4.11.3+24-g14b62ab3e5-1~deb10u1 i386         Xen
runtime libraries - miscellaneous, versioned ABI
ii  libxenstore3.0:i386 4.11.3+24-g14b62ab3e5-1~deb10u1 i386         Xen
runtime libraries - libxenstore
ii  libxentoolcore1:i386 4.11.3+24-g14b62ab3e5-1~deb10u1 i386        
Xen runtime libraries - libxentoolcore
ii  libxentoollog1:i386 4.11.3+24-g14b62ab3e5-1~deb10u1 i386         Xen
runtime libraries - libxentoollog
ii  xen-hypervisor-4.11-amd64 4.11.3+24-g14b62ab3e5-1~deb10u1
i386         Xen Hypervisor on AMD64
ii  xen-hypervisor-common 4.11.3+24-g14b62ab3e5-1~deb10u1 all         
Xen Hypervisor - common files
ii  xen-utils-4.11 4.11.3+24-g14b62ab3e5-1~deb10u1 i386         XEN
administrative tools
ii  xen-utils-common 4.11.3+24-g14b62ab3e5-1~deb10u1 i386         Xen
administrative tools - common files
ii  xenstore-utils 4.11.3+24-g14b62ab3e5-1~deb10u1 i386         Xenstore
command line utilities for Xen

Linux flail 4.19.0-9-686-pae #1 SMP Debian 4.19.118-2+deb10u1
(2020-06-07) i686 GNU/Linux

 09:15:19 up 29 days, 12:56,  2 users,  load average: 0.23, 0.19, 0.18

(This machine was only recently updated, hence a recent reboot, but
almost a month seems a lot longer than what is being complained about here)

So, YMMV, but I think you will definitely need to provide more detailed
information on what happens if you are even slightly interested in
finding and fixing the cause.

Regards,
Adam
Re: Debian 10, xen 4.11 reliability [ In reply to ]
On 7/14/20 1:16 AM, Adam Goryachev wrote:
>
> On 14/7/20 03:02, Hans van Kranenburg wrote:
>> Hi Casper,
>>
>> On 7/9/20 10:45 AM, Casper wrote:
>>> [...]
>>> Or problem with Debian Xen package as it not so popular anymore?
>>> Any suggestion what to test to figure out problem?
>
> BTW, I don't think is a general rule that Debian 10.4 with packages Xen
> 4.11 doesn't work.

True. It just works (tm), until you have some edge case hardware that
misbehaves, or you run into an edge case bug with a very specific
combination of non-default configuration here and there (or when you try
to use EFI, cough).

So, to add to the list:
* Run latest BIOS / cpu microcode that is available.
* Other firmware, e.g. for raid controller or whatever?
* Is the box using ECC memory? I mean, even a memory module that flips a
bit now and then can crash a server every few weeks... Run a memtest or
7zip benchmark or what was the thing that's very good at exposing memory
errors...

Also, feel free to open a bug report in the Debian bug tracker, we're
willing to help, but expect that you have to do the work to gather all
info. I don't have a similar piece of hardware lying around here... What
distro package maintainers can do is help users to gather enough info to
have a good report that doesn't waste too much time of the upstream
developers.

Hans (also member of Debian Xen team)

P.S. About the EFI thing, apparently that often does not work, testers
wanted to figure out in what cases, and how to make it work!
Re: Debian 10, xen 4.11 reliability [ In reply to ]
On Thursday, July 9, 2020, Casper <kl@os.lv> wrote:
> Hello,
>
> I have used Debian/Xen dom0 for many years, Debian 7, 3.2.41 kernel with
Xen 4.1.4 on some servers still work very reliable with no problems, only
wanted to change new hard discs pushed me to install new Debian 10 with Xen
packages to newer version.
>
> Where I reinstall Debian 10, latest 4.19 kernel it comes with Xen
4.11.4-pre all crashes in 1-2weeks. In start I was blaming HW, but now it
repeats in old reliable servers too. My config is I have 2-3 hard discs per
server node and RAID1 with sw md raid, it crashes with md lost access for
disc. After reboot all works as it should.
>
> I tested different HW and discs, all the same problems.
> Loads are no big, just few testing DomU nodes.
> Any suggestion? I`m currently running latest Debian kernel 4.19 was
thinking to downgrade to test different kernel.
> Or problem with Debian Xen package as it not so popular anymore?
> Any suggestion what to test to figure out problem?
>
> Sincerely,
> Casper
>
>

Just a blind shoot - which scheduler are you using? If credit2, try credit
legacy.

https://wiki.gentoo.org/wiki/Xen#Xen_domU_hanging_with_Xen_4.12.2B
Re: Debian 10, xen 4.11 reliability [ In reply to ]
Another blind shot and this may not be very helpful, but I have seen
general instability with 4.11 on certain hardware that was never fully
diagnosed. It appears to be magically resolved on newer versions.

I suspected this: https://xenbits.xen.org/xsa/advisory-294.html at the
time as some silicon was changed (but not underlying hardware
platform), introducing PCID in all cases. I did not prove this and it
was not a denial of service failure mode.

If you can, perhaps building from source or applying that patch could
help. Please let us know how you get on, issues like this are always
very frustrating.

On Tue, 14 Jul 2020 at 15:46, Tomas Mozes <hydrapolic@gmail.com> wrote:
>
>
>
> On Thursday, July 9, 2020, Casper <kl@os.lv> wrote:
> > Hello,
> >
> > I have used Debian/Xen dom0 for many years, Debian 7, 3.2.41 kernel with Xen 4.1.4 on some servers still work very reliable with no problems, only wanted to change new hard discs pushed me to install new Debian 10 with Xen packages to newer version.
> >
> > Where I reinstall Debian 10, latest 4.19 kernel it comes with Xen 4.11.4-pre all crashes in 1-2weeks. In start I was blaming HW, but now it repeats in old reliable servers too. My config is I have 2-3 hard discs per server node and RAID1 with sw md raid, it crashes with md lost access for disc. After reboot all works as it should.
> >
> > I tested different HW and discs, all the same problems.
> > Loads are no big, just few testing DomU nodes.
> > Any suggestion? I`m currently running latest Debian kernel 4.19 was thinking to downgrade to test different kernel.
> > Or problem with Debian Xen package as it not so popular anymore?
> > Any suggestion what to test to figure out problem?
> >
> > Sincerely,
> > Casper
> >
> >
>
> Just a blind shoot - which scheduler are you using? If credit2, try credit legacy.
>
> https://wiki.gentoo.org/wiki/Xen#Xen_domU_hanging_with_Xen_4.12.2B
Re: Debian 10, xen 4.11 reliability [ In reply to ]
On 7/14/20 2:00 AM, Hans van Kranenburg wrote:
> On 7/14/20 1:16 AM, Adam Goryachev wrote:
>>
>> On 14/7/20 03:02, Hans van Kranenburg wrote:
>>> Hi Casper,
>>>
>>> On 7/9/20 10:45 AM, Casper wrote:
>>>> [...]
>>>> Or problem with Debian Xen package as it not so popular anymore?
>>>> Any suggestion what to test to figure out problem?
>>
>> BTW, I don't think is a general rule that Debian 10.4 with packages Xen
>> 4.11 doesn't work.
>
> True. It just works (tm), until you have some edge case hardware that
> misbehaves, or you run into an edge case bug with a very specific
> combination of non-default configuration here and there (or when you try
> to use EFI, cough).
>
> So, to add to the list:
> * Run latest BIOS / cpu microcode that is available.
> * Other firmware, e.g. for raid controller or whatever?
> * Is the box using ECC memory? I mean, even a memory module that flips a
> bit now and then can crash a server every few weeks... Run a memtest or
> 7zip benchmark or what was the thing that's very good at exposing memory
> errors...
>
> Also, feel free to open a bug report in the Debian bug tracker, we're
> willing to help, but expect that you have to do the work to gather all
> info. I don't have a similar piece of hardware lying around here... What
> distro package maintainers can do is help users to gather enough info to
> have a good report that doesn't waste too much time of the upstream
> developers.

Here is a bug I opened a week ago against Debian Buster:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=964494

It looks like only newer versions of the kernel are a problem. We think the trigger is either ext3 or Xen.

The problem may not show up for weeks, and we do not know what triggers it.

If anyone has more data points to add that would help isolate the issue to one or the other, it would be appreciated.

--Sarah
Re: Debian 10, xen 4.11 reliability [ In reply to ]
On 7/16/20 5:57 AM, Sarah Newman wrote:
> On 7/14/20 2:00 AM, Hans van Kranenburg wrote:
>> On 7/14/20 1:16 AM, Adam Goryachev wrote:
>>>
>>> On 14/7/20 03:02, Hans van Kranenburg wrote:
>>>> Hi Casper,
>>>>
>>>> On 7/9/20 10:45 AM, Casper wrote:
>>>>> [...]
>>>>> Or problem with Debian Xen package as it not so popular anymore?
>>>>> Any suggestion what to test to figure out problem?
>>>
>>> BTW, I don't think is a general rule that Debian 10.4 with packages Xen
>>> 4.11 doesn't work.
>>
>> True. It just works (tm), until you have some edge case hardware that
>> misbehaves, or you run into an edge case bug with a very specific
>> combination of non-default configuration here and there (or when you try
>> to use EFI, cough).
>>
>> So, to add to the list:
>> * Run latest BIOS / cpu microcode that is available.
>> * Other firmware, e.g. for raid controller or whatever?
>> * Is the box using ECC memory? I mean, even a memory module that flips a
>> bit now and then can crash a server every few weeks... Run a memtest or
>> 7zip benchmark or what was the thing that's very good at exposing memory
>> errors...
>>
>> Also, feel free to open a bug report in the Debian bug tracker, we're
>> willing to help, but expect that you have to do the work to gather all
>> info. I don't have a similar piece of hardware lying around here... What
>> distro package maintainers can do is help users to gather enough info to
>> have a good report that doesn't waste too much time of the upstream
>> developers.
>
> Here is a bug I opened a week ago against Debian Buster:
>
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=964494
>
> It looks like only newer versions of the kernel are a problem. We think the trigger is either ext3 or Xen.
>
> The problem may not show up for weeks, and we do not know what triggers it.
>
> If anyone has more data points to add that would help isolate the issue to one or the other, it would be appreciated.

You're not running Debian Xen packages apparently, so I can't say much
about that part. Except that for the Debian stuff, we only use the
upstream stable-X.Y branches and never apply security patches from XSAs
ourselves manually. There are just too many ways in which shooting into
feet can be done. The upstream staging-X.Y branch is tested before the
commits get pushed into stable-X.Y. Debian security updates are done
with that, and the other bug fixes and dependent commits in the stable
branch just also go in at the same time. Doing this means that we make
our users run something that the upstream developers will not disapprove
of, whenever we need to ask them to help with something. (Yes, for the
careful reader, that actually means that the current
4.11.4+24-gddaaccbbab-1~deb10u1 in buster-security is 100% the same as
is if it would be in buster-backports).

But, is that Linux 4.9 in the dom0? Begin by eliminating that. Our
milage may vary, but at work, we skipped from Jessie to Buster (well,
actually to our own strech-backports) because I really could not get
anything working with Linux 4.9 as dom0 kernel after the whole
Spectre/Meltdown stuff unfolded. We never got to the bottom of it, due
to a big lack of time and kernel debugging knowledge/experience, but
what I have seen is random Oopses, disk corruption and other things.

Are you using live migration?

So, why not get those dom0s to latest Xen 4.11 packages from Debian and
Linux 4.19? It's flying here, with several clusters of dozens of servers
and a few dozen TiB of mems, running thousands of domUs, without any
problem.

I agree with Ben that using ext3 nowadays should be discouraged because
of the amount of usage and testing decreasing.

But, I might have the luxury of working with a setup where we manage all
of it and have customers look at some GUI and have no idea about the
actual underlying systems. Having customers run anything they want is a
different slice of bread...

Anyway, the above is just some thinking out loud. I know that it's very
difficult to debug these kinds of things, because you need more failures
happening to be able to correlate, and a reliable reproduction scenario
would be the ultimate thing as a start to figure out what's actually
going wrong, but these are really difficult time consuming tasks.

Have fun,
Hans
Re: Debian 10, xen 4.11 reliability [ In reply to ]
On 7/16/20 2:34 PM, Hans van Kranenburg wrote:

> You're not running Debian Xen packages apparently, so I can't say much
> about that part.

> But, is that Linux 4.9 in the dom0? Begin by eliminating that.

We've been running Linux 4.9 for a long time, though we plan to upgrade soon.

The timing does not correlate, and far less than one percent of our users are having issues.

> Our
> milage may vary, but at work, we skipped from Jessie to Buster (well,
> actually to our own strech-backports) because I really could not get
> anything working with Linux 4.9 as dom0 kernel after the whole
> Spectre/Meltdown stuff unfolded. We never got to the bottom of it, due
> to a big lack of time and kernel debugging knowledge/experience, but
> what I have seen is random Oopses, disk corruption and other things.

There were panics in the dom0 which I traced to a network driver, and I fixed it.

This is the first time we've had complaints of file system corruption.

> Are you using live migration?

Not so recently that it would have affected the two systems with problems.

>
> So, why not get those dom0s to latest Xen 4.11 packages from Debian and
> Linux 4.19? It's flying here, with several clusters of dozens of servers
> and a few dozen TiB of mems, running thousands of domUs, without any
> problem.

Are your dom0's running the latest kernel version? Are they running ext3? What uptime have they had?

What about the domU's?

>
> I agree with Ben that using ext3 nowadays should be discouraged because
> of the amount of usage and testing decreasing.

Yes. I think Debian and Ubuntu are the only distributions where we might have users who are using an old file system with a new kernel, which is why
I'm focused on ext3. But I can't say for certain.

> But, I might have the luxury of working with a setup where we manage all
> of it and have customers look at some GUI and have no idea about the
> actual underlying systems. Having customers run anything they want is a
> different slice of bread...

It very much is.

>
> Anyway, the above is just some thinking out loud. I know that it's very
> difficult to debug these kinds of things, because you need more failures
> happening to be able to correlate, and a reliable reproduction scenario
> would be the ultimate thing as a start to figure out what's actually
> going wrong, but these are really difficult time consuming tasks.

We're trying.

Thanks, Sarah
Re: Debian 10, xen 4.11 reliability [ In reply to ]
Hi,

On 7/16/20 11:58 PM, Sarah Newman wrote:
> On 7/16/20 2:34 PM, Hans van Kranenburg wrote:
>
>> You're not running Debian Xen packages apparently, so I can't say much
>> about that part.
>
>> But, is that Linux 4.9 in the dom0? Begin by eliminating that.
>
> We've been running Linux 4.9 for a long time, though we plan to upgrade soon.
>
> The timing does not correlate, and far less than one percent of our users are having issues.
>
>> Our
>> milage may vary, but at work, we skipped from Jessie to Buster (well,
>> actually to our own strech-backports) because I really could not get
>> anything working with Linux 4.9 as dom0 kernel after the whole
>> Spectre/Meltdown stuff unfolded. We never got to the bottom of it, due
>> to a big lack of time and kernel debugging knowledge/experience, but
>> what I have seen is random Oopses, disk corruption and other things.
>
> There were panics in the dom0 which I traced to a network driver, and I fixed it.

Oh, wonderful, thanks! :)

> This is the first time we've had complaints of file system corruption.
>
>> Are you using live migration?
>
> Not so recently that it would have affected the two systems with problems.
>
>>
>> So, why not get those dom0s to latest Xen 4.11 packages from Debian and
>> Linux 4.19? It's flying here, with several clusters of dozens of servers
>> and a few dozen TiB of mems, running thousands of domUs, without any
>> problem.
>
> Are your dom0's running the latest kernel version? Are they running ext3? What uptime have they had?

There's certainly 4.19.118-2 based dom0 kernels in the mix, yes. Dom0
filesystem is ext4.

> What about the domU's?

Some quite heavily used domUs on these servers. And filesystems are
either ext4 or btrfs.

So, no ext3 anywhere, at all.

>> I agree with Ben that using ext3 nowadays should be discouraged because
>> of the amount of usage and testing decreasing.
>
> Yes. I think Debian and Ubuntu are the only distributions where we might have users who are using an old file system with a new kernel, which is why
> I'm focused on ext3. But I can't say for certain.

Interesting. I have no great ideas or anything right now, sorry.

I can of course grab a test domU here and create an ext3 fs on an extra
block device, make it do something and then see what happens after some
time...

>> But, I might have the luxury of working with a setup where we manage all
>> of it and have customers look at some GUI and have no idea about the
>> actual underlying systems. Having customers run anything they want is a
>> different slice of bread...
>
> It very much is.
>
>>
>> Anyway, the above is just some thinking out loud. I know that it's very
>> difficult to debug these kinds of things, because you need more failures
>> happening to be able to correlate, and a reliable reproduction scenario
>> would be the ultimate thing as a start to figure out what's actually
>> going wrong, but these are really difficult time consuming tasks.
>
> We're trying.

Good luck

Hans
Re: Debian 10, xen 4.11 reliability [ In reply to ]
Hello,

I can report I have root and rest fs ext3 too, but is it correct Debian
uses ext4 for mounting ext3?

[ 8.314622] EXT4-fs (md0): mounting ext3 file system using the ext4
subsystem
[ 10.192765] EXT4-fs (md0): mounted filesystem with ordered data mode.
Opts: (null)

All domU use ext3, even new debian machines.

Casper

On 16.07.20 06:57, Sarah Newman wrote:
> On 7/14/20 2:00 AM, Hans van Kranenburg wrote:
>> On 7/14/20 1:16 AM, Adam Goryachev wrote:
>>>
>>> On 14/7/20 03:02, Hans van Kranenburg wrote:
>>>> Hi Casper,
>>>>
>>>> On 7/9/20 10:45 AM, Casper wrote:
>>>>> [...]
>>>>> Or problem with Debian Xen package as it not so popular anymore?
>>>>> Any suggestion what to test to figure out problem?
>>>
>>> BTW, I don't think is a general rule that Debian 10.4 with packages Xen
>>> 4.11 doesn't work.
>>
>> True. It just works (tm), until you have some edge case hardware that
>> misbehaves, or you run into an edge case bug with a very specific
>> combination of non-default configuration here and there (or when you try
>> to use EFI, cough).
>>
>> So, to add to the list:
>> * Run latest BIOS / cpu microcode that is available.
>> * Other firmware, e.g. for raid controller or whatever?
>> * Is the box using ECC memory? I mean, even a memory module that flips a
>> bit now and then can crash a server every few weeks... Run a memtest or
>> 7zip benchmark or what was the thing that's very good at exposing memory
>> errors...
>>
>> Also, feel free to open a bug report in the Debian bug tracker, we're
>> willing to help, but expect that you have to do the work to gather all
>> info. I don't have a similar piece of hardware lying around here... What
>> distro package maintainers can do is help users to gather enough info to
>> have a good report that doesn't waste too much time of the upstream
>> developers.
>
> Here is a bug I opened a week ago against Debian Buster:
>
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=964494
>
> It looks like only newer versions of the kernel are a problem. We think
> the trigger is either ext3 or Xen.
>
> The problem may not show up for weeks, and we do not know what triggers it.
>
> If anyone has more data points to add that would help isolate the issue
> to one or the other, it would be appreciated.
>
> --Sarah
Re: Debian 10, xen 4.11 reliability [ In reply to ]
Was playing with system little bit, got only day uptime and it crashed
this time with diff. msg:

Jul 20 19:20:28 test systemd[1]: Stopping Availability of block devices...
Jul 20 19:20:28 test systemd[1]: session-1.scope: Killing process 866
(screen) with signal SIGTERM.
Jul 20 19:20:28 test systemd[1]: session-1.scope: Killing process 867
(bash) with signal SIGTERM.
Jul 20 19:20:28 test systemd[1]: session-1.scope: Killing process 955
(bash) with signal SIGTERM.
Jul 20 19:20:28 test systemd[1]: session-1.scope: Killing process 30295
(ssh) with signal SIGTERM.
Jul 20 19:20:28 test systemd[1]: session-1.scope: Killing process 30298
(sshfs) with signal SIGTERM.
Jul 20 19:20:28 test systemd[1]: session-1.scope: Killing process 30307
(ssh) with signal SIGTERM.
Jul 20 19:20:28 test systemd[1]: session-1.scope: Killing process 30310
(sshfs) with signal SIGTERM.
Jul 20 19:20:28 test systemd[1]: session-1.scope: Killing process 2730
(xl) with signal SIGTERM.
Jul 20 19:20:28 test systemd[1]: session-1.scope: Killing process 3061
(xl) with signal SIGTERM.
Jul 20 19:20:28 test systemd[1]: Stopping Session 1 of user casper.
Jul 20 19:20:28 test systemd[1]: Stopping Session 352 of user casper.
Jul 20 19:20:28 test systemd[1]: Stopping LVM event activation on device
9:2...
Jul 20 19:20:28 test systemd[1]: Stopping Session 419 of user root.
Jul 20 19:20:28 test systemd[1]: lvm2-lvmpolld.socket: Succeeded.
Jul 20 19:20:28 test systemd[1]: Closed LVM2 poll daemon socket.
Jul 20 19:20:28 test systemd[1]: Stopped target Graphical Interface.
Jul 20 19:20:28 test systemd[1]: Stopped target Multi-User System.


On 20.07.20 11:53, Casper wrote:
> Hello,
>
> I can report I have root and rest fs ext3 too, but is it correct Debian
> uses ext4 for mounting ext3?
>
> [    8.314622] EXT4-fs (md0): mounting ext3 file system using the ext4
> subsystem
> [   10.192765] EXT4-fs (md0): mounted filesystem with ordered data mode.
> Opts: (null)
>
> All domU use ext3, even new debian machines.
>
> Casper
>
> On 16.07.20 06:57, Sarah Newman wrote:
>> On 7/14/20 2:00 AM, Hans van Kranenburg wrote:
>>> On 7/14/20 1:16 AM, Adam Goryachev wrote:
>>>>
>>>> On 14/7/20 03:02, Hans van Kranenburg wrote:
>>>>> Hi Casper,
>>>>>
>>>>> On 7/9/20 10:45 AM, Casper wrote:
>>>>>> [...]
>>>>>> Or problem with Debian Xen package as it not so popular anymore?
>>>>>> Any suggestion what to test to figure out problem?
>>>>
>>>> BTW, I don't think is a general rule that Debian 10.4 with packages Xen
>>>> 4.11 doesn't work.
>>>
>>> True. It just works (tm), until you have some edge case hardware that
>>> misbehaves, or you run into an edge case bug with a very specific
>>> combination of non-default configuration here and there (or when you try
>>> to use EFI, cough).
>>>
>>> So, to add to the list:
>>> * Run latest BIOS / cpu microcode that is available.
>>> * Other firmware, e.g. for raid controller or whatever?
>>> * Is the box using ECC memory? I mean, even a memory module that flips a
>>> bit now and then can crash a server every few weeks... Run a memtest or
>>> 7zip benchmark or what was the thing that's very good at exposing memory
>>> errors...
>>>
>>> Also, feel free to open a bug report in the Debian bug tracker, we're
>>> willing to help, but expect that you have to do the work to gather all
>>> info. I don't have a similar piece of hardware lying around here... What
>>> distro package maintainers can do is help users to gather enough info to
>>> have a good report that doesn't waste too much time of the upstream
>>> developers.
>>
>> Here is a bug I opened a week ago against Debian Buster:
>>
>> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=964494
>>
>> It looks like only newer versions of the kernel are a problem. We
>> think the trigger is either ext3 or Xen.
>>
>> The problem may not show up for weeks, and we do not know what
>> triggers it.
>>
>> If anyone has more data points to add that would help isolate the
>> issue to one or the other, it would be appreciated.
>>
>> --Sarah
Re: Debian 10, xen 4.11 reliability [ In reply to ]
On 7/20/20 1:53 AM, Casper wrote:
> Hello,
>
> I can report I have root and rest fs ext3 too, but is it correct Debian uses ext4 for mounting ext3?
>
> [    8.314622] EXT4-fs (md0): mounting ext3 file system using the ext4 subsystem
> [   10.192765] EXT4-fs (md0): mounted filesystem with ordered data mode. Opts: (null)
>
> All domU use ext3, even new debian machines.
>
> Casper

Hi Casper,

That still counts as ext3, the functionality has simply been taken over by the ext4 module.

You should probably upgrade to ext4 per

https://debian-administration.org/article/643/Migrating_a_live_system_from_ext3_to_ext4_filesystem

--Sarah