On 7/16/20 5:57 AM, Sarah Newman wrote:
> On 7/14/20 2:00 AM, Hans van Kranenburg wrote:
>> On 7/14/20 1:16 AM, Adam Goryachev wrote:
>>>
>>> On 14/7/20 03:02, Hans van Kranenburg wrote:
>>>> Hi Casper,
>>>>
>>>> On 7/9/20 10:45 AM, Casper wrote:
>>>>> [...]
>>>>> Or problem with Debian Xen package as it not so popular anymore?
>>>>> Any suggestion what to test to figure out problem?
>>>
>>> BTW, I don't think is a general rule that Debian 10.4 with packages Xen
>>> 4.11 doesn't work.
>>
>> True. It just works (tm), until you have some edge case hardware that
>> misbehaves, or you run into an edge case bug with a very specific
>> combination of non-default configuration here and there (or when you try
>> to use EFI, cough).
>>
>> So, to add to the list:
>> * Run latest BIOS / cpu microcode that is available.
>> * Other firmware, e.g. for raid controller or whatever?
>> * Is the box using ECC memory? I mean, even a memory module that flips a
>> bit now and then can crash a server every few weeks... Run a memtest or
>> 7zip benchmark or what was the thing that's very good at exposing memory
>> errors...
>>
>> Also, feel free to open a bug report in the Debian bug tracker, we're
>> willing to help, but expect that you have to do the work to gather all
>> info. I don't have a similar piece of hardware lying around here... What
>> distro package maintainers can do is help users to gather enough info to
>> have a good report that doesn't waste too much time of the upstream
>> developers.
>
> Here is a bug I opened a week ago against Debian Buster:
>
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=964494
>
> It looks like only newer versions of the kernel are a problem. We think the trigger is either ext3 or Xen.
>
> The problem may not show up for weeks, and we do not know what triggers it.
>
> If anyone has more data points to add that would help isolate the issue to one or the other, it would be appreciated.
You're not running Debian Xen packages apparently, so I can't say much
about that part. Except that for the Debian stuff, we only use the
upstream stable-X.Y branches and never apply security patches from XSAs
ourselves manually. There are just too many ways in which shooting into
feet can be done. The upstream staging-X.Y branch is tested before the
commits get pushed into stable-X.Y. Debian security updates are done
with that, and the other bug fixes and dependent commits in the stable
branch just also go in at the same time. Doing this means that we make
our users run something that the upstream developers will not disapprove
of, whenever we need to ask them to help with something. (Yes, for the
careful reader, that actually means that the current
4.11.4+24-gddaaccbbab-1~deb10u1 in buster-security is 100% the same as
is if it would be in buster-backports).
But, is that Linux 4.9 in the dom0? Begin by eliminating that. Our
milage may vary, but at work, we skipped from Jessie to Buster (well,
actually to our own strech-backports) because I really could not get
anything working with Linux 4.9 as dom0 kernel after the whole
Spectre/Meltdown stuff unfolded. We never got to the bottom of it, due
to a big lack of time and kernel debugging knowledge/experience, but
what I have seen is random Oopses, disk corruption and other things.
Are you using live migration?
So, why not get those dom0s to latest Xen 4.11 packages from Debian and
Linux 4.19? It's flying here, with several clusters of dozens of servers
and a few dozen TiB of mems, running thousands of domUs, without any
problem.
I agree with Ben that using ext3 nowadays should be discouraged because
of the amount of usage and testing decreasing.
But, I might have the luxury of working with a setup where we manage all
of it and have customers look at some GUI and have no idea about the
actual underlying systems. Having customers run anything they want is a
different slice of bread...
Anyway, the above is just some thinking out loud. I know that it's very
difficult to debug these kinds of things, because you need more failures
happening to be able to correlate, and a reliable reproduction scenario
would be the ultimate thing as a start to figure out what's actually
going wrong, but these are really difficult time consuming tasks.
Have fun,
Hans