Mailing List Archive

Xen 4.18/ARM64 on Raspberry Pi 4B: VLAN traffic crashing Dom0
Hi Xen users list,

I've come across something strange. I recently set up Xen (self
compiled, Version 4.18) on a Raspberry Pi 4B. Until yesterday,
everything worked well with one DomU and one bridge.

Then I wanted to actually make use of the virtualization and started to
set up a second Debian Bookworm DomU (using xen-create-image) for
monitoring my systems with zabbix. The bridge used for this setup was
the device bridging the hardware NIC. I installed zabbix, set it up, and
everything went well, I could access the web interface without any problem.

Then I set up VLANs (using VLAN numbers 1 and 2, which I probably
shouldn't, but its a working setup, and therefore...) to separate
network traffic between the DomUs. I made the existing device bridge
VLAN 1 (bridge 1) and created a secondary device for bridging VLAN 2
(bridge 2). Using only bridge 1 / VLAN 1 everything works well, I can
access the zabbix web interface without any noticeable issue. After
switching the zabbix DomU to VLAN 2 / bridge 2, everything seemingly
keeps on working well, I can ping different devices in my network from
the zabbix DomU and vice versa, I can ssh into the machine.

However, as soon as I remotely access the zabbix web interface, the
complete system (DomUs and Dom0) becomes unresponsive and reboots after
a couple of seconds. This is reliably reproducable.

I didn't see any error message in any log (zabbix, DomU syslog, Dom0
syslog) except for the following lines immediately before the system
reboots on the Xen serial console:

(XEN) Watchdog timer fired for domain 0
(XEN) Hardware Dom0 shutdown: watchdog rebooting machine

As soon as I change the bridge to bridge 1 (with or without VLAN setup),
the web interface is accessible again after booting the zabbix DomU.

So I assume that causing high traffic on the virtual NIC when using a
VLAN setup with more than 1 VLAN makes the system hard crash. Of course,
there might be other causes that I'm not aware of, but that seems to be
the most likely assumption.

I'd appreciate any hints how to troubleshoot this and/or how to proceed
otherwise (bug report?).

Thanks,

Paul


xl info:

host : ***
release : 6.1.0-11-arm64
version : #1 SMP Debian 6.1.38-4 (2023-08-08)
machine : aarch64
nr_cpus : 4
max_cpu_id : 3
nr_nodes : 1
cores_per_socket : 1
threads_per_core : 1
cpu_mhz : 54.000
hw_caps :
00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000
virt_caps : hvm hap vpmu gnttab-v1
arm_sve_vector_length : 0
total_memory : 8043
free_memory : 5859
sharing_freed_memory : 0
sharing_used_memory : 0
outstanding_claims : 0
free_cpus : 0
xen_major : 4
xen_minor : 18
xen_extra : -unstable
xen_version : 4.18-unstable
xen_caps : xen-3.0-aarch64 xen-3.0-armv7l
xen_scheduler : credit2
xen_pagesize : 4096
platform_params : virt_start=0x0
xen_changeset : Fri Aug 11 09:59:49 2023 +0200 git:a9a3b432a8
xen_commandline : placeholder dom0_mem=1024M,max:1024M
console=dtuart dtuart=serial1 sync_console no-real-mode edd=off
cc_compiler : gcc (Debian 12.2.0-14) 12.2.0
cc_compile_by : root
cc_compile_domain : ***
cc_compile_date : Mon Aug 14 22:05:30 CEST 2023
build_id : b21191ae0cd2ff49905c166a665dc30d9a4b1fbf
xend_config_format : 4



cat /etc/network/interfaces (with VLAN setup, redacted for IPs):

# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).

source /etc/network/interfaces.d/*

# The loopback network interface
auto lo
iface lo inet loopback

# The primary network interface
auto enabcm6e4ei0
iface enabcm6e4ei0 inet manual
iface enabcm6e4ei0 inet6 manual

VLAN LAN
auto enabcm6e4ei0.1
iface enabcm6e4ei0.1 inet manual

VLAN DMZ_LAN
auto enabcm6e4ei0.2
iface enabcm6e4ei0.2 inet manual

#Bridge LAN
auto xenbr0
iface xenbr0 inet static
bridge_ports enabcm6e4ei0.1
address *.*.*.*/24
gateway *.*.*.*

iface xenbr0 inet6 static
bridge_ports enabcm6e4ei0.1
address *::*/64
gateway *::*
# use SLAAC al IPv6 address from the router
# we may notv6 forwarding, otherwise SLAAC ged
autoconf 1
accept_ra 2

#Bridge DMZ_LAN
auto xenbr1
iface xenbr1 inet manual
bridge_ports enabcm6e4ei0.2
Re: Xen 4.18/ARM64 on Raspberry Pi 4B: VLAN traffic crashing Dom0 [ In reply to ]
First, I need to mention I've never used bridges+VLANs this way, so I
may miss the obvious !
I -think- it's a network problem, not a Xen one, but what do I know :)

I've often read that bridges on dom0 should have some additional params.
They would be in the iface config, around "bridge_ports", like :
bridge_stp off # dont use STP (spanning tree proto)
bridge_waitport 0 # dont wait for port to be available
bridge_fd 0 # no forward delay

You may also try to enable STP (iirc it's disabled by default on Linux
bridges).
But TBH, I'm not sure those params will help in this case.

I've also read the VLAN 1 is a bit "special", better avoid it.
IIUC, untagged traffic would be auto tagged 1. Use ids 2/3, or 10/11,
10/20, etc.

Other stuff to test :
- check MAC addresses
- use tcpdump/wireshark remote logging on the real NIC (enabcm6e4ei0)
*and* the bridges, to see what really happens, maybe a network/broadcast
storm, filling dom0 cpu/memory ?
- set "loglvl=all" to Xen cmdline to maybe get more info
- how are the interfaces configured in the domUs and in the cfg files ?
- test w/o IPv6

You could also show us the outputs of "ip a", "ip link show type bridge"
(brctl show), etc.

PS: I guess it's only in the mail, and should be harmless, but you have
two /eni stanzas "VLAN LAN" and "VLAN DMZ_LAN" that should be comments.

--
++
zithro / Cyril
Re: Xen 4.18/ARM64 on Raspberry Pi 4B: VLAN traffic crashing Dom0 [ In reply to ]
Thanks for your time, zithro.

Am 08.09.2023 um 17:08 schrieb zithro:
> First, I need to mention I've never used bridges+VLANs this way, so I
may miss the obvious !
> I -think- it's a network problem, not a Xen one, but what do I know ????

I also suspect in the meantime that this is a general (Debian, perhaps
even Arm64 specific?) network problem. But I am not sure if it can be
ruled out by now that Xen plays a role.

> I've often read that bridges on dom0 should have some additional params.
> They would be in the iface config, around "bridge_ports", like :
> bridge_stp off # dont use STP (spanning tree proto)
> bridge_waitport 0 # dont wait for port to be available
> bridge_fd 0 # no forward delay

Tried that, no change.

> You may also try to enable STP (iirc it's disabled by default on
Linux bridges).
> But TBH, I'm not sure those params will help in this case.

Tried that, no change.

> I've also read the VLAN 1 is a bit "special", better avoid it.
> IIUC, untagged traffic would be auto tagged 1. Use ids 2/3, or 10/11,
10/20, etc.

I changed the VLAN numbers. First to 101, 102, 103 etc. This was when I
noticed a new strange thing: VLANs with numbers >99 simply don't work on
my Raspberry Pi under Debian. VLAN 99 works, VLAN 100 (or everything
else >99 that I tried) doesn't work. If I choose a number >99, the VLAN
is not configured, "ip a" doesn't list it. Other Debian systems on x64
architecture don't show this behavior, there, it was no problem to set
up VLANs > 99. So another data point that there seems to be something
fishy about the network on my Raspberry Pi system.

Therefore, I've changed the VLANs to 10, 20, 30 etc., which worked. But
it didn't solve the initial problem of the crashing Dom0 and DomUs.

> Other stuff to test :
> - check MAC addresses

What should I check specifically? (However, if there are duplicate MAC
addresses (what I am assuming you are aiming at), why would it work when
using the same VLAN bridge?)

> - use tcpdump/wireshark remote logging on the real NIC (enabcm6e4ei0)
*and* the bridges, to see what really happens, maybe a network/broadcast
storm, filling dom0 cpu/memory ?

Now, here it becomes really strange. I started tcpdumps on Dom0, and
depending on which interface/bridge traffic was logged, the problem went
away, meaning, the DomU was running smoothly for hours, even when
accessing the zabbix web interface! Stopping the log makes the system
crash reproducably if I access the zabbix web interface.

Logging enabcm6e4ei0 (NIC): no crashes
Logging enabcm6e4ei0.10 (VLAN 10): instant crash
Logging enabcm6e4ei0.20 (VLAN 20): no crashes
Logging xenbr0 (on VLAN 10): instant crash
Logging xenbr1 (on VLAN 20): no crashes

I can't think of a rational explanation why logging the traffic on
certain interfaces/bridges should avoid the crash of the complete
system, while logging other interfaces/bridges doesn't. Any ideas?

I checked the dumps of enabcm6e4ei0.10 and xenbr0 (where the system
crashes) with wireshark, nothing sticks out to me (but I am really no
expert in analyzing network traffic). I could send the dumps directly to
you, if you want to spend the time.

> - set "loglvl=all" to Xen cmdline to maybe get more info

Done, need to check results. (Serial interface is not connected right now.)

> - how are the interfaces configured in the domUs and in the cfg files ?

/etc/network/interfaces on the DomU on which zabbix is running:

# The loopback network interface
auto lo
iface lo inet loopback

# The primary network interface
auto enX0
iface enX0 inet static
address xx.xx.xx.xx/24
gateway xx.xx.xx.xx

iface enX0 inet6 static
address xxxx::xxxx:xxxx:xxxx:xxxx/64
gateway xxxx::xxxx:xxxx:xxxx:xxxx
# use SLAAC to get global IPv6 address from the router
# we may not enable ipv6 forwarding, otherwise SLAAC gets disabled
autoconf 1
accept_ra 2

vif line in the xl.cfg of the same DomU:

vif = [ 'mac=02:93:0B:61:A5:82,bridge=xenbr1,ip=xx.xx.xx.xx' ]


> - test w/o IPv6

Tried that, no difference.

> You could also show us the outputs of "ip a", "ip link show type
bridge" (brctl show), etc.

root@xxx:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enabcm6e4ei0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq
state UP group default qlen 1000
link/ether d8:3a:dd:28:39:4f brd ff:ff:ff:ff:ff:ff
inet6 fe80::da3a:ddff:fe28:394f/64 scope link
valid_lft forever preferred_lft forever
3: enabcm6e4ei0.10@enabcm6e4ei0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu
1500 qdisc noqueue master xenbr0 state UP group default qlen 1000
link/ether d8:3a:dd:28:39:4f brd ff:ff:ff:ff:ff:ff
4: enabcm6e4ei0.20@enabcm6e4ei0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu
1500 qdisc noqueue master xenbr1 state UP group default qlen 1000
link/ether d8:3a:dd:28:39:4f brd ff:ff:ff:ff:ff:ff
inet6 xxxx::xxxx:xxxx:xxxx:xxxx/64 scope global dynamic mngtmpaddr
valid_lft 86134sec preferred_lft 14134sec
inet6 xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx/64 scope global
dynamic mngtmpaddr
valid_lft 86134sec preferred_lft 14134sec
inet6 fe80::da3a:ddff:fe28:394f/64 scope link
valid_lft forever preferred_lft forever
5: xenbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue
state UP group default qlen 1000
link/ether 02:28:b7:1f:ee:6d brd ff:ff:ff:ff:ff:ff
inet xx.xx.xx.xx/24 brd xx.xx.xx.255 scope global xenbr0
valid_lft forever preferred_lft forever
inet6 xxxx::xxxx:xxxx:xxxx:xxxx/64 scope global dynamic mngtmpaddr
valid_lft 86135sec preferred_lft 14135sec
inet6 xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx/64 scope global
dynamic mngtmpaddr
valid_lft 86135sec preferred_lft 14135sec
inet6 xxxx::xxxx:xxxx:xxxx:xxxx/64 scope global
valid_lft forever preferred_lft forever
inet6 fe80::28:b7ff:fe1f:ee6d/64 scope link
valid_lft forever preferred_lft forever
6: xenbr1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue
state UP group default qlen 1000
link/ether c6:11:98:cb:32:bd brd ff:ff:ff:ff:ff:ff
inet6 xxxx::xxxx:xxxx:xxxx:xxxx/64 scope global dynamic mngtmpaddr
valid_lft 86280sec preferred_lft 14280sec
inet6 xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx/64 scope global
dynamic mngtmpaddr
valid_lft 86280sec preferred_lft 14280sec
inet6 fe80::c411:98ff:fecb:32bd/64 scope link
valid_lft forever preferred_lft forever
7: vif1.0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master
xenbr0 state UP group default qlen 32
link/ether fe:ff:ff:ff:ff:ff brd ff:ff:ff:ff:ff:ff
inet6 fe80::fcff:ffff:feff:ffff/64 scope link
valid_lft forever preferred_lft forever
8: vif2.0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master
xenbr1 state UP group default qlen 32
link/ether fe:ff:ff:ff:ff:ff brd ff:ff:ff:ff:ff:ff
inet6 fe80::fcff:ffff:feff:ffff/64 scope link
valid_lft forever preferred_lft forever

root@xxx:~# ip link show type bridge
5: xenbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue
state UP mode DEFAULT group default qlen 1000
link/ether 02:28:b7:1f:ee:6d brd ff:ff:ff:ff:ff:ff
6: xenbr1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue
state UP mode DEFAULT group default qlen 1000
link/ether c6:11:98:cb:32:bd brd ff:ff:ff:ff:ff:ff

root@xxx:~# brctl show
bridge name bridge id STP enabled interfaces
xenbr0 8000.0228b71fee6d no enabcm6e4ei0.10
vif1.0
xenbr1 8000.c61198cb32bd no enabcm6e4ei0.20
vif2.0

> PS: I guess it's only in the mail, and should be harmless, but you
have two /eni stanzas "VLAN LAN" and "VLAN DMZ_LAN" that should be comments.
>
Sorry, copy/paste error, fixed. No difference.
Re: Xen 4.18/ARM64 on Raspberry Pi 4B: VLAN traffic crashing Dom0 [ In reply to ]
On Thu, 2023-09-07 at 18:41 +0200, Paul Leiber wrote:
> However, as soon as I remotely access the zabbix web interface, the
>
> complete system (DomUs and Dom0) becomes unresponsive and reboots
> after
>
> a couple of seconds. This is reliably reproducable.
>
>
>
> I didn't see any error message in any log (zabbix, DomU syslog, Dom0
>
> syslog) except for the following lines immediately before the system
>
> reboots on the Xen serial console:
>
>
>
> (XEN) Watchdog timer fired for domain 0
>
> (XEN) Hardware Dom0 shutdown: watchdog rebooting machine

Hmm.. Here you have a hint, how to study this issue. Seems, that this
is not Xe, issue, but it is watchdog issue. Your should set watchdog to
log, why it is rebooting machine.

You can also set or edit watchdog to do something else than rebooting
machine, because rebooting does not help your study.

Maybe it could also be helpful set your system to pure VLAN system
without Xen at all, but set system somehow to run same kind apps your
run with Xen and zabbix web interface. Maybe running Web server is
enough to study your VLAN1 and VLAN2. When your are sure, that VLAN1
and VLAN2 settings are OK and VLAN itself work with Raspberry OK, it is
much easier to continue with Xen and zabbix.

BR.

--
Reijo Korhonen, old school developer
Re: Xen 4.18/ARM64 on Raspberry Pi 4B: VLAN traffic crashing Dom0 [ In reply to ]
On 16.09.2023 13:26, reijo.korhonen@gmail.com wrote:
> On Thu, 2023-09-07 at 18:41 +0200, Paul Leiber wrote:
>> However, as soon as I remotely access the zabbix web interface, the
>>
>> complete system (DomUs and Dom0) becomes unresponsive and reboots
>> after
>>
>> a couple of seconds. This is reliably reproducable.
>>
>>
>>
>> I didn't see any error message in any log (zabbix, DomU syslog, Dom0
>>
>> syslog) except for the following lines immediately before the system
>>
>> reboots on the Xen serial console:
>>
>>
>>
>> (XEN) Watchdog timer fired for domain 0
>>
>> (XEN) Hardware Dom0 shutdown: watchdog rebooting machine
> Hmm.. Here you have a hint, how to study this issue. Seems, that this
> is not Xe, issue, but it is watchdog issue. Your should set watchdog to
> log, why it is rebooting machine.
>
> You can also set or edit watchdog to do something else than rebooting
> machine, because rebooting does not help your study.

I tried to dig deeper into the cause for the watchdog triggering.
However, I didn't find any useful documentation on the web on how the
watchdog works or how to enable logging. Can anybody direct me to useful
information on the Xen watchdog, please?

> Maybe it could also be helpful set your system to pure VLAN system
> without Xen at all, but set system somehow to run same kind apps your
> run with Xen and zabbix web interface. Maybe running Web server is
> enough to study your VLAN1 and VLAN2. When your are sure, that VLAN1
> and VLAN2 settings are OK and VLAN itself work with Raspberry OK, it is
> much easier to continue with Xen and zabbix.
>
I booted the system without Xen and set it up to use the VLAN 20 bridge
(the same that leads to a reboot when using it in the DomU) as primary
network interface. Everything seems to be working, I could download
large files from the internet without any problem. Next step is then to
follow your advice reproduce the Zabbix setup (sigh).

Paul
Re: Xen 4.18/ARM64 on Raspberry Pi 4B: VLAN traffic crashing Dom0 [ In reply to ]
Am 11.10.2023 um 10:55 schrieb Paul Leiber:
> On 16.09.2023 13:26, reijo.korhonen@gmail.com wrote:
>> On Thu, 2023-09-07 at 18:41 +0200, Paul Leiber wrote:
>>> However, as soon as I remotely access the zabbix web interface, the
>>>
>>> complete system (DomUs and Dom0) becomes unresponsive and reboots
>>> after
>>>
>>> a couple of seconds. This is reliably reproducable.
>>>
>>>
>>>
>>> I didn't see any error message in any log (zabbix, DomU syslog, Dom0
>>>
>>> syslog) except for the following lines immediately before the system
>>>
>>> reboots on the Xen serial console:
>>>
>>>
>>>
>>> (XEN) Watchdog timer fired for domain 0
>>>
>>> (XEN) Hardware Dom0 shutdown: watchdog rebooting machine
>> Hmm.. Here you have a hint, how to study this issue. Seems, that this
>> is not Xe, issue, but it is watchdog issue. Your should set watchdog to
>> log, why it is rebooting machine.
>>
>> You can also set or edit watchdog to do something else than rebooting
>> machine, because rebooting does not help your study.
>
> I tried to dig deeper into the cause for the watchdog triggering.
> However, I didn't find any useful documentation on the web on how the
> watchdog works or how to enable logging. Can anybody direct me to useful
> information on the Xen watchdog, please?
>
>> Maybe it could also be helpful set your system to pure VLAN system
>> without Xen at all, but set system somehow to run same kind apps your
>> run with Xen and zabbix web interface. Maybe running Web server is
>> enough to study your VLAN1 and VLAN2. When your are sure, that VLAN1
>> and VLAN2 settings are OK and VLAN itself work with Raspberry OK, it is
>> much easier to continue with Xen and zabbix.
>>
> I booted the system without Xen and set it up to use the VLAN 20 bridge
> (the same that leads to a reboot when using it in the DomU) as primary
> network interface. Everything seems to be working, I could download
> large files from the internet without any problem. Next step is then to
> follow your advice reproduce the Zabbix setup (sigh).

Setting up Zabbix on the base debian system (which in Xen setup would be
Dom0) showed that the same setup (VLANs 10 and 20, bridges 1 and 2,
using bridge 2 as interface for Zabbix) without Xen is working reliably,
no reboots. This points to some Xen component being the root cause.

I'm out of ideas, I'm taking this to xen-devel now.

Paul