Hi Bertrand,
On 3/21/24 17:15, Bertrand Marquis wrote:
> Hi John,
>
>> On 21 Mar 2024, at 17:05, John Ernberg <john.ernberg@actia.se> wrote:
>>
>> Hi Bertrand,
>>
>> On 3/21/24 08:41, Bertrand Marquis wrote:
>>> Hi,
>>>
>>>> On 20 Mar 2024, at 18:40, Julien Grall <julien@xen.org> wrote:
>>>>
>>>> Hi John,
>>>>
>>>> On 20/03/2024 16:24, John Ernberg wrote:
>>>>> Hi Bertrand,
>>>>> On 3/13/24 11:07, Bertrand Marquis wrote:
>>>>>> Hi,
>>>>>>
>>>>>>> On 8 Mar 2024, at 15:04, Julien Grall <julien@xen.org> wrote:
>>>>>>>
>>>>>>> Hi John,
>>>>>>>
>>>>>>> Thank you for the reply.
>>>>>>>
>>>>>>> On 08/03/2024 13:40, John Ernberg wrote:
>>>>>>>> On 3/7/24 00:07, Julien Grall wrote:
>>>>>>>>>> Ping on the watchdog discussion bits.
>>>>>>>>>
>>>>>>>>> Sorry for the late reply.
>>>>>>>>>
>>>>>>>>> On 06/03/2024 13:13, John Ernberg wrote:
>>>>>>>>>> On 2/9/24 14:14, John Ernberg wrote:
>>>>>>>>>>>
>>>>>>>>>>>> * IMX_SIP_TIMER_*: This seems to be related to the watchdog.
>>>>>>>>>>>> Shouldn't dom0 rely on the watchdog provided by Xen instead? So those
>>>>>>>>>>>> call will be used by Xen.
>>>>>>>>>>>
>>>>>>>>>>> That is indeed a watchdog SIP, and also for setting the SoC internal RTC
>>>>>>>>>>> if it is being used.
>>>>>>>>>>>
>>>>>>>>>>> I looked around if there was previous discussion and only really
>>>>>>>>>>> found [3].
>>>>>>>>>>> Is the xen/xen/include/watchdog.h header meant to be for this kind of
>>>>>>>>>>> watchdog support or is that more for the VM watchdog? Looking at the x86
>>>>>>>>>>> ACPI NMI watchdog it seems like the former, but I have never worked with
>>>>>>>>>>> x86 nor ACPI.
>>>>>>>>>
>>>>>>>>> include/watchdog.h contains helper to configure the watchdog for Xen. We
>>>>>>>>> also have per-VM watchdog and this is configured by the hypercall
>>>>>>>>> SCHEDOP_watchdog.
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Currently forwarding it to Dom0 has not caused any watchdog resets with
>>>>>>>>>>> our watchdog timeout settings, our specific Dom0 setup and VM count.
>>>>>>>>>
>>>>>>>>> IIUC, the SMC API for the watchdog would be similar to the ACPI NMI
>>>>>>>>> watchdog. So I think it would make more sense if this is not exposed to
>>>>>>>>> dom0 (even if Xen is doing nothing with it).
>>>>>>>>>
>>>>>>>>> Can you try to hide the SMCs and check if dom0 still behave properly?
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>>
>>>>>>>> This SMC manages a hardware watchdog, if it's not pinged within a
>>>>>>>> specific interval the entire board resets.
>>>>>>>
>>>>>>> Do you know what's the default interval? Is it large enough so Xen + dom0 can boot (at least up to when the watchdog driver is initialized)?
>>>>>>>
>>>>>>>> If I block the SMCs the watchdog driver in Dom0 will fail to ping the
>>>>>>>> watchdog, triggering a board reset because the system looks to have
>>>>>>>> become unresponsive. The reason this patch set started is because we
>>>>>>>> couldn't ping the watchdog when running with Xen.
>>>>>>>> In our specific system the bootloader enables the watchdog as early as
>>>>>>>> possible so that we can get watchdog protection for as much of the boot
>>>>>>>> as we possibly can.
>>>>>>>> So, if we are to block the SMC from Dom0, then Xen needs to take over
>>>>>>>> the pinging. It could be implemented similarly to the NMI watchdog,
>>>>>>>> except that the system will reset if the ping is missed rather than
>>>>>>>> backtrace.
>>>>>>>> It would also mean that Xen will get a whole watchdog driver-category
>>>>>>>> due to the watchdog being vendor and sometimes even SoC specific when it
>>>>>>>> comes to Arm.
>>>>>>>> My understanding of the domain watchdog code is that today the domain
>>>>>>>> needs to call SCHEDOP_watchdog at least once to start the watchdog
>>>>>>>> timer. Since watchdog protection through the whole boot process is
>>>>>>>> desirable we'd need some core changes, such as an option to start the
>>>>>>>> domain watchdog on init. >
>>>>>>>> It's quite a big change to make
>>>>>>>
>>>>>>> For clarification, above you seem to mention two changes:
>>>>>>>
>>>>>>> 1) Allow Xen to use the HW watchdog
>>>>>>> 2) Allow the domain to use the watchdog early
>>>>>>>
>>>>>>> I am assuming by big change, you are referring to 2?
>>>>>>>
>>>>>>> , while I am not against doing it if it
>>>>>>>> makes sense, I now wonder if Xen should manage hardware watchdogs.
>>>>>>>> Looking at the domain watchdog code it looks like if a domain does not
>>>>>>>> get enough execution time, the watchdog will not be pinged enough and
>>>>>>>> the guest will be reset. So either watchdog approach requires Dom0 to
>>>>>>>> get execution time. Dom0 also needs to service all the PV backends it's
>>>>>>>> responsible for. I'm not sure it's valuable to add another layer of
>>>>>>>> watchdog for this scenario as the end result (checking that the entire
>>>>>>>> system works) is achieved without it as well.
>>>>>>>> So, before I try to find the time to make a proposal for moving the
>>>>>>>> hardware watchdog bit to Xen, do we really want it?
>>>>>>>
>>>>>>> Thanks for the details. Given that the watchdog is enabled by the bootloader, I think we want Xen to drive the watchdog for two reasons:
>>>>>>> 1) In true dom0less environment, dom0 would not exist
>>>>>>> 2) You are relying on Xen + Dom0 to boot (or at least enough to get the watchdog working) within the watchdog interval.
>>>>>>
>>>>>> Definitely we need to consider the case where there is no Dom0.
>>>>>>
>>>>>> I think there are in fact 3 different use cases here:
>>>>>> - watchdog fully driven in a domain (dom0 or another): would work if it is expected
>>>>>> that Xen + Domain boot time is under the watchdog initial refresh rate. I think this
>>>>>> could make sense on some applications where your system depends on a specific
>>>>>> domain to be properly booted to work. This would require an initial refresh time
>>>>>> configurable in the boot loader probably.
>>>>> This is our use-case. ^
>>>>> Our dom0 is monitoring and managing the other domains in our system.
>>>>> Without dom0 working the system isn't really working as a whole.
>>>>> @Julien: Would you be ok with the patch set continuing in the direction
>>>>> of the
>>>>> original proposal, letting another party (or me at a later time) implement
>>>>> the fully driven by Xen option?
>>>> I am concerned about this particular point from Bertrand:
>>>>
>>>> "would work if it is expected that Xen + Domain boot time is under the watchdog initial refresh rate."
>>>>
>>>> How will a user be able to figure out how to initially configure the watchdog? Is this something you can easily configure in the bootloader at runtime?
>>
>> If you go through the trouble of enabling the watchdog in the bootloader you
>> usually know what you're doing and set the timeout yourself.
>>
>> Since our systems can be mounted in really annoying locations (both in
>> installations and world-wise) we need as much auto-recovery as possible to
>> avoid people having to travel to collect a unit that just needed a reset due
>> to some unexpected obscure issue at startup.
>
> I definitely get the need do not get me wrong.
> I am just concerned by potential users having target restarting when using Xen because of that and not knowing why.
>
>>>
>>> Definitely here it would be better to have the watchdog turned off by default and document how to enable it in the firmware.
>>>
>>> Even if a long timeout is configured by default, a user could run into trouble if using a linux without watchdog or not running linux or using dom0less without a linux having access to it.
>>> I agree with Julien here that the concern could be that users would come to us instead of NXP if there is system is doing a reset without reasons after some seconds or minutes.
>>
>> I could add myself as a reviewer for the iMX parts if that helps routing
>> such
>> questions (and future patches) also to me. We have experience with the QXP,
>> and the QM (for the supported parts by this patch set) is identical.
>>
>> Would that help with the concerns?
>
> Definitely it could help.
I'll figure out how to include myself in the MAINTAINERS file for the
next spin.
>
>>
>>>
>>>>
>>>>
>>>> Overall, I am not for this approach at least in the current status. I would be more inclined if there are some documentations explaining how this is supposed to be configured on NXP, so others can use the code.
>>>>
>>>> Anyway, this is why we have multiple Arm maintainers for this kind of situation. If they other agrees with you, then they can ack the patch and this can be merged.
>>>
>>> I agree with Stefano that it would be good to have those board supported.
>>>
>>> One thing i could suggest until there is a watchdog driver inside Xen is to have a clear Warning at Xen boot on those boards in the console so that we could at least identify the reason easily if a reset occurs for someone.
>>
>> How do other SoCs deal with this currently? The iMX SoCs aren't the only
>> ones
>> with a watchdog, just the first one added to Xen that pings the watchdog
>> over
>> an SMC. What about the OMAPs, the R-Cars, Xilinx's, Exynos' and so on?
>
> As far as I know the watchdog is usually not active until a driver activates it.
> Which means that by default it will not fire.
> Having it activated by the bootloader by default is not usual.
> Now definitely on a lot of use cases the users are activating it in the booloader
> but their systems are design for it.
>
> Do I understand that the default boot loader configuration is not activating it on your side ?
We are using a bootloader called Punchboot [1] developed by one of our
employees. With Punchboot you have to explicitly write the code to
enable the watchdog yourself. In u-boot you need to enable the watchdog
drivers and configure the watchdog first before it is started. I don't
know how the situation is in other bootloaders such as BareBox.
Best regards // John Ernberg
[1]: https://github.com/jonasblixt/punchboot
On 3/21/24 17:15, Bertrand Marquis wrote:
> Hi John,
>
>> On 21 Mar 2024, at 17:05, John Ernberg <john.ernberg@actia.se> wrote:
>>
>> Hi Bertrand,
>>
>> On 3/21/24 08:41, Bertrand Marquis wrote:
>>> Hi,
>>>
>>>> On 20 Mar 2024, at 18:40, Julien Grall <julien@xen.org> wrote:
>>>>
>>>> Hi John,
>>>>
>>>> On 20/03/2024 16:24, John Ernberg wrote:
>>>>> Hi Bertrand,
>>>>> On 3/13/24 11:07, Bertrand Marquis wrote:
>>>>>> Hi,
>>>>>>
>>>>>>> On 8 Mar 2024, at 15:04, Julien Grall <julien@xen.org> wrote:
>>>>>>>
>>>>>>> Hi John,
>>>>>>>
>>>>>>> Thank you for the reply.
>>>>>>>
>>>>>>> On 08/03/2024 13:40, John Ernberg wrote:
>>>>>>>> On 3/7/24 00:07, Julien Grall wrote:
>>>>>>>>>> Ping on the watchdog discussion bits.
>>>>>>>>>
>>>>>>>>> Sorry for the late reply.
>>>>>>>>>
>>>>>>>>> On 06/03/2024 13:13, John Ernberg wrote:
>>>>>>>>>> On 2/9/24 14:14, John Ernberg wrote:
>>>>>>>>>>>
>>>>>>>>>>>> * IMX_SIP_TIMER_*: This seems to be related to the watchdog.
>>>>>>>>>>>> Shouldn't dom0 rely on the watchdog provided by Xen instead? So those
>>>>>>>>>>>> call will be used by Xen.
>>>>>>>>>>>
>>>>>>>>>>> That is indeed a watchdog SIP, and also for setting the SoC internal RTC
>>>>>>>>>>> if it is being used.
>>>>>>>>>>>
>>>>>>>>>>> I looked around if there was previous discussion and only really
>>>>>>>>>>> found [3].
>>>>>>>>>>> Is the xen/xen/include/watchdog.h header meant to be for this kind of
>>>>>>>>>>> watchdog support or is that more for the VM watchdog? Looking at the x86
>>>>>>>>>>> ACPI NMI watchdog it seems like the former, but I have never worked with
>>>>>>>>>>> x86 nor ACPI.
>>>>>>>>>
>>>>>>>>> include/watchdog.h contains helper to configure the watchdog for Xen. We
>>>>>>>>> also have per-VM watchdog and this is configured by the hypercall
>>>>>>>>> SCHEDOP_watchdog.
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Currently forwarding it to Dom0 has not caused any watchdog resets with
>>>>>>>>>>> our watchdog timeout settings, our specific Dom0 setup and VM count.
>>>>>>>>>
>>>>>>>>> IIUC, the SMC API for the watchdog would be similar to the ACPI NMI
>>>>>>>>> watchdog. So I think it would make more sense if this is not exposed to
>>>>>>>>> dom0 (even if Xen is doing nothing with it).
>>>>>>>>>
>>>>>>>>> Can you try to hide the SMCs and check if dom0 still behave properly?
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>>
>>>>>>>> This SMC manages a hardware watchdog, if it's not pinged within a
>>>>>>>> specific interval the entire board resets.
>>>>>>>
>>>>>>> Do you know what's the default interval? Is it large enough so Xen + dom0 can boot (at least up to when the watchdog driver is initialized)?
>>>>>>>
>>>>>>>> If I block the SMCs the watchdog driver in Dom0 will fail to ping the
>>>>>>>> watchdog, triggering a board reset because the system looks to have
>>>>>>>> become unresponsive. The reason this patch set started is because we
>>>>>>>> couldn't ping the watchdog when running with Xen.
>>>>>>>> In our specific system the bootloader enables the watchdog as early as
>>>>>>>> possible so that we can get watchdog protection for as much of the boot
>>>>>>>> as we possibly can.
>>>>>>>> So, if we are to block the SMC from Dom0, then Xen needs to take over
>>>>>>>> the pinging. It could be implemented similarly to the NMI watchdog,
>>>>>>>> except that the system will reset if the ping is missed rather than
>>>>>>>> backtrace.
>>>>>>>> It would also mean that Xen will get a whole watchdog driver-category
>>>>>>>> due to the watchdog being vendor and sometimes even SoC specific when it
>>>>>>>> comes to Arm.
>>>>>>>> My understanding of the domain watchdog code is that today the domain
>>>>>>>> needs to call SCHEDOP_watchdog at least once to start the watchdog
>>>>>>>> timer. Since watchdog protection through the whole boot process is
>>>>>>>> desirable we'd need some core changes, such as an option to start the
>>>>>>>> domain watchdog on init. >
>>>>>>>> It's quite a big change to make
>>>>>>>
>>>>>>> For clarification, above you seem to mention two changes:
>>>>>>>
>>>>>>> 1) Allow Xen to use the HW watchdog
>>>>>>> 2) Allow the domain to use the watchdog early
>>>>>>>
>>>>>>> I am assuming by big change, you are referring to 2?
>>>>>>>
>>>>>>> , while I am not against doing it if it
>>>>>>>> makes sense, I now wonder if Xen should manage hardware watchdogs.
>>>>>>>> Looking at the domain watchdog code it looks like if a domain does not
>>>>>>>> get enough execution time, the watchdog will not be pinged enough and
>>>>>>>> the guest will be reset. So either watchdog approach requires Dom0 to
>>>>>>>> get execution time. Dom0 also needs to service all the PV backends it's
>>>>>>>> responsible for. I'm not sure it's valuable to add another layer of
>>>>>>>> watchdog for this scenario as the end result (checking that the entire
>>>>>>>> system works) is achieved without it as well.
>>>>>>>> So, before I try to find the time to make a proposal for moving the
>>>>>>>> hardware watchdog bit to Xen, do we really want it?
>>>>>>>
>>>>>>> Thanks for the details. Given that the watchdog is enabled by the bootloader, I think we want Xen to drive the watchdog for two reasons:
>>>>>>> 1) In true dom0less environment, dom0 would not exist
>>>>>>> 2) You are relying on Xen + Dom0 to boot (or at least enough to get the watchdog working) within the watchdog interval.
>>>>>>
>>>>>> Definitely we need to consider the case where there is no Dom0.
>>>>>>
>>>>>> I think there are in fact 3 different use cases here:
>>>>>> - watchdog fully driven in a domain (dom0 or another): would work if it is expected
>>>>>> that Xen + Domain boot time is under the watchdog initial refresh rate. I think this
>>>>>> could make sense on some applications where your system depends on a specific
>>>>>> domain to be properly booted to work. This would require an initial refresh time
>>>>>> configurable in the boot loader probably.
>>>>> This is our use-case. ^
>>>>> Our dom0 is monitoring and managing the other domains in our system.
>>>>> Without dom0 working the system isn't really working as a whole.
>>>>> @Julien: Would you be ok with the patch set continuing in the direction
>>>>> of the
>>>>> original proposal, letting another party (or me at a later time) implement
>>>>> the fully driven by Xen option?
>>>> I am concerned about this particular point from Bertrand:
>>>>
>>>> "would work if it is expected that Xen + Domain boot time is under the watchdog initial refresh rate."
>>>>
>>>> How will a user be able to figure out how to initially configure the watchdog? Is this something you can easily configure in the bootloader at runtime?
>>
>> If you go through the trouble of enabling the watchdog in the bootloader you
>> usually know what you're doing and set the timeout yourself.
>>
>> Since our systems can be mounted in really annoying locations (both in
>> installations and world-wise) we need as much auto-recovery as possible to
>> avoid people having to travel to collect a unit that just needed a reset due
>> to some unexpected obscure issue at startup.
>
> I definitely get the need do not get me wrong.
> I am just concerned by potential users having target restarting when using Xen because of that and not knowing why.
>
>>>
>>> Definitely here it would be better to have the watchdog turned off by default and document how to enable it in the firmware.
>>>
>>> Even if a long timeout is configured by default, a user could run into trouble if using a linux without watchdog or not running linux or using dom0less without a linux having access to it.
>>> I agree with Julien here that the concern could be that users would come to us instead of NXP if there is system is doing a reset without reasons after some seconds or minutes.
>>
>> I could add myself as a reviewer for the iMX parts if that helps routing
>> such
>> questions (and future patches) also to me. We have experience with the QXP,
>> and the QM (for the supported parts by this patch set) is identical.
>>
>> Would that help with the concerns?
>
> Definitely it could help.
I'll figure out how to include myself in the MAINTAINERS file for the
next spin.
>
>>
>>>
>>>>
>>>>
>>>> Overall, I am not for this approach at least in the current status. I would be more inclined if there are some documentations explaining how this is supposed to be configured on NXP, so others can use the code.
>>>>
>>>> Anyway, this is why we have multiple Arm maintainers for this kind of situation. If they other agrees with you, then they can ack the patch and this can be merged.
>>>
>>> I agree with Stefano that it would be good to have those board supported.
>>>
>>> One thing i could suggest until there is a watchdog driver inside Xen is to have a clear Warning at Xen boot on those boards in the console so that we could at least identify the reason easily if a reset occurs for someone.
>>
>> How do other SoCs deal with this currently? The iMX SoCs aren't the only
>> ones
>> with a watchdog, just the first one added to Xen that pings the watchdog
>> over
>> an SMC. What about the OMAPs, the R-Cars, Xilinx's, Exynos' and so on?
>
> As far as I know the watchdog is usually not active until a driver activates it.
> Which means that by default it will not fire.
> Having it activated by the bootloader by default is not usual.
> Now definitely on a lot of use cases the users are activating it in the booloader
> but their systems are design for it.
>
> Do I understand that the default boot loader configuration is not activating it on your side ?
We are using a bootloader called Punchboot [1] developed by one of our
employees. With Punchboot you have to explicitly write the code to
enable the watchdog yourself. In u-boot you need to enable the watchdog
drivers and configure the watchdog first before it is started. I don't
know how the situation is in other bootloaders such as BareBox.
Best regards // John Ernberg
[1]: https://github.com/jonasblixt/punchboot