Mailing List Archive

MX80 watchdog
Afternoon,

I've been upgrading some MX80 routers to from 15.1, consistently they
seem to fall over during periods of strenuous SSD access, or indeed once
during a "commit check".

We thought this might be due to the uptime (~1500 days) so have been
rebooting them prior to the upgrade which has mostly stopped the problem
from happening. Not completely, however - they get stuck for about an
hour doing this, after which they reboot and continue to work.


watchdog: scheduling fairness gone for 3540 seconds now.
(da1:umass-sim1:1:0:0): Synchronize cache failed, status == 0x34, scsi
status == 0x0
Automatic reboot in 15 seconds - press a key on the console to abort
Rebooting...


I'd like it if they waited a bit less than an hour and see the watchdog
can be configured but I can't find any useful documentation about
exactly what conditions it would fire and what the defaults are.

Currently there is no configuration under "system processes watchdog",
and it looks like it can be enabled, disabled and the timeout set up to
3600 seconds.

So my question is, is it this watchdog that is resetting the thing after
an hour and would it be reasonable to set the timeout to say 300 seconds
so there was less down time if it went wrong.

Thanks,
--
Tom

:: www.portfast.co.uk / @portfast
:: hosted services, domains, virtual machines, consultancy
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: MX80 watchdog [ In reply to ]
Do you monitor RPD task memory use and Freebsd process memory use?
Is it possible you are leaking memory over time, and getting DRAM
pressure at the 1500d mark?

It might be this:
https://prsearch.juniper.net/problemreport/PR1099998

Initially as you said it happens at strenuous SSD access, I was
thinking that Junos does have RE failover limits on disk-io read/write
latency, which causes false positive RE switchovers now and again
(more people have hit them, than people are aware of hitting them).
But in your case this can't possibly be true, because the MX80 doesn't
have two RE. But for completeness,
https://www.juniper.net/documentation/us/en/software/junos/high-availability/topics/ref/statement/not-on-disk-underperform-edit-chassis.html

On Mon, 12 Jun 2023 at 18:35, Tom Bird via juniper-nsp
<juniper-nsp@puck.nether.net> wrote:
>
> Afternoon,
>
> I've been upgrading some MX80 routers to from 15.1, consistently they
> seem to fall over during periods of strenuous SSD access, or indeed once
> during a "commit check".
>
> We thought this might be due to the uptime (~1500 days) so have been
> rebooting them prior to the upgrade which has mostly stopped the problem
> from happening. Not completely, however - they get stuck for about an
> hour doing this, after which they reboot and continue to work.
>
>
> watchdog: scheduling fairness gone for 3540 seconds now.
> (da1:umass-sim1:1:0:0): Synchronize cache failed, status == 0x34, scsi
> status == 0x0
> Automatic reboot in 15 seconds - press a key on the console to abort
> Rebooting...
>
>
> I'd like it if they waited a bit less than an hour and see the watchdog
> can be configured but I can't find any useful documentation about
> exactly what conditions it would fire and what the defaults are.
>
> Currently there is no configuration under "system processes watchdog",
> and it looks like it can be enabled, disabled and the timeout set up to
> 3600 seconds.
>
> So my question is, is it this watchdog that is resetting the thing after
> an hour and would it be reasonable to set the timeout to say 300 seconds
> so there was less down time if it went wrong.
>
> Thanks,
> --
> Tom
>
> :: www.portfast.co.uk / @portfast
> :: hosted services, domains, virtual machines, consultancy
> _______________________________________________
> juniper-nsp mailing list juniper-nsp@puck.nether.net
> https://puck.nether.net/mailman/listinfo/juniper-nsp



--
++ytti
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp