Mailing List Archive

Problems with SBD
Hi everyone,

I have a two node system with SLES 11 SP3 (pacemaker-1.1.9-0.19.102,
corosync-1.4.5-0.18.15, sbd-1.1-0.13.153). Since desember we started to
have several reboots of the system due to SBD; 22nd, 24th and 26th. Last
reboot happened yesterday January 3rd. The message is the same all the
times.
/var/log/messages:Jan 3 11:55:08 kernighan sbd: [7879]: info: Cancelling
IO request due to timeout (rw=0)
/var/log/messages:Jan 3 11:55:08 kernighan sbd: [7879]: ERROR: mbox read
failed in servant.
/var/log/messages:Jan 3 11:55:08 kernighan sbd: [7878]: WARN: Servant for
/dev/sdc1 (pid: 7879) has terminated
/var/log/messages:Jan 3 11:55:08 kernighan sbd: [7878]: WARN: Servant for
/dev/sdc1 outdated (age: 4)
/var/log/messages:Jan 3 11:55:08 kernighan sbd: [8183]: info: Servant
starting for device /dev/sdc1
/var/log/messages:Jan 3 11:55:11 kernighan sbd: [8183]: info: Cancelling
IO request due to timeout (rw=0)
/var/log/messages:Jan 3 11:55:11 kernighan sbd: [8183]: ERROR: Unable to
read header from device 5
/var/log/messages:Jan 3 11:55:11 kernighan sbd: [8183]: ERROR: Not a valid
header on /dev/sdc1
/var/log/messages:Jan 3 11:55:11 kernighan sbd: [7878]: WARN: Servant for
/dev/sdc1 (pid: 8183) has terminated
/var/log/messages:Jan 3 11:55:11 kernighan sbd: [7878]: WARN: Latency: No
liveness for 4 s exceeds threshold of 3 s (healthy servants: 0)

The sbd is an iscsi drive shared by synology box.

Could any one provide me some guidance on what's happenning please?

Thanks in advance,

Oriol
Re: Problems with SBD [ In reply to ]
On 2015-01-04T19:49:58, Oriol Mula-Valls <omv.lists@gmail.com> wrote:

> I have a two node system with SLES 11 SP3 (pacemaker-1.1.9-0.19.102,
> corosync-1.4.5-0.18.15, sbd-1.1-0.13.153). Since desember we started to
> have several reboots of the system due to SBD; 22nd, 24th and 26th. Last
> reboot happened yesterday January 3rd. The message is the same all the
> times.
> /var/log/messages:Jan 3 11:55:08 kernighan sbd: [7879]: info: Cancelling
> IO request due to timeout (rw=0)
> /var/log/messages:Jan 3 11:55:08 kernighan sbd: [7879]: ERROR: mbox read
> failed in servant.
> /var/log/messages:Jan 3 11:55:08 kernighan sbd: [7878]: WARN: Servant for
> /dev/sdc1 (pid: 7879) has terminated
> /var/log/messages:Jan 3 11:55:08 kernighan sbd: [7878]: WARN: Servant for
> /dev/sdc1 outdated (age: 4)
> /var/log/messages:Jan 3 11:55:08 kernighan sbd: [8183]: info: Servant
> starting for device /dev/sdc1
> /var/log/messages:Jan 3 11:55:11 kernighan sbd: [8183]: info: Cancelling
> IO request due to timeout (rw=0)
> /var/log/messages:Jan 3 11:55:11 kernighan sbd: [8183]: ERROR: Unable to
> read header from device 5
> /var/log/messages:Jan 3 11:55:11 kernighan sbd: [8183]: ERROR: Not a valid
> header on /dev/sdc1
> /var/log/messages:Jan 3 11:55:11 kernighan sbd: [7878]: WARN: Servant for
> /dev/sdc1 (pid: 8183) has terminated
> /var/log/messages:Jan 3 11:55:11 kernighan sbd: [7878]: WARN: Latency: No
> liveness for 4 s exceeds threshold of 3 s (healthy servants: 0)
>
> The sbd is an iscsi drive shared by synology box.
>
> Could any one provide me some guidance on what's happenning please?

Those are pretty clearly IO errors due to high latency. You may need to
increase the IO timeout, and/or figure out why the IO to your Synology
box sometimes stalls for multiple seconds. See the manpage for this; you
can add the required flag to /etc/sysconfig/sbd -> SBD_OPTS.

You also should use a stable name (/dev/disk/by-id/...) rather than
/dev/sdc1 - note that /dev/sdX may not be stable over reboots or iSCSI
restarts.

Further, you can avoid the reboots by enabling the pacemaker
integration. See the manpage for details on what that flag does. (-P)
That will be the default in later sbd versions for releases after SLE HA
11.



Regards,
Lars

--
Architect Storage/HA
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Jennifer Guild, Dilip Upmanyu, Graham Norton, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
Re: Problems with SBD [ In reply to ]
Thanks a lot Lars. I took advantage of a crash last week to add the -P
parameter.

I'll try to read more carefully the man of sbd to increase the IO timeout.

Kind regards,
Oriol

On Wed, Jan 7, 2015 at 12:09 PM, Lars Marowsky-Bree <lmb@suse.com> wrote:

> On 2015-01-04T19:49:58, Oriol Mula-Valls <omv.lists@gmail.com> wrote:
>
> > I have a two node system with SLES 11 SP3 (pacemaker-1.1.9-0.19.102,
> > corosync-1.4.5-0.18.15, sbd-1.1-0.13.153). Since desember we started to
> > have several reboots of the system due to SBD; 22nd, 24th and 26th. Last
> > reboot happened yesterday January 3rd. The message is the same all the
> > times.
> > /var/log/messages:Jan 3 11:55:08 kernighan sbd: [7879]: info: Cancelling
> > IO request due to timeout (rw=0)
> > /var/log/messages:Jan 3 11:55:08 kernighan sbd: [7879]: ERROR: mbox read
> > failed in servant.
> > /var/log/messages:Jan 3 11:55:08 kernighan sbd: [7878]: WARN: Servant
> for
> > /dev/sdc1 (pid: 7879) has terminated
> > /var/log/messages:Jan 3 11:55:08 kernighan sbd: [7878]: WARN: Servant
> for
> > /dev/sdc1 outdated (age: 4)
> > /var/log/messages:Jan 3 11:55:08 kernighan sbd: [8183]: info: Servant
> > starting for device /dev/sdc1
> > /var/log/messages:Jan 3 11:55:11 kernighan sbd: [8183]: info: Cancelling
> > IO request due to timeout (rw=0)
> > /var/log/messages:Jan 3 11:55:11 kernighan sbd: [8183]: ERROR: Unable to
> > read header from device 5
> > /var/log/messages:Jan 3 11:55:11 kernighan sbd: [8183]: ERROR: Not a
> valid
> > header on /dev/sdc1
> > /var/log/messages:Jan 3 11:55:11 kernighan sbd: [7878]: WARN: Servant
> for
> > /dev/sdc1 (pid: 8183) has terminated
> > /var/log/messages:Jan 3 11:55:11 kernighan sbd: [7878]: WARN: Latency:
> No
> > liveness for 4 s exceeds threshold of 3 s (healthy servants: 0)
> >
> > The sbd is an iscsi drive shared by synology box.
> >
> > Could any one provide me some guidance on what's happenning please?
>
> Those are pretty clearly IO errors due to high latency. You may need to
> increase the IO timeout, and/or figure out why the IO to your Synology
> box sometimes stalls for multiple seconds. See the manpage for this; you
> can add the required flag to /etc/sysconfig/sbd -> SBD_OPTS.
>
> You also should use a stable name (/dev/disk/by-id/...) rather than
> /dev/sdc1 - note that /dev/sdX may not be stable over reboots or iSCSI
> restarts.
>
> Further, you can avoid the reboots by enabling the pacemaker
> integration. See the manpage for details on what that flag does. (-P)
> That will be the default in later sbd versions for releases after SLE HA
> 11.
>
>
>
> Regards,
> Lars
>
> --
> Architect Storage/HA
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Jennifer Guild,
> Dilip Upmanyu, Graham Norton, HRB 21284 (AG Nürnberg)
> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>