Mailing List Archive: [rfc] SBD with Pacemaker/Quorum integration

[rfc] SBD with Pacemaker/Quorum integration

May 24, 2012, 2:10 AM

Post #1 of 12 (7012 views)

Hi all,

I had to repeatedly deal with customer/partner scenarios where the SAN
was unreliable, and outages were correlated across fabrics. The desire
was to avoid the self-fence in such cases if the cluster is quorate and
the node is not unhealthy.

This required SBD to link against pacemaker's CIB and PE libraries, and
all that that implies. Which meant sbd had to move out of cluster-glue,
or else we'd face a build loop.

To give you a glance of the extended sbd code, you can check out
http://hg.linux-ha.org/sbd - the new Pacemaker integration is activated
using the "-P" option in /etc/sysconfig/sbd, otherwise sbd remains a
drop-in replacement for the previous versions.

Regards,
Lars

--
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [rfc] SBD with Pacemaker/Quorum integration [ In reply to ]

florian at hastexo

May 24, 2012, 5:34 AM

Post #2 of 12 (6904 views)

On Thu, May 24, 2012 at 11:10 AM, Lars Marowsky-Bree <lmb@suse.com> wrote:
> Hi all,
>
> I had to repeatedly deal with customer/partner scenarios where the SAN
> was unreliable, and outages were correlated across fabrics. The desire
> was to avoid the self-fence in such cases if the cluster is quorate and
> the node is not unhealthy.
>
> This required SBD to link against pacemaker's CIB and PE libraries, and
> all that that implies. Which meant sbd had to move out of cluster-glue,
> or else we'd face a build loop.
>
> To give you a glance of the extended sbd code, you can check out
> http://hg.linux-ha.org/sbd - the new Pacemaker integration is activated
> using the "-P" option in /etc/sysconfig/sbd, otherwise sbd remains a
> drop-in replacement for the previous versions.

Just as a suggestion: since you're already taking this out of glue,
would you mind also moving the repo to GitHub? It's just orders of
magnitude more straightforward to review and comment on code that way.

Cheers,
Florian

--
Need help with High Availability?
http://www.hastexo.com/now
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [rfc] SBD with Pacemaker/Quorum integration [ In reply to ]

May 24, 2012, 6:10 AM

Post #3 of 12 (6905 views)

On 2012-05-24T14:34:59, Florian Haas <florian@hastexo.com> wrote:

> > To give you a glance of the extended sbd code, you can check out
> > http://hg.linux-ha.org/sbd - the new Pacemaker integration is activated
> > using the "-P" option in /etc/sysconfig/sbd, otherwise sbd remains a
> > drop-in replacement for the previous versions.
> Just as a suggestion: since you're already taking this out of glue,
> would you mind also moving the repo to GitHub? It's just orders of
> magnitude more straightforward to review and comment on code that way.

I'll probably do that, but since I stripped it out of glue to start
with, sticking with hg was easier for the time being.

But yes, I am contemplating to get over my git aversion ;-)

That aside, what do you think of the idea/approach?

Regards,
Lars

--
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [rfc] SBD with Pacemaker/Quorum integration [ In reply to ]

May 24, 2012, 6:26 AM

Post #4 of 12 (6911 views)

Hi,

On Thu, May 24, 2012 at 02:34:59PM +0200, Florian Haas wrote:
> On Thu, May 24, 2012 at 11:10 AM, Lars Marowsky-Bree <lmb@suse.com> wrote:
> > Hi all,
> >
> > I had to repeatedly deal with customer/partner scenarios where the SAN
> > was unreliable, and outages were correlated across fabrics. The desire
> > was to avoid the self-fence in such cases if the cluster is quorate and
> > the node is not unhealthy.
> >
> > This required SBD to link against pacemaker's CIB and PE libraries, and
> > all that that implies. Which meant sbd had to move out of cluster-glue,
> > or else we'd face a build loop.
> >
> > To give you a glance of the extended sbd code, you can check out
> > http://hg.linux-ha.org/sbd - the new Pacemaker integration is activated
> > using the "-P" option in /etc/sysconfig/sbd, otherwise sbd remains a
> > drop-in replacement for the previous versions.
>
> Just as a suggestion: since you're already taking this out of glue,
> would you mind also moving the repo to GitHub? It's just orders of
> magnitude more straightforward to review and comment on code that way.

This is OT and I appologize for the noise, but I need to chime
in here: it is really a matter of preference whether code review
is more comfortable in the mouse oriented github than by other
means (i.e. in an editor).

Cheers,

Dejan

> Cheers,
> Florian
>
> --
> Need help with High Availability?
> http://www.hastexo.com/now
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [rfc] SBD with Pacemaker/Quorum integration [ In reply to ]

florian at hastexo

May 25, 2012, 8:31 AM

Post #5 of 12 (6908 views)

On Thu, May 24, 2012 at 3:10 PM, Lars Marowsky-Bree <lmb@suse.com> wrote:
> On 2012-05-24T14:34:59, Florian Haas <florian@hastexo.com> wrote:
>
>> > To give you a glance of the extended sbd code, you can check out
>> > http://hg.linux-ha.org/sbd - the new Pacemaker integration is activated
>> > using the "-P" option in /etc/sysconfig/sbd, otherwise sbd remains a
>> > drop-in replacement for the previous versions.
>> Just as a suggestion: since you're already taking this out of glue,
>> would you mind also moving the repo to GitHub? It's just orders of
>> magnitude more straightforward to review and comment on code that way.
>
> I'll probably do that, but since I stripped it out of glue to start
> with, sticking with hg was easier for the time being.
>
> But yes, I am contemplating to get over my git aversion ;-)
>
> That aside, what do you think of the idea/approach?

Um, right now I have no opinion. Your commit messages are pretty
terse, and there's no README in the repo. Mind adding one?

Cheers,
Florian

--
Need help with High Availability?
http://www.hastexo.com/now
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [rfc] SBD with Pacemaker/Quorum integration [ In reply to ]

May 25, 2012, 8:41 AM

Post #6 of 12 (6928 views)

On 2012-05-25T17:31:52, Florian Haas <florian@hastexo.com> wrote:

> > That aside, what do you think of the idea/approach?
> Um, right now I have no opinion. Your commit messages are pretty
> terse, and there's no README in the repo. Mind adding one?

Good point. I wasn't aware the commit messages were terse ;-)

To sketch this out:

Basically though SBD continues as it always did.

If you specify "-P" to the daemon start-up (usually via
/etc/sysconfig/sbd SBD_OPTS), the following will happen:

sbd will start (in addition to the worker processes that monitor the
disks) a process that signs in with pacemaker (and corosync). This
process monitors that the partition the local node is part of is
quorate, and that the local node (according to the CIB as run through
pengine) is "healthy".

If so, the master thread will not self-fence even if the majority of
devices is currently unavailable.

That's it, nothing more. Does that help?

It became needed because customers had scenarios with just one device
(which experienced intermittent problems), where MPIO acted up (I've
seen IO stuck for minutes), or even three devices where failures were
correlated. Then, SBD would self-fence, and the customer be unhappy.

(I have opinions on particularly the last failure mode. This seems to
arise specifically when customers have build setups with two HBAs, two
SANs, two storages, but then cross-linked the SANs, connected the HBAs
to each, and the storages too. That seems to frequently lead to
hiccups where the *entire* fabric is affected. I'm thinking this
cross-linking is a case of sham redundancy; it *looks* as if makes
things more redundant, but in reality reduces it since faults are no
longer independent. Alas, they've not wanted to change that.)

Regards,
Lars

--
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [rfc] SBD with Pacemaker/Quorum integration [ In reply to ]

florian at hastexo

May 25, 2012, 12:44 PM

Post #7 of 12 (6902 views)

On Fri, May 25, 2012 at 5:41 PM, Lars Marowsky-Bree <lmb@suse.com> wrote:
> On 2012-05-25T17:31:52, Florian Haas <florian@hastexo.com> wrote:
>
>> > That aside, what do you think of the idea/approach?
>> Um, right now I have no opinion. Your commit messages are pretty
>> terse, and there's no README in the repo. Mind adding one?
>
> Good point. I wasn't aware the commit messages were terse ;-)
>
> To sketch this out:
>
> Basically though SBD continues as it always did.
>
> If you specify "-P" to the daemon start-up (usually via
> /etc/sysconfig/sbd SBD_OPTS), the following will happen:
>
> sbd will start (in addition to the worker processes that monitor the
> disks) a process that signs in with pacemaker (and corosync). This
> process monitors that the partition the local node is part of is
> quorate, and that the local node (according to the CIB as run through
> pengine) is "healthy".
>
> If so, the master thread will not self-fence even if the majority of
> devices is currently unavailable.
>
> That's it, nothing more. Does that help?

It does. One naive question: what's the rationale of tying in with
Pacemaker's view of things? Couldn't you just consume the quorum and
membership information from Corosync alone?

> It became needed because customers had scenarios with just one device
> (which experienced intermittent problems), where MPIO acted up (I've
> seen IO stuck for minutes), or even three devices where failures were
> correlated. Then, SBD would self-fence, and the customer be unhappy.
>
>
> (I have opinions on particularly the last failure mode. This seems to
> arise specifically when customers have build setups with two HBAs, two
> SANs, two storages, but then cross-linked the SANs, connected the HBAs
> to each, and the storages too. That seems to frequently lead to
> hiccups where the *entire* fabric is affected. I'm thinking this
> cross-linking is a case of sham redundancy; it *looks* as if makes
> things more redundant, but in reality reduces it since faults are no
> longer independent. Alas, they've not wanted to change that.)

Henceforth, I'm going to dangle this thread in front of everyone who
believes their SAN can never fail. Thanks. :)

Are there any SUSEisms in SBD or would you expect it to be packageable
on any platform?

Cheers,
Florian

--
Need help with High Availability?
http://www.hastexo.com/now
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [rfc] SBD with Pacemaker/Quorum integration [ In reply to ]

May 25, 2012, 12:56 PM

Post #8 of 12 (6925 views)

On 2012-05-25T21:44:25, Florian Haas <florian@hastexo.com> wrote:

> > If so, the master thread will not self-fence even if the majority of
> > devices is currently unavailable.
> >
> > That's it, nothing more. Does that help?
>
> It does. One naive question: what's the rationale of tying in with
> Pacemaker's view of things? Couldn't you just consume the quorum and
> membership information from Corosync alone?

Yes and no.

On SLE HA 11 (which, alas, is still the prime motivator for this),
corosync actually gets that state from Pacemaker. And, ultimately, it is
Pacemaker's belief (from the CIB) that pengine bases its fencing
decisions on, so that's where we need to look.

Further, quorum isn't enough. If we have quorum, the local node could
still be dirty (as in: stop failures, unclean, ...) that imply that it
should self-fence, pronto.

Since this overrides the decision to self-fence if the devices are gone,
and thus a real poison pill may no longer be delivered, we must take
steps to minimize that risk.

But yes, what it does now is to sign in both with corosync/ais and
the CIB, querying quorum state from both.

Fun anecdote, I originally thought being notification-driven might be
good enough - until the testers started SIGSTOPping corosync/cib and
complaining that the pacemaker watcher didn't pick up on that ;-)

I know this is bound to have some holes. It can't perform a
comprehensive health check of pacemaker's stack; yet, this only matters
for as long as the loss of devices persists. During that degraded phase,
the system is a bit more fragile. I'm a bit weary of this, because I'm
*sure* these will all get reported one after another and further
contribute to the code obfuscation, but such is reality ...

> > (I have opinions on particularly the last failure mode. This seems to
> > arise specifically when customers have build setups with two HBAs, two
> > SANs, two storages, but then cross-linked the SANs, connected the HBAs
> > to each, and the storages too. That seems to frequently lead to
> > hiccups where the *entire* fabric is affected. I'm thinking this
> > cross-linking is a case of sham redundancy; it *looks* as if makes
> > things more redundant, but in reality reduces it since faults are no
> > longer independent. Alas, they've not wanted to change that.)
>
> Henceforth, I'm going to dangle this thread in front of everyone who
> believes their SAN can never fail. Thanks. :)

Heh. Please dangle it in front of them and explain the benefits of
separation/isolation to them. ;-)

If they followed our recommendation - 2 independent SANs, and a third
iSCSI device over the network (okok, effectively that makes 3 SANs) -
they'd never experience this.

(Since that's how my lab is actually set up, I had some troubles
following the problems they reported initially. Oh, and *don't* get me
started on async IO handling in Linux.)

> Are there any SUSEisms in SBD or would you expect it to be packageable
> on any platform?

Should be packageable on every platform, though I admit that I've not
tried building the pacemaker module against anything but the
corosync+pacemaker+openais stuff we ship on SLE HA 11 so far.

I assume that this may need further work; at least the places I stole
code from had special treatment. And the source code to crm_node
(ccm_epoche.c) ... I *think* this may indicate opportunities for
improving the client libraries in pacemaker to hide all that stuff
better.

Best,
Lars

--
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [rfc] SBD with Pacemaker/Quorum integration [ In reply to ]

andrew at beekhof

May 27, 2012, 7:18 PM

Post #9 of 12 (6904 views)

On Sat, May 26, 2012 at 5:56 AM, Lars Marowsky-Bree <lmb@suse.com> wrote:
> On 2012-05-25T21:44:25, Florian Haas <florian@hastexo.com> wrote:
>
>> > If so, the master thread will not self-fence even if the majority of
>> > devices is currently unavailable.
>> >
>> > That's it, nothing more. Does that help?
>>
>> It does. One naive question: what's the rationale of tying in with
>> Pacemaker's view of things? Couldn't you just consume the quorum and
>> membership information from Corosync alone?
>
> Yes and no.
>
> On SLE HA 11 (which, alas, is still the prime motivator for this),
> corosync actually gets that state from Pacemaker. And, ultimately, it is
> Pacemaker's belief (from the CIB) that pengine bases its fencing
> decisions on, so that's where we need to look.
>
> Further, quorum isn't enough. If we have quorum, the local node could
> still be dirty (as in: stop failures, unclean, ...) that imply that it
> should self-fence, pronto.
>
> Since this overrides the decision to self-fence if the devices are gone,
> and thus a real poison pill may no longer be delivered, we must take
> steps to minimize that risk.
>
> But yes, what it does now is to sign in both with corosync/ais and
> the CIB, querying quorum state from both.
>
> Fun anecdote, I originally thought being notification-driven might be
> good enough - until the testers started SIGSTOPping corosync/cib and
> complaining that the pacemaker watcher didn't pick up on that ;-)
>
> I know this is bound to have some holes. It can't perform a
> comprehensive health check of pacemaker's stack; yet, this only matters
> for as long as the loss of devices persists. During that degraded phase,
> the system is a bit more fragile. I'm a bit weary of this, because I'm
> *sure* these will all get reported one after another and further
> contribute to the code obfuscation, but such is reality ...
>
>> > (I have opinions on particularly the last failure mode. This seems to
>> > arise specifically when customers have build setups with two HBAs, two
>> > SANs, two storages, but then cross-linked the SANs, connected the HBAs
>> > to each, and the storages too. That seems to frequently lead to
>> > hiccups where the *entire* fabric is affected. I'm thinking this
>> > cross-linking is a case of sham redundancy; it *looks* as if makes
>> > things more redundant, but in reality reduces it since faults are no
>> > longer independent. Alas, they've not wanted to change that.)
>>
>> Henceforth, I'm going to dangle this thread in front of everyone who
>> believes their SAN can never fail. Thanks. :)
>
> Heh. Please dangle it in front of them and explain the benefits of
> separation/isolation to them. ;-)
>
> If they followed our recommendation - 2 independent SANs, and a third
> iSCSI device over the network (okok, effectively that makes 3 SANs) -
> they'd never experience this.
>
> (Since that's how my lab is actually set up, I had some troubles
> following the problems they reported initially. Oh, and *don't* get me
> started on async IO handling in Linux.)
>
>> Are there any SUSEisms in SBD or would you expect it to be packageable
>> on any platform?
>
> Should be packageable on every platform, though I admit that I've not
> tried building the pacemaker module against anything but the
> corosync+pacemaker+openais stuff we ship on SLE HA 11 so far.
>
> I assume that this may need further work; at least the places I stole
> code from had special treatment. And the source code to crm_node
> (ccm_epoche.c) ... I *think* this may indicate opportunities for
> improving the client libraries in pacemaker to hide all that stuff
> better.

Yep, suggestions are welcome.
In theory it shouldn't be required, but in practice there are so many
membership/quorum combinations that sadly the compatibility code has
become worthy of a real API.
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [rfc] SBD with Pacemaker/Quorum integration [ In reply to ]

florian at hastexo

May 28, 2012, 11:39 PM

Post #10 of 12 (6869 views)

On Fri, May 25, 2012 at 9:56 PM, Lars Marowsky-Bree <lmb@suse.com> wrote:
> Should be packageable on every platform, though I admit that I've not
> tried building the pacemaker module against anything but the
> corosync+pacemaker+openais stuff we ship on SLE HA 11 so far.

Are you expecting this to build without "-I/usr/include/libxml2"? It
didn't for me, before I added that.

Florian

--
Need help with High Availability?
http://www.hastexo.com/now
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [rfc] SBD with Pacemaker/Quorum integration [ In reply to ]

May 29, 2012, 2:14 AM

Post #11 of 12 (6872 views)

On 2012-05-29T08:39:06, Florian Haas <florian@hastexo.com> wrote:

> > Should be packageable on every platform, though I admit that I've not
> > tried building the pacemaker module against anything but the
> > corosync+pacemaker+openais stuff we ship on SLE HA 11 so far.
> Are you expecting this to build without "-I/usr/include/libxml2"? It
> didn't for me, before I added that.

It builds here; I assume this is because I need to re-autofoo it ...

Regards,
Lars

--
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [rfc] SBD with Pacemaker/Quorum integration [ In reply to ]

Jun 15, 2012, 4:44 AM

Post #12 of 12 (6811 views)

On 2012-05-25T17:31:52, Florian Haas <florian@hastexo.com> wrote:

> Um, right now I have no opinion. Your commit messages are pretty
> terse, and there's no README in the repo. Mind adding one?

FWIW, there is now a manual page as well. That might help with
understanding what it is supposed to do.

Regards,
Lars

--
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/