Mailing List Archive

RA spec: explicit "probe" operation?
Hi all,

triggered by the linux-ha-dev discussion, I'd also like to open the
discussion on another item on the revised specification list; namely, a
dedicated "probe" operation.

To recap, right now, Pacemaker uses the "monitor" operation to check if
the resource is active at all (prior to starting anything, as part of
the discovery process on a node).

Now, at this stage, in an empty cluster, nothing else will be active
yet either; so something that would be an error later, at "start" for
example, may just be expected. (Such as a file missing or a command
returning a weird state because they try to access shared storage.)

This, apparently, isn't all that easy to get right. A specific "probe"
operation, that is not tasked with verifying if the resource is healthy,
just if it is at all active, might be clearer, or at least that has been
suggested in the past.


The alternative would be to clarify the "monitor" semantics; "monitor"
just almost never strikes me as the right place to return
"ERR_INSTALLED" or "ERR_CONFIGURED", unless the evidence is really
strong (such as syntax violations in the parameters, for example).
These requirements can only be checked in full when we're attempting the
operation that actually needs them; "monitor" isn't "validate-all", it's
meant to find out the state of the resource only. IMHO, most of the
"ocf_is_probe" checks indicate that the monitor op is trying too much.


Personally, I'm leaning towards the latter; I don't really like the
"probe" operation idea, but being a completely impartial moderator, I'm
alas forced to bring it up, how bad the idea might be. ;-) Any
comments?


Regards,
Lars

--
Architect Storage/HA, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________
ha-wg-technical mailing list
ha-wg-technical@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/ha-wg-technical
Re: RA spec: explicit "probe" operation? [ In reply to ]
On 2011-06-28 11:10, Lars Marowsky-Bree wrote:
> Personally, I'm leaning towards the latter; I don't really like the
> "probe" operation idea, but being a completely impartial moderator, I'm
> alas forced to bring it up, how bad the idea might be. ;-) Any
> comments?

How about instead defining specific instances when the cluster _must_
call validate-all (I think it never does, now, but feel free to correct
me), retain the definition of the probe operation as is (a monitor
action that does not recur), and then restrict monitor to only check for
resource status and failure, rather than correct configuration?

Cheers,
Florian
Re: RA spec: explicit "probe" operation? [ In reply to ]
On 2011-06-28T12:51:43, Florian Haas <florian.haas@linbit.com> wrote:

> How about instead defining specific instances when the cluster _must_
> call validate-all (I think it never does, now, but feel free to correct
> me),

I'm not sure I follow where you're going here - which such specific
instances come to mind? Basically, there are only two points where validate-all
make sense:

a) automatically directly prior to an intended "start" - in which case
it is redundant, since the "start" can report exactly the same.

b) As a help for the UIs, to check if the parameters make sense and are
possibly even correct. (Which is what validate-all was intended for, to
provide deeper checking than a simple syntax check that the UI can
provide based on the data type.)

It'd probably be good to recommend that UIs actually do this when a
resource is added (and all its pre-requisites are running).

> retain the definition of the probe operation as is (a monitor action
> that does not recur),

Well, from the point of view of the definition of "monitor", that
doesn't matter. In theory, the repeat schedule for "monitor" wasn't
meant to ever be required, since "monitor" was intended to always
provide a correct result. Which leads me to:

> and then restrict monitor to only check for resource status and
> failure, rather than correct configuration?

This bit actually makes the problem go away indeed. The spec needs to
clarify that the "monitor" operation's primary goal is to ascertain the
state (running/failed/stopped/unknown).

That some monitors went beyond this (with the best of intentions) is
what actually is causing most of our problems in scenarios where this
doesn't make sense.


The RA can sometimes ascertain that the resource will never be able to
be started on that node unless the admin intervenes or the environment
is changed by other resources being brought online first. (Which is what
ERR_INSTALLED being returned by monitor_0 basically implies.)

Or that the semantics are completely broken (ip=430.a.49.2),
which is a valid cause for "ERR_CONFIGURED".

monitor_0 can reasonably check for this, iff carefully implemented (the
problem arises from those RAs that aren't, or where the logic has bugs);
splitting it off into a separate mandatory call to "validate-all" is not
necessarily a good idea, since it would double the number of startup
probe actions.


That would be the third option; not change anything, make "ocf_is_probe"
(and how to detect them) official, and document how implementors have to
be really careful about going beyond the mere state check.


Regards,
Lars

--
Architect Storage/HA, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________
ha-wg-technical mailing list
ha-wg-technical@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/ha-wg-technical
Re: RA spec: explicit "probe" operation? [ In reply to ]
Hi,

On Tue, Jun 28, 2011 at 02:32:24PM +0200, Lars Marowsky-Bree wrote:
> On 2011-06-28T12:51:43, Florian Haas <florian.haas@linbit.com> wrote:
>
> > How about instead defining specific instances when the cluster _must_
> > call validate-all (I think it never does, now, but feel free to correct
> > me),
>
> I'm not sure I follow where you're going here - which such specific
> instances come to mind? Basically, there are only two points where validate-all
> make sense:
>
> a) automatically directly prior to an intended "start" - in which case
> it is redundant, since the "start" can report exactly the same.
>
> b) As a help for the UIs, to check if the parameters make sense and are
> possibly even correct. (Which is what validate-all was intended for, to
> provide deeper checking than a simple syntax check that the UI can
> provide based on the data type.)
>
> It'd probably be good to recommend that UIs actually do this when a
> resource is added (and all its pre-requisites are running).

Not a bad idea, but how would the UI know that all requisite
resources are running? This is what we already discussed several
months ago, when I suggested that ptest somehow delivers the
dependencies, but everybody frowned to that (IIRC).

> > retain the definition of the probe operation as is (a monitor action
> > that does not recur),
>
> Well, from the point of view of the definition of "monitor", that
> doesn't matter. In theory, the repeat schedule for "monitor" wasn't
> meant to ever be required, since "monitor" was intended to always
> provide a correct result.

How do you mean "be required"? As something to check whether
it's a probe?

> Which leads me to:
>
> > and then restrict monitor to only check for resource status and
> > failure, rather than correct configuration?
>
> This bit actually makes the problem go away indeed. The spec needs to
> clarify that the "monitor" operation's primary goal is to ascertain the
> state (running/failed/stopped/unknown).
>
> That some monitors went beyond this (with the best of intentions) is
> what actually is causing most of our problems in scenarios where this
> doesn't make sense.

Are you suggesting that an RA shouldn't be doing deeper checks?
Though this could really be up for discussion, but so far the
idea was that an RA instance should do a bit more than just
check whether a process was running. After all, that's what
makes OCF RA better than LSB.

> The RA can sometimes ascertain that the resource will never be able to
> be started on that node unless the admin intervenes or the environment
> is changed by other resources being brought online first. (Which is what
> ERR_INSTALLED being returned by monitor_0 basically implies.)

Which doesn't work for all resources, as we recently discussed.
Some, such as oracle or db2, may even have binaries on shared
storage.

> Or that the semantics are completely broken (ip=430.a.49.2),
> which is a valid cause for "ERR_CONFIGURED".
>
> monitor_0 can reasonably check for this, iff carefully implemented (the
> problem arises from those RAs that aren't, or where the logic has bugs);
> splitting it off into a separate mandatory call to "validate-all" is not
> necessarily a good idea, since it would double the number of startup
> probe actions.
>
>
> That would be the third option; not change anything, make "ocf_is_probe"
> (and how to detect them) official, and document how implementors have to
> be really careful about going beyond the mere state check.

This is probably the most reasonable thing to do. Otherwise,
we'll go into changing all RAs, and that wouldn't be justified
in this case.

BTW, I have an (almost) ready RA driver for shell based RAs.
This driver takes care of probes, so that an RA using the driver
can split probe and monitor code. Actually, wrong handling of
probes was the main motivation to implement it.

Cheers,

Dejan

>
> Regards,
> Lars
>
> --
> Architect Storage/HA, OPS Engineering, Novell, Inc.
> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
>
> _______________________________________________
> ha-wg-technical mailing list
> ha-wg-technical@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/ha-wg-technical
_______________________________________________
ha-wg-technical mailing list
ha-wg-technical@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/ha-wg-technical
Re: RA spec: explicit "probe" operation? [ In reply to ]
On 2011-06-28T16:36:45, Dejan Muhamedagic <dejan@suse.de> wrote:

> > It'd probably be good to recommend that UIs actually do this when a
> > resource is added (and all its pre-requisites are running).
> Not a bad idea, but how would the UI know that all requisite
> resources are running? This is what we already discussed several
> months ago, when I suggested that ptest somehow delivers the
> dependencies, but everybody frowned to that (IIRC).

The admin could tell the UI to check the parameters (most of the time,
people will add new resources when their pre-requisites are running, so
that is easy). Or indeed, it could parse ptest output.

But that is actually beyond this spec discussion; what I was trying to
understand is when "validate-all" would be mandatory to run like Florian
suggested; the above is only an example where it _could_ be run.

> > Well, from the point of view of the definition of "monitor", that
> > doesn't matter. In theory, the repeat schedule for "monitor" wasn't
> > meant to ever be required, since "monitor" was intended to always
> > provide a correct result.
> How do you mean "be required"? As something to check whether
> it's a probe?

I meant "required to be known to the RA", sorry, I thought the context
was clear.

> Are you suggesting that an RA shouldn't be doing deeper checks?
> Though this could really be up for discussion, but so far the
> idea was that an RA instance should do a bit more than just
> check whether a process was running. After all, that's what
> makes OCF RA better than LSB.

No, that wasn't why we invented OCF RAs instead of going with LSB. The
main distinction is that OCF RAs take instance parameters. (And that we
got to define new actions for some.)

And that "monitor" _can_ do more, but it doesn't _need_ to; the primary
goal is to determine running/stopped/failed/unknown state, and it
shouldn't report anything else if it can't be sure about it.

> > The RA can sometimes ascertain that the resource will never be able to
> > be started on that node unless the admin intervenes or the environment
> > is changed by other resources being brought online first. (Which is what
> > ERR_INSTALLED being returned by monitor_0 basically implies.)
> Which doesn't work for all resources, as we recently discussed.
> Some, such as oracle or db2, may even have binaries on shared
> storage.

Which is exactly the point. I'm not sure where you're contradicting me?

> BTW, I have an (almost) ready RA driver for shell based RAs.
> This driver takes care of probes, so that an RA using the driver
> can split probe and monitor code. Actually, wrong handling of
> probes was the main motivation to implement it.

I'm not sure at all what you're talking about here; what is this?


Regards,
Lars

--
Architect Storage/HA, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________
ha-wg-technical mailing list
ha-wg-technical@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/ha-wg-technical
Re: RA spec: explicit "probe" operation? [ In reply to ]
On 06/28/2011 02:32 PM, Lars Marowsky-Bree wrote:
> On 2011-06-28T12:51:43, Florian Haas <florian.haas@linbit.com> wrote:
>
>> How about instead defining specific instances when the cluster _must_
>> call validate-all (I think it never does, now, but feel free to correct
>> me),
>
> I'm not sure I follow where you're going here - which such specific
> instances come to mind? Basically, there are only two points where validate-all
> make sense:
>
> a) automatically directly prior to an intended "start" - in which case
> it is redundant, since the "start" can report exactly the same.
>
> b) As a help for the UIs, to check if the parameters make sense and are
> possibly even correct. (Which is what validate-all was intended for, to
> provide deeper checking than a simple syntax check that the UI can
> provide based on the data type.)

c) Whenever a node comes online. Then, the cluster could run
validate-all for all defined resources on the newly joined node. Since
validate-all would be the operation that is "allowed" to return
$OCF_ERR_CONFIGURED or $OCF_ERR_INSTALLED or $OCF_ERR_PERM, the cluster
would immediately know which nodes are eligible for running the
resource. It could, possibly, set an implicit -INF location constraint
for resources on ineligible nodes. Only then would it proceed to probe,
and only on the eligible nodes.

>> retain the definition of the probe operation as is (a monitor action
>> that does not recur),
>
> Well, from the point of view of the definition of "monitor", that
> doesn't matter. In theory, the repeat schedule for "monitor" wasn't
> meant to ever be required, since "monitor" was intended to always
> provide a correct result. Which leads me to:
>
>> and then restrict monitor to only check for resource status and
>> failure, rather than correct configuration?
>
> This bit actually makes the problem go away indeed. The spec needs to
> clarify that the "monitor" operation's primary goal is to ascertain the
> state (running/failed/stopped/unknown).

More precisely, running/runtime failure (as opposed to configuration
error or unsatisfied prerequisites)/stopped/unknown. Although I'm not so
sure if "unknown" is actually sensible. We have no "unknown" state at
this point, unless you are referring to the unmanaged state we get into
if stop fails and fencing is either not configured or fails as well.

> That some monitors went beyond this (with the best of intentions) is
> what actually is causing most of our problems in scenarios where this
> doesn't make sense.
>
>
> The RA can sometimes ascertain that the resource will never be able to
> be started on that node unless the admin intervenes or the environment
> is changed by other resources being brought online first. (Which is what
> ERR_INSTALLED being returned by monitor_0 basically implies.)
>
> Or that the semantics are completely broken (ip=430.a.49.2),
> which is a valid cause for "ERR_CONFIGURED".
>
> monitor_0 can reasonably check for this, iff carefully implemented (the
> problem arises from those RAs that aren't, or where the logic has bugs);
> splitting it off into a separate mandatory call to "validate-all" is not
> necessarily a good idea, since it would double the number of startup
> probe actions.

I don't think so; see the cluster behavior I suggested above.

Cheers,
Florian
Re: RA spec: explicit "probe" operation? [ In reply to ]
On 29/06/11 15:56, Florian Haas wrote:
> On 06/28/2011 02:32 PM, Lars Marowsky-Bree wrote:
>> On 2011-06-28T12:51:43, Florian Haas<florian.haas@linbit.com> wrote:
>>
>>> How about instead defining specific instances when the cluster _must_
>>> call validate-all (I think it never does, now, but feel free to correct
>>> me),
>>
>> I'm not sure I follow where you're going here - which such specific
>> instances come to mind? Basically, there are only two points where validate-all
>> make sense:
>>
>> a) automatically directly prior to an intended "start" - in which case
>> it is redundant, since the "start" can report exactly the same.
>>
>> b) As a help for the UIs, to check if the parameters make sense and are
>> possibly even correct. (Which is what validate-all was intended for, to
>> provide deeper checking than a simple syntax check that the UI can
>> provide based on the data type.)
>
> c) Whenever a node comes online. Then, the cluster could run
> validate-all for all defined resources on the newly joined node. Since
> validate-all would be the operation that is "allowed" to return
> $OCF_ERR_CONFIGURED or $OCF_ERR_INSTALLED or $OCF_ERR_PERM, the cluster
> would immediately know which nodes are eligible for running the
> resource. It could, possibly, set an implicit -INF location constraint
> for resources on ineligible nodes. Only then would it proceed to probe,
> and only on the eligible nodes.

That might fail or get weird if one resource depends on another
(resource B's config is on cluster filesystem A), in which case we'd
need to take colocation and ordering constraints into account and only
run validate-all on resources whose dependencies are already running...?

Regards,

Tim
--
Tim Serong <tserong@novell.com>
Senior Clustering Engineer, OPS Engineering, Novell Inc.
_______________________________________________
ha-wg-technical mailing list
ha-wg-technical@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/ha-wg-technical
Re: RA spec: explicit "probe" operation? [ In reply to ]
On 2011-06-29T07:56:48, Florian Haas <florian.haas@linbit.com> wrote:

> c) Whenever a node comes online. Then, the cluster could run
> validate-all for all defined resources on the newly joined node.

Like Tim said, this fails for exactly the same reason that "monitor_0"
has undesirable properties, I'm afraid.

Also, "monitor_0" would still need to run. (Theoretically, just because
it's misconfigured and shouldn't run there doesn't mean it isn't active
in a failed state.)

Unless we probe in dependency order, but even that would be wrong - we'd
possibly not probe for resources that are active out of turn. (Think
groups: a-b-c, c shouldn't be active if a-b aren't running, but an IP
could still have been started on boot.)


Regards,
Lars

--
Architect Storage/HA, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________
ha-wg-technical mailing list
ha-wg-technical@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/ha-wg-technical
Re: RA spec: explicit "probe" operation? [ In reply to ]
Random thought...

What about having monitor just be about started/stopped/failed (as
suggested above) and have the cluster automatically call validate-all
(which would check for tools and config options) before start ops?

On Wed, Jun 29, 2011 at 7:06 PM, Lars Marowsky-Bree <lmb@suse.de> wrote:
> On 2011-06-29T07:56:48, Florian Haas <florian.haas@linbit.com> wrote:
>
>> c) Whenever a node comes online. Then, the cluster could run
>> validate-all for all defined resources on the newly joined node.
>
> Like Tim said, this fails for exactly the same reason that "monitor_0"
> has undesirable properties, I'm afraid.
>
> Also, "monitor_0" would still need to run. (Theoretically, just because
> it's misconfigured and shouldn't run there doesn't mean it isn't active
> in a failed state.)
>
> Unless we probe in dependency order, but even that would be wrong - we'd
> possibly not probe for resources that are active out of turn. (Think
> groups: a-b-c, c shouldn't be active if a-b aren't running, but an IP
> could still have been started on boot.)
>
>
> Regards,
>    Lars
>
> --
> Architect Storage/HA, OPS Engineering, Novell, Inc.
> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
>
> _______________________________________________
> ha-wg-technical mailing list
> ha-wg-technical@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/ha-wg-technical
>
_______________________________________________
ha-wg-technical mailing list
ha-wg-technical@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/ha-wg-technical
Re: RA spec: explicit "probe" operation? [ In reply to ]
On 2011-06-30T10:32:56, Andrew Beekhof <andrew@beekhof.net> wrote:

> Random thought...
>
> What about having monitor just be about started/stopped/failed (as
> suggested above) and have the cluster automatically call validate-all
> (which would check for tools and config options) before start ops?

What would the point be?



--
Architect Storage/HA, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________
ha-wg-technical mailing list
ha-wg-technical@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/ha-wg-technical
Re: RA spec: explicit "probe" operation? [ In reply to ]
On Thu, Jun 30, 2011 at 6:10 PM, Lars Marowsky-Bree <lmb@suse.de> wrote:
> On 2011-06-30T10:32:56, Andrew Beekhof <andrew@beekhof.net> wrote:
>
>> Random thought...
>>
>> What about having monitor just be about started/stopped/failed (as
>> suggested above) and have the cluster automatically call validate-all
>> (which would check for tools and config options) before start ops?
>
> What would the point be?

It would alleviate the part where RA writers need to know (and we need
to document) when to call validate-all.
Failures would also show up in the CIB (and therefor the tools) under
the validate-all op, not start - this might be slightly more helpful
for users' debugging.

>
>
>
> --
> Architect Storage/HA, OPS Engineering, Novell, Inc.
> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
>
>
_______________________________________________
ha-wg-technical mailing list
ha-wg-technical@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/ha-wg-technical
Re: RA spec: explicit "probe" operation? [ In reply to ]
On 2011-06-30T19:07:42, Andrew Beekhof <andrew@beekhof.net> wrote:

> It would alleviate the part where RA writers need to know (and we need
> to document) when to call validate-all.

Uhm, RA writers don't need to know when to call validate-all.

> Failures would also show up in the CIB (and therefor the tools) under
> the validate-all op, not start - this might be slightly more helpful
> for users' debugging.

At the expense of doubling the number of operations we need to call for
"start", and doubling the effort - since, clearly, "start" needs to
check all these requirements again.

(Just like any op needs to check its own requirements; or at least
implicitly does, since it otherwise would fail to complete.)


Regards,
Lars

--
Architect Storage/HA, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________
ha-wg-technical mailing list
ha-wg-technical@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/ha-wg-technical
Re: RA spec: explicit "probe" operation? [ In reply to ]
On Thu, Jun 30, 2011 at 7:09 PM, Lars Marowsky-Bree <lmb@suse.de> wrote:
> On 2011-06-30T19:07:42, Andrew Beekhof <andrew@beekhof.net> wrote:
>
>> It would alleviate the part where RA writers need to know (and we need
>> to document) when to call validate-all.
>
> Uhm, RA writers don't need to know when to call validate-all.

I was referring to this (which apparently I only half read):

> a) automatically directly prior to an intended "start" - in which case
> it is redundant, since the "start" can report exactly the same.

Start can do this, but only if we educate RA writers to do so.
I'd favor doing it explicitly and automagically.

>
>> Failures would also show up in the CIB (and therefor the tools) under
>> the validate-all op, not start - this might be slightly more helpful
>> for users' debugging.
>
> At the expense of doubling the number of operations we need to call for
> "start", and doubling the effort - since, clearly, "start" needs to
> check all these requirements again.

Every RA already calls validate-all before really trying to start?

> (Just like any op needs to check its own requirements; or at least
> implicitly does, since it otherwise would fail to complete.)
>
>
> Regards,
>    Lars
>
> --
> Architect Storage/HA, OPS Engineering, Novell, Inc.
> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
>
>
_______________________________________________
ha-wg-technical mailing list
ha-wg-technical@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/ha-wg-technical
Re: RA spec: explicit "probe" operation? [ In reply to ]
On Tue, Jun 28, 2011 at 11:02:32PM +0200, Lars Marowsky-Bree wrote:
> On 2011-06-28T16:36:45, Dejan Muhamedagic <dejan@suse.de> wrote:
>
> > > It'd probably be good to recommend that UIs actually do this when a
> > > resource is added (and all its pre-requisites are running).
> > Not a bad idea, but how would the UI know that all requisite
> > resources are running? This is what we already discussed several
> > months ago, when I suggested that ptest somehow delivers the
> > dependencies, but everybody frowned to that (IIRC).
>
> The admin could tell the UI to check the parameters (most of the time,
> people will add new resources when their pre-requisites are running, so
> that is easy). Or indeed, it could parse ptest output.
>
> But that is actually beyond this spec discussion; what I was trying to

Yes, it's a different matter.

> understand is when "validate-all" would be mandatory to run like Florian
> suggested; the above is only an example where it _could_ be run.
>
> > > Well, from the point of view of the definition of "monitor", that
> > > doesn't matter. In theory, the repeat schedule for "monitor" wasn't
> > > meant to ever be required, since "monitor" was intended to always
> > > provide a correct result.
> > How do you mean "be required"? As something to check whether
> > it's a probe?
>
> I meant "required to be known to the RA", sorry, I thought the context
> was clear.
>
> > Are you suggesting that an RA shouldn't be doing deeper checks?
> > Though this could really be up for discussion, but so far the
> > idea was that an RA instance should do a bit more than just
> > check whether a process was running. After all, that's what
> > makes OCF RA better than LSB.
>
> No, that wasn't why we invented OCF RAs instead of going with LSB. The
> main distinction is that OCF RAs take instance parameters. (And that we
> got to define new actions for some.)
>
> And that "monitor" _can_ do more, but it doesn't _need_ to; the primary
> goal is to determine running/stopped/failed/unknown state, and it
> shouldn't report anything else if it can't be sure about it.

Of course, the RA won't make up things :)

> > > The RA can sometimes ascertain that the resource will never be able to
> > > be started on that node unless the admin intervenes or the environment
> > > is changed by other resources being brought online first. (Which is what
> > > ERR_INSTALLED being returned by monitor_0 basically implies.)
> > Which doesn't work for all resources, as we recently discussed.
> > Some, such as oracle or db2, may even have binaries on shared
> > storage.
>
> Which is exactly the point. I'm not sure where you're contradicting me?

ERR_INSTALLED says that the resource can never run on this node.
In case of probes, that may not be true if for instance shared
storage is not mounted. So, in some RAs probe must return
NOT_RUNNING if some requirements are not fulfilled.

> > BTW, I have an (almost) ready RA driver for shell based RAs.
> > This driver takes care of probes, so that an RA using the driver
> > can split probe and monitor code. Actually, wrong handling of
> > probes was the main motivation to implement it.
>
> I'm not sure at all what you're talking about here; what is this?

The idea is to streamline RA development further. Basically, an
RA could look like this:

REQUIRED_PARAMS="config ..."
REQUIRED_BINARIES="b1 b2 ..."

xyz_metadata() {
...
}
xyz_start() {
...
}
xyz_stop() {
...
}
xyz_monitor_10() {
...
}
xyz_monitor() {
...
}
xyz_probe() {
...
}
xyz_validate_all() {
...
}

ocf_rarun $@

ocf_rarun would take care of all the boring details, checking
parameters, doing validation when needed, invoking the right
monitor (or probe), etc.

Thanks,

Dejan

> Regards,
> Lars
>
> --
> Architect Storage/HA, OPS Engineering, Novell, Inc.
> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
>
_______________________________________________
ha-wg-technical mailing list
ha-wg-technical@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/ha-wg-technical
Re: RA spec: explicit "probe" operation? [ In reply to ]
On 2011-06-30T12:48:03, Dejan Muhamedagic <dejan@suse.de> wrote:

> > > > The RA can sometimes ascertain that the resource will never be able to
> > > > be started on that node unless the admin intervenes or the environment
> > > > is changed by other resources being brought online first. (Which is what
> > > > ERR_INSTALLED being returned by monitor_0 basically implies.)
> > > Which doesn't work for all resources, as we recently discussed.
> > > Some, such as oracle or db2, may even have binaries on shared
> > > storage.
> > Which is exactly the point. I'm not sure where you're contradicting me?
> ERR_INSTALLED says that the resource can never run on this node.
> In case of probes, that may not be true if for instance shared
> storage is not mounted. So, in some RAs probe must return
> NOT_RUNNING if some requirements are not fulfilled.

Which is exactly why "monitor" shouldn't return ERR_INSTALLED, if this
possibility exists. That was the whole point of this discussion?

Note that "binary not present", if one wants to be pedantic, is not
identical to "not running"; it is identical to "not start or stoppable",
in all likelihood, but ps could still be used to see if the process is
around. On the other hand, "data share not mounted" probably is a pretty
good indicator of "not running".

> ocf_rarun $@

Ah, so a template function that takes care of the code that is shared
across resource agents. Yes, that's a good idea, but a completely
different discussion than this one.


Regards,
Lars

--
Architect Storage/HA, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________
ha-wg-technical mailing list
ha-wg-technical@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/ha-wg-technical
Re: RA spec: explicit "probe" operation? [ In reply to ]
On 2011-06-30 11:36, Andrew Beekhof wrote:
> On Thu, Jun 30, 2011 at 7:09 PM, Lars Marowsky-Bree <lmb@suse.de> wrote:
>> On 2011-06-30T19:07:42, Andrew Beekhof <andrew@beekhof.net> wrote:
>>
>>> It would alleviate the part where RA writers need to know (and we need
>>> to document) when to call validate-all.
>>
>> Uhm, RA writers don't need to know when to call validate-all.
>
> I was referring to this (which apparently I only half read):
>
>> a) automatically directly prior to an intended "start" - in which case
>> it is redundant, since the "start" can report exactly the same.
>
> Start can do this, but only if we educate RA writers to do so.
> I'd favor doing it explicitly and automagically.

I'm in full agreement with Andrew on this one.

Cheers,
Florian
Re: RA spec: explicit "probe" operation? [ In reply to ]
On Fri, Jul 1, 2011 at 5:47 PM, Florian Haas <florian.haas@linbit.com> wrote:
> On 2011-06-30 11:36, Andrew Beekhof wrote:
>> On Thu, Jun 30, 2011 at 7:09 PM, Lars Marowsky-Bree <lmb@suse.de> wrote:
>>> On 2011-06-30T19:07:42, Andrew Beekhof <andrew@beekhof.net> wrote:
>>>
>>>> It would alleviate the part where RA writers need to know (and we need
>>>> to document) when to call validate-all.
>>>
>>> Uhm, RA writers don't need to know when to call validate-all.
>>
>> I was referring to this (which apparently I only half read):
>>
>>> a) automatically directly prior to an intended "start" - in which case
>>> it is redundant, since the "start" can report exactly the same.
>>
>> Start can do this, but only if we educate RA writers to do so.
>> I'd favor doing it explicitly and automagically.
>
> I'm in full agreement with Andrew on this one.

Oh crap. Sorry everyone, I must be wrong ;-)
_______________________________________________
ha-wg-technical mailing list
ha-wg-technical@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/ha-wg-technical
Re: RA spec: explicit "probe" operation? [ In reply to ]
On Fri, Jul 01, 2011 at 06:09:10PM +1000, Andrew Beekhof wrote:
> On Fri, Jul 1, 2011 at 5:47 PM, Florian Haas <florian.haas@linbit.com> wrote:
> > On 2011-06-30 11:36, Andrew Beekhof wrote:
> >> On Thu, Jun 30, 2011 at 7:09 PM, Lars Marowsky-Bree <lmb@suse.de> wrote:
> >>> On 2011-06-30T19:07:42, Andrew Beekhof <andrew@beekhof.net> wrote:
> >>>
> >>>> It would alleviate the part where RA writers need to know (and we need
> >>>> to document) when to call validate-all.
> >>>
> >>> Uhm, RA writers don't need to know when to call validate-all.
> >>
> >> I was referring to this (which apparently I only half read):
> >>
> >>> a) automatically directly prior to an intended "start" - in which case
> >>> it is redundant, since the "start" can report exactly the same.
> >>
> >> Start can do this, but only if we educate RA writers to do so.
> >> I'd favor doing it explicitly and automagically.
> >
> > I'm in full agreement with Andrew on this one.
>
> Oh crap. Sorry everyone, I must be wrong ;-)

I concur with both :)

> _______________________________________________
> ha-wg-technical mailing list
> ha-wg-technical@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/ha-wg-technical
_______________________________________________
ha-wg-technical mailing list
ha-wg-technical@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/ha-wg-technical
Re: RA spec: explicit "probe" operation? [ In reply to ]
On 2011-07-01T18:09:10, Andrew Beekhof <andrew@beekhof.net> wrote:

> >>>> It would alleviate the part where RA writers need to know (and we need
> >>>> to document) when to call validate-all.
> >>> Uhm, RA writers don't need to know when to call validate-all.
> >> I was referring to this (which apparently I only half read):
> >>> a) automatically directly prior to an intended "start" - in which case
> >>> it is redundant, since the "start" can report exactly the same.
> >> Start can do this, but only if we educate RA writers to do so.
> >> I'd favor doing it explicitly and automagically.
> > I'm in full agreement with Andrew on this one.
> Oh crap. Sorry everyone, I must be wrong ;-)

You both actually are. ;-)

Either "start" succeeds at this point, or it doesn't. Calling
"validate-all" doesn't provide any additional information that calling
"start" directly wouldn't; it is superfluous.

That's obvious, isn't it?

"validate-all" makes sense in situations where you're not planning to do
anything else immediately; like, a UI trying to figure out if the user
gave it reasonably useful parameters.


Regards,
Lars

--
Architect Storage/HA, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________
ha-wg-technical mailing list
ha-wg-technical@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/ha-wg-technical
Re: RA spec: explicit "probe" operation? [ In reply to ]
On 2011-07-06 17:31, Lars Marowsky-Bree wrote:
> On 2011-07-01T18:09:10, Andrew Beekhof <andrew@beekhof.net> wrote:
>
>>>>>> It would alleviate the part where RA writers need to know (and we need
>>>>>> to document) when to call validate-all.
>>>>> Uhm, RA writers don't need to know when to call validate-all.
>>>> I was referring to this (which apparently I only half read):
>>>>> a) automatically directly prior to an intended "start" - in which case
>>>>> it is redundant, since the "start" can report exactly the same.
>>>> Start can do this, but only if we educate RA writers to do so.
>>>> I'd favor doing it explicitly and automagically.
>>> I'm in full agreement with Andrew on this one.
>> Oh crap. Sorry everyone, I must be wrong ;-)
>
> You both actually are. ;-)
>
> Either "start" succeeds at this point, or it doesn't. Calling
> "validate-all" doesn't provide any additional information that calling
> "start" directly wouldn't; it is superfluous.
>
> That's obvious, isn't it?

So obvious that it's typical to see a new contributor forget to do any
sort of validation, on start or elsewhere.

Just sayin'.

Florian
Re: RA spec: explicit "probe" operation? [ In reply to ]
On 2011-07-06T17:43:01, Florian Haas <florian.haas@linbit.com> wrote:

> > Either "start" succeeds at this point, or it doesn't. Calling
> > "validate-all" doesn't provide any additional information that calling
> > "start" directly wouldn't; it is superfluous.
> >
> > That's obvious, isn't it?
> So obvious that it's typical to see a new contributor forget to do any
> sort of validation, on start or elsewhere.
>
> Just sayin'.

Yes. but those RAs would still be broken.

I agree we ought to invent a place where validate-all _is_ called, and
have UIs that actually utilize it, but we shouldn't hide a broken action
implementation behind it.



Regards,
Lars

--
Architect Storage/HA, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________
ha-wg-technical mailing list
ha-wg-technical@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/ha-wg-technical
Re: RA spec: explicit "probe" operation? [ In reply to ]
On Thu, Jul 7, 2011 at 8:42 PM, Lars Marowsky-Bree <lmb@suse.de> wrote:
> On 2011-07-06T17:43:01, Florian Haas <florian.haas@linbit.com> wrote:
>
>> > Either "start" succeeds at this point, or it doesn't. Calling
>> > "validate-all" doesn't provide any additional information that calling
>> > "start" directly wouldn't; it is superfluous.
>> >
>> > That's obvious, isn't it?
>> So obvious that it's typical to see a new contributor forget to do any
>> sort of validation, on start or elsewhere.
>>
>> Just sayin'.
>
> Yes. but those RAs would still be broken.
>
> I agree we ought to invent a place where validate-all _is_ called, and
> have UIs that actually utilize it, but we shouldn't hide a broken action
> implementation behind it.
>

Either:

1) start must always call validate-all, or
2) RA writers duplicate the validate-all checks in start

Otherwise your assertion that:

> > Calling
> > "validate-all" doesn't provide any additional information that calling
> > "start" directly wouldn't; it is superfluous.

Is incorrect.

I think most of us are agreeing that neither are desirable and that

3) we automatically call validate-all before start

Is the better path forward.
_______________________________________________
ha-wg-technical mailing list
ha-wg-technical@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/ha-wg-technical
Re: RA spec: explicit "probe" operation? [ In reply to ]
On 2011-07-08T10:54:17, Andrew Beekhof <andrew@beekhof.net> wrote:

> Either:
>
> 1) start must always call validate-all, or

RAs are supposed to be reasonably paranoid.

Validating the input on any request is just insanely good practice. (At
least as far as they are relevant to the request.)

As a somewhat silly example, consider validate-all checking that
"tmpdir" is non-empty and a valid directory. Consider that start removes
it. Consider what will happen if something goes wrong with the
parameter, or one of our users manages to call "start" manually. (You
know they will.)

The operation and the validation for the parameters that operation takes
belong in _one_ execution context, not split into two.

> Otherwise your assertion that:
>
> > > Calling
> > > "validate-all" doesn't provide any additional information that calling
> > > "start" directly wouldn't; it is superfluous.
>
> Is incorrect.

It is not. But if parameters are incorrect, start will fail, even if it
didn't explicitly validate them; otherwise, it can't reasonably expect
what will happen. Hence, it should validate them.

If "start" is called with wrong parameters, it _needs to cope with
that_. Consider:

Calling just "start" results in the service being up or not (with a
failure).

Calling validate-all + start reuslts in the service being up or not
(with a, perhaps, somewhat more detailed failure). But since start is
still allowed to fail (obviously), something can still go wrong there;
and then, start needs to report that properly.

But the service will still be either up or down, and that is what we
actually want to know at that stage; from the point of the cluster
manager, it doesn't contain more information. And we still have a start
operation that needs to be prepared to handle a failure.

> I think most of us are agreeing that neither are desirable and that
>
> 3) we automatically call validate-all before start
>
> Is the better path forward.

No! Please, don't. It incurs a completely pointless fork and the whole
setup of the RA. We need to make clusters -faster-, not double the
amount of actions they need to run before starting a service!

Not to mention that it will lead people to write even sloppier start
operations than they already do.

I really really REALLY think you're on the wrong path here. As in,
really quite wrong. Please, reconsider and think that through.

validate-all was never meant to be an automatic operation. It was meant
as a UI aid, not as a mandatory step. We _can_ redefine that, but then
most likely all RAs out there will need fixing. (More than they already
do.) For little to negative gain.


Regards,
Lars

--
Architect Storage/HA, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________
ha-wg-technical mailing list
ha-wg-technical@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/ha-wg-technical
Re: RA spec: explicit "probe" operation? [ In reply to ]
On Sat, Jul 9, 2011 at 7:40 AM, Lars Marowsky-Bree <lmb@suse.de> wrote:
> On 2011-07-08T10:54:17, Andrew Beekhof <andrew@beekhof.net> wrote:
>
>> Either:
>>
>> 1) start must always call validate-all, or
>
> RAs are supposed to be reasonably paranoid.
>
> Validating the input on any request is just insanely good practice. (At
> least as far as they are relevant to the request.)
>
> As a somewhat silly example, consider validate-all checking that
> "tmpdir" is non-empty and a valid directory. Consider that start removes
> it.  Consider what will happen if something goes wrong with the
> parameter, or one of our users manages to call "start" manually. (You
> know they will.)
>
> The operation and the validation for the parameters that operation takes
> belong in _one_ execution context, not split into two.

So you're arguing that validate-all should go away?

>
>> Otherwise your assertion that:
>>
>> > > Calling
>> > > "validate-all" doesn't provide any additional information that calling
>> > > "start" directly wouldn't; it is superfluous.
>>
>> Is incorrect.
>
> It is not. But if parameters are incorrect, start will fail,

Without doubt. However not all failures (and error codes) are created equal.
As long as something is designed to check this stuff _and_ its
actually being called - I don't actually care.

I just thought that place was validate-all.

> even if it
> didn't explicitly validate them; otherwise, it can't reasonably expect
> what will happen. Hence, it should validate them.
>
> If "start" is called with wrong parameters, it _needs to cope with
> that_. Consider:
>
> Calling just "start" results in the service being up or not (with a
> failure).
>
> Calling validate-all + start reuslts in the service being up or not
> (with a, perhaps, somewhat more detailed failure). But since start is
> still allowed to fail (obviously), something can still go wrong there;
> and then, start needs to report that properly.

A "somewhat more detailed failure" is the part I care about.
How many "it doesn't start" emails/reports do we get? Far too many to count.

I think having the cluster abort before that point would make all our
lives easier as it would be more obvious that the error is related to
the config or install.

>
> But the service will still be either up or down, and that is what we
> actually want to know at that stage; from the point of the cluster
> manager, it doesn't contain more information. And we still have a start
> operation that needs to be prepared to handle a failure.
>
>> I think most of us are agreeing that neither are desirable and that
>>
>> 3) we automatically call validate-all before start
>>
>> Is the better path forward.
>
> No! Please, don't. It incurs a completely pointless fork and the whole
> setup of the RA. We need to make clusters -faster-, not double the
> amount of actions they need to run before starting a service!

I think you're exaggerating the amount of overhead.

time(validate-all) != time(start) and time(fork) << time(start) in most cases.

> Not to mention that it will lead people to write even sloppier start
> operations than they already do.
>
> I really really REALLY think you're on the wrong path here. As in,
> really quite wrong. Please, reconsider and think that through.

Like I said above, I don't much care where the checks are located.
But having defined a method called validate-all, it seems the logical
place to put them.

If the logic should be executed as part of a start op and you don't
want 1), then we should remove the function completely.

> validate-all was never meant to be an automatic operation. It was meant
> as a UI aid, not as a mandatory step. We _can_ redefine that, but then
> most likely all RAs out there will need fixing. (More than they already
> do.) For little to negative gain.
>
>
> Regards,
>    Lars
>
> --
> Architect Storage/HA, OPS Engineering, Novell, Inc.
> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
>
> _______________________________________________
> ha-wg-technical mailing list
> ha-wg-technical@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/ha-wg-technical
>
_______________________________________________
ha-wg-technical mailing list
ha-wg-technical@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/ha-wg-technical
Re: RA spec: explicit "probe" operation? [ In reply to ]
On Fri, Jul 08, 2011 at 11:40:03PM +0200, Lars Marowsky-Bree wrote:
> On 2011-07-08T10:54:17, Andrew Beekhof <andrew@beekhof.net> wrote:
>
> > Either:
> >
> > 1) start must always call validate-all, or
>
> RAs are supposed to be reasonably paranoid.
>
> Validating the input on any request is just insanely good practice. (At
> least as far as they are relevant to the request.)
>
> As a somewhat silly example, consider validate-all checking that
> "tmpdir" is non-empty and a valid directory. Consider that start removes
> it. Consider what will happen if something goes wrong with the
> parameter, or one of our users manages to call "start" manually. (You
> know they will.)
>
> The operation and the validation for the parameters that operation takes
> belong in _one_ execution context, not split into two.
>
> > Otherwise your assertion that:
> >
> > > > Calling
> > > > "validate-all" doesn't provide any additional information that calling
> > > > "start" directly wouldn't; it is superfluous.
> >
> > Is incorrect.
>
> It is not. But if parameters are incorrect, start will fail, even if it
> didn't explicitly validate them; otherwise, it can't reasonably expect
> what will happen. Hence, it should validate them.
>
> If "start" is called with wrong parameters, it _needs to cope with
> that_. Consider:
>
> Calling just "start" results in the service being up or not (with a
> failure).
>
> Calling validate-all + start reuslts in the service being up or not
> (with a, perhaps, somewhat more detailed failure). But since start is
> still allowed to fail (obviously), something can still go wrong there;
> and then, start needs to report that properly.
>
> But the service will still be either up or down, and that is what we
> actually want to know at that stage; from the point of the cluster
> manager, it doesn't contain more information. And we still have a start
> operation that needs to be prepared to handle a failure.
>
> > I think most of us are agreeing that neither are desirable and that
> >
> > 3) we automatically call validate-all before start
> >
> > Is the better path forward.
>
> No! Please, don't. It incurs a completely pointless fork and the whole
> setup of the RA. We need to make clusters -faster-, not double the
> amount of actions they need to run before starting a service!
>
> Not to mention that it will lead people to write even sloppier start
> operations than they already do.
>
> I really really REALLY think you're on the wrong path here. As in,
> really quite wrong. Please, reconsider and think that through.
>
> validate-all was never meant to be an automatic operation. It was meant
> as a UI aid, not as a mandatory step. We _can_ redefine that, but then
> most likely all RAs out there will need fixing. (More than they already
> do.) For little to negative gain.

I'd agree.

Dejan

> Regards,
> Lars
>
> --
> Architect Storage/HA, OPS Engineering, Novell, Inc.
> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
>
> _______________________________________________
> ha-wg-technical mailing list
> ha-wg-technical@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/ha-wg-technical
_______________________________________________
ha-wg-technical mailing list
ha-wg-technical@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/ha-wg-technical

1 2  View All