Mailing List Archive

LRM bug
The LRM treats operation timeouts as ERROR:s - not just failed
operations that give warnings. This violates the meaning of ERROR:
messages in the code.

We reserved ERROR: messages for things that the software did not expect
- and therefore possibly could not be properly recovered from. In this
case, the behavior is perfectly expected and the condition will be
properly recovered from. It just means the operation in question failed.

An sample message:
ERROR: process_lrm_event: LRM operation agent-da:3_monitor_5000
(47) Timed Out (timeout=60000ms)

Because of this one message, you can't tell customers "If you ever have
an ERROR: message, the HA software has failed".

This ought to just be a warning, like any other failed action...

--
Alan Robertson <alanr@unix.sh> - @OSSAlanR

"Openness is the foundation and preservative of friendship... Let me claim from you at all times your undisguised opinions." - William Wilberforce
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: LRM bug [ In reply to ]
Hi Alan,

On Mon, Jul 30, 2012 at 10:14:27AM -0600, Alan Robertson wrote:
> The LRM treats operation timeouts as ERROR:s - not just failed
> operations that give warnings. This violates the meaning of ERROR:
> messages in the code.
>
> We reserved ERROR: messages for things that the software did not expect
> - and therefore possibly could not be properly recovered from. In this
> case, the behavior is perfectly expected and the condition will be
> properly recovered from. It just means the operation in question failed.
>
> An sample message:
> ERROR: process_lrm_event: LRM operation agent-da:3_monitor_5000
> (47) Timed Out (timeout=60000ms)
>
> Because of this one message, you can't tell customers "If you ever have
> an ERROR: message, the HA software has failed".
>
> This ought to just be a warning, like any other failed action...

I guess that ERROR is used because resource agents use the same
severity when reporting failures they cannot recover from. In
this case, the RA won't log anything, so the lrmd does that on
its behalf. That seems OK to me. The other option would be to
remove the ERROR severity log messages in all RA, because a
resource problem should normally always be recoverable.

Cheers,

Dejan

> --
> Alan Robertson <alanr@unix.sh> - @OSSAlanR
>
> "Openness is the foundation and preservative of friendship... Let me claim from you at all times your undisguised opinions." - William Wilberforce
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: LRM bug [ In reply to ]
On 08/07/2012 08:18 AM, Dejan Muhamedagic wrote:
> Hi Alan,
>
> On Mon, Jul 30, 2012 at 10:14:27AM -0600, Alan Robertson wrote:
>> The LRM treats operation timeouts as ERROR:s - not just failed
>> operations that give warnings. This violates the meaning of ERROR:
>> messages in the code.
>>
>> We reserved ERROR: messages for things that the software did not expect
>> - and therefore possibly could not be properly recovered from. In this
>> case, the behavior is perfectly expected and the condition will be
>> properly recovered from. It just means the operation in question failed.
>>
>> An sample message:
>> ERROR: process_lrm_event: LRM operation agent-da:3_monitor_5000
>> (47) Timed Out (timeout=60000ms)
>>
>> Because of this one message, you can't tell customers "If you ever have
>> an ERROR: message, the HA software has failed".
>>
>> This ought to just be a warning, like any other failed action...
> I guess that ERROR is used because resource agents use the same
> severity when reporting failures they cannot recover from. In
> this case, the RA won't log anything, so the lrmd does that on
> its behalf. That seems OK to me. The other option would be to
> remove the ERROR severity log messages in all RA, because a
> resource problem should normally always be recoverable.
The exceptions that print ERROR: should be relegated to things like "The
CRM gave me a command I didn't understand, or referenced a resource that
I don't know about" -- and similar things that really shouldn't happen.

Or that's how it seems to me anyway...


--
Alan Robertson <alanr@unix.sh> - @OSSAlanR

"Openness is the foundation and preservative of friendship... Let me claim from you at all times your undisguised opinions." - William Wilberforce
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: LRM bug [ In reply to ]
On Tue, Aug 07, 2012 at 11:04:22PM -0600, Alan Robertson wrote:
> On 08/07/2012 08:18 AM, Dejan Muhamedagic wrote:
> > Hi Alan,
> >
> > On Mon, Jul 30, 2012 at 10:14:27AM -0600, Alan Robertson wrote:
> >> The LRM treats operation timeouts as ERROR:s - not just failed
> >> operations that give warnings. This violates the meaning of ERROR:
> >> messages in the code.
> >>
> >> We reserved ERROR: messages for things that the software did not expect
> >> - and therefore possibly could not be properly recovered from. In this
> >> case, the behavior is perfectly expected and the condition will be
> >> properly recovered from. It just means the operation in question failed.
> >>
> >> An sample message:
> >> ERROR: process_lrm_event: LRM operation agent-da:3_monitor_5000
> >> (47) Timed Out (timeout=60000ms)
> >>
> >> Because of this one message, you can't tell customers "If you ever have
> >> an ERROR: message, the HA software has failed".
> >>
> >> This ought to just be a warning, like any other failed action...
> > I guess that ERROR is used because resource agents use the same
> > severity when reporting failures they cannot recover from. In
> > this case, the RA won't log anything, so the lrmd does that on
> > its behalf. That seems OK to me. The other option would be to
> > remove the ERROR severity log messages in all RA, because a
> > resource problem should normally always be recoverable.
> The exceptions that print ERROR: should be relegated to things like "The
> CRM gave me a command I didn't understand, or referenced a resource that
> I don't know about" -- and similar things that really shouldn't happen.
>
> Or that's how it seems to me anyway...

Turns out that this comes from the crmd not lrmd. The lrmd
actually does issue just a warning.

I see your point though I'd still be reluctant not to log an
error somewhere, because all other resource errors are logged at
that severity.

Cheers,

Dejan


> --
> Alan Robertson <alanr@unix.sh> - @OSSAlanR
>
> "Openness is the foundation and preservative of friendship... Let me claim from you at all times your undisguised opinions." - William Wilberforce
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/