Mailing List Archive

IPMI stonith resource gets stuck
Hi,

I'm testing a 2-node Corosync (1.4.6) and Pacemaker (1.1.10+git20130802)
cluster on Debian 8.0 and having some problems with the stonith resources.

I've set up two external/ipmi resources on each node and wanted to test
how they would react by physically unplugging the IPMI device network
interfaces.

On the DC, no problem, the resource monitor fails, stop op succeeds and
due to location constraints, as expected the resource enters the stop
state and stays there. After replugging the network cable and cleaningup
the resource, it gets restored to normal state.

On the slave node, different scenario: after monitor op fails, stop op
also fails for an unknown reason. The cluster then retries the stop
operation unsuccessfully until I have the node enter/exit standby mode.
Replugging the network cable on the IPMI device has no effect.

At least, that's what I figure is happenning from these logs:

DC: http://pastebin.com/raw.php?i=QpwG6nea
Slave: http://pastebin.com/raw.php?i=3nesX8yJ
Config: http://pastebin.com/raw.php?i=3FrJuwWz

Any help tracking down the issue would be much appreciated.

Thanks!

--
Jérôme Charaoui
Technicien informatique
Collège de Maisonneuve


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
Re: IPMI stonith resource gets stuck [ In reply to ]
Hi,

On Wed, Jan 28, 2015 at 01:53:17PM -0500, Jérôme Charaoui wrote:
> Hi,
>
> I'm testing a 2-node Corosync (1.4.6) and Pacemaker
> (1.1.10+git20130802) cluster on Debian 8.0 and having some problems
> with the stonith resources.
>
> I've set up two external/ipmi resources on each node and wanted to
> test how they would react by physically unplugging the IPMI device
> network interfaces.
>
> On the DC, no problem, the resource monitor fails, stop op succeeds
> and due to location constraints, as expected the resource enters the
> stop state and stays there. After replugging the network cable and
> cleaningup the resource, it gets restored to normal state.
>
> On the slave node, different scenario: after monitor op fails, stop
> op also fails for an unknown reason. The cluster then retries the

The stop operation for stonith devices does not involve the
device at all, it's just stonithd operation, something like
"disable resource". From the "slave" logs, after some abort,

Jan 28 12:04:22 [31422] scatlas01 stonith-ng: error: crm_abort: crm_glib_handler: Forked child 15705 to record non-fatal assert at logging.c:73 : Source ID 63 was not found when attempting to remove it

stonithd exits:

Jan 28 12:05:42 [31422] scatlas01 stonith-ng: info: st_child_term: Child 16540 timed out, sending SIGTERM
Jan 28 12:05:42 [31422] scatlas01 stonith-ng: info: crm_signal_dispatch: Invoking handler for signal 15: Terminated
Jan 28 12:05:42 [31422] scatlas01 stonith-ng: info: stonith_shutdown: Terminating with 2 clients

Apparently, there're a number of stop operations started, for the
same resource, which all exited (or got cancelled) around
12:29:09. There probably was some confusion in lrmd after
stonithd left. In short, you ran into a bug, but I guess that
that bug got fixed in the meantime.

Beekhof and David Vossel should know.

Thanks,

Dejan

> stop operation unsuccessfully until I have the node enter/exit
> standby mode. Replugging the network cable on the IPMI device has no
> effect.
>
> At least, that's what I figure is happenning from these logs:
>
> DC: http://pastebin.com/raw.php?i=QpwG6nea
> Slave: http://pastebin.com/raw.php?i=3nesX8yJ
> Config: http://pastebin.com/raw.php?i=3FrJuwWz
>
> Any help tracking down the issue would be much appreciated.
>
> Thanks!
>
> --
> Jérôme Charaoui
> Technicien informatique
> Collège de Maisonneuve
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
Re: IPMI stonith resource gets stuck [ In reply to ]
Le 2015-01-30 07:49, Dejan Muhamedagic a écrit :
> Hi,
>
> On Wed, Jan 28, 2015 at 01:53:17PM -0500, Jérôme Charaoui wrote:
>> Hi,
>>
>> I'm testing a 2-node Corosync (1.4.6) and Pacemaker
>> (1.1.10+git20130802) cluster on Debian 8.0 and having some problems
>> with the stonith resources.
>>
>> I've set up two external/ipmi resources on each node and wanted to
>> test how they would react by physically unplugging the IPMI device
>> network interfaces.
>>
>> On the DC, no problem, the resource monitor fails, stop op succeeds
>> and due to location constraints, as expected the resource enters the
>> stop state and stays there. After replugging the network cable and
>> cleaningup the resource, it gets restored to normal state.
>>
>> On the slave node, different scenario: after monitor op fails, stop
>> op also fails for an unknown reason. The cluster then retries the
>
> The stop operation for stonith devices does not involve the
> device at all, it's just stonithd operation, something like
> "disable resource". From the "slave" logs, after some abort,
>
> Jan 28 12:04:22 [31422] scatlas01 stonith-ng: error: crm_abort: crm_glib_handler: Forked child 15705 to record non-fatal assert at logging.c:73 : Source ID 63 was not found when attempting to remove it
>
> stonithd exits:
>
> Jan 28 12:05:42 [31422] scatlas01 stonith-ng: info: st_child_term: Child 16540 timed out, sending SIGTERM
> Jan 28 12:05:42 [31422] scatlas01 stonith-ng: info: crm_signal_dispatch: Invoking handler for signal 15: Terminated
> Jan 28 12:05:42 [31422] scatlas01 stonith-ng: info: stonith_shutdown: Terminating with 2 clients
>
> Apparently, there're a number of stop operations started, for the
> same resource, which all exited (or got cancelled) around
> 12:29:09. There probably was some confusion in lrmd after
> stonithd left.

Thank you for looking at this, much appreciated.

The timeout issue intrigued me because I had noticed ipmitool taking
sometimes over 10 seconds attempting to execute an action on a
non-responding IPMI device over the lanplus interface.

So I had a look at the ipmi stonith plugin code and the ipmitool manpage
itself and noticed this little gem in the latter:

-R <count> Set the number of retries for lan/lanplus interface
(default=4).

I then went ahead and added "-R 1" in the plugin's ipmitool_opts
variable, and my problem went away!


> In short, you ran into a bug, but I guess that
> that bug got fixed in the meantime.

This bug report seems like a match:
https://github.com/ClusterLabs/pacemaker/pull/334

If I'm not mistaken in reading the changelog, this fix was released in
1.12, correct?


> Beekhof and David Vossel should know.
>
> Thanks,
>
> Dejan
>
>> stop operation unsuccessfully until I have the node enter/exit
>> standby mode. Replugging the network cable on the IPMI device has no
>> effect.
>>
>> At least, that's what I figure is happenning from these logs:
>>
>> DC: http://pastebin.com/raw.php?i=QpwG6nea
>> Slave: http://pastebin.com/raw.php?i=3nesX8yJ
>> Config: http://pastebin.com/raw.php?i=3FrJuwWz
>>
>> Any help tracking down the issue would be much appreciated.
>>
>> Thanks!
>>
>> --
>> Jérôme Charaoui
>> Technicien informatique
>> Collège de Maisonneuve
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>



_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
Re: IPMI stonith resource gets stuck [ In reply to ]
On 01/30/2015 05:03 PM, Jérôme Charaoui wrote:
>
> Thank you for looking at this, much appreciated.
>
> The timeout issue intrigued me because I had noticed ipmitool taking
> sometimes over 10 seconds attempting to execute an action on a
> non-responding IPMI device over the lanplus interface.
>
> So I had a look at the ipmi stonith plugin code and the ipmitool
> manpage itself and noticed this little gem in the latter:
>
> -R <count> Set the number of retries for lan/lanplus interface
> (default=4).
>
> I then went ahead and added "-R 1" in the plugin's ipmitool_opts
> variable, and my problem went away!

If you use fence agent fence_ipmilan then you can set this with retry_on
(or --retry-on X when using as argv)

m,

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org