Mailing List Archive

external/vcenter: don't fail when a machine is powered off
Hello,

While testing a new cluster we found the following behavior which i
discussed on #linux-ha with "andreask" afterwards and we both agree the
behavior was wrong.

bug scenario:
3 node cluster, 1 standby just for having 3 nodes, 2 active nodes
when we did a power off of the machine ( similar to pulling the power
cable from a machine ) the cluster failed to failover to the next node.

This is because the following setting:
RESETPOWERON was set to 0, so a machine powered off stays powered off

with the current code path, a machine in the state poweroff is
considered a failure for the stonith reset operation. which results in
no resources are started on the second node, and the machine stays in a
unclean state.

The analogy with real hardware and a powerbar and imho correct behavior:
---
If i pull the plug of node1, node 2 will fence it with the powerbar. The
power will powercycle the socket without any result, because i pulled
the plug. But the fencing operation is a success and all resources are
started on the second node
---

Patch to fix this with i hope a minimal change is attached.

After finding this bug i got ill and have to stay at home for a few
days, so i don't have access to an environment to test this patch atm.


Regards

Robbert Müller
Re: external/vcenter: don't fail when a machine is powered off [ In reply to ]
Hi,

On Mon, Oct 22, 2012 at 11:06:07AM +0200, Robbert Muller wrote:
> Hello,
>
> While testing a new cluster we found the following behavior which i
> discussed on #linux-ha with "andreask" afterwards and we both agree the
> behavior was wrong.
>
> bug scenario:
> 3 node cluster, 1 standby just for having 3 nodes, 2 active nodes
> when we did a power off of the machine ( similar to pulling the power
> cable from a machine ) the cluster failed to failover to the next node.
>
> This is because the following setting:
> RESETPOWERON was set to 0, so a machine powered off stays powered off

Just to make sure: RESETPOWERON was set to 0 in the configuration?

> with the current code path, a machine in the state poweroff is
> considered a failure for the stonith reset operation. which results in
> no resources are started on the second node, and the machine stays in a
> unclean state.
>
> The analogy with real hardware and a powerbar and imho correct behavior:
> ---
> If i pull the plug of node1, node 2 will fence it with the powerbar. The
> power will powercycle the socket without any result, because i pulled
> the plug. But the fencing operation is a success and all resources are
> started on the second node
> ---
>
> Patch to fix this with i hope a minimal change is attached.

Thanks for the patch. But we'll need to rework it a bit.

> After finding this bug i got ill and have to stay at home for a few
> days, so i don't have access to an environment to test this patch atm.

Get better soon!

Cheers,

Dejan

> Regards
>
> Robbert Müller
>
>
>
>

> diff -r 66f7442698e6 lib/plugins/stonith/external/vcenter
> --- a/lib/plugins/stonith/external/vcenter Mon Oct 15 15:59:57 2012 +0200
> +++ b/lib/plugins/stonith/external/vcenter Mon Oct 22 10:38:09 2012 +0200
> @@ -199,6 +199,8 @@
> if ($powerState eq "poweredOff" && (! exists $ENV{'RESETPOWERON'} || $ENV{'RESETPOWERON'} ne 0)) {
> $vm->PowerOnVM();
> system("ha_log.sh", "info", "Machine $esx:$vm->{'name'} has been powered on");
> + } elsif( $powerState eq "poweredOff" ) {
> + system("ha_log.sh", "info", "Machine $esx:$vm->{'name'} is poweredoff and RESETPOWERON was disabled");
> } else {
> dielog("Could not complete $esx:$vm->{'name'} power cycle");
> }

> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: external/vcenter: don't fail when a machine is powered off [ In reply to ]
Hi,

On 23-10-12 13:13, Dejan Muhamedagic wrote:
> Hi,
>
> On Mon, Oct 22, 2012 at 11:06:07AM +0200, Robbert Muller wrote:
>> Hello,
>>
>> While testing a new cluster we found the following behavior which i
>> discussed on #linux-ha with "andreask" afterwards and we both agree the
>> behavior was wrong.
>>
>> bug scenario:
>> 3 node cluster, 1 standby just for having 3 nodes, 2 active nodes
>> when we did a power off of the machine ( similar to pulling the power
>> cable from a machine ) the cluster failed to failover to the next node.
>>
>> This is because the following setting:
>> RESETPOWERON was set to 0, so a machine powered off stays powered off
>
> Just to make sure: RESETPOWERON was set to 0 in the configuration?
Yes it is.

>
>> with the current code path, a machine in the state poweroff is
>> considered a failure for the stonith reset operation. which results in
>> no resources are started on the second node, and the machine stays in a
>> unclean state.
>>
>> The analogy with real hardware and a powerbar and imho correct behavior:
>> ---
>> If i pull the plug of node1, node 2 will fence it with the powerbar. The
>> power will powercycle the socket without any result, because i pulled
>> the plug. But the fencing operation is a success and all resources are
>> started on the second node
>> ---
>>
>> Patch to fix this with i hope a minimal change is attached.
>
> Thanks for the patch. But we'll need to rework it a bit.

Could you tell me what is wrong with it? i am currently testing it on
our customers environment. And it seems to work as expected.

>
>> After finding this bug i got ill and have to stay at home for a few
>> days, so i don't have access to an environment to test this patch atm.
>
> Get better soon!

Thx, the antibiotics seem to have killed the infection. So i'm back to work.


Regards

Robbert
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: external/vcenter: don't fail when a machine is powered off [ In reply to ]
On Tue, Oct 23, 2012 at 01:19:53PM +0200, Robbert Müller wrote:
> Hi,
>
> On 23-10-12 13:13, Dejan Muhamedagic wrote:
> > Hi,
> >
> > On Mon, Oct 22, 2012 at 11:06:07AM +0200, Robbert Muller wrote:
> >> Hello,
> >>
> >> While testing a new cluster we found the following behavior which i
> >> discussed on #linux-ha with "andreask" afterwards and we both agree the
> >> behavior was wrong.
> >>
> >> bug scenario:
> >> 3 node cluster, 1 standby just for having 3 nodes, 2 active nodes
> >> when we did a power off of the machine ( similar to pulling the power
> >> cable from a machine ) the cluster failed to failover to the next node.
> >>
> >> This is because the following setting:
> >> RESETPOWERON was set to 0, so a machine powered off stays powered off
> >
> > Just to make sure: RESETPOWERON was set to 0 in the configuration?
> Yes it is.

OK.

> >> with the current code path, a machine in the state poweroff is
> >> considered a failure for the stonith reset operation. which results in
> >> no resources are started on the second node, and the machine stays in a
> >> unclean state.
> >>
> >> The analogy with real hardware and a powerbar and imho correct behavior:
> >> ---
> >> If i pull the plug of node1, node 2 will fence it with the powerbar. The
> >> power will powercycle the socket without any result, because i pulled
> >> the plug. But the fencing operation is a success and all resources are
> >> started on the second node
> >> ---
> >>
> >> Patch to fix this with i hope a minimal change is attached.
> >
> > Thanks for the patch. But we'll need to rework it a bit.
>
> Could you tell me what is wrong with it? i am currently testing it on
> our customers environment. And it seems to work as expected.

Functionally nothing wrong with it, it's just that the extra if
was repeating part of the previous if, which may be difficult to
understand at times. Please see, and possibly test, the attached
patch.

Cheers,

Dejan


> >> After finding this bug i got ill and have to stay at home for a few
> >> days, so i don't have access to an environment to test this patch atm.
> >
> > Get better soon!
>
> Thx, the antibiotics seem to have killed the infection. So i'm back to work.
>
>
> Regards
>
> Robbert
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/