Mailing List Archive

Nov 1, 2014, 3:49 PM

Post #2 of 14 (4496 views)

On 01/11/14 06:27 PM, Alex Samad - Yieldbroker wrote:
> Hi
>
> 2 node cluster, running under vmware
>
> Centos 6.5
>
> pacemaker-libs-1.1.10-14.el6_5.3.x86_64
> pacemaker-cluster-libs-1.1.10-14.el6_5.3.x86_64
> pacemaker-cli-1.1.10-14.el6_5.3.x86_64
> pacemaker-1.1.10-14.el6_5.3.x86_64
>
>
> this is what I have in /etc/cluster/cluster.conf
>
> <fencedevices>
> <fencedevice agent="fence_pcmk" name="pcmk"/>
> </fencedevices>
>
>
> And pcs config
> stonith-enabled: false
>
> How do I configure stonith to do a os reboot if I want to use that.
>
> I found some suse documentation to a suicide drive
> http://doc.opensuse.org/products/draft/SLE-HA/SLE-ha-guide_sd_draft/cha.ha.fencing.html
>
> but I am guess that is something deprecated or suse specific
>
> Alex

In cman's cluster.conf, you configure the fence device 'fence_pcmk', as
you have. That is a dummy/hook fence agent that simply passes fence
requests up to pacemaker to actually perform. Pacemaker will then tell
cman whether the fence succeeded or failed.

To make sure you have cluster.conf configured properly, it should look
something like this;

====
ccs -f /etc/cluster/cluster.conf --createcluster an-anvil-04
ccs -f /etc/cluster/cluster.conf --setcman two_node="1" expected_votes="1"
ccs -f /etc/cluster/cluster.conf --addnode an-a04n01.alteeve.ca
ccs -f /etc/cluster/cluster.conf --addnode an-a04n02.alteeve.ca
ccs -f /etc/cluster/cluster.conf --addfencedev pcmk agent=fence_pcmk
ccs -f /etc/cluster/cluster.conf --addmethod pcmk-redirect
an-a04n01.alteeve.ca
ccs -f /etc/cluster/cluster.conf --addmethod pcmk-redirect
an-a04n02.alteeve.ca
ccs -f /etc/cluster/cluster.conf --addfenceinst pcmk
an-a04n01.alteeve.ca pcmk-redirect port=an-a04n01.alteeve.ca
ccs -f /etc/cluster/cluster.conf --addfenceinst pcmk
an-a04n02.alteeve.ca pcmk-redirect port=an-a04n02.alteeve.ca
ccs -f /etc/cluster/cluster.conf --setfencedaemon post_join_delay="30"
cat /etc/cluster/cluster.conf
====
<cluster config_version="10" name="an-anvil-04">
<fence_daemon post_join_delay="30"/>
<clusternodes>
<clusternode name="an-a04n01.alteeve.ca" nodeid="1">
<fence>
<method name="pcmk-redirect">
<device name="pcmk" port="an-a04n01.alteeve.ca"/>
</method>
</fence>
</clusternode>
<clusternode name="an-a04n02.alteeve.ca" nodeid="2">
<fence>
<method name="pcmk-redirect">
<device name="pcmk" port="an-a04n02.alteeve.ca"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman expected_votes="1" two_node="1"/>
<fencedevices>
<fencedevice agent="fence_pcmk" name="pcmk"/>
</fencedevices>
<rm>
<failoverdomains/>
<resources/>
</rm>
</cluster>
====

Then you move over to pacemaker and configure stonith there. How you do
this will vary a bit for your fence agent. I use IPMI fencing, which
looks like this:

====
pcs cluster cib stonith_cfg
pcs -f stonith_cfg stonith create fence_n01_ipmi fence_ipmilan
pcmk_host_list="an-a04n01.alteeve.ca" ipaddr="an-a04n01.ipmi"
action="reboot" login="admin" passwd="Initial1" delay=15 op monitor
interval=10s
pcs -f stonith_cfg stonith create fence_n02_ipmi fence_ipmilan
pcmk_host_list="an-a04n02.alteeve.ca" ipaddr="an-a04n02.ipmi"
action="reboot" login="admin" passwd="Initial1" op monitor interval=10s
pcs cluster cib-push stonith_cfg
pcs property set stonith-enabled=true
====

Hope this helps.

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: stonith q [ In reply to ]

Nov 2, 2014, 2:01 AM

Post #3 of 14 (4467 views)

> -----Original Message-----
> From: Digimer [mailto:lists@alteeve.ca]
> Sent: Sunday, 2 November 2014 9:49 AM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] stonith q
>
> On 01/11/14 06:27 PM, Alex Samad - Yieldbroker wrote:
> > Hi
> >
> > 2 node cluster, running under vmware
{snip}
> >
> > Alex
>
> In cman's cluster.conf, you configure the fence device 'fence_pcmk', as you
> have. That is a dummy/hook fence agent that simply passes fence requests
> up to pacemaker to actually perform. Pacemaker will then tell cman whether
> the fence succeeded or failed.
>
> To make sure you have cluster.conf configured properly, it should look
> something like this;
>
> ====
> ccs -f /etc/cluster/cluster.conf --createcluster an-anvil-04 ccs -f
> /etc/cluster/cluster.conf --setcman two_node="1" expected_votes="1"
> ccs -f /etc/cluster/cluster.conf --addnode an-a04n01.alteeve.ca ccs -f
> /etc/cluster/cluster.conf --addnode an-a04n02.alteeve.ca ccs -f
> /etc/cluster/cluster.conf --addfencedev pcmk agent=fence_pcmk ccs -f
> /etc/cluster/cluster.conf --addmethod pcmk-redirect an-a04n01.alteeve.ca
> ccs -f /etc/cluster/cluster.conf --addmethod pcmk-redirect an-
> a04n02.alteeve.ca ccs -f /etc/cluster/cluster.conf --addfenceinst pcmk an-
> a04n01.alteeve.ca pcmk-redirect port=an-a04n01.alteeve.ca ccs -f
> /etc/cluster/cluster.conf --addfenceinst pcmk an-a04n02.alteeve.ca pcmk-
> redirect port=an-a04n02.alteeve.ca ccs -f /etc/cluster/cluster.conf --
> setfencedaemon post_join_delay="30"
> cat /etc/cluster/cluster.conf
> ====
> <cluster config_version="10" name="an-anvil-04">
> <fence_daemon post_join_delay="30"/>
> <clusternodes>
> <clusternode name="an-a04n01.alteeve.ca" nodeid="1">
> <fence>
> <method name="pcmk-redirect">
> <device name="pcmk" port="an-a04n01.alteeve.ca"/>
> </method>
> </fence>
> </clusternode>
> <clusternode name="an-a04n02.alteeve.ca" nodeid="2">
> <fence>
> <method name="pcmk-redirect">
> <device name="pcmk" port="an-a04n02.alteeve.ca"/>
> </method>
> </fence>
> </clusternode>
> </clusternodes>
> <cman expected_votes="1" two_node="1"/>
> <fencedevices>
> <fencedevice agent="fence_pcmk" name="pcmk"/>
> </fencedevices>
> <rm>
> <failoverdomains/>
> <resources/>
> </rm>
> </cluster>
> ====

[Alex Samad - Yieldbroker]
Actually I do, I just put the initial bit in there.

>
> Then you move over to pacemaker and configure stonith there. How you do
> this will vary a bit for your fence agent. I use IPMI fencing, which looks like
> this:
>
> ====
> pcs cluster cib stonith_cfg
> pcs -f stonith_cfg stonith create fence_n01_ipmi fence_ipmilan
> pcmk_host_list="an-a04n01.alteeve.ca" ipaddr="an-a04n01.ipmi"
> action="reboot" login="admin" passwd="Initial1" delay=15 op monitor
> interval=10s pcs -f stonith_cfg stonith create fence_n02_ipmi fence_ipmilan
> pcmk_host_list="an-a04n02.alteeve.ca" ipaddr="an-a04n02.ipmi"
> action="reboot" login="admin" passwd="Initial1" op monitor interval=10s
> pcs cluster cib-push stonith_cfg pcs property set stonith-enabled=true ====
[Alex Samad - Yieldbroker]

Problem is that uses ipmi I don't want to use that.

I would like to translater a stonight shutdown/reboot into a OS reboot commend.

Thanks

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: stonith q [ In reply to ]

arvidjaar at gmail

Nov 2, 2014, 3:45 AM

Post #4 of 14 (4458 views)

Ð’ Sun, 2 Nov 2014 10:01:59 +0000
Alex Samad - Yieldbroker <Alex.Samad@yieldbroker.com> Ð¿Ð¸ÑˆÐµÑ‚:

>
>
> > -----Original Message-----
> > From: Digimer [mailto:lists@alteeve.ca]
> > Sent: Sunday, 2 November 2014 9:49 AM
> > To: The Pacemaker cluster resource manager
> > Subject: Re: [Pacemaker] stonith q
> >
> > On 01/11/14 06:27 PM, Alex Samad - Yieldbroker wrote:
> > > Hi
> > >
> > > 2 node cluster, running under vmware
> {snip}
> > >
> > > Alex
> >
> > In cman's cluster.conf, you configure the fence device 'fence_pcmk', as you
> > have. That is a dummy/hook fence agent that simply passes fence requests
> > up to pacemaker to actually perform. Pacemaker will then tell cman whether
> > the fence succeeded or failed.
> >
> > To make sure you have cluster.conf configured properly, it should look
> > something like this;
> >
> > ====
> > ccs -f /etc/cluster/cluster.conf --createcluster an-anvil-04 ccs -f
> > /etc/cluster/cluster.conf --setcman two_node="1" expected_votes="1"
> > ccs -f /etc/cluster/cluster.conf --addnode an-a04n01.alteeve.ca ccs -f
> > /etc/cluster/cluster.conf --addnode an-a04n02.alteeve.ca ccs -f
> > /etc/cluster/cluster.conf --addfencedev pcmk agent=fence_pcmk ccs -f
> > /etc/cluster/cluster.conf --addmethod pcmk-redirect an-a04n01.alteeve.ca
> > ccs -f /etc/cluster/cluster.conf --addmethod pcmk-redirect an-
> > a04n02.alteeve.ca ccs -f /etc/cluster/cluster.conf --addfenceinst pcmk an-
> > a04n01.alteeve.ca pcmk-redirect port=an-a04n01.alteeve.ca ccs -f
> > /etc/cluster/cluster.conf --addfenceinst pcmk an-a04n02.alteeve.ca pcmk-
> > redirect port=an-a04n02.alteeve.ca ccs -f /etc/cluster/cluster.conf --
> > setfencedaemon post_join_delay="30"
> > cat /etc/cluster/cluster.conf
> > ====
> > <cluster config_version="10" name="an-anvil-04">
> > <fence_daemon post_join_delay="30"/>
> > <clusternodes>
> > <clusternode name="an-a04n01.alteeve.ca" nodeid="1">
> > <fence>
> > <method name="pcmk-redirect">
> > <device name="pcmk" port="an-a04n01.alteeve.ca"/>
> > </method>
> > </fence>
> > </clusternode>
> > <clusternode name="an-a04n02.alteeve.ca" nodeid="2">
> > <fence>
> > <method name="pcmk-redirect">
> > <device name="pcmk" port="an-a04n02.alteeve.ca"/>
> > </method>
> > </fence>
> > </clusternode>
> > </clusternodes>
> > <cman expected_votes="1" two_node="1"/>
> > <fencedevices>
> > <fencedevice agent="fence_pcmk" name="pcmk"/>
> > </fencedevices>
> > <rm>
> > <failoverdomains/>
> > <resources/>
> > </rm>
> > </cluster>
> > ====
>
> [Alex Samad - Yieldbroker]
> Actually I do, I just put the initial bit in there.
>
> >
> > Then you move over to pacemaker and configure stonith there. How you do
> > this will vary a bit for your fence agent. I use IPMI fencing, which looks like
> > this:
> >
> > ====
> > pcs cluster cib stonith_cfg
> > pcs -f stonith_cfg stonith create fence_n01_ipmi fence_ipmilan
> > pcmk_host_list="an-a04n01.alteeve.ca" ipaddr="an-a04n01.ipmi"
> > action="reboot" login="admin" passwd="Initial1" delay=15 op monitor
> > interval=10s pcs -f stonith_cfg stonith create fence_n02_ipmi fence_ipmilan
> > pcmk_host_list="an-a04n02.alteeve.ca" ipaddr="an-a04n02.ipmi"
> > action="reboot" login="admin" passwd="Initial1" op monitor interval=10s
> > pcs cluster cib-push stonith_cfg pcs property set stonith-enabled=true ====
> [Alex Samad - Yieldbroker]
>
> Problem is that uses ipmi I don't want to use that.
>
> I would like to translater a stonight shutdown/reboot into a OS reboot commend.
>

That hardly makes sense except in pure test environment. Stonith is
needed when you do not know state of partner node, in which case you
cannot be sure your reboot/shutdown command will be executed, nor that
you can reach your partner at all.

If you are running under Vmware, use stonith/vmware or stonith/vcenter.

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: stonith q [ In reply to ]

Nov 2, 2014, 8:25 AM

Post #5 of 14 (4466 views)

On 02/11/14 06:45 AM, Andrei Borzenkov wrote:
> Ð’ Sun, 2 Nov 2014 10:01:59 +0000
> Alex Samad - Yieldbroker <Alex.Samad@yieldbroker.com> Ð¿Ð¸ÑˆÐµÑ‚:
>
>>
>>
>>> -----Original Message-----
>>> From: Digimer [mailto:lists@alteeve.ca]
>>> Sent: Sunday, 2 November 2014 9:49 AM
>>> To: The Pacemaker cluster resource manager
>>> Subject: Re: [Pacemaker] stonith q
>>>
>>> On 01/11/14 06:27 PM, Alex Samad - Yieldbroker wrote:
>>>> Hi
>>>>
>>>> 2 node cluster, running under vmware
>> {snip}
>>>>
>>>> Alex
>>>
>>> In cman's cluster.conf, you configure the fence device 'fence_pcmk', as you
>>> have. That is a dummy/hook fence agent that simply passes fence requests
>>> up to pacemaker to actually perform. Pacemaker will then tell cman whether
>>> the fence succeeded or failed.
>>>
>>> To make sure you have cluster.conf configured properly, it should look
>>> something like this;
>>>
>>> ====
>>> ccs -f /etc/cluster/cluster.conf --createcluster an-anvil-04 ccs -f
>>> /etc/cluster/cluster.conf --setcman two_node="1" expected_votes="1"
>>> ccs -f /etc/cluster/cluster.conf --addnode an-a04n01.alteeve.ca ccs -f
>>> /etc/cluster/cluster.conf --addnode an-a04n02.alteeve.ca ccs -f
>>> /etc/cluster/cluster.conf --addfencedev pcmk agent=fence_pcmk ccs -f
>>> /etc/cluster/cluster.conf --addmethod pcmk-redirect an-a04n01.alteeve.ca
>>> ccs -f /etc/cluster/cluster.conf --addmethod pcmk-redirect an-
>>> a04n02.alteeve.ca ccs -f /etc/cluster/cluster.conf --addfenceinst pcmk an-
>>> a04n01.alteeve.ca pcmk-redirect port=an-a04n01.alteeve.ca ccs -f
>>> /etc/cluster/cluster.conf --addfenceinst pcmk an-a04n02.alteeve.ca pcmk-
>>> redirect port=an-a04n02.alteeve.ca ccs -f /etc/cluster/cluster.conf --
>>> setfencedaemon post_join_delay="30"
>>> cat /etc/cluster/cluster.conf
>>> ====
>>> <cluster config_version="10" name="an-anvil-04">
>>> <fence_daemon post_join_delay="30"/>
>>> <clusternodes>
>>> <clusternode name="an-a04n01.alteeve.ca" nodeid="1">
>>> <fence>
>>> <method name="pcmk-redirect">
>>> <device name="pcmk" port="an-a04n01.alteeve.ca"/>
>>> </method>
>>> </fence>
>>> </clusternode>
>>> <clusternode name="an-a04n02.alteeve.ca" nodeid="2">
>>> <fence>
>>> <method name="pcmk-redirect">
>>> <device name="pcmk" port="an-a04n02.alteeve.ca"/>
>>> </method>
>>> </fence>
>>> </clusternode>
>>> </clusternodes>
>>> <cman expected_votes="1" two_node="1"/>
>>> <fencedevices>
>>> <fencedevice agent="fence_pcmk" name="pcmk"/>
>>> </fencedevices>
>>> <rm>
>>> <failoverdomains/>
>>> <resources/>
>>> </rm>
>>> </cluster>
>>> ====
>>
>> [Alex Samad - Yieldbroker]
>> Actually I do, I just put the initial bit in there.
>>
>>>
>>> Then you move over to pacemaker and configure stonith there. How you do
>>> this will vary a bit for your fence agent. I use IPMI fencing, which looks like
>>> this:
>>>
>>> ====
>>> pcs cluster cib stonith_cfg
>>> pcs -f stonith_cfg stonith create fence_n01_ipmi fence_ipmilan
>>> pcmk_host_list="an-a04n01.alteeve.ca" ipaddr="an-a04n01.ipmi"
>>> action="reboot" login="admin" passwd="Initial1" delay=15 op monitor
>>> interval=10s pcs -f stonith_cfg stonith create fence_n02_ipmi fence_ipmilan
>>> pcmk_host_list="an-a04n02.alteeve.ca" ipaddr="an-a04n02.ipmi"
>>> action="reboot" login="admin" passwd="Initial1" op monitor interval=10s
>>> pcs cluster cib-push stonith_cfg pcs property set stonith-enabled=true ====
>> [Alex Samad - Yieldbroker]
>>
>> Problem is that uses ipmi I don't want to use that.
>>
>> I would like to translater a stonight shutdown/reboot into a OS reboot commend.
>>
>
> That hardly makes sense except in pure test environment. Stonith is
> needed when you do not know state of partner node, in which case you
> cannot be sure your reboot/shutdown command will be executed, nor that
> you can reach your partner at all.
>
> If you are running under Vmware, use stonith/vmware or stonith/vcenter.

Andrei is correct. A stonith method must be external to the node and
work regardless of the state of a node. Try this; 'echo c >
/proc/sysrq-trigger' will crash the node. Any stonith method that
requires the OS to respond will fail and your cluster will hang.

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: stonith q [ In reply to ]

Nov 2, 2014, 12:47 PM

Post #6 of 14 (4459 views)

> -----Original Message-----
> From: Digimer [mailto:lists@alteeve.ca]
> Sent: Monday, 3 November 2014 3:26 AM
> To: The Pacemaker cluster resource manager; Alex Samad - Yieldbroker
> Subject: Re: [Pacemaker] stonith q
>
> On 02/11/14 06:45 AM, Andrei Borzenkov wrote:
> > Ð’ Sun, 2 Nov 2014 10:01:59 +0000
> > Alex Samad - Yieldbroker <Alex.Samad@yieldbroker.com> Ð¿Ð¸ÑˆÐµÑ‚:
> >
> >>
> >>
> >>> -----Original Message-----
> >>> From: Digimer [mailto:lists@alteeve.ca]
> >>> Sent: Sunday, 2 November 2014 9:49 AM
> >>> To: The Pacemaker cluster resource manager
> >>> Subject: Re: [Pacemaker] stonith q
> >>>
> >>> On 01/11/14 06:27 PM, Alex Samad - Yieldbroker wrote:
> >>>> Hi
{snip}
> >
> > That hardly makes sense except in pure test environment. Stonith is
> > needed when you do not know state of partner node, in which case you
> > cannot be sure your reboot/shutdown command will be executed, nor that
> > you can reach your partner at all.
> >
> > If you are running under Vmware, use stonith/vmware or stonith/vcenter.
>
> Andrei is correct. A stonith method must be external to the node and work
> regardless of the state of a node. Try this; 'echo c > /proc/sysrq-trigger' will
> crash the node. Any stonith method that requires the OS to respond will fail
> and your cluster will hang.

Yes but vmware will restart the node in that circumstance.
I have had issues with my 2 node cluster where 1 node will remove itself because of lack of communication. Very hard to track down when it happens every now and then and only at 1am, I believe because of backup traffic or because its starved of cpu cycles.

What I would like to see happen in that situation if for a reboot to be issued, I know that the node would respond and I know that it would reconnect.

I read that there was a suicide option module that did what I wanted but its not available.

I don't want to setup useid for vmware for each node and configure that. I just want fenced to do a reboot via the os of the node instead of just killing cman

What I am hearing is that its not available. Is it possible to hook to a custom script on that event, I can write my own restart

>
> --
> Digimer
> Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is
> trapped in the mind of a person without access to education?
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: stonith q [ In reply to ]

arvidjaar at gmail

Nov 2, 2014, 9:41 PM

Post #7 of 14 (4460 views)

Ð’ Sun, 2 Nov 2014 20:47:22 +0000
Alex Samad - Yieldbroker <Alex.Samad@yieldbroker.com> Ð¿Ð¸ÑˆÐµÑ‚:

>
>
> > -----Original Message-----
> > From: Digimer [mailto:lists@alteeve.ca]
> > Sent: Monday, 3 November 2014 3:26 AM
> > To: The Pacemaker cluster resource manager; Alex Samad - Yieldbroker
> > Subject: Re: [Pacemaker] stonith q
> >
> > On 02/11/14 06:45 AM, Andrei Borzenkov wrote:
> > > Ð’ Sun, 2 Nov 2014 10:01:59 +0000
> > > Alex Samad - Yieldbroker <Alex.Samad@yieldbroker.com> Ð¿Ð¸ÑˆÐµÑ‚:
> > >
> > >>
> > >>
> > >>> -----Original Message-----
> > >>> From: Digimer [mailto:lists@alteeve.ca]
> > >>> Sent: Sunday, 2 November 2014 9:49 AM
> > >>> To: The Pacemaker cluster resource manager
> > >>> Subject: Re: [Pacemaker] stonith q
> > >>>
> > >>> On 01/11/14 06:27 PM, Alex Samad - Yieldbroker wrote:
> > >>>> Hi
> {snip}
> > >
> > > That hardly makes sense except in pure test environment. Stonith is
> > > needed when you do not know state of partner node, in which case you
> > > cannot be sure your reboot/shutdown command will be executed, nor that
> > > you can reach your partner at all.
> > >
> > > If you are running under Vmware, use stonith/vmware or stonith/vcenter.
> >
> > Andrei is correct. A stonith method must be external to the node and work
> > regardless of the state of a node. Try this; 'echo c > /proc/sysrq-trigger' will
> > crash the node. Any stonith method that requires the OS to respond will fail
> > and your cluster will hang.
>
> Yes but vmware will restart the node in that circumstance.
> I have had issues with my 2 node cluster where 1 node will remove itself because of lack of communication. Very hard to track down when it happens every now and then and only at 1am, I believe because of backup traffic or because its starved of cpu cycles.
>
> What I would like to see happen in that situation if for a reboot to be issued, I know that the node would respond and I know that it would reconnect.
>
> I read that there was a suicide option module that did what I wanted but its not available.
>
> I don't want to setup useid for vmware for each node and configure that. I just want fenced to do a reboot via the os of the node instead of just killing cman
>
> What I am hearing is that its not available. Is it possible to hook to a custom script on that event, I can write my own restart
>

Sure you can write your own external stonith script.

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: stonith q [ In reply to ]

Nov 2, 2014, 11:07 PM

Post #8 of 14 (4455 views)

{snip}
> > What I am hearing is that its not available. Is it possible to hook to
> > a custom script on that event, I can write my own restart
> >
>
> Sure you can write your own external stonith script.

Any pointers to a frame work somewhere ?
Does fenced have any handlers, I notice it logs a message in syslog and cluster log is there a chance to capture the event there ?

A
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: stonith q [ In reply to ]

arvidjaar at gmail

Nov 4, 2014, 12:55 AM

Post #9 of 14 (4452 views)

Ð’ Mon, 3 Nov 2014 07:07:41 +0000
Alex Samad - Yieldbroker <Alex.Samad@yieldbroker.com> Ð¿Ð¸ÑˆÐµÑ‚:

> {snip}
> > > What I am hearing is that its not available. Is it possible to hook to
> > > a custom script on that event, I can write my own restart
> > >
> >
> > Sure you can write your own external stonith script.
>
>
> Any pointers to a frame work somewhere ?

I do not think there is any formal stonith agent developers guide; take
at any existing agent like external/ipmi and modify to suite your needs.

> Does fenced have any handlers, I notice it logs a message in syslog and cluster log is there a chance to capture the event there ?

I do not have experience with RH CMAN, sorry. But from what I
understand fenced and stonithd agents are compatible.

>
> A

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: stonith q [ In reply to ]

Nov 4, 2014, 7:53 AM

Post #10 of 14 (4455 views)

On 04/11/14 03:55 AM, Andrei Borzenkov wrote:
> Ð’ Mon, 3 Nov 2014 07:07:41 +0000
> Alex Samad - Yieldbroker <Alex.Samad@yieldbroker.com> Ð¿Ð¸ÑˆÐµÑ‚:
>
>> {snip}
>>>> What I am hearing is that its not available. Is it possible to hook to
>>>> a custom script on that event, I can write my own restart
>>>>
>>>
>>> Sure you can write your own external stonith script.
>>
>>
>> Any pointers to a frame work somewhere ?
>
> I do not think there is any formal stonith agent developers guide; take
> at any existing agent like external/ipmi and modify to suite your needs.
>
>> Does fenced have any handlers, I notice it logs a message in syslog and cluster log is there a chance to capture the event there ?
>
> I do not have experience with RH CMAN, sorry. But from what I
> understand fenced and stonithd agents are compatible.

https://fedorahosted.org/cluster/wiki/FenceAgentAPI

Note the return codes. Also, not listed there, is the requirement that
an agent print it's XML validation data. You can see example of what
this looks like by calling 'fence_ipmilan -o metadata' (or any other
fence_* agent).

For the record, I think this is a bad idea.

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: stonith q [ In reply to ]

Nov 4, 2014, 11:45 AM

Post #11 of 14 (4446 views)

{snip}
> >> Any pointers to a frame work somewhere ?
> >
> > I do not think there is any formal stonith agent developers guide;
> > take at any existing agent like external/ipmi and modify to suite your
> needs.
> >
> >> Does fenced have any handlers, I notice it logs a message in syslog and
> cluster log is there a chance to capture the event there ?
> >
> > I do not have experience with RH CMAN, sorry. But from what I
> > understand fenced and stonithd agents are compatible.
>
> https://fedorahosted.org/cluster/wiki/FenceAgentAPI

Thanks

>
> Note the return codes. Also, not listed there, is the requirement that an
> agent print it's XML validation data. You can see example of what this looks
> like by calling 'fence_ipmilan -o metadata' (or any other
> fence_* agent).
>
> For the record, I think this is a bad idea.

So lots of people have said this is bad idea and maybe I am miss understanding something.

From my observation of my 2 node cluster, when inter cluster comms has an issues 1 node kills the other node.
Lets say A + B.
A is currently running the resources, B get elected to die.
A signal is sent cman -> PK -> stonithd

From the logs on server B I see fenced trying to kill server B, but I don't use any cman/stonith agents. I would like to capture that event and use a OS reboot.

So the problem I perceive is if server B is in a state where it can't run OS locked up or crashed. I believe VMware will look after that, from experience I have seen it deal with that

The issue is if B is running enough to still have a VIP (one of the resources that PK looks after) is still on B and A and B can't or will not shutdown via the OS. I understand that, but I would like still attempt to reboot at that time

I have found a simpler solution I actively poll to check if the cluster is okay. I would prefer to fire a script on an event but ..

I'm also looking into why there is a comms problem as its 2 vm's on the same host on the same network, I think its starvation of cpu cycles as itâ€™s a dev setup.

>
> --
> Digimer
> Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is
> trapped in the mind of a person without access to education?
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: stonith q [ In reply to ]

Nov 4, 2014, 1:54 PM

Post #12 of 14 (4452 views)

On 04/11/14 02:45 PM, Alex Samad - Yieldbroker wrote:
> {snip}
>>>> Any pointers to a frame work somewhere ?
>>>
>>> I do not think there is any formal stonith agent developers guide;
>>> take at any existing agent like external/ipmi and modify to suite your
>> needs.
>>>
>>>> Does fenced have any handlers, I notice it logs a message in syslog and
>> cluster log is there a chance to capture the event there ?
>>>
>>> I do not have experience with RH CMAN, sorry. But from what I
>>> understand fenced and stonithd agents are compatible.
>>
>> https://fedorahosted.org/cluster/wiki/FenceAgentAPI
>
>
> Thanks
>
>>
>> Note the return codes. Also, not listed there, is the requirement that an
>> agent print it's XML validation data. You can see example of what this looks
>> like by calling 'fence_ipmilan -o metadata' (or any other
>> fence_* agent).
>>
>> For the record, I think this is a bad idea.
>
> So lots of people have said this is bad idea and maybe I am miss understanding something.
>
> From my observation of my 2 node cluster, when inter cluster comms has an issues 1 node kills the other node.
> Lets say A + B.
> A is currently running the resources, B get elected to die.

Nothing is "selected". Both nodes will initiate a fence, but if you set
'delay="15"' for the node "A" fence method, the node B will pause for 15
seconds before acting on the fence request. If node A saw no delay on
node B, it will immediately proceed with the fence action. In this way,
node A will always be faster than node B, so node B will always lose in
a fence race like this.

> A signal is sent cman -> PK -> stonithd

Correct (basically).

> From the logs on server B I see fenced trying to kill server B, but I don't use any cman/stonith agents. I would like to capture that event and use a OS reboot.

Then use a fabric fence method. These are ones where the network
connection(s) to the target node is(are) severed. Thus, node B will sit
there perpetually trying to fence node A, but failing because it can't
talk to it's fence device (network switch, etc). Then a human can come
in, examine the system, reboot the node and unfence the node once it has
rebooted, restoring network connections.

I created a proof of concept fence agent doing this with D-Link switches:

https://github.com/digimer/fence_dlink_snmp

It should be easy enough to adapt to, say, call the hypervisor/host and
using brctl to detach the virtual interfaces to the VM.

Or, more easily, stick with power fencing and use an external log server.

> So the problem I perceive is if server B is in a state where it can't run OS locked up or crashed. I believe VMware will look after that, from experience I have seen it deal with that

I'm not sure I understand... I don't use VMWare, so maybe I am missing
something. If the node stops all processing, then it's possible the node
will be detected as faulty and will be rebooted. However, there are many
ways that nodes can fail. Secondly, unless something tells pacemaker
that the node is dead, it won't know and is not allowed to assume.

> The issue is if B is running enough to still have a VIP (one of the resources that PK looks after) is still on B and A and B can't or will not shutdown via the OS. I understand that, but I would like still attempt to reboot at that time

You're mistake here is assuming that the node will be operating in a
defined state. The whole idea of fencing is to put a node that is in an
unknown state into a known state. To do that, you must be able to fence
totally outside the node itself. If you depend on the node behaving at
all, your approach is flawed.

> I have found a simpler solution I actively poll to check if the cluster is okay. I would prefer to fire a script on an event but ..
>
> I'm also looking into why there is a comms problem as its 2 vm's on the same host on the same network, I think its starvation of cpu cycles as itâ€™s a dev setup.

Why things went wrong is entirely secondary to fencing.

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: stonith q [ In reply to ]

Nov 4, 2014, 2:39 PM

Post #13 of 14 (4474 views)

> -----Original Message-----
> From: Digimer [mailto:lists@alteeve.ca]
> Sent: Wednesday, 5 November 2014 8:54 AM
> To: Alex Samad - Yieldbroker; Andrei Borzenkov
> Cc: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] stonith q
>
> On 04/11/14 02:45 PM, Alex Samad - Yieldbroker wrote:
> > {snip}
> >>>> Any pointers to a frame work somewhere ?
> >>>
> >>> I do not think there is any formal stonith agent developers guide;
> >>> take at any existing agent like external/ipmi and modify to suite
> >>> your
> >> needs.
> >>>
> >>>> Does fenced have any handlers, I notice it logs a message in syslog
> >>>> and
> >> cluster log is there a chance to capture the event there ?
> >>>
> >>> I do not have experience with RH CMAN, sorry. But from what I
> >>> understand fenced and stonithd agents are compatible.
> >>
> >> https://fedorahosted.org/cluster/wiki/FenceAgentAPI
> >
> >
> > Thanks
> >
> >>
> >> Note the return codes. Also, not listed there, is the requirement
> >> that an agent print it's XML validation data. You can see example of
> >> what this looks like by calling 'fence_ipmilan -o metadata' (or any
> >> other
> >> fence_* agent).
> >>
> >> For the record, I think this is a bad idea.
> >
> > So lots of people have said this is bad idea and maybe I am miss
> understanding something.
> >
> > From my observation of my 2 node cluster, when inter cluster comms has
> an issues 1 node kills the other node.
> > Lets say A + B.
> > A is currently running the resources, B get elected to die.
>
> Nothing is "selected". Both nodes will initiate a fence, but if you set
> 'delay="15"' for the node "A" fence method, the node B will pause for 15
> seconds before acting on the fence request. If node A saw no delay on node
> B, it will immediately proceed with the fence action. In this way, node A will
> always be faster than node B, so node B will always lose in a fence race like
> this.

Okay, maybe I am reading this wrong. So example of what happened last night

demorp1
=======
Nov 4 23:21:34 demorp1 corosync[23415]: [TOTEM ] A processor failed, forming new configuration.
Nov 4 23:21:36 demorp1 corosync[23415]: [CMAN ] quorum lost, blocking activity
Nov 4 23:21:36 demorp1 corosync[23415]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Nov 4 23:21:36 demorp1 corosync[23415]: [QUORUM] Members[1]: 1
Nov 4 23:21:36 demorp1 corosync[23415]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Nov 4 23:21:36 demorp1 corosync[23415]: [CPG ] chosen downlist: sender r(0) ip(10.172.218.51) ; members(old:2 left:1)
Nov 4 23:21:36 demorp1 corosync[23415]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 4 23:21:37 demorp1 corosync[23415]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Nov 4 23:21:37 demorp1 corosync[23415]: [CPG ] chosen downlist: sender r(0) ip(10.172.218.51) ; members(old:1 left:0)
Nov 4 23:21:37 demorp1 corosync[23415]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 4 23:21:37 demorp1 kernel: dlm: closing connection to node 2
Nov 4 23:21:37 demorp1 corosync[23415]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Nov 4 23:21:37 demorp1 corosync[23415]: [CMAN ] quorum regained, resuming activity
Nov 4 23:21:37 demorp1 corosync[23415]: [QUORUM] This node is within the primary component and will provide service.
Nov 4 23:21:37 demorp1 corosync[23415]: [QUORUM] Members[2]: 1 2
Nov 4 23:21:37 demorp1 corosync[23415]: [QUORUM] Members[2]: 1 2
Nov 4 23:21:37 demorp1 corosync[23415]: [CPG ] chosen downlist: sender r(0) ip(10.172.218.51) ; members(old:1 left:0)
Nov 4 23:21:37 demorp1 corosync[23415]: [MAIN ] Completed service synchronization, ready to provide service.

>>>>> I read to mean that demorp2 killed this node >>> Nov 4 23:21:37 demorp1 corosync[23415]: cman killed by node 2 because we were killed by cman_tool or other application

Nov 4 23:21:37 demorp1 pacemakerd[24093]: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Nov 4 23:21:37 demorp1 pacemakerd[24093]: error: mcp_cpg_destroy: Connection destroyed
Nov 4 23:21:37 demorp1 stonith-ng[24100]: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Nov 4 23:21:37 demorp1 dlm_controld[23497]: cluster is down, exiting
Nov 4 23:21:37 demorp1 fenced[23483]: cluster is down, exiting

>>> This is what I would like to capture and do something with

Nov 4 23:21:37 demorp1 fenced[23483]: daemon cpg_dispatch error 2
Nov 4 23:21:37 demorp1 fenced[23483]: cpg_dispatch error 2
Nov 4 23:21:37 demorp1 gfs_controld[23559]: cluster is down, exiting
Nov 4 23:21:37 demorp1 gfs_controld[23559]: daemon cpg_dispatch error 2
Nov 4 23:21:37 demorp1 stonith-ng[24100]: error: stonith_peer_cs_destroy: Corosync connection terminated
Nov 4 23:21:37 demorp1 attrd[24101]: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Nov 4 23:21:37 demorp1 attrd[24101]: crit: attrd_cs_destroy: Lost connection to Corosync service!
Nov 4 23:21:37 demorp1 attrd[24101]: notice: main: Exiting...
Nov 4 23:21:37 demorp1 attrd[24101]: notice: main: Disconnecting client 0x14ab240, pid=24102...
Nov 4 23:21:37 demorp1 crmd[24102]: notice: peer_update_callback: Our peer on the DC is dead
Nov 4 23:21:37 demorp1 crmd[24102]: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Nov 4 23:21:37 demorp1 crmd[24102]: error: crmd_cs_destroy: connection terminated
Nov 4 23:21:37 demorp1 cib[24099]: warning: qb_ipcs_event_sendv: new_event_notification (24099-24100-12): Broken pipe (32)
Nov 4 23:21:37 demorp1 cib[24099]: warning: cib_notify_send_one: Notification of client crmd/0a81732f-ee8e-4e97-bd8e-a45e2f360a0f failed
Nov 4 23:21:37 demorp1 cib[24099]: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Nov 4 23:21:37 demorp1 cib[24099]: error: cib_cs_destroy: Corosync connection lost! Exiting.
Nov 4 23:21:37 demorp1 crmd[24102]: notice: crmd_exit: Forcing immediate exit: Link has been severed (67)
Nov 4 23:21:37 demorp1 attrd[24101]: error: attrd_cib_connection_destroy: Connection to the CIB terminated...
Nov 4 23:21:38 demorp1 lrmd[2434]: warning: qb_ipcs_event_sendv: new_event_notification (2434-24102-6): Bad file descriptor (9)
Nov 4 23:21:38 demorp1 lrmd[2434]: warning: send_client_notify: Notification of client crmd/3651ccf7-018a-4b0d-a6dc-f2513bd7bbe9 failed
Nov 4 23:21:38 demorp1 lrmd[2434]: warning: send_client_notify: Notification of client crmd/3651ccf7-018a-4b0d-a6dc-f2513bd7bbe9 failed
Nov 4 23:21:39 demorp1 kernel: dlm: closing connection to node 2
Nov 4 23:21:39 demorp1 kernel: dlm: closing connection to node 1

demorp2
=======
Nov 4 23:21:37 demorp2 corosync[1734]: [MAIN ] Corosync main process was not scheduled for 12117.8027 ms (threshold is 8000.0000 ms). Consider token timeout increase.
Nov 4 23:21:37 demorp2 corosync[1734]: [CMAN ] quorum lost, blocking activity
Nov 4 23:21:37 demorp2 corosync[1734]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Nov 4 23:21:37 demorp2 corosync[1734]: [QUORUM] Members[1]: 2
Nov 4 23:21:37 demorp2 corosync[1734]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Nov 4 23:21:37 demorp2 corosync[1734]: [CPG ] chosen downlist: sender r(0) ip(10.172.218.52) ; members(old:2 left:1)
Nov 4 23:21:37 demorp2 corosync[1734]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 4 23:21:37 demorp2 corosync[1734]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Nov 4 23:21:37 demorp2 corosync[1734]: [CMAN ] quorum regained, resuming activity
Nov 4 23:21:37 demorp2 corosync[1734]: [QUORUM] This node is within the primary component and will provide service.
Nov 4 23:21:37 demorp2 corosync[1734]: [QUORUM] Members[2]: 1 2
Nov 4 23:21:37 demorp2 corosync[1734]: [QUORUM] Members[2]: 1 2
Nov 4 23:21:37 demorp2 corosync[1734]: [CPG ] chosen downlist: sender r(0) ip(10.172.218.51) ; members(old:1 left:0)
Nov 4 23:21:37 demorp2 corosync[1734]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 4 23:21:37 demorp2 crmd[2492]: warning: match_down_event: No match for shutdown action on demorp1
Nov 4 23:21:37 demorp2 crmd[2492]: notice: peer_update_callback: Stonith/shutdown of demorp1 not matched
Nov 4 23:21:37 demorp2 crmd[2492]: notice: cman_event_callback: Membership 400: quorum lost
Nov 4 23:21:37 demorp2 crmd[2492]: notice: cman_event_callback: Membership 400: quorum acquired
Nov 4 23:21:37 demorp2 crmd[2492]: notice: do_state_transition: State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL origin=peer_update_callback ]
Nov 4 23:21:37 demorp2 kernel: dlm: closing connection to node 1

>>>> this is what I believe is node 2 saying to kill node 1 >>>>>>> Nov 4 23:21:37 demorp2 fenced[1833]: telling cman to remove nodeid 1 from cluster

Nov 4 23:21:37 demorp2 fenced[1833]: receive_start 1:3 add node with started_count 1
Nov 4 23:21:51 demorp2 corosync[1734]: [MAIN ] Corosync main process was not scheduled for 10987.4082 ms (threshold is 8000.0000 ms). Consider token timeout increase.
Nov 4 23:21:51 demorp2 corosync[1734]: [TOTEM ] A processor failed, forming new configuration.
Nov 4 23:21:51 demorp2 kernel: IN=eth0 OUT= MAC=00:50:56:a6:0f:15:00:00:00:00:00:00:08:00 SRC=10.0.0.0 DST=224.0.0.1 LEN=36 TOS=0x00 PREC=0x00 TTL=1 ID=0 PROTO=2
Nov 4 23:21:53 demorp2 corosync[1734]: [CMAN ] quorum lost, blocking activity
Nov 4 23:21:53 demorp2 corosync[1734]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Nov 4 23:21:53 demorp2 corosync[1734]: [QUORUM] Members[1]: 2
Nov 4 23:21:53 demorp2 corosync[1734]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Nov 4 23:21:53 demorp2 crmd[2492]: notice: cman_event_callback: Membership 404: quorum lost
Nov 4 23:21:53 demorp2 kernel: dlm: closing connection to node 1
Nov 4 23:21:53 demorp2 corosync[1734]: [CPG ] chosen downlist: sender r(0) ip(10.172.218.52) ; members(old:2 left:1)
Nov 4 23:21:53 demorp2 corosync[1734]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 4 23:21:53 demorp2 crmd[2492]: notice: crm_update_peer_state: cman_event_callback: Node demorp1[1] - state is now lost (was member)
Nov 4 23:21:53 demorp2 crmd[2492]: warning: match_down_event: No match for shutdown action on demorp1
Nov 4 23:21:53 demorp2 crmd[2492]: notice: peer_update_callback: Stonith/shutdown of demorp1 not matched
Nov 4 23:21:53 demorp2 crmd[2492]: warning: match_down_event: No match for shutdown action on demorp1
Nov 4 23:21:53 demorp2 crmd[2492]: notice: peer_update_callback: Stonith/shutdown of demorp1 not matched
Nov 4 23:21:53 demorp2 attrd[2490]: notice: attrd_local_callback: Sending full refresh (origin=crmd)
Nov 4 23:21:53 demorp2 attrd[2490]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-ybrpstat (142)
Nov 4 23:21:53 demorp2 pengine[2491]: notice: unpack_config: On loss of CCM Quorum: Ignore
Nov 4 23:21:53 demorp2 pengine[2491]: notice: LogActions: Start ybrpip#011(demorp2)
Nov 4 23:21:53 demorp2 pengine[2491]: notice: process_pe_message: Calculated Transition 99: /var/lib/pacemaker/pengine/pe-input-3255.bz2
Nov 4 23:21:53 demorp2 crmd[2492]: notice: te_rsc_command: Initiating action 5: start ybrpip_start_0 on demorp2 (local)
Nov 4 23:21:53 demorp2 attrd[2490]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-ybrpstat (1414871697)
Nov 4 23:21:53 demorp2 attrd[2490]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
Nov 4 23:21:54 demorp2 IPaddr2(ybrpip)[25809]: INFO: Adding inet address 10.172.218.50/24 with broadcast address 10.172.218.255 to device eth0
Nov 4 23:21:54 demorp2 IPaddr2(ybrpip)[25809]: INFO: Bringing device eth0 up
Nov 4 23:21:54 demorp2 IPaddr2(ybrpip)[25809]: INFO: /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p /var/run/resource-agents/send_arp-10.172.218.50 eth0 10.172.218.50 auto not_used not_used

>
> > A signal is sent cman -> PK -> stonithd
>
> Correct (basically).
>
> > From the logs on server B I see fenced trying to kill server B, but I don't use
> any cman/stonith agents. I would like to capture that event and use a OS
> reboot.
>
> Then use a fabric fence method. These are ones where the network
> connection(s) to the target node is(are) severed. Thus, node B will sit there
> perpetually trying to fence node A, but failing because it can't talk to it's
> fence device (network switch, etc). Then a human can come in, examine the
> system, reboot the node and unfence the node once it has rebooted,
> restoring network connections.
>
> I created a proof of concept fence agent doing this with D-Link switches:
>
> https://github.com/digimer/fence_dlink_snmp
>
> It should be easy enough to adapt to, say, call the hypervisor/host and using
> brctl to detach the virtual interfaces to the VM.
Nice, but I am in a virt world, vm's share a LUN.

My preference is to not allow each node access to vmware to shutdown or isolate the other node.
Issue with userid and password and basically security.

>
> Or, more easily, stick with power fencing and use an external log server.
>
> > So the problem I perceive is if server B is in a state where it can't
> > run OS locked up or crashed. I believe VMware will look after that,
> > from experience I have seen it deal with that
>
> I'm not sure I understand... I don't use VMWare, so maybe I am missing
> something. If the node stops all processing, then it's possible the node will be
> detected as faulty and will be rebooted. However, there are many ways that
> nodes can fail. Secondly, unless something tells pacemaker that the node is
> dead, it won't know and is not allowed to assume.

What I am trying to say is there are a few states a node can be in
1) okay
2) not cluster ok, but OS is okay
3) not cluster ok, not OS okay, but server still ticking over
4) server is locked up

So for 2, if I have an agent that reboots it will work
For 3 this is the issue, the niche cache where the OS reboot will potential fail
For 4 VMware has a way to detect this and will restart the vm

I am willing to live with 3 as it is for now

>
> > The issue is if B is running enough to still have a VIP (one of the
> > resources that PK looks after) is still on B and A and B can't or will
> > not shutdown via the OS. I understand that, but I would like still
> > attempt to reboot at that time
>
> You're mistake here is assuming that the node will be operating in a defined
> state. The whole idea of fencing is to put a node that is in an unknown state
> into a known state. To do that, you must be able to fence totally outside the
> node itself. If you depend on the node behaving at all, your approach is
> flawed.
Yes and no. I am willing to accept some states, as outlined above

>
> > I have found a simpler solution I actively poll to check if the cluster is okay. I
> would prefer to fire a script on an event but ..
> >
> > I'm also looking into why there is a comms problem as its 2 vm's on the
> same host on the same network, I think its starvation of cpu cycles as itâ€™s a
> dev setup.
>
> Why things went wrong is entirely secondary to fencing.

True but it might help to deprioritise finding my fencing solution :)

>
> --
> Digimer
> Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is
> trapped in the mind of a person without access to education?
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: stonith q [ In reply to ]

andrew at beekhof

Nov 4, 2014, 5:22 PM

Post #14 of 14 (4478 views)