Mailing List Archive

Pacemaker won't start after node was fenced
Had a failover of my active/passive cluster and now the passive node will
not rejoin the cluster.



2 nodes running Ubuntu 12.04

coro 1.4.2-2, openais 1.1.4-4, pcmk 1.1.6-2ubuntu3



Corosync ring membership is fine on both rings.



Tried stopping coro/pace and clearing /var/lib/heartbeat/crm/ and then
restarting on passive node without success.

Tried rebooting passive node (again - it was successfully fenced)

Tried updating pacemaker to latest in distro (1.1.6-2ubuntu3.3) then went
back on passive node

Tried putting active node in maintenance mode and stopping pacemaker and
corosync on both nodes. Then restarting on both nodes. Corosync came
back fine as before but now I have the same problem on both nodes with
pacemaker not starting successfully. Both show exactly same now - attrd:
[24883]: ERROR: main: HA Signon failed.



Log:

Jan 27 01:09:59 Condor crmd: [24885]: info: crmd_init: Starting crmd

Jan 27 01:09:59 Condor cib: [24881]: info: validate_with_relaxng: Creating
RNG parser context

Jan 27 01:09:59 Condor lrmd: [24882]: info: enabling coredumps

Jan 27 01:09:59 Condor lrmd: [24882]: info: Started.

Jan 27 01:09:59 Condor corosync[24778]: [IPC ] Invalid IPC
credentials.

Jan 27 01:09:59 Condor attrd: [24883]: ERROR: main: HA Signon failed

Jan 27 01:09:59 Condor attrd: [24883]: ERROR: main: Aborting startup

Jan 27 01:09:59 Condor pacemakerd: [24877]: ERROR: pcmk_child_exit: Child
process attrd exited (pid=24883, rc=100)

Jan 27 01:09:59 Condor pacemakerd: [24877]: notice: pcmk_child_exit: Child
process attrd no longer wishes to be respawned

Jan 27 01:09:59 Condor pacemakerd: [24877]: info: update_node_processes:
Node Condor now has process list: 00000000000000000000000000110312 (was
00000000000000000000000000111312)

Jan 27 01:09:59 Condor stonith-ng: [24880]: info:
init_ais_connection_classic: AIS connection established

Jan 27 01:09:59 Condor stonith-ng: [24880]: info: get_ais_nodeid: Server
details: id=167837962 uname=Condor cname=pcmk

Jan 27 01:09:59 Condor stonith-ng: [24880]: info:
init_ais_connection_once: Connection to 'classic openais (with plugin)':
established

Jan 27 01:09:59 Condor stonith-ng: [24880]: info: crm_new_peer: Node
Condor now has id: 167837962

Jan 27 01:09:59 Condor stonith-ng: [24880]: info: crm_new_peer: Node
167837962 is now known as Condor

Jan 27 01:09:59 Condor stonith-ng: [24880]: info: main: Starting
stonith-ng mainloop

Jan 27 01:09:59 Condor stonith-ng: [24880]: info: crm_update_peer: Node
Condor: id=167837962 state=unknown addr=(null) votes=0 born=0 seen=0
proc=00000000000000000000000000110312 (new)

Jan 27 01:09:59 Condor cib: [24881]: info: startCib: CIB Initialization
completed successfully

Jan 27 01:09:59 Condor cib: [24881]: info: get_cluster_type: Cluster type
is: 'openais'

Jan 27 01:09:59 Condor cib: [24881]: notice: crm_cluster_connect:
Connecting to cluster infrastructure: classic openais (with plugin)

Jan 27 01:09:59 Condor cib: [24881]: info: init_ais_connection_classic:
Creating connection to our Corosync plugin

Jan 27 01:09:59 Condor corosync[24778]: [IPC ] Invalid IPC
credentials.

Jan 27 01:09:59 Condor cib: [24881]: info: init_ais_connection_classic:
Connection to our AIS plugin (9) failed: unknown (100)

Jan 27 01:09:59 Condor cib: [24881]: CRIT: cib_init: Cannot sign in to the
cluster... terminating

Jan 27 01:09:59 Condor pacemakerd: [24877]: ERROR: pcmk_child_exit: Child
process cib exited (pid=24881, rc=100)

Jan 27 01:09:59 Condor pacemakerd: [24877]: notice: pcmk_child_exit: Child
process cib no longer wishes to be respawned

Jan 27 01:09:59 Condor pacemakerd: [24877]: info: update_node_processes:
Node Condor now has process list: 00000000000000000000000000110212 (was
00000000000000000000000000110312)

Jan 27 01:09:59 Condor stonith-ng: [24880]: info: crm_update_peer: Node
Condor: id=167837962 state=unknown addr=(null) votes=0 born=0 seen=0
proc=00000000000000000000000000110212 (new)

Jan 27 01:10:00 Condor crmd: [24885]: info: do_cib_control: Could not
connect to the CIB service: connection failed

Jan 27 01:10:00 Condor crmd: [24885]: WARN: do_cib_control: Couldn't
complete CIB registration 1 times... pause and retry

Jan 27 01:10:00 Condor crmd: [24885]: info: crmd_init: Starting crmd's
mainloop

Jan 27 01:10:01 Condor CRON[24888]: (root) CMD (/etc/init.d/watchdog -e
>/dev/null 2>&1)

Jan 27 01:10:02 Condor crmd: [24885]: info: crm_timer_popped: Wait Timer
(I_NULL) just popped (2000ms)

Jan 27 01:10:03 Condor crmd: [24885]: info: do_cib_control: Could not
connect to the CIB service: connection failed

Jan 27 01:10:03 Condor crmd: [24885]: WARN: do_cib_control: Couldn't
complete CIB registration 2 times... pause and retry

Jan 27 01:10:05 Condor crmd: [24885]: info: crm_timer_popped: Wait Timer
(I_NULL) just popped (2000ms)

Jan 27 01:10:06 Condor crmd: [24885]: info: do_cib_control: Could not
connect to the CIB service: connection failed

Jan 27 01:10:06 Condor crmd: [24885]: WARN: do_cib_control: Couldn't
complete CIB registration 3 times... pause and retry

Jan 27 01:10:08 Condor crmd: [24885]: info: crm_timer_popped: Wait Timer
(I_NULL) just popped (2000ms)

Jan 27 01:10:09 Condor crmd: [24885]: info: do_cib_control: Could not
connect to the CIB service: connection failed

Jan 27 01:10:09 Condor crmd: [24885]: WARN: do_cib_control: Couldn't
complete CIB registration 4 times... pause and retry

Jan 27 01:10:11 Condor crmd: [24885]: info: crm_timer_popped: Wait Timer
(I_NULL) just popped (2000ms)

Jan 27 01:10:12 Condor crmd: [24885]: info: do_cib_control: Could not
connect to the CIB service: connection failed

Jan 27 01:10:12 Condor crmd: [24885]: WARN: do_cib_control: Couldn't
complete CIB registration 5 times... pause and retry



Jacob A. Smith
IT Manager
Argotec, LLC
Re: Pacemaker won't start after node was fenced [ In reply to ]
> On 27 Jan 2015, at 5:23 pm, Jake Smith <jsmith@argotec.com> wrote:
>
> Had a failover of my active/passive cluster and now the passive node will not rejoin the cluster.
>
> 2 nodes running Ubuntu 12.04
> coro 1.4.2-2, openais 1.1.4-4, pcmk 1.1.6-2ubuntu3
>
> Corosync ring membership is fine on both rings.
>
> Tried stopping coro/pace and clearing /var/lib/heartbeat/crm/ and then restarting on passive node without success.
> Tried rebooting passive node (again – it was successfully fenced)
> Tried updating pacemaker to latest in distro (1.1.6-2ubuntu3.3) then went back on passive node
> Tried putting active node in maintenance mode and stopping pacemaker and corosync on both nodes. Then restarting on both nodes. Corosync came back fine as before but now I have the same problem on both nodes with pacemaker not starting successfully. Both show exactly same now - attrd: [24883]: ERROR: main: HA Signon failed.
>
> Log:
> Jan 27 01:09:59 Condor crmd: [24885]: info: crmd_init: Starting crmd
> Jan 27 01:09:59 Condor cib: [24881]: info: validate_with_relaxng: Creating RNG parser context
> Jan 27 01:09:59 Condor lrmd: [24882]: info: enabling coredumps
> Jan 27 01:09:59 Condor lrmd: [24882]: info: Started.
> Jan 27 01:09:59 Condor corosync[24778]: [IPC ] Invalid IPC credentials.

This seems to be the root of the errors.
Pacemaker looks a little old, could you consider updating?

> Jan 27 01:09:59 Condor attrd: [24883]: ERROR: main: HA Signon failed
> Jan 27 01:09:59 Condor attrd: [24883]: ERROR: main: Aborting startup
> Jan 27 01:09:59 Condor pacemakerd: [24877]: ERROR: pcmk_child_exit: Child process attrd exited (pid=24883, rc=100)
> Jan 27 01:09:59 Condor pacemakerd: [24877]: notice: pcmk_child_exit: Child process attrd no longer wishes to be respawned
> Jan 27 01:09:59 Condor pacemakerd: [24877]: info: update_node_processes: Node Condor now has process list: 00000000000000000000000000110312 (was 00000000000000000000000000111312)
> Jan 27 01:09:59 Condor stonith-ng: [24880]: info: init_ais_connection_classic: AIS connection established
> Jan 27 01:09:59 Condor stonith-ng: [24880]: info: get_ais_nodeid: Server details: id=167837962 uname=Condor cname=pcmk
> Jan 27 01:09:59 Condor stonith-ng: [24880]: info: init_ais_connection_once: Connection to 'classic openais (with plugin)': established
> Jan 27 01:09:59 Condor stonith-ng: [24880]: info: crm_new_peer: Node Condor now has id: 167837962
> Jan 27 01:09:59 Condor stonith-ng: [24880]: info: crm_new_peer: Node 167837962 is now known as Condor
> Jan 27 01:09:59 Condor stonith-ng: [24880]: info: main: Starting stonith-ng mainloop
> Jan 27 01:09:59 Condor stonith-ng: [24880]: info: crm_update_peer: Node Condor: id=167837962 state=unknown addr=(null) votes=0 born=0 seen=0 proc=00000000000000000000000000110312 (new)
> Jan 27 01:09:59 Condor cib: [24881]: info: startCib: CIB Initialization completed successfully
> Jan 27 01:09:59 Condor cib: [24881]: info: get_cluster_type: Cluster type is: 'openais'
> Jan 27 01:09:59 Condor cib: [24881]: notice: crm_cluster_connect: Connecting to cluster infrastructure: classic openais (with plugin)
> Jan 27 01:09:59 Condor cib: [24881]: info: init_ais_connection_classic: Creating connection to our Corosync plugin
> Jan 27 01:09:59 Condor corosync[24778]: [IPC ] Invalid IPC credentials.
> Jan 27 01:09:59 Condor cib: [24881]: info: init_ais_connection_classic: Connection to our AIS plugin (9) failed: unknown (100)
> Jan 27 01:09:59 Condor cib: [24881]: CRIT: cib_init: Cannot sign in to the cluster... terminating
> Jan 27 01:09:59 Condor pacemakerd: [24877]: ERROR: pcmk_child_exit: Child process cib exited (pid=24881, rc=100)
> Jan 27 01:09:59 Condor pacemakerd: [24877]: notice: pcmk_child_exit: Child process cib no longer wishes to be respawned
> Jan 27 01:09:59 Condor pacemakerd: [24877]: info: update_node_processes: Node Condor now has process list: 00000000000000000000000000110212 (was 00000000000000000000000000110312)
> Jan 27 01:09:59 Condor stonith-ng: [24880]: info: crm_update_peer: Node Condor: id=167837962 state=unknown addr=(null) votes=0 born=0 seen=0 proc=00000000000000000000000000110212 (new)
> Jan 27 01:10:00 Condor crmd: [24885]: info: do_cib_control: Could not connect to the CIB service: connection failed
> Jan 27 01:10:00 Condor crmd: [24885]: WARN: do_cib_control: Couldn't complete CIB registration 1 times... pause and retry
> Jan 27 01:10:00 Condor crmd: [24885]: info: crmd_init: Starting crmd's mainloop
> Jan 27 01:10:01 Condor CRON[24888]: (root) CMD (/etc/init.d/watchdog -e >/dev/null 2>&1)
> Jan 27 01:10:02 Condor crmd: [24885]: info: crm_timer_popped: Wait Timer (I_NULL) just popped (2000ms)
> Jan 27 01:10:03 Condor crmd: [24885]: info: do_cib_control: Could not connect to the CIB service: connection failed
> Jan 27 01:10:03 Condor crmd: [24885]: WARN: do_cib_control: Couldn't complete CIB registration 2 times... pause and retry
> Jan 27 01:10:05 Condor crmd: [24885]: info: crm_timer_popped: Wait Timer (I_NULL) just popped (2000ms)
> Jan 27 01:10:06 Condor crmd: [24885]: info: do_cib_control: Could not connect to the CIB service: connection failed
> Jan 27 01:10:06 Condor crmd: [24885]: WARN: do_cib_control: Couldn't complete CIB registration 3 times... pause and retry
> Jan 27 01:10:08 Condor crmd: [24885]: info: crm_timer_popped: Wait Timer (I_NULL) just popped (2000ms)
> Jan 27 01:10:09 Condor crmd: [24885]: info: do_cib_control: Could not connect to the CIB service: connection failed
> Jan 27 01:10:09 Condor crmd: [24885]: WARN: do_cib_control: Couldn't complete CIB registration 4 times... pause and retry
> Jan 27 01:10:11 Condor crmd: [24885]: info: crm_timer_popped: Wait Timer (I_NULL) just popped (2000ms)
> Jan 27 01:10:12 Condor crmd: [24885]: info: do_cib_control: Could not connect to the CIB service: connection failed
> Jan 27 01:10:12 Condor crmd: [24885]: WARN: do_cib_control: Couldn't complete CIB registration 5 times... pause and retry
>
> Jacob A. Smith
> IT Manager
> Argotec, LLC
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
Re: Pacemaker won't start after node was fenced [ In reply to ]
That will be tough but I'll see if I can give it a try sometime soon.

Have had no luck tracking down that error so running out of other options :/

Jake

-----Original Message-----
From: Andrew Beekhof [mailto:andrew@beekhof.net]
Sent: Monday, February 23, 2015 7:43 PM
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Pacemaker won't start after node was fenced


> On 27 Jan 2015, at 5:23 pm, Jake Smith <jsmith@argotec.com> wrote:
>
> Had a failover of my active/passive cluster and now the passive node will
> not rejoin the cluster.
>
> 2 nodes running Ubuntu 12.04
> coro 1.4.2-2, openais 1.1.4-4, pcmk 1.1.6-2ubuntu3
>
> Corosync ring membership is fine on both rings.
>
> Tried stopping coro/pace and clearing /var/lib/heartbeat/crm/ and then
> restarting on passive node without success.
> Tried rebooting passive node (again – it was successfully fenced)
> Tried updating pacemaker to latest in distro (1.1.6-2ubuntu3.3) then
> went back on passive node Tried putting active node in maintenance mode
> and stopping pacemaker and corosync on both nodes. Then restarting on
> both nodes. Corosync came back fine as before but now I have the same
> problem on both nodes with pacemaker not starting successfully. Both show
> exactly same now - attrd: [24883]: ERROR: main: HA Signon failed.
>
> Log:
> Jan 27 01:09:59 Condor crmd: [24885]: info: crmd_init: Starting crmd
> Jan 27 01:09:59 Condor cib: [24881]: info: validate_with_relaxng:
> Creating RNG parser context Jan 27 01:09:59 Condor lrmd: [24882]:
> info: enabling coredumps Jan 27 01:09:59 Condor lrmd: [24882]: info:
> Started.
> Jan 27 01:09:59 Condor corosync[24778]: [IPC ] Invalid IPC
> credentials.

This seems to be the root of the errors.
Pacemaker looks a little old, could you consider updating?

> Jan 27 01:09:59 Condor attrd: [24883]: ERROR: main: HA Signon failed
> Jan 27 01:09:59 Condor attrd: [24883]: ERROR: main: Aborting startup
> Jan 27 01:09:59 Condor pacemakerd: [24877]: ERROR: pcmk_child_exit:
> Child process attrd exited (pid=24883, rc=100) Jan 27 01:09:59 Condor
> pacemakerd: [24877]: notice: pcmk_child_exit: Child process attrd no
> longer wishes to be respawned Jan 27 01:09:59 Condor pacemakerd:
> [24877]: info: update_node_processes: Node Condor now has process
> list: 00000000000000000000000000110312 (was
> 00000000000000000000000000111312) Jan 27 01:09:59 Condor stonith-ng:
> [24880]: info: init_ais_connection_classic: AIS connection established
> Jan 27 01:09:59 Condor stonith-ng: [24880]: info: get_ais_nodeid:
> Server details: id=167837962 uname=Condor cname=pcmk Jan 27 01:09:59
> Condor stonith-ng: [24880]: info: init_ais_connection_once: Connection
> to 'classic openais (with plugin)': established Jan 27 01:09:59 Condor
> stonith-ng: [24880]: info: crm_new_peer: Node Condor now has id: 167837962
> Jan 27 01:09:59 Condor stonith-ng: [24880]: info: crm_new_peer: Node
> 167837962 is now known as Condor Jan 27 01:09:59 Condor stonith-ng:
> [24880]: info: main: Starting stonith-ng mainloop Jan 27 01:09:59 Condor
> stonith-ng: [24880]: info: crm_update_peer: Node Condor: id=167837962
> state=unknown addr=(null) votes=0 born=0 seen=0
> proc=00000000000000000000000000110312 (new) Jan 27 01:09:59 Condor cib:
> [24881]: info: startCib: CIB Initialization completed successfully Jan 27
> 01:09:59 Condor cib: [24881]: info: get_cluster_type: Cluster type is:
> 'openais'
> Jan 27 01:09:59 Condor cib: [24881]: notice: crm_cluster_connect:
> Connecting to cluster infrastructure: classic openais (with plugin) Jan 27
> 01:09:59 Condor cib: [24881]: info: init_ais_connection_classic: Creating
> connection to our Corosync plugin
> Jan 27 01:09:59 Condor corosync[24778]: [IPC ] Invalid IPC
> credentials.
> Jan 27 01:09:59 Condor cib: [24881]: info:
> init_ais_connection_classic: Connection to our AIS plugin (9) failed:
> unknown (100) Jan 27 01:09:59 Condor cib: [24881]: CRIT: cib_init:
> Cannot sign in to the cluster... terminating Jan 27 01:09:59 Condor
> pacemakerd: [24877]: ERROR: pcmk_child_exit: Child process cib exited
> (pid=24881, rc=100) Jan 27 01:09:59 Condor pacemakerd: [24877]:
> notice: pcmk_child_exit: Child process cib no longer wishes to be
> respawned Jan 27 01:09:59 Condor pacemakerd: [24877]: info:
> update_node_processes: Node Condor now has process list:
> 00000000000000000000000000110212 (was
> 00000000000000000000000000110312) Jan 27 01:09:59 Condor stonith-ng:
> [24880]: info: crm_update_peer: Node Condor: id=167837962
> state=unknown addr=(null) votes=0 born=0 seen=0
> proc=00000000000000000000000000110212 (new) Jan 27 01:10:00 Condor
> crmd: [24885]: info: do_cib_control: Could not connect to the CIB
> service: connection failed Jan 27 01:10:00 Condor crmd: [24885]: WARN:
> do_cib_control: Couldn't complete CIB registration 1 times... pause
> and retry Jan 27 01:10:00 Condor crmd: [24885]: info: crmd_init:
> Starting crmd's mainloop Jan 27 01:10:01 Condor CRON[24888]: (root)
> CMD (/etc/init.d/watchdog -e >/dev/null 2>&1) Jan 27 01:10:02 Condor
> crmd: [24885]: info: crm_timer_popped: Wait Timer (I_NULL) just popped
> (2000ms) Jan 27 01:10:03 Condor crmd: [24885]: info: do_cib_control:
> Could not connect to the CIB service: connection failed Jan 27
> 01:10:03 Condor crmd: [24885]: WARN: do_cib_control: Couldn't complete
> CIB registration 2 times... pause and retry Jan 27 01:10:05 Condor
> crmd: [24885]: info: crm_timer_popped: Wait Timer (I_NULL) just popped
> (2000ms) Jan 27 01:10:06 Condor crmd: [24885]: info: do_cib_control:
> Could not connect to the CIB service: connection failed Jan 27
> 01:10:06 Condor crmd: [24885]: WARN: do_cib_control: Couldn't complete
> CIB registration 3 times... pause and retry Jan 27 01:10:08 Condor
> crmd: [24885]: info: crm_timer_popped: Wait Timer (I_NULL) just popped
> (2000ms) Jan 27 01:10:09 Condor crmd: [24885]: info: do_cib_control:
> Could not connect to the CIB service: connection failed Jan 27
> 01:10:09 Condor crmd: [24885]: WARN: do_cib_control: Couldn't complete
> CIB registration 4 times... pause and retry Jan 27 01:10:11 Condor
> crmd: [24885]: info: crm_timer_popped: Wait Timer (I_NULL) just popped
> (2000ms) Jan 27 01:10:12 Condor crmd: [24885]: info: do_cib_control:
> Could not connect to the CIB service: connection failed Jan 27
> 01:10:12 Condor crmd: [24885]: WARN: do_cib_control: Couldn't complete
> CIB registration 5 times... pause and retry
>
> Jacob A. Smith
> IT Manager
> Argotec, LLC
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org Getting started:
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org