Mailing List Archive: Help understanding what went wrong

Help understanding what went wrong

Oct 29, 2014, 5:42 PM

Post #1 of 5 (983 views)

Hi

rpm -qa | grep pace
pacemaker-libs-1.1.10-14.el6_5.3.x86_64
pacemaker-cluster-libs-1.1.10-14.el6_5.3.x86_64
pacemaker-cli-1.1.10-14.el6_5.3.x86_64
pacemaker-1.1.10-14.el6_5.3.x86_64

centos 6.5

I have a 2 node cluster
pcs config
Cluster Name: ybrp
Corosync Nodes:

Pacemaker Nodes:
wwwrp1 wwwrp2

Resources:
Resource: ybrpip (class=ocf provider=heartbeat type=IPaddr2)
Attributes: ip=10.32.43.50 cidr_netmask=24 nic=eth0 clusterip_hash=sourceip-sourceport
Meta Attrs: stickiness=0,migration-threshold=3,failure-timeout=600s
Operations: start on-fail=restart interval=0s timeout=60s (ybrpip-start-interval-0s)
monitor on-fail=restart interval=5s timeout=20s (ybrpip-monitor-interval-5s)
stop interval=0s timeout=60s (ybrpip-stop-interval-0s)
Clone: ybrpstat-clone
Meta Attrs: globally-unique=false clone-max=2 clone-node-max=1
Resource: ybrpstat (class=ocf provider=yb type=proxy)
Operations: monitor on-fail=restart interval=5s timeout=20s (ybrpstat-monitor-interval-5s)

Stonith Devices:
Fencing Levels:

Location Constraints:
Ordering Constraints:
start ybrpstat-clone then start ybrpip (Mandatory) (id:order-ybrpstat-clone-ybrpip-mandatory)
Colocation Constraints:
ybrpip with ybrpstat-clone (INFINITY) (id:colocation-ybrpip-ybrpstat-clone-INFINITY)

Cluster Properties:
cluster-infrastructure: cman
dc-version: 1.1.10-14.el6_5.3-368c726
last-lrm-refresh: 1414629002
no-quorum-policy: ignore
stonith-enabled: false

Basically 1 node died (wwwrp1) hardware failure

I can see in the logs the cluster wants to bring the IP address over and it seems to do it but

## this seems to be the ip address moving
Oct 30 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO: Adding inet address 10.32.43.50/24 with broadcast address 10.32.43.255 to device eth0
Oct 30 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO: Bringing device eth0 up
Oct 30 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO: /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p /var/run/resource-agents/send_arp-10.32.43.50 eth0 10.32.43.50 auto not_used not_used

## this seems to be where it checks other stuff
Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: process_lrm_event: LRM operation ybrpip_start_0 (call=13476, rc=0, cib-update=11762, confirmed=true) ok
Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: te_rsc_command: Initiating action 8: monitor ybrpip_monitor_5000 on wwwrp2 (local)
Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: process_lrm_event: LRM operation ybrpip_monitor_5000 (call=13479, rc=0, cib-update=11763, confirmed=false) ok
Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: run_graph: Transition 6828 (Complete=6, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3.bz2): Complete
Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]

# this is where monitor times out, but it doesn't look like 20000ms .. the initial try was at Oct 30 01:19:43
Oct 30 01:19:44 wwwrp2 lrmd[2459]: warning: child_timeout_callback: ybrpstat_monitor_5000 process (PID 25712) timed out
Oct 30 01:19:44 wwwrp2 lrmd[2459]: warning: operation_finished: ybrpstat_monitor_5000:25712 - timed out after 20000ms
Oct 30 01:19:44 wwwrp2 crmd[2462]: error: process_lrm_event: LRM operation ybrpstat_monitor_5000 (13473) Timed Out (timeout=20000ms)
Oct 30 01:19:44 wwwrp2 crmd[2462]: warning: update_failcount: Updating failcount for ybrpstat on wwwrp2 after failed monitor: rc=1 (update=value++, time=1414592384)

I'm guessing because it timed out it went into failed mode. I need to know why it timed out. The script has never timed out before or in testing...

Am I reading this right. The reason the resource didn't fail over (ip address) was because there was no ybrpstat running on wwwrp2 and the reason for that was the monitor action failed/timed out

Thanks
Alex

=== logs ==

Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: unpack_config: On loss of CCM Quorum: Ignore
Oct 30 01:19:42 wwwrp2 pengine[2461]: warning: unpack_rsc_op: Processing failed op start for ybrpstat:0 on wwwrp1: unknown error (1)
Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: LogActions: Start ybrpip#011(wwwrp1)
Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: LogActions: Recover ybrpstat:0#011(Started wwwrp1)
Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: process_pe_message: Calculated Transition 6827: /var/lib/pacemaker/pengine/pe-input-2.bz2
Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: unpack_config: On loss of CCM Quorum: Ignore
Oct 30 01:19:42 wwwrp2 pengine[2461]: warning: unpack_rsc_op: Processing failed op start for ybrpstat:0 on wwwrp1: unknown error (1)
Oct 30 01:19:42 wwwrp2 pengine[2461]: warning: common_apply_stickiness: Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000)
Oct 30 01:19:42 wwwrp2 pengine[2461]: warning: common_apply_stickiness: Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000)
Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: LogActions: Start ybrpip#011(wwwrp2)
Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: LogActions: Stop ybrpstat:0#011(wwwrp1)
Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: process_pe_message: Calculated Transition 6828: /var/lib/pacemaker/pengine/pe-input-3.bz2
Oct 30 01:19:42 wwwrp2 crmd[2462]: notice: te_rsc_command: Initiating action 7: start ybrpip_start_0 on wwwrp2 (local)
Oct 30 01:19:42 wwwrp2 crmd[2462]: notice: te_rsc_command: Initiating action 1: stop ybrpstat_stop_0 on wwwrp1
Oct 30 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO: Adding inet address 10.32.43.50/24 with broadcast address 10.32.43.255 to device eth0
Oct 30 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO: Bringing device eth0 up
Oct 30 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO: /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p /var/run/resource-agents/send_arp-10.32.43.50 eth0 10.32.43.50 auto not_used not_used
Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: process_lrm_event: LRM operation ybrpip_start_0 (call=13476, rc=0, cib-update=11762, confirmed=true) ok
Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: te_rsc_command: Initiating action 8: monitor ybrpip_monitor_5000 on wwwrp2 (local)
Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: process_lrm_event: LRM operation ybrpip_monitor_5000 (call=13479, rc=0, cib-update=11763, confirmed=false) ok
Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: run_graph: Transition 6828 (Complete=6, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3.bz2): Complete
Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Oct 30 01:19:44 wwwrp2 lrmd[2459]: warning: child_timeout_callback: ybrpstat_monitor_5000 process (PID 25712) timed out
Oct 30 01:19:44 wwwrp2 lrmd[2459]: warning: operation_finished: ybrpstat_monitor_5000:25712 - timed out after 20000ms
Oct 30 01:19:44 wwwrp2 crmd[2462]: error: process_lrm_event: LRM operation ybrpstat_monitor_5000 (13473) Timed Out (timeout=20000ms)
Oct 30 01:19:44 wwwrp2 crmd[2462]: warning: update_failcount: Updating failcount for ybrpstat on wwwrp2 after failed monitor: rc=1 (update=value++, time=1414592384)
Oct 30 01:19:44 wwwrp2 crmd[2462]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Oct 30 01:19:44 wwwrp2 attrd[2460]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-ybrpstat (1)
Oct 30 01:19:44 wwwrp2 attrd[2460]: notice: attrd_perform_update: Sent update 12543: fail-count-ybrpstat=1
Oct 30 01:19:44 wwwrp2 attrd[2460]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-ybrpstat (1414592384)
Oct 30 01:19:44 wwwrp2 attrd[2460]: notice: attrd_perform_update: Sent update 12545: last-failure-ybrpstat=1414592384
Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: unpack_config: On loss of CCM Quorum: Ignore
Oct 30 01:19:44 wwwrp2 pengine[2461]: warning: unpack_rsc_op: Processing failed op start for ybrpstat:0 on wwwrp1: unknown error (1)
Oct 30 01:19:44 wwwrp2 pengine[2461]: warning: unpack_rsc_op: Processing failed op monitor for ybrpstat:0 on wwwrp2: unknown error (1)
Oct 30 01:19:44 wwwrp2 pengine[2461]: warning: common_apply_stickiness: Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000)
Oct 30 01:19:44 wwwrp2 pengine[2461]: warning: common_apply_stickiness: Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000)
Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: LogActions: Restart ybrpip#011(Started wwwrp2)
Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: LogActions: Recover ybrpstat:0#011(Started wwwrp2)
Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: process_pe_message: Calculated Transition 6829: /var/lib/pacemaker/pengine/pe-input-4.bz2
Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: unpack_config: On loss of CCM Quorum: Ignore
Oct 30 01:19:44 wwwrp2 pengine[2461]: warning: unpack_rsc_op: Processing failed op start for ybrpstat:0 on wwwrp1: unknown error (1)
Oct 30 01:19:44 wwwrp2 pengine[2461]: warning: unpack_rsc_op: Processing failed op monitor for ybrpstat:0 on wwwrp2: unknown error (1)
Oct 30 01:19:44 wwwrp2 pengine[2461]: warning: common_apply_stickiness: Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000)
Oct 30 01:19:44 wwwrp2 pengine[2461]: warning: common_apply_stickiness: Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000)
Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: LogActions: Restart ybrpip#011(Started wwwrp2)
Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: LogActions: Recover ybrpstat:0#011(Started wwwrp2)
Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: process_pe_message: Calculated Transition 6830: /var/lib/pacemaker/pengine/pe-input-5.bz2

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: Help understanding what went wrong [ In reply to ]

Alex.Samad at yieldbroker

Oct 29, 2014, 6:06 PM

Post #2 of 5 (944 views)

Permalink

Hi

Seems like I had the same problem on my prod cluster

Oct 30 01:21:53 prodrp1 lrmd[2536]: warning: child_timeout_callback: ybrpstat_monitor_5000 process (PID 17918) timed out
Oct 30 01:21:53 prodrp1 lrmd[2536]: warning: operation_finished: ybrpstat_monitor_5000:17918 - timed out after 20000ms
Oct 30 01:21:53 prodrp1 crmd[2539]: error: process_lrm_event: LRM operation ybrpstat_monitor_5000 (4384) Timed Out (timeout=20000ms)
Oct 30 01:21:53 prodrp1 attrd[2537]: notice: attrd_cs_dispatch: Update relayed from prodrp2

This is the first recording on prodrp1.

The issue is that prodrp1 was the master for the IP.

I am wondering if it has something to do with it being a cloned resource ?

So the cluster realises there is a problem, in the middle of it realising that the cluster check monitor fails because (assumption here) it does some sort of cluster wide check!

A

> -----Original Message-----
> From: Alex Samad - Yieldbroker [mailto:Alex.Samad@yieldbroker.com]
> Sent: Thursday, 30 October 2014 11:43 AM
> To: pacemaker@oss.clusterlabs.org
> Subject: [Pacemaker] Help understanding what went wrong
>
> Hi
>
>
> rpm -qa | grep pace
> pacemaker-libs-1.1.10-14.el6_5.3.x86_64
> pacemaker-cluster-libs-1.1.10-14.el6_5.3.x86_64
> pacemaker-cli-1.1.10-14.el6_5.3.x86_64
> pacemaker-1.1.10-14.el6_5.3.x86_64
>
> centos 6.5
>
>
> I have a 2 node cluster
> pcs config
> Cluster Name: ybrp
> Corosync Nodes:
>
> Pacemaker Nodes:
> wwwrp1 wwwrp2
>
> Resources:
> Resource: ybrpip (class=ocf provider=heartbeat type=IPaddr2)
> Attributes: ip=10.32.43.50 cidr_netmask=24 nic=eth0
> clusterip_hash=sourceip-sourceport
> Meta Attrs: stickiness=0,migration-threshold=3,failure-timeout=600s
> Operations: start on-fail=restart interval=0s timeout=60s (ybrpip-start-
> interval-0s)
> monitor on-fail=restart interval=5s timeout=20s (ybrpip-monitor-
> interval-5s)
> stop interval=0s timeout=60s (ybrpip-stop-interval-0s)
> Clone: ybrpstat-clone
> Meta Attrs: globally-unique=false clone-max=2 clone-node-max=1
> Resource: ybrpstat (class=ocf provider=yb type=proxy)
> Operations: monitor on-fail=restart interval=5s timeout=20s (ybrpstat-
> monitor-interval-5s)
>
> Stonith Devices:
> Fencing Levels:
>
> Location Constraints:
> Ordering Constraints:
> start ybrpstat-clone then start ybrpip (Mandatory) (id:order-ybrpstat-clone-
> ybrpip-mandatory)
> Colocation Constraints:
> ybrpip with ybrpstat-clone (INFINITY) (id:colocation-ybrpip-ybrpstat-clone-
> INFINITY)
>
> Cluster Properties:
> cluster-infrastructure: cman
> dc-version: 1.1.10-14.el6_5.3-368c726
> last-lrm-refresh: 1414629002
> no-quorum-policy: ignore
> stonith-enabled: false
>
>
>
> Basically 1 node died (wwwrp1) hardware failure
>
> I can see in the logs the cluster wants to bring the IP address over and it
> seems to do it but
>
>
>
> ## this seems to be the ip address moving Oct 30 01:19:43 wwwrp2
> IPaddr2(ybrpip)[25778]: INFO: Adding inet address 10.32.43.50/24 with
> broadcast address 10.32.43.255 to device eth0 Oct 30 01:19:43 wwwrp2
> IPaddr2(ybrpip)[25778]: INFO: Bringing device eth0 up Oct 30 01:19:43
> wwwrp2 IPaddr2(ybrpip)[25778]: INFO: /usr/libexec/heartbeat/send_arp -i
> 200 -r 5 -p /var/run/resource-agents/send_arp-10.32.43.50 eth0 10.32.43.50
> auto not_used not_used
>
>
> ## this seems to be where it checks other stuff
> Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: process_lrm_event: LRM
> operation ybrpip_start_0 (call=13476, rc=0, cib-update=11762,
> confirmed=true) ok
> Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: te_rsc_command: Initiating
> action 8: monitor ybrpip_monitor_5000 on wwwrp2 (local)
> Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: process_lrm_event: LRM
> operation ybrpip_monitor_5000 (call=13479, rc=0, cib-update=11763,
> confirmed=false) ok
> Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: run_graph: Transition 6828
> (Complete=6, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-3.bz2): Complete
> Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
>
> # this is where monitor times out, but it doesn't look like 20000ms .. the initial
> try was at Oct 30 01:19:43 Oct 30 01:19:44 wwwrp2 lrmd[2459]: warning:
> child_timeout_callback: ybrpstat_monitor_5000 process (PID 25712) timed
> out Oct 30 01:19:44 wwwrp2 lrmd[2459]: warning: operation_finished:
> ybrpstat_monitor_5000:25712 - timed out after 20000ms
> Oct 30 01:19:44 wwwrp2 crmd[2462]: error: process_lrm_event: LRM
> operation ybrpstat_monitor_5000 (13473) Timed Out (timeout=20000ms)
> Oct 30 01:19:44 wwwrp2 crmd[2462]: warning: update_failcount: Updating
> failcount for ybrpstat on wwwrp2 after failed monitor: rc=1
> (update=value++, time=1414592384)
>
> I'm guessing because it timed out it went into failed mode. I need to know
> why it timed out. The script has never timed out before or in testing...
>
> Am I reading this right. The reason the resource didn't fail over (ip address)
> was because there was no ybrpstat running on wwwrp2 and the reason for
> that was the monitor action failed/timed out
>
> Thanks
> Alex
>
>
> === logs ==
>
> Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Oct 30 01:19:42 wwwrp2 pengine[2461]: warning: unpack_rsc_op: Processing
> failed op start for ybrpstat:0 on wwwrp1: unknown error (1)
> Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: LogActions: Start
> ybrpip#011(wwwrp1)
> Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: LogActions: Recover
> ybrpstat:0#011(Started wwwrp1)
> Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: process_pe_message:
> Calculated Transition 6827: /var/lib/pacemaker/pengine/pe-input-2.bz2
> Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Oct 30 01:19:42 wwwrp2 pengine[2461]: warning: unpack_rsc_op: Processing
> failed op start for ybrpstat:0 on wwwrp1: unknown error (1) Oct 30 01:19:42
> wwwrp2 pengine[2461]: warning: common_apply_stickiness: Forcing
> ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000) Oct
> 30 01:19:42 wwwrp2 pengine[2461]: warning: common_apply_stickiness:
> Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures
> (max=1000000)
> Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: LogActions: Start
> ybrpip#011(wwwrp2)
> Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: LogActions: Stop
> ybrpstat:0#011(wwwrp1)
> Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: process_pe_message:
> Calculated Transition 6828: /var/lib/pacemaker/pengine/pe-input-3.bz2
> Oct 30 01:19:42 wwwrp2 crmd[2462]: notice: te_rsc_command: Initiating
> action 7: start ybrpip_start_0 on wwwrp2 (local)
> Oct 30 01:19:42 wwwrp2 crmd[2462]: notice: te_rsc_command: Initiating
> action 1: stop ybrpstat_stop_0 on wwwrp1
> Oct 30 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO: Adding inet address
> 10.32.43.50/24 with broadcast address 10.32.43.255 to device eth0 Oct 30
> 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO: Bringing device eth0 up Oct
> 30 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO:
> /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p /var/run/resource-
> agents/send_arp-10.32.43.50 eth0 10.32.43.50 auto not_used not_used
> Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: process_lrm_event: LRM
> operation ybrpip_start_0 (call=13476, rc=0, cib-update=11762,
> confirmed=true) ok
> Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: te_rsc_command: Initiating
> action 8: monitor ybrpip_monitor_5000 on wwwrp2 (local)
> Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: process_lrm_event: LRM
> operation ybrpip_monitor_5000 (call=13479, rc=0, cib-update=11763,
> confirmed=false) ok
> Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: run_graph: Transition 6828
> (Complete=6, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-3.bz2): Complete
> Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> Oct 30 01:19:44 wwwrp2 lrmd[2459]: warning: child_timeout_callback:
> ybrpstat_monitor_5000 process (PID 25712) timed out Oct 30 01:19:44
> wwwrp2 lrmd[2459]: warning: operation_finished:
> ybrpstat_monitor_5000:25712 - timed out after 20000ms
> Oct 30 01:19:44 wwwrp2 crmd[2462]: error: process_lrm_event: LRM
> operation ybrpstat_monitor_5000 (13473) Timed Out (timeout=20000ms)
> Oct 30 01:19:44 wwwrp2 crmd[2462]: warning: update_failcount: Updating
> failcount for ybrpstat on wwwrp2 after failed monitor: rc=1
> (update=value++, time=1414592384)
> Oct 30 01:19:44 wwwrp2 crmd[2462]: notice: do_state_transition: State
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> Oct 30 01:19:44 wwwrp2 attrd[2460]: notice: attrd_trigger_update: Sending
> flush op to all hosts for: fail-count-ybrpstat (1)
> Oct 30 01:19:44 wwwrp2 attrd[2460]: notice: attrd_perform_update: Sent
> update 12543: fail-count-ybrpstat=1
> Oct 30 01:19:44 wwwrp2 attrd[2460]: notice: attrd_trigger_update: Sending
> flush op to all hosts for: last-failure-ybrpstat (1414592384)
> Oct 30 01:19:44 wwwrp2 attrd[2460]: notice: attrd_perform_update: Sent
> update 12545: last-failure-ybrpstat=1414592384
> Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Oct 30 01:19:44 wwwrp2 pengine[2461]: warning: unpack_rsc_op: Processing
> failed op start for ybrpstat:0 on wwwrp1: unknown error (1) Oct 30 01:19:44
> wwwrp2 pengine[2461]: warning: unpack_rsc_op: Processing failed op
> monitor for ybrpstat:0 on wwwrp2: unknown error (1) Oct 30 01:19:44
> wwwrp2 pengine[2461]: warning: common_apply_stickiness: Forcing
> ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000) Oct
> 30 01:19:44 wwwrp2 pengine[2461]: warning: common_apply_stickiness:
> Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures
> (max=1000000)
> Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: LogActions: Restart
> ybrpip#011(Started wwwrp2)
> Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: LogActions: Recover
> ybrpstat:0#011(Started wwwrp2)
> Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: process_pe_message:
> Calculated Transition 6829: /var/lib/pacemaker/pengine/pe-input-4.bz2
> Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Oct 30 01:19:44 wwwrp2 pengine[2461]: warning: unpack_rsc_op: Processing
> failed op start for ybrpstat:0 on wwwrp1: unknown error (1) Oct 30 01:19:44
> wwwrp2 pengine[2461]: warning: unpack_rsc_op: Processing failed op
> monitor for ybrpstat:0 on wwwrp2: unknown error (1) Oct 30 01:19:44
> wwwrp2 pengine[2461]: warning: common_apply_stickiness: Forcing
> ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000) Oct
> 30 01:19:44 wwwrp2 pengine[2461]: warning: common_apply_stickiness:
> Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures
> (max=1000000)
> Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: LogActions: Restart
> ybrpip#011(Started wwwrp2)
> Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: LogActions: Recover
> ybrpstat:0#011(Started wwwrp2)
> Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: process_pe_message:
> Calculated Transition 6830: /var/lib/pacemaker/pengine/pe-input-5.bz2
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: Help understanding what went wrong [ In reply to ]

Alex.Samad at yieldbroker

Oct 29, 2014, 7:02 PM

Post #3 of 5 (992 views)

Permalink

Another cluster another set of logs

This is more interesting all the clusters are setup the same.

On this cluster none of the nodes failed

But I still got
Oct 30 01:20:40 uatrp2 crmd[2595]: error: process_lrm_event: LRM operation ybrpstat_monitor_5000 (17) Timed Out (timeout=20000ms)
Oct 30 01:22:20 uatrp1 crmd[2523]: error: process_lrm_event: LRM operation ybrpstat_monitor_5000 (17) Timed Out (timeout=20000ms)

This has me really confused now!

Oct 30 01:19:30 uatrp1 crmd[2523]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
Oct 30 01:19:30 uatrp1 pengine[2522]: notice: unpack_config: On loss of CCM Quorum: Ignore
Oct 30 01:19:30 uatrp1 pengine[2522]: notice: process_pe_message: Calculated Transition 691: /var/lib/pacemaker/pengine/pe-input-816.bz2
Oct 30 01:19:30 uatrp1 crmd[2523]: notice: run_graph: Transition 691 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-816.bz2): Complete
Oct 30 01:19:30 uatrp1 crmd[2523]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Oct 30 01:20:40 uatrp2 lrmd[2592]: warning: child_timeout_callback: ybrpstat_monitor_5000 process (PID 15638) timed out
Oct 30 01:20:40 uatrp2 lrmd[2592]: warning: operation_finished: ybrpstat_monitor_5000:15638 - timed out after 20000ms
Oct 30 01:20:40 uatrp2 crmd[2595]: error: process_lrm_event: LRM operation ybrpstat_monitor_5000 (17) Timed Out (timeout=20000ms)
Oct 30 01:20:40 uatrp1 crmd[2523]: warning: update_failcount: Updating failcount for ybrpstat on uatrp2 after failed monitor: rc=1 (update=value++, time=1414592440)
Oct 30 01:20:40 uatrp1 crmd[2523]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Oct 30 01:20:40 uatrp1 pengine[2522]: notice: unpack_config: On loss of CCM Quorum: Ignore
Oct 30 01:20:41 uatrp1 pengine[2522]: warning: unpack_rsc_op: Processing failed op monitor for ybrpstat:1 on uatrp2: unknown error (1)
Oct 30 01:20:41 uatrp1 pengine[2522]: notice: LogActions: Recover ybrpstat:1#011(Started uatrp2)
Oct 30 01:20:41 uatrp1 pengine[2522]: notice: process_pe_message: Calculated Transition 692: /var/lib/pacemaker/pengine/pe-input-817.bz2
Oct 30 01:20:41 uatrp1 crmd[2523]: notice: te_rsc_command: Initiating action 4: stop ybrpstat_stop_0 on uatrp2
Oct 30 01:20:41 uatrp2 proxy(ybrpstat)[15696]: INFO: proxy is not stopped by pacemaker return success
Oct 30 01:20:41 uatrp2 attrd[2593]: notice: attrd_cs_dispatch: Update relayed from uatrp1
Oct 30 01:20:41 uatrp2 attrd[2593]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-ybrpstat (1)
Oct 30 01:20:41 uatrp2 crmd[2595]: notice: process_lrm_event: LRM operation ybrpstat_stop_0 (call=22, rc=0, cib-update=15, confirmed=true) ok
Oct 30 01:20:41 uatrp1 crmd[2523]: notice: te_rsc_command: Initiating action 13: start ybrpstat_start_0 on uatrp2
Oct 30 01:20:41 uatrp2 attrd[2593]: notice: attrd_perform_update: Sent update 9: fail-count-ybrpstat=1
Oct 30 01:20:41 uatrp2 attrd[2593]: notice: attrd_cs_dispatch: Update relayed from uatrp1
Oct 30 01:20:41 uatrp2 attrd[2593]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-ybrpstat (1414592440)
Oct 30 01:20:41 uatrp2 attrd[2593]: notice: attrd_perform_update: Sent update 12: last-failure-ybrpstat=1414592440
Oct 30 01:21:01 uatrp2 lrmd[2592]: warning: child_timeout_callback: ybrpstat_start_0 process (PID 15705) timed out
Oct 30 01:21:01 uatrp2 lrmd[2592]: warning: operation_finished: ybrpstat_start_0:15705 - timed out after 20000ms
Oct 30 01:21:01 uatrp2 crmd[2595]: error: process_lrm_event: LRM operation ybrpstat_start_0 (26) Timed Out (timeout=20000ms)
Oct 30 01:21:01 uatrp1 crmd[2523]: warning: status_from_rc: Action 13 (ybrpstat_start_0) on uatrp2 failed (target: 0 vs. rc: 1): Error
Oct 30 01:21:01 uatrp1 crmd[2523]: warning: update_failcount: Updating failcount for ybrpstat on uatrp2 after failed start: rc=1 (update=INFINITY, time=1414592461)
Oct 30 01:21:01 uatrp1 crmd[2523]: warning: update_failcount: Updating failcount for ybrpstat on uatrp2 after failed start: rc=1 (update=INFINITY, time=1414592461)
Oct 30 01:21:01 uatrp1 crmd[2523]: notice: run_graph: Transition 692 (Complete=7, Pending=0, Fired=0, Skipped=1, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-817.bz2): Stopped
Oct 30 01:21:01 uatrp2 attrd[2593]: notice: attrd_cs_dispatch: Update relayed from uatrp1
Oct 30 01:21:01 uatrp2 attrd[2593]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-ybrpstat (INFINITY)
Oct 30 01:21:01 uatrp2 attrd[2593]: notice: attrd_perform_update: Sent update 14: fail-count-ybrpstat=INFINITY
Oct 30 01:21:01 uatrp2 attrd[2593]: notice: attrd_cs_dispatch: Update relayed from uatrp1
Oct 30 01:21:01 uatrp2 attrd[2593]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-ybrpstat (1414592461)
Oct 30 01:21:01 uatrp2 attrd[2593]: notice: attrd_perform_update: Sent update 16: last-failure-ybrpstat=1414592461
Oct 30 01:21:01 uatrp2 attrd[2593]: notice: attrd_cs_dispatch: Update relayed from uatrp1
Oct 30 01:21:01 uatrp2 attrd[2593]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-ybrpstat (INFINITY)
Oct 30 01:21:01 uatrp2 attrd[2593]: notice: attrd_perform_update: Sent update 18: fail-count-ybrpstat=INFINITY
Oct 30 01:21:01 uatrp2 attrd[2593]: notice: attrd_cs_dispatch: Update relayed from uatrp1
Oct 30 01:21:01 uatrp2 attrd[2593]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-ybrpstat (1414592461)
Oct 30 01:21:01 uatrp2 attrd[2593]: notice: attrd_perform_update: Sent update 20: last-failure-ybrpstat=1414592461
Oct 30 01:21:01 uatrp1 pengine[2522]: notice: unpack_config: On loss of CCM Quorum: Ignore
Oct 30 01:21:01 uatrp1 pengine[2522]: warning: unpack_rsc_op: Processing failed op start for ybrpstat:1 on uatrp2: unknown error (1)
Oct 30 01:21:01 uatrp1 pengine[2522]: notice: LogActions: Recover ybrpstat:1#011(Started uatrp2)
Oct 30 01:21:01 uatrp1 pengine[2522]: notice: process_pe_message: Calculated Transition 693: /var/lib/pacemaker/pengine/pe-input-818.bz2
Oct 30 01:21:01 uatrp1 pengine[2522]: notice: unpack_config: On loss of CCM Quorum: Ignore
Oct 30 01:21:01 uatrp1 pengine[2522]: warning: unpack_rsc_op: Processing failed op start for ybrpstat:1 on uatrp2: unknown error (1)
Oct 30 01:21:01 uatrp1 pengine[2522]: warning: common_apply_stickiness: Forcing ybrpstat-clone away from uatrp2 after 1000000 failures (max=1000000)
Oct 30 01:21:01 uatrp1 pengine[2522]: warning: common_apply_stickiness: Forcing ybrpstat-clone away from uatrp2 after 1000000 failures (max=1000000)
Oct 30 01:21:01 uatrp1 pengine[2522]: notice: LogActions: Stop ybrpstat:1#011(uatrp2)
Oct 30 01:21:01 uatrp1 pengine[2522]: notice: process_pe_message: Calculated Transition 694: /var/lib/pacemaker/pengine/pe-input-819.bz2
Oct 30 01:21:01 uatrp1 crmd[2523]: notice: te_rsc_command: Initiating action 3: stop ybrpstat_stop_0 on uatrp2
Oct 30 01:21:01 uatrp2 proxy(ybrpstat)[15755]: INFO: proxy is not stopped by pacemaker return success
Oct 30 01:21:01 uatrp2 crmd[2595]: notice: process_lrm_event: LRM operation ybrpstat_stop_0 (call=29, rc=0, cib-update=17, confirmed=true) ok
Oct 30 01:21:01 uatrp1 crmd[2523]: notice: run_graph: Transition 694 (Complete=4, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-819.bz2): Complete
Oct 30 01:21:01 uatrp1 crmd[2523]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Oct 30 01:22:20 uatrp1 lrmd[2520]: warning: child_timeout_callback: ybrpstat_monitor_5000 process (PID 23676) timed out
Oct 30 01:22:20 uatrp1 lrmd[2520]: warning: operation_finished: ybrpstat_monitor_5000:23676 - timed out after 20000ms
Oct 30 01:22:20 uatrp1 crmd[2523]: error: process_lrm_event: LRM operation ybrpstat_monitor_5000 (17) Timed Out (timeout=20000ms)
Oct 30 01:22:20 uatrp1 crmd[2523]: warning: update_failcount: Updating failcount for ybrpstat on uatrp1 after failed monitor: rc=1 (update=value++, time=1414592540)
Oct 30 01:22:20 uatrp1 crmd[2523]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Oct 30 01:22:20 uatrp1 attrd[2521]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-ybrpstat (1)
Oct 30 01:22:20 uatrp1 attrd[2521]: notice: attrd_perform_update: Sent update 17: fail-count-ybrpstat=1
Oct 30 01:22:20 uatrp1 attrd[2521]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-ybrpstat (1414592540)
Oct 30 01:22:20 uatrp1 pengine[2522]: notice: unpack_config: On loss of CCM Quorum: Ignore
Oct 30 01:22:20 uatrp1 pengine[2522]: warning: unpack_rsc_op: Processing failed op monitor for ybrpstat:0 on uatrp1: unknown error (1)
Oct 30 01:22:20 uatrp1 pengine[2522]: warning: unpack_rsc_op: Processing failed op start for ybrpstat:1 on uatrp2: unknown error (1)
Oct 30 01:22:20 uatrp1 pengine[2522]: warning: common_apply_stickiness: Forcing ybrpstat-clone away from uatrp2 after 1000000 failures (max=1000000)
Oct 30 01:22:20 uatrp1 pengine[2522]: warning: common_apply_stickiness: Forcing ybrpstat-clone away from uatrp2 after 1000000 failures (max=1000000)
Oct 30 01:22:20 uatrp1 pengine[2522]: notice: LogActions: Restart ybrpip#011(Started uatrp1)
Oct 30 01:22:20 uatrp1 pengine[2522]: notice: LogActions: Recover ybrpstat:0#011(Started uatrp1)
Oct 30 01:22:20 uatrp1 pengine[2522]: notice: process_pe_message: Calculated Transition 695: /var/lib/pacemaker/pengine/pe-input-820.bz2
Oct 30 01:22:20 uatrp1 attrd[2521]: notice: attrd_perform_update: Sent update 20: last-failure-ybrpstat=1414592540
Oct 30 01:22:20 uatrp1 pengine[2522]: notice: unpack_config: On loss of CCM Quorum: Ignore
Oct 30 01:22:20 uatrp1 pengine[2522]: warning: unpack_rsc_op: Processing failed op monitor for ybrpstat:0 on uatrp1: unknown error (1)
Oct 30 01:22:20 uatrp1 pengine[2522]: warning: unpack_rsc_op: Processing failed op start for ybrpstat:1 on uatrp2: unknown error (1)

> -----Original Message-----
> From: Alex Samad - Yieldbroker
> Sent: Thursday, 30 October 2014 12:06 PM
> To: pacemaker@oss.clusterlabs.org
> Subject: RE: Help understanding what went wrong
>
> Hi
>
> Seems like I had the same problem on my prod cluster
>
> Oct 30 01:21:53 prodrp1 lrmd[2536]: warning: child_timeout_callback:
> ybrpstat_monitor_5000 process (PID 17918) timed out Oct 30 01:21:53
> prodrp1 lrmd[2536]: warning: operation_finished:
> ybrpstat_monitor_5000:17918 - timed out after 20000ms
> Oct 30 01:21:53 prodrp1 crmd[2539]: error: process_lrm_event: LRM
> operation ybrpstat_monitor_5000 (4384) Timed Out (timeout=20000ms)
> Oct 30 01:21:53 prodrp1 attrd[2537]: notice: attrd_cs_dispatch: Update
> relayed from prodrp2
>
> This is the first recording on prodrp1.
>
> The issue is that prodrp1 was the master for the IP.
>
> I am wondering if it has something to do with it being a cloned resource ?
>
> So the cluster realises there is a problem, in the middle of it realising that the
> cluster check monitor fails because (assumption here) it does some sort of
> cluster wide check!
>
> A
>
> > -----Original Message-----
> > From: Alex Samad - Yieldbroker [mailto:Alex.Samad@yieldbroker.com]
> > Sent: Thursday, 30 October 2014 11:43 AM
> > To: pacemaker@oss.clusterlabs.org
> > Subject: [Pacemaker] Help understanding what went wrong
> >
> > Hi
> >
> >
> > rpm -qa | grep pace
> > pacemaker-libs-1.1.10-14.el6_5.3.x86_64
> > pacemaker-cluster-libs-1.1.10-14.el6_5.3.x86_64
> > pacemaker-cli-1.1.10-14.el6_5.3.x86_64
> > pacemaker-1.1.10-14.el6_5.3.x86_64
> >
> > centos 6.5
> >
> >
> > I have a 2 node cluster
> > pcs config
> > Cluster Name: ybrp
> > Corosync Nodes:
> >
> > Pacemaker Nodes:
> > wwwrp1 wwwrp2
> >
> > Resources:
> > Resource: ybrpip (class=ocf provider=heartbeat type=IPaddr2)
> > Attributes: ip=10.32.43.50 cidr_netmask=24 nic=eth0
> > clusterip_hash=sourceip-sourceport
> > Meta Attrs: stickiness=0,migration-threshold=3,failure-timeout=600s
> > Operations: start on-fail=restart interval=0s timeout=60s
> > (ybrpip-start-
> > interval-0s)
> > monitor on-fail=restart interval=5s timeout=20s
> > (ybrpip-monitor-
> > interval-5s)
> > stop interval=0s timeout=60s (ybrpip-stop-interval-0s)
> > Clone: ybrpstat-clone
> > Meta Attrs: globally-unique=false clone-max=2 clone-node-max=1
> > Resource: ybrpstat (class=ocf provider=yb type=proxy)
> > Operations: monitor on-fail=restart interval=5s timeout=20s
> > (ybrpstat-
> > monitor-interval-5s)
> >
> > Stonith Devices:
> > Fencing Levels:
> >
> > Location Constraints:
> > Ordering Constraints:
> > start ybrpstat-clone then start ybrpip (Mandatory)
> > (id:order-ybrpstat-clone-
> > ybrpip-mandatory)
> > Colocation Constraints:
> > ybrpip with ybrpstat-clone (INFINITY)
> > (id:colocation-ybrpip-ybrpstat-clone-
> > INFINITY)
> >
> > Cluster Properties:
> > cluster-infrastructure: cman
> > dc-version: 1.1.10-14.el6_5.3-368c726
> > last-lrm-refresh: 1414629002
> > no-quorum-policy: ignore
> > stonith-enabled: false
> >
> >
> >
> > Basically 1 node died (wwwrp1) hardware failure
> >
> > I can see in the logs the cluster wants to bring the IP address over
> > and it seems to do it but
> >
> >
> >
> > ## this seems to be the ip address moving Oct 30 01:19:43 wwwrp2
> > IPaddr2(ybrpip)[25778]: INFO: Adding inet address 10.32.43.50/24 with
> > broadcast address 10.32.43.255 to device eth0 Oct 30 01:19:43 wwwrp2
> > IPaddr2(ybrpip)[25778]: INFO: Bringing device eth0 up Oct 30 01:19:43
> > wwwrp2 IPaddr2(ybrpip)[25778]: INFO: /usr/libexec/heartbeat/send_arp
> > -i
> > 200 -r 5 -p /var/run/resource-agents/send_arp-10.32.43.50 eth0
> > 10.32.43.50 auto not_used not_used
> >
> >
> > ## this seems to be where it checks other stuff
> > Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: process_lrm_event: LRM
> > operation ybrpip_start_0 (call=13476, rc=0, cib-update=11762,
> > confirmed=true) ok
> > Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: te_rsc_command: Initiating
> > action 8: monitor ybrpip_monitor_5000 on wwwrp2 (local)
> > Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: process_lrm_event: LRM
> > operation ybrpip_monitor_5000 (call=13479, rc=0, cib-update=11763,
> > confirmed=false) ok
> > Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: run_graph: Transition 6828
> > (Complete=6, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> > Source=/var/lib/pacemaker/pengine/pe-input-3.bz2): Complete
> > Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: do_state_transition: State
> > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> > cause=C_FSA_INTERNAL origin=notify_crmd ]
> >
> > # this is where monitor times out, but it doesn't look like 20000ms ..
> > the initial try was at Oct 30 01:19:43 Oct 30 01:19:44 wwwrp2 lrmd[2459]:
> warning:
> > child_timeout_callback: ybrpstat_monitor_5000 process (PID 25712)
> > timed out Oct 30 01:19:44 wwwrp2 lrmd[2459]: warning:
> operation_finished:
> > ybrpstat_monitor_5000:25712 - timed out after 20000ms
> > Oct 30 01:19:44 wwwrp2 crmd[2462]: error: process_lrm_event: LRM
> > operation ybrpstat_monitor_5000 (13473) Timed Out (timeout=20000ms)
> > Oct 30 01:19:44 wwwrp2 crmd[2462]: warning: update_failcount:
> > Updating failcount for ybrpstat on wwwrp2 after failed monitor: rc=1
> > (update=value++, time=1414592384)
> >
> > I'm guessing because it timed out it went into failed mode. I need to
> > know why it timed out. The script has never timed out before or in
> testing...
> >
> > Am I reading this right. The reason the resource didn't fail over (ip
> > address) was because there was no ybrpstat running on wwwrp2 and the
> > reason for that was the monitor action failed/timed out
> >
> > Thanks
> > Alex
> >
> >
> > === logs ==
> >
> > Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: unpack_config: On loss of
> > CCM Quorum: Ignore
> > Oct 30 01:19:42 wwwrp2 pengine[2461]: warning: unpack_rsc_op:
> > Processing failed op start for ybrpstat:0 on wwwrp1: unknown error (1)
> > Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: LogActions: Start
> > ybrpip#011(wwwrp1)
> > Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: LogActions: Recover
> > ybrpstat:0#011(Started wwwrp1)
> > Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: process_pe_message:
> > Calculated Transition 6827: /var/lib/pacemaker/pengine/pe-input-2.bz2
> > Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: unpack_config: On loss of
> > CCM Quorum: Ignore
> > Oct 30 01:19:42 wwwrp2 pengine[2461]: warning: unpack_rsc_op:
> > Processing failed op start for ybrpstat:0 on wwwrp1: unknown error (1)
> > Oct 30 01:19:42
> > wwwrp2 pengine[2461]: warning: common_apply_stickiness: Forcing
> > ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000)
> > Oct
> > 30 01:19:42 wwwrp2 pengine[2461]: warning: common_apply_stickiness:
> > Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures
> > (max=1000000)
> > Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: LogActions: Start
> > ybrpip#011(wwwrp2)
> > Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: LogActions: Stop
> > ybrpstat:0#011(wwwrp1)
> > Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: process_pe_message:
> > Calculated Transition 6828: /var/lib/pacemaker/pengine/pe-input-3.bz2
> > Oct 30 01:19:42 wwwrp2 crmd[2462]: notice: te_rsc_command: Initiating
> > action 7: start ybrpip_start_0 on wwwrp2 (local)
> > Oct 30 01:19:42 wwwrp2 crmd[2462]: notice: te_rsc_command: Initiating
> > action 1: stop ybrpstat_stop_0 on wwwrp1 Oct 30 01:19:43 wwwrp2
> > IPaddr2(ybrpip)[25778]: INFO: Adding inet address
> > 10.32.43.50/24 with broadcast address 10.32.43.255 to device eth0 Oct
> > 30
> > 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO: Bringing device eth0 up
> > Oct
> > 30 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO:
> > /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p /var/run/resource-
> > agents/send_arp-10.32.43.50 eth0 10.32.43.50 auto not_used not_used
> > Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: process_lrm_event: LRM
> > operation ybrpip_start_0 (call=13476, rc=0, cib-update=11762,
> > confirmed=true) ok
> > Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: te_rsc_command: Initiating
> > action 8: monitor ybrpip_monitor_5000 on wwwrp2 (local)
> > Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: process_lrm_event: LRM
> > operation ybrpip_monitor_5000 (call=13479, rc=0, cib-update=11763,
> > confirmed=false) ok
> > Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: run_graph: Transition 6828
> > (Complete=6, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> > Source=/var/lib/pacemaker/pengine/pe-input-3.bz2): Complete
> > Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: do_state_transition: State
> > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> > cause=C_FSA_INTERNAL origin=notify_crmd ] Oct 30 01:19:44 wwwrp2
> > lrmd[2459]: warning: child_timeout_callback:
> > ybrpstat_monitor_5000 process (PID 25712) timed out Oct 30 01:19:44
> > wwwrp2 lrmd[2459]: warning: operation_finished:
> > ybrpstat_monitor_5000:25712 - timed out after 20000ms
> > Oct 30 01:19:44 wwwrp2 crmd[2462]: error: process_lrm_event: LRM
> > operation ybrpstat_monitor_5000 (13473) Timed Out (timeout=20000ms)
> > Oct 30 01:19:44 wwwrp2 crmd[2462]: warning: update_failcount:
> > Updating failcount for ybrpstat on wwwrp2 after failed monitor: rc=1
> > (update=value++, time=1414592384)
> > Oct 30 01:19:44 wwwrp2 crmd[2462]: notice: do_state_transition: State
> > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> > cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> > Oct 30 01:19:44 wwwrp2 attrd[2460]: notice: attrd_trigger_update:
> Sending
> > flush op to all hosts for: fail-count-ybrpstat (1)
> > Oct 30 01:19:44 wwwrp2 attrd[2460]: notice: attrd_perform_update: Sent
> > update 12543: fail-count-ybrpstat=1
> > Oct 30 01:19:44 wwwrp2 attrd[2460]: notice: attrd_trigger_update:
> Sending
> > flush op to all hosts for: last-failure-ybrpstat (1414592384)
> > Oct 30 01:19:44 wwwrp2 attrd[2460]: notice: attrd_perform_update: Sent
> > update 12545: last-failure-ybrpstat=1414592384
> > Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: unpack_config: On loss of
> > CCM Quorum: Ignore
> > Oct 30 01:19:44 wwwrp2 pengine[2461]: warning: unpack_rsc_op:
> > Processing failed op start for ybrpstat:0 on wwwrp1: unknown error (1)
> > Oct 30 01:19:44
> > wwwrp2 pengine[2461]: warning: unpack_rsc_op: Processing failed op
> > monitor for ybrpstat:0 on wwwrp2: unknown error (1) Oct 30 01:19:44
> > wwwrp2 pengine[2461]: warning: common_apply_stickiness: Forcing
> > ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000)
> > Oct
> > 30 01:19:44 wwwrp2 pengine[2461]: warning: common_apply_stickiness:
> > Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures
> > (max=1000000)
> > Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: LogActions: Restart
> > ybrpip#011(Started wwwrp2)
> > Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: LogActions: Recover
> > ybrpstat:0#011(Started wwwrp2)
> > Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: process_pe_message:
> > Calculated Transition 6829: /var/lib/pacemaker/pengine/pe-input-4.bz2
> > Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: unpack_config: On loss of
> > CCM Quorum: Ignore
> > Oct 30 01:19:44 wwwrp2 pengine[2461]: warning: unpack_rsc_op:
> > Processing failed op start for ybrpstat:0 on wwwrp1: unknown error (1)
> > Oct 30 01:19:44
> > wwwrp2 pengine[2461]: warning: unpack_rsc_op: Processing failed op
> > monitor for ybrpstat:0 on wwwrp2: unknown error (1) Oct 30 01:19:44
> > wwwrp2 pengine[2461]: warning: common_apply_stickiness: Forcing
> > ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000)
> > Oct
> > 30 01:19:44 wwwrp2 pengine[2461]: warning: common_apply_stickiness:
> > Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures
> > (max=1000000)
> > Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: LogActions: Restart
> > ybrpip#011(Started wwwrp2)
> > Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: LogActions: Recover
> > ybrpstat:0#011(Started wwwrp2)
> > Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: process_pe_message:
> > Calculated Transition 6830: /var/lib/pacemaker/pengine/pe-input-5.bz2
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org Getting started:
> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: Help understanding what went wrong [ In reply to ]

andrew at beekhof

Oct 29, 2014, 7:27 PM

Post #4 of 5 (954 views)

Permalink

> On 30 Oct 2014, at 11:42 am, Alex Samad - Yieldbroker <Alex.Samad@yieldbroker.com> wrote:
>
> Am I reading this right. The reason the resource didn't fail over (ip address) was because there was no ybrpstat running on wwwrp2 and the reason for that was the monitor action failed/timed out

Thats pretty much correct.
Unfortunately the only thing that can say why the monitor operation took so long is your resource agent.
Pacemaker has no knowledge of what the agent is doing, so can only report the fact that it didn't complete in time.
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: Help understanding what went wrong [ In reply to ]

Alex.Samad at yieldbroker

Oct 29, 2014, 8:15 PM

Post #5 of 5 (939 views)

Permalink

Thanks, more investigating to do

> -----Original Message-----
> From: Andrew Beekhof [mailto:andrew@beekhof.net]
> Sent: Thursday, 30 October 2014 1:28 PM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Help understanding what went wrong
>
>
> > On 30 Oct 2014, at 11:42 am, Alex Samad - Yieldbroker
> <Alex.Samad@yieldbroker.com> wrote:
> >
> > Am I reading this right. The reason the resource didn't fail over (ip
> > address) was because there was no ybrpstat running on wwwrp2 and the
> > reason for that was the monitor action failed/timed out
>
> Thats pretty much correct.
> Unfortunately the only thing that can say why the monitor operation took so
> long is your resource agent.
> Pacemaker has no knowledge of what the agent is doing, so can only report
> the fact that it didn't complete in time.
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org