Mailing List Archive

Segfault on monitor resource
Hi!

I'm writing here because two days ago I experienced a strange problem in my
Pacemaker Cluster.
Everything was working fine, till suddenly a Segfault in Nginx monitor
resource happened:

Jan 25 03:55:24 lb02 crmd: [9975]: notice: run_graph: ==== Transition 7551
(Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pengine/pe-input-90.bz2): Complete
Jan 25 03:55:24 lb02 crmd: [9975]: notice: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]
Jan 25 04:00:08 lb02 cib: [9971]: info: cib_stats: Processed 1 operations
(0.00us average, 0% utilization) in the last 10min
Jan 25 04:10:24 lb02 crmd: [9975]: info: crm_timer_popped: PEngine Recheck
Timer (I_PE_CALC) just popped (900000ms)
Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED
origin=crm_timer_popped ]
Jan 25 04:10:24 lb02 crmd: [9975]: info: do_state_transition: Progressed to
state S_POLICY_ENGINE after C_TIMER_POPPED
Jan 25 04:10:24 lb02 pengine: [10028]: WARN: unpack_rsc_op: Processing
failed op Ldirector-rsc_last_failure_0 on lb02: not running (7)
Jan 25 04:10:24 lb02 pengine: [10028]: notice: common_apply_stickiness:
Ldirector-rsc can fail 999997 more times on lb02 before being forced off
Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
cause=C_IPC_MESSAGE origin=handle_response ]
Jan 25 04:10:24 lb02 pengine: [10028]: notice: process_pe_message:
Transition 7552: PEngine Input stored in: /var/lib/pengine/pe-input-90.bz2
Jan 25 04:10:24 lb02 crmd: [9975]: info: do_te_invoke: Processing graph
7552 (ref=pe_calc-dc-1422155424-7644) derived from
/var/lib/pengine/pe-input-90.bz2
Jan 25 04:10:24 lb02 crmd: [9975]: notice: run_graph: ==== Transition 7552
(Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pengine/pe-input-90.bz2): Complete
Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]


Jan 25 04:10:30 lb02 lrmd: [9972]: info: RA output:
(Nginx-rsc:monitor:stderr) Segmentation fault ******* here it starts

As you can see, the last line.
And then:

Jan 25 04:10:30 lb02 lrmd: [9972]: info: RA output:
(Nginx-rsc:monitor:stderr) Killed
/usr/lib/ocf/resource.d//heartbeat/nginx: 910:
/usr/lib/ocf/resource.d//heartbeat/nginx: Cannot fork

I guess here Nginx was killed.

And then I have some others errors till Pacemaker decide to move the
resources to the node:

Jan 25 04:10:30 lb02 crmd: [9975]: info: process_lrm_event: LRM operation
Nginx-rsc_monitor_10000 (call=52, rc=2, cib-update=7633, confirmed=false)
invalid parameter
Jan 25 04:10:30 lb02 crmd: [9975]: info: process_graph_event: Detected
action Nginx-rsc_monitor_10000 from a different transition: 5739 vs. 7552
Jan 25 04:10:30 lb02 crmd: [9975]: info: abort_transition_graph:
process_graph_event:476 - Triggered transition abort (complete=1,
tag=lrm_rsc_op, id=Nginx-rsc_last_failure_0,
magic=0:2;4:5739:0:42d1ed53-9686-4174-84e7-d2c230ed8832, cib=
3.14.40) : Old event
Jan 25 04:10:30 lb02 crmd: [9975]: WARN: update_failcount: Updating
failcount for Nginx-rsc on lb02 after failed monitor: rc=2 (update=value++,
time=1422155430)
Jan 25 04:10:30 lb02 crmd: [9975]: notice: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
origin=abort_transition_graph ]
Jan 25 04:10:30 lb02 attrd: [9974]: info: log-rotate detected on logfile
/var/log/ha-log
Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_trigger_update: Sending
flush op to all hosts for: fail-count-Nginx-rsc (1)
Jan 25 04:10:30 lb02 pengine: [10028]: ERROR: unpack_rsc_op: Preventing
Nginx-rsc from re-starting on lb02: operation monitor failed 'invalid
parameter' (rc=2)
Jan 25 04:10:30 lb02 pengine: [10028]: WARN: unpack_rsc_op: Processing
failed op Nginx-rsc_last_failure_0 on lb02: invalid parameter (2)
Jan 25 04:10:30 lb02 pengine: [10028]: WARN: unpack_rsc_op: Processing
failed op Ldirector-rsc_last_failure_0 on lb02: not running (7)
Jan 25 04:10:30 lb02 pengine: [10028]: notice: common_apply_stickiness:
Ldirector-rsc can fail 999997 more times on lb02 before being forced off
Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
IP-rsc_mysql (lb02)
Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
IP-rsc_nginx (lb02)
Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
IP-rsc_nginx6 (lb02)
Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
IP-rsc_elasticsearch (lb02)
Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Move
Ldirector-rsc (Started lb02 -> lb01)
Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Move
Nginx-rsc (Started lb02 -> lb01)
Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_perform_update: Sent
update 23: fail-count-Nginx-rsc=1
Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_trigger_update: Sending
flush op to all hosts for: last-failure-Nginx-rsc (1422155430)

I see that Pacemaker is complaining about some errors like "invalid
paraemter", for example in these lines:

Jan 25 04:10:30 lb02 crmd: [9975]: info: process_lrm_event: LRM operation
Nginx-rsc_monitor_10000 (call=52, rc=2, cib-update=7633, confirmed=false)
invalid parameter

Jan 25 04:10:30 lb02 pengine: [10028]: ERROR: unpack_rsc_op: Preventing
Nginx-rsc from re-starting on lb02: operation monitor failed 'invalid
parameter' (rc=2)

It sounds(for me) like a syntax problem defining the resources, but I've
checked the confic with crm_verify and there is no error:

root# (S) crm_verify -LVV
root# (S)

So I'm just wondering why pacemaker is complaining about an invalid
parameter.

This is my CIB objetcs:

node $id="43b2c5a1-9552-4438-962b-6e98a2dd67c7" lb01
node $id="68328520-68e0-42fd-9adf-062655691643" lb02
primitive IP-rsc_elasticsearch ocf:heartbeat:IPaddr2 \
params ip="xx.xx.xx.xx" nic="eth0" cidr_netmask="255.255.255.224"
primitive IP-rsc_elasticsearch6 ocf:heartbeat:IPv6addr \
params ipv6addr="xxxxxxxxxxxxxxxx" \
op monitor interval="10s"
primitive IP-rsc_mysql ocf:heartbeat:IPaddr2 \
params ip="xx.xx.xx.xx" nic="eth0" cidr_netmask="255.255.255.224"
primitive IP-rsc_mysql6 ocf:heartbeat:IPv6addr \
params ipv6addr="xxxxxxxxxxxxxx" \
op monitor interval="10s"
primitive IP-rsc_nginx ocf:heartbeat:IPaddr2 \
params ip="xx.xx.xx.xx" nic="eth0" cidr_netmask="255.255.255.224"
primitive IP-rsc_nginx6 ocf:heartbeat:IPv6addr \
params ipv6addr="xxxxxxxxxxxxxx" \
op monitor interval="10s"
primitive Ldirector-rsc ocf:heartbeat:ldirectord \
op monitor interval="10s" timeout="30s"
primitive Nginx-rsc ocf:heartbeat:nginx \
op monitor interval="10s" timeout="30s"
location cli-standby-IP-rsc_elasticsearch6 IP-rsc_elasticsearch6 \
rule $id="cli-standby-rule-IP-rsc_elasticsearch6" -inf: #uname eq lb01
location cli-standby-IP-rsc_mysql IP-rsc_mysql \
rule $id="cli-standby-rule-IP-rsc_mysql" -inf: #uname eq lb01
location cli-standby-IP-rsc_mysql6 IP-rsc_mysql6 \
rule $id="cli-standby-rule-IP-rsc_mysql6" -inf: #uname eq lb01
location cli-standby-IP-rsc_nginx IP-rsc_nginx \
rule $id="cli-standby-rule-IP-rsc_nginx" -inf: #uname eq lb01
location cli-standby-IP-rsc_nginx6 IP-rsc_nginx6 \
rule $id="cli-standby-rule-IP-rsc_nginx6" -inf: #uname eq lb01
colocation hcu_c inf: Nginx-rsc Ldirector-rsc IP-rsc_mysql IP-rsc_nginx
IP-rsc_nginx6 IP-rsc_elasticsearch
order hcu_o inf: IP-rsc_nginx IP-rsc_nginx6 IP-rsc_mysql Ldirector-rsc
Nginx-rsc IP-rsc_elasticsearch
property $id="cib-bootstrap-options" \
dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
cluster-infrastructure="Heartbeat" \
stonith-enabled="false


Do you have some hints that I can follow?

Thanks in advance!

Oscar
Re: Segfault on monitor resource [ In reply to ]
Oh, I forgot some important details:

root# (S) crm status
============
Last updated: Mon Jan 26 18:21:35 2015
Last change: Sun Jan 25 05:19:13 2015 via crm_resource on lb01
Stack: Heartbeat
Current DC: lb01 (43b2c5a1-9552-4438-962b-6e98a2dd67c7) - partition with
quorum
Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
2 Nodes configured, unknown expected votes
8 Resources configured.
============

Online: [ lb01 lb02 ]

IP-rsc_mysql (ocf::heartbeat:IPaddr2): Started lb02
IP-rsc_nginx (ocf::heartbeat:IPaddr2): Started lb02
IP-rsc_nginx6 (ocf::heartbeat:IPv6addr): Started lb02
IP-rsc_mysql6 (ocf::heartbeat:IPv6addr): Started lb02
IP-rsc_elasticsearch6 (ocf::heartbeat:IPv6addr): Started lb02
IP-rsc_elasticsearch (ocf::heartbeat:IPaddr2): Started lb02
Ldirector-rsc (ocf::heartbeat:ldirectord): Started lb02
Nginx-rsc (ocf::heartbeat:nginx): Started lb02


This is running on:

Debian 7.8
pacemaker 1.1.7-1

2015-01-26 18:20 GMT+01:00 Oscar Salvador <osalvador.vilardaga@gmail.com>:

> Hi!
>
> I'm writing here because two days ago I experienced a strange problem in
> my Pacemaker Cluster.
> Everything was working fine, till suddenly a Segfault in Nginx monitor
> resource happened:
>
> Jan 25 03:55:24 lb02 crmd: [9975]: notice: run_graph: ==== Transition 7551
> (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pengine/pe-input-90.bz2): Complete
> Jan 25 03:55:24 lb02 crmd: [9975]: notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> Jan 25 04:00:08 lb02 cib: [9971]: info: cib_stats: Processed 1 operations
> (0.00us average, 0% utilization) in the last 10min
> Jan 25 04:10:24 lb02 crmd: [9975]: info: crm_timer_popped: PEngine Recheck
> Timer (I_PE_CALC) just popped (900000ms)
> Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED
> origin=crm_timer_popped ]
> Jan 25 04:10:24 lb02 crmd: [9975]: info: do_state_transition: Progressed
> to state S_POLICY_ENGINE after C_TIMER_POPPED
> Jan 25 04:10:24 lb02 pengine: [10028]: WARN: unpack_rsc_op: Processing
> failed op Ldirector-rsc_last_failure_0 on lb02: not running (7)
> Jan 25 04:10:24 lb02 pengine: [10028]: notice: common_apply_stickiness:
> Ldirector-rsc can fail 999997 more times on lb02 before being forced off
> Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State
> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
> cause=C_IPC_MESSAGE origin=handle_response ]
> Jan 25 04:10:24 lb02 pengine: [10028]: notice: process_pe_message:
> Transition 7552: PEngine Input stored in: /var/lib/pengine/pe-input-90.bz2
> Jan 25 04:10:24 lb02 crmd: [9975]: info: do_te_invoke: Processing graph
> 7552 (ref=pe_calc-dc-1422155424-7644) derived from
> /var/lib/pengine/pe-input-90.bz2
> Jan 25 04:10:24 lb02 crmd: [9975]: notice: run_graph: ==== Transition 7552
> (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pengine/pe-input-90.bz2): Complete
> Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
>
>
> Jan 25 04:10:30 lb02 lrmd: [9972]: info: RA output:
> (Nginx-rsc:monitor:stderr) Segmentation fault ******* here it starts
>
> As you can see, the last line.
> And then:
>
> Jan 25 04:10:30 lb02 lrmd: [9972]: info: RA output:
> (Nginx-rsc:monitor:stderr) Killed
> /usr/lib/ocf/resource.d//heartbeat/nginx: 910:
> /usr/lib/ocf/resource.d//heartbeat/nginx: Cannot fork
>
> I guess here Nginx was killed.
>
> And then I have some others errors till Pacemaker decide to move the
> resources to the node:
>
> Jan 25 04:10:30 lb02 crmd: [9975]: info: process_lrm_event: LRM operation
> Nginx-rsc_monitor_10000 (call=52, rc=2, cib-update=7633, confirmed=false)
> invalid parameter
> Jan 25 04:10:30 lb02 crmd: [9975]: info: process_graph_event: Detected
> action Nginx-rsc_monitor_10000 from a different transition: 5739 vs. 7552
> Jan 25 04:10:30 lb02 crmd: [9975]: info: abort_transition_graph:
> process_graph_event:476 - Triggered transition abort (complete=1,
> tag=lrm_rsc_op, id=Nginx-rsc_last_failure_0,
> magic=0:2;4:5739:0:42d1ed53-9686-4174-84e7-d2c230ed8832, cib=
> 3.14.40) : Old event
> Jan 25 04:10:30 lb02 crmd: [9975]: WARN: update_failcount: Updating
> failcount for Nginx-rsc on lb02 after failed monitor: rc=2 (update=value++,
> time=1422155430)
> Jan 25 04:10:30 lb02 crmd: [9975]: notice: do_state_transition: State
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
> origin=abort_transition_graph ]
> Jan 25 04:10:30 lb02 attrd: [9974]: info: log-rotate detected on logfile
> /var/log/ha-log
> Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_trigger_update: Sending
> flush op to all hosts for: fail-count-Nginx-rsc (1)
> Jan 25 04:10:30 lb02 pengine: [10028]: ERROR: unpack_rsc_op: Preventing
> Nginx-rsc from re-starting on lb02: operation monitor failed 'invalid
> parameter' (rc=2)
> Jan 25 04:10:30 lb02 pengine: [10028]: WARN: unpack_rsc_op: Processing
> failed op Nginx-rsc_last_failure_0 on lb02: invalid parameter (2)
> Jan 25 04:10:30 lb02 pengine: [10028]: WARN: unpack_rsc_op: Processing
> failed op Ldirector-rsc_last_failure_0 on lb02: not running (7)
> Jan 25 04:10:30 lb02 pengine: [10028]: notice: common_apply_stickiness:
> Ldirector-rsc can fail 999997 more times on lb02 before being forced off
> Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
> IP-rsc_mysql (lb02)
> Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
> IP-rsc_nginx (lb02)
> Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
> IP-rsc_nginx6 (lb02)
> Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
> IP-rsc_elasticsearch (lb02)
> Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Move
> Ldirector-rsc (Started lb02 -> lb01)
> Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Move
> Nginx-rsc (Started lb02 -> lb01)
> Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_perform_update: Sent
> update 23: fail-count-Nginx-rsc=1
> Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_trigger_update: Sending
> flush op to all hosts for: last-failure-Nginx-rsc (1422155430)
>
> I see that Pacemaker is complaining about some errors like "invalid
> paraemter", for example in these lines:
>
> Jan 25 04:10:30 lb02 crmd: [9975]: info: process_lrm_event: LRM operation
> Nginx-rsc_monitor_10000 (call=52, rc=2, cib-update=7633, confirmed=false)
> invalid parameter
>
> Jan 25 04:10:30 lb02 pengine: [10028]: ERROR: unpack_rsc_op: Preventing
> Nginx-rsc from re-starting on lb02: operation monitor failed 'invalid
> parameter' (rc=2)
>
> It sounds(for me) like a syntax problem defining the resources, but I've
> checked the confic with crm_verify and there is no error:
>
> root# (S) crm_verify -LVV
> root# (S)
>
> So I'm just wondering why pacemaker is complaining about an invalid
> parameter.
>
> This is my CIB objetcs:
>
> node $id="43b2c5a1-9552-4438-962b-6e98a2dd67c7" lb01
> node $id="68328520-68e0-42fd-9adf-062655691643" lb02
> primitive IP-rsc_elasticsearch ocf:heartbeat:IPaddr2 \
> params ip="xx.xx.xx.xx" nic="eth0" cidr_netmask="255.255.255.224"
> primitive IP-rsc_elasticsearch6 ocf:heartbeat:IPv6addr \
> params ipv6addr="xxxxxxxxxxxxxxxx" \
> op monitor interval="10s"
> primitive IP-rsc_mysql ocf:heartbeat:IPaddr2 \
> params ip="xx.xx.xx.xx" nic="eth0" cidr_netmask="255.255.255.224"
> primitive IP-rsc_mysql6 ocf:heartbeat:IPv6addr \
> params ipv6addr="xxxxxxxxxxxxxx" \
> op monitor interval="10s"
> primitive IP-rsc_nginx ocf:heartbeat:IPaddr2 \
> params ip="xx.xx.xx.xx" nic="eth0" cidr_netmask="255.255.255.224"
> primitive IP-rsc_nginx6 ocf:heartbeat:IPv6addr \
> params ipv6addr="xxxxxxxxxxxxxx" \
> op monitor interval="10s"
> primitive Ldirector-rsc ocf:heartbeat:ldirectord \
> op monitor interval="10s" timeout="30s"
> primitive Nginx-rsc ocf:heartbeat:nginx \
> op monitor interval="10s" timeout="30s"
> location cli-standby-IP-rsc_elasticsearch6 IP-rsc_elasticsearch6 \
> rule $id="cli-standby-rule-IP-rsc_elasticsearch6" -inf: #uname eq lb01
> location cli-standby-IP-rsc_mysql IP-rsc_mysql \
> rule $id="cli-standby-rule-IP-rsc_mysql" -inf: #uname eq lb01
> location cli-standby-IP-rsc_mysql6 IP-rsc_mysql6 \
> rule $id="cli-standby-rule-IP-rsc_mysql6" -inf: #uname eq lb01
> location cli-standby-IP-rsc_nginx IP-rsc_nginx \
> rule $id="cli-standby-rule-IP-rsc_nginx" -inf: #uname eq lb01
> location cli-standby-IP-rsc_nginx6 IP-rsc_nginx6 \
> rule $id="cli-standby-rule-IP-rsc_nginx6" -inf: #uname eq lb01
> colocation hcu_c inf: Nginx-rsc Ldirector-rsc IP-rsc_mysql IP-rsc_nginx
> IP-rsc_nginx6 IP-rsc_elasticsearch
> order hcu_o inf: IP-rsc_nginx IP-rsc_nginx6 IP-rsc_mysql Ldirector-rsc
> Nginx-rsc IP-rsc_elasticsearch
> property $id="cib-bootstrap-options" \
> dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
> cluster-infrastructure="Heartbeat" \
> stonith-enabled="false
>
>
> Do you have some hints that I can follow?
>
> Thanks in advance!
>
> Oscar
>
Re: Segfault on monitor resource [ In reply to ]
maybe you can use sar for checking if your server was tight of resources?

Jan 25 04:10:30 lb02 lrmd: [9972]: info: RA output:
(Nginx-rsc:monitor:stderr) Killed
/usr/lib/ocf/resource.d//heartbeat/nginx: 910:
/usr/lib/ocf/resource.d//heartbeat/nginx: Cannot fork


2015-01-26 18:22 GMT+01:00 Oscar Salvador <osalvador.vilardaga@gmail.com>:
> Oh, I forgot some important details:
>
> root# (S) crm status
> ============
> Last updated: Mon Jan 26 18:21:35 2015
> Last change: Sun Jan 25 05:19:13 2015 via crm_resource on lb01
> Stack: Heartbeat
> Current DC: lb01 (43b2c5a1-9552-4438-962b-6e98a2dd67c7) - partition with
> quorum
> Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
> 2 Nodes configured, unknown expected votes
> 8 Resources configured.
> ============
>
> Online: [ lb01 lb02 ]
>
> IP-rsc_mysql (ocf::heartbeat:IPaddr2): Started lb02
> IP-rsc_nginx (ocf::heartbeat:IPaddr2): Started lb02
> IP-rsc_nginx6 (ocf::heartbeat:IPv6addr): Started lb02
> IP-rsc_mysql6 (ocf::heartbeat:IPv6addr): Started lb02
> IP-rsc_elasticsearch6 (ocf::heartbeat:IPv6addr): Started lb02
> IP-rsc_elasticsearch (ocf::heartbeat:IPaddr2): Started lb02
> Ldirector-rsc (ocf::heartbeat:ldirectord): Started lb02
> Nginx-rsc (ocf::heartbeat:nginx): Started lb02
>
>
> This is running on:
>
> Debian 7.8
> pacemaker 1.1.7-1
>
> 2015-01-26 18:20 GMT+01:00 Oscar Salvador <osalvador.vilardaga@gmail.com>:
>>
>> Hi!
>>
>> I'm writing here because two days ago I experienced a strange problem in
>> my Pacemaker Cluster.
>> Everything was working fine, till suddenly a Segfault in Nginx monitor
>> resource happened:
>>
>> Jan 25 03:55:24 lb02 crmd: [9975]: notice: run_graph: ==== Transition 7551
>> (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
>> Source=/var/lib/pengine/pe-input-90.bz2): Complete
>> Jan 25 03:55:24 lb02 crmd: [9975]: notice: do_state_transition: State
>> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
>> cause=C_FSA_INTERNAL origin=notify_crmd ]
>> Jan 25 04:00:08 lb02 cib: [9971]: info: cib_stats: Processed 1 operations
>> (0.00us average, 0% utilization) in the last 10min
>> Jan 25 04:10:24 lb02 crmd: [9975]: info: crm_timer_popped: PEngine Recheck
>> Timer (I_PE_CALC) just popped (900000ms)
>> Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State
>> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED
>> origin=crm_timer_popped ]
>> Jan 25 04:10:24 lb02 crmd: [9975]: info: do_state_transition: Progressed
>> to state S_POLICY_ENGINE after C_TIMER_POPPED
>> Jan 25 04:10:24 lb02 pengine: [10028]: WARN: unpack_rsc_op: Processing
>> failed op Ldirector-rsc_last_failure_0 on lb02: not running (7)
>> Jan 25 04:10:24 lb02 pengine: [10028]: notice: common_apply_stickiness:
>> Ldirector-rsc can fail 999997 more times on lb02 before being forced off
>> Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State
>> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
>> cause=C_IPC_MESSAGE origin=handle_response ]
>> Jan 25 04:10:24 lb02 pengine: [10028]: notice: process_pe_message:
>> Transition 7552: PEngine Input stored in: /var/lib/pengine/pe-input-90.bz2
>> Jan 25 04:10:24 lb02 crmd: [9975]: info: do_te_invoke: Processing graph
>> 7552 (ref=pe_calc-dc-1422155424-7644) derived from
>> /var/lib/pengine/pe-input-90.bz2
>> Jan 25 04:10:24 lb02 crmd: [9975]: notice: run_graph: ==== Transition 7552
>> (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
>> Source=/var/lib/pengine/pe-input-90.bz2): Complete
>> Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State
>> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
>> cause=C_FSA_INTERNAL origin=notify_crmd ]
>>
>>
>> Jan 25 04:10:30 lb02 lrmd: [9972]: info: RA output:
>> (Nginx-rsc:monitor:stderr) Segmentation fault ******* here it starts
>>
>> As you can see, the last line.
>> And then:
>>
>> Jan 25 04:10:30 lb02 lrmd: [9972]: info: RA output:
>> (Nginx-rsc:monitor:stderr) Killed
>> /usr/lib/ocf/resource.d//heartbeat/nginx: 910:
>> /usr/lib/ocf/resource.d//heartbeat/nginx: Cannot fork
>>
>> I guess here Nginx was killed.
>>
>> And then I have some others errors till Pacemaker decide to move the
>> resources to the node:
>>
>> Jan 25 04:10:30 lb02 crmd: [9975]: info: process_lrm_event: LRM operation
>> Nginx-rsc_monitor_10000 (call=52, rc=2, cib-update=7633, confirmed=false)
>> invalid parameter
>> Jan 25 04:10:30 lb02 crmd: [9975]: info: process_graph_event: Detected
>> action Nginx-rsc_monitor_10000 from a different transition: 5739 vs. 7552
>> Jan 25 04:10:30 lb02 crmd: [9975]: info: abort_transition_graph:
>> process_graph_event:476 - Triggered transition abort (complete=1,
>> tag=lrm_rsc_op, id=Nginx-rsc_last_failure_0,
>> magic=0:2;4:5739:0:42d1ed53-9686-4174-84e7-d2c230ed8832, cib=
>> 3.14.40) : Old event
>> Jan 25 04:10:30 lb02 crmd: [9975]: WARN: update_failcount: Updating
>> failcount for Nginx-rsc on lb02 after failed monitor: rc=2 (update=value++,
>> time=1422155430)
>> Jan 25 04:10:30 lb02 crmd: [9975]: notice: do_state_transition: State
>> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
>> origin=abort_transition_graph ]
>> Jan 25 04:10:30 lb02 attrd: [9974]: info: log-rotate detected on logfile
>> /var/log/ha-log
>> Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_trigger_update: Sending
>> flush op to all hosts for: fail-count-Nginx-rsc (1)
>> Jan 25 04:10:30 lb02 pengine: [10028]: ERROR: unpack_rsc_op: Preventing
>> Nginx-rsc from re-starting on lb02: operation monitor failed 'invalid
>> parameter' (rc=2)
>> Jan 25 04:10:30 lb02 pengine: [10028]: WARN: unpack_rsc_op: Processing
>> failed op Nginx-rsc_last_failure_0 on lb02: invalid parameter (2)
>> Jan 25 04:10:30 lb02 pengine: [10028]: WARN: unpack_rsc_op: Processing
>> failed op Ldirector-rsc_last_failure_0 on lb02: not running (7)
>> Jan 25 04:10:30 lb02 pengine: [10028]: notice: common_apply_stickiness:
>> Ldirector-rsc can fail 999997 more times on lb02 before being forced off
>> Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
>> IP-rsc_mysql (lb02)
>> Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
>> IP-rsc_nginx (lb02)
>> Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
>> IP-rsc_nginx6 (lb02)
>> Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
>> IP-rsc_elasticsearch (lb02)
>> Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Move
>> Ldirector-rsc (Started lb02 -> lb01)
>> Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Move
>> Nginx-rsc (Started lb02 -> lb01)
>> Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_perform_update: Sent
>> update 23: fail-count-Nginx-rsc=1
>> Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_trigger_update: Sending
>> flush op to all hosts for: last-failure-Nginx-rsc (1422155430)
>>
>> I see that Pacemaker is complaining about some errors like "invalid
>> paraemter", for example in these lines:
>>
>> Jan 25 04:10:30 lb02 crmd: [9975]: info: process_lrm_event: LRM operation
>> Nginx-rsc_monitor_10000 (call=52, rc=2, cib-update=7633, confirmed=false)
>> invalid parameter
>>
>> Jan 25 04:10:30 lb02 pengine: [10028]: ERROR: unpack_rsc_op: Preventing
>> Nginx-rsc from re-starting on lb02: operation monitor failed 'invalid
>> parameter' (rc=2)
>>
>> It sounds(for me) like a syntax problem defining the resources, but I've
>> checked the confic with crm_verify and there is no error:
>>
>> root# (S) crm_verify -LVV
>> root# (S)
>>
>> So I'm just wondering why pacemaker is complaining about an invalid
>> parameter.
>>
>> This is my CIB objetcs:
>>
>> node $id="43b2c5a1-9552-4438-962b-6e98a2dd67c7" lb01
>> node $id="68328520-68e0-42fd-9adf-062655691643" lb02
>> primitive IP-rsc_elasticsearch ocf:heartbeat:IPaddr2 \
>> params ip="xx.xx.xx.xx" nic="eth0" cidr_netmask="255.255.255.224"
>> primitive IP-rsc_elasticsearch6 ocf:heartbeat:IPv6addr \
>> params ipv6addr="xxxxxxxxxxxxxxxx" \
>> op monitor interval="10s"
>> primitive IP-rsc_mysql ocf:heartbeat:IPaddr2 \
>> params ip="xx.xx.xx.xx" nic="eth0" cidr_netmask="255.255.255.224"
>> primitive IP-rsc_mysql6 ocf:heartbeat:IPv6addr \
>> params ipv6addr="xxxxxxxxxxxxxx" \
>> op monitor interval="10s"
>> primitive IP-rsc_nginx ocf:heartbeat:IPaddr2 \
>> params ip="xx.xx.xx.xx" nic="eth0" cidr_netmask="255.255.255.224"
>> primitive IP-rsc_nginx6 ocf:heartbeat:IPv6addr \
>> params ipv6addr="xxxxxxxxxxxxxx" \
>> op monitor interval="10s"
>> primitive Ldirector-rsc ocf:heartbeat:ldirectord \
>> op monitor interval="10s" timeout="30s"
>> primitive Nginx-rsc ocf:heartbeat:nginx \
>> op monitor interval="10s" timeout="30s"
>> location cli-standby-IP-rsc_elasticsearch6 IP-rsc_elasticsearch6 \
>> rule $id="cli-standby-rule-IP-rsc_elasticsearch6" -inf: #uname eq lb01
>> location cli-standby-IP-rsc_mysql IP-rsc_mysql \
>> rule $id="cli-standby-rule-IP-rsc_mysql" -inf: #uname eq lb01
>> location cli-standby-IP-rsc_mysql6 IP-rsc_mysql6 \
>> rule $id="cli-standby-rule-IP-rsc_mysql6" -inf: #uname eq lb01
>> location cli-standby-IP-rsc_nginx IP-rsc_nginx \
>> rule $id="cli-standby-rule-IP-rsc_nginx" -inf: #uname eq lb01
>> location cli-standby-IP-rsc_nginx6 IP-rsc_nginx6 \
>> rule $id="cli-standby-rule-IP-rsc_nginx6" -inf: #uname eq lb01
>> colocation hcu_c inf: Nginx-rsc Ldirector-rsc IP-rsc_mysql IP-rsc_nginx
>> IP-rsc_nginx6 IP-rsc_elasticsearch
>> order hcu_o inf: IP-rsc_nginx IP-rsc_nginx6 IP-rsc_mysql Ldirector-rsc
>> Nginx-rsc IP-rsc_elasticsearch
>> property $id="cib-bootstrap-options" \
>> dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
>> cluster-infrastructure="Heartbeat" \
>> stonith-enabled="false
>>
>>
>> Do you have some hints that I can follow?
>>
>> Thanks in advance!
>>
>> Oscar
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>



--
esta es mi vida e me la vivo hasta que dios quiera

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
Re: Segfault on monitor resource [ In reply to ]
Hi,

On Mon, Jan 26, 2015 at 06:20:35PM +0100, Oscar Salvador wrote:
> Hi!
>
> I'm writing here because two days ago I experienced a strange problem in my
> Pacemaker Cluster.
> Everything was working fine, till suddenly a Segfault in Nginx monitor
> resource happened:
>
> Jan 25 03:55:24 lb02 crmd: [9975]: notice: run_graph: ==== Transition 7551
> (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pengine/pe-input-90.bz2): Complete
> Jan 25 03:55:24 lb02 crmd: [9975]: notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> Jan 25 04:00:08 lb02 cib: [9971]: info: cib_stats: Processed 1 operations
> (0.00us average, 0% utilization) in the last 10min
> Jan 25 04:10:24 lb02 crmd: [9975]: info: crm_timer_popped: PEngine Recheck
> Timer (I_PE_CALC) just popped (900000ms)
> Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED
> origin=crm_timer_popped ]
> Jan 25 04:10:24 lb02 crmd: [9975]: info: do_state_transition: Progressed to
> state S_POLICY_ENGINE after C_TIMER_POPPED
> Jan 25 04:10:24 lb02 pengine: [10028]: WARN: unpack_rsc_op: Processing
> failed op Ldirector-rsc_last_failure_0 on lb02: not running (7)
> Jan 25 04:10:24 lb02 pengine: [10028]: notice: common_apply_stickiness:
> Ldirector-rsc can fail 999997 more times on lb02 before being forced off
> Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State
> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
> cause=C_IPC_MESSAGE origin=handle_response ]
> Jan 25 04:10:24 lb02 pengine: [10028]: notice: process_pe_message:
> Transition 7552: PEngine Input stored in: /var/lib/pengine/pe-input-90.bz2
> Jan 25 04:10:24 lb02 crmd: [9975]: info: do_te_invoke: Processing graph
> 7552 (ref=pe_calc-dc-1422155424-7644) derived from
> /var/lib/pengine/pe-input-90.bz2
> Jan 25 04:10:24 lb02 crmd: [9975]: notice: run_graph: ==== Transition 7552
> (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pengine/pe-input-90.bz2): Complete
> Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
>
>
> Jan 25 04:10:30 lb02 lrmd: [9972]: info: RA output:
> (Nginx-rsc:monitor:stderr) Segmentation fault ******* here it starts

What exactly did segfault? Do you have a core dump to examine?

> As you can see, the last line.
> And then:
>
> Jan 25 04:10:30 lb02 lrmd: [9972]: info: RA output:
> (Nginx-rsc:monitor:stderr) Killed
> /usr/lib/ocf/resource.d//heartbeat/nginx: 910:
> /usr/lib/ocf/resource.d//heartbeat/nginx: Cannot fork

This could be related to the segfault, or due to other serious
system error.

> I guess here Nginx was killed.
>
> And then I have some others errors till Pacemaker decide to move the
> resources to the node:
>
> Jan 25 04:10:30 lb02 crmd: [9975]: info: process_lrm_event: LRM operation
> Nginx-rsc_monitor_10000 (call=52, rc=2, cib-update=7633, confirmed=false)
> invalid parameter
> Jan 25 04:10:30 lb02 crmd: [9975]: info: process_graph_event: Detected
> action Nginx-rsc_monitor_10000 from a different transition: 5739 vs. 7552
> Jan 25 04:10:30 lb02 crmd: [9975]: info: abort_transition_graph:
> process_graph_event:476 - Triggered transition abort (complete=1,
> tag=lrm_rsc_op, id=Nginx-rsc_last_failure_0,
> magic=0:2;4:5739:0:42d1ed53-9686-4174-84e7-d2c230ed8832, cib=
> 3.14.40) : Old event
> Jan 25 04:10:30 lb02 crmd: [9975]: WARN: update_failcount: Updating
> failcount for Nginx-rsc on lb02 after failed monitor: rc=2 (update=value++,
> time=1422155430)
> Jan 25 04:10:30 lb02 crmd: [9975]: notice: do_state_transition: State
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
> origin=abort_transition_graph ]
> Jan 25 04:10:30 lb02 attrd: [9974]: info: log-rotate detected on logfile
> /var/log/ha-log
> Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_trigger_update: Sending
> flush op to all hosts for: fail-count-Nginx-rsc (1)
> Jan 25 04:10:30 lb02 pengine: [10028]: ERROR: unpack_rsc_op: Preventing
> Nginx-rsc from re-starting on lb02: operation monitor failed 'invalid
> parameter' (rc=2)
> Jan 25 04:10:30 lb02 pengine: [10028]: WARN: unpack_rsc_op: Processing
> failed op Nginx-rsc_last_failure_0 on lb02: invalid parameter (2)
> Jan 25 04:10:30 lb02 pengine: [10028]: WARN: unpack_rsc_op: Processing
> failed op Ldirector-rsc_last_failure_0 on lb02: not running (7)
> Jan 25 04:10:30 lb02 pengine: [10028]: notice: common_apply_stickiness:
> Ldirector-rsc can fail 999997 more times on lb02 before being forced off
> Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
> IP-rsc_mysql (lb02)
> Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
> IP-rsc_nginx (lb02)
> Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
> IP-rsc_nginx6 (lb02)
> Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
> IP-rsc_elasticsearch (lb02)
> Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Move
> Ldirector-rsc (Started lb02 -> lb01)
> Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Move
> Nginx-rsc (Started lb02 -> lb01)
> Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_perform_update: Sent
> update 23: fail-count-Nginx-rsc=1
> Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_trigger_update: Sending
> flush op to all hosts for: last-failure-Nginx-rsc (1422155430)
>
> I see that Pacemaker is complaining about some errors like "invalid
> paraemter", for example in these lines:

That error code is what the nginx RA exited with. It's unusual,
but perhaps also due to the segfault.

Thanks,

Dejan

> Jan 25 04:10:30 lb02 crmd: [9975]: info: process_lrm_event: LRM operation
> Nginx-rsc_monitor_10000 (call=52, rc=2, cib-update=7633, confirmed=false)
> invalid parameter
>
> Jan 25 04:10:30 lb02 pengine: [10028]: ERROR: unpack_rsc_op: Preventing
> Nginx-rsc from re-starting on lb02: operation monitor failed 'invalid
> parameter' (rc=2)
>
> It sounds(for me) like a syntax problem defining the resources, but I've
> checked the confic with crm_verify and there is no error:
>
> root# (S) crm_verify -LVV
> root# (S)
>
> So I'm just wondering why pacemaker is complaining about an invalid
> parameter.
>
> This is my CIB objetcs:
>
> node $id="43b2c5a1-9552-4438-962b-6e98a2dd67c7" lb01
> node $id="68328520-68e0-42fd-9adf-062655691643" lb02
> primitive IP-rsc_elasticsearch ocf:heartbeat:IPaddr2 \
> params ip="xx.xx.xx.xx" nic="eth0" cidr_netmask="255.255.255.224"
> primitive IP-rsc_elasticsearch6 ocf:heartbeat:IPv6addr \
> params ipv6addr="xxxxxxxxxxxxxxxx" \
> op monitor interval="10s"
> primitive IP-rsc_mysql ocf:heartbeat:IPaddr2 \
> params ip="xx.xx.xx.xx" nic="eth0" cidr_netmask="255.255.255.224"
> primitive IP-rsc_mysql6 ocf:heartbeat:IPv6addr \
> params ipv6addr="xxxxxxxxxxxxxx" \
> op monitor interval="10s"
> primitive IP-rsc_nginx ocf:heartbeat:IPaddr2 \
> params ip="xx.xx.xx.xx" nic="eth0" cidr_netmask="255.255.255.224"
> primitive IP-rsc_nginx6 ocf:heartbeat:IPv6addr \
> params ipv6addr="xxxxxxxxxxxxxx" \
> op monitor interval="10s"
> primitive Ldirector-rsc ocf:heartbeat:ldirectord \
> op monitor interval="10s" timeout="30s"
> primitive Nginx-rsc ocf:heartbeat:nginx \
> op monitor interval="10s" timeout="30s"
> location cli-standby-IP-rsc_elasticsearch6 IP-rsc_elasticsearch6 \
> rule $id="cli-standby-rule-IP-rsc_elasticsearch6" -inf: #uname eq lb01
> location cli-standby-IP-rsc_mysql IP-rsc_mysql \
> rule $id="cli-standby-rule-IP-rsc_mysql" -inf: #uname eq lb01
> location cli-standby-IP-rsc_mysql6 IP-rsc_mysql6 \
> rule $id="cli-standby-rule-IP-rsc_mysql6" -inf: #uname eq lb01
> location cli-standby-IP-rsc_nginx IP-rsc_nginx \
> rule $id="cli-standby-rule-IP-rsc_nginx" -inf: #uname eq lb01
> location cli-standby-IP-rsc_nginx6 IP-rsc_nginx6 \
> rule $id="cli-standby-rule-IP-rsc_nginx6" -inf: #uname eq lb01
> colocation hcu_c inf: Nginx-rsc Ldirector-rsc IP-rsc_mysql IP-rsc_nginx
> IP-rsc_nginx6 IP-rsc_elasticsearch
> order hcu_o inf: IP-rsc_nginx IP-rsc_nginx6 IP-rsc_mysql Ldirector-rsc
> Nginx-rsc IP-rsc_elasticsearch
> property $id="cib-bootstrap-options" \
> dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
> cluster-infrastructure="Heartbeat" \
> stonith-enabled="false
>
>
> Do you have some hints that I can follow?
>
> Thanks in advance!
>
> Oscar

> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
Re: Segfault on monitor resource [ In reply to ]
Hi,

I've checked the resource graphs I have, and the resources were fine, so I
think it's not a problem due to a high use of memory or something like that.
And unfortunately I don't have a core dump to analize(I'll enable it for a
future case) so the only thing I have are the logs.

For the line below, I though that was the process in charge to monitore
nginx what was killed due to a segfault:

RA output: (Nginx-rsc:monitor:stderr) Segmentation fault


I've checked the Nginx logs, and there is nothing worth there, actually
there is no activity, so I think it has to be something internal what
caused the failure.
I'll enable coredumps, it's the only thing I can do for now.

Thank you very much

Oscar

2015-01-27 10:39 GMT+01:00 Dejan Muhamedagic <dejanmm@fastmail.fm>:

> Hi,
>
> On Mon, Jan 26, 2015 at 06:20:35PM +0100, Oscar Salvador wrote:
> > Hi!
> >
> > I'm writing here because two days ago I experienced a strange problem in
> my
> > Pacemaker Cluster.
> > Everything was working fine, till suddenly a Segfault in Nginx monitor
> > resource happened:
> >
> > Jan 25 03:55:24 lb02 crmd: [9975]: notice: run_graph: ==== Transition
> 7551
> > (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> > Source=/var/lib/pengine/pe-input-90.bz2): Complete
> > Jan 25 03:55:24 lb02 crmd: [9975]: notice: do_state_transition: State
> > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> > cause=C_FSA_INTERNAL origin=notify_crmd ]
> > Jan 25 04:00:08 lb02 cib: [9971]: info: cib_stats: Processed 1 operations
> > (0.00us average, 0% utilization) in the last 10min
> > Jan 25 04:10:24 lb02 crmd: [9975]: info: crm_timer_popped: PEngine
> Recheck
> > Timer (I_PE_CALC) just popped (900000ms)
> > Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State
> > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_TIMER_POPPED
> > origin=crm_timer_popped ]
> > Jan 25 04:10:24 lb02 crmd: [9975]: info: do_state_transition: Progressed
> to
> > state S_POLICY_ENGINE after C_TIMER_POPPED
> > Jan 25 04:10:24 lb02 pengine: [10028]: WARN: unpack_rsc_op: Processing
> > failed op Ldirector-rsc_last_failure_0 on lb02: not running (7)
> > Jan 25 04:10:24 lb02 pengine: [10028]: notice: common_apply_stickiness:
> > Ldirector-rsc can fail 999997 more times on lb02 before being forced off
> > Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State
> > transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
> > cause=C_IPC_MESSAGE origin=handle_response ]
> > Jan 25 04:10:24 lb02 pengine: [10028]: notice: process_pe_message:
> > Transition 7552: PEngine Input stored in:
> /var/lib/pengine/pe-input-90.bz2
> > Jan 25 04:10:24 lb02 crmd: [9975]: info: do_te_invoke: Processing graph
> > 7552 (ref=pe_calc-dc-1422155424-7644) derived from
> > /var/lib/pengine/pe-input-90.bz2
> > Jan 25 04:10:24 lb02 crmd: [9975]: notice: run_graph: ==== Transition
> 7552
> > (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> > Source=/var/lib/pengine/pe-input-90.bz2): Complete
> > Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State
> > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> > cause=C_FSA_INTERNAL origin=notify_crmd ]
> >
> >
> > Jan 25 04:10:30 lb02 lrmd: [9972]: info: RA output:
> > (Nginx-rsc:monitor:stderr) Segmentation fault ******* here it starts
>
> What exactly did segfault? Do you have a core dump to examine?
>
> > As you can see, the last line.
> > And then:
> >
> > Jan 25 04:10:30 lb02 lrmd: [9972]: info: RA output:
> > (Nginx-rsc:monitor:stderr) Killed
> > /usr/lib/ocf/resource.d//heartbeat/nginx: 910:
> > /usr/lib/ocf/resource.d//heartbeat/nginx: Cannot fork
>
> This could be related to the segfault, or due to other serious
> system error.
>
> > I guess here Nginx was killed.
> >
> > And then I have some others errors till Pacemaker decide to move the
> > resources to the node:
> >
> > Jan 25 04:10:30 lb02 crmd: [9975]: info: process_lrm_event: LRM operation
> > Nginx-rsc_monitor_10000 (call=52, rc=2, cib-update=7633, confirmed=false)
> > invalid parameter
> > Jan 25 04:10:30 lb02 crmd: [9975]: info: process_graph_event: Detected
> > action Nginx-rsc_monitor_10000 from a different transition: 5739 vs. 7552
> > Jan 25 04:10:30 lb02 crmd: [9975]: info: abort_transition_graph:
> > process_graph_event:476 - Triggered transition abort (complete=1,
> > tag=lrm_rsc_op, id=Nginx-rsc_last_failure_0,
> > magic=0:2;4:5739:0:42d1ed53-9686-4174-84e7-d2c230ed8832, cib=
> > 3.14.40) : Old event
> > Jan 25 04:10:30 lb02 crmd: [9975]: WARN: update_failcount: Updating
> > failcount for Nginx-rsc on lb02 after failed monitor: rc=2
> (update=value++,
> > time=1422155430)
> > Jan 25 04:10:30 lb02 crmd: [9975]: notice: do_state_transition: State
> > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_FSA_INTERNAL
> > origin=abort_transition_graph ]
> > Jan 25 04:10:30 lb02 attrd: [9974]: info: log-rotate detected on logfile
> > /var/log/ha-log
> > Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_trigger_update: Sending
> > flush op to all hosts for: fail-count-Nginx-rsc (1)
> > Jan 25 04:10:30 lb02 pengine: [10028]: ERROR: unpack_rsc_op: Preventing
> > Nginx-rsc from re-starting on lb02: operation monitor failed 'invalid
> > parameter' (rc=2)
> > Jan 25 04:10:30 lb02 pengine: [10028]: WARN: unpack_rsc_op: Processing
> > failed op Nginx-rsc_last_failure_0 on lb02: invalid parameter (2)
> > Jan 25 04:10:30 lb02 pengine: [10028]: WARN: unpack_rsc_op: Processing
> > failed op Ldirector-rsc_last_failure_0 on lb02: not running (7)
> > Jan 25 04:10:30 lb02 pengine: [10028]: notice: common_apply_stickiness:
> > Ldirector-rsc can fail 999997 more times on lb02 before being forced off
> > Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
> > IP-rsc_mysql (lb02)
> > Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
> > IP-rsc_nginx (lb02)
> > Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
> > IP-rsc_nginx6 (lb02)
> > Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
> > IP-rsc_elasticsearch (lb02)
> > Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Move
> > Ldirector-rsc (Started lb02 -> lb01)
> > Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Move
> > Nginx-rsc (Started lb02 -> lb01)
> > Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_perform_update: Sent
> > update 23: fail-count-Nginx-rsc=1
> > Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_trigger_update: Sending
> > flush op to all hosts for: last-failure-Nginx-rsc (1422155430)
> >
> > I see that Pacemaker is complaining about some errors like "invalid
> > paraemter", for example in these lines:
>
> That error code is what the nginx RA exited with. It's unusual,
> but perhaps also due to the segfault.
>
> Thanks,
>
> Dejan
>
> > Jan 25 04:10:30 lb02 crmd: [9975]: info: process_lrm_event: LRM operation
> > Nginx-rsc_monitor_10000 (call=52, rc=2, cib-update=7633, confirmed=false)
> > invalid parameter
> >
> > Jan 25 04:10:30 lb02 pengine: [10028]: ERROR: unpack_rsc_op: Preventing
> > Nginx-rsc from re-starting on lb02: operation monitor failed 'invalid
> > parameter' (rc=2)
> >
> > It sounds(for me) like a syntax problem defining the resources, but I've
> > checked the confic with crm_verify and there is no error:
> >
> > root# (S) crm_verify -LVV
> > root# (S)
> >
> > So I'm just wondering why pacemaker is complaining about an invalid
> > parameter.
> >
> > This is my CIB objetcs:
> >
> > node $id="43b2c5a1-9552-4438-962b-6e98a2dd67c7" lb01
> > node $id="68328520-68e0-42fd-9adf-062655691643" lb02
> > primitive IP-rsc_elasticsearch ocf:heartbeat:IPaddr2 \
> > params ip="xx.xx.xx.xx" nic="eth0" cidr_netmask="255.255.255.224"
> > primitive IP-rsc_elasticsearch6 ocf:heartbeat:IPv6addr \
> > params ipv6addr="xxxxxxxxxxxxxxxx" \
> > op monitor interval="10s"
> > primitive IP-rsc_mysql ocf:heartbeat:IPaddr2 \
> > params ip="xx.xx.xx.xx" nic="eth0" cidr_netmask="255.255.255.224"
> > primitive IP-rsc_mysql6 ocf:heartbeat:IPv6addr \
> > params ipv6addr="xxxxxxxxxxxxxx" \
> > op monitor interval="10s"
> > primitive IP-rsc_nginx ocf:heartbeat:IPaddr2 \
> > params ip="xx.xx.xx.xx" nic="eth0" cidr_netmask="255.255.255.224"
> > primitive IP-rsc_nginx6 ocf:heartbeat:IPv6addr \
> > params ipv6addr="xxxxxxxxxxxxxx" \
> > op monitor interval="10s"
> > primitive Ldirector-rsc ocf:heartbeat:ldirectord \
> > op monitor interval="10s" timeout="30s"
> > primitive Nginx-rsc ocf:heartbeat:nginx \
> > op monitor interval="10s" timeout="30s"
> > location cli-standby-IP-rsc_elasticsearch6 IP-rsc_elasticsearch6 \
> > rule $id="cli-standby-rule-IP-rsc_elasticsearch6" -inf: #uname eq lb01
> > location cli-standby-IP-rsc_mysql IP-rsc_mysql \
> > rule $id="cli-standby-rule-IP-rsc_mysql" -inf: #uname eq lb01
> > location cli-standby-IP-rsc_mysql6 IP-rsc_mysql6 \
> > rule $id="cli-standby-rule-IP-rsc_mysql6" -inf: #uname eq lb01
> > location cli-standby-IP-rsc_nginx IP-rsc_nginx \
> > rule $id="cli-standby-rule-IP-rsc_nginx" -inf: #uname eq lb01
> > location cli-standby-IP-rsc_nginx6 IP-rsc_nginx6 \
> > rule $id="cli-standby-rule-IP-rsc_nginx6" -inf: #uname eq lb01
> > colocation hcu_c inf: Nginx-rsc Ldirector-rsc IP-rsc_mysql IP-rsc_nginx
> > IP-rsc_nginx6 IP-rsc_elasticsearch
> > order hcu_o inf: IP-rsc_nginx IP-rsc_nginx6 IP-rsc_mysql Ldirector-rsc
> > Nginx-rsc IP-rsc_elasticsearch
> > property $id="cib-bootstrap-options" \
> > dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
> > cluster-infrastructure="Heartbeat" \
> > stonith-enabled="false
> >
> >
> > Do you have some hints that I can follow?
> >
> > Thanks in advance!
> >
> > Oscar
>
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
Re: Segfault on monitor resource [ In reply to ]
On Tue, Jan 27, 2015 at 03:18:13PM +0100, Oscar Salvador wrote:
> Hi,
>
> I've checked the resource graphs I have, and the resources were fine, so I
> think it's not a problem due to a high use of memory or something like that.
> And unfortunately I don't have a core dump to analize(I'll enable it for a
> future case) so the only thing I have are the logs.
>
> For the line below, I though that was the process in charge to monitore
> nginx what was killed due to a segfault:
>
> RA output: (Nginx-rsc:monitor:stderr) Segmentation fault

This is just output captured during the execution of the RA
monitor action. It could've been anything within the RA (which is
just a shell script) to segfault.

Thanks,

Dejan

> I've checked the Nginx logs, and there is nothing worth there, actually
> there is no activity, so I think it has to be something internal what
> caused the failure.
> I'll enable coredumps, it's the only thing I can do for now.
>
> Thank you very much
>
> Oscar
>
> 2015-01-27 10:39 GMT+01:00 Dejan Muhamedagic <dejanmm@fastmail.fm>:
>
> > Hi,
> >
> > On Mon, Jan 26, 2015 at 06:20:35PM +0100, Oscar Salvador wrote:
> > > Hi!
> > >
> > > I'm writing here because two days ago I experienced a strange problem in
> > my
> > > Pacemaker Cluster.
> > > Everything was working fine, till suddenly a Segfault in Nginx monitor
> > > resource happened:
> > >
> > > Jan 25 03:55:24 lb02 crmd: [9975]: notice: run_graph: ==== Transition
> > 7551
> > > (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> > > Source=/var/lib/pengine/pe-input-90.bz2): Complete
> > > Jan 25 03:55:24 lb02 crmd: [9975]: notice: do_state_transition: State
> > > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> > > cause=C_FSA_INTERNAL origin=notify_crmd ]
> > > Jan 25 04:00:08 lb02 cib: [9971]: info: cib_stats: Processed 1 operations
> > > (0.00us average, 0% utilization) in the last 10min
> > > Jan 25 04:10:24 lb02 crmd: [9975]: info: crm_timer_popped: PEngine
> > Recheck
> > > Timer (I_PE_CALC) just popped (900000ms)
> > > Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State
> > > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> > cause=C_TIMER_POPPED
> > > origin=crm_timer_popped ]
> > > Jan 25 04:10:24 lb02 crmd: [9975]: info: do_state_transition: Progressed
> > to
> > > state S_POLICY_ENGINE after C_TIMER_POPPED
> > > Jan 25 04:10:24 lb02 pengine: [10028]: WARN: unpack_rsc_op: Processing
> > > failed op Ldirector-rsc_last_failure_0 on lb02: not running (7)
> > > Jan 25 04:10:24 lb02 pengine: [10028]: notice: common_apply_stickiness:
> > > Ldirector-rsc can fail 999997 more times on lb02 before being forced off
> > > Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State
> > > transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
> > > cause=C_IPC_MESSAGE origin=handle_response ]
> > > Jan 25 04:10:24 lb02 pengine: [10028]: notice: process_pe_message:
> > > Transition 7552: PEngine Input stored in:
> > /var/lib/pengine/pe-input-90.bz2
> > > Jan 25 04:10:24 lb02 crmd: [9975]: info: do_te_invoke: Processing graph
> > > 7552 (ref=pe_calc-dc-1422155424-7644) derived from
> > > /var/lib/pengine/pe-input-90.bz2
> > > Jan 25 04:10:24 lb02 crmd: [9975]: notice: run_graph: ==== Transition
> > 7552
> > > (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> > > Source=/var/lib/pengine/pe-input-90.bz2): Complete
> > > Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State
> > > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> > > cause=C_FSA_INTERNAL origin=notify_crmd ]
> > >
> > >
> > > Jan 25 04:10:30 lb02 lrmd: [9972]: info: RA output:
> > > (Nginx-rsc:monitor:stderr) Segmentation fault ******* here it starts
> >
> > What exactly did segfault? Do you have a core dump to examine?
> >
> > > As you can see, the last line.
> > > And then:
> > >
> > > Jan 25 04:10:30 lb02 lrmd: [9972]: info: RA output:
> > > (Nginx-rsc:monitor:stderr) Killed
> > > /usr/lib/ocf/resource.d//heartbeat/nginx: 910:
> > > /usr/lib/ocf/resource.d//heartbeat/nginx: Cannot fork
> >
> > This could be related to the segfault, or due to other serious
> > system error.
> >
> > > I guess here Nginx was killed.
> > >
> > > And then I have some others errors till Pacemaker decide to move the
> > > resources to the node:
> > >
> > > Jan 25 04:10:30 lb02 crmd: [9975]: info: process_lrm_event: LRM operation
> > > Nginx-rsc_monitor_10000 (call=52, rc=2, cib-update=7633, confirmed=false)
> > > invalid parameter
> > > Jan 25 04:10:30 lb02 crmd: [9975]: info: process_graph_event: Detected
> > > action Nginx-rsc_monitor_10000 from a different transition: 5739 vs. 7552
> > > Jan 25 04:10:30 lb02 crmd: [9975]: info: abort_transition_graph:
> > > process_graph_event:476 - Triggered transition abort (complete=1,
> > > tag=lrm_rsc_op, id=Nginx-rsc_last_failure_0,
> > > magic=0:2;4:5739:0:42d1ed53-9686-4174-84e7-d2c230ed8832, cib=
> > > 3.14.40) : Old event
> > > Jan 25 04:10:30 lb02 crmd: [9975]: WARN: update_failcount: Updating
> > > failcount for Nginx-rsc on lb02 after failed monitor: rc=2
> > (update=value++,
> > > time=1422155430)
> > > Jan 25 04:10:30 lb02 crmd: [9975]: notice: do_state_transition: State
> > > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> > cause=C_FSA_INTERNAL
> > > origin=abort_transition_graph ]
> > > Jan 25 04:10:30 lb02 attrd: [9974]: info: log-rotate detected on logfile
> > > /var/log/ha-log
> > > Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_trigger_update: Sending
> > > flush op to all hosts for: fail-count-Nginx-rsc (1)
> > > Jan 25 04:10:30 lb02 pengine: [10028]: ERROR: unpack_rsc_op: Preventing
> > > Nginx-rsc from re-starting on lb02: operation monitor failed 'invalid
> > > parameter' (rc=2)
> > > Jan 25 04:10:30 lb02 pengine: [10028]: WARN: unpack_rsc_op: Processing
> > > failed op Nginx-rsc_last_failure_0 on lb02: invalid parameter (2)
> > > Jan 25 04:10:30 lb02 pengine: [10028]: WARN: unpack_rsc_op: Processing
> > > failed op Ldirector-rsc_last_failure_0 on lb02: not running (7)
> > > Jan 25 04:10:30 lb02 pengine: [10028]: notice: common_apply_stickiness:
> > > Ldirector-rsc can fail 999997 more times on lb02 before being forced off
> > > Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
> > > IP-rsc_mysql (lb02)
> > > Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
> > > IP-rsc_nginx (lb02)
> > > Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
> > > IP-rsc_nginx6 (lb02)
> > > Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
> > > IP-rsc_elasticsearch (lb02)
> > > Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Move
> > > Ldirector-rsc (Started lb02 -> lb01)
> > > Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Move
> > > Nginx-rsc (Started lb02 -> lb01)
> > > Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_perform_update: Sent
> > > update 23: fail-count-Nginx-rsc=1
> > > Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_trigger_update: Sending
> > > flush op to all hosts for: last-failure-Nginx-rsc (1422155430)
> > >
> > > I see that Pacemaker is complaining about some errors like "invalid
> > > paraemter", for example in these lines:
> >
> > That error code is what the nginx RA exited with. It's unusual,
> > but perhaps also due to the segfault.
> >
> > Thanks,
> >
> > Dejan
> >
> > > Jan 25 04:10:30 lb02 crmd: [9975]: info: process_lrm_event: LRM operation
> > > Nginx-rsc_monitor_10000 (call=52, rc=2, cib-update=7633, confirmed=false)
> > > invalid parameter
> > >
> > > Jan 25 04:10:30 lb02 pengine: [10028]: ERROR: unpack_rsc_op: Preventing
> > > Nginx-rsc from re-starting on lb02: operation monitor failed 'invalid
> > > parameter' (rc=2)
> > >
> > > It sounds(for me) like a syntax problem defining the resources, but I've
> > > checked the confic with crm_verify and there is no error:
> > >
> > > root# (S) crm_verify -LVV
> > > root# (S)
> > >
> > > So I'm just wondering why pacemaker is complaining about an invalid
> > > parameter.
> > >
> > > This is my CIB objetcs:
> > >
> > > node $id="43b2c5a1-9552-4438-962b-6e98a2dd67c7" lb01
> > > node $id="68328520-68e0-42fd-9adf-062655691643" lb02
> > > primitive IP-rsc_elasticsearch ocf:heartbeat:IPaddr2 \
> > > params ip="xx.xx.xx.xx" nic="eth0" cidr_netmask="255.255.255.224"
> > > primitive IP-rsc_elasticsearch6 ocf:heartbeat:IPv6addr \
> > > params ipv6addr="xxxxxxxxxxxxxxxx" \
> > > op monitor interval="10s"
> > > primitive IP-rsc_mysql ocf:heartbeat:IPaddr2 \
> > > params ip="xx.xx.xx.xx" nic="eth0" cidr_netmask="255.255.255.224"
> > > primitive IP-rsc_mysql6 ocf:heartbeat:IPv6addr \
> > > params ipv6addr="xxxxxxxxxxxxxx" \
> > > op monitor interval="10s"
> > > primitive IP-rsc_nginx ocf:heartbeat:IPaddr2 \
> > > params ip="xx.xx.xx.xx" nic="eth0" cidr_netmask="255.255.255.224"
> > > primitive IP-rsc_nginx6 ocf:heartbeat:IPv6addr \
> > > params ipv6addr="xxxxxxxxxxxxxx" \
> > > op monitor interval="10s"
> > > primitive Ldirector-rsc ocf:heartbeat:ldirectord \
> > > op monitor interval="10s" timeout="30s"
> > > primitive Nginx-rsc ocf:heartbeat:nginx \
> > > op monitor interval="10s" timeout="30s"
> > > location cli-standby-IP-rsc_elasticsearch6 IP-rsc_elasticsearch6 \
> > > rule $id="cli-standby-rule-IP-rsc_elasticsearch6" -inf: #uname eq lb01
> > > location cli-standby-IP-rsc_mysql IP-rsc_mysql \
> > > rule $id="cli-standby-rule-IP-rsc_mysql" -inf: #uname eq lb01
> > > location cli-standby-IP-rsc_mysql6 IP-rsc_mysql6 \
> > > rule $id="cli-standby-rule-IP-rsc_mysql6" -inf: #uname eq lb01
> > > location cli-standby-IP-rsc_nginx IP-rsc_nginx \
> > > rule $id="cli-standby-rule-IP-rsc_nginx" -inf: #uname eq lb01
> > > location cli-standby-IP-rsc_nginx6 IP-rsc_nginx6 \
> > > rule $id="cli-standby-rule-IP-rsc_nginx6" -inf: #uname eq lb01
> > > colocation hcu_c inf: Nginx-rsc Ldirector-rsc IP-rsc_mysql IP-rsc_nginx
> > > IP-rsc_nginx6 IP-rsc_elasticsearch
> > > order hcu_o inf: IP-rsc_nginx IP-rsc_nginx6 IP-rsc_mysql Ldirector-rsc
> > > Nginx-rsc IP-rsc_elasticsearch
> > > property $id="cib-bootstrap-options" \
> > > dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
> > > cluster-infrastructure="Heartbeat" \
> > > stonith-enabled="false
> > >
> > >
> > > Do you have some hints that I can follow?
> > >
> > > Thanks in advance!
> > >
> > > Oscar
> >
> > > _______________________________________________
> > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > >
> > > Project Home: http://www.clusterlabs.org
> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > Bugs: http://bugs.clusterlabs.org
> >
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >

> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
Re: Segfault on monitor resource [ In reply to ]
2015-01-27 17:58 GMT+01:00 Dejan Muhamedagic <dejanmm@fastmail.fm>:

> On Tue, Jan 27, 2015 at 03:18:13PM +0100, Oscar Salvador wrote:
> > Hi,
> >
> > I've checked the resource graphs I have, and the resources were fine, so
> I
> > think it's not a problem due to a high use of memory or something like
> that.
> > And unfortunately I don't have a core dump to analize(I'll enable it for
> a
> > future case) so the only thing I have are the logs.
> >
> > For the line below, I though that was the process in charge to monitore
> > nginx what was killed due to a segfault:
> >
> > RA output: (Nginx-rsc:monitor:stderr) Segmentation fault
>
> This is just output captured during the execution of the RA
> monitor action. It could've been anything within the RA (which is
> just a shell script) to segfault.
>

Hi,

Yes, I see.
I've enabled core dumps on the system, so the next time I'll be able to
check what is causing this.

Thank you very much
Oscar Salvador


>
> Thanks,
>
> Dejan
>
> > I've checked the Nginx logs, and there is nothing worth there, actually
> > there is no activity, so I think it has to be something internal what
> > caused the failure.
> > I'll enable coredumps, it's the only thing I can do for now.
> >
> > Thank you very much
> >
> > Oscar
> >
> > 2015-01-27 10:39 GMT+01:00 Dejan Muhamedagic <dejanmm@fastmail.fm>:
> >
> > > Hi,
> > >
> > > On Mon, Jan 26, 2015 at 06:20:35PM +0100, Oscar Salvador wrote:
> > > > Hi!
> > > >
> > > > I'm writing here because two days ago I experienced a strange
> problem in
> > > my
> > > > Pacemaker Cluster.
> > > > Everything was working fine, till suddenly a Segfault in Nginx
> monitor
> > > > resource happened:
> > > >
> > > > Jan 25 03:55:24 lb02 crmd: [9975]: notice: run_graph: ==== Transition
> > > 7551
> > > > (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> > > > Source=/var/lib/pengine/pe-input-90.bz2): Complete
> > > > Jan 25 03:55:24 lb02 crmd: [9975]: notice: do_state_transition: State
> > > > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> > > > cause=C_FSA_INTERNAL origin=notify_crmd ]
> > > > Jan 25 04:00:08 lb02 cib: [9971]: info: cib_stats: Processed 1
> operations
> > > > (0.00us average, 0% utilization) in the last 10min
> > > > Jan 25 04:10:24 lb02 crmd: [9975]: info: crm_timer_popped: PEngine
> > > Recheck
> > > > Timer (I_PE_CALC) just popped (900000ms)
> > > > Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State
> > > > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> > > cause=C_TIMER_POPPED
> > > > origin=crm_timer_popped ]
> > > > Jan 25 04:10:24 lb02 crmd: [9975]: info: do_state_transition:
> Progressed
> > > to
> > > > state S_POLICY_ENGINE after C_TIMER_POPPED
> > > > Jan 25 04:10:24 lb02 pengine: [10028]: WARN: unpack_rsc_op:
> Processing
> > > > failed op Ldirector-rsc_last_failure_0 on lb02: not running (7)
> > > > Jan 25 04:10:24 lb02 pengine: [10028]: notice:
> common_apply_stickiness:
> > > > Ldirector-rsc can fail 999997 more times on lb02 before being forced
> off
> > > > Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State
> > > > transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [
> input=I_PE_SUCCESS
> > > > cause=C_IPC_MESSAGE origin=handle_response ]
> > > > Jan 25 04:10:24 lb02 pengine: [10028]: notice: process_pe_message:
> > > > Transition 7552: PEngine Input stored in:
> > > /var/lib/pengine/pe-input-90.bz2
> > > > Jan 25 04:10:24 lb02 crmd: [9975]: info: do_te_invoke: Processing
> graph
> > > > 7552 (ref=pe_calc-dc-1422155424-7644) derived from
> > > > /var/lib/pengine/pe-input-90.bz2
> > > > Jan 25 04:10:24 lb02 crmd: [9975]: notice: run_graph: ==== Transition
> > > 7552
> > > > (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> > > > Source=/var/lib/pengine/pe-input-90.bz2): Complete
> > > > Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State
> > > > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> > > > cause=C_FSA_INTERNAL origin=notify_crmd ]
> > > >
> > > >
> > > > Jan 25 04:10:30 lb02 lrmd: [9972]: info: RA output:
> > > > (Nginx-rsc:monitor:stderr) Segmentation fault ******* here it
> starts
> > >
> > > What exactly did segfault? Do you have a core dump to examine?
> > >
> > > > As you can see, the last line.
> > > > And then:
> > > >
> > > > Jan 25 04:10:30 lb02 lrmd: [9972]: info: RA output:
> > > > (Nginx-rsc:monitor:stderr) Killed
> > > > /usr/lib/ocf/resource.d//heartbeat/nginx: 910:
> > > > /usr/lib/ocf/resource.d//heartbeat/nginx: Cannot fork
> > >
> > > This could be related to the segfault, or due to other serious
> > > system error.
> > >
> > > > I guess here Nginx was killed.
> > > >
> > > > And then I have some others errors till Pacemaker decide to move the
> > > > resources to the node:
> > > >
> > > > Jan 25 04:10:30 lb02 crmd: [9975]: info: process_lrm_event: LRM
> operation
> > > > Nginx-rsc_monitor_10000 (call=52, rc=2, cib-update=7633,
> confirmed=false)
> > > > invalid parameter
> > > > Jan 25 04:10:30 lb02 crmd: [9975]: info: process_graph_event:
> Detected
> > > > action Nginx-rsc_monitor_10000 from a different transition: 5739 vs.
> 7552
> > > > Jan 25 04:10:30 lb02 crmd: [9975]: info: abort_transition_graph:
> > > > process_graph_event:476 - Triggered transition abort (complete=1,
> > > > tag=lrm_rsc_op, id=Nginx-rsc_last_failure_0,
> > > > magic=0:2;4:5739:0:42d1ed53-9686-4174-84e7-d2c230ed8832, cib=
> > > > 3.14.40) : Old event
> > > > Jan 25 04:10:30 lb02 crmd: [9975]: WARN: update_failcount: Updating
> > > > failcount for Nginx-rsc on lb02 after failed monitor: rc=2
> > > (update=value++,
> > > > time=1422155430)
> > > > Jan 25 04:10:30 lb02 crmd: [9975]: notice: do_state_transition: State
> > > > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> > > cause=C_FSA_INTERNAL
> > > > origin=abort_transition_graph ]
> > > > Jan 25 04:10:30 lb02 attrd: [9974]: info: log-rotate detected on
> logfile
> > > > /var/log/ha-log
> > > > Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_trigger_update:
> Sending
> > > > flush op to all hosts for: fail-count-Nginx-rsc (1)
> > > > Jan 25 04:10:30 lb02 pengine: [10028]: ERROR: unpack_rsc_op:
> Preventing
> > > > Nginx-rsc from re-starting on lb02: operation monitor failed 'invalid
> > > > parameter' (rc=2)
> > > > Jan 25 04:10:30 lb02 pengine: [10028]: WARN: unpack_rsc_op:
> Processing
> > > > failed op Nginx-rsc_last_failure_0 on lb02: invalid parameter (2)
> > > > Jan 25 04:10:30 lb02 pengine: [10028]: WARN: unpack_rsc_op:
> Processing
> > > > failed op Ldirector-rsc_last_failure_0 on lb02: not running (7)
> > > > Jan 25 04:10:30 lb02 pengine: [10028]: notice:
> common_apply_stickiness:
> > > > Ldirector-rsc can fail 999997 more times on lb02 before being forced
> off
> > > > Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
> > > > IP-rsc_mysql (lb02)
> > > > Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
> > > > IP-rsc_nginx (lb02)
> > > > Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
> > > > IP-rsc_nginx6 (lb02)
> > > > Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
> > > > IP-rsc_elasticsearch (lb02)
> > > > Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Move
> > > > Ldirector-rsc (Started lb02 -> lb01)
> > > > Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Move
> > > > Nginx-rsc (Started lb02 -> lb01)
> > > > Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_perform_update:
> Sent
> > > > update 23: fail-count-Nginx-rsc=1
> > > > Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_trigger_update:
> Sending
> > > > flush op to all hosts for: last-failure-Nginx-rsc (1422155430)
> > > >
> > > > I see that Pacemaker is complaining about some errors like "invalid
> > > > paraemter", for example in these lines:
> > >
> > > That error code is what the nginx RA exited with. It's unusual,
> > > but perhaps also due to the segfault.
> > >
> > > Thanks,
> > >
> > > Dejan
> > >
> > > > Jan 25 04:10:30 lb02 crmd: [9975]: info: process_lrm_event: LRM
> operation
> > > > Nginx-rsc_monitor_10000 (call=52, rc=2, cib-update=7633,
> confirmed=false)
> > > > invalid parameter
> > > >
> > > > Jan 25 04:10:30 lb02 pengine: [10028]: ERROR: unpack_rsc_op:
> Preventing
> > > > Nginx-rsc from re-starting on lb02: operation monitor failed 'invalid
> > > > parameter' (rc=2)
> > > >
> > > > It sounds(for me) like a syntax problem defining the resources, but
> I've
> > > > checked the confic with crm_verify and there is no error:
> > > >
> > > > root# (S) crm_verify -LVV
> > > > root# (S)
> > > >
> > > > So I'm just wondering why pacemaker is complaining about an invalid
> > > > parameter.
> > > >
> > > > This is my CIB objetcs:
> > > >
> > > > node $id="43b2c5a1-9552-4438-962b-6e98a2dd67c7" lb01
> > > > node $id="68328520-68e0-42fd-9adf-062655691643" lb02
> > > > primitive IP-rsc_elasticsearch ocf:heartbeat:IPaddr2 \
> > > > params ip="xx.xx.xx.xx" nic="eth0" cidr_netmask="255.255.255.224"
> > > > primitive IP-rsc_elasticsearch6 ocf:heartbeat:IPv6addr \
> > > > params ipv6addr="xxxxxxxxxxxxxxxx" \
> > > > op monitor interval="10s"
> > > > primitive IP-rsc_mysql ocf:heartbeat:IPaddr2 \
> > > > params ip="xx.xx.xx.xx" nic="eth0" cidr_netmask="255.255.255.224"
> > > > primitive IP-rsc_mysql6 ocf:heartbeat:IPv6addr \
> > > > params ipv6addr="xxxxxxxxxxxxxx" \
> > > > op monitor interval="10s"
> > > > primitive IP-rsc_nginx ocf:heartbeat:IPaddr2 \
> > > > params ip="xx.xx.xx.xx" nic="eth0" cidr_netmask="255.255.255.224"
> > > > primitive IP-rsc_nginx6 ocf:heartbeat:IPv6addr \
> > > > params ipv6addr="xxxxxxxxxxxxxx" \
> > > > op monitor interval="10s"
> > > > primitive Ldirector-rsc ocf:heartbeat:ldirectord \
> > > > op monitor interval="10s" timeout="30s"
> > > > primitive Nginx-rsc ocf:heartbeat:nginx \
> > > > op monitor interval="10s" timeout="30s"
> > > > location cli-standby-IP-rsc_elasticsearch6 IP-rsc_elasticsearch6 \
> > > > rule $id="cli-standby-rule-IP-rsc_elasticsearch6" -inf: #uname eq
> lb01
> > > > location cli-standby-IP-rsc_mysql IP-rsc_mysql \
> > > > rule $id="cli-standby-rule-IP-rsc_mysql" -inf: #uname eq lb01
> > > > location cli-standby-IP-rsc_mysql6 IP-rsc_mysql6 \
> > > > rule $id="cli-standby-rule-IP-rsc_mysql6" -inf: #uname eq lb01
> > > > location cli-standby-IP-rsc_nginx IP-rsc_nginx \
> > > > rule $id="cli-standby-rule-IP-rsc_nginx" -inf: #uname eq lb01
> > > > location cli-standby-IP-rsc_nginx6 IP-rsc_nginx6 \
> > > > rule $id="cli-standby-rule-IP-rsc_nginx6" -inf: #uname eq lb01
> > > > colocation hcu_c inf: Nginx-rsc Ldirector-rsc IP-rsc_mysql
> IP-rsc_nginx
> > > > IP-rsc_nginx6 IP-rsc_elasticsearch
> > > > order hcu_o inf: IP-rsc_nginx IP-rsc_nginx6 IP-rsc_mysql
> Ldirector-rsc
> > > > Nginx-rsc IP-rsc_elasticsearch
> > > > property $id="cib-bootstrap-options" \
> > > > dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
> > > > cluster-infrastructure="Heartbeat" \
> > > > stonith-enabled="false
> > > >
> > > >
> > > > Do you have some hints that I can follow?
> > > >
> > > > Thanks in advance!
> > > >
> > > > Oscar
> > >
> > > > _______________________________________________
> > > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > > >
> > > > Project Home: http://www.clusterlabs.org
> > > > Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > > Bugs: http://bugs.clusterlabs.org
> > >
> > >
> > > _______________________________________________
> > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > >
> > > Project Home: http://www.clusterlabs.org
> > > Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > Bugs: http://bugs.clusterlabs.org
> > >
>
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>