Mailing List Archive

Issues with ocf_run on CentOS7 with pgsql resource agent
I'm running into an issues with the pgsql resource agent running on CentOS7. The issue is when the pgsql resource agent attempts a call to runasowner it uses a su as user postgres in a call to ocf_run, this is causing what appears to be a new session on CentOS7.

We've narrowed it down to the ocf_run method in ocf-shellfuncs in executing the command:
output=`"$@" 2>&1`

Snippet of /var/log/messages:
Nov 20 21:29:05 ip-10-0-0-4 pengine[2041]: notice: LogActions: Stop pgsql:0 (pcmk-1)
Nov 20 21:29:05 ip-10-0-0-4 pengine[2041]: notice: process_pe_message: Calculated Transition 1: /var/lib/pacemaker/pengine/pe-input-205.bz2
Nov 20 21:29:05 ip-10-0-0-4 crmd[2042]: notice: te_rsc_command: Initiating action 47: notify pgsql_pre_notify_stop_0 on pcmk-1 (local)
Nov 20 21:29:05 ip-10-0-0-4 crmd[2042]: notice: te_rsc_command: Initiating action 48: notify pgsql_pre_notify_stop_0 on pcmk-2
Nov 20 21:29:05 ip-10-0-0-4 pacemakerd[2412]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log
Nov 20 21:29:05 ip-10-0-0-4 su: (to postgres) root on none
Nov 20 21:29:05 ip-10-0-0-4 systemd: Created slice user-26.slice.
Nov 20 21:29:05 ip-10-0-0-4 systemd: Starting Session c4 of user postgres.
Nov 20 21:29:05 ip-10-0-0-4 systemd: Started Session c4 of user postgres.
Nov 20 21:29:30 ip-10-0-0-4 crmd[2042]: notice: process_lrm_event: LRM operation pgsql_notify_0 (call=17, rc=0, cib-update=0, confirmed=true) ok
Nov 20 21:29:30 ip-10-0-0-4 crmd[2042]: notice: te_rsc_command: Initiating action 1: stop pgsql_stop_0 on pcmk-1 (local)
Nov 20 21:29:30 ip-10-0-0-4 pacemakerd[2464]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log
Nov 20 21:29:30 ip-10-0-0-4 su: (to postgres) root on none
Nov 20 21:29:30 ip-10-0-0-4 systemd: Created slice user-26.slice.
Nov 20 21:29:30 ip-10-0-0-4 systemd: Starting Session c5 of user postgres.
Nov 20 21:29:30 ip-10-0-0-4 systemd: Started Session c5 of user postgres.

Software versions:
pacemaker 1.1.10
resource-agents 3.9.5
corosync 2.3.3-2

Has anyone else experienced this issue or know of what might be going wrong to cause new sessions to be created?

Thanks,
Brendan
Re: Issues with ocf_run on CentOS7 with pgsql resource agent [ In reply to ]
On Fri, Nov 21, 2014 at 12:36 AM, Brendan Reekie <breekie@sandvine.com> wrote:
> I’m running into an issues with the pgsql resource agent running on CentOS7.
> The issue is when the pgsql resource agent attempts a call to runasowner it
> uses a su as user postgres in a call to ocf_run, this is causing what
> appears to be a new session on CentOS7.
>
>
>
> We’ve narrowed it down to the ocf_run method in ocf-shellfuncs in executing
> the command:
>
> output=`"$@" 2>&1`
>
>
>
> Snippet of /var/log/messages:
>
> Nov 20 21:29:05 ip-10-0-0-4 pengine[2041]: notice: LogActions: Stop
> pgsql:0 (pcmk-1)
>
> Nov 20 21:29:05 ip-10-0-0-4 pengine[2041]: notice: process_pe_message:
> Calculated Transition 1: /var/lib/pacemaker/pengine/pe-input-205.bz2
>
> Nov 20 21:29:05 ip-10-0-0-4 crmd[2042]: notice: te_rsc_command: Initiating
> action 47: notify pgsql_pre_notify_stop_0 on pcmk-1 (local)
>
> Nov 20 21:29:05 ip-10-0-0-4 crmd[2042]: notice: te_rsc_command: Initiating
> action 48: notify pgsql_pre_notify_stop_0 on pcmk-2
>
> Nov 20 21:29:05 ip-10-0-0-4 pacemakerd[2412]: notice: crm_add_logfile:
> Additional logging available in /var/log/pacemaker.log
>
> Nov 20 21:29:05 ip-10-0-0-4 su: (to postgres) root on none
>
> Nov 20 21:29:05 ip-10-0-0-4 systemd: Created slice user-26.slice.
>
> Nov 20 21:29:05 ip-10-0-0-4 systemd: Starting Session c4 of user postgres.
>
> Nov 20 21:29:05 ip-10-0-0-4 systemd: Started Session c4 of user postgres.
>
> Nov 20 21:29:30 ip-10-0-0-4 crmd[2042]: notice: process_lrm_event: LRM
> operation pgsql_notify_0 (call=17, rc=0, cib-update=0, confirmed=true) ok
>
> Nov 20 21:29:30 ip-10-0-0-4 crmd[2042]: notice: te_rsc_command: Initiating
> action 1: stop pgsql_stop_0 on pcmk-1 (local)
>
> Nov 20 21:29:30 ip-10-0-0-4 pacemakerd[2464]: notice: crm_add_logfile:
> Additional logging available in /var/log/pacemaker.log
>
> Nov 20 21:29:30 ip-10-0-0-4 su: (to postgres) root on none
>
> Nov 20 21:29:30 ip-10-0-0-4 systemd: Created slice user-26.slice.
>
> Nov 20 21:29:30 ip-10-0-0-4 systemd: Starting Session c5 of user postgres.
>
> Nov 20 21:29:30 ip-10-0-0-4 systemd: Started Session c5 of user postgres.
>
>
>
> Software versions:
>
> pacemaker 1.1.10
>
> resource-agents 3.9.5
>
> corosync 2.3.3-2
>
>
>
> Has anyone else experienced this issue or know of what might be going wrong
> to cause new sessions to be created?
>

Session is created by pam_systemd. What exact problem does it cause?
So far you did not mention any.

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
Re: Issues with ocf_run on CentOS7 with pgsql resource agent [ In reply to ]
-----Original Message-----
From: Andrei Borzenkov [mailto:arvidjaar@gmail.com]
Sent: Friday, November 21, 2014 7:13 AM
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Issues with ocf_run on CentOS7 with pgsql resource agent

On Fri, Nov 21, 2014 at 12:36 AM, Brendan Reekie <breekie@sandvine.com> wrote:
>> I’m running into an issues with the pgsql resource agent running on CentOS7.
>> The issue is when the pgsql resource agent attempts a call to
>> runasowner it uses a su as user postgres in a call to ocf_run, this is
>> causing what appears to be a new session on CentOS7.
>>
>>
>>
>> We’ve narrowed it down to the ocf_run method in ocf-shellfuncs in
>> executing the command:
>>
>> output=`"$@" 2>&1`
>>
>>
>>
>> Snippet of /var/log/messages:
>>
>> Nov 20 21:29:05 ip-10-0-0-4 pengine[2041]: notice: LogActions: Stop
>> pgsql:0 (pcmk-1)
>>
>> Nov 20 21:29:05 ip-10-0-0-4 pengine[2041]: notice: process_pe_message:
>> Calculated Transition 1: /var/lib/pacemaker/pengine/pe-input-205.bz2
>>
>> Nov 20 21:29:05 ip-10-0-0-4 crmd[2042]: notice: te_rsc_command:
>> Initiating action 47: notify pgsql_pre_notify_stop_0 on pcmk-1 (local)
>>
>> Nov 20 21:29:05 ip-10-0-0-4 crmd[2042]: notice: te_rsc_command:
>> Initiating action 48: notify pgsql_pre_notify_stop_0 on pcmk-2
>>
>> Nov 20 21:29:05 ip-10-0-0-4 pacemakerd[2412]: notice: crm_add_logfile:
>> Additional logging available in /var/log/pacemaker.log
>>
>> Nov 20 21:29:05 ip-10-0-0-4 su: (to postgres) root on none
>>
>> Nov 20 21:29:05 ip-10-0-0-4 systemd: Created slice user-26.slice.
>>
>> Nov 20 21:29:05 ip-10-0-0-4 systemd: Starting Session c4 of user postgres.
>>
>> Nov 20 21:29:05 ip-10-0-0-4 systemd: Started Session c4 of user postgres.
>>
>> Nov 20 21:29:30 ip-10-0-0-4 crmd[2042]: notice: process_lrm_event: LRM
>> operation pgsql_notify_0 (call=17, rc=0, cib-update=0, confirmed=true)
>> ok
>>
>> Nov 20 21:29:30 ip-10-0-0-4 crmd[2042]: notice: te_rsc_command:
>> Initiating action 1: stop pgsql_stop_0 on pcmk-1 (local)
>>
>> Nov 20 21:29:30 ip-10-0-0-4 pacemakerd[2464]: notice: crm_add_logfile:
>> Additional logging available in /var/log/pacemaker.log
>>
>> Nov 20 21:29:30 ip-10-0-0-4 su: (to postgres) root on none
>>
>> Nov 20 21:29:30 ip-10-0-0-4 systemd: Created slice user-26.slice.
>>
>> Nov 20 21:29:30 ip-10-0-0-4 systemd: Starting Session c5 of user postgres.
>>
>> Nov 20 21:29:30 ip-10-0-0-4 systemd: Started Session c5 of user postgres.
>>
>>
>>
>> Software versions:
>>
>> pacemaker 1.1.10
>>
>> resource-agents 3.9.5
>>
>> corosync 2.3.3-2
>>
>>
>>
>> Has anyone else experienced this issue or know of what might be going
>> wrong to cause new sessions to be created?
>>

>Session is created by pam_systemd. What exact problem does it cause?
>So far you did not mention any.

Sorry Andrei on not clarifying. The issue that we observed is that the pgsql resource agent will eventually timeout (if the defaults are left at 60 seconds). Increasing the timeout to several minutes the master/slave are correctly identified and started correctly.

In the above snippet from /var/log/messages it's taking approximately 25 seconds to execute when su to postgres.

It seems any call to the pgsql resource agent that invokes the runasowner encounters a 25 second delay. For the corresponding time range the /var/log/pacemaker.log file shows that the notify call 17 started at 21:29:05 but didn't complete to 21:29:30.

Relevant logs:
Nov 20 21:29:05 [2039] pcmk-1 lrmd: info: log_execute: executing - rsc:pgsql action:notify call_id:17
Nov 20 21:29:30 [2039] pcmk-1 lrmd: info: log_finished: finished - rsc:pgsql action:notify call_id:17 pid:2401 exit-code:0 exec-time:25130ms queue-time:1ms

Full log during that time:
Nov 20 21:29:05 [2039] pcmk-1 lrmd: info: log_finished: finished - rsc:pgsql action:notify call_id:16 pid:2320 exit-code:0 exec-time:25128ms queue-time:0ms
Nov 20 21:29:05 [2042] pcmk-1 crmd: info: match_graph_event: Action pgsql_notify_0 (50) confirmed on pcmk-1 (rc=0)
Nov 20 21:29:05 [2042] pcmk-1 crmd: notice: process_lrm_event: LRM operation pgsql_notify_0 (call=16, rc=0, cib-update=0, confirmed=true) ok
Nov 20 21:29:05 [2042] pcmk-1 crmd: info: match_graph_event: Action pgsql_notify_0 (51) confirmed on pcmk-2 (rc=0)
Nov 20 21:29:05 [2042] pcmk-1 crmd: notice: run_graph: Transition 0 (Complete=19, Pending=0, Fired=0, Skipped=2, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-204.bz2): Stopped
Nov 20 21:29:05 [2042] pcmk-1 crmd: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]
Nov 20 21:29:05 [2037] pcmk-1 cib: info: cib_process_request: Completed cib_query operation for section 'all': OK (rc=0, origin=local/crmd/30, version=0.291.20)
Nov 20 21:29:05 [2041] pcmk-1 pengine: notice: unpack_config: On loss of CCM Quorum: Ignore
Nov 20 21:29:05 [2041] pcmk-1 pengine: info: determine_online_status: Node pcmk-1 is online
Nov 20 21:29:05 [2041] pcmk-1 pengine: info: determine_online_status: Node pcmk-2 is online
Nov 20 21:29:05 [2041] pcmk-1 pengine: warning: unpack_rsc_op: Processing failed op start for pgsql:0 on pcmk-1: unknown error (1)
Nov 20 21:29:05 [2041] pcmk-1 pengine: info: group_print: Resource Group: master-group
Nov 20 21:29:05 [2041] pcmk-1 pengine: info: native_print: vip-master (ocf::heartbeat:IPaddr2): Stopped
Nov 20 21:29:05 [2041] pcmk-1 pengine: info: native_print: vip-rep (ocf::heartbeat:IPaddr2): Stopped
Nov 20 21:29:05 [2041] pcmk-1 pengine: info: clone_print: Master/Slave Set: msPostgresql [pgsql]
Nov 20 21:29:05 [2041] pcmk-1 pengine: info: native_print: pgsql (ocf::heartbeat:pgsql): FAILED pcmk-1
Nov 20 21:29:05 [2041] pcmk-1 pengine: info: short_print: Slaves: [ pcmk-2 ]
Nov 20 21:29:05 [2041] pcmk-1 pengine: info: get_failcount_full: pgsql:0 has failed INFINITY times on pcmk-1
Nov 20 21:29:05 [2041] pcmk-1 pengine: warning: common_apply_stickiness: Forcing msPostgresql away from pcmk-1 after 1000000 failures (max=1)
Nov 20 21:29:05 [2041] pcmk-1 pengine: info: get_failcount_full: pgsql:1 has failed INFINITY times on pcmk-1
Nov 20 21:29:05 [2041] pcmk-1 pengine: warning: common_apply_stickiness: Forcing msPostgresql away from pcmk-1 after 1000000 failures (max=1)
Nov 20 21:29:05 [2041] pcmk-1 pengine: info: native_color: Resource pgsql:0 cannot run anywhere
Nov 20 21:29:05 [2041] pcmk-1 pengine: info: master_color: msPostgresql: Promoted 0 instances of a possible 1 to master
Nov 20 21:29:05 [2041] pcmk-1 pengine: info: rsc_merge_weights: vip-master: Rolling back scores from vip-rep
Nov 20 21:29:05 [2041] pcmk-1 pengine: info: native_color: Resource vip-master cannot run anywhere
Nov 20 21:29:05 [2041] pcmk-1 pengine: info: native_color: Resource vip-rep cannot run anywhere
Nov 20 21:29:05 [2041] pcmk-1 pengine: info: RecurringOp: Start recurring monitor (4s) for pgsql:1 on pcmk-2
Nov 20 21:29:05 [2041] pcmk-1 pengine: info: RecurringOp: Start recurring monitor (4s) for pgsql:1 on pcmk-2
Nov 20 21:29:05 [2041] pcmk-1 pengine: info: LogActions: Leave vip-master (Stopped)
Nov 20 21:29:05 [2041] pcmk-1 pengine: info: LogActions: Leave vip-rep (Stopped)
Nov 20 21:29:05 [2041] pcmk-1 pengine: notice: LogActions: Stop pgsql:0 (pcmk-1)
Nov 20 21:29:05 [2041] pcmk-1 pengine: info: LogActions: Leave pgsql:1 (Slave pcmk-2)
Nov 20 21:29:05 [2041] pcmk-1 pengine: notice: process_pe_message: Calculated Transition 1: /var/lib/pacemaker/pengine/pe-input-205.bz2
Nov 20 21:29:05 [2042] pcmk-1 crmd: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Nov 20 21:29:05 [2042] pcmk-1 crmd: info: do_te_invoke: Processing graph 1 (ref=pe_calc-dc-1416518945-23) derived from /var/lib/pacemaker/pengine/pe-input-205.bz2
Nov 20 21:29:05 [2042] pcmk-1 crmd: notice: te_rsc_command: Initiating action 47: notify pgsql_pre_notify_stop_0 on pcmk-1 (local)
Nov 20 21:29:05 [2042] pcmk-1 crmd: info: do_lrm_rsc_op: Performing key=47:1:0:804110e4-5de5-4a3a-9fd7-2fb96177d7ac op=pgsql_notify_0
Nov 20 21:29:05 [2039] pcmk-1 lrmd: info: log_execute: executing - rsc:pgsql action:notify call_id:17
Nov 20 21:29:05 [2042] pcmk-1 crmd: notice: te_rsc_command: Initiating action 48: notify pgsql_pre_notify_stop_0 on pcmk-2
Nov 20 21:29:05 [2412] pcmk-1 pacemakerd: info: crm_log_init: Changed active directory to /var/lib/pacemaker/cores/root
Nov 20 21:29:05 [2412] pcmk-1 pacemakerd: info: crm_xml_cleanup: Cleaning up memory from libxml2
Nov 20 21:29:30 [2042] pcmk-1 crmd: info: match_graph_event: Action pgsql_notify_0 (48) confirmed on pcmk-2 (rc=0)
Nov 20 21:29:30 [2039] pcmk-1 lrmd: info: log_finished: finished - rsc:pgsql action:notify call_id:17 pid:2401 exit-code:0 exec-time:25130ms queue-time:1ms
Nov 20 21:29:30 [2042] pcmk-1 crmd: info: match_graph_event: Action pgsql_notify_0 (47) confirmed on pcmk-1 (rc=0)
Nov 20 21:29:30 [2042] pcmk-1 crmd: notice: process_lrm_event: LRM operation pgsql_notify_0 (call=17, rc=0, cib-update=0, confirmed=true) ok
Nov 20 21:29:30 [2042] pcmk-1 crmd: notice: te_rsc_command: Initiating action 1: stop pgsql_stop_0 on pcmk-1 (local)
Nov 20 21:29:30 [2042] pcmk-1 crmd: info: do_lrm_rsc_op: Performing key=1:1:0:804110e4-5de5-4a3a-9fd7-2fb96177d7ac op=pgsql_stop_0
Nov 20 21:29:30 [2039] pcmk-1 lrmd: info: log_execute: executing - rsc:pgsql action:stop call_id:18
Nov 20 21:29:30 [2464] pcmk-1 pacemakerd: info: crm_log_init: Changed active directory to /var/lib/pacemaker/cores/root
Nov 20 21:29:30 [2464] pcmk-1 pacemakerd: info: crm_xml_cleanup: Cleaning up memory from libxml2


The other observation is that a call from a test script (ocftest) that calls the ocf_run with the same parameters as the pgsql resource agent doesn't encounter the same delays. The following is the test script that I've used:

#!/bin/sh

: ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/lib/heartbeat}
. ${OCF_FUNCTIONS_DIR}/ocf-shellfuncs


OCF_RESKEY_pgdba="postgres"
OCF_RESKEY_pgdata="/var/lib/pgsql/9.3/data/"

#
# Run the given command in the Resource owner environment...
#
runasowner() {
local quietrun=""
local loglevel="-err"
local var

for var in 1 2
do
case "$1" in
"-q")
quietrun="-q"
shift 1;;
"warn"|"err")
loglevel="-$1"
shift 1;;
*)
;;
esac
done

ocf_log info "&&& su called runasowner: $*"
ocf_run $quietrun $loglevel su $OCF_RESKEY_pgdba -c "cd $OCF_RESKEY_pgdata; $*"
}

runasowner "test -w $OCF_RESKEY_pgdata";


When executing the script multiple times, the 25 second delay is not observed and nor is what appear to be a new session being created. Output of /var/log/messages:
Nov 21 14:54:23 ip-10-0-0-4 su: (to postgres) centos on none
Nov 21 14:54:24 ip-10-0-0-4 su: (to postgres) centos on none
Nov 21 14:54:25 ip-10-0-0-4 su: (to postgres) centos on none

Stdout capture:
[root@pcmk-1 heartbeat]# ./ocftest
INFO: &&& su called runasowner: test -w /var/lib/pgsql/9.3/data/
[root@pcmk-1 heartbeat]# ./ocftest
INFO: &&& su called runasowner: test -w /var/lib/pgsql/9.3/data/
[root@pcmk-1 heartbeat]# ./ocftest
INFO: &&& su called runasowner: test -w /var/lib/pgsql/9.3/data/

Any ideas on what would be causing the delays?
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
Re: Issues with ocf_run on CentOS7 with pgsql resource agent [ In reply to ]
Hi,

On Fri, Nov 21, 2014 at 03:00:02PM +0000, Brendan Reekie wrote:
> -----Original Message-----
> From: Andrei Borzenkov [mailto:arvidjaar@gmail.com]
> Sent: Friday, November 21, 2014 7:13 AM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Issues with ocf_run on CentOS7 with pgsql resource agent
>
> On Fri, Nov 21, 2014 at 12:36 AM, Brendan Reekie <breekie@sandvine.com> wrote:
> >> I’m running into an issues with the pgsql resource agent running on CentOS7.
> >> The issue is when the pgsql resource agent attempts a call to
> >> runasowner it uses a su as user postgres in a call to ocf_run, this is
> >> causing what appears to be a new session on CentOS7.

Not sure what do you mean by "new session".

> >> We’ve narrowed it down to the ocf_run method in ocf-shellfuncs in
> >> executing the command:

Why do you think that it really has to do with the way it is run
and not with what is actually run (I suppose that it's some pgsql
call)?

Do you also know what was being executed at this point (the
content of $@)? I can see several runasowner() invocations the
RA. Could you also trace the RA (see: crm resource help trace)?
It should include the timestamps.

FWIW, it looks like ocf_run (which is cpu intensive) is a bit
abused in the RA. One example:

runasowner "kill -s 0 $PID >/dev/null 2>&1"

Thanks,

Dejan


> >> output=`"$@" 2>&1`
> >>
> >>
> >>
> >> Snippet of /var/log/messages:
> >>
> >> Nov 20 21:29:05 ip-10-0-0-4 pengine[2041]: notice: LogActions: Stop
> >> pgsql:0 (pcmk-1)
> >>
> >> Nov 20 21:29:05 ip-10-0-0-4 pengine[2041]: notice: process_pe_message:
> >> Calculated Transition 1: /var/lib/pacemaker/pengine/pe-input-205.bz2
> >>
> >> Nov 20 21:29:05 ip-10-0-0-4 crmd[2042]: notice: te_rsc_command:
> >> Initiating action 47: notify pgsql_pre_notify_stop_0 on pcmk-1 (local)
> >>
> >> Nov 20 21:29:05 ip-10-0-0-4 crmd[2042]: notice: te_rsc_command:
> >> Initiating action 48: notify pgsql_pre_notify_stop_0 on pcmk-2
> >>
> >> Nov 20 21:29:05 ip-10-0-0-4 pacemakerd[2412]: notice: crm_add_logfile:
> >> Additional logging available in /var/log/pacemaker.log
> >>
> >> Nov 20 21:29:05 ip-10-0-0-4 su: (to postgres) root on none
> >>
> >> Nov 20 21:29:05 ip-10-0-0-4 systemd: Created slice user-26.slice.
> >>
> >> Nov 20 21:29:05 ip-10-0-0-4 systemd: Starting Session c4 of user postgres.
> >>
> >> Nov 20 21:29:05 ip-10-0-0-4 systemd: Started Session c4 of user postgres.
> >>
> >> Nov 20 21:29:30 ip-10-0-0-4 crmd[2042]: notice: process_lrm_event: LRM
> >> operation pgsql_notify_0 (call=17, rc=0, cib-update=0, confirmed=true)
> >> ok
> >>
> >> Nov 20 21:29:30 ip-10-0-0-4 crmd[2042]: notice: te_rsc_command:
> >> Initiating action 1: stop pgsql_stop_0 on pcmk-1 (local)
> >>
> >> Nov 20 21:29:30 ip-10-0-0-4 pacemakerd[2464]: notice: crm_add_logfile:
> >> Additional logging available in /var/log/pacemaker.log
> >>
> >> Nov 20 21:29:30 ip-10-0-0-4 su: (to postgres) root on none
> >>
> >> Nov 20 21:29:30 ip-10-0-0-4 systemd: Created slice user-26.slice.
> >>
> >> Nov 20 21:29:30 ip-10-0-0-4 systemd: Starting Session c5 of user postgres.
> >>
> >> Nov 20 21:29:30 ip-10-0-0-4 systemd: Started Session c5 of user postgres.
> >>
> >>
> >>
> >> Software versions:
> >>
> >> pacemaker 1.1.10
> >>
> >> resource-agents 3.9.5
> >>
> >> corosync 2.3.3-2
> >>
> >>
> >>
> >> Has anyone else experienced this issue or know of what might be going
> >> wrong to cause new sessions to be created?
> >>
>
> >Session is created by pam_systemd. What exact problem does it cause?
> >So far you did not mention any.
>
> Sorry Andrei on not clarifying. The issue that we observed is that the pgsql resource agent will eventually timeout (if the defaults are left at 60 seconds). Increasing the timeout to several minutes the master/slave are correctly identified and started correctly.
>
> In the above snippet from /var/log/messages it's taking approximately 25 seconds to execute when su to postgres.
>
> It seems any call to the pgsql resource agent that invokes the runasowner encounters a 25 second delay. For the corresponding time range the /var/log/pacemaker.log file shows that the notify call 17 started at 21:29:05 but didn't complete to 21:29:30.
>
> Relevant logs:
> Nov 20 21:29:05 [2039] pcmk-1 lrmd: info: log_execute: executing - rsc:pgsql action:notify call_id:17
> Nov 20 21:29:30 [2039] pcmk-1 lrmd: info: log_finished: finished - rsc:pgsql action:notify call_id:17 pid:2401 exit-code:0 exec-time:25130ms queue-time:1ms
>
> Full log during that time:
> Nov 20 21:29:05 [2039] pcmk-1 lrmd: info: log_finished: finished - rsc:pgsql action:notify call_id:16 pid:2320 exit-code:0 exec-time:25128ms queue-time:0ms
> Nov 20 21:29:05 [2042] pcmk-1 crmd: info: match_graph_event: Action pgsql_notify_0 (50) confirmed on pcmk-1 (rc=0)
> Nov 20 21:29:05 [2042] pcmk-1 crmd: notice: process_lrm_event: LRM operation pgsql_notify_0 (call=16, rc=0, cib-update=0, confirmed=true) ok
> Nov 20 21:29:05 [2042] pcmk-1 crmd: info: match_graph_event: Action pgsql_notify_0 (51) confirmed on pcmk-2 (rc=0)
> Nov 20 21:29:05 [2042] pcmk-1 crmd: notice: run_graph: Transition 0 (Complete=19, Pending=0, Fired=0, Skipped=2, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-204.bz2): Stopped
> Nov 20 21:29:05 [2042] pcmk-1 crmd: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]
> Nov 20 21:29:05 [2037] pcmk-1 cib: info: cib_process_request: Completed cib_query operation for section 'all': OK (rc=0, origin=local/crmd/30, version=0.291.20)
> Nov 20 21:29:05 [2041] pcmk-1 pengine: notice: unpack_config: On loss of CCM Quorum: Ignore
> Nov 20 21:29:05 [2041] pcmk-1 pengine: info: determine_online_status: Node pcmk-1 is online
> Nov 20 21:29:05 [2041] pcmk-1 pengine: info: determine_online_status: Node pcmk-2 is online
> Nov 20 21:29:05 [2041] pcmk-1 pengine: warning: unpack_rsc_op: Processing failed op start for pgsql:0 on pcmk-1: unknown error (1)
> Nov 20 21:29:05 [2041] pcmk-1 pengine: info: group_print: Resource Group: master-group
> Nov 20 21:29:05 [2041] pcmk-1 pengine: info: native_print: vip-master (ocf::heartbeat:IPaddr2): Stopped
> Nov 20 21:29:05 [2041] pcmk-1 pengine: info: native_print: vip-rep (ocf::heartbeat:IPaddr2): Stopped
> Nov 20 21:29:05 [2041] pcmk-1 pengine: info: clone_print: Master/Slave Set: msPostgresql [pgsql]
> Nov 20 21:29:05 [2041] pcmk-1 pengine: info: native_print: pgsql (ocf::heartbeat:pgsql): FAILED pcmk-1
> Nov 20 21:29:05 [2041] pcmk-1 pengine: info: short_print: Slaves: [ pcmk-2 ]
> Nov 20 21:29:05 [2041] pcmk-1 pengine: info: get_failcount_full: pgsql:0 has failed INFINITY times on pcmk-1
> Nov 20 21:29:05 [2041] pcmk-1 pengine: warning: common_apply_stickiness: Forcing msPostgresql away from pcmk-1 after 1000000 failures (max=1)
> Nov 20 21:29:05 [2041] pcmk-1 pengine: info: get_failcount_full: pgsql:1 has failed INFINITY times on pcmk-1
> Nov 20 21:29:05 [2041] pcmk-1 pengine: warning: common_apply_stickiness: Forcing msPostgresql away from pcmk-1 after 1000000 failures (max=1)
> Nov 20 21:29:05 [2041] pcmk-1 pengine: info: native_color: Resource pgsql:0 cannot run anywhere
> Nov 20 21:29:05 [2041] pcmk-1 pengine: info: master_color: msPostgresql: Promoted 0 instances of a possible 1 to master
> Nov 20 21:29:05 [2041] pcmk-1 pengine: info: rsc_merge_weights: vip-master: Rolling back scores from vip-rep
> Nov 20 21:29:05 [2041] pcmk-1 pengine: info: native_color: Resource vip-master cannot run anywhere
> Nov 20 21:29:05 [2041] pcmk-1 pengine: info: native_color: Resource vip-rep cannot run anywhere
> Nov 20 21:29:05 [2041] pcmk-1 pengine: info: RecurringOp: Start recurring monitor (4s) for pgsql:1 on pcmk-2
> Nov 20 21:29:05 [2041] pcmk-1 pengine: info: RecurringOp: Start recurring monitor (4s) for pgsql:1 on pcmk-2
> Nov 20 21:29:05 [2041] pcmk-1 pengine: info: LogActions: Leave vip-master (Stopped)
> Nov 20 21:29:05 [2041] pcmk-1 pengine: info: LogActions: Leave vip-rep (Stopped)
> Nov 20 21:29:05 [2041] pcmk-1 pengine: notice: LogActions: Stop pgsql:0 (pcmk-1)
> Nov 20 21:29:05 [2041] pcmk-1 pengine: info: LogActions: Leave pgsql:1 (Slave pcmk-2)
> Nov 20 21:29:05 [2041] pcmk-1 pengine: notice: process_pe_message: Calculated Transition 1: /var/lib/pacemaker/pengine/pe-input-205.bz2
> Nov 20 21:29:05 [2042] pcmk-1 crmd: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
> Nov 20 21:29:05 [2042] pcmk-1 crmd: info: do_te_invoke: Processing graph 1 (ref=pe_calc-dc-1416518945-23) derived from /var/lib/pacemaker/pengine/pe-input-205.bz2
> Nov 20 21:29:05 [2042] pcmk-1 crmd: notice: te_rsc_command: Initiating action 47: notify pgsql_pre_notify_stop_0 on pcmk-1 (local)
> Nov 20 21:29:05 [2042] pcmk-1 crmd: info: do_lrm_rsc_op: Performing key=47:1:0:804110e4-5de5-4a3a-9fd7-2fb96177d7ac op=pgsql_notify_0
> Nov 20 21:29:05 [2039] pcmk-1 lrmd: info: log_execute: executing - rsc:pgsql action:notify call_id:17
> Nov 20 21:29:05 [2042] pcmk-1 crmd: notice: te_rsc_command: Initiating action 48: notify pgsql_pre_notify_stop_0 on pcmk-2
> Nov 20 21:29:05 [2412] pcmk-1 pacemakerd: info: crm_log_init: Changed active directory to /var/lib/pacemaker/cores/root
> Nov 20 21:29:05 [2412] pcmk-1 pacemakerd: info: crm_xml_cleanup: Cleaning up memory from libxml2
> Nov 20 21:29:30 [2042] pcmk-1 crmd: info: match_graph_event: Action pgsql_notify_0 (48) confirmed on pcmk-2 (rc=0)
> Nov 20 21:29:30 [2039] pcmk-1 lrmd: info: log_finished: finished - rsc:pgsql action:notify call_id:17 pid:2401 exit-code:0 exec-time:25130ms queue-time:1ms
> Nov 20 21:29:30 [2042] pcmk-1 crmd: info: match_graph_event: Action pgsql_notify_0 (47) confirmed on pcmk-1 (rc=0)
> Nov 20 21:29:30 [2042] pcmk-1 crmd: notice: process_lrm_event: LRM operation pgsql_notify_0 (call=17, rc=0, cib-update=0, confirmed=true) ok
> Nov 20 21:29:30 [2042] pcmk-1 crmd: notice: te_rsc_command: Initiating action 1: stop pgsql_stop_0 on pcmk-1 (local)
> Nov 20 21:29:30 [2042] pcmk-1 crmd: info: do_lrm_rsc_op: Performing key=1:1:0:804110e4-5de5-4a3a-9fd7-2fb96177d7ac op=pgsql_stop_0
> Nov 20 21:29:30 [2039] pcmk-1 lrmd: info: log_execute: executing - rsc:pgsql action:stop call_id:18
> Nov 20 21:29:30 [2464] pcmk-1 pacemakerd: info: crm_log_init: Changed active directory to /var/lib/pacemaker/cores/root
> Nov 20 21:29:30 [2464] pcmk-1 pacemakerd: info: crm_xml_cleanup: Cleaning up memory from libxml2
>
>
> The other observation is that a call from a test script (ocftest) that calls the ocf_run with the same parameters as the pgsql resource agent doesn't encounter the same delays. The following is the test script that I've used:
>
> #!/bin/sh
>
> : ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/lib/heartbeat}
> . ${OCF_FUNCTIONS_DIR}/ocf-shellfuncs
>
>
> OCF_RESKEY_pgdba="postgres"
> OCF_RESKEY_pgdata="/var/lib/pgsql/9.3/data/"
>
> #
> # Run the given command in the Resource owner environment...
> #
> runasowner() {
> local quietrun=""
> local loglevel="-err"
> local var
>
> for var in 1 2
> do
> case "$1" in
> "-q")
> quietrun="-q"
> shift 1;;
> "warn"|"err")
> loglevel="-$1"
> shift 1;;
> *)
> ;;
> esac
> done
>
> ocf_log info "&&& su called runasowner: $*"
> ocf_run $quietrun $loglevel su $OCF_RESKEY_pgdba -c "cd $OCF_RESKEY_pgdata; $*"
> }
>
> runasowner "test -w $OCF_RESKEY_pgdata";
>
>
> When executing the script multiple times, the 25 second delay is not observed and nor is what appear to be a new session being created. Output of /var/log/messages:
> Nov 21 14:54:23 ip-10-0-0-4 su: (to postgres) centos on none
> Nov 21 14:54:24 ip-10-0-0-4 su: (to postgres) centos on none
> Nov 21 14:54:25 ip-10-0-0-4 su: (to postgres) centos on none
>
> Stdout capture:
> [root@pcmk-1 heartbeat]# ./ocftest
> INFO: &&& su called runasowner: test -w /var/lib/pgsql/9.3/data/
> [root@pcmk-1 heartbeat]# ./ocftest
> INFO: &&& su called runasowner: test -w /var/lib/pgsql/9.3/data/
> [root@pcmk-1 heartbeat]# ./ocftest
> INFO: &&& su called runasowner: test -w /var/lib/pgsql/9.3/data/
>
> Any ideas on what would be causing the delays?
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org