Mailing List Archive

A question of pgsql resource
Hi All,

I'm using pgsql resource agent ( resource-agents-3.9.5-9 ) on fedora20.

I'm testing various failure patterns in a pgsql replicated cluster using it.

I think if MASTER PostgreSQL process has suspended for a long time,
then the resource monitoring and demotion timed out, and the cluster cannot failover until resume.

-----the Cluster status after master demotion timed out.-----
Online: [ server1 server2 ]

Master/Slave Set: msPostgresql [pgsql]
pgsql (ocf::heartbeat:pgsql): FAILED server2
Stopped: [ server1 ]
Clone Set: ping-gw-rsc-clone [ping-gw-rsc]
Started: [ server1 server2 ]

Node Attributes:
* Node server1:
+ master-pgsql : -INFINITY
+ pgsql-data-status : STREAMING|SYNC
+ pgsql-status : STOP
+ ping-gw1 : 100
* Node server2:
+ master-pgsql : -INFINITY
+ pgsql-data-status : LATEST
+ pgsql-status : PRI
+ ping-gw1 : 100

Migration summary:
* Node server1:
* Node server2:
pgsql: migration-threshold=1 fail-count=2 last-failure='Fri Apr 11 14:07:43 2014'

Failed actions:
pgsql_demote_0 on server2 'unknown error' (1): call=77, status=Timed Out, last-rc-change='Fri Apr 11 14:06:43 2014', queued=1ms, exec=60001ms
-------------------------------------------------------

I think pgsql_real_stop() had better throw SIGKILL to PostgreSQL when the shutdown(-m i) command has timed out.

What do you think abount my opinion ?

Regards,

Naoya

---
Naoya Anzai
Engineering Department
NEC Solution Inovetors, Ltd.
E-Mail: anzai-naoya@mxu.nes.nec.co.jp
---


_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Re: A question of pgsql resource [ In reply to ]
Hi Anzai-san

2014-05-16 14:01 GMT+09:00 Naoya Anzai <anzai-naoya@mxu.nes.nec.co.jp>:
> Hi All,
>
> I'm using pgsql resource agent ( resource-agents-3.9.5-9 ) on fedora20.
>
> I'm testing various failure patterns in a pgsql replicated cluster using it.
>
> I think if MASTER PostgreSQL process has suspended for a long time,
> then the resource monitoring and demotion timed out, and the cluster cannot failover until resume.
>
> -----the Cluster status after master demotion timed out.-----
> Online: [ server1 server2 ]
>
> Master/Slave Set: msPostgresql [pgsql]
> pgsql (ocf::heartbeat:pgsql): FAILED server2
> Stopped: [ server1 ]
> Clone Set: ping-gw-rsc-clone [ping-gw-rsc]
> Started: [ server1 server2 ]
>
> Node Attributes:
> * Node server1:
> + master-pgsql : -INFINITY
> + pgsql-data-status : STREAMING|SYNC
> + pgsql-status : STOP
> + ping-gw1 : 100
> * Node server2:
> + master-pgsql : -INFINITY
> + pgsql-data-status : LATEST
> + pgsql-status : PRI
> + ping-gw1 : 100
>
> Migration summary:
> * Node server1:
> * Node server2:
> pgsql: migration-threshold=1 fail-count=2 last-failure='Fri Apr 11 14:07:43 2014'
>
> Failed actions:
> pgsql_demote_0 on server2 'unknown error' (1): call=77, status=Timed Out, last-rc-change='Fri Apr 11 14:06:43 2014', queued=1ms, exec=60001ms
> -------------------------------------------------------
>
> I think pgsql_real_stop() had better throw SIGKILL to PostgreSQL when the shutdown(-m i) command has timed out.
>
> What do you think abount my opinion ?

I think it makes sense.
But I would like you to keep current stopping process
because I think it's safer to use STONITH.
Could you implement it adding new parameter if you implement?


BTW, is it true that the cause of time-out is not "while" but "pg_ctl(-m i)"?
If "pg_ctl (-m i)", you need to use time-out parameter or you can use
exec_with_timeout().

Thanks,
Takatoshi MATSUO
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Re: A question of pgsql resource [ In reply to ]
Hi Matsuo-san

Thank you for your response.

>But I would like you to keep current stopping process
>because I think it's safer to use STONITH.
>Could you implement it adding new parameter if you implement?

Sorry,
I don't fully understand all of pacemaker yet.

In pacemaker,
Is there a function that a node executes STONITH when
a resource agent of the other node timed out ?

or

Is there a function that a node executes "harakiri" when
own resource agent timed out ?

I'm thinking that the resource agent must implement that functions,
because pacemaker does not provide them.
Is this wrong ??

>BTW, is it true that the cause of time-out is not "while" but "pg_ctl(-m i)"?
>If "pg_ctl (-m i)", you need to use time-out parameter or you can use
>exec_with_timeout().

In My test pattern,
I issued SIGSTOP signals to all of MASTER-PostgreSQL processes.
# killall -SIGSTOP postgres

In fact,
it looks like a resouce agent timed out in pgsql_real_monitor() after pg_ctl( -m i) has timed out.

--- real suspend point of the pgsql resource ----

output=`su $OCF_RESKEY_pgdba -c "cd $OCF_RESKEY_pgdata; \
$OCF_RESKEY_psql $psql_options -U $OCF_RESKEY_pgdba \
-Atc \"${CHECK_MS_SQL}\""`

-----------------------------------------------
It's a psql.


Regards,

Naoya

---
Naoya Anzai
Engineering Department
NEC Solution Inovetors, Ltd.
E-Mail: anzai-naoya@mxu.nes.nec.co.jp
---



_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Re: A question of pgsql resource [ In reply to ]
On 19 May 2014, at 8:12 pm, Naoya Anzai <anzai-naoya@mxu.nes.nec.co.jp> wrote:

> Hi Matsuo-san
>
> Thank you for your response.
>
>> But I would like you to keep current stopping process
>> because I think it's safer to use STONITH.
>> Could you implement it adding new parameter if you implement?
>
> Sorry,
> I don't fully understand all of pacemaker yet.
>
> In pacemaker,
> Is there a function that a node executes STONITH when
> a resource agent of the other node timed out ?

yes. Specify on-fail=fence for that operation

>
> or
>
> Is there a function that a node executes "harakiri" when
> own resource agent timed out ?
>
> I'm thinking that the resource agent must implement that functions,
> because pacemaker does not provide them.
> Is this wrong ??
>
>> BTW, is it true that the cause of time-out is not "while" but "pg_ctl(-m i)"?
>> If "pg_ctl (-m i)", you need to use time-out parameter or you can use
>> exec_with_timeout().
>
> In My test pattern,
> I issued SIGSTOP signals to all of MASTER-PostgreSQL processes.
> # killall -SIGSTOP postgres
>
> In fact,
> it looks like a resouce agent timed out in pgsql_real_monitor() after pg_ctl( -m i) has timed out.
>
> --- real suspend point of the pgsql resource ----
>
> output=`su $OCF_RESKEY_pgdba -c "cd $OCF_RESKEY_pgdata; \
> $OCF_RESKEY_psql $psql_options -U $OCF_RESKEY_pgdba \
> -Atc \"${CHECK_MS_SQL}\""`
>
> -----------------------------------------------
> It's a psql.
>
>
> Regards,
>
> Naoya
>
> ---
> Naoya Anzai
> Engineering Department
> NEC Solution Inovetors, Ltd.
> E-Mail: anzai-naoya@mxu.nes.nec.co.jp
> ---
>
>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
Re: A question of pgsql resource [ In reply to ]
Hi Andrew and Matsuo-san

> > In pacemaker,
> > Is there a function that a node executes STONITH when
> > a resource agent of the other node timed out ?
>
> yes. Specify on-fail=fence for that operation

I set a STONITH plugin (fence_ipmilan) in each nodes
and updated the pgsql resource stop-on-fail operation . ("block" -> "fence")

And now, failover by resource timeout works fine !

So, I think pgsql_real_stop() isn't necessary to be modified
because there is that function in pacemaker.

Thanks a lot!

Naoya

---
Naoya Anzai
Engineering Department
NEC Solution Inovetors, Ltd.
E-Mail: anzai-naoya@mxu.nes.nec.co.jp
---
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems