Mailing List Archive

pgsql resource agent in status "Stopped" after crm resource cleanup
Hi

I'm currently building a 2 node DRBD backed PostgreSQL on Debian Wheezy
and I'm testing how Pacemaker reacts to specific failure scenarios.

One thing I did test that currently drives me crazy is when I manually
stop PostgreSQL trough pg_ctl or just kill the master process to
simulate a crash the pgsql resource agent correctly detects the error
and restarts PostgreSQL.

The problem is have arises when I later call 'crm resource cleanup
pgsql' to delete the failcount and the failed tasks the pgsql resources
shows up as Stopped, but in reality it is still running fine. I'm
having the same problem when I delete the failcount separately and then
do the cleanup.

The problem seems to be that psql_monitor runs into a timeout:
Feb 21 12:47:59 vm-db-01 crmd: [6494]: WARN: cib_action_update:
rsc_op 44: pgsql_monitor_30000 on vm-db-01 timed out

After the timeout pgsql is being restarted, and the interesting thing
is that I can delete the failed action from the timeout without a
problem.

Does anyone have an idea what the problem could be in this case?

Best regards
Lukas

--
Adfinis SyGroup AG
Lukas Grossar, System Engineer

Keltenstrasse 98 | CH-3018 Bern
Tel. 031 550 31 11 | Direkt 031 550 31 06
Re: pgsql resource agent in status "Stopped" after crm resource cleanup [ In reply to ]
On 21 Feb 2014, at 10:55 pm, Lukas Grossar <lukas.grossar@adfinis-sygroup.ch> wrote:

> Hi
>
> I'm currently building a 2 node DRBD backed PostgreSQL on Debian Wheezy
> and I'm testing how Pacemaker reacts to specific failure scenarios.
>
> One thing I did test that currently drives me crazy is when I manually
> stop PostgreSQL trough pg_ctl or just kill the master process to
> simulate a crash the pgsql resource agent correctly detects the error
> and restarts PostgreSQL.
>
> The problem is have arises when I later call 'crm resource cleanup
> pgsql' to delete the failcount and the failed tasks the pgsql resources
> shows up as Stopped, but in reality it is still running fine. I'm
> having the same problem when I delete the failcount separately and then
> do the cleanup.
>
> The problem seems to be that psql_monitor runs into a timeout:
> Feb 21 12:47:59 vm-db-01 crmd: [6494]: WARN: cib_action_update:
> rsc_op 44: pgsql_monitor_30000 on vm-db-01 timed out
>
> After the timeout pgsql is being restarted, and the interesting thing
> is that I can delete the failed action from the timeout without a
> problem.
>
> Does anyone have an idea what the problem could be in this case?

Not without more logs. You'd probably want to turn on 'set -x' in the resource agent to see why it can't complete.