Mailing List Archive: Occasional nonsensical resource agent errors, redux

Occasional nonsensical resource agent errors, redux

Nov 1, 2014, 5:03 AM

Post #1 of 9 (2348 views)

Hi all:

In July, list member Ken Gaillot reported occasional nonsensical resource
agent errors using Pacemaker (
http://oss.clusterlabs.org/pipermail/pacemaker/2014-July/022231.html).

We're seeing similar issues with our install. We have a 2 node
corosync/pacemaker failover configuration that is using the
ocf:heartbeat:IPaddr2
resource agent extensively. About once a week, we'll get an error like
this, out of the blue:

Nov 1 05:23:57 lb02 IPaddr2(anon_ip)[32312]: ERROR: Setup problem: couldn't
find command: ip

It goes without saying that the ip command hasn't gone anywhere and all the
paths are configured correctly.

We're currently running 1.1.10-14.el6_5.3-368c726 under CentOS 6 x86_64
inside of a xen container.

Any thoughts from folks on what might be happening or how we can get
additional debug information to help figure out what's triggering this?

Best,
--
Patrick Michael Kane
<pmk@wawd.com>

Re: Occasional nonsensical resource agent errors, redux [ In reply to ]

andrew at beekhof

Nov 2, 2014, 6:32 PM

Post #2 of 9 (2328 views)

> On 1 Nov 2014, at 11:03 pm, Patrick Kane <pmk@wawd.com> wrote:
>
> Hi all:
>
> In July, list member Ken Gaillot reported occasional nonsensical resource agent errors using Pacemaker (http://oss.clusterlabs.org/pipermail/pacemaker/2014-July/022231.html).
>
> We're seeing similar issues with our install. We have a 2 node corosync/pacemaker failover configuration that is using the ocf:heartbeat:IPaddr2 resource agent extensively. About once a week, we'll get an error like this, out of the blue:
>
> Nov 1 05:23:57 lb02 IPaddr2(anon_ip)[32312]: ERROR: Setup problem: couldn't find command: ip
>
> It goes without saying that the ip command hasn't gone anywhere and all the paths are configured correctly.
>
> We're currently running 1.1.10-14.el6_5.3-368c726 under CentOS 6 x86_64 inside of a xen container.
>
> Any thoughts from folks on what might be happening or how we can get additional debug information to help figure out what's triggering this?

its pretty much in the hands of the agent.
you could perhaps find the call that looks for ip and wrap it in a set -x/set +x block
that way you'd know exactly why it thinks the binary is missing
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: Occasional nonsensical resource agent errors, redux [ In reply to ]

arvidjaar at gmail

Nov 2, 2014, 9:46 PM

Post #3 of 9 (2315 views)

Ð’ Mon, 3 Nov 2014 13:32:45 +1100
Andrew Beekhof <andrew@beekhof.net> Ð¿Ð¸ÑˆÐµÑ‚:

>
> > On 1 Nov 2014, at 11:03 pm, Patrick Kane <pmk@wawd.com> wrote:
> >
> > Hi all:
> >
> > In July, list member Ken Gaillot reported occasional nonsensical resource agent errors using Pacemaker (http://oss.clusterlabs.org/pipermail/pacemaker/2014-July/022231.html).
> >
> > We're seeing similar issues with our install. We have a 2 node corosync/pacemaker failover configuration that is using the ocf:heartbeat:IPaddr2 resource agent extensively. About once a week, we'll get an error like this, out of the blue:
> >
> > Nov 1 05:23:57 lb02 IPaddr2(anon_ip)[32312]: ERROR: Setup problem: couldn't find command: ip
> >
> > It goes without saying that the ip command hasn't gone anywhere and all the paths are configured correctly.
> >
> > We're currently running 1.1.10-14.el6_5.3-368c726 under CentOS 6 x86_64 inside of a xen container.
> >
> > Any thoughts from folks on what might be happening or how we can get additional debug information to help figure out what's triggering this?
>
> its pretty much in the hands of the agent.

Actually the message seems to be output by check_binary() function
which is part of framework.

> you could perhaps find the call that looks for ip and wrap it in a set -x/set +x block
> that way you'd know exactly why it thinks the binary is missing
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: Occasional nonsensical resource agent errors, redux [ In reply to ]

dejanmm at fastmail

Nov 3, 2014, 6:26 AM

Post #4 of 9 (2305 views)

Hi,

On Mon, Nov 03, 2014 at 08:46:00AM +0300, Andrei Borzenkov wrote:
> Ð’ Mon, 3 Nov 2014 13:32:45 +1100
> Andrew Beekhof <andrew@beekhof.net> Ð¿Ð¸ÑˆÐµÑ‚:
>
> >
> > > On 1 Nov 2014, at 11:03 pm, Patrick Kane <pmk@wawd.com> wrote:
> > >
> > > Hi all:
> > >
> > > In July, list member Ken Gaillot reported occasional nonsensical resource agent errors using Pacemaker (http://oss.clusterlabs.org/pipermail/pacemaker/2014-July/022231.html).
> > >
> > > We're seeing similar issues with our install. We have a 2 node corosync/pacemaker failover configuration that is using the ocf:heartbeat:IPaddr2 resource agent extensively. About once a week, we'll get an error like this, out of the blue:
> > >
> > > Nov 1 05:23:57 lb02 IPaddr2(anon_ip)[32312]: ERROR: Setup problem: couldn't find command: ip
> > >
> > > It goes without saying that the ip command hasn't gone anywhere and all the paths are configured correctly.
> > >
> > > We're currently running 1.1.10-14.el6_5.3-368c726 under CentOS 6 x86_64 inside of a xen container.
> > >
> > > Any thoughts from folks on what might be happening or how we can get additional debug information to help figure out what's triggering this?
> >
> > its pretty much in the hands of the agent.
>
> Actually the message seems to be output by check_binary() function
> which is part of framework.

Someone complained in the IRC about this issue (another resource
agent though, I think Xen) and they said that which(1) was not
able to find the program. I'd suggest to do strace (or ltrace)
of which(1) at that point (it's in ocf-shellfuncs).

The which(1) utility is a simple tool: it splits the PATH
environment variable and stats the program name appended to each
of the paths. PATH somehow corrupted or filesystem misbehaving?
My guess is that it's the former.

BTW, was there an upgrade of some kind before this started
happening?

Thanks,

Dejan

> > you could perhaps find the call that looks for ip and wrap it in a set -x/set +x block
> > that way you'd know exactly why it thinks the binary is missing
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: Occasional nonsensical resource agent errors, redux [ In reply to ]

arvidjaar at gmail

Nov 3, 2014, 8:06 AM

Post #5 of 9 (2304 views)

Ð’ Mon, 3 Nov 2014 15:26:34 +0100
Dejan Muhamedagic <dejanmm@fastmail.fm> Ð¿Ð¸ÑˆÐµÑ‚:

> Hi,
>
> On Mon, Nov 03, 2014 at 08:46:00AM +0300, Andrei Borzenkov wrote:
> > Ð’ Mon, 3 Nov 2014 13:32:45 +1100
> > Andrew Beekhof <andrew@beekhof.net> Ð¿Ð¸ÑˆÐµÑ‚:
> >
> > >
> > > > On 1 Nov 2014, at 11:03 pm, Patrick Kane <pmk@wawd.com> wrote:
> > > >
> > > > Hi all:
> > > >
> > > > In July, list member Ken Gaillot reported occasional nonsensical resource agent errors using Pacemaker (http://oss.clusterlabs.org/pipermail/pacemaker/2014-July/022231.html).
> > > >
> > > > We're seeing similar issues with our install. We have a 2 node corosync/pacemaker failover configuration that is using the ocf:heartbeat:IPaddr2 resource agent extensively. About once a week, we'll get an error like this, out of the blue:
> > > >
> > > > Nov 1 05:23:57 lb02 IPaddr2(anon_ip)[32312]: ERROR: Setup problem: couldn't find command: ip
> > > >
> > > > It goes without saying that the ip command hasn't gone anywhere and all the paths are configured correctly.
> > > >
> > > > We're currently running 1.1.10-14.el6_5.3-368c726 under CentOS 6 x86_64 inside of a xen container.
> > > >
> > > > Any thoughts from folks on what might be happening or how we can get additional debug information to help figure out what's triggering this?
> > >
> > > its pretty much in the hands of the agent.
> >
> > Actually the message seems to be output by check_binary() function
> > which is part of framework.
>
> Someone complained in the IRC about this issue (another resource
> agent though, I think Xen) and they said that which(1) was not
> able to find the program. I'd suggest to do strace (or ltrace)
> of which(1) at that point (it's in ocf-shellfuncs).
>
> The which(1) utility is a simple tool: it splits the PATH
> environment variable and stats the program name appended to each
> of the paths. PATH somehow corrupted or filesystem misbehaving?
> My guess is that it's the former.
>

As it is called quite often I'd instrument have_binary to dump all
environment and variables on "which" failure for known binary as well as
rerun it under strace. Running it under strace every time would
probably result in too copious output.

> BTW, was there an upgrade of some kind before this started
> happening?
>
> Thanks,
>
> Dejan
>
> > > you could perhaps find the call that looks for ip and wrap it in a set -x/set +x block
> > > that way you'd know exactly why it thinks the binary is missing
> > > _______________________________________________
> > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > >
> > > Project Home: http://www.clusterlabs.org
> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > Bugs: http://bugs.clusterlabs.org
> >
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: Occasional nonsensical resource agent errors, redux [ In reply to ]

kjgaillo at gleim

Nov 3, 2014, 8:36 AM

Post #6 of 9 (2305 views)

On 11/03/2014 09:26 AM, Dejan Muhamedagic wrote:
> On Mon, Nov 03, 2014 at 08:46:00AM +0300, Andrei Borzenkov wrote:
>> Ð’ Mon, 3 Nov 2014 13:32:45 +1100
>> Andrew Beekhof <andrew@beekhof.net> Ð¿Ð¸ÑˆÐµÑ‚:
>>
>>>
>>>> On 1 Nov 2014, at 11:03 pm, Patrick Kane <pmk@wawd.com> wrote:
>>>>
>>>> Hi all:
>>>>
>>>> In July, list member Ken Gaillot reported occasional nonsensical resource agent errors using Pacemaker (http://oss.clusterlabs.org/pipermail/pacemaker/2014-July/022231.html).
>>>>
>>>> We're seeing similar issues with our install. We have a 2 node corosync/pacemaker failover configuration that is using the ocf:heartbeat:IPaddr2 resource agent extensively. About once a week, we'll get an error like this, out of the blue:
>>>>
>>>> Nov 1 05:23:57 lb02 IPaddr2(anon_ip)[32312]: ERROR: Setup problem: couldn't find command: ip
>>>>
>>>> It goes without saying that the ip command hasn't gone anywhere and all the paths are configured correctly.
>>>>
>>>> We're currently running 1.1.10-14.el6_5.3-368c726 under CentOS 6 x86_64 inside of a xen container.
>>>>
>>>> Any thoughts from folks on what might be happening or how we can get additional debug information to help figure out what's triggering this?
>>>
>>> its pretty much in the hands of the agent.
>>
>> Actually the message seems to be output by check_binary() function
>> which is part of framework.
>
> Someone complained in the IRC about this issue (another resource
> agent though, I think Xen) and they said that which(1) was not
> able to find the program. I'd suggest to do strace (or ltrace)
> of which(1) at that point (it's in ocf-shellfuncs).
>
> The which(1) utility is a simple tool: it splits the PATH
> environment variable and stats the program name appended to each
> of the paths. PATH somehow corrupted or filesystem misbehaving?
> My guess is that it's the former.
>
> BTW, was there an upgrade of some kind before this started
> happening?

I was hoping to have something useful before posting another update, but
since it's come up again, here's what we've found so far:

* The most common manifestation is the "couldn't find command" error. In
various instances it "couldn't find" xm, ip or awk. However, we've seen
two other variations:

lrmd: [3363]: info: RA output: (pan:monitor:stderr) en-destroy: bad
variable name

and

lrmd: [2145]: info: RA output: (ldap-ip:monitor:stderr)
/usr/lib/ocf/resource.d//heartbeat/IPaddr2: 1:
/usr/lib/ocf/resource.d//heartbeat/IPaddr2: : Permission denied

The RA in the first case does not use the string "en-destroy" at all; it
does call a command "xen-destroy". That, to me, is a strong suggestion
of memory corruption somewhere, whether in the RA, the shell, lrmd or a
library used by one of those.

* I have not found any bugs in the RA or its included files.

* I tried setting "debug: on" in corosync.conf, but that did not give
any additional useful information. The resource agent error is still the
first unusual message in the sequence. Here is an example, giving one
successful monitor run and then an occurrence of the issue (the nodes
are a pair of Xen dom0s including pisces, running two Xen domU resources
pan and nemesis):

Sep 13 20:16:56 pisces lrmd: [3509]: debug: rsc:pan monitor[21] (pid 372)
Sep 13 20:16:56 pisces lrmd: [372]: debug: perform_ra_op: resetting
scheduler class to SCHED_OTHER
Sep 13 20:16:56 pisces lrmd: [3509]: debug: rsc:nemesis monitor[32] (pid
409)
Sep 13 20:16:56 pisces lrmd: [409]: debug: perform_ra_op: resetting
scheduler class to SCHED_OTHER
Sep 13 20:16:56 pisces lrmd: [3509]: info: operation monitor[21] on pan
for client 3512: pid 372 exited with return code 0
Sep 13 20:16:57 pisces lrmd: [3509]: info: operation monitor[32] on
nemesis for client 3512: pid 409 exited with return code 0
Sep 13 20:17:06 pisces lrmd: [3509]: debug: rsc:pan monitor[21] (pid 455)
Sep 13 20:17:06 pisces lrmd: [455]: debug: perform_ra_op: resetting
scheduler class to SCHED_OTHER
Sep 13 20:17:07 pisces lrmd: [3509]: info: RA output:
(pan:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/Xen: 71: local:
Sep 13 20:17:07 pisces lrmd: [3509]: info: RA output:
(pan:monitor:stderr) en-destroy: bad variable name
Sep 13 20:17:07 pisces lrmd: [3509]: info: RA output: (pan:monitor:stderr)
Sep 13 20:17:07 pisces lrmd: [3509]: info: operation monitor[21] on pan
for client 3512: pid 455 exited with return code 2

* I tried reverting several security updates applied in the month or so
before we first saw the issue. Reverting the Debian kernel packages to
3.2.57-3 and then 3.2.54-2 did not help, nor did reverting libxml2 to
libxml2 2.8.0+dfsg1-7+nmu2. None of the other updates from that time
look like they could have any effect.

* Regarding libxml2, I did find that Debian had backported an upstream
patch into its 2.8.0+dfsg1-7+nmu3 that introduced a memory corruption
bug, which upstream later corrected (the bug never made it into an
upstream release, but Debian had backported a specific changeset). I
submitted that as Debian Bug #765770 which was just fixed last week. I
haven't had a chance to apply that to the affected servers yet, but as
mentioned above, reverting to the libxml2 before the introduced bug did
not fix the issue.

* I have not found a way to intentionally reproduce the issue. :-( We
have had 10 occurrences across 3 two-node clusters in five months. Some
of the nodes have had only one occurrence during that time, but one pair
gets the most of them. With the time between occurrences, it's hard to
do something like strace on lrmd, though that's probably a good way
forward, scripting something to deal with the output reasonably.

* There does not seem to be any correlation with how long the node has
been up. Checking RAM usage of corosync and lrmd on all nodes over about
two weeks shows little to no change, so I don't suspect a leak. Most of
our errors have occurred in the Xen RA, but probably only because that's
the RA we use most; we've also seen it in IPaddr2.

* My next idea would be to compile/install the latest versions of at
least pacemaker and the resource agents. However I am in the middle of
changing jobs, and unfortunately do not have much time left for this. My
new job will have plenty of time to spend on pacemaker ;-) so I may be
able to give updates later. Debian's "jessie" release freezes this week,
so I'm hoping that I will have time to at least get a test cluster up
running the somewhat newer versions in that (pacemaker 1.1.10, corosync
1.4.6).

-- Ken Gaillot <kjgaillo@gleim.com>
Network Operations Center, Gleim Publications

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: Occasional nonsensical resource agent errors, redux [ In reply to ]

dejanmm at fastmail

Nov 4, 2014, 8:02 AM

Post #7 of 9 (2320 views)

Hi,

On Mon, Nov 03, 2014 at 11:36:40AM -0500, Ken Gaillot wrote:
> On 11/03/2014 09:26 AM, Dejan Muhamedagic wrote:
>> On Mon, Nov 03, 2014 at 08:46:00AM +0300, Andrei Borzenkov wrote:
>>> Ð’ Mon, 3 Nov 2014 13:32:45 +1100
>>> Andrew Beekhof <andrew@beekhof.net> Ð¿Ð¸ÑˆÐµÑ‚:
>>>
>>>>
>>>>> On 1 Nov 2014, at 11:03 pm, Patrick Kane <pmk@wawd.com> wrote:
>>>>>
>>>>> Hi all:
>>>>>
>>>>> In July, list member Ken Gaillot reported occasional nonsensical resource agent errors using Pacemaker (http://oss.clusterlabs.org/pipermail/pacemaker/2014-July/022231.html).
>>>>>
>>>>> We're seeing similar issues with our install. We have a 2 node corosync/pacemaker failover configuration that is using the ocf:heartbeat:IPaddr2 resource agent extensively. About once a week, we'll get an error like this, out of the blue:
>>>>>
>>>>> Nov 1 05:23:57 lb02 IPaddr2(anon_ip)[32312]: ERROR: Setup problem: couldn't find command: ip
>>>>>
>>>>> It goes without saying that the ip command hasn't gone anywhere and all the paths are configured correctly.
>>>>>
>>>>> We're currently running 1.1.10-14.el6_5.3-368c726 under CentOS 6 x86_64 inside of a xen container.
>>>>>
>>>>> Any thoughts from folks on what might be happening or how we can get additional debug information to help figure out what's triggering this?
>>>>
>>>> its pretty much in the hands of the agent.
>>>
>>> Actually the message seems to be output by check_binary() function
>>> which is part of framework.
>>
>> Someone complained in the IRC about this issue (another resource
>> agent though, I think Xen) and they said that which(1) was not
>> able to find the program. I'd suggest to do strace (or ltrace)
>> of which(1) at that point (it's in ocf-shellfuncs).
>>
>> The which(1) utility is a simple tool: it splits the PATH
>> environment variable and stats the program name appended to each
>> of the paths. PATH somehow corrupted or filesystem misbehaving?
>> My guess is that it's the former.
>>
>> BTW, was there an upgrade of some kind before this started
>> happening?
>
> I was hoping to have something useful before posting another update, but
> since it's come up again, here's what we've found so far:
>
> * The most common manifestation is the "couldn't find command" error. In
> various instances it "couldn't find" xm, ip or awk. However, we've seen
> two other variations:
>
> lrmd: [3363]: info: RA output: (pan:monitor:stderr) en-destroy: bad
> variable name
>
> and
>
> lrmd: [2145]: info: RA output: (ldap-ip:monitor:stderr)
> /usr/lib/ocf/resource.d//heartbeat/IPaddr2: 1:
> /usr/lib/ocf/resource.d//heartbeat/IPaddr2: : Permission denied
>
> The RA in the first case does not use the string "en-destroy" at all; it
> does call a command "xen-destroy". That, to me, is a strong suggestion
> of memory corruption somewhere, whether in the RA, the shell, lrmd or a
> library used by one of those.

Scary. Shell and lrmd are two obvious candidates. I assume that
none of them would cause a segfault if trampling through the
memory where a copy of the running script resides.

> * I have not found any bugs in the RA or its included files.
>
> * I tried setting "debug: on" in corosync.conf, but that did not give
> any additional useful information. The resource agent error is still the
> first unusual message in the sequence. Here is an example, giving one
> successful monitor run and then an occurrence of the issue (the nodes
> are a pair of Xen dom0s including pisces, running two Xen domU resources
> pan and nemesis):
>
> Sep 13 20:16:56 pisces lrmd: [3509]: debug: rsc:pan monitor[21] (pid 372)
> Sep 13 20:16:56 pisces lrmd: [372]: debug: perform_ra_op: resetting
> scheduler class to SCHED_OTHER
> Sep 13 20:16:56 pisces lrmd: [3509]: debug: rsc:nemesis monitor[32] (pid
> 409)
> Sep 13 20:16:56 pisces lrmd: [409]: debug: perform_ra_op: resetting
> scheduler class to SCHED_OTHER
> Sep 13 20:16:56 pisces lrmd: [3509]: info: operation monitor[21] on pan
> for client 3512: pid 372 exited with return code 0
> Sep 13 20:16:57 pisces lrmd: [3509]: info: operation monitor[32] on
> nemesis for client 3512: pid 409 exited with return code 0
> Sep 13 20:17:06 pisces lrmd: [3509]: debug: rsc:pan monitor[21] (pid 455)
> Sep 13 20:17:06 pisces lrmd: [455]: debug: perform_ra_op: resetting
> scheduler class to SCHED_OTHER
> Sep 13 20:17:07 pisces lrmd: [3509]: info: RA output:
> (pan:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/Xen: 71: local:

This "local" seems to be from ocf-binaries:have_binary():

71 local bin=`echo $1 | sed -e 's/ -.*//'`

> Sep 13 20:17:07 pisces lrmd: [3509]: info: RA output:
> (pan:monitor:stderr) en-destroy: bad variable name
> Sep 13 20:17:07 pisces lrmd: [3509]: info: RA output: (pan:monitor:stderr)
> Sep 13 20:17:07 pisces lrmd: [3509]: info: operation monitor[21] on pan
> for client 3512: pid 455 exited with return code 2
>
> * I tried reverting several security updates applied in the month or so
> before we first saw the issue. Reverting the Debian kernel packages to
> 3.2.57-3 and then 3.2.54-2 did not help, nor did reverting libxml2 to
> libxml2 2.8.0+dfsg1-7+nmu2.

I suppose that you restarted the cluster stack after update :)

> None of the other updates from that time
> look like they could have any effect.
>
> * Regarding libxml2, I did find that Debian had backported an upstream
> patch into its 2.8.0+dfsg1-7+nmu3 that introduced a memory corruption
> bug, which upstream later corrected (the bug never made it into an
> upstream release, but Debian had backported a specific changeset). I
> submitted that as Debian Bug #765770 which was just fixed last week. I
> haven't had a chance to apply that to the affected servers yet, but as
> mentioned above, reverting to the libxml2 before the introduced bug did
> not fix the issue.
>
> * I have not found a way to intentionally reproduce the issue. :-( We
> have had 10 occurrences across 3 two-node clusters in five months. Some
> of the nodes have had only one occurrence during that time, but one pair
> gets the most of them. With the time between occurrences, it's hard to
> do something like strace on lrmd, though that's probably a good way
> forward, scripting something to deal with the output reasonably.

Perhaps dumping core of both lrmd and the shell when this
happens would help. Are the most affected nodes in any way
significantly different from the others? By CIB size perhaps?

> * There does not seem to be any correlation with how long the node has
> been up. Checking RAM usage of corosync and lrmd on all nodes over about
> two weeks shows little to no change, so I don't suspect a leak. Most of
> our errors have occurred in the Xen RA, but probably only because that's
> the RA we use most; we've also seen it in IPaddr2.
>
> * My next idea would be to compile/install the latest versions of at
> least pacemaker and the resource agents. However I am in the middle of
> changing jobs, and unfortunately do not have much time left for this. My
> new job will have plenty of time to spend on pacemaker ;-) so I may be
> able to give updates later. Debian's "jessie" release freezes this week,
> so I'm hoping that I will have time to at least get a test cluster up
> running the somewhat newer versions in that (pacemaker 1.1.10, corosync
> 1.4.6).

Did you open a bug report with debian?

Cheers,

Dejan

> -- Ken Gaillot <kjgaillo@gleim.com>
> Network Operations Center, Gleim Publications
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: Occasional nonsensical resource agent errors, redux [ In reply to ]

kjgaillo at gleim

Nov 4, 2014, 9:21 AM

Post #8 of 9 (2338 views)

On 11/04/2014 11:02 AM, Dejan Muhamedagic wrote:
>>>>>> On 1 Nov 2014, at 11:03 pm, Patrick Kane <pmk@wawd.com> wrote:
>>>>>>
>>>>>> Hi all:
>>>>>>
>>>>>> In July, list member Ken Gaillot reported occasional nonsensical resource agent errors using Pacemaker (http://oss.clusterlabs.org/pipermail/pacemaker/2014-July/022231.html).
>>>>>>

<snip>

>> I was hoping to have something useful before posting another update, but
>> since it's come up again, here's what we've found so far:
>>
>> * The most common manifestation is the "couldn't find command" error. In
>> various instances it "couldn't find" xm, ip or awk. However, we've seen
>> two other variations:
>>
>> lrmd: [3363]: info: RA output: (pan:monitor:stderr) en-destroy: bad
>> variable name
>>
>> and
>>
>> lrmd: [2145]: info: RA output: (ldap-ip:monitor:stderr)
>> /usr/lib/ocf/resource.d//heartbeat/IPaddr2: 1:
>> /usr/lib/ocf/resource.d//heartbeat/IPaddr2: : Permission denied
>>
>> The RA in the first case does not use the string "en-destroy" at all; it
>> does call a command "xen-destroy". That, to me, is a strong suggestion
>> of memory corruption somewhere, whether in the RA, the shell, lrmd or a
>> library used by one of those.
>
> Scary. Shell and lrmd are two obvious candidates. I assume that
> none of them would cause a segfault if trampling through the
> memory where a copy of the running script resides.
>
>> * I have not found any bugs in the RA or its included files.
>>
>> * I tried setting "debug: on" in corosync.conf, but that did not give
>> any additional useful information. The resource agent error is still the
>> first unusual message in the sequence. Here is an example, giving one
>> successful monitor run and then an occurrence of the issue (the nodes
>> are a pair of Xen dom0s including pisces, running two Xen domU resources
>> pan and nemesis):
>>
>> Sep 13 20:16:56 pisces lrmd: [3509]: debug: rsc:pan monitor[21] (pid 372)
>> Sep 13 20:16:56 pisces lrmd: [372]: debug: perform_ra_op: resetting
>> scheduler class to SCHED_OTHER
>> Sep 13 20:16:56 pisces lrmd: [3509]: debug: rsc:nemesis monitor[32] (pid
>> 409)
>> Sep 13 20:16:56 pisces lrmd: [409]: debug: perform_ra_op: resetting
>> scheduler class to SCHED_OTHER
>> Sep 13 20:16:56 pisces lrmd: [3509]: info: operation monitor[21] on pan
>> for client 3512: pid 372 exited with return code 0
>> Sep 13 20:16:57 pisces lrmd: [3509]: info: operation monitor[32] on
>> nemesis for client 3512: pid 409 exited with return code 0
>> Sep 13 20:17:06 pisces lrmd: [3509]: debug: rsc:pan monitor[21] (pid 455)
>> Sep 13 20:17:06 pisces lrmd: [455]: debug: perform_ra_op: resetting
>> scheduler class to SCHED_OTHER
>> Sep 13 20:17:07 pisces lrmd: [3509]: info: RA output:
>> (pan:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/Xen: 71: local:
>
> This "local" seems to be from ocf-binaries:have_binary():
>
> 71 local bin=`echo $1 | sed -e 's/ -.*//'`

Agreed, nothing unusual there, reinforces suspicion of memory corruption.

>> Sep 13 20:17:07 pisces lrmd: [3509]: info: RA output:
>> (pan:monitor:stderr) en-destroy: bad variable name
>> Sep 13 20:17:07 pisces lrmd: [3509]: info: RA output: (pan:monitor:stderr)
>> Sep 13 20:17:07 pisces lrmd: [3509]: info: operation monitor[21] on pan
>> for client 3512: pid 455 exited with return code 2
>>
>> * I tried reverting several security updates applied in the month or so
>> before we first saw the issue. Reverting the Debian kernel packages to
>> 3.2.57-3 and then 3.2.54-2 did not help, nor did reverting libxml2 to
>> libxml2 2.8.0+dfsg1-7+nmu2.
>
> I suppose that you restarted the cluster stack after update :)

Yep, w/reboot for kernel reverts. Some of the libxml2 reverts got full
reboots as well b/c they were done with other maintenance.

>> None of the other updates from that time
>> look like they could have any effect.
>>
>> * Regarding libxml2, I did find that Debian had backported an upstream
>> patch into its 2.8.0+dfsg1-7+nmu3 that introduced a memory corruption
>> bug, which upstream later corrected (the bug never made it into an
>> upstream release, but Debian had backported a specific changeset). I
>> submitted that as Debian Bug #765770 which was just fixed last week. I
>> haven't had a chance to apply that to the affected servers yet, but as
>> mentioned above, reverting to the libxml2 before the introduced bug did
>> not fix the issue.
>>
>> * I have not found a way to intentionally reproduce the issue. :-( We
>> have had 10 occurrences across 3 two-node clusters in five months. Some
>> of the nodes have had only one occurrence during that time, but one pair
>> gets the most of them. With the time between occurrences, it's hard to
>> do something like strace on lrmd, though that's probably a good way
>> forward, scripting something to deal with the output reasonably.
>
> Perhaps dumping core of both lrmd and the shell when this
> happens would help. Are the most affected nodes in any way
> significantly different from the others? By CIB size perhaps?

It's actually simpler. An overview of our setup is:

Cluster #1 (with 6 of the 10 failures): Xen dom0s as nodes, two Xen
domUs as resources (DRBD is configured outside pacemaker, Xen handles
primary/secondary roles)

Cluster #2 (2/10 failures): Xen dom0s as nodes, DLM resource, CLVM
resource, DRBD resource, LVM volume group resource, 10 Xen domU resources

Cluster #3 (2/10 failures): Xen domUs as nodes, DNS daemon resource,
LDAP daemon resource, (2x) IPaddr2 resources

>> * There does not seem to be any correlation with how long the node has
>> been up. Checking RAM usage of corosync and lrmd on all nodes over about
>> two weeks shows little to no change, so I don't suspect a leak. Most of
>> our errors have occurred in the Xen RA, but probably only because that's
>> the RA we use most; we've also seen it in IPaddr2.
>>
>> * My next idea would be to compile/install the latest versions of at
>> least pacemaker and the resource agents. However I am in the middle of
>> changing jobs, and unfortunately do not have much time left for this. My
>> new job will have plenty of time to spend on pacemaker ;-) so I may be
>> able to give updates later. Debian's "jessie" release freezes this week,
>> so I'm hoping that I will have time to at least get a test cluster up
>> running the somewhat newer versions in that (pacemaker 1.1.10, corosync
>> 1.4.6).
>
> Did you open a bug report with debian?

Not for the resource agent issue. I'd have to narrow it down to a
particular package, and rule out (or reference) an upstream bug.

-- Ken Gaillot <kjgaillo@gleim.com>
Network Operations Center, Gleim Publications

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: Occasional nonsensical resource agent errors, redux [ In reply to ]

andrew at beekhof

Nov 4, 2014, 1:44 PM

Post #9 of 9 (2310 views)

> On 5 Nov 2014, at 4:21 am, Ken Gaillot <kjgaillo@gleim.com> wrote:
>
>>> Sep 13 20:17:07 pisces lrmd: [3509]: info: RA output:
>>> (pan:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/Xen: 71: local:
>>
>> This "local" seems to be from ocf-binaries:have_binary():
>>
>> 71 local bin=`echo $1 | sed -e 's/ -.*//'`
>
> Agreed, nothing unusual there, reinforces suspicion of memory corruption.
>
>>> Sep 13 20:17:07 pisces lrmd: [3509]: info: RA output:
>>> (pan:monitor:stderr) en-destroy: bad variable name
>>> Sep 13 20:17:07 pisces lrmd: [3509]: info: RA output: (pan:monitor:stderr)
>>> Sep 13 20:17:07 pisces lrmd: [3509]: info: operation monitor[21] on pan
>>> for client 3512: pid 455 exited with return code 2

Not that pacemaker couldn't cause this, but I'm curious to know if anyone has seen this on bare metal...
VMs, with their backing stores and COW overlays seem like a reasonable candidate based on my experiences.

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org