Mailing List Archive

IPsrcaddr bug, and fix recommendation
hi,

I have a cluster configuration with two IPsrcaddr resources (e.g. IP address "A" and "B")
They are configured to two different addresses, and are never supposed to run on the same nodes. So "A" can run on nodes N1 and N2, "B" can run on N3,N4.

My problem is, that in some cases, crm_mon shows that an ipsrcaddr resource is running on a node where it shouldn't, and of course it is in unmanaged state and cannot be stopped.
For instance:
IP address "A" is started, unamanged on node N3.

I am using pacemaker 1.1.6 on a debian system, with the latest RA from github.

I checked the RA, and here are my findings.


- When status is called, it calls the srca_read() function

- srca_read() returns 2, if a srcip is running on the given node, but with a different IP address.

- srca_status(), when gets "2" from srca_read(), returns "$OCF_ERR_GENERIC"

As a result, in my case IP "B" is running on N3, which is OK, but CRM_mon reports that IP "A" is also running on N3 (unmanaged). [for some reason this is how the OCF_ERR_GENERIC is interpreted]
This is definitively a bug, the question is whether in pacemaker or in the RA.
If I change the script to return "$OCF_NOT_RUNNING" instead of $OCF_ERR_GENERIC" it works properly.

What is the proper behavior in this case?
My recommendation is to fix the RA so that srca_read() returns 1, if there is a srcip on the node, but it is not the queried one.

In this case the RA would return a "$OCF_NOT_RUNNING"



Cheers,
Attila
Re: IPsrcaddr bug, and fix recommendation [ In reply to ]
Hi,

On Thu, Dec 20, 2012 at 08:03:32PM +0100, Attila Megyeri wrote:
> hi,
>
> I have a cluster configuration with two IPsrcaddr resources (e.g. IP address "A" and "B")
> They are configured to two different addresses, and are never supposed to run on the same nodes. So "A" can run on nodes N1 and N2, "B" can run on N3,N4.
>
> My problem is, that in some cases, crm_mon shows that an ipsrcaddr resource is running on a node where it shouldn't, and of course it is in unmanaged state and cannot be stopped.
> For instance:
> IP address "A" is started, unamanged on node N3.
>
> I am using pacemaker 1.1.6 on a debian system, with the latest RA from github.
>
> I checked the RA, and here are my findings.
>
>
> - When status is called, it calls the srca_read() function
>
> - srca_read() returns 2, if a srcip is running on the given node, but with a different IP address.
>
> - srca_status(), when gets "2" from srca_read(), returns "$OCF_ERR_GENERIC"
>
> As a result, in my case IP "B" is running on N3, which is OK, but CRM_mon reports that IP "A" is also running on N3 (unmanaged). [for some reason this is how the OCF_ERR_GENERIC is interpreted]
> This is definitively a bug, the question is whether in pacemaker or in the RA.
> If I change the script to return "$OCF_NOT_RUNNING" instead of $OCF_ERR_GENERIC" it works properly.
>
> What is the proper behavior in this case?
> My recommendation is to fix the RA so that srca_read() returns 1, if there is a srcip on the node, but it is not the queried one.

The comment in the agent says:

# NOTES:
#
# 1) There must be one and not more than 1 default route! Mainly because
# I can't see why you should have more than one. And if there is more
# than one, we would have to box clever to find out which one is to be
# modified, or we would have to pass its identity as an argument.
#

This should actually be in the meta-data, as it is obviously
intended for users.

It looks like your use case doesn't fit this description, right?
Perhaps we could add a parameter like "allow_multiple_default_routes".

Thanks,

Dejan


> In this case the RA would return a "$OCF_NOT_RUNNING"
>
>
>
> Cheers,
> Attila
>

> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: IPsrcaddr bug, and fix recommendation [ In reply to ]
Hi Dejan,

-----Original Message-----
From: linux-ha-dev-bounces@lists.linux-ha.org [mailto:linux-ha-dev-bounces@lists.linux-ha.org] On Behalf Of Dejan Muhamedagic
Sent: Monday, December 24, 2012 11:07 AM
To: linux-ha-dev@lists.linux-ha.org
Subject: Re: [Linux-ha-dev] IPsrcaddr bug, and fix recommendation

Hi,

On Thu, Dec 20, 2012 at 08:03:32PM +0100, Attila Megyeri wrote:
> hi,
>
> I have a cluster configuration with two IPsrcaddr resources (e.g. IP
> address "A" and "B") They are configured to two different addresses, and are never supposed to run on the same nodes. So "A" can run on nodes N1 and N2, "B" can run on N3,N4.
>
> My problem is, that in some cases, crm_mon shows that an ipsrcaddr resource is running on a node where it shouldn't, and of course it is in unmanaged state and cannot be stopped.
> For instance:
> IP address "A" is started, unamanged on node N3.
>
> I am using pacemaker 1.1.6 on a debian system, with the latest RA from github.
>
> I checked the RA, and here are my findings.
>
>
> - When status is called, it calls the srca_read() function
>
> - srca_read() returns 2, if a srcip is running on the given node, but with a different IP address.
>
> - srca_status(), when gets "2" from srca_read(), returns "$OCF_ERR_GENERIC"
>
> As a result, in my case IP "B" is running on N3, which is OK, but
> CRM_mon reports that IP "A" is also running on N3 (unmanaged). [for some reason this is how the OCF_ERR_GENERIC is interpreted] This is definitively a bug, the question is whether in pacemaker or in the RA.
> If I change the script to return "$OCF_NOT_RUNNING" instead of $OCF_ERR_GENERIC" it works properly.
>
> What is the proper behavior in this case?
> My recommendation is to fix the RA so that srca_read() returns 1, if there is a srcip on the node, but it is not the queried one.

The comment in the agent says:

# NOTES:
#
# 1) There must be one and not more than 1 default route! Mainly because
# I can't see why you should have more than one. And if there is more
# than one, we would have to box clever to find out which one is to be
# modified, or we would have to pass its identity as an argument.
#

This should actually be in the meta-data, as it is obviously intended for users.

It looks like your use case doesn't fit this description, right?
Perhaps we could add a parameter like "allow_multiple_default_routes".

Thanks,

Dejan


On the host where the resource is running I have only one default gateway. The other pair of this host (the other node) uses a different default gateway - but I do not think this should be a limitation (on that host I have a single default gateway as well).
The srca_read() function does not fail in the steps that check the default gateway. The function runs till the last line where 2 is returned, although it is not a generic error, rather the SRC ip is not running on the node.


Thanks,

Attila







> In this case the RA would return a "$OCF_NOT_RUNNING"
>
>
>
> Cheers,
> Attila
>

> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: IPsrcaddr bug, and fix recommendation [ In reply to ]
Hi Attila,

Sorry for the delay, somehow missed your message.

On Fri, Dec 28, 2012 at 12:52:22PM +0100, Attila Megyeri wrote:
> Hi Dejan,
>
> -----Original Message-----
> From: linux-ha-dev-bounces@lists.linux-ha.org [mailto:linux-ha-dev-bounces@lists.linux-ha.org] On Behalf Of Dejan Muhamedagic
> Sent: Monday, December 24, 2012 11:07 AM
> To: linux-ha-dev@lists.linux-ha.org
> Subject: Re: [Linux-ha-dev] IPsrcaddr bug, and fix recommendation
>
> Hi,
>
> On Thu, Dec 20, 2012 at 08:03:32PM +0100, Attila Megyeri wrote:
> > hi,
> >
> > I have a cluster configuration with two IPsrcaddr resources (e.g. IP
> > address "A" and "B") They are configured to two different addresses, and are never supposed to run on the same nodes. So "A" can run on nodes N1 and N2, "B" can run on N3,N4.
> >
> > My problem is, that in some cases, crm_mon shows that an ipsrcaddr resource is running on a node where it shouldn't, and of course it is in unmanaged state and cannot be stopped.
> > For instance:
> > IP address "A" is started, unamanged on node N3.
> >
> > I am using pacemaker 1.1.6 on a debian system, with the latest RA from github.
> >
> > I checked the RA, and here are my findings.
> >
> >
> > - When status is called, it calls the srca_read() function
> >
> > - srca_read() returns 2, if a srcip is running on the given node, but with a different IP address.
> >
> > - srca_status(), when gets "2" from srca_read(), returns "$OCF_ERR_GENERIC"
> >
> > As a result, in my case IP "B" is running on N3, which is OK, but
> > CRM_mon reports that IP "A" is also running on N3 (unmanaged). [for some reason this is how the OCF_ERR_GENERIC is interpreted] This is definitively a bug, the question is whether in pacemaker or in the RA.
> > If I change the script to return "$OCF_NOT_RUNNING" instead of $OCF_ERR_GENERIC" it works properly.
> >
> > What is the proper behavior in this case?
> > My recommendation is to fix the RA so that srca_read() returns 1, if there is a srcip on the node, but it is not the queried one.
>
> The comment in the agent says:
>
> # NOTES:
> #
> # 1) There must be one and not more than 1 default route! Mainly because
> # I can't see why you should have more than one. And if there is more
> # than one, we would have to box clever to find out which one is to be
> # modified, or we would have to pass its identity as an argument.
> #
>
> This should actually be in the meta-data, as it is obviously intended for users.
>
> It looks like your use case doesn't fit this description, right?
> Perhaps we could add a parameter like "allow_multiple_default_routes".
>
> Thanks,
>
> Dejan
>
>
> On the host where the resource is running I have only one default gateway. The other pair of this host (the other node) uses a different default gateway - but I do not think this should be a limitation (on that host I have a single default gateway as well).

The "must be one and not more than 1" should also say
"cluster-wide".

> The srca_read() function does not fail in the steps that check the default gateway. The function runs till the last line where 2 is returned, although it is not a generic error, rather the SRC ip is not running on the node.

The exit code 2 signifies that the default route has an
unexpected address.

I think that it works as designed. As mentioned earlier, we can
extend the resource agent to support clusters with multiple
default routes, but that would need to be done with an extra
configuration parameter. Patches welcome :)

Thanks,

Dejan

>
> Thanks,
>
> Attila
>
>
>
>
>
>
>
> > In this case the RA would return a "$OCF_NOT_RUNNING"
> >
> >
> >
> > Cheers,
> > Attila
> >
>
> > _______________________________________________________
> > Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > Home Page: http://linux-ha.org/
>
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/