Mailing List Archive: [OCTAVIA][QUEENS][KOLLA] - Amphora to Health-manager invalid UDP heartbeat.

[OCTAVIA][QUEENS][KOLLA] - Amphora to Health-manager invalid UDP heartbeat.

Oct 23, 2018, 7:09 AM

Post #1 of 3 (740 views)

Hi guys,

I'm finishing to work on my POC for Octavia and after solving few issues
with my configuration I'm close to get a properly working setup.
However, I'm facing a small but yet annoying bug with the health-manager
receiving amphora heartbeat UDP packet which it consider as not correct and
so drop it.

Here are the messages that can be found in logs:

*2018-10-23 13:53:21.844 25 WARNING
octavia.amphorae.backends.health_daemon.status_message [-] calculated hmac:
faf73e41a0f843b826ee581c3995b7f7e56b5e5a294fca0b84eda426766f8415 not equal
to msg hmac:
6137613337316432636365393832376431343337306537353066626130653261 dropping
packet*

Which come from this part of the HM Code:

https://docs.openstack.org/octavia/pike/_modules/octavia/amphorae/backends/health_daemon/status_message.html#get_payload

The annoying thing is that I don't get why the UDP packet is considered as
stale and how can I try to reproduce the payload which is send to the
HealthManager.
I'm willing to write a simple PY program to simulate the heartbeat payload
but I don't now what's exactly the message and I think I miss some
informations.

Both HealthManager and the Amphora do use the same heartbeat_key and both
can contact on the network as the initial Health-manager to Amphora 9443
connection is validated.

As an effect to this situation, my loadbalancer is stuck in PENDING_UPDATE
mode.

Do you have any idea on how can I handle such thing or if it's something
already seen out there for anyone else?

Kind regards,
G.

Re: [OCTAVIA][QUEENS][KOLLA] - Amphora to Health-manager invalid UDP heartbeat. [ In reply to ]

johnsomor at gmail

Oct 23, 2018, 10:09 AM

Post #2 of 3 (739 views)

Permalink

Are the controller and the amphora using the same version of Octavia?

We had a python3 issue where we had to change the HMAC digest used. If
you controller is running an older version of Octavia than your
amphora images, it may not have the compatibility code to support the
new format. The compatibility code is here:
https://github.com/openstack/octavia/blob/master/octavia/amphorae/backends/health_daemon/status_message.py#L56

There is also a release note about the issue here:
https://docs.openstack.org/releasenotes/octavia/rocky.html#upgrade-notes

If that is not the issue, I would double check the heartbeat_key in
the health manager configuration files and inside one of the amphora.

Note, that this key is only used for health heartbeats and stats, it
is not used for the controller to amphora communication on port 9443.

Also, load balancers cannot get "stuck" in PENDING_* states unless
someone has killed the controller process that was actively working on
that load balancer. By killed I mean a non-graceful shutdown of the
process that was in the middle of working on the load balancer.
Otherwise all code paths lead back to ACTIVE or ERROR status after it
finishes the work or gives up retrying the requested action. Check
your controller logs to make sure this load balancer is not still
being worked on by one of the controllers. The default retry timeouts
(some are up to 25 minutes) are very long (it will keep trying to
accomplish the request) to accommodate very slow (virtual box) hosts
and the test gates. You will want to tune those down for a production
deployment.

Michael

On Tue, Oct 23, 2018 at 7:09 AM Gaël THEROND <gael.therond@gmail.com> wrote:
>
> Hi guys,
>
> I'm finishing to work on my POC for Octavia and after solving few issues with my configuration I'm close to get a properly working setup.
> However, I'm facing a small but yet annoying bug with the health-manager receiving amphora heartbeat UDP packet which it consider as not correct and so drop it.
>
> Here are the messages that can be found in logs:
>
> 2018-10-23 13:53:21.844 25 WARNING octavia.amphorae.backends.health_daemon.status_message [-] calculated hmac: faf73e41a0f843b826ee581c3995b7f7e56b5e5a294fca0b84eda426766f8415 not equal to msg hmac: 6137613337316432636365393832376431343337306537353066626130653261 dropping packet
>
> Which come from this part of the HM Code:
>
> https://docs.openstack.org/octavia/pike/_modules/octavia/amphorae/backends/health_daemon/status_message.html#get_payload
>
> The annoying thing is that I don't get why the UDP packet is considered as stale and how can I try to reproduce the payload which is send to the HealthManager.
> I'm willing to write a simple PY program to simulate the heartbeat payload but I don't now what's exactly the message and I think I miss some informations.
>
> Both HealthManager and the Amphora do use the same heartbeat_key and both can contact on the network as the initial Health-manager to Amphora 9443 connection is validated.
>
> As an effect to this situation, my loadbalancer is stuck in PENDING_UPDATE mode.
>
> Do you have any idea on how can I handle such thing or if it's something already seen out there for anyone else?
>
> Kind regards,
> G.
> _______________________________________________
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

_______________________________________________
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

Re: [OCTAVIA][QUEENS][KOLLA] - Amphora to Health-manager invalid UDP heartbeat. [ In reply to ]

gael.therond at gmail

Oct 24, 2018, 5:06 AM

Post #3 of 3 (738 views)

Permalink

Hi Michael,

Thanks a lot for those many details regarding the transition between
different states, indeed as you said, my LB passed from pending_update to
active but I still had an offline status this morning as I still received
UDP Packets that HM dropped.

When I was talking about the HealthManager reaching to the amphora on port
9443 of course I didn't mean it use the heartbeat key.

I just had a look at my Amphora and Octavia CP (Control Plan) versions,
seems a little bit off sync as my amphora agent is: *%prog 3.0.0.0b4.dev6*
while my octavia CP services are: *%prog 2.0.1*

I've just updated to stable/rocky this morning and so jumped to: *%prog
3.0.1*
I'll check if I still encounter this issue, but for now my issue seems to
have vanished as I've the following messages:

*2018-10-24 11:58:54.620 24 DEBUG futurist.periodics [-] Submitting
periodic callback 'octavia.cmd.health_manager.periodic_health_check'
_process_scheduled
/usr/lib/python2.7/site-packages/futurist/periodics.py:639*
*2018-10-24 11:58:57.620 24 DEBUG futurist.periodics [-] Submitting
periodic callback 'octavia.cmd.health_manager.periodic_health_check'
_process_scheduled
/usr/lib/python2.7/site-packages/futurist/periodics.py:639*
*2018-10-24 11:59:00.620 24 DEBUG futurist.periodics [-] Submitting
periodic callback 'octavia.cmd.health_manager.periodic_health_check'
_process_scheduled
/usr/lib/python2.7/site-packages/futurist/periodics.py:639*
*2018-10-24 11:59:03.620 24 DEBUG futurist.periodics [-] Submitting
periodic callback 'octavia.cmd.health_manager.periodic_health_check'
_process_scheduled
/usr/lib/python2.7/site-packages/futurist/periodics.py:639*
*2018-10-24 11:59:04.557 23 DEBUG
octavia.amphorae.drivers.health.heartbeat_udp [-] Received packet from
('172.27.201.105', 48342) dorecv
/usr/lib/python2.7/site-packages/octavia/amphorae/drivers/health/heartbeat_udp.py:187*
*2018-10-24 11:59:04.619 45 DEBUG
octavia.controller.healthmanager.health_drivers.update_db [-] Health Update
finished in: 0.0600640773773 seconds update_health
/usr/lib/python2.7/site-packages/octavia/controller/healthmanager/health_drivers/update_db.py:93*

I'll update you with my following investigation, but so far, the issue
seems to be resolve, I'll tweak a bit the timeouts as my LB take a looooot
of time to create Listeners/Pools and come to an online status.

Thanks a lot!

Le mar. 23 oct. 2018 à 19:09, Michael Johnson <johnsomor@gmail.com> a
écrit :

> Are the controller and the amphora using the same version of Octavia?
>
> We had a python3 issue where we had to change the HMAC digest used. If
> you controller is running an older version of Octavia than your
> amphora images, it may not have the compatibility code to support the
> new format. The compatibility code is here:
>
> https://github.com/openstack/octavia/blob/master/octavia/amphorae/backends/health_daemon/status_message.py#L56
>
> There is also a release note about the issue here:
> https://docs.openstack.org/releasenotes/octavia/rocky.html#upgrade-notes
>
> If that is not the issue, I would double check the heartbeat_key in
> the health manager configuration files and inside one of the amphora.
>
> Note, that this key is only used for health heartbeats and stats, it
> is not used for the controller to amphora communication on port 9443.
>
> Also, load balancers cannot get "stuck" in PENDING_* states unless
> someone has killed the controller process that was actively working on
> that load balancer. By killed I mean a non-graceful shutdown of the
> process that was in the middle of working on the load balancer.
> Otherwise all code paths lead back to ACTIVE or ERROR status after it
> finishes the work or gives up retrying the requested action. Check
> your controller logs to make sure this load balancer is not still
> being worked on by one of the controllers. The default retry timeouts
> (some are up to 25 minutes) are very long (it will keep trying to
> accomplish the request) to accommodate very slow (virtual box) hosts
> and the test gates. You will want to tune those down for a production
> deployment.
>
> Michael
>
> On Tue, Oct 23, 2018 at 7:09 AM Gaël THEROND <gael.therond@gmail.com>
> wrote:
> >
> > Hi guys,
> >
> > I'm finishing to work on my POC for Octavia and after solving few issues
> with my configuration I'm close to get a properly working setup.
> > However, I'm facing a small but yet annoying bug with the health-manager
> receiving amphora heartbeat UDP packet which it consider as not correct and
> so drop it.
> >
> > Here are the messages that can be found in logs:
> >
> > 2018-10-23 13:53:21.844 25 WARNING
> octavia.amphorae.backends.health_daemon.status_message [-] calculated hmac:
> faf73e41a0f843b826ee581c3995b7f7e56b5e5a294fca0b84eda426766f8415 not equal
> to msg hmac:
> 6137613337316432636365393832376431343337306537353066626130653261 dropping
> packet
> >
> > Which come from this part of the HM Code:
> >
> >
> https://docs.openstack.org/octavia/pike/_modules/octavia/amphorae/backends/health_daemon/status_message.html#get_payload
> >
> > The annoying thing is that I don't get why the UDP packet is considered
> as stale and how can I try to reproduce the payload which is send to the
> HealthManager.
> > I'm willing to write a simple PY program to simulate the heartbeat
> payload but I don't now what's exactly the message and I think I miss some
> informations.
> >
> > Both HealthManager and the Amphora do use the same heartbeat_key and
> both can contact on the network as the initial Health-manager to Amphora
> 9443 connection is validated.
> >
> > As an effect to this situation, my loadbalancer is stuck in
> PENDING_UPDATE mode.
> >
> > Do you have any idea on how can I handle such thing or if it's something
> already seen out there for anyone else?
> >
> > Kind regards,
> > G.
> > _______________________________________________
> > OpenStack-operators mailing list
> > OpenStack-operators@lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>