Mailing List Archive

[Bug 63010] mod_proxy_hcheck when health checks configured, esp. vs down service, consumes large memory
https://bz.apache.org/bugzilla/show_bug.cgi?id=63010

--- Comment #4 from WJCarpenter <bill-apache@carpenter.org> ---
I've spent quite a bit of time over the last few days trying to understand what
goes on here. I have reproduced the memory growth on both Windows and Linux
with Apache httpd 2.4.46. It doesn't depend on health check failures, though
they can affect the timing to make it more likely. (In my testing, I dummied
out the actual check in hc_check() and just put in a fixed delay to avoid
vagaries of network stuff.)

My conclusion is that it's strongly related to bug #62318
(https://bz.apache.org/bugzilla/show_bug.cgi?id=62318). The fix for that didn't
account for an important edge case. When should we mark a worker health check
as "updated", where "updated" means the last time it was checked? Originally,
the worker was marked as "updated" when the check completed, though the
timestamp used was when the check started. Any watchdog firings while the check
was still running would trigger additional checks to run for that worker since
the trigger condition was checking the "updated" timestamp that continued to
have an old value until that check completed.

The previous fix used the same timestamp for "updated", but it changed when the
timestamp was applied. It applied the timestamp at the beginning of the check.
That (mostly) prevents additional checks from being triggered while the first
one is running.

The edge case that isn't accounted for is when all of the threads in the pool
used for health checks are busy. In that case, the "updated" timestamp is not
applied until the health check task is popped from the queue and given to a
thread to run. Let's look at an example (which I actually used experimentally).
Suppose it takes 5 seconds to do a health check. That's a long time, but not
preposterous. The watchdog callback runs every 100ms, and the default thread
pool is 16 threads. With just a single worker to check, 49 additional checks
will be triggered before the first check finishes. Each will carry a timestamp
just 100ms later than the one just before it. The first 16 checks will be given
to threads for immediate execution, with the final one updating "updated" 1.5
seconds later than the first one. The other 34 or so will be queued until
threads free up to run them. Each of those queued health checks carries along
with it a timestamp to be applied to "updated", and the timestamp reflects the
time that the health check task was enqueued. There's just one "updated" field
for the worker, regardless of how many health check tasks there are for that
worker. The "updated" timestamp will always reflect the watchdog trigger time
of the health check task that started most recently. In our example, the
"updated" timestamps eventually be marked with a time about 5 seconds later
than when the first check started.

With the default health check interval ("hcinterval") of 60 seconds, things
settle down until the 65 second point or so. Then the cycle repeats. There is
wasted effort of redundant health checks (which might foil expectations for
"hcfails" and "hcpasses"), and the "hcinterval" is effectively changed from 60
seconds to about 65 seconds. Those things are undesirable but not a tragedy.

Having just a single worker is unlikely. For every worker of some kind getting
health checks, the above scenario plays out. It's easy to get hundreds of
health check tasks queued up waiting for threads to process them. Still, the
net effect would be merely stretching the "hcinterval" time for each worker by
the time it takes to do a check. It's 5 seconds in our example, but it would be
way less for a healthy system. If the system is not healthy, the time it takes
to do health check can be faster or slower. If we try to connect to a port with
no listener process, then we will very quickly get a "connection refused". If
an entire server is down, we won't get any response at all and will have to
wait for a self-imposed socket timeout (from the "connectiontimeout" parameter,
default 60s). Or, the process on the server might be in a weird state so that
the health check takes a long time to complete without causing a socket timeout
(from the "timeout" parameter, default 60s). If the time to complete the health
check exceeds the "hcinterval" for the worker, we will never catch up. A
growing number of health check tasks will be queued for that worker. Each
queued health check task is holding a chunk of heap memory, and that is what
accounts for the memory growth. (As one anecdotal data point, on Windows my
httpd normally runs at 15-16 MB process size. With my pathological test case,
it's easy for me to see it grow to a couple of GB over a couple hours or so
with many tens of thousands of trapped health check tasks in the queue.)

So, what to do to workaround or fix this edge case?

* Workaround 1: Have at least as many threads in the thread pool
(ProxyHCTPSize) as you have distinct workers getting health checks. That way,
health check tasks will not usually get queued (though it still could happen
occasionally, so maybe allocate some extra threads). The threads will be idle
most of the time.

* Workaround 2: Make your health check intervals ("hcinterval") long enough to
avoid the pathlogical case. That is, health checks will always complete, pass
or fail, before the next check is due. Of course, you lose the benefit of more
frequent health checks.

* Workaround 3: Don't use the thread pool for health checks, which you can do
by setting ProxyHCTPSize=0. That will make all health checks synchronous, which
is generally pretty undesirable.

* Fix 1: Apply the timestamp to "updated" before sending the health check task
to the pool. In mod_proxy_hcheck.c, that's in the function
hc_watchdog_callback() before the call to apr_thread_pool_push(). That will
keep most additional health check tasks from being created for the same worker,
though that can still happen if a health check takes longer than the health
check interval.

* Fix 2: Put in an explcit check that a worker has a health check in flight,
and don't create any new checks as long as that's true. This needs only a
single bit flag in the worker data blob that already exists. Set that flag on
the worker before you call apr_thread_pool_push() and clear the flag at the
completion of hc_check(). I don't see the value in having simultaneious
in-flight health checks for the same worker, but maybe someone else does.

If I were deciding, I would do both Fix 1 and Fix 2, though you could argue
that Fix 2 makes Fix 1 unnecessary (depending on how you feel about the meaning
of "hcinterval", as mentioned in the discussion of bug #62318). I have
prototyped both fixes to verify that it resolves the problem (or certainly
makes it much harder to trigger).

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org
[Bug 63010] mod_proxy_hcheck when health checks configured, esp. vs down service, consumes large memory [ In reply to ]
https://bz.apache.org/bugzilla/show_bug.cgi?id=63010

--- Comment #5 from Jim Jagielski <jim@apache.org> ---
Thanks for the incredible bug report and the excellent investigation. I agree
that Fixes 1+2 make the most sense. I'll look into adding them asap.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org