Mailing List Archive

mpm_event old gen workers not stopping as they should
Hi!

We've been having reoccurring problems with mpm_event and old
workers/processes sometimes sticking around after server reloads due
to cert update (most often due to using LetsEncrypt certs) and config
changes (more seldom). At least, that's what we think is happening.

I seem to remember fixes on this theme some time ago, but it doesn't
seem to be perfect yet, at least not in our setup.

I currently see this on a Ubuntu 18.04 machine, from server-status
(raw cut&paste, so the formatting isn't the best):

--------------------8<---------------------------
Server Version: Apache/2.4.46 (Unix) OpenSSL/1.1.1
Server MPM: event
Server Built: Oct 21 2020 13:48:44

Current Time: Tuesday, 30-Mar-2021 16:13:23 CEST
Restart Time: Sunday, 21-Feb-2021 12:15:56 CET
Parent Server Config. Generation: 1
Parent Server MPM Generation: 0
Server uptime: 37 days 2 hours 57 minutes 26 seconds
Server load: 0.58 1.45 1.51
Total accesses: 266154607 - Total Traffic: 51438.2 GB - Total Duration: 24778100508
CPU Usage: u64484.4 s43175.7 cu226122 cs121786 - 14.2% CPU load
83 requests/sec - 16.4 MB/second - 202.7 kB/request - 93.0966 ms/request
13 requests currently being processed, 762 idle workers

Slot PID Stopping Connections Threads Async connections
total accepting busy idle writing keep-alive closing
0 970153 no (old gen) 2 yes 1 63 0 0 0
1 360080 no (old gen) 0 yes 0 64 0 0 0
2 770373 no (old gen) 0 yes 0 64 0 0 0
3 810318 no (old gen) 0 yes 0 64 0 0 0
4 921494 no 0 yes 0 64 0 0 0
5 970233 no (old gen) 0 yes 0 64 0 0 0
7 612077 no (old gen) 0 yes 0 64 0 0 0
8 49423 no (old gen) 0 yes 0 64 0 0 0
9 49521 no (old gen) 1 yes 0 64 0 0 0
10 955271 no (old gen) 0 yes 0 64 0 0 0
13 955426 no (old gen) 2 yes 0 64 0 0 0
14 154811 no 0 yes 0 64 0 0 0
15 558125 no (old gen) 3 yes 3 61 0 1 0
16 558205 no (old gen) 0 yes 0 64 0 0 0
17 603555 no (old gen) 2 yes 3 61 0 0 0
18 558451 no (old gen) 0 yes 0 64 0 0 0
19 587269 no (old gen) 0 yes 0 64 0 0 0
22 955577 no (old gen) 0 yes 0 64 0 0 0
24 538389 no 0 yes 0 64 0 0 0
26 538401 no 0 yes 0 64 0 0 0
28 538435 no 0 yes 0 64 0 0 0
36 538979 no 0 yes 0 64 0 0 0
51 540034 no 0 yes 0 64 0 0 0
60 540326 no 0 yes 0 64 0 0 0
62 540379 no 7 yes 6 58 0 0 0
66 540457 no 0 yes 0 64 0 0 0
73 540666 no 0 yes 0 64 0 0 0
75 540721 no 0 yes 0 64 0 0 0
Sum 28 0 17 13 1779 0 1 0
--------------------8<---------------------------

So, what strikes me as very odd is that we have a bunch of PIDs that
are marked as "old gen", but are not stopping (and thus still
accepting new connections). Shouldn't "old gen" processes by default
stop accepting new connections?

Things become very unfun when the old processes sometimes processes
connections while holding on to an expired LetsEncrypt certificate.
Murphy ensures that our tests never hit the old pids, but users always
do...

The start time of the "old gen" processes vary a bit:

--------------------8<---------------------------
970153 Mar28
360080 Mar22
770373 Mar19
810318 Mar28
970233 Mar28
612077 Mar28
49423 Mar29
49521 Mar29
955271 Mar28
955426 Mar28
558125 Mar20
558205 Mar20
603555 Mar20
558451 Mar20
587269 Mar28
955577 Mar28
--------------------8<---------------------------

Although I don't know if you can draw any conclusions from that...

In any case, I'm a bit in deep water trying to navigate where this
problem might originate. And reproducing is hard, we're only seeing it
occasionally...

Ideas?


/Nikke
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se | nikke@acc.umu.se
---------------------------------------------------------------------------
The greatest productive force is selfishness!
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Re: mpm_event old gen workers not stopping as they should [ In reply to ]
Hi Niklas,

On Tue, Mar 30, 2021 at 4:34 PM Niklas Edmundsson <nikke@acc.umu.se> wrote:
>
> I currently see this on a Ubuntu 18.04 machine, from server-status
> (raw cut&paste, so the formatting isn't the best):
>
> --------------------8<---------------------------
> Server Version: Apache/2.4.46 (Unix) OpenSSL/1.1.1

Is that Ubuntu's httpd? It looks like Bionic is bundling an older one.

Also, which modules are you loading?
What kind of traffic (static, cgi, proxy, websocket, http2..)?
How long do connections last?

>
> So, what strikes me as very odd is that we have a bunch of PIDs that
> are marked as "old gen", but are not stopping (and thus still
> accepting new connections). Shouldn't "old gen" processes by default
> stop accepting new connections?

Indeed, provided they get the signal to stop..
Could you set the global LogLevel to trace6 please and provide the error_log?
If all your virtual hosts have their own LogLevel, this shouldn't
affect the verbosity of the traffic, only the administration.


Regards;
Yann.
Re: mpm_event old gen workers not stopping as they should [ In reply to ]
On Tue, 30 Mar 2021, Yann Ylavic wrote:

> Hi Niklas,
>
> On Tue, Mar 30, 2021 at 4:34 PM Niklas Edmundsson <nikke@acc.umu.se> wrote:
>>
>> I currently see this on a Ubuntu 18.04 machine, from server-status
>> (raw cut&paste, so the formatting isn't the best):
>>
>> --------------------8<---------------------------
>> Server Version: Apache/2.4.46 (Unix) OpenSSL/1.1.1
>
> Is that Ubuntu's httpd? It looks like Bionic is bundling an older one.

It's our own build of the unpatched httpd.apache.org source. We're in
the process of rolling these hosts to Ubuntu 20.04LTS which has the
same httpd version but an OS-specific build.

> Also, which modules are you loading?

server-info says:

Loaded Modules

core.c, event.c, http_core.c, mod_access_compat.c, mod_alias.c,
mod_authn_core.c, mod_authz_core.c, mod_authz_host.c, mod_autoindex.c,
mod_cache.c, mod_cache_disk_largefile.c, mod_dir.c, mod_env.c,
mod_expires.c, mod_include.c, mod_info.c, mod_log_config.c,
mod_logio.c, mod_mime.c, mod_reqtimeout.c, mod_rewrite.c,
mod_setenvif.c, mod_so.c, mod_socache_shmcb.c, mod_ssl.c,
mod_status.c, mod_systemd.c, mod_unixd.c, mod_version.c,

Out of these, mod_cache_disk_largefile nonstandard, it's our own take
on mod_cache_disk to suit our workload. It feels unlikely to have
these side effects, although anything is possible I guess...

> What kind of traffic (static, cgi, proxy, websocket, http2..)?

static files and directory listings, http and https, this is one of
the frontend hosts for ftp.acc.umu.se (the file archive run by
Academic Computer Club, UmeƄ University).

> How long do connections last?

The frontends serve only files smaller than 4 MiB, anything bigger
gets redirected to offload hosts. There are real slow downloads, but
they tend to finish sooner rather than later...

Looking now, I see that we have set GracefulShutdownTimeout 3600 for
these frontend hosts, and seem to remember that time was chosen to
allow for the vast majority of transfers to finish.

>> So, what strikes me as very odd is that we have a bunch of PIDs that
>> are marked as "old gen", but are not stopping (and thus still
>> accepting new connections). Shouldn't "old gen" processes by default
>> stop accepting new connections?
>
> Indeed, provided they get the signal to stop..

Ah, so the "old gen" marker in server-status isn't the same mechanism
that's used by the processes for the accepting/stopping logic?

> Could you set the global LogLevel to trace6 please and provide the error_log?
> If all your virtual hosts have their own LogLevel, this shouldn't
> affect the verbosity of the traffic, only the administration.

Ah, nice trick. I'll get that sorted if I get bored over easter (I
don't think we set LogLevel per-vhost so likely need to rearrange the
config a bit).


/Nikke
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se | nikke@acc.umu.se
---------------------------------------------------------------------------
Pain: Giving a Klingon a noogie.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Re: mpm_event old gen workers not stopping as they should [ In reply to ]
On Tue, Mar 30, 2021 at 8:37 PM Niklas Edmundsson <nikke@acc.umu.se> wrote:
>
> Out of these, mod_cache_disk_largefile nonstandard, it's our own take
> on mod_cache_disk to suit our workload. It feels unlikely to have
> these side effects, although anything is possible I guess...

Does this module create threads?

Also, is mpm_event loaded dynamically (LoadModule) or compiled
statically (builtin)?

>
> Ah, so the "old gen" marker in server-status isn't the same mechanism
> that's used by the processes for the accepting/stopping logic?

The "old gen" marks the processes that have an older generation number
than the one of the latest reload.
If some children processes somehow miss the "signal" to stop,
mod_status will still mark them as "old gen" because their generation
number in the scoreboard is old, but still those children will
continue to run as if nothing happened.
It shouldn't happen, but as you say anything is possible..


Regards;
Yann.
Re: mpm_event old gen workers not stopping as they should [ In reply to ]
On Tue, 30 Mar 2021, Yann Ylavic wrote:

> On Tue, Mar 30, 2021 at 8:37 PM Niklas Edmundsson <nikke@acc.umu.se> wrote:
>>
>> Out of these, mod_cache_disk_largefile nonstandard, it's our own take
>> on mod_cache_disk to suit our workload. It feels unlikely to have
>> these side effects, although anything is possible I guess...
>
> Does this module create threads?

Gah. Got it in one. Yes, it creates threads dynamically to cache large
files in the background, and indeed when looking at the code it
doesn't mask signals. At this point I fully expect this to be the
issue, although I don't really understand why it should be...

AIUI, event.c calls apr_setup_signal_thread() that masks all signal
delivery which in turn should be inherited for all subsequently
created threads, be it by event.c or otherwise...

https://github.com/accumu/mod_cache_disk_largefile/blob/master/mod_cache_disk_largefile.c#L3065
do_bgcopy() is our code that creates the threads, if you're
interested.

> Also, is mpm_event loaded dynamically (LoadModule) or compiled
> statically (builtin)?

It's loaded dynamically by:

LoadModule mpm_event_module libexec/mod_mpm_event.so


>> Ah, so the "old gen" marker in server-status isn't the same mechanism
>> that's used by the processes for the accepting/stopping logic?
>
> The "old gen" marks the processes that have an older generation number
> than the one of the latest reload.
> If some children processes somehow miss the "signal" to stop,
> mod_status will still mark them as "old gen" because their generation
> number in the scoreboard is old, but still those children will
> continue to run as if nothing happened.

OK, makes sense...

> It shouldn't happen, but as you say anything is possible..

Yup, especially if it's home-grown code that's to blame ;)


/Nikke
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se | nikke@acc.umu.se
---------------------------------------------------------------------------
"Besides, you look good in a dress." -Riker, to Worf
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Re: mpm_event old gen workers not stopping as they should [ In reply to ]
I was ambivalent on asking this in this thread, or do a new one, but
since this thread has some background for our setup I opted to do it
here.

On Tue, 30 Mar 2021, Niklas Edmundsson wrote:
> On Tue, 30 Mar 2021, Yann Ylavic wrote:
>> On Tue, Mar 30, 2021 at 8:37 PM Niklas Edmundsson <nikke@acc.umu.se> wrote:
>>>
>>> Out of these, mod_cache_disk_largefile nonstandard, it's our own take
>>
>> Does this module create threads?
>
> Gah. Got it in one. Yes, it creates threads dynamically to cache large files
> in the background, and indeed when looking at the code it doesn't mask
> signals. At this point I fully expect this to be the issue, although I don't
> really understand why it should be...
>
> AIUI, event.c calls apr_setup_signal_thread() that masks all signal delivery
> which in turn should be inherited for all subsequently created threads, be it
> by event.c or otherwise...

Could us spawning short-lived threads in our own module in any way
cause mpm_event to behave like this when running out of workers, from
trace6 logging on the main server (not vhosts):

[Mon Apr 12 13:17:37.815553 2021] [mpm_event:trace1] [pid 583332:tid 139762336704256] event.c(1811): Idle workers: 0

Notice the time gap here. NOTHING is logged by mpm_event in this time
frame. Not even the every-second trace6 connections: messages that's
otherwise present...

[Mon Apr 12 13:37:40.314720 2021] [mpm_event:debug] [pid 583332:tid 139762336704256] event.c(485): AH00457: Accepting new connections again: 62 active conns (0 lingering/0 clogged/0 suspended), 2 idle workers

At this point in time the server broke out of the catatonic state, and
a few seconds later it started logging "workers shutdown" messages,
ie:

[Mon Apr 12 13:37:43.042779 2021] [mpm_event:trace6] [pid 581994:tid 139762336704256] event.c(1635): 11/64 workers shutdown

as if it was whiplash:ing from total overload to cleaning surplus
workers...

event.c has the following config from server-info, shouldn't be too
exotic:

ListenBacklog 4000
Listen 80
Listen 443
GracefulShutdownTimeout 3600
ServerLimit 200
StartServers 1
MinSpareThreads 40
MaxSpareThreads 1024
ThreadLimit 64
ThreadsPerChild 64
MaxRequestWorkers 4096
AsyncRequestWorkerFactor 10


/Nikke
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se | nikke@acc.umu.se
---------------------------------------------------------------------------
Windows isn't CrippleWare -- it's "Functionally Challenged".
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=