Apr 14, 2022, 6:43 AM
Post #8 of 10
(734 views)
Permalink
On Thu, Apr 14, 2022 at 1:44 PM Ruediger Pluem <rpluem@apache.org> wrote:
>
> On 4/14/22 1:16 PM, Yann Ylavic wrote:
> > On Thu, Apr 14, 2022 at 12:09 PM Ruediger Pluem <rpluem@apache.org> wrote:
> >>
> >> On 4/14/22 11:52 AM, Yann Ylavic wrote:
> >>>
> >>> Any signal that killed the process would mean something unexpected
> >>> happened since we clean_child_exit() otherwise?
> >>
> >> Hm. You mean that the case for
> >>
> >> case SIGTERM:
> >> case SIGHUP:
> >> case AP_SIG_GRACEFUL:
> >>
> >> in ap_process_child_status
> >> never triggers as the child catches these and does a clean_child_exit()?
> >
> > Yes
> >
> >>
> >> If yes, the above seems to be a simple solution, but it would also capture SIGKILL which might
> >> have been issued by our parent. Does this matter?
> >
> > I think our SIGKILLs (for ungraceful restart) are "captured" by
> > ap_reclaim_child_processes(), that is outside server_main_loop() so it
> > should be OK..
>
> Good point. Then I am fine with your proposed patch.
Thinking more about it, I came to something like this:
Index: server/mpm/event/event.c
===================================================================
--- server/mpm/event/event.c (revision 1899818)
+++ server/mpm/event/event.c (working copy)
@@ -3382,6 +3382,7 @@ static void server_main_loop(int remaining_childre
{
int num_buckets = retained->mpm->num_buckets;
int max_daemon_used = 0;
+ int successive_children_signaled = 0;
int child_slot;
apr_exit_why_e exitwhy;
int status, processed_status;
@@ -3460,11 +3461,26 @@ static void server_main_loop(int remaining_childre
/* Don't perform idle maintenance when a child dies,
* only do it when there's a timeout. Remember only a
* finite number of children can die, and it's pretty
- * pathological for a lot to die suddenly.
+ * pathological for a lot to die suddenly. If that happens
+ * anyway, protect against pathological segfault flood by not
+ * restarting more than 3 children if no timeout happened in
+ * between, otherwise we let the usual maintenance go.
*/
- continue;
+ if (child_slot < 0 || !APR_PROC_CHECK_SIGNALED(exitwhy)) {
+ continue;
+ }
+ if (++successive_children_signaled >= 3) {
+ apr_sleep(apr_time_from_sec(1));
+ }
+ else {
+ remaining_children_to_start++;
+ }
}
- else if (remaining_children_to_start) {
+ else {
+ successive_children_signaled = 0;
+ }
+
+ if (remaining_children_to_start) {
/* we hit a 1 second timeout in which none of the previous
* generation of children needed to be reaped... so assume
* they're all done, and pick up the slack if any is left.
--
I have not tested it yet but possibly without this if many children
segfault successively we might enter a busy fork() loop.
The above would restart killed children immediately (using the
remaining_children_to_start mechanism) only if there are less than 3
successively, otherwise sleep 1s and
perform_idle_server_maintenance().
WDYT?
> >
> >>
> >> Do we need something for processes that die due to MaxConnectionsPerChild or do we expect them to die at a slow path where the
> >> current approach is fine?
> >
> > Hm, good question. Restarting them if not necessary from a maintenance
> > metrics perspective is no better/worse than waiting for a maintenance
> > cycle, so I'd say let them be maintained as we do now..
> > Ideally (maybe) we'd pass a timeout to ap_wait_or_timeout() to make
> > sure that maintenance cycles really happen every 1s, regardless of
> > exited processes in the meantime (i.e. handle a "remaining" timeout
> > instead of the hardcoded 1s which currently may result in
> > 0.9s+waitpid()+0.9s+waitpid()+timeout thus here ~2s but possibly more
> > between two maintenance cycles).
>
> Hm, but ap_wait_or_timeout only waits if there is nothing to reap, which would mean we would wait at max 1 s (+ processing time
> for multiple ap_wait_or_timeout calls) until we perform the idle server maintenance, or do I miss your point?
Yeah, I should have looked at ap_wait_or_timeout() better, it's
actually an ap_wait_or_sleep()..
Regards;
Yann.