Mailing List Archive

On 7/7/21 11:45 AM, Stefan Eissing wrote:
> In my h2 test suite, I do a setup where I use proxy configs against the server itself. We seem to have a problem performing a clean child exit with that. Test in 2.4.48 and trunk:
> - run tests with several graceful restarts
> - no proxied request, clean exit
> - with proxied requests
> AH00045: child process 53682 still did not exit, sending a SIGTERM
> AH00045: child process 53682 still did not exit, sending a SIGTERM
> [often stops here, sometimes]
> ...
> AH00046: child process 53682 still did not exit, sending a SIGKILL
>
> Question:
> - is such a test setup doomed to fail in general?
> - are we sure that we cannot encounter such states in "normal" setups?
>
> If someone wants at log at whatever LogLevels, let me know. It's seems highly reproducible.

Do you have stack traces where these processes are hanging and a simple config that causes this?

Regards

Rüdiger

Re: child exit on self-proxy [ In reply to ]

Jul 7, 2021, 3:15 AM

Post #3 of 35 (1755 views)

> Am 07.07.2021 um 11:49 schrieb Ruediger Pluem <rpluem@apache.org>:
>
>
>
> On 7/7/21 11:45 AM, Stefan Eissing wrote:
>> In my h2 test suite, I do a setup where I use proxy configs against the server itself. We seem to have a problem performing a clean child exit with that. Test in 2.4.48 and trunk:
>> - run tests with several graceful restarts
>> - no proxied request, clean exit
>> - with proxied requests
>> AH00045: child process 53682 still did not exit, sending a SIGTERM
>> AH00045: child process 53682 still did not exit, sending a SIGTERM
>> [often stops here, sometimes]
>> ...
>> AH00046: child process 53682 still did not exit, sending a SIGKILL
>>
>> Question:
>> - is such a test setup doomed to fail in general?
>> - are we sure that we cannot encounter such states in "normal" setups?
>>
>> If someone wants at log at whatever LogLevels, let me know. It's seems highly reproducible.
>
> Do you have stack traces where these processes are hanging and a simple config that causes this?

I added a TRACE1 log in event.c before/after join_workers (line 2921) and see:

[Wed Jul 07 10:06:03.144044 2021] [mpm_event:trace1] [pid 72886:tid 4493635072] event.c(2921): graceful termination received, join workers
[Wed Jul 07 10:06:03.144213 2021] [mpm_event:trace1] [pid 72886:tid 123145435672576] event.c(1799): All workers are busy or dying, will close 1 keep-alive connections
[Wed Jul 07 10:06:08.079690 2021] [mpm_event:trace1] [pid 72886:tid 123145435672576] event.c(1799): All workers are busy or dying, will close 0 keep-alive connections
[Wed Jul 07 10:06:10.787777 2021] [core:warn] [pid 72813:tid 4493635072] AH00045: child process 72886 still did not exit, sending a SIGTERM
[Wed Jul 07 10:06:12.789335 2021] [core:warn] [pid 72813:tid 4493635072] AH00045: child process 72886 still did not exit, sending a SIGTERM
[Wed Jul 07 10:06:14.791281 2021] [core:warn] [pid 72813:tid 4493635072] AH00045: child process 72886 still did not exit, sending a SIGTERM
[Wed Jul 07 10:06:16.792983 2021] [core:error] [pid 72813:tid 4493635072] AH00046: child process 72886 still did not exit, sending a SIGKILL

So, I assume the keep-alive connection is the mod_proxy_http connection to the sever itself. Since the join_workers() never returns, there seems to be a thread not finishing.

On another run, I got a stacktrace of the child:

Call graph:
264 Thread_36766542 DispatchQueue_1: com.apple.main-thread (serial)
+ 264 start (in libdyld.dylib) + 1 [0x7fff2032cf5d]
+ 264 main (in httpd) + 2278 [0x104729b86] main.c:862
+ 264 ap_run_mpm (in httpd) + 75 [0x10473917b] mpm_common.c:100
+ 264 event_run (in mod_mpm_event.so) + 2994 [0x10502ae62] event.c:3398
+ 264 make_child (in mod_mpm_event.so) + 436 [0x10502bbc4] event.c:2997
+ 264 child_main (in mod_mpm_event.so) + 1734 [0x10502c2e6] event.c:2924
+ 264 join_workers (in mod_mpm_event.so) + 386 [0x10502cc72] event.c:2717
+ 264 apr_thread_join (in libapr-2.0.dylib) + 44 [0x1048b347c] thread.c:256
+ 264 _pthread_join (in libsystem_pthread.dylib) + 362 [0x7fff20312f60]
+ 264 __ulock_wait (in libsystem_kernel.dylib) + 10 [0x7fff202dd9ee]
264 Thread_36766548
+ 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443]
+ 264 _pthread_start (in libsystem_pthread.dylib) + 224 [0x7fff203118fc]
+ 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] thread.c:148
+ 264 slot_run (in mod_http2.so) + 189 [0x104f99a8d] h2_workers.c:260
+ 264 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 [0x7fff20311e49]
+ 264 __psynch_cvwait (in libsystem_kernel.dylib) + 10 [0x7fff202decde]
264 Thread_36766549
+ 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443]
+ 264 _pthread_start (in libsystem_pthread.dylib) + 224 [0x7fff203118fc]
+ 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] thread.c:148
+ 264 slot_run (in mod_http2.so) + 189 [0x104f99a8d] h2_workers.c:260
+ 264 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 [0x7fff20311e49]
+ 264 __psynch_cvwait (in libsystem_kernel.dylib) + 10 [0x7fff202decde]
264 Thread_36766550
+ 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443]
+ 264 _pthread_start (in libsystem_pthread.dylib) + 224 [0x7fff203118fc]
+ 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] thread.c:148
+ 264 slot_run (in mod_http2.so) + 189 [0x104f99a8d] h2_workers.c:260
+ 264 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 [0x7fff20311e49]
+ 264 __psynch_cvwait (in libsystem_kernel.dylib) + 10 [0x7fff202decde]
264 Thread_36766551
+ 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443]
+ 264 _pthread_start (in libsystem_pthread.dylib) + 224 [0x7fff203118fc]
+ 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] thread.c:148
+ 264 slot_run (in mod_http2.so) + 189 [0x104f99a8d] h2_workers.c:260
+ 264 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 [0x7fff20311e49]
+ 264 __psynch_cvwait (in libsystem_kernel.dylib) + 10 [0x7fff202decde]
264 Thread_36766565
+ 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443]
+ 264 _pthread_start (in libsystem_pthread.dylib) + 224 [0x7fff203118fc]
+ 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] thread.c:148
+ 264 worker_thread (in mod_mpm_event.so) + 380 [0x10502cf2c] event.c:2340
+ 264 ap_queue_pop_something (in httpd) + 142 [0x10473cf3e] mpm_fdqueue.c:447
+ 264 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 [0x7fff20311e49]
+ 264 __psynch_cvwait (in libsystem_kernel.dylib) + 10 [0x7fff202decde]
264 Thread_36766579
264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443]
264 _pthread_start (in libsystem_pthread.dylib) + 224 [0x7fff203118fc]
264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] thread.c:148
264 listener_thread (in mod_mpm_event.so) + 1129 [0x10502e069] event.c:1940
264 impl_pollset_poll (in libapr-2.0.dylib) + 114 [0x1048aec82] kqueue.c:272
264 kevent (in libsystem_kernel.dylib) + 10 [0x7fff202e0c4a]

>
> Regards
>
> Rüdiger

Re: child exit on self-proxy [ In reply to ]

Jul 7, 2021, 4:51 AM

Post #4 of 35 (1754 views)

On Wed, Jul 7, 2021 at 12:16 PM Stefan Eissing
<stefan.eissing@greenbytes.de> wrote:
>
> I added a TRACE1 log in event.c before/after join_workers (line 2921) and see:
>
> [Wed Jul 07 10:06:03.144044 2021] [mpm_event:trace1] [pid 72886:tid 4493635072] event.c(2921): graceful termination received, join workers
> [Wed Jul 07 10:06:03.144213 2021] [mpm_event:trace1] [pid 72886:tid 123145435672576] event.c(1799): All workers are busy or dying, will close 1 keep-alive connections
> [Wed Jul 07 10:06:08.079690 2021] [mpm_event:trace1] [pid 72886:tid 123145435672576] event.c(1799): All workers are busy or dying, will close 0 keep-alive connections
> [Wed Jul 07 10:06:10.787777 2021] [core:warn] [pid 72813:tid 4493635072] AH00045: child process 72886 still did not exit, sending a SIGTERM
> [Wed Jul 07 10:06:12.789335 2021] [core:warn] [pid 72813:tid 4493635072] AH00045: child process 72886 still did not exit, sending a SIGTERM
> [Wed Jul 07 10:06:14.791281 2021] [core:warn] [pid 72813:tid 4493635072] AH00045: child process 72886 still did not exit, sending a SIGTERM
> [Wed Jul 07 10:06:16.792983 2021] [core:error] [pid 72813:tid 4493635072] AH00046: child process 72886 still did not exit, sending a SIGKILL
>
> So, I assume the keep-alive connection is the mod_proxy_http connection to the sever itself. Since the join_workers() never returns, there seems to be a thread not finishing.

No, MPM event doesn't handle mod_proxy kept alive connections, those
are client connections.

>
> On another run, I got a stacktrace of the child:
>
> Call graph:
> 264 Thread_36766542 DispatchQueue_1: com.apple.main-thread (serial)
> + 264 start (in libdyld.dylib) + 1 [0x7fff2032cf5d]
> + 264 main (in httpd) + 2278 [0x104729b86] main.c:862
> + 264 ap_run_mpm (in httpd) + 75 [0x10473917b] mpm_common.c:100
> + 264 event_run (in mod_mpm_event.so) + 2994 [0x10502ae62] event.c:3398
> + 264 make_child (in mod_mpm_event.so) + 436 [0x10502bbc4] event.c:2997
> + 264 child_main (in mod_mpm_event.so) + 1734 [0x10502c2e6] event.c:2924
> + 264 join_workers (in mod_mpm_event.so) + 386 [0x10502cc72] event.c:2717
> + 264 apr_thread_join (in libapr-2.0.dylib) + 44 [0x1048b347c] thread.c:256
> + 264 _pthread_join (in libsystem_pthread.dylib) + 362 [0x7fff20312f60]
> + 264 __ulock_wait (in libsystem_kernel.dylib) + 10 [0x7fff202dd9ee]

OK, that's where the child waits for worker threads.

> 264 Thread_36766548
> + 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443]
> + 264 _pthread_start (in libsystem_pthread.dylib) + 224 [0x7fff203118fc]
> + 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] thread.c:148
> + 264 slot_run (in mod_http2.so) + 189 [0x104f99a8d] h2_workers.c:260
> + 264 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 [0x7fff20311e49]
> + 264 __psynch_cvwait (in libsystem_kernel.dylib) + 10 [0x7fff202decde]
> 264 Thread_36766549
> + 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443]
> + 264 _pthread_start (in libsystem_pthread.dylib) + 224 [0x7fff203118fc]
> + 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] thread.c:148
> + 264 slot_run (in mod_http2.so) + 189 [0x104f99a8d] h2_workers.c:260
> + 264 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 [0x7fff20311e49]
> + 264 __psynch_cvwait (in libsystem_kernel.dylib) + 10 [0x7fff202decde]
> 264 Thread_36766550
> + 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443]
> + 264 _pthread_start (in libsystem_pthread.dylib) + 224 [0x7fff203118fc]
> + 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] thread.c:148
> + 264 slot_run (in mod_http2.so) + 189 [0x104f99a8d] h2_workers.c:260
> + 264 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 [0x7fff20311e49]
> + 264 __psynch_cvwait (in libsystem_kernel.dylib) + 10 [0x7fff202decde]
> 264 Thread_36766551
> + 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443]
> + 264 _pthread_start (in libsystem_pthread.dylib) + 224 [0x7fff203118fc]
> + 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] thread.c:148
> + 264 slot_run (in mod_http2.so) + 189 [0x104f99a8d] h2_workers.c:260
> + 264 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 [0x7fff20311e49]
> + 264 __psynch_cvwait (in libsystem_kernel.dylib) + 10 [0x7fff202decde]

The above mod_http2 threads are not accounted for by the MPM I suppose
(they should be stopped by the future child_exit hook, but still the
MPM will only care about its own worker threads, i.e. those running
the process_connection hook).

> 264 Thread_36766565
> + 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443]
> + 264 _pthread_start (in libsystem_pthread.dylib) + 224 [0x7fff203118fc]
> + 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] thread.c:148
> + 264 worker_thread (in mod_mpm_event.so) + 380 [0x10502cf2c] event.c:2340
> + 264 ap_queue_pop_something (in httpd) + 142 [0x10473cf3e] mpm_fdqueue.c:447
> + 264 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 [0x7fff20311e49]
> + 264 __psynch_cvwait (in libsystem_kernel.dylib) + 10 [0x7fff202decde]

This worker didn't get the signal to stop from ap_queue_term()..

> 264 Thread_36766579
> 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443]
> 264 _pthread_start (in libsystem_pthread.dylib) + 224 [0x7fff203118fc]
> 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] thread.c:148
> 264 listener_thread (in mod_mpm_event.so) + 1129 [0x10502e069] event.c:1940
> 264 impl_pollset_poll (in libapr-2.0.dylib) + 114 [0x1048aec82] kqueue.c:272
> 264 kevent (in libsystem_kernel.dylib) + 10 [0x7fff202e0c4a]

.. and that's because the listener didn't exit either, hmm.

Your system sets listener_is_wakeable (AH02471), right?
Does wakeup_listener() get called for the graceful restart that hangs?

Cheers;
Yann.

Re: child exit on self-proxy [ In reply to ]

Jul 7, 2021, 7:25 AM

Post #5 of 35 (1754 views)

Was busy on other things. Will try to get answers...

> Am 07.07.2021 um 13:51 schrieb Yann Ylavic <ylavic.dev@gmail.com>:
>
> On Wed, Jul 7, 2021 at 12:16 PM Stefan Eissing
> <stefan.eissing@greenbytes.de> wrote:
>>
>> I added a TRACE1 log in event.c before/after join_workers (line 2921) and see:
>>
>> [Wed Jul 07 10:06:03.144044 2021] [mpm_event:trace1] [pid 72886:tid 4493635072] event.c(2921): graceful termination received, join workers
>> [Wed Jul 07 10:06:03.144213 2021] [mpm_event:trace1] [pid 72886:tid 123145435672576] event.c(1799): All workers are busy or dying, will close 1 keep-alive connections
>> [Wed Jul 07 10:06:08.079690 2021] [mpm_event:trace1] [pid 72886:tid 123145435672576] event.c(1799): All workers are busy or dying, will close 0 keep-alive connections
>> [Wed Jul 07 10:06:10.787777 2021] [core:warn] [pid 72813:tid 4493635072] AH00045: child process 72886 still did not exit, sending a SIGTERM
>> [Wed Jul 07 10:06:12.789335 2021] [core:warn] [pid 72813:tid 4493635072] AH00045: child process 72886 still did not exit, sending a SIGTERM
>> [Wed Jul 07 10:06:14.791281 2021] [core:warn] [pid 72813:tid 4493635072] AH00045: child process 72886 still did not exit, sending a SIGTERM
>> [Wed Jul 07 10:06:16.792983 2021] [core:error] [pid 72813:tid 4493635072] AH00046: child process 72886 still did not exit, sending a SIGKILL
>>
>> So, I assume the keep-alive connection is the mod_proxy_http connection to the sever itself. Since the join_workers() never returns, there seems to be a thread not finishing.
>
> No, MPM event doesn't handle mod_proxy kept alive connections, those
> are client connections.
>
>>
>> On another run, I got a stacktrace of the child:
>>
>> Call graph:
>> 264 Thread_36766542 DispatchQueue_1: com.apple.main-thread (serial)
>> + 264 start (in libdyld.dylib) + 1 [0x7fff2032cf5d]
>> + 264 main (in httpd) + 2278 [0x104729b86] main.c:862
>> + 264 ap_run_mpm (in httpd) + 75 [0x10473917b] mpm_common.c:100
>> + 264 event_run (in mod_mpm_event.so) + 2994 [0x10502ae62] event.c:3398
>> + 264 make_child (in mod_mpm_event.so) + 436 [0x10502bbc4] event.c:2997
>> + 264 child_main (in mod_mpm_event.so) + 1734 [0x10502c2e6] event.c:2924
>> + 264 join_workers (in mod_mpm_event.so) + 386 [0x10502cc72] event.c:2717
>> + 264 apr_thread_join (in libapr-2.0.dylib) + 44 [0x1048b347c] thread.c:256
>> + 264 _pthread_join (in libsystem_pthread.dylib) + 362 [0x7fff20312f60]
>> + 264 __ulock_wait (in libsystem_kernel.dylib) + 10 [0x7fff202dd9ee]
>
> OK, that's where the child waits for worker threads.
>
>> 264 Thread_36766548
>> + 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443]
>> + 264 _pthread_start (in libsystem_pthread.dylib) + 224 [0x7fff203118fc]
>> + 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] thread.c:148
>> + 264 slot_run (in mod_http2.so) + 189 [0x104f99a8d] h2_workers.c:260
>> + 264 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 [0x7fff20311e49]
>> + 264 __psynch_cvwait (in libsystem_kernel.dylib) + 10 [0x7fff202decde]
>> 264 Thread_36766549
>> + 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443]
>> + 264 _pthread_start (in libsystem_pthread.dylib) + 224 [0x7fff203118fc]
>> + 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] thread.c:148
>> + 264 slot_run (in mod_http2.so) + 189 [0x104f99a8d] h2_workers.c:260
>> + 264 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 [0x7fff20311e49]
>> + 264 __psynch_cvwait (in libsystem_kernel.dylib) + 10 [0x7fff202decde]
>> 264 Thread_36766550
>> + 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443]
>> + 264 _pthread_start (in libsystem_pthread.dylib) + 224 [0x7fff203118fc]
>> + 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] thread.c:148
>> + 264 slot_run (in mod_http2.so) + 189 [0x104f99a8d] h2_workers.c:260
>> + 264 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 [0x7fff20311e49]
>> + 264 __psynch_cvwait (in libsystem_kernel.dylib) + 10 [0x7fff202decde]
>> 264 Thread_36766551
>> + 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443]
>> + 264 _pthread_start (in libsystem_pthread.dylib) + 224 [0x7fff203118fc]
>> + 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] thread.c:148
>> + 264 slot_run (in mod_http2.so) + 189 [0x104f99a8d] h2_workers.c:260
>> + 264 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 [0x7fff20311e49]
>> + 264 __psynch_cvwait (in libsystem_kernel.dylib) + 10 [0x7fff202decde]
>
> The above mod_http2 threads are not accounted for by the MPM I suppose
> (they should be stopped by the future child_exit hook, but still the
> MPM will only care about its own worker threads, i.e. those running
> the process_connection hook).
>
>> 264 Thread_36766565
>> + 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443]
>> + 264 _pthread_start (in libsystem_pthread.dylib) + 224 [0x7fff203118fc]
>> + 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] thread.c:148
>> + 264 worker_thread (in mod_mpm_event.so) + 380 [0x10502cf2c] event.c:2340
>> + 264 ap_queue_pop_something (in httpd) + 142 [0x10473cf3e] mpm_fdqueue.c:447
>> + 264 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 [0x7fff20311e49]
>> + 264 __psynch_cvwait (in libsystem_kernel.dylib) + 10 [0x7fff202decde]
>
> This worker didn't get the signal to stop from ap_queue_term()..
>
>> 264 Thread_36766579
>> 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443]
>> 264 _pthread_start (in libsystem_pthread.dylib) + 224 [0x7fff203118fc]
>> 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] thread.c:148
>> 264 listener_thread (in mod_mpm_event.so) + 1129 [0x10502e069] event.c:1940
>> 264 impl_pollset_poll (in libapr-2.0.dylib) + 114 [0x1048aec82] kqueue.c:272
>> 264 kevent (in libsystem_kernel.dylib) + 10 [0x7fff202e0c4a]
>
> .. and that's because the listener didn't exit either, hmm.
>
> Your system sets listener_is_wakeable (AH02471), right?
> Does wakeup_listener() get called for the graceful restart that hangs?
>
> Cheers;
> Yann.

Re: child exit on self-proxy [ In reply to ]

Jul 7, 2021, 7:56 AM

Post #6 of 35 (1754 views)

> Am 07.07.2021 um 13:51 schrieb Yann Ylavic <ylavic.dev@gmail.com>:
>
> On Wed, Jul 7, 2021 at 12:16 PM Stefan Eissing
> <stefan.eissing@greenbytes.de> wrote:
>>
>> I added a TRACE1 log in event.c before/after join_workers (line 2921) and see:
>>
>> [Wed Jul 07 10:06:03.144044 2021] [mpm_event:trace1] [pid 72886:tid 4493635072] event.c(2921): graceful termination received, join workers
>> [Wed Jul 07 10:06:03.144213 2021] [mpm_event:trace1] [pid 72886:tid 123145435672576] event.c(1799): All workers are busy or dying, will close 1 keep-alive connections
>> [Wed Jul 07 10:06:08.079690 2021] [mpm_event:trace1] [pid 72886:tid 123145435672576] event.c(1799): All workers are busy or dying, will close 0 keep-alive connections
>> [Wed Jul 07 10:06:10.787777 2021] [core:warn] [pid 72813:tid 4493635072] AH00045: child process 72886 still did not exit, sending a SIGTERM
>> [Wed Jul 07 10:06:12.789335 2021] [core:warn] [pid 72813:tid 4493635072] AH00045: child process 72886 still did not exit, sending a SIGTERM
>> [Wed Jul 07 10:06:14.791281 2021] [core:warn] [pid 72813:tid 4493635072] AH00045: child process 72886 still did not exit, sending a SIGTERM
>> [Wed Jul 07 10:06:16.792983 2021] [core:error] [pid 72813:tid 4493635072] AH00046: child process 72886 still did not exit, sending a SIGKILL
>>
>> So, I assume the keep-alive connection is the mod_proxy_http connection to the sever itself. Since the join_workers() never returns, there seems to be a thread not finishing.
>
> No, MPM event doesn't handle mod_proxy kept alive connections, those
> are client connections.
>
>>
>> On another run, I got a stacktrace of the child:
>>
>> Call graph:
>> 264 Thread_36766542 DispatchQueue_1: com.apple.main-thread (serial)
>> + 264 start (in libdyld.dylib) + 1 [0x7fff2032cf5d]
>> + 264 main (in httpd) + 2278 [0x104729b86] main.c:862
>> + 264 ap_run_mpm (in httpd) + 75 [0x10473917b] mpm_common.c:100
>> + 264 event_run (in mod_mpm_event.so) + 2994 [0x10502ae62] event.c:3398
>> + 264 make_child (in mod_mpm_event.so) + 436 [0x10502bbc4] event.c:2997
>> + 264 child_main (in mod_mpm_event.so) + 1734 [0x10502c2e6] event.c:2924
>> + 264 join_workers (in mod_mpm_event.so) + 386 [0x10502cc72] event.c:2717
>> + 264 apr_thread_join (in libapr-2.0.dylib) + 44 [0x1048b347c] thread.c:256
>> + 264 _pthread_join (in libsystem_pthread.dylib) + 362 [0x7fff20312f60]
>> + 264 __ulock_wait (in libsystem_kernel.dylib) + 10 [0x7fff202dd9ee]
>
> OK, that's where the child waits for worker threads.
>
>> 264 Thread_36766548
>> + 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443]
>> + 264 _pthread_start (in libsystem_pthread.dylib) + 224 [0x7fff203118fc]
>> + 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] thread.c:148
>> + 264 slot_run (in mod_http2.so) + 189 [0x104f99a8d] h2_workers.c:260
>> + 264 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 [0x7fff20311e49]
>> + 264 __psynch_cvwait (in libsystem_kernel.dylib) + 10 [0x7fff202decde]
>> 264 Thread_36766549
>> + 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443]
>> + 264 _pthread_start (in libsystem_pthread.dylib) + 224 [0x7fff203118fc]
>> + 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] thread.c:148
>> + 264 slot_run (in mod_http2.so) + 189 [0x104f99a8d] h2_workers.c:260
>> + 264 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 [0x7fff20311e49]
>> + 264 __psynch_cvwait (in libsystem_kernel.dylib) + 10 [0x7fff202decde]
>> 264 Thread_36766550
>> + 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443]
>> + 264 _pthread_start (in libsystem_pthread.dylib) + 224 [0x7fff203118fc]
>> + 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] thread.c:148
>> + 264 slot_run (in mod_http2.so) + 189 [0x104f99a8d] h2_workers.c:260
>> + 264 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 [0x7fff20311e49]
>> + 264 __psynch_cvwait (in libsystem_kernel.dylib) + 10 [0x7fff202decde]
>> 264 Thread_36766551
>> + 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443]
>> + 264 _pthread_start (in libsystem_pthread.dylib) + 224 [0x7fff203118fc]
>> + 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] thread.c:148
>> + 264 slot_run (in mod_http2.so) + 189 [0x104f99a8d] h2_workers.c:260
>> + 264 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 [0x7fff20311e49]
>> + 264 __psynch_cvwait (in libsystem_kernel.dylib) + 10 [0x7fff202decde]
>
> The above mod_http2 threads are not accounted for by the MPM I suppose
> (they should be stopped by the future child_exit hook, but still the
> MPM will only care about its own worker threads, i.e. those running
> the process_connection hook).

Yes, the h2 worker thread currently get cleaned up on the pool pre cleanup. Those are ok to see as that has not happened yet.

>
>> 264 Thread_36766565
>> + 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443]
>> + 264 _pthread_start (in libsystem_pthread.dylib) + 224 [0x7fff203118fc]
>> + 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] thread.c:148
>> + 264 worker_thread (in mod_mpm_event.so) + 380 [0x10502cf2c] event.c:2340
>> + 264 ap_queue_pop_something (in httpd) + 142 [0x10473cf3e] mpm_fdqueue.c:447
>> + 264 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 [0x7fff20311e49]
>> + 264 __psynch_cvwait (in libsystem_kernel.dylib) + 10 [0x7fff202decde]
>
> This worker didn't get the signal to stop from ap_queue_term()..
>
>> 264 Thread_36766579
>> 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443]
>> 264 _pthread_start (in libsystem_pthread.dylib) + 224 [0x7fff203118fc]
>> 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] thread.c:148
>> 264 listener_thread (in mod_mpm_event.so) + 1129 [0x10502e069] event.c:1940
>> 264 impl_pollset_poll (in libapr-2.0.dylib) + 114 [0x1048aec82] kqueue.c:272
>> 264 kevent (in libsystem_kernel.dylib) + 10 [0x7fff202e0c4a]
>
> .. and that's because the listener didn't exit either, hmm.
>
> Your system sets listener_is_wakeable (AH02471), right?

once at process start:
[Wed Jul 07 10:11:32.368594 2021] [mpm_event:debug] [pid 73128:tid 123145437016064] event.c(2611): AH02471: start_threads: Using kqueue (wakeable)

> Does wakeup_listener() get called for the graceful restart that hangs?
>
> Cheers;
> Yann.

Inserted some debug logging into event.c (and of course, the more I insert, the less the likelihood of it happening):
Also, I had apr trunk in srclib and switched to branches/1.7.x - same behaviour.

End of log for a hanging process:

[Wed Jul 07 14:35:12.300919 2021] [mpm_event:debug] [pid 80258:tid 4439064064] event.c(599): AH: wakeup_listener: start
[Wed Jul 07 14:35:12.300938 2021] [mpm_event:debug] [pid 80258:tid 4439064064] event.c(603): AH: wakeup_listener: apr_pollset_wakeup
[Wed Jul 07 14:35:12.300952 2021] [mpm_event:debug] [pid 80258:tid 4439064064] event.c(610): AH: wakeup_listener: ap_queue_info_term
[Wed Jul 07 14:35:12.301082 2021] [mpm_event:trace1] [pid 80258:tid 123145536962560] event.c(1809): All workers are busy or dying, will close 1 keep-alive connections
[Wed Jul 07 14:35:12.301194 2021] [mpm_event:debug] [pid 80258:tid 4439064064] event.c(629): AH: wakeup_listener: pthread_kill
[Wed Jul 07 14:35:12.301433 2021] [mpm_event:debug] [pid 80258:tid 4439064064] event.c(635): AH: wakeup_listener: end
[Wed Jul 07 14:35:12.301444 2021] [mpm_event:trace1] [pid 80258:tid 4439064064] event.c(2933): graceful termination received, join workers
[Wed Jul 07 14:35:16.501276 2021] [mpm_event:trace1] [pid 80258:tid 123145536962560] event.c(1809): All workers are busy or dying, will close 0 keep-alive connections

macOS process analysis:
Call graph:
2942 Thread_36817225 DispatchQueue_1: com.apple.main-thread (serial)
+ 2942 start (in libdyld.dylib) + 1 [0x7fff2032cf5d]
+ 2942 main (in httpd) + 2278 [0x107ec3b86] main.c:862
+ 2942 ap_run_mpm (in httpd) + 75 [0x107ed317b] mpm_common.c:100
+ 2942 event_run (in mod_mpm_event.so) + 2994 [0x1087c4b72] event.c:3410
+ 2942 make_child (in mod_mpm_event.so) + 436 [0x1087c58d4] event.c:3009
+ 2942 child_main (in mod_mpm_event.so) + 1734 [0x1087c5ff6] event.c:2936
+ 2942 join_workers (in mod_mpm_event.so) + 197 [0x1087c67e5] event.c:2729
+ 2942 apr_thread_join (in libapr-2.0.dylib) + 44 [0x10804d47c] thread.c:256
+ 2942 _pthread_join (in libsystem_pthread.dylib) + 362 [0x7fff20312f60]
+ 2942 __ulock_wait (in libsystem_kernel.dylib) + 10 [0x7fff202dd9ee]
2942 Thread_36817227
+ 2942 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443]
+ 2942 _pthread_start (in libsystem_pthread.dylib) + 224 [0x7fff203118fc]
+ 2942 dummy_worker (in libapr-2.0.dylib) + 30 [0x10804d3ee] thread.c:148
+ 2942 slot_run (in mod_http2.so) + 189 [0x108733a8d] h2_workers.c:260
+ 2942 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 [0x7fff20311e49]
+ 2942 __psynch_cvwait (in libsystem_kernel.dylib) + 10 [0x7fff202decde]
2942 Thread_36817228
+ 2942 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443]
+ 2942 _pthread_start (in libsystem_pthread.dylib) + 224 [0x7fff203118fc]
+ 2942 dummy_worker (in libapr-2.0.dylib) + 30 [0x10804d3ee] thread.c:148
+ 2942 slot_run (in mod_http2.so) + 189 [0x108733a8d] h2_workers.c:260
+ 2942 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 [0x7fff20311e49]
+ 2942 __psynch_cvwait (in libsystem_kernel.dylib) + 10 [0x7fff202decde]
2942 Thread_36817229
+ 2942 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443]
+ 2942 _pthread_start (in libsystem_pthread.dylib) + 224 [0x7fff203118fc]
+ 2942 dummy_worker (in libapr-2.0.dylib) + 30 [0x10804d3ee] thread.c:148
+ 2942 slot_run (in mod_http2.so) + 189 [0x108733a8d] h2_workers.c:260
+ 2942 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 [0x7fff20311e49]
+ 2942 __psynch_cvwait (in libsystem_kernel.dylib) + 10 [0x7fff202decde]
2942 Thread_36817230
+ 2942 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443]
+ 2942 _pthread_start (in libsystem_pthread.dylib) + 224 [0x7fff203118fc]
+ 2942 dummy_worker (in libapr-2.0.dylib) + 30 [0x10804d3ee] thread.c:148
+ 2942 slot_run (in mod_http2.so) + 189 [0x108733a8d] h2_workers.c:260
+ 2942 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 [0x7fff20311e49]
+ 2942 __psynch_cvwait (in libsystem_kernel.dylib) + 10 [0x7fff202decde]
2942 Thread_36817254
+ 2942 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443]
+ 2942 _pthread_start (in libsystem_pthread.dylib) + 224 [0x7fff203118fc]
+ 2942 dummy_worker (in libapr-2.0.dylib) + 30 [0x10804d3ee] thread.c:148
+ 2942 worker_thread (in mod_mpm_event.so) + 300 [0x1087c6a4c] event.c:2350
+ 2942 ap_queue_pop_something (in httpd) + 142 [0x107ed6f3e] mpm_fdqueue.c:447
+ 2942 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 [0x7fff20311e49]
+ 2942 __psynch_cvwait (in libsystem_kernel.dylib) + 10 [0x7fff202decde]
2942 Thread_36817257
2942 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443]
2942 _pthread_start (in libsystem_pthread.dylib) + 224 [0x7fff203118fc]
2942 dummy_worker (in libapr-2.0.dylib) + 30 [0x10804d3ee] thread.c:148
2942 listener_thread (in mod_mpm_event.so) + 1148 [0x1087c7c2c] event.c:1950
2942 impl_pollset_poll (in libapr-2.0.dylib) + 114 [0x108048c82] kqueue.c:272
2942 kevent (in libsystem_kernel.dylib) + 10 [0x7fff202e0c4a]

Total number in stack (recursive counted multiple, when >=5):
6 _pthread_start (in libsystem_pthread.dylib) + 224 [0x7fff203118fc]
6 dummy_worker (in libapr-2.0.dylib) + 30 [0x10804d3ee] thread.c:148
6 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443]
5 __psynch_cvwait (in libsystem_kernel.dylib) + 0 [0x7fff202decd4]
5 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 [0x7fff20311e49]

Sort by top of stack, same collapsed (when >= 5):
__psynch_cvwait (in libsystem_kernel.dylib) 14710
__ulock_wait (in libsystem_kernel.dylib) 2942
kevent (in libsystem_kernel.dylib) 2942

Re: child exit on self-proxy [ In reply to ]

Jul 7, 2021, 8:25 AM

Post #7 of 35 (1754 views)

In order to reproduce the logs I see on this weird behaviour, I'll attach the patch I made:

Re: child exit on self-proxy [ In reply to ]

rpluem at apache

Jul 7, 2021, 8:39 AM

Post #8 of 35 (1754 views)

On 7/7/21 5:25 PM, Stefan Eissing wrote:
> In order to reproduce the logs I see on this weird behaviour, I'll attach the patch I made:
>
>
>
>
> With this, I see in a hanging process:
>
> [Wed Jul 07 15:20:23.014980 2021] [core:trace1] [pid 42471:tid 123145448968192] core_filters.c(429): (32)Broken pipe: [client 127.0.0.1:49824] core_output_filter: writing data to the network
> [Wed Jul 07 15:20:23.044541 2021] [mpm_event:debug] [pid 42471:tid 4525448704] event.c(599): AH: wakeup_listener: start
> [Wed Jul 07 15:20:23.044601 2021] [mpm_event:debug] [pid 42471:tid 4525448704] event.c(603): AH: wakeup_listener: apr_pollset_wakeup
> [Wed Jul 07 15:20:23.044715 2021] [mpm_event:debug] [pid 42471:tid 4525448704] event.c(610): AH: wakeup_listener: ap_queue_info_term
> [Wed Jul 07 15:20:23.044861 2021] [mpm_event:debug] [pid 42471:tid 123145461309440] event.c(1985): (4)Interrupted system call: AH: pollset returned listener_may_exit=0

Hm. The ap_log_error statically writes listener_may_exit=0. Can you put the actual value of listener_may_exit in the log message?

> [Wed Jul 07 15:20:23.044817 2021] [mpm_event:debug] [pid 42471:tid 4525448704] event.c(629): AH: wakeup_listener: pthread_kill
> [Wed Jul 07 15:20:23.044954 2021] [mpm_event:trace1] [pid 42471:tid 123145461309440] event.c(1809): All workers are busy or dying, will close 1 keep-alive connections
> [Wed Jul 07 15:20:23.044987 2021] [mpm_event:debug] [pid 42471:tid 4525448704] event.c(635): AH: wakeup_listener: end
> [Wed Jul 07 15:20:23.045154 2021] [mpm_event:trace1] [pid 42471:tid 4525448704] event.c(2940): graceful termination received, join workers
> [Wed Jul 07 15:20:25.927437 2021] [mpm_event:debug] [pid 42471:tid 123145461309440] event.c(1985): (70007)The timeout specified has expired: AH: pollset returned listener_may_exit=0

Looks like we get back in the poll and have to wait for the timeout.

> [Wed Jul 07 15:20:25.927468 2021] [mpm_event:trace1] [pid 42471:tid 123145461309440] event.c(1809): All workers are busy or dying, will close 0 keep-alive connections
>
> Which says (to my eyes) that the pollset_wakeup happened, but the listener saw "listener_may_exit == 0". Is the volatile maybe not enough?

Regards

R?diger

Re: child exit on self-proxy [ In reply to ]

Jul 7, 2021, 8:45 AM

Post #9 of 35 (1754 views)

> Am 07.07.2021 um 17:39 schrieb Ruediger Pluem <rpluem@apache.org>:
>
>
>
> On 7/7/21 5:25 PM, Stefan Eissing wrote:
>> In order to reproduce the logs I see on this weird behaviour, I'll attach the patch I made:
>>
>>
>>
>>
>> With this, I see in a hanging process:
>>
>> [Wed Jul 07 15:20:23.014980 2021] [core:trace1] [pid 42471:tid 123145448968192] core_filters.c(429): (32)Broken pipe: [client 127.0.0.1:49824] core_output_filter: writing data to the network
>> [Wed Jul 07 15:20:23.044541 2021] [mpm_event:debug] [pid 42471:tid 4525448704] event.c(599): AH: wakeup_listener: start
>> [Wed Jul 07 15:20:23.044601 2021] [mpm_event:debug] [pid 42471:tid 4525448704] event.c(603): AH: wakeup_listener: apr_pollset_wakeup
>> [Wed Jul 07 15:20:23.044715 2021] [mpm_event:debug] [pid 42471:tid 4525448704] event.c(610): AH: wakeup_listener: ap_queue_info_term
>> [Wed Jul 07 15:20:23.044861 2021] [mpm_event:debug] [pid 42471:tid 123145461309440] event.c(1985): (4)Interrupted system call: AH: pollset returned listener_may_exit=0
>
> Hm. The ap_log_error statically writes listener_may_exit=0. Can you put the actual value of listener_may_exit in the log message?

I put it statically since that was the value in the "if ()" 6 lines above. it should have been visible then and there to work.

>
>> [Wed Jul 07 15:20:23.044817 2021] [mpm_event:debug] [pid 42471:tid 4525448704] event.c(629): AH: wakeup_listener: pthread_kill
>> [Wed Jul 07 15:20:23.044954 2021] [mpm_event:trace1] [pid 42471:tid 123145461309440] event.c(1809): All workers are busy or dying, will close 1 keep-alive connections
>> [Wed Jul 07 15:20:23.044987 2021] [mpm_event:debug] [pid 42471:tid 4525448704] event.c(635): AH: wakeup_listener: end
>> [Wed Jul 07 15:20:23.045154 2021] [mpm_event:trace1] [pid 42471:tid 4525448704] event.c(2940): graceful termination received, join workers
>> [Wed Jul 07 15:20:25.927437 2021] [mpm_event:debug] [pid 42471:tid 123145461309440] event.c(1985): (70007)The timeout specified has expired: AH: pollset returned listener_may_exit=0
>
> Looks like we get back in the poll and have to wait for the timeout.
>
>> [Wed Jul 07 15:20:25.927468 2021] [mpm_event:trace1] [pid 42471:tid 123145461309440] event.c(1809): All workers are busy or dying, will close 0 keep-alive connections
>>
>> Which says (to my eyes) that the pollset_wakeup happened, but the listener saw "listener_may_exit == 0". Is the volatile maybe not enough?
>
> Regards
>
> Rüdiger

Re: child exit on self-proxy [ In reply to ]

Jul 7, 2021, 8:45 AM

Post #10 of 35 (1754 views)

On Wed, Jul 7, 2021 at 5:39 PM Ruediger Pluem <rpluem@apache.org> wrote:
>
>
>
> On 7/7/21 5:25 PM, Stefan Eissing wrote:
> > In order to reproduce the logs I see on this weird behaviour, I'll attach the patch I made:
> >
> >
> >
> >
> > With this, I see in a hanging process:
> >
> > [Wed Jul 07 15:20:23.014980 2021] [core:trace1] [pid 42471:tid 123145448968192] core_filters.c(429): (32)Broken pipe: [client 127.0.0.1:49824] core_output_filter: writing data to the network
> > [Wed Jul 07 15:20:23.044541 2021] [mpm_event:debug] [pid 42471:tid 4525448704] event.c(599): AH: wakeup_listener: start
> > [Wed Jul 07 15:20:23.044601 2021] [mpm_event:debug] [pid 42471:tid 4525448704] event.c(603): AH: wakeup_listener: apr_pollset_wakeup
> > [Wed Jul 07 15:20:23.044715 2021] [mpm_event:debug] [pid 42471:tid 4525448704] event.c(610): AH: wakeup_listener: ap_queue_info_term
> > [Wed Jul 07 15:20:23.044861 2021] [mpm_event:debug] [pid 42471:tid 123145461309440] event.c(1985): (4)Interrupted system call: AH: pollset returned listener_may_exit=0
>
> Hm. The ap_log_error statically writes listener_may_exit=0. Can you put the actual value of listener_may_exit in the log message?

It would be interesting to log apr_atomic_read32(&connection_count) too.

Re: child exit on self-proxy [ In reply to ]

Jul 7, 2021, 8:48 AM

Post #11 of 35 (1754 views)

On Wed, Jul 7, 2021 at 5:45 PM Stefan Eissing
<stefan.eissing@greenbytes.de> wrote:
>
> > Am 07.07.2021 um 17:39 schrieb Ruediger Pluem <rpluem@apache.org>:
> >
> > Hm. The ap_log_error statically writes listener_may_exit=0. Can you put the actual value of listener_may_exit in the log message?
>
> I put it statically since that was the value in the "if ()" 6 lines above. it should have been visible then and there to work.

The above if () does not always break.

Cheers;
Yann.

Re: child exit on self-proxy [ In reply to ]

Jul 7, 2021, 8:51 AM

Post #12 of 35 (1754 views)

The things I do for you...

[Wed Jul 07 15:48:52.591983 2021] [mpm_event:debug] [pid 48747:tid 4385197568] event.c(599): AH: wakeup_listener: start
[Wed Jul 07 15:48:52.592000 2021] [mpm_event:debug] [pid 48747:tid 4385197568] event.c(603): AH: wakeup_listener: apr_pollset_wakeup
[Wed Jul 07 15:48:52.592046 2021] [mpm_event:debug] [pid 48747:tid 4385197568] event.c(610): AH: wakeup_listener: ap_queue_info_term
[Wed Jul 07 15:48:52.592188 2021] [mpm_event:debug] [pid 48747:tid 4385197568] event.c(629): AH: wakeup_listener: pthread_kill
[Wed Jul 07 15:48:52.592166 2021] [mpm_event:debug] [pid 48747:tid 123145502474240] event.c(1985): (4)Interrupted system call: AH: pollset returned listener_may_exit=1 connection_count=2
[Wed Jul 07 15:48:52.592523 2021] [mpm_event:debug] [pid 48747:tid 4385197568] event.c(635): AH: wakeup_listener: end
[Wed Jul 07 15:48:52.592727 2021] [mpm_event:trace1] [pid 48747:tid 4385197568] event.c(2942): graceful termination received, join workers
[Wed Jul 07 15:48:52.592515 2021] [mpm_event:trace1] [pid 48747:tid 123145502474240] event.c(1809): All workers are busy or dying, will close 2 keep-alive connections
[Wed Jul 07 15:48:52.593381 2021] [mpm_event:debug] [pid 48747:tid 123145502474240] event.c(1985): AH: pollset returned listener_may_exit=1 connection_count=2
[Wed Jul 07 15:48:52.593478 2021] [mpm_event:debug] [pid 48747:tid 123145502474240] event.c(1985): (4)Interrupted system call: AH: pollset returned listener_may_exit=1 connection_count=1
[Wed Jul 07 15:48:57.552623 2021] [mpm_event:debug] [pid 48747:tid 123145502474240] event.c(1985): (70007)The timeout specified has expired: AH: pollset returned listener_may_exit=1 connection_count=1
[Wed Jul 07 15:48:57.552658 2021] [mpm_event:trace1] [pid 48747:tid 123145502474240] event.c(1809): All workers are busy or dying, will close 0 keep-alive connections

Re: child exit on self-proxy [ In reply to ]

Jul 7, 2021, 9:34 AM

Post #13 of 35 (1754 views)

Does this attached patch help? This is to verify an hypothesis where
killed keepalive connections are still in the defer_linger_chain
without a worker to handle them.

I don't see "interrupted, but continuing" triggering in your log but
possibly it happened earlier..

On Wed, Jul 7, 2021 at 5:51 PM Stefan Eissing
<stefan.eissing@greenbytes.de> wrote:
>
> The things I do for you...
>
> [Wed Jul 07 15:48:52.591983 2021] [mpm_event:debug] [pid 48747:tid 4385197568] event.c(599): AH: wakeup_listener: start
> [Wed Jul 07 15:48:52.592000 2021] [mpm_event:debug] [pid 48747:tid 4385197568] event.c(603): AH: wakeup_listener: apr_pollset_wakeup
> [Wed Jul 07 15:48:52.592046 2021] [mpm_event:debug] [pid 48747:tid 4385197568] event.c(610): AH: wakeup_listener: ap_queue_info_term
> [Wed Jul 07 15:48:52.592188 2021] [mpm_event:debug] [pid 48747:tid 4385197568] event.c(629): AH: wakeup_listener: pthread_kill
> [Wed Jul 07 15:48:52.592166 2021] [mpm_event:debug] [pid 48747:tid 123145502474240] event.c(1985): (4)Interrupted system call: AH: pollset returned listener_may_exit=1 connection_count=2
> [Wed Jul 07 15:48:52.592523 2021] [mpm_event:debug] [pid 48747:tid 4385197568] event.c(635): AH: wakeup_listener: end
> [Wed Jul 07 15:48:52.592727 2021] [mpm_event:trace1] [pid 48747:tid 4385197568] event.c(2942): graceful termination received, join workers
> [Wed Jul 07 15:48:52.592515 2021] [mpm_event:trace1] [pid 48747:tid 123145502474240] event.c(1809): All workers are busy or dying, will close 2 keep-alive connections
> [Wed Jul 07 15:48:52.593381 2021] [mpm_event:debug] [pid 48747:tid 123145502474240] event.c(1985): AH: pollset returned listener_may_exit=1 connection_count=2
> [Wed Jul 07 15:48:52.593478 2021] [mpm_event:debug] [pid 48747:tid 123145502474240] event.c(1985): (4)Interrupted system call: AH: pollset returned listener_may_exit=1 connection_count=1
> [Wed Jul 07 15:48:57.552623 2021] [mpm_event:debug] [pid 48747:tid 123145502474240] event.c(1985): (70007)The timeout specified has expired: AH: pollset returned listener_may_exit=1 connection_count=1
> [Wed Jul 07 15:48:57.552658 2021] [mpm_event:trace1] [pid 48747:tid 123145502474240] event.c(1809): All workers are busy or dying, will close 0 keep-alive connections
>
>
>
> Btw. I opened a bottle of nice wine - this is entertaining. ;-)
>
> - Stefan
>
> > Am 07.07.2021 um 17:48 schrieb Yann Ylavic <ylavic.dev@gmail.com>:
> >
> > On Wed, Jul 7, 2021 at 5:45 PM Stefan Eissing
> > <stefan.eissing@greenbytes.de> wrote:
> >>
> >>> Am 07.07.2021 um 17:39 schrieb Ruediger Pluem <rpluem@apache.org>:
> >>>
> >>> Hm. The ap_log_error statically writes listener_may_exit=0. Can you put the actual value of listener_may_exit in the log message?
> >>
> >> I put it statically since that was the value in the "if ()" 6 lines above. it should have been visible then and there to work.
> >
> > The above if () does not always break.
> >
> >
> > Cheers;
> > Yann.
>

Re: child exit on self-proxy [ In reply to ]

Jul 7, 2021, 9:49 AM

Post #14 of 35 (1754 views)

> Am 07.07.2021 um 18:34 schrieb Yann Ylavic <ylavic.dev@gmail.com>:
>
> Does this attached patch help? This is to verify an hypothesis where
> killed keepalive connections are still in the defer_linger_chain
> without a worker to handle them.
>
> I don't see "interrupted, but continuing" triggering in your log but
> possibly it happened earlier..

[Wed Jul 07 16:48:04.237865 2021] [mpm_event:debug] [pid 55134:tid 4643098112] event.c(599): AH: wakeup_listener: start
[Wed Jul 07 16:48:04.237895 2021] [mpm_event:debug] [pid 55134:tid 4643098112] event.c(603): AH: wakeup_listener: apr_pollset_wakeup
[Wed Jul 07 16:48:04.238042 2021] [mpm_event:debug] [pid 55134:tid 4643098112] event.c(610): AH: wakeup_listener: ap_queue_info_term
[Wed Jul 07 16:48:04.238224 2021] [mpm_event:debug] [pid 55134:tid 123145362530304] event.c(1971): (4)Interrupted system call: AH: pollset returned listener_may_exit=1 connection_count=1
[Wed Jul 07 16:48:04.238381 2021] [mpm_event:trace1] [pid 55134:tid 123145362530304] event.c(1809): All workers are busy or dying, will close 1 keep-alive connections
[Wed Jul 07 16:48:04.238196 2021] [mpm_event:debug] [pid 55134:tid 4643098112] event.c(629): AH: wakeup_listener: pthread_kill
[Wed Jul 07 16:48:04.238513 2021] [mpm_event:debug] [pid 55134:tid 123145362530304] event.c(1971): (4)Interrupted system call: AH: pollset returned listener_may_exit=1 connection_count=1
[Wed Jul 07 16:48:04.238506 2021] [mpm_event:debug] [pid 55134:tid 4643098112] event.c(635): AH: wakeup_listener: end
[Wed Jul 07 16:48:04.238555 2021] [mpm_event:trace1] [pid 55134:tid 4643098112] event.c(2928): graceful termination received, join workers
[Wed Jul 07 16:48:08.443130 2021] [mpm_event:debug] [pid 55134:tid 123145362530304] event.c(1971): (70007)The timeout specified has expired: AH: pollset returned listener_may_exit=1 connection_count=1
[Wed Jul 07 16:48:08.443149 2021] [mpm_event:trace1] [pid 55134:tid 123145362530304] event.c(1809): All workers are busy or dying, will close 0 keep-alive connections

>
> On Wed, Jul 7, 2021 at 5:51 PM Stefan Eissing
> <stefan.eissing@greenbytes.de> wrote:
>>
>> The things I do for you...
>>
>> [Wed Jul 07 15:48:52.591983 2021] [mpm_event:debug] [pid 48747:tid 4385197568] event.c(599): AH: wakeup_listener: start
>> [Wed Jul 07 15:48:52.592000 2021] [mpm_event:debug] [pid 48747:tid 4385197568] event.c(603): AH: wakeup_listener: apr_pollset_wakeup
>> [Wed Jul 07 15:48:52.592046 2021] [mpm_event:debug] [pid 48747:tid 4385197568] event.c(610): AH: wakeup_listener: ap_queue_info_term
>> [Wed Jul 07 15:48:52.592188 2021] [mpm_event:debug] [pid 48747:tid 4385197568] event.c(629): AH: wakeup_listener: pthread_kill
>> [Wed Jul 07 15:48:52.592166 2021] [mpm_event:debug] [pid 48747:tid 123145502474240] event.c(1985): (4)Interrupted system call: AH: pollset returned listener_may_exit=1 connection_count=2
>> [Wed Jul 07 15:48:52.592523 2021] [mpm_event:debug] [pid 48747:tid 4385197568] event.c(635): AH: wakeup_listener: end
>> [Wed Jul 07 15:48:52.592727 2021] [mpm_event:trace1] [pid 48747:tid 4385197568] event.c(2942): graceful termination received, join workers
>> [Wed Jul 07 15:48:52.592515 2021] [mpm_event:trace1] [pid 48747:tid 123145502474240] event.c(1809): All workers are busy or dying, will close 2 keep-alive connections
>> [Wed Jul 07 15:48:52.593381 2021] [mpm_event:debug] [pid 48747:tid 123145502474240] event.c(1985): AH: pollset returned listener_may_exit=1 connection_count=2
>> [Wed Jul 07 15:48:52.593478 2021] [mpm_event:debug] [pid 48747:tid 123145502474240] event.c(1985): (4)Interrupted system call: AH: pollset returned listener_may_exit=1 connection_count=1
>> [Wed Jul 07 15:48:57.552623 2021] [mpm_event:debug] [pid 48747:tid 123145502474240] event.c(1985): (70007)The timeout specified has expired: AH: pollset returned listener_may_exit=1 connection_count=1
>> [Wed Jul 07 15:48:57.552658 2021] [mpm_event:trace1] [pid 48747:tid 123145502474240] event.c(1809): All workers are busy or dying, will close 0 keep-alive connections
>>
>>
>>
>> Btw. I opened a bottle of nice wine - this is entertaining. ;-)
>>
>> - Stefan
>>
>>> Am 07.07.2021 um 17:48 schrieb Yann Ylavic <ylavic.dev@gmail.com>:
>>>
>>> On Wed, Jul 7, 2021 at 5:45 PM Stefan Eissing
>>> <stefan.eissing@greenbytes.de> wrote:
>>>>
>>>>> Am 07.07.2021 um 17:39 schrieb Ruediger Pluem <rpluem@apache.org>:
>>>>>
>>>>> Hm. The ap_log_error statically writes listener_may_exit=0. Can you put the actual value of listener_may_exit in the log message?
>>>>
>>>> I put it statically since that was the value in the "if ()" 6 lines above. it should have been visible then and there to work.
>>>
>>> The above if () does not always break.
>>>
>>>
>>> Cheers;
>>> Yann.
>>
> <event_EINTR.diff>

Re: child exit on self-proxy [ In reply to ]

Jul 8, 2021, 2:46 AM

Post #15 of 35 (1753 views)

Some day, I knew I had to learn more about mpm_event. =)

Adding more DEBUGs, I see in the example below that 2 connections were idling at start of the graceful and the get added to the linger_chain. 2 workers are then woken up and process the socket. connection_count stays at 2 however. As I read it, that count drops when the connection pool is destroyed/cleanup'ed. This normally seem to happen on the worker_queue_info, but in this example this just does not happen.

Is this a correct read?

[Thu Jul 08 09:39:32.423198 2021] [mpm_event:trace1] [pid 9272:tid 4494732800] event.c(2973): graceful termination received, join workers
[Thu Jul 08 09:39:32.423032 2021] [mpm_event:debug] [pid 9272:tid 123145324126208] event.c(2250): (4)Interrupted system call: AH: listener: process defer_linger_chain have_idle_worker=1 listener_may_exit=1 connection_count=2
[Thu Jul 08 09:39:32.422765 2021] [mpm_event:debug] [pid 9272:tid 123145321443328] event.c(2431): (4)Interrupted system call: AH: worker pop_something interrupted, goto worker_pop
[Thu Jul 08 09:39:32.422251 2021] [mpm_event:debug] [pid 9272:tid 123145311784960] event.c(2431): (4)Interrupted system call: AH: worker pop_something interrupted, goto worker_pop
[Thu Jul 08 09:39:32.421638 2021] [mpm_event:debug] [pid 9272:tid 123145318760448] event.c(2431): (4)Interrupted system call: AH: worker pop_something interrupted, goto worker_pop
[Thu Jul 08 09:39:32.423541 2021] [mpm_event:debug] [pid 9272:tid 123145324126208] event.c(2259): (4)Interrupted system call: AH: listener: process defer_linger_chain pushed to worker, listener_may_exit=1 connection_count=2
[Thu Jul 08 09:39:32.422212 2021] [mpm_event:debug] [pid 9272:tid 123145311248384] event.c(2408): (4)Interrupted system call: AH: worker pop_something, workers_may_exit=0, dying=1
[Thu Jul 08 09:39:32.421806 2021] [mpm_event:debug] [pid 9272:tid 123145312858112] event.c(2431): (4)Interrupted system call: AH: worker pop_something interrupted, goto worker_pop
[Thu Jul 08 09:39:32.422789 2021] [mpm_event:debug] [pid 9272:tid 123145323053056] event.c(2431): (4)Interrupted system call: AH: worker pop_something interrupted, goto worker_pop
[Thu Jul 08 09:39:32.423346 2021] [mpm_event:debug] [pid 9272:tid 123145317150720] event.c(2431): (4)Interrupted system call: AH: worker pop_something interrupted, goto worker_pop
[Thu Jul 08 09:39:32.423755 2021] [mpm_event:debug] [pid 9272:tid 123145311248384] event.c(2431): (4)Interrupted system call: AH: worker pop_something interrupted, goto worker_pop
[Thu Jul 08 09:39:32.424230 2021] [mpm_event:debug] [pid 9272:tid 123145317150720] event.c(2402): (4)Interrupted system call: AH: worker read worker_queue, workers_may_exit=0, dying=1
[Thu Jul 08 09:39:32.424373 2021] [mpm_event:debug] [pid 9272:tid 123145311248384] event.c(2402): (4)Interrupted system call: AH: worker read worker_queue, workers_may_exit=0, dying=1
[Thu Jul 08 09:39:32.424384 2021] [mpm_event:debug] [pid 9272:tid 123145317150720] event.c(2408): AH: worker pop_something, workers_may_exit=0, dying=1
[Thu Jul 08 09:39:32.424477 2021] [mpm_event:debug] [pid 9272:tid 123145317150720] event.c(2338): AH: process_defer_linger, processing socket, connection_count=2
[Thu Jul 08 09:39:32.424653 2021] [mpm_event:debug] [pid 9272:tid 123145317150720] event.c(2342): AH: process_defer_linger, processed socket, connection_count=2
[Thu Jul 08 09:39:32.424664 2021] [mpm_event:debug] [pid 9272:tid 123145317150720] event.c(2338): AH: process_defer_linger, processing socket, connection_count=2
[Thu Jul 08 09:39:32.424680 2021] [mpm_event:debug] [pid 9272:tid 123145317150720] event.c(2342): AH: process_defer_linger, processed socket, connection_count=2
[Thu Jul 08 09:39:32.424708 2021] [mpm_event:debug] [pid 9272:tid 123145317150720] event.c(2319): AH: process_defer_linger, nothing more to do
[Thu Jul 08 09:39:32.424754 2021] [mpm_event:debug] [pid 9272:tid 123145317150720] event.c(2402): AH: worker read worker_queue, workers_may_exit=0, dying=1
[Thu Jul 08 09:39:32.424756 2021] [mpm_event:debug] [pid 9272:tid 123145324126208] event.c(1978): AH: pollset returned listener_may_exit=1 connection_count=2
[Thu Jul 08 09:39:32.424830 2021] [mpm_event:debug] [pid 9272:tid 123145311248384] event.c(2408): AH: worker pop_something, workers_may_exit=0, dying=1
[Thu Jul 08 09:39:32.424910 2021] [mpm_event:debug] [pid 9272:tid 123145311248384] event.c(808): AH: decrement_connection_count: last_connection=0 listener_may_exit=1 connection_count=1
[Thu Jul 08 09:39:32.424924 2021] [mpm_event:debug] [pid 9272:tid 123145311248384] event.c(816): AH: decrement_connection_count: pollset_wakeup
[Thu Jul 08 09:39:32.424958 2021] [mpm_event:debug] [pid 9272:tid 123145311248384] event.c(2319): AH: process_defer_linger, nothing more to do
[Thu Jul 08 09:39:32.424960 2021] [mpm_event:debug] [pid 9272:tid 123145324126208] event.c(1978): (4)Interrupted system call: AH: pollset returned listener_may_exit=1 connection_count=1
[Thu Jul 08 09:39:37.349569 2021] [mpm_event:debug] [pid 9272:tid 123145324126208] event.c(1978): (70007)The timeout specified has expired: AH: pollset returned listener_may_exit=1 connection_count=1
[Thu Jul 08 09:39:37.349661 2021] [mpm_event:trace1] [pid 9272:tid 123145324126208] event.c(1816): All workers are busy or dying, will close 0 keep-alive connections

Re: child exit on self-proxy [ In reply to ]

Jul 8, 2021, 5:14 AM

Post #16 of 35 (1753 views)

On Thu, Jul 8, 2021 at 11:47 AM Stefan Eissing
<stefan.eissing@greenbytes.de> wrote:
>
> Some day, I knew I had to learn more about mpm_event. =)
>
> Adding more DEBUGs, I see in the example below that 2 connections were idling at start of the graceful and the get added to the linger_chain. 2 workers are then woken up and process the socket. connection_count stays at 2 however. As I read it, that count drops when the connection pool is destroyed/cleanup'ed. This normally seem to happen on the worker_queue_info, but in this example this just does not happen.
>
> Is this a correct read?

You proxy to a local server, so the 2 connections are the incoming
ones on the local proxy vhost and the local server vhost.
But mod_proxy backend connections are recycled and end up in a reslist
which is not checked until reuse.
So when on graceful restart the local server starts lingering close
with its kept alive connection, it's just ignored by mod_proxy and the
MAX_SECS_TO_LINGER timeout (30s) applies.

I think we shouldn't do lingering on keepalive connections when they
expire or get killed (graceful or max workers), this does not help the
client anyway because any data sent on the connection is doomed, the
sooner we RESET the faster it will try on another connection.

So how about the attached patch that closes the connections when their
keepalive expires (including when it's shortened)?
There are other changes in there, like more trace logs that helped me
debug things, they are worth it too IMHO.
Also I changed reqevents = POLLIN to POLLIN|POLLHUP (by generalizing
and using update_reqevents_from_sense() in more places), because I'm
not sure that we catch the connections closed remotely otherwise
(connection_count should go down to 1 more quickly in the local
proxy+server case because the local server's client connection should
have responded to the lingering close almost instantly, but this one
seems to timeout too).

Cheers;
Yann.

Re: child exit on self-proxy [ In reply to ]

Jul 8, 2021, 5:55 AM

Post #17 of 35 (1753 views)

> Am 08.07.2021 um 14:14 schrieb Yann Ylavic <ylavic.dev@gmail.com>:
>
> On Thu, Jul 8, 2021 at 11:47 AM Stefan Eissing
> <stefan.eissing@greenbytes.de> wrote:
>>
>> Some day, I knew I had to learn more about mpm_event. =)
>>
>> Adding more DEBUGs, I see in the example below that 2 connections were idling at start of the graceful and the get added to the linger_chain. 2 workers are then woken up and process the socket. connection_count stays at 2 however. As I read it, that count drops when the connection pool is destroyed/cleanup'ed. This normally seem to happen on the worker_queue_info, but in this example this just does not happen.
>>
>> Is this a correct read?
>
> You proxy to a local server, so the 2 connections are the incoming
> ones on the local proxy vhost and the local server vhost.
> But mod_proxy backend connections are recycled and end up in a reslist
> which is not checked until reuse.
> So when on graceful restart the local server starts lingering close
> with its kept alive connection, it's just ignored by mod_proxy and the
> MAX_SECS_TO_LINGER timeout (30s) applies.
>
> I think we shouldn't do lingering on keepalive connections when they
> expire or get killed (graceful or max workers), this does not help the
> client anyway because any data sent on the connection is doomed, the
> sooner we RESET the faster it will try on another connection.
>
> So how about the attached patch that closes the connections when their
> keepalive expires (including when it's shortened)?
> There are other changes in there, like more trace logs that helped me
> debug things, they are worth it too IMHO.
> Also I changed reqevents = POLLIN to POLLIN|POLLHUP (by generalizing
> and using update_reqevents_from_sense() in more places), because I'm
> not sure that we catch the connections closed remotely otherwise
> (connection_count should go down to 1 more quickly in the local
> proxy+server case because the local server's client connection should
> have responded to the lingering close almost instantly, but this one
> seems to timeout too).
>
> Cheers;
> Yann.
> <event_ka_no_linger.diff>

This seems to be it! Yann strikes again! And I learned something...\o/

I needed to make small tweaks, because no all previous close checked the
return value and I got assert failures on your patch. I attach my version
below. Instead of the param, there could also be 2 functions, one without
checks. A matter of taste.

So, I try to summarise:

- the effect was triggered by the keep-alive connections remote endpoints
slumbering in a proxy connection pool where no one was monitoring the
sockets and reacting to TCP FIN packets.
- This triggered a long LINGER timeout that blocked the listener from
exiting. Ultimately leading to the parent killing the child.
- This situation could also happen, I assume, when a client drops from
the network, e.g. a cell phone entering bad coverage, where no one
in the network path really knows what to do? Or a network connectivity
issue or a comp crashing behind a NAT?

Cheers, Stefan

Re: child exit on self-proxy [ In reply to ]

Jul 8, 2021, 6:33 AM

Post #18 of 35 (1753 views)

On Thu, Jul 8, 2021 at 2:55 PM Stefan Eissing
<stefan.eissing@greenbytes.de> wrote:
>
> This seems to be it! Yann strikes again! And I learned something...\o/

Thanks Stefan :)

>
> I needed to make small tweaks, because no all previous close checked the
> return value and I got assert failures on your patch.

Interesting, what error value was that? EBADF, EAGAIN?

> I attach my version
> below. Instead of the param, there could also be 2 functions, one without
> checks. A matter of taste.

Depending on the error cases, we might want to fix something or ignore
or relax the assertion maybe.
Let's see..

>
> So, I try to summarise:
>
> - the effect was triggered by the keep-alive connections remote endpoints
> slumbering in a proxy connection pool where no one was monitoring the
> sockets and reacting to TCP FIN packets.
> - This triggered a long LINGER timeout that blocked the listener from
> exiting. Ultimately leading to the parent killing the child.
> - This situation could also happen, I assume, when a client drops from
> the network, e.g. a cell phone entering bad coverage, where no one
> in the network path really knows what to do? Or a network connectivity
> issue or a comp crashing behind a NAT?

Exactly, I think that closing kept alive connections is the right
thing to do here, lingering close helps neither MPM event nor the
client.
I first tried to use the short linger timeout instead of the long one,
but I don't see a reason to linger those connections, even a little.
Other opinions maybe?

Cheers;
Yann.

Re: child exit on self-proxy [ In reply to ]

Jul 8, 2021, 6:33 AM

Post #19 of 35 (1753 views)

On Thu, Jul 8, 2021 at 3:33 PM Yann Ylavic <ylavic.dev@gmail.com> wrote:
>
> On Thu, Jul 8, 2021 at 2:55 PM Stefan Eissing
> <stefan.eissing@greenbytes.de> wrote:
> >
> > This seems to be it! Yann strikes again! And I learned something...\o/
>
> Thanks Stefan :)
>
> >
> > I needed to make small tweaks, because no all previous close checked the
> > return value and I got assert failures on your patch.
>
> Interesting, what error value was that? EBADF, EAGAIN?
>
> > I attach my version
> > below. Instead of the param, there could also be 2 functions, one without
> > checks. A matter of taste.
>
> Depending on the error cases, we might want to fix something or ignore
> or relax the assertion maybe.
> Let's see..
>
> >
> > So, I try to summarise:
> >
> > - the effect was triggered by the keep-alive connections remote endpoints
> > slumbering in a proxy connection pool where no one was monitoring the
> > sockets and reacting to TCP FIN packets.
> > - This triggered a long LINGER timeout that blocked the listener from
> > exiting. Ultimately leading to the parent killing the child.
> > - This situation could also happen, I assume, when a client drops from
> > the network, e.g. a cell phone entering bad coverage, where no one
> > in the network path really knows what to do? Or a network connectivity
> > issue or a comp crashing behind a NAT?
>
> Exactly, I think that closing kept alive connections is the right
> thing to do here, lingering close helps neither MPM event nor the
> client.
> I first tried to use the short linger timeout instead of the long one,
> but I don't see a reason to linger those connections, even a little.
> Other opinions maybe?
>
>
> Cheers;
> Yann.

Re: child exit on self-proxy [ In reply to ]

Jul 8, 2021, 6:35 AM

Post #20 of 35 (1753 views)

> Am 08.07.2021 um 15:33 schrieb Yann Ylavic <ylavic.dev@gmail.com>:
>
> On Thu, Jul 8, 2021 at 2:55 PM Stefan Eissing
> <stefan.eissing@greenbytes.de> wrote:
>>
>> This seems to be it! Yann strikes again! And I learned something...\o/
>
> Thanks Stefan :)
>
>>
>> I needed to make small tweaks, because no all previous close checked the
>> return value and I got assert failures on your patch.
>
> Interesting, what error value was that? EBADF, EAGAIN?

EBADF is what I saw.

>
>> I attach my version
>> below. Instead of the param, there could also be 2 functions, one without
>> checks. A matter of taste.
>
> Depending on the error cases, we might want to fix something or ignore
> or relax the assertion maybe.
> Let's see..
>
>>
>> So, I try to summarise:
>>
>> - the effect was triggered by the keep-alive connections remote endpoints
>> slumbering in a proxy connection pool where no one was monitoring the
>> sockets and reacting to TCP FIN packets.
>> - This triggered a long LINGER timeout that blocked the listener from
>> exiting. Ultimately leading to the parent killing the child.
>> - This situation could also happen, I assume, when a client drops from
>> the network, e.g. a cell phone entering bad coverage, where no one
>> in the network path really knows what to do? Or a network connectivity
>> issue or a comp crashing behind a NAT?
>
> Exactly, I think that closing kept alive connections is the right
> thing to do here, lingering close helps neither MPM event nor the
> client.
> I first tried to use the short linger timeout instead of the long one,
> but I don't see a reason to linger those connections, even a little.
> Other opinions maybe?
>
>
> Cheers;
> Yann.

Re: child exit on self-proxy [ In reply to ]

covener at gmail

Jul 8, 2021, 6:40 AM

Post #21 of 35 (1753 views)

> I first tried to use the short linger timeout instead of the long one,
> but I don't see a reason to linger those connections, even a little.

IIUC this would only be useful for KeepAliveTimeout of something like
2s or less which is kind of academic.
I would go with whatever is simpler [to maintain].

Re: child exit on self-proxy [ In reply to ]

Jul 8, 2021, 6:46 AM

Post #22 of 35 (1753 views)

On Thu, Jul 8, 2021 at 3:35 PM Stefan Eissing
<stefan.eissing@greenbytes.de> wrote:
> >>
> >> I needed to make small tweaks, because no all previous close checked the
> >> return value and I got assert failures on your patch.
> >
> > Interesting, what error value was that? EBADF, EAGAIN?
>
> EBADF is what I saw.

Could you track somehow where the first close happened?
Maybe:
static void close_socket_nonblocking_ex(apr_socket_t *csd, const char *from)
{
apr_status_t rv;
apr_os_sock_t fd = -1;
apr_os_sock_get(&fd, csd);
apr_socket_timeout_set(csd, 0);
rv = apr_socket_close(csd);
if (rv != APR_SUCCESS) {
ap_log_error(APLOG_MARK, APLOG_ERR, rv, ap_server_conf, APLOGNO(00468)
"error closing socket %i/%pp from %s", (int)fd, csd, from);
/*AP_DEBUG_ASSERT(0);*/
}
else {
ap_log_error(APLOG_MARK, APLOG_DEBUG, rv, ap_server_conf,
"socket %i/%pp closed from %s", (int)fd, csd, from);
}
}
#define close_socket_nonblocking(csd) close_socket_nonblocking_ex(csd,
__FUNCTION__)

to see if close_socket_nonblocking() is called twice or something.

Re: child exit on self-proxy [ In reply to ]

Jul 8, 2021, 7:21 AM

Post #23 of 35 (1753 views)

I see many of those:
[Thu Jul 08 14:16:55.301670 2021] [mpm_event:error] [pid 81101:tid 123145411510272] (9)Bad file descriptor: AH00468: error closing socket -1/7ff9cf0086b0 from process_socket

which come from event.c#1263: rc = start_lingering_close_blocking(cs);
calling event.c#864: kill_connection_ex(cs, from);
and event.c#837: close_socket_nonblocking_ex(cs->pfd.desc.s, from);

Patch attached.

Re: child exit on self-proxy [ In reply to ]

Jul 8, 2021, 11:17 AM

Post #24 of 35 (1752 views)

On Thu, Jul 8, 2021 at 4:21 PM Stefan Eissing
<stefan.eissing@greenbytes.de> wrote:
>
> I see many of those:
> [Thu Jul 08 14:16:55.301670 2021] [mpm_event:error] [pid 81101:tid 123145411510272] (9)Bad file descriptor: AH00468: error closing socket -1/7ff9cf0086b0 from process_socket
>
> which come from event.c#1263: rc = start_lingering_close_blocking(cs);
> calling event.c#864: kill_connection_ex(cs, from);
> and event.c#837: close_socket_nonblocking_ex(cs->pfd.desc.s, from);

OK, so I think this could be addressed by the attached patch.
This is the same as v0 plus some changes in ap_start_lingering_close()
(from server/connection.c) to not close the socket on error.
ap_start_lingering_close() did not close the socket consistently, so
the caller had to call apr_socket_close() anyway and sometimes fail..
The close on failure is moved to ap_lingering_close() which is the one
that should care.

Still the AP_DEBUG_ASSERT(0) triggering with this?

Re: child exit on self-proxy [ In reply to ]

Jul 9, 2021, 12:09 AM

Post #25 of 35 (1752 views)