Mailing List Archive

pytest results for 2.4.59
Hi there,

first although I saw very few pytest failures, I think the results are
overall fine and good enough for release.

I first had to find out, that I need to build the h2ws websocket client
during httpd build (for websocket tests) and use the right multipart
python module ("python-multipart" instead of "multipart").

I see 4 failures:

A] two with "AssertionError: request not found in
/tmp/esupport-testdir/pytest-event-310/gen/apache/logs/test_...":

__________________________ TestTiming.test_h2_009_01
___________________________
self = <http2.test_009_timing.TestTiming object at 0x7f01935bdb10>
env = <http2.env.H2TestEnv object at 0x7f019305cdf0>

def test_h2_009_01(self, env):
...

> assert found, f'request not found in {TestTiming.LOGFILE}'
E AssertionError: request not found in
/tmp/esupport-testdir/pytest-event-310/gen/apache/logs/test_009
E assert False

modules/http2/test_009_timing.py:46: AssertionError

and

__________________________ TestTiming.test_h2_009_02
___________________________

self = <http2.test_009_timing.TestTiming object at 0x7f01935bdd20>
env = <http2.env.H2TestEnv object at 0x7f019305cdf0>

def test_h2_009_02(self, env):
...
> assert found, f'request not found in {TestTiming.LOGFILE}'
E AssertionError: request not found in
/tmp/esupport-testdir/pytest-event-310/gen/apache/logs/test_009
E assert False

modules/http2/test_009_timing.py:74: AssertionError


I need to further investigate, whether the log file is missing, or does
not have the right contents. The failure should not be critical in itself.


B] buffer test failure TestBuffering.test_h2_712_02

self = <http2.test_712_buffering.TestBuffering object at 0x7f01934b2410>
env = <http2.env.H2TestEnv object at 0x7f019305cdf0>

def test_h2_712_02(self, env):
...
> piper.stutter_check(chunks, stutter)

modules/http2/test_712_buffering.py:48:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _

self = CurlPiper[.exitcode=0, stderr=['14:46:27.261890 == Info: Added
cgi.tests.httpd.apache.org:5001:127.0.0.1 to DNS
cache\n....tests.httpd.apache.org left intact\n'],
stdout=['chunk-000-0123456789\nchunk-001-0123456789\nchunk-002-0123456789\n']]
chunks = ['chunk-000-0123456789\n', 'chunk-001-0123456789\n',
'chunk-002-0123456789\n']
stutter = datetime.timedelta(seconds=1)

def stutter_check(self, chunks: [str], stutter: datetime.timedelta):
...
# received as many chunks as we sent
> assert len(chunks) == len(recv_times), "received response not
in {0} chunks, but {1}".format(
len(chunks), len(recv_times))
E AssertionError: received response not in 3 chunks, but 4

pyhttpd/curl.py:118: AssertionError
----------------------------- Captured stderr call
-----------------------------
starting: [.'curl', '-s', '--path-as-is', '-D',
'/tmp/esupport-testdir/pytest-event-310/gen/curl.headers.438',
'--cacert',
'/tmp/esupport-testdir/pytest-event-310/gen/apache/ca/ca.rsa4096.cert.pem',
'--resolve', 'cgi.tests.httpd.apache.org:5001:127.0.0.1', '-H',
'AP-Test-Name: test_h2_712_02', '--connect-timeout', '5', '-T', '-',
'-X', 'POST', '--trace-ascii', '%', '--trace-time',
'https://cgi.tests.httpd.apache.org:5001/h2proxy/h2test/echo']


Here I have no idea where the difference in the chunk numbers come from.
Maybe the test assumptions are to strict?


C] a single websocket test failure
TestWebSockets.test_h2_800_04_non_ws_resource

self = <http2.test_800_websockets.TestWebSockets object at 0x7f01934e56c0>
env = <http2.env.H2TestEnv object at 0x7f019305cdf0>, ws_server = None

def test_h2_800_04_non_ws_resource(self, env: H2TestEnv, ws_server):
r, infos, frames = ws_run(env, path='/alive.json')
assert r.exit_code == 0, f'{r}'
> assert infos == ['[1] :status: 502', '[1] EOF'], f'{r}'
E AssertionError: ExecResult[code=0, args=['/path/to/h2ws', '-vv',
'-c', 'localhost:5002',
'ws://cgi.tests.httpd.apache.org:5002/alive.json', 'ws-stdin']
E ----stdout---------------------------------------
E ----stderr---------------------------------------
E
E assert ['[1] :status...F', '[1] RST'] == ['[1] :status...2',
'[1] EOF']
E Left contains one more item: '[1] RST'
E Full diff:
E - ['[1] :status: 502', '[1] EOF']
E + ['[1] :status: 502', '[1] EOF', '[1] RST']
E ? +++++ ++++++

modules/http2/test_800_websockets.py:178: AssertionError

All in all the results are mich better than what I achieved for the
previous releases!

Best regards,

Rainer
Re: pytest results for 2.4.59 [ In reply to ]
I think I fixed all test failures, hopefully in the correct way. More
eyes welcome.

I have a few additional sporadic ERRORS:

A] ERROR during teardown check for log file errors or warnings (twice):

04.04.2024 21:14:42.205465 _______________ ERROR at teardown of
TestStatus.test_md_920_020 ________________
04.04.2024 21:14:42.205465 ERROR
modules/md/test_920_status.py::TestStatus::test_md_920_020 - AssertionE...

04.04.2024 21:14:42.205465 E AssertionError: apache logged 1
errors and 0 warnings:
04.04.2024 21:14:42.205465 E [Thu Apr 04 21:12:29.381511 2024]
[md:error] [pid 4169] (22)Invalid argument: no certificates in non-empty
chain
/path/to/gen/apache/md/staging/one.test.test-md-702-070-1712257797.org/pubcert.pem


04.04.2024 21:03:26.382051 _______________ ERROR at teardown of
TestStatus.test_md_920_020 ________________
04.04.2024 21:03:26.382360 ERROR
modules/md/test_920_status.py::TestStatus::test_md_920_020 - AssertionE...

04.04.2024 21:03:26.382051 E AssertionError: apache logged 1
errors and 1 warnings:
04.04.2024 21:03:26.382051 E [Thu Apr 04 21:00:48.924286 2024]
[md:error] [pid 8717:tid 139629962274560] (20014)Internal error
(specific information not available): test-md-702-041-1712256790.org:
asked to retrieve chain, but no order in context
04.04.2024 21:03:26.382051 E [Thu Apr 04 21:00:48.924229 2024]
[md:warn] [pid 8717:tid 139629962274560] error generate pkey RSA 3072

B] Hanging httpd child processes

This happens only on RHEL 9 with worker MPM and can be notices by a
dramatic slowdown of the tests. There's a lot of messages

AH00045: child process 1067703 still did not exit, sending a SIGTERM

and

AH00276: the listener thread didn't exit

It happened in

modules/core/test_001_encoding.py::TestEncoding::test_core_001_20[test2-/10%25abnormal.txt-200]

modules/md/test_920_status.py::TestStatus::test_md_920_020

modules/proxy/test_02_unix.py::TestProxyUds::test_proxy_02_003[mixed-500]

but I don't know, whether it might happen elsewhere also, because it is
sporadic.

What I see in the error logs for one hanging child process:

- most threads terminate with

[Thu Apr 04 22:42:59.617953 2024] [ssl:trace3] [pid 1067703:tid
140619680433728] ssl_engine_kernel.c(2223): [client 127.0.0.1:40686]
OpenSSL: Write: SSL negotiation finished successfully
[Thu Apr 04 22:42:59.617972 2024] [ssl:trace6] [pid 1067703:tid
140619680433728] ssl_engine_io.c(154): [client 127.0.0.1:40686]
bio_filter_out_write: flush
[Thu Apr 04 22:42:59.617981 2024] [ssl:debug] [pid 1067703:tid
140619680433728] ssl_engine_io.c(1146): [client 127.0.0.1:40686]
AH02001: Connection closed to child 0 with standard shutdown (server
test1.tests.httpd.apache.org:443)

- watchdog thread terminates (?) with

[Thu Apr 04 22:43:00.902666 2024] [md:debug] [pid 1067703:tid
140619697219136] md_reg.c(1163): test-md-810-003a-1712260944.org:
staging done
[Thu Apr 04 22:43:00.903951 2024] [md:notice] [pid 1067703:tid
140619697219136] AH10059: The Managed Domain
test-md-810-003a-1712260944.org has been setup and changes will be
activated on next (graceful) server restart.
[Thu Apr 04 22:43:00.904418 2024] [md:debug] [pid 1067703:tid
140619697219136] mod_md_drive.c(229): AH10107: next run in 11 hours 59
minutes 58 seconds
[Thu Apr 04 22:43:01.204981 2024] [md:debug] [pid 1067703:tid
140619697219136] mod_md_drive.c(236): AH10058: md watchdog stopping
[Thu Apr 04 22:43:01.205094 2024] [watchdog:debug] [pid 1067703:tid
140619697219136] mod_watchdog.c(257): AH02973: Singleton Watchdog
(_md_renew_) stopping

- one worker thread seems not to stop:

[Thu Apr 04 22:42:59.768569 2024] [core:trace5] [pid 1067703:tid
140619672041024] protocol.c(714): [client 127.0.0.1:48748] Request
received from client: GET
/.well-known/acme-challenge/3VAiCadJ5do2TuwIbbh3w2foMGfnCspnm0eYejBSC9E
HTTP/1.1
[Thu Apr 04 22:42:59.768667 2024] [md:debug] [pid 1067703:tid
140619672041024] mod_md.c(1385): [client 127.0.0.1:48748] loading
challenge for test-md-810-003a-1712260944.org
(/.well-known/acme-challenge/3VAiCadJ5do2TuwIbbh3w2foMGfnCspnm0eYejBSC9E)
[Thu Apr 04 22:42:59.768698 2024] [http:trace3] [pid 1067703:tid
140619672041024] http_filters.c(1141): [client 127.0.0.1:48748] Response
sent with status 200, headers:
[Thu Apr 04 22:42:59.768706 2024] [http:trace5] [pid 1067703:tid
140619672041024] http_filters.c(1150): [client 127.0.0.1:48748] Date:
Thu, 04 Apr 2024 20:42:59 GMT
[Thu Apr 04 22:42:59.768712 2024] [http:trace5] [pid 1067703:tid
140619672041024] http_filters.c(1153): [client 127.0.0.1:48748]
Server: Apache/2.4.59 (Unix) OpenSSL/3.1.5
[Thu Apr 04 22:42:59.768718 2024] [http:trace4] [pid 1067703:tid
140619672041024] http_filters.c(971): [client 127.0.0.1:48748]
Content-Length: 88
[Thu Apr 04 22:42:59.768724 2024] [http:trace4] [pid 1067703:tid
140619672041024] http_filters.c(971): [client 127.0.0.1:48748]
Connection: close
[Thu Apr 04 22:42:59.769616 2024] [core:trace6] [pid 1067703:tid
140619672041024] core_filters.c(828): [client 127.0.0.1:48748]
writev_nonblocking: 227/227

I do not know, whether it is always this request, but that is one
example I could extract from the (trace8) error log.

No idea, why I can only observe this (until now) on RHEL 9 with worker.
Not woth event or prefork, not on RHEL 7 and 8 and SLES 12 and 15.

Best regards,

Rainer
Re: pytest results for 2.4.59 [ In reply to ]
On 4/5/24 12:59 AM, Rainer Jung wrote:
> I think I fixed all test failures, hopefully in the correct way. More eyes welcome.
>
> I have a few additional sporadic ERRORS:
>
> A] ERROR during teardown check for log file errors or warnings (twice):
>
> 04.04.2024 21:14:42.205465 _______________ ERROR at teardown of TestStatus.test_md_920_020 ________________
> 04.04.2024 21:14:42.205465 ERROR modules/md/test_920_status.py::TestStatus::test_md_920_020 - AssertionE...
>
> 04.04.2024 21:14:42.205465 E       AssertionError: apache logged 1 errors and 0 warnings:
> 04.04.2024 21:14:42.205465 E         [Thu Apr 04 21:12:29.381511 2024] [md:error] [pid 4169] (22)Invalid argument: no certificates
> in non-empty chain /path/to/gen/apache/md/staging/one.test.test-md-702-070-1712257797.org/pubcert.pem
>
>
> 04.04.2024 21:03:26.382051 _______________ ERROR at teardown of TestStatus.test_md_920_020 ________________
> 04.04.2024 21:03:26.382360 ERROR modules/md/test_920_status.py::TestStatus::test_md_920_020 - AssertionE...
>
> 04.04.2024 21:03:26.382051 E       AssertionError: apache logged 1 errors and 1 warnings:
> 04.04.2024 21:03:26.382051 E         [Thu Apr 04 21:00:48.924286 2024] [md:error] [pid 8717:tid 139629962274560] (20014)Internal
> error (specific information not available): test-md-702-041-1712256790.org: asked to retrieve chain, but no order in context
> 04.04.2024 21:03:26.382051 E         [Thu Apr 04 21:00:48.924229 2024] [md:warn] [pid 8717:tid 139629962274560] error generate
> pkey RSA 3072
>
> B] Hanging httpd child processes
>
> This happens only on RHEL 9 with worker MPM and can be notices by a dramatic slowdown of the tests. There's a lot of messages
>
> AH00045: child process 1067703 still did not exit, sending a SIGTERM
>
> and
>
> AH00276: the listener thread didn't exit
>
> It happened in
>
> modules/core/test_001_encoding.py::TestEncoding::test_core_001_20[test2-/10%25abnormal.txt-200]
>
> modules/md/test_920_status.py::TestStatus::test_md_920_020
>
> modules/proxy/test_02_unix.py::TestProxyUds::test_proxy_02_003[mixed-500]
>
> but I don't know, whether it might happen elsewhere also, because it is sporadic.
>
> What I see in the error logs for one hanging child process:
>
> - most threads terminate with
>
> [Thu Apr 04 22:42:59.617953 2024] [ssl:trace3] [pid 1067703:tid 140619680433728] ssl_engine_kernel.c(2223): [client
> 127.0.0.1:40686] OpenSSL: Write: SSL negotiation finished successfully
> [Thu Apr 04 22:42:59.617972 2024] [ssl:trace6] [pid 1067703:tid 140619680433728] ssl_engine_io.c(154): [client 127.0.0.1:40686]
> bio_filter_out_write: flush
> [Thu Apr 04 22:42:59.617981 2024] [ssl:debug] [pid 1067703:tid 140619680433728] ssl_engine_io.c(1146): [client 127.0.0.1:40686]
> AH02001: Connection closed to child 0 with standard shutdown (server test1.tests.httpd.apache.org:443)
>
> - watchdog thread terminates (?) with
>
> [Thu Apr 04 22:43:00.902666 2024] [md:debug] [pid 1067703:tid 140619697219136] md_reg.c(1163): test-md-810-003a-1712260944.org:
> staging done
> [Thu Apr 04 22:43:00.903951 2024] [md:notice] [pid 1067703:tid 140619697219136] AH10059: The Managed Domain
> test-md-810-003a-1712260944.org has been setup and changes will be activated on next (graceful) server restart.
> [Thu Apr 04 22:43:00.904418 2024] [md:debug] [pid 1067703:tid 140619697219136] mod_md_drive.c(229): AH10107: next run in 11 hours
> 59 minutes 58 seconds
> [Thu Apr 04 22:43:01.204981 2024] [md:debug] [pid 1067703:tid 140619697219136] mod_md_drive.c(236): AH10058: md watchdog stopping
> [Thu Apr 04 22:43:01.205094 2024] [watchdog:debug] [pid 1067703:tid 140619697219136] mod_watchdog.c(257): AH02973: Singleton
> Watchdog (_md_renew_) stopping
>
> - one worker thread seems not to stop:
>
> [Thu Apr 04 22:42:59.768569 2024] [core:trace5] [pid 1067703:tid 140619672041024] protocol.c(714): [client 127.0.0.1:48748]
> Request received from client: GET /.well-known/acme-challenge/3VAiCadJ5do2TuwIbbh3w2foMGfnCspnm0eYejBSC9E HTTP/1.1
> [Thu Apr 04 22:42:59.768667 2024] [md:debug] [pid 1067703:tid 140619672041024] mod_md.c(1385): [client 127.0.0.1:48748] loading
> challenge for test-md-810-003a-1712260944.org (/.well-known/acme-challenge/3VAiCadJ5do2TuwIbbh3w2foMGfnCspnm0eYejBSC9E)
> [Thu Apr 04 22:42:59.768698 2024] [http:trace3] [pid 1067703:tid 140619672041024] http_filters.c(1141): [client 127.0.0.1:48748]
> Response sent with status 200, headers:
> [Thu Apr 04 22:42:59.768706 2024] [http:trace5] [pid 1067703:tid 140619672041024] http_filters.c(1150): [client 127.0.0.1:48748]  
> Date: Thu, 04 Apr 2024 20:42:59 GMT
> [Thu Apr 04 22:42:59.768712 2024] [http:trace5] [pid 1067703:tid 140619672041024] http_filters.c(1153): [client 127.0.0.1:48748]
> Server: Apache/2.4.59 (Unix) OpenSSL/3.1.5
> [Thu Apr 04 22:42:59.768718 2024] [http:trace4] [pid 1067703:tid 140619672041024] http_filters.c(971): [client 127.0.0.1:48748]
> Content-Length: 88
> [Thu Apr 04 22:42:59.768724 2024] [http:trace4] [pid 1067703:tid 140619672041024] http_filters.c(971): [client 127.0.0.1:48748]
> Connection: close
> [Thu Apr 04 22:42:59.769616 2024] [core:trace6] [pid 1067703:tid 140619672041024] core_filters.c(828): [client 127.0.0.1:48748]
> writev_nonblocking: 227/227
>
> I do not know, whether it is always this request, but that is one example I could extract from the (trace8) error log.
>
> No idea, why I can only observe this (until now) on RHEL 9 with worker. Not woth event or prefork, not on RHEL 7 and 8 and SLES 12
> and 15.

Are you able to provide a stacktrace of the hanging process (thread apply all bt full)?

Regards

Rüdiger
Re: pytest results for 2.4.59 [ In reply to ]
On 4/5/24 00:59, Rainer Jung wrote:
> This happens only on RHEL 9 with worker MPM and can be notices by a
> dramatic slowdown of the tests. There's a lot of messages
>
> AH00045: child process 1067703 still did not exit, sending a SIGTERM

I have noted those too on fedora 39, I am planning to have a look...

--
Cheers

Jean-Frederic
Re: pytest results for 2.4.59 [ In reply to ]
On 4/5/24 07:55, Ruediger Pluem wrote:
>
>
> On 4/5/24 12:59 AM, Rainer Jung wrote:
>> I think I fixed all test failures, hopefully in the correct way. More eyes welcome.
>>
>> I have a few additional sporadic ERRORS:
>>
>> A] ERROR during teardown check for log file errors or warnings (twice):
>>
>> 04.04.2024 21:14:42.205465 _______________ ERROR at teardown of TestStatus.test_md_920_020 ________________
>> 04.04.2024 21:14:42.205465 ERROR modules/md/test_920_status.py::TestStatus::test_md_920_020 - AssertionE...
>>
>> 04.04.2024 21:14:42.205465 E       AssertionError: apache logged 1 errors and 0 warnings:
>> 04.04.2024 21:14:42.205465 E         [Thu Apr 04 21:12:29.381511 2024] [md:error] [pid 4169] (22)Invalid argument: no certificates
>> in non-empty chain /path/to/gen/apache/md/staging/one.test.test-md-702-070-1712257797.org/pubcert.pem
>>
>>
>> 04.04.2024 21:03:26.382051 _______________ ERROR at teardown of TestStatus.test_md_920_020 ________________
>> 04.04.2024 21:03:26.382360 ERROR modules/md/test_920_status.py::TestStatus::test_md_920_020 - AssertionE...
>>
>> 04.04.2024 21:03:26.382051 E       AssertionError: apache logged 1 errors and 1 warnings:
>> 04.04.2024 21:03:26.382051 E         [Thu Apr 04 21:00:48.924286 2024] [md:error] [pid 8717:tid 139629962274560] (20014)Internal
>> error (specific information not available): test-md-702-041-1712256790.org: asked to retrieve chain, but no order in context
>> 04.04.2024 21:03:26.382051 E         [Thu Apr 04 21:00:48.924229 2024] [md:warn] [pid 8717:tid 139629962274560] error generate
>> pkey RSA 3072
>>
>> B] Hanging httpd child processes
>>
>> This happens only on RHEL 9 with worker MPM and can be notices by a dramatic slowdown of the tests. There's a lot of messages
>>
>> AH00045: child process 1067703 still did not exit, sending a SIGTERM
>>
>> and
>>
>> AH00276: the listener thread didn't exit
>>
>> It happened in
>>
>> modules/core/test_001_encoding.py::TestEncoding::test_core_001_20[test2-/10%25abnormal.txt-200]
>>
>> modules/md/test_920_status.py::TestStatus::test_md_920_020
>>
>> modules/proxy/test_02_unix.py::TestProxyUds::test_proxy_02_003[mixed-500]
>>
>> but I don't know, whether it might happen elsewhere also, because it is sporadic.
>>
>> What I see in the error logs for one hanging child process:
>>
>> - most threads terminate with
>>
>> [Thu Apr 04 22:42:59.617953 2024] [ssl:trace3] [pid 1067703:tid 140619680433728] ssl_engine_kernel.c(2223): [client
>> 127.0.0.1:40686] OpenSSL: Write: SSL negotiation finished successfully
>> [Thu Apr 04 22:42:59.617972 2024] [ssl:trace6] [pid 1067703:tid 140619680433728] ssl_engine_io.c(154): [client 127.0.0.1:40686]
>> bio_filter_out_write: flush
>> [Thu Apr 04 22:42:59.617981 2024] [ssl:debug] [pid 1067703:tid 140619680433728] ssl_engine_io.c(1146): [client 127.0.0.1:40686]
>> AH02001: Connection closed to child 0 with standard shutdown (server test1.tests.httpd.apache.org:443)
>>
>> - watchdog thread terminates (?) with
>>
>> [Thu Apr 04 22:43:00.902666 2024] [md:debug] [pid 1067703:tid 140619697219136] md_reg.c(1163): test-md-810-003a-1712260944.org:
>> staging done
>> [Thu Apr 04 22:43:00.903951 2024] [md:notice] [pid 1067703:tid 140619697219136] AH10059: The Managed Domain
>> test-md-810-003a-1712260944.org has been setup and changes will be activated on next (graceful) server restart.
>> [Thu Apr 04 22:43:00.904418 2024] [md:debug] [pid 1067703:tid 140619697219136] mod_md_drive.c(229): AH10107: next run in 11 hours
>> 59 minutes 58 seconds
>> [Thu Apr 04 22:43:01.204981 2024] [md:debug] [pid 1067703:tid 140619697219136] mod_md_drive.c(236): AH10058: md watchdog stopping
>> [Thu Apr 04 22:43:01.205094 2024] [watchdog:debug] [pid 1067703:tid 140619697219136] mod_watchdog.c(257): AH02973: Singleton
>> Watchdog (_md_renew_) stopping
>>
>> - one worker thread seems not to stop:
>>
>> [Thu Apr 04 22:42:59.768569 2024] [core:trace5] [pid 1067703:tid 140619672041024] protocol.c(714): [client 127.0.0.1:48748]
>> Request received from client: GET /.well-known/acme-challenge/3VAiCadJ5do2TuwIbbh3w2foMGfnCspnm0eYejBSC9E HTTP/1.1
>> [Thu Apr 04 22:42:59.768667 2024] [md:debug] [pid 1067703:tid 140619672041024] mod_md.c(1385): [client 127.0.0.1:48748] loading
>> challenge for test-md-810-003a-1712260944.org (/.well-known/acme-challenge/3VAiCadJ5do2TuwIbbh3w2foMGfnCspnm0eYejBSC9E)
>> [Thu Apr 04 22:42:59.768698 2024] [http:trace3] [pid 1067703:tid 140619672041024] http_filters.c(1141): [client 127.0.0.1:48748]
>> Response sent with status 200, headers:
>> [Thu Apr 04 22:42:59.768706 2024] [http:trace5] [pid 1067703:tid 140619672041024] http_filters.c(1150): [client 127.0.0.1:48748]
>> Date: Thu, 04 Apr 2024 20:42:59 GMT
>> [Thu Apr 04 22:42:59.768712 2024] [http:trace5] [pid 1067703:tid 140619672041024] http_filters.c(1153): [client 127.0.0.1:48748]
>> Server: Apache/2.4.59 (Unix) OpenSSL/3.1.5
>> [Thu Apr 04 22:42:59.768718 2024] [http:trace4] [pid 1067703:tid 140619672041024] http_filters.c(971): [client 127.0.0.1:48748]
>> Content-Length: 88
>> [Thu Apr 04 22:42:59.768724 2024] [http:trace4] [pid 1067703:tid 140619672041024] http_filters.c(971): [client 127.0.0.1:48748]
>> Connection: close
>> [Thu Apr 04 22:42:59.769616 2024] [core:trace6] [pid 1067703:tid 140619672041024] core_filters.c(828): [client 127.0.0.1:48748]
>> writev_nonblocking: 227/227
>>
>> I do not know, whether it is always this request, but that is one example I could extract from the (trace8) error log.
>>
>> No idea, why I can only observe this (until now) on RHEL 9 with worker. Not woth event or prefork, not on RHEL 7 and 8 and SLES 12
>> and 15.
>
> Are you able to provide a stacktrace of the hanging process (thread apply all bt full)?

(gdb) bt full
#0 0x00007f8b4cbda087 in __GI___select (nfds=nfds@entry=0,
readfds=readfds@entry=0x0, writefds=writefds@entry=0x0,
exceptfds=exceptfds@entry=0x0,
timeout=timeout@entry=0x7ffd961c9d50) at
../sysdeps/unix/sysv/linux/select.c:69
sc_ret = -4
sc_cancel_oldtype = 0
sc_ret = <optimized out>
s = <optimized out>
us = <optimized out>
ns = <optimized out>
ts64 = {tv_sec = 0, tv_nsec = 155601612}
pts64 = 0x7ffd961c9cf0
r = <optimized out>
#1 0x00007f8b4cd19242 in apr_sleep (t=t@entry=500000) at
time/unix/time.c:249
tv = {tv_sec = 0, tv_usec = 500000}
#2 0x0000000000440813 in join_workers (listener=0x7f8b44004e50,
threads=threads@entry=0xdf6f70, mode=mode@entry=2) at worker.c:1069
iter = 7
i = <optimized out>
rv = <optimized out>
thread_rv = 0
#3 0x00000000004413bd in child_main
(child_num_arg=child_num_arg@entry=2, child_bucket=child_bucket@entry=0)
at worker.c:1310
threads = 0xdf6f70
rv = 1
ts = 0xdf7040
thread_attr = 0xe03170
start_thread_id = 0xdf70a0
i = <optimized out>
#4 0x00000000004416fa in make_child (s=0xe08f80, slot=slot@entry=2,
bucket=0) at worker.c:1376
pid = 0
#5 0x000000000044179e in startup_children (number_to_start=1) at
worker.c:1403
i = 2
#6 0x000000000044290d in worker_run (_pconf=<optimized out>,
plog=0xdc67d0, s=0xe08f80) at worker.c:1928
listen_buckets = 0xdf9df0
num_buckets = 1
remaining_children_to_start = <optimized out>
rv = <optimized out>
id = "0", '\000' <repeats 14 times>
i = <optimized out>
#7 0x0000000000456580 in ap_run_mpm (pconf=pconf@entry=0xd95ca0,
plog=0xdc67d0, s=0xe08f80) at mpm_common.c:102
pHook = <optimized out>
n = 0
rv = -1
#8 0x0000000000433575 in main (argc=<optimized out>, argv=<optimized
c = 102 'f'
showcompile = <optimized out>
showdirectives = <optimized out>
confname = <optimized out>
--Type <RET> for more, q to quit, c to continue without paging--
def_server_root = <optimized out>
temp_error_log = <optimized out>
error = <optimized out>
process = 0xd95c70
pconf = 0xd95ca0
plog = 0xdc67d0
ptemp = 0xe3dec0
pcommands = <optimized out>
opt = 0xdc6740
rv = <optimized out>
mod = <optimized out>
opt_arg = 0x7ffd961cab64
"/home/jfclere/httpd-trunk/test/pyhttpd/../gen/apache/conf/httpd.conf"
signal_server = <optimized out>
rc = <optimized out>

Weird... only one thread

>
> Regards
>
> Rüdiger
>

--
Cheers

Jean-Frederic
Re: pytest results for 2.4.59 [ In reply to ]
On 4/5/24 07:55, Ruediger Pluem wrote:
>
>
> On 4/5/24 12:59 AM, Rainer Jung wrote:
>> I think I fixed all test failures, hopefully in the correct way. More eyes welcome.
>>
>> I have a few additional sporadic ERRORS:
>>
>> A] ERROR during teardown check for log file errors or warnings (twice):
>>
>> 04.04.2024 21:14:42.205465 _______________ ERROR at teardown of TestStatus.test_md_920_020 ________________
>> 04.04.2024 21:14:42.205465 ERROR modules/md/test_920_status.py::TestStatus::test_md_920_020 - AssertionE...
>>
>> 04.04.2024 21:14:42.205465 E       AssertionError: apache logged 1 errors and 0 warnings:
>> 04.04.2024 21:14:42.205465 E         [Thu Apr 04 21:12:29.381511 2024] [md:error] [pid 4169] (22)Invalid argument: no certificates
>> in non-empty chain /path/to/gen/apache/md/staging/one.test.test-md-702-070-1712257797.org/pubcert.pem
>>
>>
>> 04.04.2024 21:03:26.382051 _______________ ERROR at teardown of TestStatus.test_md_920_020 ________________
>> 04.04.2024 21:03:26.382360 ERROR modules/md/test_920_status.py::TestStatus::test_md_920_020 - AssertionE...
>>
>> 04.04.2024 21:03:26.382051 E       AssertionError: apache logged 1 errors and 1 warnings:
>> 04.04.2024 21:03:26.382051 E         [Thu Apr 04 21:00:48.924286 2024] [md:error] [pid 8717:tid 139629962274560] (20014)Internal
>> error (specific information not available): test-md-702-041-1712256790.org: asked to retrieve chain, but no order in context
>> 04.04.2024 21:03:26.382051 E         [Thu Apr 04 21:00:48.924229 2024] [md:warn] [pid 8717:tid 139629962274560] error generate
>> pkey RSA 3072
>>
>> B] Hanging httpd child processes
>>
>> This happens only on RHEL 9 with worker MPM and can be notices by a dramatic slowdown of the tests. There's a lot of messages
>>
>> AH00045: child process 1067703 still did not exit, sending a SIGTERM
>>
>> and
>>
>> AH00276: the listener thread didn't exit
>>
>> It happened in
>>
>> modules/core/test_001_encoding.py::TestEncoding::test_core_001_20[test2-/10%25abnormal.txt-200]
>>
>> modules/md/test_920_status.py::TestStatus::test_md_920_020
>>
>> modules/proxy/test_02_unix.py::TestProxyUds::test_proxy_02_003[mixed-500]
>>
>> but I don't know, whether it might happen elsewhere also, because it is sporadic.
>>
>> What I see in the error logs for one hanging child process:
>>
>> - most threads terminate with
>>
>> [Thu Apr 04 22:42:59.617953 2024] [ssl:trace3] [pid 1067703:tid 140619680433728] ssl_engine_kernel.c(2223): [client
>> 127.0.0.1:40686] OpenSSL: Write: SSL negotiation finished successfully
>> [Thu Apr 04 22:42:59.617972 2024] [ssl:trace6] [pid 1067703:tid 140619680433728] ssl_engine_io.c(154): [client 127.0.0.1:40686]
>> bio_filter_out_write: flush
>> [Thu Apr 04 22:42:59.617981 2024] [ssl:debug] [pid 1067703:tid 140619680433728] ssl_engine_io.c(1146): [client 127.0.0.1:40686]
>> AH02001: Connection closed to child 0 with standard shutdown (server test1.tests.httpd.apache.org:443)
>>
>> - watchdog thread terminates (?) with
>>
>> [Thu Apr 04 22:43:00.902666 2024] [md:debug] [pid 1067703:tid 140619697219136] md_reg.c(1163): test-md-810-003a-1712260944.org:
>> staging done
>> [Thu Apr 04 22:43:00.903951 2024] [md:notice] [pid 1067703:tid 140619697219136] AH10059: The Managed Domain
>> test-md-810-003a-1712260944.org has been setup and changes will be activated on next (graceful) server restart.
>> [Thu Apr 04 22:43:00.904418 2024] [md:debug] [pid 1067703:tid 140619697219136] mod_md_drive.c(229): AH10107: next run in 11 hours
>> 59 minutes 58 seconds
>> [Thu Apr 04 22:43:01.204981 2024] [md:debug] [pid 1067703:tid 140619697219136] mod_md_drive.c(236): AH10058: md watchdog stopping
>> [Thu Apr 04 22:43:01.205094 2024] [watchdog:debug] [pid 1067703:tid 140619697219136] mod_watchdog.c(257): AH02973: Singleton
>> Watchdog (_md_renew_) stopping
>>
>> - one worker thread seems not to stop:
>>
>> [Thu Apr 04 22:42:59.768569 2024] [core:trace5] [pid 1067703:tid 140619672041024] protocol.c(714): [client 127.0.0.1:48748]
>> Request received from client: GET /.well-known/acme-challenge/3VAiCadJ5do2TuwIbbh3w2foMGfnCspnm0eYejBSC9E HTTP/1.1
>> [Thu Apr 04 22:42:59.768667 2024] [md:debug] [pid 1067703:tid 140619672041024] mod_md.c(1385): [client 127.0.0.1:48748] loading
>> challenge for test-md-810-003a-1712260944.org (/.well-known/acme-challenge/3VAiCadJ5do2TuwIbbh3w2foMGfnCspnm0eYejBSC9E)
>> [Thu Apr 04 22:42:59.768698 2024] [http:trace3] [pid 1067703:tid 140619672041024] http_filters.c(1141): [client 127.0.0.1:48748]
>> Response sent with status 200, headers:
>> [Thu Apr 04 22:42:59.768706 2024] [http:trace5] [pid 1067703:tid 140619672041024] http_filters.c(1150): [client 127.0.0.1:48748]
>> Date: Thu, 04 Apr 2024 20:42:59 GMT
>> [Thu Apr 04 22:42:59.768712 2024] [http:trace5] [pid 1067703:tid 140619672041024] http_filters.c(1153): [client 127.0.0.1:48748]
>> Server: Apache/2.4.59 (Unix) OpenSSL/3.1.5
>> [Thu Apr 04 22:42:59.768718 2024] [http:trace4] [pid 1067703:tid 140619672041024] http_filters.c(971): [client 127.0.0.1:48748]
>> Content-Length: 88
>> [Thu Apr 04 22:42:59.768724 2024] [http:trace4] [pid 1067703:tid 140619672041024] http_filters.c(971): [client 127.0.0.1:48748]
>> Connection: close
>> [Thu Apr 04 22:42:59.769616 2024] [core:trace6] [pid 1067703:tid 140619672041024] core_filters.c(828): [client 127.0.0.1:48748]
>> writev_nonblocking: 227/227
>>
>> I do not know, whether it is always this request, but that is one example I could extract from the (trace8) error log.
>>
>> No idea, why I can only observe this (until now) on RHEL 9 with worker. Not woth event or prefork, not on RHEL 7 and 8 and SLES 12
>> and 15.
>
> Are you able to provide a stacktrace of the hanging process (thread apply all bt full)?

In server/mpm/worker/worker.c (join_workers) pthread_kill() returns
ENOENT and EINTR. We are waiting in the those 2 cases probably not
waiting for ENOENT helps... But that sounds a work-around a problem or
just some timing issues?

>
> Regards
>
> Rüdiger
>

--
Cheers

Jean-Frederic
Re: pytest results for 2.4.59 [ In reply to ]
On 4/5/24 07:55, Ruediger Pluem wrote:
>
>
> On 4/5/24 12:59 AM, Rainer Jung wrote:
>> I think I fixed all test failures, hopefully in the correct way. More eyes welcome.
>>
>> I have a few additional sporadic ERRORS:
>>
>> A] ERROR during teardown check for log file errors or warnings (twice):
>>
>> 04.04.2024 21:14:42.205465 _______________ ERROR at teardown of TestStatus.test_md_920_020 ________________
>> 04.04.2024 21:14:42.205465 ERROR modules/md/test_920_status.py::TestStatus::test_md_920_020 - AssertionE...
>>
>> 04.04.2024 21:14:42.205465 E       AssertionError: apache logged 1 errors and 0 warnings:
>> 04.04.2024 21:14:42.205465 E         [Thu Apr 04 21:12:29.381511 2024] [md:error] [pid 4169] (22)Invalid argument: no certificates
>> in non-empty chain /path/to/gen/apache/md/staging/one.test.test-md-702-070-1712257797.org/pubcert.pem
>>
>>
>> 04.04.2024 21:03:26.382051 _______________ ERROR at teardown of TestStatus.test_md_920_020 ________________
>> 04.04.2024 21:03:26.382360 ERROR modules/md/test_920_status.py::TestStatus::test_md_920_020 - AssertionE...
>>
>> 04.04.2024 21:03:26.382051 E       AssertionError: apache logged 1 errors and 1 warnings:
>> 04.04.2024 21:03:26.382051 E         [Thu Apr 04 21:00:48.924286 2024] [md:error] [pid 8717:tid 139629962274560] (20014)Internal
>> error (specific information not available): test-md-702-041-1712256790.org: asked to retrieve chain, but no order in context
>> 04.04.2024 21:03:26.382051 E         [Thu Apr 04 21:00:48.924229 2024] [md:warn] [pid 8717:tid 139629962274560] error generate
>> pkey RSA 3072
>>
>> B] Hanging httpd child processes
>>
>> This happens only on RHEL 9 with worker MPM and can be notices by a dramatic slowdown of the tests. There's a lot of messages
>>
>> AH00045: child process 1067703 still did not exit, sending a SIGTERM
>>
>> and
>>
>> AH00276: the listener thread didn't exit
>>
>> It happened in
>>
>> modules/core/test_001_encoding.py::TestEncoding::test_core_001_20[test2-/10%25abnormal.txt-200]
>>
>> modules/md/test_920_status.py::TestStatus::test_md_920_020
>>
>> modules/proxy/test_02_unix.py::TestProxyUds::test_proxy_02_003[mixed-500]
>>
>> but I don't know, whether it might happen elsewhere also, because it is sporadic.
>>
>> What I see in the error logs for one hanging child process:
>>
>> - most threads terminate with
>>
>> [Thu Apr 04 22:42:59.617953 2024] [ssl:trace3] [pid 1067703:tid 140619680433728] ssl_engine_kernel.c(2223): [client
>> 127.0.0.1:40686] OpenSSL: Write: SSL negotiation finished successfully
>> [Thu Apr 04 22:42:59.617972 2024] [ssl:trace6] [pid 1067703:tid 140619680433728] ssl_engine_io.c(154): [client 127.0.0.1:40686]
>> bio_filter_out_write: flush
>> [Thu Apr 04 22:42:59.617981 2024] [ssl:debug] [pid 1067703:tid 140619680433728] ssl_engine_io.c(1146): [client 127.0.0.1:40686]
>> AH02001: Connection closed to child 0 with standard shutdown (server test1.tests.httpd.apache.org:443)
>>
>> - watchdog thread terminates (?) with
>>
>> [Thu Apr 04 22:43:00.902666 2024] [md:debug] [pid 1067703:tid 140619697219136] md_reg.c(1163): test-md-810-003a-1712260944.org:
>> staging done
>> [Thu Apr 04 22:43:00.903951 2024] [md:notice] [pid 1067703:tid 140619697219136] AH10059: The Managed Domain
>> test-md-810-003a-1712260944.org has been setup and changes will be activated on next (graceful) server restart.
>> [Thu Apr 04 22:43:00.904418 2024] [md:debug] [pid 1067703:tid 140619697219136] mod_md_drive.c(229): AH10107: next run in 11 hours
>> 59 minutes 58 seconds
>> [Thu Apr 04 22:43:01.204981 2024] [md:debug] [pid 1067703:tid 140619697219136] mod_md_drive.c(236): AH10058: md watchdog stopping
>> [Thu Apr 04 22:43:01.205094 2024] [watchdog:debug] [pid 1067703:tid 140619697219136] mod_watchdog.c(257): AH02973: Singleton
>> Watchdog (_md_renew_) stopping
>>
>> - one worker thread seems not to stop:
>>
>> [Thu Apr 04 22:42:59.768569 2024] [core:trace5] [pid 1067703:tid 140619672041024] protocol.c(714): [client 127.0.0.1:48748]
>> Request received from client: GET /.well-known/acme-challenge/3VAiCadJ5do2TuwIbbh3w2foMGfnCspnm0eYejBSC9E HTTP/1.1
>> [Thu Apr 04 22:42:59.768667 2024] [md:debug] [pid 1067703:tid 140619672041024] mod_md.c(1385): [client 127.0.0.1:48748] loading
>> challenge for test-md-810-003a-1712260944.org (/.well-known/acme-challenge/3VAiCadJ5do2TuwIbbh3w2foMGfnCspnm0eYejBSC9E)
>> [Thu Apr 04 22:42:59.768698 2024] [http:trace3] [pid 1067703:tid 140619672041024] http_filters.c(1141): [client 127.0.0.1:48748]
>> Response sent with status 200, headers:
>> [Thu Apr 04 22:42:59.768706 2024] [http:trace5] [pid 1067703:tid 140619672041024] http_filters.c(1150): [client 127.0.0.1:48748]
>> Date: Thu, 04 Apr 2024 20:42:59 GMT
>> [Thu Apr 04 22:42:59.768712 2024] [http:trace5] [pid 1067703:tid 140619672041024] http_filters.c(1153): [client 127.0.0.1:48748]
>> Server: Apache/2.4.59 (Unix) OpenSSL/3.1.5
>> [Thu Apr 04 22:42:59.768718 2024] [http:trace4] [pid 1067703:tid 140619672041024] http_filters.c(971): [client 127.0.0.1:48748]
>> Content-Length: 88
>> [Thu Apr 04 22:42:59.768724 2024] [http:trace4] [pid 1067703:tid 140619672041024] http_filters.c(971): [client 127.0.0.1:48748]
>> Connection: close
>> [Thu Apr 04 22:42:59.769616 2024] [core:trace6] [pid 1067703:tid 140619672041024] core_filters.c(828): [client 127.0.0.1:48748]
>> writev_nonblocking: 227/227
>>
>> I do not know, whether it is always this request, but that is one example I could extract from the (trace8) error log.
>>
>> No idea, why I can only observe this (until now) on RHEL 9 with worker. Not woth event or prefork, not on RHEL 7 and 8 and SLES 12
>> and 15.
>
> Are you able to provide a stacktrace of the hanging process (thread apply all bt full)?

It seems pthread_kill(t, 0) returns 0 even the thread t has exited...
older version of fedora will return 3 (I have tried fc28)

The small example (taken from the internet) seems to behave like httpd
reporting running threads that have exited...

I have created https://bugzilla.redhat.com/show_bug.cgi?id=2273757 in
case it is a fedora/rhel bug.

>
> Regards
>
> Rüdiger
>

--
Cheers

Jean-Frederic
Re: pytest results for 2.4.59 [ In reply to ]
On Sat, Apr 6, 2024 at 10:46?AM jean-frederic clere <jfclere@gmail.com> wrote:
>
> On 4/5/24 07:55, Ruediger Pluem wrote:
> >
> > Are you able to provide a stacktrace of the hanging process (thread apply all bt full)?
>
> It seems pthread_kill(t, 0) returns 0 even the thread t has exited...
> older version of fedora will return 3 (I have tried fc28)

If pthread_kill() does not work we probably should use the global
"dying" variable like in mpm_event.
But it's not clear from your earlier "bt full" whether there are other
threads, could you try "thread apply all bt full" instead to show all
the threads?
It's clear from the main thread's backtrace that it's waiting for the
listener in the "iter" loop, but nothing tells if the listener already
exited or not. The listener for instance could be waiting indefinitely
apr_pollset_poll() at this point, and since there is no pollset wakeup
in mpm_worker I don't think that wakeup_listener() can help here.
So maybe we need to add an apr_pollset_wakeup() in wakeup_listener()
too, like in mpm_event too.

Overall something like the attached patch?


Regards;
Yann.
Re: pytest results for 2.4.59 [ In reply to ]
On 4/6/24 13:10, Yann Ylavic wrote:
> On Sat, Apr 6, 2024 at 10:46?AM jean-frederic clere <jfclere@gmail.com> wrote:
>>
>> On 4/5/24 07:55, Ruediger Pluem wrote:
>>>
>>> Are you able to provide a stacktrace of the hanging process (thread apply all bt full)?
>>
>> It seems pthread_kill(t, 0) returns 0 even the thread t has exited...
>> older version of fedora will return 3 (I have tried fc28)
>
> If pthread_kill() does not work we probably should use the global
> "dying" variable like in mpm_event.
> But it's not clear from your earlier "bt full" whether there are other
> threads, could you try "thread apply all bt full" instead to show all
> the threads?

(gdb) thread apply all bt full

Thread 1 (Thread 0x7ffbf3f5ad40 (LWP 2891875)):
#0 0x00007ffbf429b087 in __GI___select (nfds=nfds@entry=0,
readfds=readfds@entry=0x0, writefds=writefds@entry=0x0,
exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x7fff56cbb0b0) at
../sysdeps/unix/sysv/linux/select.c:69
sc_ret = -4
sc_cancel_oldtype = 0
sc_ret = <optimized out>
s = <optimized out>
us = <optimized out>
ns = <optimized out>
ts64 = {tv_sec = 0, tv_nsec = 155950744}
pts64 = 0x7fff56cbb050
r = <optimized out>
#1 0x00007ffbf43d9d92 in apr_sleep (t=t@entry=500000) at
time/unix/time.c:249
tv = {tv_sec = 0, tv_usec = 500000}
#2 0x0000000000440733 in join_workers (listener=0x87c170,
threads=threads@entry=0x91e150, mode=mode@entry=2) at worker.c:1069
iter = 7
i = <optimized out>
rv = <optimized out>
thread_rv = 0
#3 0x00000000004412d9 in child_main
(child_num_arg=child_num_arg@entry=0, child_bucket=child_bucket@entry=0)
at worker.c:1310
threads = 0x91e150
rv = 1
ts = 0x815a78
thread_attr = 0x815a98
start_thread_id = 0x815b08
i = <optimized out>
#4 0x000000000044161a in make_child (s=0x818d00, slot=slot@entry=0,
bucket=0) at worker.c:1376
pid = 0
#5 0x00000000004416be in startup_children (number_to_start=3) at
worker.c:1403
i = 0
#6 0x00000000004428f9 in worker_run (_pconf=<optimized out>,
plog=0x81b998, s=0x818d00) at worker.c:1928
listen_buckets = 0x875480
num_buckets = 1
remaining_children_to_start = <optimized out>
rv = <optimized out>
id = "0\000\000\000\000\000\000\000\t\000\000\000\000\000\000"
i = <optimized out>
#7 0x0000000000456930 in ap_run_mpm (pconf=pconf@entry=0x7ec3e8,
plog=0x81b998, s=0x818d00) at mpm_common.c:102
pHook = <optimized out>
n = 0
rv = -1
#8 0x000000000043350e in main (argc=<optimized out>, argv=<optimized
c = 102 'f'
showcompile = <optimized out>
--Type <RET> for more, q to quit, c to continue without paging--
showdirectives = <optimized out>
confname = <optimized out>
def_server_root = <optimized out>
temp_error_log = <optimized out>
error = <optimized out>
process = 0x7ea4c8
pconf = 0x7ec3e8
plog = 0x81b998
ptemp = 0x815678
pcommands = <optimized out>
opt = 0x810ef0
rv = <optimized out>
mod = <optimized out>
opt_arg = 0x7fff56cbcb64
"/home/jfclere/httpd-trunk/test/pyhttpd/../gen/apache/conf/httpd.conf"
signal_server = <optimized out>
rc = <optimized out>
(gdb)

I have added a kill(pid, SIGABRT); in server/mpm_unix.c after the
ap_log_error() as it is not easy to get a core otherwise.


> It's clear from the main thread's backtrace that it's waiting for the
> listener in the "iter" loop, but nothing tells if the listener already
> exited or not. The listener for instance could be waiting indefinitely
> apr_pollset_poll() at this point, and since there is no pollset wakeup
> in mpm_worker I don't think that wakeup_listener() can help here.

According to my tests using assert(0) in the join_workers() in different
location, the listener thread is stopped by wakeup_listener() but the
pthread_kill() doesn't report that.


> So maybe we need to add an apr_pollset_wakeup() in wakeup_listener()
> too, like in mpm_event too.
>
> Overall something like the attached patch?

Yes the attached patch helps

>
>
> Regards;
> Yann.

--
Cheers

Jean-Frederic
Re: pytest results for 2.4.59 [ In reply to ]
Hi Jean-Frederic and all,

you didn't write at what point in time you take the thread dump. I see
the SIGTERM messages logged during test execution always during the last
test in each group (http2, md, ...) just because that is the time the
logs are checked by teardown for error messages. At the time the test
complains it already starts to kill the children and at least during my
test runs it success with killing them (I think). So finding a good
point in time to attach the debugger and see the right situation might
not be easy?

When you say Yann's patch helps, it means especially there are not more
SIGTERM messages in the logs resp. no more teardown checks failing?

Best regards,

Rainer

Am 06.04.24 um 17:32 schrieb jean-frederic clere:
> On 4/6/24 13:10, Yann Ylavic wrote:
>> On Sat, Apr 6, 2024 at 10:46?AM jean-frederic clere
>> <jfclere@gmail.com> wrote:
>>>
>>> On 4/5/24 07:55, Ruediger Pluem wrote:
>>>>
>>>> Are you able to provide a stacktrace of the hanging process (thread
>>>> apply all bt full)?
>>>
>>> It seems pthread_kill(t, 0) returns 0 even the thread t has exited...
>>> older version of fedora will return 3 (I have tried fc28)
>>
>> If pthread_kill() does not work we probably should use the global
>> "dying" variable like in mpm_event.
>> But it's not clear from your earlier "bt full" whether there are other
>> threads, could you try "thread apply all bt full" instead to show all
>> the threads?
>
> (gdb) thread apply all bt full
>
> Thread 1 (Thread 0x7ffbf3f5ad40 (LWP 2891875)):
> #0  0x00007ffbf429b087 in __GI___select (nfds=nfds@entry=0,
> readfds=readfds@entry=0x0, writefds=writefds@entry=0x0,
> exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x7fff56cbb0b0) at
> ../sysdeps/unix/sysv/linux/select.c:69
>         sc_ret = -4
>         sc_cancel_oldtype = 0
>         sc_ret = <optimized out>
>         s = <optimized out>
>         us = <optimized out>
>         ns = <optimized out>
>         ts64 = {tv_sec = 0, tv_nsec = 155950744}
>         pts64 = 0x7fff56cbb050
>         r = <optimized out>
> #1  0x00007ffbf43d9d92 in apr_sleep (t=t@entry=500000) at
> time/unix/time.c:249
>         tv = {tv_sec = 0, tv_usec = 500000}
> #2  0x0000000000440733 in join_workers (listener=0x87c170,
> threads=threads@entry=0x91e150, mode=mode@entry=2) at worker.c:1069
>         iter = 7
>         i = <optimized out>
>         rv = <optimized out>
>         thread_rv = 0
> #3  0x00000000004412d9 in child_main
> (child_num_arg=child_num_arg@entry=0, child_bucket=child_bucket@entry=0)
> at worker.c:1310
>         threads = 0x91e150
>         rv = 1
>         ts = 0x815a78
>         thread_attr = 0x815a98
>         start_thread_id = 0x815b08
>         i = <optimized out>
> #4  0x000000000044161a in make_child (s=0x818d00, slot=slot@entry=0,
> bucket=0) at worker.c:1376
>         pid = 0
> #5  0x00000000004416be in startup_children (number_to_start=3) at
> worker.c:1403
>         i = 0
> #6  0x00000000004428f9 in worker_run (_pconf=<optimized out>,
> plog=0x81b998, s=0x818d00) at worker.c:1928
>         listen_buckets = 0x875480
>         num_buckets = 1
>         remaining_children_to_start = <optimized out>
>         rv = <optimized out>
>         id = "0\000\000\000\000\000\000\000\t\000\000\000\000\000\000"
>         i = <optimized out>
> #7  0x0000000000456930 in ap_run_mpm (pconf=pconf@entry=0x7ec3e8,
> plog=0x81b998, s=0x818d00) at mpm_common.c:102
>         pHook = <optimized out>
>         n = 0
>         rv = -1
> #8  0x000000000043350e in main (argc=<optimized out>, argv=<optimized
> out>) at main.c:882
>         c = 102 'f'
>         showcompile = <optimized out>
> --Type <RET> for more, q to quit, c to continue without paging--
>         showdirectives = <optimized out>
>         confname = <optimized out>
>         def_server_root = <optimized out>
>         temp_error_log = <optimized out>
>         error = <optimized out>
>         process = 0x7ea4c8
>         pconf = 0x7ec3e8
>         plog = 0x81b998
>         ptemp = 0x815678
>         pcommands = <optimized out>
>         opt = 0x810ef0
>         rv = <optimized out>
>         mod = <optimized out>
>         opt_arg = 0x7fff56cbcb64
> "/home/jfclere/httpd-trunk/test/pyhttpd/../gen/apache/conf/httpd.conf"
>         signal_server = <optimized out>
>         rc = <optimized out>
> (gdb)
>
> I have added a kill(pid, SIGABRT); in server/mpm_unix.c after the
> ap_log_error() as it is not easy to get a core otherwise.
>
>
>> It's clear from the main thread's backtrace that it's waiting for the
>> listener in the "iter" loop, but nothing tells if the listener already
>> exited or not. The listener for instance could be waiting indefinitely
>> apr_pollset_poll() at this point, and since there is no pollset wakeup
>> in mpm_worker I don't think that wakeup_listener() can help here.
>
> According to my tests using assert(0) in the join_workers() in different
> location, the listener thread is stopped by wakeup_listener() but the
> pthread_kill() doesn't report that.
>
>
>> So maybe we need to add an apr_pollset_wakeup() in wakeup_listener()
>> too, like in mpm_event too.
>>
>> Overall something like the attached patch?
>
> Yes the attached patch helps
>
>>
>>
>> Regards;
>> Yann.
Re: pytest results for 2.4.59 [ In reply to ]
On 4/6/24 20:02, Rainer Jung wrote:
> Hi Jean-Frederic and all,
>
> you didn't write at what point in time you take the thread dump. I see
> the SIGTERM messages logged during test execution always during the last
> test in each group (http2, md, ...) just because that is the time the
> logs are checked by teardown for error messages. At the time the test
> complains it already starts to kill the children and at least during my
> test runs it success with killing them (I think). So finding a good
> point in time to attach the debugger and see the right situation might
> not be easy?

I might have taken the thread dump late because I have taken it just
after the ap_log_error("sending a SIGTERM") at that time there is only
one thread running and it is "waiting" for the listener thread that has
already stopped. There I have found that the behavior of pthread_kill()
changed in "recent" fedora/rhel.

Are you suggesting I should also try to get dumps earlier in the
shutdown process?

>
> When you say Yann's patch helps, it means especially there are not more
> SIGTERM messages in the logs resp. no more teardown checks failing?

Yes with Yann's patch, the "AH00045: child process nnnnn still did not
exit, sending a SIGTERM" messages are gone and teardown checks are passing.

>
> Best regards,
>
> Rainer
>
> Am 06.04.24 um 17:32 schrieb jean-frederic clere:
>> On 4/6/24 13:10, Yann Ylavic wrote:
>>> On Sat, Apr 6, 2024 at 10:46?AM jean-frederic clere
>>> <jfclere@gmail.com> wrote:
>>>>
>>>> On 4/5/24 07:55, Ruediger Pluem wrote:
>>>>>
>>>>> Are you able to provide a stacktrace of the hanging process (thread
>>>>> apply all bt full)?
>>>>
>>>> It seems pthread_kill(t, 0) returns 0 even the thread t has exited...
>>>> older version of fedora will return 3 (I have tried fc28)
>>>
>>> If pthread_kill() does not work we probably should use the global
>>> "dying" variable like in mpm_event.
>>> But it's not clear from your earlier "bt full" whether there are other
>>> threads, could you try "thread apply all bt full" instead to show all
>>> the threads?
>>
>> (gdb) thread apply all bt full
>>
>> Thread 1 (Thread 0x7ffbf3f5ad40 (LWP 2891875)):
>> #0  0x00007ffbf429b087 in __GI___select (nfds=nfds@entry=0,
>> readfds=readfds@entry=0x0, writefds=writefds@entry=0x0,
>> exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x7fff56cbb0b0)
>> at ../sysdeps/unix/sysv/linux/select.c:69
>>          sc_ret = -4
>>          sc_cancel_oldtype = 0
>>          sc_ret = <optimized out>
>>          s = <optimized out>
>>          us = <optimized out>
>>          ns = <optimized out>
>>          ts64 = {tv_sec = 0, tv_nsec = 155950744}
>>          pts64 = 0x7fff56cbb050
>>          r = <optimized out>
>> #1  0x00007ffbf43d9d92 in apr_sleep (t=t@entry=500000) at
>> time/unix/time.c:249
>>          tv = {tv_sec = 0, tv_usec = 500000}
>> #2  0x0000000000440733 in join_workers (listener=0x87c170,
>> threads=threads@entry=0x91e150, mode=mode@entry=2) at worker.c:1069
>>          iter = 7
>>          i = <optimized out>
>>          rv = <optimized out>
>>          thread_rv = 0
>> #3  0x00000000004412d9 in child_main
>> (child_num_arg=child_num_arg@entry=0,
>> child_bucket=child_bucket@entry=0) at worker.c:1310
>>          threads = 0x91e150
>>          rv = 1
>>          ts = 0x815a78
>>          thread_attr = 0x815a98
>>          start_thread_id = 0x815b08
>>          i = <optimized out>
>> #4  0x000000000044161a in make_child (s=0x818d00, slot=slot@entry=0,
>> bucket=0) at worker.c:1376
>>          pid = 0
>> #5  0x00000000004416be in startup_children (number_to_start=3) at
>> worker.c:1403
>>          i = 0
>> #6  0x00000000004428f9 in worker_run (_pconf=<optimized out>,
>> plog=0x81b998, s=0x818d00) at worker.c:1928
>>          listen_buckets = 0x875480
>>          num_buckets = 1
>>          remaining_children_to_start = <optimized out>
>>          rv = <optimized out>
>>          id = "0\000\000\000\000\000\000\000\t\000\000\000\000\000\000"
>>          i = <optimized out>
>> #7  0x0000000000456930 in ap_run_mpm (pconf=pconf@entry=0x7ec3e8,
>> plog=0x81b998, s=0x818d00) at mpm_common.c:102
>>          pHook = <optimized out>
>>          n = 0
>>          rv = -1
>> #8  0x000000000043350e in main (argc=<optimized out>, argv=<optimized
>> out>) at main.c:882
>>          c = 102 'f'
>>          showcompile = <optimized out>
>> --Type <RET> for more, q to quit, c to continue without paging--
>>          showdirectives = <optimized out>
>>          confname = <optimized out>
>>          def_server_root = <optimized out>
>>          temp_error_log = <optimized out>
>>          error = <optimized out>
>>          process = 0x7ea4c8
>>          pconf = 0x7ec3e8
>>          plog = 0x81b998
>>          ptemp = 0x815678
>>          pcommands = <optimized out>
>>          opt = 0x810ef0
>>          rv = <optimized out>
>>          mod = <optimized out>
>>          opt_arg = 0x7fff56cbcb64
>> "/home/jfclere/httpd-trunk/test/pyhttpd/../gen/apache/conf/httpd.conf"
>>          signal_server = <optimized out>
>>          rc = <optimized out>
>> (gdb)
>>
>> I have added a kill(pid, SIGABRT); in server/mpm_unix.c after the
>> ap_log_error() as it is not easy to get a core otherwise.
>>
>>
>>> It's clear from the main thread's backtrace that it's waiting for the
>>> listener in the "iter" loop, but nothing tells if the listener already
>>> exited or not. The listener for instance could be waiting indefinitely
>>> apr_pollset_poll() at this point, and since there is no pollset wakeup
>>> in mpm_worker I don't think that wakeup_listener() can help here.
>>
>> According to my tests using assert(0) in the join_workers() in
>> different location, the listener thread is stopped by
>> wakeup_listener() but the pthread_kill() doesn't report that.
>>
>>
>>> So maybe we need to add an apr_pollset_wakeup() in wakeup_listener()
>>> too, like in mpm_event too.
>>>
>>> Overall something like the attached patch?
>>
>> Yes the attached patch helps
>>
>>>
>>>
>>> Regards;
>>> Yann.

--
Cheers

Jean-Frederic
Re: pytest results for 2.4.59 [ In reply to ]
Am 07.04.24 um 09:42 schrieb jean-frederic clere:
> On 4/6/24 20:02, Rainer Jung wrote:
>> Hi Jean-Frederic and all,
>>
>> you didn't write at what point in time you take the thread dump. I see
>> the SIGTERM messages logged during test execution always during the
>> last test in each group (http2, md, ...) just because that is the time
>> the logs are checked by teardown for error messages. At the time the
>> test complains it already starts to kill the children and at least
>> during my test runs it success with killing them (I think). So finding
>> a good point in time to attach the debugger and see the right
>> situation might not be easy?
>
> I might have taken the thread dump late because I have taken it just
> after the ap_log_error("sending a SIGTERM") at that time there is only
> one thread running and it is "waiting" for the listener thread that has
> already stopped. There I have found that the behavior of pthread_kill()
> changed in "recent" fedora/rhel.
>
> Are you suggesting I should also try to get dumps earlier in the
> shutdown process?
>
>>
>> When you say Yann's patch helps, it means especially there are not
>> more SIGTERM messages in the logs resp. no more teardown checks failing?
>
> Yes with Yann's patch, the "AH00045: child process nnnnn still did not
> exit, sending a SIGTERM" messages are gone and teardown checks are passing.

I could now also do some tests and for me the SIGTERM messages for
worker on RHEL 9 during pytest are also gone with Yann's woker MPM
patch. Plus I didn't get any new failures. Perl test framework is fine
and pytest only some of the few occasional failures and errors I had
also seen before.

So it would be nice if we would apply Yann's patch for the worker mpm.

Thanks and best regards,

Rainer
Re: pytest results for 2.4.59 [ In reply to ]
On Sun, Apr 7, 2024 at 2:16?PM Rainer Jung <rainer.jung@kippdata.de> wrote:
>
> Am 07.04.24 um 09:42 schrieb jean-frederic clere:
> > On 4/6/24 20:02, Rainer Jung wrote:
> >
> >> When you say Yann's patch helps, it means especially there are not
> >> more SIGTERM messages in the logs resp. no more teardown checks failing?
> >
> > Yes with Yann's patch, the "AH00045: child process nnnnn still did not
> > exit, sending a SIGTERM" messages are gone and teardown checks are passing.
>
> I could now also do some tests and for me the SIGTERM messages for
> worker on RHEL 9 during pytest are also gone with Yann's woker MPM
> patch. Plus I didn't get any new failures. Perl test framework is fine
> and pytest only some of the few occasional failures and errors I had
> also seen before.
>
> So it would be nice if we would apply Yann's patch for the worker mpm.

Thanks for testing, now in r1916926.
Finally I did not use the "good methods" (APR_POLLSET_NODEFAULT) to
create the pollset (like in mpm_event) but left the pollset
implementation to be selected as before. It's now using
APR_POLLSET_WAKEABLE though to help wakeup_listener().


Regards;
Yann.
Re: pytest results for 2.4.59 [ In reply to ]
On Sat, Apr 6, 2024 at 10:46?AM jean-frederic clere <jfclere@gmail.com> wrote:
>
> It seems pthread_kill(t, 0) returns 0 even the thread t has exited...
> older version of fedora will return 3 (I have tried fc28)
>
> The small example (taken from the internet) seems to behave like httpd
> reporting running threads that have exited...
>
> I have created https://bugzilla.redhat.com/show_bug.cgi?id=2273757 in
> case it is a fedora/rhel bug.

FWIW I find the new pthread_kill() behaviour completely useless. ESRCH
was for threads that exited already but were not joined (i.e.
"inactive" per POSIX), but for threads that are joined/detached (i.e.
"dead" or "end of lifetime") as documented by POSIX it's completely
useless because any pthread_ function usage is already undefined
behaviour on those (IOW, no one should do this).
So glibc conforming to POSIX here by removing ESRCH is just pointless,
they just killed pthread_kill(, 0) for no/bad reasons..