Mailing List Archive

test-harness-compat.t and SIGINT
TL;DR:

Two questions:

1) has anyone seen cpan/Test-Harness/t/compat/test-harness-compat.t fail
recently?

2) Does anyone know how (on Linux) a process can receive a kernel-sourced
SIGINT signal, when not caused by typing ^C on the terminal?

Long version:

I've been working on a branch (not yet pushed) which unwraps and
optimises pp_aassign() on PERL_RC_STACK builds. A couple of times I've
seen test-harness-compat.t fail. I suspect that it may be related to
changes or bugs in my branch, hence the first question: has anyone seen
this test file fail in blead recently, or is it just me? This is on a
threaded, debugging build.

I've had extreme difficulty reproducing and diagnosing this issue.
Eventually on a threaded, debugging build on my pp_aasign() branch, I got
this command line to fail the test in about on in every 100-200 runs on
average:

$ cd cpan/Test-Harness
$ time while PERL_HASH_SEED=0 PERL_HASH_SEED_DEBUG=1 ../../perl -I../../lib t/compat/test-harness-compat.t| grep -q '^ok 120'; do date; done

The symptoms are that the process does a 255 exit after test 24, with no
error indication. Reducing the test script has been impossible: the
slightest change to the script and the problem goes away. Running under
valgrind or ASAN and the problem goes away.

I eventually managed to get it to fail while under strace -f. This shows
that the main process and a child is getting killed by a kernel-generated
SIGINT. The last few lines in the trace file are:

1900436 newfstatat(AT_FDCWD, "../../lib/Scalar/Util.pm", <unfinished ...>
1900400 <... pselect6 resumed>) = ? ERESTARTNOHAND (To be restarted if no handler)
1900400 --- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
1900436 <... newfstatat resumed>{st_mode=S_IFREG|0444, st_size=10761, ...}, 0) = 0
1900436 --- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
1900436 +++ killed by SIGINT +++
1900400 +++ killed by SIGINT +++

1900400 is the main process.

Nothing in the trace file shows any process doing a kill(...). The OS log
doesn't show anything like an OOM killer killing off processes.

About the only thing I can think of is that, in some mysterious fashion
(i.e. not via kill(pid, SIGINT)), the Test::Harness test is simulating a
^C to see how it gets handled, and due to some bug or race condition or
something it's not getting caught properly.

I'm bemused.

--
Monto Blanco... scorchio!
Re: test-harness-compat.t and SIGINT [ In reply to ]
On 11/13/23 10:36, Dave Mitchell wrote:
> TL;DR:
>
> Two questions:
>
> 1) has anyone seen cpan/Test-Harness/t/compat/test-harness-compat.t fail
> recently?
>
> 2) Does anyone know how (on Linux) a process can receive a kernel-sourced
> SIGINT signal, when not caused by typing ^C on the terminal?
>
> Long version:
>
> I've been working on a branch (not yet pushed) which unwraps and
> optimises pp_aassign() on PERL_RC_STACK builds. A couple of times I've
> seen test-harness-compat.t fail. I suspect that it may be related to
> changes or bugs in my branch, hence the first question: has anyone seen
> this test file fail in blead recently, or is it just me? This is on a
> threaded, debugging build.

No recent failures for that file reported at
https://perl5.test-smoke.org/matrix. In particular, I've recently been
building on FreeBSD-13 with -Accflags='-DPERL_RC_STACK' and haven't seen
any failures in that file.

[snip]
Re: test-harness-compat.t and SIGINT [ In reply to ]
On Mon, Nov 13, 2023 at 03:36:43PM +0000, Dave Mitchell wrote:
> TL;DR:
>
> Two questions:
>
> 1) has anyone seen cpan/Test-Harness/t/compat/test-harness-compat.t fail
> recently?

I haven't seen such failures.

> 2) Does anyone know how (on Linux) a process can receive a kernel-sourced
> SIGINT signal, when not caused by typing ^C on the terminal?
...
>
> I eventually managed to get it to fail while under strace -f. This shows
> that the main process and a child is getting killed by a kernel-generated
> SIGINT. The last few lines in the trace file are:
>
> 1900436 newfstatat(AT_FDCWD, "../../lib/Scalar/Util.pm", <unfinished ...>
> 1900400 <... pselect6 resumed>) = ? ERESTARTNOHAND (To be restarted if no handler)
> 1900400 --- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
> 1900436 <... newfstatat resumed>{st_mode=S_IFREG|0444, st_size=10761, ...}, 0) = 0
> 1900436 --- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
> 1900436 +++ killed by SIGINT +++
> 1900400 +++ killed by SIGINT +++
>
> 1900400 is the main process.
>
> Nothing in the trace file shows any process doing a kill(...). The OS log
> doesn't show anything like an OOM killer killing off processes.

kill() would set si_code == SI_USER.

From what I can see the OOM killer sends SIGKILL:

https://github.com/torvalds/linux/blob/9bacdd8996c77c42ca004440be610692275ff9d0/mm/oom_kill.c#L943

> About the only thing I can think of is that, in some mysterious fashion
> (i.e. not via kill(pid, SIGINT)), the Test::Harness test is simulating a
> ^C to see how it gets handled, and due to some bug or race condition or
> something it's not getting caught properly.

The only way I could get si_code=SI_KERNEL was with ^C, which I'm
pretty sure you're not doing yourself :)

I don't see any evidence that Test::Harness is using a pty to run the
tests, and sending a signal via that (either by sending \cC, via
TIOCSIG or TIOCSTI).

So I'm clueless.

Tony
Re: test-harness-compat.t and SIGINT [ In reply to ]
On Tue, Nov 14, 2023 at 4:23?AM Tony Cook <tony@develop-help.com> wrote:

> On Mon, Nov 13, 2023 at 03:36:43PM +0000, Dave Mitchell wrote:
> > TL;DR:
> >
> > Two questions:
> >
> > 1) has anyone seen cpan/Test-Harness/t/compat/test-harness-compat.t fail
> > recently?
>
> I haven't seen such failures.
>
> > 2) Does anyone know how (on Linux) a process can receive a kernel-sourced
> > SIGINT signal, when not caused by typing ^C on the terminal?
> ...
> >
> > I eventually managed to get it to fail while under strace -f. This shows
> > that the main process and a child is getting killed by a kernel-generated
> > SIGINT. The last few lines in the trace file are:
> >
> > 1900436 newfstatat(AT_FDCWD, "../../lib/Scalar/Util.pm", <unfinished
> ...>
> > 1900400 <... pselect6 resumed>) = ? ERESTARTNOHAND (To be
> restarted if no handler)
> > 1900400 --- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
> > 1900436 <... newfstatat resumed>{st_mode=S_IFREG|0444, st_size=10761,
> ...}, 0) = 0
> > 1900436 --- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
> > 1900436 +++ killed by SIGINT +++
> > 1900400 +++ killed by SIGINT +++
> >
> > 1900400 is the main process.
> >
> > Nothing in the trace file shows any process doing a kill(...). The OS log
> > doesn't show anything like an OOM killer killing off processes.
>
> kill() would set si_code == SI_USER.
>
> From what I can see the OOM killer sends SIGKILL:
>
>
> https://github.com/torvalds/linux/blob/9bacdd8996c77c42ca004440be610692275ff9d0/mm/oom_kill.c#L943
>
> > About the only thing I can think of is that, in some mysterious fashion
> > (i.e. not via kill(pid, SIGINT)), the Test::Harness test is simulating a
> > ^C to see how it gets handled, and due to some bug or race condition or
> > something it's not getting caught properly.
>
> The only way I could get si_code=SI_KERNEL was with ^C, which I'm
> pretty sure you're not doing yourself :)
>
> I don't see any evidence that Test::Harness is using a pty to run the
> tests, and sending a signal via that (either by sending \cC, via
> TIOCSIG or TIOCSTI).
>
> So I'm clueless.
>

I had come to the same conclusions, and am also clueless.

Leon
Re: test-harness-compat.t and SIGINT [ In reply to ]
On Tue, Nov 14, 2023 at 09:53:13AM +1100, Tony Cook wrote:
> The only way I could get si_code=SI_KERNEL was with ^C, which I'm
> pretty sure you're not doing yourself :)

I'm suddenly not so sure! It's possible I may have got my trace files
mixed up and I was looking at an earlier run where I had in fact stopped
the run with ^C at some point. I've just tried re-running with strace, and
haven't been able to get a failure all night.

I'll get back to you all if and when I get some further data.

--
"There's something wrong with our bloody ships today, Chatfield."
-- Admiral Beatty at the Battle of Jutland, 31st May 1916.
Re: test-harness-compat.t and SIGINT [ In reply to ]
On Tue, Nov 14, 2023 at 07:42:05AM +0000, Dave Mitchell wrote:
> I'm suddenly not so sure! It's possible I may have got my trace files
> mixed up and I was looking at an earlier run where I had in fact stopped
> the run with ^C at some point. I've just tried re-running with strace, and
> haven't been able to get a failure all night.

TL;DR: nothing to see here.

Yeah, the SIGINT was a complete red herring - I was probably looking at
the wrong trace file.

I eventually tracked it down to a bug when SvREFCNT(&PL_sv_undef) == 1,
which was causing code to falsely assume it could steal the SV rather
than copying it when storing the value in a hash.

This didn't happen often because even on debugging builds, sv_undef's RC
is initialised to 1000, and only rarely reached 1.

I still don't know why reproducing the bug in test-harness-compat.t
was so hard; but that test file system()s various Test::Harness processes
which analyse the result of various good or bad TAP outputs. Perhaps a
timing issue makes Test::Harness behave slightly differently, so when perl
is going slower it doesn't trigger?

Anyway, all fixed now.

--
Art is anything that has a label (especially if the label is "untitled 1")