Mailing List Archive

[perl #125760] RFC remove strange behaviour of sysread()/syswrite() on UTF-8 streams
# New Ticket Created by Tony Cook
# Please include the string: [perl #125760]
# in the subject line of all future correspondence about this issue.
# <URL: https://rt.perl.org/Ticket/Display.html?id=125760 >


One of the few remaining warts[1] in Perl's Unicode support is how
sysread() and syswrite() behave on streams with a unicode layer.

First sysread():

For example:

open my $fh, "<:utf8", "filewithutf8.txt" or die;
my $buf;
sysread $fh, $buf, 1000;

will reads up to 1000 unvalidated UTF-8[2] *characters* from the stream.
That seems all fine and good, but the following:

open my $fh, "<:encoding(UCS-2BE)", "filewithucs2be.txt" or die;
my $buf;
sysread $fh, $buf, 1000;

does exactly the same thing - only the fact that the stream is unicode
flagged (i.e., has the PERLIO_F_UTF8 flag) is referenced, the actual
layers are ignored.

This behaviour is mostly documented by:

Note that if the filehandle has been marked as C<:utf8> Unicode
characters are read instead of bytes (the LENGTH, OFFSET, and the
return value of sysread() are in Unicode characters).
The C<:encoding(...)> layer implicitly introduces the C<:utf8> layer.
See L</binmode>, L</open>, and the C<open> pragma, L<open>.

which skips mentioning that the "Unicode characters" read are always
UTF-8 encoded.

This, beyond the broken :utf8 layer itself, is one of the few pure
perl vectors for badly encoded SVf_UTF8 strings in the perl
interpreter.

Also it can be confusing, even an experienced CPAN author managed to get
it wrong[3].

My suggestion is that (eventually) sysread() on a file with the
PERLIO_F_UTF8 flag on should either do a simple octet read, as it does
without that flag, or fail.

For the transition sysread() would warn when passed a handle with
PERLIO_F_UTF8, presumably something like "sysread() on a unicode
handle is deprecated".

So what's the desired behaviour after the transition:

1) sysread() would act as if the flag was not there, completely
ignoring the layers rather than ignoring the layers *except* for
the flag.

This has the advantage that sysread() behaves consistently after
the change. It may however make code that depends on the old
behaviour silently misbehave.

2) sysread() fails, probably with EINVAL.

While sysread() becomes no longer useful on handles with the flag,
mixing low- and high-level I/O is generally unsafe anyway, and
PerlIO layers are pretty much a high-level construct, so there
isn't much lost.

It prevents most silent mis-behaviour while remaining true to
sysread()'s contract to read bytes from a file.

3) sysread() croaks.

Similar to 2), but with more emphasis.

Then syswrite():

Unlike sysread(), syswrite() doesn't act a a vector for producing
corrupt internal perl data structures, but it does have the same issue
that it pays attention to only part of the layer state for the handle.

For example:

open my $fh, ">:utf8", "filetobeutf8.txt" or die;
my $data = "\x{101}";
syswrite $fh, $data;

will write UTF-8 encoded data to the file, which is fine, but:

open my $fh, ">:encoding(UCS-2BE)", "filetobeucs2.txt" or die;
my $data = "\x{101}";
syswrite $fh, $data;

does the same thing.

I believe if syswrite() is going to ignore any of the layer state of
the handle, it should ignore it all, so the examples above would throw
an exception, just as they do for handles without the flag when
syswrite() is called with wide characters.

Again, for the transition, syswrite() should produce a deprecation
warning.

Tony

[1] the other I can think of is that the :utf8 PerlIO layer doesn't
validate, it's just a flag

[2] or utf8, perl's internal encoding

[3] https://rt.cpan.org/Public/Bug/Display.html?id=83126 and a few
other tickets for the same distribution, and
https://rt.perl.org/Ticket/Display.html?id=121870
Re: [perl #125760] RFC remove strange behaviour of sysread()/syswrite() on UTF-8 streams [ In reply to ]
On 08/06/2015 09:00 AM, Tony Cook (via RT) wrote:
> # New Ticket Created by Tony Cook
> # Please include the string: [perl #125760]
> # in the subject line of all future correspondence about this issue.
> # <URL: https://rt.perl.org/Ticket/Display.html?id=125760 >
>
>
> One of the few remaining warts[1] in Perl's Unicode support is how
> sysread() and syswrite() behave on streams with a unicode layer.
>

Having read the excellent analysis, my 2c is that both failure cases for
sysread and syswrite should ultimately croak.

I do not have an informed opinion on what the deprecation cycle would
look like, as it is likely very beneficial to exercise the croaking as
early as 5.23.x on a CPAN smoke, yet it is clearly too early for code in
the wild.
[perl #125760] RFC remove strange behaviour of sysread()/syswrite() on UTF-8 streams [ In reply to ]
I think I am the original vector of these vectors... and I think I was wrong. Just make them croak.


---
via perlbug: queue: perl5 status: open
https://rt.perl.org/Ticket/Display.html?id=125760
Re: [perl #125760] RFC remove strange behaviour of sysread()/syswrite() on UTF-8 streams [ In reply to ]
* Peter Rabbitson <rabbit-p5p@rabbit.us> [2015-08-06T03:12:26]
> Having read the excellent analysis, my 2c is that both failure cases for
> sysread and syswrite should ultimately croak.

Yes. Thanks, Tony, and I agree.

> I do not have an informed opinion on what the deprecation cycle would look
> like, as it is likely very beneficial to exercise the croaking as early as
> 5.23.x on a CPAN smoke, yet it is clearly too early for code in the wild.

We should definitely get the warnings in place soon.

I think it would be beneficial if we had a way to mark any deprecation warning
as fatal, process wide, for the purpose of smoking (and other places, like
integration testing), but I think it needs more thought than a "hey it would be
neat" from me.

But if we're going to make it croak in 5.28, time to make it warn now.

--
rjbs
[perl #125760] RFC remove strange behaviour of sysread()/syswrite() on UTF-8 streams [ In reply to ]
On Thu Aug 06 16:12:45 2015, perl.p5p@rjbs.manxome.org wrote:
> But if we're going to make it croak in 5.28, time to make it warn now.

Patch attached.

As chansen mentioned in #p5p, send() and recv() have the same issue, so the patch
also deprecates them on :utf8 handles.

Tony


---
via perlbug: queue: perl5 status: open
https://rt.perl.org/Ticket/Display.html?id=125760
[perl #125760] RFC remove strange behaviour of sysread()/syswrite() on UTF-8 streams [ In reply to ]
On Sun Aug 09 23:23:05 2015, tonyc wrote:
> On Thu Aug 06 16:12:45 2015, perl.p5p@rjbs.manxome.org wrote:
> > But if we're going to make it croak in 5.28, time to make it warn
> > now.
>
> Patch attached.
>
> As chansen mentioned in #p5p, send() and recv() have the same issue,
> so the patch
> also deprecates them on :utf8 handles.

Applied as fb10a8a78bba7573de4629b739bfe81cd42e78c9.

Leaving open, as I expect *something* will break.

Tony


---
via perlbug: queue: perl5 status: open
https://rt.perl.org/Ticket/Display.html?id=125760
Re: [perl #125760] RFC remove strange behaviour of sysread()/syswrite() on UTF-8 streams [ In reply to ]
On Mon, Aug 17, 2015 at 8:40 AM, Tony Cook via RT <perlbug-followup@perl.org
> wrote:

> On Sun Aug 09 23:23:05 2015, tonyc wrote:
> > On Thu Aug 06 16:12:45 2015, perl.p5p@rjbs.manxome.org wrote:
> > > But if we're going to make it croak in 5.28, time to make it warn
> > > now.
> >
> > Patch attached.
> >
> > As chansen mentioned in #p5p, send() and recv() have the same issue,
> > so the patch
> > also deprecates them on :utf8 handles.
>
> Applied as fb10a8a78bba7573de4629b739bfe81cd42e78c9.
>
> Leaving open, as I expect *something* will break.
>
> Tony
>

Among commonly used modules, I don't expect File::Slurp to like this, then
again the beast is so broken by design that that may be a feature.

Leon
Re: [perl #125760] RFC remove strange behaviour of sysread()/syswrite() on UTF-8 streams [ In reply to ]
On Sun, Aug 16, 2015 at 11:40:06PM -0700, Tony Cook via RT wrote:
> On Sun Aug 09 23:23:05 2015, tonyc wrote:
> > On Thu Aug 06 16:12:45 2015, perl.p5p@rjbs.manxome.org wrote:
> > > But if we're going to make it croak in 5.28, time to make it warn
> > > now.
> >
> > Patch attached.
> >
> > As chansen mentioned in #p5p, send() and recv() have the same issue,
> > so the patch
> > also deprecates them on :utf8 handles.
>
> Applied as fb10a8a78bba7573de4629b739bfe81cd42e78c9.

It's causing lib/warnings.t to fail in smokes that do PERL_UNICODE=""

e.g.


[davem@robin t]$ PERL_UNICODE="" ./perl harness ../lib/warnings.t
../lib/warnings.t .. 617/876 PROG:
# pp_sys.c [pp_sysread]
use warnings 'io' ;
if ($^O eq 'dos') {
print <<EOM ;
SKIPPED
# skipped on dos
EOM
exit ;
}
my $file = "./xcv" ;
open(F, ">$file") ;
my $a = sysread(F, $a,10) ;
no warnings 'io' ;
my $a = sysread(F, $a,10) ;
close F ;
use warnings 'io' ;
sysread(F, $a, 10);
read(F, $a, 10);
sysread(NONEXISTENT, $a, 10);
read(NONEXISTENT, $a, 10);
unlink $file ;
EXPECTED:
Filehandle F opened only for output at - line 12.
sysread() on closed filehandle F at - line 17.
read() on closed filehandle F at - line 18.
sysread() on unopened filehandle NONEXISTENT at - line 19.
read() on unopened filehandle NONEXISTENT at - line 20.
GOT:
sysread() is deprecated on :utf8 handles at - line 12.
Filehandle F opened only for output at - line 12.
sysread() is deprecated on :utf8 handles at - line 14.
sysread() on closed filehandle F at - line 17.
read() on closed filehandle F at - line 18.
sysread() on unopened filehandle NONEXISTENT at - line 19.
read() on unopened filehandle NONEXISTENT at - line 20.
# Failed test 673 - at lib/warnings/pp_sys line 625
PROG:
use warnings 'deprecated';
open my $fh, "<", "../harness" or die "# $!";
my $buf;
sysread $fh, $buf, 10;
binmode $fh, ':utf8';
sysread $fh, $buf, 10;
EXPECTED:
sysread() is deprecated on :utf8 handles at - line 6.
GOT:
sysread() is deprecated on :utf8 handles at - line 4.
sysread() is deprecated on :utf8 handles at - line 6.
# Failed test 689 - sysread() deprecated on :utf8 at lib/warnings/pp_sys line 943
PROG:
my $file = "syswwarn.tmp";
use warnings 'deprecated';
open my $fh, ">", $file or die "# $!";
syswrite $fh, 'ABC';
binmode $fh, ':utf8';
syswrite $fh, 'ABC';
close $fh;
unlink $file;
EXPECTED:
syswrite() is deprecated on :utf8 handles at - line 6.
GOT:
syswrite() is deprecated on :utf8 handles at - line 4.
syswrite() is deprecated on :utf8 handles at - line 6.
# Failed test 690 - syswrite() deprecated on :utf8 at lib/warnings/pp_sys line 953
../lib/warnings.t .. Failed 3/876 subtests
(less 2 skipped subtests: 871 okay)

Test Summary Report
-------------------
../lib/warnings.t (Wstat: 0 Tests: 876 Failed: 3)
Failed tests: 673, 689-690
Files=1, Tests=876, 11 wallclock secs ( 0.31 usr 0.02 sys + 8.64 cusr 2.42 csys = 11.39 CPU)
Result: FAIL
[davem@robin t]$


--
Indomitable in retreat, invincible in advance, insufferable in victory
-- Churchill on Montgomery
[perl #125760] RFC remove strange behaviour of sysread()/syswrite() on UTF-8 streams [ In reply to ]
On Wed Aug 19 01:43:50 2015, davem wrote:
> On Sun, Aug 16, 2015 at 11:40:06PM -0700, Tony Cook via RT wrote:
> > On Sun Aug 09 23:23:05 2015, tonyc wrote:
> > > On Thu Aug 06 16:12:45 2015, perl.p5p@rjbs.manxome.org wrote:
> > > > But if we're going to make it croak in 5.28, time to make it warn
> > > > now.
> > >
> > > Patch attached.
> > >
> > > As chansen mentioned in #p5p, send() and recv() have the same
> > > issue,
> > > so the patch
> > > also deprecates them on :utf8 handles.
> >
> > Applied as fb10a8a78bba7573de4629b739bfe81cd42e78c9.
>
> It's causing lib/warnings.t to fail in smokes that do PERL_UNICODE=""

Thanks, fixed in 60e67244ff.

Tony

---
via perlbug: queue: perl5 status: open
https://rt.perl.org/Ticket/Display.html?id=125760
Re: [perl #125760] RFC remove strange behaviour of sysread()/syswrite() on UTF-8 streams [ In reply to ]
On Tue, Sep 25, 2018 at 3:26 AM Tony Cook via RT
<perlbug-followup@perl.org> wrote:
>
> On Sun, 15 Oct 2017 22:23:32 -0700, tonyc wrote:
> > Added to the 5.30 blockers ticket, since these should start croaking then.
>
> Patch to make them fatal attached, I'll apply this Soon(tm).

I should point out that File::Slurp still hasn't been fix to not use
this misfeature (despite a ticket being open about it for half a
decade). File::Slurp currently has 636 direct dependents (and an
unknown bug likely high number of indirect dependencies).

This change will break CPAN given the current state of File::Slurp.

Leon

Leon
Re: [perl #125760] RFC remove strange behaviour of sysread()/syswrite() on UTF-8 streams [ In reply to ]
On Wed, Oct 10, 2018 at 3:14 AM Tony Cook via RT
<perlbug-followup@perl.org> wrote:
>
> On Tue, 25 Sep 2018 10:27:33 -0700, LeonT wrote:
> > On Tue, Sep 25, 2018 at 3:26 AM Tony Cook via RT
> > <perlbug-followup@perl.org> wrote:
> > >
> > > On Sun, 15 Oct 2017 22:23:32 -0700, tonyc wrote:
> > > > Added to the 5.30 blockers ticket, since these should start
> > > > croaking then.
> > >
> > > Patch to make them fatal attached, I'll apply this Soon(tm).
> >
> > I should point out that File::Slurp still hasn't been fix to not use
> > this misfeature (despite a ticket being open about it for half a
> > decade). File::Slurp currently has 636 direct dependents (and an
> > unknown bug likely high number of indirect dependencies).
> >
> > This change will break CPAN given the current state of File::Slurp.
>
> Code using File::Slurp with :utf8 is already broken, it just won't be
> silently broken now.
>
> Tony

True, but it's now also uninstallable for all File::Slurp users that
don't use :utf8 (which is probably the majority).

Leon
Re: [perl #125760] RFC remove strange behaviour of sysread()/syswrite() on UTF-8 streams [ In reply to ]
On Sun, Nov 11, 2018 at 05:24:51AM -0800, Thorsten Sch?ning via RT wrote:
> Am Sun, 09 Aug 2015 23:23:05 -0700, tonyc schrieb:
> > if ((fp_utf8 = PerlIO_isutf8(IoIFP(io))) && !IN_BYTES) {
> > + if (PL_op->op_type == OP_SYSREAD || PL_op->op_type == OP_RECV) {
> > + Perl_ck_warner(aTHX_ packWARN(WARN_DEPRECATED),
> > + "%s() is deprecated on :utf8 handles",
> > + OP_DESC(PL_op));
> > + }
>
> Could someone please explain me if the following change is valid:
>
> > binmode($DB::OUT, ':utf-8')
> to
> > binmode($DB::OUT, ':encoding(UTF-8)')

No, that won't prevent the warning.

For example, with a default 5.28.0 build:

$ ./perl -Ilib -e 'open my $f, ">", "foo"; binmode $f; binmode $f, ":encoding(UTF8)"; syswrite($f, "foo")'
syswrite() is deprecated on :utf8 handles. This will be a fatal error in Perl 5.30 at -e line 1.

In blead this throws an error instead.

>
> >From my understanding it is not because of statements like the following:
>
> > +(W deprecated) The sysread(), recv(), syswrite() and send() operators
> > +are deprecated on handles that have the C<:utf8> layer, either
> > +explicitly, or implicitly, eg., with the C<:encoding(UTF-16LE)> layer.
>
> https://rt.perl.org/Public/Ticket/Attachment/1360169/728149/0001-perl-125760-deprecate-sys-read-write-send-recv-on-ut.patch
>
> > +Note that if the filehandle has been marked as C<:utf8>, C<sysread> will
> > +throw an exception. The C<:encoding(...)> layer implicitly
> > introduces the C<:utf8> layer.</sysread>
>
> https://rt.perl.org/Public/Ticket/Attachment/1583747/827814/0001-perl-133170-fatalize-sysread-syswrite-recv-send-on-u.patch
>
> But the change seems to fix the warnings. Shouldn't that change still result in the ":utf8"-flag being applied and therefore being still wrong? The other attached patches of this bug e.g. regarding tests use ":raw" only, like I understand this bug as well.

I'm not sure what you're saying here, if you could provide simple but
complete code that doesn't behave how you expect, the result you
expect and the result you got, we can decide if we have a code bug or
perhaps a documentation bug.

Tony