Mailing List Archive

1 2 3  View All
Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]
On Mon, Aug 2, 2021 at 11:31 AM Felipe Gasper <felipe@felipegasper.com>
wrote:

>
>
> > On Aug 2, 2021, at 11:17 AM, Veesh Goldman <rabbiveesh@gmail.com> wrote:
> >
> >
> >
> >
> > My point is still that this:
> >
> > -----
> > use v5.36;
> > print 'Hello, world!';
> > -----
> >
> > … should not be “subtly wrong”.
> >
> > -F
> >
> > Since 5.36 is meant to turn on warnings, this will be explicitly wrong,
> not subtly.
> >
> > Perhaps the "wide character" warning is too unclear, but we can always
> improve the text to include a doc link as such.
>
> There’s no “wide character” warning when there happen to be no wide
> characters.
>
> >
> > What compels me more is the following example.
> > Let's say I'm looking for customers in my database named josé. Easy,
> I'll use DBIC:
> >
> > $customer_rs->search({ name => 'josé' })
> >
> > But when I run it, I get nothing. That's because the various DBDs will
> handle encoding and decoding for you, bc perl is meant to deal with text in
> userland.
>
> Which DBDs?
>
> - DBD::SQLite is bytes by default, but it has the SvPV bug (i.e., it sends
> the internal PV to SQLite).
>
> - DBD::mysql is also bytes w/ SvPV bug by default.
>
> (I haven’t tried DBD::Pg.)
>

DBD::mysql has the unicode bug due to long standing issues. DBD::MariaDB
was forked for this reason.

DBD::MariaDB, DBD::SQLite, and DBD::Pg are used with the unicode option in
any modern programs. Thus they expect decoded strings.


> > Had utf8 been turned on, then I would've started with text, not bytes,
> and found my customers instead of mojibake (though on the other hand, the
> non utf8 is a great way to find double encoded text).
> >
> > I think this is a more realistic example than printing a string literal,
> where the behavior is surprising and conceptually inconsistent.
>
> Why would you query on a string constant? More likely you’ll be accepting
> $name via some input, in which case you have to decode it. But if you tried
> it with a constant you may be confused at why you *didn’t* have to decode
> it there.


You are making a lot of assumptions about other peoples' code and thought
processes based on your own experience, which is not the way many
people approach these problems. And that is why we are considering this; to
make the defaults match more people's assumptions.

-Dan
Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]
Postgres defaults to chars when the client encoding is utf8.

On Mon, Aug 2, 2021, 11:46 Felipe Gasper <felipe@felipegasper.com> wrote:

>
>
> > On Aug 2, 2021, at 11:25 AM, Dan Book <grinnz@gmail.com> wrote:
> >
> > STDOUT and STDERR expect bytes unless one does "use open ':std', IO =>
> ':encoding(UTF-8)';" which changes the assumption of those interfaces so
> isn't great. DBI drivers, Mojolicious interfaces, etc expect characters.
>
> Of note: DBI drivers, by default, seem more predominantly to expect
> *bytes*, not characters. At least, SQLite, MySQL, and PostgreSQL do.
>
> -F
>
>
Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]
> On Aug 2, 2021, at 11:53 AM, Dan Book <grinnz@gmail.com> wrote:
>
> On Mon, Aug 2, 2021 at 11:31 AM Felipe Gasper <felipe@felipegasper.com> wrote:
>
>
> > On Aug 2, 2021, at 11:17 AM, Veesh Goldman <rabbiveesh@gmail.com> wrote:
> >
> >
> >
> >
> > My point is still that this:
> >
> > -----
> > use v5.36;
> > print 'Hello, world!';
> > -----
> >
> > … should not be “subtly wrong”.
> >
> > -F
> >
> > Since 5.36 is meant to turn on warnings, this will be explicitly wrong, not subtly.
> >
> > Perhaps the "wide character" warning is too unclear, but we can always improve the text to include a doc link as such.
>
> There’s no “wide character” warning when there happen to be no wide characters.
>
> >
> > What compels me more is the following example.
> > Let's say I'm looking for customers in my database named josé. Easy, I'll use DBIC:
> >
> > $customer_rs->search({ name => 'josé' })
> >
> > But when I run it, I get nothing. That's because the various DBDs will handle encoding and decoding for you, bc perl is meant to deal with text in userland.
>
> Which DBDs?
>
> - DBD::SQLite is bytes by default, but it has the SvPV bug (i.e., it sends the internal PV to SQLite).
>
> - DBD::mysql is also bytes w/ SvPV bug by default.
>
> (I haven’t tried DBD::Pg.)
>
> DBD::mysql has the unicode bug due to long standing issues. DBD::MariaDB was forked for this reason.
>
> DBD::MariaDB, DBD::SQLite, and DBD::Pg are used with the unicode option in any modern programs. Thus they expect decoded strings.

None of these modules’ documentation says “all new code should enable this”, so if indeed “any modern programs” should be set up that way, it seems a rather cargo-cult-ish thing.

I would say, respectfully, that you yourself are “making a lot of assumptions about other peoples' code”, etc. etc.

>
> > Had utf8 been turned on, then I would've started with text, not bytes, and found my customers instead of mojibake (though on the other hand, the non utf8 is a great way to find double encoded text).
> >
> > I think this is a more realistic example than printing a string literal, where the behavior is surprising and conceptually inconsistent.
>
> Why would you query on a string constant? More likely you’ll be accepting $name via some input, in which case you have to decode it. But if you tried it with a constant you may be confused at why you *didn’t* have to decode it there.
>
> You are making a lot of assumptions about other peoples' code and thought processes based on your own experience, which is not the way many people approach these problems. And that is why we are considering this; to make the defaults match more people's assumptions.

Making defaults match assumptions is a great thing. I just think newcomers to the language would make assumptions about what `print 'Hello, world!'` does before they reason about DBI etc. Most of those newcomers will hail from JS or Python, where this stuff “just works”.

It basically seems like all the right people are on board with the notion that “Hello, world” in “modern” Perl will look thus:

-----
use v5.36;
use Encode;
print Encode::encode_utf8('Hello, world!');
-----

… and any ensuing explanation will have to discuss character encoding, and the fact that Perl can’t tell text from bytes. Right away this simple example draws attention to one of Perl’s more frustration-prone qualities.

Respectfully, I just can’t see how this improves the language, and I’m surprised more folks aren’t voicing similar thoughts. I’d love to be wrong; I guess we’ll see.

cheers,
-Felipe
Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]
On Tue, 3 Aug 2021 at 01:17, Felipe Gasper <felipe@felipegasper.com> wrote:

>
>
> > On Aug 2, 2021, at 11:53 AM, Dan Book <grinnz@gmail.com> wrote:
> >
> > On Mon, Aug 2, 2021 at 11:31 AM Felipe Gasper <felipe@felipegasper.com>
> wrote:
> >
> >
> > > On Aug 2, 2021, at 11:17 AM, Veesh Goldman <rabbiveesh@gmail.com>
> wrote:
> > >
> > >
> > >
> > >
> > > My point is still that this:
> > >
> > > -----
> > > use v5.36;
> > > print 'Hello, world!';
> > > -----
> > >
> > > … should not be “subtly wrong”.
> > >
> > > -F
> > >
> > > Since 5.36 is meant to turn on warnings, this will be explicitly
> wrong, not subtly.
> > >
> > > Perhaps the "wide character" warning is too unclear, but we can always
> improve the text to include a doc link as such.
> >
> > There’s no “wide character” warning when there happen to be no wide
> characters.
> >
> > >
> > > What compels me more is the following example.
> > > Let's say I'm looking for customers in my database named josé. Easy,
> I'll use DBIC:
> > >
> > > $customer_rs->search({ name => 'josé' })
> > >
> > > But when I run it, I get nothing. That's because the various DBDs will
> handle encoding and decoding for you, bc perl is meant to deal with text in
> userland.
> >
> > Which DBDs?
> >
> > - DBD::SQLite is bytes by default, but it has the SvPV bug (i.e., it
> sends the internal PV to SQLite).
> >
> > - DBD::mysql is also bytes w/ SvPV bug by default.
> >
> > (I haven’t tried DBD::Pg.)
> >
> > DBD::mysql has the unicode bug due to long standing issues. DBD::MariaDB
> was forked for this reason.
> >
> > DBD::MariaDB, DBD::SQLite, and DBD::Pg are used with the unicode option
> in any modern programs. Thus they expect decoded strings.
>
> None of these modules’ documentation says “all new code should enable
> this”, so if indeed “any modern programs” should be set up that way, it
> seems a rather cargo-cult-ish thing.
>
> I would say, respectfully, that you yourself are “making a lot of
> assumptions about other peoples' code”, etc. etc.
>
> >
> > > Had utf8 been turned on, then I would've started with text, not bytes,
> and found my customers instead of mojibake (though on the other hand, the
> non utf8 is a great way to find double encoded text).
> > >
> > > I think this is a more realistic example than printing a string
> literal, where the behavior is surprising and conceptually inconsistent.
> >
> > Why would you query on a string constant? More likely you’ll be
> accepting $name via some input, in which case you have to decode it. But if
> you tried it with a constant you may be confused at why you *didn’t* have
> to decode it there.
> >
> > You are making a lot of assumptions about other peoples' code and
> thought processes based on your own experience, which is not the way many
> people approach these problems. And that is why we are considering this; to
> make the defaults match more people's assumptions.
>
> Making defaults match assumptions is a great thing. I just think newcomers
> to the language would make assumptions about what `print 'Hello, world!'`
> does before they reason about DBI etc. Most of those newcomers will hail
> from JS or Python, where this stuff “just works”.
>
> It basically seems like all the right people are on board with the notion
> that “Hello, world” in “modern” Perl will look thus:
>
> -----
> use v5.36;
> use Encode;
> print Encode::encode_utf8('Hello, world!');
> -----
>
> … and any ensuing explanation will have to discuss character encoding, and
> the fact that Perl can’t tell text from bytes. Right away this simple
> example draws attention to one of Perl’s more frustration-prone qualities.
>
> Respectfully, I just can’t see how this improves the language, and I’m
> surprised more folks aren’t voicing similar thoughts. I’d love to be wrong;
> I guess we’ll see.
>

In core, the official answer is "use utf8" with "binmode":

https://perldoc.perl.org/perluniintro

Outside core Perl, one of the commonly shared guides is
Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]
My apologies, hit send too soon:

On Tue, 3 Aug 2021 at 01:23, Tom Molesworth <tom@deriv.com> wrote:

> On Tue, 3 Aug 2021 at 01:17, Felipe Gasper <felipe@felipegasper.com>
> wrote:
>
>>
>> -----
>> use v5.36;
>> use Encode;
>> print Encode::encode_utf8('Hello, world!');
>> -----
>>
>> … and any ensuing explanation will have to discuss character encoding,
>> and the fact that Perl can’t tell text from bytes. Right away this simple
>> example draws attention to one of Perl’s more frustration-prone qualities.
>>
>> Respectfully, I just can’t see how this improves the language, and I’m
>> surprised more folks aren’t voicing similar thoughts. I’d love to be wrong;
>> I guess we’ll see.
>>
>
> In core, the official answer is "use utf8" with "binmode":
>
> https://perldoc.perl.org/perluniintro
>
> Outside core Perl, one of the commonly shared guides is
>
>
... this summary by tchrist:

https://stackoverflow.com/a/6163129

A general mantra of "transcode at the edges" is something I personally find
more helpful than adding encode/decode operations throughout the core of
the code.

Enabling utf8 by default (lexical scope) conflicts with our inability to
enable the `encoding(UTF-8)` binmode by default (global scope).

Nice as utf8-everywhere might be - for some use-cases, at least - I don't
think there has been any viable option provided to resolve this?
Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]
Dan Book <grinnz@gmail.com> writes:

> DBD::MariaDB, DBD::SQLite, and DBD::Pg are used with the unicode
> option in any modern programs. Thus they expect decoded strings.

As far as DBD::SQLite is concerned, this is only half-true. In the
current version 1.70 there have been changes how to declare unicode
handling, but even with DBD_SQLITE_STRING_MODE_UNICODE_STRICT you can
feed it UTF-8 encoded byte sequences and it "just works" (but maybe
shouldn't).

You see the downside of this when you have a non-ASCII literal in a
iso-latin-1 encoded Perl source (e.g. "ä" or "\x{e4}"). For Perl, it is
the same character as "\N{LATIN SMALL LETTER A WITH DIAERESIS}", but if
you feed both to the database you get different results.

Veesh could change his source (if in a latin-1 encoded file)
$customer_rs->search({ name => 'josé' })
to
$customer_rs->search({ name => decode('iso-8859-1','josé') })
to make it work.

It seems that the driver still inspects the infamous UTF-8-flag to
decide whether a literal is encoded or not.

This issue goes away when source files are encoded (and assumed to be
encoded in UTF-8. But "working around driver quirks" is in my opinion
no good motivation for the change.

> Why would you query on a string constant? More likely you’ll be
> accepting $name via some input, in which case you have to decode
> it. But if you tried it with a constant you may be confused at why
> you *didn’t* have to decode it there.

I've seen that problem when feeding data from iso-latin-1 encoded input.
"You have to decode it" nails it, and you do have to decode it in this
example, too.
--
Cheers,
haj
Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]
On Mon, Aug 2, 2021 at 4:28 PM Harald Jörg <haj@posteo.de> wrote:

> Dan Book <grinnz@gmail.com> writes:
>
> > DBD::MariaDB, DBD::SQLite, and DBD::Pg are used with the unicode
> > option in any modern programs. Thus they expect decoded strings.
>
> As far as DBD::SQLite is concerned, this is only half-true. In the
> current version 1.70 there have been changes how to declare unicode
> handling, but even with DBD_SQLITE_STRING_MODE_UNICODE_STRICT you can
> feed it UTF-8 encoded byte sequences and it "just works" (but maybe
> shouldn't).
>
> You see the downside of this when you have a non-ASCII literal in a
> iso-latin-1 encoded Perl source (e.g. "ä" or "\x{e4}"). For Perl, it is
> the same character as "\N{LATIN SMALL LETTER A WITH DIAERESIS}", but if
> you feed both to the database you get different results.
>

I don't think this is correct. Mojo::SQLite has many tests to ensure in
unicode-mode that it treats strings consistently.


> Veesh could change his source (if in a latin-1 encoded file)
> $customer_rs->search({ name => 'josé' })
> to
> $customer_rs->search({ name => decode('iso-8859-1','josé') })
> to make it work.
>

This code makes no difference, decoding from iso-8859-1 is a no-op in Perl
strings (aside from considering "bytes" outside the single-byte encoding
range as errors/replacement characters).


> It seems that the driver still inspects the infamous UTF-8-flag to
> decide whether a literal is encoded or not.
>

This is not the case.

use strict;
use warnings;
use DBD::SQLite;
use DBD::SQLite::Constants ':dbd_sqlite_string_mode';

my %options = (RaiseError => 1, AutoInactiveDestroy => 1,
sqlite_string_mode => DBD_SQLITE_STRING_MODE_UNICODE_FALLBACK);
my $db = DBI->connect('dbi:SQLite:dbname=:memory:', undef, undef,
\%options);

my $str = "\xe4";

utf8::downgrade $str;
printf "%vX (length: %d)\n", $db->selectrow_array('SELECT ?, length(?)',
undef, $str, $str);
# prints: E4 (length: 1)

utf8::upgrade $str;
printf "%vX (length: %d)\n", $db->selectrow_array('SELECT ?, length(?)',
undef, $str, $str);
# prints: E4 (length: 1)


> This issue goes away when source files are encoded (and assumed to be
> encoded in UTF-8. But "working around driver quirks" is in my opinion
> no good motivation for the change.
>

-Dan
Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]
On Mon, Aug 2, 2021 at 5:32 PM Dan Book <grinnz@gmail.com> wrote:

> On Mon, Aug 2, 2021 at 4:28 PM Harald Jörg <haj@posteo.de> wrote:
>
>> Dan Book <grinnz@gmail.com> writes:
>>
>> > DBD::MariaDB, DBD::SQLite, and DBD::Pg are used with the unicode
>> > option in any modern programs. Thus they expect decoded strings.
>>
>> As far as DBD::SQLite is concerned, this is only half-true. In the
>> current version 1.70 there have been changes how to declare unicode
>> handling, but even with DBD_SQLITE_STRING_MODE_UNICODE_STRICT you can
>> feed it UTF-8 encoded byte sequences and it "just works" (but maybe
>> shouldn't).
>>
>> You see the downside of this when you have a non-ASCII literal in a
>> iso-latin-1 encoded Perl source (e.g. "ä" or "\x{e4}"). For Perl, it is
>> the same character as "\N{LATIN SMALL LETTER A WITH DIAERESIS}", but if
>> you feed both to the database you get different results.
>>
>
> I don't think this is correct. Mojo::SQLite has many tests to ensure in
> unicode-mode that it treats strings consistently.
>
>
>> Veesh could change his source (if in a latin-1 encoded file)
>> $customer_rs->search({ name => 'josé' })
>> to
>> $customer_rs->search({ name => decode('iso-8859-1','josé') })
>> to make it work.
>>
>
> This code makes no difference, decoding from iso-8859-1 is a no-op in Perl
> strings (aside from considering "bytes" outside the single-byte encoding
> range as errors/replacement characters).
>
>
>> It seems that the driver still inspects the infamous UTF-8-flag to
>> decide whether a literal is encoded or not.
>>
>
> This is not the case.
>
> use strict;
> use warnings;
> use DBD::SQLite;
> use DBD::SQLite::Constants ':dbd_sqlite_string_mode';
>
> my %options = (RaiseError => 1, AutoInactiveDestroy => 1,
> sqlite_string_mode => DBD_SQLITE_STRING_MODE_UNICODE_FALLBACK);
> my $db = DBI->connect('dbi:SQLite:dbname=:memory:', undef, undef,
> \%options);
>
> my $str = "\xe4";
>
> utf8::downgrade $str;
> printf "%vX (length: %d)\n", $db->selectrow_array('SELECT ?, length(?)',
> undef, $str, $str);
> # prints: E4 (length: 1)
>
> utf8::upgrade $str;
> printf "%vX (length: %d)\n", $db->selectrow_array('SELECT ?, length(?)',
> undef, $str, $str);
> # prints: E4 (length: 1)
>

And for completeness if you do the same test with the UTF-8 encoded bytes
"\xc3\xa4" you get consistent results as well: C3.A4 (length: 2)

-Dan
Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]
2021-8-2 23:43 Felipe Gasper <felipe@felipegasper.com>:

>
> use v5.36;
> print 'Hello, world!';
>
>
I think this is the right code in Perl.

Is the combination of "use utf8" and "print ascii code characters" the
right code?
Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]
Dan Book <grinnz@gmail.com> writes:

> On Mon, Aug 2, 2021 at 4:28 PM Harald Jörg <haj@posteo.de> wrote:
>
> > Dan Book <grinnz@gmail.com> writes:
> >
> > > DBD::MariaDB, DBD::SQLite, and DBD::Pg are used with the unicode
> > > option in any modern programs. Thus they expect decoded strings.
> >
> > As far as DBD::SQLite is concerned, this is only half-true. In the
> > current version 1.70 there have been changes how to declare unicode
> > handling, but even with DBD_SQLITE_STRING_MODE_UNICODE_STRICT you can
> > feed it UTF-8 encoded byte sequences and it "just works" (but maybe
> > shouldn't).
> > [...]
> I don't think this is correct. Mojo::SQLite has many tests to ensure
> in unicode-mode that it treats strings consistently.

Thanks for clarifying! I compared your code to my failing tests (as
attached to https://github.com/DBD-SQLite/DBD-SQLite/issues/83) and
found that I fell victim to a typo in the DBD::SQLite docs:

$dbh->{string_mode} = DBD_SQLITE_STRING_MODE_UNICODE_FALLBACK;

That key should read $dbh->{sqlite_string_mode). So, in effect, I ran
the test with the old (broken) default encoding which is known to fail.
--
Thanks again,
haj
Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]
On 2021/07/30 07:45, Ricardo Signes wrote:
> Porters,
>
> I propose that "use v5.36.0" should imply that the source code is,
> subsequently, UTF-8 encoded.

How about -CL, as that seems to do the same and work with most of
the cases others brought up...
---
Though, the attached seems to work with or without that switch, not
exactly sure why though (has utf8 in the source)...
Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]
> On Aug 3, 2021, at 17:35, L A Walsh <Astara@tlinx.org> wrote:
>
> ?On 2021/07/30 07:45, Ricardo Signes wrote:
>> Porters,
>>
>> I propose that "use v5.36.0" should imply that the source code is, subsequently, UTF-8 encoded.
>
> How about -CL, as that seems to do the same and work with most of
> the cases others brought up...
> ---
> Though, the attached seems to work with or without that switch, not
> exactly sure why though (has utf8 in the source)...

It works because of character encoding bugs in the Perl builtins that you’re using. See CPAN’s Sys::Binmode for the fix.

-F

> #!/usr/bin/perl
> use warnings; use strict;
> sub P(@);
> if (0) {
> eval 'use P';
> } else {
> eval ' sub P(@) { printf @_ if @_; print "\n"; } ';
> }
>
> my $x="é";
> my $pi="?";
> my $u8tmp="????????????";
> my $tmp="/tmp";
> my $epee="épée";
> P "x=%s, u8tmp=%s", $x, $u8tmp;
> my $newtmp="$tmp/$u8tmp";
> if ( ! -d $newtmp) {
> mkdir $newtmp or die P "mkdir %s returned %s", $newtmp, $!;
> }
> system("touch \"$newtmp/$x\" ");
> chdir $newtmp;
> system("touch \"$newtmp/$epee\" ");
> my $dot=".";
> opendir my $dh,$dot or die P "opendir of %s failed: %s", $dot,$!;
> while (defined($_=readdir($dh))) {
> P "found %s", $_;
> };
> closedir $dh || die P "closedir failed with %s",$!;
> #if (-e $tmp) {
> # rmdir "$tmp" || die P "rmdir of %s: %s", $tmp, $!
> #}
Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]
On August 3, 2021 13:26, Felipe Gasper wrote:
> It sounds like you want STDIN/STDOUT/STDERR and @ARGV to be decoded by default, in addition to the source code. TBH that makes more sense to me than the present proposal since it would preserve parity of character encoding for simple programs, though I think it would still sow confusion since other filehandles will remain binary by default:

Are other filehandles binary by default? I thought they were text by default, with the “:crlf" layer applied on those operating systems where it is appropriate. Perhaps in 2021 “text by default” should include encoding.

--
--
Aaron Priven, aaron@priven.com, www.priven.com/aaron
Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]
On Wed, Aug 4, 2021 at 6:59 PM Aaron Priven <aaron@priven.com> wrote:

> On August 3, 2021 13:26, Felipe Gasper wrote:
> > It sounds like you want STDIN/STDOUT/STDERR and @ARGV to be decoded by
> default, in addition to the source code. TBH that makes more sense to me
> than the present proposal since it would preserve parity of character
> encoding for simple programs, though I think it would still sow confusion
> since other filehandles will remain binary by default:
>
> Are other filehandles binary by default? I thought they were text by
> default, with the “:crlf" layer applied on those operating systems where it
> is appropriate. Perhaps in 2021 “text by default” should include encoding.
>

They are "text by default" in the ASCII sense, not the Unicode sense. The
:crlf layer is enabled by default on Windows and translates CR LF to LF,
but there is no default translation of bytes to characters. So you need to
use binmode or :raw to make a filehandle binary-compatible on Windows, but
you also need to apply an :encoding layer if you want to read/write
characters instead of bytes.

-Dan
Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]
> On Aug 4, 2021, at 4:33 PM, Dan Book <grinnz@gmail.com> wrote:
> They are "text by default" in the ASCII sense, not the Unicode sense. The :crlf layer is enabled by default on Windows and translates CR LF to LF, but there is no default translation of bytes to characters. So you need to use binmode or :raw to make a filehandle binary-compatible on Windows, but you also need to apply an :encoding layer if you want to read/write characters instead of bytes.

I don’t think it’s true that there’s no default treatment of bytes as characters. By default, perl treats bytes as Latin-1. So if you open a file without an encoding layer, read some data, and then output it to a file opened with an encoding layer, that encoding layer will assume that the data being output is in Latin-1, and convert that to characters accordingly.

So an open filehandle is, in perl, a text filehandle using encoding Latin-1, unless a layer or binmode is used. It wouldn’t be unreasonable to decide that, in some future version of perl, an open filehandle would be treated as a text filehandle using encoding UTF-8 instead.

The problem, of course, is that on some but not all operating systems the text filehandle returned by open can be used as a binary filehandle without loss. Conceptually it’s a text filehandle, the meaning in the perl language is that it’s a text filehandle, but people misuse it as a binary one because there’s no actual breakage, as long as it’s only running on those operating systems.

So I understand that in practice, it might well be more trouble than it’s worth to have “use 5.036” or even “use v7” make perl default to text filehandles using the UTF-8 encoding, instead of defaulting to text filehandles using the Latin-1 encoding. But I think it’s worth considering.

--
Aaron Priven, aaron@priven.com, www.priven.com/aaron
Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]
On 8/6/21 5:34 PM, Aaron Priven wrote:
>> On Aug 4, 2021, at 4:33 PM, Dan Book <grinnz@gmail.com
>> <mailto:grinnz@gmail.com>> wrote:
>> They are "text by default" in the ASCII sense, not the Unicode sense.
>> The :crlf layer is enabled by default on Windows and translates CR LF
>> to LF, but there is no default translation of bytes to characters. So
>> you need to use binmode or :raw to make a filehandle binary-compatible
>> on Windows, but you also need to apply an :encoding layer if you want
>> to read/write characters instead of bytes.
>
> I don’t think it’s true that there’s no default treatment of bytes as
> characters. By default, perl treats bytes as Latin-1. So if you open a
> file without an encoding layer, read some data, and then output it to a
> file opened with an encoding layer, that encoding layer will assume that
> the data being output is in Latin-1, and convert that to characters
> accordingly.

Perl doesn't treat bytes as Latin-1 by default. It treats
non-ASCII-range bytes as not being in any character set. All such match
\W in patterns, for example, and uc etc return the input unchanged.
Feature unicode-strings is necessary to get a Latin-1 treatment, or
converting to UTF-8.

>
> So an open filehandle is, in perl, a text filehandle using encoding
> Latin-1, unless a layer or binmode is used.  It wouldn’t be unreasonable
> to decide that, in some future version of perl, an open filehandle would
> be treated as a text filehandle using encoding UTF-8 instead.
>
> The problem, of course, is that on some but not all operating systems
> the text filehandle returned by open can be used as a binary filehandle
> without loss. /Conceptually/ it’s a text filehandle, the meaning in the
> perl language is that it’s a text filehandle, but people misuse it as a
> binary one because there’s no actual breakage, as long as it’s only
> running on those operating systems.
>
> So I understand that in practice, it might well be more trouble than
> it’s worth to have “use 5.036” or even “use v7” make perl default to
> text filehandles using the UTF-8 encoding, instead of defaulting to text
> filehandles using the Latin-1 encoding. But I think it’s worth considering.
>
> --
> Aaron Priven, aaron@priven.com <mailto:aaron@priven.com>,
> www.priven.com/aaron <http://www.priven.com/aaron>

1 2 3  View All