Mailing List Archive

elevator pitch for IV vs PV serialisation
So, we currently say in "What needs an RFC? What can just be a PR?"

1. Language changes (feature changes to the parser, tokeniser)
2. Command line options
3. Adding/removing warnings (entries in `perldiag.pod`)
4. Significant changes to when an existing warning triggers

https://github.com/Perl/RFCs/blob/master/docs/process.md


The previous PSC had wondered whether we should add

5. Bug fixes that have the potential for significant runtime behaviour changes
that can't be warned about and break things


So, here goes

Problem - perl caches string, integer and floating point representations of
values, and does this on reads. Hence serialisers can't distinguish between
values that started as numbers but got used as strings, and the other way.

$ perl -MDevel::Peek -e '$a = 42; $b = "$a"; Dump $a'
SV = PVIV(0x556b723ebad0) at 0x556b723ee9b8
REFCNT = 1
FLAGS = (IOK,POK,pIOK,pPOK)
IV = 42
PV = 0x556b723df720 "42"\0
CUR = 2
LEN = 10
$ perl -MDevel::Peek -e '$a = "42"; $b = $a + 0; Dump $a'
SV = PVIV(0x556abce6aa10) at 0x556abce6d8f8
REFCNT = 1
FLAGS = (IOK,POK,IsCOW,pIOK,pPOK)
IV = 42
PV = 0x556abceaa690 "42"\0
CUR = 2
LEN = 10
COW_REFCNT = 1

(ignore that IsCOW. You can't rely on that)


Proposed solution:

Don't set SVf_POK in Perl_sv_2pv_flags() when converting IVs to PVs.
Basically this:

diff --git a/sv.c b/sv.c
index 3c69dece01..307846fb45 100644
--- a/sv.c
+++ b/sv.c
@@ -3141,7 +3141,19 @@ Perl_sv_2pv_flags(pTHX_ SV *const sv, STRLEN *const lp, const U32 flags)
Move(ptr, s, len, char);
s += len;
*s = '\0';
- SvPOK_on(sv);
+ /* We used to call SvPOK_on(). Whilst this is fine for (most) Perl code,
+ it means that after this stringification is cached, there is no way
+ to distinguish between values originally assigned as $a = 42; and
+ $a = "42"; (or results of string operators vs numeric operators)
+ where the value has subsequently been used in the other "context"
+ and had a value cached.
+ This (somewhat) hack means that we retain the cached stringification,
+ but the existing SvPV() macros end up entering this function (which
+ returns the cached value) instead of using it directly inline.
+ However, the result is that if a value is SVf_IOK|SVf_POK then it
+ originated as "42", whereas if it's SVf_IOK then it originated as 42.
+ (ignore SVp_IOK and SVp_POK) */
+ SvPOKp_on(sv);
}
else if (SvNOK(sv)) {
if (SvTYPE(sv) < SVt_PVNV)


Benefits of this:

$ ./perl -Ilib -MDevel::Peek -e '$a = 42; $b = "$a"; Dump $a'
SV = PVIV(0x557eacb3b2d0) at 0x557eacb3a7a0
REFCNT = 1
FLAGS = (IOK,pIOK,pPOK)
IV = 42
PV = 0x557eacb841d0 "42"\0
CUR = 2
LEN = 10
$ ./perl -Ilib -MDevel::Peek -e '$a = "42"; $b = $a + 0; Dump $a'
SV = PVIV(0x564aae2d82d0) at 0x564aae2d77a0
REFCNT = 1
FLAGS = (IOK,POK,IsCOW,pIOK,pPOK)
IV = 42
PV = 0x564aae320f60 "42"\0
CUR = 2
LEN = 10
COW_REFCNT = 1


if (flags & SVf_IOK|SVf_NOK) && !(flags & SVf_POK)
serealise as number
else if (flags & SVf_POK)
serealise as string
else
the existing guesswork ...


Potential problems:

Something is bound to break.

As to existing core code, the branch I pushed also has this hunk:

diff --git a/dist/Data-Dumper/Dumper.xs b/dist/Data-Dumper/Dumper.xs
index 6d73beca30..78005d3726 100644
--- a/dist/Data-Dumper/Dumper.xs
+++ b/dist/Data-Dumper/Dumper.xs
@@ -1569,7 +1569,7 @@ Data_Dumper_Dumpxs(href, ...)
else
(void)SvOK_off(name);

- if (SvPOK(name)) {
+ if (SvPOKp(name)) {
if ((SvPVX_const(name))[0] == '*') {
if (SvROK(val)) {
switch (SvTYPE(SvRV(val))) {


because, sigh, Data::Dumper is testing SvPOK.


However, it's wrong. The change exposes an existing bug. It's easier to
demonstrate with NVs:

$ /media/Tux/perls/bin/perl5.18.0 -MData::Dumper -e 'print Data::Dumper->Dumpxs(["pi"], [4.5/1.5])'
$3 = 'pi';
$ /media/Tux/perls/bin/perl5.20.0 -MData::Dumper -e 'print Data::Dumper->Dumpxs(["pi"], [4.5/1.5])'
$VAR1 = 'pi';


but it also applies to string overloading. I've just uploaded Data::Dumper
2.182_51 to CPAN with a general fix.


Anyway, changing this is likely to expose other problems, although I'll
guess that 50% will be existing bugs that can be tickled on 5.34.0 with
carefully crafted test casses.


Prototype implementation in https://github.com/nwc10/perl5/tree/IV-vs-PV

Please don't merge this (yet), as the DD fix should go into blead first
(once 2.183 is released) then these two commits get rebased on that.


Nicholas Clark
Re: elevator pitch for IV vs PV serialisation [ In reply to ]
On Sat, 3 Jul 2021 13:40:06 +0000
Nicholas Clark <nick@ccl4.org> wrote:

> Problem - perl caches string, integer and floating point
> representations of values, and does this on reads. Hence serialisers
> can't distinguish between values that started as numbers but got used
> as strings, and the other way.

Typified by JSON encoders, which is where most folks will probably be
familiar with it. Though variations on the same theme (YAML, MsgPack,
even my own Tangence) will suffer the same problem.

...
> Proposed solution:
>
> Don't set SVf_POK in Perl_sv_2pv_flags() when converting IVs to PVs.

Sounds simple and elegant.

If this appears to work then I'm definitely in favour of this - it has
the potential to properly fix that long-standing issue.

> Anyway, changing this is likely to expose other problems, although
> I'll guess that 50% will be existing bugs that can be tickled on
> 5.34.0 with carefully crafted test casses.

This feels of the same order of magnitude as the unescaped `{` in
regexps thing. Something that, eh... someone's code was probably
written wrong technically, but it's arguable that core perl ought to
have documented it better. Expect lots of sore grumbling and little
fixes here and there to finally smooth things over.


But overall, yes - :+1: from me.

--
Paul "LeoNerd" Evans

leonerd@leonerd.org.uk | https://metacpan.org/author/PEVANS
http://www.leonerd.org.uk/ | https://www.tindie.com/stores/leonerd/
Re: elevator pitch for IV vs PV serialisation [ In reply to ]
On Sat, Jul 3, 2021 at 9:40 AM Nicholas Clark <nick@ccl4.org> wrote:

> Proposed solution:
>
> Don't set SVf_POK in Perl_sv_2pv_flags() when converting IVs to PVs.
>

Forgive my tangential familiarity with this problem space, but a few
questions come to mind:

Could this have a significant performance impact since Perl and XS modules
will no longer trust the cached string value of such scalars?

Doesn't pPOK mean that the string value is not to be considered correct and
thus this would change the existing paradigm of how string/numeric values
work?

If either of those are true, could this be accomplished by instead adding a
new flag that means what we want it to mean (this was originally a
number/string)?

-Dan
Re: elevator pitch for IV vs PV serialisation [ In reply to ]
On Sat, 3 Jul 2021 11:19:30 -0400
Dan Book <grinnz@gmail.com> wrote:

> If either of those are true, could this be accomplished by instead
> adding a new flag that means what we want it to mean (this was
> originally a number/string)?

The longstanding problem that there are no spare bits in SvFLAGS to use
for this :/ If there had been we'd have solved it ages ago. My recent
"can we get more SV types?" was an attempt at this, before Nick pointed
out -how many- more types we'd need, because of all of the subtypes
(e.g. SvPVMG)

--
Paul "LeoNerd" Evans

leonerd@leonerd.org.uk | https://metacpan.org/author/PEVANS
http://www.leonerd.org.uk/ | https://www.tindie.com/stores/leonerd/
Re: elevator pitch for IV vs PV serialisation [ In reply to ]
On Sat, Jul 3, 2021 at 6:40 AM Nicholas Clark <nick@ccl4.org> wrote:

> Anyway, changing this is likely to expose other problems, although I'll
> guess that 50% will be existing bugs that can be tickled on 5.34.0 with
> carefully crafted test casses.
>

Storable should get a version bump, as since its serialization of data will
change, one should not be able to count on deserializing data on
perl-latest + Storable-latest that was serialized on perl-5.34.0 +
Storable-latest. I haven't tested this, but I expect that there are some
permutations of data where the data will not properly round-trip. (Since
Storable is already documented as its (de)serialization being
backwards-compatible but not forward-compatible, I reckon a simple version
bump should be sufficient.)

Other serializers/deserializers should also be checked (Sereal?) but since
they are not core, other mitigation steps will have to be taken.

That said, I am very ++++ on this idea. I have code that is very finicky
about checking these flags, which restricts just what can be done with data
that goes through this code, and it will be nice for those restrictions to
be loosened a bit and still count on the flags being preserved.
Re: elevator pitch for IV vs PV serialisation [ In reply to ]
On 2021-07-03 6:40 a.m., Nicholas Clark wrote:
> So, we currently say in "What needs an RFC? What can just be a PR?"
>
> 1. Language changes (feature changes to the parser, tokeniser)
> 2. Command line options
> 3. Adding/removing warnings (entries in `perldiag.pod`)
> 4. Significant changes to when an existing warning triggers
>
> https://github.com/Perl/RFCs/blob/master/docs/process.md
>
> The previous PSC had wondered whether we should add
>
> 5. Bug fixes that have the potential for significant runtime behaviour changes
> that can't be warned about and break things
>
> So, here goes
>
> Problem - perl caches string, integer and floating point representations of
> values, and does this on reads. Hence serialisers can't distinguish between
> values that started as numbers but got used as strings, and the other way.

An RFC can definitely be a PR, especially if its simple. The only real reason
not to use a PR as an RFC is if the proposer doesn't know how to implement the
code or if it would take too much effort to bother with if the idea isn't sound,
and so a non-PR RFC is done to avoid unnecessary work.

For what I want, having my own strongly typed serialization format
https://github.com/muldis/Muldis_Object_Notation it would be a boon to Perl
users for Perl scalars to be completely unambiguous of which of these Perl data
types the scalar started out as:

- undef
- boolean
- integer
- floating point number
- octet string
- character string

With regards to boolean, I mean that if the scalar value was the result of a
boolean operation such as (1 == 1) or (1 == 0) then simple analysis of the
scalar can distinguish this fact versus if the integer 1 or the empty string
were explicitly used instead. So how the value was created is tracked somehow.

Likewise, integer and float we consider those representations canonical if they
were the result of unquoted numeric syntax like 1 vs 1.0 or the result of a
mathematical operation.

Likewise, octet string is indicated if the scalar was produced by an operator
that was working in terms of octets such as a file read using binmode.

For what I want, I already have a backup way for Perl developers to be explicit
and unambiguous of what Muldis Object Notation type they want to represent, for
example ['Boolean', (1==1)] or ['Integer',
'45363254364326432634354325432234523'] vs ['Text','34543241'], but I want Perl
developers to not have to use such more verbose forms when the Perl scalar
itself can be unambiguously interpreted as exactly 1 of the stronger types.

-- Darren Duncan
Re: elevator pitch for IV vs PV serialisation [ In reply to ]
On Sat, Jul 03, 2021 at 11:19:30AM -0400, Dan Book wrote:
> On Sat, Jul 3, 2021 at 9:40 AM Nicholas Clark <nick@ccl4.org> wrote:
>
> > Proposed solution:
> >
> > Don't set SVf_POK in Perl_sv_2pv_flags() when converting IVs to PVs.
> >
>
> Forgive my tangential familiarity with this problem space, but a few
> questions come to mind:
>
> Could this have a significant performance impact since Perl and XS modules
> will no longer trust the cached string value of such scalars?

The second commit on the branch change the SvPV macro (and all the rest)
to trust the cached value in this specific case.

To this:

#define SvPV_flags(sv, len, flags) \
((((SvFLAGS(sv) & (SVf_POK|SVs_GMG)) == SVf_POK) \
|| ((SvFLAGS(sv) & (SVf_IOK|SVp_POK|SVs_GMG)) == (SVf_IOK|SVp_POK))) \
? ((len = SvCUR(sv)), SvPVX(sv)) : sv_2pv_flags(sv, &len, flags))


from this:

#define SvPV_flags(sv, len, flags) \
(((SvFLAGS(sv) & (SVf_POK|SVs_GMG)) == SVf_POK) \
? ((len = SvCUR(sv)), SvPVX(sv)) : sv_2pv_flags(sv, &len, flags))


So, I doubt "significant" - there's one extra flags test (and I guess
branch) inlined everywhere, but no change in function calls

> Doesn't pPOK mean that the string value is not to be considered correct and
> thus this would change the existing paradigm of how string/numeric values
> work?

Code that asks "give me the value of this scalar as a string" or
"give me the value of this scalar as an integer" will be unchanged.

Code that thinks that SVf_POK matters is (likely) already broken with
objects that overload stringification.

(eg Data::Dumper in one place - until I push a fix tomorrow)

Code that tries to figure out whether a value is really a string or a number
might take a different path. Which was kind of the intent :-)

(But only kind of. The intent was to enable such code to get a "right"
answer reliably. Some code might start deciding that values are integers
where previously it was thinking string. But this will *only* be for values
that were *written* as integers and then *read* later in a string context.
So the value is equally correctly representable as either.)



So yes, I think that this stuff might expose problems, but I'm relatively
confident that it will be in code which is already fragile, and not in most
code.


On Sat, Jul 03, 2021 at 04:28:22PM +0100, Paul "LeoNerd" Evans wrote:
> On Sat, 3 Jul 2021 11:19:30 -0400
> Dan Book <grinnz@gmail.com> wrote:
>
> > If either of those are true, could this be accomplished by instead
> > adding a new flag that means what we want it to mean (this was
> > originally a number/string)?
>
> The longstanding problem that there are no spare bits in SvFLAGS to use
> for this :/ If there had been we'd have solved it ages ago. My recent
> "can we get more SV types?" was an attempt at this, before Nick pointed
> out -how many- more types we'd need, because of all of the subtypes
> (e.g. SvPVMG)

Changing the available SV types also has the potential to break any XS
code on CPAN that has a switch statement on all SV types (or all scalar
SV types). I'm not saying that anyone *should* do this (instead of using
API calls and flag tests to work things out), but authors do.

Nicholas Clark
Re: elevator pitch for IV vs PV serialisation [ In reply to ]
??, 4 ???. 2021 ?. ? 22:51, Nicholas Clark <nick@ccl4.org>:
>Code that tries to figure out whether a value is really a string or a number
might take a different path.

I think we should carefully consider the effect on darkpan/custom
serializers. They need to do exactly this, and we can't predict the
ways end systems will break - external APIs in statically typed
languages care about proper types a lot. And the breakage may not show
immediately.

Best regards,
Sergey Aleynikov
Re: elevator pitch for IV vs PV serialisation [ In reply to ]
On Mon, Jul 05, 2021 at 10:01:46AM +0300, Sergey Aleynikov wrote:
> ??, 4 ???. 2021 ?. ? 22:51, Nicholas Clark <nick@ccl4.org>:
> >Code that tries to figure out whether a value is really a string or a number
> might take a different path.
>
> I think we should carefully consider the effect on darkpan/custom
> serializers. They need to do exactly this, and we can't predict the
> ways end systems will break - external APIs in statically typed
> languages care about proper types a lot. And the breakage may not show
> immediately.

I'm not even sure *how* we can do this any better than the code we *can* see
on CPAN. This code doesn't change any *APIs*. It changes internal
implementation details that aren't documented. Some XS code will always end
up relying on implementation details. It's just in this case that we are
deliberate changing some implementation details that are sort-of intended to
be relied upon *in one specific case*.

We can't see such code. We can only estimate the chance of breakage based on
the breakage we do see on the code we can see. For CPAN, Storable, Sereal
and JSON::XS all pass their tests with these changes (along with all there
dependencies)

(as does the core, now that I've fixed the bug in Data::Dumper. That XS code
already failed with overloaded stringification and pure floating point
values)


So far, this change has exposed 1 real bug, and not broken some of the code
that we thought most likely to be affected.

Nicholas Clark
Re: elevator pitch for IV vs PV serialisation [ In reply to ]
On Sat, Jul 03, 2021 at 09:53:50AM -0700, Karen Etheridge wrote:
> On Sat, Jul 3, 2021 at 6:40 AM Nicholas Clark <nick@ccl4.org> wrote:
>
> > Anyway, changing this is likely to expose other problems, although I'll
> > guess that 50% will be existing bugs that can be tickled on 5.34.0 with
> > carefully crafted test casses.
> >
>
> Storable should get a version bump, as since its serialization of data will
> change, one should not be able to count on deserializing data on
> perl-latest + Storable-latest that was serialized on perl-5.34.0 +
> Storable-latest. I haven't tested this, but I expect that there are some
> permutations of data where the data will not properly round-trip. (Since
> Storable is already documented as its (de)serialization being
> backwards-compatible but not forward-compatible, I reckon a simple version
> bump should be sufficient.)

Yes, thanks for mentioning this. I hadn't thought too hard about this.

So then I went digging...

Storable doesn't actually round trip *representation*. We already have things
like this:

$ perl -MDevel::Peek -MStorable=freeze,thaw -e 'my $u = ~0; Dump($u); my $normal = thaw(freeze(\$u)); Dump $$normal'
SV = IV(0x560407aa4a08) at 0x560407aa4a18
REFCNT = 1
FLAGS = (IOK,pIOK,IsUV)
UV = 18446744073709551615
SV = PV(0x560407a76e00) at 0x560407a9a7f0
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x560407aa6f10 "18446744073709551615"\0
CUR = 20
LEN = 22


The value is preserved, but not the flags that show how it started.

None of Storable's code changes, so it isn't adding new tag types or anything
existing Storable can't read. Storable isn't set up to try to distinguish
numbers vs strings, so its XS code doesn't benefit from changing.

I think that the only difference is the path taken through this C:


} else if (flags & SVf_POK) {
/* public string - go direct to string read. */
goto string_readlen;
} else if (
#if PERL_VERSION_LT(5,7,0)
/* For 5.6 and earlier NV flag trumps IV flag, so only use integer
direct if NV flag is off. */
(flags & (SVf_NOK | SVf_IOK)) == SVf_IOK
#else
/* 5.7 rules are that if IV public flag is set, IV value is as
good, if not better, than NV value. */
flags & SVf_IOK
#endif
) {
iv = SvIV(sv);
/*
* Will come here from below with iv set if double is an integer.
*/
integer:


(and for the tests I wrote, this was all I could detect) - for integer
values that had been read in a string context (and are not "too large" as
per its current heuristics), previously Storable would store them with the
tag SX_SCALAR, and now it will store then with the tag SX_INTEGER, and
the reading code will create them as SVt_IV now, instead of SVt_PV
previously.

> Other serializers/deserializers should also be checked (Sereal?) but since
> they are not core, other mitigation steps will have to be taken.

Sereal passes all its test. As does Cpanel::JSON:::XS (and JSON::XS)

> That said, I am very ++++ on this idea. I have code that is very finicky
> about checking these flags, which restricts just what can be done with data
> that goes through this code, and it will be nice for those restrictions to
> be loosened a bit and still count on the flags being preserved.


Thanks ETHER (and others) for the useful feedback. As no-one actually had
a firm opinion of whether this really needed an RFC, or if PR code review
was good enough, I took an "executive decision" (ie "sod it") and make a PR.

https://github.com/Perl/perl5/pull/18958


If that *wasn't* what was wanted, I missed anyone saying so. :-)


Nicholas Clark
Re: elevator pitch for IV vs PV serialisation [ In reply to ]
On Sat, Jul 03, 2021 at 01:40:06PM +0000, Nicholas Clark wrote:
> Problem - perl caches string, integer and floating point representations of
> values, and does this on reads. Hence serialisers can't distinguish between
> values that started as numbers but got used as strings, and the other way.

Is the 'original' type actual useful and/or well-defined? For example in

while (<DATA>) {
chomp;
my ($x,$y, $description) = split;
}
__END_
1,2,foo bar
3,4,baz boz

should $x's original value be regarded as an integer or a string?
If instead one were to use an XS-based CSV parser, should that parser
be returning IOK_only-SVs for the first two columns?

I'm not arguing either way, and I'm sure others have thought about this a
lot more than me. I just have this uneasy feeling that claiming that
serialisers "need" to know the "original" type is an unreasonable demand.

--
The Enterprise is involved in a bizarre time-warp experience which is in
some way unconnected with the Late 20th Century.
-- Things That Never Happen in "Star Trek" #14
Re: elevator pitch for IV vs PV serialisation [ In reply to ]
> On Jul 5, 2021, at 12:33 PM, Dave Mitchell <davem@iabyn.com> wrote:
>
> On Sat, Jul 03, 2021 at 01:40:06PM +0000, Nicholas Clark wrote:
>> Problem - perl caches string, integer and floating point representations of
>> values, and does this on reads. Hence serialisers can't distinguish between
>> values that started as numbers but got used as strings, and the other way.
>
> Is the 'original' type actual useful and/or well-defined? For example in
>
> while (<DATA>) {
> chomp;
> my ($x,$y, $description) = split;
> }
> __END_
> 1,2,foo bar
> 3,4,baz boz
>
> should $x's original value be regarded as an integer or a string?

FWIW I would say that all of the scalars are strings here. Everything `split` returns would be a string (POK_only or POK_only_UTF8).

> If instead one were to use an XS-based CSV parser, should that parser
> be returning IOK_only-SVs for the first two columns?

I would think no difference from the `split` implementation since CSV (RFC 4180) doesn’t seem to differentiate.

-F
Re: elevator pitch for IV vs PV serialisation [ In reply to ]
On Mon, 5 Jul 2021 17:33:48 +0100, Dave Mitchell <davem@iabyn.com>
wrote:

> On Sat, Jul 03, 2021 at 01:40:06PM +0000, Nicholas Clark wrote:
> > Problem - perl caches string, integer and floating point
> > representations of values, and does this on reads. Hence
> > serialisers can't distinguish between values that started as
> > numbers but got used as strings, and the other way.
>
> Is the 'original' type actual useful and/or well-defined? For example
> in
>
> while (<DATA>) {
> chomp;
> my ($x,$y, $description) = split;

^ => /,/
> }
> __END_
^ _
> 1,2,foo bar
> 3,4,baz boz
>
> should $x's original value be regarded as an integer or a string?
> If instead one were to use an XS-based CSV parser, should that parser
> be returning IOK_only-SVs for the first two columns?

That is already possible, but only if specifically asked for:

use 5.16.2;
use Text::CSV_XS;
use Data::Peek;

my $csv = Text::CSV_XS->new;
$csv->types ([
Text::CSV_XS::IV (),
Text::CSV_XS::IV (),
Text::CSV_XS::PV (),
]);

while (my $row = $csv->getline (*DATA)) {
DDump ($row->[0]);
}
__END__
1,2,foo bar
3,4,baz boz


SV = PVIV(0x10f4818) at 0x10ed808
REFCNT = 1
FLAGS = (IOK,pIOK)
IV = 1
PV = 0x10e8160 "1"\0
CUR = 1
LEN = 10
SV = PVIV(0x10f4818) at 0x10ed700
REFCNT = 1
FLAGS = (IOK,pIOK)
IV = 3
PV = 0x2c78c60 "3"\0
CUR = 1
LEN = 10


And that can be changed to be IV-only
I can even imagine an attribute that tells the parser to return an
IV-only if the cell matches ^-[0-9]+$, but that would certainly slow
down

> I'm not arguing either way, and I'm sure others have thought about
> this a lot more than me. I just have this uneasy feeling that
> claiming that serialisers "need" to know the "original" type is an
> unreasonable demand.

--
H.Merijn Brand https://tux.nl Perl Monger http://amsterdam.pm.org/
using perl5.00307 .. 5.33 porting perl5 on HP-UX, AIX, and Linux
https://tux.nl/email.html http://qa.perl.org https://www.test-smoke.org
Re: elevator pitch for IV vs PV serialisation [ In reply to ]
On Mon, 5 Jul 2021 17:33:48 +0100
Dave Mitchell <davem@iabyn.com> wrote:

> Is the 'original' type actual useful and/or well-defined? For example in
>
> while (<DATA>) {
> chomp;
> my ($x,$y, $description) = split;
> }
> __END_
> 1,2,foo bar
> 3,4,baz boz
>
> should $x's original value be regarded as an integer or a string?
> If instead one were to use an XS-based CSV parser, should that parser
> be returning IOK_only-SVs for the first two columns?
>
> I'm not arguing either way, and I'm sure others have thought about this a
> lot more than me. I just have this uneasy feeling that claiming that
> serialisers "need" to know the "original" type is an unreasonable demand.

This is a very interesting case. How does one convert a string returned
by split() to a number? Are we going to expose flag-manipulation
functions to the users?.

It seems we're trying to promote IOK/POK/NOK from internal flags to
semi-user-visible semi-types and I don't like it. That's not how the
language works. The proposal basically adds a new, hidden, *immutable*
property to variables. I'm sure that will result in a lot of confusion.

More importantly, I believe the only thing that matters to the
serializers is what the variable contains *now*, not how it started. We
have accept that that without explicit type hints, it's impossible to
tell that.
Re: elevator pitch for IV vs PV serialisation [ In reply to ]
> On Jul 5, 2021, at 2:19 PM, Tomasz Konojacki <me@xenu.pl> wrote:
>
> On Mon, 5 Jul 2021 17:33:48 +0100
> Dave Mitchell <davem@iabyn.com> wrote:
>
>> Is the 'original' type actual useful and/or well-defined? For example in
>>
>> while (<DATA>) {
>> chomp;
>> my ($x,$y, $description) = split;
>> }
>> __END_
>> 1,2,foo bar
>> 3,4,baz boz
>>
>> should $x's original value be regarded as an integer or a string?
>> If instead one were to use an XS-based CSV parser, should that parser
>> be returning IOK_only-SVs for the first two columns?
>>
>> I'm not arguing either way, and I'm sure others have thought about this a
>> lot more than me. I just have this uneasy feeling that claiming that
>> serialisers "need" to know the "original" type is an unreasonable demand.
>
> This is a very interesting case. How does one convert a string returned
> by split() to a number? Are we going to expose flag-manipulation
> functions to the users?.

Currently if you do $foo += 0 then $foo will be IOK_only. It’s a hacky, but widely used, trick to coerce reliable-ish types from JSON et al.

-F
Re: elevator pitch for IV vs PV serialisation [ In reply to ]
On Mon, 5 Jul 2021 09:58:22 +0000
Nicholas Clark <nick@ccl4.org> wrote:

> Thanks ETHER (and others) for the useful feedback. As no-one actually
> had a firm opinion of whether this really needed an RFC, or if PR
> code review was good enough, I took an "executive decision" (ie "sod
> it") and make a PR.
>
> https://github.com/Perl/perl5/pull/18958
>
>
> If that *wasn't* what was wanted, I missed anyone saying so. :-)

With my newly-acquired PSC hat on:

LGTM

As in: This feels like a "bugfix" - albeit a fix for a very
longstanding conceptual bug, but still. It's not changing anything
Perl-visible, just some inner details visible to XS that a few modules
may have inadvertently been relying on. It puts the language in a much
better place, so yes - good to move it forward.

--
Paul "LeoNerd" Evans

leonerd@leonerd.org.uk | https://metacpan.org/author/PEVANS
http://www.leonerd.org.uk/ | https://www.tindie.com/stores/leonerd/
Re: elevator pitch for IV vs PV serialisation [ In reply to ]
On 2021-07-05 9:33 a.m., Dave Mitchell wrote:
> On Sat, Jul 03, 2021 at 01:40:06PM +0000, Nicholas Clark wrote:
>> Problem - perl caches string, integer and floating point representations of
>> values, and does this on reads. Hence serialisers can't distinguish between
>> values that started as numbers but got used as strings, and the other way.
>
> Is the 'original' type actual useful and/or well-defined? For example in
>
> while (<DATA>) {
> chomp;
> my ($x,$y, $description) = split;
> }
> __END_
> 1,2,foo bar
> 3,4,baz boz
>
> should $x's original value be regarded as an integer or a string?
> If instead one were to use an XS-based CSV parser, should that parser
> be returning IOK_only-SVs for the first two columns?
>
> I'm not arguing either way, and I'm sure others have thought about this a
> lot more than me. I just have this uneasy feeling that claiming that
> serialisers "need" to know the "original" type is an unreasonable demand.

Unless CSV itself is unambiguously strongly typed, then all 3 of these are 100%
original value is character strings, there are no integers here.

For what matters for Perl internals, "started as" is entirely about what logic
produced the value.

my $x = '29' . '95';

my $y = 29 + 95;

my $z = 29.0 + 95.0;

my $w = (1 == 1);

In the above examples, $x "started as" a character string while $y "started as"
an integer while $z "started as" a float while $w "started as" a boolean.

Your CSV example is string processing so its result "started as" a string. Any
other interpretation is for user code to manipulate and not Perl itself

-- Darren Duncan
Re: elevator pitch for IV vs PV serialisation [ In reply to ]
On Sat, Jul 03, 2021 at 09:53:50AM -0700, Karen Etheridge wrote:

> Other serializers/deserializers should also be checked (Sereal?) but since
> they are not core, other mitigation steps will have to be taken.
>
> That said, I am very ++++ on this idea. I have code that is very finicky
> about checking these flags, which restricts just what can be done with data
> that goes through this code, and it will be nice for those restrictions to
> be loosened a bit and still count on the flags being preserved.

I'm very ++++ too even though it's probably going to break my tests.
It'll do that cos my code is (I think!) merely as correct as currently
possible, not *really* correct, so it's the good kind of breakage.

--
David Cantrell | Hero of the Information Age

All principles of gravity are negated by fear
-- Cartoon Law IV
Re: elevator pitch for IV vs PV serialisation [ In reply to ]
On Sat, 3 Jul 2021 at 15:40, Nicholas Clark <nick@ccl4.org> wrote:

> So, we currently say in "What needs an RFC? What can just be a PR?"
>
> 1. Language changes (feature changes to the parser, tokeniser)
> 2. Command line options
> 3. Adding/removing warnings (entries in `perldiag.pod`)
> 4. Significant changes to when an existing warning triggers
>
> https://github.com/Perl/RFCs/blob/master/docs/process.md
>
>
> The previous PSC had wondered whether we should add
>
> 5. Bug fixes that have the potential for significant runtime behaviour
> changes
> that can't be warned about and break things
>
>
> So, here goes
>
> Problem - perl caches string, integer and floating point representations of
> values, and does this on reads. Hence serialisers can't distinguish between
> values that started as numbers but got used as strings, and the other way.
>
> $ perl -MDevel::Peek -e '$a = 42; $b = "$a"; Dump $a'
> SV = PVIV(0x556b723ebad0) at 0x556b723ee9b8
> REFCNT = 1
> FLAGS = (IOK,POK,pIOK,pPOK)
> IV = 42
> PV = 0x556b723df720 "42"\0
> CUR = 2
> LEN = 10
> $ perl -MDevel::Peek -e '$a = "42"; $b = $a + 0; Dump $a'
> SV = PVIV(0x556abce6aa10) at 0x556abce6d8f8
> REFCNT = 1
> FLAGS = (IOK,POK,IsCOW,pIOK,pPOK)
> IV = 42
> PV = 0x556abceaa690 "42"\0
> CUR = 2
> LEN = 10
> COW_REFCNT = 1
>
> (ignore that IsCOW. You can't rely on that)
>
>
> Proposed solution:
>
> Don't set SVf_POK in Perl_sv_2pv_flags() when converting IVs to PVs.
>

I have not had the impression that the problem with processing flags comes
from stringifying things that start out as IV or NV.

As far as I can tell It is the reverse. When we numify a string, we, we set
it to be IOK or NOK in circumstances we shouldnt.

$ perl -MDevel::Peek -le'use warnings; my $str=" 1 "; my $v= 0+$str;
Dump($str)'
SV = PVIV(0x12af230) at 0x12b4b08
REFCNT = 1
FLAGS = (IOK,POK,IsCOW,pIOK,pPOK)
IV = 1
PV = 0x12b8170 " 1 "\0
CUR = 5
LEN = 10
COW_REFCNT = 1

$ perl -MDevel::Peek -le'use warnings; my $str=" 1p "; my $v= 0+$str;
Dump($str)'
Argument " 1p " isn't numeric in addition (+) at -e line 1.
SV = PVNV(0x2262320) at 0x2284b08
REFCNT = 1
FLAGS = (POK,IsCOW,pIOK,pNOK,pPOK)
IV = 1
NV = 1
PV = 0x2288170 " 1p "\0
CUR = 6
LEN = 10
COW_REFCNT = 1

The problem comes from the first case, where the SV is marked POK and IOK
after numifying the string. It is an example of where we do the wrong thing.

It is actually relatively easy to check if the IV slot can be trusted when
you ONLY assume existing Perl internal semantics. You check if the string
starts or ends with whitespace, or starts with a 0 and is longer than 1
character.

Where you get nailed however is on dualvars. Dualvars mess with everything.
It means you simply can never ever trust the flags in the SV for type data,
and fixing core doesnt really work semantic wise.

So it feels to me what we really need is a flag that says the var is a dual
var, and we should change the PV->(I|N)V conversion logic to not IOK or NOK
when there are leading or trailing spaces.










> Basically this:
>
> diff --git a/sv.c b/sv.c
> index 3c69dece01..307846fb45 100644
> --- a/sv.c
> +++ b/sv.c
> @@ -3141,7 +3141,19 @@ Perl_sv_2pv_flags(pTHX_ SV *const sv, STRLEN *const
> lp, const U32 flags)
> Move(ptr, s, len, char);
> s += len;
> *s = '\0';
> - SvPOK_on(sv);
> + /* We used to call SvPOK_on(). Whilst this is fine for (most)
> Perl code,
> + it means that after this stringification is cached, there is
> no way
> + to distinguish between values originally assigned as $a = 42;
> and
> + $a = "42"; (or results of string operators vs numeric
> operators)
> + where the value has subsequently been used in the other
> "context"
> + and had a value cached.
> + This (somewhat) hack means that we retain the cached
> stringification,
> + but the existing SvPV() macros end up entering this function
> (which
> + returns the cached value) instead of using it directly inline.
> + However, the result is that if a value is SVf_IOK|SVf_POK then
> it
> + originated as "42", whereas if it's SVf_IOK then it originated
> as 42.
> + (ignore SVp_IOK and SVp_POK) */
> + SvPOKp_on(sv);
> }
> else if (SvNOK(sv)) {
> if (SvTYPE(sv) < SVt_PVNV)
>
>
> Benefits of this:
>
> $ ./perl -Ilib -MDevel::Peek -e '$a = 42; $b = "$a"; Dump $a'
> SV = PVIV(0x557eacb3b2d0) at 0x557eacb3a7a0
> REFCNT = 1
> FLAGS = (IOK,pIOK,pPOK)
> IV = 42
> PV = 0x557eacb841d0 "42"\0
> CUR = 2
> LEN = 10
> $ ./perl -Ilib -MDevel::Peek -e '$a = "42"; $b = $a + 0; Dump $a'
> SV = PVIV(0x564aae2d82d0) at 0x564aae2d77a0
> REFCNT = 1
> FLAGS = (IOK,POK,IsCOW,pIOK,pPOK)
> IV = 42
> PV = 0x564aae320f60 "42"\0
> CUR = 2
> LEN = 10
> COW_REFCNT = 1
>
>
> if (flags & SVf_IOK|SVf_NOK) && !(flags & SVf_POK)
> serealise as number
> else if (flags & SVf_POK)
> serealise as string
> else
> the existing guesswork ...
>
>
> Potential problems:
>
> Something is bound to break.
>
> As to existing core code, the branch I pushed also has this hunk:
>
> diff --git a/dist/Data-Dumper/Dumper.xs b/dist/Data-Dumper/Dumper.xs
> index 6d73beca30..78005d3726 100644
> --- a/dist/Data-Dumper/Dumper.xs
> +++ b/dist/Data-Dumper/Dumper.xs
> @@ -1569,7 +1569,7 @@ Data_Dumper_Dumpxs(href, ...)
> else
> (void)SvOK_off(name);
>
> - if (SvPOK(name)) {
> + if (SvPOKp(name)) {
> if ((SvPVX_const(name))[0] == '*') {
> if (SvROK(val)) {
> switch (SvTYPE(SvRV(val))) {
>
>
> because, sigh, Data::Dumper is testing SvPOK.
>
>
> However, it's wrong. The change exposes an existing bug. It's easier to
> demonstrate with NVs:
>
> $ /media/Tux/perls/bin/perl5.18.0 -MData::Dumper -e 'print
> Data::Dumper->Dumpxs(["pi"], [4.5/1.5])'
> $3 = 'pi';
> $ /media/Tux/perls/bin/perl5.20.0 -MData::Dumper -e 'print
> Data::Dumper->Dumpxs(["pi"], [4.5/1.5])'
> $VAR1 = 'pi';
>

This reminds me that there is an interesting special case with exponents.
Try numifying various engineering notation.

norole:yorton@psykopsis:master:/git_tree/DFA-Trim$ perl -MDevel::Peek
-le'use warnings; my $str="0E0"; my $v= 0+$str; Dump($str)'
SV = PVNV(0x12a2320) at 0x12c4af8
REFCNT = 1
FLAGS = (IOK,NOK,POK,IsCOW,pIOK,pNOK,pPOK)
IV = 0
NV = 0
PV = 0x12c8160 "0E0"\0
CUR = 3
LEN = 10
COW_REFCNT = 1

$ perl -MDevel::Peek -le'use warnings; my $str="1E0"; my $v= 0+$str;
Dump($str)'
SV = PVNV(0x7fd320) at 0x81faf8
REFCNT = 1
FLAGS = (IOK,NOK,POK,IsCOW,pIOK,pNOK,pPOK)
IV = 1
NV = 1
PV = 0x823160 "1E0"\0
CUR = 3
LEN = 10
COW_REFCNT = 1

$ perl -MDevel::Peek -le'use warnings; my $str="1E3"; my $v= 0+$str;
Dump($str)'
SV = PVNV(0xc45320) at 0xc67af8
REFCNT = 1
FLAGS = (IOK,NOK,POK,IsCOW,pIOK,pNOK,pPOK)
IV = 1000
NV = 1000
PV = 0xc6b160 "1E3"\0
CUR = 3
LEN = 10
COW_REFCNT = 1

$ perl -MDevel::Peek -le'use warnings; my $str="1.3E3"; my $v= 0+$str;
Dump($str)'
SV = PVNV(0x140a320) at 0x142cb08
REFCNT = 1
FLAGS = (IOK,NOK,POK,IsCOW,pIOK,pNOK,pPOK)
IV = 1300
NV = 1300
PV = 0x1430170 "1.3E3"\0
CUR = 5
LEN = 10
COW_REFCNT = 1

$ perl -MDevel::Peek -le'use warnings; my $str=" 1.3E3 "; my $v= 0+$str;
Dump($str)'
SV = PVNV(0x1449320) at 0x146bb08
REFCNT = 1
FLAGS = (IOK,NOK,POK,IsCOW,pIOK,pNOK,pPOK)
IV = 1300
NV = 1300
PV = 0x146f170 " 1.3E3 "\0
CUR = 7
LEN = 10
COW_REFCNT = 1

>
>
> but it also applies to string overloading. I've just uploaded Data::Dumper
> 2.182_51 to CPAN with a general fix.
>
>
> Anyway, changing this is likely to expose other problems, although I'll
> guess that 50% will be existing bugs that can be tickled on 5.34.0 with
> carefully crafted test casses.
>
>
> Prototype implementation in https://github.com/nwc10/perl5/tree/IV-vs-PV
>
> Please don't merge this (yet), as the DD fix should go into blead first
> (once 2.183 is released) then these two commits get rebased on that.
>

I actually dont think this will help much. The operation it saves is
essentially constant time. What it doesn't solve is the main problem, true
dual vars where the SV has been deliberately set to be IOK and POK at the
same time.

For example $).

$ perl -MDevel::Peek -le'$)="1 1"; print Dump($))'
SV = PVMG(0x1518fc0) at 0x14efae8
REFCNT = 1
FLAGS = (GMG,SMG,POK,IsCOW,pPOK)
IV = 0
NV = 0
PV = 0x1503770 "1 1"\0
CUR = 3
LEN = 10
COW_REFCNT = 1
MAGIC = 0x15263b0
MG_VIRTUAL = &PL_vtbl_sv
MG_TYPE = PERL_MAGIC_sv(\0)
MG_OBJ = 0x14efa58
MG_LEN = 1
MG_PTR = 0x152fef0 ")"

$ perl -MDevel::Peek -le'$)="1 1"; print $); print Dump($))'
1000 4 24 27 30 46 115 127 1000
SV = PVMG(0x1300fc0) at 0x12d7ae8
REFCNT = 1
FLAGS = (GMG,SMG,IOK,POK,pIOK,pPOK)
IV = 1000
NV = 0
PV = 0x12cf670 "1000 4 24 27 30 46 115 127 1000"\0
CUR = 31
LEN = 40
MAGIC = 0x130e3b0
MG_VIRTUAL = &PL_vtbl_sv
MG_TYPE = PERL_MAGIC_sv(\0)
MG_OBJ = 0x12d7a58
MG_LEN = 1
MG_PTR = 0x1317ef0 ")"

And also $!:

$ perl -MDevel::Peek -le'$!=123; print 0+$!; print $!; print Dump($!)'
123
No medium found
SV = PVMG(0x1a3dfd0) at 0x1a14af8
REFCNT = 1
FLAGS = (GMG,SMG,NOK,POK,pNOK,pPOK)
IV = 123
NV = 123
PV = 0x1a28780 "No medium found"\0
CUR = 15
LEN = 17
MAGIC = 0x1a1fe90
MG_VIRTUAL = &PL_vtbl_sv
MG_TYPE = PERL_MAGIC_sv(\0)
MG_OBJ = 0x1a14a68
MG_LEN = 1
MG_PTR = 0x1a54f00 "!"

Notice that one is a dual-var AND has magic. But its a PV/NV dualvar, for
some reason it doesnt set the IOK flag.

IMO we should havea dual flag, and be done with this nonsense, and
something should only be POK and IOK or POK and NOK if you can take the IV
and create the PV from it, and vice versa with no loss. Same for NV.

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"
Re: elevator pitch for IV vs PV serialisation [ In reply to ]
On Mon, 5 Jul 2021 at 18:34, Dave Mitchell <davem@iabyn.com> wrote:

> On Sat, Jul 03, 2021 at 01:40:06PM +0000, Nicholas Clark wrote:
> > Problem - perl caches string, integer and floating point representations
> of
> > values, and does this on reads. Hence serialisers can't distinguish
> between
> > values that started as numbers but got used as strings, and the other
> way.
>
> Is the 'original' type actual useful and/or well-defined? For example in
>
> while (<DATA>) {
> chomp;
> my ($x,$y, $description) = split;
> }
> __END_
> 1,2,foo bar
> 3,4,baz boz
>
> should $x's original value be regarded as an integer or a string?
> If instead one were to use an XS-based CSV parser, should that parser
> be returning IOK_only-SVs for the first two columns?
>
> I'm not arguing either way, and I'm sure others have thought about this a
> lot more than me. I just have this uneasy feeling that claiming that
> serialisers "need" to know the "original" type is an unreasonable demand.
>

From the point of view of perl to perl serialization I agree the original
type is uninteresting, what is interesting is whether the PV is a canonical
representation of the NV or IV.

If I can take the NV/IV and produce the PV exactly, and I can take the PV
and produce the NV/IV exactly then it doesn't matter which is choose for
serialization.

In Sereal we prefer the numeric representation, especially for IV's. But we
want to detect when the IV can not be used to reconsitute the PV.

Where the "type" of the value comes in is when you are doing interop with
another language, but its not clear to me that knowing the original type
is really helpful, you want to know the expected type for the serialization
you are producing, which usually means some kind of schema if you are to
do it safely. Eg, consider booleans do not interop well at all.

cheers,
Yves


--
perl -Mre=debug -e "/just|another|perl|hacker/"
Re: elevator pitch for IV vs PV serialisation [ In reply to ]
On Mon, 5 Jul 2021 at 11:58, Nicholas Clark <nick@ccl4.org> wrote:

>
> Sereal passes all its test.


Yeah, i wouldnt expect this to change Sereal at all, it already follows the
logic you described. I dont think it helps it at all either as i mentioned
previously the case that is problematic is intentional dualvars, exponents,
and strings with leading and trailing whitespace. All can end up in a state
where they are POK/IOK/NOK but the IV/NV cannot be used to faithfully
reconstruct the PV, and worse, there is no constant time operation to
detect if this is the case, you have to scan the string, or convert the
IV/NV to a string, and then compare to see if you have a dualvar. :-(

What we need is a patch that makes sure that the IOK/NOK flags are NOT set
when the PV->IV or NV operation is not "perfect"/"canonical", if instead we
set them to pIOK and pNOK then we would be fine. But last i looked into it
the flags actually mean "should I warn when I use this for this operation",
not so much whether the operation is canonical or not.

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"