Mailing List Archive

qr'$foo' and unescaped $ signs
One of the obscure features of regular expressions is that if one uses a
single quote as the delimiter, no interpolation takes place.

One result of this - there's a Data::Dumper bug about handling them:

https://rt.cpan.org/Public/Bug/Display.html?id=84569

You get this:

$ perl -MData::Dumper -w
print Dumper(qr'$foo');
__END__
$VAR1 = qr/(?^:$foo)/;


That is not correct (and it's still the behaviour in blead)

However, I'm struggling - is this a core bug. Or a DD bug. How do these
crazy regexs really work.

Say I take this:

$ cat ~/test/single.pl
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
use Devel::Peek;

use re 'debug';

my $re = qr'$foo';

print "$re\n";
print Dumper($re);
++$Data::Dumper::Useperl;
print Dumper($re);

print "\$foo" =~ $re ? "Match\n" : "not\n";

Dump($re);

__END__

$ perl ~/test/single.pl
Compiling REx "$foo"
Final program:
1: SEOL (2)
2: EXACT <foo> (4)
4: END (0)
anchored "foo" at 0..0 (checking anchored) minlen 3
(?^:$foo)
$VAR1 = qr/$foo/;
$VAR1 = qr/$foo/;
Matching REx "$foo" against "$foo"
Intuit: trying to determine minimum start position...
doing 'check' fbm scan, [0..4] gave 1
Found anchored substr "foo" at offset 1 (rx_origin now 1)...
(multiline anchor test skipped)
try at offset...
Intuit: Successfully guessed: match at offset 1
1 <$> <foo> | 0| 1:SEOL(2)
| 0| failed...
Match failed
not
SV = IV(0xbc3920) at 0xbc3920
REFCNT = 1
FLAGS = (ROK)
RV = 0xbbe138
SV = REGEXP(0xcb5944) at 0xbbe138
REFCNT = 1
FLAGS = (OBJECT,POK,FAKE,pPOK)
PV = 0xcbf1f8 "(?^:$foo)"
...



Compare this with *trying* to re-write qr'$foo' as qr/\$foo/:

$ cat ~/test/normal.pl
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
use Devel::Peek;

use re 'debug';

my $re = qr/\$foo/;

print "$re\n";
print Dumper($re);
++$Data::Dumper::Useperl;
print Dumper($re);

print "\$foo" =~ $re ? "Match\n" : "not\n";

Dump($re);

__END__

$ perl ~/test/normal.pl
Compiling REx "\$foo"
Final program:
1: EXACT <$foo> (3)
3: END (0)
anchored "$foo" at 0..0 (checking anchored isall) minlen 4
(?^:\$foo)
$VAR1 = qr/\$foo/;
$VAR1 = qr/\$foo/;
Matching REx "\$foo" against "$foo"
Intuit: trying to determine minimum start position...
doing 'check' fbm scan, [0..4] gave 0
Found anchored substr "$foo" at offset 0 (rx_origin now 0)...
(multiline anchor test skipped)
Intuit: Successfully guessed: match at offset 0
Match
SV = IV(0x1726920) at 0x1726920
REFCNT = 1
FLAGS = (ROK)
RV = 0x1721138
SV = REGEXP(0x18189d4) at 0x1721138
REFCNT = 1
FLAGS = (OBJECT,POK,FAKE,pPOK)
PV = 0x1821e88 "(?^:\\$foo)"
...




My *first* thought when looking at all the Devel::Peek output was "there is
nothing recording that the regex was written with '' - this is wrong?"

And as there's no hint about the '' in the internal state, this must be
buggy because anything trying to interpolate them is going to mistake
'$foo' for $foo and violate strict.

So I played, and nothing goes wrong


After quite a while, my *second* thought was "hang on - those two aren't
the same regular expression" - ie qr'$foo' isn't qr/\$foo/


If you look at the regex debug output, qr'$foo' is

1) the anchor $
2) the 3 character fixed string foo


while qr/\$foo/ is

2) the 4 character fixed string $foo


at which point it's starting to make slightly more sense.

(the second regex differs in that has the EXACT flag set on it, which
isn't because of the '' vs //, but because it's entirely a literal string,
so the result from the optimiser can be used directly, without hitting the
main regex engine.)



So, I think what matters here is that scalar variable interpolation happens
early in regex compilation, and only happens once. If you interpolate a
regex, the *literal* content of that regex is interpolated (however strange)
but isn't then subject to another round of variable interpolation.

So, I'm right in thinking

1) the perl core is correct in what it stores into the C structures?
2) what Devel::Peek reports is sane?
3) this all works with interpolation?
4) if I see \$ in the C structures, it came from \$ in qr// qr"" etc?

(even if the match itself makes no sense, and won't work without //m.)
(In which case, the bug *is* in Data::Dumper.)

So I guess, the question is:

If the PV stored for a regex has an *unescaped* dollar sign in it it *has*
to have originated from a regex written with '' ?

ie for all the other pattern delimiters, which are doing what perlre calls
"double-quotish context", the two literals in the pattern \$ are passed on
downwards as \$ - ie they don't just suppress interpolation in the perl parser,
but also are passed down to regcomp.c as a literal pair \$


meaning that if one scans the internal string of the regex (stored as the PV)
and finds an unescaped $ anywhere (other than the last character) then

1) the *only* way that could have got there was by being written as qr''
2) the *only* way to convert that back to a regex is to output qr''
(ie there exists no escaping syntax capable of recreating it inside qr//)



I suspect that this rabbit hole goes deeper. What did I miss? What did I get
wrong?

Nicholas Clark
Re: qr'$foo' and unescaped $ signs [ In reply to ]
On Tue, Jun 29, 2021 at 4:31 PM Nicholas Clark <nick@ccl4.org> wrote:

> So I guess, the question is:
>
> If the PV stored for a regex has an *unescaped* dollar sign in it it *has*
> to have originated from a regex written with '' ?
>

I'm not sure I follow you all the way, but I don't think so. It could
come from an interpolated variable:

eirik@rustybox[17:33:04]~$ perl 2>&1 | grep PV
use Devel::Peek;
Dump($_) for qr'$foo', qr/${\q($)}foo/;
__END__
PV = 0x55d52be518a0 "(?^:$foo)"
PV = 0x55d52be518a0 "(?^:$foo)"
PV = 0x55d52be714c0 "(?^:$foo)"
PV = 0x55d52be714c0 "(?^:$foo)"
eirik@rustybox[17:33:24]~$


> 1) the *only* way that could have got there was by being written as qr''
>

If I read you right, no, not quite.


> 2) the *only* way to convert that back to a regex is to output qr''
> (ie there exists no escaping syntax capable of recreating it inside
> qr//)
>

The only *other* way I know of, is through a variable.

Perhaps Data::Dumper should output qr/(?^:${\q($)}foo)/;instead?


Eirik
Re: qr'$foo' and unescaped $ signs [ In reply to ]
On Tue, Jun 29, 2021 at 11:39 AM Eirik Berg Hanssen <
Eirik-Berg.Hanssen@allverden.no> wrote:

>
>
> On Tue, Jun 29, 2021 at 4:31 PM Nicholas Clark <nick@ccl4.org> wrote:
>
>> So I guess, the question is:
>>
>> If the PV stored for a regex has an *unescaped* dollar sign in it it *has*
>> to have originated from a regex written with '' ?
>>
>
> I'm not sure I follow you all the way, but I don't think so. It could
> come from an interpolated variable:
>

Or the last character of the regex, which is quite common.

-Dan
Re: qr'$foo' and unescaped $ signs [ In reply to ]
On Tue, Jun 29, 2021 at 11:50:03AM -0400, Dan Book wrote:
> On Tue, Jun 29, 2021 at 11:39 AM Eirik Berg Hanssen <
> Eirik-Berg.Hanssen@allverden.no> wrote:
>
> >
> >
> > On Tue, Jun 29, 2021 at 4:31 PM Nicholas Clark <nick@ccl4.org> wrote:
> >
> >> So I guess, the question is:
> >>
> >> If the PV stored for a regex has an *unescaped* dollar sign in it it *has*
> >> to have originated from a regex written with '' ?
> >>
> >
> > I'm not sure I follow you all the way, but I don't think so. It could
> > come from an interpolated variable:
> >
>
> Or the last character of the regex, which is quite common.

That case was actually in the text of my original message, which Eirik didn't
quote.

Nicholas Clark
Re: qr'$foo' and unescaped $ signs [ In reply to ]
On Tue, Jun 29, 2021 at 05:38:48PM +0200, Eirik Berg Hanssen wrote:
> On Tue, Jun 29, 2021 at 4:31 PM Nicholas Clark <nick@ccl4.org> wrote:
>
> > So I guess, the question is:
> >
> > If the PV stored for a regex has an *unescaped* dollar sign in it it *has*
> > to have originated from a regex written with '' ?
> >
>
> I'm not sure I follow you all the way, but I don't think so. It could
> come from an interpolated variable:

You followed enough to consider a case that I had not. Thanks...

> eirik@rustybox[17:33:04]~$ perl 2>&1 | grep PV
> use Devel::Peek;
> Dump($_) for qr'$foo', qr/${\q($)}foo/;
> __END__
> PV = 0x55d52be518a0 "(?^:$foo)"
> PV = 0x55d52be518a0 "(?^:$foo)"
> PV = 0x55d52be714c0 "(?^:$foo)"
> PV = 0x55d52be714c0 "(?^:$foo)"
> eirik@rustybox[17:33:24]~$
>
>
> > 1) the *only* way that could have got there was by being written as qr''
> >
>
> If I read you right, no, not quite.

You did.

> > 2) the *only* way to convert that back to a regex is to output qr''
> > (ie there exists no escaping syntax capable of recreating it inside
> > qr//)
> >
>
> The only *other* way I know of, is through a variable.
>
> Perhaps Data::Dumper should output qr/(?^:${\q($)}foo)/;instead?

Thanks. That is something I'd not considered, and likely a solution.
I thought we were stuck - the possible solution I had to use qr'' then
fails if there's Unicode, because the XS code really wants to use \x{...}
escapes to encode that to keep the output as ASCII.

But you've suggested a syntax (hairy, but workable) where one can generate
an equivalent regex that has the literal $ in it, but is written in 7 bits.


ie, start with this, written with a shell heredoc to handle the ''s:

$ ./perl -Ilib -C63 -MDevel::Peek -w <<EOT Dump(qr'$ ?')
EOT
SV = IV(0x1332360) at 0x1332370
REFCNT = 1
FLAGS = (TEMP,ROK)
RV = 0x1332520
SV = REGEXP(0x13663a0) at 0x1332520
REFCNT = 1
FLAGS = (OBJECT,POK,FAKE,pPOK,UTF8)
PV = 0x1359a30 "(?^u:$ \342\230\203)" [UTF8 "(?^u:$ \x{2603})"]
CUR = 11
LEN = 0
...


but instead, dump it like this:

./perl -Ilib -C63 -MDevel::Peek -we 'Dump(qr/${\q($)} \x{2603}/)'
SV = IV(0x15e9aa0) at 0x15e9ab0
REFCNT = 1
FLAGS = (TEMP,ROK)
RV = 0x15e9af8
SV = REGEXP(0x1604470) at 0x15e9af8
REFCNT = 1
FLAGS = (OBJECT,POK,FAKE,pPOK,UTF8)
PV = 0x15df630 "(?^u:$ \\x{2603})" [UTF8 "(?^u:$ \\x{2603})"]
CUR = 16
LEN = 0
...

I think that this generalises.

Thanks.

There might be a winning move after all.

Nicholas Clark
Re: qr'$foo' and unescaped $ signs [ In reply to ]
On Tue, Jun 29, 2021 at 02:30:01PM +0000, Nicholas Clark wrote:
> One of the obscure features of regular expressions is that if one uses a
> single quote as the delimiter, no interpolation takes place.
>
> One result of this - there's a Data::Dumper bug about handling them:
>
> https://rt.cpan.org/Public/Bug/Display.html?id=84569
>
> You get this:
>
> $ perl -MData::Dumper -w
> print Dumper(qr'$foo');
> __END__
> $VAR1 = qr/(?^:$foo)/;

> meaning that if one scans the internal string of the regex (stored as the PV)
> and finds an unescaped $ anywhere (other than the last character) then

or $) or $|

These two in qr// don't interpolate the magic variables.
I think that all the others do.

This doesn't seem to be documented anywhere

> 1) the *only* way that could have got there was by being written as qr''
> 2) the *only* way to convert that back to a regex is to output qr''
> (ie there exists no escaping syntax capable of recreating it inside qr//)

> I suspect that this rabbit hole goes deeper. What did I miss? What did I get
> wrong?

I missed the new bit above.

Also, I can't find any way to make a regex with an embedded $ somewhere in
the middle that actually matches any string. Even with qr''m so that the $
will match a "\n" in the middle of a string.

It seems that if $ is matched, it needs to be the (logical) end of the
pattern, else the entire pattern matches. By "logical", I mean that other
text follow in the regex, but it needs to be other alternations, or something
else that doesn't contribute to matching. It can't be more characters, even
if they are present on the next line.

basically, is there any string that qr'$foo'm will match?

Am I right?
And is *that* documented anywhere?

Nicholas Clark
Re: qr'$foo' and unescaped $ signs [ In reply to ]
On 6/30/21 4:41 AM, Nicholas Clark wrote:
> On Tue, Jun 29, 2021 at 02:30:01PM +0000, Nicholas Clark wrote:
>> One of the obscure features of regular expressions is that if one uses a
>> single quote as the delimiter, no interpolation takes place.
>>
>> One result of this - there's a Data::Dumper bug about handling them:
>>
>> https://rt.cpan.org/Public/Bug/Display.html?id=84569
>>
>> You get this:
>>
>> $ perl -MData::Dumper -w
>> print Dumper(qr'$foo');
>> __END__
>> $VAR1 = qr/(?^:$foo)/;
>
>> meaning that if one scans the internal string of the regex (stored as the PV)
>> and finds an unescaped $ anywhere (other than the last character) then
>
> or $) or $|
>
> These two in qr// don't interpolate the magic variables.
> I think that all the others do.
>
> This doesn't seem to be documented anywhere
>
>> 1) the *only* way that could have got there was by being written as qr''
>> 2) the *only* way to convert that back to a regex is to output qr''
>> (ie there exists no escaping syntax capable of recreating it inside qr//)
>
>> I suspect that this rabbit hole goes deeper. What did I miss? What did I get
>> wrong?
>
> I missed the new bit above.
>
> Also, I can't find any way to make a regex with an embedded $ somewhere in
> the middle that actually matches any string. Even with qr''m so that the $
> will match a "\n" in the middle of a string.
>
> It seems that if $ is matched, it needs to be the (logical) end of the
> pattern, else the entire pattern matches. By "logical", I mean that other
> text follow in the regex, but it needs to be other alternations, or something
> else that doesn't contribute to matching. It can't be more characters, even
> if they are present on the next line.
>
> basically, is there any string that qr'$foo'm will match?
>
> Am I right?
> And is *that* documented anywhere?
>
> Nicholas Clark
>

$ blead -Dr -le "qr'$foo'm"
Compiling REx ""
Final program:
1: NOTHING (2)
2: END (0)
minlen 0
Enabling $` $& $' support (0x7).

From regcomp.sym:
NOTHING NOTHING, no ; Match empty string.

$ blead -Dr -le "qr'\$foo'm"
Compiling REx "$foo"
Final program:
1: MEOL (2)
2: EXACT <foo> (4)
4: END (0)
anchored "foo" at 0..0 (checking anchored) minlen 3

regcomp.sym:
SEOL EOL, no ; Match "" at end of line: /$/
MEOL EOL, no ; Same, assuming multiline: /$/m
Re: qr'$foo' and unescaped $ signs [ In reply to ]
On Wed, Jun 30, 2021 at 06:18:13AM -0600, Karl Williamson wrote:

> $ blead -Dr -le "qr'$foo'm"
> Compiling REx ""
> Final program:
> 1: NOTHING (2)
> 2: END (0)
> minlen 0
> Enabling $` $& $' support (0x7).

I think that your Unix shell quoting is not doing you favours here.

I think that for this examples the shell will replace $foo with an empty
string, as the shell is processing the "" quotes. So you perl one-liner
is not what you intended. (I believe).

And in the other, \$ is processed by the shell, not perl's parser.

Nicholas Clark
Re: qr'$foo' and unescaped $ signs [ In reply to ]
On Wed, Jun 30, 2021 at 12:42 PM Nicholas Clark <nick@ccl4.org> wrote:

> It seems that if $ is matched, it needs to be the (logical) end of the
> pattern, else the entire pattern matches. By "logical", I mean that other
> text follow in the regex, but it needs to be other alternations, or
> something
> else that doesn't contribute to matching. It can't be more characters, even
> if they are present on the next line.
>

You can match more characters, but you need to consume the newline
first. (The $ is a zero-width lookahead assertion – it does not consume
anything.)

eirik@rustybox[14:54:32]~$ perl
use 5.16.0;
for (qr'$\sbar'm, qr'$.*bar'ms) {
say "foo\nbar" =~ $_ ? "Matched: $&" : "Nope"
}
__END__
Matched:
bar
Matched:
bar
eirik@rustybox[15:00:14]~$

basically, is there any string that qr'$foo'm will match?
>

No, no more than there is a string that qr/(?=bar)foo/ will match. Need
to consume that bar first.


Eirik
Re: qr'$foo' and unescaped $ signs [ In reply to ]
On 6/30/21 6:22 AM, Nicholas Clark wrote:
> On Wed, Jun 30, 2021 at 06:18:13AM -0600, Karl Williamson wrote:
>
>> $ blead -Dr -le "qr'$foo'm"
>> Compiling REx ""
>> Final program:
>> 1: NOTHING (2)
>> 2: END (0)
>> minlen 0
>> Enabling $` $& $' support (0x7).
>
> I think that your Unix shell quoting is not doing you favours here.
>
> I think that for this examples the shell will replace $foo with an empty
> string, as the shell is processing the "" quotes. So you perl one-liner
> is not what you intended. (I believe).
>
> And in the other, \$ is processed by the shell, not perl's parser.
>
> Nicholas Clark
>

Sorry. Hopefully this is better

§ cat nwc.pl
qr'$foo'm;
qr'\$foo'm;

§ blead -Dr nwc.pl
Compiling REx "$foo"
Final program:
1: MEOL (2)
2: EXACT <foo> (4)
4: END (0)
anchored "foo" at 0..0 (checking anchored) minlen 3
Compiling REx "\$foo"
Final program:
1: EXACT <$foo> (3)
3: END (0)
anchored "$foo" at 0..0 (checking anchored isall) minlen 4
Enabling $` $& $' support (0x7).
Re: qr'$foo' and unescaped $ signs [ In reply to ]
On Wed, Jun 30, 2021 at 03:05:04PM +0200, Eirik Berg Hanssen wrote:
> On Wed, Jun 30, 2021 at 12:42 PM Nicholas Clark <nick@ccl4.org> wrote:
>
> > It seems that if $ is matched, it needs to be the (logical) end of the
> > pattern, else the entire pattern matches. By "logical", I mean that other
> > text follow in the regex, but it needs to be other alternations, or
> > something
> > else that doesn't contribute to matching. It can't be more characters, even
> > if they are present on the next line.
> >
>
> You can match more characters, but you need to consume the newline
> first. (The $ is a zero-width lookahead assertion – it does not consume
> anything.)

D'oh! Thanks. Yes. That was what I goofed.


So, based on your suggestion I've written what I think is a fix, and
pushed it to https://github.com/nwc10/perl5/tree/DD-84569

The pure-Perl implementation is this:

$pat =~ s <
(\\.) # anything backslash escaped
| (\$)(?![)|]|\z) # any unescaped $, except $| $) and end
| / # any unescaped /
>
{
$1 ? $1
: $2 ? '${\q($)}'
: '\\/'
}gex;


The C implementation might actually be easier to read. :-)

It passes tests on blead. I've not *yet* tested it against older Perls,
so I'm not going to roll a CPAN dev release yet.

(Also, a fresh head tomorrow might spot some bugs or corner cases that I
missed)


Thanks for your insights on this.

Nicholas Clark
Re: qr'$foo' and unescaped $ signs [ In reply to ]
On Wed, Jun 30, 2021 at 03:23:53PM +0000, Nicholas Clark wrote:

> It passes tests on blead. I've not *yet* tested it against older Perls,
> so I'm not going to roll a CPAN dev release yet.

2.183 escaped to CPAN this morning.

Nicholas Clark
Re: qr'$foo' and unescaped $ signs [ In reply to ]
On 7/5/21 6:09 AM, Nicholas Clark wrote:
> On Wed, Jun 30, 2021 at 03:23:53PM +0000, Nicholas Clark wrote:
>
>> It passes tests on blead. I've not *yet* tested it against older Perls,
>> so I'm not going to roll a CPAN dev release yet.
>
> 2.183 escaped to CPAN this morning.
>
> Nicholas Clark
>

Does that make this ticket closable?
https://rt.cpan.org/Ticket/Display.html?id=84569#txn-2031471

Thank you very much.
Jim Keenan
Re: qr'$foo' and unescaped $ signs [ In reply to ]
On Mon, Jul 05, 2021 at 07:35:10AM -0400, James E Keenan wrote:
> On 7/5/21 6:09 AM, Nicholas Clark wrote:
> > On Wed, Jun 30, 2021 at 03:23:53PM +0000, Nicholas Clark wrote:
> >
> > > It passes tests on blead. I've not *yet* tested it against older Perls,
> > > so I'm not going to roll a CPAN dev release yet.
> >
> > 2.183 escaped to CPAN this morning.
> >
> > Nicholas Clark
> >
>
> Does that make this ticket closable?
> https://rt.cpan.org/Ticket/Display.html?id=84569#txn-2031471

I checked again, and the metadata has now caught up, meaning that "2.183" is
now in the "Fixed in" choices, so I have set that and closed it.

It seems that it takes a few hours for rt.cpan.org to get in sync with
PAUSE, so I didn't close it as soon as I made the upload.

Nicholas Clark