One of the obscure features of regular expressions is that if one uses a
single quote as the delimiter, no interpolation takes place.
One result of this - there's a Data::Dumper bug about handling them:
https://rt.cpan.org/Public/Bug/Display.html?id=84569
You get this:
$ perl -MData::Dumper -w
print Dumper(qr'$foo');
__END__
$VAR1 = qr/(?^:$foo)/;
That is not correct (and it's still the behaviour in blead)
However, I'm struggling - is this a core bug. Or a DD bug. How do these
crazy regexs really work.
Say I take this:
$ cat ~/test/single.pl
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
use Devel::Peek;
use re 'debug';
my $re = qr'$foo';
print "$re\n";
print Dumper($re);
++$Data::Dumper::Useperl;
print Dumper($re);
print "\$foo" =~ $re ? "Match\n" : "not\n";
Dump($re);
__END__
$ perl ~/test/single.pl
Compiling REx "$foo"
Final program:
1: SEOL (2)
2: EXACT <foo> (4)
4: END (0)
anchored "foo" at 0..0 (checking anchored) minlen 3
(?^:$foo)
$VAR1 = qr/$foo/;
$VAR1 = qr/$foo/;
Matching REx "$foo" against "$foo"
Intuit: trying to determine minimum start position...
doing 'check' fbm scan, [0..4] gave 1
Found anchored substr "foo" at offset 1 (rx_origin now 1)...
(multiline anchor test skipped)
try at offset...
Intuit: Successfully guessed: match at offset 1
1 <$> <foo> | 0| 1:SEOL(2)
| 0| failed...
Match failed
not
SV = IV(0xbc3920) at 0xbc3920
REFCNT = 1
FLAGS = (ROK)
RV = 0xbbe138
SV = REGEXP(0xcb5944) at 0xbbe138
REFCNT = 1
FLAGS = (OBJECT,POK,FAKE,pPOK)
PV = 0xcbf1f8 "(?^:$foo)"
...
Compare this with *trying* to re-write qr'$foo' as qr/\$foo/:
$ cat ~/test/normal.pl
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
use Devel::Peek;
use re 'debug';
my $re = qr/\$foo/;
print "$re\n";
print Dumper($re);
++$Data::Dumper::Useperl;
print Dumper($re);
print "\$foo" =~ $re ? "Match\n" : "not\n";
Dump($re);
__END__
$ perl ~/test/normal.pl
Compiling REx "\$foo"
Final program:
1: EXACT <$foo> (3)
3: END (0)
anchored "$foo" at 0..0 (checking anchored isall) minlen 4
(?^:\$foo)
$VAR1 = qr/\$foo/;
$VAR1 = qr/\$foo/;
Matching REx "\$foo" against "$foo"
Intuit: trying to determine minimum start position...
doing 'check' fbm scan, [0..4] gave 0
Found anchored substr "$foo" at offset 0 (rx_origin now 0)...
(multiline anchor test skipped)
Intuit: Successfully guessed: match at offset 0
Match
SV = IV(0x1726920) at 0x1726920
REFCNT = 1
FLAGS = (ROK)
RV = 0x1721138
SV = REGEXP(0x18189d4) at 0x1721138
REFCNT = 1
FLAGS = (OBJECT,POK,FAKE,pPOK)
PV = 0x1821e88 "(?^:\\$foo)"
...
My *first* thought when looking at all the Devel::Peek output was "there is
nothing recording that the regex was written with '' - this is wrong?"
And as there's no hint about the '' in the internal state, this must be
buggy because anything trying to interpolate them is going to mistake
'$foo' for $foo and violate strict.
So I played, and nothing goes wrong
After quite a while, my *second* thought was "hang on - those two aren't
the same regular expression" - ie qr'$foo' isn't qr/\$foo/
If you look at the regex debug output, qr'$foo' is
1) the anchor $
2) the 3 character fixed string foo
while qr/\$foo/ is
2) the 4 character fixed string $foo
at which point it's starting to make slightly more sense.
(the second regex differs in that has the EXACT flag set on it, which
isn't because of the '' vs //, but because it's entirely a literal string,
so the result from the optimiser can be used directly, without hitting the
main regex engine.)
So, I think what matters here is that scalar variable interpolation happens
early in regex compilation, and only happens once. If you interpolate a
regex, the *literal* content of that regex is interpolated (however strange)
but isn't then subject to another round of variable interpolation.
So, I'm right in thinking
1) the perl core is correct in what it stores into the C structures?
2) what Devel::Peek reports is sane?
3) this all works with interpolation?
4) if I see \$ in the C structures, it came from \$ in qr// qr"" etc?
(even if the match itself makes no sense, and won't work without //m.)
(In which case, the bug *is* in Data::Dumper.)
So I guess, the question is:
If the PV stored for a regex has an *unescaped* dollar sign in it it *has*
to have originated from a regex written with '' ?
ie for all the other pattern delimiters, which are doing what perlre calls
"double-quotish context", the two literals in the pattern \$ are passed on
downwards as \$ - ie they don't just suppress interpolation in the perl parser,
but also are passed down to regcomp.c as a literal pair \$
meaning that if one scans the internal string of the regex (stored as the PV)
and finds an unescaped $ anywhere (other than the last character) then
1) the *only* way that could have got there was by being written as qr''
2) the *only* way to convert that back to a regex is to output qr''
(ie there exists no escaping syntax capable of recreating it inside qr//)
I suspect that this rabbit hole goes deeper. What did I miss? What did I get
wrong?
Nicholas Clark
single quote as the delimiter, no interpolation takes place.
One result of this - there's a Data::Dumper bug about handling them:
https://rt.cpan.org/Public/Bug/Display.html?id=84569
You get this:
$ perl -MData::Dumper -w
print Dumper(qr'$foo');
__END__
$VAR1 = qr/(?^:$foo)/;
That is not correct (and it's still the behaviour in blead)
However, I'm struggling - is this a core bug. Or a DD bug. How do these
crazy regexs really work.
Say I take this:
$ cat ~/test/single.pl
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
use Devel::Peek;
use re 'debug';
my $re = qr'$foo';
print "$re\n";
print Dumper($re);
++$Data::Dumper::Useperl;
print Dumper($re);
print "\$foo" =~ $re ? "Match\n" : "not\n";
Dump($re);
__END__
$ perl ~/test/single.pl
Compiling REx "$foo"
Final program:
1: SEOL (2)
2: EXACT <foo> (4)
4: END (0)
anchored "foo" at 0..0 (checking anchored) minlen 3
(?^:$foo)
$VAR1 = qr/$foo/;
$VAR1 = qr/$foo/;
Matching REx "$foo" against "$foo"
Intuit: trying to determine minimum start position...
doing 'check' fbm scan, [0..4] gave 1
Found anchored substr "foo" at offset 1 (rx_origin now 1)...
(multiline anchor test skipped)
try at offset...
Intuit: Successfully guessed: match at offset 1
1 <$> <foo> | 0| 1:SEOL(2)
| 0| failed...
Match failed
not
SV = IV(0xbc3920) at 0xbc3920
REFCNT = 1
FLAGS = (ROK)
RV = 0xbbe138
SV = REGEXP(0xcb5944) at 0xbbe138
REFCNT = 1
FLAGS = (OBJECT,POK,FAKE,pPOK)
PV = 0xcbf1f8 "(?^:$foo)"
...
Compare this with *trying* to re-write qr'$foo' as qr/\$foo/:
$ cat ~/test/normal.pl
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
use Devel::Peek;
use re 'debug';
my $re = qr/\$foo/;
print "$re\n";
print Dumper($re);
++$Data::Dumper::Useperl;
print Dumper($re);
print "\$foo" =~ $re ? "Match\n" : "not\n";
Dump($re);
__END__
$ perl ~/test/normal.pl
Compiling REx "\$foo"
Final program:
1: EXACT <$foo> (3)
3: END (0)
anchored "$foo" at 0..0 (checking anchored isall) minlen 4
(?^:\$foo)
$VAR1 = qr/\$foo/;
$VAR1 = qr/\$foo/;
Matching REx "\$foo" against "$foo"
Intuit: trying to determine minimum start position...
doing 'check' fbm scan, [0..4] gave 0
Found anchored substr "$foo" at offset 0 (rx_origin now 0)...
(multiline anchor test skipped)
Intuit: Successfully guessed: match at offset 0
Match
SV = IV(0x1726920) at 0x1726920
REFCNT = 1
FLAGS = (ROK)
RV = 0x1721138
SV = REGEXP(0x18189d4) at 0x1721138
REFCNT = 1
FLAGS = (OBJECT,POK,FAKE,pPOK)
PV = 0x1821e88 "(?^:\\$foo)"
...
My *first* thought when looking at all the Devel::Peek output was "there is
nothing recording that the regex was written with '' - this is wrong?"
And as there's no hint about the '' in the internal state, this must be
buggy because anything trying to interpolate them is going to mistake
'$foo' for $foo and violate strict.
So I played, and nothing goes wrong
After quite a while, my *second* thought was "hang on - those two aren't
the same regular expression" - ie qr'$foo' isn't qr/\$foo/
If you look at the regex debug output, qr'$foo' is
1) the anchor $
2) the 3 character fixed string foo
while qr/\$foo/ is
2) the 4 character fixed string $foo
at which point it's starting to make slightly more sense.
(the second regex differs in that has the EXACT flag set on it, which
isn't because of the '' vs //, but because it's entirely a literal string,
so the result from the optimiser can be used directly, without hitting the
main regex engine.)
So, I think what matters here is that scalar variable interpolation happens
early in regex compilation, and only happens once. If you interpolate a
regex, the *literal* content of that regex is interpolated (however strange)
but isn't then subject to another round of variable interpolation.
So, I'm right in thinking
1) the perl core is correct in what it stores into the C structures?
2) what Devel::Peek reports is sane?
3) this all works with interpolation?
4) if I see \$ in the C structures, it came from \$ in qr// qr"" etc?
(even if the match itself makes no sense, and won't work without //m.)
(In which case, the bug *is* in Data::Dumper.)
So I guess, the question is:
If the PV stored for a regex has an *unescaped* dollar sign in it it *has*
to have originated from a regex written with '' ?
ie for all the other pattern delimiters, which are doing what perlre calls
"double-quotish context", the two literals in the pattern \$ are passed on
downwards as \$ - ie they don't just suppress interpolation in the perl parser,
but also are passed down to regcomp.c as a literal pair \$
meaning that if one scans the internal string of the regex (stored as the PV)
and finds an unescaped $ anywhere (other than the last character) then
1) the *only* way that could have got there was by being written as qr''
2) the *only* way to convert that back to a regex is to output qr''
(ie there exists no escaping syntax capable of recreating it inside qr//)
I suspect that this rabbit hole goes deeper. What did I miss? What did I get
wrong?
Nicholas Clark