Mailing List Archive: RFC naming the 0th match in regular expressions

RFC naming the 0th match in regular expressions

perl5-porters at perl

Sep 6, 2021, 6:37 PM

Post #1 of 12 (617 views)

Hi all,

I brought this up on PCRE2 (at
<https://github.com/PhilipHazel/pcre2/issues/15>), but the author
rightfully pointed out that it's something Perl doesn't do.

1. The (minor) problem is that a regular expression cannot begin
directly with a naming group, it must be parenthesised. This makes
expressions that use named patterns produce result sets that are 1
item longer than necessary, and contain a duplicate member (0 and
1). It is more pronounced when using the "g" modifier, because
iterating the matches then has multiple 0 groups that cannot be
addressed using a semantic name.
2. The proposed syntax is to allow a pattern like
/?<name>.../
rather than requiring
/(?<name>...)/
3. The benefits are consistent semantic access of match result members,
and smaller result sets. I'm sure results are already
memory-efficient with named and numbered groups pointing to the same
data, but they have more output when examined or used.
4. From basic testing with PCRE, expressions that begin with ? fail to
compile, but I'm not certain that's the case for all regexp usage in
Perl. It looks like a backwards-compatible change.

Regards,
Roy.

Re: RFC naming the 0th match in regular expressions [ In reply to ]

grinnz at gmail

Sep 8, 2021, 8:56 PM

Post #2 of 12 (617 views)

On Wed, Sep 8, 2021 at 9:44 PM Roy via perl5-porters <perl5-porters@perl.org>
wrote:

> Hi all,
>
> I brought this up on PCRE2 (at
> <https://github.com/PhilipHazel/pcre2/issues/15>), but the author
> rightfully pointed out that it's something Perl doesn't do.
>
> 1. The (minor) problem is that a regular expression cannot begin
> directly with a naming group, it must be parenthesised. This makes
> expressions that use named patterns produce result sets that are 1
> item longer than necessary, and contain a duplicate member (0 and
> 1). It is more pronounced when using the "g" modifier, because
> iterating the matches then has multiple 0 groups that cannot be
> addressed using a semantic name.
> 2. The proposed syntax is to allow a pattern like
> /?<name>.../
> rather than requiring
> /(?<name>...)/
> 3. The benefits are consistent semantic access of match result members,
> and smaller result sets. I'm sure results are already
> memory-efficient with named and numbered groups pointing to the same
> data, but they have more output when examined or used.
> 4. From basic testing with PCRE, expressions that begin with ? fail to
> compile, but I'm not certain that's the case for all regexp usage in
> Perl. It looks like a backwards-compatible change.
>

It is available syntax, but it would be inconsistent to allow it at the
beginning of a regex and not in the middle, particularly since regexes can
be combined.

-Dan

Re: RFC naming the 0th match in regular expressions [ In reply to ]

perl5-porters at perl

Sep 8, 2021, 9:52 PM

Post #3 of 12 (617 views)

On 9/9/21 13:26, Dan Book wrote:
> On Wed, Sep 8, 2021 at 9:44 PM Roy via perl5-porters
> <perl5-porters@perl.org> wrote:
>
> 2. The proposed syntax is to allow a pattern like
> /?<name>.../
> rather than requiring
> /(?<name>...)/
>
>
> It is available syntax, but it would be inconsistent to allow it at
> the beginning of a regex and not in the middle, particularly since
> regexes can be combined.

All patterns are, in a sense, surrounded by invisible implicit
parentheses corresponding to the whole string. Opening those implicit
parens with a name is as logical as numbering it 0. Surely, one can't
expect to combine arbitrary patterns, without knowing their contents,
and expect them to be valid. If one was concatenating partial patterns,
and you wanted to use this feature, you'd concatenate an opening name
group partial + partial 0 + ... + partial n. Starting a name in the
middle of a combined pattern, without corresponding parens, makes no
more sense than expecting an anchor in the middle of a word to work,
e.g. /foo\Abar/

--
Regards,
Roy

Re: RFC naming the 0th match in regular expressions [ In reply to ]

grinnz at gmail

Sep 9, 2021, 1:20 AM

Post #4 of 12 (617 views)

On Thu, Sep 9, 2021 at 12:53 AM Roy <roy-orbison@devo.net.au> wrote:

> On 9/9/21 13:26, Dan Book wrote:
>
> On Wed, Sep 8, 2021 at 9:44 PM Roy via perl5-porters <
> perl5-porters@perl.org> wrote:
>
>> 2. The proposed syntax is to allow a pattern like
>> /?<name>.../
>> rather than requiring
>> /(?<name>...)/
>
>
> It is available syntax, but it would be inconsistent to allow it at the
> beginning of a regex and not in the middle, particularly since regexes can
> be combined.
>
>
> All patterns are, in a sense, surrounded by invisible implicit parentheses
> corresponding to the whole string. Opening those implicit parens with a
> name is as logical as numbering it 0. Surely, one can't expect to combine
> arbitrary patterns, without knowing their contents, and expect them to be
> valid. If one was concatenating partial patterns, and you wanted to use
> this feature, you'd concatenate an opening name group partial + partial 0 +
> ... + partial n. Starting a name in the middle of a combined pattern,
> without corresponding parens, makes no more sense than expecting an anchor
> in the middle of a word to work, e.g. /foo\Abar/
>

That's not how it works in Perl, where strings and regular expressions are
regularly interpolated.

-Dan

Re: RFC naming the 0th match in regular expressions [ In reply to ]

perl5-porters at perl

Sep 9, 2021, 2:16 AM

Post #5 of 12 (617 views)

On 9/9/21 17:50, Dan Book wrote:
> On Thu, Sep 9, 2021 at 12:53 AM Roy <roy-orbison@devo.net.au> wrote:
>
>> On 9/9/21 13:26, Dan Book wrote:
>>
>> On Wed, Sep 8, 2021 at 9:44 PM Roy via perl5-porters <
>> perl5-porters@perl.org> wrote:
>>
>>> 2. The proposed syntax is to allow a pattern like
>>> /?<name>.../
>>> rather than requiring
>>> /(?<name>...)/
>>
>> It is available syntax, but it would be inconsistent to allow it at the
>> beginning of a regex and not in the middle, particularly since regexes can
>> be combined.
>>
>>
>> All patterns are, in a sense, surrounded by invisible implicit parentheses
>> corresponding to the whole string. Opening those implicit parens with a
>> name is as logical as numbering it 0. Surely, one can't expect to combine
>> arbitrary patterns, without knowing their contents, and expect them to be
>> valid. If one was concatenating partial patterns, and you wanted to use
>> this feature, you'd concatenate an opening name group partial + partial 0 +
>> ... + partial n. Starting a name in the middle of a combined pattern,
>> without corresponding parens, makes no more sense than expecting an anchor
>> in the middle of a word to work, e.g. /foo\Abar/
>>
> That's not how it works in Perl, where strings and regular expressions are
> regularly interpolated.

What, exactly, is "not how it works"? Are you talking about something
other than building regexen from variables, etc.?

This test script:

$a = 'fooo';
$b = '?<bar>baz';
if ('fooobaz' =~ /$a$b/) {
   print "Bare name groups are a thing, not backwards compatible.\n";
}
elsif ('foo<bar>baz' =~ /$a$b/) {
   print "Not a name group. Same as current and in proposal.\n";
}
if ('bazfooo' =~ /$b$a/) {
   print "Hooray! Named 0!\n";
}

Produces:

Not a name group. Same as current and in proposal.
Quantifier follows nothing in regex; marked by <-- HERE in m/? <-- HERE
<bar>bazfooo/ at test.pl line 9.

--
Regards,
Roy

Re: RFC naming the 0th match in regular expressions [ In reply to ]

Sep 9, 2021, 3:52 AM

Post #6 of 12 (617 views)

Smylers <Smylers@stripey.com> wrote:
:Currently you can wrap a pattern:
:
: $pat = "(?:$pat)";
:
:and it will continue to match whatever it did before:
:
: /$pat/

It's more than that: any compiled regular expression will stringify to
a very similar wrapped form that also encapsulates the flags with which
the regexp was compiled, and I'd say we pretty much require _that_ to be
interpreted as an identical regexp:

% perl E 'say qr{pat}'
(?^u:pat)
%

It might be possible to extend Roy's syntax such that /?<name>/ means
the same as /(?flags:?<name>)/ but that would probably have to be
extended recursively, so I'm not sure we'd want to do that.

What might fit better - and as far as I can see would allow the desired
behaviour to be achieved just as easily - would be a new flag that
suppresses the zeroth capture.

Hugo

Re: RFC naming the 0th match in regular expressions [ In reply to ]

Smylers at stripey

Sep 9, 2021, 4:07 AM

Post #7 of 12 (617 views)

Roy via perl5-porters writes:

> Starting a name in the middle of a combined pattern, without
> corresponding parens, makes no more sense than expecting an anchor in
> the middle of a word to work, e.g. /foo\Abar/

Possibly not, but Perl currently allows several things that don't
(necessarily always) make sense. Including doing:

/$prefix$pattern/

Now if $pattern starts with \A, then that *probably* won't match. But it
might: $prefix might sometimes be empty; or a zero-width lookahead
assertion; or end with a | symbol.

But even if it doesn't match, it will compile as a regexp, and the Perl
program can continuing doing whatever it does when matching fails.

Whereas if you introduce syntax that is permitted at the start of a
pattern but not anywhere else in it, you can now have the situation
where $prefix and $pattern are both valid regexps, yet concatenating
them as above yields a syntax error, crashing the program.

It may not make any more sense, but it does have a different outcome.

> All patterns are, in a sense, surrounded by invisible implicit
> parentheses corresponding to the whole string.

As such, it can sometimes makes sense to make those parentheses
explicit, for instance when preparing a pattern for potential embedding
elsewhere.

Currently you can wrap a pattern:

$pat = "(?:$pat)";

and it will continue to match whatever it did before:

/$pat/

Again, that's true if the pattern starts with \A. It's not true if it
starts with something that is a syntax error other than right at the
beginning.

I'm not against the issue you raise being addressed in some way. It may
even be that the solution you proposed, on reflection, turns out to be the
best way of doing so. But it's important that potential downsides are
identified and considered, not just dismissed as something that
shouldn't happen anyway.

Best wishes

Smylers

Re: RFC naming the 0th match in regular expressions [ In reply to ]

demerphq at gmail

Sep 9, 2021, 4:38 AM

Post #8 of 12 (617 views)

On Thu, 9 Sept 2021 at 03:44, Roy via perl5-porters <perl5-porters@perl.org>
wrote:

> Hi all,
>
> I brought this up on PCRE2 (at
> <https://github.com/PhilipHazel/pcre2/issues/15>), but the author
> rightfully pointed out that it's something Perl doesn't do.
>
> 1. The (minor) problem is that a regular expression cannot begin
> directly with a naming group, it must be parenthesised. This makes
> expressions that use named patterns produce result sets that are 1
> item longer than necessary, and contain a duplicate member (0 and
> 1). It is more pronounced when using the "g" modifier, because
> iterating the matches then has multiple 0 groups that cannot be
> addressed using a semantic name.
> 2. The proposed syntax is to allow a pattern like
> /?<name>.../
> rather than requiring
> /(?<name>...)/
> 3. The benefits are consistent semantic access of match result members,
> and smaller result sets. I'm sure results are already
> memory-efficient with named and numbered groups pointing to the same
> data, but they have more output when examined or used.
> 4. From basic testing with PCRE, expressions that begin with ? fail to
> compile, but I'm not certain that's the case for all regexp usage in
> Perl. It looks like a backwards-compatible change.
>

Besides the part about wanting to have special syntax at the start of the
pattern which IMO doesn't fly for the reasons already stated, I really dont
understand what you want. Is it that you want to be able to provide a named
capture name for the entire match so you can access it via %+ and %-
instead of via $&? For a bunch of historical reasons "the entire match" is
special, even though it is logically under the hood equivalent to the 0th
capture buffer, we use a special var name for it, and do not normally
consider it to be a "capture buffer". $& may even have performance
implications. The only place where the "0" buffer is referenced in perl
is @- and @+. It is not exposed via %+ and %- and nor is it exposed
in @^CAPTURE.

cheers,
Yves

Re: RFC naming the 0th match in regular expressions [ In reply to ]

perl5-porters at perl

Sep 9, 2021, 8:33 PM

Post #9 of 12 (617 views)

On 9/9/21 21:08, demerphq wrote:
> On Thu, 9 Sept 2021 at 03:44, Roy via perl5-porters
> <perl5-porters@perl.org> wrote:
>
> Hi all,
>
> I brought this up on PCRE2 (at
> <https://github.com/PhilipHazel/pcre2/issues/15>), but the author
> rightfully pointed out that it's something Perl doesn't do.
>
> 1. The (minor) problem is that a regular expression cannot begin
> directly with a naming group, it must be parenthesised. This makes
> expressions that use named patterns produce result sets that are 1
> item longer than necessary, and contain a duplicate member (0 and
> 1). It is more pronounced when using the "g" modifier, because
> iterating the matches then has multiple 0 groups that cannot be
> addressed using a semantic name.
> 2. The proposed syntax is to allow a pattern like
> /?<name>.../
> rather than requiring
> /(?<name>...)/
> 3. The benefits are consistent semantic access of match result
> members,
> and smaller result sets. I'm sure results are already
> memory-efficient with named and numbered groups pointing to
> the same
> data, but they have more output when examined or used.
> 4. From basic testing with PCRE, expressions that begin with ?
> fail to
> compile, but I'm not certain that's the case for all regexp
> usage in
> Perl. It looks like a backwards-compatible change.
>
>
> Besides the part about wanting to have special syntax at the start of
> the pattern which IMO doesn't fly for the reasons already stated,

I don't see how those reasons explain how it wouldn't work. Perl
enabling interpolation in regex literals doesn't guarantee that all
patterns created with interpolation are valid. I'm not proposing
anything that would break existing code.

To me, asserting ?<name> would have to be supported mid-pattern, without
parens, is as illogical as trying to support \A and \z mid-pattern.

I think what Dan meant by "that's not how it works" was that there are
no implicit parens surrounding patterns because patterns can be
combined. However, upon the _execution_ of any such pattern, the
delimiters of the combination, e.g. /$x$y$z/, are analogous to a
grouping, e.g. ($x$y$z), where those parens correspond to the 0 buffer.
All my proposal suggests is that that grouping be directly nameable.

> I really dont understand what you want. Is it that you want to be able
> to provide a named capture name for the entire match
Pretty much. I think this effects PCRE2 more than Perl.
> so you can access it via %+ and %- instead of via $&? For a bunch of
> historical reasons "the entire match" is special, even though it is
> logically under the hood equivalent to the 0th capture buffer, we use
> a special var name for it, and do not normally consider it to be a
> "capture buffer". $& may even have performance implications. The only
> place where the "0" buffer is referenced in perl is @- and @+. It is
> not exposed via %+ and %- and nor is it exposed in @^CAPTURE.

This is not at all intended to disrupt the Perl way of doing things. The
exposure of buffers in $&, $1 .. $n, $+{name}, etc. would be exactly the
same as if someone had used the same pattern with explicit parens
/(?<name>...)/. The only discernible difference to the programmer would
be the length of the numbered matches being one less, because they'd all
be shifted down so $1 would no longer be a duplicate of $&.
I demonstrated it as best I could in this comment:
https://github.com/PhilipHazel/pcre2/issues/15#issuecomment-912940482

It shouldn't affect named backreferences, either, because /?<x>foo\g{x}/
would be as bad as /(?<x>foo\g{x})/

--
Regards,
Roy

Re: RFC naming the 0th match in regular expressions [ In reply to ]

demerphq at gmail

Sep 13, 2021, 10:32 AM

Post #10 of 12 (617 views)

On Fri, 10 Sept 2021 at 05:33, Roy <roy-orbison@devo.net.au> wrote:

> On 9/9/21 21:08, demerphq wrote:
> > On Thu, 9 Sept 2021 at 03:44, Roy via perl5-porters
> > <perl5-porters@perl.org> wrote:
> >
> > Hi all,
> >
> > I brought this up on PCRE2 (at
> > <https://github.com/PhilipHazel/pcre2/issues/15>), but the author
> > rightfully pointed out that it's something Perl doesn't do.
> >
> > 1. The (minor) problem is that a regular expression cannot begin
> > directly with a naming group, it must be parenthesised. This
> makes
> > expressions that use named patterns produce result sets that are
> 1
> > item longer than necessary, and contain a duplicate member (0 and
> > 1). It is more pronounced when using the "g" modifier, because
> > iterating the matches then has multiple 0 groups that cannot be
> > addressed using a semantic name.
> > 2. The proposed syntax is to allow a pattern like
> > /?<name>.../
> > rather than requiring
> > /(?<name>...)/
> > 3. The benefits are consistent semantic access of match result
> > members,
> > and smaller result sets. I'm sure results are already
> > memory-efficient with named and numbered groups pointing to
> > the same
> > data, but they have more output when examined or used.
> > 4. From basic testing with PCRE, expressions that begin with ?
> > fail to
> > compile, but I'm not certain that's the case for all regexp
> > usage in
> > Perl. It looks like a backwards-compatible change.
> >
> >
> > Besides the part about wanting to have special syntax at the start of
> > the pattern which IMO doesn't fly for the reasons already stated,
>
> I don't see how those reasons explain how it wouldn't work. Perl
> enabling interpolation in regex literals doesn't guarantee that all
> patterns created with interpolation are valid. I'm not proposing
> anything that would break existing code.
>
> To me, asserting ?<name> would have to be supported mid-pattern, without
> parens, is as illogical as trying to support \A and \z mid-pattern.
>
> I think what Dan meant by "that's not how it works" was that there are
> no implicit parens surrounding patterns because patterns can be
> combined. However, upon the _execution_ of any such pattern, the
> delimiters of the combination, e.g. /$x$y$z/, are analogous to a
> grouping, e.g. ($x$y$z), where those parens correspond to the 0 buffer.
> All my proposal suggests is that that grouping be directly nameable.
>

There is a strong expectation that:

$str=~/PAT/;

and

my $qr= qr/PAT/;
$str=~/(?:$qr)/

will alway expected to match the same.

Basically PAT and (?:PAT) have to match the same thing.

Which means that making something be legal only as the first part of a
pattern doesn't fly.

> > I really dont understand what you want. Is it that you want to be able
> > to provide a named capture name for the entire match
> Pretty much. I think this effects PCRE2 more than Perl.
> > so you can access it via %+ and %- instead of via $&? For a bunch of
> > historical reasons "the entire match" is special, even though it is
> > logically under the hood equivalent to the 0th capture buffer, we use
> > a special var name for it, and do not normally consider it to be a
> > "capture buffer". $& may even have performance implications. The only
> > place where the "0" buffer is referenced in perl is @- and @+. It is
> > not exposed via %+ and %- and nor is it exposed in @^CAPTURE.
>
> This is not at all intended to disrupt the Perl way of doing things. The
> exposure of buffers in $&, $1 .. $n, $+{name}, etc. would be exactly the
> same as if someone had used the same pattern with explicit parens
> /(?<name>...)/. The only discernible difference to the programmer would
> be the length of the numbered matches being one less, because they'd all
> be shifted down so $1 would no longer be a duplicate of $&.
> I demonstrated it as best I could in this comment:
> https://github.com/PhilipHazel/pcre2/issues/15#issuecomment-912940482
>
> It shouldn't affect named backreferences, either, because /?<x>foo\g{x}/
> would be as bad as /(?<x>foo\g{x})/
>

I am not saying you are trying to disrupt anything. I am trying to figure
out if we can give you what you want, even if I am resistant of giving you
what you asked for. I am not convinced they are the same thing.

Eg, if I made it so that

use re full_capture_name => "foo";

/PAT/

would "populate"

$+{foo}

as expected then it would satisfy your need right? Why does it have to be
this weird syntax at the start of the pattern?

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

Re: RFC naming the 0th match in regular expressions [ In reply to ]

perl5-porters at perl

Sep 13, 2021, 5:22 PM

Post #11 of 12 (617 views)

On 14/9/21 03:02, demerphq wrote:
> There is a strong expectation that:
>
> $str=~/PAT/;
>
> and
>
> my $qr= qr/PAT/;
> $str=~/(?:$qr)/
>
> will alway expected to match the same.
>
> Basically PAT and (?:PAT) have to match the same thing.
>
> Which means that making something be legal only as the first part of a
> pattern doesn't fly.

I didn't know that. In that case, the whole thing is moot because it's a
BC break.

Thanks for the consideration.

--
Cheers,
Roy

Re: RFC naming the 0th match in regular expressions [ In reply to ]

perl5-porters at perl

Sep 14, 2021, 11:59 PM

Post #12 of 12 (617 views)

On 9/9/21 20:22, hv@crypt.org wrote:
> Smylers <Smylers@stripey.com> wrote:
> :Currently you can wrap a pattern:
> :
> : $pat = "(?:$pat)";
> :
> :and it will continue to match whatever it did before:
> :
> : /$pat/
>
> It's more than that: any compiled regular expression will stringify to
> a very similar wrapped form that also encapsulates the flags with which
> the regexp was compiled, and I'd say we pretty much require _that_ to be
> interpreted as an identical regexp:
>
> % perl E 'say qr{pat}'
> (?^u:pat)
> %
>
> It might be possible to extend Roy's syntax such that /?<name>/ means
> the same as /(?flags:?<name>)/ but that would probably have to be
> extended recursively, so I'm not sure we'd want to do that.
>
> What might fit better - and as far as I can see would allow the desired
> behaviour to be achieved just as easily - would be a new flag that
> suppresses the zeroth capture.

Sorry, missed this one (was flagged spam).

An additional modifier might be good as then PCRE could use it, along
with /n, to remove all numbered matches, yet maintain compatibility. I'm
not sure of much direct benefit to Perl because it already separates
named matches into %+, unless using those modifiers prevents overwriting
$&, $1, .., $9 from an earlier operation.

--
Cheers,
Roy