Mailing List Archive

parsing perl (perly) - maintenance and evolution
Hi list,

I'm a wanna-be-contributor having (almost) almost no clue about perl
sources.
I'm also not-native-speaker so sometimes I may use wording that has
different meanings as expected.

I noticed a few "complaints" about perly.y (including Sawyer's talk on CoC).
As far as it correlates with my other work I started with an attempt to do
some refactoring / beautification of perly.y, starting with pull request
#18036.

My goal is to make the whole parser (including actions) declarative so such
declaration can be used to generate also supporting tools (eg syntax
highlighting in emacs)

I believe bison's GLR parser can help with some technical problems, for
example it provides trick to conditionally include/exclude alternative
rule, for example (using snippet from LeoNerd's branch):

| %?{ FEATURE_FINALLY_BLOCK_IS_ENABLED } FINALLY mblock

I think this feature should allow perl5 / perl7 / perlX specific rules in
single grammar simplifying evolution and maintenance. Drawback is, perl
doesn't use bison parser, only it's tables, so there is some larger
refactoring pending.

So I'd like to ask you about your opinion and goals first to avoid wasting
time.

Best regards,
Brano
Re: parsing perl (perly) - maintenance and evolution [ In reply to ]
On Tue, Aug 11, 2020 at 07:21:34AM +0200, Branislav Zahradn?k wrote:
> I'm a wanna-be-contributor having (almost) almost no clue about perl
> sources.

Welcome!

> I noticed a few "complaints" about perly.y (including Sawyer's talk on CoC).
> As far as it correlates with my other work I started with an attempt to do
> some refactoring / beautification of perly.y, starting with pull request
> #18036.

A word of warning. Perl's code base is stable and rather old, so we are
very cautious about accepting code refactoring PRs. Unless there is a good
reason for the refactor, we prefer to leave things as they are.
Refactoring can often introduce subtle bugs, and often the diffs tend to
be large and hard to visually spot mistakes in.

So in particular we tend to avoid cosmetic fixes such as white space
changes, unless there's a good technical reason for it (such as if the
code indentation is wrong and visually misleading).

In situations where such changes are necessary, we prefer the cosmetic
changes to be in a separate commit. For example if a function needs to be
moved to different location in the source file and modified slightly, we
would expect that to be done as two commits: the first just moves the
function and makes absolutely no other changes (with the commit message
explaining that there are no changes) then the second commit makes the
modification. If both actions were in a single commit then in the diff
you would just see a bunch of '-' lines then a bunch of '+' lines, and the
modification becomes very hard to spot.

> My goal is to make the whole parser (including actions) declarative

Can you describe what you mean here in a lot more detail?

> I believe bison's GLR parser can help with some technical problems, for
> example it provides trick to conditionally include/exclude alternative
> rule, for example (using snippet from LeoNerd's branch):
>
> | %?{ FEATURE_FINALLY_BLOCK_IS_ENABLED } FINALLY mblock

I don't think that particularly helps here. If we used that syntax, then
code which used the FINALLY keyword while not within the scope of 'use
feature "finally"' would just get a generic bison 'syntax error' message.
We prefer to do stuff like this (taken from the current perly.y):

subsigguts:
{
.....
if (!FEATURE_SIGNATURES_IS_ENABLED)
Perl_croak(aTHX_ "Experimental "
"subroutine signatures not enabled");

which gives a more meaningful error message.

> I think this feature should allow perl5 / perl7 / perlX specific rules in
> single grammar simplifying evolution and maintenance.

I'm not convinced that GLR will help us here.

> Drawback is, perl
> doesn't use bison parser, only it's tables, so there is some larger
> refactoring pending.

Recreating perl's custom perly.c based on a GLR perly.c then backporting
all of perl's customisations would be quite a lot of work.

Note that GLR's 'follow multiple paths' feature (i.e. the 'G' in 'GLR')
probably can't usefully be used by perl: perl's lexer is highly
context-sensitive and so wouldn't work if the parser's actions were
deferred. As a trivial example, the lexer treats the '/' as either a divide
or as the start of a pattern match depending on whether the current parser
state is XOPERATOR (expecting an operator) or XSTATE (expecting a
statement).

Also, I would want to be reassured that using GLR won't incur any
performance penalty, since perl is interpreted and incurs the lexer and
parser cost for every execution.

Note that most of the complexity of perl's parsing is actually in the
lexer rather than the parser:

$ wc -l perly.y toke.c
1415 perly.y
13229 toke.c

Perl's lexer is the thing that is the biggest mess - apart from the
table-generated keyword recognition, almost every part of the lexer is
hand-crafted C code. For example it has to handle all the complexities of
double-quoted strings, e.g.:

"abc$a[i+$j]\Q\U$b[foo($c)]\Eboo";
/val=$a[$i] [abc] $/;

which the lexer converts to streams of tokens as if the input had been

'abc' . $a[i+$j] . quotemeta(uc($b[foo($c)]) . "boo");
match('val=', $a[$i], ' [abc] $');

In "abc$a[i+$j] ...", while scanning the string, the lexer has to
switch back into "parser mode" to fully parse the arbitrarily complex
$a[i+$j] expression, then when the sub-parse has finished, switch back to
scanning the rest of the string.

In an ideal world the lexer should be re-written to use some sort of
super-flex rule/table based system. But that would be a huge job.

--
In my day, we used to edit the inodes by hand. With magnets.
Re: parsing perl (perly) - maintenance and evolution [ In reply to ]
On Tue, 11 Aug 2020 at 10:17, Dave Mitchell <davem@iabyn.com> wrote:

> On Tue, Aug 11, 2020 at 07:21:34AM +0200, Branislav Zahradník wrote:
> > I'm a wanna-be-contributor having (almost) almost no clue about perl
> > sources.
>
> Welcome!
>
> > I noticed a few "complaints" about perly.y (including Sawyer's talk on
> CoC).
> > As far as it correlates with my other work I started with an attempt to
> do
> > some refactoring / beautification of perly.y, starting with pull request
> > #18036.
>
> A word of warning. Perl's code base is stable and rather old, so we are
> very cautious about accepting code refactoring PRs. Unless there is a good
> reason for the refactor, we prefer to leave things as they are.
> Refactoring can often introduce subtle bugs, and often the diffs tend to
> be large and hard to visually spot mistakes in.
>

This is nothing extraordinary in legacy code base.
MY current intention is to make perly.y and co easier to understand (and
thus maintain)


>
> > My goal is to make the whole parser (including actions) declarative
>
> Can you describe what you mean here in a lot more detail?
>

I'll try, though this is something I'm not good at.
My vision is to have declarative description of language (something bnf
like)
with generators of (current state of affairs):
- perly.y
- toke.c
- regen/keywords.pl
- cperl-mode.el
- ...

Also there should be introspection to identify dead rules, for example,
DOROP token (err keyword) whas removed long time ago but rule still exist


>
> > I believe bison's GLR parser can help with some technical problems, for
> > example it provides trick to conditionally include/exclude alternative
> > rule, for example (using snippet from LeoNerd's branch):
> >
> > | %?{ FEATURE_FINALLY_BLOCK_IS_ENABLED } FINALLY mblock
>
> I don't think that particularly helps here. If we used that syntax, then
> code which used the FINALLY keyword while not within the scope of 'use
> feature "finally"' would just get a generic bison 'syntax error' message.
> We prefer to do stuff like this (taken from the current perly.y):
>
> subsigguts:
> {
> .....
> if (!FEATURE_SIGNATURES_IS_ENABLED)
> Perl_croak(aTHX_ "Experimental "
> "subroutine signatures not enabled");
>
>
I agree my example is trivial - if you move such logic into lexer.
This is only one problem to solve. But what about following?

method-call-operator:
%?{ FEATURE_NEW_METHOD_CALL_OPERATOR } DOT
| %?{! FEATURE_NEW_METHOD_CALL_OPERATOR} ARROW


Another example: let say indirect method calls will be removed in perl 8.
GLR will allow you
to have all rules / alternatives in one grammar (so it can parse perl 5,
perl 7, and perl 8),
well separated (by these predicates). You can still have such
notifications. GLR will execute
actions once proper path is identified.

which gives a more meaningful error message.
>

> > I think this feature should allow perl5 / perl7 / perlX specific rules in
> > single grammar simplifying evolution and maintenance.
>
> I'm not convinced that GLR will help us here.
>

> > Drawback is, perl
> > doesn't use bison parser, only it's tables, so there is some larger
> > refactoring pending.
>
> Recreating perl's custom perly.c based on a GLR perly.c then backporting
> all of perl's customisations would be quite a lot of work.
>
> Note that GLR's 'follow multiple paths' feature (i.e. the 'G' in 'GLR')
> probably can't usefully be used by perl: perl's lexer is highly
> context-sensitive and so wouldn't work if the parser's actions were
> deferred. As a trivial example, the lexer treats the '/' as either a divide
> or as the start of a pattern match depending on whether the current parser
> state is XOPERATOR (expecting an operator) or XSTATE (expecting a
> statement).
>
> Also, I would want to be reassured that using GLR won't incur any
> performance penalty, since perl is interpreted and incurs the lexer and
> parser cost for every execution.
>
> Note that most of the complexity of perl's parsing is actually in the
> lexer rather than the parser:
>
> $ wc -l perly.y toke.c
> 1415 perly.y
> 13229 toke.c
>
> In an ideal world the lexer should be re-written to use some sort of
> super-flex rule/table based system. But that would be a huge job.
>

I agree with your description of the current state.

The advantage of GLR is exactly the thing you consider as disadvantage:
it can take multiple paths - and future may have (and probably will have)
different paths.


I'm not sure I can solve all issues right now. Maybe what perl really needs
is a different parser (imho the best will be push GLR parser ... something
like marpa?).

GLR is an idea to consider a talk about based on the fact that bison is
already used ... somehow.
I'm open to participate on its implementation (but don't want to waste my
effort without consensus)



>
> --
> In my day, we used to edit the inodes by hand. With magnets.
>
Re: parsing perl (perly) - maintenance and evolution [ In reply to ]
I summarized changes I have on my mind so far (with some examples):
https://gist.github.com/happy-barney/90de5cedee52a7855c0a780a3649cf7d
Re: parsing perl (perly) - maintenance and evolution [ In reply to ]
On Tue, Aug 11, 2020 at 4:17 AM Dave Mitchell <davem@iabyn.com> wrote:
>
> On Tue, Aug 11, 2020 at 07:21:34AM +0200, Branislav Zahradník wrote:
> > I'm a wanna-be-contributor having (almost) almost no clue about perl
> > sources.
>
> Welcome!
>
> > I noticed a few "complaints" about perly.y (including Sawyer's talk on CoC).
> > As far as it correlates with my other work I started with an attempt to do
> > some refactoring / beautification of perly.y, starting with pull request
> > #18036.
>
> A word of warning. Perl's code base is stable and rather old, so we are
> very cautious about accepting code refactoring PRs. Unless there is a good
> reason for the refactor, we prefer to leave things as they are.
> Refactoring can often introduce subtle bugs, and often the diffs tend to
> be large and hard to visually spot mistakes in.
>

Hi. So it's interesting that this has come up. I've just spent some
time inside of Perl's toke.c/perly.y for a similar reason, although I
took a different direction. Over the past few weeks I've worked on
getting an alternative perl parser[1] proof of concept working. The
module is currently able to parse a tiny subset of perl and is able to
feed perl an opcode tree using the pluggable keyword feature. The
parsing is guarded with a use statement, so it only takes over parsing
when requested, and the module so far appears to work.

I think there are some potential benefits to this module, primarily
having an interface that can both run code as well as produce a richer
data structure for interested parties, like code editors or linting
tools. I also know there's a lot of work to do; I've already dealt
with some of the great complexity in toke.c that Dave mentioned. There
is also the obvious question of how close to 100% accuracy can it get
to, but there's also the less obvious question of how accurate does it
need to be for it to be useful.

While I'm excited to work on this, I wanted to make sure that there is
interest in the idea before investing a lot of time into this module.
That is why I'm sending this email today, to ask if there are others
that think they would be interested in the final product of this
undertaking.

-Jon Gentle

[1] https://github.com/atrodo/Plywood
Re: parsing perl (perly) - maintenance and evolution [ In reply to ]
On Tue, Aug 11, 2020 at 12:46:23PM +0200, Branislav Zahradn?k wrote:
> I'll try, though this is something I'm not good at.
> My vision is to have declarative description of language (something bnf
> like)
> with generators of (current state of affairs):
> - perly.y
> - toke.c
> - regen/keywords.pl
> - cperl-mode.el

I think you'll find it very hard indeed to make the lexer (toke.c)
table-driven using something like flex. Don't get me wrong, I would be
happy if this were done, but I wouldn't want you to waste your time
getting 90% there and then finding that the final 10% is not possible and
so abandoning the attempt. And in perl, there have been many attempts with
parsing etc that can can parse 90% of perl, but can't do the hard last
10%. To give you an example, the lexer should be able to tokenize the
following. It includes code embedded within string-context stuff which
itself includes strings.

$s =~ /
abc
\Q
$patterns[ foo(<<~STR1, <<~STR2) ]
this is string 1
STR1
this is string 2:
$foo[.
$a =~ m{
foo
(?{ $x = <<~BAR })
blah
BAR
}
]
STR2
\U $x[$y+$$z] \E
$a[$abc]
a[$abc]
\E
xyz
/x;


> I'm not sure I can solve all issues right now. Maybe what perl really needs
> is a different parser (imho the best will be push GLR parser ... something
> like marpa?).
>
> GLR is an idea to consider a talk about based on the fact that bison is
> already used ... somehow.
> I'm open to participate on its implementation (but don't want to waste my
> effort without consensus)


I am ok with moving to GLR as long as:

1) it's supported by the versions of bison likely to be found on a
developers PC;
2) the performance and memory usage of perl's compilation phase is not
impacted;
3) the customisations in perly.c are not lost: such as the one which
produces debugging output under 'perl -Dpv'.
4) it doesn't do too much harm to blame annotaion; e.g. given a code
comment somewhere, it shouldn't be too hard to look back though the
commit history to find where that line first appeared.

I am ok with moving to a more general declaration-driven parsing system
if the (1)..(4) above apply, and as long we don't get stuck half way
through. That is to say if the project gets abandoned half-way through
(e.g. because you lose interest or because it becomes technically
impossible to finish) what is be left in blead should be in a consistent
state and understandable and modifiable by people who come later.

--
If life gives you lemons, you'll probably develop a citric acid allergy.
Re: parsing perl (perly) - maintenance and evolution [ In reply to ]
On Mon, 24 Aug 2020 at 13:22, Dave Mitchell <davem@iabyn.com> wrote:

> On Tue, Aug 11, 2020 at 12:46:23PM +0200, Branislav Zahradník wrote:
> > I'll try, though this is something I'm not good at.
> > My vision is to have declarative description of language (something bnf
> > like)
> > with generators of (current state of affairs):
> > - perly.y
> > - toke.c
> > - regen/keywords.pl
> > - cperl-mode.el
>
> I think you'll find it very hard indeed to make the lexer (toke.c)
>

I'm well aware of that. My goal is not to get rid of toke.c (as layer), but
to make it declarative.


> I am ok with moving to GLR as long as:
>
> 1) it's supported by the versions of bison likely to be found on a
> developers PC;
>

Current min version (2.4) already supports GLR parser.


> 2) the performance and memory usage of perl's compilation phase is not
> impacted;
>

GLR is slower than LALR due the fact it will have larger tables,more state
data to store,
and with intended usage it will execute predicates every time it will go
through
the rule - internally every excluded alternative will raise parse error.

On other hand it will increase maintainability of the parser, especially
with features adding
alternative paths. Some of them can be solved in lexer (for example:
LeoNerd's FINALLY).

Anyway, as you mention earlier, I may fail in the lexer part so let's
postpone the decision.
I'll expect there will be two parsers, temporarily, for evaluation (perly.*
and perly-glr.*).

3) the customisations in perly.c are not lost: such as the one which
> produces debugging output under 'perl -Dpv'.
>

Ack.


> 4) it doesn't do too much harm to blame annotaion; e.g. given a code
> comment somewhere, it shouldn't be too hard to look back though the
> commit history to find where that line first appeared.
>

GLR and LALR share the same format, difference is only one pragma.


> I am ok with moving to a more general declaration-driven parsing system
> if the (1)..(4) above apply, and as long we don't get stuck half way
> through. That is to say if the project gets abandoned half-way through
> (e.g. because you lose interest or because it becomes technically
> impossible to finish) what is be left in blead should be in a consistent
> state and understandable and modifiable by people who come later.
>

I agree with that.


>
> --
> If life gives you lemons, you'll probably develop a citric acid allergy.
>