Mailing List Archive

API function: matching a string with a regexp
C API question.

I have a string in an SV, and a quoted regexp object in another SV. I
want to know (as a boolean?) if the string matches the regexp.
Optionally I suppose there might be captures and maybe setting $1 etc..
is good, but if not I don't really mind. It's just a true/false test.

I would have imagined there'd be some sort of API function with a
signature something like

bool regexp_match(SV *rx, SV *s);

but to my observation there doesn't seem to be anything like that.

Stuck for ideas, I went off to read how pp_match does it; being the
actual PP func behind Perl syntax like `$str =~ $pat`. Again I had
somehow expected that it would receive the string and the pattern as
two SVs on the stack and it would just do its thing, so maybe I could
hack it up by pushing my values to the stack and fake-calling the pp
func directly.

But ho-boy was I wrong.

The way that `$str =~ m/foo/` works is that the pattern is a
compiletime constant, so it becomes part of the OP_MATCH optree node
itself. Variable patterns, like `$str =~ $pat` are compiled into a
OP_MATCH + OP_REGCOMP, where the regcomp takes the pattern off the
stack and "compiles" it into a regexp, which is stored *in the actual
OP_MATCH node* of the optree. On thready perls, this goes via the pad
of course.

At leont's suggestion I then went to look at how pp_smartmatch does it,
and that has its own (static) set of functions to create a temporary
PMOP out of a pattern, which it then matches against.

This doesn't feel like a very effective API, and makes it really
nontrivial to implement such an function as outlined above.

I wonder if we want to tidy this up somewhat? Perhaps we could extract
the contents of pp_match into a helper function that then doesn't need
to have this weird communication-at-a-distance via the PMOP structure
itself, and that function can then be called by pp_match,
pp_smartmatch, and generally exposed as API for others to use.

Alternatively: Does anyone know of another easier way that I might have
missed buried somewhere in the API?

--
Paul "LeoNerd" Evans

leonerd@leonerd.org.uk | https://metacpan.org/author/PEVANS
http://www.leonerd.org.uk/ | https://www.tindie.com/stores/leonerd/
Re: API function: matching a string with a regexp [ In reply to ]
On Fri, 7 Jul 2023 at 13:15, Paul "LeoNerd" Evans <leonerd@leonerd.org.uk>
wrote:

> C API question.
>
> I have a string in an SV, and a quoted regexp object in another SV. I
> want to know (as a boolean?) if the string matches the regexp.
> Optionally I suppose there might be captures and maybe setting $1 etc..
> is good, but if not I don't really mind. It's just a true/false test.
>
> I would have imagined there'd be some sort of API function with a
> signature something like
>
> bool regexp_match(SV *rx, SV *s);
>
> but to my observation there doesn't seem to be anything like that.
>

Sure there is. You should read the docs on this. pod/perlreapi.pod is the
place to start. See also regexp.h line 267 for the definition of the regex
engine API.


>
> Stuck for ideas, I went off to read how pp_match does it; being the
> actual PP func behind Perl syntax like `$str =~ $pat`. Again I had
> somehow expected that it would receive the string and the pattern as
> two SVs on the stack and it would just do its thing, so maybe I could
> hack it up by pushing my values to the stack and fake-calling the pp
> func directly.
>
> But ho-boy was I wrong.
>
> The way that `$str =~ m/foo/` works is that the pattern is a
> compiletime constant, so it becomes part of the OP_MATCH optree node
> itself. Variable patterns, like `$str =~ $pat` are compiled into a
> OP_MATCH + OP_REGCOMP, where the regcomp takes the pattern off the
> stack and "compiles" it into a regexp, which is stored *in the actual
> OP_MATCH node* of the optree. On thready perls, this goes via the pad
> of course.
>
> At leont's suggestion I then went to look at how pp_smartmatch does it,
> and that has its own (static) set of functions to create a temporary
> PMOP out of a pattern, which it then matches against.
>
> This doesn't feel like a very effective API, and makes it really
> nontrivial to implement such an function as outlined above.
>

Well, it is more complex than a simple call because of a lot of factors.
Some patterns have elements that mean you can use the "intuit" interface,
and some don't. Some patterns are executed inside of loops, and some aren't
etc. The opcodes handle this stuff properly, ensuring that where
appropriate we CALLREGINTUIT() followed by CALLREGEXEC() and respect
various flags about the patterns being matched.


> I wonder if we want to tidy this up somewhat? Perhaps we could extract
> the contents of pp_match into a helper function that then doesn't need
> to have this weird communication-at-a-distance via the PMOP structure
> itself, and that function can then be called by pp_match,
> pp_smartmatch, and generally exposed as API for others to use.
>

We have an entire documented API for regexes and regex engines. Some of the
"weird communication-at-a-distance" is unforced historical baggage we
haven't cleaned up yet and yes sure it would be nice to see that removed,
but some is useful or necessary stuff that deals with some of the
subtleties involved with matching, such as ensuring that the right regex
engine is used, intuit is used when appropriate, etc (we distribute two
with perl itself, and others can supply their own as well).

Perl integrates matching into its syntax in an intuitive way which means
that people don't have to do clunky python/java style "construct a compiled
regex, and then match and then inspect the matched object afterwards". But
at a low level we have a clean and documented api for allowing alternative
regex engines to be plugged in.


>
> Alternatively: Does anyone know of another easier way that I might have
> missed buried somewhere in the API?


Like I said, have a look at pod/perlreapi.pod, but a short answer is you
want to use CALLREGEXEC() maybe preceded by CALLREGINTUIT(). The
definitions for the functional interface are documented in
perlreapi.pod and in regexp.h line 267 (also in perl.h line 264.

Start with the pod tho. Karl, Hv, Dave and I have put a fair put of effort
into documenting the regex engine in various pod files. Most of the
internals is documented in the following two pod files:

pod/perlreapi.pod
pod/perlreguts.pod

And the user facing bits are documented in

pod/perlre.pod
pod/perlreref.pod
pod/perlrebackslash.pod
pod/perlretut.pod
pod/perlrecharclass.pod
pod/perlrequick.pod

Let me know if you need more than this to move forward with things.

cheers,
Yves




--
perl -Mre=debug -e "/just|another|perl|hacker/"