Mailing List Archive: Revisiting trim

Re: Revisiting trim [ In reply to ]

May 28, 2021, 1:30 PM

Post #26 of 39 (784 views)

On Fri, May 28, 2021 at 11:25 AM Joseph Brenner <doomvox@gmail.com> wrote:

> André Warnier (tomcat/perl) <aw@ice-sa.com> wrote:
>
> > $stripped_line =~ s/^\s+//; $stripped_line =~ /\s+$//; # or only one of
> those, depends
>
> > Is /that/ the worst possible way ? or if not *the* worst, was there a
> better way all along ? (*)
>
> That's a very reasonable way of doing it which may very well be the
> best way (though you dropped an "s" on the second "s///").
>
> They were probably referring to a tendency of many programmers to
> obsess with trimming the left and right with a single s/// operation,
> which will result in a hairy, unreadable solution that won't peform
> any better than just doing it in two steps.
>

Is this really slowerr? Is this really hairier and less readable than the
two step approach?

$reference_identifier =~ s/^\s*(.+?)\s*$/$1/; # how I usually
full-trim a reference identifier

--
"Lay off that whiskey, and let that cocaine be!" -- Johnny Cash

Re: Revisiting trim [ In reply to ]

ambs at zbr

May 28, 2021, 1:33 PM

Post #27 of 39 (784 views)

Permalink

??????? Original Message ???????
On Friday, May 28th, 2021 at 21:30, David Nicol <davidnicol@gmail.com> wrote:

> On Fri, May 28, 2021 at 11:25 AM Joseph Brenner <doomvox@gmail.com> wrote:
>
>> André Warnier (tomcat/perl) <aw@ice-sa.com> wrote:
>>
>>> $stripped_line =~ s/^\s+//; $stripped_line =~ /\s+$//; # or only one of those, depends
>>
>>> Is /that/ the worst possible way ? or if not *the* worst, was there a better way all along ? (*)
>>
>> That's a very reasonable way of doing it which may very well be the
>> best way (though you dropped an "s" on the second "s///").
>>
>> They were probably referring to a tendency of many programmers to
>> obsess with trimming the left and right with a single s/// operation,
>> which will result in a hairy, unreadable solution that won't peform
>> any better than just doing it in two steps.
>
> Is this really slowerr? Is this really hairier and less readable than the two step approach?
>
> $reference_identifier =~ s/^\s*(.+?)\s*$/$1/; # how I usually full-trim a reference identifier

Probably still slower, but usually I write $foo =~ s/^\s*|\s*$//g;

Re: Revisiting trim [ In reply to ]

grinnz at gmail

May 28, 2021, 1:35 PM

Post #28 of 39 (784 views)

Permalink

On Fri, May 28, 2021 at 4:31 PM David Nicol <davidnicol@gmail.com> wrote:

>
>
> On Fri, May 28, 2021 at 11:25 AM Joseph Brenner <doomvox@gmail.com> wrote:
>
>> André Warnier (tomcat/perl) <aw@ice-sa.com> wrote:
>>
>> > $stripped_line =~ s/^\s+//; $stripped_line =~ /\s+$//; # or only one of
>> those, depends
>>
>> > Is /that/ the worst possible way ? or if not *the* worst, was there a
>> better way all along ? (*)
>>
>> That's a very reasonable way of doing it which may very well be the
>> best way (though you dropped an "s" on the second "s///").
>>
>> They were probably referring to a tendency of many programmers to
>> obsess with trimming the left and right with a single s/// operation,
>> which will result in a hairy, unreadable solution that won't peform
>> any better than just doing it in two steps.
>>
>
> Is this really slowerr? Is this really hairier and less readable than the
> two step approach?
>
> $reference_identifier =~ s/^\s*(.+?)\s*$/$1/; # how I usually
> full-trim a reference identifier
>
> Yes and (IMO) yes.
>

-Dan

Re: Revisiting trim [ In reply to ]

aw at ice-sa

May 28, 2021, 2:04 PM

Post #29 of 39 (784 views)

Permalink

On 28.05.2021 22:33, Alberto Simões wrote:
>
> ??????? Original Message ???????
> On Friday, May 28th, 2021 at 21:30, David Nicol <davidnicol@gmail.com> wrote:
>
>>
>>
>> On Fri, May 28, 2021 at 11:25 AM Joseph Brenner <doomvox@gmail.com
>> <mailto:doomvox@gmail.com>> wrote:
>>
>> André Warnier (tomcat/perl) <aw@ice-sa.com <mailto:aw@ice-sa.com>> wrote:
>>
>> > $stripped_line =~ s/^\s+//; $stripped_line =~ /\s+$//; # or only one of those, depends
>>
>> > Is /that/ the worst possible way ? or if not *the* worst, was there a better way
>> all along ? (*)
>>
>> That's a very reasonable way of doing it which may very well be the
>> best way (though you dropped an "s" on the second "s///").
>>
>> They were probably referring to a tendency of many programmers to
>> obsess with trimming the left and right with a single s/// operation,
>> which will result in a hairy, unreadable solution that won't peform
>> any better than just doing it in two steps.
>>
>>
>> Is this really slowerr? Is this really hairier and less readable than the two step approach?
>>
>> $reference_identifier =~ s/^\s*(.+?)\s*$/$1/; # how I usually full-trim a
>> reference identifier
>>
>
> Probably still slower, but usually I write $foo =~ s/^\s*|\s*$//g;
>
>
These will probably give a field day to whoever previously wrote about "the worst way
possible" .. :-)

The perl regex engine is a wonderful thing, and my interpretation may be wrong, but I
would tend to intuit that if you use captures (as in the first above) or alternatives (as
in the second) or \s* (which may mean "nothing or something", in both), it is bound to be
somewhat less efficient than if you give the regex engine something definite to look for,
like "^\s+" and "\s+$" (although this one, if the target is utf8 and it works backward
from the end, may be quite hairy too).
But whether that compensates for one assignment instead of two, I don't have a clue.

Anyway, it kind of makes the case for optimal (l|r|)trimmed() functions, to help us all
poor mere perl programmers.

I'm interested in a guru comment though.

Re: Revisiting trim [ In reply to ]

doomvox at gmail

May 28, 2021, 3:52 PM

Post #30 of 39 (784 views)

Permalink

Some quick-and-dirty benchmarking, trimming 100,000 short strings:

case 1:
$line =~ s/^\s+//;
$line =~ s/\s+$//;
# real 0m1.427s

case 2:
$line =~ s/^\s*(.+?)\s*$/$1/;
# real 0m1.853s

case 3:
$line =~ s/^\s*|\s*$//g;
# real 0m2.864s

So, case 2 is 30% slower, case 3 is 100% slower.

There's a simple fix that improves case 3 quite a bit:

case 4:
$line =~ s/^\s+|\s+$//g;
# real 0m1.704s

However: I took it very easy on this case using short lines... it's
very sensitive to line length (that \g is checking every point in the
string) and it slows down by a factor of ten with lines that are only
around 80 chars long.

Anyway, these speed penalties are Not Good, but they're also not
(usually) a reason to care.
Granted I was exaggerating calling these hairy and
unreadable, but I think they're all harder to read.

(For example, with "case 3", my first thought was it was
broken and wouldn't strip trailing whitespace if it
had stripped leading whitespace, but then I noticed the /g.
And further, it's using a * instead of a +, so without the /g
it *never* strips trailing space: so there were two things
I didn't understand.)

The thing you should ask yourself as a perl programmer is
"what did I think I would gain from doing this in one
line?".

The key point for the perl5-porters though is that there
is indeed a need for a built-in trim.

Re: Revisiting trim [ In reply to ]

perl5-porters at perl

May 28, 2021, 7:51 PM

Post #31 of 39 (784 views)

Permalink

The follow way to "trim" using "split" seems to provide a constant time solution, not dependent on the length of the string. Although I don't know how "split" is implemented, this its constancy is not surprising.

In fact, the filthy way I'm generating strings necessarily overtakes the amount of time to run this very quickly.

# bench.sh
for NUM in $(seq 1 100);
do
STRING=$(perl -e "printf qq{ %s }, ' a b ' x $NUM")
time perl x.pl "$STRING" 2>&1 | grep real
done

# x.pl
my $foo = $ARGV[0];
my $trimmed = (split /^\s*|\s*$/, $foo)[-1];
print qq{'$trimmed'\n}; # <- commenting out provides no benefit timewise

exerpt of output ('real' bounces between 7ms and 16ms, indicating a sensitivity to the mac OS process scheduler itself which is even more indicitave to the efficiency of this solution):

real 0m0.007s
user 0m0.002s
sys 0m0.004s
'a ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba b'

real 0m0.007s
user 0m0.002s
sys 0m0.003s
'a ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba b'

real 0m0.015s
user 0m0.003s
sys 0m0.006s
'a ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba b'

real 0m0.007s
user 0m0.002s
sys 0m0.003s
'a ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba b'

real 0m0.008s
user 0m0.002s
sys 0m0.004s
'a ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba b'

real 0m0.008s
user 0m0.002s
sys 0m0.003s

Cheers,
Brett

--
oodler@cpan.org
?https://github.com/oodler577
#pdl #p5p #p7-dev #native @ irc.perl.org

Sent with ProtonMail Secure Email.

??????? Original Message ???????
On Friday, May 28, 2021 5:52 PM, Joseph Brenner <doomvox@gmail.com> wrote:

> Some quick-and-dirty benchmarking, trimming 100,000 short strings:
>
> case 1:
> $line =~ s/^\s+//;
> $line =~ s/\s+$//;
>
> real 0m1.427s
>
> ==============
>
> case 2:
> $line =~ s/^\s*(.+?)\s*$/$1/;
>
> real 0m1.853s
>
> ==============
>
> case 3:
> $line =~ s/^\s*|\s*$//g;
>
> real 0m2.864s
>
> ==============
>
> So, case 2 is 30% slower, case 3 is 100% slower.
>
> There's a simple fix that improves case 3 quite a bit:
>
> case 4:
> $line =~ s/^\s+|\s+$//g;
>
> real 0m1.704s
>
> ==============
>
> However: I took it very easy on this case using short lines... it's
> very sensitive to line length (that \g is checking every point in the
> string) and it slows down by a factor of ten with lines that are only
> around 80 chars long.
>
> Anyway, these speed penalties are Not Good, but they're also not
> (usually) a reason to care.
> Granted I was exaggerating calling these hairy and
> unreadable, but I think they're all harder to read.
>
> (For example, with "case 3", my first thought was it was
> broken and wouldn't strip trailing whitespace if it
> had stripped leading whitespace, but then I noticed the /g.
> And further, it's using a * instead of a +, so without the /g
> it never strips trailing space: so there were two things
> I didn't understand.)
>
> The thing you should ask yourself as a perl programmer is
> "what did I think I would gain from doing this in one
> line?".
>
> The key point for the perl5-porters though is that there
> is indeed a need for a built-in trim.

Re: Revisiting trim [ In reply to ]

perl5-porters at perl

May 28, 2021, 9:45 PM

Post #32 of 39 (784 views)

Permalink

On Sat, 29 May 2021 at 04:31, David Nicol <davidnicol@gmail.com> wrote:

> On Fri, May 28, 2021 at 11:25 AM Joseph Brenner <doomvox@gmail.com> wrote:
>
>> André Warnier (tomcat/perl) <aw@ice-sa.com> wrote:
>>
>> > $stripped_line =~ s/^\s+//; $stripped_line =~ /\s+$//; # or only one of
>> those, depends
>>
>> > Is /that/ the worst possible way ? or if not *the* worst, was there a
>> better way all along ? (*)
>>
>> That's a very reasonable way of doing it which may very well be the
>> best way (though you dropped an "s" on the second "s///").
>>
>> They were probably referring to a tendency of many programmers to
>> obsess with trimming the left and right with a single s/// operation,
>> which will result in a hairy, unreadable solution that won't peform
>> any better than just doing it in two steps.
>>
>
> Is this really slowerr? Is this really hairier and less readable than the
> two step approach?
>
> $reference_identifier =~ s/^\s*(.+?)\s*$/$1/; # how I usually
> full-trim a reference identifier
>

It's not equivalent - you're missing the /s modifier, so a string such as
" example\ntext " would not match.

So, yes - there's a discussion about benchmarking and whether it's more
efficient, but given the minor detail that it's doing something *different*
appears to have escaped attention, I'd suggest that this was indeed less
readable?

(also yes, I appreciate that this is $reference_identifier and earlier we
had $stripped_line, but the purpose of trimmed() is "remove whitespace from
the start and end of a string", so "similar problem domain with different
details" just muddies the discussion)

Re: Revisiting trim [ In reply to ]

demerphq at gmail

May 29, 2021, 12:37 AM

Post #33 of 39 (784 views)

Permalink

On Fri, 28 May 2021 at 12:02, André Warnier (tomcat/perl) <aw@ice-sa.com> wrote:
> $stripped_line =~ s/^\s+//; $stripped_line =~ /\s+$//; # or only one of those, depends

This is as correct a way to do it as you can do in perl regex. I'd
probably replace the $ with \z to be absolutely clear on my intent. I
had to go double check the behavior of $ here, where \z is
unambiguous.

The point here is that people often write this:

$stripped_line=~/^\s+|\s+$/g;

which causes the regex engine to perform the scan through the string
in a really inefficient way. Your splitting it into two calls avoids
the main mistake that people make.

But this question also illustrates the problem here. The regex engine
doesn't know how to go backwards. Even for the split form of the regex
the *second* regex, the one that does the rtrim() functionality, is
the problem performance wise. The regex engine will do a scan of the
whole string, every time it finds a space character it will scan
forward until it find either a non-string, or the end of the string.
There is some cleverness in the engine to make this case not be
quadratic, but its not far off. The run time will be proportional to
the length of the string and number of space nonspace sequences it
contains.

So the reason to add trimmed() to the language at an optimization
level is that while its hard to teach the regex engine to go
backwards, its not hard to create a custom dfa or similar logic that
scans through the string from the right and finds the rightmost
non-space character in the string. For instance even doing a naïve
implementation of using the utf8-skip-backwards-one-character logic
would be O(N) where N is the number of characters at the end of the
string.

This performance issue with rtrim() I would argue supports your point,
adding trim() without rtrim() is to a certain extent a missed
opportunity. Stripping whitespace from the end of the string will
still be inefficient and difficult to read. Eg, consider I would call
myself a regex expert, but every time someone posts this pattern with
$ in it I have to double check the rules. Making people use an
inefficient and cryptic regex for a common task seems undesirable.
The cryptic argument applies for ltrim(), but that at least *is*
efficient in the regex engine.

cheers
Yves

Re: Revisiting trim [ In reply to ]

demerphq at gmail

May 29, 2021, 12:46 AM

Post #34 of 39 (784 views)

Permalink

On Fri, 28 May 2021 at 22:31, David Nicol <davidnicol@gmail.com> wrote:
>
>
>
> On Fri, May 28, 2021 at 11:25 AM Joseph Brenner <doomvox@gmail.com> wrote:
>>
>> André Warnier (tomcat/perl) <aw@ice-sa.com> wrote:
>>
>> > $stripped_line =~ s/^\s+//; $stripped_line =~ /\s+$//; # or only one of those, depends
>>
>> > Is /that/ the worst possible way ? or if not *the* worst, was there a better way all along ? (*)
>>
>> That's a very reasonable way of doing it which may very well be the
>> best way (though you dropped an "s" on the second "s///").
>>
>> They were probably referring to a tendency of many programmers to
>> obsess with trimming the left and right with a single s/// operation,
>> which will result in a hairy, unreadable solution that won't peform
>> any better than just doing it in two steps.
>
>
> Is this really slowerr? Is this really hairier and less readable than the two step approach?
>
> $reference_identifier =~ s/^\s*(.+?)\s*$/$1/; # how I usually full-trim a reference identifier

This avoids the killer aspect of s/^\s+|\s+$/, but it still scales
proportional to the length of the string and the number of space
non-space sequences in the string. The overhead will be quite a bit
higher, and I assume you want to make the . match newlines? Consider
this wont work the same as other examples on a string like " foo\nbar
".

The reality is that the regex engine is crappy way to do this
particular task. To do it right you want to start from the right hand
side and search left, such that your performance is proportional to
the number of characters being removed. The regex engine no matter how
you slice it is going to go left to right, and is thus at best going
to be proportional to the length of the string overall.

TBH, I would not be surprised if:

chop($str) while $str=~/\s\z/;

or

1 while $str=~s/\s\z//;

is actually one of the fastest ways to do this with a regex. I
believe in these cases the regex engine does actually use the
utf8-skip-backwards macros (eg it knows how to find the position that
is K characters before the end of the string to see if they match a
space character, it does not know how to scan from the right to find
the maximal set of space characters).

So yes, frankly as someone intimate with the regex engine I would say
that this is a task that people should NOT use the regex engine for at
all. Unfortunately to do this really right as a function you need to
do it in C.

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

Re: Revisiting trim [ In reply to ]

demerphq at gmail

May 29, 2021, 12:56 AM

Post #35 of 39 (784 views)

Permalink

On Sat, 29 May 2021 at 00:52, Joseph Brenner <doomvox@gmail.com> wrote:
>
> Some quick-and-dirty benchmarking, trimming 100,000 short strings:
>
> case 1:
> $line =~ s/^\s+//;
> $line =~ s/\s+$//;
> # real 0m1.427s
>
> case 2:
> $line =~ s/^\s*(.+?)\s*$/$1/;
> # real 0m1.853s
>
> case 3:
> $line =~ s/^\s*|\s*$//g;
> # real 0m2.864s
>
> So, case 2 is 30% slower, case 3 is 100% slower.
>
> There's a simple fix that improves case 3 quite a bit:
>
> case 4:
> $line =~ s/^\s+|\s+$//g;
> # real 0m1.704s
>
> However: I took it very easy on this case using short lines... it's
> very sensitive to line length (that \g is checking every point in the
> string) and it slows down by a factor of ten with lines that are only
> around 80 chars long.

THIS is the key point here. Run your benchmarks over strings of length
1, 10, 100, 1000, and include the examples I posted in another mail:

1 while $str=~s/\s\z//;
chop($str) while $str=~m/\s\z/;

Also do it on strings like this:

" ". ("x" x $length) . " ";

and also try strings constructed like this for various $k.

my $str;
for my $l (1..$k) {
$str.=" " . ("x") x $l;
}

The point here is that performance of the regex based solutions will
generally be determined by the length of the string, and the number of
space non-space sequences. My two hacks above are designed to avoid
that and stay proportional to the number of characters that are
*removed* from the string, but they will still be worse than doing it
right in C code.

Yves

Re: Revisiting trim [ In reply to ]

demerphq at gmail

May 29, 2021, 1:00 AM

Post #36 of 39 (784 views)

Permalink

On Sat, 29 May 2021 at 09:56, demerphq <demerphq@gmail.com> wrote:
>
> On Sat, 29 May 2021 at 00:52, Joseph Brenner <doomvox@gmail.com> wrote:
> >
> > Some quick-and-dirty benchmarking, trimming 100,000 short strings:
> >
> > case 1:
> > $line =~ s/^\s+//;
> > $line =~ s/\s+$//;
> > # real 0m1.427s
> >
> > case 2:
> > $line =~ s/^\s*(.+?)\s*$/$1/;
> > # real 0m1.853s
> >
> > case 3:
> > $line =~ s/^\s*|\s*$//g;
> > # real 0m2.864s
> >
> > So, case 2 is 30% slower, case 3 is 100% slower.
> >
> > There's a simple fix that improves case 3 quite a bit:
> >
> > case 4:
> > $line =~ s/^\s+|\s+$//g;
> > # real 0m1.704s
> >
> > However: I took it very easy on this case using short lines... it's
> > very sensitive to line length (that \g is checking every point in the
> > string) and it slows down by a factor of ten with lines that are only
> > around 80 chars long.
>
> THIS is the key point here. Run your benchmarks over strings of length
> 1, 10, 100, 1000, and include the examples I posted in another mail:
>
> 1 while $str=~s/\s\z//;
> chop($str) while $str=~m/\s\z/;
>
> Also do it on strings like this:

Here I meant to say do it on sequences of space/non-space with both
the space and non-space being longer and longer. Eg:

" ". (((" " x $l1) . ("Q" x $l2)) x $l3) . " ";

You will see that most of the regex versions degrade terribly. Im on
the wrong computer or id post the results, but I bet my hacks above
beat them all once the string gets over a certain size, if not hands
down.

Yves

Re: Revisiting trim [ In reply to ]

hv at crypt

May 30, 2021, 2:09 PM

Post #37 of 39 (784 views)

Permalink

Karl Williamson <public@khwilliamson.com> wrote:
:On 5/29/21 1:37 AM, demerphq wrote:
[...]
:> But this question also illustrates the problem here. The regex engine
:> doesn't know how to go backwards. [...]
:
:Maybe you and I should have a chat about what can and should be done to
:improve the matching speed of right-anchored patterns.
:
:I suppose it is theoretically possible to create reverse
:Perl_re_intuit_start() and S_find_byclass() functions, if one could wrap
:one's mind around that, though the libc support is limited. But I could
:be wrong about the feasibility and it would be more work than anyone
:would care to undertake.

FWIW, I think it is probably impossible for the general case of /pat\z/,
but for restricted cases (primarily those without captures) it might not
be so hard.

:But there are things that could be done. It had never occurred to me
:before that the hop_back functions could be called with large numbers.
:Backing up in a UTF-8 string could be improved by a factor of 8 by doing
:per-word operations. (You load a whole word. One can isolate and count
:the continuation bytes in it by some shifting/masking/ etc operations.
:Everything that isn't a continuation byte marks a character.)
:Similarly, functions like S_find_next_masked() could have a
:corresponding reversed version, though slower on UTF-8 than the forward
:because of the forward bias of UTF-8.

Yes, that sounds good.

Hugo

Re: Revisiting trim [ In reply to ]

public at khwilliamson

May 30, 2021, 2:20 PM

Post #38 of 39 (784 views)

Permalink

On 5/29/21 1:37 AM, demerphq wrote:
> On Fri, 28 May 2021 at 12:02, André Warnier (tomcat/perl) <aw@ice-sa.com> wrote:
>> $stripped_line =~ s/^\s+//; $stripped_line =~ /\s+$//; # or only one of those, depends
>
> This is as correct a way to do it as you can do in perl regex. I'd
> probably replace the $ with \z to be absolutely clear on my intent. I
> had to go double check the behavior of $ here, where \z is
> unambiguous.
>
> The point here is that people often write this:
>
> $stripped_line=~/^\s+|\s+$/g;
>
> which causes the regex engine to perform the scan through the string
> in a really inefficient way. Your splitting it into two calls avoids
> the main mistake that people make.
>
> But this question also illustrates the problem here. The regex engine
> doesn't know how to go backwards. Even for the split form of the regex
> the *second* regex, the one that does the rtrim() functionality, is
> the problem performance wise. The regex engine will do a scan of the
> whole string, every time it finds a space character it will scan
> forward until it find either a non-string, or the end of the string.
> There is some cleverness in the engine to make this case not be
> quadratic, but its not far off. The run time will be proportional to
> the length of the string and number of space nonspace sequences it
> contains.
>
> So the reason to add trimmed() to the language at an optimization
> level is that while its hard to teach the regex engine to go
> backwards, its not hard to create a custom dfa or similar logic that
> scans through the string from the right and finds the rightmost
> non-space character in the string. For instance even doing a naïve
> implementation of using the utf8-skip-backwards-one-character logic
> would be O(N) where N is the number of characters at the end of the
> string.
>
> This performance issue with rtrim() I would argue supports your point,
> adding trim() without rtrim() is to a certain extent a missed
> opportunity. Stripping whitespace from the end of the string will
> still be inefficient and difficult to read. Eg, consider I would call
> myself a regex expert, but every time someone posts this pattern with
> $ in it I have to double check the rules. Making people use an
> inefficient and cryptic regex for a common task seems undesirable.
> The cryptic argument applies for ltrim(), but that at least *is*
> efficient in the regex engine.
>

Maybe you and I should have a chat about what can and should be done to
improve the matching speed of right-anchored patterns.

I suppose it is theoretically possible to create reverse
Perl_re_intuit_start() and S_find_byclass() functions, if one could wrap
one's mind around that, though the libc support is limited. But I could
be wrong about the feasibility and it would be more work than anyone
would care to undertake.

But there are things that could be done. It had never occurred to me
before that the hop_back functions could be called with large numbers.
Backing up in a UTF-8 string could be improved by a factor of 8 by doing
per-word operations. (You load a whole word. One can isolate and count
the continuation bytes in it by some shifting/masking/ etc operations.
Everything that isn't a continuation byte marks a character.)
Similarly, functions like S_find_next_masked() could have a
corresponding reversed version, though slower on UTF-8 than the forward
because of the forward bias of UTF-8.

Re: Revisiting trim [ In reply to ]

aw at ice-sa

Jun 1, 2021, 8:39 AM

Post #39 of 39 (781 views)

Permalink

On 29.05.2021 00:52, Joseph Brenner wrote:
> The key point for the perl5-porters though is that there
> is indeed a need for a built-in trim.

But - just talking for me personally - if it always trims both ends unconditionally, then
at least 75% of the times I'd wish to use it, I wouldn't be able to.
Which is ok for me, I can continue to use "s/\s+$/" in those cases.
But maybe not so ok for the trees in the Amazon, the coral reefs, and the Pacific atolls.

For the sake of it, I created a naive rtrim() function in pure perl (well, using chop()),
and this already seems to run about 50% faster than the "s/\s+$//" regex;
Here it goes :

#!/usr/bin/perl
use strict;
use warnings;
use Time::HiRes qw( gettimeofday tv_interval );

sub rtrim {
my $str = shift;
my $last = '';
while (1) {
$last = chop($str);
last unless $last =~ /\s/;
}
$str .= $last;
}

my $string1 = " a b " x 10;
#print "string1 [",length($string1),"][",$string1,"]\n";
my $string10 = ($string1 x 10) . (" " x 10);
#print "string10 [",length($string10),"][",$string10,"]\n";
my $string100 = $string10 x 10 . (" " x 100);
#print "string100 [",length($string100),"][",$string100,"]\n";

my $res = '';
my ($T0,$T1);

print "regex :\n";

$T0 = [gettimeofday];
for (my $i=0;$i<1000;$i++) {
$res = $string1;
$res =~ s/\s+$//;
}
$T1 = tv_interval($T0, [gettimeofday()]);
print "string1 : ",sprintf("%.4f",$T1),"s\n";

$T0 = [gettimeofday];
for (my $i=0;$i<1000;$i++) {
$res = $string10;
$res =~ s/\s+$//;
}
$T1 = tv_interval($T0, [gettimeofday()]);
print "string10 : ",sprintf("%.4f",$T1),"s\n";

$T0 = [gettimeofday];
for (my $i=0;$i<1000;$i++) {
$res = $string100;
$res =~ s/\s+$//;
}
$T1 = tv_interval($T0, [gettimeofday()]);
print "string100 : ",sprintf("%.4f",$T1),"s\n";

print "function :\n";

$T0 = [gettimeofday];
for (my $i=0;$i<1000;$i++) {
$res = $string1;
$res = rtrim($res);
}
$T1 = tv_interval($T0, [gettimeofday()]);
print "string1 : ",sprintf("%.4f",$T1),"s\n";

$T0 = [gettimeofday];
for (my $i=0;$i<1000;$i++) {
$res = $string10;
$res = rtrim($res);
}
$T1 = tv_interval($T0, [gettimeofday()]);
print "string10 : ",sprintf("%.4f",$T1),"s\n";

$T0 = [gettimeofday];
for (my $i=0;$i<1000;$i++) {
$res = $string100;
$res = rtrim($res);
}
$T1 = tv_interval($T0, [gettimeofday()]);
print "string100 : ",sprintf("%.4f",$T1),"s\n";
#print " length [",length($res),"][",$res,"]\n"; # just to check

exit;

Prints :

regex :
string1 : 0.0032s
string10 : 0.0104s
string100 : 0.0911s
function :
string1 : 0.0015s
string10 : 0.0056s
string100 : 0.0501s

(Strawberry perl 5, version 28, subversion 2 (v5.28.2) built for MSWin32-x64-multi-thread)