Mailing List Archive: Revisiting trim

Revisiting trim

May 26, 2021, 1:20 PM

Post #1 of 39 (1531 views)

As a reminder, after a lot of discussion, there was broad agreement that we should add the capability for trimming a string, and we should keep it simple: trim whitespace from both ends of the string, with no parameterisation. Here "whitespace" means [:space:], so consistent Unicode semantics, whatever the internal encoding of the string.

But there was disagreement on whether it should trim in place, or return the trimmed string. Maybe we should have two functions: trim() to edit in place, and trimmed() to return the trimmed string. Rik, Sawyer, and Neil were all fans of the in-place edit.

After lots of discussion on github and p5p, the steering council also solicited input from people with a lot of experience training people and/or writing books to explain Perl.

The end result of all of that is:

• We think we should add a single function to Perl.
• It should return the trimmed string.
• It should be called "trimmed", not "trim".

This is not a mandate, but it is our recommendation.

What convinced us on the naming issue was an example from Tom Christiansen of a large company that had a similar internal discussion, and found that people were confused over how they expected it to work, until they changed the name from "trim" to "trimmed". This also fits in with one of Larry's original desires, that Perl be a readable language. The function returns a trimmed version of its input:

$x = trimmed $y; # "x is a trimmed version of y"

We think that "trim" would have been the right name for editing in place:

trim $x;

There are only two distributions on CPAN that define a function called "trimmed", so we suspect that this will clash with less code in the wild. There are plenty of distributions that define a trim(), but not all of them perform the function we're talking about here.

It may be tempting to claim that if we called it "trim" rather than "trimmed", then it "would just work" for the majority of people with an existing trim function (whether imported from CPAN or home-brewed). But whatever a new builtin is called, existing code would not use it by default, so everyone would have to make edits to enable it, no matter what name we give it.

There's a related topic of namespaces for functions like this. We're deliberately not addressing that here, as we suspect it will fall out of the broader review of text/string processing capabilities, which we want completed before 5.36 (i.e. before trimmed would be included in a stable release anyway).

We're now interested to hear what p5p thinks of this.

Neil, Rik, Nick

Re: Revisiting trim [ In reply to ]

uri at stemsystems

May 26, 2021, 1:55 PM

Post #2 of 39 (1531 views)

On 5/26/21 4:20 PM, Neil Bowers wrote:
> As a reminder, after a lot of discussion, there was broad agreement
> that we should add the capability for trimming a string, and we should
> keep it simple: trim whitespace from both ends of the string, with no
> parameterisation. Here "whitespace" means [:space:], so consistent
> Unicode semantics, whatever the internal encoding of the string.
>
> What convinced us on the naming issue was an example from Tom
> Christiansen of a large company that had a similar internal
> discussion, and found that people were confused over how they expected
> it to work, until they changed the name from "trim" to "trimmed". This
> also fits in with one of Larry's original desires, that Perl be a
> readable language. The function returns a trimmed version of its input:
>
> $x = trimmed $y; # "x is a trimmed version of y"
>
> We think that "trim" would have been the right name for editing in place:
>
> trim $x;
>
what about trimmed using context? in a void context it trims in place.
in scalar (or non-void) it returns the trimmed string and leaves the
input unchanged. only one new function and we use context everywhere so
it is familiar.

uri

Re: Revisiting trim [ In reply to ]

May 26, 2021, 2:13 PM

Post #3 of 39 (1531 views)

On Wednesday, May 26th, 2021 at 21:55, Uri Guttman <uri@stemsystems.com> wrote:

> what about trimmed using context? in a void context it trims in place. in scalar (or non-void) it returns the trimmed string and leaves the input unchanged. only one new function and we use context everywhere so it is familiar.

While I like the idea, we would need to make other methods to have similar behavior:
- lc / uc / ucfirst
- chomp / chop
and probably others.

My 5 cents

Re: Revisiting trim [ In reply to ]

scott at perturb

May 26, 2021, 2:16 PM

Post #4 of 39 (1531 views)

On 5/26/21 2:13 PM, Alberto Simões wrote:
>
> On Wednesday, May 26th, 2021 at 21:55, Uri Guttman
> <uri@stemsystems.com> wrote:
>
>>
>> what about trimmed using context? in a void context it trims in
>> place. in scalar (or non-void) it returns the trimmed string and
>> leaves the input unchanged. only one new function and we use context
>> everywhere so it is familiar.
>>
>>
> While I like the idea, we would need to make other methods to have
> similar behavior:
> - lc / uc / ucfirst
> - chomp / chop
> and probably others.
>
> My 5 cents

Using context for trim() has already been discussed several times and
shot down.

- Scott

Re: Revisiting trim [ In reply to ]

uri at stemsystems

May 26, 2021, 2:50 PM

Post #5 of 39 (1531 views)

On 5/26/21 5:13 PM, Alberto Simões wrote:
>
> On Wednesday, May 26th, 2021 at 21:55, Uri Guttman
> <uri@stemsystems.com> wrote:
>
>>
>> what about trimmed using context? in a void context it trims in
>> place. in scalar (or non-void) it returns the trimmed string and
>> leaves the input unchanged. only one new function and we use context
>> everywhere so it is familiar.
>>
>>
> While I like the idea, we would need to make other methods to have
> similar behavior:
> - lc / uc / ucfirst
> - chomp / chop
> and probably others.
>
>

chop/chomp already have return values so they can't be changed. the
ls/uc ones could have void context to work.

another issue is that void context needs an lvalue as an argument so it
can modify in place.
the pass results versions (in non-void context) can take any expression.

uri

Re: Revisiting trim [ In reply to ]

perl5-porters at perl

May 27, 2021, 12:56 AM

Post #6 of 39 (1531 views)

??????? Original Message ???????
On Wednesday, May 26, 2021 3:20 PM, Neil Bowers <neilb@neilb.org> wrote:

> As a reminder, after a lot of discussion, there was broad agreement that we should add the capability for trimming a string, and we should keep it simple: trim whitespace from both ends of the string, with no parameterisation. Here "whitespace" means [:space:], so consistent Unicode semantics, whatever the internal encoding of the string.
>
> But there was disagreement on whether it should trim in place, or return the trimmed string. Maybe we should have two functions: trim() to edit in place, and trimmed() to return the trimmed string. Rik, Sawyer, and Neil were all fans of the in-place edit.
>
> After lots of discussion on github and p5p, the steering council also solicited input from people with a lot of experience training people and/or writing books to explain Perl.
>
> The end result of all of that is:
>
> - We think we should add a single function to Perl.
> - It should return the trimmed string.
> - It should be called "trimmed", not "trim".
>
> This is not a mandate, but it is our recommendation.

"trimmed" is fair, though I suspect people are going to want:

* trim also
* someone will ask for "chomped" and/or "tromp"

Please don't implement this in perl core, though. Please. Seriously.

> What convinced us on the naming issue was an example from Tom Christiansen of a large company that had a similar internal discussion, and found that people were confused over how they expected it to work, until they changed the name from "trim" to "trimmed". This also fits in with one of Larry's original desires, that Perl be a readable language. The function returns a trimmed version of its input:
>
> $x = trimmed $y; # "x is a trimmed version of y"
>
> We think that "trim" would have been the right name for editing in place:
>
> trim $x;
>
> There are only two distributions on CPAN that define a function called "trimmed", so we suspect that this will clash with less code in the wild. There are plenty of distributions that define a trim(), but not all of them perform the function we're talking about here.
>
> It may be tempting to claim that if we called it "trim" rather than "trimmed", then it "would just work" for the majority of people with an existing trim function (whether imported from CPAN or home-brewed). But whatever a new builtin is called, existing code would not use it by default, so everyone would have to make edits to enable it, no matter what name we give it.
>
> There's a related topic of namespaces for functions like this. We're deliberately not addressing that here, as we suspect it will fall out of the broader review of text/string processing capabilities, which we want completed before 5.36 (i.e. before trimmed would be included in a stable release anyway).

I think waiting on this discussion/decision is terrible mistake. All you need is a dual-life module that exports stuff via the familiar "use feature" syntax (or better under Experimental::*). Shoving things in perl core should be the absolute last stop for proven functionity that needs to be fast or for fundamental capabilities of the perl runtime needed to support useful and interesting features that are themselves implemented at much higher levels.

Cheers,
Brett

> We're now interested to hear what p5p thinks of this.
>
> Neil, Rik, Nick

Re: Revisiting trim [ In reply to ]

demerphq at gmail

May 27, 2021, 1:57 AM

Post #7 of 39 (1531 views)

On Thu, 27 May 2021 at 09:57, mah.kitteh via perl5-porters
<perl5-porters@perl.org> wrote:
> I think waiting on this discussion/decision is terrible mistake. All you need is a ...

Given the problems the world faces I would not say "terrible" but I
agree it is a mistake. I would suggest a different "all you need is a"
than you I think (although maybe what you mean by Experimental::*
would cover my view), but I think sorting out where things should go
is a far more important decision than "should we add trim(med) to the
core". For me if its put in the right place where it can't cause
language conflict and is part of building an orderly and viable future
then I have no issue with it being in core. The *where* is the problem
for me.

Anyway, as for the proposal, if a trim like function is going to be
added to the standard keyword set, I think doing it as trimmed() with
the semantics outlined in this thread is at least a touch more
palatable than trim() which will definitely cause trouble all over.

But I really really appeal to those in charge these days to address
this issue of where new functional (not control) keywords go and how
it can be done in a forwards and backwards compatible way (meaning use
feature is out). I feel really strongly that a proper decision on that
subject will make all the rest of the debates on other functions much
less controversial. Eg, so we have trimmed(), when (and where) do
ltrimmed and rtrimmed get added? Do we just endlessly accrete new
keywords into the main part of the language? It just seems to create
so much unnecessary acrimony. Figure out a clean way to resolve the
forwards/backwards compatibility issue (which is pretty easy with well
chosen namespaces) and IMO almost all of the acrimony will go away.
Why should anyone care if a new speciality function gets added to a
fenced off namespace?

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

Re: Revisiting trim [ In reply to ]

walde.christian at gmail

May 27, 2021, 3:22 AM

Post #8 of 39 (1531 views)

On Wed, 26 May 2021 22:20:22 +0200, Neil Bowers <neilb@neilb.org> wrote:

> $x = trimmed $y; # "x is a trimmed version of y"
>
> We're now interested to hear what p5p thinks of this.

Sounds perfect to me. Making the functionality as accessible as possible is the goal, and that is a perfectly good name and limits the scope of changes to the minimum possible.

--
With regards,
Christian Walde

Re: Revisiting trim [ In reply to ]

scott at perturb

May 27, 2021, 9:36 AM

Post #9 of 39 (1531 views)

On 5/26/21 1:20 PM, Neil Bowers wrote:
>
> * We think we should add a single function to Perl.
> * It should return the trimmed string.
> * It should be called "trimmed", not "trim".
>
> This is not a mandate, but it is our recommendation.

As the original author of this I appreciate the clear and concise
response. Thank you PSC for continuing to meet and discuss these issues
and provide guidance.

Personally I'd prefer *trim()* instead of *trimmed()* just for
consistency with other languages:

* PHP = trim() <https://www.php.net/manual/en/function.trim.php>
* Javascript = trim()
<https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/trim>
* Raku = trim() <https://docs.raku.org/routine/trim>
* Go = trim()
<https://www.geeksforgeeks.org/how-to-trim-a-string-in-golang/>
* Vimscript = trim()
<https://github.com/vim/vim/commit/295ac5ab5e840af6051bed5ec9d9acc3c73445de>
* PowerShell = trim()
<https://devblogs.microsoft.com/scripting/trim-your-strings-with-powershell/>
* VBA = trim() <https://trumpexcel.com/vba-trim/>
* C# = trim()
<https://www.c-sharpcorner.com/uploadfile/mahesh/trim-string-in-C-Sharp/>
* String::Util = trim()
<https://metacpan.org/pod/String::Util#trim($string),-ltrim($string),-rtrim($string)>
* Text::Trim = trim() <https://metacpan.org/pod/Text::Trim>
* Lisp = string_trim() <http://clhs.lisp.se/Body/f_stg_tr.htm>
* Python = strip()
<https://www.journaldev.com/23625/python-trim-string-rstrip-lstrip-strip>
* Ruby = strip()
<https://ruby-doc.org/core-2.7.1/String.html#method-i-strip>

If we go with *trimmed()* we'll definitely be an outlier. Since the PSC
has agreed this is a valuable feature, and should be included in core
(bike shedding is done), the only thing left to debate before we have a
final implementation is the name.

I'd like to being work in earnest next Monday on this feature. Can we
debate the best name here for a couple days so I can begin work on the
final feature? I have a large rebase on the PR to do, and some other
tweaking.

- Scott

Re: Revisiting trim [ In reply to ]

perl5-porters at perl

May 27, 2021, 9:44 AM

Post #10 of 39 (1531 views)

--
oodler@cpan.org
?https://github.com/oodler577
#pdl #p5p #p7-dev #native @ irc.perl.org

Sent with [ProtonMail](https://protonmail.com) Secure Email.

??????? Original Message ???????
On Thursday, May 27, 2021 11:36 AM, Scott Baker <scott@perturb.org> wrote:

> On 5/26/21 1:20 PM, Neil Bowers wrote:
>
>> - We think we should add a single function to Perl.
>> - It should return the trimmed string.
>> - It should be called "trimmed", not "trim".
>>
>> This is not a mandate, but it is our recommendation.
>
> As the original author of this I appreciate the clear and concise response. Thank you PSC for continuing to meet and discuss these issues and provide guidance.
>
> Personally I'd prefer trim() instead of trimmed() just for consistency with other languages:
>
> - PHP = [trim()](https://www.php.net/manual/en/function.trim.php)
>
> - Javascript = [trim()](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/trim)
>
> - Raku = [trim()](https://docs.raku.org/routine/trim)
>
> - Go = [trim()](https://www.geeksforgeeks.org/how-to-trim-a-string-in-golang/)
>
> - Vimscript = [trim()](https://github.com/vim/vim/commit/295ac5ab5e840af6051bed5ec9d9acc3c73445de)
>
> - PowerShell = [trim()](https://devblogs.microsoft.com/scripting/trim-your-strings-with-powershell/)
>
> - VBA = [trim()](https://trumpexcel.com/vba-trim/)
>
> - C# = [trim()](https://www.c-sharpcorner.com/uploadfile/mahesh/trim-string-in-C-Sharp/)
>
> - String::Util = [trim()](https://metacpan.org/pod/String::Util#trim($string),-ltrim($string),-rtrim($string))
>
> - Text::Trim = [trim()](https://metacpan.org/pod/Text::Trim)
>
> - Lisp = [string_trim()](http://clhs.lisp.se/Body/f_stg_tr.htm)
>
> - Python = [strip()](https://www.journaldev.com/23625/python-trim-string-rstrip-lstrip-strip)
>
> - Ruby = [strip()](https://ruby-doc.org/core-2.7.1/String.html#method-i-strip)
>
> If we go with trimmed() we'll definitely be an outlier. Since the PSC has agreed this is a valuable feature, and should be included in core (bike shedding is done), the only thing left to debate before we have a final implementation is the name.

a. being an outlier is an indicator that you're in the lead, generally; PHP, e.g., has a whole set of functions that are interfaces to PRCE..thought I forget what the "P" in that means...

b. "be included in core (bike shedding is done)" - WAY to early to state this, and it's therefore ambigious what the first part of this statement even refers to or means

> I'd like to being work in earnest next Monday on this feature. Can we debate the best name here for a couple days so I can begin work on the final feature? I have a large rebase on the PR to do, and some other tweaking.

This is quite presumptuous. There has been no conversation on where to place this. It's very concerning to me that there has also been very little discussion about "where" to place this "single" (yeah right) core feature. At this point, and mainly due to the pressure and rush being applied to this, my general concern as I said last night is not necessarily "trim" as the POC is currently implemented; but what comes after "trim" and how it's handed - string related or not. So what's the rush? No rush exists other than the proof of concept work facing potential bit rot. That's not really perl's problem.

Brett

> - Scott

Re: Revisiting trim [ In reply to ]

May 27, 2021, 9:59 AM

Post #11 of 39 (1531 views)

On Thu, 27 May 2021 16:44:42 +0000
"mah.kitteh via perl5-porters" <perl5-porters@perl.org> wrote:

> This is quite presumptuous. There has been no conversation on where to place this. It's very concerning to me that there has also been very little discussion about "where" to place this "single" (yeah right) core feature. At this point, and mainly due to the pressure and rush being applied to this, my general concern as I said last night is not necessarily "trim" as the POC is currently implemented; but what comes after "trim" and how it's handed - string related or not. So what's the rush? No rush exists other than the proof of concept work facing potential bit rot. That's not really perl's problem.

There was a lot of conversation. There are literally *hundreds* of posts
about trim on p5p and github. The discussion has been going for almost a
year now.

https://github.com/Perl/perl5/issues/17952
https://github.com/Perl/perl5/pull/17999

In chronological order:

https://www.nntp.perl.org/group/perl.perl5.porters/2020/07/msg258058.html
https://www.nntp.perl.org/group/perl.perl5.porters/2020/11/msg258544.html
https://www.nntp.perl.org/group/perl.perl5.porters/2021/02/msg259118.html
https://www.nntp.perl.org/group/perl.perl5.porters/2021/03/msg259427.html
https://www.nntp.perl.org/group/perl.perl5.porters/2021/03/msg259615.html

Re: Revisiting trim [ In reply to ]

perl5-porters at perl

May 27, 2021, 10:09 AM

Post #12 of 39 (1531 views)

??????? Original Message ???????
On Thursday, May 27, 2021 11:59 AM, Tomasz Konojacki <me@xenu.pl> wrote:

> On Thu, 27 May 2021 16:44:42 +0000
> "mah.kitteh via perl5-porters" perl5-porters@perl.org wrote:
>
> > This is quite presumptuous. There has been no conversation on where to place this. It's very concerning to me that there has also been very little discussion about "where" to place this "single" (yeah right) core feature. At this point, and mainly due to the pressure and rush being applied to this, my general concern as I said last night is not necessarily "trim" as the POC is currently implemented; but what comes after "trim" and how it's handed - string related or not. So what's the rush? No rush exists other than the proof of concept work facing potential bit rot. That's not really perl's problem.
>
> There was a lot of conversation. There are literallyhundreds of posts
> about trim on p5p and github. The discussion has been going for almost a
> year now.
>
> https://github.com/Perl/perl5/issues/17952
> https://github.com/Perl/perl5/pull/17999
>
> In chronological order:
>
> https://www.nntp.perl.org/group/perl.perl5.porters/2020/07/msg258058.html
> https://www.nntp.perl.org/group/perl.perl5.porters/2020/11/msg258544.html
> https://www.nntp.perl.org/group/perl.perl5.porters/2021/02/msg259118.html
> https://www.nntp.perl.org/group/perl.perl5.porters/2021/03/msg259427.html
> https://www.nntp.perl.org/group/perl.perl5.porters/2021/03/msg259615.html

I am aware of those, but I appreciate you taking the time to provide the links.

What I can't seem to find is the conversation on why it needs to be implemented at such a low level. If I understood this particular piece with some clarity then I'd be happy to never post about "trim" again.

Cheers,
Brett

Re: Revisiting trim [ In reply to ]

leonerd at leonerd

May 27, 2021, 10:13 AM

Post #13 of 39 (1531 views)

On Thu, 27 May 2021 17:09:34 +0000
"mah.kitteh via perl5-porters" <perl5-porters@perl.org> wrote:

> What I can't seem to find is the conversation on why it needs to be
> implemented at such a low level. If I understood this particular
> piece with some clarity then I'd be happy to never post about "trim"
> again.

It doesn't. It'd be great if core perl had a _much_ lighter-weight
mechanism for doing all of this. If maybe someone could write a ~10line
C function to attach to e.g. `string::trim` in a way that doesn't
require a _huge_ disturbance to keywords and parsing and opcodes and
everything else, then we could use that same mechanism to apply a huge
number more core functions under namespaces like string::, math::,
scalar:: and so on, and have a lot more useful utility functions
around, without all this heavyweight stuff.

That mechanism doesn't exist.

Yet.

Want to write it? ;)

--
Paul "LeoNerd" Evans

leonerd@leonerd.org.uk | https://metacpan.org/author/PEVANS
http://www.leonerd.org.uk/ | https://www.tindie.com/stores/leonerd/

Re: Revisiting trim [ In reply to ]

perl5-porters at perl

May 27, 2021, 10:16 AM

Post #14 of 39 (1531 views)

??????? Original Message ???????
On Thursday, May 27, 2021 12:13 PM, Paul "LeoNerd" Evans <leonerd@leonerd.org.uk> wrote:

> On Thu, 27 May 2021 17:09:34 +0000
> "mah.kitteh via perl5-porters" perl5-porters@perl.org wrote:
>
> > What I can't seem to find is the conversation on why it needs to be
> > implemented at such a low level. If I understood this particular
> > piece with some clarity then I'd be happy to never post about "trim"
> > again.
>
> It doesn't. It'd be great if core perl had amuch lighter-weight
> mechanism for doing all of this. If maybe someone could write a ~10line
> C function to attach to e.g. `string::trim` in a way that doesn't
> require a huge disturbance to keywords and parsing and opcodes and
> everything else, then we could use that same mechanism to apply a huge
> number more core functions under namespaces like string::, math::,
> scalar:: and so on, and have a lot more useful utility functions
> around, without all this heavyweight stuff.

Thank you, I can get behind this.

>
> That mechanism doesn't exist.
>
> Yet.
>
> Want to write it? ;)

No.

Cheers,
Brett

>
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Paul "LeoNerd" Evans
>
> leonerd@leonerd.org.uk | https://metacpan.org/author/PEVANS
> http://www.leonerd.org.uk/ | https://www.tindie.com/stores/leonerd/

Re: Revisiting trim [ In reply to ]

davidnicol at gmail

May 27, 2021, 11:00 AM

Post #15 of 39 (1531 views)

I'm opposed to extending core, unless everything in for instance List::Util
is going to start being in there too. What's wrong with including some kind
of standard String::Util that could have all these things that can easily
be written with a single s/// ?

trim
rtrim
ltrim
concat
extend
...

and everything would operate in-place in void context and return copies
otherwise?

Re: Revisiting trim [ In reply to ]

demerphq at gmail

May 27, 2021, 12:13 PM

Post #16 of 39 (1531 views)

On Thu, 27 May 2021, 19:10 mah.kitteh via perl5-porters, <
perl5-porters@perl.org> wrote:

> ??????? Original Message ???????
> On Thursday, May 27, 2021 11:59 AM, Tomasz Konojacki <me@xenu.pl> wrote:
>
> > On Thu, 27 May 2021 16:44:42 +0000
> > "mah.kitteh via perl5-porters" perl5-porters@perl.org wrote:
> >
> > > This is quite presumptuous. There has been no conversation on where to
> place this. It's very concerning to me that there has also been very little
> discussion about "where" to place this "single" (yeah right) core feature.
> At this point, and mainly due to the pressure and rush being applied to
> this, my general concern as I said last night is not necessarily "trim" as
> the POC is currently implemented; but what comes after "trim" and how it's
> handed - string related or not. So what's the rush? No rush exists other
> than the proof of concept work facing potential bit rot. That's not really
> perl's problem.
> >
> > There was a lot of conversation. There are literallyhundreds of posts
> > about trim on p5p and github. The discussion has been going for almost a
> > year now.
> >
> > https://github.com/Perl/perl5/issues/17952
> > https://github.com/Perl/perl5/pull/17999
> >
> > In chronological order:
> >
> >
> https://www.nntp.perl.org/group/perl.perl5.porters/2020/07/msg258058.html
> >
> https://www.nntp.perl.org/group/perl.perl5.porters/2020/11/msg258544.html
> >
> https://www.nntp.perl.org/group/perl.perl5.porters/2021/02/msg259118.html
> >
> https://www.nntp.perl.org/group/perl.perl5.porters/2021/03/msg259427.html
> >
> https://www.nntp.perl.org/group/perl.perl5.porters/2021/03/msg259615.html
>
> I am aware of those, but I appreciate you taking the time to provide the
> links.
>
> What I can't seem to find is the conversation on why it needs to be
> implemented at such a low level. If I understood this particular piece with
> some clarity then I'd be happy to never post about "trim" again.
>

If you mean "why does this warrant C level implementation" then there are a
couple of answers, the simplest one being that the particular type of regex
engine we use doesn't deal with this type of pattern well. A more complex
version would be it is not a DFA and does not know how to match utf8
backwards and it is non trivial to teach it to do so. And people tend to
write the worst possible regexen to do it anyway. The end result is that
trimming strings can be a surprisingly expensive task if not done artfully,
and the code to do it is pretty cryptic so having a function really helps
performance and code clarity.

Having said that, making the function return a result and not do inplace
edit is a massive speed penalty and will likely mean that those using
custom xs already to do this (my workplace) won't migrate. At least for us
the point is to do it quickly, not to do it in a more self explanatory way.

Anyway, I just wanted to point out that doing trim properly in perl with
its bifocal strings and taking account of utf8 and unicode whitespace rules
is not quite as trivial as it might sound.

Yves

>

Re: Revisiting trim [ In reply to ]

leonerd at leonerd

May 27, 2021, 1:16 PM

Post #17 of 39 (1531 views)

On Thu, 27 May 2021 21:13:35 +0200
demerphq <demerphq@gmail.com> wrote:

> Having said that, making the function return a result and not do
> inplace edit is a massive speed penalty and will likely mean that
> those using custom xs already to do this (my workplace) won't
> migrate. At least for us the point is to do it quickly, not to do it
> in a more self explanatory way.

The implementation already detects if target SV == source SV, and edits
in-place if that is the case.

$str = trim $str;

will be an inplace edit.

Don't conflate "the user must write `trim $str` as a mutating keyword"
with "the implementation will mutate an existing SV inplace".

--
Paul "LeoNerd" Evans

leonerd@leonerd.org.uk | https://metacpan.org/author/PEVANS
http://www.leonerd.org.uk/ | https://www.tindie.com/stores/leonerd/

Re: Revisiting trim [ In reply to ]

benkasminbullock at gmail

May 27, 2021, 2:54 PM

Post #18 of 39 (1531 views)

On Fri, 28 May 2021 at 01:36, Scott Baker <scott@perturb.org> wrote:

> Personally I'd prefer *trim()* instead of *trimmed()* just for
> consistency with other languages:
>
> - PHP = trim() <https://www.php.net/manual/en/function.trim.php>
> - Javascript = trim()
> <https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/trim>
>
> In Javascript it's actually a method on a string so " string ".trim()

https://www.w3schools.com/jsref/jsref_trim_string.asp

>
> - Raku = trim() <https://docs.raku.org/routine/trim>
> - Go = trim()
> <https://www.geeksforgeeks.org/how-to-trim-a-string-in-golang/>
>
> There is strings.Trim but it is not equivalent. The equivalent to your
Perl proposal is strings.TrimSpace:

https://golang.org/pkg/strings/#TrimSpace

>
> - Vimscript = trim()
> <https://github.com/vim/vim/commit/295ac5ab5e840af6051bed5ec9d9acc3c73445de>
> - PowerShell = trim()
> <https://devblogs.microsoft.com/scripting/trim-your-strings-with-powershell/>
> - VBA = trim() <https://trumpexcel.com/vba-trim/>
>
> It's actually called Trim (with a capital letter) in VBA. There is also
LTrim and RTrim.

>
> - C# = trim()
> <https://www.c-sharpcorner.com/uploadfile/mahesh/trim-string-in-C-Sharp/>
> - String::Util = trim()
> <https://metacpan.org/pod/String::Util#trim($string),-ltrim($string),-rtrim($string)>
> - Text::Trim = trim() <https://metacpan.org/pod/Text::Trim>
> - Lisp = string_trim() <http://clhs.lisp.se/Body/f_stg_tr.htm>
>
>
Lisp doesn't usually use underscores to separate words and it doesn't use
final brackets for arguments to functions either so this cannot be right.

>
> -
> - Python = strip()
> <https://www.journaldev.com/23625/python-trim-string-rstrip-lstrip-strip>
> - Ruby = strip()
> <https://ruby-doc.org/core-2.7.1/String.html#method-i-strip>
>
> This does scotch the argument that "trim" is the most familiar form for
othe programming languages.

>
>
> If we go with *trimmed()* we'll definitely be an outlier.
>
Perl is already an outlier, who else uses "next" and "last" instead of
"break" and "continue", or uses -> for members rather than .?

> Since the PSC has agreed this is a valuable feature, and should be
> included in core (bike shedding is done), the only thing left to debate
> before we have a final implementation is the name.
>
To avoid over-lengthy discussions, in my opinion the person who does the
implementation should basically have the right to choose the name at the
experimental stage. Then, if the name causes a problem in practice, it can
be changed. But a lot of these discussions on this mailing list have
involved imaginary people with imagined problems.

> I'd like to being work in earnest next Monday on this feature. Can we
> debate the best name here for a couple days so I can begin work on the
> final feature? I have a large rebase on the PR to do, and some other
> tweaking.
>
The very best possible name for this function is "trim". Or whatever you
want to call it.

Re: Revisiting trim [ In reply to ]

demerphq at gmail

May 28, 2021, 12:26 AM

Post #19 of 39 (1531 views)

On Thu, 27 May 2021 at 22:17, Paul "LeoNerd" Evans
<leonerd@leonerd.org.uk> wrote:
>
> On Thu, 27 May 2021 21:13:35 +0200
> demerphq <demerphq@gmail.com> wrote:
>
> > Having said that, making the function return a result and not do
> > inplace edit is a massive speed penalty and will likely mean that
> > those using custom xs already to do this (my workplace) won't
> > migrate. At least for us the point is to do it quickly, not to do it
> > in a more self explanatory way.
>
> The implementation already detects if target SV == source SV, and edits
> in-place if that is the case.
>
> $str = trim $str;
>
> will be an inplace edit.
>
> Don't conflate "the user must write `trim $str` as a mutating keyword"
> with "the implementation will mutate an existing SV inplace".

Ah, so that would be this implementation is hairier than it would need
to be if the argument was modified in place without this type of
detection, it also explains one of your other comments that didnt make
sense to me.

Thanks,

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

Re: Revisiting trim [ In reply to ]

May 28, 2021, 3:01 AM

Post #20 of 39 (1531 views)

Hi.
As a long-time perl *applications* programmer, I'd like to contribute a couple of things
and ask a question.

1) maybe 50% of the usage of perl I've had over the last 30 years (and probably 95% of the
CPU time used with it over that same time) has consisted of processing text (historically
Terabytes of it, and still Gigabytes of it every day) in more or less complex ways,
something which perl has always been particularly good at. "Good" being understood as "you
can do anything with it" and "fast".

2) if there would be a trim() (or trimmed()) function directly in the base language, it
would be welcome, not only for its functionality itself, but as a way to avoid those
ever-recurring nagging comments from non-perl people about how "unnecessarily
complicated/clumsy" this looks like in perl, as comnpared to all these "more modern"
languages where it is built-in. (So, see this at least in part as a little drop in the
general bucket of avoiding things which could discourage new potential perl aficionados).

3) many many times when processing textual data, it is convenient and/or necessary to
strip *trailing* spaces, /without/ stripping *leading* spaces. Trailing spaces are
generally not significant and mostly use up disk/memory space unnecessarily.
But leading spaces often fulfill some need for alignment or syntax, and should not always
be stripped. Thus, if a single trimmed() function was provided, which always trims both
sides, it would in my view be insufficient, make its usage quite conditional, and even
sometimes make the deciphering of code (written by someone else) more difficult.
(Like : did they *know* that it trims both sides ? or was that a typo ?). And it would
still leave the "trim only trailing spaces" functionality to be expressed differently,
which sounds a bit awkward, even if quite fits the TIMTOWTDI basic perl philosophy.
In other words, I would strongly favor either 3 functions (trimmed, rtrimmed, ltrimmed) or
trimmed($subject{,"L(eft)"|"B(oth)"|"R(ight)"}), with the default being Both.
(which kind of suggests 1|0|-1 instead as 2d optional argument, a bit like substr() and
co. where "-1" tends to mean "start from the end backwards", no ?)
(And maybe ltrimmed and rtrimmed can just be internal "aliases" to trimmed)

4) due to the expectations of vintage perl programmers in what regards perl's
text-processing prouesse (see above), *if* such function(s) were to be provided, one would
expect it/them to be at least as fast as the best ("unnecessarily complicated/clumsy
looking") regex achieving the same thing.

And finally, the question : several times in this discussion I have read that, left to
their own devices currently (meaning with regexp), naive perl programmers do it "in the
worst way possible".
Now which way is that ?
I admit that for 30+ years, I have been doing this without much thinking about it (once I
got over my initial wonder 30 years ago at there not being a trim() function) :

my $line = <>; # e.g.
my $stripped_line = $line; # keep the original as is, work on a copy
$stripped_line =~ s/^\s+//; $stripped_line =~ /\s+$//; # or only one of those, depends

Is /that/ the worst possible way ? or if not *the* worst, was there a better way all along
? (*)

(I should probably add that in 30 years, I heve probably not written a single perl program
where some form of the above trimming did not happen).

(*) if yes, knowing this from the beginning would probably have helped avoiding the
current climate crisis

On 28.05.2021 09:26, demerphq wrote:
> On Thu, 27 May 2021 at 22:17, Paul "LeoNerd" Evans
> <leonerd@leonerd.org.uk> wrote:
>>
>> On Thu, 27 May 2021 21:13:35 +0200
>> demerphq <demerphq@gmail.com> wrote:
>>
>>> Having said that, making the function return a result and not do
>>> inplace edit is a massive speed penalty and will likely mean that
>>> those using custom xs already to do this (my workplace) won't
>>> migrate. At least for us the point is to do it quickly, not to do it
>>> in a more self explanatory way.
>>
>> The implementation already detects if target SV == source SV, and edits
>> in-place if that is the case.
>>
>> $str = trim $str;
>>
>> will be an inplace edit.
>>
>> Don't conflate "the user must write `trim $str` as a mutating keyword"
>> with "the implementation will mutate an existing SV inplace".
>
> Ah, so that would be this implementation is hairier than it would need
> to be if the argument was modified in place without this type of
> detection, it also explains one of your other comments that didnt make
> sense to me.
>
> Thanks,
>
> Yves
>
>

Re: Revisiting trim [ In reply to ]

doomvox at gmail

May 28, 2021, 9:25 AM

Post #21 of 39 (1531 views)

André Warnier (tomcat/perl) <aw@ice-sa.com> wrote:

> $stripped_line =~ s/^\s+//; $stripped_line =~ /\s+$//; # or only one of those, depends

> Is /that/ the worst possible way ? or if not *the* worst, was there a better way all along ? (*)

That's a very reasonable way of doing it which may very well be the
best way (though you dropped an "s" on the second "s///").

They were probably referring to a tendency of many programmers to
obsess with trimming the left and right with a single s/// operation,
which will result in a hairy, unreadable solution that won't peform
any better than just doing it in two steps.

I've no strong feelings on the "trim" discussion, but I think you
argue well that the "rtrim" case is pretty common.

I think tchrist probably has a point about the clarity of "trimmed",
but I suspect if it'd been up to Larry Wall, he'd have gone with the
shortest form. For some reason "trim", "trim('R')" and "trim('L')"
seem perlish too me (though I gather "parameterization" is supposed to
be off the table at this point, so an R/L argument would be
controversial, too).

I see that in Raku, the routines are called "trim", "trim-leading" and
"trim-trailing". (None of these trim in-place, to do that you'd use
this idiom: "$line.=trim;").

My apologies if it seems like we're re-opening old discussions at this
point, but it's a problem in these debates that there's no easy way to
review what's already been talked to death.

Re: Revisiting trim [ In reply to ]

grinnz at gmail

May 28, 2021, 9:31 AM

Post #22 of 39 (1531 views)

On Fri, May 28, 2021 at 12:26 PM Joseph Brenner <doomvox@gmail.com> wrote:

> André Warnier (tomcat/perl) <aw@ice-sa.com> wrote:
>
> > $stripped_line =~ s/^\s+//; $stripped_line =~ /\s+$//; # or only one of
> those, depends
>
> > Is /that/ the worst possible way ? or if not *the* worst, was there a
> better way all along ? (*)
>
> That's a very reasonable way of doing it which may very well be the
> best way (though you dropped an "s" on the second "s///").
>
> They were probably referring to a tendency of many programmers to
> obsess with trimming the left and right with a single s/// operation,
> which will result in a hairy, unreadable solution that won't peform
> any better than just doing it in two steps.
>
> I've no strong feelings on the "trim" discussion, but I think you
> argue well that the "rtrim" case is pretty common.
>
> I think tchrist probably has a point about the clarity of "trimmed",
> but I suspect if it'd been up to Larry Wall, he'd have gone with the
> shortest form. For some reason "trim", "trim('R')" and "trim('L')"
> seem perlish too me (though I gather "parameterization" is supposed to
> be off the table at this point, so an R/L argument would be
> controversial, too).
>
> I see that in Raku, the routines are called "trim", "trim-leading" and
> "trim-trailing". (None of these trim in-place, to do that you'd use
> this idiom: "$line.=trim;").
>
> My apologies if it seems like we're re-opening old discussions at this
> point, but it's a problem in these debates that there's no easy way to
> review what's already been talked to death.
>

My two cents on the parameterized trims:

1) trim-right and trim-left are certainly reasonable use cases, *however*
they are not as common a need across CPAN and general code.

2) The Perlish way is to add an option rather than similar functions with
slightly different names.

3) Such an option or additional functions can be added later; even possibly
during the two-year experimental window of the feature.

-Dan

Re: Revisiting trim [ In reply to ]

perl5-porters at perl

May 28, 2021, 9:43 AM

Post #23 of 39 (1531 views)

??????? Original Message ???????
On Friday, May 28, 2021 11:25 AM, Joseph Brenner <doomvox@gmail.com> wrote:

> André Warnier (tomcat/perl) aw@ice-sa.com wrote:
>
> > $stripped_line =~ s/^\s+//; $stripped_line =~ /\s+$//; # or only one of those, depends
>
> > Is /that/ the worst possible way ? or if not the worst, was there a better way all along ? (*)
>
> That's a very reasonable way of doing it which may very well be the
> best way (though you dropped an "s" on the second "s///").
>
> They were probably referring to a tendency of many programmers to
> obsess with trimming the left and right with a single s/// operation,
> which will result in a hairy, unreadable solution that won't peform
> any better than just doing it in two steps.

This is a good and generally applicable point for a lot of things; it smacks at the heart of "premature optimization if the root of all evil*"...

* except for 3% of the time when it's a trivial optimization

>
> I've no strong feelings on the "trim" discussion, but I think you
> argue well that the "rtrim" case is pretty common.
>
> I think tchrist probably has a point about the clarity of "trimmed",
> but I suspect if it'd been up to Larry Wall, he'd have gone with the
> shortest form. For some reason "trim", "trim('R')" and "trim('L')"
> seem perlish too me (though I gather "parameterization" is supposed to
> be off the table at this point, so an R/L argument would be
> controversial, too).

The length of the function is proportional to the frequency of use, this is the "Huffman encoding" aspect of "WWLD" (what would Larry do?).

Related to this discussion, that might no have been brought up; just for more context and information:

* 2 chars - uc [used primarily to normalize input from what I've seen]
* 7 chars - ucfirst [pretty sure I have *never* used this on purpose]
* 2 chars - lc [used same way generally as uc]

It's worth to note that they return the affected value and are non-destructive. But since 'trim' has been most often couched in terms of 'chomp', that is what's defining that whole part of the discussion.

>
> I see that in Raku, the routines are called "trim", "trim-leading" and
> "trim-trailing". (None of these trim in-place, to do that you'd use
> this idiom: "$line.=trim;").
>
> My apologies if it seems like we're re-opening old discussions at this
> point, but it's a problem in these debates that there's no easy way to
> review what's already been talked to death.

This horse is not dead. For me the most important aspect, as I've stated, is the precendent this can set (for good or ill) regarding but not limited to:

* a coherent and consistent strategy for DWIM string functions (which has been recognized by the PRC, tyvm)
* the question of *where* to put things (core vs CPAN/dual-life, namespaces, etc)
* and refining how "features" or "experiments" are handled wrt, among other things, backward and furture compatibilities (also seems to have been recognized by the PRC; again tyvm)

So this is not about 'trim'; it truly is what comes after. And since we have this opportunity now to take a step back, it's worth discussing. The issue of trim being efficatious is a part of this discussion; but not the "real" discussion IMO.

Cheers,
Brett

Re: Revisiting trim [ In reply to ]

aaron at priven

May 28, 2021, 10:49 AM

Post #24 of 39 (1531 views)

> On May 27, 2021, at 1:57 AM, demerphq <demerphq@gmail.com> wrote:
> Do we just endlessly accrete new keywords into the main part of the language?

Yes. And we appreciate how much more useful the language gets over time because of it.

--
Aaron Priven, aaron@priven.com, www.priven.com/aaron

Re: Revisiting trim [ In reply to ]

May 28, 2021, 12:31 PM

Post #25 of 39 (1531 views)

On 28.05.2021 18:31, Dan Book wrote:
> My two cents on the parameterized trims:
>
> 1) trim-right and trim-left are certainly reasonable use cases, *however* they are not as
> common a need across CPAN and general code.

That's one way of looking at it.

I understand that you need a criterium to estimate the usefulness and/or appeal of a
proposed new keyword/function the language. But maybe counting how often it appears in a
(even large) set of code does not always tell the whole story ?

Another way would be to wonder at how often such code might be *executed*.

As a trivial and circumstancial example if I may :
Earlier this week I exported an SQL Server table of 157 million rows at 25 columns per
row, initially as a 14 GB CSV file. For reasons I shall not get into here, all the columns
came out as fixed-length, values right-appended with spaces. The ultimate goal was to
convert this to JSON, so to avoid a lot of unnecessary volume (JSON is already a lot more
verbose than CSV), I chose to individually right-trim every column in every CSV line
first. The program thus ran "s/\s+$//" 157 M x 25 = 3,175,000,000 times.

However, the "s/\s+$//" expression appears only once in the source of the program.

P.S.
Understand that I am certainly not complaining about the efficiency of perl and
"s/\s+$//". They both did their job perfectly, and pretty fast too (close to the time it
took to just read that file with "wc -l", and much less time than it took to export the
CSV file in the first place).
But if a dedicated rtrimmed() function, in addition to being slightly more elegant, would
happen also to be 25% faster than the regex above, I wouldn't say no to it. I might even
write a perl program to look into all our data-intensive programs and flag all its
potential uses.

Re: Revisiting trim [ In reply to ]

davidnicol at gmail

May 28, 2021, 1:30 PM

Post #26 of 39 (785 views)

On Fri, May 28, 2021 at 11:25 AM Joseph Brenner <doomvox@gmail.com> wrote:

> André Warnier (tomcat/perl) <aw@ice-sa.com> wrote:
>
> > $stripped_line =~ s/^\s+//; $stripped_line =~ /\s+$//; # or only one of
> those, depends
>
> > Is /that/ the worst possible way ? or if not *the* worst, was there a
> better way all along ? (*)
>
> That's a very reasonable way of doing it which may very well be the
> best way (though you dropped an "s" on the second "s///").
>
> They were probably referring to a tendency of many programmers to
> obsess with trimming the left and right with a single s/// operation,
> which will result in a hairy, unreadable solution that won't peform
> any better than just doing it in two steps.
>

Is this really slowerr? Is this really hairier and less readable than the
two step approach?

$reference_identifier =~ s/^\s*(.+?)\s*$/$1/; # how I usually
full-trim a reference identifier

--
"Lay off that whiskey, and let that cocaine be!" -- Johnny Cash

Re: Revisiting trim [ In reply to ]

May 28, 2021, 1:33 PM

Post #27 of 39 (785 views)

??????? Original Message ???????
On Friday, May 28th, 2021 at 21:30, David Nicol <davidnicol@gmail.com> wrote:

> On Fri, May 28, 2021 at 11:25 AM Joseph Brenner <doomvox@gmail.com> wrote:
>
>> André Warnier (tomcat/perl) <aw@ice-sa.com> wrote:
>>
>>> $stripped_line =~ s/^\s+//; $stripped_line =~ /\s+$//; # or only one of those, depends
>>
>>> Is /that/ the worst possible way ? or if not *the* worst, was there a better way all along ? (*)
>>
>> That's a very reasonable way of doing it which may very well be the
>> best way (though you dropped an "s" on the second "s///").
>>
>> They were probably referring to a tendency of many programmers to
>> obsess with trimming the left and right with a single s/// operation,
>> which will result in a hairy, unreadable solution that won't peform
>> any better than just doing it in two steps.
>
> Is this really slowerr? Is this really hairier and less readable than the two step approach?
>
> $reference_identifier =~ s/^\s*(.+?)\s*$/$1/; # how I usually full-trim a reference identifier

Probably still slower, but usually I write $foo =~ s/^\s*|\s*$//g;

Re: Revisiting trim [ In reply to ]

grinnz at gmail

May 28, 2021, 1:35 PM

Post #28 of 39 (785 views)

On Fri, May 28, 2021 at 4:31 PM David Nicol <davidnicol@gmail.com> wrote:

>
>
> On Fri, May 28, 2021 at 11:25 AM Joseph Brenner <doomvox@gmail.com> wrote:
>
>> André Warnier (tomcat/perl) <aw@ice-sa.com> wrote:
>>
>> > $stripped_line =~ s/^\s+//; $stripped_line =~ /\s+$//; # or only one of
>> those, depends
>>
>> > Is /that/ the worst possible way ? or if not *the* worst, was there a
>> better way all along ? (*)
>>
>> That's a very reasonable way of doing it which may very well be the
>> best way (though you dropped an "s" on the second "s///").
>>
>> They were probably referring to a tendency of many programmers to
>> obsess with trimming the left and right with a single s/// operation,
>> which will result in a hairy, unreadable solution that won't peform
>> any better than just doing it in two steps.
>>
>
> Is this really slowerr? Is this really hairier and less readable than the
> two step approach?
>
> $reference_identifier =~ s/^\s*(.+?)\s*$/$1/; # how I usually
> full-trim a reference identifier
>
> Yes and (IMO) yes.
>

-Dan

Re: Revisiting trim [ In reply to ]

May 28, 2021, 2:04 PM

Post #29 of 39 (785 views)

On 28.05.2021 22:33, Alberto Simões wrote:
>
> ??????? Original Message ???????
> On Friday, May 28th, 2021 at 21:30, David Nicol <davidnicol@gmail.com> wrote:
>
>>
>>
>> On Fri, May 28, 2021 at 11:25 AM Joseph Brenner <doomvox@gmail.com
>> <mailto:doomvox@gmail.com>> wrote:
>>
>> André Warnier (tomcat/perl) <aw@ice-sa.com <mailto:aw@ice-sa.com>> wrote:
>>
>> > $stripped_line =~ s/^\s+//; $stripped_line =~ /\s+$//; # or only one of those, depends
>>
>> > Is /that/ the worst possible way ? or if not *the* worst, was there a better way
>> all along ? (*)
>>
>> That's a very reasonable way of doing it which may very well be the
>> best way (though you dropped an "s" on the second "s///").
>>
>> They were probably referring to a tendency of many programmers to
>> obsess with trimming the left and right with a single s/// operation,
>> which will result in a hairy, unreadable solution that won't peform
>> any better than just doing it in two steps.
>>
>>
>> Is this really slowerr? Is this really hairier and less readable than the two step approach?
>>
>> $reference_identifier =~ s/^\s*(.+?)\s*$/$1/; # how I usually full-trim a
>> reference identifier
>>
>
> Probably still slower, but usually I write $foo =~ s/^\s*|\s*$//g;
>
>
These will probably give a field day to whoever previously wrote about "the worst way
possible" .. :-)

The perl regex engine is a wonderful thing, and my interpretation may be wrong, but I
would tend to intuit that if you use captures (as in the first above) or alternatives (as
in the second) or \s* (which may mean "nothing or something", in both), it is bound to be
somewhat less efficient than if you give the regex engine something definite to look for,
like "^\s+" and "\s+$" (although this one, if the target is utf8 and it works backward
from the end, may be quite hairy too).
But whether that compensates for one assignment instead of two, I don't have a clue.

Anyway, it kind of makes the case for optimal (l|r|)trimmed() functions, to help us all
poor mere perl programmers.

I'm interested in a guru comment though.

Re: Revisiting trim [ In reply to ]

doomvox at gmail

May 28, 2021, 3:52 PM

Post #30 of 39 (785 views)

Some quick-and-dirty benchmarking, trimming 100,000 short strings:

case 1:
$line =~ s/^\s+//;
$line =~ s/\s+$//;
# real 0m1.427s

case 2:
$line =~ s/^\s*(.+?)\s*$/$1/;
# real 0m1.853s

case 3:
$line =~ s/^\s*|\s*$//g;
# real 0m2.864s

So, case 2 is 30% slower, case 3 is 100% slower.

There's a simple fix that improves case 3 quite a bit:

case 4:
$line =~ s/^\s+|\s+$//g;
# real 0m1.704s

However: I took it very easy on this case using short lines... it's
very sensitive to line length (that \g is checking every point in the
string) and it slows down by a factor of ten with lines that are only
around 80 chars long.

Anyway, these speed penalties are Not Good, but they're also not
(usually) a reason to care.
Granted I was exaggerating calling these hairy and
unreadable, but I think they're all harder to read.

(For example, with "case 3", my first thought was it was
broken and wouldn't strip trailing whitespace if it
had stripped leading whitespace, but then I noticed the /g.
And further, it's using a * instead of a +, so without the /g
it *never* strips trailing space: so there were two things
I didn't understand.)

The thing you should ask yourself as a perl programmer is
"what did I think I would gain from doing this in one
line?".

The key point for the perl5-porters though is that there
is indeed a need for a built-in trim.

Re: Revisiting trim [ In reply to ]

perl5-porters at perl

May 28, 2021, 7:51 PM

Post #31 of 39 (785 views)

The follow way to "trim" using "split" seems to provide a constant time solution, not dependent on the length of the string. Although I don't know how "split" is implemented, this its constancy is not surprising.

In fact, the filthy way I'm generating strings necessarily overtakes the amount of time to run this very quickly.

# bench.sh
for NUM in $(seq 1 100);
do
STRING=$(perl -e "printf qq{ %s }, ' a b ' x $NUM")
time perl x.pl "$STRING" 2>&1 | grep real
done

# x.pl
my $foo = $ARGV[0];
my $trimmed = (split /^\s*|\s*$/, $foo)[-1];
print qq{'$trimmed'\n}; # <- commenting out provides no benefit timewise

exerpt of output ('real' bounces between 7ms and 16ms, indicating a sensitivity to the mac OS process scheduler itself which is even more indicitave to the efficiency of this solution):

real 0m0.007s
user 0m0.002s
sys 0m0.004s
'a ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba b'

real 0m0.007s
user 0m0.002s
sys 0m0.003s
'a ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba b'

real 0m0.015s
user 0m0.003s
sys 0m0.006s
'a ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba b'

real 0m0.007s
user 0m0.002s
sys 0m0.003s
'a ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba b'

real 0m0.008s
user 0m0.002s
sys 0m0.004s
'a ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba b'

real 0m0.008s
user 0m0.002s
sys 0m0.003s

Cheers,
Brett

--
oodler@cpan.org
?https://github.com/oodler577
#pdl #p5p #p7-dev #native @ irc.perl.org

Sent with ProtonMail Secure Email.

??????? Original Message ???????
On Friday, May 28, 2021 5:52 PM, Joseph Brenner <doomvox@gmail.com> wrote:

> Some quick-and-dirty benchmarking, trimming 100,000 short strings:
>
> case 1:
> $line =~ s/^\s+//;
> $line =~ s/\s+$//;
>
> real 0m1.427s
>
> ==============
>
> case 2:
> $line =~ s/^\s*(.+?)\s*$/$1/;
>
> real 0m1.853s
>
> ==============
>
> case 3:
> $line =~ s/^\s*|\s*$//g;
>
> real 0m2.864s
>
> ==============
>
> So, case 2 is 30% slower, case 3 is 100% slower.
>
> There's a simple fix that improves case 3 quite a bit:
>
> case 4:
> $line =~ s/^\s+|\s+$//g;
>
> real 0m1.704s
>
> ==============
>
> However: I took it very easy on this case using short lines... it's
> very sensitive to line length (that \g is checking every point in the
> string) and it slows down by a factor of ten with lines that are only
> around 80 chars long.
>
> Anyway, these speed penalties are Not Good, but they're also not
> (usually) a reason to care.
> Granted I was exaggerating calling these hairy and
> unreadable, but I think they're all harder to read.
>
> (For example, with "case 3", my first thought was it was
> broken and wouldn't strip trailing whitespace if it
> had stripped leading whitespace, but then I noticed the /g.
> And further, it's using a * instead of a +, so without the /g
> it never strips trailing space: so there were two things
> I didn't understand.)
>
> The thing you should ask yourself as a perl programmer is
> "what did I think I would gain from doing this in one
> line?".
>
> The key point for the perl5-porters though is that there
> is indeed a need for a built-in trim.

Re: Revisiting trim [ In reply to ]

perl5-porters at perl

May 28, 2021, 9:45 PM

Post #32 of 39 (785 views)

On Sat, 29 May 2021 at 04:31, David Nicol <davidnicol@gmail.com> wrote:

> On Fri, May 28, 2021 at 11:25 AM Joseph Brenner <doomvox@gmail.com> wrote:
>
>> André Warnier (tomcat/perl) <aw@ice-sa.com> wrote:
>>
>> > $stripped_line =~ s/^\s+//; $stripped_line =~ /\s+$//; # or only one of
>> those, depends
>>
>> > Is /that/ the worst possible way ? or if not *the* worst, was there a
>> better way all along ? (*)
>>
>> That's a very reasonable way of doing it which may very well be the
>> best way (though you dropped an "s" on the second "s///").
>>
>> They were probably referring to a tendency of many programmers to
>> obsess with trimming the left and right with a single s/// operation,
>> which will result in a hairy, unreadable solution that won't peform
>> any better than just doing it in two steps.
>>
>
> Is this really slowerr? Is this really hairier and less readable than the
> two step approach?
>
> $reference_identifier =~ s/^\s*(.+?)\s*$/$1/; # how I usually
> full-trim a reference identifier
>

It's not equivalent - you're missing the /s modifier, so a string such as
" example\ntext " would not match.

So, yes - there's a discussion about benchmarking and whether it's more
efficient, but given the minor detail that it's doing something *different*
appears to have escaped attention, I'd suggest that this was indeed less
readable?

(also yes, I appreciate that this is $reference_identifier and earlier we
had $stripped_line, but the purpose of trimmed() is "remove whitespace from
the start and end of a string", so "similar problem domain with different
details" just muddies the discussion)

Re: Revisiting trim [ In reply to ]

demerphq at gmail

May 29, 2021, 12:37 AM

Post #33 of 39 (785 views)

On Fri, 28 May 2021 at 12:02, André Warnier (tomcat/perl) <aw@ice-sa.com> wrote:
> $stripped_line =~ s/^\s+//; $stripped_line =~ /\s+$//; # or only one of those, depends

This is as correct a way to do it as you can do in perl regex. I'd
probably replace the $ with \z to be absolutely clear on my intent. I
had to go double check the behavior of $ here, where \z is
unambiguous.

The point here is that people often write this:

$stripped_line=~/^\s+|\s+$/g;

which causes the regex engine to perform the scan through the string
in a really inefficient way. Your splitting it into two calls avoids
the main mistake that people make.

But this question also illustrates the problem here. The regex engine
doesn't know how to go backwards. Even for the split form of the regex
the *second* regex, the one that does the rtrim() functionality, is
the problem performance wise. The regex engine will do a scan of the
whole string, every time it finds a space character it will scan
forward until it find either a non-string, or the end of the string.
There is some cleverness in the engine to make this case not be
quadratic, but its not far off. The run time will be proportional to
the length of the string and number of space nonspace sequences it
contains.

So the reason to add trimmed() to the language at an optimization
level is that while its hard to teach the regex engine to go
backwards, its not hard to create a custom dfa or similar logic that
scans through the string from the right and finds the rightmost
non-space character in the string. For instance even doing a naïve
implementation of using the utf8-skip-backwards-one-character logic
would be O(N) where N is the number of characters at the end of the
string.

This performance issue with rtrim() I would argue supports your point,
adding trim() without rtrim() is to a certain extent a missed
opportunity. Stripping whitespace from the end of the string will
still be inefficient and difficult to read. Eg, consider I would call
myself a regex expert, but every time someone posts this pattern with
$ in it I have to double check the rules. Making people use an
inefficient and cryptic regex for a common task seems undesirable.
The cryptic argument applies for ltrim(), but that at least *is*
efficient in the regex engine.

cheers
Yves

Re: Revisiting trim [ In reply to ]

demerphq at gmail

May 29, 2021, 12:46 AM

Post #34 of 39 (785 views)

On Fri, 28 May 2021 at 22:31, David Nicol <davidnicol@gmail.com> wrote:
>
>
>
> On Fri, May 28, 2021 at 11:25 AM Joseph Brenner <doomvox@gmail.com> wrote:
>>
>> André Warnier (tomcat/perl) <aw@ice-sa.com> wrote:
>>
>> > $stripped_line =~ s/^\s+//; $stripped_line =~ /\s+$//; # or only one of those, depends
>>
>> > Is /that/ the worst possible way ? or if not *the* worst, was there a better way all along ? (*)
>>
>> That's a very reasonable way of doing it which may very well be the
>> best way (though you dropped an "s" on the second "s///").
>>
>> They were probably referring to a tendency of many programmers to
>> obsess with trimming the left and right with a single s/// operation,
>> which will result in a hairy, unreadable solution that won't peform
>> any better than just doing it in two steps.
>
>
> Is this really slowerr? Is this really hairier and less readable than the two step approach?
>
> $reference_identifier =~ s/^\s*(.+?)\s*$/$1/; # how I usually full-trim a reference identifier

This avoids the killer aspect of s/^\s+|\s+$/, but it still scales
proportional to the length of the string and the number of space
non-space sequences in the string. The overhead will be quite a bit
higher, and I assume you want to make the . match newlines? Consider
this wont work the same as other examples on a string like " foo\nbar
".

The reality is that the regex engine is crappy way to do this
particular task. To do it right you want to start from the right hand
side and search left, such that your performance is proportional to
the number of characters being removed. The regex engine no matter how
you slice it is going to go left to right, and is thus at best going
to be proportional to the length of the string overall.

TBH, I would not be surprised if:

chop($str) while $str=~/\s\z/;

or

1 while $str=~s/\s\z//;

is actually one of the fastest ways to do this with a regex. I
believe in these cases the regex engine does actually use the
utf8-skip-backwards macros (eg it knows how to find the position that
is K characters before the end of the string to see if they match a
space character, it does not know how to scan from the right to find
the maximal set of space characters).

So yes, frankly as someone intimate with the regex engine I would say
that this is a task that people should NOT use the regex engine for at
all. Unfortunately to do this really right as a function you need to
do it in C.

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

Re: Revisiting trim [ In reply to ]

demerphq at gmail

May 29, 2021, 12:56 AM

Post #35 of 39 (785 views)

On Sat, 29 May 2021 at 00:52, Joseph Brenner <doomvox@gmail.com> wrote:
>
> Some quick-and-dirty benchmarking, trimming 100,000 short strings:
>
> case 1:
> $line =~ s/^\s+//;
> $line =~ s/\s+$//;
> # real 0m1.427s
>
> case 2:
> $line =~ s/^\s*(.+?)\s*$/$1/;
> # real 0m1.853s
>
> case 3:
> $line =~ s/^\s*|\s*$//g;
> # real 0m2.864s
>
> So, case 2 is 30% slower, case 3 is 100% slower.
>
> There's a simple fix that improves case 3 quite a bit:
>
> case 4:
> $line =~ s/^\s+|\s+$//g;
> # real 0m1.704s
>
> However: I took it very easy on this case using short lines... it's
> very sensitive to line length (that \g is checking every point in the
> string) and it slows down by a factor of ten with lines that are only
> around 80 chars long.

THIS is the key point here. Run your benchmarks over strings of length
1, 10, 100, 1000, and include the examples I posted in another mail:

1 while $str=~s/\s\z//;
chop($str) while $str=~m/\s\z/;

Also do it on strings like this:

" ". ("x" x $length) . " ";

and also try strings constructed like this for various $k.

my $str;
for my $l (1..$k) {
$str.=" " . ("x") x $l;
}

The point here is that performance of the regex based solutions will
generally be determined by the length of the string, and the number of
space non-space sequences. My two hacks above are designed to avoid
that and stay proportional to the number of characters that are
*removed* from the string, but they will still be worse than doing it
right in C code.

Yves

Re: Revisiting trim [ In reply to ]

demerphq at gmail

May 29, 2021, 1:00 AM

Post #36 of 39 (785 views)

On Sat, 29 May 2021 at 09:56, demerphq <demerphq@gmail.com> wrote:
>
> On Sat, 29 May 2021 at 00:52, Joseph Brenner <doomvox@gmail.com> wrote:
> >
> > Some quick-and-dirty benchmarking, trimming 100,000 short strings:
> >
> > case 1:
> > $line =~ s/^\s+//;
> > $line =~ s/\s+$//;
> > # real 0m1.427s
> >
> > case 2:
> > $line =~ s/^\s*(.+?)\s*$/$1/;
> > # real 0m1.853s
> >
> > case 3:
> > $line =~ s/^\s*|\s*$//g;
> > # real 0m2.864s
> >
> > So, case 2 is 30% slower, case 3 is 100% slower.
> >
> > There's a simple fix that improves case 3 quite a bit:
> >
> > case 4:
> > $line =~ s/^\s+|\s+$//g;
> > # real 0m1.704s
> >
> > However: I took it very easy on this case using short lines... it's
> > very sensitive to line length (that \g is checking every point in the
> > string) and it slows down by a factor of ten with lines that are only
> > around 80 chars long.
>
> THIS is the key point here. Run your benchmarks over strings of length
> 1, 10, 100, 1000, and include the examples I posted in another mail:
>
> 1 while $str=~s/\s\z//;
> chop($str) while $str=~m/\s\z/;
>
> Also do it on strings like this:

Here I meant to say do it on sequences of space/non-space with both
the space and non-space being longer and longer. Eg:

" ". (((" " x $l1) . ("Q" x $l2)) x $l3) . " ";

You will see that most of the regex versions degrade terribly. Im on
the wrong computer or id post the results, but I bet my hacks above
beat them all once the string gets over a certain size, if not hands
down.

Yves

Re: Revisiting trim [ In reply to ]

May 30, 2021, 2:09 PM

Post #37 of 39 (785 views)

Karl Williamson <public@khwilliamson.com> wrote:
:On 5/29/21 1:37 AM, demerphq wrote:
[...]
:> But this question also illustrates the problem here. The regex engine
:> doesn't know how to go backwards. [...]
:
:Maybe you and I should have a chat about what can and should be done to
:improve the matching speed of right-anchored patterns.
:
:I suppose it is theoretically possible to create reverse
:Perl_re_intuit_start() and S_find_byclass() functions, if one could wrap
:one's mind around that, though the libc support is limited. But I could
:be wrong about the feasibility and it would be more work than anyone
:would care to undertake.

FWIW, I think it is probably impossible for the general case of /pat\z/,
but for restricted cases (primarily those without captures) it might not
be so hard.

:But there are things that could be done. It had never occurred to me
:before that the hop_back functions could be called with large numbers.
:Backing up in a UTF-8 string could be improved by a factor of 8 by doing
:per-word operations. (You load a whole word. One can isolate and count
:the continuation bytes in it by some shifting/masking/ etc operations.
:Everything that isn't a continuation byte marks a character.)
:Similarly, functions like S_find_next_masked() could have a
:corresponding reversed version, though slower on UTF-8 than the forward
:because of the forward bias of UTF-8.

Yes, that sounds good.

Hugo

Re: Revisiting trim [ In reply to ]

public at khwilliamson

May 30, 2021, 2:20 PM

Post #38 of 39 (785 views)

On 5/29/21 1:37 AM, demerphq wrote:
> On Fri, 28 May 2021 at 12:02, André Warnier (tomcat/perl) <aw@ice-sa.com> wrote:
>> $stripped_line =~ s/^\s+//; $stripped_line =~ /\s+$//; # or only one of those, depends
>
> This is as correct a way to do it as you can do in perl regex. I'd
> probably replace the $ with \z to be absolutely clear on my intent. I
> had to go double check the behavior of $ here, where \z is
> unambiguous.
>
> The point here is that people often write this:
>
> $stripped_line=~/^\s+|\s+$/g;
>
> which causes the regex engine to perform the scan through the string
> in a really inefficient way. Your splitting it into two calls avoids
> the main mistake that people make.
>
> But this question also illustrates the problem here. The regex engine
> doesn't know how to go backwards. Even for the split form of the regex
> the *second* regex, the one that does the rtrim() functionality, is
> the problem performance wise. The regex engine will do a scan of the
> whole string, every time it finds a space character it will scan
> forward until it find either a non-string, or the end of the string.
> There is some cleverness in the engine to make this case not be
> quadratic, but its not far off. The run time will be proportional to
> the length of the string and number of space nonspace sequences it
> contains.
>
> So the reason to add trimmed() to the language at an optimization
> level is that while its hard to teach the regex engine to go
> backwards, its not hard to create a custom dfa or similar logic that
> scans through the string from the right and finds the rightmost
> non-space character in the string. For instance even doing a naïve
> implementation of using the utf8-skip-backwards-one-character logic
> would be O(N) where N is the number of characters at the end of the
> string.
>
> This performance issue with rtrim() I would argue supports your point,
> adding trim() without rtrim() is to a certain extent a missed
> opportunity. Stripping whitespace from the end of the string will
> still be inefficient and difficult to read. Eg, consider I would call
> myself a regex expert, but every time someone posts this pattern with
> $ in it I have to double check the rules. Making people use an
> inefficient and cryptic regex for a common task seems undesirable.
> The cryptic argument applies for ltrim(), but that at least *is*
> efficient in the regex engine.
>

Maybe you and I should have a chat about what can and should be done to
improve the matching speed of right-anchored patterns.

I suppose it is theoretically possible to create reverse
Perl_re_intuit_start() and S_find_byclass() functions, if one could wrap
one's mind around that, though the libc support is limited. But I could
be wrong about the feasibility and it would be more work than anyone
would care to undertake.

But there are things that could be done. It had never occurred to me
before that the hop_back functions could be called with large numbers.
Backing up in a UTF-8 string could be improved by a factor of 8 by doing
per-word operations. (You load a whole word. One can isolate and count
the continuation bytes in it by some shifting/masking/ etc operations.
Everything that isn't a continuation byte marks a character.)
Similarly, functions like S_find_next_masked() could have a
corresponding reversed version, though slower on UTF-8 than the forward
because of the forward bias of UTF-8.

Re: Revisiting trim [ In reply to ]

Jun 1, 2021, 8:39 AM

Post #39 of 39 (782 views)

On 29.05.2021 00:52, Joseph Brenner wrote:
> The key point for the perl5-porters though is that there
> is indeed a need for a built-in trim.

But - just talking for me personally - if it always trims both ends unconditionally, then
at least 75% of the times I'd wish to use it, I wouldn't be able to.
Which is ok for me, I can continue to use "s/\s+$/" in those cases.
But maybe not so ok for the trees in the Amazon, the coral reefs, and the Pacific atolls.

For the sake of it, I created a naive rtrim() function in pure perl (well, using chop()),
and this already seems to run about 50% faster than the "s/\s+$//" regex;
Here it goes :

#!/usr/bin/perl
use strict;
use warnings;
use Time::HiRes qw( gettimeofday tv_interval );

sub rtrim {
my $str = shift;
my $last = '';
while (1) {
$last = chop($str);
last unless $last =~ /\s/;
}
$str .= $last;
}

my $string1 = " a b " x 10;
#print "string1 [",length($string1),"][",$string1,"]\n";
my $string10 = ($string1 x 10) . (" " x 10);
#print "string10 [",length($string10),"][",$string10,"]\n";
my $string100 = $string10 x 10 . (" " x 100);
#print "string100 [",length($string100),"][",$string100,"]\n";

my $res = '';
my ($T0,$T1);

print "regex :\n";

$T0 = [gettimeofday];
for (my $i=0;$i<1000;$i++) {
$res = $string1;
$res =~ s/\s+$//;
}
$T1 = tv_interval($T0, [gettimeofday()]);
print "string1 : ",sprintf("%.4f",$T1),"s\n";

$T0 = [gettimeofday];
for (my $i=0;$i<1000;$i++) {
$res = $string10;
$res =~ s/\s+$//;
}
$T1 = tv_interval($T0, [gettimeofday()]);
print "string10 : ",sprintf("%.4f",$T1),"s\n";

$T0 = [gettimeofday];
for (my $i=0;$i<1000;$i++) {
$res = $string100;
$res =~ s/\s+$//;
}
$T1 = tv_interval($T0, [gettimeofday()]);
print "string100 : ",sprintf("%.4f",$T1),"s\n";

print "function :\n";

$T0 = [gettimeofday];
for (my $i=0;$i<1000;$i++) {
$res = $string1;
$res = rtrim($res);
}
$T1 = tv_interval($T0, [gettimeofday()]);
print "string1 : ",sprintf("%.4f",$T1),"s\n";

$T0 = [gettimeofday];
for (my $i=0;$i<1000;$i++) {
$res = $string10;
$res = rtrim($res);
}
$T1 = tv_interval($T0, [gettimeofday()]);
print "string10 : ",sprintf("%.4f",$T1),"s\n";

$T0 = [gettimeofday];
for (my $i=0;$i<1000;$i++) {
$res = $string100;
$res = rtrim($res);
}
$T1 = tv_interval($T0, [gettimeofday()]);
print "string100 : ",sprintf("%.4f",$T1),"s\n";
#print " length [",length($res),"][",$res,"]\n"; # just to check

exit;

Prints :

regex :
string1 : 0.0032s
string10 : 0.0104s
string100 : 0.0911s
function :
string1 : 0.0015s
string10 : 0.0056s
string100 : 0.0501s

(Strawberry perl 5, version 28, subversion 2 (v5.28.2) built for MSWin32-x64-multi-thread)