Mailing List Archive

1 2  View All
RE: How to escape strings for re.finditer? [ In reply to ]
Roel,

You make some good points. One to consider is that when you ask a regular expression matcher to search using something that uses NO regular expression features, much of the complexity disappears and what it creates is probably similar enough to what you get with a string search except that loops and all are written as something using fast functions probably written in C.

That is one reason the roll your own versions have a disadvantage unless you roll your own in a similar way by writing a similar C function.

Nobody has shown us what really should be out there of a simple but fast text search algorithm that does a similar job and it may still be out there, but as you point out, perhaps it is not needed as long as people just use the re version.

Avi

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of Roel Schroeven
Sent: Tuesday, February 28, 2023 4:33 AM
To: python-list@python.org
Subject: Re: How to escape strings for re.finditer?

Op 28/02/2023 om 3:44 schreef Thomas Passin:
> On 2/27/2023 9:16 PM, avi.e.gross@gmail.com wrote:
>> And, just for fun, since there is nothing wrong with your code, this
>> minor change is terser:
>>
>>>>> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>>>>> for match in re.finditer(re.escape('abc_degree + 1') , example):
>> ... print(match.start(), match.end()) ...
>> ...
>> 4 18
>> 26 40
>
> Just for more fun :) -
>
> Without knowing how general your expressions will be, I think the
> following version is very readable, certainly more readable than regexes:
>
> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
> KEY = 'abc_degree + 1'
>
> for i in range(len(example)):
> if example[i:].startswith(KEY):
> print(i, i + len(KEY))
> # prints:
> 4 18
> 26 40
I think it's often a good idea to use a standard library function instead of rolling your own. The issue becomes less clear-cut when the standard library doesn't do exactly what you need (as here, where
re.finditer() uses regular expressions while the use case only uses simple search strings). Ideally there would be a str.finditer() method we could use, but in the absence of that I think we still need to consider using the almost-but-not-quite fitting re.finditer().

Two reasons:

(1) I think it's clearer: the name tells us what it does (though of course we could solve this in a hand-written version by wrapping it in a suitably named function).

(2) Searching for a string in another string, in a performant way, is not as simple as it first appears. Your version works correctly, but slowly. In some situations it doesn't matter, but in other cases it will. For better performance, string searching algorithms jump ahead either when they found a match or when they know for sure there isn't a match for some time (see e.g. the Boyer–Moore string-search algorithm).
You could write such a more efficient algorithm, but then it becomes more complex and more error-prone. Using a well-tested existing function becomes quite attractive.

To illustrate the difference performance, I did a simple test (using the paragraph above is test text):

import re
import timeit

def using_re_finditer(key, text):
matches = []
for match in re.finditer(re.escape(key), text):
matches.append((match.start(), match.end()))
return matches


def using_simple_loop(key, text):
matches = []
for i in range(len(text)):
if text[i:].startswith(key):
matches.append((i, i + len(key)))
return matches


CORPUS = """Searching for a string in another string, in a performant way, is
not as simple as it first appears. Your version works correctly, but slowly.
In some situations it doesn't matter, but in other cases it will.
For better
performance, string searching algorithms jump ahead either when they found a
match or when they know for sure there isn't a match for some time (see e.g.
the Boyer–Moore string-search algorithm). You could write such a more
efficient algorithm, but then it becomes more complex and more error-prone.
Using a well-tested existing function becomes quite attractive."""
KEY = 'in'
print('using_simple_loop:',
timeit.repeat(stmt='using_simple_loop(KEY, CORPUS)', globals=globals(),
number=1000))
print('using_re_finditer:',
timeit.repeat(stmt='using_re_finditer(KEY, CORPUS)', globals=globals(),
number=1000))

This does 5 runs of 1000 repetitions each, and reports the time in seconds for each of those runs.
Result on my machine:

using_simple_loop: [0.13952950000020792, 0.13063130000000456, 0.12803450000001249, 0.13186180000002423, 0.13084610000032626]
using_re_finditer: [0.003861400000005233, 0.004061900000124297, 0.003478999999970256, 0.003413100000216218, 0.0037320000001273]

We find that in this test re.finditer() is more than 30 times faster (despite the overhead of regular expressions.

While speed isn't everything in programming, with such a large difference in performance and (to me) no real disadvantages of using re.finditer(), I would prefer re.finditer() over writing my own.

--
"The saddest aspect of life right now is that science gathers knowledge faster than society gathers wisdom."
-- Isaac Asimov

--
https://mail.python.org/mailman/listinfo/python-list

--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
On 2/28/2023 11:48 AM, Jon Ribbens via Python-list wrote:
> On 2023-02-28, Thomas Passin <list1@tompassin.net> wrote:
...
>>
>> It is interesting, though, how pre-processing the search pattern can
>> improve search times if you can afford the pre-processing. Here's a
>> paper on rapidly finding matches when there may be up to one misspelled
>> character. It's easy enough to implement, though in Python you can't
>> take the additional step of tuning it to stay in cache.
>>
>> https://Robert.Muth.Org/Papers/1996-Approx-Multi.Pdf
>
> You've somehow title-cased that URL. The correct URL is:
>
> https://robert.muth.org/Papers/1996-approx-multi.pdf

Thanks, not sure how that happened ...

--
https://mail.python.org/mailman/listinfo/python-list
RE: How to escape strings for re.finditer? [ In reply to ]
> I wrote my previous message before reading this.? Thank you for the test you ran -- it answers the question of performance.? You show that re.finditer is 30x faster, so that certainly recommends that over a simple loop, which introduces looping overhead.?

>> ??? def using_simple_loop(key, text):
>> ??????? matches = []
>> ??????? for i in range(len(text)):
>> ??????????? if text[i:].startswith(key):
>> ??????????????? matches.append((i, i + len(key)))
>> ??????? return matches
>>
>> ??? using_simple_loop: [0.13952950000020792, 0.13063130000000456, 0.12803450000001249, 0.13186180000002423, 0.13084610000032626]
>> ??? using_re_finditer: [0.003861400000005233, 0.004061900000124297, 0.003478999999970256, 0.003413100000216218, 0.0037320000001273]


With a slight tweak to the simple loop code using .find() it becomes a third faster than the RE version though.


def using_simple_loop2(key, text):
matches = []
keyLen = len(key)
start = 0
while (foundSpot := text.find(key, start)) > -1:
start = foundSpot + keyLen
matches.append((foundSpot, start))
return matches


using_simple_loop: [0.1732664997689426, 0.1601669997908175, 0.15792609984055161, 0.1573973000049591, 0.15759290009737015]
using_re_finditer: [0.003412699792534113, 0.0032823001965880394, 0.0033694999292492867, 0.003354900050908327, 0.0033336998894810677]
using_simple_loop2: [0.00256159994751215, 0.0025471001863479614, 0.0025424999184906483, 0.0025831996463239193, 0.0025555999018251896]
--
https://mail.python.org/mailman/listinfo/python-list
RE: How to escape strings for re.finditer? [ In reply to ]
Jen,



I had no doubt the code you ran was indented properly or it would not work.



I am merely letting you know that somewhere in the process of copying the code or the transition between mailers, my version is messed up. It happens to be easy for me to fix but I sometimes see garbled code I then simply ignore.



At times what may help is to leave blank lines that python ignores but also keeps the line rearrangements minimal.



On to your real question.



In my OPINION, there are many interesting questions that can get in the way of just getting a working solution. Some may be better in some abstract way but except for big projects it often hardly matters.



So regex is one thing or more a cluster of things and a list comp is something completely different. They are both tools you can use and abuse or lose.



The distinction I believe we started with was how to find a fixed string inside another fixed string in as many places as needed and perhaps return offset info. So this can be solved in too many ways using a side of python focused on pure text. As discussed, solutions can include explicit loops such as “for” and “while” and their syntactic sugar cousin of a list comp. Not mentioned yet are other techniques like a recursive function that finds the first and passes on the rest of the string to itself to find the rest, or various functional programming techniques that may do sort of hidden loops. YOU DO NOT NEED ALL OF THEM but it can be interesting to learn.



Regex is a completely different universe that is a bit more of MORE. If I ask you for a ride to the grocery store, I might expect you to show up with a car and not a James Bond vehicle that also is a boat, submarine, airplane, and maybe spaceship. Well, Regex is the latter. And in your case, it is this complexity that meant you had to convert your text so it will not see what it considers commands or hints.



In normal use, put a bit too simply, it wants a carefully crafted pattern to be spelled out and it weaves an often complex algorithm it then sort of compiles that represents the understanding of what you asked for. The simplest pattern is to match EXACTLY THIS. That is your case.



A more complex pattern may say to match Boston OR Chicago followed by any amount of whitespace then a number of digits between 3 and 5 and then should not be followed by something specific. Oh, and by the way, save selected parts in parentheses to be accessed as \1 or \2 so I can ask you to do things like match a word followed by itself. It goes on and on.



Be warned RE is implemented now all over the place including outside the usual UNIX roots and there are somewhat different versions. For your need, it does not matter.



The compiled monstrosity though can be fairly fast and might be a tad hard for you to write by yourself as a bunch of if statements nested that are weirdly matching various patterns with some look ahead or look behind.



What you are being told is that despite this being way more than you asked for, it not only works but is fairly fast when doing the simple thing you asked for. That may be why a text version you are looking for is hard to find.



I am not clear what exactly the rest of your project is about but my guess is your first priority is completing it decently and not to try umpteen methods and compare them. Not today. Of course if the working version is slow and you profile it and find this part seems to be holding it back, it may be worth examining.





From: Jen Kris <jenkris@tutanota.com>
Sent: Tuesday, February 28, 2023 12:58 PM
To: avi.e.gross@gmail.com
Cc: 'Python List' <python-list@python.org>
Subject: RE: How to escape strings for re.finditer?



The code I sent is correct, and it runs here. Maybe you received it with a carriage return removed, but on my copy after posting, it is correct:



example = 'X - abc_degree + 1 + qq + abc_degree + 1'

find_string = re.escape('abc_degree + 1')

for match in re.finditer(find_string, example):

print(match.start(), match.end())



One question: several people have made suggestions other than regex (not your terser example with regex you shown below). Is there a reason why regex is not preferred to, for example, a list comp? Performance? Reliability?













Feb 27, 2023, 18:16 by avi.e.gross@gmail.com <mailto:avi.e.gross@gmail.com> :

Jen,



Can you see what SOME OF US see as ASCII text? We can help you better if we get code that can be copied and run as-is.



What you sent is not terse. It is wrong. It will not run on any python interpreter because you somehow lost a carriage return and indent.



This is what you sent:



example = 'X - abc_degree + 1 + qq + abc_degree + 1'

find_string = re.escape('abc_degree + 1') for match in re.finditer(find_string, example):

print(match.start(), match.end())



This is code indentedproperly:



example = 'X - abc_degree + 1 + qq + abc_degree + 1'

find_string = re.escape('abc_degree + 1')

for match in re.finditer(find_string, example):

print(match.start(), match.end())



Of course I am sure you wrote and ran code more like the latter version but somewhere in your copy/paste process, ....



And, just for fun, since there is nothing wrong with your code, this minor change is terser:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'

for match in re.finditer(re.escape('abc_degree + 1') , example):

... print(match.start(), match.end())

...

...

4 18

26 40



But note once you use regular expressions, and not in your case, you might match multiple things that are far from the same such as matching two repeated words of any kind in any case including "and and" and "so so" or finding words that have multiple doubled letter as in the stereotypical bookkeeper. In those cases, you may want even more than offsets but also show the exact text that matched or even show some characters before and/or after for context.





-----Original Message-----

From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org <mailto:python-list-bounces+avi.e.gross=gmail.com@python.org> > On Behalf Of Jen Kris via Python-list

Sent: Monday, February 27, 2023 8:36 PM

To: Cameron Simpson <cs@cskk.id.au <mailto:cs@cskk.id.au> >

Cc: Python List <python-list@python.org <mailto:python-list@python.org> >

Subject: Re: How to escape strings for re.finditer?





I haven't tested it either but it looks like it would work. But for this case I prefer the relative simplicity of:



example = 'X - abc_degree + 1 + qq + abc_degree + 1'

find_string = re.escape('abc_degree + 1') for match in re.finditer(find_string, example):

print(match.start(), match.end())



4 18

26 40



I don't insist on terseness for its own sake, but it's cleaner this way.



Jen





Feb 27, 2023, 16:55 by cs@cskk.id.au <mailto:cs@cskk.id.au> :

On 28Feb2023 01:13, Jen Kris <jenkris@tutanota.com <mailto:jenkris@tutanota.com> > wrote:

I went to the re module because the specified string may appear more than once in the string (in the code I'm writing).



Sure, but writing a `finditer` for plain `str` is pretty easy (untested):



pos = 0

while True:

found = s.find(substring, pos)

if found < 0:

break

start = found

end = found + len(substring)

... do whatever with start and end ...

pos = end



Many people go straight to the `re` module whenever they're looking for strings. It is often cryptic error prone overkill. Just something to keep in mind.



Cheers,

Cameron Simpson <cs@cskk.id.au <mailto:cs@cskk.id.au> >

--

https://mail.python.org/mailman/listinfo/python-list



--

https://mail.python.org/mailman/listinfo/python-list



--
https://mail.python.org/mailman/listinfo/python-list
RE: How to escape strings for re.finditer? [ In reply to ]
This message is more for Thomas than Jen,

You made me think of what happens in fairly large cases. What happens if I ask you to search a thousand pages looking for your name?

One solution might be to break the problem into parts that can be run in independent threads or processes and perhaps across different CPU's or on many machines at once. Think of it as a variant on a merge sort where each chunk returns where it found one or more items and then those are gathered together and merged upstream.

The problem is you cannot just randomly divide the text. Any matches across a divide are lost. So if you know you are searching for "Thomas Passin" you need an overlap big enough to hold enough of that size. It would not be made as something like a pure binary tree and if the choices made included variant sizes in what might match, you would get duplicates. So the merging part would obviously have to eventually remove those.

I have often wondered how Google and other such services are able to find millions of things in hardly any time and arguably never show most of them as who looks past a few pages/screens?

I think much of that may involve other techniques including quite a bit of pre-indexing. But they also seem to enlist lots of processors that each do the search on a subset of the problem space and combine and prioritize.

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of Thomas Passin
Sent: Tuesday, February 28, 2023 1:31 PM
To: python-list@python.org
Subject: Re: How to escape strings for re.finditer?

On 2/28/2023 1:07 PM, Jen Kris wrote:
>
> Using str.startswith is a cool idea in this case. But is it better
> than regex for performance or reliability? Regex syntax is not a
> model of simplicity, but in my simple case it's not too difficult.

The trouble is that we don't know what your case really is. If you are talking about a short pattern like your example and a small text to search, and you don't need to do it too often, then my little code example is probably ideal. Reliability wouldn't be an issue, and performance would not be relevant. If your case is going to be much larger, called many times in a loop, or be much more complicated in some other way, then a regex or some other approach is likely to be much faster.


> Feb 27, 2023, 18:52 by list1@tompassin.net:
>
> On 2/27/2023 9:16 PM, avi.e.gross@gmail.com wrote:
>
> And, just for fun, since there is nothing wrong with your code,
> this minor change is terser:
>
> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
> for match in re.finditer(re.escape('abc_degree + 1')
> , example):
>
> ... print(match.start(), match.end())
> ...
> ...
> 4 18
> 26 40
>
>
> Just for more fun :) -
>
> Without knowing how general your expressions will be, I think the
> following version is very readable, certainly more readable than
> regexes:
>
> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
> KEY = 'abc_degree + 1'
>
> for i in range(len(example)):
> if example[i:].startswith(KEY):
> print(i, i + len(KEY))
> # prints:
> 4 18
> 26 40
>
> If you may have variable numbers of spaces around the symbols, OTOH,
> the whole situation changes and then regexes would almost certainly
> be the best approach. But the regular expression strings would
> become harder to read.
> --
> https://mail.python.org/mailman/listinfo/python-list
>
>

--
https://mail.python.org/mailman/listinfo/python-list

--
https://mail.python.org/mailman/listinfo/python-list
RE: How to escape strings for re.finditer? [ In reply to ]
David,

Your results suggest we need to be reminded that lots depends on other
factors. There are multiple versions/implementations of python out there
including some written in C but also other underpinnings. Each can often
have sections of pure python code replaced carefully with libraries of
compiled code, or not. So your results will vary.

Just as an example, assume you derive a type of your own as a subclass of
str and you over-ride the find method by writing it in pure python using
loops and maybe add a few bells and whistles. If you used your improved
algorithm using this variant of str, might it not be quite a bit slower?
Imagine how much slower if your improvement also implemented caching and
logging and the option of ignoring case which are not really needed here.

This type of thing can happen in many other scenarios and some module may be
shared that is slow and a while later is updated but not everyone installs
the update so performance stats can vary wildly.

Some people advocate using some functional programming tactics, in various
languages, partially because the more general loops are SLOW. But that is
largely because some of the functional stuff is a compiled function that
hides the loops inside a faster environment than the interpreter.

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On
Behalf Of David Raymond
Sent: Tuesday, February 28, 2023 2:40 PM
To: python-list@python.org
Subject: RE: How to escape strings for re.finditer?

> I wrote my previous message before reading this.? Thank you for the test
you ran -- it answers the question of performance.? You show that
re.finditer is 30x faster, so that certainly recommends that over a simple
loop, which introduces looping overhead.?

>> ??? def using_simple_loop(key, text):
>> ??????? matches = []
>> ??????? for i in range(len(text)):
>> ??????????? if text[i:].startswith(key):
>> ??????????????? matches.append((i, i + len(key)))
>> ??????? return matches
>>
>> ??? using_simple_loop: [0.13952950000020792, 0.13063130000000456,
0.12803450000001249, 0.13186180000002423, 0.13084610000032626]
>> ??? using_re_finditer: [0.003861400000005233, 0.004061900000124297,
0.003478999999970256, 0.003413100000216218, 0.0037320000001273]


With a slight tweak to the simple loop code using .find() it becomes a third
faster than the RE version though.


def using_simple_loop2(key, text):
matches = []
keyLen = len(key)
start = 0
while (foundSpot := text.find(key, start)) > -1:
start = foundSpot + keyLen
matches.append((foundSpot, start))
return matches


using_simple_loop: [0.1732664997689426, 0.1601669997908175,
0.15792609984055161, 0.1573973000049591, 0.15759290009737015]
using_re_finditer: [0.003412699792534113, 0.0032823001965880394,
0.0033694999292492867, 0.003354900050908327, 0.0033336998894810677]
using_simple_loop2: [0.00256159994751215, 0.0025471001863479614,
0.0025424999184906483, 0.0025831996463239193, 0.0025555999018251896]
--
https://mail.python.org/mailman/listinfo/python-list

--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
On 2/28/2023 2:40 PM, David Raymond wrote:
> With a slight tweak to the simple loop code using .find() it becomes a third faster than the RE version though.
>
>
> def using_simple_loop2(key, text):
> matches = []
> keyLen = len(key)
> start = 0
> while (foundSpot := text.find(key, start)) > -1:
> start = foundSpot + keyLen
> matches.append((foundSpot, start))
> return matches
>
>
> using_simple_loop: [0.1732664997689426, 0.1601669997908175, 0.15792609984055161, 0.1573973000049591, 0.15759290009737015]
> using_re_finditer: [0.003412699792534113, 0.0032823001965880394, 0.0033694999292492867, 0.003354900050908327, 0.0033336998894810677]
> using_simple_loop2: [0.00256159994751215, 0.0025471001863479614, 0.0025424999184906483, 0.0025831996463239193, 0.0025555999018251896]

On my system the difference is way bigger than that:

KEY = '''it doesn't matter, but in other cases it will.'''

using_simple_loop2: [0.0004955999902449548, 0.0004844000213779509,
0.0004862999776378274, 0.0004800999886356294, 0.0004792999825440347]

using_re_finditer: [0.002840900036972016, 0.0028330000350251794,
0.002701299963518977, 0.0028105000383220613, 0.0029977999511174858]

Shorter keys show the least differential:

KEY = 'in'

using_simple_loop2: [0.001983499969355762, 0.0019614999764598906,
0.0019617999787442386, 0.002027600014116615, 0.0020669000223279]

using_re_finditer: [0.002787900040857494, 0.0027620999608188868,
0.0027723999810405076, 0.002776700013782829, 0.002946800028439611]

Brilliant!

Python 3.10.9
Windows 10 AMD64 (build 10.0.19044) SP0

--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
On 28Feb2023 18:57, Jen Kris <jenkris@tutanota.com> wrote:
>One question:  several people have made suggestions other than regex
>(not your terser example with regex you shown below).  Is there a
>reason why regex is not preferred to, for example, a list comp? 

These are different things; I'm not sure a comparison is meaningful.

>Performance?  Reliability? 

Regexps are:
- cryptic and error prone (you can make them more readable, but the
notation is deliberately both terse and powerful, which means that
small changes can have large effects in behaviour); the "error prone"
part does not mean that a regexp is unreliable, but that writing one
which is _correct_ for your task can be difficult, and also difficult
to debug
- have a compile step, which slows things down
- can be slower to execute as well, as a regexp does a bunch of
housekeeping for you

The more complex the tool the more... indirection between your solution
using that tool and the smallest thing which needs to be done, and often
the slower the solution. This isn't absolute; there are times for the
complex tool.

Common opinion here is often that if you're doing simple fixed-string
things such as your task, which was finding instances of a fixed string,
just use the existing str methods. You'll end up writing what you need
directly and overtly.

I've a personal maxim that one should use the "smallest" tool which
succinctly solves the problem. I usually use it to choose a programming
language (eg sed vs awk vs shell vs python in loose order of problem
difficulty), but it applies also to choosing tools within a language.

Cheers,
Cameron Simpson <cs@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
On 2023-02-28 15:25:05 -0500, avi.e.gross@gmail.com wrote:
> Jen,
>
>
>
> I had no doubt the code you ran was indented properly or it would not work.
>
>
>
> I am merely letting you know that somewhere in the process of copying
> the code or the transition between mailers, my version is messed up.

The problem seems to be at your end. Jen's code looks ok here.

The content type is text/plain, no format=flowed or anything which would
affect the interpretation of line endings. However, after
base64-decoding it only contains unix-style LF line endings, not CRLF
line endings. That might throw your mailer off, but I have no idea why
it would join only some lines but not others.

> It happens to be easy for me to fix but I sometimes see garbled code I
> then simply ignore.

Truth to be told, that's one reason why I rarely read your mails to the
end. The long lines and the triple-spaced paragraphs make it just too
uncomfortable.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"
Re: How to escape strings for re.finditer? [ In reply to ]
On 2023-03-01 01:01:42 +0100, Peter J. Holzer wrote:
> On 2023-02-28 15:25:05 -0500, avi.e.gross@gmail.com wrote:
> > It happens to be easy for me to fix but I sometimes see garbled code I
> > then simply ignore.
>
> Truth to be told, that's one reason why I rarely read your mails to the
> end. The long lines and the triple-spaced paragraphs make it just too
> uncomfortable.

Hmm, since I was now paying a bit more attention to formatting problems
I saw that only about half of your messages have those long lines
although all seem to be sent with the same mailer. Don't know what's
going on there.

hp


--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"
Re: How to escape strings for re.finditer? [ In reply to ]
Regex is fine if it works for you. The critiques ? ?difficult to read? ?are subjective. Unless the code is in a section that has been profiled to be a bottleneck, I don?t sweat performance at this level.

For me, using code that has already been written and vetted is the preferred approach to writing new code I have to test and maintain. I use an online regex tester, https://pythex.org, to get the syntax write before copying pasting it into my code.

From: Python-list <python-list-bounces+gweatherby=uchc.edu@python.org> on behalf of Jen Kris via Python-list <python-list@python.org>
Date: Tuesday, February 28, 2023 at 1:11 PM
To: Thomas Passin <list1@tompassin.net>
Cc: python-list@python.org <python-list@python.org>
Subject: Re: How to escape strings for re.finditer?
*** Attention: This is an external email. Use caution responding, opening attachments or clicking on links. ***

Using str.startswith is a cool idea in this case. But is it better than regex for performance or reliability? Regex syntax is not a model of simplicity, but in my simple case it's not too difficult.


--
https://mail.python.org/mailman/listinfo/python-list
RE: How to escape strings for re.finditer? [ In reply to ]
Peter,

Nobody here would appreciate it if I tested it by sending out multiple
copies of each email to see if the same message wraps differently.

I am using a fairly standard mailer in Outlook that interfaces with gmail
and I could try mailing directly from gmail but apparently there are
systemic problems and I experience other complaints when sending directly
from AOL mail too.

So, if some people don't read me, I can live with that. I mean the right
people, LOL!

Or did I get that wrong?

I do appreciate the feedback. Ironically, when I politely shared how someone
else's email was displaying on my screen, it seems I am equally causing
similar issues for others.

An interesting question is whether any of us reading the archived copies see
different things including with various browsers:

https://mail.python.org/pipermail/python-list/

I am not sure which letters from me had the anomalies you mention but
spot-checking a few of them showed a normal display when I use Chrome.

But none of this is really a python issue except insofar as you never know
what functionality in the network was written for in python.

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On
Behalf Of Peter J. Holzer
Sent: Tuesday, February 28, 2023 7:26 PM
To: python-list@python.org
Subject: Re: How to escape strings for re.finditer?

On 2023-03-01 01:01:42 +0100, Peter J. Holzer wrote:
> On 2023-02-28 15:25:05 -0500, avi.e.gross@gmail.com wrote:
> > It happens to be easy for me to fix but I sometimes see garbled code
> > I then simply ignore.
>
> Truth to be told, that's one reason why I rarely read your mails to
> the end. The long lines and the triple-spaced paragraphs make it just
> too uncomfortable.

Hmm, since I was now paying a bit more attention to formatting problems I
saw that only about half of your messages have those long lines although all
seem to be sent with the same mailer. Don't know what's going on there.

hp


--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
On 2023-02-28, Cameron Simpson <cs@cskk.id.au> wrote:

> Regexps are:
> - cryptic and error prone (you can make them more readable, but the
> notation is deliberately both terse and powerful, which means that
> small changes can have large effects in behaviour); the "error prone"
> part does not mean that a regexp is unreliable, but that writing one
> which is _correct_ for your task can be difficult,

The nasty thing is that writing one that _appears_ to be correct for
your task is often fairly easy. It will work as you expect for the
test cases you throw at it, but then fail in confusing ways when
released into the "real world". If you're lucky, it fails frequently
and obviously enough that you notice it right away. If you're not
lucky, it will fail infrequently and subtly for many years to come.

My rule: never use an RE if you can use the normal string methods
(even if it takes a a few lines of code using them to replace a single
RE).

--
Grant
--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
On 3/1/2023 12:04 PM, Grant Edwards wrote:
> On 2023-02-28, Cameron Simpson <cs@cskk.id.au> wrote:
>
>> Regexps are:
>> - cryptic and error prone (you can make them more readable, but the
>> notation is deliberately both terse and powerful, which means that
>> small changes can have large effects in behaviour); the "error prone"
>> part does not mean that a regexp is unreliable, but that writing one
>> which is _correct_ for your task can be difficult,
>
> The nasty thing is that writing one that _appears_ to be correct for
> your task is often fairly easy. It will work as you expect for the
> test cases you throw at it, but then fail in confusing ways when
> released into the "real world". If you're lucky, it fails frequently
> and obviously enough that you notice it right away. If you're not
> lucky, it will fail infrequently and subtly for many years to come.
>
> My rule: never use an RE if you can use the normal string methods
> (even if it takes a a few lines of code using them to replace a single
> RE).

A corollary is that once you get a working regex, don't mess with it if
you do not absolutely have to.

--
https://mail.python.org/mailman/listinfo/python-list
RE: How to escape RE [ In reply to ]
Cameron,

The topic is now Regular Expressions and the sin tax. This is not
exclusively a Python issue as everybody and even their grandmother uses it
in various forms.

I remember early versions of RE were fairly simple and readable. It was a
terse minilanguage that allowed fairly complex things to be done but was
readable.

You now encounter versions that make people struggle as countless extensions
have been sloppily grafted on. Who ordered multiple uses where "?" is now
used? As an example. Many places have sort of expanded the terseness and
both made it more and also less legible. UNICODE made lots of older RE
features not very useful as definitions of things like what whitespace can
be and what a word boundary or contents might be are made so different that
new constructs were added to hold them.

But, if you are operating mainly on ASCII text, the base functionality is
till in there and can be used fairly easily.

Consider it a bit like other mini languages such as the print() variants
that kept adding functionality by packing lots of info tersely so you
specify you want a floating point number with so many digits and so on, and
by the way, right justified in a wider field and if it is negative, so this.
Great if you can still remember how to read it.

I was reading a python book recently which kept using a suffix of !r and I
finally looked it up. It seems to be asking print (or perhaps an f string)
to use __repr__() if possible to get the representation of the object. Then
I find out this is not really needed any more as the context now allows you
to use something like {repr(val)) so a val!r is not the only and confusing
way.

These mini-languages each require you to learn their own rules and quirks
and when you do, they can be powerful and intuitive, at least for the
features you memorized and maybe use regularly.

Now RE knowledge is the same and it ports moderately well between languages
except when it doesn't. As has been noted, the people at PERL relied on it a
lot and kept changing and extending it. Some Python functionality lets you
specify if you want PERL style or other styles.

But hiding your head in the sand is not always going to work for long. No,
you do not need to use RE for simple cases. Mind you, that is when it is
easiest to use it reliably. I read some books related to XML where much of
the work had been done in non-UNIX land years ago and they often had other
ways of doing things in their endless series of methods on validating a
schema or declaring it so data is forced to match the declared objectives
such as what type(s) each item can be or whether some fields must exist
inside others or in a particular order, or say you can have only three of
them and seeming endless other such things. And then, suddenly, someone has
the idea to introduce the ability for you to specify many things using
regular expressions and the oppressiveness (for me) lifts and many things
can now be done trivially or that were not doable before. I had a similar
experience in my SQL reading where adding the ability to do some pattern
matching using a form of RE made life simpler.

The fact is that the idea of complex pattern matching IS complex and any
tool that lets you express it so fluidly will itself be complex. So, as some
have mentioned, find a resource that helps you build a regular expression
perhaps through menus, or one that verifies if one you created makes any
sense or lets you enter test data and have it show you how it is matching or
what to change to make it match differently. The multi-line version of RE
may also be helpful as well as sometimes breaking up a bigger one into
several smaller ones that your program uses in multiple phases.

Python recently added new functionality called Structural Pattern Matching.
You use a match statement with various cases that match patterns and if
matched, execute some action. Here is one tutorial if needed:

https://peps.python.org/pep-0636/

The point is that although not at all the same as a RE, we again have a bit
of a mini-language that can be used fairly concisely to investigate a
problem domain fairly quickly and efficiently and do things. It is an
overlapping but different form of pattern matching. And, in languages that
have long had similar ideas and constructs, people often cut back on using
other constructs like an IF statement, and just used something like this!

And consider this example as being vaguely like a bit of regular expression:

match command.split():
case ["go", ("north" | "south" | "east" | "west")]:
current_room = current_room.neighbor(...)

Like it or not, our future in programming is likely to include more and more
such aids along with headaches.

Avi

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On
Behalf Of Grant Edwards
Sent: Wednesday, March 1, 2023 12:04 PM
To: python-list@python.org
Subject: Re: How to escape strings for re.finditer?

On 2023-02-28, Cameron Simpson <cs@cskk.id.au> wrote:

> Regexps are:
> - cryptic and error prone (you can make them more readable, but the
> notation is deliberately both terse and powerful, which means that
> small changes can have large effects in behaviour); the "error prone"
> part does not mean that a regexp is unreliable, but that writing one
> which is _correct_ for your task can be difficult,

The nasty thing is that writing one that _appears_ to be correct for your
task is often fairly easy. It will work as you expect for the test cases you
throw at it, but then fail in confusing ways when released into the "real
world". If you're lucky, it fails frequently and obviously enough that you
notice it right away. If you're not lucky, it will fail infrequently and
subtly for many years to come.

My rule: never use an RE if you can use the normal string methods (even if
it takes a a few lines of code using them to replace a single RE).

--
Grant
--
https://mail.python.org/mailman/listinfo/python-list

--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
On 2023-03-01 01:01:42 +0100, Peter J. Holzer wrote:
> On 2023-02-28 15:25:05 -0500, avi.e.gross@gmail.com wrote:
> > I had no doubt the code you ran was indented properly or it would not work.
> >
> > I am merely letting you know that somewhere in the process of copying
> > the code or the transition between mailers, my version is messed up.
>
> The problem seems to be at your end. Jen's code looks ok here.
[...]
> I have no idea why it would join only some lines but not others.

Actually I do have an idea now, since I noticed something similar at
work today: Outlook has an option "remove additional line breaks from
text-only messages" (translated from German) in the the "Email / Message
Format" section. You want to make sure this is off if you are reading
mails where line breaks might be important[1].

hp

[1] Personally I'd say you shouldn't use Outlook if you are reading
mails where line breaks (or other formatting) is important, but ...

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"
RE: How to escape strings for re.finditer? [ In reply to ]
Thanks, Peter. Excellent advice, even if only for any of us using Microsoft
Outlook as our mailer. I made the changes and we will see but they should
mainly impact what I see. I did tweak another parameter.

The problem for me was finding where they hid the options menu I needed.
Then, I started translating the menus back into German until I realized I
was being silly! Good practice though. LOL!

The truth is I generally can handle receiving mangled code as most of the
time I can re-edit it into shape, or am just reading it and not
copying/pasting.

What concerns me is to be able to send out the pure text content many seem
to need in a way that does not introduce the anomalies people see. Something
like a least-common denominator.

Or. I could switch mailers. But my guess is reading/responding from the
native gmail editor may also need options changes and yet still impact some
readers.

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On
Behalf Of Peter J. Holzer
Sent: Thursday, March 2, 2023 3:09 PM
To: python-list@python.org
Subject: Re: How to escape strings for re.finditer?

On 2023-03-01 01:01:42 +0100, Peter J. Holzer wrote:
> On 2023-02-28 15:25:05 -0500, avi.e.gross@gmail.com wrote:
> > I had no doubt the code you ran was indented properly or it would not
work.
> >
> > I am merely letting you know that somewhere in the process of
> > copying the code or the transition between mailers, my version is messed
up.
>
> The problem seems to be at your end. Jen's code looks ok here.
[...]
> I have no idea why it would join only some lines but not others.

Actually I do have an idea now, since I noticed something similar at work
today: Outlook has an option "remove additional line breaks from text-only
messages" (translated from German) in the the "Email / Message Format"
section. You want to make sure this is off if you are reading mails where
line breaks might be important[1].

hp

[1] Personally I'd say you shouldn't use Outlook if you are reading mails
where line breaks (or other formatting) is important, but ...

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
On 2023-03-02, Peter J. Holzer <hjp-python@hjp.at> wrote:


> [1] Personally I'd say you shouldn't use Outlook if you are reading
> mails where line breaks (or other formatting) is important, but ...

I'd shorten that to

"You shouldn't use Outlook if mail is important."

--
https://mail.python.org/mailman/listinfo/python-list

1 2  View All