Mailing List Archive

How to escape strings for re.finditer?
When matching a string against a longer string, where both strings have spaces in them, we need to escape the spaces. 

This works (no spaces):

import re
example = 'abcdefabcdefabcdefg'
find_string = "abc"
for match in re.finditer(find_string, example):
    print(match.start(), match.end())

That gives me the start and end character positions, which is what I want. 

However, this does not work:

import re
example = re.escape('X - cty_degrees + 1 + qq')
find_string = re.escape('cty_degrees + 1')
for match in re.finditer(find_string, example):
    print(match.start(), match.end())

I’ve tried several other attempts based on my reseearch, but still no match. 

I don’t have much experience with regex, so I hoped a reg-expert might help. 

Thanks,

Jen

--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
On 2023-02-27 23:11, Jen Kris via Python-list wrote:
> When matching a string against a longer string, where both strings have spaces in them, we need to escape the spaces.
>
> This works (no spaces):
>
> import re
> example = 'abcdefabcdefabcdefg'
> find_string = "abc"
> for match in re.finditer(find_string, example):
>     print(match.start(), match.end())
>
> That gives me the start and end character positions, which is what I want.
>
> However, this does not work:
>
> import re
> example = re.escape('X - cty_degrees + 1 + qq')
> find_string = re.escape('cty_degrees + 1')
> for match in re.finditer(find_string, example):
>     print(match.start(), match.end())
>
> I’ve tried several other attempts based on my reseearch, but still no match.
>
> I don’t have much experience with regex, so I hoped a reg-expert might help.
>
You need to escape only the pattern, not the string you're searching.
--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
On 28Feb2023 00:11, Jen Kris <jenkris@tutanota.com> wrote:
>When matching a string against a longer string, where both strings have spaces in them, we need to escape the spaces. 
>
>This works (no spaces):
>
>import re
>example = 'abcdefabcdefabcdefg'
>find_string = "abc"
>for match in re.finditer(find_string, example):
>    print(match.start(), match.end())
>
>That gives me the start and end character positions, which is what I want. 
>
>However, this does not work:
>
>import re
>example = re.escape('X - cty_degrees + 1 + qq')
>find_string = re.escape('cty_degrees + 1')
>for match in re.finditer(find_string, example):
>    print(match.start(), match.end())
>
>I’ve tried several other attempts based on my reseearch, but still no
>match. 

You need to print those strings out. You're escaping the _example_
string, which would make it:

X - cty_degrees \+ 1 \+ qq

because `+` is a special character in regexps and so `re.escape` escapes
it. But you don't want to mangle the string you're searching! After all,
the text above does not contain the string `cty_degrees + 1`.

My secondary question is: if you're escaping the thing you're searching
_for_, then you're effectively searching for a _fixed_ string, not a
pattern/regexp. So why on earth are you using regexps to do your
searching?

The `str` type has a `find(substring)` function. Just use that! It'll be
faster and the code simpler!

Cheers,
Cameron Simpson <cs@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
Yes, that's it.  I don't know how long it would have taken to find that detail with research through the voluminous re documentation.  Thanks very much. 

Feb 27, 2023, 15:47 by python@mrabarnett.plus.com:

> On 2023-02-27 23:11, Jen Kris via Python-list wrote:
>
>> When matching a string against a longer string, where both strings have spaces in them, we need to escape the spaces.
>>
>> This works (no spaces):
>>
>> import re
>> example = 'abcdefabcdefabcdefg'
>> find_string = "abc"
>> for match in re.finditer(find_string, example):
>>     print(match.start(), match.end())
>>
>> That gives me the start and end character positions, which is what I want.
>>
>> However, this does not work:
>>
>> import re
>> example = re.escape('X - cty_degrees + 1 + qq')
>> find_string = re.escape('cty_degrees + 1')
>> for match in re.finditer(find_string, example):
>>     print(match.start(), match.end())
>>
>> I’ve tried several other attempts based on my reseearch, but still no match.
>>
>> I don’t have much experience with regex, so I hoped a reg-expert might help.
>>
> You need to escape only the pattern, not the string you're searching.
> --
> https://mail.python.org/mailman/listinfo/python-list
>

--
https://mail.python.org/mailman/listinfo/python-list
RE: How to escape strings for re.finditer? [ In reply to ]
MRAB makes a valid point. The regular expression compiled is only done on the pattern you are looking for and it it contains anything that might be a command, such as an ^ at the start or [12] in middle, you want that converted so NONE OF THAT is one. It will be compiled to something that looks for an ^, including later in the string, and look for a real [. then a real 1 and a real 2 and a real ], not for one of the choices of 1 or 2.

Your example was 'cty_degrees + 1' which can have a subtle bug introduced. The special character is "+" which means match greedily as many copies of the previous entity as possible. In this case, the previous entity was a single space. So the regular expression will match 'cty degrees' then match the single space it sees because it sees a space followed ny a plus then not looking for a plus, hits a plus and fails. If your example is rewritten in whatever way re.escape uses, it might be 'cty_degrees \+ 1' and then it should work fine.

But converting what you are searching for just breaks that as the result will have a '\+" whish is being viewed as two unrelated symbols and the backslash breaks the match from going further.



-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of MRAB
Sent: Monday, February 27, 2023 6:46 PM
To: python-list@python.org
Subject: Re: How to escape strings for re.finditer?

On 2023-02-27 23:11, Jen Kris via Python-list wrote:
> When matching a string against a longer string, where both strings have spaces in them, we need to escape the spaces.
>
> This works (no spaces):
>
> import re
> example = 'abcdefabcdefabcdefg'
> find_string = "abc"
> for match in re.finditer(find_string, example):
> print(match.start(), match.end())
>
> That gives me the start and end character positions, which is what I want.
>
> However, this does not work:
>
> import re
> example = re.escape('X - cty_degrees + 1 + qq') find_string =
> re.escape('cty_degrees + 1') for match in re.finditer(find_string,
> example):
> print(match.start(), match.end())
>
> I’ve tried several other attempts based on my reseearch, but still no match.
>
> I don’t have much experience with regex, so I hoped a reg-expert might help.
>
You need to escape only the pattern, not the string you're searching.
--
https://mail.python.org/mailman/listinfo/python-list

--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
I went to the re module because the specified string may appear more than once in the string (in the code I'm writing).  For example: 

a = "X - abc_degree + 1 + qq + abc_degree + 1"
 b = "abc_degree + 1"
 q = a.find(b)

print(q)
4

So it correctly finds the start of the first instance, but not the second one.  The re code finds both instances.  If I knew that the substring occurred only once then the str.find would be best. 

I changed my re code after MRAB's comment, it now works. 

Thanks much. 

Jen


Feb 27, 2023, 15:56 by cs@cskk.id.au:

> On 28Feb2023 00:11, Jen Kris <jenkris@tutanota.com> wrote:
>
>> When matching a string against a longer string, where both strings have spaces in them, we need to escape the spaces. 
>>
>> This works (no spaces):
>>
>> import re
>> example = 'abcdefabcdefabcdefg'
>> find_string = "abc"
>> for match in re.finditer(find_string, example):
>>     print(match.start(), match.end())
>>
>> That gives me the start and end character positions, which is what I want. 
>>
>> However, this does not work:
>>
>> import re
>> example = re.escape('X - cty_degrees + 1 + qq')
>> find_string = re.escape('cty_degrees + 1')
>> for match in re.finditer(find_string, example):
>>     print(match.start(), match.end())
>>
>> I’ve tried several other attempts based on my reseearch, but still no match. 
>>
>
> You need to print those strings out. You're escaping the _example_ string, which would make it:
>
> X - cty_degrees \+ 1 \+ qq
>
> because `+` is a special character in regexps and so `re.escape` escapes it. But you don't want to mangle the string you're searching! After all, the text above does not contain the string `cty_degrees + 1`.
>
> My secondary question is: if you're escaping the thing you're searching _for_, then you're effectively searching for a _fixed_ string, not a pattern/regexp. So why on earth are you using regexps to do your searching?
>
> The `str` type has a `find(substring)` function. Just use that! It'll be faster and the code simpler!
>
> Cheers,
> Cameron Simpson <cs@cskk.id.au>
> --
> https://mail.python.org/mailman/listinfo/python-list
>

--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
On 28Feb2023 01:13, Jen Kris <jenkris@tutanota.com> wrote:
>I went to the re module because the specified string may appear more
>than once in the string (in the code I'm writing).

Sure, but writing a `finditer` for plain `str` is pretty easy
(untested):

pos = 0
while True:
found = s.find(substring, pos)
if found < 0:
break
start = found
end = found + len(substring)
... do whatever with start and end ...
pos = end

Many people go straight to the `re` module whenever they're looking for
strings. It is often cryptic error prone overkill. Just something to
keep in mind.

Cheers,
Cameron Simpson <cs@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list
RE: How to escape strings for re.finditer? [ In reply to ]
Just FYI, Jen, there are times a sledgehammer works but perhaps is not the only way. These days people worry less about efficiency and more about programmer time and education and that can be fine.

But it you looked at methods available in strings or in some other modules, your situation is quite common. Some may use another RE front end called finditer().

I am NOT suggesting you do what I say next, but imagine writing a loop that takes a substring of what you are searching for of the same length as your search string. Near the end, it stops as there is too little left.

You can now simply test your searched for string against that substring for equality and it tends to return rapidly when they are not equal early on.

Your loop would return whatever data structure or results you want such as that it matched it three times at offsets a, b and c.

But do you allow overlaps? If not, your loop needs to skip len(search_str) after a match.

What you may want to consider is another form of pre-processing. Do you care if "abc_degree + 1" has missing or added spaces at the tart or end or anywhere in middle as in " abc_degree +1"?

Do you care if stuff is a different case like "Abc_Degree + 1"?

Some such searches can be done if both the pattern and searched string are first converted to a canonical format that maps to the same output. But that complicates things a bit and you may to display what you match differently.

And are you also willing to match this: "myabc_degree + 1"?

When using a crafter RE there is a way to ask for a word boundary so abc will only be matched if before that is a space or the start of the string and not "my".

So this may be a case where you can solve an easy version with the chance it can be fooled or overengineer it. If you are allowing the user to type in what to search for, as many programs including editors, do, you will often find such false positives unless the user knows RE syntax and applies it and you do not escape it. I have experienced havoc when doing a careless global replace that matched more than I expected, including making changes in comments or constant strings rather than just the name of a function. Adding a paren is helpful as is not replacing them all but one at a time and skipping any that are not wanted.

Good luck.

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of Jen Kris via Python-list
Sent: Monday, February 27, 2023 7:14 PM
To: Cameron Simpson <cs@cskk.id.au>
Cc: Python List <python-list@python.org>
Subject: Re: How to escape strings for re.finditer?


I went to the re module because the specified string may appear more than once in the string (in the code I'm writing). For example:

a = "X - abc_degree + 1 + qq + abc_degree + 1"
b = "abc_degree + 1"
q = a.find(b)

print(q)
4

So it correctly finds the start of the first instance, but not the second one. The re code finds both instances. If I knew that the substring occurred only once then the str.find would be best.

I changed my re code after MRAB's comment, it now works.

Thanks much.

Jen


Feb 27, 2023, 15:56 by cs@cskk.id.au:

> On 28Feb2023 00:11, Jen Kris <jenkris@tutanota.com> wrote:
>
>> When matching a string against a longer string, where both strings
>> have spaces in them, we need to escape the spaces.
>>
>> This works (no spaces):
>>
>> import re
>> example = 'abcdefabcdefabcdefg'
>> find_string = "abc"
>> for match in re.finditer(find_string, example):
>> print(match.start(), match.end())
>>
>> That gives me the start and end character positions, which is what I
>> want.
>>
>> However, this does not work:
>>
>> import re
>> example = re.escape('X - cty_degrees + 1 + qq') find_string =
>> re.escape('cty_degrees + 1') for match in re.finditer(find_string,
>> example):
>> print(match.start(), match.end())
>>
>> I’ve tried several other attempts based on my reseearch, but still no
>> match.
>>
>
> You need to print those strings out. You're escaping the _example_ string, which would make it:
>
> X - cty_degrees \+ 1 \+ qq
>
> because `+` is a special character in regexps and so `re.escape` escapes it. But you don't want to mangle the string you're searching! After all, the text above does not contain the string `cty_degrees + 1`.
>
> My secondary question is: if you're escaping the thing you're searching _for_, then you're effectively searching for a _fixed_ string, not a pattern/regexp. So why on earth are you using regexps to do your searching?
>
> The `str` type has a `find(substring)` function. Just use that! It'll be faster and the code simpler!
>
> Cheers,
> Cameron Simpson <cs@cskk.id.au>
> --
> https://mail.python.org/mailman/listinfo/python-list
>

--
https://mail.python.org/mailman/listinfo/python-list

--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
On 28Feb2023 00:57, Jen Kris <jenkris@tutanota.com> wrote:
>Yes, that's it.  I don't know how long it would have taken to find that
>detail with research through the voluminous re documentation.  Thanks
>very much. 

You find things like this by printing out the strings you're actually
working with. Not the original strings, but the strings when you're
invoking `finditer` i.e. in your case, escaped strings.

Then you might have seen that what you were searching no longer
contained what you were searching for.

Don't underestimate the value of the debugging print call. It lets you
see what your programme is actually working with, instead of what you
thought it was working with.

Cheers,
Cameron Simpson <cs@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
string.count() only tells me there are N instances of the string; it does not say where they begin and end, as does re.finditer. 

Feb 27, 2023, 16:20 by bobmellowood@gmail.com:

> Would string.count() work for you then?
>
> On Mon, Feb 27, 2023 at 5:16?PM Jen Kris via Python-list <> python-list@python.org> > wrote:
>
>>
>> I went to the re module because the specified string may appear more than once in the string (in the code I'm writing).  For example: 
>>
>> a = "X - abc_degree + 1 + qq + abc_degree + 1"
>>  b = "abc_degree + 1"
>>  q = a.find(b)
>>
>> print(q)
>> 4
>>
>> So it correctly finds the start of the first instance, but not the second one.  The re code finds both instances.  If I knew that the substring occurred only once then the str.find would be best. 
>>
>> I changed my re code after MRAB's comment, it now works. 
>>
>> Thanks much. 
>>
>> Jen
>>
>>
>> Feb 27, 2023, 15:56 by >> cs@cskk.id.au>> :
>>
>> > On 28Feb2023 00:11, Jen Kris <>> jenkris@tutanota.com>> > wrote:
>> >
>> >> When matching a string against a longer string, where both strings have spaces in them, we need to escape the spaces. 
>> >>
>> >> This works (no spaces):
>> >>
>> >> import re
>> >> example = 'abcdefabcdefabcdefg'
>> >> find_string = "abc"
>> >> for match in re.finditer(find_string, example):
>> >>     print(match.start(), match.end())
>> >>
>> >> That gives me the start and end character positions, which is what I want. 
>> >>
>> >> However, this does not work:
>> >>
>> >> import re
>> >> example = re.escape('X - cty_degrees + 1 + qq')
>> >> find_string = re.escape('cty_degrees + 1')
>> >> for match in re.finditer(find_string, example):
>> >>     print(match.start(), match.end())
>> >>
>> >> I’ve tried several other attempts based on my reseearch, but still no match. 
>> >>
>> >
>> > You need to print those strings out. You're escaping the _example_ string, which would make it:
>> >
>> >  X - cty_degrees \+ 1 \+ qq
>> >
>> > because `+` is a special character in regexps and so `re.escape` escapes it. But you don't want to mangle the string you're searching! After all, the text above does not contain the string `cty_degrees + 1`.
>> >
>> > My secondary question is: if you're escaping the thing you're searching _for_, then you're effectively searching for a _fixed_ string, not a pattern/regexp. So why on earth are you using regexps to do your searching?
>> >
>> > The `str` type has a `find(substring)` function. Just use that! It'll be faster and the code simpler!
>> >
>> > Cheers,
>> > Cameron Simpson <>> cs@cskk.id.au>> >
>> > --
>> > >> https://mail.python.org/mailman/listinfo/python-list
>> >
>>
>> --
>> >> https://mail.python.org/mailman/listinfo/python-list
>>
>
>
> --
> **** Listen to my CD at > http://www.mellowood.ca/music/cedars> ****
> Bob van der Poel ** Wynndel, British Columbia, CANADA **
> EMAIL: > bob@mellowood.ca
> WWW:   > http://www.mellowood.ca
>

--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
I haven't tested it either but it looks like it would work.  But for this case I prefer the relative simplicity of:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
find_string = re.escape('abc_degree + 1')
for match in re.finditer(find_string, example):
    print(match.start(), match.end())

4 18
26 40

I don't insist on terseness for its own sake, but it's cleaner this way. 

Jen


Feb 27, 2023, 16:55 by cs@cskk.id.au:

> On 28Feb2023 01:13, Jen Kris <jenkris@tutanota.com> wrote:
>
>> I went to the re module because the specified string may appear more than once in the string (in the code I'm writing).
>>
>
> Sure, but writing a `finditer` for plain `str` is pretty easy (untested):
>
> pos = 0
> while True:
> found = s.find(substring, pos)
> if found < 0:
> break
> start = found
> end = found + len(substring)
> ... do whatever with start and end ...
> pos = end
>
> Many people go straight to the `re` module whenever they're looking for strings. It is often cryptic error prone overkill. Just something to keep in mind.
>
> Cheers,
> Cameron Simpson <cs@cskk.id.au>
> --
> https://mail.python.org/mailman/listinfo/python-list
>

--
https://mail.python.org/mailman/listinfo/python-list
RE: How to escape strings for re.finditer? [ In reply to ]
Jen,

What you just described is why that tool is not the right tool for the job, albeit it may help you confirm if whatever method you choose does work correctly and finds the same number of matches.

Sometimes you simply do some searching and roll your own.

Consider this code using a sort of list comprehension feature:

>>> short = "hello world"
>>> longer = "hello world is how many programs start for novices but some use hello world! to show how happy they are to say hello world"

>>> short in longer
True
>>> howLong = len(short)

>>> res = [.(offset, offset + howLong) for offset in range(len(longer)) if longer.startswith(short, offset)]
>>> res
[(0, 11), (64, 75), (111, 122)]
>>> len(res)
3

I could do a bit more but it seems to work. Did I get the offsets right? Checking:

>>> print( [ longer[res[index][0]:res[index][1]] for index in range(len(res))])
['hello world', 'hello world', 'hello world']

Seems to work but thrown together quickly so can likely be done much nicer.

But as noted, the above has flaws such as matching overlaps like:

>>> short = "good good"
>>> longer = "A good good good but not douple plus good good good goody"
>>> howLong = len(short)
>>> res = [.(offset, offset + howLong) for offset in range(len(longer)) if longer.startswith(short, offset)]
>>> res
[(2, 11), (7, 16), (37, 46), (42, 51), (47, 56)]

It matched five times as sometimes we had three of four good in a row. Some other method might match only three.

What some might do can get long and you clearly want one answer and not tutorials. For example, people can make a loop that finds a match and either sabotages the area by replacing or deleting it, or keeps track and searched again on a substring offset from the beginning.

When you do not find a tool, consider making one. You can take (better) code than I show above and make it info a function and now you have a tool. Even better, you can make it return whatever you want.

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of Jen Kris via Python-list
Sent: Monday, February 27, 2023 7:40 PM
To: Bob van der Poel <bobmellowood@gmail.com>
Cc: Python List <python-list@python.org>
Subject: Re: How to escape strings for re.finditer?


string.count() only tells me there are N instances of the string; it does not say where they begin and end, as does re.finditer.

Feb 27, 2023, 16:20 by bobmellowood@gmail.com:

> Would string.count() work for you then?
>
> On Mon, Feb 27, 2023 at 5:16?PM Jen Kris via Python-list <> python-list@python.org> > wrote:
>
>>
>> I went to the re module because the specified string may appear more
>> than once in the string (in the code I'm writing). For example:
>>
>> a = "X - abc_degree + 1 + qq + abc_degree + 1"
>> b = "abc_degree + 1"
>> q = a.find(b)
>>
>> print(q)
>> 4
>>
>> So it correctly finds the start of the first instance, but not the
>> second one. The re code finds both instances. If I knew that the substring occurred only once then the str.find would be best.
>>
>> I changed my re code after MRAB's comment, it now works.
>>
>> Thanks much.
>>
>> Jen
>>
>>
>> Feb 27, 2023, 15:56 by >> cs@cskk.id.au>> :
>>
>> > On 28Feb2023 00:11, Jen Kris <>> jenkris@tutanota.com>> > wrote:
>> >
>> >> When matching a string against a longer string, where both
>> strings have spaces in them, we need to escape the spaces. >> >>
>> This works (no spaces):
>> >>
>> >> import re
>> >> example = 'abcdefabcdefabcdefg'
>> >> find_string = "abc"
>> >> for match in re.finditer(find_string, example):
>> >> print(match.start(), match.end()) >> >> That gives me the
>> start and end character positions, which is what I want.
>> >>
>> >> However, this does not work:
>> >>
>> >> import re
>> >> example = re.escape('X - cty_degrees + 1 + qq') >> find_string =
>> re.escape('cty_degrees + 1') >> for match in
>> re.finditer(find_string, example):
>> >> print(match.start(), match.end()) >> >> I’ve tried several
>> other attempts based on my reseearch, but still no match.
>> >>
>> >
>> > You need to print those strings out. You're escaping the _example_ string, which would make it:
>> >
>> > X - cty_degrees \+ 1 \+ qq
>> >
>> > because `+` is a special character in regexps and so `re.escape` escapes it. But you don't want to mangle the string you're searching! After all, the text above does not contain the string `cty_degrees + 1`.
>> >
>> > My secondary question is: if you're escaping the thing you're searching _for_, then you're effectively searching for a _fixed_ string, not a pattern/regexp. So why on earth are you using regexps to do your searching?
>> >
>> > The `str` type has a `find(substring)` function. Just use that! It'll be faster and the code simpler!
>> >
>> > Cheers,
>> > Cameron Simpson <>> cs@cskk.id.au>> > > -- > >>
>> https://mail.python.org/mailman/listinfo/python-list
>> >
>>
>> --
>> >> https://mail.python.org/mailman/listinfo/python-list
>>
>
>
> --
> **** Listen to my CD at > http://www.mellowood.ca/music/cedars> ****
> Bob van der Poel ** Wynndel, British Columbia, CANADA **
> EMAIL: > bob@mellowood.ca
> WWW: > http://www.mellowood.ca
>

--
https://mail.python.org/mailman/listinfo/python-list

--
https://mail.python.org/mailman/listinfo/python-list
RE: How to escape strings for re.finditer? [ In reply to ]
Jen,

Can you see what SOME OF US see as ASCII text? We can help you better if we get code that can be copied and run as-is.

What you sent is not terse. It is wrong. It will not run on any python interpreter because you somehow lost a carriage return and indent.

This is what you sent:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
find_string = re.escape('abc_degree + 1') for match in re.finditer(find_string, example):
print(match.start(), match.end())

This is code indentedproperly:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
find_string = re.escape('abc_degree + 1')
for match in re.finditer(find_string, example):
print(match.start(), match.end())

Of course I am sure you wrote and ran code more like the latter version but somewhere in your copy/paste process, ....

And, just for fun, since there is nothing wrong with your code, this minor change is terser:

>>> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>>> for match in re.finditer(re.escape('abc_degree + 1') , example):
... print(match.start(), match.end())
...
...
4 18
26 40

But note once you use regular expressions, and not in your case, you might match multiple things that are far from the same such as matching two repeated words of any kind in any case including "and and" and "so so" or finding words that have multiple doubled letter as in the stereotypical bookkeeper. In those cases, you may want even more than offsets but also show the exact text that matched or even show some characters before and/or after for context.


-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of Jen Kris via Python-list
Sent: Monday, February 27, 2023 8:36 PM
To: Cameron Simpson <cs@cskk.id.au>
Cc: Python List <python-list@python.org>
Subject: Re: How to escape strings for re.finditer?


I haven't tested it either but it looks like it would work. But for this case I prefer the relative simplicity of:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
find_string = re.escape('abc_degree + 1') for match in re.finditer(find_string, example):
print(match.start(), match.end())

4 18
26 40

I don't insist on terseness for its own sake, but it's cleaner this way.

Jen


Feb 27, 2023, 16:55 by cs@cskk.id.au:

> On 28Feb2023 01:13, Jen Kris <jenkris@tutanota.com> wrote:
>
>> I went to the re module because the specified string may appear more than once in the string (in the code I'm writing).
>>
>
> Sure, but writing a `finditer` for plain `str` is pretty easy (untested):
>
> pos = 0
> while True:
> found = s.find(substring, pos)
> if found < 0:
> break
> start = found
> end = found + len(substring)
> ... do whatever with start and end ...
> pos = end
>
> Many people go straight to the `re` module whenever they're looking for strings. It is often cryptic error prone overkill. Just something to keep in mind.
>
> Cheers,
> Cameron Simpson <cs@cskk.id.au>
> --
> https://mail.python.org/mailman/listinfo/python-list
>

--
https://mail.python.org/mailman/listinfo/python-list

--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
On 2/27/2023 9:16 PM, avi.e.gross@gmail.com wrote:
> And, just for fun, since there is nothing wrong with your code, this minor change is terser:
>
>>>> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>>>> for match in re.finditer(re.escape('abc_degree + 1') , example):
> ... print(match.start(), match.end())
> ...
> ...
> 4 18
> 26 40

Just for more fun :) -

Without knowing how general your expressions will be, I think the
following version is very readable, certainly more readable than regexes:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
KEY = 'abc_degree + 1'

for i in range(len(example)):
if example[i:].startswith(KEY):
print(i, i + len(KEY))
# prints:
4 18
26 40

If you may have variable numbers of spaces around the symbols, OTOH, the
whole situation changes and then regexes would almost certainly be the
best approach. But the regular expression strings would become harder
to read.
--
https://mail.python.org/mailman/listinfo/python-list
RE: How to escape strings for re.finditer? [ In reply to ]
I think by now we have given all that is needed by the OP but Dave's answer
strikes me as being able to be a tad faster as a while loop if you are
searching larger corpus such as an entire ebook or all books as you can do
on books.google.com

I think I mentioned earlier that some assumptions need to apply. The text
needs to be something like an ASCII encoding or seen as code points rather
than bytes. We assume a match should move forward by the length of the
match. And, clearly, there cannot be a match too close to the end.

So a while loop would begin with a variable set to zero to mark the current
location of the search. The condition for repeating the loop is that this
variable is less than or equal to len(searched_text) - len(key)

In the loop, each comparison is done the same way as David uses, or anything
similar enough but the twist is a failure increments the variable by 1 while
success increments by len(key).

Will this make much difference? It might as the simpler algorithm counts
overlapping matches and wastes some time hunting where perhaps it shouldn't.

And, of course, if you made something like this into a search function, you
can easily add features such as asking that you only return the first N
matches or the next N, simply by making it a generator.
So tying this into an earlier discussion, do you want the LAST match info
visible when the While loop has completed? If it was available, it opens up
possibilities for running the loop again but starting from where you left
off.



-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On
Behalf Of Thomas Passin
Sent: Monday, February 27, 2023 9:44 PM
To: python-list@python.org
Subject: Re: How to escape strings for re.finditer?

On 2/27/2023 9:16 PM, avi.e.gross@gmail.com wrote:
> And, just for fun, since there is nothing wrong with your code, this minor
change is terser:
>
>>>> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>>>> for match in re.finditer(re.escape('abc_degree + 1') , example):
> ... print(match.start(), match.end())
> ...
> ...
> 4 18
> 26 40

Just for more fun :) -

Without knowing how general your expressions will be, I think the following
version is very readable, certainly more readable than regexes:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
KEY = 'abc_degree + 1'

for i in range(len(example)):
if example[i:].startswith(KEY):
print(i, i + len(KEY))
# prints:
4 18
26 40

If you may have variable numbers of spaces around the symbols, OTOH, the
whole situation changes and then regexes would almost certainly be the best
approach. But the regular expression strings would become harder to read.
--
https://mail.python.org/mailman/listinfo/python-list

--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
Op 28/02/2023 om 3:44 schreef Thomas Passin:
> On 2/27/2023 9:16 PM, avi.e.gross@gmail.com wrote:
>> And, just for fun, since there is nothing wrong with your code, this
>> minor change is terser:
>>
>>>>> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>>>>> for match in re.finditer(re.escape('abc_degree + 1') , example):
>> ...     print(match.start(), match.end())
>> ...
>> ...
>> 4 18
>> 26 40
>
> Just for more fun :) -
>
> Without knowing how general your expressions will be, I think the
> following version is very readable, certainly more readable than regexes:
>
> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
> KEY = 'abc_degree + 1'
>
> for i in range(len(example)):
>     if example[i:].startswith(KEY):
>         print(i, i + len(KEY))
> # prints:
> 4 18
> 26 40
I think it's often a good idea to use a standard library function
instead of rolling your own. The issue becomes less clear-cut when the
standard library doesn't do exactly what you need (as here, where
re.finditer() uses regular expressions while the use case only uses
simple search strings). Ideally there would be a str.finditer() method
we could use, but in the absence of that I think we still need to
consider using the almost-but-not-quite fitting re.finditer().

Two reasons:

(1) I think it's clearer: the name tells us what it does (though of
course we could solve this in a hand-written version by wrapping it in a
suitably named function).

(2) Searching for a string in another string, in a performant way, is
not as simple as it first appears. Your version works correctly, but
slowly. In some situations it doesn't matter, but in other cases it
will. For better performance, string searching algorithms jump ahead
either when they found a match or when they know for sure there isn't a
match for some time (see e.g. the Boyer–Moore string-search algorithm).
You could write such a more efficient algorithm, but then it becomes
more complex and more error-prone. Using a well-tested existing function
becomes quite attractive.

To illustrate the difference performance, I did a simple test (using the
paragraph above is test text):

    import re
    import timeit

    def using_re_finditer(key, text):
        matches = []
        for match in re.finditer(re.escape(key), text):
            matches.append((match.start(), match.end()))
        return matches


    def using_simple_loop(key, text):
        matches = []
        for i in range(len(text)):
            if text[i:].startswith(key):
                matches.append((i, i + len(key)))
        return matches


    CORPUS = """Searching for a string in another string, in a
performant way, is
    not as simple as it first appears. Your version works correctly,
but slowly.
    In some situations it doesn't matter, but in other cases it will.
For better
    performance, string searching algorithms jump ahead either when
they found a
    match or when they know for sure there isn't a match for some time
(see e.g.
    the Boyer–Moore string-search algorithm). You could write such a more
    efficient algorithm, but then it becomes more complex and more
error-prone.
    Using a well-tested existing function becomes quite attractive."""
    KEY = 'in'
    print('using_simple_loop:',
timeit.repeat(stmt='using_simple_loop(KEY, CORPUS)', globals=globals(),
number=1000))
    print('using_re_finditer:',
timeit.repeat(stmt='using_re_finditer(KEY, CORPUS)', globals=globals(),
number=1000))

This does 5 runs of 1000 repetitions each, and reports the time in
seconds for each of those runs.
Result on my machine:

    using_simple_loop: [0.13952950000020792, 0.13063130000000456,
0.12803450000001249, 0.13186180000002423, 0.13084610000032626]
    using_re_finditer: [0.003861400000005233, 0.004061900000124297,
0.003478999999970256, 0.003413100000216218, 0.0037320000001273]

We find that in this test re.finditer() is more than 30 times faster
(despite the overhead of regular expressions.

While speed isn't everything in programming, with such a large
difference in performance and (to me) no real disadvantages of using
re.finditer(), I would prefer re.finditer() over writing my own.

--
"The saddest aspect of life right now is that science gathers knowledge
faster than society gathers wisdom."
-- Isaac Asimov

--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
On 2/28/2023 4:33 AM, Roel Schroeven wrote:
> Op 28/02/2023 om 3:44 schreef Thomas Passin:
>> On 2/27/2023 9:16 PM, avi.e.gross@gmail.com wrote:
>>> And, just for fun, since there is nothing wrong with your code, this
>>> minor change is terser:
>>>
>>>>>> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>>>>>> for match in re.finditer(re.escape('abc_degree + 1') , example):
>>> ...     print(match.start(), match.end())
>>> ...
>>> ...
>>> 4 18
>>> 26 40
>>
>> Just for more fun :) -
>>
>> Without knowing how general your expressions will be, I think the
>> following version is very readable, certainly more readable than regexes:
>>
>> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>> KEY = 'abc_degree + 1'
>>
>> for i in range(len(example)):
>>     if example[i:].startswith(KEY):
>>         print(i, i + len(KEY))
>> # prints:
>> 4 18
>> 26 40
> I think it's often a good idea to use a standard library function
> instead of rolling your own. The issue becomes less clear-cut when the
> standard library doesn't do exactly what you need (as here, where
> re.finditer() uses regular expressions while the use case only uses
> simple search strings). Ideally there would be a str.finditer() method
> we could use, but in the absence of that I think we still need to
> consider using the almost-but-not-quite fitting re.finditer().
>
> Two reasons:
>
> (1) I think it's clearer: the name tells us what it does (though of
> course we could solve this in a hand-written version by wrapping it in a
> suitably named function).
>
> (2) Searching for a string in another string, in a performant way, is
> not as simple as it first appears. Your version works correctly, but
> slowly. In some situations it doesn't matter, but in other cases it
> will. For better performance, string searching algorithms jump ahead
> either when they found a match or when they know for sure there isn't a
> match for some time (see e.g. the Boyer–Moore string-search algorithm).
> You could write such a more efficient algorithm, but then it becomes
> more complex and more error-prone. Using a well-tested existing function
> becomes quite attractive.

Sure, it all depends on what the real task will be. That's why I wrote
"Without knowing how general your expressions will be". For the example
string, it's unlikely that speed will be a factor, but who knows what
target strings and keys will turn up in the future?

> To illustrate the difference performance, I did a simple test (using the
> paragraph above is test text):
>
>     import re
>     import timeit
>
>     def using_re_finditer(key, text):
>         matches = []
>         for match in re.finditer(re.escape(key), text):
>             matches.append((match.start(), match.end()))
>         return matches
>
>
>     def using_simple_loop(key, text):
>         matches = []
>         for i in range(len(text)):
>             if text[i:].startswith(key):
>                 matches.append((i, i + len(key)))
>         return matches
>
>
>     CORPUS = """Searching for a string in another string, in a
> performant way, is
>     not as simple as it first appears. Your version works correctly,
> but slowly.
>     In some situations it doesn't matter, but in other cases it will.
> For better
>     performance, string searching algorithms jump ahead either when
> they found a
>     match or when they know for sure there isn't a match for some time
> (see e.g.
>     the Boyer–Moore string-search algorithm). You could write such a more
>     efficient algorithm, but then it becomes more complex and more
> error-prone.
>     Using a well-tested existing function becomes quite attractive."""
>     KEY = 'in'
>     print('using_simple_loop:',
> timeit.repeat(stmt='using_simple_loop(KEY, CORPUS)', globals=globals(),
> number=1000))
>     print('using_re_finditer:',
> timeit.repeat(stmt='using_re_finditer(KEY, CORPUS)', globals=globals(),
> number=1000))
>
> This does 5 runs of 1000 repetitions each, and reports the time in
> seconds for each of those runs.
> Result on my machine:
>
>     using_simple_loop: [0.13952950000020792, 0.13063130000000456,
> 0.12803450000001249, 0.13186180000002423, 0.13084610000032626]
>     using_re_finditer: [0.003861400000005233, 0.004061900000124297,
> 0.003478999999970256, 0.003413100000216218, 0.0037320000001273]
>
> We find that in this test re.finditer() is more than 30 times faster
> (despite the overhead of regular expressions.
>
> While speed isn't everything in programming, with such a large
> difference in performance and (to me) no real disadvantages of using
> re.finditer(), I would prefer re.finditer() over writing my own.
>

--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
Op 28/02/2023 om 14:35 schreef Thomas Passin:
> On 2/28/2023 4:33 AM, Roel Schroeven wrote:
>> [...]
>> (2) Searching for a string in another string, in a performant way, is
>> not as simple as it first appears. Your version works correctly, but
>> slowly. In some situations it doesn't matter, but in other cases it
>> will. For better performance, string searching algorithms jump ahead
>> either when they found a match or when they know for sure there isn't
>> a match for some time (see e.g. the Boyer–Moore string-search
>> algorithm). You could write such a more efficient algorithm, but then
>> it becomes more complex and more error-prone. Using a well-tested
>> existing function becomes quite attractive.
>
> Sure, it all depends on what the real task will be.  That's why I
> wrote "Without knowing how general your expressions will be". For the
> example string, it's unlikely that speed will be a factor, but who
> knows what target strings and keys will turn up in the future?
On hindsight I think it was overthinking things a bit. "It all depends
on what the real task will be" you say, and indeed I think that should
be the main conclusion here.

--
"Man had always assumed that he was more intelligent than dolphins because
he had achieved so much — the wheel, New York, wars and so on — whilst all
the dolphins had ever done was muck about in the water having a good time.
But conversely, the dolphins had always believed that they were far more
intelligent than man — for precisely the same reasons."
-- Douglas Adams

--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
On 2/28/2023 10:05 AM, Roel Schroeven wrote:
> Op 28/02/2023 om 14:35 schreef Thomas Passin:
>> On 2/28/2023 4:33 AM, Roel Schroeven wrote:
>>> [...]
>>> (2) Searching for a string in another string, in a performant way, is
>>> not as simple as it first appears. Your version works correctly, but
>>> slowly. In some situations it doesn't matter, but in other cases it
>>> will. For better performance, string searching algorithms jump ahead
>>> either when they found a match or when they know for sure there isn't
>>> a match for some time (see e.g. the Boyer–Moore string-search
>>> algorithm). You could write such a more efficient algorithm, but then
>>> it becomes more complex and more error-prone. Using a well-tested
>>> existing function becomes quite attractive.
>>
>> Sure, it all depends on what the real task will be.  That's why I
>> wrote "Without knowing how general your expressions will be". For the
>> example string, it's unlikely that speed will be a factor, but who
>> knows what target strings and keys will turn up in the future?
> On hindsight I think it was overthinking things a bit. "It all depends
> on what the real task will be" you say, and indeed I think that should
> be the main conclusion here.


It is interesting, though, how pre-processing the search pattern can
improve search times if you can afford the pre-processing. Here's a
paper on rapidly finding matches when there may be up to one misspelled
character. It's easy enough to implement, though in Python you can't
take the additional step of tuning it to stay in cache.

https://Robert.Muth.Org/Papers/1996-Approx-Multi.Pdf

--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
On 2023-02-28, Thomas Passin <list1@tompassin.net> wrote:
> On 2/28/2023 10:05 AM, Roel Schroeven wrote:
>> Op 28/02/2023 om 14:35 schreef Thomas Passin:
>>> On 2/28/2023 4:33 AM, Roel Schroeven wrote:
>>>> [...]
>>>> (2) Searching for a string in another string, in a performant way, is
>>>> not as simple as it first appears. Your version works correctly, but
>>>> slowly. In some situations it doesn't matter, but in other cases it
>>>> will. For better performance, string searching algorithms jump ahead
>>>> either when they found a match or when they know for sure there isn't
>>>> a match for some time (see e.g. the Boyer–Moore string-search
>>>> algorithm). You could write such a more efficient algorithm, but then
>>>> it becomes more complex and more error-prone. Using a well-tested
>>>> existing function becomes quite attractive.
>>>
>>> Sure, it all depends on what the real task will be.  That's why I
>>> wrote "Without knowing how general your expressions will be". For the
>>> example string, it's unlikely that speed will be a factor, but who
>>> knows what target strings and keys will turn up in the future?
>> On hindsight I think it was overthinking things a bit. "It all depends
>> on what the real task will be" you say, and indeed I think that should
>> be the main conclusion here.
>
> It is interesting, though, how pre-processing the search pattern can
> improve search times if you can afford the pre-processing. Here's a
> paper on rapidly finding matches when there may be up to one misspelled
> character. It's easy enough to implement, though in Python you can't
> take the additional step of tuning it to stay in cache.
>
> https://Robert.Muth.Org/Papers/1996-Approx-Multi.Pdf

You've somehow title-cased that URL. The correct URL is:

https://robert.muth.org/Papers/1996-approx-multi.pdf
--
https://mail.python.org/mailman/listinfo/python-list
RE: How to escape strings for re.finditer? [ In reply to ]
The code I sent is correct, and it runs here.  Maybe you received it with a carriage return removed, but on my copy after posting, it is correct:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
 find_string = re.escape('abc_degree + 1')
 for match in re.finditer(find_string, example):
     print(match.start(), match.end())

One question:  several people have made suggestions other than regex (not your terser example with regex you shown below).  Is there a reason why regex is not preferred to, for example, a list comp?  Performance?  Reliability? 



 


Feb 27, 2023, 18:16 by avi.e.gross@gmail.com:

> Jen,
>
> Can you see what SOME OF US see as ASCII text? We can help you better if we get code that can be copied and run as-is.
>
> What you sent is not terse. It is wrong. It will not run on any python interpreter because you somehow lost a carriage return and indent.
>
> This is what you sent:
>
> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
> find_string = re.escape('abc_degree + 1') for match in re.finditer(find_string, example):
> print(match.start(), match.end())
>
> This is code indentedproperly:
>
> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
> find_string = re.escape('abc_degree + 1')
> for match in re.finditer(find_string, example):
> print(match.start(), match.end())
>
> Of course I am sure you wrote and ran code more like the latter version but somewhere in your copy/paste process, ....
>
> And, just for fun, since there is nothing wrong with your code, this minor change is terser:
>
>>>> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>>>> for match in re.finditer(re.escape('abc_degree + 1') , example):
>>>>
> ... print(match.start(), match.end())
> ...
> ...
> 4 18
> 26 40
>
> But note once you use regular expressions, and not in your case, you might match multiple things that are far from the same such as matching two repeated words of any kind in any case including "and and" and "so so" or finding words that have multiple doubled letter as in the stereotypical bookkeeper. In those cases, you may want even more than offsets but also show the exact text that matched or even show some characters before and/or after for context.
>
>
> -----Original Message-----
> From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of Jen Kris via Python-list
> Sent: Monday, February 27, 2023 8:36 PM
> To: Cameron Simpson <cs@cskk.id.au>
> Cc: Python List <python-list@python.org>
> Subject: Re: How to escape strings for re.finditer?
>
>
> I haven't tested it either but it looks like it would work. But for this case I prefer the relative simplicity of:
>
> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
> find_string = re.escape('abc_degree + 1') for match in re.finditer(find_string, example):
> print(match.start(), match.end())
>
> 4 18
> 26 40
>
> I don't insist on terseness for its own sake, but it's cleaner this way.
>
> Jen
>
>
> Feb 27, 2023, 16:55 by cs@cskk.id.au:
>
>> On 28Feb2023 01:13, Jen Kris <jenkris@tutanota.com> wrote:
>>
>>> I went to the re module because the specified string may appear more than once in the string (in the code I'm writing).
>>>
>>
>> Sure, but writing a `finditer` for plain `str` is pretty easy (untested):
>>
>> pos = 0
>> while True:
>> found = s.find(substring, pos)
>> if found < 0:
>> break
>> start = found
>> end = found + len(substring)
>> ... do whatever with start and end ...
>> pos = end
>>
>> Many people go straight to the `re` module whenever they're looking for strings. It is often cryptic error prone overkill. Just something to keep in mind.
>>
>> Cheers,
>> Cameron Simpson <cs@cskk.id.au>
>> --
>> https://mail.python.org/mailman/listinfo/python-list
>>
>
> --
> https://mail.python.org/mailman/listinfo/python-list
>

--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
Using str.startswith is a cool idea in this case.  But is it better than regex for performance or reliability?  Regex syntax is not a model of simplicity, but in my simple case it's not too difficult. 


Feb 27, 2023, 18:52 by list1@tompassin.net:

> On 2/27/2023 9:16 PM, avi.e.gross@gmail.com wrote:
>
>> And, just for fun, since there is nothing wrong with your code, this minor change is terser:
>>
>>>>> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>>>>> for match in re.finditer(re.escape('abc_degree + 1') , example):
>>>>>
>> ... print(match.start(), match.end())
>> ...
>> ...
>> 4 18
>> 26 40
>>
>
> Just for more fun :) -
>
> Without knowing how general your expressions will be, I think the following version is very readable, certainly more readable than regexes:
>
> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
> KEY = 'abc_degree + 1'
>
> for i in range(len(example)):
> if example[i:].startswith(KEY):
> print(i, i + len(KEY))
> # prints:
> 4 18
> 26 40
>
> If you may have variable numbers of spaces around the symbols, OTOH, the whole situation changes and then regexes would almost certainly be the best approach. But the regular expression strings would become harder to read.
> --
> https://mail.python.org/mailman/listinfo/python-list
>

--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
I wrote my previous message before reading this.  Thank you for the test you ran -- it answers the question of performance.  You show that re.finditer is 30x faster, so that certainly recommends that over a simple loop, which introduces looping overhead. 


Feb 28, 2023, 05:44 by list1@tompassin.net:

> On 2/28/2023 4:33 AM, Roel Schroeven wrote:
>
>> Op 28/02/2023 om 3:44 schreef Thomas Passin:
>>
>>> On 2/27/2023 9:16 PM, avi.e.gross@gmail.com wrote:
>>>
>>>> And, just for fun, since there is nothing wrong with your code, this minor change is terser:
>>>>
>>>>>>> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>>>>>>> for match in re.finditer(re.escape('abc_degree + 1') , example):
>>>>>>>
>>>> ...     print(match.start(), match.end())
>>>> ...
>>>> ...
>>>> 4 18
>>>> 26 40
>>>>
>>>
>>> Just for more fun :) -
>>>
>>> Without knowing how general your expressions will be, I think the following version is very readable, certainly more readable than regexes:
>>>
>>> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>>> KEY = 'abc_degree + 1'
>>>
>>> for i in range(len(example)):
>>>     if example[i:].startswith(KEY):
>>>         print(i, i + len(KEY))
>>> # prints:
>>> 4 18
>>> 26 40
>>>
>> I think it's often a good idea to use a standard library function instead of rolling your own. The issue becomes less clear-cut when the standard library doesn't do exactly what you need (as here, where re.finditer() uses regular expressions while the use case only uses simple search strings). Ideally there would be a str.finditer() method we could use, but in the absence of that I think we still need to consider using the almost-but-not-quite fitting re.finditer().
>>
>> Two reasons:
>>
>> (1) I think it's clearer: the name tells us what it does (though of course we could solve this in a hand-written version by wrapping it in a suitably named function).
>>
>> (2) Searching for a string in another string, in a performant way, is not as simple as it first appears. Your version works correctly, but slowly. In some situations it doesn't matter, but in other cases it will. For better performance, string searching algorithms jump ahead either when they found a match or when they know for sure there isn't a match for some time (see e.g. the Boyer–Moore string-search algorithm). You could write such a more efficient algorithm, but then it becomes more complex and more error-prone. Using a well-tested existing function becomes quite attractive.
>>
>
> Sure, it all depends on what the real task will be. That's why I wrote "Without knowing how general your expressions will be". For the example string, it's unlikely that speed will be a factor, but who knows what target strings and keys will turn up in the future?
>
>> To illustrate the difference performance, I did a simple test (using the paragraph above is test text):
>>
>>     import re
>>     import timeit
>>
>>     def using_re_finditer(key, text):
>>         matches = []
>>         for match in re.finditer(re.escape(key), text):
>>             matches.append((match.start(), match.end()))
>>         return matches
>>
>>
>>     def using_simple_loop(key, text):
>>         matches = []
>>         for i in range(len(text)):
>>             if text[i:].startswith(key):
>>                 matches.append((i, i + len(key)))
>>         return matches
>>
>>
>>     CORPUS = """Searching for a string in another string, in a performant way, is
>>     not as simple as it first appears. Your version works correctly, but slowly.
>>     In some situations it doesn't matter, but in other cases it will. For better
>>     performance, string searching algorithms jump ahead either when they found a
>>     match or when they know for sure there isn't a match for some time (see e.g.
>>     the Boyer–Moore string-search algorithm). You could write such a more
>>     efficient algorithm, but then it becomes more complex and more error-prone.
>>     Using a well-tested existing function becomes quite attractive."""
>>     KEY = 'in'
>>     print('using_simple_loop:', timeit.repeat(stmt='using_simple_loop(KEY, CORPUS)', globals=globals(), number=1000))
>>     print('using_re_finditer:', timeit.repeat(stmt='using_re_finditer(KEY, CORPUS)', globals=globals(), number=1000))
>>
>> This does 5 runs of 1000 repetitions each, and reports the time in seconds for each of those runs.
>> Result on my machine:
>>
>>     using_simple_loop: [0.13952950000020792, 0.13063130000000456, 0.12803450000001249, 0.13186180000002423, 0.13084610000032626]
>>     using_re_finditer: [0.003861400000005233, 0.004061900000124297, 0.003478999999970256, 0.003413100000216218, 0.0037320000001273]
>>
>> We find that in this test re.finditer() is more than 30 times faster (despite the overhead of regular expressions.
>>
>> While speed isn't everything in programming, with such a large difference in performance and (to me) no real disadvantages of using re.finditer(), I would prefer re.finditer() over writing my own.
>>
>
> --
> https://mail.python.org/mailman/listinfo/python-list
>

--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
On 2/28/2023 12:57 PM, Jen Kris via Python-list wrote:
> The code I sent is correct, and it runs here.  Maybe you received it with a carriage return removed, but on my copy after posting, it is correct:
>
> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>  find_string = re.escape('abc_degree + 1')
>  for match in re.finditer(find_string, example):
>      print(match.start(), match.end())
>
> One question:  several people have made suggestions other than regex (not your terser example with regex you shown below).  Is there a reason why regex is not preferred to, for example, a list comp?  Performance?  Reliability?

"Some people, when confronted with a problem, think 'I know, I'll use
regular expressions.' Now they have two problems."

-
https://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/

Of course, if you actually read the blog post in the link, there's more
to it than that...

>
> Feb 27, 2023, 18:16 by avi.e.gross@gmail.com:
>
>> Jen,
>>
>> Can you see what SOME OF US see as ASCII text? We can help you better if we get code that can be copied and run as-is.
>>
>> What you sent is not terse. It is wrong. It will not run on any python interpreter because you somehow lost a carriage return and indent.
>>
>> This is what you sent:
>>
>> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>> find_string = re.escape('abc_degree + 1') for match in re.finditer(find_string, example):
>> print(match.start(), match.end())
>>
>> This is code indentedproperly:
>>
>> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>> find_string = re.escape('abc_degree + 1')
>> for match in re.finditer(find_string, example):
>> print(match.start(), match.end())
>>
>> Of course I am sure you wrote and ran code more like the latter version but somewhere in your copy/paste process, ....
>>
>> And, just for fun, since there is nothing wrong with your code, this minor change is terser:
>>
>>>>> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>>>>> for match in re.finditer(re.escape('abc_degree + 1') , example):
>>>>>
>> ... print(match.start(), match.end())
>> ...
>> ...
>> 4 18
>> 26 40
>>
>> But note once you use regular expressions, and not in your case, you might match multiple things that are far from the same such as matching two repeated words of any kind in any case including "and and" and "so so" or finding words that have multiple doubled letter as in the stereotypical bookkeeper. In those cases, you may want even more than offsets but also show the exact text that matched or even show some characters before and/or after for context.
>>
>>
>> -----Original Message-----
>> From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of Jen Kris via Python-list
>> Sent: Monday, February 27, 2023 8:36 PM
>> To: Cameron Simpson <cs@cskk.id.au>
>> Cc: Python List <python-list@python.org>
>> Subject: Re: How to escape strings for re.finditer?
>>
>>
>> I haven't tested it either but it looks like it would work. But for this case I prefer the relative simplicity of:
>>
>> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>> find_string = re.escape('abc_degree + 1') for match in re.finditer(find_string, example):
>> print(match.start(), match.end())
>>
>> 4 18
>> 26 40
>>
>> I don't insist on terseness for its own sake, but it's cleaner this way.
>>
>> Jen
>>
>>
>> Feb 27, 2023, 16:55 by cs@cskk.id.au:
>>
>>> On 28Feb2023 01:13, Jen Kris <jenkris@tutanota.com> wrote:
>>>
>>>> I went to the re module because the specified string may appear more than once in the string (in the code I'm writing).
>>>>
>>>
>>> Sure, but writing a `finditer` for plain `str` is pretty easy (untested):
>>>
>>> pos = 0
>>> while True:
>>> found = s.find(substring, pos)
>>> if found < 0:
>>> break
>>> start = found
>>> end = found + len(substring)
>>> ... do whatever with start and end ...
>>> pos = end
>>>
>>> Many people go straight to the `re` module whenever they're looking for strings. It is often cryptic error prone overkill. Just something to keep in mind.
>>>
>>> Cheers,
>>> Cameron Simpson <cs@cskk.id.au>
>>> --
>>> https://mail.python.org/mailman/listinfo/python-list
>>>
>>
>> --
>> https://mail.python.org/mailman/listinfo/python-list
>>
>

--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
On 2/28/2023 1:07 PM, Jen Kris wrote:
>
> Using str.startswith is a cool idea in this case.  But is it better than
> regex for performance or reliability?  Regex syntax is not a model of
> simplicity, but in my simple case it's not too difficult.

The trouble is that we don't know what your case really is. If you are
talking about a short pattern like your example and a small text to
search, and you don't need to do it too often, then my little code
example is probably ideal. Reliability wouldn't be an issue, and
performance would not be relevant. If your case is going to be much
larger, called many times in a loop, or be much more complicated in some
other way, then a regex or some other approach is likely to be much faster.


> Feb 27, 2023, 18:52 by list1@tompassin.net:
>
> On 2/27/2023 9:16 PM, avi.e.gross@gmail.com wrote:
>
> And, just for fun, since there is nothing wrong with your code,
> this minor change is terser:
>
> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
> for match in re.finditer(re.escape('abc_degree + 1')
> , example):
>
> ... print(match.start(), match.end())
> ...
> ...
> 4 18
> 26 40
>
>
> Just for more fun :) -
>
> Without knowing how general your expressions will be, I think the
> following version is very readable, certainly more readable than
> regexes:
>
> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
> KEY = 'abc_degree + 1'
>
> for i in range(len(example)):
> if example[i:].startswith(KEY):
> print(i, i + len(KEY))
> # prints:
> 4 18
> 26 40
>
> If you may have variable numbers of spaces around the symbols, OTOH,
> the whole situation changes and then regexes would almost certainly
> be the best approach. But the regular expression strings would
> become harder to read.
> --
> https://mail.python.org/mailman/listinfo/python-list
>
>

--
https://mail.python.org/mailman/listinfo/python-list
RE: How to escape strings for re.finditer? [ In reply to ]
Roel,

You make some good points. One to consider is that when you ask a regular expression matcher to search using something that uses NO regular expression features, much of the complexity disappears and what it creates is probably similar enough to what you get with a string search except that loops and all are written as something using fast functions probably written in C.

That is one reason the roll your own versions have a disadvantage unless you roll your own in a similar way by writing a similar C function.

Nobody has shown us what really should be out there of a simple but fast text search algorithm that does a similar job and it may still be out there, but as you point out, perhaps it is not needed as long as people just use the re version.

Avi

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of Roel Schroeven
Sent: Tuesday, February 28, 2023 4:33 AM
To: python-list@python.org
Subject: Re: How to escape strings for re.finditer?

Op 28/02/2023 om 3:44 schreef Thomas Passin:
> On 2/27/2023 9:16 PM, avi.e.gross@gmail.com wrote:
>> And, just for fun, since there is nothing wrong with your code, this
>> minor change is terser:
>>
>>>>> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>>>>> for match in re.finditer(re.escape('abc_degree + 1') , example):
>> ... print(match.start(), match.end()) ...
>> ...
>> 4 18
>> 26 40
>
> Just for more fun :) -
>
> Without knowing how general your expressions will be, I think the
> following version is very readable, certainly more readable than regexes:
>
> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
> KEY = 'abc_degree + 1'
>
> for i in range(len(example)):
> if example[i:].startswith(KEY):
> print(i, i + len(KEY))
> # prints:
> 4 18
> 26 40
I think it's often a good idea to use a standard library function instead of rolling your own. The issue becomes less clear-cut when the standard library doesn't do exactly what you need (as here, where
re.finditer() uses regular expressions while the use case only uses simple search strings). Ideally there would be a str.finditer() method we could use, but in the absence of that I think we still need to consider using the almost-but-not-quite fitting re.finditer().

Two reasons:

(1) I think it's clearer: the name tells us what it does (though of course we could solve this in a hand-written version by wrapping it in a suitably named function).

(2) Searching for a string in another string, in a performant way, is not as simple as it first appears. Your version works correctly, but slowly. In some situations it doesn't matter, but in other cases it will. For better performance, string searching algorithms jump ahead either when they found a match or when they know for sure there isn't a match for some time (see e.g. the Boyer–Moore string-search algorithm).
You could write such a more efficient algorithm, but then it becomes more complex and more error-prone. Using a well-tested existing function becomes quite attractive.

To illustrate the difference performance, I did a simple test (using the paragraph above is test text):

import re
import timeit

def using_re_finditer(key, text):
matches = []
for match in re.finditer(re.escape(key), text):
matches.append((match.start(), match.end()))
return matches


def using_simple_loop(key, text):
matches = []
for i in range(len(text)):
if text[i:].startswith(key):
matches.append((i, i + len(key)))
return matches


CORPUS = """Searching for a string in another string, in a performant way, is
not as simple as it first appears. Your version works correctly, but slowly.
In some situations it doesn't matter, but in other cases it will.
For better
performance, string searching algorithms jump ahead either when they found a
match or when they know for sure there isn't a match for some time (see e.g.
the Boyer–Moore string-search algorithm). You could write such a more
efficient algorithm, but then it becomes more complex and more error-prone.
Using a well-tested existing function becomes quite attractive."""
KEY = 'in'
print('using_simple_loop:',
timeit.repeat(stmt='using_simple_loop(KEY, CORPUS)', globals=globals(),
number=1000))
print('using_re_finditer:',
timeit.repeat(stmt='using_re_finditer(KEY, CORPUS)', globals=globals(),
number=1000))

This does 5 runs of 1000 repetitions each, and reports the time in seconds for each of those runs.
Result on my machine:

using_simple_loop: [0.13952950000020792, 0.13063130000000456, 0.12803450000001249, 0.13186180000002423, 0.13084610000032626]
using_re_finditer: [0.003861400000005233, 0.004061900000124297, 0.003478999999970256, 0.003413100000216218, 0.0037320000001273]

We find that in this test re.finditer() is more than 30 times faster (despite the overhead of regular expressions.

While speed isn't everything in programming, with such a large difference in performance and (to me) no real disadvantages of using re.finditer(), I would prefer re.finditer() over writing my own.

--
"The saddest aspect of life right now is that science gathers knowledge faster than society gathers wisdom."
-- Isaac Asimov

--
https://mail.python.org/mailman/listinfo/python-list

--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
On 2/28/2023 11:48 AM, Jon Ribbens via Python-list wrote:
> On 2023-02-28, Thomas Passin <list1@tompassin.net> wrote:
...
>>
>> It is interesting, though, how pre-processing the search pattern can
>> improve search times if you can afford the pre-processing. Here's a
>> paper on rapidly finding matches when there may be up to one misspelled
>> character. It's easy enough to implement, though in Python you can't
>> take the additional step of tuning it to stay in cache.
>>
>> https://Robert.Muth.Org/Papers/1996-Approx-Multi.Pdf
>
> You've somehow title-cased that URL. The correct URL is:
>
> https://robert.muth.org/Papers/1996-approx-multi.pdf

Thanks, not sure how that happened ...

--
https://mail.python.org/mailman/listinfo/python-list
RE: How to escape strings for re.finditer? [ In reply to ]
> I wrote my previous message before reading this.? Thank you for the test you ran -- it answers the question of performance.? You show that re.finditer is 30x faster, so that certainly recommends that over a simple loop, which introduces looping overhead.?

>> ??? def using_simple_loop(key, text):
>> ??????? matches = []
>> ??????? for i in range(len(text)):
>> ??????????? if text[i:].startswith(key):
>> ??????????????? matches.append((i, i + len(key)))
>> ??????? return matches
>>
>> ??? using_simple_loop: [0.13952950000020792, 0.13063130000000456, 0.12803450000001249, 0.13186180000002423, 0.13084610000032626]
>> ??? using_re_finditer: [0.003861400000005233, 0.004061900000124297, 0.003478999999970256, 0.003413100000216218, 0.0037320000001273]


With a slight tweak to the simple loop code using .find() it becomes a third faster than the RE version though.


def using_simple_loop2(key, text):
matches = []
keyLen = len(key)
start = 0
while (foundSpot := text.find(key, start)) > -1:
start = foundSpot + keyLen
matches.append((foundSpot, start))
return matches


using_simple_loop: [0.1732664997689426, 0.1601669997908175, 0.15792609984055161, 0.1573973000049591, 0.15759290009737015]
using_re_finditer: [0.003412699792534113, 0.0032823001965880394, 0.0033694999292492867, 0.003354900050908327, 0.0033336998894810677]
using_simple_loop2: [0.00256159994751215, 0.0025471001863479614, 0.0025424999184906483, 0.0025831996463239193, 0.0025555999018251896]
--
https://mail.python.org/mailman/listinfo/python-list
RE: How to escape strings for re.finditer? [ In reply to ]
Jen,



I had no doubt the code you ran was indented properly or it would not work.



I am merely letting you know that somewhere in the process of copying the code or the transition between mailers, my version is messed up. It happens to be easy for me to fix but I sometimes see garbled code I then simply ignore.



At times what may help is to leave blank lines that python ignores but also keeps the line rearrangements minimal.



On to your real question.



In my OPINION, there are many interesting questions that can get in the way of just getting a working solution. Some may be better in some abstract way but except for big projects it often hardly matters.



So regex is one thing or more a cluster of things and a list comp is something completely different. They are both tools you can use and abuse or lose.



The distinction I believe we started with was how to find a fixed string inside another fixed string in as many places as needed and perhaps return offset info. So this can be solved in too many ways using a side of python focused on pure text. As discussed, solutions can include explicit loops such as “for” and “while” and their syntactic sugar cousin of a list comp. Not mentioned yet are other techniques like a recursive function that finds the first and passes on the rest of the string to itself to find the rest, or various functional programming techniques that may do sort of hidden loops. YOU DO NOT NEED ALL OF THEM but it can be interesting to learn.



Regex is a completely different universe that is a bit more of MORE. If I ask you for a ride to the grocery store, I might expect you to show up with a car and not a James Bond vehicle that also is a boat, submarine, airplane, and maybe spaceship. Well, Regex is the latter. And in your case, it is this complexity that meant you had to convert your text so it will not see what it considers commands or hints.



In normal use, put a bit too simply, it wants a carefully crafted pattern to be spelled out and it weaves an often complex algorithm it then sort of compiles that represents the understanding of what you asked for. The simplest pattern is to match EXACTLY THIS. That is your case.



A more complex pattern may say to match Boston OR Chicago followed by any amount of whitespace then a number of digits between 3 and 5 and then should not be followed by something specific. Oh, and by the way, save selected parts in parentheses to be accessed as \1 or \2 so I can ask you to do things like match a word followed by itself. It goes on and on.



Be warned RE is implemented now all over the place including outside the usual UNIX roots and there are somewhat different versions. For your need, it does not matter.



The compiled monstrosity though can be fairly fast and might be a tad hard for you to write by yourself as a bunch of if statements nested that are weirdly matching various patterns with some look ahead or look behind.



What you are being told is that despite this being way more than you asked for, it not only works but is fairly fast when doing the simple thing you asked for. That may be why a text version you are looking for is hard to find.



I am not clear what exactly the rest of your project is about but my guess is your first priority is completing it decently and not to try umpteen methods and compare them. Not today. Of course if the working version is slow and you profile it and find this part seems to be holding it back, it may be worth examining.





From: Jen Kris <jenkris@tutanota.com>
Sent: Tuesday, February 28, 2023 12:58 PM
To: avi.e.gross@gmail.com
Cc: 'Python List' <python-list@python.org>
Subject: RE: How to escape strings for re.finditer?



The code I sent is correct, and it runs here. Maybe you received it with a carriage return removed, but on my copy after posting, it is correct:



example = 'X - abc_degree + 1 + qq + abc_degree + 1'

find_string = re.escape('abc_degree + 1')

for match in re.finditer(find_string, example):

print(match.start(), match.end())



One question: several people have made suggestions other than regex (not your terser example with regex you shown below). Is there a reason why regex is not preferred to, for example, a list comp? Performance? Reliability?













Feb 27, 2023, 18:16 by avi.e.gross@gmail.com <mailto:avi.e.gross@gmail.com> :

Jen,



Can you see what SOME OF US see as ASCII text? We can help you better if we get code that can be copied and run as-is.



What you sent is not terse. It is wrong. It will not run on any python interpreter because you somehow lost a carriage return and indent.



This is what you sent:



example = 'X - abc_degree + 1 + qq + abc_degree + 1'

find_string = re.escape('abc_degree + 1') for match in re.finditer(find_string, example):

print(match.start(), match.end())



This is code indentedproperly:



example = 'X - abc_degree + 1 + qq + abc_degree + 1'

find_string = re.escape('abc_degree + 1')

for match in re.finditer(find_string, example):

print(match.start(), match.end())



Of course I am sure you wrote and ran code more like the latter version but somewhere in your copy/paste process, ....



And, just for fun, since there is nothing wrong with your code, this minor change is terser:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'

for match in re.finditer(re.escape('abc_degree + 1') , example):

... print(match.start(), match.end())

...

...

4 18

26 40



But note once you use regular expressions, and not in your case, you might match multiple things that are far from the same such as matching two repeated words of any kind in any case including "and and" and "so so" or finding words that have multiple doubled letter as in the stereotypical bookkeeper. In those cases, you may want even more than offsets but also show the exact text that matched or even show some characters before and/or after for context.





-----Original Message-----

From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org <mailto:python-list-bounces+avi.e.gross=gmail.com@python.org> > On Behalf Of Jen Kris via Python-list

Sent: Monday, February 27, 2023 8:36 PM

To: Cameron Simpson <cs@cskk.id.au <mailto:cs@cskk.id.au> >

Cc: Python List <python-list@python.org <mailto:python-list@python.org> >

Subject: Re: How to escape strings for re.finditer?





I haven't tested it either but it looks like it would work. But for this case I prefer the relative simplicity of:



example = 'X - abc_degree + 1 + qq + abc_degree + 1'

find_string = re.escape('abc_degree + 1') for match in re.finditer(find_string, example):

print(match.start(), match.end())



4 18

26 40



I don't insist on terseness for its own sake, but it's cleaner this way.



Jen





Feb 27, 2023, 16:55 by cs@cskk.id.au <mailto:cs@cskk.id.au> :

On 28Feb2023 01:13, Jen Kris <jenkris@tutanota.com <mailto:jenkris@tutanota.com> > wrote:

I went to the re module because the specified string may appear more than once in the string (in the code I'm writing).



Sure, but writing a `finditer` for plain `str` is pretty easy (untested):



pos = 0

while True:

found = s.find(substring, pos)

if found < 0:

break

start = found

end = found + len(substring)

... do whatever with start and end ...

pos = end



Many people go straight to the `re` module whenever they're looking for strings. It is often cryptic error prone overkill. Just something to keep in mind.



Cheers,

Cameron Simpson <cs@cskk.id.au <mailto:cs@cskk.id.au> >

--

https://mail.python.org/mailman/listinfo/python-list



--

https://mail.python.org/mailman/listinfo/python-list



--
https://mail.python.org/mailman/listinfo/python-list
RE: How to escape strings for re.finditer? [ In reply to ]
This message is more for Thomas than Jen,

You made me think of what happens in fairly large cases. What happens if I ask you to search a thousand pages looking for your name?

One solution might be to break the problem into parts that can be run in independent threads or processes and perhaps across different CPU's or on many machines at once. Think of it as a variant on a merge sort where each chunk returns where it found one or more items and then those are gathered together and merged upstream.

The problem is you cannot just randomly divide the text. Any matches across a divide are lost. So if you know you are searching for "Thomas Passin" you need an overlap big enough to hold enough of that size. It would not be made as something like a pure binary tree and if the choices made included variant sizes in what might match, you would get duplicates. So the merging part would obviously have to eventually remove those.

I have often wondered how Google and other such services are able to find millions of things in hardly any time and arguably never show most of them as who looks past a few pages/screens?

I think much of that may involve other techniques including quite a bit of pre-indexing. But they also seem to enlist lots of processors that each do the search on a subset of the problem space and combine and prioritize.

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of Thomas Passin
Sent: Tuesday, February 28, 2023 1:31 PM
To: python-list@python.org
Subject: Re: How to escape strings for re.finditer?

On 2/28/2023 1:07 PM, Jen Kris wrote:
>
> Using str.startswith is a cool idea in this case. But is it better
> than regex for performance or reliability? Regex syntax is not a
> model of simplicity, but in my simple case it's not too difficult.

The trouble is that we don't know what your case really is. If you are talking about a short pattern like your example and a small text to search, and you don't need to do it too often, then my little code example is probably ideal. Reliability wouldn't be an issue, and performance would not be relevant. If your case is going to be much larger, called many times in a loop, or be much more complicated in some other way, then a regex or some other approach is likely to be much faster.


> Feb 27, 2023, 18:52 by list1@tompassin.net:
>
> On 2/27/2023 9:16 PM, avi.e.gross@gmail.com wrote:
>
> And, just for fun, since there is nothing wrong with your code,
> this minor change is terser:
>
> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
> for match in re.finditer(re.escape('abc_degree + 1')
> , example):
>
> ... print(match.start(), match.end())
> ...
> ...
> 4 18
> 26 40
>
>
> Just for more fun :) -
>
> Without knowing how general your expressions will be, I think the
> following version is very readable, certainly more readable than
> regexes:
>
> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
> KEY = 'abc_degree + 1'
>
> for i in range(len(example)):
> if example[i:].startswith(KEY):
> print(i, i + len(KEY))
> # prints:
> 4 18
> 26 40
>
> If you may have variable numbers of spaces around the symbols, OTOH,
> the whole situation changes and then regexes would almost certainly
> be the best approach. But the regular expression strings would
> become harder to read.
> --
> https://mail.python.org/mailman/listinfo/python-list
>
>

--
https://mail.python.org/mailman/listinfo/python-list

--
https://mail.python.org/mailman/listinfo/python-list
RE: How to escape strings for re.finditer? [ In reply to ]
David,

Your results suggest we need to be reminded that lots depends on other
factors. There are multiple versions/implementations of python out there
including some written in C but also other underpinnings. Each can often
have sections of pure python code replaced carefully with libraries of
compiled code, or not. So your results will vary.

Just as an example, assume you derive a type of your own as a subclass of
str and you over-ride the find method by writing it in pure python using
loops and maybe add a few bells and whistles. If you used your improved
algorithm using this variant of str, might it not be quite a bit slower?
Imagine how much slower if your improvement also implemented caching and
logging and the option of ignoring case which are not really needed here.

This type of thing can happen in many other scenarios and some module may be
shared that is slow and a while later is updated but not everyone installs
the update so performance stats can vary wildly.

Some people advocate using some functional programming tactics, in various
languages, partially because the more general loops are SLOW. But that is
largely because some of the functional stuff is a compiled function that
hides the loops inside a faster environment than the interpreter.

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On
Behalf Of David Raymond
Sent: Tuesday, February 28, 2023 2:40 PM
To: python-list@python.org
Subject: RE: How to escape strings for re.finditer?

> I wrote my previous message before reading this.? Thank you for the test
you ran -- it answers the question of performance.? You show that
re.finditer is 30x faster, so that certainly recommends that over a simple
loop, which introduces looping overhead.?

>> ??? def using_simple_loop(key, text):
>> ??????? matches = []
>> ??????? for i in range(len(text)):
>> ??????????? if text[i:].startswith(key):
>> ??????????????? matches.append((i, i + len(key)))
>> ??????? return matches
>>
>> ??? using_simple_loop: [0.13952950000020792, 0.13063130000000456,
0.12803450000001249, 0.13186180000002423, 0.13084610000032626]
>> ??? using_re_finditer: [0.003861400000005233, 0.004061900000124297,
0.003478999999970256, 0.003413100000216218, 0.0037320000001273]


With a slight tweak to the simple loop code using .find() it becomes a third
faster than the RE version though.


def using_simple_loop2(key, text):
matches = []
keyLen = len(key)
start = 0
while (foundSpot := text.find(key, start)) > -1:
start = foundSpot + keyLen
matches.append((foundSpot, start))
return matches


using_simple_loop: [0.1732664997689426, 0.1601669997908175,
0.15792609984055161, 0.1573973000049591, 0.15759290009737015]
using_re_finditer: [0.003412699792534113, 0.0032823001965880394,
0.0033694999292492867, 0.003354900050908327, 0.0033336998894810677]
using_simple_loop2: [0.00256159994751215, 0.0025471001863479614,
0.0025424999184906483, 0.0025831996463239193, 0.0025555999018251896]
--
https://mail.python.org/mailman/listinfo/python-list

--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
On 2/28/2023 2:40 PM, David Raymond wrote:
> With a slight tweak to the simple loop code using .find() it becomes a third faster than the RE version though.
>
>
> def using_simple_loop2(key, text):
> matches = []
> keyLen = len(key)
> start = 0
> while (foundSpot := text.find(key, start)) > -1:
> start = foundSpot + keyLen
> matches.append((foundSpot, start))
> return matches
>
>
> using_simple_loop: [0.1732664997689426, 0.1601669997908175, 0.15792609984055161, 0.1573973000049591, 0.15759290009737015]
> using_re_finditer: [0.003412699792534113, 0.0032823001965880394, 0.0033694999292492867, 0.003354900050908327, 0.0033336998894810677]
> using_simple_loop2: [0.00256159994751215, 0.0025471001863479614, 0.0025424999184906483, 0.0025831996463239193, 0.0025555999018251896]

On my system the difference is way bigger than that:

KEY = '''it doesn't matter, but in other cases it will.'''

using_simple_loop2: [0.0004955999902449548, 0.0004844000213779509,
0.0004862999776378274, 0.0004800999886356294, 0.0004792999825440347]

using_re_finditer: [0.002840900036972016, 0.0028330000350251794,
0.002701299963518977, 0.0028105000383220613, 0.0029977999511174858]

Shorter keys show the least differential:

KEY = 'in'

using_simple_loop2: [0.001983499969355762, 0.0019614999764598906,
0.0019617999787442386, 0.002027600014116615, 0.0020669000223279]

using_re_finditer: [0.002787900040857494, 0.0027620999608188868,
0.0027723999810405076, 0.002776700013782829, 0.002946800028439611]

Brilliant!

Python 3.10.9
Windows 10 AMD64 (build 10.0.19044) SP0

--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
On 28Feb2023 18:57, Jen Kris <jenkris@tutanota.com> wrote:
>One question:  several people have made suggestions other than regex
>(not your terser example with regex you shown below).  Is there a
>reason why regex is not preferred to, for example, a list comp? 

These are different things; I'm not sure a comparison is meaningful.

>Performance?  Reliability? 

Regexps are:
- cryptic and error prone (you can make them more readable, but the
notation is deliberately both terse and powerful, which means that
small changes can have large effects in behaviour); the "error prone"
part does not mean that a regexp is unreliable, but that writing one
which is _correct_ for your task can be difficult, and also difficult
to debug
- have a compile step, which slows things down
- can be slower to execute as well, as a regexp does a bunch of
housekeeping for you

The more complex the tool the more... indirection between your solution
using that tool and the smallest thing which needs to be done, and often
the slower the solution. This isn't absolute; there are times for the
complex tool.

Common opinion here is often that if you're doing simple fixed-string
things such as your task, which was finding instances of a fixed string,
just use the existing str methods. You'll end up writing what you need
directly and overtly.

I've a personal maxim that one should use the "smallest" tool which
succinctly solves the problem. I usually use it to choose a programming
language (eg sed vs awk vs shell vs python in loose order of problem
difficulty), but it applies also to choosing tools within a language.

Cheers,
Cameron Simpson <cs@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
On 2023-02-28 15:25:05 -0500, avi.e.gross@gmail.com wrote:
> Jen,
>
>
>
> I had no doubt the code you ran was indented properly or it would not work.
>
>
>
> I am merely letting you know that somewhere in the process of copying
> the code or the transition between mailers, my version is messed up.

The problem seems to be at your end. Jen's code looks ok here.

The content type is text/plain, no format=flowed or anything which would
affect the interpretation of line endings. However, after
base64-decoding it only contains unix-style LF line endings, not CRLF
line endings. That might throw your mailer off, but I have no idea why
it would join only some lines but not others.

> It happens to be easy for me to fix but I sometimes see garbled code I
> then simply ignore.

Truth to be told, that's one reason why I rarely read your mails to the
end. The long lines and the triple-spaced paragraphs make it just too
uncomfortable.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"
Re: How to escape strings for re.finditer? [ In reply to ]
On 2023-03-01 01:01:42 +0100, Peter J. Holzer wrote:
> On 2023-02-28 15:25:05 -0500, avi.e.gross@gmail.com wrote:
> > It happens to be easy for me to fix but I sometimes see garbled code I
> > then simply ignore.
>
> Truth to be told, that's one reason why I rarely read your mails to the
> end. The long lines and the triple-spaced paragraphs make it just too
> uncomfortable.

Hmm, since I was now paying a bit more attention to formatting problems
I saw that only about half of your messages have those long lines
although all seem to be sent with the same mailer. Don't know what's
going on there.

hp


--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"
Re: How to escape strings for re.finditer? [ In reply to ]
Regex is fine if it works for you. The critiques ? ?difficult to read? ?are subjective. Unless the code is in a section that has been profiled to be a bottleneck, I don?t sweat performance at this level.

For me, using code that has already been written and vetted is the preferred approach to writing new code I have to test and maintain. I use an online regex tester, https://pythex.org, to get the syntax write before copying pasting it into my code.

From: Python-list <python-list-bounces+gweatherby=uchc.edu@python.org> on behalf of Jen Kris via Python-list <python-list@python.org>
Date: Tuesday, February 28, 2023 at 1:11 PM
To: Thomas Passin <list1@tompassin.net>
Cc: python-list@python.org <python-list@python.org>
Subject: Re: How to escape strings for re.finditer?
*** Attention: This is an external email. Use caution responding, opening attachments or clicking on links. ***

Using str.startswith is a cool idea in this case. But is it better than regex for performance or reliability? Regex syntax is not a model of simplicity, but in my simple case it's not too difficult.


--
https://mail.python.org/mailman/listinfo/python-list
RE: How to escape strings for re.finditer? [ In reply to ]
Peter,

Nobody here would appreciate it if I tested it by sending out multiple
copies of each email to see if the same message wraps differently.

I am using a fairly standard mailer in Outlook that interfaces with gmail
and I could try mailing directly from gmail but apparently there are
systemic problems and I experience other complaints when sending directly
from AOL mail too.

So, if some people don't read me, I can live with that. I mean the right
people, LOL!

Or did I get that wrong?

I do appreciate the feedback. Ironically, when I politely shared how someone
else's email was displaying on my screen, it seems I am equally causing
similar issues for others.

An interesting question is whether any of us reading the archived copies see
different things including with various browsers:

https://mail.python.org/pipermail/python-list/

I am not sure which letters from me had the anomalies you mention but
spot-checking a few of them showed a normal display when I use Chrome.

But none of this is really a python issue except insofar as you never know
what functionality in the network was written for in python.

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On
Behalf Of Peter J. Holzer
Sent: Tuesday, February 28, 2023 7:26 PM
To: python-list@python.org
Subject: Re: How to escape strings for re.finditer?

On 2023-03-01 01:01:42 +0100, Peter J. Holzer wrote:
> On 2023-02-28 15:25:05 -0500, avi.e.gross@gmail.com wrote:
> > It happens to be easy for me to fix but I sometimes see garbled code
> > I then simply ignore.
>
> Truth to be told, that's one reason why I rarely read your mails to
> the end. The long lines and the triple-spaced paragraphs make it just
> too uncomfortable.

Hmm, since I was now paying a bit more attention to formatting problems I
saw that only about half of your messages have those long lines although all
seem to be sent with the same mailer. Don't know what's going on there.

hp


--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
On 2023-02-28, Cameron Simpson <cs@cskk.id.au> wrote:

> Regexps are:
> - cryptic and error prone (you can make them more readable, but the
> notation is deliberately both terse and powerful, which means that
> small changes can have large effects in behaviour); the "error prone"
> part does not mean that a regexp is unreliable, but that writing one
> which is _correct_ for your task can be difficult,

The nasty thing is that writing one that _appears_ to be correct for
your task is often fairly easy. It will work as you expect for the
test cases you throw at it, but then fail in confusing ways when
released into the "real world". If you're lucky, it fails frequently
and obviously enough that you notice it right away. If you're not
lucky, it will fail infrequently and subtly for many years to come.

My rule: never use an RE if you can use the normal string methods
(even if it takes a a few lines of code using them to replace a single
RE).

--
Grant
--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
On 3/1/2023 12:04 PM, Grant Edwards wrote:
> On 2023-02-28, Cameron Simpson <cs@cskk.id.au> wrote:
>
>> Regexps are:
>> - cryptic and error prone (you can make them more readable, but the
>> notation is deliberately both terse and powerful, which means that
>> small changes can have large effects in behaviour); the "error prone"
>> part does not mean that a regexp is unreliable, but that writing one
>> which is _correct_ for your task can be difficult,
>
> The nasty thing is that writing one that _appears_ to be correct for
> your task is often fairly easy. It will work as you expect for the
> test cases you throw at it, but then fail in confusing ways when
> released into the "real world". If you're lucky, it fails frequently
> and obviously enough that you notice it right away. If you're not
> lucky, it will fail infrequently and subtly for many years to come.
>
> My rule: never use an RE if you can use the normal string methods
> (even if it takes a a few lines of code using them to replace a single
> RE).

A corollary is that once you get a working regex, don't mess with it if
you do not absolutely have to.

--
https://mail.python.org/mailman/listinfo/python-list
RE: How to escape RE [ In reply to ]
Cameron,

The topic is now Regular Expressions and the sin tax. This is not
exclusively a Python issue as everybody and even their grandmother uses it
in various forms.

I remember early versions of RE were fairly simple and readable. It was a
terse minilanguage that allowed fairly complex things to be done but was
readable.

You now encounter versions that make people struggle as countless extensions
have been sloppily grafted on. Who ordered multiple uses where "?" is now
used? As an example. Many places have sort of expanded the terseness and
both made it more and also less legible. UNICODE made lots of older RE
features not very useful as definitions of things like what whitespace can
be and what a word boundary or contents might be are made so different that
new constructs were added to hold them.

But, if you are operating mainly on ASCII text, the base functionality is
till in there and can be used fairly easily.

Consider it a bit like other mini languages such as the print() variants
that kept adding functionality by packing lots of info tersely so you
specify you want a floating point number with so many digits and so on, and
by the way, right justified in a wider field and if it is negative, so this.
Great if you can still remember how to read it.

I was reading a python book recently which kept using a suffix of !r and I
finally looked it up. It seems to be asking print (or perhaps an f string)
to use __repr__() if possible to get the representation of the object. Then
I find out this is not really needed any more as the context now allows you
to use something like {repr(val)) so a val!r is not the only and confusing
way.

These mini-languages each require you to learn their own rules and quirks
and when you do, they can be powerful and intuitive, at least for the
features you memorized and maybe use regularly.

Now RE knowledge is the same and it ports moderately well between languages
except when it doesn't. As has been noted, the people at PERL relied on it a
lot and kept changing and extending it. Some Python functionality lets you
specify if you want PERL style or other styles.

But hiding your head in the sand is not always going to work for long. No,
you do not need to use RE for simple cases. Mind you, that is when it is
easiest to use it reliably. I read some books related to XML where much of
the work had been done in non-UNIX land years ago and they often had other
ways of doing things in their endless series of methods on validating a
schema or declaring it so data is forced to match the declared objectives
such as what type(s) each item can be or whether some fields must exist
inside others or in a particular order, or say you can have only three of
them and seeming endless other such things. And then, suddenly, someone has
the idea to introduce the ability for you to specify many things using
regular expressions and the oppressiveness (for me) lifts and many things
can now be done trivially or that were not doable before. I had a similar
experience in my SQL reading where adding the ability to do some pattern
matching using a form of RE made life simpler.

The fact is that the idea of complex pattern matching IS complex and any
tool that lets you express it so fluidly will itself be complex. So, as some
have mentioned, find a resource that helps you build a regular expression
perhaps through menus, or one that verifies if one you created makes any
sense or lets you enter test data and have it show you how it is matching or
what to change to make it match differently. The multi-line version of RE
may also be helpful as well as sometimes breaking up a bigger one into
several smaller ones that your program uses in multiple phases.

Python recently added new functionality called Structural Pattern Matching.
You use a match statement with various cases that match patterns and if
matched, execute some action. Here is one tutorial if needed:

https://peps.python.org/pep-0636/

The point is that although not at all the same as a RE, we again have a bit
of a mini-language that can be used fairly concisely to investigate a
problem domain fairly quickly and efficiently and do things. It is an
overlapping but different form of pattern matching. And, in languages that
have long had similar ideas and constructs, people often cut back on using
other constructs like an IF statement, and just used something like this!

And consider this example as being vaguely like a bit of regular expression:

match command.split():
case ["go", ("north" | "south" | "east" | "west")]:
current_room = current_room.neighbor(...)

Like it or not, our future in programming is likely to include more and more
such aids along with headaches.

Avi

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On
Behalf Of Grant Edwards
Sent: Wednesday, March 1, 2023 12:04 PM
To: python-list@python.org
Subject: Re: How to escape strings for re.finditer?

On 2023-02-28, Cameron Simpson <cs@cskk.id.au> wrote:

> Regexps are:
> - cryptic and error prone (you can make them more readable, but the
> notation is deliberately both terse and powerful, which means that
> small changes can have large effects in behaviour); the "error prone"
> part does not mean that a regexp is unreliable, but that writing one
> which is _correct_ for your task can be difficult,

The nasty thing is that writing one that _appears_ to be correct for your
task is often fairly easy. It will work as you expect for the test cases you
throw at it, but then fail in confusing ways when released into the "real
world". If you're lucky, it fails frequently and obviously enough that you
notice it right away. If you're not lucky, it will fail infrequently and
subtly for many years to come.

My rule: never use an RE if you can use the normal string methods (even if
it takes a a few lines of code using them to replace a single RE).

--
Grant
--
https://mail.python.org/mailman/listinfo/python-list

--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
On 2023-03-01 01:01:42 +0100, Peter J. Holzer wrote:
> On 2023-02-28 15:25:05 -0500, avi.e.gross@gmail.com wrote:
> > I had no doubt the code you ran was indented properly or it would not work.
> >
> > I am merely letting you know that somewhere in the process of copying
> > the code or the transition between mailers, my version is messed up.
>
> The problem seems to be at your end. Jen's code looks ok here.
[...]
> I have no idea why it would join only some lines but not others.

Actually I do have an idea now, since I noticed something similar at
work today: Outlook has an option "remove additional line breaks from
text-only messages" (translated from German) in the the "Email / Message
Format" section. You want to make sure this is off if you are reading
mails where line breaks might be important[1].

hp

[1] Personally I'd say you shouldn't use Outlook if you are reading
mails where line breaks (or other formatting) is important, but ...

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"
RE: How to escape strings for re.finditer? [ In reply to ]
Thanks, Peter. Excellent advice, even if only for any of us using Microsoft
Outlook as our mailer. I made the changes and we will see but they should
mainly impact what I see. I did tweak another parameter.

The problem for me was finding where they hid the options menu I needed.
Then, I started translating the menus back into German until I realized I
was being silly! Good practice though. LOL!

The truth is I generally can handle receiving mangled code as most of the
time I can re-edit it into shape, or am just reading it and not
copying/pasting.

What concerns me is to be able to send out the pure text content many seem
to need in a way that does not introduce the anomalies people see. Something
like a least-common denominator.

Or. I could switch mailers. But my guess is reading/responding from the
native gmail editor may also need options changes and yet still impact some
readers.

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On
Behalf Of Peter J. Holzer
Sent: Thursday, March 2, 2023 3:09 PM
To: python-list@python.org
Subject: Re: How to escape strings for re.finditer?

On 2023-03-01 01:01:42 +0100, Peter J. Holzer wrote:
> On 2023-02-28 15:25:05 -0500, avi.e.gross@gmail.com wrote:
> > I had no doubt the code you ran was indented properly or it would not
work.
> >
> > I am merely letting you know that somewhere in the process of
> > copying the code or the transition between mailers, my version is messed
up.
>
> The problem seems to be at your end. Jen's code looks ok here.
[...]
> I have no idea why it would join only some lines but not others.

Actually I do have an idea now, since I noticed something similar at work
today: Outlook has an option "remove additional line breaks from text-only
messages" (translated from German) in the the "Email / Message Format"
section. You want to make sure this is off if you are reading mails where
line breaks might be important[1].

hp

[1] Personally I'd say you shouldn't use Outlook if you are reading mails
where line breaks (or other formatting) is important, but ...

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
On 2023-03-02, Peter J. Holzer <hjp-python@hjp.at> wrote:


> [1] Personally I'd say you shouldn't use Outlook if you are reading
> mails where line breaks (or other formatting) is important, but ...

I'd shorten that to

"You shouldn't use Outlook if mail is important."

--
https://mail.python.org/mailman/listinfo/python-list