Mailing List Archive

How to escape strings for re.finditer?
When matching a string against a longer string, where both strings have spaces in them, we need to escape the spaces. 

This works (no spaces):

import re
example = 'abcdefabcdefabcdefg'
find_string = "abc"
for match in re.finditer(find_string, example):
    print(match.start(), match.end())

That gives me the start and end character positions, which is what I want. 

However, this does not work:

import re
example = re.escape('X - cty_degrees + 1 + qq')
find_string = re.escape('cty_degrees + 1')
for match in re.finditer(find_string, example):
    print(match.start(), match.end())

I’ve tried several other attempts based on my reseearch, but still no match. 

I don’t have much experience with regex, so I hoped a reg-expert might help. 

Thanks,

Jen

--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
On 2023-02-27 23:11, Jen Kris via Python-list wrote:
> When matching a string against a longer string, where both strings have spaces in them, we need to escape the spaces.
>
> This works (no spaces):
>
> import re
> example = 'abcdefabcdefabcdefg'
> find_string = "abc"
> for match in re.finditer(find_string, example):
>     print(match.start(), match.end())
>
> That gives me the start and end character positions, which is what I want.
>
> However, this does not work:
>
> import re
> example = re.escape('X - cty_degrees + 1 + qq')
> find_string = re.escape('cty_degrees + 1')
> for match in re.finditer(find_string, example):
>     print(match.start(), match.end())
>
> I’ve tried several other attempts based on my reseearch, but still no match.
>
> I don’t have much experience with regex, so I hoped a reg-expert might help.
>
You need to escape only the pattern, not the string you're searching.
--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
On 28Feb2023 00:11, Jen Kris <jenkris@tutanota.com> wrote:
>When matching a string against a longer string, where both strings have spaces in them, we need to escape the spaces. 
>
>This works (no spaces):
>
>import re
>example = 'abcdefabcdefabcdefg'
>find_string = "abc"
>for match in re.finditer(find_string, example):
>    print(match.start(), match.end())
>
>That gives me the start and end character positions, which is what I want. 
>
>However, this does not work:
>
>import re
>example = re.escape('X - cty_degrees + 1 + qq')
>find_string = re.escape('cty_degrees + 1')
>for match in re.finditer(find_string, example):
>    print(match.start(), match.end())
>
>I’ve tried several other attempts based on my reseearch, but still no
>match. 

You need to print those strings out. You're escaping the _example_
string, which would make it:

X - cty_degrees \+ 1 \+ qq

because `+` is a special character in regexps and so `re.escape` escapes
it. But you don't want to mangle the string you're searching! After all,
the text above does not contain the string `cty_degrees + 1`.

My secondary question is: if you're escaping the thing you're searching
_for_, then you're effectively searching for a _fixed_ string, not a
pattern/regexp. So why on earth are you using regexps to do your
searching?

The `str` type has a `find(substring)` function. Just use that! It'll be
faster and the code simpler!

Cheers,
Cameron Simpson <cs@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
Yes, that's it.  I don't know how long it would have taken to find that detail with research through the voluminous re documentation.  Thanks very much. 

Feb 27, 2023, 15:47 by python@mrabarnett.plus.com:

> On 2023-02-27 23:11, Jen Kris via Python-list wrote:
>
>> When matching a string against a longer string, where both strings have spaces in them, we need to escape the spaces.
>>
>> This works (no spaces):
>>
>> import re
>> example = 'abcdefabcdefabcdefg'
>> find_string = "abc"
>> for match in re.finditer(find_string, example):
>>     print(match.start(), match.end())
>>
>> That gives me the start and end character positions, which is what I want.
>>
>> However, this does not work:
>>
>> import re
>> example = re.escape('X - cty_degrees + 1 + qq')
>> find_string = re.escape('cty_degrees + 1')
>> for match in re.finditer(find_string, example):
>>     print(match.start(), match.end())
>>
>> I’ve tried several other attempts based on my reseearch, but still no match.
>>
>> I don’t have much experience with regex, so I hoped a reg-expert might help.
>>
> You need to escape only the pattern, not the string you're searching.
> --
> https://mail.python.org/mailman/listinfo/python-list
>

--
https://mail.python.org/mailman/listinfo/python-list
RE: How to escape strings for re.finditer? [ In reply to ]
MRAB makes a valid point. The regular expression compiled is only done on the pattern you are looking for and it it contains anything that might be a command, such as an ^ at the start or [12] in middle, you want that converted so NONE OF THAT is one. It will be compiled to something that looks for an ^, including later in the string, and look for a real [. then a real 1 and a real 2 and a real ], not for one of the choices of 1 or 2.

Your example was 'cty_degrees + 1' which can have a subtle bug introduced. The special character is "+" which means match greedily as many copies of the previous entity as possible. In this case, the previous entity was a single space. So the regular expression will match 'cty degrees' then match the single space it sees because it sees a space followed ny a plus then not looking for a plus, hits a plus and fails. If your example is rewritten in whatever way re.escape uses, it might be 'cty_degrees \+ 1' and then it should work fine.

But converting what you are searching for just breaks that as the result will have a '\+" whish is being viewed as two unrelated symbols and the backslash breaks the match from going further.



-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of MRAB
Sent: Monday, February 27, 2023 6:46 PM
To: python-list@python.org
Subject: Re: How to escape strings for re.finditer?

On 2023-02-27 23:11, Jen Kris via Python-list wrote:
> When matching a string against a longer string, where both strings have spaces in them, we need to escape the spaces.
>
> This works (no spaces):
>
> import re
> example = 'abcdefabcdefabcdefg'
> find_string = "abc"
> for match in re.finditer(find_string, example):
> print(match.start(), match.end())
>
> That gives me the start and end character positions, which is what I want.
>
> However, this does not work:
>
> import re
> example = re.escape('X - cty_degrees + 1 + qq') find_string =
> re.escape('cty_degrees + 1') for match in re.finditer(find_string,
> example):
> print(match.start(), match.end())
>
> I’ve tried several other attempts based on my reseearch, but still no match.
>
> I don’t have much experience with regex, so I hoped a reg-expert might help.
>
You need to escape only the pattern, not the string you're searching.
--
https://mail.python.org/mailman/listinfo/python-list

--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
I went to the re module because the specified string may appear more than once in the string (in the code I'm writing).  For example: 

a = "X - abc_degree + 1 + qq + abc_degree + 1"
 b = "abc_degree + 1"
 q = a.find(b)

print(q)
4

So it correctly finds the start of the first instance, but not the second one.  The re code finds both instances.  If I knew that the substring occurred only once then the str.find would be best. 

I changed my re code after MRAB's comment, it now works. 

Thanks much. 

Jen


Feb 27, 2023, 15:56 by cs@cskk.id.au:

> On 28Feb2023 00:11, Jen Kris <jenkris@tutanota.com> wrote:
>
>> When matching a string against a longer string, where both strings have spaces in them, we need to escape the spaces. 
>>
>> This works (no spaces):
>>
>> import re
>> example = 'abcdefabcdefabcdefg'
>> find_string = "abc"
>> for match in re.finditer(find_string, example):
>>     print(match.start(), match.end())
>>
>> That gives me the start and end character positions, which is what I want. 
>>
>> However, this does not work:
>>
>> import re
>> example = re.escape('X - cty_degrees + 1 + qq')
>> find_string = re.escape('cty_degrees + 1')
>> for match in re.finditer(find_string, example):
>>     print(match.start(), match.end())
>>
>> I’ve tried several other attempts based on my reseearch, but still no match. 
>>
>
> You need to print those strings out. You're escaping the _example_ string, which would make it:
>
> X - cty_degrees \+ 1 \+ qq
>
> because `+` is a special character in regexps and so `re.escape` escapes it. But you don't want to mangle the string you're searching! After all, the text above does not contain the string `cty_degrees + 1`.
>
> My secondary question is: if you're escaping the thing you're searching _for_, then you're effectively searching for a _fixed_ string, not a pattern/regexp. So why on earth are you using regexps to do your searching?
>
> The `str` type has a `find(substring)` function. Just use that! It'll be faster and the code simpler!
>
> Cheers,
> Cameron Simpson <cs@cskk.id.au>
> --
> https://mail.python.org/mailman/listinfo/python-list
>

--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
On 28Feb2023 01:13, Jen Kris <jenkris@tutanota.com> wrote:
>I went to the re module because the specified string may appear more
>than once in the string (in the code I'm writing).

Sure, but writing a `finditer` for plain `str` is pretty easy
(untested):

pos = 0
while True:
found = s.find(substring, pos)
if found < 0:
break
start = found
end = found + len(substring)
... do whatever with start and end ...
pos = end

Many people go straight to the `re` module whenever they're looking for
strings. It is often cryptic error prone overkill. Just something to
keep in mind.

Cheers,
Cameron Simpson <cs@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list
RE: How to escape strings for re.finditer? [ In reply to ]
Just FYI, Jen, there are times a sledgehammer works but perhaps is not the only way. These days people worry less about efficiency and more about programmer time and education and that can be fine.

But it you looked at methods available in strings or in some other modules, your situation is quite common. Some may use another RE front end called finditer().

I am NOT suggesting you do what I say next, but imagine writing a loop that takes a substring of what you are searching for of the same length as your search string. Near the end, it stops as there is too little left.

You can now simply test your searched for string against that substring for equality and it tends to return rapidly when they are not equal early on.

Your loop would return whatever data structure or results you want such as that it matched it three times at offsets a, b and c.

But do you allow overlaps? If not, your loop needs to skip len(search_str) after a match.

What you may want to consider is another form of pre-processing. Do you care if "abc_degree + 1" has missing or added spaces at the tart or end or anywhere in middle as in " abc_degree +1"?

Do you care if stuff is a different case like "Abc_Degree + 1"?

Some such searches can be done if both the pattern and searched string are first converted to a canonical format that maps to the same output. But that complicates things a bit and you may to display what you match differently.

And are you also willing to match this: "myabc_degree + 1"?

When using a crafter RE there is a way to ask for a word boundary so abc will only be matched if before that is a space or the start of the string and not "my".

So this may be a case where you can solve an easy version with the chance it can be fooled or overengineer it. If you are allowing the user to type in what to search for, as many programs including editors, do, you will often find such false positives unless the user knows RE syntax and applies it and you do not escape it. I have experienced havoc when doing a careless global replace that matched more than I expected, including making changes in comments or constant strings rather than just the name of a function. Adding a paren is helpful as is not replacing them all but one at a time and skipping any that are not wanted.

Good luck.

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of Jen Kris via Python-list
Sent: Monday, February 27, 2023 7:14 PM
To: Cameron Simpson <cs@cskk.id.au>
Cc: Python List <python-list@python.org>
Subject: Re: How to escape strings for re.finditer?


I went to the re module because the specified string may appear more than once in the string (in the code I'm writing). For example:

a = "X - abc_degree + 1 + qq + abc_degree + 1"
b = "abc_degree + 1"
q = a.find(b)

print(q)
4

So it correctly finds the start of the first instance, but not the second one. The re code finds both instances. If I knew that the substring occurred only once then the str.find would be best.

I changed my re code after MRAB's comment, it now works.

Thanks much.

Jen


Feb 27, 2023, 15:56 by cs@cskk.id.au:

> On 28Feb2023 00:11, Jen Kris <jenkris@tutanota.com> wrote:
>
>> When matching a string against a longer string, where both strings
>> have spaces in them, we need to escape the spaces.
>>
>> This works (no spaces):
>>
>> import re
>> example = 'abcdefabcdefabcdefg'
>> find_string = "abc"
>> for match in re.finditer(find_string, example):
>> print(match.start(), match.end())
>>
>> That gives me the start and end character positions, which is what I
>> want.
>>
>> However, this does not work:
>>
>> import re
>> example = re.escape('X - cty_degrees + 1 + qq') find_string =
>> re.escape('cty_degrees + 1') for match in re.finditer(find_string,
>> example):
>> print(match.start(), match.end())
>>
>> I’ve tried several other attempts based on my reseearch, but still no
>> match.
>>
>
> You need to print those strings out. You're escaping the _example_ string, which would make it:
>
> X - cty_degrees \+ 1 \+ qq
>
> because `+` is a special character in regexps and so `re.escape` escapes it. But you don't want to mangle the string you're searching! After all, the text above does not contain the string `cty_degrees + 1`.
>
> My secondary question is: if you're escaping the thing you're searching _for_, then you're effectively searching for a _fixed_ string, not a pattern/regexp. So why on earth are you using regexps to do your searching?
>
> The `str` type has a `find(substring)` function. Just use that! It'll be faster and the code simpler!
>
> Cheers,
> Cameron Simpson <cs@cskk.id.au>
> --
> https://mail.python.org/mailman/listinfo/python-list
>

--
https://mail.python.org/mailman/listinfo/python-list

--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
On 28Feb2023 00:57, Jen Kris <jenkris@tutanota.com> wrote:
>Yes, that's it.  I don't know how long it would have taken to find that
>detail with research through the voluminous re documentation.  Thanks
>very much. 

You find things like this by printing out the strings you're actually
working with. Not the original strings, but the strings when you're
invoking `finditer` i.e. in your case, escaped strings.

Then you might have seen that what you were searching no longer
contained what you were searching for.

Don't underestimate the value of the debugging print call. It lets you
see what your programme is actually working with, instead of what you
thought it was working with.

Cheers,
Cameron Simpson <cs@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
string.count() only tells me there are N instances of the string; it does not say where they begin and end, as does re.finditer. 

Feb 27, 2023, 16:20 by bobmellowood@gmail.com:

> Would string.count() work for you then?
>
> On Mon, Feb 27, 2023 at 5:16?PM Jen Kris via Python-list <> python-list@python.org> > wrote:
>
>>
>> I went to the re module because the specified string may appear more than once in the string (in the code I'm writing).  For example: 
>>
>> a = "X - abc_degree + 1 + qq + abc_degree + 1"
>>  b = "abc_degree + 1"
>>  q = a.find(b)
>>
>> print(q)
>> 4
>>
>> So it correctly finds the start of the first instance, but not the second one.  The re code finds both instances.  If I knew that the substring occurred only once then the str.find would be best. 
>>
>> I changed my re code after MRAB's comment, it now works. 
>>
>> Thanks much. 
>>
>> Jen
>>
>>
>> Feb 27, 2023, 15:56 by >> cs@cskk.id.au>> :
>>
>> > On 28Feb2023 00:11, Jen Kris <>> jenkris@tutanota.com>> > wrote:
>> >
>> >> When matching a string against a longer string, where both strings have spaces in them, we need to escape the spaces. 
>> >>
>> >> This works (no spaces):
>> >>
>> >> import re
>> >> example = 'abcdefabcdefabcdefg'
>> >> find_string = "abc"
>> >> for match in re.finditer(find_string, example):
>> >>     print(match.start(), match.end())
>> >>
>> >> That gives me the start and end character positions, which is what I want. 
>> >>
>> >> However, this does not work:
>> >>
>> >> import re
>> >> example = re.escape('X - cty_degrees + 1 + qq')
>> >> find_string = re.escape('cty_degrees + 1')
>> >> for match in re.finditer(find_string, example):
>> >>     print(match.start(), match.end())
>> >>
>> >> I’ve tried several other attempts based on my reseearch, but still no match. 
>> >>
>> >
>> > You need to print those strings out. You're escaping the _example_ string, which would make it:
>> >
>> >  X - cty_degrees \+ 1 \+ qq
>> >
>> > because `+` is a special character in regexps and so `re.escape` escapes it. But you don't want to mangle the string you're searching! After all, the text above does not contain the string `cty_degrees + 1`.
>> >
>> > My secondary question is: if you're escaping the thing you're searching _for_, then you're effectively searching for a _fixed_ string, not a pattern/regexp. So why on earth are you using regexps to do your searching?
>> >
>> > The `str` type has a `find(substring)` function. Just use that! It'll be faster and the code simpler!
>> >
>> > Cheers,
>> > Cameron Simpson <>> cs@cskk.id.au>> >
>> > --
>> > >> https://mail.python.org/mailman/listinfo/python-list
>> >
>>
>> --
>> >> https://mail.python.org/mailman/listinfo/python-list
>>
>
>
> --
> **** Listen to my CD at > http://www.mellowood.ca/music/cedars> ****
> Bob van der Poel ** Wynndel, British Columbia, CANADA **
> EMAIL: > bob@mellowood.ca
> WWW:   > http://www.mellowood.ca
>

--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
I haven't tested it either but it looks like it would work.  But for this case I prefer the relative simplicity of:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
find_string = re.escape('abc_degree + 1')
for match in re.finditer(find_string, example):
    print(match.start(), match.end())

4 18
26 40

I don't insist on terseness for its own sake, but it's cleaner this way. 

Jen


Feb 27, 2023, 16:55 by cs@cskk.id.au:

> On 28Feb2023 01:13, Jen Kris <jenkris@tutanota.com> wrote:
>
>> I went to the re module because the specified string may appear more than once in the string (in the code I'm writing).
>>
>
> Sure, but writing a `finditer` for plain `str` is pretty easy (untested):
>
> pos = 0
> while True:
> found = s.find(substring, pos)
> if found < 0:
> break
> start = found
> end = found + len(substring)
> ... do whatever with start and end ...
> pos = end
>
> Many people go straight to the `re` module whenever they're looking for strings. It is often cryptic error prone overkill. Just something to keep in mind.
>
> Cheers,
> Cameron Simpson <cs@cskk.id.au>
> --
> https://mail.python.org/mailman/listinfo/python-list
>

--
https://mail.python.org/mailman/listinfo/python-list
RE: How to escape strings for re.finditer? [ In reply to ]
Jen,

What you just described is why that tool is not the right tool for the job, albeit it may help you confirm if whatever method you choose does work correctly and finds the same number of matches.

Sometimes you simply do some searching and roll your own.

Consider this code using a sort of list comprehension feature:

>>> short = "hello world"
>>> longer = "hello world is how many programs start for novices but some use hello world! to show how happy they are to say hello world"

>>> short in longer
True
>>> howLong = len(short)

>>> res = [.(offset, offset + howLong) for offset in range(len(longer)) if longer.startswith(short, offset)]
>>> res
[(0, 11), (64, 75), (111, 122)]
>>> len(res)
3

I could do a bit more but it seems to work. Did I get the offsets right? Checking:

>>> print( [ longer[res[index][0]:res[index][1]] for index in range(len(res))])
['hello world', 'hello world', 'hello world']

Seems to work but thrown together quickly so can likely be done much nicer.

But as noted, the above has flaws such as matching overlaps like:

>>> short = "good good"
>>> longer = "A good good good but not douple plus good good good goody"
>>> howLong = len(short)
>>> res = [.(offset, offset + howLong) for offset in range(len(longer)) if longer.startswith(short, offset)]
>>> res
[(2, 11), (7, 16), (37, 46), (42, 51), (47, 56)]

It matched five times as sometimes we had three of four good in a row. Some other method might match only three.

What some might do can get long and you clearly want one answer and not tutorials. For example, people can make a loop that finds a match and either sabotages the area by replacing or deleting it, or keeps track and searched again on a substring offset from the beginning.

When you do not find a tool, consider making one. You can take (better) code than I show above and make it info a function and now you have a tool. Even better, you can make it return whatever you want.

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of Jen Kris via Python-list
Sent: Monday, February 27, 2023 7:40 PM
To: Bob van der Poel <bobmellowood@gmail.com>
Cc: Python List <python-list@python.org>
Subject: Re: How to escape strings for re.finditer?


string.count() only tells me there are N instances of the string; it does not say where they begin and end, as does re.finditer.

Feb 27, 2023, 16:20 by bobmellowood@gmail.com:

> Would string.count() work for you then?
>
> On Mon, Feb 27, 2023 at 5:16?PM Jen Kris via Python-list <> python-list@python.org> > wrote:
>
>>
>> I went to the re module because the specified string may appear more
>> than once in the string (in the code I'm writing). For example:
>>
>> a = "X - abc_degree + 1 + qq + abc_degree + 1"
>> b = "abc_degree + 1"
>> q = a.find(b)
>>
>> print(q)
>> 4
>>
>> So it correctly finds the start of the first instance, but not the
>> second one. The re code finds both instances. If I knew that the substring occurred only once then the str.find would be best.
>>
>> I changed my re code after MRAB's comment, it now works.
>>
>> Thanks much.
>>
>> Jen
>>
>>
>> Feb 27, 2023, 15:56 by >> cs@cskk.id.au>> :
>>
>> > On 28Feb2023 00:11, Jen Kris <>> jenkris@tutanota.com>> > wrote:
>> >
>> >> When matching a string against a longer string, where both
>> strings have spaces in them, we need to escape the spaces. >> >>
>> This works (no spaces):
>> >>
>> >> import re
>> >> example = 'abcdefabcdefabcdefg'
>> >> find_string = "abc"
>> >> for match in re.finditer(find_string, example):
>> >> print(match.start(), match.end()) >> >> That gives me the
>> start and end character positions, which is what I want.
>> >>
>> >> However, this does not work:
>> >>
>> >> import re
>> >> example = re.escape('X - cty_degrees + 1 + qq') >> find_string =
>> re.escape('cty_degrees + 1') >> for match in
>> re.finditer(find_string, example):
>> >> print(match.start(), match.end()) >> >> I’ve tried several
>> other attempts based on my reseearch, but still no match.
>> >>
>> >
>> > You need to print those strings out. You're escaping the _example_ string, which would make it:
>> >
>> > X - cty_degrees \+ 1 \+ qq
>> >
>> > because `+` is a special character in regexps and so `re.escape` escapes it. But you don't want to mangle the string you're searching! After all, the text above does not contain the string `cty_degrees + 1`.
>> >
>> > My secondary question is: if you're escaping the thing you're searching _for_, then you're effectively searching for a _fixed_ string, not a pattern/regexp. So why on earth are you using regexps to do your searching?
>> >
>> > The `str` type has a `find(substring)` function. Just use that! It'll be faster and the code simpler!
>> >
>> > Cheers,
>> > Cameron Simpson <>> cs@cskk.id.au>> > > -- > >>
>> https://mail.python.org/mailman/listinfo/python-list
>> >
>>
>> --
>> >> https://mail.python.org/mailman/listinfo/python-list
>>
>
>
> --
> **** Listen to my CD at > http://www.mellowood.ca/music/cedars> ****
> Bob van der Poel ** Wynndel, British Columbia, CANADA **
> EMAIL: > bob@mellowood.ca
> WWW: > http://www.mellowood.ca
>

--
https://mail.python.org/mailman/listinfo/python-list

--
https://mail.python.org/mailman/listinfo/python-list
RE: How to escape strings for re.finditer? [ In reply to ]
Jen,

Can you see what SOME OF US see as ASCII text? We can help you better if we get code that can be copied and run as-is.

What you sent is not terse. It is wrong. It will not run on any python interpreter because you somehow lost a carriage return and indent.

This is what you sent:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
find_string = re.escape('abc_degree + 1') for match in re.finditer(find_string, example):
print(match.start(), match.end())

This is code indentedproperly:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
find_string = re.escape('abc_degree + 1')
for match in re.finditer(find_string, example):
print(match.start(), match.end())

Of course I am sure you wrote and ran code more like the latter version but somewhere in your copy/paste process, ....

And, just for fun, since there is nothing wrong with your code, this minor change is terser:

>>> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>>> for match in re.finditer(re.escape('abc_degree + 1') , example):
... print(match.start(), match.end())
...
...
4 18
26 40

But note once you use regular expressions, and not in your case, you might match multiple things that are far from the same such as matching two repeated words of any kind in any case including "and and" and "so so" or finding words that have multiple doubled letter as in the stereotypical bookkeeper. In those cases, you may want even more than offsets but also show the exact text that matched or even show some characters before and/or after for context.


-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of Jen Kris via Python-list
Sent: Monday, February 27, 2023 8:36 PM
To: Cameron Simpson <cs@cskk.id.au>
Cc: Python List <python-list@python.org>
Subject: Re: How to escape strings for re.finditer?


I haven't tested it either but it looks like it would work. But for this case I prefer the relative simplicity of:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
find_string = re.escape('abc_degree + 1') for match in re.finditer(find_string, example):
print(match.start(), match.end())

4 18
26 40

I don't insist on terseness for its own sake, but it's cleaner this way.

Jen


Feb 27, 2023, 16:55 by cs@cskk.id.au:

> On 28Feb2023 01:13, Jen Kris <jenkris@tutanota.com> wrote:
>
>> I went to the re module because the specified string may appear more than once in the string (in the code I'm writing).
>>
>
> Sure, but writing a `finditer` for plain `str` is pretty easy (untested):
>
> pos = 0
> while True:
> found = s.find(substring, pos)
> if found < 0:
> break
> start = found
> end = found + len(substring)
> ... do whatever with start and end ...
> pos = end
>
> Many people go straight to the `re` module whenever they're looking for strings. It is often cryptic error prone overkill. Just something to keep in mind.
>
> Cheers,
> Cameron Simpson <cs@cskk.id.au>
> --
> https://mail.python.org/mailman/listinfo/python-list
>

--
https://mail.python.org/mailman/listinfo/python-list

--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
On 2/27/2023 9:16 PM, avi.e.gross@gmail.com wrote:
> And, just for fun, since there is nothing wrong with your code, this minor change is terser:
>
>>>> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>>>> for match in re.finditer(re.escape('abc_degree + 1') , example):
> ... print(match.start(), match.end())
> ...
> ...
> 4 18
> 26 40

Just for more fun :) -

Without knowing how general your expressions will be, I think the
following version is very readable, certainly more readable than regexes:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
KEY = 'abc_degree + 1'

for i in range(len(example)):
if example[i:].startswith(KEY):
print(i, i + len(KEY))
# prints:
4 18
26 40

If you may have variable numbers of spaces around the symbols, OTOH, the
whole situation changes and then regexes would almost certainly be the
best approach. But the regular expression strings would become harder
to read.
--
https://mail.python.org/mailman/listinfo/python-list
RE: How to escape strings for re.finditer? [ In reply to ]
I think by now we have given all that is needed by the OP but Dave's answer
strikes me as being able to be a tad faster as a while loop if you are
searching larger corpus such as an entire ebook or all books as you can do
on books.google.com

I think I mentioned earlier that some assumptions need to apply. The text
needs to be something like an ASCII encoding or seen as code points rather
than bytes. We assume a match should move forward by the length of the
match. And, clearly, there cannot be a match too close to the end.

So a while loop would begin with a variable set to zero to mark the current
location of the search. The condition for repeating the loop is that this
variable is less than or equal to len(searched_text) - len(key)

In the loop, each comparison is done the same way as David uses, or anything
similar enough but the twist is a failure increments the variable by 1 while
success increments by len(key).

Will this make much difference? It might as the simpler algorithm counts
overlapping matches and wastes some time hunting where perhaps it shouldn't.

And, of course, if you made something like this into a search function, you
can easily add features such as asking that you only return the first N
matches or the next N, simply by making it a generator.
So tying this into an earlier discussion, do you want the LAST match info
visible when the While loop has completed? If it was available, it opens up
possibilities for running the loop again but starting from where you left
off.



-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On
Behalf Of Thomas Passin
Sent: Monday, February 27, 2023 9:44 PM
To: python-list@python.org
Subject: Re: How to escape strings for re.finditer?

On 2/27/2023 9:16 PM, avi.e.gross@gmail.com wrote:
> And, just for fun, since there is nothing wrong with your code, this minor
change is terser:
>
>>>> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>>>> for match in re.finditer(re.escape('abc_degree + 1') , example):
> ... print(match.start(), match.end())
> ...
> ...
> 4 18
> 26 40

Just for more fun :) -

Without knowing how general your expressions will be, I think the following
version is very readable, certainly more readable than regexes:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
KEY = 'abc_degree + 1'

for i in range(len(example)):
if example[i:].startswith(KEY):
print(i, i + len(KEY))
# prints:
4 18
26 40

If you may have variable numbers of spaces around the symbols, OTOH, the
whole situation changes and then regexes would almost certainly be the best
approach. But the regular expression strings would become harder to read.
--
https://mail.python.org/mailman/listinfo/python-list

--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
Op 28/02/2023 om 3:44 schreef Thomas Passin:
> On 2/27/2023 9:16 PM, avi.e.gross@gmail.com wrote:
>> And, just for fun, since there is nothing wrong with your code, this
>> minor change is terser:
>>
>>>>> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>>>>> for match in re.finditer(re.escape('abc_degree + 1') , example):
>> ...     print(match.start(), match.end())
>> ...
>> ...
>> 4 18
>> 26 40
>
> Just for more fun :) -
>
> Without knowing how general your expressions will be, I think the
> following version is very readable, certainly more readable than regexes:
>
> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
> KEY = 'abc_degree + 1'
>
> for i in range(len(example)):
>     if example[i:].startswith(KEY):
>         print(i, i + len(KEY))
> # prints:
> 4 18
> 26 40
I think it's often a good idea to use a standard library function
instead of rolling your own. The issue becomes less clear-cut when the
standard library doesn't do exactly what you need (as here, where
re.finditer() uses regular expressions while the use case only uses
simple search strings). Ideally there would be a str.finditer() method
we could use, but in the absence of that I think we still need to
consider using the almost-but-not-quite fitting re.finditer().

Two reasons:

(1) I think it's clearer: the name tells us what it does (though of
course we could solve this in a hand-written version by wrapping it in a
suitably named function).

(2) Searching for a string in another string, in a performant way, is
not as simple as it first appears. Your version works correctly, but
slowly. In some situations it doesn't matter, but in other cases it
will. For better performance, string searching algorithms jump ahead
either when they found a match or when they know for sure there isn't a
match for some time (see e.g. the Boyer–Moore string-search algorithm).
You could write such a more efficient algorithm, but then it becomes
more complex and more error-prone. Using a well-tested existing function
becomes quite attractive.

To illustrate the difference performance, I did a simple test (using the
paragraph above is test text):

    import re
    import timeit

    def using_re_finditer(key, text):
        matches = []
        for match in re.finditer(re.escape(key), text):
            matches.append((match.start(), match.end()))
        return matches


    def using_simple_loop(key, text):
        matches = []
        for i in range(len(text)):
            if text[i:].startswith(key):
                matches.append((i, i + len(key)))
        return matches


    CORPUS = """Searching for a string in another string, in a
performant way, is
    not as simple as it first appears. Your version works correctly,
but slowly.
    In some situations it doesn't matter, but in other cases it will.
For better
    performance, string searching algorithms jump ahead either when
they found a
    match or when they know for sure there isn't a match for some time
(see e.g.
    the Boyer–Moore string-search algorithm). You could write such a more
    efficient algorithm, but then it becomes more complex and more
error-prone.
    Using a well-tested existing function becomes quite attractive."""
    KEY = 'in'
    print('using_simple_loop:',
timeit.repeat(stmt='using_simple_loop(KEY, CORPUS)', globals=globals(),
number=1000))
    print('using_re_finditer:',
timeit.repeat(stmt='using_re_finditer(KEY, CORPUS)', globals=globals(),
number=1000))

This does 5 runs of 1000 repetitions each, and reports the time in
seconds for each of those runs.
Result on my machine:

    using_simple_loop: [0.13952950000020792, 0.13063130000000456,
0.12803450000001249, 0.13186180000002423, 0.13084610000032626]
    using_re_finditer: [0.003861400000005233, 0.004061900000124297,
0.003478999999970256, 0.003413100000216218, 0.0037320000001273]

We find that in this test re.finditer() is more than 30 times faster
(despite the overhead of regular expressions.

While speed isn't everything in programming, with such a large
difference in performance and (to me) no real disadvantages of using
re.finditer(), I would prefer re.finditer() over writing my own.

--
"The saddest aspect of life right now is that science gathers knowledge
faster than society gathers wisdom."
-- Isaac Asimov

--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
On 2/28/2023 4:33 AM, Roel Schroeven wrote:
> Op 28/02/2023 om 3:44 schreef Thomas Passin:
>> On 2/27/2023 9:16 PM, avi.e.gross@gmail.com wrote:
>>> And, just for fun, since there is nothing wrong with your code, this
>>> minor change is terser:
>>>
>>>>>> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>>>>>> for match in re.finditer(re.escape('abc_degree + 1') , example):
>>> ...     print(match.start(), match.end())
>>> ...
>>> ...
>>> 4 18
>>> 26 40
>>
>> Just for more fun :) -
>>
>> Without knowing how general your expressions will be, I think the
>> following version is very readable, certainly more readable than regexes:
>>
>> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>> KEY = 'abc_degree + 1'
>>
>> for i in range(len(example)):
>>     if example[i:].startswith(KEY):
>>         print(i, i + len(KEY))
>> # prints:
>> 4 18
>> 26 40
> I think it's often a good idea to use a standard library function
> instead of rolling your own. The issue becomes less clear-cut when the
> standard library doesn't do exactly what you need (as here, where
> re.finditer() uses regular expressions while the use case only uses
> simple search strings). Ideally there would be a str.finditer() method
> we could use, but in the absence of that I think we still need to
> consider using the almost-but-not-quite fitting re.finditer().
>
> Two reasons:
>
> (1) I think it's clearer: the name tells us what it does (though of
> course we could solve this in a hand-written version by wrapping it in a
> suitably named function).
>
> (2) Searching for a string in another string, in a performant way, is
> not as simple as it first appears. Your version works correctly, but
> slowly. In some situations it doesn't matter, but in other cases it
> will. For better performance, string searching algorithms jump ahead
> either when they found a match or when they know for sure there isn't a
> match for some time (see e.g. the Boyer–Moore string-search algorithm).
> You could write such a more efficient algorithm, but then it becomes
> more complex and more error-prone. Using a well-tested existing function
> becomes quite attractive.

Sure, it all depends on what the real task will be. That's why I wrote
"Without knowing how general your expressions will be". For the example
string, it's unlikely that speed will be a factor, but who knows what
target strings and keys will turn up in the future?

> To illustrate the difference performance, I did a simple test (using the
> paragraph above is test text):
>
>     import re
>     import timeit
>
>     def using_re_finditer(key, text):
>         matches = []
>         for match in re.finditer(re.escape(key), text):
>             matches.append((match.start(), match.end()))
>         return matches
>
>
>     def using_simple_loop(key, text):
>         matches = []
>         for i in range(len(text)):
>             if text[i:].startswith(key):
>                 matches.append((i, i + len(key)))
>         return matches
>
>
>     CORPUS = """Searching for a string in another string, in a
> performant way, is
>     not as simple as it first appears. Your version works correctly,
> but slowly.
>     In some situations it doesn't matter, but in other cases it will.
> For better
>     performance, string searching algorithms jump ahead either when
> they found a
>     match or when they know for sure there isn't a match for some time
> (see e.g.
>     the Boyer–Moore string-search algorithm). You could write such a more
>     efficient algorithm, but then it becomes more complex and more
> error-prone.
>     Using a well-tested existing function becomes quite attractive."""
>     KEY = 'in'
>     print('using_simple_loop:',
> timeit.repeat(stmt='using_simple_loop(KEY, CORPUS)', globals=globals(),
> number=1000))
>     print('using_re_finditer:',
> timeit.repeat(stmt='using_re_finditer(KEY, CORPUS)', globals=globals(),
> number=1000))
>
> This does 5 runs of 1000 repetitions each, and reports the time in
> seconds for each of those runs.
> Result on my machine:
>
>     using_simple_loop: [0.13952950000020792, 0.13063130000000456,
> 0.12803450000001249, 0.13186180000002423, 0.13084610000032626]
>     using_re_finditer: [0.003861400000005233, 0.004061900000124297,
> 0.003478999999970256, 0.003413100000216218, 0.0037320000001273]
>
> We find that in this test re.finditer() is more than 30 times faster
> (despite the overhead of regular expressions.
>
> While speed isn't everything in programming, with such a large
> difference in performance and (to me) no real disadvantages of using
> re.finditer(), I would prefer re.finditer() over writing my own.
>

--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
Op 28/02/2023 om 14:35 schreef Thomas Passin:
> On 2/28/2023 4:33 AM, Roel Schroeven wrote:
>> [...]
>> (2) Searching for a string in another string, in a performant way, is
>> not as simple as it first appears. Your version works correctly, but
>> slowly. In some situations it doesn't matter, but in other cases it
>> will. For better performance, string searching algorithms jump ahead
>> either when they found a match or when they know for sure there isn't
>> a match for some time (see e.g. the Boyer–Moore string-search
>> algorithm). You could write such a more efficient algorithm, but then
>> it becomes more complex and more error-prone. Using a well-tested
>> existing function becomes quite attractive.
>
> Sure, it all depends on what the real task will be.  That's why I
> wrote "Without knowing how general your expressions will be". For the
> example string, it's unlikely that speed will be a factor, but who
> knows what target strings and keys will turn up in the future?
On hindsight I think it was overthinking things a bit. "It all depends
on what the real task will be" you say, and indeed I think that should
be the main conclusion here.

--
"Man had always assumed that he was more intelligent than dolphins because
he had achieved so much — the wheel, New York, wars and so on — whilst all
the dolphins had ever done was muck about in the water having a good time.
But conversely, the dolphins had always believed that they were far more
intelligent than man — for precisely the same reasons."
-- Douglas Adams

--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
On 2/28/2023 10:05 AM, Roel Schroeven wrote:
> Op 28/02/2023 om 14:35 schreef Thomas Passin:
>> On 2/28/2023 4:33 AM, Roel Schroeven wrote:
>>> [...]
>>> (2) Searching for a string in another string, in a performant way, is
>>> not as simple as it first appears. Your version works correctly, but
>>> slowly. In some situations it doesn't matter, but in other cases it
>>> will. For better performance, string searching algorithms jump ahead
>>> either when they found a match or when they know for sure there isn't
>>> a match for some time (see e.g. the Boyer–Moore string-search
>>> algorithm). You could write such a more efficient algorithm, but then
>>> it becomes more complex and more error-prone. Using a well-tested
>>> existing function becomes quite attractive.
>>
>> Sure, it all depends on what the real task will be.  That's why I
>> wrote "Without knowing how general your expressions will be". For the
>> example string, it's unlikely that speed will be a factor, but who
>> knows what target strings and keys will turn up in the future?
> On hindsight I think it was overthinking things a bit. "It all depends
> on what the real task will be" you say, and indeed I think that should
> be the main conclusion here.


It is interesting, though, how pre-processing the search pattern can
improve search times if you can afford the pre-processing. Here's a
paper on rapidly finding matches when there may be up to one misspelled
character. It's easy enough to implement, though in Python you can't
take the additional step of tuning it to stay in cache.

https://Robert.Muth.Org/Papers/1996-Approx-Multi.Pdf

--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
On 2023-02-28, Thomas Passin <list1@tompassin.net> wrote:
> On 2/28/2023 10:05 AM, Roel Schroeven wrote:
>> Op 28/02/2023 om 14:35 schreef Thomas Passin:
>>> On 2/28/2023 4:33 AM, Roel Schroeven wrote:
>>>> [...]
>>>> (2) Searching for a string in another string, in a performant way, is
>>>> not as simple as it first appears. Your version works correctly, but
>>>> slowly. In some situations it doesn't matter, but in other cases it
>>>> will. For better performance, string searching algorithms jump ahead
>>>> either when they found a match or when they know for sure there isn't
>>>> a match for some time (see e.g. the Boyer–Moore string-search
>>>> algorithm). You could write such a more efficient algorithm, but then
>>>> it becomes more complex and more error-prone. Using a well-tested
>>>> existing function becomes quite attractive.
>>>
>>> Sure, it all depends on what the real task will be.  That's why I
>>> wrote "Without knowing how general your expressions will be". For the
>>> example string, it's unlikely that speed will be a factor, but who
>>> knows what target strings and keys will turn up in the future?
>> On hindsight I think it was overthinking things a bit. "It all depends
>> on what the real task will be" you say, and indeed I think that should
>> be the main conclusion here.
>
> It is interesting, though, how pre-processing the search pattern can
> improve search times if you can afford the pre-processing. Here's a
> paper on rapidly finding matches when there may be up to one misspelled
> character. It's easy enough to implement, though in Python you can't
> take the additional step of tuning it to stay in cache.
>
> https://Robert.Muth.Org/Papers/1996-Approx-Multi.Pdf

You've somehow title-cased that URL. The correct URL is:

https://robert.muth.org/Papers/1996-approx-multi.pdf
--
https://mail.python.org/mailman/listinfo/python-list
RE: How to escape strings for re.finditer? [ In reply to ]
The code I sent is correct, and it runs here.  Maybe you received it with a carriage return removed, but on my copy after posting, it is correct:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
 find_string = re.escape('abc_degree + 1')
 for match in re.finditer(find_string, example):
     print(match.start(), match.end())

One question:  several people have made suggestions other than regex (not your terser example with regex you shown below).  Is there a reason why regex is not preferred to, for example, a list comp?  Performance?  Reliability? 



 


Feb 27, 2023, 18:16 by avi.e.gross@gmail.com:

> Jen,
>
> Can you see what SOME OF US see as ASCII text? We can help you better if we get code that can be copied and run as-is.
>
> What you sent is not terse. It is wrong. It will not run on any python interpreter because you somehow lost a carriage return and indent.
>
> This is what you sent:
>
> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
> find_string = re.escape('abc_degree + 1') for match in re.finditer(find_string, example):
> print(match.start(), match.end())
>
> This is code indentedproperly:
>
> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
> find_string = re.escape('abc_degree + 1')
> for match in re.finditer(find_string, example):
> print(match.start(), match.end())
>
> Of course I am sure you wrote and ran code more like the latter version but somewhere in your copy/paste process, ....
>
> And, just for fun, since there is nothing wrong with your code, this minor change is terser:
>
>>>> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>>>> for match in re.finditer(re.escape('abc_degree + 1') , example):
>>>>
> ... print(match.start(), match.end())
> ...
> ...
> 4 18
> 26 40
>
> But note once you use regular expressions, and not in your case, you might match multiple things that are far from the same such as matching two repeated words of any kind in any case including "and and" and "so so" or finding words that have multiple doubled letter as in the stereotypical bookkeeper. In those cases, you may want even more than offsets but also show the exact text that matched or even show some characters before and/or after for context.
>
>
> -----Original Message-----
> From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of Jen Kris via Python-list
> Sent: Monday, February 27, 2023 8:36 PM
> To: Cameron Simpson <cs@cskk.id.au>
> Cc: Python List <python-list@python.org>
> Subject: Re: How to escape strings for re.finditer?
>
>
> I haven't tested it either but it looks like it would work. But for this case I prefer the relative simplicity of:
>
> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
> find_string = re.escape('abc_degree + 1') for match in re.finditer(find_string, example):
> print(match.start(), match.end())
>
> 4 18
> 26 40
>
> I don't insist on terseness for its own sake, but it's cleaner this way.
>
> Jen
>
>
> Feb 27, 2023, 16:55 by cs@cskk.id.au:
>
>> On 28Feb2023 01:13, Jen Kris <jenkris@tutanota.com> wrote:
>>
>>> I went to the re module because the specified string may appear more than once in the string (in the code I'm writing).
>>>
>>
>> Sure, but writing a `finditer` for plain `str` is pretty easy (untested):
>>
>> pos = 0
>> while True:
>> found = s.find(substring, pos)
>> if found < 0:
>> break
>> start = found
>> end = found + len(substring)
>> ... do whatever with start and end ...
>> pos = end
>>
>> Many people go straight to the `re` module whenever they're looking for strings. It is often cryptic error prone overkill. Just something to keep in mind.
>>
>> Cheers,
>> Cameron Simpson <cs@cskk.id.au>
>> --
>> https://mail.python.org/mailman/listinfo/python-list
>>
>
> --
> https://mail.python.org/mailman/listinfo/python-list
>

--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
Using str.startswith is a cool idea in this case.  But is it better than regex for performance or reliability?  Regex syntax is not a model of simplicity, but in my simple case it's not too difficult. 


Feb 27, 2023, 18:52 by list1@tompassin.net:

> On 2/27/2023 9:16 PM, avi.e.gross@gmail.com wrote:
>
>> And, just for fun, since there is nothing wrong with your code, this minor change is terser:
>>
>>>>> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>>>>> for match in re.finditer(re.escape('abc_degree + 1') , example):
>>>>>
>> ... print(match.start(), match.end())
>> ...
>> ...
>> 4 18
>> 26 40
>>
>
> Just for more fun :) -
>
> Without knowing how general your expressions will be, I think the following version is very readable, certainly more readable than regexes:
>
> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
> KEY = 'abc_degree + 1'
>
> for i in range(len(example)):
> if example[i:].startswith(KEY):
> print(i, i + len(KEY))
> # prints:
> 4 18
> 26 40
>
> If you may have variable numbers of spaces around the symbols, OTOH, the whole situation changes and then regexes would almost certainly be the best approach. But the regular expression strings would become harder to read.
> --
> https://mail.python.org/mailman/listinfo/python-list
>

--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
I wrote my previous message before reading this.  Thank you for the test you ran -- it answers the question of performance.  You show that re.finditer is 30x faster, so that certainly recommends that over a simple loop, which introduces looping overhead. 


Feb 28, 2023, 05:44 by list1@tompassin.net:

> On 2/28/2023 4:33 AM, Roel Schroeven wrote:
>
>> Op 28/02/2023 om 3:44 schreef Thomas Passin:
>>
>>> On 2/27/2023 9:16 PM, avi.e.gross@gmail.com wrote:
>>>
>>>> And, just for fun, since there is nothing wrong with your code, this minor change is terser:
>>>>
>>>>>>> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>>>>>>> for match in re.finditer(re.escape('abc_degree + 1') , example):
>>>>>>>
>>>> ...     print(match.start(), match.end())
>>>> ...
>>>> ...
>>>> 4 18
>>>> 26 40
>>>>
>>>
>>> Just for more fun :) -
>>>
>>> Without knowing how general your expressions will be, I think the following version is very readable, certainly more readable than regexes:
>>>
>>> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>>> KEY = 'abc_degree + 1'
>>>
>>> for i in range(len(example)):
>>>     if example[i:].startswith(KEY):
>>>         print(i, i + len(KEY))
>>> # prints:
>>> 4 18
>>> 26 40
>>>
>> I think it's often a good idea to use a standard library function instead of rolling your own. The issue becomes less clear-cut when the standard library doesn't do exactly what you need (as here, where re.finditer() uses regular expressions while the use case only uses simple search strings). Ideally there would be a str.finditer() method we could use, but in the absence of that I think we still need to consider using the almost-but-not-quite fitting re.finditer().
>>
>> Two reasons:
>>
>> (1) I think it's clearer: the name tells us what it does (though of course we could solve this in a hand-written version by wrapping it in a suitably named function).
>>
>> (2) Searching for a string in another string, in a performant way, is not as simple as it first appears. Your version works correctly, but slowly. In some situations it doesn't matter, but in other cases it will. For better performance, string searching algorithms jump ahead either when they found a match or when they know for sure there isn't a match for some time (see e.g. the Boyer–Moore string-search algorithm). You could write such a more efficient algorithm, but then it becomes more complex and more error-prone. Using a well-tested existing function becomes quite attractive.
>>
>
> Sure, it all depends on what the real task will be. That's why I wrote "Without knowing how general your expressions will be". For the example string, it's unlikely that speed will be a factor, but who knows what target strings and keys will turn up in the future?
>
>> To illustrate the difference performance, I did a simple test (using the paragraph above is test text):
>>
>>     import re
>>     import timeit
>>
>>     def using_re_finditer(key, text):
>>         matches = []
>>         for match in re.finditer(re.escape(key), text):
>>             matches.append((match.start(), match.end()))
>>         return matches
>>
>>
>>     def using_simple_loop(key, text):
>>         matches = []
>>         for i in range(len(text)):
>>             if text[i:].startswith(key):
>>                 matches.append((i, i + len(key)))
>>         return matches
>>
>>
>>     CORPUS = """Searching for a string in another string, in a performant way, is
>>     not as simple as it first appears. Your version works correctly, but slowly.
>>     In some situations it doesn't matter, but in other cases it will. For better
>>     performance, string searching algorithms jump ahead either when they found a
>>     match or when they know for sure there isn't a match for some time (see e.g.
>>     the Boyer–Moore string-search algorithm). You could write such a more
>>     efficient algorithm, but then it becomes more complex and more error-prone.
>>     Using a well-tested existing function becomes quite attractive."""
>>     KEY = 'in'
>>     print('using_simple_loop:', timeit.repeat(stmt='using_simple_loop(KEY, CORPUS)', globals=globals(), number=1000))
>>     print('using_re_finditer:', timeit.repeat(stmt='using_re_finditer(KEY, CORPUS)', globals=globals(), number=1000))
>>
>> This does 5 runs of 1000 repetitions each, and reports the time in seconds for each of those runs.
>> Result on my machine:
>>
>>     using_simple_loop: [0.13952950000020792, 0.13063130000000456, 0.12803450000001249, 0.13186180000002423, 0.13084610000032626]
>>     using_re_finditer: [0.003861400000005233, 0.004061900000124297, 0.003478999999970256, 0.003413100000216218, 0.0037320000001273]
>>
>> We find that in this test re.finditer() is more than 30 times faster (despite the overhead of regular expressions.
>>
>> While speed isn't everything in programming, with such a large difference in performance and (to me) no real disadvantages of using re.finditer(), I would prefer re.finditer() over writing my own.
>>
>
> --
> https://mail.python.org/mailman/listinfo/python-list
>

--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
On 2/28/2023 12:57 PM, Jen Kris via Python-list wrote:
> The code I sent is correct, and it runs here.  Maybe you received it with a carriage return removed, but on my copy after posting, it is correct:
>
> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>  find_string = re.escape('abc_degree + 1')
>  for match in re.finditer(find_string, example):
>      print(match.start(), match.end())
>
> One question:  several people have made suggestions other than regex (not your terser example with regex you shown below).  Is there a reason why regex is not preferred to, for example, a list comp?  Performance?  Reliability?

"Some people, when confronted with a problem, think 'I know, I'll use
regular expressions.' Now they have two problems."

-
https://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/

Of course, if you actually read the blog post in the link, there's more
to it than that...

>
> Feb 27, 2023, 18:16 by avi.e.gross@gmail.com:
>
>> Jen,
>>
>> Can you see what SOME OF US see as ASCII text? We can help you better if we get code that can be copied and run as-is.
>>
>> What you sent is not terse. It is wrong. It will not run on any python interpreter because you somehow lost a carriage return and indent.
>>
>> This is what you sent:
>>
>> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>> find_string = re.escape('abc_degree + 1') for match in re.finditer(find_string, example):
>> print(match.start(), match.end())
>>
>> This is code indentedproperly:
>>
>> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>> find_string = re.escape('abc_degree + 1')
>> for match in re.finditer(find_string, example):
>> print(match.start(), match.end())
>>
>> Of course I am sure you wrote and ran code more like the latter version but somewhere in your copy/paste process, ....
>>
>> And, just for fun, since there is nothing wrong with your code, this minor change is terser:
>>
>>>>> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>>>>> for match in re.finditer(re.escape('abc_degree + 1') , example):
>>>>>
>> ... print(match.start(), match.end())
>> ...
>> ...
>> 4 18
>> 26 40
>>
>> But note once you use regular expressions, and not in your case, you might match multiple things that are far from the same such as matching two repeated words of any kind in any case including "and and" and "so so" or finding words that have multiple doubled letter as in the stereotypical bookkeeper. In those cases, you may want even more than offsets but also show the exact text that matched or even show some characters before and/or after for context.
>>
>>
>> -----Original Message-----
>> From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of Jen Kris via Python-list
>> Sent: Monday, February 27, 2023 8:36 PM
>> To: Cameron Simpson <cs@cskk.id.au>
>> Cc: Python List <python-list@python.org>
>> Subject: Re: How to escape strings for re.finditer?
>>
>>
>> I haven't tested it either but it looks like it would work. But for this case I prefer the relative simplicity of:
>>
>> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>> find_string = re.escape('abc_degree + 1') for match in re.finditer(find_string, example):
>> print(match.start(), match.end())
>>
>> 4 18
>> 26 40
>>
>> I don't insist on terseness for its own sake, but it's cleaner this way.
>>
>> Jen
>>
>>
>> Feb 27, 2023, 16:55 by cs@cskk.id.au:
>>
>>> On 28Feb2023 01:13, Jen Kris <jenkris@tutanota.com> wrote:
>>>
>>>> I went to the re module because the specified string may appear more than once in the string (in the code I'm writing).
>>>>
>>>
>>> Sure, but writing a `finditer` for plain `str` is pretty easy (untested):
>>>
>>> pos = 0
>>> while True:
>>> found = s.find(substring, pos)
>>> if found < 0:
>>> break
>>> start = found
>>> end = found + len(substring)
>>> ... do whatever with start and end ...
>>> pos = end
>>>
>>> Many people go straight to the `re` module whenever they're looking for strings. It is often cryptic error prone overkill. Just something to keep in mind.
>>>
>>> Cheers,
>>> Cameron Simpson <cs@cskk.id.au>
>>> --
>>> https://mail.python.org/mailman/listinfo/python-list
>>>
>>
>> --
>> https://mail.python.org/mailman/listinfo/python-list
>>
>

--
https://mail.python.org/mailman/listinfo/python-list
Re: How to escape strings for re.finditer? [ In reply to ]
On 2/28/2023 1:07 PM, Jen Kris wrote:
>
> Using str.startswith is a cool idea in this case.  But is it better than
> regex for performance or reliability?  Regex syntax is not a model of
> simplicity, but in my simple case it's not too difficult.

The trouble is that we don't know what your case really is. If you are
talking about a short pattern like your example and a small text to
search, and you don't need to do it too often, then my little code
example is probably ideal. Reliability wouldn't be an issue, and
performance would not be relevant. If your case is going to be much
larger, called many times in a loop, or be much more complicated in some
other way, then a regex or some other approach is likely to be much faster.


> Feb 27, 2023, 18:52 by list1@tompassin.net:
>
> On 2/27/2023 9:16 PM, avi.e.gross@gmail.com wrote:
>
> And, just for fun, since there is nothing wrong with your code,
> this minor change is terser:
>
> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
> for match in re.finditer(re.escape('abc_degree + 1')
> , example):
>
> ... print(match.start(), match.end())
> ...
> ...
> 4 18
> 26 40
>
>
> Just for more fun :) -
>
> Without knowing how general your expressions will be, I think the
> following version is very readable, certainly more readable than
> regexes:
>
> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
> KEY = 'abc_degree + 1'
>
> for i in range(len(example)):
> if example[i:].startswith(KEY):
> print(i, i + len(KEY))
> # prints:
> 4 18
> 26 40
>
> If you may have variable numbers of spaces around the symbols, OTOH,
> the whole situation changes and then regexes would almost certainly
> be the best approach. But the regular expression strings would
> become harder to read.
> --
> https://mail.python.org/mailman/listinfo/python-list
>
>

--
https://mail.python.org/mailman/listinfo/python-list

1 2  View All