Mailing List Archive: Code improvement question

Code improvement question

python-list at python

Nov 14, 2023, 3:14 PM

Post #1 of 17 (407 views)

I'd like to improve the code below, which works. It feels clunky to me.

I need to clean up user-uploaded files the size of which I don't know in
advance.

After cleaning they might be as big as 1Mb but that would be super rare.
Perhaps only for testing.

I'm extracting CAS numbers and here is the pattern xx-xx-x up to
xxxxxxx-xx-x eg., 1012300-77-4

def remove_alpha(txt):

    """ r'[^0-9\- ]':

    [^...]: Match any character that is not in the specified set.

    0-9: Match any digit.

    \: Escape character.

    -: Match a hyphen.

    Space: Match a space.

    """

    cleaned_txt = re.sub(r'[^0-9\- ]', '', txt)

    bits = cleaned_txt.split()

    pieces = []

    for bit in bits:

        # minimum size of a CAS number is 7 so drop smaller clumps of digits

        pieces.append(bit if len(bit) > 6 else "")

    return " ".join(pieces)

Many thanks for any hints

Cheers

Mike
--
https://mail.python.org/mailman/listinfo/python-list

Re: Code improvement question [ In reply to ]

python-list at python

Nov 14, 2023, 3:25 PM

Post #2 of 17 (407 views)

On 2023-11-14 23:14, Mike Dewhirst via Python-list wrote:
> I'd like to improve the code below, which works. It feels clunky to me.
>
> I need to clean up user-uploaded files the size of which I don't know in
> advance.
>
> After cleaning they might be as big as 1Mb but that would be super rare.
> Perhaps only for testing.
>
> I'm extracting CAS numbers and here is the pattern xx-xx-x up to
> xxxxxxx-xx-x eg., 1012300-77-4
>
> def remove_alpha(txt):
>
>     """ r'[^0-9\- ]':
>
>     [^...]: Match any character that is not in the specified set.
>
>     0-9: Match any digit.
>
>     \: Escape character.
>
>     -: Match a hyphen.
>
>     Space: Match a space.
>
>     """
>
>     cleaned_txt = re.sub(r'[^0-9\- ]', '', txt)
>
>     bits = cleaned_txt.split()
>
>     pieces = []
>
>     for bit in bits:
>
>         # minimum size of a CAS number is 7 so drop smaller clumps of digits
>
>         pieces.append(bit if len(bit) > 6 else "")
>
>     return " ".join(pieces)
>
>
> Many thanks for any hints
>
Why don't you use re.findall?

re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)

--
https://mail.python.org/mailman/listinfo/python-list

Re: Code improvement question [ In reply to ]

python-list at python

Nov 14, 2023, 7:41 PM

Post #3 of 17 (404 views)

On 15/11/2023 10:25 am, MRAB via Python-list wrote:
> On 2023-11-14 23:14, Mike Dewhirst via Python-list wrote:
>> I'd like to improve the code below, which works. It feels clunky to me.
>>
>> I need to clean up user-uploaded files the size of which I don't know in
>> advance.
>>
>> After cleaning they might be as big as 1Mb but that would be super rare.
>> Perhaps only for testing.
>>
>> I'm extracting CAS numbers and here is the pattern xx-xx-x up to
>> xxxxxxx-xx-x eg., 1012300-77-4
>>
>> def remove_alpha(txt):
>>
>>     """ r'[^0-9\- ]':
>>
>>     [^...]: Match any character that is not in the specified set.
>>
>>     0-9: Match any digit.
>>
>>     \: Escape character.
>>
>>     -: Match a hyphen.
>>
>>     Space: Match a space.
>>
>>     """
>>
>>     cleaned_txt = re.sub(r'[^0-9\- ]', '', txt)
>>
>>     bits = cleaned_txt.split()
>>
>>     pieces = []
>>
>>     for bit in bits:
>>
>>         # minimum size of a CAS number is 7 so drop smaller clumps
>> of digits
>>
>>         pieces.append(bit if len(bit) > 6 else "")
>>
>>     return " ".join(pieces)
>>
>>
>> Many thanks for any hints
>>
> Why don't you use re.findall?
>
> re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)

I think I can see what you did there but it won't make sense to me - or
whoever looks at the code - in future.

That answers your specific question. However, I am in awe of people who
can just "do" regular expressions and I thank you very much for what
would have been a monumental effort had I tried it.

That little re.sub() came from ChatGPT and I can understand it without
too much effort because it came documented

I suppose ChatGPT is the answer to this thread. Or everything. Or will be.

Thanks

Mike
--
https://mail.python.org/mailman/listinfo/python-list

Re: Code improvement question [ In reply to ]

python-list at python

Nov 14, 2023, 8:08 PM

Post #4 of 17 (404 views)

On 2023-11-15 03:41, Mike Dewhirst via Python-list wrote:
> On 15/11/2023 10:25 am, MRAB via Python-list wrote:
>> On 2023-11-14 23:14, Mike Dewhirst via Python-list wrote:
>>> I'd like to improve the code below, which works. It feels clunky to me.
>>>
>>> I need to clean up user-uploaded files the size of which I don't know in
>>> advance.
>>>
>>> After cleaning they might be as big as 1Mb but that would be super rare.
>>> Perhaps only for testing.
>>>
>>> I'm extracting CAS numbers and here is the pattern xx-xx-x up to
>>> xxxxxxx-xx-x eg., 1012300-77-4
>>>
>>> def remove_alpha(txt):
>>>
>>>     """ r'[^0-9\- ]':
>>>
>>>     [^...]: Match any character that is not in the specified set.
>>>
>>>     0-9: Match any digit.
>>>
>>>     \: Escape character.
>>>
>>>     -: Match a hyphen.
>>>
>>>     Space: Match a space.
>>>
>>>     """
>>>
>>>     cleaned_txt = re.sub(r'[^0-9\- ]', '', txt)
>>>
>>>     bits = cleaned_txt.split()
>>>
>>>     pieces = []
>>>
>>>     for bit in bits:
>>>
>>>         # minimum size of a CAS number is 7 so drop smaller clumps
>>> of digits
>>>
>>>         pieces.append(bit if len(bit) > 6 else "")
>>>
>>>     return " ".join(pieces)
>>>
>>>
>>> Many thanks for any hints
>>>
>> Why don't you use re.findall?
>>
>> re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)
>
> I think I can see what you did there but it won't make sense to me - or
> whoever looks at the code - in future.
>
> That answers your specific question. However, I am in awe of people who
> can just "do" regular expressions and I thank you very much for what
> would have been a monumental effort had I tried it.
>
> That little re.sub() came from ChatGPT and I can understand it without
> too much effort because it came documented
>
> I suppose ChatGPT is the answer to this thread. Or everything. Or will be.
>
\b Word boundary
[0-9]{2,7} 2..7 digits
- "-"
[0-9]{2} 2 digits
- "-"
[0-9]{2} 2 digits
\b Word boundary

The "word boundary" thing is to stop it matching where there are letters
or digits right next to the digits.

For example, if the text contained, say, "123456789-12-1234", you
wouldn't want it to match because there are more than 7 digits at the
start and more than 2 digits at the end.

--
https://mail.python.org/mailman/listinfo/python-list

Re: Code improvement question [ In reply to ]

python-list at python

Nov 15, 2023, 2:34 PM

Post #5 of 17 (400 views)

>>>
>> Why don't you use re.findall?
>>
>> re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)
>
> I think I can see what you did there but it won't make sense to me - or
> whoever looks at the code - in future.
>
> That answers your specific question. However, I am in awe of people who
> can just "do" regular expressions and I thank you very much for what
> would have been a monumental effort had I tried it.

I feel the same way about regex. If I can find a way to write something
without regex I very much prefer to as regex usually adds complexity and
hurts readability.

You might find https://regex101.com/ to be useful for testing your
regex. You can enter in sample data and see if it matches.

If I understood what your regex was trying to do I might be able to
suggest some python to do the same thing. Is it just removing numbers
from text?

The for loop, "for bit in bits" etc, could be written as a list
comprehension.

pieces = [.bit if len(bit) > 6 else "" for bit in bits]

For devs familiar with other languages but new to Python this will look
like gibberish so arguably the original for loop is clearer, depending
on your team.

It's worth making the effort to get into list comprehensions though
because they're awesome.

>
> That little re.sub() came from ChatGPT and I can understand it without
> too much effort because it came documented
>
> I suppose ChatGPT is the answer to this thread. Or everything. Or will be.

I am doubtful. We'll see!

R

--
https://mail.python.org/mailman/listinfo/python-list

Re: Code improvement question [ In reply to ]

python-list at python

Nov 16, 2023, 5:15 PM

Post #6 of 17 (400 views)

On 15/11/2023 3:08 pm, MRAB via Python-list wrote:
> On 2023-11-15 03:41, Mike Dewhirst via Python-list wrote:
>> On 15/11/2023 10:25 am, MRAB via Python-list wrote:
>>> On 2023-11-14 23:14, Mike Dewhirst via Python-list wrote:
>>>> I'd like to improve the code below, which works. It feels clunky to
>>>> me.
>>>>
>>>> I need to clean up user-uploaded files the size of which I don't
>>>> know in
>>>> advance.
>>>>
>>>> After cleaning they might be as big as 1Mb but that would be super
>>>> rare.
>>>> Perhaps only for testing.
>>>>
>>>> I'm extracting CAS numbers and here is the pattern xx-xx-x up to
>>>> xxxxxxx-xx-x eg., 1012300-77-4
>>>>
>>>> def remove_alpha(txt):
>>>>
>>>>     """ r'[^0-9\- ]':
>>>>
>>>>     [^...]: Match any character that is not in the specified set.
>>>>
>>>>     0-9: Match any digit.
>>>>
>>>>     \: Escape character.
>>>>
>>>>     -: Match a hyphen.
>>>>
>>>>     Space: Match a space.
>>>>
>>>>     """
>>>>
>>>>     cleaned_txt = re.sub(r'[^0-9\- ]', '', txt)
>>>>
>>>>     bits = cleaned_txt.split()
>>>>
>>>>     pieces = []
>>>>
>>>>     for bit in bits:
>>>>
>>>>         # minimum size of a CAS number is 7 so drop smaller
>>>> clumps of digits
>>>>
>>>>         pieces.append(bit if len(bit) > 6 else "")
>>>>
>>>>     return " ".join(pieces)
>>>>
>>>>
>>>> Many thanks for any hints
>>>>
>>> Why don't you use re.findall?
>>>
>>> re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)
>>
>> I think I can see what you did there but it won't make sense to me - or
>> whoever looks at the code - in future.
>>
>> That answers your specific question. However, I am in awe of people who
>> can just "do" regular expressions and I thank you very much for what
>> would have been a monumental effort had I tried it.
>>
>> That little re.sub() came from ChatGPT and I can understand it without
>> too much effort because it came documented
>>
>> I suppose ChatGPT is the answer to this thread. Or everything. Or
>> will be.
>>
> \b          Word boundary
> [0-9]{2,7} 2..7 digits
> -           "-"
> [0-9]{2}    2 digits
> -           "-"
> [0-9]{2}    2 digits
> \b          Word boundary
>
> The "word boundary" thing is to stop it matching where there are
> letters or digits right next to the digits.
>
> For example, if the text contained, say, "123456789-12-1234", you
> wouldn't want it to match because there are more than 7 digits at the
> start and more than 2 digits at the end.
>
Thanks

I know I should invest some brainspace in re. Many years ago at a Perl
conferenceI did buy a coffee mug completely covered with a regex cheat
sheet. It currently holds pens and pencils on my desk. And spiders now I
look closely!

Then I took up Python and re is different.

Maybe I'll have another look ...

Cheers

Mike

--
Signed email is an absolute defence against phishing. This email has
been signed with my private key. If you import my public key you can
automatically decrypt my signature and be sure it came from me. Your
email software can handle signing.

Re: Code improvement question [ In reply to ]

python-list at python

Nov 16, 2023, 5:22 PM

Post #7 of 17 (400 views)

On 2023-11-17 01:15, Mike Dewhirst via Python-list wrote:
> On 15/11/2023 3:08 pm, MRAB via Python-list wrote:
>> On 2023-11-15 03:41, Mike Dewhirst via Python-list wrote:
>>> On 15/11/2023 10:25 am, MRAB via Python-list wrote:
>>>> On 2023-11-14 23:14, Mike Dewhirst via Python-list wrote:
>>>>> I'd like to improve the code below, which works. It feels clunky to
>>>>> me.
>>>>>
>>>>> I need to clean up user-uploaded files the size of which I don't
>>>>> know in
>>>>> advance.
>>>>>
>>>>> After cleaning they might be as big as 1Mb but that would be super
>>>>> rare.
>>>>> Perhaps only for testing.
>>>>>
>>>>> I'm extracting CAS numbers and here is the pattern xx-xx-x up to
>>>>> xxxxxxx-xx-x eg., 1012300-77-4
>>>>>
>>>>> def remove_alpha(txt):
>>>>>
>>>>>     """ r'[^0-9\- ]':
>>>>>
>>>>>     [^...]: Match any character that is not in the specified set.
>>>>>
>>>>>     0-9: Match any digit.
>>>>>
>>>>>     \: Escape character.
>>>>>
>>>>>     -: Match a hyphen.
>>>>>
>>>>>     Space: Match a space.
>>>>>
>>>>>     """
>>>>>
>>>>>     cleaned_txt = re.sub(r'[^0-9\- ]', '', txt)
>>>>>
>>>>>     bits = cleaned_txt.split()
>>>>>
>>>>>     pieces = []
>>>>>
>>>>>     for bit in bits:
>>>>>
>>>>>         # minimum size of a CAS number is 7 so drop smaller
>>>>> clumps of digits
>>>>>
>>>>>         pieces.append(bit if len(bit) > 6 else "")
>>>>>
>>>>>     return " ".join(pieces)
>>>>>
>>>>>
>>>>> Many thanks for any hints
>>>>>
>>>> Why don't you use re.findall?
>>>>
>>>> re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)
>>>
>>> I think I can see what you did there but it won't make sense to me - or
>>> whoever looks at the code - in future.
>>>
>>> That answers your specific question. However, I am in awe of people who
>>> can just "do" regular expressions and I thank you very much for what
>>> would have been a monumental effort had I tried it.
>>>
>>> That little re.sub() came from ChatGPT and I can understand it without
>>> too much effort because it came documented
>>>
>>> I suppose ChatGPT is the answer to this thread. Or everything. Or
>>> will be.
>>>
>> \b          Word boundary
>> [0-9]{2,7} 2..7 digits
>> -           "-"
>> [0-9]{2}    2 digits
>> -           "-"
>> [0-9]{2}    2 digits
>> \b          Word boundary
>>
>> The "word boundary" thing is to stop it matching where there are
>> letters or digits right next to the digits.
>>
>> For example, if the text contained, say, "123456789-12-1234", you
>> wouldn't want it to match because there are more than 7 digits at the
>> start and more than 2 digits at the end.
>>
> Thanks
>
> I know I should invest some brainspace in re. Many years ago at a Perl
> conferenceI did buy a coffee mug completely covered with a regex cheat
> sheet. It currently holds pens and pencils on my desk. And spiders now I
> look closely!
>
> Then I took up Python and re is different.
>
> Maybe I'll have another look ...
>
The patterns themselves aren't that different; Perl's just has more
features than the re module's.
--
https://mail.python.org/mailman/listinfo/python-list

Re: Code improvement question [ In reply to ]

python-list at python

Nov 16, 2023, 8:56 PM

Post #8 of 17 (398 views)

On 16/11/2023 9:34 am, Rimu Atkinson via Python-list wrote:
>
>>>>
>>> Why don't you use re.findall?
>>>
>>> re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)
>>
>> I think I can see what you did there but it won't make sense to me -
>> or whoever looks at the code - in future.
>>
>> That answers your specific question. However, I am in awe of people
>> who can just "do" regular expressions and I thank you very much for
>> what would have been a monumental effort had I tried it.
>
> I feel the same way about regex. If I can find a way to write
> something without regex I very much prefer to as regex usually adds
> complexity and hurts readability.
>
> You might find https://regex101.com/ to be useful for testing your
> regex. You can enter in sample data and see if it matches.
>
> If I understood what your regex was trying to do I might be able to
> suggest some python to do the same thing. Is it just removing numbers
> from text?
>
> The for loop, "for bit in bits" etc, could be written as a list
> comprehension.
>
> pieces = [.bit if len(bit) > 6 else "" for bit in bits]
>
> For devs familiar with other languages but new to Python this will
> look like gibberish so arguably the original for loop is clearer,
> depending on your team.
>
> It's worth making the effort to get into list comprehensions though
> because they're awesome.

I agree qualitatively 100% but quantitively perhaps I agree 80% where
readability is easy.

I think that's what you are saying anyway.

>
>
>
>>
>> That little re.sub() came from ChatGPT and I can understand it
>> without too much effort because it came documented
>>
>> I suppose ChatGPT is the answer to this thread. Or everything. Or
>> will be.
>
> I am doubtful. We'll see!
>
> R
>
>

--
Signed email is an absolute defence against phishing. This email has
been signed with my private key. If you import my public key you can
automatically decrypt my signature and be sure it came from me. Your
email software can handle signing.

Re: Code improvement question [ In reply to ]

python-list at python

Nov 17, 2023, 1:38 AM

Post #9 of 17 (393 views)

Mike Dewhirst ha scritto:
> On 15/11/2023 10:25 am, MRAB via Python-list wrote:
>> On 2023-11-14 23:14, Mike Dewhirst via Python-list wrote:
>>> I'd like to improve the code below, which works. It feels clunky to me.
>>>
>>> I need to clean up user-uploaded files the size of which I don't know in
>>> advance.
>>>
>>> After cleaning they might be as big as 1Mb but that would be super rare.
>>> Perhaps only for testing.
>>>
>>> I'm extracting CAS numbers and here is the pattern xx-xx-x up to
>>> xxxxxxx-xx-x eg., 1012300-77-4
>>>
>>> def remove_alpha(txt):
>>>
>>>     """ r'[^0-9\- ]':
>>>
>>>     [^...]: Match any character that is not in the specified set.
>>>
>>>     0-9: Match any digit.
>>>
>>>     \: Escape character.
>>>
>>>     -: Match a hyphen.
>>>
>>>     Space: Match a space.
>>>
>>>     """
>>>
>>>     cleaned_txt = re.sub(r'[^0-9\- ]', '', txt)
>>>
>>>     bits = cleaned_txt.split()
>>>
>>>     pieces = []
>>>
>>>     for bit in bits:
>>>
>>>         # minimum size of a CAS number is 7 so drop smaller clumps
>>> of digits
>>>
>>>         pieces.append(bit if len(bit) > 6 else "")
>>>
>>>     return " ".join(pieces)
>>>
>>>
>>> Many thanks for any hints
>>>
>> Why don't you use re.findall?
>>
>> re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)
>
> I think I can see what you did there but it won't make sense to me - or
> whoever looks at the code - in future.
>
> That answers your specific question. However, I am in awe of people who
> can just "do" regular expressions and I thank you very much for what
> would have been a monumental effort had I tried it.
>
> That little re.sub() came from ChatGPT and I can understand it without
> too much effort because it came documented
>
> I suppose ChatGPT is the answer to this thread. Or everything. Or will be.
>
> Thanks
>
> Mike

I respect your opinion but from the point of view of many usenet users
asking a question to chatgpt to solve your problem is truly an overkill.
The computer world overflows with people who know regex. If you had not
already had the answer with the use of 're' I would have sent you my
suggestion that as you can see it is practically identical. I am quite
sure that in this usenet the same solution came to the mind of many
people.

with open(file) as fp:
try: ret = re.findall(r'\b\d{2,7}\-\d{2}\-\d{1}\b', fp.read())
except: ret = []

The only difference is '\d' instead of '[0-9]' but they are equivalent.

--
https://mail.python.org/mailman/listinfo/python-list

Re: Code improvement question [ In reply to ]

python-list at python

Nov 17, 2023, 3:17 AM

Post #10 of 17 (396 views)

On 2023-11-16 11:34:16 +1300, Rimu Atkinson via Python-list wrote:
> > > Why don't you use re.findall?
> > >
> > > re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)
> >
> > I think I can see what you did there but it won't make sense to me - or
> > whoever looks at the code - in future.
> >
> > That answers your specific question. However, I am in awe of people who
> > can just "do" regular expressions and I thank you very much for what
> > would have been a monumental effort had I tried it.
>
> I feel the same way about regex. If I can find a way to write something
> without regex I very much prefer to as regex usually adds complexity and
> hurts readability.

I find "straight" regexps very easy to write. There are only a handful
of constructs which are all very simple and you just string them
together. But then I've used regexps for 30+ years, so of course they
feel natural to me.

(Reading regexps may be a bit harder, exactly because they are to
simple: There is no abstraction, so a complicated pattern results in a
long regexp.)

There are some extensions to regexps which are conceptually harder, like
lookahead and lookbehind or nested contexts in Perl. I may need the
manual for those (especially because they are new(ish) and every
language uses a different syntax for them) or avoid them altogether.

Oh, and Python (just like Perl) allows you to embed whitespace and
comments into Regexps, which helps readability a lot if you have to
write long regexps.

> You might find https://regex101.com/ to be useful for testing your regex.
> You can enter in sample data and see if it matches.
>
> If I understood what your regex was trying to do I might be able to suggest
> some python to do the same thing. Is it just removing numbers from text?

Not "removing" them (as I understood it), but extracting them (i.e. find
and collect them).

> > > re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)

\b - a word boundary.
[0-9]{2,7} - 2 to 7 digits
- - a hyphen-minus
[0-9]{2} - exactly 2 digits
- - a hyphen-minus
[0-9]{2} - exactly 2 digits
\b - a word boundary.

Seems quite straightforward to me. I'll be impressed if you can write
that in Python in a way which is easier to read.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

Re: Code improvement question [ In reply to ]

python-list at python

Nov 17, 2023, 4:48 AM

Post #11 of 17 (395 views)

On 11/17/2023 6:17 AM, Peter J. Holzer via Python-list wrote:
> On 2023-11-16 11:34:16 +1300, Rimu Atkinson via Python-list wrote:
>>>> Why don't you use re.findall?
>>>>
>>>> re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)
>>>
>>> I think I can see what you did there but it won't make sense to me - or
>>> whoever looks at the code - in future.
>>>
>>> That answers your specific question. However, I am in awe of people who
>>> can just "do" regular expressions and I thank you very much for what
>>> would have been a monumental effort had I tried it.
>>
>> I feel the same way about regex. If I can find a way to write something
>> without regex I very much prefer to as regex usually adds complexity and
>> hurts readability.
>
> I find "straight" regexps very easy to write. There are only a handful
> of constructs which are all very simple and you just string them
> together. But then I've used regexps for 30+ years, so of course they
> feel natural to me.
>
> (Reading regexps may be a bit harder, exactly because they are to
> simple: There is no abstraction, so a complicated pattern results in a
> long regexp.)
>
> There are some extensions to regexps which are conceptually harder, like
> lookahead and lookbehind or nested contexts in Perl. I may need the
> manual for those (especially because they are new(ish) and every
> language uses a different syntax for them) or avoid them altogether.
>
> Oh, and Python (just like Perl) allows you to embed whitespace and
> comments into Regexps, which helps readability a lot if you have to
> write long regexps.
>
>
>> You might find https://regex101.com/ to be useful for testing your regex.
>> You can enter in sample data and see if it matches.
>>
>> If I understood what your regex was trying to do I might be able to suggest
>> some python to do the same thing. Is it just removing numbers from text?
>
> Not "removing" them (as I understood it), but extracting them (i.e. find
> and collect them).
>
>>>> re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)
>
> \b - a word boundary.
> [0-9]{2,7} - 2 to 7 digits
> - - a hyphen-minus
> [0-9]{2} - exactly 2 digits
> - - a hyphen-minus
> [0-9]{2} - exactly 2 digits
> \b - a word boundary.
>
> Seems quite straightforward to me. I'll be impressed if you can write
> that in Python in a way which is easier to read.

And the re.VERBOSE (also re.X) flag can always be used so the entire
expression can be written line-by-line with comments nearly the same as
the example above

--
https://mail.python.org/mailman/listinfo/python-list

Re: Code improvement question [ In reply to ]

python-list at python

Nov 17, 2023, 6:46 AM

Post #12 of 17 (394 views)

On 2023-11-17 07:48:41 -0500, Thomas Passin via Python-list wrote:
> On 11/17/2023 6:17 AM, Peter J. Holzer via Python-list wrote:
> > Oh, and Python (just like Perl) allows you to embed whitespace and
> > comments into Regexps, which helps readability a lot if you have to
> > write long regexps.
> >
[...]
> > > > > re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)
> >
> > \b - a word boundary.
> > [0-9]{2,7} - 2 to 7 digits
> > - - a hyphen-minus
> > [0-9]{2} - exactly 2 digits
> > - - a hyphen-minus
> > [0-9]{2} - exactly 2 digits
> > \b - a word boundary.
> >
> > Seems quite straightforward to me. I'll be impressed if you can write
> > that in Python in a way which is easier to read.
>
> And the re.VERBOSE (also re.X) flag can always be used so the entire
> expression can be written line-by-line with comments nearly the same
> as the example above

Yes. That's what I alluded to above.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

Re: Code improvement question [ In reply to ]

python-list at python

Nov 17, 2023, 7:17 AM

Post #13 of 17 (394 views)

On 11/17/2023 9:46 AM, Peter J. Holzer via Python-list wrote:
> On 2023-11-17 07:48:41 -0500, Thomas Passin via Python-list wrote:
>> On 11/17/2023 6:17 AM, Peter J. Holzer via Python-list wrote:
>>> Oh, and Python (just like Perl) allows you to embed whitespace and
>>> comments into Regexps, which helps readability a lot if you have to
>>> write long regexps.
>>>
> [...]
>>>>>> re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)
>>>
>>> \b - a word boundary.
>>> [0-9]{2,7} - 2 to 7 digits
>>> - - a hyphen-minus
>>> [0-9]{2} - exactly 2 digits
>>> - - a hyphen-minus
>>> [0-9]{2} - exactly 2 digits
>>> \b - a word boundary.
>>>
>>> Seems quite straightforward to me. I'll be impressed if you can write
>>> that in Python in a way which is easier to read.
>>
>> And the re.VERBOSE (also re.X) flag can always be used so the entire
>> expression can be written line-by-line with comments nearly the same
>> as the example above
>
> Yes. That's what I alluded to above.

I know, and I just wanted to make it explicit for people who didn't know
much about Python regexes.

--
https://mail.python.org/mailman/listinfo/python-list

Re: Code improvement question [ In reply to ]

python-list at python

Nov 17, 2023, 10:56 AM

Post #14 of 17 (393 views)

On 2023-11-17 09:38, jak via Python-list wrote:
> Mike Dewhirst ha scritto:
>> On 15/11/2023 10:25 am, MRAB via Python-list wrote:
>>> On 2023-11-14 23:14, Mike Dewhirst via Python-list wrote:
>>>> I'd like to improve the code below, which works. It feels clunky to me.
>>>>
>>>> I need to clean up user-uploaded files the size of which I don't know in
>>>> advance.
>>>>
>>>> After cleaning they might be as big as 1Mb but that would be super rare.
>>>> Perhaps only for testing.
>>>>
>>>> I'm extracting CAS numbers and here is the pattern xx-xx-x up to
>>>> xxxxxxx-xx-x eg., 1012300-77-4
>>>>
>>>> def remove_alpha(txt):
>>>>
>>>>     """ r'[^0-9\- ]':
>>>>
>>>>     [^...]: Match any character that is not in the specified set.
>>>>
>>>>     0-9: Match any digit.
>>>>
>>>>     \: Escape character.
>>>>
>>>>     -: Match a hyphen.
>>>>
>>>>     Space: Match a space.
>>>>
>>>>     """
>>>>
>>>>     cleaned_txt = re.sub(r'[^0-9\- ]', '', txt)
>>>>
>>>>     bits = cleaned_txt.split()
>>>>
>>>>     pieces = []
>>>>
>>>>     for bit in bits:
>>>>
>>>>         # minimum size of a CAS number is 7 so drop smaller clumps
>>>> of digits
>>>>
>>>>         pieces.append(bit if len(bit) > 6 else "")
>>>>
>>>>     return " ".join(pieces)
>>>>
>>>>
>>>> Many thanks for any hints
>>>>
>>> Why don't you use re.findall?
>>>
>>> re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)
>>
>> I think I can see what you did there but it won't make sense to me - or
>> whoever looks at the code - in future.
>>
>> That answers your specific question. However, I am in awe of people who
>> can just "do" regular expressions and I thank you very much for what
>> would have been a monumental effort had I tried it.
>>
>> That little re.sub() came from ChatGPT and I can understand it without
>> too much effort because it came documented
>>
>> I suppose ChatGPT is the answer to this thread. Or everything. Or will be.
>>
>> Thanks
>>
>> Mike
>
> I respect your opinion but from the point of view of many usenet users
> asking a question to chatgpt to solve your problem is truly an overkill.
> The computer world overflows with people who know regex. If you had not
> already had the answer with the use of 're' I would have sent you my
> suggestion that as you can see it is practically identical. I am quite
> sure that in this usenet the same solution came to the mind of many
> people.
>
> with open(file) as fp:
> try: ret = re.findall(r'\b\d{2,7}\-\d{2}\-\d{1}\b', fp.read())
> except: ret = []
>
> The only difference is '\d' instead of '[0-9]' but they are equivalent.
>
Bare excepts are a very bad idea.
--
https://mail.python.org/mailman/listinfo/python-list

Re: Code improvement question [ In reply to ]

python-list at python

Nov 17, 2023, 3:53 PM

Post #15 of 17 (392 views)

MRAB ha scritto:
> Bare excepts are a very bad idea.

I know, you're right but to test the CAS numbers were inside a string
(txt) and instead of the 'open(file)' there was 'io.StingIO(txt)' so the
risk was almost null. When I copied it here I didn't think about it.
Sorry.

--
https://mail.python.org/mailman/listinfo/python-list

RE: Code improvement question [ In reply to ]

python-list at python

Nov 17, 2023, 10:55 PM

Post #16 of 17 (391 views)

Many features like regular expressions can be mini languages that are designed to be very powerful while also a tad cryptic to anyone not familiar.

But consider an alternative in some languages that may use some complex set of nested function calls that each have names like match_white_space(2, 5) and even if some are set up to be sort of readable, they can be a pain. Quite a few problems can be solved nicely with a single regular expression or several in a row with each one being fairly simple. Sometimes you can do parts using some of the usual text manipulation functions built-in or in a module for either speed or to simplify things so that the RE part is simpler and easier to follow.

And, as noted, Python allows ways to include comments in RE or ways to specify extensions such as PERL-style and so on. Adding enough comments above or within the code can help remind people or point to a reference and just explaining in English (or the language of your choice that hopefully others later can understand) can be helpful. You can spell out in whatever level of detail what you expect your data to look like and what you want to match or extract and then the RE may be easier to follow.

Of course the endless extensions added due to things like supporting UNICODE have made some RE much harder to create or understand and sometimes the result may not even be what you expected if something strange happens like the symbols ???

The above might match digits and maybe be interpreted at some point as 12 dozen, which may even be appropriate but a bit of a surprise perhaps.

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of Peter J. Holzer via Python-list
Sent: Friday, November 17, 2023 6:18 AM
To: python-list@python.org
Subject: Re: Code improvement question

On 2023-11-16 11:34:16 +1300, Rimu Atkinson via Python-list wrote:
> > > Why don't you use re.findall?
> > >
> > > re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)
> >
> > I think I can see what you did there but it won't make sense to me - or
> > whoever looks at the code - in future.
> >
> > That answers your specific question. However, I am in awe of people who
> > can just "do" regular expressions and I thank you very much for what
> > would have been a monumental effort had I tried it.
>
> I feel the same way about regex. If I can find a way to write something
> without regex I very much prefer to as regex usually adds complexity and
> hurts readability.

I find "straight" regexps very easy to write. There are only a handful
of constructs which are all very simple and you just string them
together. But then I've used regexps for 30+ years, so of course they
feel natural to me.

(Reading regexps may be a bit harder, exactly because they are to
simple: There is no abstraction, so a complicated pattern results in a
long regexp.)

There are some extensions to regexps which are conceptually harder, like
lookahead and lookbehind or nested contexts in Perl. I may need the
manual for those (especially because they are new(ish) and every
language uses a different syntax for them) or avoid them altogether.

Oh, and Python (just like Perl) allows you to embed whitespace and
comments into Regexps, which helps readability a lot if you have to
write long regexps.

> You might find https://regex101.com/ to be useful for testing your regex.
> You can enter in sample data and see if it matches.
>
> If I understood what your regex was trying to do I might be able to suggest
> some python to do the same thing. Is it just removing numbers from text?

Not "removing" them (as I understood it), but extracting them (i.e. find
and collect them).

> > > re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)

\b - a word boundary.
[0-9]{2,7} - 2 to 7 digits
- - a hyphen-minus
[0-9]{2} - exactly 2 digits
- - a hyphen-minus
[0-9]{2} - exactly 2 digits
\b - a word boundary.

Seems quite straightforward to me. I'll be impressed if you can write
that in Python in a way which is easier to read.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

--
https://mail.python.org/mailman/listinfo/python-list

Re: Code improvement question [ In reply to ]

python-list at python

Nov 20, 2023, 12:48 PM

Post #17 of 17 (357 views)

>
>>>> re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)
>
> \b - a word boundary.
> [0-9]{2,7} - 2 to 7 digits
> - - a hyphen-minus
> [0-9]{2} - exactly 2 digits
> - - a hyphen-minus
> [0-9]{2} - exactly 2 digits
> \b - a word boundary.
>
> Seems quite straightforward to me. I'll be impressed if you can write
> that in Python in a way which is easier to read.
>

Now that I know what {} does, you're right, that IS straightforward!
Maybe 2023 will be the year I finally get off my arse and learn regex.

Thanks :)

--
https://mail.python.org/mailman/listinfo/python-list