Mailing List Archive: help with simple regular expression grouping with re

help with simple regular expression grouping with re

May 7, 1999, 8:58 PM

Post #1 of 13 (2072 views)

Python gods and godesses,

Being relatively new to Python, I am trying to do something using re and
cannot figure out the right pattern to do what I want.

The input that I am parsing is a typical "mail merge" file, containing
comma separated fields that are surrounded by double quotes. A typical
line is:

"field 1", "field 2","field 3 has is different, it has an embedded
comma","this one doesn't"

I am trying to get a list of fields that are the strings that are
between the quotes, including any embedded commas.

Can one of you be so kind as to nudge me in the direction of a re
pattern that would work?

Python newbie, but loving it,
Bob

help with simple regular expression grouping with re [ In reply to ]

tim_one at email

May 7, 1999, 10:05 PM

Post #2 of 13 (2031 views)

[Bob Horvath]
> Being relatively new to Python, I am trying to do something using re and
> cannot figure out the right pattern to do what I want.

That's OK -- regular expressions are tricky! Be sure to read

http://www.python.org/doc/howto/regex/regex.html

for a gentler intro than the reference manual has time to give.

> The input that I am parsing is a typical "mail merge" file, containing
> comma separated fields that are surrounded by double quotes. A typical
> line is:
>
> "field 1", "field 2","field 3 has is different, it has an embedded
> comma","this one doesn't"
>
> I am trying to get a list of fields that are the strings that are
> between the quotes, including any embedded commas.

Note that regexps are utterly unforgiving -- the first two fields in your
example aren't separated by a comma, but by a comma followed by a blank. I
don't know whether that was a typo or a requirement, so let's write
something that doesn't care <wink>:

import re
pattern = re.compile(r"""
" # match an open quote
( # start a group so re.findall returns only this part
[^"]*? # match shortest run of non-quote characters
) # close the group
" # and match the close quote
""", re.VERBOSE)

answer = re.findall(pattern, your_example)
for field in answer:
print field

That prints:

field 1
field 2
field 3 has is different, it has an embedded comma
this one doesn't

Just study that until your eyes bleed <wink>.

defender-of-python-and-corrupter-of-youth-ly y'rs - tim

help with simple regular expression grouping with re [ In reply to ]

May 8, 1999, 9:21 PM

Post #3 of 13 (2034 views)

Tim Peters wrote:

> [Bob Horvath]
> > Being relatively new to Python, I am trying to do something using re and
> > cannot figure out the right pattern to do what I want.
>
> That's OK -- regular expressions are tricky! Be sure to read
>
> http://www.python.org/doc/howto/regex/regex.html
>
> for a gentler intro than the reference manual has time to give.
>

Thanks, it is a little easier.

>
> > The input that I am parsing is a typical "mail merge" file, containing
> > comma separated fields that are surrounded by double quotes. A typical
> > line is:
> >
> > "field 1", "field 2","field 3 has is different, it has an embedded
> > comma","this one doesn't"
> >
> > I am trying to get a list of fields that are the strings that are
> > between the quotes, including any embedded commas.
>
> Note that regexps are utterly unforgiving -- the first two fields in your
> example aren't separated by a comma, but by a comma followed by a blank. I
> don't know whether that was a typo or a requirement, so let's write
> something that doesn't care <wink>:

It was a typo. The commas do not have blanks around them when separating
fields. Nor are there any blanks or other white space at outside of the
double quoted fields.

>
>
> import re
> pattern = re.compile(r"""
> " # match an open quote
> ( # start a group so re.findall returns only this part
> [^"]*? # match shortest run of non-quote characters
> ) # close the group
> " # and match the close quote
> """, re.VERBOSE)
>
> answer = re.findall(pattern, your_example)
> for field in answer:
> print field
>
> That prints:
>
> field 1
> field 2
> field 3 has is different, it has an embedded comma
> this one doesn't
>
> Just study that until your eyes bleed <wink>.
>

Well, I did a lot of searching around before and after my original post, and
while findall seems to be the thing I want, I am using 1.5.1, which apparently
does not have it. I can upgrade my Linux system, but the system where it will
ultimately run might be a different story.

Is there a way to do the equivalent of findall on releases prior to having it?

Downloading-a-new-version-now-to-see-if-there-is-a-re.findall.py,
Bob

help with simple regular expression grouping with re [ In reply to ]

tim_one at email

May 9, 1999, 12:05 AM

Post #4 of 13 (2031 views)

[Bob Horvath]
> ...
> Well, I did a lot of searching around before and after my
> original post, and while findall seems to be the thing I want,
> I am using 1.5.1, which apparently does not have it.

Yes, re.findall is new in 1.5.2.

> I can upgrade my Linux system, but the system where it will
> ultimately run might be a different story.
>
> Is there a way to do the equivalent of findall on releases prior
> to having it?

How far back may you need to go? Go back far enough, and you won't even
find re <wink>.

The source for findall is in re.py -- it's just a loop written in Python.
You can easily (provided you understand all the pieces first!) write the
same thing yourself.

Here's a Pythonic way to write modules that work with old and new releases:

import re
if hasattr(re, "findall"):
# use the system-supplied version
from re import findall
else:
# oops! not there -- implement it ourselves
def findall(pattern, s):
...

Doesn't come up often, but painless to accommodate when it does. Note that
"def" is an executable stmt, not a declaration (and ditto for "from ...
import ...", etc), so that if re.findall exists, the "def findall" isn't
executed.

one-of-these-days-it-will-all-look-so-simple-you'll-laugh<wink>-ly y'rs -
tim

help with simple regular expression grouping with re [ In reply to ]

alex at somewhere

May 9, 1999, 12:02 PM

Post #5 of 13 (2033 views)

> The source for findall is in re.py -- it's just a loop written in
> Python. You can easily (provided you understand all the pieces
> first!) write the same thing yourself.

You can also simply get a copy of the new re module and put it somewhere
easy to import. I did that for a while, and it worked pretty well.

Alex.

help with simple regular expression grouping with re [ In reply to ]

May 9, 1999, 1:30 PM

Post #6 of 13 (2031 views)

Alex wrote:

> > The source for findall is in re.py -- it's just a loop written in
> > Python. You can easily (provided you understand all the pieces
> > first!) write the same thing yourself.
>
> You can also simply get a copy of the new re module and put it somewhere
> easy to import. I did that for a while, and it worked pretty well.

Duh, of course. Why didn't I think of that?

Thanks,
Bob

help with simple regular expression grouping with re [ In reply to ]

dfan at harmonixmusic

May 10, 1999, 6:58 AM

Post #7 of 13 (2028 views)

"Tim Peters" <tim_one@email.msn.com> writes:

| import re
| pattern = re.compile(r"""
| " # match an open quote
| ( # start a group so re.findall returns only this part
| [^"]*? # match shortest run of non-quote characters
| ) # close the group
| " # and match the close quote
| """, re.VERBOSE)
|
| answer = re.findall(pattern, your_example)
| for field in answer:
| print field

This works for a tricky reason, which people should be aware of.

I had just written the following response to your code:

Not that it's important, but technically, what you did was overkill.
Because *? is non-greedy, it won't match any quote characters,
because it will be happy to hand off the quote to the next element
of the regexp, which does match it.

So "(.*?)" and "([^"]*)" both solve the problem; you don't need to
disallow quotes _and_ match non-greedily.

And then I decided to test it, just to make sure (replacing '[^"]'
with '.'), and... it failed. Because '.' doesn't match newlines by
default. When I added re.DOTALL to the options at the end, it worked
fine.

Your example works because the character class [^"] (everything
but a double quote) happens to include newlines too. (Actually, I
think you took the newlines out of the input string before you tested
it, so maybe you were just lucky).

So my new claim is that the following is the 'best' regexp, for my
personal definition of best (internal comments deleted):

pattern = re.compile(r'"(.*?)"', re.VERBOSE | re.DOTALL)

--
Dan Schmidt -> dfan@harmonixmusic.com, dfan@alum.mit.edu
Honest Bob & the http://www2.thecia.net/users/dfan/
Factory-to-Dealer Incentives -> http://www2.thecia.net/users/dfan/hbob/
Gamelan Galak Tika -> http://web.mit.edu/galak-tika/www/

help with simple regular expression grouping with re [ In reply to ]

May 10, 1999, 8:19 AM

Post #8 of 13 (2030 views)

In article <wkbtftdo1r.fsf@turangalila.harmonixmusic.com>,
Dan Schmidt <dfan@harmonixmusic.com> wrote:
>
> So "(.*?)" and "([^"]*)" both solve the problem; you don't need to
> disallow quotes _and_ match non-greedily.

In general, though, character classes are much faster than *any* form of
"." that might involve backtracking. I believe this is still true even
with "non-greedy" specified. People who use regexes heavily tend to
automatically try to phrase things via character classes when possible.
What I don't know is whether using non-greedy with a character class
adds anything.
--
--- Aahz (@netcom.com)

Hugs and backrubs -- I break Rule 6 <*> http://www.rahul.net/aahz/
Androgynous poly kinky vanilla queer het

"In the end, outside of spy agencies, people are far too trusting and
willing to help." -- Ira Winkler

help with simple regular expression grouping with re [ In reply to ]

May 10, 1999, 2:20 PM

Post #9 of 13 (2023 views)

Dan Schmidt wrote:
>
> "Tim Peters" <tim_one@email.msn.com> writes:
>
> | import re
> | pattern = re.compile(r"""
> | " # match an open quote
> | ( # start a group so re.findall returns only this part
> | [^"]*? # match shortest run of non-quote characters
> | ) # close the group
> | " # and match the close quote
> | """, re.VERBOSE)
> |
> | answer = re.findall(pattern, your_example)
> | for field in answer:
> | print field

One thing to be careful of (I think) is that comma separated values
formats tend to quote " characters by doubling them. Consequently, you
might see input like:

"W. Shakespeare","""To be or not to be..."""

This can wreak havoc with re matching...

My caveat is that I've never actually seen a spec for CSV files. That's
my experience, however.

--
Skip Montanaro | Mojam: "Uniting the World of Music"
http://www.mojam.com/
skip@mojam.com | Musi-Cal: http://www.musi-cal.com/
518-372-5583

help with simple regular expression grouping with re [ In reply to ]

tim_one at email

May 10, 1999, 10:26 PM

Post #10 of 13 (2030 views)

[Tim]
> | import re
> | pattern = re.compile(r"""
> | " # match an open quote
> | ( # start a group so re.findall returns only this part
> | [^"]*? # match shortest run of non-quote characters
> | ) # close the group
> | " # and match the close quote
> | """, re.VERBOSE)
> |
> | answer = re.findall(pattern, your_example)
> | for field in answer:
> | print field

[Dan Schmidt]
> This works for a tricky reason, which people should be aware of.

*All* regexps work for a tricky reason -- or, at least, the ones that
actually do work <wink>.

> I had just written the following response to your code:
>
> Not that it's important, but technically, what you did was overkill.
> Because *? is non-greedy, it won't match any quote characters,
> because it will be happy to hand off the quote to the next element
> of the regexp, which does match it.
>
> So "(.*?)" and "([^"]*)" both solve the problem; you don't need to
> disallow quotes _and_ match non-greedily.
>
> And then I decided to test it, just to make sure (replacing '[^"]'
> with '.'), and... it failed. Because '.' doesn't match newlines by
> default. When I added re.DOTALL to the options at the end, it worked
> fine.
>
> Your example works because the character class [^"] (everything
> but a double quote) happens to include newlines too. (Actually, I
> think you took the newlines out of the input string before you tested
> it, so maybe you were just lucky).

I tested it both ways, reported on one, and have no idea which way is
correct: every time CSV parsing comes up, the questioner is unable to
define what (exactly) the rules are, and the appearance of line breaks in
the original example could simply be an artifact of a transport or mailer
breaking a long line. In the face of the unknown, seemed better to be
permissive.

> So my new claim is that the following is the 'best' regexp, for my
> personal definition of best (internal comments deleted):
>
> pattern = re.compile(r'"(.*?)"', re.VERBOSE | re.DOTALL)

The original was indeed overkill, but for another reason <wink>: it's also
the case that whenever CSV parsing comes up, a later msg in the thread goes
"oh! I forgot -- it can have *embedded* quotes too". Writing it [^"] is
anticipating a step in how the regexp will need to be changed anyway to
accommodate whichever escape convention they think they've
reverse-engineered <0.1 wink>.

Even without that prognostication, though, a greedy "([^"]*)" is (as Aahz
said) likely to run faster than a non-greedy "(.*?)". [^"]* is also more
robust, in that it unconditionally forbids matching a double quote in the
guts; what .*? matches depends on context, and will happily chew up double
quotes too if the context requires it for the *context* to match. In this
particular regexp as a whole that won't happen, but under *modification*
context-sensitive submatches are notoriously prone to surprises.

In any case, I certainly didn't need to do both [^"] and *? in the original!
My "best" would consist of removing the question mark <wink>.

otoh-if-embedded-quotes-are-really-illegal-string.split-with-a-little-
post-processing-would-be-best-of-all-ly y'rs - tim

help with simple regular expression grouping with re [ In reply to ]

May 11, 1999, 12:41 AM

Post #11 of 13 (2035 views)

Tim Peters wrote:

> [Tim]
> > | import re
> > | pattern = re.compile(r"""
> > | " # match an open quote
> > | ( # start a group so re.findall returns only this part
> > | [^"]*? # match shortest run of non-quote characters
> > | ) # close the group
> > | " # and match the close quote
> > | """, re.VERBOSE)
> > |
> > | answer = re.findall(pattern, your_example)
> > | for field in answer:
> > | print field
>
> [Dan Schmidt]
> > This works for a tricky reason, which people should be aware of.
>
> *All* regexps work for a tricky reason -- or, at least, the ones that
> actually do work <wink>.
>
> > I had just written the following response to your code:
> >
> > Not that it's important, but technically, what you did was overkill.
> > Because *? is non-greedy, it won't match any quote characters,
> > because it will be happy to hand off the quote to the next element
> > of the regexp, which does match it.
> >
> > So "(.*?)" and "([^"]*)" both solve the problem; you don't need to
> > disallow quotes _and_ match non-greedily.
> >
> > And then I decided to test it, just to make sure (replacing '[^"]'
> > with '.'), and... it failed. Because '.' doesn't match newlines by
> > default. When I added re.DOTALL to the options at the end, it worked
> > fine.
> >
> > Your example works because the character class [^"] (everything
> > but a double quote) happens to include newlines too. (Actually, I
> > think you took the newlines out of the input string before you tested
> > it, so maybe you were just lucky).
>
> I tested it both ways, reported on one, and have no idea which way is
> correct: every time CSV parsing comes up, the questioner is unable to
> define what (exactly) the rules are, and the appearance of line breaks in
> the original example could simply be an artifact of a transport or mailer
> breaking a long line. In the face of the unknown, seemed better to be
> permissive.

Being the original poster....

My problem has CSV that does not cross word boundaries, and does not contain
quotes within the fields (I had to check), but probably could some day. I'll
have to try it and see what it does.

The line crossing will never happen though.

>
>
> > So my new claim is that the following is the 'best' regexp, for my
> > personal definition of best (internal comments deleted):
> >
> > pattern = re.compile(r'"(.*?)"', re.VERBOSE | re.DOTALL)
>
> The original was indeed overkill, but for another reason <wink>: it's also
> the case that whenever CSV parsing comes up, a later msg in the thread goes
> "oh! I forgot -- it can have *embedded* quotes too". Writing it [^"] is
> anticipating a step in how the regexp will need to be changed anyway to
> accommodate whichever escape convention they think they've
> reverse-engineered <0.1 wink>.
>
> Even without that prognostication, though, a greedy "([^"]*)" is (as Aahz
> said) likely to run faster than a non-greedy "(.*?)". [^"]* is also more
> robust, in that it unconditionally forbids matching a double quote in the
> guts; what .*? matches depends on context, and will happily chew up double
> quotes too if the context requires it for the *context* to match. In this
> particular regexp as a whole that won't happen, but under *modification*
> context-sensitive submatches are notoriously prone to surprises.
>
> In any case, I certainly didn't need to do both [^"] and *? in the original!
> My "best" would consist of removing the question mark <wink>.
>
> otoh-if-embedded-quotes-are-really-illegal-string.split-with-a-little-
> post-processing-would-be-best-of-all-ly y'rs - tim

help with simple regular expression grouping with re [ In reply to ]

tim_one at email

May 11, 1999, 8:50 AM

Post #12 of 13 (2030 views)

[Bob Horvath, elaborates on his flavor of comma-separated values]
> My problem has CSV that does not cross word boundaries, and does
> not contain quotes within the fields

Plus never has whitespace adjacent to the separating commas? So long as
that's all true, and assuming there's not a newline at the end of a string,
it's enough to do

answer = string.split(s[1:-1], '","')

That is, remove the leading and trailing double quotes, then split on

","

If there is a trailing newline, change s[1:-1] to s[1:-2].

> (I had to check), but probably could some day. I'll have to try it
> and see what it does.

If it does, and an embedded double quote is represented by two adjacent
double quotes, then we're back to regexps; this will do as the guts of the
findall pattern:

"([^"]*(?:""[^"]*)*)"

Or if it uses backslash escapes,

"([^"\\]*(?:\\.[^"\\]*)*)"

There are more obvious ways to write those, but these run faster; see
Friedl's "Mastering Regular Expressions" for detailed explanation. Note
that with any sort of escape convention, regexps can merely *recognize* the
convention and pass it on as-is; you'll need to write some post-regexp code
to undo the escapes (if, of course, that's what you need).

there-are-even-those-who-say-regexps-are-obscure<wink>-ly y'rs - tim

help with simple regular expression grouping with re [ In reply to ]

May 11, 1999, 11:21 AM

Post #13 of 13 (2033 views)

In message <3733B676.1A539CA9@horvath.com>
Bob Horvath <bob@horvath.com> wrote:

> The input that I am parsing is a typical "mail merge" file, containing
> comma separated fields that are surrounded by double quotes. A typical
> line is:

Although this isn't exactly the answer you were looking for, a CSV library I
coded up last year is available from:

http://eh.org/~laurie/comp/python/csv/

It's at version 0.14.

Of course, as there is no agreed CSV standard that I know of, it's possible
that it doesn't cover every conceivable case (new lines in fields is the one
that springs to mind), but it should cope with most CSV files as is, and
even puts them in a reasonably nice list/dictionary format for you to fiddle
in Python.

Laurie