Mailing List Archive

re.sub() loops
I am trying to do some lame HTML processing with some
HTML. The following lines tries to remove some
unneccessary code from a HTML file. However python hangs
in this call:

data = re.sub('<TABLE.*?es.*?da.*?en.*?fi.*?sv.*?TABLE>','',data)

Any idea why ? My env: Python 1.5.2 under Solaris 2.5.1 - 7.0.

Andreas
re.sub() loops [ In reply to ]
In article <Pine.GSO.4.10.9904171556010.28418-100000@moses.sz-sb.de>,
Andreas Jung <ajung@sz-sb.de> wrote:
>
>I am trying to do some lame HTML processing with some
>HTML. The following lines tries to remove some
>unneccessary code from a HTML file. However python hangs
>in this call:
>
>data = re.sub('<TABLE.*?es.*?da.*?en.*?fi.*?sv.*?TABLE>','',data)

Does the <TABLE>...</TABLE> contain *all* the strings "es", "da", "en",
"fi", and "sv"? Or are the strings supposed to be "?es" and so on? In
any event, with six ".*" patterns in there, you've got exponential
processing time, even if it's not hanging.

I think that if you want assistance in constructing the correct regex,
you'll need to give us more info about the data and the goal you're
trying to accomplish. You might find it productive to pick up a copy of
the O'Reilly regex book -- I'd used regexes for years, but I didn't
really learn them until I started using that book.
--
--- Aahz (@netcom.com)

Hugs and backrubs -- I break Rule 6 <*> http://www.rahul.net/aahz/
Androgynous poly kinky vanilla queer het

Sometimes, you're not just out of left field, you're coming in
all the way from outer space.
re.sub() loops [ In reply to ]
On Sat, Apr 17, 1999 at 02:35:48PM +0000, Aahz Maruch wrote:
> In article <Pine.GSO.4.10.9904171556010.28418-100000@moses.sz-sb.de>,
> Andreas Jung <ajung@sz-sb.de> wrote:
> >
> >I am trying to do some lame HTML processing with some
> >HTML. The following lines tries to remove some
> >unneccessary code from a HTML file. However python hangs
> >in this call:
> >
> >data = re.sub('<TABLE.*?es.*?da.*?en.*?fi.*?sv.*?TABLE>','',data)
>
> Does the <TABLE>...</TABLE> contain *all* the strings "es", "da", "en",
> "fi", and "sv"? Or are the strings supposed to be "?es" and so on? In
> any event, with six ".*" patterns in there, you've got exponential
> processing time, even if it's not hanging.

The strings are all contained within the TABLE section. I used
".*?" to get the smallest match because there are several
TABLE sections in the HTML document. You're right - re did not
hang - after about 5 minutes a got a reply :) However meanwhile
I got another working solution for the problem.

Thanks,
Andreas