Mailing List Archive: An Un-optimization anecdote with text-processing

Phil Mayes wrote:

> A raw email always has a blank line between the header and the body.
> So you can read it in and find the gap by looking for 2 EOLs:
>
> [Code snipped]
> --
> Phil Mayes pmayes AT olivebr DOT com

Well, I was in the process of starting to write a Perl program
for a simple text processing task I have to do, when this post
steered me back to my language of choice. As I wrote my script,
which is called by the Unix "find" command, I got to thinking
about how it was anti-optimized... Here is what I wrote:

import string
import sys

# Way stupid algorithm, but I'm running it overnight :)
filename = sys.argv[1]
f = open(filename, 'r')
text = f.read()
while 1:
i = string.find(text, r'/"')
if i == -1: break
text = text[:i+1] + "index.html" + text[i+1:]

f.close()
f = open(filename, 'w')
f.write(text)
f.close()

As you can see, the core loop scans the entire file for a string
match, builds a new string by concatenation, then starts scanning from
the beginning yet again. It almost doesn't seem like it could be made
much worse, unless one deliberately tried.

Which brings me to my new game... Rather than finding a nice hack
that would speed things up and use less lines of code, how about
finding the hack that uses the least amount of code to make this as
slow as possible? Adding pointless operations, or other obfuscations
don't count. Try to make this operation as elegantly short, and as
painfully slow as possible...

For example:

I had originally coded this with string.index() using a try/except
pair to break out of the loop, which took more lines of code but is
probably faster than my loop (it avoids the 'if' statement in the
loop, for the overhead of one exception). So I changed it to be
shorter and (possibly) slower. Can anyone think of other elegant
unoptimizations?

Chad Netzer

PS. I'm not suggesting running elaborate benchmarks or anything, just
hoping to hear some thoughts from the real Python masters. It's
Saturday morning, 5:18 AM, and I'm STILL at work, so that might
explain the loopy idea.

Chad Netzer <chad@vision.arc.nasa.gov> writes:

> Well, I was in the process of starting to write a Perl program
> for a simple text processing task I have to do, when this post
> steered me back to my language of choice. As I wrote my script,
> which is called by the Unix "find" command, I got to thinking
> about how it was anti-optimized... Here is what I wrote:
>
> import string
> import sys
>
> # Way stupid algorithm, but I'm running it overnight :)
> filename = sys.argv[1]
> f = open(filename, 'r')
> text = f.read()
> while 1:
> i = string.find(text, r'/"')
> if i == -1: break
> text = text[:i+1] + "index.html" + text[i+1:]
>
> f.close()
> f = open(filename, 'w')
> f.write(text)
> f.close()
>
>
>
> As you can see, the core loop scans the entire file for a string
> match, builds a new string by concatenation, then starts scanning from
> the beginning yet again. It almost doesn't seem like it could be made
> much worse, unless one deliberately tried.
>
> Which brings me to my new game... Rather than finding a nice hack
> that would speed things up and use less lines of code, how about
> finding the hack that uses the least amount of code to make this as
> slow as possible? Adding pointless operations, or other obfuscations
> don't count. Try to make this operation as elegantly short, and as
> painfully slow as possible...
>
> For example:
>
> I had originally coded this with string.index() using a try/except
> pair to break out of the loop, which took more lines of code but is
> probably faster than my loop (it avoids the 'if' statement in the
> loop, for the overhead of one exception). So I changed it to be
> shorter and (possibly) slower. Can anyone think of other elegant
> unoptimizations?

My first attempt at a shortest implementation for the string replacement
part is this:

Python 1.5.2 (#3, Apr 21 1999, 16:51:54) [GCC 2.7.2.3] on linux2
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> from string import *
>>> s = ' "www.python.org/" xyzzy/plugh "fee/fie/foe/foo/" '
>>> join(split(s, '/"'), '/index.html"')
' "www.python.org/index.html" xyzzy/plugh "fee/fie/foe/foo/index.html" '
>>>

But I fear it's also one of the fastest...

Or how about:

>>> reduce(lambda x,y: x[-1] == '/' and y == '"' and x+'index.html"' or x+y, s)
' "www.python.org/index.html" xyzzy/plugh "fee/fie/foe/foo/index.html" '

you don't even need the string module. This is veerrryyyy
sssssllllllooooooowwwwwwww because the result string is copied over and
over again and you call a python function for every char.

To give you an impression of the speed difference:

>>> def test1(s):
.. reduce(lambda x,y: x[-1] == '/' and y == '"' and x+'index.html"' or x+y, s)
..
>>> def test2(s):
.. join(split(s, '/"'), '/index.html"')
..
>>> import time
>>> from string import *
>>> s = open("data.html").read()
>>> t = time.clock(); test1(s); print time.clock() - t
69.49
>>> t = time.clock(); test2(s); print time.clock() - t
0.01
>>>

--
Bernhard Herzog | Sketch, a python based drawing program
herzog@online.de | http://www.online.de/home/sketch/