Mailing List Archive

for in benchmark interested
Some time ago I have opened a thread about optimization in Python. I
have now done a simple benchmark and used Python for that too. If s.o is
interested, take a look at: http://edt.vol.cz:81/bench

Any comment is very welcome. I got an answer from the ISE-Eiffel guys
this morning ;-) You might imagine what they think. If any of you have
simular tests, pleas let me know, allow me to put that copy onto that
page. I hope we can found some points which speak for one language under
certain conditions. It would be really helpful if you would include
information about
1) implementation time
2) debug time
3) sort of bug
4) maybe a short estimate how important speed is for that special test



I didn't have done that very detailed. And I now regret that. I just can
give a rought scetch
1) Perl (stolen from an implementation form Markus Mottl on the
OCAML-Mailing-list)
2) OCAML (the same as for Perl)
3) Python first solution < 3 Minutes I guess, If I remember correctly
no bug at all (sheer luck, normally I produce at most on typo ;-))
4) optimizations (<1 minute each, but the last)(typos)
5) fastest solution (point of discussion in this group so this take
quite some time but not for implementation but discussion)
6) Eiffel sol. Because I didn't know gelex to that time I need some more
time the source was done very fast, but not very Eiffelish. No
assertions and the like
(but all implemantations take me more than a week). Some parts have to
be completely new written (split_string which is of course easy to use
in Python)
7) C. This was not difficult to do after all the other stuff, but I
didn't have a hash-table implementation at hand. Found one in the conde
snippets. But found too bugs in it. So adaption take me quite a while.

I can't tell how much time I need for debugging. Either estimate would
be much to low (I woul look like a hacker, I'm not) or to high (would
look like an idiot, but who cares ;-) And I'm using a debugger just to
look into the code that I can see if anything works as inteded even
without a direct bug. I don't know if that is good style, but I feel
comfortable with that.


Regards
Friedrich
for in benchmark interested [ In reply to ]
The Python version would be faster if you used sys.stdin.read instead
of sys.stdin.readlines. I'm not sure why you need to split the input
into lines before you split it into words; it seems like an
unnecessary step.

The version below is 25% faster on my machine than your fastest Python
version. (And I'm not even an expert Python optimizer :-).

import sys
import string


def run():
dict={}
dict_get = dict.get
read = sys.stdin.read
string_split = string.split
while 1:
buf = read(500000)
if buf:
for key in string_split(buf):
dict[key] = dict_get(key, 0) + 1
else:
return dict


dict = run()
write = sys.stdout.write
for word in dict.keys():
write("%4d\t%s\n" % (dict[word], word))


Jeremy
for in benchmark interested [ In reply to ]
But won't this break apart words that happen to span across a
500000-byte blocks boundary?

Jeremy Hylton writes:
> The Python version would be faster if you used sys.stdin.read instead
> of sys.stdin.readlines. I'm not sure why you need to split the input
> into lines before you split it into words; it seems like an
> unnecessary step.

> while 1:
> buf = read(500000)
> if buf:
> for key in string_split(buf):
> dict[key] = dict_get(key, 0) + 1
> else:
> return dict
for in benchmark interested [ In reply to ]
Doh!

I guess you could read it all at once, which would be fine for a file
that's only 6MB or so. If you wanted correctness (how important is
that in a benchmark anyway?) and still want to read fixed-size chunks,
then you need to see if the buffer that is read ends in the middle of
a word or between words. If you add that checking, the code is a bit
more complex but still about 20% faster.

#!/usr/local/bin/python
import sys
import string


def run():
dict={}
dict_get = dict.get
read = sys.stdin.read
string_split = string.split
prev = ''
while 1:
buf = read(500000)
if buf:
parts = string_split(buf)

# did buffer start with whitespace?
if buf[0] == parts[0][0]:
parts[0] = prev + parts[0]
elif prev:
dict[prev] = dict_get(prev, 0) + 1

for key in parts[:-1]:
dict[key] = dict_get(key, 0) + 1

# buffer end with whitespace?
if buf[-1] != parts[-1][-1]:
key = parts[-1]
dict[key] = dict_get(key, 0) + 1
prev = ''
else:
prev = parts[-1]
else:
return dict


dict = run()
write = sys.stdout.write
for word in dict.keys():
write("%4d\t%s\n" % (dict[word], word))


Jeremy
for in benchmark interested [ In reply to ]
Friedrich Dominicus <Friedrich.Dominicus@inka.de> said:
> Some time ago I have opened a thread about optimization in Python. I
> have now done a simple benchmark and used Python for that too. If
> s.o is interested, take a look at: http://edt.vol.cz:81/bench

How about a unix command line implementation, like

cat file | tr " \t\r" "\n\n\n" | sort | uniq -c | tail +2

It's a bit slower than the C version, but only by a factor of
two or so, and a lot easier to write.


Another solution might be:
echo "" | spellin > myhlist
spell -d myhlist < file | sort | uniq -c

but that strips punctuation so doesn't give the same results.

A third is:
deroff -w file | sort | uniq -c

which also doesn't give exactly what you want. Still, it is
arguable that both these cases are better than the other solutions
given different definitions of the word "word."


BTW, the C code has some problems:

*** add (char *)

"count_words.l", line 34: error(3232): a value of type "void *"
cannot be used to initialize an entity of type "char *"
char* str = malloc(yyleng+1); strcpy(str, yytext);
^

There are a few problems with "inline" that my non-gcc C compiler
doesn't like. I'm not sure that inline is a C keyword.

I also get a core dump when I run it:

bioreason8> ./count_words_c ./count_words.c
Bus error (core dumped)
bioreason8> dbx ./count_words_c core
dbx version 7.1 Dec 3 1996 17:03:19
Core from signal SIGBUS: Bus error
(dbx) where
> 0 hash(string = 0x100170e9 = "line") ["/home/u05/dalke/tmp/hash.c":104, 0x100037c8]
1 search(table = 0x10017010, key = 0x100170e8 = "#line")
["/home/u05/dalke/tmp/hash.c":181, 0x10003b00]
2 run() ["/home/u05/dalke/tmp/count_words.l":35, 0x100033e0]
3 yylex() ["/home/u05/dalke/tmp/count_words.l":27, 0x1000198c]
4 main(argc = 2, argv = 0x7fff2f24) ["/home/u05/dalke/tmp/count_words.l":59,
0x100034fc]
5 __start()
["/vince/6.2-mar09/work/irix/lib/libc/libc_n32_M3/csu/crt1text.s":166,
0x100015e8]
(dbx) list 92
92 /*
93 Hashes a string to produce an unsigned short, which should be
94 sufficient for most purposes.
95 */
96
97 unsigned hash(char *string)
98 {
99 unsigned int ret_val = 0;
100 int i;
101
(dbx) l
102 while (*string)
103 {
* 104 i = *( int *)string;
105 ret_val ^= i;
106 ret_val <<= 1;
107 string ++;
108 }
109 return ret_val;
110 }
111


The conversion on line 104 is really wrong. It should likely be

i = (int)(*string)

Also, I tried it (with the fix) against a core dump:
bioreason8> ls -l /usr/tmp/core
-rw-r--r-- 1 root sys 5357568 Apr 8 00:45 /usr/tmp/core
bioreason8> ./count_words_c /usr/tmp/core > /dev/null

After about 6 minutes I killed the job. Probably doesn't like the
embedded NULs. This is an example of "fuzz" testing. See
http://www.cs.wisc.edu/Dienst/UI/2.0/Describe/ncstrl.uwmadison/CS-TR-95-1268


Finally, I see the lex code is processed with:

count_words.c: count_words.l
flex -B -ocount_words.c count_words.l

You should add a "-f" to that. That gave me a 15% speedup over just -B, against
the code at http://caml.inria.fr/caml-list/0899.html, from whence
you based this thread originally.


Andrew
dalke@acm.org