On Mar 15, 2007, at 9:21 AM, Roger Dooley wrote:
> I've just started working with the devel release and have modified
> my indexer for 0.15 to the new model. The document set is rather
> large (+.5 million) and indexing this took many hours with the 0.15
> release. However, with 0.20, I haven't been able to index the files
> as the indexing seems to be taking days and I end up killing the
> process and looking at the code again.
At least some of the slowdown is a side effect of UTF-8 compatibility
in 0.20. Tokenizer is a major offender, and the bottleneck is Perl's
UTF-8 character class regex implementation.
I'm a little surprised by the scale, though. According to my
benchmarking tests, we'd taken about a 35% hit, going from around 3.1
seconds under 0.15 to around 4.2 seconds for 0.20. We actually lost
a lot more than that with the transition to UTF-8, but I've continued
to make strides optimizing the engine -- if you take Tokenizer out of
the loop, and use a purpose-built C tokenizer instead (the
ASCIIWhiteSpaceTokenizer in devel/benchmarks/BenchMarkingIndexer.pm),
0.20 is actually 30% *faster* than 0.15, at 1.82 secs vs 2.62 secs.
However, my benchmarker script only uses a Tokenizer. If your
analyzer incorporates a Stemmer or a Stopalizer, there may be
additional drags I hadn't been measuring. Stemmer seems like a more
likely culprit, since that's changed to UTF-8 and I don't know how
UTF-8 Snowball performs in comparison to Latin-1 Snowball.
Stopalizer is also a possibility, but I'm not sure that hash lookups
are slower under UTF-8 -- I wouldn't think so. LCNormalizer is
almost certainly slower, but I wouldn't guess it would affect things
too much since it only hits the string once.
Here are some stats originally compiled for a post I made to the Perl
5 Porters list: <
http://www.nntp.perl.org/group/perl.perl5.porters/ 2007/02/msg121014.html>
==================================================================
Mean time to index 1000 ASCII news articles
------------------------------------------------------------------
tokenizer 5.8.6 (thr) 5.8.8 (no thr) blead (no thr)
------------------------------------------------------------------
UTF-8 regex 4.18 secs 3.72 secs 3.80 secs
Latin-1 regex 2.84 secs 2.50 secs 2.60 secs
Purpose-built C 1.82 secs 1.60 secs 1.64 secs
It turns out that Perl's current UTF-8 char-class implementation is
sub-optimal. Yves Orton (a.k.a. demerphq) and I have had some
preliminary discussions about how to go about improving it. Yves has
actually made the regex engine pluggable in blead; what may happen
eventually is that after 5.10 comes out I'll hack up a slightly
tweaked version of the regex engine which (only) Tokenizer will use.
I'd actually love to go in and hack on Perl's regex engine right now,
and the work to implement char classes in terms of "inversion lists"
probably isn't insane (bwa ha ha). However, I haven't done so
because 1) I'd have to invest some time to come up to speed on the
gory details of the regex engine, and 2) KinoSearch's indexing
performance has been good enough up till now that it's been more
important to work on other features.
It may be time to make another stab at moving the Tokenizer loop to C.
while (/$token_re/g) {
push @starts, $-[0];
push @ends, $+[0];
}
The first time I tried that preceded Yves' exposing and documenting
of the regex engine API: <
http://search.cpan.org/~rgarcia/perl-5.9.4/ pod/perlreguts.pod>. With the aid of the new docs, I can probably
figure things out for blead, then backport for 5.8.x.
There are significant inefficiencies in how @- and @+ are retrieved
under UTF-8 -- they calculate UTF-8 length every time -- and that's
damned inefficient if you're doing it for every token. (This happens
in the function Perl_magic_regdatum_get() in mg.c.). If I can run
the loop in C, I can get at the original numbers from the regex
engine struct and avoid that.
If you don't want to wait for me to complete this work and you have
Inline C skillz, you might try carving up your own Tokenizer based on
ASCIIWhiteSpaceTokenizer.
Otherwise, if you (or anybody else) wants to help me out, I could use
some benchmarking numbers with various configs. Time I spend doing
the benchmarking (which other people can do) is time I don't spend
rooting around in the scariest crags of Perl and KinoSearch C code
(which not many other people are going to be able to do). Different
Analyzers would be very helpful. So would long vs. short source
strings.
Hope this long winded reply helps you -- composing it helped me.
Cheers,
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/