Mailing List Archive: indexing speed of 0.20_02

indexing speed of 0.20_02

dooley at wolfram

Mar 15, 2007, 8:23 AM

Post #1 of 21 (4135 views)

Hello,

I've just started working with the devel release and have modified my
indexer for 0.15 to the new model. The document set is rather large
(+.5 million) and indexing this took many hours with the 0.15 release.
However, with 0.20, I haven't been able to index the files as the
indexing seems to be taking days and I end up killing the process and
looking at the code again. I'm printing out the filepath+name as they
are indexed so I can tell what is being currently indexed.

I have indexed a subset of these files successfully and searching is
working with this index. Anyone run into this? Any ideas about how I
can figure out why this is happening?

Thanks,
Roger

indexing speed of 0.20_02 [ In reply to ]

dooley at wolfram

Mar 15, 2007, 1:30 PM

Post #2 of 21 (4069 views)

Roger Dooley (3/15/2007 12:21 PM) wrote:
> Hello,
>
> I've just started working with the devel release and have modified my
> indexer for 0.15 to the new model. The document set is rather large
> (+.5 million) and indexing this took many hours with the 0.15 release.
> However, with 0.20, I haven't been able to index the files as the
> indexing seems to be taking days and I end up killing the process and
> looking at the code again. I'm printing out the filepath+name as they
> are indexed so I can tell what is being currently indexed.
>
> I have indexed a subset of these files successfully and searching is
> working with this index. Anyone run into this? Any ideas about how I
> can figure out why this is happening?
>
> Thanks,
> Roger
>
> _______________________________________________
> KinoSearch mailing list
> KinoSearch@rectangular.com
> http://www.rectangular.com/mailman/listinfo/kinosearch

I took Marvin's advice and installed Devel::SawAmpersand, but I was
right to think that it wouldn't help because I didn't encounter this
problem when indexing with version 0.15. I'm running FC6 with 2 GB RAM.
Memory usage for the process never rises above 5%.

indexing speed of 0.20_02 [ In reply to ]

marvin at rectangular

Mar 15, 2007, 3:45 PM

Post #3 of 21 (4073 views)

On Mar 15, 2007, at 9:21 AM, Roger Dooley wrote:

> I've just started working with the devel release and have modified
> my indexer for 0.15 to the new model. The document set is rather
> large (+.5 million) and indexing this took many hours with the 0.15
> release. However, with 0.20, I haven't been able to index the files
> as the indexing seems to be taking days and I end up killing the
> process and looking at the code again.

At least some of the slowdown is a side effect of UTF-8 compatibility
in 0.20. Tokenizer is a major offender, and the bottleneck is Perl's
UTF-8 character class regex implementation.

I'm a little surprised by the scale, though. According to my
benchmarking tests, we'd taken about a 35% hit, going from around 3.1
seconds under 0.15 to around 4.2 seconds for 0.20. We actually lost
a lot more than that with the transition to UTF-8, but I've continued
to make strides optimizing the engine -- if you take Tokenizer out of
the loop, and use a purpose-built C tokenizer instead (the
ASCIIWhiteSpaceTokenizer in devel/benchmarks/BenchMarkingIndexer.pm),
0.20 is actually 30% *faster* than 0.15, at 1.82 secs vs 2.62 secs.

However, my benchmarker script only uses a Tokenizer. If your
analyzer incorporates a Stemmer or a Stopalizer, there may be
additional drags I hadn't been measuring. Stemmer seems like a more
likely culprit, since that's changed to UTF-8 and I don't know how
UTF-8 Snowball performs in comparison to Latin-1 Snowball.
Stopalizer is also a possibility, but I'm not sure that hash lookups
are slower under UTF-8 -- I wouldn't think so. LCNormalizer is
almost certainly slower, but I wouldn't guess it would affect things
too much since it only hits the string once.

Here are some stats originally compiled for a post I made to the Perl
5 Porters list: <http://www.nntp.perl.org/group/perl.perl5.porters/
2007/02/msg121014.html>

==================================================================
Mean time to index 1000 ASCII news articles
------------------------------------------------------------------
tokenizer 5.8.6 (thr) 5.8.8 (no thr) blead (no thr)
------------------------------------------------------------------
UTF-8 regex 4.18 secs 3.72 secs 3.80 secs
Latin-1 regex 2.84 secs 2.50 secs 2.60 secs
Purpose-built C 1.82 secs 1.60 secs 1.64 secs

It turns out that Perl's current UTF-8 char-class implementation is
sub-optimal. Yves Orton (a.k.a. demerphq) and I have had some
preliminary discussions about how to go about improving it. Yves has
actually made the regex engine pluggable in blead; what may happen
eventually is that after 5.10 comes out I'll hack up a slightly
tweaked version of the regex engine which (only) Tokenizer will use.

I'd actually love to go in and hack on Perl's regex engine right now,
and the work to implement char classes in terms of "inversion lists"
probably isn't insane (bwa ha ha). However, I haven't done so
because 1) I'd have to invest some time to come up to speed on the
gory details of the regex engine, and 2) KinoSearch's indexing
performance has been good enough up till now that it's been more
important to work on other features.

It may be time to make another stab at moving the Tokenizer loop to C.

while (/$token_re/g) {
push @starts, $-[0];
push @ends, $+[0];
}

The first time I tried that preceded Yves' exposing and documenting
of the regex engine API: <http://search.cpan.org/~rgarcia/perl-5.9.4/
pod/perlreguts.pod>. With the aid of the new docs, I can probably
figure things out for blead, then backport for 5.8.x.

There are significant inefficiencies in how @- and @+ are retrieved
under UTF-8 -- they calculate UTF-8 length every time -- and that's
damned inefficient if you're doing it for every token. (This happens
in the function Perl_magic_regdatum_get() in mg.c.). If I can run
the loop in C, I can get at the original numbers from the regex
engine struct and avoid that.

If you don't want to wait for me to complete this work and you have
Inline C skillz, you might try carving up your own Tokenizer based on
ASCIIWhiteSpaceTokenizer.

Otherwise, if you (or anybody else) wants to help me out, I could use
some benchmarking numbers with various configs. Time I spend doing
the benchmarking (which other people can do) is time I don't spend
rooting around in the scariest crags of Perl and KinoSearch C code
(which not many other people are going to be able to do). Different
Analyzers would be very helpful. So would long vs. short source
strings.

Hope this long winded reply helps you -- composing it helped me.

Cheers,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

indexing speed of 0.20_02 [ In reply to ]

marvin at rectangular

Mar 15, 2007, 4:05 PM

Post #4 of 21 (4082 views)

On Mar 15, 2007, at 2:27 PM, Roger Dooley wrote:

> Memory usage for the process never rises above 5%.

Just for giggles, I'd be curious to see what happens if you change
the cache size. I'm considering adding a setter for this to
InvIndexer's API, but past experiments haven't yielded significantly
improved performance with larger cache sizes. I think that's because
there's a hell of a lot of malloc/free thrashing going on -- but
maybe that's peculiar to OS X and other people will see something
different.

To change the threshold, try this:

use KinoSearch::Util::SortExternal;
sub some_bigger_number { SOME_NUMBER }
local *KinoSearch::Util::SortExternal::_DEFAULT_MEM_THRESHOLD
= *some_bigger_number;

use KinoSearch::InvIndexer;
...

The default value is 16777216 (16 MB).

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

indexing speed of 0.20_02 [ In reply to ]

dooley at wolfram

Mar 15, 2007, 6:55 PM

Post #5 of 21 (4078 views)

Marvin Humphrey (3/15/2007 8:03 PM) wrote:
>
> On Mar 15, 2007, at 2:27 PM, Roger Dooley wrote:
>
>> Memory usage for the process never rises above 5%.
>
> Just for giggles, I'd be curious to see what happens if you change the
> cache size. I'm considering adding a setter for this to InvIndexer's
> API, but past experiments haven't yielded significantly improved
> performance with larger cache sizes. I think that's because there's a
> hell of a lot of malloc/free thrashing going on -- but maybe that's
> peculiar to OS X and other people will see something different.
>
> To change the threshold, try this:
>
> use KinoSearch::Util::SortExternal;
> sub some_bigger_number { SOME_NUMBER }
> local *KinoSearch::Util::SortExternal::_DEFAULT_MEM_THRESHOLD
> = *some_bigger_number;
>
> use KinoSearch::InvIndexer;
> ...
>
> The default value is 16777216 (16 MB).
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
>
> _______________________________________________
> KinoSearch mailing list
> KinoSearch@rectangular.com
> http://www.rectangular.com/mailman/listinfo/kinosearch

Thanks for the suggestion. I tried increasing this to 64 MB and the
indexing started to go faster, but it's now ground down to a snail's pace.

In a previous email, you asked for benchmarking help. I'd be glad to
run any debugging I can. I'll be out of town next week, but I might be
able to run some this weekend if pointed in the right direction.

Thanks

indexing speed of 0.20_02 [ In reply to ]

henka at cityweb

Mar 16, 2007, 5:23 AM

Post #6 of 21 (4074 views)

On Thu, 15 Mar 2007, Marvin Humphrey wrote:
> Otherwise, if you (or anybody else) wants to help me out, I could use
> some benchmarking numbers with various configs. Time I spend doing
> the benchmarking (which other people can do) is time I don't spend
> rooting around in the scariest crags of Perl and KinoSearch C code
> (which not many other people are going to be able to do). Different
> Analyzers would be very helpful. So would long vs. short source
> strings.

I'm working towards having a test indexer running by this weekend
sometime (going from 0.15 to 0.20-devel), so will also provide some
feedback if I can.

indexing speed of 0.20_02 [ In reply to ]

dooley at wolfram

Mar 16, 2007, 11:23 AM

Post #7 of 21 (4090 views)

>
> Thanks for the suggestion. I tried increasing this to 64 MB and the
> indexing started to go faster, but it's now ground down to a snail's pace.
>
> In a previous email, you asked for benchmarking help. I'd be glad to
> run any debugging I can. I'll be out of town next week, but I might be
> able to run some this weekend if pointed in the right direction.
>

I've been watching my indexing for 12+ hours now and I've found that
indexing will grind to a halt,index a few hundred documents several
hours later, and then halt again. Rinse...Repeat... These documents
are single emails and are being decoded with MIME::Parser and MIME::Head
if this helps any.

-Roger

indexing speed of 0.20_02 [ In reply to ]

marvin at rectangular

Mar 16, 2007, 8:46 PM

Post #8 of 21 (4081 views)

On Mar 15, 2007, at 7:53 PM, Roger Dooley wrote:

> Thanks for the suggestion. I tried increasing this to 64 MB and
> the indexing started to go faster, but it's now ground down to a
> snail's pace.

Snail's pace? Something's weird.

The only major dropoff in indexing I'm aware of is the SawAmpersand
problem.

Besides that, I've seen a drop of maybe a couple percent due to what
I suspect is memory fragmentation. The benchmarker script launches
each iteration as a child process to avoid this problem -- I found it
because the first iteration was consistently faster (once the OS hard
disk cache was warmed by running a couple iters beforehand).

Things shouldn't get much slower once the external sort cache starts
flushing. And that part of the codebase has hardly changed at all
from 0.15 -- mostly just OO facade stuff.

How large are these documents? The utf8_length() problem with @- and
@+ I described earlier should accelerate with increasing size. But
even so...

What happens if you cut the cache way down, to 1024 ? That won't
lose any data, it's a threshold that triggers flushing once it's
exceeded.

And what's churning? Are we CPU-bound? I/O-bound?

Can you profile the code? Are you familiar with Devel::Dprof? If
it's a UTF-8 analyzer problem, I'd expect dprofpp output to turn
something up.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

indexing speed of 0.20_02 [ In reply to ]

dooley at wolfram

Mar 16, 2007, 10:00 PM

Post #9 of 21 (4080 views)

Marvin Humphrey (3/17/2007 12:43 AM) wrote:
>
> On Mar 15, 2007, at 7:53 PM, Roger Dooley wrote:
>
>> Thanks for the suggestion. I tried increasing this to 64 MB and the
>> indexing started to go faster, but it's now ground down to a snail's
>> pace.
>
> Snail's pace? Something's weird.
>
> The only major dropoff in indexing I'm aware of is the SawAmpersand
> problem.
>
> Besides that, I've seen a drop of maybe a couple percent due to what I
> suspect is memory fragmentation. The benchmarker script launches each
> iteration as a child process to avoid this problem -- I found it because
> the first iteration was consistently faster (once the OS hard disk cache
> was warmed by running a couple iters beforehand).
>
> Things shouldn't get much slower once the external sort cache starts
> flushing. And that part of the codebase has hardly changed at all from
> 0.15 -- mostly just OO facade stuff.
>
> How large are these documents? The utf8_length() problem with @- and @+
> I described earlier should accelerate with increasing size. But even so...
>

Not large (single pieces of email). Very few with attachments.

> What happens if you cut the cache way down, to 1024 ? That won't lose
> any data, it's a threshold that triggers flushing once it's exceeded.
>

I'll let you know.

> And what's churning? Are we CPU-bound? I/O-bound?

CPU is 99% (dual processor system...PIII 900 MHz Xeon). 2 GB RAM and
memory usage for the process never gets above 6%.

>
> Can you profile the code? Are you familiar with Devel::Dprof? If it's
> a UTF-8 analyzer problem, I'd expect dprofpp output to turn something up.
>

I haven't used Devel::Dprof, but there's a first time for everything. I
wish I could continue with this next week, but I'm going to be out of
town and wrapping up another project that will have me unable to work on
this. If you come up with anything while I'm gone, I can get to it on
the following Monday.

Thanks

> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
>
> _______________________________________________
> KinoSearch mailing list
> KinoSearch@rectangular.com
> http://www.rectangular.com/mailman/listinfo/kinosearch

indexing speed of 0.20_02 [ In reply to ]

henka at cityweb

Mar 17, 2007, 5:45 AM

Post #10 of 21 (4090 views)

On Sat, 17 Mar 2007, Roger Dooley wrote:
> > And what's churning? Are we CPU-bound? I/O-bound?
>
> CPU is 99% (dual processor system...PIII 900 MHz Xeon). 2 GB RAM and
> memory usage for the process never gets above 6%.
>
> >
> > Can you profile the code? Are you familiar with Devel::Dprof? If it's
> > a UTF-8 analyzer problem, I'd expect dprofpp output to turn something up.
> >
>
> I haven't used Devel::Dprof, but there's a first time for everything. I
> wish I could continue with this next week, but I'm going to be out of
> town and wrapping up another project that will have me unable to work on
> this. If you come up with anything while I'm gone, I can get to it on
> the following Monday.

Our indexers have been churning away happily for over 24 hours now - no
slow-down at all.

As Marvin suggests (and as he suggested to me some time ago), you'll
probably expose a few bodies if you profile your code (and boy, did I
have a few)- here's a snippet of his origional email to me:

---------
First, you call the script you want to profile like so:

perl -d:DProf script.pl

That creates a file called tmon.out in the current directory.

Next you call dprofpp. Both Devel::DProf and dprofpp are distributed
with core Perl.

dprofpp

dprofpp parses tmon.out and creates the profiling output you saw,
along with some other junk.
----------

indexing speed of 0.20_02 [ In reply to ]

dooley at wolfram

Mar 18, 2007, 5:57 AM

Post #11 of 21 (4084 views)

> First, you call the script you want to profile like so:
>
> perl -d:DProf script.pl
>
> That creates a file called tmon.out in the current directory.
>
> Next you call dprofpp. Both Devel::DProf and dprofpp are distributed
> with core Perl.
>
> dprofpp
>
> dprofpp parses tmon.out and creates the profiling output you saw,
> along with some other junk.
> ----------

I indexed a small subset of the documents. Here are the results:

Total Elapsed Time = 18411.82 Seconds
User+System Time = 18402.14 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
98.4 18123 18132. 70161 0.2583 0.2584 KinoSearch::Analysis::Tokenizer::a
441 nalyze
0.22 40.70 54.700 739621 0.0001 0.0001 IO::ScalarArray::getline
0.10 17.94 17.944 70161 0.0003 0.0003 Lingua::Stem::Snowball::stem_in_pl
ace
0.09 16.79 19.930 70161 0.0002 0.0003 KinoSearch::Analysis::LCNormalizer
::analyze
0.09 15.95 15.951 102839 0.0002 0.0002 KinoSearch::Index::PostingsWriter:
:add_batch
0.07 12.60 18458. 16395 0.0008 1.1259 main::dir_index
0.07 12.38 12.381 102839 0.0001 0.0001 KinoSearch::Analysis::TokenBatch::
invert
0.06 11.85 18276. 16339 0.0007 1.1186 KinoSearch::Index::SegWriter::add_
027 doc
0.05 8.322 8.322 871912 0.0000 0.0000 KinoSearch::Util::Obj::DESTROY
0.04 7.661 7.661 102839 0.0001 0.0001 KinoSearch::Index::TermVectorsWrit
er::tv_string
0.04 7.487 7.487 110953 0.0000 0.0000 IO::ScalarArray::_eos
0.04 7.290 10.397 79469 0.0001 0.0001 Mail::Header::_fmt_line
0.04 7.254 7.254 70161 0.0001 0.0001 KinoSearch::Analysis::TokenBatch::
add_many_tokens
0.04 7.050 7.050 1 7.0500 7.0500 KinoSearch::Index::PostingsWriter:
:_write_postings
0.04 6.931 64.591 16339 0.0004 0.0040 MIME::Body::as_lines

indexing speed of 0.20_02 [ In reply to ]

marvin at rectangular

Mar 18, 2007, 6:21 AM

Post #12 of 21 (4077 views)

On Mar 18, 2007, at 6:53 AM, Roger Dooley wrote:

> 98.4 18123 18132. 70161 0.2583 0.2584
> KinoSearch::Analysis::Tokenizer::a
> 441 nalyze

What's your analyzer look like? Are you using a stock PolyAnalyzer?

Can you describe the material in the emails in terms of language? I
want to know if there's a lot of stuff above UTF-8 code point 127
(i.e. not ASCII).

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

indexing speed of 0.20_02 [ In reply to ]

dooley at wolfram

Mar 18, 2007, 7:57 AM

Post #13 of 21 (4068 views)

Marvin Humphrey (3/18/2007 10:18 AM) wrote:
>
> On Mar 18, 2007, at 6:53 AM, Roger Dooley wrote:
>
>
>> 98.4 18123 18132. 70161 0.2583 0.2584
>> KinoSearch::Analysis::Tokenizer::a
>> 441 nalyze
>
> What's your analyzer look like? Are you using a stock PolyAnalyzer?
>

stock PolyAnalyzer

> Can you describe the material in the emails in terms of language? I
> want to know if there's a lot of stuff above UTF-8 code point 127 (i.e.
> not ASCII).
>

ASCII...material indexed was from 1991.

>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
>
> _______________________________________________
> KinoSearch mailing list
> KinoSearch@rectangular.com
> http://www.rectangular.com/mailman/listinfo/kinosearch

indexing speed of 0.20_02 [ In reply to ]

marvin at rectangular

Mar 18, 2007, 8:22 AM

Post #14 of 21 (4078 views)

On Mar 18, 2007, at 8:53 AM, Roger Dooley wrote:

> stock PolyAnalyzer

> ASCII...material indexed was from 1991.

Very strange. I mean, look at Tokenizer->analyze in 0.20_02:

sub analyze {
my ( $self, $batch ) = @_;
my $token_re = $self->{token_re};
my $new_batch = KinoSearch::Analysis::TokenBatch->new;

# alias input to $_
while ( my $token = $batch->next ) {
for ( $token->get_text ) {
# accumulate token start_offsets and end_offsets
my ( @starts, @ends );
while (/$token_re/g) {
push @starts, $-[0];
push @ends, $+[0];
}

# add the new tokens to the batch
$new_batch->add_many_tokens( $_, \@starts, \@ends );
}
}

return $new_batch;
}

There's not a lot there. TokenBatch->add_many_tokens is a bit
involved, but if that one had been the problem, it would have shown
up in the profile. TokenBatch->new, TokenBatch->next and Token-
>get_text are all nothing.

It has to be the regex.

I'm just about out of ideas. What version of Perl? Maybe
something's really whacked with Unicode regex char classes or the /g
flag or @- and @+ in a particular Perl version, and maybe I can
duplicate the problem by compiling a copy.

When you get back, maybe we can have you run the benchmarkers from
both 0.15 and 0.20_xx. (FYI, The benchmarker bundled with 0.20_02 is
out of date.) The Reuters news articles are also (nearly) pure ASCII.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

indexing speed of 0.20_02 [ In reply to ]

marvin at rectangular

Mar 18, 2007, 9:05 AM

Post #15 of 21 (4073 views)

On Mar 18, 2007, at 9:19 AM, Marvin Humphrey wrote:

> It has to be the regex.

Found it.

It's @- and @+ -- specifically, the utf8_length() recalculation in
mg.c identified earlier.

Here's what I get for the app below with a source string that's 135
bytes:

Rate utf8_capture ascii_capture utf8_match
ascii_match
utf8_capture 5856/s -- -16%
-74% -82%
ascii_capture 6982/s 19% --
-69% -78%
utf8_match 22330/s 281% 220%
-- -30%
ascii_match 31940/s 445% 357%
43% --

Here's what I get with a 294k input:

(warning: too few iterations for a reliable count)
Rate utf8_capture ascii_capture utf8_match
ascii_match
utf8_capture 2.33e-02/s -- -100%
-100% -100%
ascii_capture 4.85/s 20701% --
-70% -82%
utf8_match 16.0/s 68622% 230%
-- -42%
ascii_match 27.6/s 118248% 469%
72% --

The remedy will be to move the loop to C and access the regexp struct
members directly. I just have to figure out exactly how to do that
-- the regex engine is not officially part of Perl's C API.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

#--------------------------------------------------------

#!/usr/bin/perl
use strict;
use warnings;

use Benchmark qw( cmpthese );

my $token_re = qr/\w+(?:'\w+)*/;

die("File path required as first command-line argument")
unless @ARGV;
my $filepath = $ARGV[0];
open( my $fh, '<', $filepath )
or die "Can't open '$filepath': $!";
my $content = do { local $/; <$fh> };
close $fh or die "Can't close '$filepath': $!";

my $utf8_content = $content;
utf8::upgrade($utf8_content);

cmpthese(
-1,
{ ascii_match => sub { match_only($content) },
utf8_match => sub { match_only($utf8_content) },
ascii_capture => sub { capture($content) },
utf8_capture => sub { capture($utf8_content) },
}
);

sub match_only {
for ( $_[0] ) {
1 while /$token_re/g;
}
}

sub capture {
my ( @starts, @ends );
for ( $_[0] ) {
while (/$token_re/g) {
push @starts, $-[0];
push @ends, $+[0];
}
}
}

indexing speed of 0.20_02 [ In reply to ]

marvin at rectangular

Mar 18, 2007, 2:10 PM

Post #16 of 21 (4086 views)

Roger,

Thanks for hanging in there. I believe the problem has been licked.

Tokenizer->analyze has been moved to XS, and the new version does not
suffer from the utf8_length() problem. Benchmarking tests using the
Reuters corpus appear to indicate that indexing performance is the
best it's ever been.

Please grab Tokenizer.pm from the subversion repository...

http://www.rectangular.com/svn/kinosearch/trunk/perl/lib/
KinoSearch/Analysis/Tokenizer.pm

... and copy it into the $DIST_ROOT/lib/KinoSearch/Analysis directory
of a freshly decompressed 0.20_02 tarball.

http://www.rectangular.com/downloads/KinoSearch-0.20_02.tar.gz

(Don't check out subversion trunk, there are some other things going
on right now.)

Then run the Build/install process.

Note that you can't just copy Tokenizer.pm into a system dir, because
XS code in KS is inlined into the .pm files and needs to be extracted
and recompiled.

Let me know how it goes,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

kino_TokenBatch*
analyze(self_hv, batch)
HV *self_hv;
kino_TokenBatch *batch;
CODE:
{
SV *token_re = extract_sv(self_hv, SNL("token_re"));
MAGIC *mg = NULL;
REGEXP *rx = NULL;
kino_Token *token = NULL;
kino_u32_t num_code_points = 0;
SV *wrapper = sv_newmortal();
RETVAL = kino_TokenBatch_new(NULL);

/* extract regexp struct from qr// entity */
if (SvROK(token_re)) {
SV *sv = SvRV(token_re);
if (SvMAGICAL(sv))
mg = mg_find(sv, PERL_MAGIC_qr);
}
if (!mg)
CONFESS("not a qr// entity");
rx = (REGEXP*)mg->mg_obj;

/* fake up an SV wrapper to feed to the regex engine */
sv_upgrade(wrapper, SVt_PV);
SvREADONLY_on(wrapper);
SvLEN(wrapper) = 0;
SvUTF8_on(wrapper);

while ( (token = Kino_TokenBatch_Next(batch)) != NULL ) {
char *const string_beg = token->text;
char *const string_end = string_beg + token->len;
char *string_arg = string_beg;

/* wrap the token's string */
SvPVX(wrapper) = string_beg;
SvCUR_set(wrapper, token->len);
SvPOK_on(wrapper);

while (
pregexec(rx, string_arg, string_end, string_arg, 1,
wrapper, 1)
) {
char *const start_ptr = string_arg + rx->startp[0];
char *const end_ptr = string_arg + rx->endp[0];
kino_u32_t start, end;
kino_Token *new_token;

/* get start and end offsets in Unicode code points */
for( ; string_arg < start_ptr; num_code_points++) {
string_arg += KINO_STRHELP_UTF8_SKIP[(kino_u8_t)
*string_arg];
if (string_arg > string_end)
CONFESS("scanned past end of '%s'", string_beg);
}
start = num_code_points;
for( ; string_arg < end_ptr; num_code_points++) {
string_arg += KINO_STRHELP_UTF8_SKIP[(kino_u8_t)
*string_arg];
if (string_arg > string_end)
CONFESS("scanned past end of '%s'", string_beg);
}
end = num_code_points;

/* add a token to the new_batch */
new_token = kino_Token_new(
start_ptr,
(end_ptr - start_ptr),
start,
end,
1.0f, /* boost always 1 for now */
1 /* position increment */
);
Kino_TokenBatch_Append(RETVAL, new_token);
REFCOUNT_DEC(new_token);
}
}
}
OUTPUT: RETVAL

indexing speed of 0.20_02 [ In reply to ]

dooley at wolfram

Mar 18, 2007, 3:01 PM

Post #17 of 21 (4078 views)

Marvin Humphrey (3/18/2007 6:06 PM) wrote:
> Roger,
>
> Thanks for hanging in there. I believe the problem has been licked.
>
> Tokenizer->analyze has been moved to XS, and the new version does not
> suffer from the utf8_length() problem. Benchmarking tests using the
> Reuters corpus appear to indicate that indexing performance is the
> best it's ever been.
>
> Please grab Tokenizer.pm from the subversion repository...
>
>
> http://www.rectangular.com/svn/kinosearch/trunk/perl/lib/KinoSearch/Analysis/Tokenizer.pm
>
>
>
> ... and copy it into the $DIST_ROOT/lib/KinoSearch/Analysis directory
> of a freshly decompressed 0.20_02 tarball.
>
> http://www.rectangular.com/downloads/KinoSearch-0.20_02.tar.gz
>
> (Don't check out subversion trunk, there are some other things going
> on right now.)
>
> Then run the Build/install process.
>
> Note that you can't just copy Tokenizer.pm into a system dir, because
> XS code in KS is inlined into the .pm files and needs to be extracted
> and recompiled.
>
> Let me know how it goes,
>
> Marvin Humphrey Rectangular Research http://www.rectangular.com/
>
>

t/217-multi_term_list.........NOK 6/8
# Failed test 'correct term number'
# at t/217-multi_term_list.t line 89.
# got: 'ay'
# expected: 'b'
t/217-multi_term_list.........ok 7/8# Looks like you failed 1 test of 8.
t/217-multi_term_list.........dubious
Test returned status 1 (wstat 256, 0x100)
DIED. FAILED test 6
Failed 1/8 tests, 87.50% okay

I did install anyway. I've only been running the indexer for less than
10 minutes and I'm up to +23k documents indexed. If there are any other
problems, I'll send some email. Thanks for attending to this!

-Roger

indexing speed of 0.20_02 [ In reply to ]

marvin at rectangular

Mar 22, 2007, 11:29 PM

Post #18 of 21 (4068 views)

On Mar 18, 2007, at 3:57 PM, Roger Dooley wrote:

> t/217-multi_term_list.........NOK 6/8
> # Failed test 'correct term number'
> # at t/217-multi_term_list.t line 89.
> # got: 'ay'
> # expected: 'b'
> t/217-multi_term_list.........ok 7/8# Looks like you failed 1 test
> of 8.
> t/217-multi_term_list.........dubious
> Test returned status 1 (wstat 256, 0x100)
> DIED. FAILED test 6
> Failed 1/8 tests, 87.50% okay

I believe I've now squashed this annoying little twerp. It had a 1
in 128 chance of occurring.

Pudge, that leaves your segfault bug as the last open bug related to
rangefilter (: for now :). Let me try to clean up the memory errors
tomorrow, then I'll ask you to try to reproduce it using trunk.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

indexing speed of 0.20_02 [ In reply to ]

Mar 31, 2007, 8:17 AM

Post #19 of 21 (4072 views)

Marvin Humphrey wrote:
> Tokenizer->analyze has been moved to XS, and the new version does not
> suffer from the utf8_length() problem. Benchmarking tests using the
> Reuters corpus appear to indicate that indexing performance is the best
> it's ever been.

I also have the problem with indexing speed slowing to a snail's pace on
OSX after about 150 documents.

> Please grab Tokenizer.pm from the subversion repository...
>
http://www.rectangular.com/svn/kinosearch/trunk/perl/lib/KinoSearch/Analysis/Tokenizer.pm

Is the current trunk safe to grab? Trying to replace with the current
version of just this file causes Build problems.

Thanks,

Tony

indexing speed of 0.20_02 [ In reply to ]

marvin at rectangular

Mar 31, 2007, 9:12 AM

Post #20 of 21 (4091 views)

On Mar 31, 2007, at 9:09 AM, Tony Bowden wrote:
>> Please grab Tokenizer.pm from the subversion repository...
>>
> http://www.rectangular.com/svn/kinosearch/trunk/perl/lib/KinoSearch/
> Analysis/Tokenizer.pm
>
> Is the current trunk safe to grab?

I'm afraid not.

> Trying to replace with the current version of just this file causes
> Build problems.

Ah, jeez, that's right. I renamed something.

Here's a tarball that's 0.20_02 augmented with a working version of
the current Tokenizer.

http://www.rectangular.com/downloads/kino4tony.tar.gz

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

indexing speed of 0.20_02 [ In reply to ]

Apr 1, 2007, 7:44 AM

Post #21 of 21 (4081 views)

Marvin Humphrey wrote:
> Here's a tarball that's 0.20_02 augmented with a working version of the
> current Tokenizer.
> http://www.rectangular.com/downloads/kino4tony.tar.gz

Excellent. Everything builds and runs fine, at great speed.

Thanks,

Tony