Mailing List Archive

revision 3552 SEGV during indexing
Hi Marvin,

Revision 3552 seems to be SEGV'ing. Here's a backtrace:

Program terminated with signal 11, Segmentation fault.
#0 0x00000086 in ?? ()
(gdb) bt
#0 0x00000086 in ?? ()
#1 0x00360e2d in kino_Inverter_clear (self=0x972eea8)
at ../c_src/h/KinoSearch/Obj.h:166
#2 0x00379a41 in kino_SegWriter_add_doc (self=0x972bc00, doc=0x981cd60)
at ../c_src/h/KinoSearch/Index/Inverter.h:297
#3 0x002bb111 in XS_KinoSearch__Index__SegWriter_add_doc (my_perl=0x8a07008,
cv=0x8bf9474) at ../c_src/h/KinoSearch/Index/SegWriter.h:249
#4 0x009a747d in Perl_pp_entersub ()
from /usr/lib/perl5/5.8.8/i386-linux-thread-multi/CORE/libperl.so
#5 0x009a08df in Perl_runops_standard ()
from /usr/lib/perl5/5.8.8/i386-linux-thread-multi/CORE/libperl.so
#6 0x00945f13 in perl_run ()
from /usr/lib/perl5/5.8.8/i386-linux-thread-multi/CORE/libperl.so
#7 0x080491ee in main ()
(gdb)

As you can see above, the segv is happening on a
$invindexer->add_doc($doc) call (a normal doc, I tried several).

regards
Henry


_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: revision 3552 SEGV during indexing [ In reply to ]
On Jun 30, 2008, at 8:54 PM, Henry wrote:

> Revision 3552 seems to be SEGV'ing.

OK, the recent big leaks were cleaned up as of r3551, so my guess is
that this isn't an out-of-memory error.

Just to verify, the whole trunk is up-to-date, not just trunk/perl,
right?

> Program terminated with signal 11, Segmentation fault.
> #0 0x00000086 in ?? ()
> (gdb) bt
> #0 0x00000086 in ?? ()
> #1 0x00360e2d in kino_Inverter_clear (self=0x972eea8)
> at ../c_src/h/KinoSearch/Obj.h:166

Tracking this down, it looks like the section in Inverter_clear()
where the Inverter's stored Doc object gets its refcount decremented.
That's puzzling. I don't see a scenario where an invalid Doc object
could be sitting in the inverter->doc slot.

> As you can see above, the segv is happening on a
> $invindexer->add_doc($doc) call (a normal doc, I tried several).

Can you tell me a little more? What does this document look like?
How long has the indexing session been running when this happens?

Although throughout most of the KS test suite $invindexer->add_doc()
gets fed a hashref rather than a Doc, there are instances where an
actual Doc gets used (in t/602-boosts.t at the least), so we have a
test already.

BTW, the instability people like you and Edward are experiencing right
now is annoying, but the refactoring is paying off. SVN trunk is now
about 30% faster on the benchmark test than the last dev release, but
the real-world gains are likely to be bigger: on the same system, t/
001-build_invindexes.t completes in 0.8 seconds for trunk vs. 7.6
seconds for the last dev release.

My guess is that that improvements to Stemmer, LCNormalizer, and
PolyAnalyzer are contributing the most, but there have also been
improvements to InvIndexer, SegWriter, Inverter, and DocWriter. I'd
be surprised if everyone sees such gains, especially since KS probably
isn't the bottleneck in most indexing apps, but still... :)

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: revision 3552 SEGV during indexing [ In reply to ]
On Thu, July 3, 2008 12:36 am, Marvin Humphrey wrote:
> Just to verify, the whole trunk is up-to-date, not just trunk/perl,
> right?

Yes, I checked out a fresh copy and recompiled a few times. I also tried
on other nodes in the cluster and they're doing the same.

> Can you tell me a little more? What does this document look like?
> How long has the indexing session been running when this happens?

The docs are run-of-the-mill HTML files. It happens on the third file in
the run - consistently.

> Although throughout most of the KS test suite $invindexer->add_doc()
> gets fed a hashref rather than a Doc, there are instances where an
> actual Doc gets used (in t/602-boosts.t at the least), so we have a
> test already.

I've got a sneaking suspicion my code hasn't kept pace with ks in svn
(meaning I'm sure I've missed some change in how ks in svn is supposed to
be used - my indexing code hasn't changed significantly in a few months).
I'll whup together a test case and post it here.

> BTW, the instability people like you and Edward are experiencing right
> now is annoying, but the refactoring is paying off. SVN trunk is now
> about 30% faster on the benchmark test than the last dev release, but
> the real-world gains are likely to be bigger: on the same system, t/
> 001-build_invindexes.t completes in 0.8 seconds for trunk vs. 7.6
> seconds for the last dev release.

No worries - we who walk barefoot in the head in svn-land do so with full
knowledge and at our own peril ;-)

> My guess is that that improvements to Stemmer, LCNormalizer, and
> PolyAnalyzer are contributing the most, but there have also been
> improvements to InvIndexer, SegWriter, Inverter, and DocWriter. I'd
> be surprised if everyone sees such gains, especially since KS probably
> isn't the bottleneck in most indexing apps, but still... :)

Indexing has become zippy indeed (not that it was slow to begin with).
Your suggestion of using HTML::Parser has been used to good effect, with
XML::LibXML rounding out a few cases where my skills with HTML::Parser are
lacking.

regards
Henry


_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: revision 3552 SEGV during indexing [ In reply to ]
There probably is an explanation for what's happening here. My indexing
code is wrapped in an eval{} to be fault tolerant (especially when parsing
HTML) and I consistently get a segv after indexing a few files.

However, if I precede the relevant code with 'print Dumper $doc;', then
indexing occurs without error...

use KinoSearch::InvIndexer;
use KinoSearch::Doc;
use Schema;
use Data::Dumper;
...

eval {
...
my $doc = {};
$doc->{title} = gettext1();
$doc->{body} = gettext2();
...
print Dumper $doc; # <----- without this, the indexer drops a core
my $ksdoc = KinoSearch::Doc->new(
fields => $doc,
boost => $doc_boost);
$invindexer->add_doc($ksdoc)
};



Weird. Does anyone have an idea why this would start happening?

regards
Henry


_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: revision 3552 SEGV during indexing [ In reply to ]
On Thu, July 3, 2008 10:06 am, Henry wrote:
> print Dumper $doc; # <----- without this, the indexer drops a core

scratch that gobbledegook - if I add a line in another module or look at
the code askance, then it also crashes. Each time I think I'm seeing a
pattern, it changes.

/sigh


_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: revision 3552 SEGV during indexing [ In reply to ]
On Thu, July 3, 2008 4:49 pm, Marvin Humphrey wrote:
> Thanks for the sample. Unfortunately, I have yet to duplicate the
> problem. I tried adding a similar loop to t/305-invindexer.t (see
> below), but it ran clean both under normal conditions and under
> valgrind.
>
> Since you run Linux and this problem seems to be happening
> consistently, how about running your app under valgrind? Something
> should turn up right away.

Thanks for the prompt reply. Please find attached the results of running
valgrind with "-v --leak-check=full".

If you need me to rerun with different args, or change something, please
let me know.

Regards
Henry
Re: revision 3552 SEGV during indexing [ In reply to ]
On Thu, July 3, 2008 7:32 pm, Marvin Humphrey wrote:
> OK, I think I may have squashed the bug. I still haven't reproduced
> the problem, because that requires both an elaborate schema and luck
> in order of hash iteration. However, I found an place that could
> cause an uninitialized value error. Please try r3558.

Thanks! Will give it a try this evening or tomorrow.

Regards
Henry


_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: revision 3552 SEGV during indexing [ In reply to ]
I did a quick recompile/index on two batches of files and revision 3558
indexes without error, and consistently so.

Thanks Marvin, that was grease lightning!

Regards
Henry


_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: revision 3552 SEGV during indexing [ In reply to ]
On Jul 3, 2008, at 1:06 AM, Henry wrote:

> eval {
> ...
> my $doc = {};
> $doc->{title} = gettext1();
> $doc->{body} = gettext2();
> ...
> print Dumper $doc; # <----- without this, the indexer drops a core
> my $ksdoc = KinoSearch::Doc->new(
> fields => $doc,
> boost => $doc_boost);
> $invindexer->add_doc($ksdoc)
> };

Thanks for the sample. Unfortunately, I have yet to duplicate the
problem. I tried adding a similar loop to t/305-invindexer.t (see
below), but it ran clean both under normal conditions and under
valgrind.

Since you run Linux and this problem seems to be happening
consistently, how about running your app under valgrind? Something
should turn up right away.

valgrind perl myapp.pl

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

Index: ../perl/t/305-invindexer.t
===================================================================
--- ../perl/t/305-invindexer.t (revision 3557)
+++ ../perl/t/305-invindexer.t (working copy)
@@ -21,6 +21,22 @@

my $invindexer = KinoSearch::InvIndexer->new( invindex =>
$invindex, );

+for ( 0 .. 10 ) {
+ eval {
+ my $fields = {
+ content => "blah$_",
+ };
+ my $doc = KinoSearch::Doc->new(
+ fields => $fields,
+ boost => $_,
+ );
+ $invindexer->add_doc($doc);
+ }
+}
+$invindexer->finish;
+
+$invindexer = KinoSearch::InvIndexer->new( invindex => $invindex, );
+
eval {
my $lock_factory = KinoSearch::Store::LockFactory->new(
folder => $folder,

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: revision 3552 SEGV during indexing [ In reply to ]
OK, I think I may have squashed the bug. I still haven't reproduced
the problem, because that requires both an elaborate schema and luck
in order of hash iteration. However, I found an place that could
cause an uninitialized value error. Please try r3558.

On Jul 3, 2008, at 3:18 AM, Henry wrote:

> Thanks for the prompt reply. Please find attached the results of
> running
> valgrind with "-v --leak-check=full".
>
> If you need me to rerun with different args, or change something,
> please
> let me know.


In this case, we were just looking for errors, so the simple command
would have sufficed. But FYI, full checks look something like this:

$ valgrind --leak-check=full --show-reachable=yes \
> --suppressions=../devel/conf/p510_valgrind.supp \
> /usr/local/debugperl/bin/perl510 -Mblib t/214-spec_field.t

I'm actually surprised that your report didn't show a bunch of known
leaks related to boot up, which ordinarily have to be suppressed.
Also, unless you're running a debugging perl, Perl usually leaks like
crazy on shutdown because it just abandons everything rather than
waste time on pointless one-by-one cleanup.

But whatever... the report showed me what I needed to see.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch