Mailing List Archive

utf8 warnings/error
Hi,

I'm indexing emails, mostly spam, and I'm running into a bunch of
UTF-8 error followed by an error from PolyAnalyzer. Here are a few of
the warnings:

Malformed UTF-8 character (unexpected non-continuation byte 0xcf,
immediately after start byte 0xfb) in subroutine entry at
/usr/lib/perl5/vendor_perl/5.8.4/i686-linux/KinoSearch/Analysis/PolyAnalyzer.pm
line 77.
Malformed UTF-8 character (unexpected non-continuation byte 0xcf,
immediately after start byte 0xfb) in subroutine entry at
/usr/lib/perl5/vendor_perl/5.8.4/i686-linux/KinoSearch/Analysis/PolyAnalyzer.pm
line 77.
Malformed UTF-8 character (unexpected non-continuation byte 0xea,
immediately after start byte 0xcd) in subroutine entry at
/usr/lib/perl5/vendor_perl/5.8.4/i686-linux/KinoSearch/Analysis/PolyAnalyzer.pm
line 77.

If you want to see all of the warnings, let me know. And then the
error after the warnings looks like this:

[error] Caught exception in
GMail::Controller::User::Mail::Folder->begin "Error in function
XS_KinoSearch__Analysis__Tokenizer__do_analyze at
lib/KinoSearch.xs:4758: scanned past end of '
????:???????????????????(c)????????(????????????)??????????????????,??????????????????%20,

????????,??????????????????:

??????:?????? ????????:(0)13543676298

'
at /usr/lib/perl5/vendor_perl/5.8.4/i686-linux/KinoSearch/Analysis/PolyAnalyzer.pm
line 77
KinoSearch::Analysis::PolyAnalyzer::analyze_field('KinoSearch::Analysis::PolyAnalyzer=HASH(0x8fa0adc)',
'HASH(0x897e0d4)', 'body') called at
/usr/lib/perl5/vendor_perl/5.8.4/i686-linux/KinoSearch/Index/SegWriter.pm
line 104
KinoSearch::Index::SegWriter::add_doc('KinoSearch::Index::SegWriter=HASH(0x8983774)',
'HASH(0x897e0d4)', 1) called at
/usr/lib/perl5/vendor_perl/5.8.4/i686-linux/KinoSearch/InvIndexer.pm
line 114
KinoSearch::InvIndexer::add_doc('KinoSearch::InvIndexer=HASH(0x89849a4)',
'HASH(0x897e0d4)') called at
/usr/lib/gmail_maildir/GT/Maildir/KinoSearch/Indexer.pm line 200
GT::Maildir::KinoSearch::Indexer::index('GT::Maildir::KinoSearch::Indexer=HASH(0x8b0e180)',
'/var/home/alex/alex.krohn.org/mail/alex/Maildir/./cur/1182973...')
called at GMail::Model::Maildir::Folder::index line 45
...
The rest of the stack trace is in my code.

Is there something I need to do to the strings I'm passing into add_doc?

Thanks,

Scott
utf8 warnings/error [ In reply to ]
On Aug 19, 2007, at 11:28 AM, Scott Beck wrote:

> I'm indexing emails, mostly spam, and I'm running into a bunch of
> UTF-8 error followed by an error from PolyAnalyzer.

All of KinoSearch's tools expect to be fed valid UTF-8. It seems
that they aren't getting it.

There is a line in SegWriter that should take any field value which
does not have the SVf_UTF8 flag set and force it into UTF-8 before it
gets sent through the analysis chain.

if ( !$field_spec->binary ) {
utf8ify( $doc->{$field_name} );
}

However, if the SVf_UTF8 flag is already set, utf8ify() does nothing.

What I would like to know is whether the incoming field values are
marked with the SVf_UTF8 flag, but are not truly valid UTF-8.
Strings in that state are bad news.

> Is there something I need to do to the strings I'm passing into
> add_doc?

KS is ready for two possibilities.

1) SVf_UTF8 is set, and the string is truly UTF-8.
2) SVf_UTF8 is not set. The string will be upgraded to
UTF-8, assuming a source encoding of Latin 1.

Check your source strings via this line (see <http://perldoc.perl.org/
utf8.html>):

utf8::valid($doc->{$field_name}) or die "Bad string!";

If the string passes muster with utf8::valid() but KS still has
problems, then KS has a bug. If not, then there is a bug prior to KS
in your app.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
utf8 warnings/error [ In reply to ]
Hi Marvin,


On 8/19/07, Marvin Humphrey <marvin@rectangular.com> wrote:
>
> On Aug 19, 2007, at 11:28 AM, Scott Beck wrote:
>
> > I'm indexing emails, mostly spam, and I'm running into a bunch of
> > UTF-8 error followed by an error from PolyAnalyzer.
>
> All of KinoSearch's tools expect to be fed valid UTF-8. It seems
> that they aren't getting it.
>
> KS is ready for two possibilities.
<snip>
>
> 1) SVf_UTF8 is set, and the string is truly UTF-8.
> 2) SVf_UTF8 is not set. The string will be upgraded to
> UTF-8, assuming a source encoding of Latin 1.
>
> Check your source strings via this line (see <http://perldoc.perl.org/
> utf8.html>):
>
> utf8::valid($doc->{$field_name}) or die "Bad string!";
>
> If the string passes muster with utf8::valid() but KS still has
> problems, then KS has a bug. If not, then there is a bug prior to KS
> in your app.
>

I tried this just before the add_doc in my code:

for (keys %$email) {
utf8::valid($email->{$_}) or die "Bad string!";
warn "> $_ is valid utf8";
}

I see the warn there for every field. Also I tried this just to make
sure the strings are UTF-8:

for (keys %$email) {
unless (utf8::is_utf8($email->{$_})) {
utf8::upgrade($email->{$_});
}
utf8::valid($email->{$_}) or die "Bad string!";
warn "> $_ is valid utf8";
}

I get the same errors and warnings with either of these inserted just
before the add_doc().

Thanks,

Scott
utf8 warnings/error [ In reply to ]
On Aug 19, 2007, at 12:17 PM, Scott Beck wrote:

> I get the same errors and warnings with either of these inserted just
> before the add_doc().

OK, we'll have to figure out at what point what was valid UTF-8
became invalid UTF-8. Can you please send me some spam?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
utf8 warnings/error [ In reply to ]
On Aug 19, 2007, at 12:30 PM, Marvin Humphrey wrote:

>> I get the same errors and warnings with either of these inserted just
>> before the add_doc().
>
> OK, we'll have to figure out at what point what was valid UTF-8
> became invalid UTF-8. Can you please send me some spam?

Scott,

Thanks for sending me spam off-list. Unfortunately/fortunately it
seems to parse OK on my machine (stock Perl 5.8.6 on Mac OS X
10.4.10). An example script is attached (minus the spam). Can you
please fill in the blanks and try it out on your machine?

What version of Perl are you using?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: analyze_spam.pl
Type: text/x-perl-script
Size: 852 bytes
Desc: not available
Url : http://www.rectangular.com/pipermail/kinosearch/attachments/20070823/6248450e/analyze_spam.bin
-------------- next part --------------
utf8 warnings/error [ In reply to ]
Hi Marvin,

> On Aug 19, 2007, at 12:30 PM, Marvin Humphrey wrote:
>
> Scott,
>
> Thanks for sending me spam off-list. Unfortunately/fortunately it
> seems to parse OK on my machine (stock Perl 5.8.6 on Mac OS X
> 10.4.10). An example script is attached (minus the spam). Can you
> please fill in the blanks and try it out on your machine?
>

This script analyzes that email fine. I need to keep digging into the
code. I'm getting some very strange things happening. I think I'm
going to try running things through valgrind and see if if there are
some memory issues. Thanks for your help.

Scott


> What version of Perl are you using?

perl 5.8.4 on linux.

Cheers,

Scott
utf8 warnings/error [ In reply to ]
Hi Marvin,

I still can't reproduce these errors on a small test case :(
I did however get some feedback from valgrind although I don't know
how helpful it is. I thought I would post it here as a follow up. I
will continue to debug this and see if I can figure it out.

valgrind errors from my tests:
==1766== Invalid read of size 1
==1766== at 0x814C06F: Perl_swash_fetch (utf8.c:1747)
==1766== by 0x813BF82: S_find_byclass (regexec.c:1248)
==1766== by 0x813E456: Perl_regexec_flags (regexec.c:1945)
==1766== by 0x8138538: Perl_pregexec (regexec.c:323)
==1766== by 0x61284B9:
XS_KinoSearch__Analysis__Tokenizer__do_analyze (KinoSearch.xs:4741)
==1766== by 0x80DE048: Perl_pp_entersub (pp_hot.c:2854)
==1766== by 0x80BCA83: Perl_runops_debug (dump.c:1442)
==1766== by 0x8064024: S_run_body (perl.c:1921)
==1766== by 0x8063AE5: perl_run (perl.c:1840)
==1766== by 0x805F69A: main (perlmain.c:86)
==1766== Address 0x630C48F is 6 bytes after a block of size 17 alloc'd
==1766== at 0x401B507: malloc (vg_replace_malloc.c:149)
==1766== by 0x80BCFEB: Perl_safesysmalloc (util.c:67)
==1766== by 0x80E1817: Perl_sv_grow (sv.c:1637)
==1766== by 0x80E6E24: Perl_sv_setsv_flags (sv.c:4019)
==1766== by 0x80EDD93: Perl_newSVsv (sv.c:7049)
==1766== by 0x814BC09: Perl_swash_fetch (utf8.c:1717)
==1766== by 0x8149F03: Perl_is_utf8_alnum (utf8.c:1191)
==1766== by 0x813BF0C: S_find_byclass (regexec.c:1246)
==1766== by 0x813E456: Perl_regexec_flags (regexec.c:1945)
==1766== by 0x8138538: Perl_pregexec (regexec.c:323)
==1766== by 0x61284B9:
XS_KinoSearch__Analysis__Tokenizer__do_analyze (KinoSearch.xs:4741)
==1766== by 0x80DE048: Perl_pp_entersub (pp_hot.c:2854)
==1766== by 0x80BCA83: Perl_runops_debug (dump.c:1442)
==1766== by 0x8064024: S_run_body (perl.c:1921)
==1766== by 0x8063AE5: perl_run (perl.c:1840)
==1766== by 0x805F69A: main (perlmain.c:86)

==1766== Invalid read of size 1
==1766== at 0x814C06F: Perl_swash_fetch (utf8.c:1747)
==1766== by 0x8145D17: S_regrepeat (regexec.c:4089)
==1766== by 0x814497E: S_regmatch (regexec.c:3732)
==1766== by 0x813EE8D: S_regtry (regexec.c:2185)
==1766== by 0x813BFA8: S_find_byclass (regexec.c:1249)
==1766== by 0x813E456: Perl_regexec_flags (regexec.c:1945)
==1766== by 0x8138538: Perl_pregexec (regexec.c:323)
==1766== by 0x61284B9:
XS_KinoSearch__Analysis__Tokenizer__do_analyze (KinoSearch.xs:4741)
==1766== by 0x80DE048: Perl_pp_entersub (pp_hot.c:2854)
==1766== by 0x80BCA83: Perl_runops_debug (dump.c:1442)
==1766== by 0x8064024: S_run_body (perl.c:1921)
==1766== by 0x8063AE5: perl_run (perl.c:1840)
==1766== by 0x805F69A: main (perlmain.c:86)
==1766== Address 0x630C48F is 6 bytes after a block of size 17 alloc'd
==1766== at 0x401B507: malloc (vg_replace_malloc.c:149)
==1766== by 0x80BCFEB: Perl_safesysmalloc (util.c:67)
==1766== by 0x80E1817: Perl_sv_grow (sv.c:1637)
==1766== by 0x80E6E24: Perl_sv_setsv_flags (sv.c:4019)
==1766== by 0x80EDD93: Perl_newSVsv (sv.c:7049)
==1766== by 0x814BC09: Perl_swash_fetch (utf8.c:1717)
==1766== by 0x8149F03: Perl_is_utf8_alnum (utf8.c:1191)
==1766== by 0x813BF0C: S_find_byclass (regexec.c:1246)
==1766== by 0x813E456: Perl_regexec_flags (regexec.c:1945)
==1766== by 0x8138538: Perl_pregexec (regexec.c:323)
==1766== by 0x61284B9:
XS_KinoSearch__Analysis__Tokenizer__do_analyze (KinoSearch.xs:4741)
==1766== by 0x80DE048: Perl_pp_entersub (pp_hot.c:2854)
==1766== by 0x80BCA83: Perl_runops_debug (dump.c:1442)
==1766== by 0x8064024: S_run_body (perl.c:1921)
==1766== by 0x8063AE5: perl_run (perl.c:1840)
==1766== by 0x805F69A: main (perlmain.c:86)

I don't know if this is related but after I index and then do a
delete/insert, my index is really broken. I wrote a small little
command line tool to test with, like mysql command line. Here is what
searches are returning like after all this:

kino> flag_deleted:0
flag_deleted path subject
0 ./new/1182974233.2388.1.vmware.nmsrv.com,S=1598 Re:Re:
Free Porn NOW
Hits 1
kino> flag_deleted:1
flag_deleted path subject
1 .Drafts/new/1187976985.1766.4.vmware.nmsrv.com,S=205:2,DT
Hits 1
kino> subject:a
flag_deleted path subject
0 ./cur/1182972472.1867.1.vmware.nmsrv.com,S=2454 Hot sex
with Viagra pills
0 ./cur/1182972878.1983.1.vmware.nmsrv.com,S=1410 Hello!
0 ./cur/1182973219.2096.1.vmware.nmsrv.com,S=1704 We are
here for you to live a healthier and happier life!
Hits 3
kino>

As you can see the search for "a" shows 3 results with flag_deleted=0
but the search for flag_deleted:0 only shows one result. And actually
there should be 57 results in the database which I can see from this
tool before I do the delete/insert from the database.

I will continue to try and reduce the problem to as small a case as
possible. Thanks for all your time and effort.

Scott
utf8 warnings/error [ In reply to ]
Hi,

I have a reproduceable test case that seems to corrupt up the index.
You can download the test case here:
http://devmagic.org/kinotest.tar.gz

Or you can browse it here:
http://devmagic.org/kinotest

test.pl is the code. The data file is a data dump of a bunch of
emails, test.pl inserts that data.

The kino file is the command line tool I've been using to do simple
test queries with, it has a db path hardcoded at the top, so if you
want to play with that you will need to modify it (it's unrelated to
the test case).

Running test.pl on my system gives these results:
Hits for flag_deleted:"0": 58
Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 1
Removed ./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093
Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 0
Hits for flag_deleted:"0": 44
Adding ./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093
Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 1
Hits for flag_deleted:"0": 45
Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 1
Removed ./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093
Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 0
Hits for flag_deleted:"0": 44
Adding ./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093
Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 1
Hits for flag_deleted:"0": 45
Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 1
Removed ./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093
Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 0
Hits for flag_deleted:"0": 1
Adding ./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093
Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 1
Hits for flag_deleted:"0": 2

I insert the data, then run the query flag_deleted:"0", hits is 58.
Then I remove one item from the index and run that query again, hits
is now 44. I repeat these additions/deletions 3 times. As you can see
the third time we end up with only 2 results.

Please let me know if you see a bug in my code/logic.

Thanks,

Scott
utf8 warnings/error [ In reply to ]
On Aug 24, 2007, at 10:54 AM, Scott Beck wrote:

> I still can't reproduce these errors on a small test case :(

I know the feeling, and I appreciate your attempts.

There are a few variables in KS that are handy for dialing down the
scale and exposing large problems with small datasets.

Here's a snippet from buildlib/TestSchema.pm...

# Expose problems faced by much larger indexes by using absurdly
low values
# for index_interval and skip_interval.
sub index_interval {5}
sub skip_interval {3}

... and another from buildlib/KinoTestUtils.pm:

# set mem_thesh to 1 kiB in order to expose problems with flushing
$KinoSearch::Index::PostingsWriter::instance_vars{mem_thresh} =
0x400;

That last one affects the threshold that triggers the external
sorter, and is probably the most useful.

> I did however get some feedback from valgrind although I don't know
> how helpful it is. I thought I would post it here as a follow up. I
> will continue to debug this and see if I can figure it out.

This readout is strange. It implies that the regular expression
matcher is attempting to match on parts of the string that are not
allocated.

While the Tokenizer hacks a bit into Perl's internals, it's not doing
anything outlandish -- just the equivalent of m/$pat/g. It starts at
the top of the string, asks the regular expression engine to find the
first match, keeps asking it over and over until all matches have
been collected.

I think by Perl 5.8.4 all the nastiest unicode bugs should have gone.

> valgrind errors from my tests:
> ==1766== Invalid read of size 1
> ==1766== at 0x814C06F: Perl_swash_fetch (utf8.c:1747)
> ==1766== by 0x813BF82: S_find_byclass (regexec.c:1248)
> ==1766== by 0x813E456: Perl_regexec_flags (regexec.c:1945)
> ==1766== by 0x8138538: Perl_pregexec (regexec.c:323)
> ==1766== by 0x61284B9:
> XS_KinoSearch__Analysis__Tokenizer__do_analyze (KinoSearch.xs:4741)
> ==1766== by 0x80DE048: Perl_pp_entersub (pp_hot.c:2854)
> ==1766== by 0x80BCA83: Perl_runops_debug (dump.c:1442)
> ==1766== by 0x8064024: S_run_body (perl.c:1921)
> ==1766== by 0x8063AE5: perl_run (perl.c:1840)
> ==1766== by 0x805F69A: main (perlmain.c:86)
> ==1766== Address 0x630C48F is 6 bytes after a block of size 17
> alloc'd
> ==1766== at 0x401B507: malloc (vg_replace_malloc.c:149)
> ==1766== by 0x80BCFEB: Perl_safesysmalloc (util.c:67)
> ==1766== by 0x80E1817: Perl_sv_grow (sv.c:1637)
> ==1766== by 0x80E6E24: Perl_sv_setsv_flags (sv.c:4019)
> ==1766== by 0x80EDD93: Perl_newSVsv (sv.c:7049)
> ==1766== by 0x814BC09: Perl_swash_fetch (utf8.c:1717)
> ==1766== by 0x8149F03: Perl_is_utf8_alnum (utf8.c:1191)
> ==1766== by 0x813BF0C: S_find_byclass (regexec.c:1246)
> ==1766== by 0x813E456: Perl_regexec_flags (regexec.c:1945)
> ==1766== by 0x8138538: Perl_pregexec (regexec.c:323)
> ==1766== by 0x61284B9:
> XS_KinoSearch__Analysis__Tokenizer__do_analyze (KinoSearch.xs:4741)
> ==1766== by 0x80DE048: Perl_pp_entersub (pp_hot.c:2854)
> ==1766== by 0x80BCA83: Perl_runops_debug (dump.c:1442)
> ==1766== by 0x8064024: S_run_body (perl.c:1921)
> ==1766== by 0x8063AE5: perl_run (perl.c:1840)
> ==1766== by 0x805F69A: main (perlmain.c:86)
>
> ==1766== Invalid read of size 1
> ==1766== at 0x814C06F: Perl_swash_fetch (utf8.c:1747)
> ==1766== by 0x8145D17: S_regrepeat (regexec.c:4089)
> ==1766== by 0x814497E: S_regmatch (regexec.c:3732)
> ==1766== by 0x813EE8D: S_regtry (regexec.c:2185)
> ==1766== by 0x813BFA8: S_find_byclass (regexec.c:1249)
> ==1766== by 0x813E456: Perl_regexec_flags (regexec.c:1945)
> ==1766== by 0x8138538: Perl_pregexec (regexec.c:323)
> ==1766== by 0x61284B9:
> XS_KinoSearch__Analysis__Tokenizer__do_analyze (KinoSearch.xs:4741)
> ==1766== by 0x80DE048: Perl_pp_entersub (pp_hot.c:2854)
> ==1766== by 0x80BCA83: Perl_runops_debug (dump.c:1442)
> ==1766== by 0x8064024: S_run_body (perl.c:1921)
> ==1766== by 0x8063AE5: perl_run (perl.c:1840)
> ==1766== by 0x805F69A: main (perlmain.c:86)
> ==1766== Address 0x630C48F is 6 bytes after a block of size 17
> alloc'd
> ==1766== at 0x401B507: malloc (vg_replace_malloc.c:149)
> ==1766== by 0x80BCFEB: Perl_safesysmalloc (util.c:67)
> ==1766== by 0x80E1817: Perl_sv_grow (sv.c:1637)
> ==1766== by 0x80E6E24: Perl_sv_setsv_flags (sv.c:4019)
> ==1766== by 0x80EDD93: Perl_newSVsv (sv.c:7049)
> ==1766== by 0x814BC09: Perl_swash_fetch (utf8.c:1717)
> ==1766== by 0x8149F03: Perl_is_utf8_alnum (utf8.c:1191)
> ==1766== by 0x813BF0C: S_find_byclass (regexec.c:1246)
> ==1766== by 0x813E456: Perl_regexec_flags (regexec.c:1945)
> ==1766== by 0x8138538: Perl_pregexec (regexec.c:323)
> ==1766== by 0x61284B9:
> XS_KinoSearch__Analysis__Tokenizer__do_analyze (KinoSearch.xs:4741)
> ==1766== by 0x80DE048: Perl_pp_entersub (pp_hot.c:2854)
> ==1766== by 0x80BCA83: Perl_runops_debug (dump.c:1442)
> ==1766== by 0x8064024: S_run_body (perl.c:1921)
> ==1766== by 0x8063AE5: perl_run (perl.c:1840)
> ==1766== by 0x805F69A: main (perlmain.c:86)
>
> I don't know if this is related but after I index and then do a
> delete/insert, my index is really broken.

Yeah. Unfortunately, that one looks like a real KS bug. I can
verify that merging segments with deletions can produce index
corruption. :(

I believe that the code that's to blame is new to KS 0.20_04.
0.20_04 contains a refactoring of KinoSearch's external sorter to
perform some of its own memory management, using some innovations
recently uncovered by Lucene developer Michael McCandless as he
implemented a variant of the KinoSearch merge model in Lucene. This
led to improved speed, but at the expense of increased complexity,
and somewhere hidden in that complexity is a nasty little bug.

> I will continue to try and reduce the problem to as small a case as
> possible. Thanks for all your time and effort.

And thank you for yours. I'm sorry I was not able to be more
responsive during the week, but things are starting to lighten up a bit.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
utf8 warnings/error [ In reply to ]
Hi Marvin,

On 8/25/07, Marvin Humphrey <marvin@rectangular.com> wrote:
<snip>
> I believe that the code that's to blame is new to KS 0.20_04.
> 0.20_04 contains a refactoring of KinoSearch's external sorter to
> perform some of its own memory management, using some innovations
> recently uncovered by Lucene developer Michael McCandless as he
> implemented a variant of the KinoSearch merge model in Lucene. This
> led to improved speed, but at the expense of increased complexity,
> and somewhere hidden in that complexity is a nasty little bug.
>

I removed the changes to the sorting and patched 0.20_03 with all the
other changes and I still see index corruption when deleting. I don't
know if it helps you to know this, but if I turn off optimize on the
finish() calls, I only loose 1 record instead of most of the index.
I'll continue to debug. If you figure it out please let me know.

Thanks,

Scott
Re: utf8 warnings/error [ In reply to ]
On Aug 24, 2007, at 2:34 PM, Scott Beck wrote:

> I have a reproduceable test case that seems to corrupt up the index.

Thanks for this. I've managed to reduce it down to a very small
failing test case.

http://www.rectangular.com/svn/kinosearch/trunk/perl/t/218-
del_merging.t

Next, I'll dive into the C code. Hopefully a fix is not far off.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: utf8 warnings/error [ In reply to ]
On Aug 27, 2007, at 8:57 PM, Scott Beck wrote:

> If you figure it out please let me know.

I believe that the problem may have been solved. Please try
repository revision 2501 and let me know how it goes.

svn co -r 2501 http://www.rectangular.com/svn/kinosearch/trunk ks

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: utf8 warnings/error [ In reply to ]
Hi Marvin,

My test still fails on this. Also the new test you added fails as well:

t/218-del_merging................ok 1/4
# Failed test 'match all docs after deletion'
# at t/218-del_merging.t line 41.
t/218-del_merging................NOK 4/4# got: '1'
# expected: '2'
# Looks like you failed 1 test of 4.
t/218-del_merging................dubious
Test returned status 1 (wstat 256, 0x100)
DIED. FAILED test 4
Failed 1/4 tests, 75.00% okay


vmware perl # perl -V
Summary of my perl5 (revision 5 version 8 subversion 4) configuration:
Platform:
osname=linux, osvers=2.6.11.7-gt, archname=i686-linux
uname='linux vmware.nmsrv.com 2.6.11.7-gt #1 smp mon apr 18
21:39:05 pdt 2005 i686 amd opteron(tm) processor 246 authenticamd
gnulinux '
config_args='-des -Darchname=i686-linux -Dcccdlflags=-fPIC
-Dccdlflags=-rdynamic -Dcc=gcc -Dprefix=/usr -Dvendorprefix=/usr
-Dsiteprefix=/usr -Dlocincpth= -Doptimize=-g -O0 -Duselargefiles
-Dd_dosuid -Dd_semctl_semun -Dscriptdir=/usr/bin
-Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3
-Dinstallman1dir=/usr/share/man/man1
-Dinstallman3dir=/var/portage/tmp/portage/perl-5.8.4-r2/image//usr/share/man/man3
-Dman1ext=1 -Dman3ext=3pm -Dcf_by=Gentoo -Ud_csh -Di_ndbm -Di_gdbm
-Di_db'
hint=recommended, useposix=true, d_sigaction=define
usethreads=undef use5005threads=undef useithreads=undef
usemultiplicity=undef
useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
use64bitint=undef use64bitall=undef uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler:
cc='gcc', ccflags ='-DDEBUGGING -fno-strict-aliasing
-D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
optimize='-g -O0',
cppflags='-DPERL5 -DDEBUGGING -fno-strict-aliasing'
ccversion='', gccversion='3.3.3 20040412 (Gentoo Linux 3.3.3-r6,
ssp-3.3.2-2, pie-8.7.6)', gccosandvers=''
intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
alignbytes=4, prototype=define
Linker and Libraries:
ld='gcc', ldflags =' -L/usr/local/lib'
libpth=/usr/local/lib /lib /usr/lib
libs=-lpthread -lnsl -lndbm -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc
perllibs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc
libc=/lib/libc-2.3.5.so, so=so, useshrplib=false, libperl=libperl.a
gnulibc_version='2.3.5'
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic'
cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib'


Characteristics of this binary (from libperl):
Compile-time options: DEBUGGING USE_LARGE_FILES
Built under linux
Compiled at Aug 26 2007 19:37:08
@INC:
/etc/perl
/usr/lib/perl5/site_perl/5.8.4/i686-linux
/usr/lib/perl5/site_perl/5.8.4
/usr/lib/perl5/site_perl/5.8.3/i686-linux
/usr/lib/perl5/site_perl/5.8.3
/usr/lib/perl5/site_perl
/usr/lib/perl5/vendor_perl/5.8.4/i686-linux
/usr/lib/perl5/vendor_perl/5.8.4
/usr/lib/perl5/vendor_perl/5.8.3/i686-linux
/usr/lib/perl5/vendor_perl/5.8.3
/usr/lib/perl5/vendor_perl
/usr/lib/perl5/5.8.4/i686-linux
/usr/lib/perl5/5.8.4
/usr/local/lib/site_perl
/usr/lib/perl5/site_perl/5.8.3/i686-linux
/usr/lib/perl5/site_perl/5.8.3
.
vmware perl #

Thanks,

Scott

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: utf8 warnings/error [ In reply to ]
On Sep 3, 2007, at 4:03 AM, Scott Beck wrote:

> My test still fails on this. Also the new test you added fails as
> well:
>
> t/218-del_merging................ok 1/4
> # Failed test 'match all docs after deletion'
> # at t/218-del_merging.t line 41.
> t/218-del_merging................NOK 4/4# got: '1'
> # expected: '2'
> # Looks like you failed 1 test of 4.
> t/218-del_merging................dubious
> Test returned status 1 (wstat 256, 0x100)
> DIED. FAILED test 4
> Failed 1/4 tests, 75.00% okay

Thanks for volunteering that bit of information. It helped me to
locate the problem quickly: I asked you to use revision 2501, but
the relevant revision was 2502. <:-(

Please try 2502 instead.

svn co -r 2502 http://www.rectangular.com/svn/kinosearch/trunk ks

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: utf8 warnings/error [ In reply to ]
Hi Marvin,

The test you made for KinoSearch passed with that revision, but my
test still fails. It now fails on the second insert/delete instead of
the first:

vmware kinotest # perl test.pl
unlink ./testdb/_1s.cf
unlink ./testdb/segments_1t.yaml
rmdir ./testdb
Kino VERSION: 0.20_04
Hits for flag_deleted:"0": 58
Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 1
Removed ./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093
Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 0
Hits for flag_deleted:"0": 57
Adding ./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093
Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 1
Hits for flag_deleted:"0": 58
Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 1
Removed ./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093
Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 0
Hits for flag_deleted:"0": 45
Adding ./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093
Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 1
Hits for flag_deleted:"0": 46
Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 1
Removed ./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093
Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 0
Hits for flag_deleted:"0": 1
Adding ./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093
Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 1
Hits for flag_deleted:"0": 2
vmware kinotest #

I'll see if I can't work on getting a more simplified version of my
test case that fails. If you want to look at the test case I'm
currently using it's here:
http://devmagic.org/kinotest/
you can download it here:
http://devmagic.org/kinotest.tar.gz

Cheers,

Scott

On 9/3/07, Marvin Humphrey <marvin@rectangular.com> wrote:
>
> On Sep 3, 2007, at 4:03 AM, Scott Beck wrote:
>
> > My test still fails on this. Also the new test you added fails as
> > well:
> >
> > t/218-del_merging................ok 1/4
> > # Failed test 'match all docs after deletion'
> > # at t/218-del_merging.t line 41.
> > t/218-del_merging................NOK 4/4# got: '1'
> > # expected: '2'
> > # Looks like you failed 1 test of 4.
> > t/218-del_merging................dubious
> > Test returned status 1 (wstat 256, 0x100)
> > DIED. FAILED test 4
> > Failed 1/4 tests, 75.00% okay
>
> Thanks for volunteering that bit of information. It helped me to
> locate the problem quickly: I asked you to use revision 2501, but
> the relevant revision was 2502. <:-(
>
> Please try 2502 instead.
>
> svn co -r 2502 http://www.rectangular.com/svn/kinosearch/trunk ks
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
>
> _______________________________________________
> KinoSearch mailing list
> KinoSearch@rectangular.com
> http://www.rectangular.com/mailman/listinfo/kinosearch
>

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: utf8 warnings/error [ In reply to ]
On Sep 3, 2007, at 3:49 PM, Scott Beck wrote:

> The test you made for KinoSearch passed with that revision, but my
> test still fails. It now fails on the second insert/delete instead of
> the first:

Strange. Take a look at the output below my sig. Does it look like
what you were expecting?

> I'll see if I can't work on getting a more simplified version of my
> test case that fails. If you want to look at the test case I'm
> currently using it's here:
> http://devmagic.org/kinotest/
> you can download it here:
> http://devmagic.org/kinotest.tar.gz

Yes, I was working with it. The methodology I used was to keep
subtracting data from your test case until I had something nice and
small... then go back and see if fixing the small problem also solved
the large problem. I'd achieved the result below before asking you
to try things out.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


slothbear:~/projects/ks_variants/ks_pudgefix/perl marvin$ perl -Mblib
test.pl
unlink ./testdb/_1p.cf
unlink ./testdb/segments_1q.yaml
rmdir ./testdb
Hits for flag_deleted:"0": 58
Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 1
Removed ./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093
Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 0
Hits for flag_deleted:"0": 57
Adding ./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093
Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 0
Hits for flag_deleted:"0": 57
Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 0
Removed ./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093
Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 0
Hits for flag_deleted:"0": 57
Adding ./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093
Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 0
Hits for flag_deleted:"0": 57
Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 0
Removed ./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093
Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 0
Hits for flag_deleted:"0": 57
Adding ./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093
Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 0
Hits for flag_deleted:"0": 57
slothbear:~/projects/ks_variants/ks_pudgefix/perl marvin$






_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: utf8 warnings/error [ In reply to ]
Hi Marvin,

On 9/3/07, Marvin Humphrey <marvin@rectangular.com> wrote:
>
> On Sep 3, 2007, at 3:49 PM, Scott Beck wrote:
>
> > The test you made for KinoSearch passed with that revision, but my
> > test still fails. It now fails on the second insert/delete instead of
> > the first:
>
> Strange. Take a look at the output below my sig. Does it look like
> what you were expecting?
>

In your output it looks like the record to be added/deleted it deleted
at the start but never added back again.

Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 0

should be:
Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 1

after:
Adding ./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093

So the test deletes after the first are never done because the record
is never added back. Strange though, is that output with the same
revision you had me test with? If so it's very odd that our putputs
would be different..

<snip>
>
>
> slothbear:~/projects/ks_variants/ks_pudgefix/perl marvin$ perl -Mblib
> test.pl
> unlink ./testdb/_1p.cf
> unlink ./testdb/segments_1q.yaml
> rmdir ./testdb
> Hits for flag_deleted:"0": 58
> Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 1
> Removed ./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093
> Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 0
> Hits for flag_deleted:"0": 57
> Adding ./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093

everything is OK until here, why isn't the record there after just
adding? Very strange..

> Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 0
> Hits for flag_deleted:"0": 57
> Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 0
> Removed ./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093
> Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 0
> Hits for flag_deleted:"0": 57
> Adding ./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093
> Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 0
> Hits for flag_deleted:"0": 57
> Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 0
> Removed ./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093
> Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 0
> Hits for flag_deleted:"0": 57
> Adding ./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093
> Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 0
> Hits for flag_deleted:"0": 57
> slothbear:~/projects/ks_variants/ks_pudgefix/perl marvin$
>

Thanks,

Scott

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: utf8 warnings/error [ In reply to ]
Hi Marvin,

I was able to get your test to fail by repeating the add/delete 3
times. I've attached the test.

Cheers,

Scott
Re: utf8 warnings/error [ In reply to ]
On Sep 3, 2007, at 5:22 PM, Scott Beck wrote:

> I was able to get your test to fail by repeating the add/delete 3
> times. I've attached the test.

Thanks, Scott. I've made this test pass as of revision 2507.
However, I'm still getting the same results for the test.pl that you
sent.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: utf8 warnings/error [ In reply to ]
On Sep 4, 2007, at 11:44 AM, Marvin Humphrey wrote:

> However, I'm still getting the same results for the test.pl that
> you sent.

Figured it out. This line produces undef in my version of perl...

my $data = do "data";

... so I'd renamed the file to "data.pl" and changed the line to this...

my $data = do "data.pl";

... but only in one place. Turns out there's another. Once it's
changed in both places, I get what looks like correct results under
revision 2507.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

slothbear:~/projects/ks_variants/ks_pudgefix/perl marvin$ perl -Mblib
test.pl
unlink ./testdb/_1p.cf
unlink ./testdb/segments_1q.yaml
rmdir ./testdb
Hits for flag_deleted:"0": 58
Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 1
Removed ./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093
Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 0
Hits for flag_deleted:"0": 57
Adding ./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093
Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 1
Hits for flag_deleted:"0": 58
Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 1
Removed ./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093
Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 0
Hits for flag_deleted:"0": 57
Adding ./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093
Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 1
Hits for flag_deleted:"0": 58
Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 1
Removed ./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093
Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 0
Hits for flag_deleted:"0": 57
Adding ./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093
Hits for path:"./cur/1182974671.2581.1.vmware.nmsrv.com,S=11093": 1
Hits for flag_deleted:"0": 58
slothbear:~/projects/ks_variants/ks_pudgefix/perl marvin$




_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: utf8 warnings/error [ In reply to ]
Hi Marvin,

This does indeed fix the delete issue I've been having. Thanks a lot
for all your hard work on this bug and for KinoSearch! Back to working
on my own bugs ;)

Cheers,

Scott

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch