Hi,
I'm indexing emails, mostly spam, and I'm running into a bunch of
UTF-8 error followed by an error from PolyAnalyzer. Here are a few of
the warnings:
Malformed UTF-8 character (unexpected non-continuation byte 0xcf,
immediately after start byte 0xfb) in subroutine entry at
/usr/lib/perl5/vendor_perl/5.8.4/i686-linux/KinoSearch/Analysis/PolyAnalyzer.pm
line 77.
Malformed UTF-8 character (unexpected non-continuation byte 0xcf,
immediately after start byte 0xfb) in subroutine entry at
/usr/lib/perl5/vendor_perl/5.8.4/i686-linux/KinoSearch/Analysis/PolyAnalyzer.pm
line 77.
Malformed UTF-8 character (unexpected non-continuation byte 0xea,
immediately after start byte 0xcd) in subroutine entry at
/usr/lib/perl5/vendor_perl/5.8.4/i686-linux/KinoSearch/Analysis/PolyAnalyzer.pm
line 77.
If you want to see all of the warnings, let me know. And then the
error after the warnings looks like this:
[error] Caught exception in
GMail::Controller::User::Mail::Folder->begin "Error in function
XS_KinoSearch__Analysis__Tokenizer__do_analyze at
lib/KinoSearch.xs:4758: scanned past end of '
????:???????????????????(c)????????(????????????)??????????????????,??????????????????%20,
????????,??????????????????:
??????:?????? ????????:(0)13543676298
'
at /usr/lib/perl5/vendor_perl/5.8.4/i686-linux/KinoSearch/Analysis/PolyAnalyzer.pm
line 77
KinoSearch::Analysis::PolyAnalyzer::analyze_field('KinoSearch::Analysis::PolyAnalyzer=HASH(0x8fa0adc)',
'HASH(0x897e0d4)', 'body') called at
/usr/lib/perl5/vendor_perl/5.8.4/i686-linux/KinoSearch/Index/SegWriter.pm
line 104
KinoSearch::Index::SegWriter::add_doc('KinoSearch::Index::SegWriter=HASH(0x8983774)',
'HASH(0x897e0d4)', 1) called at
/usr/lib/perl5/vendor_perl/5.8.4/i686-linux/KinoSearch/InvIndexer.pm
line 114
KinoSearch::InvIndexer::add_doc('KinoSearch::InvIndexer=HASH(0x89849a4)',
'HASH(0x897e0d4)') called at
/usr/lib/gmail_maildir/GT/Maildir/KinoSearch/Indexer.pm line 200
GT::Maildir::KinoSearch::Indexer::index('GT::Maildir::KinoSearch::Indexer=HASH(0x8b0e180)',
'/var/home/alex/alex.krohn.org/mail/alex/Maildir/./cur/1182973...')
called at GMail::Model::Maildir::Folder::index line 45
...
The rest of the stack trace is in my code.
Is there something I need to do to the strings I'm passing into add_doc?
Thanks,
Scott
I'm indexing emails, mostly spam, and I'm running into a bunch of
UTF-8 error followed by an error from PolyAnalyzer. Here are a few of
the warnings:
Malformed UTF-8 character (unexpected non-continuation byte 0xcf,
immediately after start byte 0xfb) in subroutine entry at
/usr/lib/perl5/vendor_perl/5.8.4/i686-linux/KinoSearch/Analysis/PolyAnalyzer.pm
line 77.
Malformed UTF-8 character (unexpected non-continuation byte 0xcf,
immediately after start byte 0xfb) in subroutine entry at
/usr/lib/perl5/vendor_perl/5.8.4/i686-linux/KinoSearch/Analysis/PolyAnalyzer.pm
line 77.
Malformed UTF-8 character (unexpected non-continuation byte 0xea,
immediately after start byte 0xcd) in subroutine entry at
/usr/lib/perl5/vendor_perl/5.8.4/i686-linux/KinoSearch/Analysis/PolyAnalyzer.pm
line 77.
If you want to see all of the warnings, let me know. And then the
error after the warnings looks like this:
[error] Caught exception in
GMail::Controller::User::Mail::Folder->begin "Error in function
XS_KinoSearch__Analysis__Tokenizer__do_analyze at
lib/KinoSearch.xs:4758: scanned past end of '
????:???????????????????(c)????????(????????????)??????????????????,??????????????????%20,
????????,??????????????????:
??????:?????? ????????:(0)13543676298
'
at /usr/lib/perl5/vendor_perl/5.8.4/i686-linux/KinoSearch/Analysis/PolyAnalyzer.pm
line 77
KinoSearch::Analysis::PolyAnalyzer::analyze_field('KinoSearch::Analysis::PolyAnalyzer=HASH(0x8fa0adc)',
'HASH(0x897e0d4)', 'body') called at
/usr/lib/perl5/vendor_perl/5.8.4/i686-linux/KinoSearch/Index/SegWriter.pm
line 104
KinoSearch::Index::SegWriter::add_doc('KinoSearch::Index::SegWriter=HASH(0x8983774)',
'HASH(0x897e0d4)', 1) called at
/usr/lib/perl5/vendor_perl/5.8.4/i686-linux/KinoSearch/InvIndexer.pm
line 114
KinoSearch::InvIndexer::add_doc('KinoSearch::InvIndexer=HASH(0x89849a4)',
'HASH(0x897e0d4)') called at
/usr/lib/gmail_maildir/GT/Maildir/KinoSearch/Indexer.pm line 200
GT::Maildir::KinoSearch::Indexer::index('GT::Maildir::KinoSearch::Indexer=HASH(0x8b0e180)',
'/var/home/alex/alex.krohn.org/mail/alex/Maildir/./cur/1182973...')
called at GMail::Model::Maildir::Folder::index line 45
...
The rest of the stack trace is in my code.
Is there something I need to do to the strings I'm passing into add_doc?
Thanks,
Scott