Mailing List Archive

Error: Maximum token length is 65535
Hi All

Hi Marvin, this is really great work and truly appreciated.

I'm using KS 0.162. When using the following code, the error below is produced:

My Definitions
my $stemmer = KinoSearch::Analysis::Stemmer->new( language => 'en' );
my $stopalizer = KinoSearch::Analysis::Stopalizer->new(language => 'en');
my $analyzer = KinoSearch::Analysis::PolyAnalyzer->new(analyzers => [$stemmer, $stopalizer]);


The Error
Maximum token length is 65535; got 107462 at /usr/lib/perl5/site_perl/5.8.8/i386-linux-thread-multi/KinoSearch/Index/SegWriter.pm line 82
KinoSearch::Index::SegWriter::add_doc('KinoSearch::Index::SegWriter=HASH(0x852d47c)', 'KinoSearch::Document::Doc=HASH(0x852cf90)') called at /usr/lib/perl5/site_perl/5.8.8/i386-linux-thread-multi/KinoSearch/InvIndexer.pm line 224
KinoSearch::InvIndexer::add_doc('KinoSearch::InvIndexer=HASH(0x8546d7c)', 'KinoSearch::Document::Doc=HASH(0x852cf90)')

I comment $stemmer and $stopalizer's definitions and use the below code. This works perfectly but clearly won't allow for stemming and stopalizer =0
my $analyzer = KinoSearch::Analysis::PolyAnalyzer->new(language => 'en');

Could anyone assist in providing a possible work around this? - Your assistance is greatly appreciated.

Regards,
Riyaad
Re: Error: Maximum token length is 65535 [ In reply to ]
On Jul 14, 2008, at 6:18 AM, Riyaad Miller wrote:

> I'm using KS 0.162. When using the following code, the error below
> is produced:
>
> My Definitions
> my $stemmer = KinoSearch::Analysis::Stemmer->new( language =>
> 'en' );
> my $stopalizer = KinoSearch::Analysis::Stopalizer->new(language =>
> 'en');
> my $analyzer = KinoSearch::Analysis::PolyAnalyzer->new(analyzers
> => [$stemmer, $stopalizer]);
>
> The Error
> Maximum token length is 65535; got 107462

You have a PolyAnalyzer which contains a Stemmer and a Stopalizer, but
not a Tokenizer. Thus, the entire field value, all 107462 characters
of it, is the only token.

Theoretically, if KS had completed indexing successfully rather than
choked on that value, and at search-time someone were to type in the
appropriate 100,000+ character search string, you might get a hit.

Whatever those 107462 characters are, I can guarantee you that nothing
that long exists in the english stop list. Similarly, I doubt the
Stemmer has anything useful to say about the last few characters of
that field.

You really need a Tokenizer. You probably also want an LCNormalizer
in there unless you really want searches to be case sensitive.

my $lc_normalizer = KinoSearch::Analysis::LCNormalizer->new;
my $tokenizer = KinoSearch::Analysis::Tokenizer->new;
my $stemmer = KinoSearch::Analysis::Stemmer->new(
language => 'en',
);
my $stopalizer = KinoSearch::Analysis::Stopalizer->new(
language => 'en',
);
my $analyzer = KinoSearch::Analysis::PolyAnalyzer->new(
analyzers => [ $lc_normalizer, $tokenizer, $stopalizer, $stemmer ],
);

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
re: Error: Maximum token length is 65535 [ In reply to ]
Hi Marvin

Thank you for the help. I did as mentioned and it worked brilliantly.
We not worthy ... :-)

Regards
Riyaad


_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch