Mailing List Archive: Error: Maximum token length is 65535

Error: Maximum token length is 65535

Jul 14, 2008, 6:18 AM

Post #1 of 3 (1833 views)

Hi All

Hi Marvin, this is really great work and truly appreciated.

I'm using KS 0.162. When using the following code, the error below is produced:

My Definitions
my $stemmer = KinoSearch::Analysis::Stemmer->new( language => 'en' );
my $stopalizer = KinoSearch::Analysis::Stopalizer->new(language => 'en');
my $analyzer = KinoSearch::Analysis::PolyAnalyzer->new(analyzers => [$stemmer, $stopalizer]);

The Error
Maximum token length is 65535; got 107462 at /usr/lib/perl5/site_perl/5.8.8/i386-linux-thread-multi/KinoSearch/Index/SegWriter.pm line 82
KinoSearch::Index::SegWriter::add_doc('KinoSearch::Index::SegWriter=HASH(0x852d47c)', 'KinoSearch::Document::Doc=HASH(0x852cf90)') called at /usr/lib/perl5/site_perl/5.8.8/i386-linux-thread-multi/KinoSearch/InvIndexer.pm line 224
KinoSearch::InvIndexer::add_doc('KinoSearch::InvIndexer=HASH(0x8546d7c)', 'KinoSearch::Document::Doc=HASH(0x852cf90)')

I comment $stemmer and $stopalizer's definitions and use the below code. This works perfectly but clearly won't allow for stemming and stopalizer =0
my $analyzer = KinoSearch::Analysis::PolyAnalyzer->new(language => 'en');

Could anyone assist in providing a possible work around this? - Your assistance is greatly appreciated.

Regards,
Riyaad

Re: Error: Maximum token length is 65535 [ In reply to ]

marvin at rectangular

Jul 14, 2008, 10:07 PM

Post #2 of 3 (1723 views)

Permalink

On Jul 14, 2008, at 6:18 AM, Riyaad Miller wrote:

> I'm using KS 0.162. When using the following code, the error below
> is produced:
>
> My Definitions
> my $stemmer = KinoSearch::Analysis::Stemmer->new( language =>
> 'en' );
> my $stopalizer = KinoSearch::Analysis::Stopalizer->new(language =>
> 'en');
> my $analyzer = KinoSearch::Analysis::PolyAnalyzer->new(analyzers
> => [$stemmer, $stopalizer]);
>
> The Error
> Maximum token length is 65535; got 107462

You have a PolyAnalyzer which contains a Stemmer and a Stopalizer, but
not a Tokenizer. Thus, the entire field value, all 107462 characters
of it, is the only token.

Theoretically, if KS had completed indexing successfully rather than
choked on that value, and at search-time someone were to type in the
appropriate 100,000+ character search string, you might get a hit.

Whatever those 107462 characters are, I can guarantee you that nothing
that long exists in the english stop list. Similarly, I doubt the
Stemmer has anything useful to say about the last few characters of
that field.

You really need a Tokenizer. You probably also want an LCNormalizer
in there unless you really want searches to be case sensitive.

my $lc_normalizer = KinoSearch::Analysis::LCNormalizer->new;
my $tokenizer = KinoSearch::Analysis::Tokenizer->new;
my $stemmer = KinoSearch::Analysis::Stemmer->new(
language => 'en',
);
my $stopalizer = KinoSearch::Analysis::Stopalizer->new(
language => 'en',
);
my $analyzer = KinoSearch::Analysis::PolyAnalyzer->new(
analyzers => [ $lc_normalizer, $tokenizer, $stopalizer, $stemmer ],
);

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

re: Error: Maximum token length is 65535 [ In reply to ]

riyaad.miller at predix

Jul 15, 2008, 6:24 AM

Post #3 of 3 (1745 views)

Permalink

Hi Marvin

Thank you for the help. I did as mentioned and it worked brilliantly.
We not worthy ... :-)

Regards
Riyaad

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch