Mailing List Archive

Buffer size for SegmentingTokenizerBase
Hi,

May someone explain to me why class SegmentingTokenizerBase using a buffer with a size of only 1024 characters? In the source code, the comment was left there mentioning possible truncated token if no safe-stopping index can be found for the existing chars in the buffer.

It doesn't sound reasonable that a sentence is always no more than 1024 characters or there is always a safe stopper, like new line can be found in a sentence.

Thanks,

Guan

**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues