Hi,
May someone explain to me why class SegmentingTokenizerBase using a buffer with a size of only 1024 characters? In the source code, the comment was left there mentioning possible truncated token if no safe-stopping index can be found for the existing chars in the buffer.
It doesn't sound reasonable that a sentence is always no more than 1024 characters or there is always a safe stopper, like new line can be found in a sentence.
Thanks,
Guan
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues
May someone explain to me why class SegmentingTokenizerBase using a buffer with a size of only 1024 characters? In the source code, the comment was left there mentioning possible truncated token if no safe-stopping index can be found for the existing chars in the buffer.
It doesn't sound reasonable that a sentence is always no more than 1024 characters or there is always a safe stopper, like new line can be found in a sentence.
Thanks,
Guan
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues