Mailing List Archive

Cache (serialized) tokensteam?
We are indexing a lot of similar texts using Lucene analyzers.
From our performance tests we see that the analyzation (converting the text the tokensteam object) is talking more time that we want.
Before digging into the analyzation code, I was thinking about caching the analyzation result since we have many repeated texts that we index in different times.
The basic idea is to serialize the tokenstream and store it in a DB. when we encounter the same text, to load it and initialize an analyzer with the loaded tokenstream.
In this context:
1 - is it "safe" to serialize the tokenstream?
2 - there is an existing code that already serialize a tokenstream?
3 - how to initialize an existing analyzer with a tokenstream?

Thanks!

Best,
Omri
The contents of this e-mail message and any attachments are confidential and are intended solely for addressee. The information may also be legally privileged. This transmission is sent in trust, for the sole purpose of delivery to the intended recipient. If you have received this transmission in error, any use, reproduction or dissemination of this transmission is strictly prohibited. If you are not the intended recipient, please immediately notify the sender by reply e-mail or phone and delete this message and its attachments, if any.
Re:Cache (serialized) tokensteam? [ In reply to ]
Can you try to cache the word segmentation results? This will be easier.

















At 2021-11-22 16:40:42, "Omri" <omri.suissa@clearmash.com> wrote:
>We are indexing a lot of similar texts using Lucene analyzers.
>From our performance tests we see that the analyzation (converting the text the tokensteam object) is talking more time that we want.
>Before digging into the analyzation code, I was thinking about caching the analyzation result since we have many repeated texts that we index in different times.
>The basic idea is to serialize the tokenstream and store it in a DB. when we encounter the same text, to load it and initialize an analyzer with the loaded tokenstream.
>In this context:
>1 - is it "safe" to serialize the tokenstream?
>2 - there is an existing code that already serialize a tokenstream?
>3 - how to initialize an existing analyzer with a tokenstream?
>
>Thanks!
>
>Best,
>Omri
>The contents of this e-mail message and any attachments are confidential and are intended solely for addressee. The information may also be legally privileged. This transmission is sent in trust, for the sole purpose of delivery to the intended recipient. If you have received this transmission in error, any use, reproduction or dissemination of this transmission is strictly prohibited. If you are not the intended recipient, please immediately notify the sender by reply e-mail or phone and delete this message and its attachments, if any.