Mailing List Archive

Dec 28, 2023, 6:21 AM

Post #2 of 15 (326 views)

On 12/28/23 12:59, Jimmy wrote:
> Hi,
>
> I'm seeking assistance in incorporating a stopword for Asian languages in Unicode. Although I possess comprehensive word lists, my attempts to generate a regex pattern and test it have been unsuccessful; the pattern fails to match or skips tokens in the newly added stopword list.
>
> I created the regex pattern using the following code:
>
> Regexp::Assemble->new->add(@words)->reduce(0)->as_string
>
> Afterward, I converted it to UTF-8 hex.
>
> I'm wondering if there are any tools available to facilitate the creation of these regex patterns.
>
I have used Regexp::Trie to create Bayes stopwords in the past, code is similar to:
-----------------------------------------------------------------------------------------------------------
use strict;
use warnings;

use Encode;
use Regexp::Trie;

my @input = <STDIN>;
my $rt = Regexp::Trie->new;
for my $w ( @input ) {
chomp($w);
$rt->add($w);
}
my $regexp = $rt->regexp;
my @reg = split //, $regexp;
for my $c ( @reg ) {
my $char = $c;
my $test;
eval "\$test = decode( 'utf8', \$c, Encode::FB_CROAK )";
if( $@ ) {
print 'x' . sprintf("%x", ord($c));
} else {
print $char;
}
}
-----------------------------------------------------------------------------------------------------------

Giovanni

Re: Bayes Stopword [ In reply to ]

Dec 28, 2023, 6:45 AM

Post #3 of 15 (326 views)

The pattern has successfully passed the test script, but it needs to check
whether Bayes learning will identify and possibly exclude the word from
matching this pattern.

Thank you.

On Thu, Dec 28, 2023 at 9:22?PM <giovanni@paclan.it> wrote:

> On 12/28/23 12:59, Jimmy wrote:
> > Hi,
> >
> > I'm seeking assistance in incorporating a stopword for Asian languages
> in Unicode. Although I possess comprehensive word lists, my attempts to
> generate a regex pattern and test it have been unsuccessful; the pattern
> fails to match or skips tokens in the newly added stopword list.
> >
> > I created the regex pattern using the following code:
> >
> > Regexp::Assemble->new->add(@words)->reduce(0)->as_string
> >
> > Afterward, I converted it to UTF-8 hex.
> >
> > I'm wondering if there are any tools available to facilitate the
> creation of these regex patterns.
> >
> I have used Regexp::Trie to create Bayes stopwords in the past, code is
> similar to:
>
> -----------------------------------------------------------------------------------------------------------
> use strict;
> use warnings;
>
> use Encode;
> use Regexp::Trie;
>
> my @input = <STDIN>;
> my $rt = Regexp::Trie->new;
> for my $w ( @input ) {
> chomp($w);
> $rt->add($w);
> }
> my $regexp = $rt->regexp;
> my @reg = split //, $regexp;
> for my $c ( @reg ) {
> my $char = $c;
> my $test;
> eval "\$test = decode( 'utf8', \$c, Encode::FB_CROAK )";
> if( $@ ) {
> print 'x' . sprintf("%x", ord($c));
> } else {
> print $char;
> }
> }
>
> -----------------------------------------------------------------------------------------------------------
>
> Giovanni
>

Re: Bayes Stopword [ In reply to ]

Dec 28, 2023, 7:12 AM

Post #4 of 15 (326 views)

"spamassassin -D bayes" will tell you, you should see a line like:
bayes: skipped token 'from' because it's in stopword list for language 'en'

Giovanni

On 12/28/23 15:45, Jimmy wrote:
> The pattern has successfully passed the test script, but it needs to check whether Bayes learning will identify and possibly exclude the word from matching this pattern.
>
> Thank you.
>
>
> On Thu, Dec 28, 2023 at 9:22?PM <giovanni@paclan.it <mailto:giovanni@paclan.it>> wrote:
>
> On 12/28/23 12:59, Jimmy wrote:
> > Hi,
> >
> > I'm seeking assistance in incorporating a stopword for Asian languages in Unicode. Although I possess comprehensive word lists, my attempts to generate a regex pattern and test it have been unsuccessful; the pattern fails to match or skips tokens in the newly added stopword list.
> >
> > I created the regex pattern using the following code:
> >
> > Regexp::Assemble->new->add(@words)->reduce(0)->as_string
> >
> > Afterward, I converted it to UTF-8 hex.
> >
> > I'm wondering if there are any tools available to facilitate the creation of these regex patterns.
> >
> I have used Regexp::Trie to create Bayes stopwords in the past, code is similar to:
> -----------------------------------------------------------------------------------------------------------
> use strict;
> use warnings;
>
> use Encode;
> use Regexp::Trie;
>
> my @input = <STDIN>;
> my $rt = Regexp::Trie->new;
> for my $w ( @input ) {
> chomp($w);
> $rt->add($w);
> }
> my $regexp = $rt->regexp;
> my @reg = split //, $regexp;
> for my $c ( @reg ) {
> my $char = $c;
> my $test;
> eval "\$test = decode( 'utf8', \$c, Encode::FB_CROAK )";
> if( $@ ) {
> print 'x' . sprintf("%x", ord($c));
> } else {
> print $char;
> }
> }
> -----------------------------------------------------------------------------------------------------------
>
> Giovanni
>

Re: Bayes Stopword [ In reply to ]

Dec 28, 2023, 7:26 AM

Post #5 of 15 (326 views)

Yes, I have done that, and I am also editing Plugin/Bayes.pm to investigate
why it is not being skipped. I suspect that if words are not separated by
spaces, longer words may not match those patterns.

Jimmy

On Thu, Dec 28, 2023 at 10:13?PM <giovanni@paclan.it> wrote:

> "spamassassin -D bayes" will tell you, you should see a line like:
> bayes: skipped token 'from' because it's in stopword list for language 'en'
>
> Giovanni
>
> On 12/28/23 15:45, Jimmy wrote:
> > The pattern has successfully passed the test script, but it needs to
> check whether Bayes learning will identify and possibly exclude the word
> from matching this pattern.
> >
> > Thank you.
> >
> >
> > On Thu, Dec 28, 2023 at 9:22?PM <giovanni@paclan.it <mailto:
> giovanni@paclan.it>> wrote:
> >
> > On 12/28/23 12:59, Jimmy wrote:
> > > Hi,
> > >
> > > I'm seeking assistance in incorporating a stopword for Asian
> languages in Unicode. Although I possess comprehensive word lists, my
> attempts to generate a regex pattern and test it have been unsuccessful;
> the pattern fails to match or skips tokens in the newly added stopword list.
> > >
> > > I created the regex pattern using the following code:
> > >
> > > Regexp::Assemble->new->add(@words)->reduce(0)->as_string
> > >
> > > Afterward, I converted it to UTF-8 hex.
> > >
> > > I'm wondering if there are any tools available to facilitate the
> creation of these regex patterns.
> > >
> > I have used Regexp::Trie to create Bayes stopwords in the past, code
> is similar to:
> >
> -----------------------------------------------------------------------------------------------------------
> > use strict;
> > use warnings;
> >
> > use Encode;
> > use Regexp::Trie;
> >
> > my @input = <STDIN>;
> > my $rt = Regexp::Trie->new;
> > for my $w ( @input ) {
> > chomp($w);
> > $rt->add($w);
> > }
> > my $regexp = $rt->regexp;
> > my @reg = split //, $regexp;
> > for my $c ( @reg ) {
> > my $char = $c;
> > my $test;
> > eval "\$test = decode( 'utf8', \$c, Encode::FB_CROAK )";
> > if( $@ ) {
> > print 'x' . sprintf("%x", ord($c));
> > } else {
> > print $char;
> > }
> > }
> >
> -----------------------------------------------------------------------------------------------------------
> >
> > Giovanni
> >
>
>

Re: Bayes Stopword [ In reply to ]

Dec 28, 2023, 7:59 AM

Post #6 of 15 (326 views)

Could you share a config line and a sample you are using ?
Giovanni

On 12/28/23 16:26, Jimmy wrote:
> Yes, I have done that, and I am also editing Plugin/Bayes.pm to investigate why it is not being skipped. I suspect that if words are not separated by spaces, longer words may not match those patterns.
>
> Jimmy
>
> On Thu, Dec 28, 2023 at 10:13?PM <giovanni@paclan.it <mailto:giovanni@paclan.it>> wrote:
>
> "spamassassin -D bayes" will tell you, you should see a line like:
> bayes: skipped token 'from' because it's in stopword list for language 'en'
>
> Giovanni
>
> On 12/28/23 15:45, Jimmy wrote:
> > The pattern has successfully passed the test script, but it needs to check whether Bayes learning will identify and possibly exclude the word from matching this pattern.
> >
> > Thank you.
> >
> >
> > On Thu, Dec 28, 2023 at 9:22?PM <giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>> wrote:
> >
> > On 12/28/23 12:59, Jimmy wrote:
> > > Hi,
> > >
> > > I'm seeking assistance in incorporating a stopword for Asian languages in Unicode. Although I possess comprehensive word lists, my attempts to generate a regex pattern and test it have been unsuccessful; the pattern fails to match or skips tokens in the newly added stopword list.
> > >
> > > I created the regex pattern using the following code:
> > >
> > > Regexp::Assemble->new->add(@words)->reduce(0)->as_string
> > >
> > > Afterward, I converted it to UTF-8 hex.
> > >
> > > I'm wondering if there are any tools available to facilitate the creation of these regex patterns.
> > >
> > I have used Regexp::Trie to create Bayes stopwords in the past, code is similar to:
> > -----------------------------------------------------------------------------------------------------------
> > use strict;
> > use warnings;
> >
> > use Encode;
> > use Regexp::Trie;
> >
> > my @input = <STDIN>;
> > my $rt = Regexp::Trie->new;
> > for my $w ( @input ) {
> > chomp($w);
> > $rt->add($w);
> > }
> > my $regexp = $rt->regexp;
> > my @reg = split //, $regexp;
> > for my $c ( @reg ) {
> > my $char = $c;
> > my $test;
> > eval "\$test = decode( 'utf8', \$c, Encode::FB_CROAK )";
> > if( $@ ) {
> > print 'x' . sprintf("%x", ord($c));
> > } else {
> > print $char;
> > }
> > }
> > -----------------------------------------------------------------------------------------------------------
> >
> > Giovanni
> >
>

Re: Bayes Stopword [ In reply to ]

Dec 28, 2023, 8:06 AM

Post #7 of 15 (326 views)

bayes_stopword_th https://pastebin.pl/view/0838138d
Sample mail https://pastebin.pl/view/e5a2c5b8

Jimmy

On Thu, Dec 28, 2023 at 10:59?PM <giovanni@paclan.it> wrote:

> Could you share a config line and a sample you are using ?
> Giovanni
>
> On 12/28/23 16:26, Jimmy wrote:
> > Yes, I have done that, and I am also editing Plugin/Bayes.pm to
> investigate why it is not being skipped. I suspect that if words are not
> separated by spaces, longer words may not match those patterns.
> >
> > Jimmy
> >
> > On Thu, Dec 28, 2023 at 10:13?PM <giovanni@paclan.it <mailto:
> giovanni@paclan.it>> wrote:
> >
> > "spamassassin -D bayes" will tell you, you should see a line like:
> > bayes: skipped token 'from' because it's in stopword list for
> language 'en'
> >
> > Giovanni
> >
> > On 12/28/23 15:45, Jimmy wrote:
> > > The pattern has successfully passed the test script, but it needs
> to check whether Bayes learning will identify and possibly exclude the word
> from matching this pattern.
> > >
> > > Thank you.
> > >
> > >
> > > On Thu, Dec 28, 2023 at 9:22?PM <giovanni@paclan.it <mailto:
> giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>
> wrote:
> > >
> > > On 12/28/23 12:59, Jimmy wrote:
> > > > Hi,
> > > >
> > > > I'm seeking assistance in incorporating a stopword for
> Asian languages in Unicode. Although I possess comprehensive word lists, my
> attempts to generate a regex pattern and test it have been unsuccessful;
> the pattern fails to match or skips tokens in the newly added stopword list.
> > > >
> > > > I created the regex pattern using the following code:
> > > >
> > > > Regexp::Assemble->new->add(@words)->reduce(0)->as_string
> > > >
> > > > Afterward, I converted it to UTF-8 hex.
> > > >
> > > > I'm wondering if there are any tools available to
> facilitate the creation of these regex patterns.
> > > >
> > > I have used Regexp::Trie to create Bayes stopwords in the
> past, code is similar to:
> > >
> -----------------------------------------------------------------------------------------------------------
> > > use strict;
> > > use warnings;
> > >
> > > use Encode;
> > > use Regexp::Trie;
> > >
> > > my @input = <STDIN>;
> > > my $rt = Regexp::Trie->new;
> > > for my $w ( @input ) {
> > > chomp($w);
> > > $rt->add($w);
> > > }
> > > my $regexp = $rt->regexp;
> > > my @reg = split //, $regexp;
> > > for my $c ( @reg ) {
> > > my $char = $c;
> > > my $test;
> > > eval "\$test = decode( 'utf8', \$c, Encode::FB_CROAK )";
> > > if( $@ ) {
> > > print 'x' . sprintf("%x", ord($c));
> > > } else {
> > > print $char;
> > > }
> > > }
> > >
> -----------------------------------------------------------------------------------------------------------
> > >
> > > Giovanni
> > >
> >
>
>

Re: Bayes Stopword [ In reply to ]

Dec 28, 2023, 11:59 PM

Post #8 of 15 (324 views)

Config line produces a syntax error for me:
config: failed to parse line in /etc/mail/spamassassin/local.cf (line 1): bayes_stopword_th

Could you share the word list in utf8 ?
I tried adding "???" to https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt and it produces a working regexp.
Bayes stopwords languages must also be enabled using "bayes_stopword_languages" config keyword, by default only english is enabled.
Giovanni

On 12/28/23 17:06, Jimmy wrote:
> bayes_stopword_th https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d>
> Sample mail https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8>
>
> Jimmy
>
>
> On Thu, Dec 28, 2023 at 10:59?PM <giovanni@paclan.it <mailto:giovanni@paclan.it>> wrote:
>
> Could you share a config line and a sample you are using ?
> Giovanni
>
> On 12/28/23 16:26, Jimmy wrote:
> > Yes, I have done that, and I am also editing Plugin/Bayes.pm to investigate why it is not being skipped. I suspect that if words are not separated by spaces, longer words may not match those patterns.
> >
> > Jimmy
> >
> > On Thu, Dec 28, 2023 at 10:13?PM <giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>> wrote:
> >
> > "spamassassin -D bayes" will tell you, you should see a line like:
> > bayes: skipped token 'from' because it's in stopword list for language 'en'
> >
> > Giovanni
> >
> > On 12/28/23 15:45, Jimmy wrote:
> > > The pattern has successfully passed the test script, but it needs to check whether Bayes learning will identify and possibly exclude the word from matching this pattern.
> > >
> > > Thank you.
> > >
> > >
> > > On Thu, Dec 28, 2023 at 9:22?PM <giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>> wrote:
> > >
> > > On 12/28/23 12:59, Jimmy wrote:
> > > > Hi,
> > > >
> > > > I'm seeking assistance in incorporating a stopword for Asian languages in Unicode. Although I possess comprehensive word lists, my attempts to generate a regex pattern and test it have been unsuccessful; the pattern fails to match or skips tokens in the newly added stopword list.
> > > >
> > > > I created the regex pattern using the following code:
> > > >
> > > > Regexp::Assemble->new->add(@words)->reduce(0)->as_string
> > > >
> > > > Afterward, I converted it to UTF-8 hex.
> > > >
> > > > I'm wondering if there are any tools available to facilitate the creation of these regex patterns.
> > > >
> > > I have used Regexp::Trie to create Bayes stopwords in the past, code is similar to:
> > > -----------------------------------------------------------------------------------------------------------
> > > use strict;
> > > use warnings;
> > >
> > > use Encode;
> > > use Regexp::Trie;
> > >
> > > my @input = <STDIN>;
> > > my $rt = Regexp::Trie->new;
> > > for my $w ( @input ) {
> > > chomp($w);
> > > $rt->add($w);
> > > }
> > > my $regexp = $rt->regexp;
> > > my @reg = split //, $regexp;
> > > for my $c ( @reg ) {
> > > my $char = $c;
> > > my $test;
> > > eval "\$test = decode( 'utf8', \$c, Encode::FB_CROAK )";
> > > if( $@ ) {
> > > print 'x' . sprintf("%x", ord($c));
> > > } else {
> > > print $char;
> > > }
> > > }
> > > -----------------------------------------------------------------------------------------------------------
> > >
> > > Giovanni
> > >
> >
>

Re: Bayes Stopword [ In reply to ]

Dec 29, 2023, 12:22 AM

Post #9 of 15 (324 views)

I use SpamAssassin 4.0.0 (2022-12-14)

$ spamassassin -D --lint 2>&1 | grep bayes:
Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=en
Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=th
Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=ru
Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=fr
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ja
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=zh
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=dk
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=nl
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=de
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=es
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fi
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fr
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=it
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=no
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ru
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=se
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=tr
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=vi
Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=ko
Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=zh
Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=hi
Dec 29 15:17:58.019 [17420] dbg: bayes: stopwords for languages enabled: en
th ru fr ja zh dk nl de es fi fr it no ru se tr vi ko zh hi

$ spamassassin -D bayes,learn < test.msg 2>&1 | grep "skipped token"
Dec 29 15:16:57.585 [17347] dbg: bayes: skipped token 'Email' because it's
in stopword list for language 'en'

You can use "???" that was listed in regexp pattern but somehow I don't
know why it not show skipped token in bayes.

Jimmy

On Fri, Dec 29, 2023 at 2:59?PM <giovanni@paclan.it> wrote:

> Config line produces a syntax error for me:
> config: failed to parse line in /etc/mail/spamassassin/local.cf (line 1):
> bayes_stopword_th
>
> Could you share the word list in utf8 ?
> I tried adding "???" to
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> and it produces a working regexp.
> Bayes stopwords languages must also be enabled using
> "bayes_stopword_languages" config keyword, by default only english is
> enabled.
> Giovanni
>
> On 12/28/23 17:06, Jimmy wrote:
> > bayes_stopword_th https://pastebin.pl/view/0838138d <
> https://pastebin.pl/view/0838138d>
> > Sample mail https://pastebin.pl/view/e5a2c5b8 <
> https://pastebin.pl/view/e5a2c5b8>
> >
> > Jimmy
> >
> >
> > On Thu, Dec 28, 2023 at 10:59?PM <giovanni@paclan.it <mailto:
> giovanni@paclan.it>> wrote:
> >
> > Could you share a config line and a sample you are using ?
> > Giovanni
> >
> > On 12/28/23 16:26, Jimmy wrote:
> > > Yes, I have done that, and I am also editing Plugin/Bayes.pm to
> investigate why it is not being skipped. I suspect that if words are not
> separated by spaces, longer words may not match those patterns.
> > >
> > > Jimmy
> > >
> > > On Thu, Dec 28, 2023 at 10:13?PM <giovanni@paclan.it <mailto:
> giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>
> wrote:
> > >
> > > "spamassassin -D bayes" will tell you, you should see a line
> like:
> > > bayes: skipped token 'from' because it's in stopword list for
> language 'en'
> > >
> > > Giovanni
> > >
> > > On 12/28/23 15:45, Jimmy wrote:
> > > > The pattern has successfully passed the test script, but
> it needs to check whether Bayes learning will identify and possibly exclude
> the word from matching this pattern.
> > > >
> > > > Thank you.
> > > >
> > > >
> > > > On Thu, Dec 28, 2023 at 9:22?PM <giovanni@paclan.it
> <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:
> giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>
> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>> wrote:
> > > >
> > > > On 12/28/23 12:59, Jimmy wrote:
> > > > > Hi,
> > > > >
> > > > > I'm seeking assistance in incorporating a stopword
> for Asian languages in Unicode. Although I possess comprehensive word
> lists, my attempts to generate a regex pattern and test it have been
> unsuccessful; the pattern fails to match or skips tokens in the newly added
> stopword list.
> > > > >
> > > > > I created the regex pattern using the following
> code:
> > > > >
> > > > >
> Regexp::Assemble->new->add(@words)->reduce(0)->as_string
> > > > >
> > > > > Afterward, I converted it to UTF-8 hex.
> > > > >
> > > > > I'm wondering if there are any tools available to
> facilitate the creation of these regex patterns.
> > > > >
> > > > I have used Regexp::Trie to create Bayes stopwords in
> the past, code is similar to:
> > > >
> -----------------------------------------------------------------------------------------------------------
> > > > use strict;
> > > > use warnings;
> > > >
> > > > use Encode;
> > > > use Regexp::Trie;
> > > >
> > > > my @input = <STDIN>;
> > > > my $rt = Regexp::Trie->new;
> > > > for my $w ( @input ) {
> > > > chomp($w);
> > > > $rt->add($w);
> > > > }
> > > > my $regexp = $rt->regexp;
> > > > my @reg = split //, $regexp;
> > > > for my $c ( @reg ) {
> > > > my $char = $c;
> > > > my $test;
> > > > eval "\$test = decode( 'utf8', \$c,
> Encode::FB_CROAK )";
> > > > if( $@ ) {
> > > > print 'x' . sprintf("%x", ord($c));
> > > > } else {
> > > > print $char;
> > > > }
> > > > }
> > > >
> -----------------------------------------------------------------------------------------------------------
> > > >
> > > > Giovanni
> > > >
> > >
> >
>
>

Re: Bayes Stopword [ In reply to ]

Dec 29, 2023, 12:59 AM

Post #10 of 15 (324 views)

To create the stopwords regexp I used the script I shared in a previous email and a list of words one per line.
Could you share the list you are using ?

Giovanni

On 12/29/23 09:22, Jimmy wrote:
> I use SpamAssassin 4.0.0 (2022-12-14)
>
> $ spamassassin -D --lint 2>&1 | grep bayes:
> Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=en
> Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=th
> Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=ru
> Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=fr
> Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ja
> Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=zh
> Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=dk
> Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=nl
> Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=de
> Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=es
> Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fi
> Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fr
> Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=it
> Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=no
> Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ru
> Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=se
> Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=tr
> Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=vi
> Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=ko
> Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=zh
> Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=hi
> Dec 29 15:17:58.019 [17420] dbg: bayes: stopwords for languages enabled: en th ru fr ja zh dk nl de es fi fr it no ru se tr vi ko zh hi
>
>
> $ spamassassin -D bayes,learn < test.msg 2>&1 | grep "skipped token"
> Dec 29 15:16:57.585 [17347] dbg: bayes: skipped token 'Email' because it's in stopword list for language 'en'
>
> You can use "???" that was listed in regexp pattern but somehow I don't know why it not show skipped token in bayes.
>
> Jimmy
>
>
> On Fri, Dec 29, 2023 at 2:59?PM <giovanni@paclan.it <mailto:giovanni@paclan.it>> wrote:
>
> Config line produces a syntax error for me:
> config: failed to parse line in /etc/mail/spamassassin/local.cf <http://local.cf> (line 1): bayes_stopword_th
>
> Could you share the word list in utf8 ?
> I tried adding "???" to https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt <https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt> and it produces a working regexp.
> Bayes stopwords languages must also be enabled using "bayes_stopword_languages" config keyword, by default only english is enabled.
> Giovanni
>
> On 12/28/23 17:06, Jimmy wrote:
> > bayes_stopword_th https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d>>
> > Sample mail https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8>>
> >
> > Jimmy
> >
> >
> > On Thu, Dec 28, 2023 at 10:59?PM <giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>> wrote:
> >
> > Could you share a config line and a sample you are using ?
> > Giovanni
> >
> > On 12/28/23 16:26, Jimmy wrote:
> > > Yes, I have done that, and I am also editing Plugin/Bayes.pm to investigate why it is not being skipped. I suspect that if words are not separated by spaces, longer words may not match those patterns.
> > >
> > > Jimmy
> > >
> > > On Thu, Dec 28, 2023 at 10:13?PM <giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>> wrote:
> > >
> > > "spamassassin -D bayes" will tell you, you should see a line like:
> > > bayes: skipped token 'from' because it's in stopword list for language 'en'
> > >
> > > Giovanni
> > >
> > > On 12/28/23 15:45, Jimmy wrote:
> > > > The pattern has successfully passed the test script, but it needs to check whether Bayes learning will identify and possibly exclude the word from matching this pattern.
> > > >
> > > > Thank you.
> > > >
> > > >
> > > > On Thu, Dec 28, 2023 at 9:22?PM <giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>>> wrote:
> > > >
> > > > On 12/28/23 12:59, Jimmy wrote:
> > > > > Hi,
> > > > >
> > > > > I'm seeking assistance in incorporating a stopword for Asian languages in Unicode. Although I possess comprehensive word lists, my attempts to generate a regex pattern and test it have been unsuccessful; the pattern fails to match or skips tokens in the newly added stopword list.
> > > > >
> > > > > I created the regex pattern using the following code:
> > > > >
> > > > > Regexp::Assemble->new->add(@words)->reduce(0)->as_string
> > > > >
> > > > > Afterward, I converted it to UTF-8 hex.
> > > > >
> > > > > I'm wondering if there are any tools available to facilitate the creation of these regex patterns.
> > > > >
> > > > I have used Regexp::Trie to create Bayes stopwords in the past, code is similar to:
> > > > -----------------------------------------------------------------------------------------------------------
> > > > use strict;
> > > > use warnings;
> > > >
> > > > use Encode;
> > > > use Regexp::Trie;
> > > >
> > > > my @input = <STDIN>;
> > > > my $rt = Regexp::Trie->new;
> > > > for my $w ( @input ) {
> > > > chomp($w);
> > > > $rt->add($w);
> > > > }
> > > > my $regexp = $rt->regexp;
> > > > my @reg = split //, $regexp;
> > > > for my $c ( @reg ) {
> > > > my $char = $c;
> > > > my $test;
> > > > eval "\$test = decode( 'utf8', \$c, Encode::FB_CROAK )";
> > > > if( $@ ) {
> > > > print 'x' . sprintf("%x", ord($c));
> > > > } else {
> > > > print $char;
> > > > }
> > > > }
> > > > -----------------------------------------------------------------------------------------------------------
> > > >
> > > > Giovanni
> > > >
> > >
> >
>

Re: Bayes Stopword [ In reply to ]

Dec 29, 2023, 1:08 AM

Post #11 of 15 (324 views)

You can use this word list

https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt

Jimmy

On Fri, Dec 29, 2023 at 3:59?PM <giovanni@paclan.it> wrote:

> To create the stopwords regexp I used the script I shared in a previous
> email and a list of words one per line.
> Could you share the list you are using ?
>
> Giovanni
>
> On 12/29/23 09:22, Jimmy wrote:
> > I use SpamAssassin 4.0.0 (2022-12-14)
> >
> > $ spamassassin -D --lint 2>&1 | grep bayes:
> > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=en
> > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=th
> > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=ru
> > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=fr
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ja
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=zh
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=dk
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=nl
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=de
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=es
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fi
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fr
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=it
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=no
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ru
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=se
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=tr
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=vi
> > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=ko
> > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=zh
> > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=hi
> > Dec 29 15:17:58.019 [17420] dbg: bayes: stopwords for languages enabled:
> en th ru fr ja zh dk nl de es fi fr it no ru se tr vi ko zh hi
> >
> >
> > $ spamassassin -D bayes,learn < test.msg 2>&1 | grep "skipped token"
> > Dec 29 15:16:57.585 [17347] dbg: bayes: skipped token 'Email' because
> it's in stopword list for language 'en'
> >
> > You can use "???" that was listed in regexp pattern but somehow I don't
> know why it not show skipped token in bayes.
> >
> > Jimmy
> >
> >
> > On Fri, Dec 29, 2023 at 2:59?PM <giovanni@paclan.it <mailto:
> giovanni@paclan.it>> wrote:
> >
> > Config line produces a syntax error for me:
> > config: failed to parse line in /etc/mail/spamassassin/local.cf <
> http://local.cf> (line 1): bayes_stopword_th
> >
> > Could you share the word list in utf8 ?
> > I tried adding "???" to
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>
> and it produces a working regexp.
> > Bayes stopwords languages must also be enabled using
> "bayes_stopword_languages" config keyword, by default only english is
> enabled.
> > Giovanni
> >
> > On 12/28/23 17:06, Jimmy wrote:
> > > bayes_stopword_th https://pastebin.pl/view/0838138d <
> https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d <
> https://pastebin.pl/view/0838138d>>
> > > Sample mail https://pastebin.pl/view/e5a2c5b8 <
> https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 <
> https://pastebin.pl/view/e5a2c5b8>>
> > >
> > > Jimmy
> > >
> > >
> > > On Thu, Dec 28, 2023 at 10:59?PM <giovanni@paclan.it <mailto:
> giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>
> wrote:
> > >
> > > Could you share a config line and a sample you are using ?
> > > Giovanni
> > >
> > > On 12/28/23 16:26, Jimmy wrote:
> > > > Yes, I have done that, and I am also editing
> Plugin/Bayes.pm to investigate why it is not being skipped. I suspect that
> if words are not separated by spaces, longer words may not match those
> patterns.
> > > >
> > > > Jimmy
> > > >
> > > > On Thu, Dec 28, 2023 at 10:13?PM <giovanni@paclan.it
> <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:
> giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>
> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>> wrote:
> > > >
> > > > "spamassassin -D bayes" will tell you, you should see
> a line like:
> > > > bayes: skipped token 'from' because it's in stopword
> list for language 'en'
> > > >
> > > > Giovanni
> > > >
> > > > On 12/28/23 15:45, Jimmy wrote:
> > > > > The pattern has successfully passed the test
> script, but it needs to check whether Bayes learning will identify and
> possibly exclude the word from matching this pattern.
> > > > >
> > > > > Thank you.
> > > > >
> > > > >
> > > > > On Thu, Dec 28, 2023 at 9:22?PM <giovanni@paclan.it
> <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:
> giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>
> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>> <mailto:
> giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it
> <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:
> giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>>>
> wrote:
> > > > >
> > > > > On 12/28/23 12:59, Jimmy wrote:
> > > > > > Hi,
> > > > > >
> > > > > > I'm seeking assistance in incorporating a
> stopword for Asian languages in Unicode. Although I possess comprehensive
> word lists, my attempts to generate a regex pattern and test it have been
> unsuccessful; the pattern fails to match or skips tokens in the newly added
> stopword list.
> > > > > >
> > > > > > I created the regex pattern using the
> following code:
> > > > > >
> > > > > >
> Regexp::Assemble->new->add(@words)->reduce(0)->as_string
> > > > > >
> > > > > > Afterward, I converted it to UTF-8 hex.
> > > > > >
> > > > > > I'm wondering if there are any tools
> available to facilitate the creation of these regex patterns.
> > > > > >
> > > > > I have used Regexp::Trie to create Bayes
> stopwords in the past, code is similar to:
> > > > >
> -----------------------------------------------------------------------------------------------------------
> > > > > use strict;
> > > > > use warnings;
> > > > >
> > > > > use Encode;
> > > > > use Regexp::Trie;
> > > > >
> > > > > my @input = <STDIN>;
> > > > > my $rt = Regexp::Trie->new;
> > > > > for my $w ( @input ) {
> > > > > chomp($w);
> > > > > $rt->add($w);
> > > > > }
> > > > > my $regexp = $rt->regexp;
> > > > > my @reg = split //, $regexp;
> > > > > for my $c ( @reg ) {
> > > > > my $char = $c;
> > > > > my $test;
> > > > > eval "\$test = decode( 'utf8', \$c,
> Encode::FB_CROAK )";
> > > > > if( $@ ) {
> > > > > print 'x' . sprintf("%x", ord($c));
> > > > > } else {
> > > > > print $char;
> > > > > }
> > > > > }
> > > > >
> -----------------------------------------------------------------------------------------------------------
> > > > >
> > > > > Giovanni
> > > > >
> > > >
> > >
> >
>
>

Re: Bayes Stopword [ In reply to ]

Dec 29, 2023, 1:47 AM

Post #12 of 15 (324 views)

I do not speak Thai but I cannot see any word in the sample email that should match that list.
Which word do you think should match the regexp ?
Giovanni

On 12/29/23 10:08, Jimmy wrote:
> You can use this word list
>
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt <https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>
>
> Jimmy
>
> On Fri, Dec 29, 2023 at 3:59?PM <giovanni@paclan.it <mailto:giovanni@paclan.it>> wrote:
>
> To create the stopwords regexp I used the script I shared in a previous email and a list of words one per line.
> Could you share the list you are using ?
>
> Giovanni
>
> On 12/29/23 09:22, Jimmy wrote:
> > I use SpamAssassin 4.0.0 (2022-12-14)
> >
> > $ spamassassin -D --lint 2>&1 | grep bayes:
> > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=en
> > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=th
> > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=ru
> > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=fr
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ja
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=zh
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=dk
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=nl
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=de
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=es
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fi
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fr
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=it
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=no
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ru
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=se
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=tr
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=vi
> > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=ko
> > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=zh
> > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=hi
> > Dec 29 15:17:58.019 [17420] dbg: bayes: stopwords for languages enabled: en th ru fr ja zh dk nl de es fi fr it no ru se tr vi ko zh hi
> >
> >
> > $ spamassassin -D bayes,learn < test.msg 2>&1 | grep "skipped token"
> > Dec 29 15:16:57.585 [17347] dbg: bayes: skipped token 'Email' because it's in stopword list for language 'en'
> >
> > You can use "???" that was listed in regexp pattern but somehow I don't know why it not show skipped token in bayes.
> >
> > Jimmy
> >
> >
> > On Fri, Dec 29, 2023 at 2:59?PM <giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>> wrote:
> >
> > Config line produces a syntax error for me:
> > config: failed to parse line in /etc/mail/spamassassin/local.cf <http://local.cf> <http://local.cf <http://local.cf>> (line 1): bayes_stopword_th
> >
> > Could you share the word list in utf8 ?
> > I tried adding "???" to https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt <https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt> <https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt <https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>> and it produces a working regexp.
> > Bayes stopwords languages must also be enabled using "bayes_stopword_languages" config keyword, by default only english is enabled.
> > Giovanni
> >
> > On 12/28/23 17:06, Jimmy wrote:
> > > bayes_stopword_th https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d>> <https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d>>>
> > > Sample mail https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8>> <https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8>>>
> > >
> > > Jimmy
> > >
> > >
> > > On Thu, Dec 28, 2023 at 10:59?PM <giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>> wrote:
> > >
> > > Could you share a config line and a sample you are using ?
> > > Giovanni
> > >
> > > On 12/28/23 16:26, Jimmy wrote:
> > > > Yes, I have done that, and I am also editing Plugin/Bayes.pm to investigate why it is not being skipped. I suspect that if words are not separated by spaces, longer words may not match those patterns.
> > > >
> > > > Jimmy
> > > >
> > > > On Thu, Dec 28, 2023 at 10:13?PM <giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>>> wrote:
> > > >
> > > > "spamassassin -D bayes" will tell you, you should see a line like:
> > > > bayes: skipped token 'from' because it's in stopword list for language 'en'
> > > >
> > > > Giovanni
> > > >
> > > > On 12/28/23 15:45, Jimmy wrote:
> > > > > The pattern has successfully passed the test script, but it needs to check whether Bayes learning will identify and possibly exclude the word from matching this pattern.
> > > > >
> > > > > Thank you.
> > > > >
> > > > >
> > > > > On Thu, Dec 28, 2023 at 9:22?PM <giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>>>> wrote:
> > > > >
> > > > > On 12/28/23 12:59, Jimmy wrote:
> > > > > > Hi,
> > > > > >
> > > > > > I'm seeking assistance in incorporating a stopword for Asian languages in Unicode. Although I possess comprehensive word lists, my attempts to generate a regex pattern and test it have been unsuccessful; the pattern fails to match or skips tokens in the newly added stopword list.
> > > > > >
> > > > > > I created the regex pattern using the following code:
> > > > > >
> > > > > > Regexp::Assemble->new->add(@words)->reduce(0)->as_string
> > > > > >
> > > > > > Afterward, I converted it to UTF-8 hex.
> > > > > >
> > > > > > I'm wondering if there are any tools available to facilitate the creation of these regex patterns.
> > > > > >
> > > > > I have used Regexp::Trie to create Bayes stopwords in the past, code is similar to:
> > > > > -----------------------------------------------------------------------------------------------------------
> > > > > use strict;
> > > > > use warnings;
> > > > >
> > > > > use Encode;
> > > > > use Regexp::Trie;
> > > > >
> > > > > my @input = <STDIN>;
> > > > > my $rt = Regexp::Trie->new;
> > > > > for my $w ( @input ) {
> > > > > chomp($w);
> > > > > $rt->add($w);
> > > > > }
> > > > > my $regexp = $rt->regexp;
> > > > > my @reg = split //, $regexp;
> > > > > for my $c ( @reg ) {
> > > > > my $char = $c;
> > > > > my $test;
> > > > > eval "\$test = decode( 'utf8', \$c, Encode::FB_CROAK )";
> > > > > if( $@ ) {
> > > > > print 'x' . sprintf("%x", ord($c));
> > > > > } else {
> > > > > print $char;
> > > > > }
> > > > > }
> > > > > -----------------------------------------------------------------------------------------------------------
> > > > >
> > > > > Giovanni
> > > > >
> > > >
> > >
> >
>

Re: Bayes Stopword [ In reply to ]

Dec 29, 2023, 2:04 AM

Post #13 of 15 (324 views)

The sample email and word list should contain at least these words.

???
???
???

Jimmy

On Fri, Dec 29, 2023 at 4:47?PM <giovanni@paclan.it> wrote:

> I do not speak Thai but I cannot see any word in the sample email that
> should match that list.
> Which word do you think should match the regexp ?
> Giovanni
>
> On 12/29/23 10:08, Jimmy wrote:
> > You can use this word list
> >
> >
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> >
> >
> > Jimmy
> >
> > On Fri, Dec 29, 2023 at 3:59?PM <giovanni@paclan.it <mailto:
> giovanni@paclan.it>> wrote:
> >
> > To create the stopwords regexp I used the script I shared in a
> previous email and a list of words one per line.
> > Could you share the list you are using ?
> >
> > Giovanni
> >
> > On 12/29/23 09:22, Jimmy wrote:
> > > I use SpamAssassin 4.0.0 (2022-12-14)
> > >
> > > $ spamassassin -D --lint 2>&1 | grep bayes:
> > > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=en
> > > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=th
> > > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=ru
> > > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=fr
> > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ja
> > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=zh
> > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=dk
> > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=nl
> > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=de
> > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=es
> > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fi
> > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fr
> > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=it
> > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=no
> > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ru
> > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=se
> > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=tr
> > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=vi
> > > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=ko
> > > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=zh
> > > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=hi
> > > Dec 29 15:17:58.019 [17420] dbg: bayes: stopwords for languages
> enabled: en th ru fr ja zh dk nl de es fi fr it no ru se tr vi ko zh hi
> > >
> > >
> > > $ spamassassin -D bayes,learn < test.msg 2>&1 | grep "skipped
> token"
> > > Dec 29 15:16:57.585 [17347] dbg: bayes: skipped token 'Email'
> because it's in stopword list for language 'en'
> > >
> > > You can use "???" that was listed in regexp pattern but somehow I
> don't know why it not show skipped token in bayes.
> > >
> > > Jimmy
> > >
> > >
> > > On Fri, Dec 29, 2023 at 2:59?PM <giovanni@paclan.it <mailto:
> giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>
> wrote:
> > >
> > > Config line produces a syntax error for me:
> > > config: failed to parse line in /etc/mail/spamassassin/
> local.cf <http://local.cf> <http://local.cf <http://local.cf>> (line 1):
> bayes_stopword_th
> > >
> > > Could you share the word list in utf8 ?
> > > I tried adding "???" to
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>>
> and it produces a working regexp.
> > > Bayes stopwords languages must also be enabled using
> "bayes_stopword_languages" config keyword, by default only english is
> enabled.
> > > Giovanni
> > >
> > > On 12/28/23 17:06, Jimmy wrote:
> > > > bayes_stopword_th https://pastebin.pl/view/0838138d <
> https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d <
> https://pastebin.pl/view/0838138d>> <https://pastebin.pl/view/0838138d <
> https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d <
> https://pastebin.pl/view/0838138d>>>
> > > > Sample mail https://pastebin.pl/view/e5a2c5b8 <
> https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 <
> https://pastebin.pl/view/e5a2c5b8>> <https://pastebin.pl/view/e5a2c5b8 <
> https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 <
> https://pastebin.pl/view/e5a2c5b8>>>
> > > >
> > > > Jimmy
> > > >
> > > >
> > > > On Thu, Dec 28, 2023 at 10:59?PM <giovanni@paclan.it
> <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:
> giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>
> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>> wrote:
> > > >
> > > > Could you share a config line and a sample you are
> using ?
> > > > Giovanni
> > > >
> > > > On 12/28/23 16:26, Jimmy wrote:
> > > > > Yes, I have done that, and I am also editing
> Plugin/Bayes.pm to investigate why it is not being skipped. I suspect that
> if words are not separated by spaces, longer words may not match those
> patterns.
> > > > >
> > > > > Jimmy
> > > > >
> > > > > On Thu, Dec 28, 2023 at 10:13?PM <
> giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it
> <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:
> giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>
> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:
> giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it
> <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:
> giovanni@paclan.it>>>>> wrote:
> > > > >
> > > > > "spamassassin -D bayes" will tell you, you
> should see a line like:
> > > > > bayes: skipped token 'from' because it's in
> stopword list for language 'en'
> > > > >
> > > > > Giovanni
> > > > >
> > > > > On 12/28/23 15:45, Jimmy wrote:
> > > > > > The pattern has successfully passed the test
> script, but it needs to check whether Bayes learning will identify and
> possibly exclude the word from matching this pattern.
> > > > > >
> > > > > > Thank you.
> > > > > >
> > > > > >
> > > > > > On Thu, Dec 28, 2023 at 9:22?PM <
> giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it
> <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:
> giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>
> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:
> giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it
> <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:
> giovanni@paclan.it>>>> <mailto:giovanni@paclan.it <mailto:
> giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>
> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:
> giovanni@paclan.it <mailto:giovanni@paclan.it>>> <mailto:
> giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it
> <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:
> giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>>>>
> wrote:
> > > > > >
> > > > > > On 12/28/23 12:59, Jimmy wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > I'm seeking assistance in
> incorporating a stopword for Asian languages in Unicode. Although I possess
> comprehensive word lists, my attempts to generate a regex pattern and test
> it have been unsuccessful; the pattern fails to match or skips tokens in
> the newly added stopword list.
> > > > > > >
> > > > > > > I created the regex pattern using the
> following code:
> > > > > > >
> > > > > > >
> Regexp::Assemble->new->add(@words)->reduce(0)->as_string
> > > > > > >
> > > > > > > Afterward, I converted it to UTF-8
> hex.
> > > > > > >
> > > > > > > I'm wondering if there are any tools
> available to facilitate the creation of these regex patterns.
> > > > > > >
> > > > > > I have used Regexp::Trie to create Bayes
> stopwords in the past, code is similar to:
> > > > > >
> -----------------------------------------------------------------------------------------------------------
> > > > > > use strict;
> > > > > > use warnings;
> > > > > >
> > > > > > use Encode;
> > > > > > use Regexp::Trie;
> > > > > >
> > > > > > my @input = <STDIN>;
> > > > > > my $rt = Regexp::Trie->new;
> > > > > > for my $w ( @input ) {
> > > > > > chomp($w);
> > > > > > $rt->add($w);
> > > > > > }
> > > > > > my $regexp = $rt->regexp;
> > > > > > my @reg = split //, $regexp;
> > > > > > for my $c ( @reg ) {
> > > > > > my $char = $c;
> > > > > > my $test;
> > > > > > eval "\$test = decode( 'utf8', \$c,
> Encode::FB_CROAK )";
> > > > > > if( $@ ) {
> > > > > > print 'x' . sprintf("%x", ord($c));
> > > > > > } else {
> > > > > > print $char;
> > > > > > }
> > > > > > }
> > > > > >
> -----------------------------------------------------------------------------------------------------------
> > > > > >
> > > > > > Giovanni
> > > > > >
> > > > >
> > > >
> > >
> >
>
>

Re: Bayes Stopword [ In reply to ]

Dec 29, 2023, 2:16 AM

Post #14 of 15 (311 views)

"???" is not considered a word because it's part of the token "????????????????????????".
Words must be separated by spaces, otherwise we should skip the word "theme" just because "the" is in english stopword list.
No idea if this makes sense for asian languages.

Giovanni

On 12/29/23 11:04, Jimmy wrote:
>
> The sample email and word list should contain at least these words.
>
> ???
> ???
> ???
>
> Jimmy
>
> On Fri, Dec 29, 2023 at 4:47?PM <giovanni@paclan.it <mailto:giovanni@paclan.it>> wrote:
>
> I do not speak Thai but I cannot see any word in the sample email that should match that list.
> Which word do you think should match the regexp ?
> Giovanni
>
> On 12/29/23 10:08, Jimmy wrote:
> > You can use this word list
> >
> > https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt <https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt> <https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt <https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>>
> >
> > Jimmy
> >
> > On Fri, Dec 29, 2023 at 3:59?PM <giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>> wrote:
> >
> > To create the stopwords regexp I used the script I shared in a previous email and a list of words one per line.
> > Could you share the list you are using ?
> >
> > Giovanni
> >
> > On 12/29/23 09:22, Jimmy wrote:
> > > I use SpamAssassin 4.0.0 (2022-12-14)
> > >
> > > $ spamassassin -D --lint 2>&1 | grep bayes:
> > > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=en
> > > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=th
> > > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=ru
> > > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=fr
> > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ja
> > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=zh
> > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=dk
> > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=nl
> > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=de
> > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=es
> > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fi
> > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fr
> > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=it
> > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=no
> > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ru
> > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=se
> > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=tr
> > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=vi
> > > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=ko
> > > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=zh
> > > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=hi
> > > Dec 29 15:17:58.019 [17420] dbg: bayes: stopwords for languages enabled: en th ru fr ja zh dk nl de es fi fr it no ru se tr vi ko zh hi
> > >
> > >
> > > $ spamassassin -D bayes,learn < test.msg 2>&1 | grep "skipped token"
> > > Dec 29 15:16:57.585 [17347] dbg: bayes: skipped token 'Email' because it's in stopword list for language 'en'
> > >
> > > You can use "???" that was listed in regexp pattern but somehow I don't know why it not show skipped token in bayes.
> > >
> > > Jimmy
> > >
> > >
> > > On Fri, Dec 29, 2023 at 2:59?PM <giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>> wrote:
> > >
> > > Config line produces a syntax error for me:
> > > config: failed to parse line in /etc/mail/spamassassin/local.cf <http://local.cf> <http://local.cf <http://local.cf>> <http://local.cf <http://local.cf> <http://local.cf <http://local.cf>>> (line 1): bayes_stopword_th
> > >
> > > Could you share the word list in utf8 ?
> > > I tried adding "???" to https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt <https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt> <https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt <https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>> <https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt <https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt> <https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt <https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>>> and it produces a working regexp.
> > > Bayes stopwords languages must also be enabled using "bayes_stopword_languages" config keyword, by default only english is enabled.
> > > Giovanni
> > >
> > > On 12/28/23 17:06, Jimmy wrote:
> > > > bayes_stopword_th https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d>> <https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d>>> <https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d>> <https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d>>>>
> > > > Sample mail https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8>> <https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8>>> <https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8>> <https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8>>>>
> > > >
> > > > Jimmy
> > > >
> > > >
> > > > On Thu, Dec 28, 2023 at 10:59?PM <giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>>> wrote:
> > > >
> > > > Could you share a config line and a sample you are using ?
> > > > Giovanni
> > > >
> > > > On 12/28/23 16:26, Jimmy wrote:
> > > > > Yes, I have done that, and I am also editing Plugin/Bayes.pm to investigate why it is not being skipped. I suspect that if words are not separated by spaces, longer words may not match those patterns.
> > > > >
> > > > > Jimmy
> > > > >
> > > > > On Thu, Dec 28, 2023 at 10:13?PM <giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>>>> wrote:
> > > > >
> > > > > "spamassassin -D bayes" will tell you, you should see a line like:
> > > > > bayes: skipped token 'from' because it's in stopword list for language 'en'
> > > > >
> > > > > Giovanni
> > > > >
> > > > > On 12/28/23 15:45, Jimmy wrote:
> > > > > > The pattern has successfully passed the test script, but it needs to check whether Bayes learning will identify and possibly exclude the word from matching this pattern.
> > > > > >
> > > > > > Thank you.
> > > > > >
> > > > > >
> > > > > > On Thu, Dec 28, 2023 at 9:22?PM <giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>>> <mailto:giovanni@paclan.it
> <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>>>>> wrote:
> > > > > >
> > > > > > On 12/28/23 12:59, Jimmy wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > I'm seeking assistance in incorporating a stopword for Asian languages in Unicode. Although I possess comprehensive word lists, my attempts to generate a regex pattern and test it have been unsuccessful; the pattern fails to match or skips tokens in the newly added stopword list.
> > > > > > >
> > > > > > > I created the regex pattern using the following code:
> > > > > > >
> > > > > > > Regexp::Assemble->new->add(@words)->reduce(0)->as_string
> > > > > > >
> > > > > > > Afterward, I converted it to UTF-8 hex.
> > > > > > >
> > > > > > > I'm wondering if there are any tools available to facilitate the creation of these regex patterns.
> > > > > > >
> > > > > > I have used Regexp::Trie to create Bayes stopwords in the past, code is similar to:
> > > > > > -----------------------------------------------------------------------------------------------------------
> > > > > > use strict;
> > > > > > use warnings;
> > > > > >
> > > > > > use Encode;
> > > > > > use Regexp::Trie;
> > > > > >
> > > > > > my @input = <STDIN>;
> > > > > > my $rt = Regexp::Trie->new;
> > > > > > for my $w ( @input ) {
> > > > > > chomp($w);
> > > > > > $rt->add($w);
> > > > > > }
> > > > > > my $regexp = $rt->regexp;
> > > > > > my @reg = split //, $regexp;
> > > > > > for my $c ( @reg ) {
> > > > > > my $char = $c;
> > > > > > my $test;
> > > > > > eval "\$test = decode( 'utf8', \$c, Encode::FB_CROAK )";
> > > > > > if( $@ ) {
> > > > > > print 'x' . sprintf("%x", ord($c));
> > > > > > } else {
> > > > > > print $char;
> > > > > > }
> > > > > > }
> > > > > > -----------------------------------------------------------------------------------------------------------
> > > > > >
> > > > > > Giovanni
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Bayes Stopword [ In reply to ]

Dec 29, 2023, 2:38 AM

Post #15 of 15 (311 views)