Mailing List Archive

Bayes Stopword
Hi,

I'm seeking assistance in incorporating a stopword for Asian languages in
Unicode. Although I possess comprehensive word lists, my attempts to
generate a regex pattern and test it have been unsuccessful; the pattern
fails to match or skips tokens in the newly added stopword list.

I created the regex pattern using the following code:

Regexp::Assemble->new->add(@words)->reduce(0)->as_string

Afterward, I converted it to UTF-8 hex.

I'm wondering if there are any tools available to facilitate the creation
of these regex patterns.

Thank you,
Jimmy
Re: Bayes Stopword [ In reply to ]
On 12/28/23 12:59, Jimmy wrote:
> Hi,
>
> I'm seeking assistance in incorporating a stopword for Asian languages in Unicode. Although I possess comprehensive word lists, my attempts to generate a regex pattern and test it have been unsuccessful; the pattern fails to match or skips tokens in the newly added stopword list.
>
> I created the regex pattern using the following code:
>
> Regexp::Assemble->new->add(@words)->reduce(0)->as_string
>
> Afterward, I converted it to UTF-8 hex.
>
> I'm wondering if there are any tools available to facilitate the creation of these regex patterns.
>
I have used Regexp::Trie to create Bayes stopwords in the past, code is similar to:
-----------------------------------------------------------------------------------------------------------
use strict;
use warnings;

use Encode;
use Regexp::Trie;

my @input = <STDIN>;
my $rt = Regexp::Trie->new;
for my $w ( @input ) {
chomp($w);
$rt->add($w);
}
my $regexp = $rt->regexp;
my @reg = split //, $regexp;
for my $c ( @reg ) {
my $char = $c;
my $test;
eval "\$test = decode( 'utf8', \$c, Encode::FB_CROAK )";
if( $@ ) {
print 'x' . sprintf("%x", ord($c));
} else {
print $char;
}
}
-----------------------------------------------------------------------------------------------------------

Giovanni
Re: Bayes Stopword [ In reply to ]
The pattern has successfully passed the test script, but it needs to check
whether Bayes learning will identify and possibly exclude the word from
matching this pattern.

Thank you.


On Thu, Dec 28, 2023 at 9:22?PM <giovanni@paclan.it> wrote:

> On 12/28/23 12:59, Jimmy wrote:
> > Hi,
> >
> > I'm seeking assistance in incorporating a stopword for Asian languages
> in Unicode. Although I possess comprehensive word lists, my attempts to
> generate a regex pattern and test it have been unsuccessful; the pattern
> fails to match or skips tokens in the newly added stopword list.
> >
> > I created the regex pattern using the following code:
> >
> > Regexp::Assemble->new->add(@words)->reduce(0)->as_string
> >
> > Afterward, I converted it to UTF-8 hex.
> >
> > I'm wondering if there are any tools available to facilitate the
> creation of these regex patterns.
> >
> I have used Regexp::Trie to create Bayes stopwords in the past, code is
> similar to:
>
> -----------------------------------------------------------------------------------------------------------
> use strict;
> use warnings;
>
> use Encode;
> use Regexp::Trie;
>
> my @input = <STDIN>;
> my $rt = Regexp::Trie->new;
> for my $w ( @input ) {
> chomp($w);
> $rt->add($w);
> }
> my $regexp = $rt->regexp;
> my @reg = split //, $regexp;
> for my $c ( @reg ) {
> my $char = $c;
> my $test;
> eval "\$test = decode( 'utf8', \$c, Encode::FB_CROAK )";
> if( $@ ) {
> print 'x' . sprintf("%x", ord($c));
> } else {
> print $char;
> }
> }
>
> -----------------------------------------------------------------------------------------------------------
>
> Giovanni
>
Re: Bayes Stopword [ In reply to ]
"spamassassin -D bayes" will tell you, you should see a line like:
bayes: skipped token 'from' because it's in stopword list for language 'en'

Giovanni

On 12/28/23 15:45, Jimmy wrote:
> The pattern has successfully passed the test script, but it needs to check whether Bayes learning will identify and possibly exclude the word from matching this pattern.
>
> Thank you.
>
>
> On Thu, Dec 28, 2023 at 9:22?PM <giovanni@paclan.it <mailto:giovanni@paclan.it>> wrote:
>
> On 12/28/23 12:59, Jimmy wrote:
> > Hi,
> >
> > I'm seeking assistance in incorporating a stopword for Asian languages in Unicode. Although I possess comprehensive word lists, my attempts to generate a regex pattern and test it have been unsuccessful; the pattern fails to match or skips tokens in the newly added stopword list.
> >
> > I created the regex pattern using the following code:
> >
> > Regexp::Assemble->new->add(@words)->reduce(0)->as_string
> >
> > Afterward, I converted it to UTF-8 hex.
> >
> > I'm wondering if there are any tools available to facilitate the creation of these regex patterns.
> >
> I have used Regexp::Trie to create Bayes stopwords in the past, code is similar to:
> -----------------------------------------------------------------------------------------------------------
> use strict;
> use warnings;
>
> use Encode;
> use Regexp::Trie;
>
> my @input = <STDIN>;
> my $rt = Regexp::Trie->new;
> for my $w ( @input ) {
>    chomp($w);
>    $rt->add($w);
> }
> my $regexp = $rt->regexp;
> my @reg = split //, $regexp;
> for my $c ( @reg ) {
>    my $char = $c;
>    my $test;
>    eval "\$test = decode( 'utf8', \$c, Encode::FB_CROAK )";
>    if( $@ ) {
>      print 'x' . sprintf("%x", ord($c));
>    } else {
>      print $char;
>    }
> }
> -----------------------------------------------------------------------------------------------------------
>
>   Giovanni
>
Re: Bayes Stopword [ In reply to ]
Yes, I have done that, and I am also editing Plugin/Bayes.pm to investigate
why it is not being skipped. I suspect that if words are not separated by
spaces, longer words may not match those patterns.

Jimmy

On Thu, Dec 28, 2023 at 10:13?PM <giovanni@paclan.it> wrote:

> "spamassassin -D bayes" will tell you, you should see a line like:
> bayes: skipped token 'from' because it's in stopword list for language 'en'
>
> Giovanni
>
> On 12/28/23 15:45, Jimmy wrote:
> > The pattern has successfully passed the test script, but it needs to
> check whether Bayes learning will identify and possibly exclude the word
> from matching this pattern.
> >
> > Thank you.
> >
> >
> > On Thu, Dec 28, 2023 at 9:22?PM <giovanni@paclan.it <mailto:
> giovanni@paclan.it>> wrote:
> >
> > On 12/28/23 12:59, Jimmy wrote:
> > > Hi,
> > >
> > > I'm seeking assistance in incorporating a stopword for Asian
> languages in Unicode. Although I possess comprehensive word lists, my
> attempts to generate a regex pattern and test it have been unsuccessful;
> the pattern fails to match or skips tokens in the newly added stopword list.
> > >
> > > I created the regex pattern using the following code:
> > >
> > > Regexp::Assemble->new->add(@words)->reduce(0)->as_string
> > >
> > > Afterward, I converted it to UTF-8 hex.
> > >
> > > I'm wondering if there are any tools available to facilitate the
> creation of these regex patterns.
> > >
> > I have used Regexp::Trie to create Bayes stopwords in the past, code
> is similar to:
> >
> -----------------------------------------------------------------------------------------------------------
> > use strict;
> > use warnings;
> >
> > use Encode;
> > use Regexp::Trie;
> >
> > my @input = <STDIN>;
> > my $rt = Regexp::Trie->new;
> > for my $w ( @input ) {
> > chomp($w);
> > $rt->add($w);
> > }
> > my $regexp = $rt->regexp;
> > my @reg = split //, $regexp;
> > for my $c ( @reg ) {
> > my $char = $c;
> > my $test;
> > eval "\$test = decode( 'utf8', \$c, Encode::FB_CROAK )";
> > if( $@ ) {
> > print 'x' . sprintf("%x", ord($c));
> > } else {
> > print $char;
> > }
> > }
> >
> -----------------------------------------------------------------------------------------------------------
> >
> > Giovanni
> >
>
>
Re: Bayes Stopword [ In reply to ]
Could you share a config line and a sample you are using ?
Giovanni

On 12/28/23 16:26, Jimmy wrote:
> Yes, I have done that, and I am also editing Plugin/Bayes.pm to investigate why it is not being skipped. I suspect that if words are not separated by spaces, longer words may not match those patterns.
>
> Jimmy
>
> On Thu, Dec 28, 2023 at 10:13?PM <giovanni@paclan.it <mailto:giovanni@paclan.it>> wrote:
>
> "spamassassin -D bayes" will tell you, you should see a line like:
> bayes: skipped token 'from' because it's in stopword list for language 'en'
>
>   Giovanni
>
> On 12/28/23 15:45, Jimmy wrote:
> > The pattern has successfully passed the test script, but it needs to check whether Bayes learning will identify and possibly exclude the word from matching this pattern.
> >
> > Thank you.
> >
> >
> > On Thu, Dec 28, 2023 at 9:22?PM <giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>> wrote:
> >
> >     On 12/28/23 12:59, Jimmy wrote:
> >      > Hi,
> >      >
> >      > I'm seeking assistance in incorporating a stopword for Asian languages in Unicode. Although I possess comprehensive word lists, my attempts to generate a regex pattern and test it have been unsuccessful; the pattern fails to match or skips tokens in the newly added stopword list.
> >      >
> >      > I created the regex pattern using the following code:
> >      >
> >      > Regexp::Assemble->new->add(@words)->reduce(0)->as_string
> >      >
> >      > Afterward, I converted it to UTF-8 hex.
> >      >
> >      > I'm wondering if there are any tools available to facilitate the creation of these regex patterns.
> >      >
> >     I have used Regexp::Trie to create Bayes stopwords in the past, code is similar to:
> >     -----------------------------------------------------------------------------------------------------------
> >     use strict;
> >     use warnings;
> >
> >     use Encode;
> >     use Regexp::Trie;
> >
> >     my @input = <STDIN>;
> >     my $rt = Regexp::Trie->new;
> >     for my $w ( @input ) {
> >         chomp($w);
> >         $rt->add($w);
> >     }
> >     my $regexp = $rt->regexp;
> >     my @reg = split //, $regexp;
> >     for my $c ( @reg ) {
> >         my $char = $c;
> >         my $test;
> >         eval "\$test = decode( 'utf8', \$c, Encode::FB_CROAK )";
> >         if( $@ ) {
> >           print 'x' . sprintf("%x", ord($c));
> >         } else {
> >           print $char;
> >         }
> >     }
> >     -----------------------------------------------------------------------------------------------------------
> >
> >        Giovanni
> >
>
Re: Bayes Stopword [ In reply to ]
bayes_stopword_th https://pastebin.pl/view/0838138d
Sample mail https://pastebin.pl/view/e5a2c5b8

Jimmy


On Thu, Dec 28, 2023 at 10:59?PM <giovanni@paclan.it> wrote:

> Could you share a config line and a sample you are using ?
> Giovanni
>
> On 12/28/23 16:26, Jimmy wrote:
> > Yes, I have done that, and I am also editing Plugin/Bayes.pm to
> investigate why it is not being skipped. I suspect that if words are not
> separated by spaces, longer words may not match those patterns.
> >
> > Jimmy
> >
> > On Thu, Dec 28, 2023 at 10:13?PM <giovanni@paclan.it <mailto:
> giovanni@paclan.it>> wrote:
> >
> > "spamassassin -D bayes" will tell you, you should see a line like:
> > bayes: skipped token 'from' because it's in stopword list for
> language 'en'
> >
> > Giovanni
> >
> > On 12/28/23 15:45, Jimmy wrote:
> > > The pattern has successfully passed the test script, but it needs
> to check whether Bayes learning will identify and possibly exclude the word
> from matching this pattern.
> > >
> > > Thank you.
> > >
> > >
> > > On Thu, Dec 28, 2023 at 9:22?PM <giovanni@paclan.it <mailto:
> giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>
> wrote:
> > >
> > > On 12/28/23 12:59, Jimmy wrote:
> > > > Hi,
> > > >
> > > > I'm seeking assistance in incorporating a stopword for
> Asian languages in Unicode. Although I possess comprehensive word lists, my
> attempts to generate a regex pattern and test it have been unsuccessful;
> the pattern fails to match or skips tokens in the newly added stopword list.
> > > >
> > > > I created the regex pattern using the following code:
> > > >
> > > > Regexp::Assemble->new->add(@words)->reduce(0)->as_string
> > > >
> > > > Afterward, I converted it to UTF-8 hex.
> > > >
> > > > I'm wondering if there are any tools available to
> facilitate the creation of these regex patterns.
> > > >
> > > I have used Regexp::Trie to create Bayes stopwords in the
> past, code is similar to:
> > >
> -----------------------------------------------------------------------------------------------------------
> > > use strict;
> > > use warnings;
> > >
> > > use Encode;
> > > use Regexp::Trie;
> > >
> > > my @input = <STDIN>;
> > > my $rt = Regexp::Trie->new;
> > > for my $w ( @input ) {
> > > chomp($w);
> > > $rt->add($w);
> > > }
> > > my $regexp = $rt->regexp;
> > > my @reg = split //, $regexp;
> > > for my $c ( @reg ) {
> > > my $char = $c;
> > > my $test;
> > > eval "\$test = decode( 'utf8', \$c, Encode::FB_CROAK )";
> > > if( $@ ) {
> > > print 'x' . sprintf("%x", ord($c));
> > > } else {
> > > print $char;
> > > }
> > > }
> > >
> -----------------------------------------------------------------------------------------------------------
> > >
> > > Giovanni
> > >
> >
>
>
Re: Bayes Stopword [ In reply to ]
Config line produces a syntax error for me:
config: failed to parse line in /etc/mail/spamassassin/local.cf (line 1): bayes_stopword_th

Could you share the word list in utf8 ?
I tried adding "???" to https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt and it produces a working regexp.
Bayes stopwords languages must also be enabled using "bayes_stopword_languages" config keyword, by default only english is enabled.
Giovanni

On 12/28/23 17:06, Jimmy wrote:
> bayes_stopword_th https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d>
> Sample mail https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8>
>
> Jimmy
>
>
> On Thu, Dec 28, 2023 at 10:59?PM <giovanni@paclan.it <mailto:giovanni@paclan.it>> wrote:
>
> Could you share a config line and a sample you are using ?
>   Giovanni
>
> On 12/28/23 16:26, Jimmy wrote:
> > Yes, I have done that, and I am also editing Plugin/Bayes.pm to investigate why it is not being skipped. I suspect that if words are not separated by spaces, longer words may not match those patterns.
> >
> > Jimmy
> >
> > On Thu, Dec 28, 2023 at 10:13?PM <giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>> wrote:
> >
> >     "spamassassin -D bayes" will tell you, you should see a line like:
> >     bayes: skipped token 'from' because it's in stopword list for language 'en'
> >
> >        Giovanni
> >
> >     On 12/28/23 15:45, Jimmy wrote:
> >      > The pattern has successfully passed the test script, but it needs to check whether Bayes learning will identify and possibly exclude the word from matching this pattern.
> >      >
> >      > Thank you.
> >      >
> >      >
> >      > On Thu, Dec 28, 2023 at 9:22?PM <giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>> wrote:
> >      >
> >      >     On 12/28/23 12:59, Jimmy wrote:
> >      >      > Hi,
> >      >      >
> >      >      > I'm seeking assistance in incorporating a stopword for Asian languages in Unicode. Although I possess comprehensive word lists, my attempts to generate a regex pattern and test it have been unsuccessful; the pattern fails to match or skips tokens in the newly added stopword list.
> >      >      >
> >      >      > I created the regex pattern using the following code:
> >      >      >
> >      >      > Regexp::Assemble->new->add(@words)->reduce(0)->as_string
> >      >      >
> >      >      > Afterward, I converted it to UTF-8 hex.
> >      >      >
> >      >      > I'm wondering if there are any tools available to facilitate the creation of these regex patterns.
> >      >      >
> >      >     I have used Regexp::Trie to create Bayes stopwords in the past, code is similar to:
> >      >     -----------------------------------------------------------------------------------------------------------
> >      >     use strict;
> >      >     use warnings;
> >      >
> >      >     use Encode;
> >      >     use Regexp::Trie;
> >      >
> >      >     my @input = <STDIN>;
> >      >     my $rt = Regexp::Trie->new;
> >      >     for my $w ( @input ) {
> >      >         chomp($w);
> >      >         $rt->add($w);
> >      >     }
> >      >     my $regexp = $rt->regexp;
> >      >     my @reg = split //, $regexp;
> >      >     for my $c ( @reg ) {
> >      >         my $char = $c;
> >      >         my $test;
> >      >         eval "\$test = decode( 'utf8', \$c, Encode::FB_CROAK )";
> >      >         if( $@ ) {
> >      >           print 'x' . sprintf("%x", ord($c));
> >      >         } else {
> >      >           print $char;
> >      >         }
> >      >     }
> >      >     -----------------------------------------------------------------------------------------------------------
> >      >
> >      >        Giovanni
> >      >
> >
>
Re: Bayes Stopword [ In reply to ]
I use SpamAssassin 4.0.0 (2022-12-14)

$ spamassassin -D --lint 2>&1 | grep bayes:
Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=en
Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=th
Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=ru
Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=fr
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ja
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=zh
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=dk
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=nl
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=de
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=es
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fi
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fr
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=it
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=no
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ru
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=se
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=tr
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=vi
Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=ko
Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=zh
Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=hi
Dec 29 15:17:58.019 [17420] dbg: bayes: stopwords for languages enabled: en
th ru fr ja zh dk nl de es fi fr it no ru se tr vi ko zh hi


$ spamassassin -D bayes,learn < test.msg 2>&1 | grep "skipped token"
Dec 29 15:16:57.585 [17347] dbg: bayes: skipped token 'Email' because it's
in stopword list for language 'en'

You can use "???" that was listed in regexp pattern but somehow I don't
know why it not show skipped token in bayes.

Jimmy


On Fri, Dec 29, 2023 at 2:59?PM <giovanni@paclan.it> wrote:

> Config line produces a syntax error for me:
> config: failed to parse line in /etc/mail/spamassassin/local.cf (line 1):
> bayes_stopword_th
>
> Could you share the word list in utf8 ?
> I tried adding "???" to
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> and it produces a working regexp.
> Bayes stopwords languages must also be enabled using
> "bayes_stopword_languages" config keyword, by default only english is
> enabled.
> Giovanni
>
> On 12/28/23 17:06, Jimmy wrote:
> > bayes_stopword_th https://pastebin.pl/view/0838138d <
> https://pastebin.pl/view/0838138d>
> > Sample mail https://pastebin.pl/view/e5a2c5b8 <
> https://pastebin.pl/view/e5a2c5b8>
> >
> > Jimmy
> >
> >
> > On Thu, Dec 28, 2023 at 10:59?PM <giovanni@paclan.it <mailto:
> giovanni@paclan.it>> wrote:
> >
> > Could you share a config line and a sample you are using ?
> > Giovanni
> >
> > On 12/28/23 16:26, Jimmy wrote:
> > > Yes, I have done that, and I am also editing Plugin/Bayes.pm to
> investigate why it is not being skipped. I suspect that if words are not
> separated by spaces, longer words may not match those patterns.
> > >
> > > Jimmy
> > >
> > > On Thu, Dec 28, 2023 at 10:13?PM <giovanni@paclan.it <mailto:
> giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>
> wrote:
> > >
> > > "spamassassin -D bayes" will tell you, you should see a line
> like:
> > > bayes: skipped token 'from' because it's in stopword list for
> language 'en'
> > >
> > > Giovanni
> > >
> > > On 12/28/23 15:45, Jimmy wrote:
> > > > The pattern has successfully passed the test script, but
> it needs to check whether Bayes learning will identify and possibly exclude
> the word from matching this pattern.
> > > >
> > > > Thank you.
> > > >
> > > >
> > > > On Thu, Dec 28, 2023 at 9:22?PM <giovanni@paclan.it
> <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:
> giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>
> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>> wrote:
> > > >
> > > > On 12/28/23 12:59, Jimmy wrote:
> > > > > Hi,
> > > > >
> > > > > I'm seeking assistance in incorporating a stopword
> for Asian languages in Unicode. Although I possess comprehensive word
> lists, my attempts to generate a regex pattern and test it have been
> unsuccessful; the pattern fails to match or skips tokens in the newly added
> stopword list.
> > > > >
> > > > > I created the regex pattern using the following
> code:
> > > > >
> > > > >
> Regexp::Assemble->new->add(@words)->reduce(0)->as_string
> > > > >
> > > > > Afterward, I converted it to UTF-8 hex.
> > > > >
> > > > > I'm wondering if there are any tools available to
> facilitate the creation of these regex patterns.
> > > > >
> > > > I have used Regexp::Trie to create Bayes stopwords in
> the past, code is similar to:
> > > >
> -----------------------------------------------------------------------------------------------------------
> > > > use strict;
> > > > use warnings;
> > > >
> > > > use Encode;
> > > > use Regexp::Trie;
> > > >
> > > > my @input = <STDIN>;
> > > > my $rt = Regexp::Trie->new;
> > > > for my $w ( @input ) {
> > > > chomp($w);
> > > > $rt->add($w);
> > > > }
> > > > my $regexp = $rt->regexp;
> > > > my @reg = split //, $regexp;
> > > > for my $c ( @reg ) {
> > > > my $char = $c;
> > > > my $test;
> > > > eval "\$test = decode( 'utf8', \$c,
> Encode::FB_CROAK )";
> > > > if( $@ ) {
> > > > print 'x' . sprintf("%x", ord($c));
> > > > } else {
> > > > print $char;
> > > > }
> > > > }
> > > >
> -----------------------------------------------------------------------------------------------------------
> > > >
> > > > Giovanni
> > > >
> > >
> >
>
>
Re: Bayes Stopword [ In reply to ]
To create the stopwords regexp I used the script I shared in a previous email and a list of words one per line.
Could you share the list you are using ?

Giovanni

On 12/29/23 09:22, Jimmy wrote:
> I use SpamAssassin 4.0.0 (2022-12-14)
>
> $ spamassassin -D --lint 2>&1 | grep bayes:
> Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=en
> Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=th
> Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=ru
> Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=fr
> Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ja
> Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=zh
> Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=dk
> Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=nl
> Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=de
> Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=es
> Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fi
> Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fr
> Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=it
> Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=no
> Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ru
> Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=se
> Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=tr
> Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=vi
> Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=ko
> Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=zh
> Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=hi
> Dec 29 15:17:58.019 [17420] dbg: bayes: stopwords for languages enabled: en th ru fr ja zh dk nl de es fi fr it no ru se tr vi ko zh hi
>
>
> $ spamassassin -D bayes,learn < test.msg 2>&1 | grep "skipped token"
> Dec 29 15:16:57.585 [17347] dbg: bayes: skipped token 'Email' because it's in stopword list for language 'en'
>
> You can use "???" that was listed in regexp pattern but somehow I don't know why it not show skipped token in bayes.
>
> Jimmy
>
>
> On Fri, Dec 29, 2023 at 2:59?PM <giovanni@paclan.it <mailto:giovanni@paclan.it>> wrote:
>
> Config line produces a syntax error for me:
> config: failed to parse line in /etc/mail/spamassassin/local.cf <http://local.cf> (line 1): bayes_stopword_th
>
> Could you share the word list in utf8 ?
> I tried adding "???" to https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt <https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt> and it produces a working regexp.
> Bayes stopwords languages must also be enabled using "bayes_stopword_languages" config keyword, by default only english is enabled.
>   Giovanni
>
> On 12/28/23 17:06, Jimmy wrote:
> > bayes_stopword_th https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d>>
> > Sample mail https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8>>
> >
> > Jimmy
> >
> >
> > On Thu, Dec 28, 2023 at 10:59?PM <giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>> wrote:
> >
> >     Could you share a config line and a sample you are using ?
> >        Giovanni
> >
> >     On 12/28/23 16:26, Jimmy wrote:
> >      > Yes, I have done that, and I am also editing Plugin/Bayes.pm to investigate why it is not being skipped. I suspect that if words are not separated by spaces, longer words may not match those patterns.
> >      >
> >      > Jimmy
> >      >
> >      > On Thu, Dec 28, 2023 at 10:13?PM <giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>> wrote:
> >      >
> >      >     "spamassassin -D bayes" will tell you, you should see a line like:
> >      >     bayes: skipped token 'from' because it's in stopword list for language 'en'
> >      >
> >      >        Giovanni
> >      >
> >      >     On 12/28/23 15:45, Jimmy wrote:
> >      >      > The pattern has successfully passed the test script, but it needs to check whether Bayes learning will identify and possibly exclude the word from matching this pattern.
> >      >      >
> >      >      > Thank you.
> >      >      >
> >      >      >
> >      >      > On Thu, Dec 28, 2023 at 9:22?PM <giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>>> wrote:
> >      >      >
> >      >      >     On 12/28/23 12:59, Jimmy wrote:
> >      >      >      > Hi,
> >      >      >      >
> >      >      >      > I'm seeking assistance in incorporating a stopword for Asian languages in Unicode. Although I possess comprehensive word lists, my attempts to generate a regex pattern and test it have been unsuccessful; the pattern fails to match or skips tokens in the newly added stopword list.
> >      >      >      >
> >      >      >      > I created the regex pattern using the following code:
> >      >      >      >
> >      >      >      > Regexp::Assemble->new->add(@words)->reduce(0)->as_string
> >      >      >      >
> >      >      >      > Afterward, I converted it to UTF-8 hex.
> >      >      >      >
> >      >      >      > I'm wondering if there are any tools available to facilitate the creation of these regex patterns.
> >      >      >      >
> >      >      >     I have used Regexp::Trie to create Bayes stopwords in the past, code is similar to:
> >      >      >     -----------------------------------------------------------------------------------------------------------
> >      >      >     use strict;
> >      >      >     use warnings;
> >      >      >
> >      >      >     use Encode;
> >      >      >     use Regexp::Trie;
> >      >      >
> >      >      >     my @input = <STDIN>;
> >      >      >     my $rt = Regexp::Trie->new;
> >      >      >     for my $w ( @input ) {
> >      >      >         chomp($w);
> >      >      >         $rt->add($w);
> >      >      >     }
> >      >      >     my $regexp = $rt->regexp;
> >      >      >     my @reg = split //, $regexp;
> >      >      >     for my $c ( @reg ) {
> >      >      >         my $char = $c;
> >      >      >         my $test;
> >      >      >         eval "\$test = decode( 'utf8', \$c, Encode::FB_CROAK )";
> >      >      >         if( $@ ) {
> >      >      >           print 'x' . sprintf("%x", ord($c));
> >      >      >         } else {
> >      >      >           print $char;
> >      >      >         }
> >      >      >     }
> >      >      >     -----------------------------------------------------------------------------------------------------------
> >      >      >
> >      >      >        Giovanni
> >      >      >
> >      >
> >
>
Re: Bayes Stopword [ In reply to ]
You can use this word list

https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt

Jimmy

On Fri, Dec 29, 2023 at 3:59?PM <giovanni@paclan.it> wrote:

> To create the stopwords regexp I used the script I shared in a previous
> email and a list of words one per line.
> Could you share the list you are using ?
>
> Giovanni
>
> On 12/29/23 09:22, Jimmy wrote:
> > I use SpamAssassin 4.0.0 (2022-12-14)
> >
> > $ spamassassin -D --lint 2>&1 | grep bayes:
> > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=en
> > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=th
> > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=ru
> > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=fr
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ja
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=zh
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=dk
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=nl
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=de
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=es
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fi
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fr
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=it
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=no
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ru
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=se
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=tr
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=vi
> > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=ko
> > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=zh
> > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=hi
> > Dec 29 15:17:58.019 [17420] dbg: bayes: stopwords for languages enabled:
> en th ru fr ja zh dk nl de es fi fr it no ru se tr vi ko zh hi
> >
> >
> > $ spamassassin -D bayes,learn < test.msg 2>&1 | grep "skipped token"
> > Dec 29 15:16:57.585 [17347] dbg: bayes: skipped token 'Email' because
> it's in stopword list for language 'en'
> >
> > You can use "???" that was listed in regexp pattern but somehow I don't
> know why it not show skipped token in bayes.
> >
> > Jimmy
> >
> >
> > On Fri, Dec 29, 2023 at 2:59?PM <giovanni@paclan.it <mailto:
> giovanni@paclan.it>> wrote:
> >
> > Config line produces a syntax error for me:
> > config: failed to parse line in /etc/mail/spamassassin/local.cf <
> http://local.cf> (line 1): bayes_stopword_th
> >
> > Could you share the word list in utf8 ?
> > I tried adding "???" to
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>
> and it produces a working regexp.
> > Bayes stopwords languages must also be enabled using
> "bayes_stopword_languages" config keyword, by default only english is
> enabled.
> > Giovanni
> >
> > On 12/28/23 17:06, Jimmy wrote:
> > > bayes_stopword_th https://pastebin.pl/view/0838138d <
> https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d <
> https://pastebin.pl/view/0838138d>>
> > > Sample mail https://pastebin.pl/view/e5a2c5b8 <
> https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 <
> https://pastebin.pl/view/e5a2c5b8>>
> > >
> > > Jimmy
> > >
> > >
> > > On Thu, Dec 28, 2023 at 10:59?PM <giovanni@paclan.it <mailto:
> giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>
> wrote:
> > >
> > > Could you share a config line and a sample you are using ?
> > > Giovanni
> > >
> > > On 12/28/23 16:26, Jimmy wrote:
> > > > Yes, I have done that, and I am also editing
> Plugin/Bayes.pm to investigate why it is not being skipped. I suspect that
> if words are not separated by spaces, longer words may not match those
> patterns.
> > > >
> > > > Jimmy
> > > >
> > > > On Thu, Dec 28, 2023 at 10:13?PM <giovanni@paclan.it
> <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:
> giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>
> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>> wrote:
> > > >
> > > > "spamassassin -D bayes" will tell you, you should see
> a line like:
> > > > bayes: skipped token 'from' because it's in stopword
> list for language 'en'
> > > >
> > > > Giovanni
> > > >
> > > > On 12/28/23 15:45, Jimmy wrote:
> > > > > The pattern has successfully passed the test
> script, but it needs to check whether Bayes learning will identify and
> possibly exclude the word from matching this pattern.
> > > > >
> > > > > Thank you.
> > > > >
> > > > >
> > > > > On Thu, Dec 28, 2023 at 9:22?PM <giovanni@paclan.it
> <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:
> giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>
> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>> <mailto:
> giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it
> <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:
> giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>>>
> wrote:
> > > > >
> > > > > On 12/28/23 12:59, Jimmy wrote:
> > > > > > Hi,
> > > > > >
> > > > > > I'm seeking assistance in incorporating a
> stopword for Asian languages in Unicode. Although I possess comprehensive
> word lists, my attempts to generate a regex pattern and test it have been
> unsuccessful; the pattern fails to match or skips tokens in the newly added
> stopword list.
> > > > > >
> > > > > > I created the regex pattern using the
> following code:
> > > > > >
> > > > > >
> Regexp::Assemble->new->add(@words)->reduce(0)->as_string
> > > > > >
> > > > > > Afterward, I converted it to UTF-8 hex.
> > > > > >
> > > > > > I'm wondering if there are any tools
> available to facilitate the creation of these regex patterns.
> > > > > >
> > > > > I have used Regexp::Trie to create Bayes
> stopwords in the past, code is similar to:
> > > > >
> -----------------------------------------------------------------------------------------------------------
> > > > > use strict;
> > > > > use warnings;
> > > > >
> > > > > use Encode;
> > > > > use Regexp::Trie;
> > > > >
> > > > > my @input = <STDIN>;
> > > > > my $rt = Regexp::Trie->new;
> > > > > for my $w ( @input ) {
> > > > > chomp($w);
> > > > > $rt->add($w);
> > > > > }
> > > > > my $regexp = $rt->regexp;
> > > > > my @reg = split //, $regexp;
> > > > > for my $c ( @reg ) {
> > > > > my $char = $c;
> > > > > my $test;
> > > > > eval "\$test = decode( 'utf8', \$c,
> Encode::FB_CROAK )";
> > > > > if( $@ ) {
> > > > > print 'x' . sprintf("%x", ord($c));
> > > > > } else {
> > > > > print $char;
> > > > > }
> > > > > }
> > > > >
> -----------------------------------------------------------------------------------------------------------
> > > > >
> > > > > Giovanni
> > > > >
> > > >
> > >
> >
>
>
Re: Bayes Stopword [ In reply to ]
I do not speak Thai but I cannot see any word in the sample email that should match that list.
Which word do you think should match the regexp ?
Giovanni

On 12/29/23 10:08, Jimmy wrote:
> You can use this word list
>
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt <https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>
>
> Jimmy
>
> On Fri, Dec 29, 2023 at 3:59?PM <giovanni@paclan.it <mailto:giovanni@paclan.it>> wrote:
>
> To create the stopwords regexp I used the script I shared in a previous email and a list of words one per line.
> Could you share the list you are using ?
>
>    Giovanni
>
> On 12/29/23 09:22, Jimmy wrote:
> > I use SpamAssassin 4.0.0 (2022-12-14)
> >
> > $ spamassassin -D --lint 2>&1 | grep bayes:
> > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=en
> > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=th
> > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=ru
> > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=fr
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ja
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=zh
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=dk
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=nl
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=de
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=es
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fi
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fr
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=it
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=no
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ru
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=se
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=tr
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=vi
> > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=ko
> > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=zh
> > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=hi
> > Dec 29 15:17:58.019 [17420] dbg: bayes: stopwords for languages enabled: en th ru fr ja zh dk nl de es fi fr it no ru se tr vi ko zh hi
> >
> >
> > $ spamassassin -D bayes,learn < test.msg 2>&1 | grep "skipped token"
> > Dec 29 15:16:57.585 [17347] dbg: bayes: skipped token 'Email' because it's in stopword list for language 'en'
> >
> > You can use "???" that was listed in regexp pattern but somehow I don't know why it not show skipped token in bayes.
> >
> > Jimmy
> >
> >
> > On Fri, Dec 29, 2023 at 2:59?PM <giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>> wrote:
> >
> >     Config line produces a syntax error for me:
> >     config: failed to parse line in /etc/mail/spamassassin/local.cf <http://local.cf> <http://local.cf <http://local.cf>> (line 1): bayes_stopword_th
> >
> >     Could you share the word list in utf8 ?
> >     I tried adding "???" to https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt <https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt> <https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt <https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>> and it produces a working regexp.
> >     Bayes stopwords languages must also be enabled using "bayes_stopword_languages" config keyword, by default only english is enabled.
> >        Giovanni
> >
> >     On 12/28/23 17:06, Jimmy wrote:
> >      > bayes_stopword_th https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d>> <https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d>>>
> >      > Sample mail https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8>> <https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8>>>
> >      >
> >      > Jimmy
> >      >
> >      >
> >      > On Thu, Dec 28, 2023 at 10:59?PM <giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>> wrote:
> >      >
> >      >     Could you share a config line and a sample you are using ?
> >      >        Giovanni
> >      >
> >      >     On 12/28/23 16:26, Jimmy wrote:
> >      >      > Yes, I have done that, and I am also editing Plugin/Bayes.pm to investigate why it is not being skipped. I suspect that if words are not separated by spaces, longer words may not match those patterns.
> >      >      >
> >      >      > Jimmy
> >      >      >
> >      >      > On Thu, Dec 28, 2023 at 10:13?PM <giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>>> wrote:
> >      >      >
> >      >      >     "spamassassin -D bayes" will tell you, you should see a line like:
> >      >      >     bayes: skipped token 'from' because it's in stopword list for language 'en'
> >      >      >
> >      >      >        Giovanni
> >      >      >
> >      >      >     On 12/28/23 15:45, Jimmy wrote:
> >      >      >      > The pattern has successfully passed the test script, but it needs to check whether Bayes learning will identify and possibly exclude the word from matching this pattern.
> >      >      >      >
> >      >      >      > Thank you.
> >      >      >      >
> >      >      >      >
> >      >      >      > On Thu, Dec 28, 2023 at 9:22?PM <giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>>>> wrote:
> >      >      >      >
> >      >      >      >     On 12/28/23 12:59, Jimmy wrote:
> >      >      >      >      > Hi,
> >      >      >      >      >
> >      >      >      >      > I'm seeking assistance in incorporating a stopword for Asian languages in Unicode. Although I possess comprehensive word lists, my attempts to generate a regex pattern and test it have been unsuccessful; the pattern fails to match or skips tokens in the newly added stopword list.
> >      >      >      >      >
> >      >      >      >      > I created the regex pattern using the following code:
> >      >      >      >      >
> >      >      >      >      > Regexp::Assemble->new->add(@words)->reduce(0)->as_string
> >      >      >      >      >
> >      >      >      >      > Afterward, I converted it to UTF-8 hex.
> >      >      >      >      >
> >      >      >      >      > I'm wondering if there are any tools available to facilitate the creation of these regex patterns.
> >      >      >      >      >
> >      >      >      >     I have used Regexp::Trie to create Bayes stopwords in the past, code is similar to:
> >      >      >      >     -----------------------------------------------------------------------------------------------------------
> >      >      >      >     use strict;
> >      >      >      >     use warnings;
> >      >      >      >
> >      >      >      >     use Encode;
> >      >      >      >     use Regexp::Trie;
> >      >      >      >
> >      >      >      >     my @input = <STDIN>;
> >      >      >      >     my $rt = Regexp::Trie->new;
> >      >      >      >     for my $w ( @input ) {
> >      >      >      >         chomp($w);
> >      >      >      >         $rt->add($w);
> >      >      >      >     }
> >      >      >      >     my $regexp = $rt->regexp;
> >      >      >      >     my @reg = split //, $regexp;
> >      >      >      >     for my $c ( @reg ) {
> >      >      >      >         my $char = $c;
> >      >      >      >         my $test;
> >      >      >      >         eval "\$test = decode( 'utf8', \$c, Encode::FB_CROAK )";
> >      >      >      >         if( $@ ) {
> >      >      >      >           print 'x' . sprintf("%x", ord($c));
> >      >      >      >         } else {
> >      >      >      >           print $char;
> >      >      >      >         }
> >      >      >      >     }
> >      >      >      >     -----------------------------------------------------------------------------------------------------------
> >      >      >      >
> >      >      >      >        Giovanni
> >      >      >      >
> >      >      >
> >      >
> >
>
Re: Bayes Stopword [ In reply to ]
The sample email and word list should contain at least these words.

???
???
???

Jimmy

On Fri, Dec 29, 2023 at 4:47?PM <giovanni@paclan.it> wrote:

> I do not speak Thai but I cannot see any word in the sample email that
> should match that list.
> Which word do you think should match the regexp ?
> Giovanni
>
> On 12/29/23 10:08, Jimmy wrote:
> > You can use this word list
> >
> >
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> >
> >
> > Jimmy
> >
> > On Fri, Dec 29, 2023 at 3:59?PM <giovanni@paclan.it <mailto:
> giovanni@paclan.it>> wrote:
> >
> > To create the stopwords regexp I used the script I shared in a
> previous email and a list of words one per line.
> > Could you share the list you are using ?
> >
> > Giovanni
> >
> > On 12/29/23 09:22, Jimmy wrote:
> > > I use SpamAssassin 4.0.0 (2022-12-14)
> > >
> > > $ spamassassin -D --lint 2>&1 | grep bayes:
> > > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=en
> > > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=th
> > > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=ru
> > > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=fr
> > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ja
> > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=zh
> > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=dk
> > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=nl
> > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=de
> > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=es
> > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fi
> > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fr
> > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=it
> > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=no
> > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ru
> > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=se
> > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=tr
> > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=vi
> > > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=ko
> > > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=zh
> > > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=hi
> > > Dec 29 15:17:58.019 [17420] dbg: bayes: stopwords for languages
> enabled: en th ru fr ja zh dk nl de es fi fr it no ru se tr vi ko zh hi
> > >
> > >
> > > $ spamassassin -D bayes,learn < test.msg 2>&1 | grep "skipped
> token"
> > > Dec 29 15:16:57.585 [17347] dbg: bayes: skipped token 'Email'
> because it's in stopword list for language 'en'
> > >
> > > You can use "???" that was listed in regexp pattern but somehow I
> don't know why it not show skipped token in bayes.
> > >
> > > Jimmy
> > >
> > >
> > > On Fri, Dec 29, 2023 at 2:59?PM <giovanni@paclan.it <mailto:
> giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>
> wrote:
> > >
> > > Config line produces a syntax error for me:
> > > config: failed to parse line in /etc/mail/spamassassin/
> local.cf <http://local.cf> <http://local.cf <http://local.cf>> (line 1):
> bayes_stopword_th
> > >
> > > Could you share the word list in utf8 ?
> > > I tried adding "???" to
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>>
> and it produces a working regexp.
> > > Bayes stopwords languages must also be enabled using
> "bayes_stopword_languages" config keyword, by default only english is
> enabled.
> > > Giovanni
> > >
> > > On 12/28/23 17:06, Jimmy wrote:
> > > > bayes_stopword_th https://pastebin.pl/view/0838138d <
> https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d <
> https://pastebin.pl/view/0838138d>> <https://pastebin.pl/view/0838138d <
> https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d <
> https://pastebin.pl/view/0838138d>>>
> > > > Sample mail https://pastebin.pl/view/e5a2c5b8 <
> https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 <
> https://pastebin.pl/view/e5a2c5b8>> <https://pastebin.pl/view/e5a2c5b8 <
> https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 <
> https://pastebin.pl/view/e5a2c5b8>>>
> > > >
> > > > Jimmy
> > > >
> > > >
> > > > On Thu, Dec 28, 2023 at 10:59?PM <giovanni@paclan.it
> <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:
> giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>
> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>> wrote:
> > > >
> > > > Could you share a config line and a sample you are
> using ?
> > > > Giovanni
> > > >
> > > > On 12/28/23 16:26, Jimmy wrote:
> > > > > Yes, I have done that, and I am also editing
> Plugin/Bayes.pm to investigate why it is not being skipped. I suspect that
> if words are not separated by spaces, longer words may not match those
> patterns.
> > > > >
> > > > > Jimmy
> > > > >
> > > > > On Thu, Dec 28, 2023 at 10:13?PM <
> giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it
> <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:
> giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>
> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:
> giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it
> <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:
> giovanni@paclan.it>>>>> wrote:
> > > > >
> > > > > "spamassassin -D bayes" will tell you, you
> should see a line like:
> > > > > bayes: skipped token 'from' because it's in
> stopword list for language 'en'
> > > > >
> > > > > Giovanni
> > > > >
> > > > > On 12/28/23 15:45, Jimmy wrote:
> > > > > > The pattern has successfully passed the test
> script, but it needs to check whether Bayes learning will identify and
> possibly exclude the word from matching this pattern.
> > > > > >
> > > > > > Thank you.
> > > > > >
> > > > > >
> > > > > > On Thu, Dec 28, 2023 at 9:22?PM <
> giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it
> <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:
> giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>
> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:
> giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it
> <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:
> giovanni@paclan.it>>>> <mailto:giovanni@paclan.it <mailto:
> giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>
> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:
> giovanni@paclan.it <mailto:giovanni@paclan.it>>> <mailto:
> giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it
> <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:
> giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>>>>
> wrote:
> > > > > >
> > > > > > On 12/28/23 12:59, Jimmy wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > I'm seeking assistance in
> incorporating a stopword for Asian languages in Unicode. Although I possess
> comprehensive word lists, my attempts to generate a regex pattern and test
> it have been unsuccessful; the pattern fails to match or skips tokens in
> the newly added stopword list.
> > > > > > >
> > > > > > > I created the regex pattern using the
> following code:
> > > > > > >
> > > > > > >
> Regexp::Assemble->new->add(@words)->reduce(0)->as_string
> > > > > > >
> > > > > > > Afterward, I converted it to UTF-8
> hex.
> > > > > > >
> > > > > > > I'm wondering if there are any tools
> available to facilitate the creation of these regex patterns.
> > > > > > >
> > > > > > I have used Regexp::Trie to create Bayes
> stopwords in the past, code is similar to:
> > > > > >
> -----------------------------------------------------------------------------------------------------------
> > > > > > use strict;
> > > > > > use warnings;
> > > > > >
> > > > > > use Encode;
> > > > > > use Regexp::Trie;
> > > > > >
> > > > > > my @input = <STDIN>;
> > > > > > my $rt = Regexp::Trie->new;
> > > > > > for my $w ( @input ) {
> > > > > > chomp($w);
> > > > > > $rt->add($w);
> > > > > > }
> > > > > > my $regexp = $rt->regexp;
> > > > > > my @reg = split //, $regexp;
> > > > > > for my $c ( @reg ) {
> > > > > > my $char = $c;
> > > > > > my $test;
> > > > > > eval "\$test = decode( 'utf8', \$c,
> Encode::FB_CROAK )";
> > > > > > if( $@ ) {
> > > > > > print 'x' . sprintf("%x", ord($c));
> > > > > > } else {
> > > > > > print $char;
> > > > > > }
> > > > > > }
> > > > > >
> -----------------------------------------------------------------------------------------------------------
> > > > > >
> > > > > > Giovanni
> > > > > >
> > > > >
> > > >
> > >
> >
>
>
Re: Bayes Stopword [ In reply to ]
"???" is not considered a word because it's part of the token "????????????????????????".
Words must be separated by spaces, otherwise we should skip the word "theme" just because "the" is in english stopword list.
No idea if this makes sense for asian languages.

Giovanni

On 12/29/23 11:04, Jimmy wrote:
>
> The sample email and word list should contain at least these words.
>
> ???
> ???
> ???
>
> Jimmy
>
> On Fri, Dec 29, 2023 at 4:47?PM <giovanni@paclan.it <mailto:giovanni@paclan.it>> wrote:
>
> I do not speak Thai but I cannot see any word in the sample email that should match that list.
> Which word do you think should match the regexp ?
>   Giovanni
>
> On 12/29/23 10:08, Jimmy wrote:
> > You can use this word list
> >
> > https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt <https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt> <https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt <https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>>
> >
> > Jimmy
> >
> > On Fri, Dec 29, 2023 at 3:59?PM <giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>> wrote:
> >
> >     To create the stopwords regexp I used the script I shared in a previous email and a list of words one per line.
> >     Could you share the list you are using ?
> >
> >         Giovanni
> >
> >     On 12/29/23 09:22, Jimmy wrote:
> >      > I use SpamAssassin 4.0.0 (2022-12-14)
> >      >
> >      > $ spamassassin -D --lint 2>&1 | grep bayes:
> >      > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=en
> >      > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=th
> >      > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=ru
> >      > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=fr
> >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ja
> >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=zh
> >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=dk
> >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=nl
> >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=de
> >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=es
> >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fi
> >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fr
> >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=it
> >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=no
> >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ru
> >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=se
> >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=tr
> >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=vi
> >      > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=ko
> >      > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=zh
> >      > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=hi
> >      > Dec 29 15:17:58.019 [17420] dbg: bayes: stopwords for languages enabled: en th ru fr ja zh dk nl de es fi fr it no ru se tr vi ko zh hi
> >      >
> >      >
> >      > $ spamassassin -D bayes,learn < test.msg 2>&1 | grep "skipped token"
> >      > Dec 29 15:16:57.585 [17347] dbg: bayes: skipped token 'Email' because it's in stopword list for language 'en'
> >      >
> >      > You can use "???" that was listed in regexp pattern but somehow I don't know why it not show skipped token in bayes.
> >      >
> >      > Jimmy
> >      >
> >      >
> >      > On Fri, Dec 29, 2023 at 2:59?PM <giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>> wrote:
> >      >
> >      >     Config line produces a syntax error for me:
> >      >     config: failed to parse line in /etc/mail/spamassassin/local.cf <http://local.cf> <http://local.cf <http://local.cf>> <http://local.cf <http://local.cf> <http://local.cf <http://local.cf>>> (line 1): bayes_stopword_th
> >      >
> >      >     Could you share the word list in utf8 ?
> >      >     I tried adding "???" to https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt <https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt> <https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt <https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>> <https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt <https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt> <https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt <https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>>> and it produces a working regexp.
> >      >     Bayes stopwords languages must also be enabled using "bayes_stopword_languages" config keyword, by default only english is enabled.
> >      >        Giovanni
> >      >
> >      >     On 12/28/23 17:06, Jimmy wrote:
> >      >      > bayes_stopword_th https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d>> <https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d>>> <https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d>> <https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d>>>>
> >      >      > Sample mail https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8>> <https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8>>> <https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8>> <https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8>>>>
> >      >      >
> >      >      > Jimmy
> >      >      >
> >      >      >
> >      >      > On Thu, Dec 28, 2023 at 10:59?PM <giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>>> wrote:
> >      >      >
> >      >      >     Could you share a config line and a sample you are using ?
> >      >      >        Giovanni
> >      >      >
> >      >      >     On 12/28/23 16:26, Jimmy wrote:
> >      >      >      > Yes, I have done that, and I am also editing Plugin/Bayes.pm to investigate why it is not being skipped. I suspect that if words are not separated by spaces, longer words may not match those patterns.
> >      >      >      >
> >      >      >      > Jimmy
> >      >      >      >
> >      >      >      > On Thu, Dec 28, 2023 at 10:13?PM <giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>>>> wrote:
> >      >      >      >
> >      >      >      >     "spamassassin -D bayes" will tell you, you should see a line like:
> >      >      >      >     bayes: skipped token 'from' because it's in stopword list for language 'en'
> >      >      >      >
> >      >      >      >        Giovanni
> >      >      >      >
> >      >      >      >     On 12/28/23 15:45, Jimmy wrote:
> >      >      >      >      > The pattern has successfully passed the test script, but it needs to check whether Bayes learning will identify and possibly exclude the word from matching this pattern.
> >      >      >      >      >
> >      >      >      >      > Thank you.
> >      >      >      >      >
> >      >      >      >      >
> >      >      >      >      > On Thu, Dec 28, 2023 at 9:22?PM <giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>>> <mailto:giovanni@paclan.it
> <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>>>>> wrote:
> >      >      >      >      >
> >      >      >      >      >     On 12/28/23 12:59, Jimmy wrote:
> >      >      >      >      >      > Hi,
> >      >      >      >      >      >
> >      >      >      >      >      > I'm seeking assistance in incorporating a stopword for Asian languages in Unicode. Although I possess comprehensive word lists, my attempts to generate a regex pattern and test it have been unsuccessful; the pattern fails to match or skips tokens in the newly added stopword list.
> >      >      >      >      >      >
> >      >      >      >      >      > I created the regex pattern using the following code:
> >      >      >      >      >      >
> >      >      >      >      >      > Regexp::Assemble->new->add(@words)->reduce(0)->as_string
> >      >      >      >      >      >
> >      >      >      >      >      > Afterward, I converted it to UTF-8 hex.
> >      >      >      >      >      >
> >      >      >      >      >      > I'm wondering if there are any tools available to facilitate the creation of these regex patterns.
> >      >      >      >      >      >
> >      >      >      >      >     I have used Regexp::Trie to create Bayes stopwords in the past, code is similar to:
> >      >      >      >      >     -----------------------------------------------------------------------------------------------------------
> >      >      >      >      >     use strict;
> >      >      >      >      >     use warnings;
> >      >      >      >      >
> >      >      >      >      >     use Encode;
> >      >      >      >      >     use Regexp::Trie;
> >      >      >      >      >
> >      >      >      >      >     my @input = <STDIN>;
> >      >      >      >      >     my $rt = Regexp::Trie->new;
> >      >      >      >      >     for my $w ( @input ) {
> >      >      >      >      >         chomp($w);
> >      >      >      >      >         $rt->add($w);
> >      >      >      >      >     }
> >      >      >      >      >     my $regexp = $rt->regexp;
> >      >      >      >      >     my @reg = split //, $regexp;
> >      >      >      >      >     for my $c ( @reg ) {
> >      >      >      >      >         my $char = $c;
> >      >      >      >      >         my $test;
> >      >      >      >      >         eval "\$test = decode( 'utf8', \$c, Encode::FB_CROAK )";
> >      >      >      >      >         if( $@ ) {
> >      >      >      >      >           print 'x' . sprintf("%x", ord($c));
> >      >      >      >      >         } else {
> >      >      >      >      >           print $char;
> >      >      >      >      >         }
> >      >      >      >      >     }
> >      >      >      >      >     -----------------------------------------------------------------------------------------------------------
> >      >      >      >      >
> >      >      >      >      >        Giovanni
> >      >      >      >      >
> >      >      >      >
> >      >      >
> >      >
> >
>
Re: Bayes Stopword [ In reply to ]
This is what I believe: the words need to be trimmed or separated, and
careful consideration is required to determine the language in order to
perform accurate cutoffs.

Jimmy

On Fri, Dec 29, 2023 at 5:16?PM <giovanni@paclan.it> wrote:

> "???" is not considered a word because it's part of the token
> "????????????????????????".
> Words must be separated by spaces, otherwise we should skip the word
> "theme" just because "the" is in english stopword list.
> No idea if this makes sense for asian languages.
>
> Giovanni
>
> On 12/29/23 11:04, Jimmy wrote:
> >
> > The sample email and word list should contain at least these words.
> >
> > ???
> > ???
> > ???
> >
> > Jimmy
> >
> > On Fri, Dec 29, 2023 at 4:47?PM <giovanni@paclan.it <mailto:
> giovanni@paclan.it>> wrote:
> >
> > I do not speak Thai but I cannot see any word in the sample email
> that should match that list.
> > Which word do you think should match the regexp ?
> > Giovanni
> >
> > On 12/29/23 10:08, Jimmy wrote:
> > > You can use this word list
> > >
> > >
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> >>
> > >
> > > Jimmy
> > >
> > > On Fri, Dec 29, 2023 at 3:59?PM <giovanni@paclan.it <mailto:
> giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>
> wrote:
> > >
> > > To create the stopwords regexp I used the script I shared in
> a previous email and a list of words one per line.
> > > Could you share the list you are using ?
> > >
> > > Giovanni
> > >
> > > On 12/29/23 09:22, Jimmy wrote:
> > > > I use SpamAssassin 4.0.0 (2022-12-14)
> > > >
> > > > $ spamassassin -D --lint 2>&1 | grep bayes:
> > > > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found
> lang=en
> > > > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found
> lang=th
> > > > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found
> lang=ru
> > > > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found
> lang=fr
> > > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=ja
> > > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=zh
> > > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=dk
> > > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=nl
> > > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=de
> > > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=es
> > > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=fi
> > > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=fr
> > > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=it
> > > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=no
> > > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=ru
> > > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=se
> > > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=tr
> > > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=vi
> > > > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found
> lang=ko
> > > > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found
> lang=zh
> > > > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found
> lang=hi
> > > > Dec 29 15:17:58.019 [17420] dbg: bayes: stopwords for
> languages enabled: en th ru fr ja zh dk nl de es fi fr it no ru se tr vi ko
> zh hi
> > > >
> > > >
> > > > $ spamassassin -D bayes,learn < test.msg 2>&1 | grep
> "skipped token"
> > > > Dec 29 15:16:57.585 [17347] dbg: bayes: skipped token
> 'Email' because it's in stopword list for language 'en'
> > > >
> > > > You can use "???" that was listed in regexp pattern but
> somehow I don't know why it not show skipped token in bayes.
> > > >
> > > > Jimmy
> > > >
> > > >
> > > > On Fri, Dec 29, 2023 at 2:59?PM <giovanni@paclan.it
> <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:
> giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>
> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>> wrote:
> > > >
> > > > Config line produces a syntax error for me:
> > > > config: failed to parse line in /etc/mail/spamassassin/
> local.cf <http://local.cf> <http://local.cf <http://local.cf>> <
> http://local.cf <http://local.cf> <http://local.cf <http://local.cf>>>
> (line 1): bayes_stopword_th
> > > >
> > > > Could you share the word list in utf8 ?
> > > > I tried adding "???" to
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>>
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>>>
> and it produces a working regexp.
> > > > Bayes stopwords languages must also be enabled using
> "bayes_stopword_languages" config keyword, by default only english is
> enabled.
> > > > Giovanni
> > > >
> > > > On 12/28/23 17:06, Jimmy wrote:
> > > > > bayes_stopword_th https://pastebin.pl/view/0838138d
> <https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d <
> https://pastebin.pl/view/0838138d>> <https://pastebin.pl/view/0838138d <
> https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d <
> https://pastebin.pl/view/0838138d>>> <https://pastebin.pl/view/0838138d <
> https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d <
> https://pastebin.pl/view/0838138d>> <https://pastebin.pl/view/0838138d <
> https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d <
> https://pastebin.pl/view/0838138d>>>>
> > > > > Sample mail https://pastebin.pl/view/e5a2c5b8 <
> https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 <
> https://pastebin.pl/view/e5a2c5b8>> <https://pastebin.pl/view/e5a2c5b8 <
> https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 <
> https://pastebin.pl/view/e5a2c5b8>>> <https://pastebin.pl/view/e5a2c5b8 <
> https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 <
> https://pastebin.pl/view/e5a2c5b8>> <https://pastebin.pl/view/e5a2c5b8 <
> https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 <
> https://pastebin.pl/view/e5a2c5b8>>>>
> > > > >
> > > > > Jimmy
> > > > >
> > > > >
> > > > > On Thu, Dec 28, 2023 at 10:59?PM <
> giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it
> <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:
> giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>
> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:
> giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it
> <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:
> giovanni@paclan.it>>>>> wrote:
> > > > >
> > > > > Could you share a config line and a sample you
> are using ?
> > > > > Giovanni
> > > > >
> > > > > On 12/28/23 16:26, Jimmy wrote:
> > > > > > Yes, I have done that, and I am also editing
> Plugin/Bayes.pm to investigate why it is not being skipped. I suspect that
> if words are not separated by spaces, longer words may not match those
> patterns.
> > > > > >
> > > > > > Jimmy
> > > > > >
> > > > > > On Thu, Dec 28, 2023 at 10:13?PM <
> giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it
> <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:
> giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>
> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:
> giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it
> <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:
> giovanni@paclan.it>>>> <mailto:giovanni@paclan.it <mailto:
> giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>
> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:
> giovanni@paclan.it <mailto:giovanni@paclan.it>>> <mailto:
> giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it
> <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:
> giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>>>>
> wrote:
> > > > > >
> > > > > > "spamassassin -D bayes" will tell you,
> you should see a line like:
> > > > > > bayes: skipped token 'from' because it's
> in stopword list for language 'en'
> > > > > >
> > > > > > Giovanni
> > > > > >
> > > > > > On 12/28/23 15:45, Jimmy wrote:
> > > > > > > The pattern has successfully passed
> the test script, but it needs to check whether Bayes learning will identify
> and possibly exclude the word from matching this pattern.
> > > > > > >
> > > > > > > Thank you.
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Dec 28, 2023 at 9:22?PM <
> giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it
> <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:
> giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>
> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:
> giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it
> <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:
> giovanni@paclan.it>>>> <mailto:giovanni@paclan.it <mailto:
> giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>
> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:
> giovanni@paclan.it <mailto:giovanni@paclan.it>>> <mailto:
> giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it
> <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:
> giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>>>
> <mailto:giovanni@paclan.it
> > <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:
> giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>
> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>> <mailto:
> giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it
> <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:
> giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>>
> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:
> giovanni@paclan.it <mailto:giovanni@paclan.it>> <mailto:giovanni@paclan.it
> <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:
> giovanni@paclan.it>>> <mailto:giovanni@paclan.it <mailto:
> giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>
> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it> <mailto:
> giovanni@paclan.it <mailto:giovanni@paclan.it>>>>>>> wrote:
> > > > > > >
> > > > > > > On 12/28/23 12:59, Jimmy wrote:
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > I'm seeking assistance in
> incorporating a stopword for Asian languages in Unicode. Although I possess
> comprehensive word lists, my attempts to generate a regex pattern and test
> it have been unsuccessful; the pattern fails to match or skips tokens in
> the newly added stopword list.
> > > > > > > >
> > > > > > > > I created the regex pattern
> using the following code:
> > > > > > > >
> > > > > > > >
> Regexp::Assemble->new->add(@words)->reduce(0)->as_string
> > > > > > > >
> > > > > > > > Afterward, I converted it to
> UTF-8 hex.
> > > > > > > >
> > > > > > > > I'm wondering if there are any
> tools available to facilitate the creation of these regex patterns.
> > > > > > > >
> > > > > > > I have used Regexp::Trie to
> create Bayes stopwords in the past, code is similar to:
> > > > > > >
> -----------------------------------------------------------------------------------------------------------
> > > > > > > use strict;
> > > > > > > use warnings;
> > > > > > >
> > > > > > > use Encode;
> > > > > > > use Regexp::Trie;
> > > > > > >
> > > > > > > my @input = <STDIN>;
> > > > > > > my $rt = Regexp::Trie->new;
> > > > > > > for my $w ( @input ) {
> > > > > > > chomp($w);
> > > > > > > $rt->add($w);
> > > > > > > }
> > > > > > > my $regexp = $rt->regexp;
> > > > > > > my @reg = split //, $regexp;
> > > > > > > for my $c ( @reg ) {
> > > > > > > my $char = $c;
> > > > > > > my $test;
> > > > > > > eval "\$test = decode(
> 'utf8', \$c, Encode::FB_CROAK )";
> > > > > > > if( $@ ) {
> > > > > > > print 'x' . sprintf("%x",
> ord($c));
> > > > > > > } else {
> > > > > > > print $char;
> > > > > > > }
> > > > > > > }
> > > > > > >
> -----------------------------------------------------------------------------------------------------------
> > > > > > >
> > > > > > > Giovanni
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
>