I use SpamAssassin 4.0.0 (2022-12-14)
$ spamassassin -D --lint 2>&1 | grep bayes:
Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=en
Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=th
Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=ru
Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=fr
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ja
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=zh
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=dk
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=nl
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=de
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=es
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fi
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fr
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=it
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=no
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ru
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=se
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=tr
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=vi
Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=ko
Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=zh
Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=hi
Dec 29 15:17:58.019 [17420] dbg: bayes: stopwords for languages enabled: en
th ru fr ja zh dk nl de es fi fr it no ru se tr vi ko zh hi
$ spamassassin -D bayes,learn < test.msg 2>&1 | grep "skipped token"
Dec 29 15:16:57.585 [17347] dbg: bayes: skipped token 'Email' because it's
in stopword list for language 'en'
You can use "???" that was listed in regexp pattern but somehow I don't
know why it not show skipped token in bayes.
Jimmy
On Fri, Dec 29, 2023 at 2:59?PM <giovanni@paclan.it> wrote:
> Config line produces a syntax error for me:
> config: failed to parse line in /etc/mail/spamassassin/local.cf (line 1):
> bayes_stopword_th
>
> Could you share the word list in utf8 ?
> I tried adding "???" to
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> and it produces a working regexp.
> Bayes stopwords languages must also be enabled using
> "bayes_stopword_languages" config keyword, by default only english is
> enabled.
> Giovanni
>
> On 12/28/23 17:06, Jimmy wrote:
> > bayes_stopword_th https://pastebin.pl/view/0838138d <
> https://pastebin.pl/view/0838138d>
> > Sample mail https://pastebin.pl/view/e5a2c5b8 <
> https://pastebin.pl/view/e5a2c5b8>
> >
> > Jimmy
> >
> >
> > On Thu, Dec 28, 2023 at 10:59?PM <giovanni@paclan.it <mailto:
> giovanni@paclan.it>> wrote:
> >
> > Could you share a config line and a sample you are using ?
> > Giovanni
> >
> > On 12/28/23 16:26, Jimmy wrote:
> > > Yes, I have done that, and I am also editing Plugin/Bayes.pm to
> investigate why it is not being skipped. I suspect that if words are not
> separated by spaces, longer words may not match those patterns.
> > >
> > > Jimmy
> > >
> > > On Thu, Dec 28, 2023 at 10:13?PM <giovanni@paclan.it <mailto:
> giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>
> wrote:
> > >
> > > "spamassassin -D bayes" will tell you, you should see a line
> like:
> > > bayes: skipped token 'from' because it's in stopword list for
> language 'en'
> > >
> > > Giovanni
> > >
> > > On 12/28/23 15:45, Jimmy wrote:
> > > > The pattern has successfully passed the test script, but
> it needs to check whether Bayes learning will identify and possibly exclude
> the word from matching this pattern.
> > > >
> > > > Thank you.
> > > >
> > > >
> > > > On Thu, Dec 28, 2023 at 9:22?PM <giovanni@paclan.it
> <mailto:giovanni@paclan.it> <mailto:giovanni@paclan.it <mailto:
> giovanni@paclan.it>> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>
> <mailto:giovanni@paclan.it <mailto:giovanni@paclan.it>>>> wrote:
> > > >
> > > > On 12/28/23 12:59, Jimmy wrote:
> > > > > Hi,
> > > > >
> > > > > I'm seeking assistance in incorporating a stopword
> for Asian languages in Unicode. Although I possess comprehensive word
> lists, my attempts to generate a regex pattern and test it have been
> unsuccessful; the pattern fails to match or skips tokens in the newly added
> stopword list.
> > > > >
> > > > > I created the regex pattern using the following
> code:
> > > > >
> > > > >
> Regexp::Assemble->new->add(@words)->reduce(0)->as_string
> > > > >
> > > > > Afterward, I converted it to UTF-8 hex.
> > > > >
> > > > > I'm wondering if there are any tools available to
> facilitate the creation of these regex patterns.
> > > > >
> > > > I have used Regexp::Trie to create Bayes stopwords in
> the past, code is similar to:
> > > >
> -----------------------------------------------------------------------------------------------------------
> > > > use strict;
> > > > use warnings;
> > > >
> > > > use Encode;
> > > > use Regexp::Trie;
> > > >
> > > > my @input = <STDIN>;
> > > > my $rt = Regexp::Trie->new;
> > > > for my $w ( @input ) {
> > > > chomp($w);
> > > > $rt->add($w);
> > > > }
> > > > my $regexp = $rt->regexp;
> > > > my @reg = split //, $regexp;
> > > > for my $c ( @reg ) {
> > > > my $char = $c;
> > > > my $test;
> > > > eval "\$test = decode( 'utf8', \$c,
> Encode::FB_CROAK )";
> > > > if( $@ ) {
> > > > print 'x' . sprintf("%x", ord($c));
> > > > } else {
> > > > print $char;
> > > > }
> > > > }
> > > >
> -----------------------------------------------------------------------------------------------------------
> > > >
> > > > Giovanni
> > > >
> > >
> >
>
>