Mailing List Archive

Newb on sa-learn - didn't get what I expected as a response...
Hi SA users,

I've FINALLY built up a "corpus" of ham vs spam and also FINALLY had some
time to spend on this and just ran sa-learn on, oh, IDK, some 10k email
messages or so, I'd guess. And along the way, I NEVER ONCE got the kind of
output response back from sa-learn that I expected.

For example, here I run it against a file containing just over 2100 spam:

$ sa-learn -u richard --spam spam
Learned tokens from 0 message(s) (0 message(s) examined)

(I was running it as root - which the docs don't mention but I figure is
what I'm supposed to do!)

In the end, I ran it on about four dozen files of ham and about 6 or so
files of spam emails, carefully curated. In all these files, I NEVER saw
it say it examined more than 1 message and EVERY time it said it examined
a message it also said it "learned" 1 token.

Uh... I try not to be stupid; is this normal or am I "doing it wrong!"

Thanks,
Richard
Re: Newb on sa-learn - didn't get what I expected as a response... [ In reply to ]
On 7/7/2023 11:04 AM, Richard wrote:
> For example, here I run it against a file containing just over 2100 spam:

> In the end, I ran it on about four dozen files of ham and about 6 or
> so files of spam emails, carefully curated. In all these files, I
> NEVER saw it say it examined more than 1 message and EVERY time it
> said it examined a message it also said it "learned" 1 token.
>

I believe the default format is Maildir.  You  mention a single file w/
multiple emails which suggests you might be running MBox format? If so,
try the --mbox command line switch.

-- Jared Hall
Re: Newb on sa-learn - didn't get what I expected as a response... [ In reply to ]
> Am 07.07.23 um 17:04 schrieb Richard:
>> I've FINALLY built up a "corpus" of ham vs spam and also FINALLY had some
>> time to spend on this and just ran sa-learn on, oh, IDK, some 10k email
>> messages or so, I'd guess. And along the way, I NEVER ONCE got the kind of
>> output response back from sa-learn that I expected.
>>
>> For example, here I run it against a file containing just over 2100 spam:
>>
>> $ sa-learn -u richard --spam spam
>> Learned tokens from 0 message(s) (0 message(s) examined)
>>
>> (I was running it as root - which the docs don't mention but I figure is
>> what I'm supposed to do!)
>
> why do you suppose that?

...Uh... Because otherwise why the -u flag and comments about running it
for virtual users?

> you NEVER run anything as root which isn't a root task - no matter what
>
> you run it with the same user you spamd is running

Good to know! ...I'd recommend an update to the doc / web page to point
out it should be run as the user ID of whatever spamd is using!

Now, I'd guess I should, as root:

sa-learn --clear

Since I hadn't run sa-learn before, EVER, that I was aware of!

...And THEN run as I've just learned. And, BTW, this makes me happy I
scripted calling sa-learn, so re-doing this will be easy!

As an aside, "curating" modern ham from my inboxes is time consuming so a
lot of the ham I used is older, from saved folders... I saw the warning
about old vs new, and the potential effects of that; as my inboxes
typically have around 2k messages in them, and going through and making
sure NONE are spam is time consuming, is it worth tossing in a few at a
time from recent days, such as a day at a time?

...My guess is that nobody can really say what the Bayesian system is
going to pick up on exactly, so YES, it can't hurt?!

Thanks,
Richard
Re: Newb on sa-learn - didn't get what I expected as a response... [ In reply to ]
On Fri, 7 Jul 2023, Jared Hall wrote:
>
> I believe the default format is Maildir.  You  mention a single file w/
> multiple emails which suggests you might be running MBox format? If so, try
> the --mbox command line switch.
>
> -- Jared Hall
>


GREAT CATCH, Jared; you are correct, mine are in mbox format, I think -
somehow I guess I overlooked the -mbox switch!

Thanks,
Richard
Re: Newb on sa-learn - didn't get what I expected as a response... [ In reply to ]
>>> (I was running it as root - which the docs don't mention but I figure is
>>> what I'm supposed to do!)
>>
>> why do you suppose that?
>
> ...Uh... Because otherwise why the -u flag and comments about running it for
> virtual users?
>
>> you NEVER run anything as root which isn't a root task - no matter what
>>
>> you run it with the same user you spamd is running
>
> Good to know! ...I'd recommend an update to the doc / web page to point out
> it should be run as the user ID of whatever spamd is using!
>

It appears that it IS running as root?! OR maybe as "sa-milt" ... As root
I got this:

# ps auxwww | grep spamd
root 100805 0.0 0.3 158208 121164 ? Ss 00:37 0:05 /usr/bin/perl -T -w /usr/bin/spamd -c -m5 -H --razor-home-dir=/var/lib/razor/ --razor-log-file=sys-syslog
# grep spam /etc/passwd
sa-milt:x:976:975:SpamAssassin Milter:/var/lib/spamass-milter:/sbin/nologin

So... run it as sa-milt (my guess), or as root?

Note that this is on a Fedora Server v 38 - the OS is a couple of months
old.

Thanks again,
Richard
Re: Newb on sa-learn - didn't get what I expected as a response... [ In reply to ]
>> It appears that it IS running as root?! OR maybe as "sa-milt" ... As root
>> I got this:
>>
>> # ps auxwww | grep spamd
>> root      100805  0.0  0.3 158208 121164 ?       Ss   00:37   0:05
>> /usr/bin/perl -T -w /usr/bin/spamd -c -m5 -H
>> --razor-home-dir=/var/lib/razor/ --razor-log-file=sys-syslog
>> # grep spam /etc/passwd
>> sa-milt:x:976:975:SpamAssassin
>> Milter:/var/lib/spamass-milter:/sbin/nologin
>>
>> So... run it as sa-milt (my guess), or as root?
>>
>> Note that this is on a Fedora Server v 38 - the OS is a couple of months
>> old
>
> so your whole setup is more then questionable
>
> give common sense a few seconds: do you REALLY want to process mails
> containing junk and malware with root privileges?

Frankly, you make a good point and I was unawares! Back January we had a
system failure - nevermind the details - and had to reinstall the OS from
scratch, then updated when the new version came out... And I _swear_ I did
_NOT_ change anything regarding SA from the defaults not required to just
get it running. (We didn't lose /etc, so I just plunked the existing
Postfix config back in place and we were up and running!)

My guess is that this is the default on Fedora Server, however, I have
another system I can confirm that with - but not today, probably.

> that below is Fedora 37, originally from 2014 cloned from our golden-master
> VM dating back to 2008 with Fedora 9
>
> not a single distro-systemd-unit in use - never
>
> [root@mail-gw:~]$ ps auxwww | grep spam
> sa-milt 436 0.0 1.2 69708 65192 ? SNs Jun16 11:09
> /usr/bin/perl -T -w /usr/bin/spamd --max-children=1 --max-conn-per-child=1000
> --local --socketpath=/run/spamd-debug/spamassassin.sock --socketmode=0666
> --siteconfigpath=/etc/mail/spamassassin-debug --syslog=stderr 2>/dev/null

...OK, I get it!... I'm not sure "what went wrong" so we ended up with
this, but I'm also not sure what the short path is to fixing this issue.

There's already an sa-milt in /etc/passwd, but the files are all owned by
root - eg: the files in /usr/share/spamassassin Surely these would need to
be changed, one would think, and somewhere the code told to run as
sa-milt, which I presume isn't THAT hard to find, though I've never dealt
with it before.

THANKS for pointing this out!

Richard
Re: Newb on sa-learn - didn't get what I expected as a response... [ In reply to ]
On Fri, 7 Jul 2023, Reindl Harald wrote:
>
> /usr is package terriotory and MUST NOT BE owned by anybody than root and
> read-only for the world
>
> just give common sense another few seconds!
>
> only the files/folders which are supposed to be written by any deamon should
> be writeable for the user the daemnon is running with
>
> you don't want an exploit happening somewhere in teh filter chain modify your
> binaries/scripts
>

OF COURSE!

For me, THE key questions have to do with the learning aspect (and maybe
logging): What's the directory that, for example, sa-learn has to write
into? ... Again, pointers would be nice - it's not like I was planning to
spend my day doing this; I have a customer visit planned that's coming up
soon! I just don't have much time!

Richard
Re: Newb on sa-learn - didn't get what I expected as a response... [ In reply to ]
On Fri, 7 Jul 2023, Reindl Harald wrote:
>>
>> OF COURSE!
>>
>> For me, THE key questions have to do with the learning aspect (and maybe
>> logging): What's the directory that, for example, sa-learn has to write
>> into? ... Again, pointers would be nice - it's not like I was planning to
>> spend my day doing this; I have a customer visit planned that's coming up
>> soon! I just don't have much time!
>
> sorry - i can't translate our configs and setup dating 9 years back and
> nothing in common with anything from the distribution - "sa-learn" needs to
> write where the bayes-db lives, nothing else
>

No worries, you've been a big help! ... You and Jared Hall!

Richard