Mailing List Archive

Learning from MailDirs
Hi,

I have one big MailDir full of spam and one full of ham, so I thought I'd take
the opportunity to get SA to learn from those messages.

When trying to learn the spam, I tried:

$ sa-learn --spam --showdots Mail/.0-Spam/
..
Learned from 1 message(s) (2 message(s) examined).

...but I have a lot more than 2 messages in there!

So, I had a look at the man pages, thinking the MailDir format might be a
problem, and saw the closest relevent thing was:

--dir Ignored; historical compatability

Hmmm... so I thought I'd mail here...

What should I do to have SA learn from mail in MailDirs?

cheers,

Chris

--
Simplistix - Content Management, Zope & Python Consulting
- http://www.simplistix.co.uk
Re: Learning from MailDirs [ In reply to ]
Chris

I front end the maildirs with an imap server, so I run a small perl
script to login to the imap server and dig out the messages from there.


--
Martin Hepworth
Snr Systems Administrator
Solid State Logic
Tel: +44 (0)1865 842300


Chris Withers wrote:
> Hi,
>
> I have one big MailDir full of spam and one full of ham, so I thought
> I'd take
> the opportunity to get SA to learn from those messages.
>
> When trying to learn the spam, I tried:
>
> $ sa-learn --spam --showdots Mail/.0-Spam/
> ..
> Learned from 1 message(s) (2 message(s) examined).
>
> ...but I have a lot more than 2 messages in there!
>
> So, I had a look at the man pages, thinking the MailDir format might be a
> problem, and saw the closest relevent thing was:
>
> --dir Ignored; historical compatability
>
> Hmmm... so I thought I'd mail here...
>
> What should I do to have SA learn from mail in MailDirs?
>
> cheers,
>
> Chris
>

**********************************************************************

This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.

This footnote confirms that this email message has been swept
for the presence of computer viruses and is believed to be clean.

**********************************************************************
Re: Learning from MailDirs [ In reply to ]
Hi Martin,

Martin Hepworth wrote:

> I front end the maildirs with an imap server, so I run a small perl
> script to login to the imap server and dig out the messages from there.

Any chance I could "borrow" that script?

cheers,

Chris

--
Simplistix - Content Management, Zope & Python Consulting
- http://www.simplistix.co.uk
Re: Learning from MailDirs [ In reply to ]
On Thu, Feb 26, 2004 at 12:10:53PM +0000, Martin Hepworth <martinh@solid-state-logic.com> wrote:
> I front end the maildirs with an imap server, so I run a small perl
> script to login to the imap server and dig out the messages from there.

This is unncessary.

The maildir format stores messages in ./new and ./cur within the
maildir. So, simply pass in both of those directories to
sa-learn, rather than the top-level directory. This is how I
have been doing it, and it seems to be working fine.

I would, however, appreciate it if sa-learn could recognize a
maildir and do the same automatically. Or perhaps even
recursively, because I have a maildir heirarchy, and learning
from everything in that requires a bit of shell magic.

--
Matthew Hunter (matthew@infodancer.org)
Public Key: http://matthew.infodancer.org/public_key.txt
Homepage: http://matthew.infodancer.org/index.jsp
Politics: http://www.triggerfinger.org/index.jsp
Re: Learning from MailDirs [ In reply to ]
On Thu, Feb 26, 2004 at 06:26:58AM -0600, Matthew Hunter wrote:
> On Thu, Feb 26, 2004 at 12:10:53PM +0000, Martin Hepworth <martinh@solid-state-logic.com> wrote:
> > I front end the maildirs with an imap server, so I run a small perl
> > script to login to the imap server and dig out the messages from there.
>
> This is unncessary.
>
> The maildir format stores messages in ./new and ./cur within the
> maildir. So, simply pass in both of those directories to
> sa-learn, rather than the top-level directory. This is how I
> have been doing it, and it seems to be working fine.
>
> I would, however, appreciate it if sa-learn could recognize a
> maildir and do the same automatically. Or perhaps even
> recursively, because I have a maildir heirarchy, and learning
> from everything in that requires a bit of shell magic.

I use this script to collect spamfeed from another computer and feed
it to sa-learn on the server:

#!/bin/sh

cd /home/spamd
rsync -e ssh -rva --exclude hcache.db js@burger:/home/js/Mail/nuwespam/ /home/spamd/spam/
rsync -e ssh -rva --exclude hcache.db js@burger:/home/js/Mail/ham/ /home/spamd/ham/
sa-learn --spam --dir spam/cur/
sa-learn --spam --dir spam/new/
sa-learn --ham --dir ham/cur/
sa-learn --ham --dir ham/new/
rm -rf spam/
rm -rf ham/


Regards
Johann

--
Johann Spies Telefoon: 021-808 4036
Informasietegnologie, Universiteit van Stellenbosch

"But God commendeth his love toward us, in that, while
we were yet sinners, Christ died for us."
Romans 5:8
Re: Learning from MailDirs [ In reply to ]
Do you only want to do sa-learn, or do you also want to report via
spamassassin -r? sa-learn will accept directories as fodder. spamassassin -r
only accepts single files. The advantage to using spamassassin -r is that you
also feed the spam to all the checksum services that you have configured such
as dcc, pyzor, and razor.

I wrote a script which is a front end for spamassassin -r and takes a
directory as an argument. All files within the directory are then learned as
spam *and* reported to the checksum services. I can send it to you if you
want, or search the archives; I think I posted it a few weeks ago.

Come to think of it, maybe I should put it in one of the spamassassin wiki's?
Anyone have any suggestions on where it should go?

Chris Withers said:
> Hi,
>
> I have one big MailDir full of spam and one full of ham, so I thought I'd take
> the opportunity to get SA to learn from those messages.
>
> When trying to learn the spam, I tried:
>
> $ sa-learn --spam --showdots Mail/.0-Spam/
> ..
> Learned from 1 message(s) (2 message(s) examined).
>
> ...but I have a lot more than 2 messages in there!
>
> So, I had a look at the man pages, thinking the MailDir format might be a
> problem, and saw the closest relevent thing was:
>
> --dir Ignored; historical compatability
>
> Hmmm... so I thought I'd mail here...
>
> What should I do to have SA learn from mail in MailDirs?
>
> cheers,
>
> Chris
>
> --
> Simplistix - Content Management, Zope & Python Consulting
> - http://www.simplistix.co.uk
>
>
>
>


--
Kurt Yoder
Sport & Health network administrator
Re: Learning from MailDirs [ In reply to ]
Matthew Hunter wrote:
> The maildir format stores messages in ./new and ./cur within the
> maildir. So, simply pass in both of those directories to
> sa-learn, rather than the top-level directory. This is how I
> have been doing it, and it seems to be working fine.

Hmm, I noticed it "learned" from two messages, even for empty maildirs, so I
suspect here may be a couple of non-message files lying in there :-S

> I would, however, appreciate it if sa-learn could recognize a
> maildir and do the same automatically.

Hear hear! :-)

> Or perhaps even
> recursively, because I have a maildir heirarchy, and learning
> from everything in that requires a bit of shell magic.

Hear hear hear hear :-)

cheers,

Chris

--
Simplistix - Content Management, Zope & Python Consulting
- http://www.simplistix.co.uk
Re: Learning from MailDirs [ In reply to ]
Kurt Yoder wrote:

> Do you only want to do sa-learn, or do you also want to report via
> spamassassin -r?

Dunno what spamassassin -r is, so thanks for pointing it out to me :-)

> I wrote a script which is a front end for spamassassin -r and takes a
> directory as an argument. All files within the directory are then learned as
> spam *and* reported to the checksum services.

Is there something similar that could be done for non-spam?

> I can send it to you if you
> want, or search the archives; I think I posted it a few weeks ago.

Sending it or a link to it in the archives would be great, thanks :-)

> Come to think of it, maybe I should put it in one of the spamassassin wiki's?

Probably a good idea too...

cheers,

Chris

--
Simplistix - Content Management, Zope & Python Consulting
- http://www.simplistix.co.uk
Re: Learning from MailDirs [ In reply to ]
On Fri, Feb 27, 2004 at 10:49:03AM +0000, Chris Withers <lists@simplistix.co.uk> wrote:
> Matthew Hunter wrote:
> >The maildir format stores messages in ./new and ./cur within the
> >maildir. So, simply pass in both of those directories to
> >sa-learn, rather than the top-level directory. This is how I
> >have been doing it, and it seems to be working fine.
> Hmm, I noticed it "learned" from two messages, even for empty maildirs, so
> I suspect here may be a couple of non-message files lying in there :-S

The maildir spec allows metafiles to be stored in the top level
directory. Many IMAP servers store message data indexes there,
for example.

> >Or perhaps even
> >recursively, because I have a maildir heirarchy, and learning
> >from everything in that requires a bit of shell magic.
> Hear hear hear hear :-)

FWIW, here's how I learn currently (learning from all read
messages, except known spam, and mailing lists dealing with
spam):

find /home/matthew/Maildir/ -type d -name "cur" | grep -v -i spam | xargs --max-args=1 -t sa-learn --no-rebuild --ham
sa-learn --showdots --no-rebuild --spam /home/matthew/Maildir/.Personal/.Spam/cur
sa-learn --rebuild

It would probably be more efficient to use a temporary store and
delete the message after learning, rather than counting on SA to
skip messages it has already seen, but I haven't thought that
through yet.

--
Matthew Hunter (matthew@infodancer.org)
Public Key: http://matthew.infodancer.org/public_key.txt
Homepage: http://matthew.infodancer.org/index.jsp
Politics: http://www.triggerfinger.org/index.jsp
Re: Learning from MailDirs [ In reply to ]
Chris Withers said:
> Kurt Yoder wrote:
>
>> Do you only want to do sa-learn, or do you also want to report via
>> spamassassin -r?
>
> Dunno what spamassassin -r is, so thanks for pointing it out to me :-)
>
>> I wrote a script which is a front end for spamassassin -r and takes a
>> directory as an argument. All files within the directory are then learned as
>> spam *and* reported to the checksum services.
>
> Is there something similar that could be done for non-spam?

You don't need to report non-spam, so you can just use sa-learn. See
http://wiki.spamassassin.org/w/BayesInSpamAssassin

>> I can send it to you if you
>> want, or search the archives; I think I posted it a few weeks ago.
>
> Sending it or a link to it in the archives would be great, thanks :-)

Posted; see

http://wiki.spamassassin.org/w/report_5fspam_2epl

>> Come to think of it, maybe I should put it in one of the spamassassin
>> wiki's?
>
> Probably a good idea too...

Done; see above links.

--
Kurt Yoder
Re: Learning from MailDirs [ In reply to ]
Matthew Hunter wrote:

> The maildir spec allows metafiles to be stored in the top level
> directory. Many IMAP servers store message data indexes there,
> for example.

Aha, okay...

> find /home/matthew/Maildir/ -type d -name "cur" | grep -v -i spam | xargs --max-args=1 -t sa-learn --no-rebuild --ham
> sa-learn --showdots --no-rebuild --spam /home/matthew/Maildir/.Personal/.Spam/cur
> sa-learn --rebuild

Looks cool, thanks :-)

cheers,

Chris
Re: Learning from MailDirs [ In reply to ]
On Friday 27 February 2004 2:57 am, Matthew Hunter wrote:
> FWIW, here's how I learn currently (learning from all read
> messages, except known spam, and mailing lists dealing with
> spam):
>
> find /home/matthew/Maildir/ -type d -name "cur" | grep -v -i spam | xargs
> --max-args=1 -t sa-learn --no-rebuild --ham

I had to do the first couple of parts by hand to see what is happening, but it
sort of makes sense: you're eliminating any mail directories that happen to
have the word "spam" as part of the NAME of the directory (folder). At
first, I thought grep would scan the FILES in each directory passed to it,
rather than "the list of filenames" itself [really gotta learn piped
precedence :) ]

Unfortunately, this doesn't take into consideration "uncaught spam" -- i.e.,
stuff that really is spam that is in the "wrong" folder. In my case, I have
a bit of "history" built up, and although for the most part I've been pretty
diligent in removing the garbage, I wouldn't be 100% certain to use this
shotgun approach (but that's just me -- this will probably be fine for
others)

--
Yet another Blog: http://osnut.homelinux.net
Re: Learning from MailDirs [ In reply to ]
On Sat, Mar 06, 2004 at 09:11:41AM -0800, Tom Emerson <osnut@pacbell.net> wrote:
Content-Description: signed data
> On Friday 27 February 2004 2:57 am, Matthew Hunter wrote:
> > FWIW, here's how I learn currently (learning from all read
> > messages, except known spam, and mailing lists dealing with
> > spam):
> >
> > find /home/matthew/Maildir/ -type d -name "cur" | grep -v -i spam | xargs
> > --max-args=1 -t sa-learn --no-rebuild --ham
>
> I had to do the first couple of parts by hand to see what is happening, but it
> sort of makes sense: you're eliminating any mail directories that happen to
> have the word "spam" as part of the NAME of the directory (folder). At
> first, I thought grep would scan the FILES in each directory passed to it,
> rather than "the list of filenames" itself [really gotta learn piped
> precedence :) ]
>
> Unfortunately, this doesn't take into consideration "uncaught spam" -- i.e.,
> stuff that really is spam that is in the "wrong" folder. In my case, I have
> a bit of "history" built up, and although for the most part I've been pretty
> diligent in removing the garbage, I wouldn't be 100% certain to use this
> shotgun approach (but that's just me -- this will probably be fine for
> others)

Actually, it does exactly that. The "find" command returns only
directories named "cur" within the tree, which (in a maildir
folder tree) means "only directories with read mail". New mail,
before an email client sees it, is stored in a directory called
"new".

To make sure you don't learn missed spam, just make sure you move
any spam you see into the spam folder the first time you see it;
it will never appear in the cur directory if you move it directly
to the spam folder as soon as you see it.

Note that the spam learning works the same way -- you won't learn
from spam until you've checked the folder for false positives and
then left that folder. It won't learn anything you haven't
at least glanced at.

It's been working very well. I have maybe one or two spams a
month that slip through. Of course, I also use other measures.

--
Matthew Hunter (matthew@infodancer.org)
Public Key: http://matthew.infodancer.org/public_key.txt
Homepage: http://matthew.infodancer.org/index.jsp
Politics: http://www.triggerfinger.org/index.jsp
Re: Learning from MailDirs [ In reply to ]
On 27 Feb 2004, at 03:57, Matthew Hunter wrote:
> FWIW, here's how I learn currently (learning from all read
> messages, except known spam, and mailing lists dealing with
> spam):
>
> find /home/matthew/Maildir/ -type d -name "cur" | grep -v -i spam |
> xargs --max-args=1 -t sa-learn --no-rebuild --ham
> sa-learn --showdots --no-rebuild --spam
> /home/matthew/Maildir/.Personal/.Spam/cur
> sa-learn --rebuild

Very nice! I use something similar to deal with my current mboxes, but
I am moving to maildir, so I will file this away.


>
--
The older you get the more you need the people you knew when you were
young.