Mailing List Archive

Batch processing a mbox?
Is there a way to batch process a mailbox with SpamAssassin (sort it to ham
and spam)?

I had this ~10000 mail box ~70% of which was spam.

The best I could think of was to split the mbox with tools/mboxsplit, then
run them through spamassassin --exit-code one by one and cat them the HAM or
SPAM mbox accordingly. Not too elegant...


-- v --

v@iki.fi
Re: Batch processing a mbox? [ In reply to ]
On Fri, Feb 13, 2004 at 01:53:28PM +0200, you [Ville Herva] wrote:
> Is there a way to batch process a mailbox with SpamAssassin (sort it to ham
> and spam)?
>
> I had this ~10000 mail box ~70% of which was spam.
>
> The best I could think of was to split the mbox with tools/mboxsplit, then
> run them through spamassassin --exit-code one by one and cat them the HAM or
> SPAM mbox accordingly. Not too elegant...

FWIW, here's roughly what I did in case anyone is interested:

mkdir spool; cd spool
perl /usr/share/doc/spamassassin-tools-2.63/tools/mboxsplit -f "%06i" < ../Mailbox
cd ..

foreach i in spool/*; do
if spamassassin --exit-code < $i > temp; then
cat temp >> HAM
else
cat temp >> SPAM
fi
rm temp
done



-- v --

v@iki.fi
Re: Batch processing a mbox? [ In reply to ]
Ville Herva <vherva@viasys.com> wrote:
> Is there a way to batch process a mailbox with SpamAssassin
> (sort it to ham and spam)?
>
> I had this ~10000 mail box ~70% of which was spam.
>
> The best I could think of was to split the mbox with
> tools/mboxsplit, then run them through spamassassin
> --exit-code one by one and cat them the HAM or SPAM mbox
> accordingly. Not too elegant...

Have you implemented spamassassin since those mails were received? Here's what
I do after a significant procmail/spamassassin modification to clean up an
mbox:

1. mv mbox mbox.old (or rename to any other name not used for sorting)
2. formail -s procmail < mbox.old

formail breaks out each message and sends it to procmail (using the default
procmailrc and ~/.procmailrc files) for sorting. I have spamassassin and clamav
run through procmail.

If I wind up with too many custom headers, I've written a small pipe script
that runs spamassassin -d and formail to remove them.

This works quite well for me.

- Bob
Re: Batch processing a mbox? [ In reply to ]
In a fit of excitement, I wrote:
> [...] Here's what I do after a significant procmail/spamassassin modification
to clean up an mbox:
>
> 1. mv mbox mbox.old (or rename to any other name not used for sorting)
> 2. formail -s procmail < mbox.old

I should add that I do this AS THE MAILBOX OWNER (su - owner), so their rules
are used appropriately.

- Bob
Re: Batch processing a mbox? [ In reply to ]
On Fri, Feb 13, 2004 at 08:19:42AM -0500, you [Bob George] wrote:
> Ville Herva <vherva@viasys.com> wrote:
> > Is there a way to batch process a mailbox with SpamAssassin
> > (sort it to ham and spam)?
> >
> > I had this ~10000 mail box ~70% of which was spam.
> >
> > The best I could think of was to split the mbox with
> > tools/mboxsplit, then run them through spamassassin
> > --exit-code one by one and cat them the HAM or SPAM mbox
> > accordingly. Not too elegant...
>
> Have you implemented spamassassin since those mails were received?

Yes, in the case I mentioned I had just set up spamassassin and was cleaning
up my old mailbox. But it turns out there's need for this in other cases.
For example, users whose mailbox has unattented for a long time.
Unfortunately, those users have no spamassassin configuration nor
procmailrc, so in that case it would be better to just batch process the
mailbox with default spamassassin config.

> Here's what I do after a significant procmail/spamassassin modification to
> clean up an mbox:
>
> 1. mv mbox mbox.old (or rename to any other name not used for sorting)
> 2. formail -s procmail < mbox.old
>
> formail breaks out each message and sends it to procmail (using the default
> procmailrc and ~/.procmailrc files) for sorting. I have spamassassin and clamav
> run through procmail.

That's clever. That requires one to create procmail recipe beforehand,
though.


thanks,

-- v --

v@iki.fi
Re: Batch processing a mbox? [ In reply to ]
Ville Herva <vherva@viasys.com> wrote:
> [...]
> That's clever. That requires one to create procmail recipe
> beforehand, though.

Hmm.. You originally wrote:
>>> Is there a way to batch process a mailbox with SpamAssassin
>>> (sort it to ham and spam)?

What did you have in mind to do the actual sorting if *not* procmail? SA will
TAG the messages, but not actually sort them.

I think you'll want to create a "sorting" procmailrc if you're not setting up
one site-wide, then just call it via:

formail -s procmail spamsort.cf < mbox.old

This would work for a one-time job.

- Bob
Re: Batch processing a mbox? [ In reply to ]
On Fri, Feb 13, 2004 at 09:04:38AM -0500, you [Bob George] wrote:
> Ville Herva <vherva@viasys.com> wrote:
> > [...]
> > That's clever. That requires one to create procmail recipe
> > beforehand, though.
>
> Hmm.. You originally wrote:
> >>> Is there a way to batch process a mailbox with SpamAssassin
> >>> (sort it to ham and spam)?
>
> What did you have in mind to do the actual sorting if *not* procmail? SA will
> TAG the messages, but not actually sort them.

I already posted this, but here it is again:

mkdir spool; cd spool
perl /usr/share/doc/spamassassin-tools-2.63/tools/mboxsplit -f "%06i" < ../Mailbox
cd ..

foreach i in spool/*; do
if spamassassin --exit-code < $i > temp; then
cat temp >> HAM
else
cat temp >> SPAM
fi
rm temp
done

This is what I used for sorting.

> I think you'll want to create a "sorting" procmailrc if you're not setting up
> one site-wide, then just call it via:
>
> formail -s procmail spamsort.cf < mbox.old
>
> This would work for a one-time job.

Yeah, it works (as does the script above), but I thought maybe there was a
quicker alternative, since there are a lot of unattended mailboxes I tought
I'd clean up.

But fair enough, it can be done via procmail or with spamassassin
--exit-code.



-- v --

v@iki.fi
Re: Batch processing a mbox? [ In reply to ]
Ville Herva <vherva@viasys.com> writes:

> foreach i in spool/*; do
> if spamassassin --exit-code < $i > temp; then
> cat temp >> HAM
> else
> cat temp >> SPAM
> fi

That gets the right result, but involves running a new spamassassin
for each message. I know about spamc/spamd so you could use spamc in
the loop instead. That would be better.

But is there a true 'batch mode' where a single spamassassin
invocation will read many messages one after the other and process
each one?

For machines with not enough computrons to do real-time mail filtering
it might work to deliver mail to an mbox file and then process it as a
batch when needed or when the machine is idle. Cranking through the
messages one after the other ought to be the fastest way to process
them, if it is possible.

--
Ed Avis <ed@membled.com>