Mailing List Archive: sa-learn muck-up

sa-learn muck-up

Feb 17, 2004, 10:36 AM

Post #1 of 3 (430 views)

Hi All,

I just installed SpamAssassin recently. I had a spam
mailfolder in mbox format but I fed it to sa-learn with
out specifying mbox format.

I also did this with a ham folder too.

How badly does this muck up the database and what
do I need to do to rebuild it correctly.

Thanks,

Mike

Re: sa-learn muck-up [ In reply to ]

jdow at earthlink

Feb 17, 2004, 10:43 AM

Post #2 of 3 (421 views)

Permalink

From: "Mike McMullen" <mlm@loanprocessing.net>

> Hi All,
>
> I just installed SpamAssassin recently. I had a spam
> mailfolder in mbox format but I fed it to sa-learn with
> out specifying mbox format.
>
> I also did this with a ham folder too.
>
> How badly does this muck up the database and what
> do I need to do to rebuild it correctly.

I "presume" you have individual ~/.spamassassin directories.
You will find at least two and maybe three files with "bayes"
in their names. Delete them and retrain.
{^_^}

Re: sa-learn muck-up [ In reply to ]

mkettler at evi-inc

Feb 17, 2004, 11:15 AM

Post #3 of 3 (424 views)

Permalink

At 12:43 PM 2/17/2004, jdow wrote:
>From: "Mike McMullen" <mlm@loanprocessing.net>
>
> > Hi All,
> >
> > I just installed SpamAssassin recently. I had a spam
> > mailfolder in mbox format but I fed it to sa-learn with
> > out specifying mbox format.
> >
> > I also did this with a ham folder too.
> >
> > How badly does this muck up the database and what
> > do I need to do to rebuild it correctly.
>
>
>I "presume" you have individual ~/.spamassassin directories.
>You will find at least two and maybe three files with "bayes"
>in their names. Delete them and retrain.
>{^_^}

Wiping the entire bayes database, while effective, is quite a bit extreme...

What will likely happen if you forget --mbox is sa-learn will learn the
entire file as if it were one email, and attribute all the tokens to a
single message. Not tragic, but it is innacurate.

If you still have the mbox files you fed to sa-learn, you can just feed
them back to sa-learn with --forget. As long as you feed it with the same
input type parameters, SA should parse it the same way, and correctly
forget all the tokens that it mis-attributed to a single message.

You can then re-train the mbox files using the --mbox parameter and SA will
correctly learn them as multiple messages.