Mailing List Archive

[Spamassassin Wiki] Update of "CorpusCleaning" by JustinMason
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by JustinMason:
http://wiki.apache.org/spamassassin/CorpusCleaning

The comment on the change is:
a writeup of how to clean a corpus quickly

New page:
= Cleaning a corpus of FPs and FNs =

Here's how to clean a corpus of FalsePositives and FalseNegatives.

Firstly, do a mass-check. You will wind up with a 'spam.log' and 'ham.log' file. Run these commands to get a list of the 200 lowest-scoring spams, create a mbox file with just those messages, then open that mbox up in the "mutt" mail client:

{{{
sort -n +1 spam.log | head -200 > id.low
./mboxget < id.low > mbox
mutt -f mbox
}}}

(you could use another mail client if you want, it's just a std UNIX-format mbox file.)

Now, delete all messages that '''really are''' spams, and not false positives (or bounces, or virus blowback, or other kinds of undesirable messages). Quit and save the mbox. It now contains only the 'bad' messages.

You can then take that mbox file, grep out the original MassCheck message id strings, and remove those lines from the 'spam.log' file:

{{{
grep X-Mass-Check-Id mbox | sed -e 's/^X-Mass-Check-Id: //' > id.fps
./remove-ids-from-mclog id.fps < spam.log > spam.log.new
mv spam.log.new spam.log
}}}

You can also remove the offending files, or messages from the source mailboxes, directly. However, this depends on what format you use to store messages; Maildirs, mboxes, etc. etc. (Maildirs are easiest, since you can just delete the files named in the 'id.fps' file.)

Doing the same operation for FalseNegatives is similar, but reverses a few things... here's the commands to do that:

{{{
sort -rn +1 ham.log | head -200 > id.hi
./mboxget < id.hi > mbox
mutt -f mbox
}}}

Delete the messages that are good, usable ham, leaving only spams, virus blowback, bounces, or whatever other undesirable messages you want to get rid of. Quit and save.

{{{
grep X-Mass-Check-Id mbox | sed -e 's/^X-Mass-Check-Id: //' > id.fns
./remove-ids-from-mclog id.fns < ham.log > ham.log.new
mv ham.log.new ham.log
}}}

Repeat, if necessary...
[Spamassassin Wiki] Update of "CorpusCleaning" by JustinMason [ In reply to ]
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by JustinMason:
http://wiki.apache.org/spamassassin/CorpusCleaning

The comment on the change is:
also talk about corrupt messages in a corpus

------------------------------------------------------------------------------
- = Cleaning a corpus of FPs and FNs =
+ = Cleaning a Mail Corpus =

- Here's how to clean a corpus of FalsePositives and FalseNegatives.
+ Here's a few methods used to deal with common forms of corpus pollution -- messages in a mail corpus that aren't suitable for use in a MassCheck.

+ == False Positives and False Negatives ==
+
- Firstly, do a mass-check. You will wind up with a 'spam.log' and 'ham.log' file. Run these commands to get a list of the 200 lowest-scoring spams, create a mbox file with just those messages, then open that mbox up in the "mutt" mail client:
+ To clean a corpus of FalsePositives and FalseNegatives -- first, do a mass-check. You will wind up with a 'spam.log' and 'ham.log' file. Run these commands to get a list of the 200 lowest-scoring spams, create a mbox file with just those messages, then open that mbox up in the "mutt" mail client:

{{{
cd /path/to/your/spamassassin/masses
@@ -46, +48 @@


Repeat, if necessary...

+ == Corrupt Messages ==
+
+ Occasionally, these will crop up -- some MUAs have a tendency to mess up mail messages or folders, making them unsuitable for use with MassCheck. SpamAssassin includes a few rules that can help identify corrupt messages.
+
+ * MISSING_HEADERS: if a message doesn't have all the normal headers, such as From, To, and Subject, this will fire. Be sure to hand-verify any ham and spam messages that hit this to ensure that they're formatted correctly (in RFC-2822 format).
+ * MISSING_HB_SEP: This is another danger sign, typically indicating that a header line has had a newline inserted incorrectly somehow.
+
[Spamassassin Wiki] Update of "CorpusCleaning" by JustinMason [ In reply to ]
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by JustinMason:
http://wiki.apache.org/spamassassin/CorpusCleaning

The comment on the change is:
other good rules to spot FPs with

------------------------------------------------------------------------------

Repeat, if necessary...

+ Rules that are useful for spotting FPs:
+
+ * BAYES_99: once a mass-check completes, it's worth grepping the ham.log for BAYES_99 and checking what mails it hits.
+ * any of the other top-listed rules in the HitFrequencies report, especially network tests such as the SURBL rules
+
== Corrupt Messages ==

Occasionally, these will crop up -- some MUAs have a tendency to mess up mail messages or folders, making them unsuitable for use with MassCheck. SpamAssassin includes a few rules that can help identify corrupt messages.
[Spamassassin Wiki] Update of "CorpusCleaning" by JustinMason [ In reply to ]
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by JustinMason:
http://wiki.apache.org/spamassassin/CorpusCleaning

The comment on the change is:
duh! wrong way round!

------------------------------------------------------------------------------

== False Positives and False Negatives ==

- To clean a corpus of FalsePositives and FalseNegatives -- first, do a mass-check. You will wind up with a 'spam.log' and 'ham.log' file. Run these commands to get a list of the 200 lowest-scoring spams, create a mbox file with just those messages, then open that mbox up in the "mutt" mail client:
+ To clean a spam corpus of FalseNegatives -- first, do a mass-check. You will wind up with a 'spam.log' and 'ham.log' file. Run these commands to get a list of the 200 lowest-scoring spams, create a mbox file with just those messages, then open that mbox up in the "mutt" mail client:

{{{
cd /path/to/your/spamassassin/masses
@@ -29, +29 @@


You can also remove the offending files, or messages from the source mailboxes, directly. However, this depends on what format you use to store messages; Maildirs, mboxes, etc. etc. (Maildirs are easiest, since you can just delete the files named in the 'id.fps' file.)

- Doing the same operation for FalseNegatives is similar, but reverses a few things... here's the commands to do that:
+ Doing the same operation to clean the ham corpus of FalsePositives is similar, but reverses a few things... here's the commands to do that:

{{{
cd /path/to/your/spamassassin/masses
[Spamassassin Wiki] Update of "CorpusCleaning" by JustinMason [ In reply to ]
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by JustinMason:
http://wiki.apache.org/spamassassin/CorpusCleaning

The comment on the change is:
oops, also the wrong way around

------------------------------------------------------------------------------
You can then take that mbox file, grep out the original MassCheck message id strings, and remove those lines from the 'spam.log' file:

{{{
- grep X-Mass-Check-Id mbox | sed -e 's/^X-Mass-Check-Id: //' > id.fps
+ grep X-Mass-Check-Id mbox | sed -e 's/^X-Mass-Check-Id: //' > id.fns
- ./remove-ids-from-mclog id.fps < spam.log > spam.log.new
+ ./remove-ids-from-mclog id.fns < spam.log > spam.log.new
mv spam.log.new spam.log
}}}

- You can also remove the offending files, or messages from the source mailboxes, directly. However, this depends on what format you use to store messages; Maildirs, mboxes, etc. etc. (Maildirs are easiest, since you can just delete the files named in the 'id.fps' file.)
+ You can also remove the offending files, or messages from the source mailboxes, directly. However, this depends on what format you use to store messages; Maildirs, mboxes, etc. etc. (Maildirs are easiest, since you can just delete the files named in the 'id.fns' file.)

Doing the same operation to clean the ham corpus of FalsePositives is similar, but reverses a few things... here's the commands to do that:

@@ -41, +41 @@

Delete the messages that are good, usable ham, leaving only spams, virus blowback, bounces, or whatever other undesirable messages you want to get rid of. Quit and save.

{{{
- grep X-Mass-Check-Id mbox | sed -e 's/^X-Mass-Check-Id: //' > id.fns
+ grep X-Mass-Check-Id mbox | sed -e 's/^X-Mass-Check-Id: //' > id.fps
- ./remove-ids-from-mclog id.fns < ham.log > ham.log.new
+ ./remove-ids-from-mclog id.fps < ham.log > ham.log.new
mv ham.log.new ham.log
}}}
[Spamassassin Wiki] Update of "CorpusCleaning" by JustinMason [ In reply to ]
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by JustinMason:
http://wiki.apache.org/spamassassin/CorpusCleaning

The comment on the change is:
duh, it was right the first time. time to stop working I think

------------------------------------------------------------------------------

== False Positives and False Negatives ==

- To clean a spam corpus of FalseNegatives -- first, do a mass-check. You will wind up with a 'spam.log' and 'ham.log' file. Run these commands to get a list of the 200 lowest-scoring spams, create a mbox file with just those messages, then open that mbox up in the "mutt" mail client:
+ To clean a spam corpus of FalsePositives -- first, do a mass-check. You will wind up with a 'spam.log' and 'ham.log' file. Run these commands to get a list of the 200 lowest-scoring spams, create a mbox file with just those messages, then open that mbox up in the "mutt" mail client:

{{{
cd /path/to/your/spamassassin/masses
@@ -22, +22 @@

You can then take that mbox file, grep out the original MassCheck message id strings, and remove those lines from the 'spam.log' file:

{{{
- grep X-Mass-Check-Id mbox | sed -e 's/^X-Mass-Check-Id: //' > id.fns
+ grep X-Mass-Check-Id mbox | sed -e 's/^X-Mass-Check-Id: //' > id.fps
- ./remove-ids-from-mclog id.fns < spam.log > spam.log.new
+ ./remove-ids-from-mclog id.fps < spam.log > spam.log.new
mv spam.log.new spam.log
}}}

- You can also remove the offending files, or messages from the source mailboxes, directly. However, this depends on what format you use to store messages; Maildirs, mboxes, etc. etc. (Maildirs are easiest, since you can just delete the files named in the 'id.fns' file.)
+ You can also remove the offending files, or messages from the source mailboxes, directly. However, this depends on what format you use to store messages; Maildirs, mboxes, etc. etc. (Maildirs are easiest, since you can just delete the files named in the 'id.fps' file.)

- Doing the same operation to clean the ham corpus of FalsePositives is similar, but reverses a few things... here's the commands to do that:
+ Doing the same operation to clean the ham corpus of FalseNegatives is similar, but reverses a few things... here's the commands to do that:

{{{
cd /path/to/your/spamassassin/masses
@@ -38, +38 @@

mutt -f mbox
}}}

- Delete the messages that are good, usable ham, leaving only spams, virus blowback, bounces, or whatever other undesirable messages you want to get rid of. Quit and save.
+ Delete the messages that are good, usable ham, leaving only spams, hams that include bits of spam, virus blowback, bounces, or whatever other undesirable messages you want to get rid of. Quit and save.

{{{
- grep X-Mass-Check-Id mbox | sed -e 's/^X-Mass-Check-Id: //' > id.fps
+ grep X-Mass-Check-Id mbox | sed -e 's/^X-Mass-Check-Id: //' > id.fns
- ./remove-ids-from-mclog id.fps < ham.log > ham.log.new
+ ./remove-ids-from-mclog id.fns < ham.log > ham.log.new
mv ham.log.new ham.log
}}}

Repeat, if necessary...

- Rules that are useful for spotting FPs:
+ Rules that are useful for spotting FNs (or spam discussions!) in the ham corpus:

* BAYES_99: once a mass-check completes, it's worth grepping the ham.log for BAYES_99 and checking what mails it hits.
* any of the other top-listed rules in the HitFrequencies report, especially network tests such as the SURBL rules