I've been working on building a spam corpus, and find that 'spamassassin -d'
is
too slow for a large corpus, and that mass-checks does this rather
indirectly,
and also slowly. With the help of Ruud H.G. van Tol on the procmail list,
I came up with a procmail recipe that does the job (below).
The sed script works by (1) putting the leading From_ into the hold buffer,
(2) putting the part of the attachment that has the original into the hold
buffer, but removing the initial Content-* part of the attachment and the
trailing termination boundary. (3) all other lines are deleted, except
for the last one, unless it is non-blank (the final line in a well-formed
message should be blank). (4) at the end, the hold buffer is swapped back
into the pattern buffer, and is then output by sed. The script assumes
that the first attachment part is SA report, and the second part is the
original message. This assumption is safe for SA, but not in general.
# ---------------- remove_sa_markup.rc -----------------
DEFAULT=|
SPACE=" "
TAB=" "
WS="$SPACE$TAB"
:0 B
* $ H ?? ^Content-Type:[$WS]+multipart/mixed;
* $ ^Content-Type:[$WS]+message/rfc822;[$WS]+x-spam-type=original
* $ ^Content-Description:[$WS]+original message before SpamAssassin
* $ ^Content-Disposition:[$WS]+attachment
{
# SA markup is present, pick up the boundary
:0
* $ ^Content-Type:[$WS]+multipart/mixed;[$WS]+boundary=\"\/[^\"]*
{ BOUNDARY = "--$MATCH" }
# sed script to pull out the original message (yikes)
# Technically, '.'s inside boundary should be escaped,
# but we assume that the rest of the chars. in the boundary
# string are sufficiently unique to ensure a correct
# match.
SED_GET_MSG_PART='
1{h;d}
2,/^'"$BOUNDARY"'$/d
/^'"$BOUNDARY"'$/,/^'"$BOUNDARY"'--$/{
/^'"$BOUNDARY"'$/,/^$/d
/^'"$BOUNDARY"'--$/d
H;d}
$!d
/^$/!H
x'
# Run it through sed to remove the markup.
:0 hbfw
| sed -e "$SED_GET_MSG_PART"
}
This can be run against an mbox as follows:
formail -s procmail $HOME/scripts/remove_sa_markup.rc < spam.mbox >
spam_no_markup.mbox
This script could also be buried inside a larger procmail script, if one
wanted to
keep a clean copy of the spam, as well as a marked up copy. In that case,
the
script above would require that 'DEFAULT=|' be removed, that a copy of the
mail should be processed (with the 'c' flag), and that the copy of the mail
without
markup be delivered somewhere.
is
too slow for a large corpus, and that mass-checks does this rather
indirectly,
and also slowly. With the help of Ruud H.G. van Tol on the procmail list,
I came up with a procmail recipe that does the job (below).
The sed script works by (1) putting the leading From_ into the hold buffer,
(2) putting the part of the attachment that has the original into the hold
buffer, but removing the initial Content-* part of the attachment and the
trailing termination boundary. (3) all other lines are deleted, except
for the last one, unless it is non-blank (the final line in a well-formed
message should be blank). (4) at the end, the hold buffer is swapped back
into the pattern buffer, and is then output by sed. The script assumes
that the first attachment part is SA report, and the second part is the
original message. This assumption is safe for SA, but not in general.
# ---------------- remove_sa_markup.rc -----------------
DEFAULT=|
SPACE=" "
TAB=" "
WS="$SPACE$TAB"
:0 B
* $ H ?? ^Content-Type:[$WS]+multipart/mixed;
* $ ^Content-Type:[$WS]+message/rfc822;[$WS]+x-spam-type=original
* $ ^Content-Description:[$WS]+original message before SpamAssassin
* $ ^Content-Disposition:[$WS]+attachment
{
# SA markup is present, pick up the boundary
:0
* $ ^Content-Type:[$WS]+multipart/mixed;[$WS]+boundary=\"\/[^\"]*
{ BOUNDARY = "--$MATCH" }
# sed script to pull out the original message (yikes)
# Technically, '.'s inside boundary should be escaped,
# but we assume that the rest of the chars. in the boundary
# string are sufficiently unique to ensure a correct
# match.
SED_GET_MSG_PART='
1{h;d}
2,/^'"$BOUNDARY"'$/d
/^'"$BOUNDARY"'$/,/^'"$BOUNDARY"'--$/{
/^'"$BOUNDARY"'$/,/^$/d
/^'"$BOUNDARY"'--$/d
H;d}
$!d
/^$/!H
x'
# Run it through sed to remove the markup.
:0 hbfw
| sed -e "$SED_GET_MSG_PART"
}
This can be run against an mbox as follows:
formail -s procmail $HOME/scripts/remove_sa_markup.rc < spam.mbox >
spam_no_markup.mbox
This script could also be buried inside a larger procmail script, if one
wanted to
keep a clean copy of the spam, as well as a marked up copy. In that case,
the
script above would require that 'DEFAULT=|' be removed, that a copy of the
mail should be processed (with the 'c' flag), and that the copy of the mail
without
markup be delivered somewhere.