Mailing List Archive: procmail script to remove SA markup

procmail script to remove SA markup

Feb 22, 2004, 9:02 AM

Post #1 of 3 (584 views)

I've been working on building a spam corpus, and find that 'spamassassin -d'
is
too slow for a large corpus, and that mass-checks does this rather
indirectly,
and also slowly. With the help of Ruud H.G. van Tol on the procmail list,
I came up with a procmail recipe that does the job (below).

The sed script works by (1) putting the leading From_ into the hold buffer,
(2) putting the part of the attachment that has the original into the hold
buffer, but removing the initial Content-* part of the attachment and the
trailing termination boundary. (3) all other lines are deleted, except
for the last one, unless it is non-blank (the final line in a well-formed
message should be blank). (4) at the end, the hold buffer is swapped back
into the pattern buffer, and is then output by sed. The script assumes
that the first attachment part is SA report, and the second part is the
original message. This assumption is safe for SA, but not in general.

# ---------------- remove_sa_markup.rc -----------------
DEFAULT=|
SPACE=" "
TAB=" "
WS="$SPACE$TAB"

:0 B
* $ H ?? ^Content-Type:[$WS]+multipart/mixed;
* $ ^Content-Type:[$WS]+message/rfc822;[$WS]+x-spam-type=original
* $ ^Content-Description:[$WS]+original message before SpamAssassin
* $ ^Content-Disposition:[$WS]+attachment
{
# SA markup is present, pick up the boundary
:0
* $ ^Content-Type:[$WS]+multipart/mixed;[$WS]+boundary=\"\/[^\"]*
{ BOUNDARY = "--$MATCH" }

# sed script to pull out the original message (yikes)
# Technically, '.'s inside boundary should be escaped,
# but we assume that the rest of the chars. in the boundary
# string are sufficiently unique to ensure a correct
# match.
SED_GET_MSG_PART='
1{h;d}
2,/^'"$BOUNDARY"'$/d
/^'"$BOUNDARY"'$/,/^'"$BOUNDARY"'--$/{
/^'"$BOUNDARY"'$/,/^$/d
/^'"$BOUNDARY"'--$/d
H;d}
$!d
/^$/!H
x'

# Run it through sed to remove the markup.
:0 hbfw
| sed -e "$SED_GET_MSG_PART"
}

This can be run against an mbox as follows:
formail -s procmail $HOME/scripts/remove_sa_markup.rc < spam.mbox >
spam_no_markup.mbox

This script could also be buried inside a larger procmail script, if one
wanted to
keep a clean copy of the spam, as well as a marked up copy. In that case,
the
script above would require that 'DEFAULT=|' be removed, that a copy of the
mail should be processed (with the 'c' flag), and that the copy of the mail
without
markup be delivered somewhere.

RE: procmail script to remove SA markup [ In reply to ]

gary at intrepid

Feb 22, 2004, 9:15 AM

Post #2 of 3 (573 views)

Permalink

> From: Gary Funck
> Sent: Sunday, February 22, 2004 8:02 AM
[...]
>
> This can be run against an mbox as follows:
> formail -s procmail $HOME/scripts/remove_sa_markup.rc < spam.mbox \
> > spam_no_markup.mbox

FYI. The script above processes about 100 messages per second, on a 2.4Ghz
P4
(RH 9 Linux).

RE: procmail script to remove SA markup [ In reply to ]

gary at intrepid

Feb 22, 2004, 10:02 AM

Post #3 of 3 (577 views)

Permalink

In subsequent discussions on the procmail list, it was noted, that if the
SA markup attachment isn't present that the SA headers should still be
removed.
Further, I suppose that it should check for Subject: [SPAM] tags, but that
is a bit problematic because the nature of those tags is site dependent,
and might even vary over time. Here's the update version.

DEFAULT=|
SHELL=/bin/sh
#VERBOSE=yes
NL="
"

SPACE=" "
TAB=" "
WS="$SPACE$TAB"

:0 B
* $ H ?? ^Content-Type:[$WS]+multipart/mixed;
* $ ^Content-Type:[$WS]+message/rfc822;[$WS]+x-spam-type=original
* $ ^Content-Description:[$WS]+original message before SpamAssassin
* $ ^Content-Disposition:[$WS]+attachment
{
# SA markup is present, pick up the boundary
:0
* $ ^Content-Type:[$WS]+multipart/mixed;[$WS]+boundary=\"\/[^\"]*
{ BOUNDARY = "--$MATCH" }

# sed script to pull out the original message (yikes)
# technically, '.'s inside boundary should be escaped,
# but we hope the rest of the chars. in the boundary
# string are sufficiently unique to ensure a correct
# match.
SED_GET_MSG_PART='
1{h;d}
2,/^'"$BOUNDARY"'$/d
/^'"$BOUNDARY"'$/,/^'"$BOUNDARY"'--$/{
/^'"$BOUNDARY"'$/,/^$/d
/^'"$BOUNDARY"'--$/d
H;d}
$!d
/^$/!H
x'

# Run it through sed to remove the markup.
:0 hbfw
| sed -e "$SED_GET_MSG_PART"
}

# if no mark-up attachment found, just remove the SA headers
# Maybe we need to remove the subject tag here as well?
:0 E fw
| formail -IX-Spam