Mailing List Archive

possibly a dumb comment, apologies if I'm being a n00b
So, I've been looking over the types of messages that SA is missing, on my
mail stream... It seems like most of them still contain trigger words
that would cause high scores, they're just slightly masked. Has anyone
talked about applying some kind of fuzzy-matching techniques? Taking the
trigger words, and generating a whole large set of patterns that match,
based on rules such as:

'a' => /a|A|@)/
'x' => /x|X|></

You might even be able to use a large corpus of spam to automatically
derive these rules. (A corpus of parsed-out and "translated" tokens would
work better, obviously.)

You could also introduce some Hamming Distance effects to the match, so
that, say hamming_match( /hello/ , 1 ) would match "helldo", "helo", and
"hillo". And then there's the possibility of doing phonetic matching,
like many spellcheckers.

Using any/all of this stuff would be pretty processor intensive --
probably much more practical for ISPs than for users -- but it seems like
it'd kill off almost all of the new crop of SA-evading spam. Maybe
somebody could lure Larry Wall into building this kind of fuzzy-match
technology directly into the next major version of Perl? *g*

Just thought I'd throw that out there. Aside from that, I'll probably
lurk for a while; if I end up feeling out of my depth (which is possible
-- my actual day job is as a linguist, and most of my coding skills, such
as they are, are aimed at that) I'll unsub.

Thanks,
Auros

------------------------------------------------------------------------
R Michael Harman / Auros Symtheos
rmharman@auros.org ............ http://www.auros.org/

Linguist and Eclectic Engineer, Lexicus, Motorola
rmharman@motorola.com ......... http://www.lexicus.mot.com/

Senior Reviews Editor, Strange Horizons Speculative Fiction Weekly
reviews@strangehorizons.com ... http://www.strangehorizons.com/
Re: possibly a dumb comment, apologies if I'm being a n00b [ In reply to ]
At 12:28 PM 3/10/2004, R Michael Harman wrote:
>Taking the trigger words, and generating a whole large set of patterns
>that match,
>based on rules such as:
>
>'a' => /a|A|@)/
>'x' => /x|X|></

Well, there's no reason to include a and A, since it's more efficient to
declare the rule case-insenstive instead, but that aside, these kinds of
obfuscation matches work well. I've been working with this stuff quite a
bit in antidrug.cf.

There's also a wide variety of "gapping" techniques used by spammers.

If for example, you look at my antidrug ruleset, you'll see my current
regex for the v-word is:

body
__DRUGS_MALEDYSFUNCTION1
/(?:\b|\s)[_\W]{0,3}(?:\\\/|V)[_\W]{0,3}[i1|l][_\W]{0,3}[a4@][_\W]{0,3}g[_\W]{0,3}r[_\W]{0,3}[a4@][_\W]{0,3}x?[_\W]{0,3}(?:\b|\s)/i

This catches most of the common substitutions for A,O, V, and I. Allows for
a number of gapping characters consisting of non-word and _ characters
between letters (as few as 0 and as many as 3). It also allows an optional
x that some spammers tack onto the end of the word.

However, there's a little bit of caution needed before comprehensively
applying this technique everywhere. The large number of match combinations
can lend itself to FPs. I've been slowly propagating these features
throughout the antidrug ruleset.

>You could also introduce some Hamming Distance effects to the match, so
>that, say hamming_match( /hello/ , 1 ) would match "helldo", "helo", and
>"hillo". And then there's the possibility of doing phonetic matching,
>like many spellcheckers.

There's been a fair amount of talk about doing things like this in the
past. I've never seen anyone propose hamming distances or phonetic methods
specifically, but many similar ideas have come by.

One thing that needs to be kept in mind is that SA is already quite
effective against mis-spelling tactics... mis-spellings are VERY easy
targets for any bayes system, you just have to be up-to-speed on your training.

Some similar proposals from the past:

Pure Spell checking has been shot-down after testing (not the same as what
you proposed, but it is related)
http://bugzilla.spamassassin.org/show_bug.cgi?id=2868

Soundex matching has been proposed, but is too unreliable, too slow, and
offering little advantage that bayes doesn't provide
http://bugzilla.spamassassin.org/show_bug.cgi?id=1407
Re: possibly a dumb comment, apologies if I'm being a n00b [ In reply to ]
On Wed, 10 Mar 2004, Matt Kettler wrote:
> One thing that needs to be kept in mind is that SA is already quite
> effective against mis-spelling tactics... mis-spellings are VERY easy
> targets for any bayes system, you just have to be up-to-speed on your training.

Yeah, that sorta makes sense... But fuzzy matching would eliminate the
need for keeping up on the Bayesian training, which apparently would make
a difference at some ISPs.

I've been trying to persuade my ISP to start using the Bayes stuff at the
site level, but they've been resistant to doing the extra work; or
possibly they're just totally clueless about the existance of the Bayesian
stuff -- when I emailed asking whether it was possible for individual
users to get access to sa-train, they suggested I add "score TEST_NAME"
lines to my user_prefs file. SA runs on the mailserver, which doesn't
have shell access; to get mail onto the shell server, users run fetchmail.
But SA is not provided on the shell server, so I don't think I'd be able
to do Bayesian stuff for just my account. Even if I could, having
individual users on the shell server all running their own Bayesian-tuned
SAs might be a processor issue...

I suggested that the admins send a note to all users asking for
contributions of spam and ham, or providing standardized folder names
where people could keep stuff for collation and use in training SA
sitewide, but they didn't seem to think it was worth it.

Maybe I just need a better ISP. (Though finding one that'll provide a
*NIX shell environment _at all_ is pretty tough these days. *sigh*)

Thanks for the thoughtful reply,
Auros

------------------------------------------------------------------------
R Michael Harman / Auros Symtheos
rmharman@auros.org ............ http://www.auros.org/

Linguist and Eclectic Engineer, Lexicus, Motorola
rmharman@motorola.com ......... http://www.lexicus.mot.com/

Senior Reviews Editor, Strange Horizons Speculative Fiction Weekly
reviews@strangehorizons.com ... http://www.strangehorizons.com/
RE: possibly a dumb comment, apologies if I'm being a n00b [ In reply to ]
I've worked on one and it's working great for me. The only problem is
that I had to modify the EvalTests.pm file that came in SA, so I'll have
to add it again when I upgrade. The rule only applies to the subject.
I haven't officially tested it, but it's running in my production
environment due to its success. I'd appreciate any comments on this,
besides me not officially testing it first 8+), and would REALLY
appreciate it if someone could test it.

This rule is very hard to explain, so please bear with me... once you
get it working (and understand it), it works REALLY well!

The update to EvalTests.pm allows me to catch all cr@p l|k3 th1s in the
subject. I would run it on the body, but I'm afraid it'll eat up too
much system resources. It knows some easily translated characters that
are often used to "hide" words (!->I, 3->e, l->I, |->I or l, @->a, etc.)

My change requires a control file (a list of words that are commonly
altered like "x@n@x") that includes 2 columns, <tab> delimited. It
includes the usual "spammy" stuff. There's one caveat to the control
file that's important to remember when adding new words: when the word
includes an I or an L, it has to have a | (pipe) in its place in the 1st
column, so the work "like" would be "||ke" in my file. This is because
many times l (lower L) will be replaced for I and | can be used for
either I or L - I like to catch these too.

The control file:
Examples:
v|agra
free free
l|m|ted
d|p|oma
ema|| email
d|et|ng dieting
debt debt

The first column is required - it has what the word will look like
after translated (L & I -> | (pipe)). The 2nd column is an optional
field that tells the system that if it really did look like this (case
insensitive), then let it through (so d3bt won't go, but debt does). If
the column doesn't exist, it doesn't matter how it was spelled, it's
getting blocked (like V!@gr@). I've attached my "badwords" file (put it
in /etc/mail/spamassassin and remove the .txt extension from the name).

First, I had to make a change to my local.cf file:
header EASY_TRANS eval:check_for_easy_trans()
describe EASY_TRANS Character translations made a known bad word
score EASY_TRANS 20.0


Here's the code change:

Find the EvalTests.pm file in the perl libraries and make the following
changes: (my file is in
/usr/local/lib/perl5/site_perl/5.8.0/Mail/SpamAssassin/EvalTests.pm

use vars qw{ # find this line (around line 22)
$IP_ADDRESS $IPV4_ADDRESS
$CCTLDS_WITH_LOTS_OF_OPEN_RELAYS
$ROUND_THE_WORLD_RELAYERS
$WORD_OBFUSCATION_CHARS
$CHARSETS_LIKELY_TO_FP_AS_CAPS
%BADS # add this line!!!!
};

# add from here down...

open(BADWORDS,"< /etc/mail/spamassassin/badwords");
while(<BADWORDS>) {
chomp(my $wordline=lc($_));
(my $word,my $proper)=split(/\t/,$wordline);
$proper=1 if (!defined $proper);
$BADS{$word}=$proper;
}

sub check_for_easy_trans {
my($self)=@_;
my $subject = lc $self->get ('Subject');
chomp($subject);
my $word;
my $origword;
foreach $word (split(/\s{1,}/,$subject)) {
$origword=$word;
$word=~s/5/s/g;
$word=~s/3/e/g;
$word=~s/0/o/g;
$word=~s/9/g/g;
$word=~s/\@/a/g;
$word=~s/\(\)/o/g;
$word=~s/\+/t/g;
$word=~s/\$/s/g;
$word=~s/6/g/g;
$word=~s/[il\!1]/\|/g;
my $ok_spelling=$BADS{$word};
if ($origword ne $ok_spelling) {
return 1 if ($BADS{$word});
}
}
}


Thanks,
Keith Hackworth




-----Original Message-----
From: R Michael Harman [mailto:rmharman@auros.org]
Sent: Wednesday, March 10, 2004 12:29 PM
To: spamassassin-users@incubator.apache.org
Subject: possibly a dumb comment, apologies if I'm being a n00b

So, I've been looking over the types of messages that SA is missing, on
my
mail stream... It seems like most of them still contain trigger words
that would cause high scores, they're just slightly masked. Has anyone
talked about applying some kind of fuzzy-matching techniques? Taking
the
trigger words, and generating a whole large set of patterns that match,
based on rules such as:

'a' => /a|A|@)/
'x' => /x|X|></

You might even be able to use a large corpus of spam to automatically
derive these rules. (A corpus of parsed-out and "translated" tokens
would
work better, obviously.)

You could also introduce some Hamming Distance effects to the match, so
that, say hamming_match( /hello/ , 1 ) would match "helldo", "helo", and
"hillo". And then there's the possibility of doing phonetic matching,
like many spellcheckers.

Using any/all of this stuff would be pretty processor intensive --
probably much more practical for ISPs than for users -- but it seems
like
it'd kill off almost all of the new crop of SA-evading spam. Maybe
somebody could lure Larry Wall into building this kind of fuzzy-match
technology directly into the next major version of Perl? *g*

Just thought I'd throw that out there. Aside from that, I'll probably
lurk for a while; if I end up feeling out of my depth (which is possible
-- my actual day job is as a linguist, and most of my coding skills,
such
as they are, are aimed at that) I'll unsub.

Thanks,
Auros

------------------------------------------------------------------------
R Michael Harman / Auros Symtheos
rmharman@auros.org ............ http://www.auros.org/

Linguist and Eclectic Engineer, Lexicus, Motorola
rmharman@motorola.com ......... http://www.lexicus.mot.com/

Senior Reviews Editor, Strange Horizons Speculative Fiction Weekly
reviews@strangehorizons.com ... http://www.strangehorizons.com/

*****
"The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential, proprietary, and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from all computers." 113
RE: possibly a dumb comment, apologies if I'm being a n00b [ In reply to ]
> -----Original Message-----
> From: Hackworth, Keith A [mailto:Keith.Hackworth@BellSouth.com]
> Sent: Wednesday, March 10, 2004 1:12 PM
> To: R Michael Harman
> Cc: spamassassin-users@incubator.apache.org
> Subject: RE: possibly a dumb comment, apologies if I'm being a n00b
>
>
> I've worked on one and it's working great for me. The only problem is
> that I had to modify the EvalTests.pm file that came in SA,
> so I'll have
> to add it again when I upgrade. The rule only applies to the subject.
> I haven't officially tested it, but it's running in my production
> environment due to its success. I'd appreciate any comments on this,
> besides me not officially testing it first 8+), and would REALLY
> appreciate it if someone could test it.


Yeah, I have a comment......FINALY!!! :)

I've looked at this for a while and it seems really cool. But I don't
publish anyones work like this unless they give me permission. This gem was
buring a hole in my hard drive for quite some time! :) My question is for
JM on this. Is there a penalty for readin this info from a file? Does it get
reread everytime the rule is run? ANd finally, do changes to the control
file get picked up right away, or do you still need to restart spamd.

Also on the subject of OBFU, check out this link:

http://www.exit0.us/index.php/ChrissMediocreObfuScript

hth,

Chris
RE: possibly a dumb comment, apologies if I'm being a n00b [ In reply to ]
Sorry for taking so long to get it out here - I had a problem with our
email servers not liking us getting spammy messages (you may have
noticed the bounced Bellsouth.com emails from my co-worker).

I have to restart my spamd when I modify my badwords file. The data in
the file is read into memory as soon as spamd (or amavis, in my case) is
started, so there is no IO impact on it. Plus, it loads into a hash, so
it takes up very little memory. My concern about performance is the
split it does on every word - it's fine for the subject (usually
subjects only have a few words), but I don't think I'd run it against
the body, although it would be GREAT if I could.

You guys are welcome to use the code. It's blocking ~200 messages/day
on my servers which receives about 15,000 per day (after my blackhole
list blocks).


I have one other rule that's pretty cool too - it blocks about 320
messages/day. It also requires a change to EvalTests.pm. I got tired
of seeing spam come in like this: vi<notatag>agra. Most HTML email
clients ignore unknown tags, so it just shows "viagra" to the user, but
since it doesn't say "Viagra" (or anything close to it), it doesn't get
picked up by SA. This bit of code counts the number of unknown HTML
tags and if it's between some threshold, it either adds a couple points
to the spam score or, if there's too many, it just blocks it altogether.

This rule requires a control file. The file is a list of known html
tags (at least known to me). I've attached it here. Put it in
/etc/mail/spamassassin.

Here are the rules I added to local.cf. The 1st number is "at least
this number of bad tags" and the 2nd number is (optional) "no more than
this number of bad tags":

rawbody BAD_TAGS2 eval:check_for_bad_tags('2','7')
describe BAD_TAGS2 Between 2-7 bad HTML tags in message
score BAD_TAGS2 4.0
rawbody BAD_TAGS8 eval:check_for_bad_tags('8')
describe BAD_TAGS8 More than 8 bad HTML tags in message
score BAD_TAGS2 8.0

This one, too, is a low memory usage one - it loads all html tags into a
hash and it's done. To add new tags, you have to restart spamd.

It ignores XML documents (they have lots of unknown tags) and messages
with attachments (many word documents or zip files have < and > in
them). The code also ignores all non-html emails - this rule only
applies to HTML emails. It's also smart enough to ignore URLs. Some
email clients change urls to <url.com>. It also ignores <mailto:...>
tags.

Here's the code change (it looks a lot like the other rule). Edit the
EvalTest.pm library that comes with SA (mine is in
/opt/local/lib/perl5/site_perl/5.8.0/Mail/SpamAssassin/EvalTests.pm):

use vars qw{
$IP_ADDRESS $IPV4_ADDRESS
$CCTLDS_WITH_LOTS_OF_OPEN_RELAYS
$ROUND_THE_WORLD_RELAYERS
$WORD_OBFUSCATION_CHARS
$CHARSETS_LIKELY_TO_FP_AS_CAPS
%TAGS #<-------------------------- add this!
%BADS #<-------------------------- this is from the other rule (see
the previous posting from me), but not required for this rule
};

open(TAGS,"< /etc/mail/spamassassin/knowntags");
while(<TAGS>) {
chomp(my $tag=lc($_));
$TAGS{$tag}=1;
}
close(TAGS);

sub check_for_bad_tags {
my ($self, $body, $count, $max) = @_;
$max="" if (!defined max);
my $message="";
my $invalidcount=0;
foreach my $line (@$body) {
chomp($message.=lc($line));
}
return 0 if ($message=~/\<xml/ || $message=~/\<?xml/);
return 0 if ($message=~/spamassassin spam filter/);
return 0 if ($message=~/boundary=/ && $message=~/multipart/);
if ($message=~/\<body/ || $message=~/\<html/) {
my @tags=split(/\</,$message);
my $first=1;
foreach my $tag (@tags) {
if (!$first) {
my $thistag="";
if ($tag=~/^\/?([^\s\>\=]*)[\s\>\=]/) {
$thistag=$1;
}
if ($thistag!~/\S{1,}\@\S{1,}\.[\S]{1,}/ &&
$thistag!~/www\.\S{1,}\.\S{1,}/ && $thistag!~/^http:/ &&
$thistag!~/^https:/ && $thistag!~/^\.\./ && $thistag!~/^mailto/) {
if (!$TAGS{$thistag} && $thistag!~/^\!\-\-/) {
$invalidcount++;
}
}
}
$first=0;
}
}
if ($max ne "") {
return 1 if ($invalidcount>$count && $invalidcount<=$max);
}
else {
return 1 if ($invalidcount>$count);
}
return 0;
}

-----Original Message-----
From: Chris Santerre [mailto:csanterre@MerchantsOverseas.com]
Sent: Wednesday, March 10, 2004 3:18 PM
To: spamassassin-users@incubator.apache.org
Subject: RE: possibly a dumb comment, apologies if I'm being a n00b

> -----Original Message-----
> From: Hackworth, Keith A [mailto:Keith.Hackworth@BellSouth.com]
> Sent: Wednesday, March 10, 2004 1:12 PM
> To: R Michael Harman
> Cc: spamassassin-users@incubator.apache.org
> Subject: RE: possibly a dumb comment, apologies if I'm being a n00b
>
>
> I've worked on one and it's working great for me. The only problem is
> that I had to modify the EvalTests.pm file that came in SA,
> so I'll have
> to add it again when I upgrade. The rule only applies to the subject.
> I haven't officially tested it, but it's running in my production
> environment due to its success. I'd appreciate any comments on this,
> besides me not officially testing it first 8+), and would REALLY
> appreciate it if someone could test it.


Yeah, I have a comment......FINALY!!! :)

I've looked at this for a while and it seems really cool. But I don't
publish anyones work like this unless they give me permission. This gem
was
buring a hole in my hard drive for quite some time! :) My question is
for
JM on this. Is there a penalty for readin this info from a file? Does it
get
reread everytime the rule is run? ANd finally, do changes to the control
file get picked up right away, or do you still need to restart spamd.

Also on the subject of OBFU, check out this link:

http://www.exit0.us/index.php/ChrissMediocreObfuScript

hth,

Chris

*****
"The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential, proprietary, and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from all computers." 113