On Fri, 2004-02-06 at 08:58, Loren Wilton wrote:
> Probably should also replace the obvious numeric and special characrters like zer0, thr33, f|ve, $even, etc. while you are at it.
<snip>
> On Thu, Feb 05, 2004 at 11:23:02PM -0800, Robert Menschel wrote:
> >
> > I use the following (we get foreign email, but since we only understand
> > English, we expect all subject headings to be in English):
> >
> > header RM_sl_ForeignChar Subject =~ /\w[äëöü]\w/
> > ...
<snip>
> all "äëöü" to "aeou" and *then* apply SA rules to them.
If you're interested in doing these transformations, you might want to
have a look-see at CMOScript. I've been attacking this sort of problem
from the other side; not "translating" characters in advance, but
matching the untranslated word against a Regexp with the translations
inside it. The idea is similar though, and I have a list of
translations you might want to use as a starting point (eg: 'b' => ['b',
'8', '\\xDF']). The list is by no means authoritative, or complete, but
it should be a good place to start. Grab the obfu.pl from
http://sandgnat.com/cmos/. Also, I haven't done much to update CMOScript lately, but my plan has
been to move towards a pre-translator methodology once the SA 2.70/3.0
plugins interface is released. Pre-transforming should help reduce
processing time (CMOScript regexps are HUGE) and should allow for more
re-use.
There are disadvantages to the pre-translate method, however. One such
example is the character "|" which could be either an obfu "I" or an
obfu "L". How would you choose to translate that character? The same
goes for "*", "I", "l". Another possible disadvantage is that it's not
as easy to translate obfu character sequences such as: "m" => "rn" or
"N" => "|\|". I haven't yet come up with a good way to do
pre-transformation and still match these obfu types in a clean manner.
OK this was probably way off-topic and more discussion than you were
looking for. Oh well.
--
Chris Thielen
Easily generate SpamAssassin rules to catch obfuscated spam phrases:
http://www.sandgnat.com/cmos/