Mailing List Archive

dealing with subjects forged with accented letters
Hi,

Is there a simple way to deal with SPAM subjects written using accented letters?
The simplest would be piping the Subject line through a simple "tr"-like filter
before applying SA checks, but can it be done?

regards, Michal.
Re: dealing with subjects forged with accented letters [ In reply to ]
Michal Szymanski wrote:

> Hi,
>
> Is there a simple way to deal with SPAM subjects written using accented
> letters? The simplest would be piping the Subject line through a simple
> "tr"-like filter before applying SA checks, but can it be done?

http://sandgnat.com/cmos/cmos.jsp


--
Jens Benecke (jens at spamfreemail.de)
http://www.hitchhikers.de - Europaweite kostenlose Mitfahrzentrale
http://www.spamfreemail.de - 100% saubere Postfächer - garantiert!
http://www.rb-hosting.de - PHP ab 9? - SSH ab 19? - günstiger Traffic
Re: dealing with subjects forged with accented letters [ In reply to ]
Hello Michal,

Thursday, February 5, 2004, 12:37:08 AM, you wrote:

MS> Is there a simple way to deal with SPAM subjects written using accented letters?
MS> The simplest would be piping the Subject line through a simple "tr"-like filter
MS> before applying SA checks, but can it be done?

I use the following (we get foreign email, but since we only understand
English, we expect all subject headings to be in English):

header RM_sl_ForeignChar Subject =~ /\w[äëöü]\w/
describe RM_sl_ForeignChar Subject contains foreign character apparently embedded within a word
score RM_sl_ForeignChar 3.000 # 413s/0h of 97268 corpus (79437s/17831h) 01/24/04

Bob Menschel
Re: dealing with subjects forged with accented letters [ In reply to ]
On Thu, Feb 05, 2004 at 11:23:02PM -0800, Robert Menschel wrote:
>
> I use the following (we get foreign email, but since we only understand
> English, we expect all subject headings to be in English):
>
> header RM_sl_ForeignChar Subject =~ /\w[äëöü]\w/
> ...

Hi Robert,

unfortunately, a solution that simple is not for me. We get emails in
Polish and occasionally also in Spanish or German (not to mention
English, of course, but these are no problem :) so we cannot just
spam-them-all. what we need is to filter Subject lines (changing
all "äëöü" to "aeou" and *then* apply SA rules to them.

Michal.

--
Michal Szymanski (msz@astrouw.edu.pl)
Warsaw University Observatory, Warszawa, POLAND
Re: dealing with subjects forged with accented letters [ In reply to ]
Probably should also replace the obvious numeric and special characrters like zer0, thr33, f|ve, $even, etc. while you are at it.

Loren

I have to wonder if it is worth the processor time though. Might be faster to simply build a thesarus of creative misspellings and analyze the sentence that results from the subsitiutions. I expect that is probably essentially what the Bayes stuff does.


-----Original Message-----
From: Michal Szymanski <msz@astrouw.edu.pl>
Sent: Feb 6, 2004 6:44 AM
To: Robert Menschel <Robert@Menschel.net>
Cc: spamassassin-users@incubator.apache.org
Subject: Re: dealing with subjects forged with accented letters

On Thu, Feb 05, 2004 at 11:23:02PM -0800, Robert Menschel wrote:
>
> I use the following (we get foreign email, but since we only understand
> English, we expect all subject headings to be in English):
>
> header RM_sl_ForeignChar Subject =~ /\w[äëöü]\w/
> ...

Hi Robert,

unfortunately, a solution that simple is not for me. We get emails in
Polish and occasionally also in Spanish or German (not to mention
English, of course, but these are no problem :) so we cannot just
spam-them-all. what we need is to filter Subject lines (changing
all "äëöü" to "aeou" and *then* apply SA rules to them.

Michal.

--
Michal Szymanski (msz@astrouw.edu.pl)
Warsaw University Observatory, Warszawa, POLAND
Re: dealing with subjects forged with accented letters [ In reply to ]
On Fri, 2004-02-06 at 08:58, Loren Wilton wrote:
> Probably should also replace the obvious numeric and special characrters like zer0, thr33, f|ve, $even, etc. while you are at it.
<snip>
> On Thu, Feb 05, 2004 at 11:23:02PM -0800, Robert Menschel wrote:
> >
> > I use the following (we get foreign email, but since we only understand
> > English, we expect all subject headings to be in English):
> >
> > header RM_sl_ForeignChar Subject =~ /\w[äëöü]\w/
> > ...
<snip>
> all "äëöü" to "aeou" and *then* apply SA rules to them.

If you're interested in doing these transformations, you might want to
have a look-see at CMOScript. I've been attacking this sort of problem
from the other side; not "translating" characters in advance, but
matching the untranslated word against a Regexp with the translations
inside it. The idea is similar though, and I have a list of
translations you might want to use as a starting point (eg: 'b' => ['b',
'8', '\\xDF']). The list is by no means authoritative, or complete, but
it should be a good place to start. Grab the obfu.pl from
http://sandgnat.com/cmos/.

Also, I haven't done much to update CMOScript lately, but my plan has
been to move towards a pre-translator methodology once the SA 2.70/3.0
plugins interface is released. Pre-transforming should help reduce
processing time (CMOScript regexps are HUGE) and should allow for more
re-use.

There are disadvantages to the pre-translate method, however. One such
example is the character "|" which could be either an obfu "I" or an
obfu "L". How would you choose to translate that character? The same
goes for "*", "I", "l". Another possible disadvantage is that it's not
as easy to translate obfu character sequences such as: "m" => "rn" or
"N" => "|\|". I haven't yet come up with a good way to do
pre-transformation and still match these obfu types in a clean manner.

OK this was probably way off-topic and more discussion than you were
looking for. Oh well.

--
Chris Thielen

Easily generate SpamAssassin rules to catch obfuscated spam phrases:
http://www.sandgnat.com/cmos/