So, I've been looking over the types of messages that SA is missing, on my
mail stream... It seems like most of them still contain trigger words
that would cause high scores, they're just slightly masked. Has anyone
talked about applying some kind of fuzzy-matching techniques? Taking the
trigger words, and generating a whole large set of patterns that match,
based on rules such as:
'a' => /a|A|@)/
'x' => /x|X|></
You might even be able to use a large corpus of spam to automatically
derive these rules. (A corpus of parsed-out and "translated" tokens would
work better, obviously.)
You could also introduce some Hamming Distance effects to the match, so
that, say hamming_match( /hello/ , 1 ) would match "helldo", "helo", and
"hillo". And then there's the possibility of doing phonetic matching,
like many spellcheckers.
Using any/all of this stuff would be pretty processor intensive --
probably much more practical for ISPs than for users -- but it seems like
it'd kill off almost all of the new crop of SA-evading spam. Maybe
somebody could lure Larry Wall into building this kind of fuzzy-match
technology directly into the next major version of Perl? *g*
Just thought I'd throw that out there. Aside from that, I'll probably
lurk for a while; if I end up feeling out of my depth (which is possible
-- my actual day job is as a linguist, and most of my coding skills, such
as they are, are aimed at that) I'll unsub.
Thanks,
Auros
------------------------------------------------------------------------
R Michael Harman / Auros Symtheos
rmharman@auros.org ............ http://www.auros.org/
Linguist and Eclectic Engineer, Lexicus, Motorola
rmharman@motorola.com ......... http://www.lexicus.mot.com/
Senior Reviews Editor, Strange Horizons Speculative Fiction Weekly
reviews@strangehorizons.com ... http://www.strangehorizons.com/
mail stream... It seems like most of them still contain trigger words
that would cause high scores, they're just slightly masked. Has anyone
talked about applying some kind of fuzzy-matching techniques? Taking the
trigger words, and generating a whole large set of patterns that match,
based on rules such as:
'a' => /a|A|@)/
'x' => /x|X|></
You might even be able to use a large corpus of spam to automatically
derive these rules. (A corpus of parsed-out and "translated" tokens would
work better, obviously.)
You could also introduce some Hamming Distance effects to the match, so
that, say hamming_match( /hello/ , 1 ) would match "helldo", "helo", and
"hillo". And then there's the possibility of doing phonetic matching,
like many spellcheckers.
Using any/all of this stuff would be pretty processor intensive --
probably much more practical for ISPs than for users -- but it seems like
it'd kill off almost all of the new crop of SA-evading spam. Maybe
somebody could lure Larry Wall into building this kind of fuzzy-match
technology directly into the next major version of Perl? *g*
Just thought I'd throw that out there. Aside from that, I'll probably
lurk for a while; if I end up feeling out of my depth (which is possible
-- my actual day job is as a linguist, and most of my coding skills, such
as they are, are aimed at that) I'll unsub.
Thanks,
Auros
------------------------------------------------------------------------
R Michael Harman / Auros Symtheos
rmharman@auros.org ............ http://www.auros.org/
Linguist and Eclectic Engineer, Lexicus, Motorola
rmharman@motorola.com ......... http://www.lexicus.mot.com/
Senior Reviews Editor, Strange Horizons Speculative Fiction Weekly
reviews@strangehorizons.com ... http://www.strangehorizons.com/