Mailing List Archive

Detect Emoticons in Subject
Hi,

I've been using SA a long time. Lately, I'm getting more and more spam
with emoticons in the subject line. I'd say about 90% of my emails with
emoticons in the subject are spam. I'd like to create a local rule which
scores email with emoticons in the subject. I saw a previous discussion on
this in the archive, but it was focused on whether such emails were *always
*spam. I think an emoticon rule, in combination with other rules, will
help my installation. I've tried to match as follows, but it won't lint.
I'm not really a perl programmer. I've written several other more
conventional local rules, but here I'm a bit out of my depth. I'd
appreciate some guidance.

# Local Rule for Emoticons in subject
subject EMOTICON_IN_SUBJECT Subject =~ /\p{Emoticons}/
score EMOTICON_IN_SUBJECT 3.0
describe EMOTICON_IN_SUBJECT Subject Line Has Emoticons

-CJ
Re: Detect Emoticons in Subject [ In reply to ]
On Thu, 20 May 2021 11:42:59 -0400
Clive Jacques wrote:

> Hi,
>
> I've been using SA a long time. Lately, I'm getting more and more
> spam with emoticons in the subject line. I'd say about 90% of my
> emails with emoticons in the subject are spam. I'd like to create a
> local rule which scores email with emoticons in the subject.

> # Local Rule for Emoticons in subject
> subject EMOTICON_IN_SUBJECT Subject =~ /\p{Emoticons}/

The rule should start with "header", that's what's causing the lint
failure.

However, AFAIK, the rule still won't work because \p{Emoticons}
isn't supported in spamassassin, which works on byte sequences. You
need to rewrite it to match UTF-8 bytes.
Re: Detect Emoticons in Subject [ In reply to ]
We've started getting lots of spam with emoji in the subject too the
past few weeks, so I've looked into this as well. As mentioned by RW,
you would need to create some kind of UTF8 regex header Subject rule. As
I'm not too excited about writing such a regex, it's way at the bottom
of my todo list to contemplate whether an SA plugin could be written for
that and to then reach out to the SA developers to see whether that
would be something upstream would accept. But honestly, I won't be able
to any time soon (I don't have the time). Still, thought I'd mention it,
since it might be relevant to your question. If you do end up figuring
out a regex that works out and isn't an extreme length, I think plenty
of people on this list would love to know!

Bert

On 20/05/2021 18:19, RW wrote:
> On Thu, 20 May 2021 11:42:59 -0400
> Clive Jacques wrote:
>
>> Hi,
>>
>> I've been using SA a long time. Lately, I'm getting more and more
>> spam with emoticons in the subject line. I'd say about 90% of my
>> emails with emoticons in the subject are spam. I'd like to create a
>> local rule which scores email with emoticons in the subject.
>> # Local Rule for Emoticons in subject
>> subject EMOTICON_IN_SUBJECT Subject =~ /\p{Emoticons}/
> The rule should start with "header", that's what's causing the lint
> failure.
>
> However, AFAIK, the rule still won't work because \p{Emoticons}
> isn't supported in spamassassin, which works on byte sequences. You
> need to rewrite it to match UTF-8 bytes.
Re: Detect Emoticons in Subject [ In reply to ]
On Thu, 2021-05-20 at 18:34 +0200, Bert Van de Poel wrote:
> We've started getting lots of spam with emoji in the subject too the
> past few weeks, so I've looked into this as well. As mentioned by RW,
> you would need to create some kind of UTF8 regex header Subject rule. As
> I'm not too excited about writing such a regex, it's way at the bottom
> of my todo list 
>
Should be easy enough - IsASCII is just a name for [\x00-\x7f] and
IsXDigit is [0-9a-fA-F], so the same logic can be applied to define a
regex that triggers on any character within the three Unicode emoji
ranges. See Wikipedia doe more detail:

https://en.wikipedia.org/wiki/Emoticon#Unicode

I haven't yet seen any emojis in Subject lines, regardless of whether
the message was spam or not, or I'd probably have already written such a
rule and given it a minimal score so it can be used in a more spam-
specific meta rule.

Martin
Re: Detect Emoticons in Subject [ In reply to ]
On Thu, 20 May 2021 18:34:54 +0200
Bert Van de Poel wrote:

> We've started getting lots of spam with emoji in the subject too the
> past few weeks, so I've looked into this as well. As mentioned by RW,
> you would need to create some kind of UTF8 regex header Subject rule.
> As I'm not too excited about writing such a regex, it's way at the
> bottom of my todo list to contemplate whether an SA plugin could be
> written for that and to then reach out to the SA developers to see
> whether that would be something upstream would accept. But honestly,
> I won't be able to any time soon (I don't have the time). Still,
> thought I'd mention it, since it might be relevant to your question.
> If you do end up figuring out a regex that works out and isn't an
> extreme length, I think plenty of people on this list would love to
> know!

Try this:


header EMOTICON_IN_SUBJECT Subject =~ /\xF0\x9F(?:\x98[\x80-\xFF]|\x99[\x00-x8F])/
Re: Detect Emoticons in Subject [ In reply to ]
On Thu, 20 May 2021 18:30:03 +0100
RW wrote:


> Try this:
>
>
> header EMOTICON_IN_SUBJECT Subject =~
> /\xF0\x9F(?:\x98[\x80-\xFF]|\x99[\x00-x8F])/
>

Actually that's only the original block, but it probably works most of
the time
Re: Detect Emoticons in Subject [ In reply to ]
On Thu, 20 May 2021 18:44:43 +0100
RW wrote:

> On Thu, 20 May 2021 18:30:03 +0100
> RW wrote:
>
>
> > Try this:
> >
> >
> > header EMOTICON_IN_SUBJECT Subject =~
> > /\xF0\x9F(?:\x98[\x80-\xFF]|\x99[\x00-x8F])/
> >
>
> Actually that's only the original block, but it probably works most of
> the time

This extends it to Supplemental Symbols and Pictographs and
adds the three original faces from Miscellaneous Symbols


/\xF0\x9F(?:\x98[\x80-\xFF]|\x99[\x80-\x8F])|xF0\x9F(?:[\xA4-\xA6][\x80-\xFF]|\xA7[\x80-\xBF])|\xE2\x98[\xB9-\xBB]/

it also fixes a minor problem with a continuation bytes in the original.
Re: Detect Emoticons in Subject [ In reply to ]
On Thu, 20 May 2021 19:26:30 +0100
RW wrote:

> On Thu, 20 May 2021 18:44:43 +0100
> RW wrote:
>
> > On Thu, 20 May 2021 18:30:03 +0100
> > RW wrote:
> >
> >
> > > Try this:
> > >
> > >
> > > header EMOTICON_IN_SUBJECT Subject =~
> > > /\xF0\x9F(?:\x98[\x80-\xFF]|\x99[\x00-x8F])/
> > >
> >
> > Actually that's only the original block, but it probably works most
> > of the time
>
> This extends it to Supplemental Symbols and Pictographs and
> adds the three original faces from Miscellaneous Symbols
>
>
> /\xF0\x9F(?:\x98[\x80-\xFF]|\x99[\x80-\x8F])|xF0\x9F(?:[\xA4-\xA6][\x80-\xFF]|\xA7[\x80-\xBF])|\xE2\x98[\xB9-\xBB]/
>
> it also fixes a minor problem with a continuation bytes in the
> original.
>
I still didn't get continuity bytes right, I forgot that bit 6 is always
0 - it's a long time since I've done this.

/\xF0\x9F(?:\x98[\x80-\xBF]|\x99[\x80-\x8F])|xF0\x9F(?:[\xA4-\xA6][\x80-\xBF]|\xA7[\x80-\xBF])|\xE2\x98[\xB9-\xBB]/
Re: Detect Emoticons in Subject [ In reply to ]
On 2021-05-20 at 13:44:43 UTC-0400 (Thu, 20 May 2021 18:44:43 +0100)
RW <rwmaillists@googlemail.com>
is rumored to have said:

> On Thu, 20 May 2021 18:30:03 +0100
> RW wrote:
>
>
>> Try this:
>>
>>
>> header EMOTICON_IN_SUBJECT Subject =~
>> /\xF0\x9F(?:\x98[\x80-\xFF]|\x99[\x00-x8F])/
>>
>
> Actually that's only the original block, but it probably works most of
> the time

Not so sure about that...

I regularly get mail from Patreon with emoji in the encoded header which
don't match that pattern:


# grep '^Subject: ' /tmp/ham |cut -d? -f4 |decode-base64 |hexdump -C
00000000 f0 9f 8e 89 20 50 61 74 72 69 63 6b 20 57 61 72 |....
Patrick War|
00000010 64 6c 65 20 6a 75 73 74 20 73 68 61 72 65 64 20 |dle just
shared |
00000020 22 f0 9f 93 9d 20 4e |".... N|
00000027

People send wanted mail with all sorts of weirdness.

Looking at the full set
(https://www.unicode.org/emoji/charts/full-emoji-list.html) I can
understand why \p{Emoticons} would be so much better than trying to
define them all in a regex of hex bytes in UTF-8 form.

--
Bill Cole
bill@scconsult.com or billcole@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire
Re: Detect Emoticons in Subject [ In reply to ]
That's fine - I'm not saying all email containing emojis in the subject (or
elsewhere) *is *spam - just that it's uncommon and right now, about 90% of
the time it is *for me*. I just want to score it as part of the greater
constellation of factors (just like DKIM, SPF etc.).

On Thu, May 20, 2021 at 2:48 PM Bill Cole <
sausers-20150205@billmail.scconsult.com> wrote:

>
> People send wanted mail with all sorts of weirdness.
>
>
Re: Detect Emoticons in Subject [ In reply to ]
On 20-05-2021 18:19, RW wrote:
> On Thu, 20 May 2021 11:42:59 -0400
> Clive Jacques wrote:
>
>> Hi,
>>
>> I've been using SA a long time. Lately, I'm getting more and more
>> spam with emoticons in the subject line. I'd say about 90% of my
>> emails with emoticons in the subject are spam. I'd like to create a
>> local rule which scores email with emoticons in the subject.
>
>> # Local Rule for Emoticons in subject
>> subject EMOTICON_IN_SUBJECT Subject =~ /\p{Emoticons}/
>
> The rule should start with "header", that's what's causing the lint
> failure.
>
> However, AFAIK, the rule still won't work because \p{Emoticons}
> isn't supported in spamassassin, which works on byte sequences. You
> need to rewrite it to match UTF-8 bytes.
>

I'm not a real fan of very complex regular expressions, as they tend to
get hard to read/understand very quickly. This thread is a perfect
example: the syntax that the OP proposed (/\p{Emoticons}/) seems
perfectly readable, and all the actually working alternatives are, with
all respect to the authors, a nightmare to decipher. Especially for
users not really proficient in regular expressions, the OP's syntax is
perfectly understandable and all the alternatives aren't.

I'm not really into the regex engine of perl/SA, so please correct if
I'm wrong. The /\p{Emoticons}/ syntax seems to me a builtin feature of
the regex spec/perl (as opposed to pseudo-code, displaying something
that actually doesn't exist).

Can someone explain why SA cannot support this type of syntax, or what
would be needed to get it supported? IMHO it makes it a lot easier for
end-users to understand a rule, and for rule developers to write or even
contribute new UTF-8-related rules, so it might be worth the effort to
get it supported?

Thanks in advance,
Tom
Re: Detect Emoticons in Subject [ In reply to ]
On Fri, May 21, 2021 at 09:53:36AM +0200, Tom Hendrikx wrote:
>
> Can someone explain why SA cannot support this type of syntax, or what would
> be needed to get it supported? IMHO it makes it a lot easier for end-users
> to understand a rule, and for rule developers to write or even contribute
> new UTF-8-related rules, so it might be worth the effort to get it
> supported?

Perl strings internally would have to be UTF8. Mandatory prerequisite would
be normalize_charset 1 in SA. Could be some cases where SA can't decode
mails properly to UTF8, so it's a question mark what happens then.

Some changes are coming already in 4.0, for example normalize_charset 1 will
be default. But more complex internal/rule changes require a lot of thought
on how to maintain backwards compatibility. I'm sure some people will still
run 3.4 for years to come.

Sorry to say but there are too few developers right now. It's up to the
community to pick up the pace.
Re: Detect Emoticons in Subject [ In reply to ]
On Thu, 20 May 2021 19:39:06 +0100
RW wrote:

>
> /\xF0\x9F(?:\x98[\x80-\xBF]|\x99[\x80-\x8F])|xF0\x9F(?:[\xA4-\xA6][\x80-\xBF]|\xA7[\x80-\xBF])|\xE2\x98[\xB9-\xBB]/


This includes the block mentioned by Bill Cole and and is simplified a
bit


/\xF0\x9F[\x98-\x99\xA4-\xA7\x8C-\x97][\x80-\x8F]|\xE2\x98[\xB9-\xBB]/


However, if you don't expect to get any legitimate mail with Asian
languages in the subject, you can probably get away with including all
4-byte UTF-8. Those code points are dominated by CJK, symbols, emojis
and dead languages.


/[\xF0-\xF7][\x80-\xBF]{3}|\xE2\x98[\xB9-\xBB]/