Mailing List Archive

[Help] bodyre in hashbl
Hello,

I'm trying to use Hashbl plugin with bodyre function.

With that function I would like to match utf8 patterns, such as

'([\p{L}\p{M}\d\S]+[\ \t]+[\p{L}\p{M}\d\S]+)'

I'm in particular interested in accented characters, such as /[?????]/.

With Perl, if I try:

```
use utf8;
use open ':std', ':encoding(UTF-8)';

$txt = ' musica ? ciao ciao.';
$re = '([\p{L}\p{M}\d\S]+[\ \t]+[\p{L}\p{M}\d\S]+)';

if ($txt =~ /$re/gs) {
print "Match: $1";
}
```

then $txt matches as well.


With Spamassassin I built my own dnsbl of hashes and the Spamassassin rule:

body HASHBL_MY_SPAM1
eval:check_hashbl_bodyre('spamhash.example.com', 'sha1/max=10/shuffle',
'([\p{L}\p{M}\d\S]+[\ \t]+[\p{L}\p{M}\d\S]+)', '^127\.0\.0\.2')

This doesn't match the above $txt in the body of the mail.

If I want to match as expected the string ' musica ? ciao ciao.' in the
body of the mail, then I must change the above regex in the following way:

body HASHBL_MY_SPAM1
eval:check_hashbl_bodyre('spamhash.example.com', 'sha1/max=10/shuffle',
'([\p{L}\p{M}\d\S?????]+[\ \t]+[\p{L}\p{M}\d\S?????]+)', '^127\.0\.0\.2')


So I have to add the accented character literally.
I can't understand why. Are there any limitation in Hashbl plugin with UTF8?
Maybe I have misunderstood something.

Thank you very much for every hint.

Kind Regards
Marco
Re: [Help] bodyre in hashbl [ In reply to ]
On Mon, May 17, 2021 at 03:02:57PM +0200, Marco wrote:
>
> So I have to add the accented character literally.
> I can't understand why. Are there any limitation in Hashbl plugin with UTF8?
> Maybe I have misunderstood something.

SA doesn't support UTF8 regex. It's just matching plain byte strings.

Depends on normalize_charset setting too, for best compatibility you should
match both latin and utf-8 raw byte variants: ? -> (?:\xfc|\xc3\xbc)

https://cwiki.apache.org/confluence/display/SPAMASSASSIN/WritingRulesAdvanced

Or check the replace_tags in 25_replace.cf, there's ready templates for
characters (but they match some commonly obfuscated variants too).
Re: [Help] bodyre in hashbl [ In reply to ]
On Mon, May 17, 2021 at 07:12:47PM +0300, Henrik K wrote:
>
> Or check the replace_tags in 25_replace.cf, there's ready templates for
> characters (but they match some commonly obfuscated variants too).

And yeah sorry, these won't work with HashBL, it's just for basic rules..
Re: [Help] bodyre in hashbl [ In reply to ]
Il 17/05/2021 18:12, Henrik K ha scritto:
> On Mon, May 17, 2021 at 03:02:57PM +0200, Marco wrote:
>>
>> So I have to add the accented character literally.
>> I can't understand why. Are there any limitation in Hashbl plugin with UTF8?
>> Maybe I have misunderstood something.
>
> SA doesn't support UTF8 regex. It's just matching plain byte strings.
>
> Depends on normalize_charset setting too, for best compatibility you should
> match both latin and utf-8 raw byte variants: ü -> (?:\xfc|\xc3\xbc)
>
> https://cwiki.apache.org/confluence/display/SPAMASSASSIN/WritingRulesAdvanced

Hello Henrik,

thank you for the hints. I didn't realized that SA doesn't support
UTF8 regex. Well. As you suggest, I would like to write rules coding
independent in order to avoid surprises. I tried, it doesn't work...

I have normalize_charset 1.
My text body is "Ciao, è proprio eccoci là si fa\nciao"

With
([\d\S\x{00E0}\x{c3a0}\x{00E8}\x{c3a8}\x{00EC}\x{c3ac}\x{00F2}\x{c3b2}\x{00F9}\x{c3b9}\x{00C0}\x{c380}\x{00C8}\x{c388}\x{00CC}\x{c38c}\x{00D2}\x{c392}\x{00D9}\x{c399}]+)
I see:
dbg: HashBL: __HASHBL_III_SPAM3: matches found: 'ciao,', 'è', 'proprio',
'eccoci', 'l?', 'si', 'fa', 'ciao'

'là' seems to have bad encoded as 'l?', so the hash doesn't match.


If I write the characters literally:
([\d\Sàèìòù]+)
I see:
dbg: HashBL: __HASHBL_III_SPAM3: matches found: 'ciao,', 'è', 'proprio',
'eccoci', 'là', 'si', 'fa', 'ciao'

Now 'là' is encoded correctly and the hash matches.


Thank you very much
Kind Regards
Marco
Re: [Help] bodyre in hashbl [ In reply to ]
On Tue, May 18, 2021 at 03:04:12PM +0200, Marco wrote:
>
> Hello Henrik,
>
> thank you for the hints. I didn't realized that SA doesn't support UTF8
> regex. Well. As you suggest, I would like to write rules coding independent
> in order to avoid surprises. I tried, it doesn't work...
>
> I have normalize_charset 1.
> My text body is "Ciao, ? proprio eccoci l? si fa\nciao"
>
> With
> ([\d\S\x{00E0}\x{c3a0}\x{00E8}\x{c3a8}\x{00EC}\x{c3ac}\x{00F2}\x{c3b2}\x{00F9}\x{c3b9}\x{00C0}\x{c380}\x{00C8}\x{c388}\x{00CC}\x{c38c}\x{00D2}\x{c392}\x{00D9}\x{c399}]+)

This is still UTF8/Unicode format: \x{xxxx}

https://www.fileformat.info/info/unicode/char/00e0/index.htm

Instead of \x{00E0}, you need to use \xC3\xA0 as you are matching _separate_
raw bytes. (untested, but assuming so from the url, too busy to test)
Re: [Help] bodyre in hashbl [ In reply to ]
Il 18/05/2021 15:27, Henrik K ha scritto:
> Instead of \x{00E0}, you need to use \xC3\xA0 as you are matching_separate_
> raw bytes. (untested, but assuming so from the url, too busy to test)

Yes, it works. I was confusing, the Spamassassin documentation is right.
I really have to use non capturing group in order to match the UTF8
characters, very long regexp!

/([?????])/ -->
/([(?:\xE0|\xC3\xA0)(?:\xE8|\xC3\xA8)(?:\xEC|\xC3\xAC)(?:\xF2|\xC3\xB2)(?:\xF9|\xC3\xB9)(?:\xC0|\xC3\x80)(?:\xC8|\xC3\x88)(?:\xCC|\xC3\x8C)(?:\xD2|\xC3\x92)(?:\xD9|\xC3\x99)])/

Thank you very much

Kind Regards
Marco