Mailing List Archive: Problem with matching regex against long body

Problem with matching regex against long body

lwilton at earthlink

Nov 3, 2020, 10:32 AM

Post #1 of 8 (747 views)

I'm getting lots of spams that are about 100+K long. The spam body contains
two blocks of random news text copied from fox news or msnbc or the like,
enclosed in a zero-point font block. I'm trying to match this simple pattern
to give some extra points, but I can't seem to get it to work. I'm wondering
if there is some buffer limit in SA that is preventing the match from
working.

If I try

rawbody LONG_HIDDEN m'<font style="font-size:0px">[^<]*<'s

I don't get a match, even though I know there is a </font> about 50K into
the message.

But if I try

rawbody LONG_HIDDEN m'<font style="font-size:0px">[^<]*'s

I do get a match. Note all I've done is remove the final "<" from the match
text.

If I try

rawbody LONG_HIDDEN m'<font style="font-size:0px">[^<]{990,}'s

I get a match.

but if I try

rawbody LONG_HIDDEN m'<font style="font-size:0px">[^<]{997,}'s

I don't get a match, but I know there is over 100K of text after that font
tag.

Can anyone see something I'm doing wrong, or know of some limitation in SA
that will prevent these long matches from working?

Thanks,

Loren

Re: Problem with matching regex against long body [ In reply to ]

lwilton at earthlink

Nov 3, 2020, 10:51 AM

Post #2 of 8 (747 views)

> basics of escaping at least *anything* won't do any harm
>
> php > echo preg_quote('<font style="font-size:0px">[^<]*<');
> \<font style\="font\-size\:0px"\>\[\^\<\]\*\<

Well, escaping the [^<]* part certianly will do harm, since it will turn it
from a group match into individual characters that don't exist in the text
to be matched.

But I've tried escaping the standaline characters like <, =, :, etc, and
that doesn't help. I have many regex patterns without these escaped, so I'm
pretty sure they work as expected normally, so should here too.

Re: Problem with matching regex against long body [ In reply to ]

rwmaillists at googlemail

Nov 3, 2020, 12:09 PM

Post #3 of 8 (747 views)

On Tue, 3 Nov 2020 10:32:41 -0800
Loren Wilton wrote:

> I'm getting lots of spams that are about 100+K long. The spam body
> contains two blocks of random news text copied from fox news or msnbc
> or the like, enclosed in a zero-point font block. I'm trying to match
> this simple pattern to give some extra points, but I can't seem to
> get it to work. I'm wondering if there is some buffer limit in SA
> that is preventing the match from working.
>

See rawbody_part_scan is the docs.

Re: Problem with matching regex against long body [ In reply to ]

rwmaillists at googlemail

Nov 3, 2020, 1:28 PM

Post #4 of 8 (747 views)

On Tue, 3 Nov 2020 20:09:46 +0000
RW wrote:

> On Tue, 3 Nov 2020 10:32:41 -0800
> Loren Wilton wrote:
>
> > I'm getting lots of spams that are about 100+K long. The spam body
> > contains two blocks of random news text copied from fox news or
> > msnbc or the like, enclosed in a zero-point font block. I'm trying
> > to match this simple pattern to give some extra points, but I can't
> > seem to get it to work. I'm wondering if there is some buffer limit
> > in SA that is preventing the match from working.
> >
>
> See rawbody_part_scan is the docs.

Also the chunking of the rawbody into 2-4 kB blocks, may make a
difference.

Re: Problem with matching regex against long body [ In reply to ]

lwilton at earthlink

Nov 3, 2020, 1:39 PM

Post #5 of 8 (747 views)

>> See rawbody_part_scan is the docs.
>
> Also the chunking of the rawbody into 2-4 kB blocks, may make a
> difference.

I wasn't able to find rawbody_part_scan in any of the docs that I managed to
find, but after digging into the source I found the chunking logic and dug
out the 2K limit. I'm not sure why I was hitting a limit at just under 1K, I
can only guess that the header was included in the first rawbody chunk,
which seems a little unlikely.

I was able to get the rule to work using a full rule, but I sure hated to do
that, since I lose the base64 decoding of the body, and full rules are ugly
and potentially dangerously inefficient. But at least it worked. Fortunately
these spams are plain text encoded.

Re: Problem with matching regex against long body [ In reply to ]

jhardin at impsec

Nov 3, 2020, 3:49 PM

Post #6 of 8 (747 views)

On Tue, 3 Nov 2020, Loren Wilton wrote:

> I'm getting lots of spams that are about 100+K long. The spam body contains
> two blocks of random news text copied from fox news or msnbc or the like,
> enclosed in a zero-point font block. I'm trying to match this simple pattern
> to give some extra points, but I can't seem to get it to work. I'm wondering
> if there is some buffer limit in SA that is preventing the match from
> working.

There is.

> If I try
>
> rawbody LONG_HIDDEN m'<font style="font-size:0px">[^<]*<'s
>
> I don't get a match, even though I know there is a </font> about 50K into the
> message.

The closing tag is past the end of the cutoff.

> But if I try
>
> rawbody LONG_HIDDEN m'<font style="font-size:0px">[^<]*'s
>
> I do get a match. Note all I've done is remove the final "<" from the match
> text.
>
> If I try
>
> rawbody LONG_HIDDEN m'<font style="font-size:0px">[^<]{990,}'s
>
> I get a match.

That's what you should do. Don't try to cut it too close, though, as all
the spammer would need to do to bypass that is move the garbage block a
little further back in the message. I'd suggest {900} or even {500} - 500
characters of zero-point text in a message body is not plausibly
legitimate.

You don't need the "," - it doesn't matter what is there beyond your
cutoff, don't waste time matching it. Basic version:

rawbody LONG_HIDDEN m'<font style="font-size:0px">[^<]{500}'s

You may also want to stick optional whitespace in there to avoid trivial
bypass:

rawbody LONG_HIDDEN m'<font\s+style\s*=\s*"font-size:0px"\s*>[^<]{500}'s

There's also the possibility of adding a typeface or other options to the
<font> tag, which would bypass your simple rule. And HTML is not
case-sensitive. And avoid * on complex stuff when matching arbitrarily
long texts, which can lead to runaway backtracking and scan timeouts.

rawbody LONG_HIDDEN m'<font\s[^>]{0,99}style\s*=\s*"font-size:0px"[^>]{0,99}>[^<]{500}'si

(Caveat: not tested, just off-the-cuff. There's room for improvement in
the style spec as well.)

--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
USMC Rules of Gunfighting #7: In ten years nobody will remember
the details of caliber, stance, or tactics. They will only remember
who lived.
-----------------------------------------------------------------------
Today: the Presidential Election

Re: Problem with matching regex against long body [ In reply to ]

lwilton at earthlink

Nov 3, 2020, 4:18 PM

Post #7 of 8 (746 views)

> You may also want to stick optional whitespace in there to avoid trivial
> bypass:
> There's also the possibility of adding a typeface or other options to the
> <font> tag, which would bypass your simple rule. And HTML is not
> case-sensitive. And avoid * on complex stuff when matching arbitrarily
> long texts, which can lead to runaway backtracking and scan timeouts.

Thanks. This spammer is prolific, but seems to be very stupid and pattern
based, hardly ever varying what he puts in some parts of the message. I've
been seeing this pattern without change for about 3 months now. I almost
never have to tweak a rule for his stuff to account for a possible
variation.

It would be interesting (at least to me) to run a set of test rules against
the SA corpus to try to determine the optimial cutoff point for a good S/O
as regards length of 0-point text. I personally have absolutely no idea what
a "reasonable" size is for 0-point text in an email. Personally I'd be
inclined to say that any 0-point text isn't reasonable, but mass marketers
seem to believe otherwise.

Re: Problem with matching regex against long body [ In reply to ]

rwmaillists at googlemail

Nov 3, 2020, 4:30 PM

Post #8 of 8 (746 views)

On Tue, 3 Nov 2020 13:39:47 -0800
Loren Wilton wrote:

> >> See rawbody_part_scan is the docs.
> >
> > Also the chunking of the rawbody into 2-4 kB blocks, may make a
> > difference.
>
> I wasn't able to find rawbody_part_scan in any of the docs that I
> managed to find, but after digging into the source I found the
> chunking logic and dug out the 2K limit. I'm not sure why I was
> hitting a limit at just under 1K, I can only guess that the header
> was included in the first rawbody chunk, which seems a little
> unlikely.

The problem is that in general the "<font style=" match could occur
anywhere within a chunk, so there's no cutoff that's guaranteed to
work, it just improves the probability of success.