Mailing List Archive: Trying to detect bogus end tags

Trying to detect bogus end tags

Feb 27, 2004, 8:46 PM

Post #1 of 8 (775 views)

I'm trying to come up with a way to detect bogus end tags, and so far I'm
not having much luck.

What I'm specifically trying to catch are things like

</table>
</belch></huntsville></delusion></wilma></boswell></attune>
</vasectomy></centum></surf></yeasty></molt></autocollimate>
</acrobat></harvest></gage></flagrant></fumble></nowadays>
</BODY>
</HTML>

Now, it looks like there is an html_tag_balance eval that would catch the
fact that there is no "<belch" to match the "</belch>" in the above hunk of
spam, if only there were some way that I could feed "belch" into the eval.
I can detect end tags eash enough with a regexp, but I can't find any way
that works to pull the found tag out and feed it to the eval routine within
an SA rule definition.

Alternately, is there a way to write a regexp that will let me look backward
for <belch once I have found </belch>? I can't seem to figure this one out
either.

Thanks,
Loren

Re: Trying to detect bogus end tags [ In reply to ]

quinlan at pathname

Feb 28, 2004, 12:22 AM

Post #2 of 8 (761 views)

Permalink

"Loren Wilton" <lwilton@earthlink.net> writes:

> I'm trying to come up with a way to detect bogus end tags, and so far I'm
> not having much luck.

3.0 will have a test for this, although it just looks for tags in
general. Someone could try enhancing the test to also notice when an
end element is used for a start-only tag.

These two tests overlap quite a bit, I'm leaving it to the score
optimizer to figure out the final scores...

OVERALL% SPAM% HAM% S/O RANK SCORE NAME
387188 306777 80411 0.792 0.00 0.00 (all messages)
100.000 79.2321 20.7679 0.792 0.00 0.00 (all messages as %)
2.305 2.7818 0.4863 0.851 0.58 1.00 HTML_BADTAG_00_10
4.236 5.3309 0.0597 0.989 0.92 1.00 HTML_BADTAG_10_20
1.165 1.4678 0.0087 0.994 0.93 1.00 HTML_BADTAG_20_30
1.917 2.4171 0.0075 0.997 0.93 1.00 HTML_BADTAG_30_40
15.944 20.1234 0.0000 1.000 0.97 1.00 HTML_BADTAG_40_50
0.454 0.5731 0.0000 1.000 0.94 1.00 HTML_BADTAG_50_60
1.023 1.2915 0.0000 1.000 0.94 1.00 HTML_BADTAG_60_70
0.367 0.4635 0.0000 1.000 0.94 1.00 HTML_BADTAG_70_80
0.127 0.1604 0.0000 1.000 0.94 1.00 HTML_BADTAG_80_90
0.015 0.0186 0.0000 1.000 0.94 1.00 HTML_BADTAG_90_100
1.312 1.5777 0.2985 0.841 0.56 1.00 HTML_NONELEMENT_00_10
0.600 0.7168 0.1530 0.824 0.53 1.00 HTML_NONELEMENT_10_20
0.598 0.7409 0.0510 0.936 0.77 1.00 HTML_NONELEMENT_20_30
2.936 3.6994 0.0236 0.994 0.93 1.00 HTML_NONELEMENT_30_40
0.966 1.2159 0.0112 0.991 0.92 1.00 HTML_NONELEMENT_40_50
15.548 19.6214 0.0075 1.000 0.97 1.00 HTML_NONELEMENT_50_60
1.477 1.8626 0.0037 0.998 0.94 1.00 HTML_NONELEMENT_60_70
1.409 1.7749 0.0112 0.994 0.92 1.00 HTML_NONELEMENT_70_80
1.556 1.9627 0.0025 0.999 0.94 1.00 HTML_NONELEMENT_80_90
1.153 1.4558 0.0000 1.000 0.94 1.00 HTML_NONELEMENT_90_100

For HTML_MESSAGE messages only:

OVERALL% SPAM% HAM% S/O RANK SCORE NAME
265707 260392 5315 0.980 0.00 0.00 (all messages)
100.000 97.9997 2.0003 0.980 0.00 0.00 (all messages as %)
3.359 3.2774 7.3565 0.308 0.03 1.00 HTML_BADTAG_00_10
6.173 6.2805 0.9031 0.874 0.65 1.00 HTML_BADTAG_10_20
1.697 1.7293 0.1317 0.929 0.77 1.00 HTML_BADTAG_20_30
2.793 2.8476 0.1129 0.962 0.86 1.00 HTML_BADTAG_30_40
23.234 23.7081 0.0000 1.000 0.99 1.00 HTML_BADTAG_40_50
0.662 0.6751 0.0000 1.000 0.96 1.00 HTML_BADTAG_50_60
1.491 1.5216 0.0000 1.000 0.96 1.00 HTML_BADTAG_60_70
0.535 0.5461 0.0000 1.000 0.96 1.00 HTML_BADTAG_70_80
0.185 0.1889 0.0000 1.000 0.96 1.00 HTML_BADTAG_80_90
0.021 0.0219 0.0000 1.000 0.96 1.00 HTML_BADTAG_90_100
1.912 1.8587 4.5155 0.292 0.03 1.00 HTML_NONELEMENT_00_10
0.874 0.8445 2.3142 0.267 0.02 1.00 HTML_NONELEMENT_10_20
0.871 0.8729 0.7714 0.531 0.14 1.00 HTML_NONELEMENT_20_30
4.278 4.3584 0.3575 0.924 0.76 1.00 HTML_NONELEMENT_30_40
1.407 1.4325 0.1693 0.894 0.69 1.00 HTML_NONELEMENT_40_50
22.657 23.1167 0.1129 0.995 0.98 1.00 HTML_NONELEMENT_50_60
2.152 2.1944 0.0564 0.975 0.89 1.00 HTML_NONELEMENT_60_70
2.053 2.0911 0.1693 0.925 0.76 1.00 HTML_NONELEMENT_70_80
2.267 2.3123 0.0376 0.984 0.91 1.00 HTML_NONELEMENT_80_90
1.681 1.7151 0.0000 1.000 0.96 1.00 HTML_NONELEMENT_90_100

Daniel

--
Daniel Quinlan anti-spam (SpamAssassin), Linux,
http://www.pathname.com/~quinlan/ and open source consulting

Re: Trying to detect bogus end tags [ In reply to ]

johnh at aproposretail

Feb 29, 2004, 12:08 PM

Post #3 of 8 (766 views)

Permalink

On Fri, 2004-02-27 at 19:46, Loren Wilton wrote:
> I'm trying to come up with a way to detect bogus end tags, and so far I'm
> not having much luck.
>
> What I'm specifically trying to catch are things like
>
> </table>
> </belch></huntsville></delusion></wilma></boswell></attune>
> </vasectomy></centum></surf></yeasty></molt></autocollimate>
> </acrobat></harvest></gage></flagrant></fumble></nowadays>
> </BODY>
> </HTML>

The list of valid HTML tags is finite. You could try something like:

rawbody BOGUS_HTML_TAG /<\/(?<!(list|of|valid|tags|...)>)[a-z]+>/i

Here is where additive scoring would be of benefit. I wouldn't want to
score high on a handful of bogus tags, but a dozen or more should score
fairly high.

N.B.: I'm still having trouble wrapping my brain around zero-length
assertions - does the above look right?

--
John Hardin KA7OHZ
Internal Systems Administrator/Guru voice: (425) 672-1304
Apropos Retail Management Systems, Inc. fax: (425) 672-0192
-----------------------------------------------------------------------
Failure to plan ahead on someone else's part does not constitute an
emergency on my part.
- David W. Barts in a.s.r

Re: Trying to detect bogus end tags [ In reply to ]

lwilton at earthlink

Feb 29, 2004, 5:35 PM

Post #4 of 8 (763 views)

Permalink

> The list of valid HTML tags is finite. You could try something like:
>
> rawbody BOGUS_HTML_TAG /<\/(?<!(list|of|valid|tags|...)>)[a-z]+>/i
>
> N.B.: I'm still having trouble wrapping my brain around zero-length
> assertions - does the above look right?

I think the above would do it for HTML tags. However, someone else pointed
out that there are a whole lot of other SGML-formatted things that can
appear in mail, and the list of tags for such things is essentially
infinite. Thus, the above test might not be manageable. Which is why I was
trying to match an end tag to a missing begin tag. In my rather limited
understanding of things SGML-like, I don't think it is valid to have an end
tag without a corresponding (proplerly nested) begin tag. Assuming that
basic assumption was correct, a check for any begin tag to match a given end
tag (without bothering with nesting and true balance checks) would be a big
step forward.

Loren

Re: Trying to detect bogus end tags [ In reply to ]

bds at jhb

Mar 1, 2004, 2:54 AM

Post #5 of 8 (762 views)

Permalink

On Saturday 28 February 2004 05:46, Loren Wilton wrote:
> I'm trying to come up with a way to detect bogus end tags, and so far
> I'm not having much luck.
>
> What I'm specifically trying to catch are things like
>
> </table>
> </belch></huntsville></delusion></wilma></boswell></attune>
> </vasectomy></centum></surf></yeasty></molt></autocollimate>
> </acrobat></harvest></gage></flagrant></fumble></nowadays>
> </BODY>
> </HTML>
>
> Now, it looks like there is an html_tag_balance eval that would catch
> the fact that there is no "<belch" to match the "</belch>" in the
> above hunk of spam, if only there were some way that I could feed
> "belch" into the eval. I can detect end tags eash enough with a
> regexp, but I can't find any way that works to pull the found tag out
> and feed it to the eval routine within an SA rule definition.

I see this too. I don't know the proper answer, but you could just look
for 30 close tags in a row.

rawbody TOO_MANY_CLOSE_TAGS /(\<[\/]?[^\>]{1,}\>){20,}/i

I don't think this is too effective in the long term, but it'll cat some
spams.

> Alternately, is there a way to write a regexp that will let me look
> backward for <belch once I have found </belch>? I can't seem to
> figure this one out either.
>
> Thanks,
> Loren

--
Berend De Schouwer

Re: Trying to detect bogus end tags [ In reply to ]

lwilton at earthlink

Mar 1, 2004, 12:56 PM

Post #6 of 8 (760 views)

Permalink

I've finally got a strange regexp that *almost* works. I can see it getting
the right results internally, but it never seems to return true or whatever
a regexp returns on success. At this point I have no idea if I'm dealing
with bugs, features, or a strong misunderstanding of the language.

If anyone wants to try to figure out what is wrong with it (or can see it at
a glance) here is the ugly little thing:

full BAD__HTML m|<BODY>(.*)</BODY>(??{ (scalar reverse $1) =~
/>([a-z]+)\/<(?!.*[\s>]\1<)/is; })|s

Loren

RE: Trying to detect bogus end tags [ In reply to ]

JunkEmail at rapidigm

Mar 2, 2004, 5:49 AM

Post #7 of 8 (759 views)

Permalink

Try these, they're not the prettiest, but at least work most of the time
:)

#This takes care of tags that don't exist such as <zebra>
#The last / is in there so it doesn't freak out about closing tags.
#<KBD> is a valid tag, but I don't believe we'll see it all that much in
email.
#Added in to not pickup <spammer@msn.com>
rawbody __MK_BAD_HTML_08
/<[^abcdefhilmopstuv\/!]+[^@]{1,80}>/i

#This takes care of closing tags that don't exist such as </zebra>
rawbody __MK_BAD_HTML_09
/<\/[^abcdefhilmopstuv]/i

#Added in ? due to MS <?xml:blahblah> tag
rawbody __MK_GOOD_HTML_01 /<\??xml/i
#Another MS Office casualty
rawbody __MK_GOOD_HTML_02 /<\/xml>/i

meta MK_BAD_HTML_11 HTML_MESSAGE && __MK_BAD_HTML_08
&& !__MK_GOOD_HTML_01
describe MK_BAD_HTML_11 Bad HTML form. HTML beginning
tag that does not exist used.
score MK_BAD_HTML_11 0.8

meta MK_BAD_HTML_12 HTML_MESSAGE && __MK_BAD_HTML_09
&& !__MK_GOOD_HTML_02
describe MK_BAD_HTML_12 Bad HTML form. HTML closing tag
that does not exist used.
score MK_BAD_HTML_12 0.8

Mike

> -----Original Message-----
> From: Loren Wilton [mailto:lwilton@earthlink.net]
> Sent: Monday, March 01, 2004 2:56 PM
> To: spamassassin-users@incubator.apache.org
> Subject: Re: Trying to detect bogus end tags
>
>
> I've finally got a strange regexp that *almost* works. I can
> see it getting
> the right results internally, but it never seems to return
> true or whatever
> a regexp returns on success. At this point I have no idea if
> I'm dealing
> with bugs, features, or a strong misunderstanding of the language.
>
> If anyone wants to try to figure out what is wrong with it
> (or can see it at
> a glance) here is the ugly little thing:
>
> full BAD__HTML m|<BODY>(.*)</BODY>(??{ (scalar reverse $1) =~
> />([a-z]+)\/<(?!.*[\s>]\1<)/is; })|s
>
> Loren
>
>

Re: Trying to detect bogus end tags [ In reply to ]

bds at jhb

Mar 2, 2004, 6:54 AM

Post #8 of 8 (763 views)

Permalink

On Monday 01 March 2004 11:54, Berend De Schouwer wrote:
> On Saturday 28 February 2004 05:46, Loren Wilton wrote:
> > I'm trying to come up with a way to detect bogus end tags, and so
> > far I'm not having much luck.
> >
> > What I'm specifically trying to catch are things like
> >
> > </table>
> > </belch></huntsville></delusion></wilma></boswell></attune>
> > </vasectomy></centum></surf></yeasty></molt></autocollimate>
> > </acrobat></harvest></gage></flagrant></fumble></nowadays>
> > </BODY>
> > </HTML>

This does catch them:

full __HTML_BAYES_POISON /(\<\/[a-zA-Z]{3,}\>[\s]*){20,}/
meta BDS_HTML_BAYES_POISON (HTML_MESSAGE && __HTML_BAYES_POISON)
describe BDS_HTML_BAYES_POISON Multiple garbage HTML close tag
score BDS_HTML_BAYES_POISON 2.0

You might get false positives if you regularly get HTML e-mail with real
statements like (/b)(/font)(/tr)(/td)(/table), etc. but I think that
won't be common. At least not 20 tags in a row. To counter false
positives, I've made it count 3 letter words only, so (/b) and (/tr)
won't count, and I've used 'full' instead of 'rawbody' so I don't have
a problem with \r\n.

The ideal solution is for SpamAssassin, or a plugin, to search for
unmatched HTML tags, in HTML MIME parts, except for (p) and (br), and
score 0.1 for each unmatched tag. In the absense of perfection, try
this.

--
Berend De Schouwer