Mailing List Archive

Mal formed urls
I was just working on some rules to catch the current crop of mal formed
urls used to escape detection by solutions that extract urls from emails and
compare them to known bad urls and I am wondering if spamassassin's patterns
for extraction take this into account?

For instance:

https:www.google.com/mail
https:\/www.google.com/mail
https:\\www.google.com/mail

Will all work at getting you to gmail because the technical spec doesn't
actually require \\ after the colon.
Will spamassassin still extract and normalize the urls above? I was hoping
to avoid digging through the source to find out.

Rick
Re: Mal formed urls [ In reply to ]
On Thu, 25 Feb 2021, Rick Cooper wrote:

> I was just working on some rules to catch the current crop of mal formed
> urls used to escape detection by solutions that extract urls from emails and
> compare them to known bad urls and I am wondering if spamassassin's patterns
> for extraction take this into account?
>
> For instance:
>
> https:www.google.com/mail
> https:\/www.google.com/mail
> https:\\www.google.com/mail
>
> Will all work at getting you to gmail because the technical spec doesn't
> actually require \\ after the colon.
> Will spamassassin still extract and normalize the urls above? I was hoping
> to avoid digging through the source to find out.

Yes, all of those do get detected and normalized.

http:fnord01.com/blah
http:\/fnord02.com/blah
http:/\fnord03.com/blah
http:\\fnord04.com/blah

Feb 25 13:24:03.445 [13854] dbg: rules: ran uri rule __ALL_URI ======> got hit: "http://fnord03.com/blah"
Feb 25 13:24:03.446 [13854] dbg: rules: ran uri rule __ALL_URI ======> got hit: "http://fnord02.com/blah"
Feb 25 13:24:03.447 [13854] dbg: rules: ran uri rule __ALL_URI ======> got hit: "http://fnord01.com/blah"
Feb 25 13:24:03.447 [13854] dbg: rules: ran uri rule __ALL_URI ======> got hit: "http://fnord04.com/blah"


--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
Are you a mildly tech-literate politico horrified by the level of
ignorance demonstrated by lawmakers gearing up to regulate online
technology they don't even begin to grasp? Cool. Now you have a
tiny glimpse into a day in the life of a gun owner. -- Sean Davis
-----------------------------------------------------------------------
271 days since the first private commercial manned orbital mission (SpaceX)
Re: Mal formed urls [ In reply to ]
On 25 Feb 2021, at 13:37, Rick Cooper wrote:

> I was just working on some rules to catch the current crop of mal
> formed
> urls used to escape detection by solutions that extract urls from
> emails and
> compare them to known bad urls and I am wondering if spamassassin's
> patterns
> for extraction take this into account?
>
> For instance:
>
> https:www.google.com/mail
> https:\/www.google.com/mail
> https:\\www.google.com/mail
>
> Will all work at getting you to gmail because the technical spec
> doesn't
> actually require \\ after the colon.

Of course not: A http: URI must NOT contain '\\' after the colon, it
MUST contain '//' after the colon. See
https://tools.ietf.org/html/rfc7230#section-2.7.1 which is the technical
spec for the formal syntax of a http URI. OTOH, there are URI schemes
which do not include '//' (e.g. mailto:) so any tool that is doing broad
URI detection can't be too picky.

What flavors of garbage almost-URIs will work in a browser very much
depends on the whims of browser developers, and whether those are
'clickable' in your preferred MUA is dependent on the gullibility of
your MUA author.

SpamAssassin traditionally has assumed that there will always be some
MUA and browser authors who lack any sense of caution or prudence, so SA
is VERY loose with what it will consider as maybe being a hostname in
something that could be a URI in some obscure or novel scheme.

> Will spamassassin still extract and normalize the urls above?

Yes, it will see all 3 as the same canonicalized URI.

> I was hoping
> to avoid digging through the source to find out.

No need to dig though the source, you can see what URIs SpamAssassin
detects (trimmed of the parts after the hostname) in a message by
manually testing it with 'spamassassin -D uri' Note that SA will only
show one instance of otherwise identical URIs after trimming and
canonicalization.

--
Bill Cole
bill@scconsult.com or billcole@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire
RE: Mal formed urls [ In reply to ]
Bill Cole wrote:
> On 25 Feb 2021, at 13:37, Rick Cooper wrote:
>
>> I was just working on some rules to catch the current crop of mal
>> formed urls used to escape detection by solutions that extract urls
>> from emails and compare them to known bad urls and I am wondering if
>> spamassassin's patterns for extraction take this into account?
>>
>> For instance:
>>
>> https:www.google.com/mail
>> https:\/www.google.com/mail
>> https:\\www.google.com/mail
>>
>> Will all work at getting you to gmail because the technical spec
>> doesn't actually require \\ after the colon.
>
> Of course not: A http: URI must NOT contain '\\' after the colon, it
> MUST contain '//' after the colon. See

Sorry , the \\ is a type since that would be the beginning of a unc path for
a windows box.

As far as I can tell the authority/path-abempty portion of a uri is optional
and must begin with // but can be empty
Hence https:www.google.com or https:\/www.google.com/. I have noticed every
browser I tested it with normalizes it back to the conventional //. But my
question was, given this is apparently an issue with some solutions parsing
of uris does SA extract them and as both you and John pointed out it does so
I am happy


> https://tools.ietf.org/html/rfc7230#section-2.7.1 which is the
> technical spec for the formal syntax of a http URI. OTOH, there are
> URI schemes which do not include '//' (e.g. mailto:) so any tool that
> is doing broad URI detection can't be too picky.
>
> What flavors of garbage almost-URIs will work in a browser very much
> depends on the whims of browser developers, and whether those are
> 'clickable' in your preferred MUA is dependent on the gullibility of
> your MUA author.
>
> SpamAssassin traditionally has assumed that there will always be some
> MUA and browser authors who lack any sense of caution or prudence, so
> SA is VERY loose with what it will consider as maybe being a hostname
> in something that could be a URI in some obscure or novel scheme.
>
>> Will spamassassin still extract and normalize the urls above?
>
> Yes, it will see all 3 as the same canonicalized URI.
>
>> I was hoping
>> to avoid digging through the source to find out.
>
> No need to dig though the source, you can see what URIs SpamAssassin
> detects (trimmed of the parts after the hostname) in a message by
> manually testing it with 'spamassassin -D uri' Note that SA will only
> show one instance of otherwise identical URIs after trimming and
> canonicalization.
Re: Mal formed urls [ In reply to ]
On 25 Feb 2021, at 17:14, Rick Cooper wrote:

> As far as I can tell the authority/path-abempty portion of a uri is
> optional
> and must begin with // but can be empty

No, https://tools.ietf.org/html/rfc7230#section-2.7.1 shows the
definition in ABNF, a strictly-defined syntax for strictly defining
other syntaxes. The "//" part denotes a mandatory literal string, in
the same way that the "http:" part is a mandatory literal string. The
'authority' and 'path-abempty' parts are distinct mandatory named
components which are defined in RFC3986, the text of which states that
an authority is *preceded by* '//' (as it is in the spec of the http:
URI) while the ABNF definition of authority (which is usually just a
'host' component) does not include '//' at all, i.e. an authority
component itself does not include the preceding '//'.

Yeah, I know: pedantry. RFCs are intrinsically pedantic.

Incidentally, earlier this week there was a blog post by a security firm
decrying such obfuscation of URIs in phishing email as if it were a
cutting edge new tactic for bypassing filters. It is neither new nor
does it fool any decent filters.


--
Bill Cole
bill@scconsult.com or billcole@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire