Mailing List Archive

Request URI (path) normalisation
Hi,

this email basically arose from a discussion on #catalyst/irc.perl.org
where my (more or less) original question was for the format of the
string that Regex actions do match against.

As nobody really seemed to know the answer, it got into a discussion
of basic URI semantics and finally kindof to the conclusion that
the current implementation of Regex (at least) probably is broken.
Part of that conclusion actually isn't from first-hand experience on
my part, but rather from Sebastian Riedel's examination of the source
of the current version, AFAICT - the debian backport package (5.7006)
I am using behaves differently. So, please forgive me, should this
invalidate parts of the following.

So, to finally get to the meat of it: According to sri's examination,
catalyst simply extracts the path component from the URI, but
doesn't do any normalisation on it. This would mean that a request
for http://bar/foo would have a different string being matched against
the regexes than a request for http://bar/f%6fo . As those two URIs
are mandated to be equivalent (to refer to the same resource) by the
URI RFC (3986, 2.3), this kind of behaviour does make it pretty difficult
to write standards-compliant software, as you'd have to match against
^(?:f|%66)(?:o|%6[fF]){2}$ for the example given above to meet the
requirements.

I've got no clue whether other action types may be affected by
this, too.

The behaviour I would consider sensible would be the normalisation
of the path in such a way that any two URI paths that are mandated
by the RFC to be equivalent will result in the exact same string,
and any two URI paths that are not mandated by the RFC to be
equivalent will result in different strings.

IMO, in addition, as many characters as possible should be in
unescaped form after normalisation. For the path alone, that
would mean that only slashes in path components would really have
to be escaped. I assume that also escaping the ASCII control range
might be a good idea for security reasons with regard to use
on syscall/shell interfaces. If it's supposed to be safe for direct
injection into a URI, any other URI reserved characters probably
should be escaped, too. But above all, I think the important
thing is consistent, documented normalisation, independent of the
engine.

Well, I guess that this email is somewhat open-ended so far.
But I don't really know what the next step should be - so, I'll
leave it at that. Please don't flame me for it ;-)

Florian

_______________________________________________
Catalyst-dev mailing list
Catalyst-dev@lists.scsys.co.uk
http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst-dev
Re: Request URI (path) normalisation [ In reply to ]
On Sun, Sep 14, 2008 at 8:04 PM, Florian Zumbiehl <florz@florz.de> wrote:
> Hi,
>
> this email basically arose from a discussion on #catalyst/irc.perl.org
> where my (more or less) original question was for the format of the
> string that Regex actions do match against.
>
> As nobody really seemed to know the answer, it got into a discussion
> of basic URI semantics and finally kindof to the conclusion that
> the current implementation of Regex (at least) probably is broken.
> Part of that conclusion actually isn't from first-hand experience on
> my part, but rather from Sebastian Riedel's examination of the source
> of the current version, AFAICT - the debian backport package (5.7006)
> I am using behaves differently. So, please forgive me, should this
> invalidate parts of the following.
>
> So, to finally get to the meat of it: According to sri's examination,
> catalyst simply extracts the path component from the URI, but
> doesn't do any normalisation on it. This would mean that a request
> for http://bar/foo would have a different string being matched against
> the regexes than a request for http://bar/f%6fo . As those two URIs
> are mandated to be equivalent (to refer to the same resource) by the
> URI RFC (3986, 2.3), this kind of behaviour does make it pretty difficult
> to write standards-compliant software, as you'd have to match against
> ^(?:f|%66)(?:o|%6[fF]){2}$ for the example given above to meet the
> requirements.
>
> I've got no clue whether other action types may be affected by
> this, too.
>
> The behaviour I would consider sensible would be the normalisation
> of the path in such a way that any two URI paths that are mandated
> by the RFC to be equivalent will result in the exact same string,
> and any two URI paths that are not mandated by the RFC to be
> equivalent will result in different strings.
>
> IMO, in addition, as many characters as possible should be in
> unescaped form after normalisation. For the path alone, that
> would mean that only slashes in path components would really have
> to be escaped. I assume that also escaping the ASCII control range
> might be a good idea for security reasons with regard to use
> on syscall/shell interfaces. If it's supposed to be safe for direct
> injection into a URI, any other URI reserved characters probably
> should be escaped, too. But above all, I think the important
> thing is consistent, documented normalisation, independent of the
> engine.
>
> Well, I guess that this email is somewhat open-ended so far.
> But I don't really know what the next step should be - so, I'll
> leave it at that. Please don't flame me for it ;-)
>
> Florian
>
> _______________________________________________
> Catalyst-dev mailing list
> Catalyst-dev@lists.scsys.co.uk
> http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst-dev
>

Not sure if I'm on the right track or not, but I think the
normalisation of the URL would be very good. I'm guessing the Regex
problem is connected with
|
| sub foo :Local { my($self, $c, @args)=@_ }
|
...where @args might contain "f%6fo" instead of whatever was meant to be there.

I haven't dug into the source myself, but would there be any issues by
making the path "sane" before it's actually handled in any way?


--
Best regards,
Jan Henning Thorsen

_______________________________________________
Catalyst-dev mailing list
Catalyst-dev@lists.scsys.co.uk
http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst-dev
Re: Request URI (path) normalisation [ In reply to ]
On Sun, Sep 14, 2008 at 08:04:31PM +0200, Florian Zumbiehl wrote:
> Hi,
>
> this email basically arose from a discussion on #catalyst/irc.perl.org
> where my (more or less) original question was for the format of the
> string that Regex actions do match against.
>
> As nobody really seemed to know the answer, it got into a discussion
> of basic URI semantics and finally kindof to the conclusion that
> the current implementation of Regex (at least) probably is broken.
> Part of that conclusion actually isn't from first-hand experience on
> my part, but rather from Sebastian Riedel's examination of the source
> of the current version, AFAICT - the debian backport package (5.7006)
> I am using behaves differently. So, please forgive me, should this
> invalidate parts of the following.

Yeah, sri implemented this broken in the first place.

> The behaviour I would consider sensible would be the normalisation
> of the path in such a way that any two URI paths that are mandated
> by the RFC to be equivalent will result in the exact same string,
> and any two URI paths that are not mandated by the RFC to be
> equivalent will result in different strings.

Backwards compatible patches to make it sane very welcome; I don't really
ever use Regex actions so I've no itch to scratch.

> IMO, in addition, as many characters as possible should be in
> unescaped form after normalisation. For the path alone, that
> would mean that only slashes in path components would really have
> to be escaped.

You also miss that things like () are marked by the URI standard as
being allowed to be used for internal subhierarchies so these have to be
kept intact as well.

I think I had a play with this a while back and determined I didn't have
time to do it right.

--
Matt S Trout Need help with your Catalyst or DBIx::Class project?
Technical Director http://www.shadowcat.co.uk/catalyst/
Shadowcat Systems Ltd. Want a managed development or deployment platform?
http://chainsawblues.vox.com/ http://www.shadowcat.co.uk/servers/

_______________________________________________
Catalyst-dev mailing list
Catalyst-dev@lists.scsys.co.uk
http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst-dev
Re: Request URI (path) normalisation [ In reply to ]
27.09.2008 19:30 Matt S Trout:

> On Sun, Sep 14, 2008 at 08:04:31PM +0200, Florian Zumbiehl wrote:
>> Hi,
>>
>> this email basically arose from a discussion on #catalyst/
>> irc.perl.org
>> where my (more or less) original question was for the format of the
>> string that Regex actions do match against.
>>
>> As nobody really seemed to know the answer, it got into a discussion
>> of basic URI semantics and finally kindof to the conclusion that
>> the current implementation of Regex (at least) probably is broken.
>> Part of that conclusion actually isn't from first-hand experience on
>> my part, but rather from Sebastian Riedel's examination of the source
>> of the current version, AFAICT - the debian backport package (5.7006)
>> I am using behaves differently. So, please forgive me, should this
>> invalidate parts of the following.
>
> Yeah, sri implemented this broken in the first place.
>
>> The behaviour I would consider sensible would be the normalisation
>> of the path in such a way that any two URI paths that are mandated
>> by the RFC to be equivalent will result in the exact same string,
>> and any two URI paths that are not mandated by the RFC to be
>> equivalent will result in different strings.
>
> Backwards compatible patches to make it sane very welcome; I don't
> really
> ever use Regex actions so I've no itch to scratch.
>
>> IMO, in addition, as many characters as possible should be in
>> unescaped form after normalisation. For the path alone, that
>> would mean that only slashes in path components would really have
>> to be escaped.
>
> You also miss that things like () are marked by the URI standard as
> being allowed to be used for internal subhierarchies so these have
> to be
> kept intact as well.
>
> I think I had a play with this a while back and determined I didn't
> have
> time to do it right.

Mojo::ByteStream::url_sanitize() should fix this.

--
sebastian

_______________________________________________
Catalyst-dev mailing list
Catalyst-dev@lists.scsys.co.uk
http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst-dev