Mailing List Archive

# in file names...
> > [# in a directory index]

Ok, how about:

% touch "foo^V^Mbar" ie foo, ctrl-M, bar

This produces a file that:

a) looks like sh*t on the screen
b) can't be found when you click on the link
c) is a valid UNIX file

There are probably others when you get into doing odd things with
control characters.

Ay.
Re: # in file names... [ In reply to ]
Seems like ';' is a weird character too. Roy F's rfc1808 mentions
that ';' is a valid component of a URL, though personally I've not
come across one of these. Check out:

http://www.w3.org/pub/WWW/Addressing/rfc1808.txt

for some more explaination and some examples of valid URL syntax.

Ay.
Re: # in file names... [ In reply to ]
>
> > > [# in a directory index]
>
> Ok, how about:
>
> % touch "foo^V^Mbar" ie foo, ctrl-M, bar
>
> This produces a file that:
>
> a) looks like sh*t on the screen

Well, some kind of escaping needs to be done for the text, too. That could
take a little more discussion than fixing the URI.

> b) can't be found when you click on the link
> c) is a valid UNIX file
>
> There are probably others when you get into doing odd things with
> control characters.
>
> Ay.

Hmmm ... good point. OK, here's a new improved patch (if someone with The Power
could put it in the patches dir, please):

From: ben@algroup.co.uk (Ben Laurie)
Subject: Escape URIs more correctly
Affects: util.c
Changelog: Apache failed to convert # to %23 in directory listings, and various
other dodgy characters.

*** ../../apache_0.8.14/src/util.c Tue Sep 19 17:05:00 1995
--- util.c Sun Oct 1 18:01:37 1995
***************
*** 463,476 ****
}
}

! #define c2x(what,where) sprintf(where,"%%%2x",what)

char *escape_uri(pool *p, char *uri) {
register int x,y;
char *copy = palloc (p, 3 * strlen (uri) + 1);

for(x=0,y=0; uri[x]; x++,y++) {
! if (ind (":% ?+&",(copy[y] = uri[x])) != -1) {
c2x(uri[x],&copy[y]);
y+=2;
}
--- 463,476 ----
}
}

! #define c2x(what,where) sprintf(where,"%%%02x",what)

char *escape_uri(pool *p, char *uri) {
register int x,y;
char *copy = palloc (p, 3 * strlen (uri) + 1);

for(x=0,y=0; uri[x]; x++,y++) {
! if (ind (":% ?+&#",(copy[y] = uri[x])) != -1 || uri[x] < 0x20 || uri[x] > 0x7e) {
c2x(uri[x],&copy[y]);
y+=2;
}

--
Ben Laurie Phone: +44 (181) 994 6435
Freelance Consultant Fax: +44 (181) 994 6472
and Technical Director Email: ben@algroup.co.uk
A.L. Digital Ltd,
London, England.
Re: # in file names... [ In reply to ]
>
> Ben wrote:
> > > Ok, how about:
> > >
> > > % touch "foo^V^Mbar" ie foo, ctrl-M, bar
> > >
> > > This produces a file that:
> > >
> > > a) looks like sh*t on the screen
> >
> > Well, some kind of escaping needs to be done for the text, too. That could
> > take a little more discussion than fixing the URI.
>
> Nearly there. Note that escape_uri is a misnomer; it should really be called
> escape_http_path, and it is currently trying to do two things.
>
> 1. Escape a path to make a valid URL path.
> 2. Escape a URL path so that it can used in an HTML document.
>
> For 1, it needs to % escape _all_ characters except for
> a-z A-Z 0-9 $ - _ . + ! * ' ( ) , : @ & =

This is not my reading of RFC 1808. There the "unreserved" characters are
defined to be "alpha | digit | safe | extra". Alpha and digit are as we expect,
safe is "$-_.+" and extra is "!*'(),". It may be that there are additional
characters which can safely be used in the context of an FTP URL, but there
is no harm in escaping them. Section 5.3 specifically recommends against the
unescaped use of ":", and ":@&=" are all reserved in a generic-RL.

>
> For 2, only the & needs to be escaped, assuming the HREF is enclosed in
> double quotes ("), so all characters except for
> a-z A-Z 0-9 $ - _ . + ! * ' ( ) , : @ =
> should be escaped.

When does Apache need to do this?

>
> The current routine escapes : and + unnecessarily. If it were being used
> for escaping other parts of a URL (the query string perhaps), then it could
> legitimately escape ':'. However, the only significant use of escape_uri
> is by mod_dir.c; all other calls to it are immediately followed by a call
> to unescape_uri to undo the escaping.
>
> So, change the patch to escape all the characters except those I mentioned;
> I would recommend changing the name of the routine.
>
> Of course, that leaves the problem of converting the filename directly to
> HTML when used as the text of the anchor. A simple solution would be
> to ignore non-printing characters, and assume ISO-8859-1 for the rest.
>
> David.
>
> References:
> Fielding, R., `Relative Uniform Resource Locators', RFC 1808, UC Irving,
> June 1995.

--
Ben Laurie Phone: +44 (181) 994 6435
Freelance Consultant Fax: +44 (181) 994 6472
and Technical Director Email: ben@algroup.co.uk
A.L. Digital Ltd,
London, England.
Re: # in file names... [ In reply to ]
Ben wrote:
> > Ok, how about:
> >
> > % touch "foo^V^Mbar" ie foo, ctrl-M, bar
> >
> > This produces a file that:
> >
> > a) looks like sh*t on the screen
>
> Well, some kind of escaping needs to be done for the text, too. That could
> take a little more discussion than fixing the URI.

Nearly there. Note that escape_uri is a misnomer; it should really be called
escape_http_path, and it is currently trying to do two things.

1. Escape a path to make a valid URL path.
2. Escape a URL path so that it can used in an HTML document.

For 1, it needs to % escape _all_ characters except for
a-z A-Z 0-9 $ - _ . + ! * ' ( ) , : @ & =

For 2, only the & needs to be escaped, assuming the HREF is enclosed in
double quotes ("), so all characters except for
a-z A-Z 0-9 $ - _ . + ! * ' ( ) , : @ =
should be escaped.

The current routine escapes : and + unnecessarily. If it were being used
for escaping other parts of a URL (the query string perhaps), then it could
legitimately escape ':'. However, the only significant use of escape_uri
is by mod_dir.c; all other calls to it are immediately followed by a call
to unescape_uri to undo the escaping.

So, change the patch to escape all the characters except those I mentioned;
I would recommend changing the name of the routine.

Of course, that leaves the problem of converting the filename directly to
HTML when used as the text of the anchor. A simple solution would be
to ignore non-printing characters, and assume ISO-8859-1 for the rest.

David.

References:
Fielding, R., `Relative Uniform Resource Locators', RFC 1808, UC Irving,
June 1995.
Re: # in file names... [ In reply to ]
> > Ben wrote:
> > > > Ok, how about:
> > > >
> > > > % touch "foo^V^Mbar" ie foo, ctrl-M, bar
> > > >
> > > > This produces a file that:
> > > >
> > > > a) looks like sh*t on the screen
> > >
> > > Well, some kind of escaping needs to be done for the text, too. That
> > > could take a little more discussion than fixing the URI.
> >
> > Nearly there. Note that escape_uri is a misnomer; it should really be
> > called escape_http_path, and it is currently trying to do two things.
> >
> > 1. Escape a path to make a valid URL path.
> > 2. Escape a URL path so that it can used in an HTML document.
> >
> > For 1, it needs to % escape _all_ characters except for
> > a-z A-Z 0-9 $ - _ . + ! * ' ( ) , : @ & =
>
> This is not my reading of RFC 1808. There the "unreserved" characters are
> defined to be "alpha | digit | safe | extra".

> Alpha and digit are as we expect, safe is "$-_.+" and extra is "!*'(),". It
> may be that there are additional characters which can safely be used in the
> context of an FTP URL,

I think you mean an HTTP URL, and the extra characters allowed are : @ & =

> but there is no harm in escaping them. Section 5.3 specifically recommends
> against the unescaped use of ":",

Correct, it is harmless. In fact 5.3 recommends prefixing relative
URLs with ./ to avoid problems with ':'; however, it would be simpler for
escape_uri to escape ':'.

> and ":@&=" are all reserved in a generic-RL.

Yes, but you are allowed to use reserved characters! reserved != forbidden
Reserved means that they _may_ be defined to have special semantics.
Whereas unreserved characters cannot be defined to have specicial semantics,
I think.

> > For 2, only the & needs to be escaped, assuming the HREF is enclosed in
> > double quotes ("), so all characters except for
> > a-z A-Z 0-9 $ - _ . + ! * ' ( ) , : @ =
> > should be escaped.
>
> When does Apache need to do this?

When it outputs a directory listing (as in the original bug report); this the
raison d'etre of unescape_uri. (Properly called unescape_httppath.)

So our list of acceptable characters in a path is now
a-z A-Z 0-9 $ - _ . + ! * ' ( ) , @ =

David.
Re: # in file names... [ In reply to ]
>
> > > Ben wrote:
> > > > > Ok, how about:
> > > > >
> > > > > % touch "foo^V^Mbar" ie foo, ctrl-M, bar
> > > > >
> > > > > This produces a file that:
> > > > >
> > > > > a) looks like sh*t on the screen
> > > >
> > > > Well, some kind of escaping needs to be done for the text, too. That
> > > > could take a little more discussion than fixing the URI.
> > >
> > > Nearly there. Note that escape_uri is a misnomer; it should really be
> > > called escape_http_path, and it is currently trying to do two things.
> > >
> > > 1. Escape a path to make a valid URL path.
> > > 2. Escape a URL path so that it can used in an HTML document.
> > >
> > > For 1, it needs to % escape _all_ characters except for
> > > a-z A-Z 0-9 $ - _ . + ! * ' ( ) , : @ & =
> >
> > This is not my reading of RFC 1808. There the "unreserved" characters are
> > defined to be "alpha | digit | safe | extra".
>
> > Alpha and digit are as we expect, safe is "$-_.+" and extra is "!*'(),". It
> > may be that there are additional characters which can safely be used in the
> > context of an FTP URL,
>
> I think you mean an HTTP URL, and the extra characters allowed are : @ & =

You're right. Directory listings make me think FTP. Oops.

>
> > but there is no harm in escaping them. Section 5.3 specifically recommends
> > against the unescaped use of ":",
>
> Correct, it is harmless. In fact 5.3 recommends prefixing relative
> URLs with ./ to avoid problems with ':';

I know, I was reinterpreting on the fly.

> however, it would be simpler for
> escape_uri to escape ':'.
>
> > and ":@&=" are all reserved in a generic-RL.
>
> Yes, but you are allowed to use reserved characters! reserved != forbidden
> Reserved means that they _may_ be defined to have special semantics.
> Whereas unreserved characters cannot be defined to have specicial semantics,
> I think.

Maybe so, however, I see no reason to leave these characters unescaped. It
only improves the system ever so slightly, and may break later when Apache
supports new semantics.

>
> > > For 2, only the & needs to be escaped, assuming the HREF is enclosed in
> > > double quotes ("), so all characters except for
> > > a-z A-Z 0-9 $ - _ . + ! * ' ( ) , : @ =
> > > should be escaped.
> >
> > When does Apache need to do this?
>
> When it outputs a directory listing (as in the original bug report); this the
> raison d'etre of unescape_uri. (Properly called unescape_httppath.)
>
> So our list of acceptable characters in a path is now
> a-z A-Z 0-9 $ - _ . + ! * ' ( ) , @ =

See above.

>
> David.

--
Ben Laurie Phone: +44 (181) 994 6435
Freelance Consultant Fax: +44 (181) 994 6472
and Technical Director Email: ben@algroup.co.uk
A.L. Digital Ltd,
London, England.
Re: # in file names... [ In reply to ]
> Maybe so, however, I see no reason to leave these characters unescaped. It
> only improves the system ever so slightly, and may break later when Apache
> supports new semantics.

Which is exactly why I think that unescape_uri should be quite specific
to http paths. Any other behaviour is pretty much guaranteed to be wrong
for other semantics; it's hopeless to try and second guess future
developments.

I missed off / from the list of acceptable characters, BTW.
Remove & from the list if you like; but don't document this routine as
somehow being a 'general URI encoding routine'.

David.
Re: # in file names... [ In reply to ]
>
> > Maybe so, however, I see no reason to leave these characters unescaped. It
> > only improves the system ever so slightly, and may break later when Apache
> > supports new semantics.
>
> Which is exactly why I think that unescape_uri should be quite specific
> to http paths. Any other behaviour is pretty much guaranteed to be wrong
> for other semantics; it's hopeless to try and second guess future
> developments.
>
> I missed off / from the list of acceptable characters, BTW.
> Remove & from the list if you like; but don't document this routine as
> somehow being a 'general URI encoding routine'.

OK, I've read the RFC a bit more carefully (blush). I see where you got your
list from, and I agree with the list, or at least this version of it:

A-Za-z0-9$-_.+!*'(),:@&=

However, we should either escape : or put ./ on the front (one less character).
I would suggest that the routine should be called escape_rfc1808_segment, which
makes it pretty clear what it is doing (at least to anyone with RFC1808 to
hand). If everyone (who cares) is happy with this, I'll redo the patch, again.
Note that the routine _will_ escape /, to escape a path with directories
in, each segment will have to be individually escaped (I suggest this method
for better OS independence. / cannot appear in a filename under Unix, but it
certainly can on a Mac, and probably also on Win95). This will not be a problem
with the current use of the routine.

> David.

Cheers,

Ben.

--
Ben Laurie Phone: +44 (181) 994 6435
Freelance Consultant Fax: +44 (181) 994 6472
and Technical Director Email: ben@algroup.co.uk
A.L. Digital Ltd,
London, England.
Re: # in file names... [ In reply to ]
>
> > > I missed off / from the list of acceptable characters, BTW.
> > > Remove & from the list if you like; but don't document this routine as
> > > somehow being a 'general URI encoding routine'.
> >
> > OK, I've read the RFC a bit more carefully (blush). I see where you got your
> > list from, and I agree with the list, or at least this version of it:
> >
> > A-Za-z0-9$-_.+!*'(),:@&=
> >
> > However, we should either escape : or put ./ on the front (one less
> > character).
>
> Escaping : may be more broswer-safe, and it's a change local to a segment
> instead of global to the path.

Yep.

>
> > I would suggest that the routine should be called escape_rfc1808_segment,
> > which makes it pretty clear what it is doing (at least to anyone with RFC1808
> > to hand). If everyone (who cares) is happy with this, I'll redo the patch,
> > again. Note that the routine _will_ escape /, to escape a path with
> > directories in, each segment will have to be individually escaped (I suggest
> > this method for better OS independence. / cannot appear in a filename under
> > Unix, but it certainly can on a Mac, and probably also on Win95). This will
> > not be a problem with the current use of the routine.
>
> Oerr, Apache on the Mac!

I didn't say I was going to do it, but it would be nice to know it was
possible.

>
> If you write escape_rfc1808_segment (or escape_path_segment as I would call it,
> with a reference to rfc1808 in the comments) then you will also need an
> escape_rfc1808_path (or escape_path) which splits up a path into
> segments, escapes each segment, and concatenates the segments again.
> For Apache at present this is, of course, equivalent to escape_rfc1808_segment,
> but allowing '/' unescaped.
>
> The call to escape_uri in mod_dir.c will become a call to
> escape_rfc1808_segment, and the other calls will become calls to
> escape_rfc1808_path. Of course, under UNIX mod_dir.c could equally well
> call escape_rfc1808_path.
>
> Note that it will take a significant amount of change to apache to work with
> an OS that allows '/' in filenames, as it already uses / for the segment
> separator. And does MacOS allow NULL characters in filenames?

I don't know if it allows NULs, but I doubt it, even Apple use C! Of course,
if my memory serves me, it uses ':' as a path separator. Perhaps these two
routines should be part of the (as yet nonexistent) OS support module, and
be called, respectively: os_convert_os_path_segment_to_url, and
os_convert_os_path_to_url, taking as input a raw OS segment/path, and giving a
converted and escaped URL segment/path (so on a Mac : would have been converted
to /, PC would have \ converted to / [though often the compiler libraries do
this anyway]). The inverse functions would also be needed.

> Also, the current CGI spec does not support / in path segments;
> for the URL http://host/cgi-bin/script/extra%2fdata
> the extra path segment cannot be passed to the script, as it would be
> decoded beforehand.

OK, so who fixes the CGI spec? :-)

> David.

Cheers,

Ben.

--
Ben Laurie Phone: +44 (181) 994 6435
Freelance Consultant Fax: +44 (181) 994 6472
and Technical Director Email: ben@algroup.co.uk
A.L. Digital Ltd,
London, England.
Re: # in file names... [ In reply to ]
> > I missed off / from the list of acceptable characters, BTW.
> > Remove & from the list if you like; but don't document this routine as
> > somehow being a 'general URI encoding routine'.
>
> OK, I've read the RFC a bit more carefully (blush). I see where you got your
> list from, and I agree with the list, or at least this version of it:
>
> A-Za-z0-9$-_.+!*'(),:@&=
>
> However, we should either escape : or put ./ on the front (one less
> character).

Escaping : may be more broswer-safe, and it's a change local to a segment
instead of global to the path.

> I would suggest that the routine should be called escape_rfc1808_segment,
> which makes it pretty clear what it is doing (at least to anyone with RFC1808
> to hand). If everyone (who cares) is happy with this, I'll redo the patch,
> again. Note that the routine _will_ escape /, to escape a path with
> directories in, each segment will have to be individually escaped (I suggest
> this method for better OS independence. / cannot appear in a filename under
> Unix, but it certainly can on a Mac, and probably also on Win95). This will
> not be a problem with the current use of the routine.

Oerr, Apache on the Mac!

If you write escape_rfc1808_segment (or escape_path_segment as I would call it,
with a reference to rfc1808 in the comments) then you will also need an
escape_rfc1808_path (or escape_path) which splits up a path into
segments, escapes each segment, and concatenates the segments again.
For Apache at present this is, of course, equivalent to escape_rfc1808_segment,
but allowing '/' unescaped.

The call to escape_uri in mod_dir.c will become a call to
escape_rfc1808_segment, and the other calls will become calls to
escape_rfc1808_path. Of course, under UNIX mod_dir.c could equally well
call escape_rfc1808_path.

Note that it will take a significant amount of change to apache to work with
an OS that allows '/' in filenames, as it already uses / for the segment
separator. And does MacOS allow NULL characters in filenames?

Also, the current CGI spec does not support / in path segments;
for the URL http://host/cgi-bin/script/extra%2fdata
the extra path segment cannot be passed to the script, as it would be
decoded beforehand.

David.