Generally, it's been my experience that nothing clarifies the issues
involved in doing something like actually writing code to do it. I
decided to take this approach with regard to content-type negotiation;
as a result, I now have two new patches, both in
ftp://ftp.ai.mit.edu/pub/users/rst/httpd-patches:
patch.addtype-bug --- Largely a code cleanup, but also fixes the
long-standing and oft-reported bug that the usual AddType directives
for *.cgi and *.shtml are ineffective in .htaccess files, and
leaves a few hooks for...
patch.content-arb --- Content-type negotiation. (Yes, really ---
but see caveats below --- a user's-manual style writeup on what
this does is towards the bottom of this note, after "HERE'S THE DOCS").
This patch is 650-odd lines, of which about 500 are the contents
of the new file http_mime_db.c, which does most of the work.
This code does a *basic* job of content-type negotiation --- it parses
everything (at least according to my reading of the relevant standards
documents, and the CERN code), but it doesn't actually use the
Accept-content-encoding and Accept-language info yet; however, it does
handle qualities. (It also doesn't yet return the correct HTTP error
codes in all cases --- in particular, it returns 404 Not Found, rather
than 406 None Acceptable, when content-type negotiation does find
alternate views, of which none are acceptable).
With this code in mind, some reflections on the discussions we've had
on this subject over the past week or two.
First off, with regard to the question of whether to support
CERN-style auto-arbitration based on filename extensions, or to have
explicit map files, I'd like to suggest that we can afford to support
both. Directories take very little extra code to support beyond what
you need to handle the contents of the map files themselves --- from
the server's point of view, they're just another form of map file,
which happens to come pre-parsed. However, the gain in convenience
for people who can use the feature is substantial.
(Of course, some people *don't* want to use the feature, so it needs
to be a configuration option. The way I've done it in my code above,
you need "Options MultiViews" enabled in a directory in order for "GET
/.../foo" to be resolved to "/.../foo.gif" or "/.../foo.jpeg". In
directories where MultiViews is off, the server behaves exactly as it
would if the directory-scanning code were never there, so I don't
*think* there's a back-compatibility issue ;-).
A second interesting thing which comes up is what to do with clients
like certain, ahem, colorful browser betas which ship completely bogus
Accept: headers (or HTTP/0.9 browsers, which don't ship any). My code
basically pretends the browser did "Accept: text/html" and "Accept:
text/plain" whether it actually did or not, to ease this difficulty;
is that the right thing?
Then there's security. The issue here is that if some Malevolent
Entity (say, a cracker exploiting a leaky ftp server) can create
type-map files, you don't want the server believing one which names
/etc/passwd as the text/plain view of the composite entity described
by /inoccuous/directory/pretty-bunny.map. My code takes the
thoroughly draconian approach of making all pathnames in map files
relative to the map file itself, and *disallowing* relative paths
containing '/', so a type-map file can *only* name things in the same
directory (although those can be symlinks *if* FollowSymLinks is
enabled). Is that the wrong thing? If so, what's the right thing?
As a final point, writing the code raises the question of what the
map files should look like. What I've done probably isn't the right
thing, but a discussion of what's wrong with it might prove
instructive. The map files implemented by my code just look like:
foo.au: audio/basic
foo.gif: image/gif
foo.html: text/html
foo.txt: text/plain
(this being the contents of foo.map). Qualities can be associated
with these ---
foo.gif: image/gif; q = 0.6
foo.jpeg: image/jpeg; q = 1.0
foo.xbm: image/x-xbitmap; q = 0.00001
but that's about it. My reasons for choosing this syntax were crass
and pragmatic --- I got to reuse the Accept:-line parsing code. (That
means that
foo.txt: text/plain, text/setext
works, if both those types apply to the document --- in fact, for
compatibility with the broken past,
foo.txt: text/plain text/setext
works, but Lord knows we don't want to advertise *that*).
My question for the group is, what else do we want? Extra MIME header
lines? Some way of discriminating on USER_AGENT? (FWIW, I was
thinking vaguely of a syntax along the lines of:
foo.aiff {
Content-type: audio/aiff; q=1.0
Pass-along-this-mime-hdr: Kilroy was here
}
foo.au {
Content-type: audio/basic; q=0.2
Pass-along-this-mime-hdr: Klrooy wuzzz heerere
}
I have a truly marvelous implementation of this in mind, but my
weekend was too small to contain it ;-).
Finally, wrt the disposition of the code --- I think patch.addtype-bug
is a reasonable candidate for Cliff's alpha release; it fixes a long
standing problem, and cleans up some of the script code a little (one
routine that searches for PATH_INFO instead of three; also, about 90
lines of duplicate code from http_{post,put,delete}.c merged into one
routine. These changes are a bit more than I would have liked at this
point, but there were several distinct pieces of code which all had
the bug, and you really can't fix it without bashing them all).
The content-arb code is a little more tenuous --- largely because it's
not quite complete (content-encodings aren't handled correctly yet),
and because we really don't know what the map files ought to look
like.
Oh yes, HERE'S THE DOCS:
This code adds two new features to httpd: special treatment for the
pseudo-mime-type application/x-type-map, and the MultiViews
per-directory Option (which can be set in srm.conf, or in .htaccess
files, as usual). These features are alternate user interfaces to
what amounts to the same piece of code (in the file http_mime_db.c)
which implements (uh, ...most of) the optional content negotiation
portion of the HTTP protocol.
Each of these features allows one of several files to satisfy a
request, based on what the client says it's willing to accept; the
differences are in the way the files are identified:
*) A type map names the files explicitly
*) In a MultiViews directory, the server does an implicit glob
and chooses from among the results
TYPE MAPS:
A type map is a document which is typed by the server (using its
normal suffix-based mechanisms) as application/x-type-map. The syntax
of these files is simple:
filename: mime/type; parm parm parm
filename: mime/type; parm parm parm
so, for instance, you can have
foo.gif: image/gif; q = 0.6
foo.jpeg: image/jpeg; q = 1.0
foo.xbm: image/x-xbitmap; q = 0.00001
The 'q', for 'quality' parameter, specifies preferences among these
images if the client doesn't much care --- in this case, the jpeg is
somewhat preferred to the gif, and the xbm is only shipped if the
client won't take *anything* else.
Note that the files references *must* be in the same directory as the
map file, for security reasons (we wouldn't want someone coming in
through an ftp server to be able to fake up a map file listing
/etc/passwd, and have the server respect it). You get a Server Error
message if they aren't.
Note also that to use this, you've got to have an AddType someplace
which defines a file suffix as application/x-type-map; the easiest
thing may be to stick a
AddType application/x-type-map map
in srm.conf.
MULTIVIEWS:
This is a per-directory option, meaning it can be set with an Options
directive within a <Directory> section in access.conf, or (if
AllowOverride is properly set) in .htaccess files. Note that Options
All does not set MultiViews; you have to ask for it by name. (This is
a one-line change to httpd.h).
The effect of MultiViews is as follows: if the server receives a
request for /some/dir/foo, /some/dir has MultiViews enabled, and
/some/dir/foo does *not* exist, then the server reads the directory
looking for files named foo.*, and effectively fakes up a type map
which names all those files, assigning them the same MIME types it
would have if the client had asked for one of them by name. It then
chooses the best match to the client's accept: headers, and forwards
them along.
If one of the files found by the globbing is a CGI script, it's not
obvious what should happen. My code gives that case gets special
treatment --- if the request was a POST, or a GET with QUERY_ARGS or
PATH_INFO, the script is given an extremely high quality rating, and
generally invoked; otherwise it is given an extremely low quality
rating, which generally causes one of the other views (if any) to be
retrieved. This is the only jiggering of quality ratings done by the
MultiViews code; aside from that, all Qualities in the synthesized
type maps are 1.0.
Note that this machinery only comes into play if the file which the
user attempted to retrieve does *not* exist by that name; if it does,
it is simply retrieved as usual. (So, someone who actually asks for
'foo.jpeg', as opposed to 'foo', never gets foo.gif).
That's it.
rst
involved in doing something like actually writing code to do it. I
decided to take this approach with regard to content-type negotiation;
as a result, I now have two new patches, both in
ftp://ftp.ai.mit.edu/pub/users/rst/httpd-patches:
patch.addtype-bug --- Largely a code cleanup, but also fixes the
long-standing and oft-reported bug that the usual AddType directives
for *.cgi and *.shtml are ineffective in .htaccess files, and
leaves a few hooks for...
patch.content-arb --- Content-type negotiation. (Yes, really ---
but see caveats below --- a user's-manual style writeup on what
this does is towards the bottom of this note, after "HERE'S THE DOCS").
This patch is 650-odd lines, of which about 500 are the contents
of the new file http_mime_db.c, which does most of the work.
This code does a *basic* job of content-type negotiation --- it parses
everything (at least according to my reading of the relevant standards
documents, and the CERN code), but it doesn't actually use the
Accept-content-encoding and Accept-language info yet; however, it does
handle qualities. (It also doesn't yet return the correct HTTP error
codes in all cases --- in particular, it returns 404 Not Found, rather
than 406 None Acceptable, when content-type negotiation does find
alternate views, of which none are acceptable).
With this code in mind, some reflections on the discussions we've had
on this subject over the past week or two.
First off, with regard to the question of whether to support
CERN-style auto-arbitration based on filename extensions, or to have
explicit map files, I'd like to suggest that we can afford to support
both. Directories take very little extra code to support beyond what
you need to handle the contents of the map files themselves --- from
the server's point of view, they're just another form of map file,
which happens to come pre-parsed. However, the gain in convenience
for people who can use the feature is substantial.
(Of course, some people *don't* want to use the feature, so it needs
to be a configuration option. The way I've done it in my code above,
you need "Options MultiViews" enabled in a directory in order for "GET
/.../foo" to be resolved to "/.../foo.gif" or "/.../foo.jpeg". In
directories where MultiViews is off, the server behaves exactly as it
would if the directory-scanning code were never there, so I don't
*think* there's a back-compatibility issue ;-).
A second interesting thing which comes up is what to do with clients
like certain, ahem, colorful browser betas which ship completely bogus
Accept: headers (or HTTP/0.9 browsers, which don't ship any). My code
basically pretends the browser did "Accept: text/html" and "Accept:
text/plain" whether it actually did or not, to ease this difficulty;
is that the right thing?
Then there's security. The issue here is that if some Malevolent
Entity (say, a cracker exploiting a leaky ftp server) can create
type-map files, you don't want the server believing one which names
/etc/passwd as the text/plain view of the composite entity described
by /inoccuous/directory/pretty-bunny.map. My code takes the
thoroughly draconian approach of making all pathnames in map files
relative to the map file itself, and *disallowing* relative paths
containing '/', so a type-map file can *only* name things in the same
directory (although those can be symlinks *if* FollowSymLinks is
enabled). Is that the wrong thing? If so, what's the right thing?
As a final point, writing the code raises the question of what the
map files should look like. What I've done probably isn't the right
thing, but a discussion of what's wrong with it might prove
instructive. The map files implemented by my code just look like:
foo.au: audio/basic
foo.gif: image/gif
foo.html: text/html
foo.txt: text/plain
(this being the contents of foo.map). Qualities can be associated
with these ---
foo.gif: image/gif; q = 0.6
foo.jpeg: image/jpeg; q = 1.0
foo.xbm: image/x-xbitmap; q = 0.00001
but that's about it. My reasons for choosing this syntax were crass
and pragmatic --- I got to reuse the Accept:-line parsing code. (That
means that
foo.txt: text/plain, text/setext
works, if both those types apply to the document --- in fact, for
compatibility with the broken past,
foo.txt: text/plain text/setext
works, but Lord knows we don't want to advertise *that*).
My question for the group is, what else do we want? Extra MIME header
lines? Some way of discriminating on USER_AGENT? (FWIW, I was
thinking vaguely of a syntax along the lines of:
foo.aiff {
Content-type: audio/aiff; q=1.0
Pass-along-this-mime-hdr: Kilroy was here
}
foo.au {
Content-type: audio/basic; q=0.2
Pass-along-this-mime-hdr: Klrooy wuzzz heerere
}
I have a truly marvelous implementation of this in mind, but my
weekend was too small to contain it ;-).
Finally, wrt the disposition of the code --- I think patch.addtype-bug
is a reasonable candidate for Cliff's alpha release; it fixes a long
standing problem, and cleans up some of the script code a little (one
routine that searches for PATH_INFO instead of three; also, about 90
lines of duplicate code from http_{post,put,delete}.c merged into one
routine. These changes are a bit more than I would have liked at this
point, but there were several distinct pieces of code which all had
the bug, and you really can't fix it without bashing them all).
The content-arb code is a little more tenuous --- largely because it's
not quite complete (content-encodings aren't handled correctly yet),
and because we really don't know what the map files ought to look
like.
Oh yes, HERE'S THE DOCS:
This code adds two new features to httpd: special treatment for the
pseudo-mime-type application/x-type-map, and the MultiViews
per-directory Option (which can be set in srm.conf, or in .htaccess
files, as usual). These features are alternate user interfaces to
what amounts to the same piece of code (in the file http_mime_db.c)
which implements (uh, ...most of) the optional content negotiation
portion of the HTTP protocol.
Each of these features allows one of several files to satisfy a
request, based on what the client says it's willing to accept; the
differences are in the way the files are identified:
*) A type map names the files explicitly
*) In a MultiViews directory, the server does an implicit glob
and chooses from among the results
TYPE MAPS:
A type map is a document which is typed by the server (using its
normal suffix-based mechanisms) as application/x-type-map. The syntax
of these files is simple:
filename: mime/type; parm parm parm
filename: mime/type; parm parm parm
so, for instance, you can have
foo.gif: image/gif; q = 0.6
foo.jpeg: image/jpeg; q = 1.0
foo.xbm: image/x-xbitmap; q = 0.00001
The 'q', for 'quality' parameter, specifies preferences among these
images if the client doesn't much care --- in this case, the jpeg is
somewhat preferred to the gif, and the xbm is only shipped if the
client won't take *anything* else.
Note that the files references *must* be in the same directory as the
map file, for security reasons (we wouldn't want someone coming in
through an ftp server to be able to fake up a map file listing
/etc/passwd, and have the server respect it). You get a Server Error
message if they aren't.
Note also that to use this, you've got to have an AddType someplace
which defines a file suffix as application/x-type-map; the easiest
thing may be to stick a
AddType application/x-type-map map
in srm.conf.
MULTIVIEWS:
This is a per-directory option, meaning it can be set with an Options
directive within a <Directory> section in access.conf, or (if
AllowOverride is properly set) in .htaccess files. Note that Options
All does not set MultiViews; you have to ask for it by name. (This is
a one-line change to httpd.h).
The effect of MultiViews is as follows: if the server receives a
request for /some/dir/foo, /some/dir has MultiViews enabled, and
/some/dir/foo does *not* exist, then the server reads the directory
looking for files named foo.*, and effectively fakes up a type map
which names all those files, assigning them the same MIME types it
would have if the client had asked for one of them by name. It then
chooses the best match to the client's accept: headers, and forwards
them along.
If one of the files found by the globbing is a CGI script, it's not
obvious what should happen. My code gives that case gets special
treatment --- if the request was a POST, or a GET with QUERY_ARGS or
PATH_INFO, the script is given an extremely high quality rating, and
generally invoked; otherwise it is given an extremely low quality
rating, which generally causes one of the other views (if any) to be
retrieved. This is the only jiggering of quality ratings done by the
MultiViews code; aside from that, all Qualities in the synthesized
type maps are 1.0.
Note that this machinery only comes into play if the file which the
user attempted to retrieve does *not* exist by that name; if it does,
it is simply retrieved as usual. (So, someone who actually asks for
'foo.jpeg', as opposed to 'foo', never gets foo.gif).
That's it.
rst