Mailing List Archive: YAS

YAS

Apr 10, 1995, 10:37 AM

Post #1 of 47 (6587 views)

another suggestion....

Have httpd parse ALIWEB index files, and return formtted output.

I've got a perl script which does this, but it might be useful to
build it into Apache.

Someone would hit an ALIWEB mime type, if there were arguments then
it would look for those as ALIWEB keywords, if there were no args
then it would return a form where the keywords could be typed in.

-=-=-=-=

Aliweb looks for "/site.idx" which contains things like..

Template-Type: ORGANIZATION
Organization-Name: Department Of Computing Mathematics, UWCC.
URI: /Places/comma.html
Description: The department for computer science in the University of Wales College of Cardiff. Wales. UK
Keywords: COMMA, UWCC, Cardiff, Wales, computer science, computimg mathematics

Template-Type: SERVICE
Name: The rec.arts.movies database
URI: /Movies/index.html
Description: An interface to the rec.arts.movies database of movie facts
Keywords: movies, film, cinema, reviews

Template-Type: DOCUMENT
Name: Centre for High Performance Computing - Cardiff
URI: /Hpc/index.html
Description: Promotion and support of High Performance Computing
Keywords: high performance computing, JISC

Re: YAS [ In reply to ]

Apr 10, 1995, 7:07 PM

Post #2 of 47 (6567 views)

From: Rob Hartill <hartill@ooo.lanl.gov>
Date: Mon, 10 Apr 95 10:37:48 MDT

Have httpd parse ALIWEB index files, and return formtted output.

... um... why not just parse the thing once, when it's built, and
serve the output as an ordinary file?

I've got a perl script which does this, but it might be useful to
build it into Apache.

There's a tradeoff here --- putting something like this into the
server itself makes it run faster, but at the same time it also
complicates the server code itself. For, say, imagemap, this may be
worth the tradeoff, but I'm not sure that prettyprinting ALIWEB files
comes up often enough that it can't run as effectively as it needs to
as a script.

(One thing we might want to do about these sorts of "YAS" things that
seem to be coming up is to define a binary interface, along the lines
of Simon's BGI or the NetSite internal APIs, and write these things as
examples of that. Brian's suggestion for trying a case-insensitive
file match as recovery from a 404 could be done, for instance, by
setting ErrorDocument to the effective URL of such an internaly loaded
module --- perhaps even to a script, if it doesn't come up much).

rst

Re: indexing suggestion [ In reply to ]

Apr 11, 1995, 5:41 PM

Post #3 of 47 (6563 views)

> From: Rob Hartill <hartill@ooo.lanl.gov>
> Date: Mon, 10 Apr 95 10:37:48 MDT
>
> Have httpd parse ALIWEB index files, and return formtted output.
>
> ... um... why not just parse the thing once, when it's built, and
> serve the output as an ordinary file?

I'm suggesting we have the server search the index file when it
is requested with arguments. Without arguments it prompts for some.

So I could hit some site with a link to their index, and be able to
type in a keyword. It would then give me a formatted list of pointers
to what I probably wanted.

Now it may be that ALIWEB doesn't have the ideal syntax for this, but
maybe we can define some kind of local index file that better suits this
idea.

The index files, being of a special MIME type could be placed in
lots of directories, so that the index will be specific to that
region of the server. A top level index could point directly to
resources or to lower level indicies.

A simple approach could be to have a format such as

#comment
URL
keywords
description

e.g.
#let's index my new game
/Robs_junk/new/game.html
game,entertainment,hangman,fun
A www verstion of the classic hangman game
# i stole the hangman code form Fred's site
http://fred.com/cgi-bin/hangman
game,entertainment,hangman,fun,fred
The original version which I based <A HREF="/Robs_junk/new/game.html">my hangman game on</A>

HTML doesn't give a damn about \r\n so, the syntax could just be 1
field per line, with an unlimited line length.

Robots could be incouraged to request the source of the index, so that
they do a proper job of indexing the web.

thoughts ?

Re: indexing suggestion [ In reply to ]

Apr 12, 1995, 7:47 AM

Post #4 of 47 (6558 views)

From: Rob Hartill <hartill@ooo.lanl.gov>
Date: Tue, 11 Apr 95 17:41:49 MDT

I'm suggesting we have the server search the index file when it
is requested with arguments. Without arguments it prompts for some.

So I could hit some site with a link to their index, and be able to
type in a keyword. It would then give me a formatted list of pointers
to what I probably wanted.

Now it may be that ALIWEB doesn't have the ideal syntax for this, but
maybe we can define some kind of local index file that better suits this
idea.

Last time I checked, Martijn was actually giving out code for a search
engine that worked on ALIWEB template files... it probably would be a good
addition to the cgi-bin of our distribution, but I'm still not sure I
see the point of integrating the code into the server itself.

rst

Re: indexing suggestion [ In reply to ]

Apr 12, 1995, 10:25 AM

Post #5 of 47 (6567 views)

> Last time I checked, Martijn was actually giving out code for a search
> engine that worked on ALIWEB template files... it probably would be a good
> addition to the cgi-bin of our distribution, but I'm still not sure I
> see the point of integrating the code into the server itself.

Okay, forget ALIWEB, but think along those lines.

For the cost of a few string compares, we can allow people to set
up index files in any directory - not a plain list of pointers, we're
talking about a database of URLs, keywords and descriptions which are
searched by httpd.

By hitting the index file URL for a directory, I could

1) ask for a form (no arguments given with the URL)
2) query the index (any arguments given)
3) view the index source (special argument given)

Andy suggested WAIS and glimpse. This is something different -
the resource owners decide what goes into the index, and how it
is described (the ALIWEB approach). The index files will typically
be small

A simple format such as the one I gave earlier will be easy to
parse. Searching will be performed on the keywords only - just
do simple case insensitive string comparisons on comma separated
keywords.

Now, because this is a special MIME type (maybe call it httpd/index)
it's really inexpensive to check for it. If you see it, you jump to
some new code.

Integrating this simple idea into the server will mean that everyone
will be able to correctly index their stuff without any CGI privillages.

The indicies will be hierarchical,

e.g.

/foo.indx could index lots of info about the site, as well
as point to other index files deeper in the URL file system.
A keyword of "games" could point you to the dedicated games index.
The webmaster wouldn't have to worry about indexing his users
resources.

If all of this was expensive, I too would have my doubts, and would
suggest it be CGI'ed. But it's so easy to bolt on to the existing code,
and by being based on simple format and searching principles, it shouldn't
have an impact on server performance.

w.r.t robots, they could ask for the raw index file and use that to
build an ALIWEB style of index - one which is far superior to existing
"grab everything and guess" robot indexing techniques.

Someone told be the other day that it's pointless just saying "it's easy".
Implement it and show people how easy it was. I will have a crack at it
today.

robh

Re: indexing suggestion [ In reply to ]

Apr 12, 1995, 1:24 PM

Post #6 of 47 (6559 views)

From: Rob Hartill <hartill@ooo.lanl.gov>
Date: Wed, 12 Apr 95 10:25:54 MDT

> Last time I checked, Martijn was actually giving out code for a search
> engine that worked on ALIWEB template files... it probably would be a good
> addition to the cgi-bin of our distribution, but I'm still not sure I
> see the point of integrating the code into the server itself.

Okay, forget ALIWEB, but think along those lines.

For the cost of a few string compares, we can allow people to set
up index files in any directory - not a plain list of pointers, we're
talking about a database of URLs, keywords and descriptions which are
searched by httpd.

This can still be done perfectly well using a script sitting in
/cgi-bin, which finds the location of the per-directory index file
from PATH_INFO_TRANSLATED --- in fact, that's what that CGI variable
is there for in the first place. It works fine for imagemap, and it
would work fine for a search script as well.

As I think I've said before, my belief is that stuff which can be
reasonably implemented as CGI scripts should be, just to keep creeping
featuritis out of the server itself. I'm afraid you still haven't
made a convincing case for an exception here --- the server would be
doing nothing which a CGI script couldn't do just as well, and I'm not
convinced the efficiency gain is worth the complication.

If you've tried it with a script-based approach, and found something
which simply cannot be done, or cannot be done effectively, without
some kind of hooks into the server, I might feel differently. But as
it is, I'm not at all convinced.

rst

Re: indexing suggestion [ In reply to ]

Apr 12, 1995, 3:56 PM

Post #7 of 47 (6555 views)

> Someone told be the other day that it's pointless just saying "it's easy".
> Implement it and show people how easy it was. I will have a crack at it
> today.

This is now Patch E68

It's in the incoming directory. Will someone move it into position
for me please.

incoming/E68_simple_indexer.txt.v2

It works, it's unintrusive, quick and it could be popular.

robh

Re: indexing suggestion [ In reply to ]

Apr 12, 1995, 4:06 PM

Post #8 of 47 (6560 views)

To see it working...
http://ooo.lanl.gov/try.indx

Re: indexing suggestion [ In reply to ]

Apr 12, 1995, 4:59 PM

Post #9 of 47 (6560 views)

> incoming/E68_simple_indexer.txt.v2

v3 fixes a typo in the HTML that was being output, and correctly
logs the number of bytes sent.

Re: indexing suggestion [ In reply to ]

Apr 12, 1995, 5:35 PM

Post #10 of 47 (6564 views)

> Andy suggested WAIS and glimpse. This is something different -
> the resource owners decide what goes into the index, and how it
> is described (the ALIWEB approach). The index files will typically
> be small

WAIS (definately - not sure about glimpse) can index anything. The ideal
solution (if you used WAIS) would be for authors to decide what they wanted
to appear in the robsownformat.idx files in each of their directories. WAIS
would then index the robsownformat.idx files - *not* the entire *.html space.
Authors can add their own keywords etc, etc, etc.

Authors would still get the final say about what was searchable - it would
*NOT* be an indexing of ALL the files that the server (or server admin)
knew about.

> If all of this was expensive, I too would have my doubts, and would
> suggest it be CGI'ed. But it's so easy to bolt on to the existing code,
> and by being based on simple format and searching principles, it shouldn't
> have an impact on server performance.
>
>
> w.r.t robots, they could ask for the raw index file and use that to
> build an ALIWEB style of index - one which is far superior to existing
> "grab everything and guess" robot indexing techniques.
>
>
> Someone told be the other day that it's pointless just saying "it's easy".
> Implement it and show people how easy it was. I will have a crack at it
> today.

It's easy. The server's already doing most of the hard work - directory
hopping, looking for files etc. The point is do you want Apache to
hardcode a preference for any given .idx format, when a sexy PERL script
and a decent Makefile (yeah with WAIS or whatever) can do the same thing?

> robh

[.I've done the WAIS thing already Rob, but go for it anyhow]

Ay.

Re: indexing suggestion [ In reply to ]

Apr 12, 1995, 10:40 PM

Post #11 of 47 (6551 views)

Last time, Rob Hartill uttered the following other thing:
>
>
>
> To see it working...
> http://ooo.lanl.gov/try.indx

I noticed that you have :
PATH=.:/bin:/usr/local/bin:/usr/bin:/users/hartill/bin:/usr/bin/X11:/etc

You really shouldn't have the . path first in the list. Besides being
bad practice from a security stand point, we found on hoohoo that our
uptime script did a `which uptime` first, and found itself, called itself
in a loop, and did a fair job of crashing the machine (not quite, but
damn hard to find, esp. when we thought it was a bug in the server).

Brandon

--
Brandon Long (N9WUC) "I think, therefore, I am confused." -- RAW
Computer Engineering Run Linux 1.1.xxx It's that Easy.
University of Illinois blong@uiuc.edu http://www.uiuc.edu/ph/www/blong
Don't worry, these aren't even my views.

Re: indexing suggestion [ In reply to ]

Apr 13, 1995, 8:52 AM

Post #12 of 47 (6561 views)

Grumble, grumble. I see that indexing in the server isn't likely to
get passed the voting stage, even though other things like imagemaps
and content-negotiation are equally valid candidates for cgi, it is
considered favorable to have them inside rather than out.

On the assumption that my proposal is heading for a veto, I'd
at least like to see changes to the counter proposal.

> http://www.ai.mit.edu/cgi-bin/idx/site.idx

I'd like to see Apache act on the MIME type mapping for .idx
instead of having "/cgi-bin/idx/" prefixes to URLs.

In the long term, this would make it more flexible, in that
one could immediately change the characteristics of all .idx/indx
URLs without changing or redirecting the well established URLs.

As it stands, all Rob T is proposing is an idea which can already
be implemented in 1.3 (I've had such a system running at Cardiff
for well over a year). That's not to say that old is bad, but
anyone can use this method on top of a Apache-built-in system if
they wanted to anyway.

The advantages of the original proposal haven't gone away, it'll
be faster (no fork ultimately) and it's based on a much simpler
(more restrictive you might argue) syntax.

On the busy Cardiff server we find that under heavy traffic, the
first services to melt are the ones based on perl cgi. And they
have a habbit of dragging the rest of the system down with them.

> If it's popular we can think about moving it into the server proper,
> when that is clearly appropriate. Right now, I don't think it is.

I think it'd become more popular if it were in the server from the
outset. The cgi approach has been there for people to use for
over a year, it simply didn't catch on. We can always extend the
syntax to meet changing needs at a later date.

It's open to a vote.

robh

Re: indexing suggestion [ In reply to ]

Apr 13, 1995, 9:16 AM

Post #13 of 47 (6553 views)

> I noticed that you have :
> PATH=.:/bin:/usr/local/bin:/usr/bin:/users/hartill/bin:/usr/bin/X11:/etc
>
> You really shouldn't have the . path first in the list. Besides being
> bad practice from a security stand point, we found on hoohoo that our
> uptime script did a `which uptime` first, and found itself, called itself
> in a loop, and did a fair job of crashing the machine (not quite, but
> damn hard to find, esp. when we thought it was a bug in the server).

I noticed that with some NCSA bundled scripts.

It shouldn't be picking up my path for this anyway - will have to
fix that.

I'll move my "." elsewhere.

cheers,
rob

Re: indexing suggestion [ In reply to ]

Apr 13, 1995, 9:54 AM

Post #14 of 47 (6557 views)

From: Rob Hartill <hartill@ooo.lanl.gov>
Date: Wed, 12 Apr 95 15:56:36 MDT
> Someone told be the other day that it's pointless just saying "it's easy".
> Implement it and show people how easy it was. I will have a crack at it
> today.

As an argument for my counterproposal of doing it as a CGI script, I
have implemented that --- it took me all of 30 seconds to replace the

$indexfilename = '/index/path/here';

with

$indexfilename = $ENV{'PATH_TRANSLATED'};

at the top of the file. The resulting script is in
E68_simple_indexer.pl in the patches/for_Apache_0.5.1 directory on
hyperreal; you can try it out with

http://www.ai.mit.edu/cgi-bin/idx/site.idx

This searches the index which can be retrieved directly at

http://www.ai.mit.edu/site.idx

This is not the world's greatest site index (it's maintained by a
script with looks for <META> fields in documents, which I regretfully
consider an experiment that failed), but it does demonstrate that the
basic functionality works, assuming that someone has a *.idx file
which has more useful information. Anyone who's capable of managing
the new imagemap script can easily manage this as well.

Rob says of his thing:

It works, it's unintrusive, quick and it could be popular.

Mine also works, it's a great deal less intrusive (wild pointers or
memory leaks in the code can't compromise the integrity of even a
non-forking server), it's quick, it can be easily modified to suit
local conditions by webmasters who don't want to mess with the server
code itself, it doesn't commit the server code to any particular index
format, and it can be replaced at will with any of the more capable
search engines that are widely available.

If it's popular we can think about moving it into the server proper,
when that is clearly appropriate. Right now, I don't think it is.

rst

Re: indexing suggestion [ In reply to ]

Apr 13, 1995, 10:09 AM

Post #15 of 47 (6556 views)

From: rst@ai.mit.edu (Robert S. Thau)
Date: Thu, 13 Apr 95 09:54:51 EDT

Mine also works, it's a great deal less intrusive (wild pointers or
memory leaks in the code can't compromise the integrity of even a
non-forking server), it's quick, it can be easily modified to suit
local conditions by webmasters who don't want to mess with the server
code itself, it doesn't commit the server code to any particular index
format, and it can be replaced at will with any of the more capable
search engines that are widely available.

...but wait! There's more! It parses the full Aliweb IANA template
syntax, which means that people who already have Aliweb site indexes
can use them directly! And the format has other advantages... in
addition to trivia like having more mnemonic field names, it supports
multi-line descriptions and keyword lists (using the usual RFC822
continuation syntax)!

It slices! It dices! It chops! And if you order now, you get a free
turnip twaddler!

rst

Re: indexing suggestion [ In reply to ]

Apr 13, 1995, 11:49 AM

Post #16 of 47 (6571 views)

Date: Thu, 13 Apr 95 16:07 BST
From: drtr@ast.cam.ac.uk (David Robinson)
Precedence: bulk
Reply-To: new-httpd@hyperreal.com

>I'd like to see Apache act on the MIME type mapping for .idx
>instead of having "/cgi-bin/idx/" prefixes to URLs.

So how about a mime type for running a CGI script. .e.g.
AddType application/x-script-parsed .idx /cgi-bin/idx

The specified script would be called with the document URL as PATH_INFO,
and the document file as PATH_INFO_TRANSLATED.

I *really* like this idea... here are a few possible improvements.
Instead of overloading PATH_INFO, give the script the document URL as
DOCUMENT_URI, as is currently done for server-side includes. This is
a bit more consistent with the includes functionality, and it also
lets the script get at the real PATH_INFO, if any was supplied.

Also, instead of having a three-argument AddType directive, it might
be better to have a separate AddHandler directive --- this would allow
users to easily declare handlers for a MIME type with multiple
suffixes, or for their DefaultType, without having to repeat the name
of the handler several times, e.g.

AddHandler text/plain /cgi-bin/format_setext_stuff

Finally, substitute directory indexing routines could be declared as
handlers for an appropriately chosen MIME type, say

AddHandler application/x-unix-directory /cgi-bin/read-4dos-indexes

It's a little late to get this in for this week's vote, but I'll
probably implement it over the weekend, as specified above, if no one
has any strong objections.

rst

Re: indexing suggestion [ In reply to ]

Apr 13, 1995, 12:11 PM

Post #17 of 47 (6561 views)

On Thu, 13 Apr 1995, Robert S. Thau wrote:
> It slices! It dices! It chops! And if you order now, you get a free
> turnip twaddler!

I'm not so sure I want my turnips twaddled, thanks.

Brian

--=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=--
brian@hotwired.com brian@hyperreal.com http://www.hotwired.com/Staff/brian/

Re: indexing suggestion [ In reply to ]

Apr 13, 1995, 4:07 PM

Post #18 of 47 (6564 views)

>I'd like to see Apache act on the MIME type mapping for .idx
>instead of having "/cgi-bin/idx/" prefixes to URLs.

So how about a mime type for running a CGI script. .e.g.
AddType application/x-script-parsed .idx /cgi-bin/idx

The specified script would be called with the document URL as PATH_INFO,
and the document file as PATH_INFO_TRANSLATED.

This could be an alternative way of having a script generate an index for
a directory, e.g.
AddType application/x-script-parse .4dos /cgi-bin/index
DirectoryIndex index.html index.4dos

which would cause the cgi script to be run if there were a 4dos-style
description file in the directory.

David.

Re: indexing suggestion [ In reply to ]

Apr 13, 1995, 6:09 PM

Post #19 of 47 (6576 views)

On Thu, 13 Apr 1995, Robert S. Thau wrote:
> Date: Thu, 13 Apr 95 16:07 BST
> From: drtr@ast.cam.ac.uk (David Robinson)
> Precedence: bulk
> Reply-To: new-httpd@hyperreal.com
>
> >I'd like to see Apache act on the MIME type mapping for .idx
> >instead of having "/cgi-bin/idx/" prefixes to URLs.
>
> So how about a mime type for running a CGI script. .e.g.
> AddType application/x-script-parsed .idx /cgi-bin/idx

Actually, this is sorta the same as, for example, putting

#!/www/cgi-bin/idx

at the top of an .idx file and making .idx a recognized CGI script type,
yes? I use this kind of mechanism for imagemaps, where I put a

#!/usr/local/bin/imagemap-new

at the top of map files - imagemap-new is a slightly modified version
which understands being called this way.

However, comparing this to something like

AddType application/x-script-parsed .imap /cgi-bin/imagemap

I think the latter is better (certainly more general).

> Also, instead of having a three-argument AddType directive, it might
> be better to have a separate AddHandler directive --- this would allow
> users to easily declare handlers for a MIME type with multiple
> suffixes, or for their DefaultType, without having to repeat the name
> of the handler several times, e.g.
>
> AddHandler text/plain /cgi-bin/format_setext_stuff
>
> Finally, substitute directory indexing routines could be declared as
> handlers for an appropriately chosen MIME type, say
>
> AddHandler application/x-unix-directory /cgi-bin/read-4dos-indexes

The first is fine - the WN server allows one to specify a particular
"filter" to be applied to URL objects, so this is similar. This also
means we don't have to create a new bogus MIME type, which is the reason
why I don't like the second AddHandler example. Rob McCool, is there a
public specification of your the NetSite server API anywhere? It seems
like there must be a more general way of modularizing server capabilities
than defining new bogus MIME types.

Brian

Re: indexing suggestion [ In reply to ]

Apr 16, 1995, 12:04 PM

Post #20 of 47 (6570 views)

>
> Re PATH_INFO; if /dir/file.ext is a regular (unix) file, then accessing
> /dir/file.ext/path_info will fail.
>
> Not currently --- the PATH_INFO is simply ignored in this case. I
> personally see no compelling reason to change this, although as we all
> will recall, Rob H. vehemently disagrees. However, I do think that
> PATH_INFO should clearly be allowed anywhere that a CGI script might
> get into the mix.

I can't follow this discussion, perhaps because I just got out
of bed, so I'm lost as to what I vehemently disagreed to here.
Please remind me.

robh

Re: indexing suggestion [ In reply to ]

Apr 16, 1995, 12:07 PM

Post #21 of 47 (6554 views)

Date: Sun, 16 Apr 95 16:48 BST
From: drtr@ast.cam.ac.uk (David Robinson)

Yes, probably better, although DOCUMENT_URI isn't part of the CGI spec.
Currently you can only have PATH_INFO for server-side includes or CGI scripts.
(See below.)

Actually, when a script is invoked via  from a
server-side-includes document, it gets both PATH_INFO *and*
DOCUMENT_URI set --- see

http://www.ai.mit.edu/xperimental/foo.shtml/path/info?query+string

and look at the results of /cgi-bin/printenv which are included at the
bottom. Note also that at least in this cas, you actually do need to
set PATH_INFO to something *different* from the DOCUMENT_URI, or lose
useful information about the actual request.

I think this would be amazingly useful. For example, the patchlog database
runs each patch file through an html converter for sending to the user.
So you read a patch with a URL like
http://host/dir/bugread.cgi?id=00001
Currently, this could be slightly better if the id was passed in the path
info as http://host/dir/bugread.cgi/00001

Whereas with your suggestion, I would be able to present the bug files as
http://host/dir/bugs/00001
and httpd would automatically run the cgi script to format the file.

That way, the index produced by http://host/dir/bugs/ would have links which
would return the formatted documents, rather than the plain files.

You'd need per-directory DefaultTypes (another potentially useful
extension, though some care would be needed in implementation in the
non-forking case), or an extension on the names of the buglog files
themselves, but that is the general idea.

Re PATH_INFO; if /dir/file.ext is a regular (unix) file, then accessing
/dir/file.ext/path_info will fail.

Not currently --- the PATH_INFO is simply ignored in this case. I
personally see no compelling reason to change this, although as we all
will recall, Rob H. vehemently disagrees. However, I do think that
PATH_INFO should clearly be allowed anywhere that a CGI script might
get into the mix.

And how about using this mechanism for content-type based translation?
Suppose the request specifies acceptance of image/gif files, but not
image/jpeg, and a jpeg version does not exist. Then we might have
AddTranslator image/jpeg image/gif /cgi-bin/jpeg2gif
Obviously, this could be done with an AddHandler CGI script which
checked the Accept headers.

It could --- I suspect that if anyone started making serious use of
such a feature, they'd want it inside the server for performance
reasons, but AddHandler would at least allow it to be prototyped
more easily.

>Finally, substitute directory indexing routines could be declared as
>handlers for an appropriately chosen MIME type, say
>
> AddHandler application/x-unix-directory /cgi-bin/read-4dos-indexes

How would this interact with the multiple DirectoryIndex config?

The Handler would only come into play if *none* of the files named in
the DirectoryIndex directive are found in that directory (including
MultiViews searches, if MultiViews is on).

NB I'm starting to code this now; it doesn't look hard...

rst

Re: indexing suggestion [ In reply to ]

Apr 16, 1995, 4:48 PM

Post #22 of 47 (6573 views)

Rst wrote:
> From: drtr@ast.cam.ac.uk (David Robinson)
>
> >I'd like to see Apache act on the MIME type mapping for .idx
> >instead of having "/cgi-bin/idx/" prefixes to URLs.
>
> So how about a mime type for running a CGI script. .e.g.
> AddType application/x-script-parsed .idx /cgi-bin/idx
>
> The specified script would be called with the document URL as PATH_INFO,
> and the document file as PATH_INFO_TRANSLATED.

>I *really* like this idea... here are a few possible improvements.
>Instead of overloading PATH_INFO, give the script the document URL as
>DOCUMENT_URI, as is currently done for server-side includes. This is
>a bit more consistent with the includes functionality, and it also
>lets the script get at the real PATH_INFO, if any was supplied.

Yes, probably better, although DOCUMENT_URI isn't part of the CGI spec.
Currently you can only have PATH_INFO for server-side includes or CGI scripts.
(See below.)

>Also, instead of having a three-argument AddType directive, it might
>be better to have a separate AddHandler directive --- this would allow
>users to easily declare handlers for a MIME type with multiple
>suffixes, or for their DefaultType, without having to repeat the name
>of the handler several times, e.g.
>
> AddHandler text/plain /cgi-bin/format_setext_stuff

I think this would be amazingly useful. For example, the patchlog database
runs each patch file through an html converter for sending to the user.
So you read a patch with a URL like
http://host/dir/bugread.cgi?id=00001
Currently, this could be slightly better if the id was passed in the path
info as http://host/dir/bugread.cgi/00001

Whereas with your suggestion, I would be able to present the bug files as
http://host/dir/bugs/00001
and httpd would automatically run the cgi script to format the file.

That way, the index produced by http://host/dir/bugs/ would have links which
would return the formatted documents, rather than the plain files.

Re PATH_INFO; if /dir/file.ext is a regular (unix) file, then accessing
/dir/file.ext/path_info will fail. Should this still apply if file.ext
is subject to a CGI handler? If the use of handlers is as 'output filters',
then httpd should probably reject such a request. However, I imagine there
might be cases where the files don't represent objects to be filtered, so
extra PATH_INFO might be useful.

And how about using this mechanism for content-type based translation?
Suppose the request specifies acceptance of image/gif files, but not
image/jpeg, and a jpeg version does not exist. Then we might have
AddTranslator image/jpeg image/gif /cgi-bin/jpeg2gif
Obviously, this could be done with an AddHandler CGI script which
checked the Accept headers.

>Finally, substitute directory indexing routines could be declared as
>handlers for an appropriately chosen MIME type, say
>
> AddHandler application/x-unix-directory /cgi-bin/read-4dos-indexes

How would this interact with the multiple DirectoryIndex config?

David.

Re: indexing suggestion [ In reply to ]

Apr 16, 1995, 6:34 PM

Post #23 of 47 (6559 views)

From: Rob Hartill <hartill@ooo.lanl.gov>
Date: Sun, 16 Apr 95 12:04:32 MDT

>
> Re PATH_INFO; if /dir/file.ext is a regular (unix) file, then accessing
> /dir/file.ext/path_info will fail.
>
> Not currently --- the PATH_INFO is simply ignored in this case. I
> personally see no compelling reason to change this, although as we all
> will recall, Rob H. vehemently disagrees. However, I do think that
> PATH_INFO should clearly be allowed anywhere that a CGI script might
> get into the mix.

I can't follow this discussion, perhaps because I just got out
of bed, so I'm lost as to what I vehemently disagreed to here.
Please remind me.

If I remember right, you were fairly insistent earlier on that
/dir/file.ext/ was "incorrect" and should be bounced with a 404;
current behavior (as I indicated above) is simply to retrieve the file
in this case and ignore the '/' (which is a simple case of the general
/dir/file.ext/path/info).

rst

Re: indexing suggestion (ATTN NCSA: possible 1.4 bug...) [ In reply to ]

Apr 17, 1995, 8:29 AM

Post #24 of 47 (6562 views)

Last time, Robert S. Thau uttered the following other thing:
>
> Date: Mon, 17 Apr 95 14:30 BST
> From: drtr@ast.cam.ac.uk (David Robinson)
>
> Rst wrote:
> >...You'd need per-directory DefaultTypes (another potentially useful
> >extension, though some care would be needed in implementation in the
> >non-forking case)...
>
> This is already available; it is a feature of NCSA httpd 1.3.
>
> Hmmm... it's still there in 1.4, and still implemented as a simple
> strcpy() into the default_type variable --- however, I can't see
> anyplace where it saves the srm.conf DefaultType value before
> overwriting it (and hence, it can't restore DefaultType to that value
> before the next transaction).
>
> (That's the subtlety I was alluding to... Rob H., if you've taken over
> the non-forking stuff, I guess this is in your bailiwick).

argh. You know, I'm beginning to understand why Netsite doesn't have
local directory config files, just central ones that define everything.
It makes it a lot easier.

fixed.

> would *clearly* be improper with XBITHACK on --- if the XBIT is set on
> /file.html/, then this is a reference to a server-side-includes file
> with PATH_INFO, and the correct thing is very definitely to process
> the file, and pass along the PATH_INFO to any scripts it happens to
> invoke.
>
> At any rate, this is such a minor issue that I can't see fussing with
> it at all until after beta 1.

1.4 returns the file as html, PATH_INFO set /. I'd feel this is the proper
course of action.

odd, 1.4 returns 403 on /index.html/a

Brandon

--
Brandon Long (N9WUC) "I think, therefore, I am confused." -- RAW
Computer Engineering Run Linux 1.1.xxx It's that Easy.
University of Illinois blong@uiuc.edu http://www.uiuc.edu/ph/www/blong
Don't worry, these aren't even my views.

Re: indexing suggestion (ATTN NCSA: possible 1.4 bug...) [ In reply to ]

Apr 17, 1995, 9:10 AM

Post #25 of 47 (6558 views)

> (That's the subtlety I was alluding to... Rob H., if you've taken over
> the non-forking stuff, I guess this is in your bailiwick).

parse_access_dir() in non-forking code is an unmunge_name()
waiting to happen.

Re: Trailing slash stuff..

> if the XBIT is set on
> /file.html/, then this is a reference to a server-side-includes file
> with PATH_INFO, and the correct thing is very definitely to process
^^^^^^^
> the file, and pass along the PATH_INFO to any scripts it happens to
> invoke.

I'll agree with that, if and only if you can point me to the CGI
documentation which defines this behaviour. If that isn't documented,
I'd agree that index.html/ and index.html/a should return a 404 Not
Found.

Until we get HTTP/1.1, and the ability to add a BASE to the header, I
think it is too dangerous (w.r.t broken relative URLs) to service these
requests. If 1.3 behaves as David described, we won't be breaking
anything by 404'ing them.

rob

Re: indexing suggestion (ATTN NCSA: possible 1.4 bug...) [ In reply to ]

Apr 17, 1995, 10:03 AM

Post #26 of 47 (5043 views)

> Note that it doesn't say "...followed by extra information... The
> extra information is sent as PATH_INFO, unless it happens to consist
> entirely of '/' characters, in which case we do some other thing with
> it".

I guess this is one level to far down, we need to see what the SSI
docs say about passing PATH_INFO from the current document (regular
html files don't have PATH_INFO), to the included cgi. I can't
reach the docs at the moment, but I'd be surprised if they explicitly
say that cgi includes assume the PATH_INFO of the parent document.

That's not to say they can't though.

> Is there a security hole here?

no. I shouldn't say that, I should say "I don't think so".

> If not, why is it "dangerous"? (I can
> see how it could be a little confusing, but only to people who brought
> it on themselves...).

That's the danger... It's going to confuse a hell of a lot of people,
because it's a half-baked system, with too many pitfalls. It only
makes sense to have the PATH_INFO, if the document has SSI, and the
SSI makes use of CGI - that narrows it down a bit.
Do we test for the other cases, and "404 Not Found" them ?

Nobody can be using this yet, 'cos NCSA 1.3/1.4 won't allow it. Is
it so desirable to have this, that we can live with the relative URL
problems that come with it ? I think not. 404 everything until HTTP/1.1
makes it fool proof.

robh

Re: indexing suggestion (ATTN NCSA: possible 1.4 bug...) [ In reply to ]

Apr 17, 1995, 10:07 AM

Post #27 of 47 (5047 views)

Date: Mon, 17 Apr 95 14:30 BST
From: drtr@ast.cam.ac.uk (David Robinson)

Rst wrote:
>...You'd need per-directory DefaultTypes (another potentially useful
>extension, though some care would be needed in implementation in the
>non-forking case)...

This is already available; it is a feature of NCSA httpd 1.3.

Hmmm... it's still there in 1.4, and still implemented as a simple
strcpy() into the default_type variable --- however, I can't see
anyplace where it saves the srm.conf DefaultType value before
overwriting it (and hence, it can't restore DefaultType to that value
before the next transaction).

(That's the subtlety I was alluding to... Rob H., if you've taken over
the non-forking stuff, I guess this is in your bailiwick).

> Re PATH_INFO; if /dir/file.ext is a regular (unix) file, then accessing
> /dir/file.ext/path_info will fail.
>
>Not currently --- the PATH_INFO is simply ignored in this case. I
>personally see no compelling reason to change this, although as we all
>will recall, Rob H. vehemently disagrees. However, I do think that
>PATH_INFO should clearly be allowed anywhere that a CGI script might
>get into the mix.

Urgle.
I think you did change this with B23.

Okay, I don't see any compelling reason to change it *back* ;-).

I based my comments on NCSA httpd 1.3, which has the following behaviour:
GET /index.html/a HTTP/1.0
returns 403 Forbidden
GET /index.html/ HTTP/1.0
returns index.html, but with Content-type: text/plain

Whereas for apache 0.5
GET /index.html/a HTTP/1.0
returns index.html
GET /index.html/ HTTP/1.0
returns index.html

The current apache behaviour is wrong.

Really? It is not at all obvious to me what the server should or
should not be doing in this case. In particular, the behavior you
propose below:

For /index.html/a it should give 404 Not Found.
For /index.html/ it should give 404 Not Found, or perhaps a redirect to
/index.html, as in general void path segments are not considered significant.

would *clearly* be improper with XBITHACK on --- if the XBIT is set on
/file.html/, then this is a reference to a server-side-includes file
with PATH_INFO, and the correct thing is very definitely to process
the file, and pass along the PATH_INFO to any scripts it happens to
invoke.

At any rate, this is such a minor issue that I can't see fussing with
it at all until after beta 1.

rst

Re: indexing suggestion (ATTN NCSA: possible 1.4 bug...) [ In reply to ]

Apr 17, 1995, 10:20 AM

Post #28 of 47 (5046 views)

> Nobody can be using this yet, 'cos NCSA 1.3/1.4 won't allow it. Is
> it so desirable to have this, that we can live with the relative URL
> problems that come with it ? I think not. 404 everything until HTTP/1.1
> makes it fool proof.

Incidentally, I dropped my proposal for a configurable setting to
"404 Not Found" directories that were missing a "/", after Roy
pointed out that HTTP/1.1 will offer a permanent solution.

I could live with the current Apache implementation (or whatever is
decided about how the variables are set), if it weren't documented.
Keep it quiet until 1.1 makes it bullet proof.

robh

Re: indexing suggestion (ATTN NCSA: possible 1.4 bug...) [ In reply to ]

Apr 17, 1995, 10:27 AM

Post #29 of 47 (5035 views)

>
> From: Rob Hartill <hartill@ooo.lanl.gov>
> Date: Mon, 17 Apr 95 10:03:53 MDT
>
> I guess this is one level to far down, we need to see what the SSI
> docs say about passing PATH_INFO from the current document (regular
> html files don't have PATH_INFO), to the included cgi. I can't
> reach the docs at the moment, but I'd be surprised if they explicitly
> say that cgi includes assume the PATH_INFO of the parent document.
>
> That's not to say they can't though.
>
> As I said, people are using it.

How can people be using it ?. 1.3 doesn't let you add path info
to say /index.html

I just tried it and got a "403 Forbidden" out of 1.3

Hmm, now that's a dumb response.

Re: indexing suggestion (ATTN NCSA: possible 1.4 bug...) [ In reply to ]

Apr 17, 1995, 10:44 AM

Post #30 of 47 (5041 views)

David wrote
> argumentative comments. Maybe it would do some good if I tried to write an
> actual specification.

That'd be useful to add to the Apache docs. If there's no alternative
out there, it might become the official spec - and you get to play
god :-)

robh

Re: indexing suggestion (ATTN NCSA: possible 1.4 bug...) [ In reply to ]

Apr 17, 1995, 11:33 AM

Post #31 of 47 (5047 views)

Date: Mon, 17 Apr 95 16:22 BST
From: drtr@ast.cam.ac.uk (David Robinson)

In fact, I don't understand why allowing PATH_INFO for server-side include
files is needed. I suspect it may be something to do with invoked CGI
script...

Precisely. There are a people with fairly elaborate setups which use
PATH_INFO to pass information around between scripts which are invoked
from different SSI pages.

Anyway, if it is, then the ssi documentation should emphasise the point
that you cannot use relative links without a BASE tag.

When we get around to documenting it ourselves, rather than borrowing
the NCSA docs, sure.

And earlier, Rst wrote:
>(Note that the server requires PATH_INFO of '/' when actually invoking
>httpd/unix-directory handlers --- sending out a redirect as usual
>otherwise --- so the handler can generate HTML which looks like:
> <a href="file_in_dir.foo">some file in this directory</a>
>without worry).

Actually, I think this is the wrong way of think about this case.
I would rather the DOCUMENT_URI ended in a '/', and the PATH_INFO was NULL.
A null PATH_INFO makes more sense.

Again, not to me. If you have a request for /script.cgi/foo/bar/zot,
then the PATH_INFO winds up as /foo/bar/zot. I am unable to see why
PATH_INFO of a single '/' should be treated any differently. In any
case, the current behavior is by far the simplest thing for anyone
actually *writing* an http/unix-directory handler (and yes, I've
written one to test the feature); I think that justifies it.

rst

Re: indexing suggestion (ATTN NCSA: possible 1.4 bug...) [ In reply to ]

Apr 17, 1995, 11:37 AM

Post #32 of 47 (5037 views)

Date: Mon, 17 Apr 95 16:30 BST
From: drtr@ast.cam.ac.uk (David Robinson)

Further thoughts, suppose I have a CGI script /cgi-bin/list and
an ssi /doc.shtml, what should happen in the event that I access
/doc.shtml/foo
which does
#exec cgi="/cgi-bin/list/bar"

should PATH_INFO for the CGI script be /foo or /bar ? Why would I ever want it
to be /foo ?

As explained earlier, you would want it to be /foo if your SSI
documents are using PATH_INFO to pass information around among
themselves (typically session-state information).

rst

Re: indexing suggestion (ATTN NCSA: possible 1.4 bug...) [ In reply to ]

Apr 17, 1995, 11:52 AM

Post #33 of 47 (5040 views)

From: Rob Hartill <hartill@ooo.lanl.gov>
Date: Mon, 17 Apr 95 9:10:30 MDT

> if the XBIT is set on
> /file.html/, then this is a reference to a server-side-includes file
> with PATH_INFO, and the correct thing is very definitely to process
^^^^^^^
> the file, and pass along the PATH_INFO to any scripts it happens to
> invoke.

I'll agree with that, if and only if you can point me to the CGI
documentation which defines this behaviour. If that isn't documented,
I'd agree that index.html/ and index.html/a should return a 404 Not
Found.

Here is what the CGI docs have to say about PATH_INFO:

<LI> <CODE>PATH_INFO</CODE> <P>
The extra path information, as given by the client. In other
words, scripts can be accessed by their virtual pathname, followed
by extra information at the end of this path. The extra
information is sent as PATH_INFO. This information should be
decoded by the server if it comes from a URL before it is passed
to the CGI script.<P>

Note that it doesn't say "...followed by extra information... The
extra information is sent as PATH_INFO, unless it happens to consist
entirely of '/' characters, in which case we do some other thing with
it".

Until we get HTTP/1.1, and the ability to add a BASE to the header, I
think it is too dangerous (w.r.t broken relative URLs) to service these
requests. If 1.3 behaves as David described, we won't be breaking
anything by 404'ing them.

Is there a security hole here? If not, why is it "dangerous"? (I can
see how it could be a little confusing, but only to people who brought
it on themselves...).

rst

Re: indexing suggestion (ATTN NCSA: possible 1.4 bug...) [ In reply to ]

Apr 17, 1995, 12:06 PM

Post #34 of 47 (5045 views)

Date: Mon, 17 Apr 95 16:39 BST
From: drtr@ast.cam.ac.uk (David Robinson)

I didn't mean in general, only for accessing a directory index.
I do not see that PATH_INFO would be relevent to an http/unix-directory
handler. But its not the value of PATH_INFO that I mind; rather that
if PATH_INFO has the '/', then DOCUMENT_URI does not, which is surely wrong.

*Surely* wrong? The semantics of httpd/unix-directory handlers aren't
even defined yet! If you have an argument to make as to why these
handlers should see a '/' on DOCUMENT_URI (as they presently don't),
then please make it, but ex cathedra statements like this aren't
terribly compelling.

I've got two arguments in favor of the status quo:

One is simple consistency with other handlers --- the implementation
would have to be special-cased to add the '/' for unix-directory
handlers, and I'd prefer to avoid that.

The other, which is somewhat more important, is that the major use for
DOCUMENT_URI in actually *writing* an httpd/unix-directory handler is
in generating headers --- and for headers you'd probably want to strip
the '/' off, if it were present, for the sake of neatness.

FYI, a simple httpd/unix-directory handler is included below (it's
actually a cut-down version of my test code); I'd prefer to keep the
minimal unix-directory handler at least this simple unless there's a
fairly compelling reason to do otherwise. (Yes, it does work, and the
relative URL's it generates are resolved correctly).

(If we were actually shipping $ENV{'DOCUMENT_URI'} values back to the
client in the http headers, I'd see more of an argument for setting
them "correctly", by your version of correctness --- we shouldn't be
referring clients to URLs which we will only redirect instantly. But
we aren't sending these values back to clients in the headers, so that
issue is moot. By the looks of things here, *very* moot).

rst

#! /usr/local/bin/perl

$dir_uri = $ENV{'DOCUMENT_URI'};
$dirname = $ENV{'DOCUMENT_TRANSLATED'};

opendir (dir, $dirname);
local (@files) = readdir(dir); closedir(dir);

print <<EOF;
Content-type: text/html

<Title>Index of $dir_uri</title>
<H1>Index of $dir_uri</h1>

<ul>
EOF

foreach $entry (@files)
{
print "<li> <a href=\"$entry\">$entry</a>\n";
}

print "</ul>\n";

Re: indexing suggestion (ATTN NCSA: possible 1.4 bug...) [ In reply to ]

Apr 17, 1995, 12:20 PM

Post #35 of 47 (5034 views)

From: Rob Hartill <hartill@ooo.lanl.gov>
Date: Mon, 17 Apr 95 10:03:53 MDT

I guess this is one level to far down, we need to see what the SSI
docs say about passing PATH_INFO from the current document (regular
html files don't have PATH_INFO), to the included cgi. I can't
reach the docs at the moment, but I'd be surprised if they explicitly
say that cgi includes assume the PATH_INFO of the parent document.

That's not to say they can't though.

As I said, people are using it.

> If not, why is it "dangerous"? (I can
> see how it could be a little confusing, but only to people who brought
> it on themselves...).

That's the danger... It's going to confuse a hell of a lot of people,
because it's a half-baked system, with too many pitfalls. It only
makes sense to have the PATH_INFO, if the document has SSI, and the
SSI makes use of CGI - that narrows it down a bit.
Do we test for the other cases, and "404 Not Found" them ?

Oh, what the heck. It turns out that there's an easy way to fix this
for the cases that people are worried about --- in send_file, add

if (path_args[0] != '\0') { ... die(...) }

*after* all the special-case checks for magic MIME types, but before
the fopen(). This wouldn't bother me too much --- what I'm really
worried about here is a B57-style "fix" for a very minor problem which
causes far more, and far nastier, problems than it solves due to
unforseen side effects, but that's not likely here.

rst

Re: indexing suggestion (ATTN NCSA: possible 1.4 bug...) [ In reply to ]

Apr 17, 1995, 12:25 PM

Post #36 of 47 (5039 views)

From: Rob Hartill <hartill@ooo.lanl.gov>
Date: Mon, 17 Apr 95 10:20:04 MDT

I could live with the current Apache implementation (or whatever is
decided about how the variables are set), if it weren't documented.
Keep it quiet until 1.1 makes it bullet proof.

Considering how shoddy we've been about documenting stuff which our
users might actually care about, I really don't think you have to
worry about anyone documenting *this* ;-).

rst

Re: indexing suggestion (ATTN NCSA: possible 1.4 bug...) [ In reply to ]

Apr 17, 1995, 12:33 PM

Post #37 of 47 (5039 views)

How can people be using it ?. 1.3 doesn't let you add path info
to say /index.html

Hmm... it *does* let you add PATH_INFO to *.shtml documents. It may
not allow you to add them to *.html (even with the x-bit set and XBITHACK),
but in that case, I think it probably should just for consistency.

rst

Re: indexing suggestion (ATTN NCSA: possible 1.4 bug...) [ In reply to ]

Apr 17, 1995, 12:57 PM

Post #38 of 47 (5038 views)

Date: Mon, 17 Apr 95 17:34 BST
From: drtr@ast.cam.ac.uk (David Robinson)
Precedence: bulk
Reply-To: new-httpd@hyperreal.com

I know...
My main problem is that I think the current ssi behaviour is completely
crap. The only spec is the source code itself, which is more broken than
a very broken thing, and upon which users probably depend. Hence my rather
argumentative comments. Maybe it would do some good if I tried to write an
actual specification.

OK, but given that users currently depend on the current behavior, I'm
not at all sure it's proper to change it incompatibly just because it
was never written down (even if you do think it's "completely crap").

Anyway, the only useful definitions of DOCUMENT_URI that I can think of are:
1. The path part of the URL for the requested resource.
2. The shortest URL formed by repeatedly removing path segments from the
path specified in 1 that would cause httpd to execute or parse the same
file (as for the full path).

It shouldn't be a surprise that I prefer 1.

Currently, for server-side includes, httpd implements some form of 2.
But I do think that for cgi scripts and handlers, knowning the actual URL
the user specified is important if you want to be able to return a document
with relative links. I would suggest that this is better to be completely
specified in the DOCUMENT_URI variable, than in the concatenation of
PATH_INFO and DOCUMENT_URI.

You can prefer that if you like --- but the current behavior parallels
the CGI case (users's full URL available only as the concatenation of
SCRIPT_NAME and PATH_INFO), and the CGI case really oughtn't change.

(Besides, in the handler context, which is what started this off, I
really think the current behavior is just more useful. A handler for
a discussion-form type system will probably want to do things like:

/some/forum/doc00001 --- the document

/some/forum/doc00001/followup_form --- a form which posts a
follow-up note

/some/forum/doc00001/post_followup --- URL which actually posts
the follow-up; this is why
POSTs to documents with
handlers work.

etc. The way DOCUMENT_URI is now defined for handlers, this works out
very conveniently for the author of the script --- the "command", if
any, is in PATH_INFO, and the relevant file is always named in
DOCUMENT_TRANSLATED, without any additional adornment that you'd have
to strip off. I've been thinking of actually implementing something
along these lines just so people could see how it fits together).

My original point was that, conceptually, I think about the URL for
a directory as having a trailing '/' as part of the resource name.
After all, I've been trained to think that way by the standard redirect
behaviour if the '/' is missing.

In fact, would you do a redirect for http://host/dir if it were handled
by a httpd/unix-directory handler? The CGI script could determine that the
user did not append a '/', and modify its relative links accordingly.

The current code does send these redirects even if a unix-directory
handler is defined --- that's why the relative URLs generated by the
sample handler I included in previous mail work correctly. (And yes,
that means that unix-directory handlers are *always* invoked with
PATH_INFO of '/'). I could be persuaded to change this, if I thought
it would actually make someone's job easier, but right now, it seems
to me that it goes more the other way...

rst

Re: indexing suggestion [ In reply to ]

Apr 17, 1995, 2:30 PM

Post #39 of 47 (5044 views)

Rst wrote:
>...You'd need per-directory DefaultTypes (another potentially useful
>extension, though some care would be needed in implementation in the
>non-forking case)...

This is already available; it is a feature of NCSA httpd 1.3.

> Re PATH_INFO; if /dir/file.ext is a regular (unix) file, then accessing
> /dir/file.ext/path_info will fail.
>
>Not currently --- the PATH_INFO is simply ignored in this case. I
>personally see no compelling reason to change this, although as we all
>will recall, Rob H. vehemently disagrees. However, I do think that
>PATH_INFO should clearly be allowed anywhere that a CGI script might
>get into the mix.

Urgle.
I think you did change this with B23.

I based my comments on NCSA httpd 1.3, which has the following behaviour:
GET /index.html/a HTTP/1.0
returns 403 Forbidden
GET /index.html/ HTTP/1.0
returns index.html, but with Content-type: text/plain

Whereas for apache 0.5
GET /index.html/a HTTP/1.0
returns index.html
GET /index.html/ HTTP/1.0
returns index.html

The current apache behaviour is wrong.
For /index.html/a it should give 404 Not Found.
For /index.html/ it should give 404 Not Found, or perhaps a redirect to
/index.html, as in general void path segments are not considered significant.

David.

Re: indexing suggestion (ATTN NCSA: possible 1.4 bug...) [ In reply to ]

Apr 17, 1995, 4:22 PM

Post #40 of 47 (5041 views)

Rst wrote:
> /index.html, as in general void path segments are not considered significant.
>would *clearly* be improper with XBITHACK on --- if the XBIT is set on
>/file.html/, then this is a reference to a server-side-includes file
>with PATH_INFO, and the correct thing is very definitely to process
>the file, and pass along the PATH_INFO to any scripts it happens to
>invoke.

In fact, I don't understand why allowing PATH_INFO for server-side include
files is needed. I suspect it may be something to do with invoked CGI
script...
Anyway, if it is, then the ssi documentation should emphasise the point
that you cannot use relative links without a BASE tag.

And earlier, Rst wrote:
>(Note that the server requires PATH_INFO of '/' when actually invoking
>httpd/unix-directory handlers --- sending out a redirect as usual
>otherwise --- so the handler can generate HTML which looks like:
> <a href="file_in_dir.foo">some file in this directory</a>
>without worry).

Actually, I think this is the wrong way of think about this case.
I would rather the DOCUMENT_URI ended in a '/', and the PATH_INFO was NULL.
A null PATH_INFO makes more sense.

David.

Re: indexing suggestion (ATTN NCSA: possible 1.4 bug...) [ In reply to ]

Apr 17, 1995, 4:24 PM

Post #41 of 47 (5038 views)

RobH wrote:
>> if the XBIT is set on
>> /file.html/, then this is a reference to a server-side-includes file
>> with PATH_INFO, and the correct thing is very definitely to process
> ^^^^^^^
>> the file, and pass along the PATH_INFO to any scripts it happens to
>> invoke.
>
>I'll agree with that, if and only if you can point me to the CGI
>documentation which defines this behaviour. If that isn't documented,
>I'd agree that index.html/ and index.html/a should return a 404 Not
>Found.

Until we get HTTP/1.1, and the ability to add a BASE to the header, I
think it is too dangerous (w.r.t broken relative URLs) to service these
requests. If 1.3 behaves as David described, we won't be breaking
anything by 404'ing them.

rob

Re: indexing suggestion (ATTN NCSA: possible 1.4 bug...) [ In reply to ]

Apr 17, 1995, 4:30 PM

Post #42 of 47 (5052 views)

This probably sounds as though I didn't read what rst wrote, but anyway...
I wrote:
>Rst wrote:
>> /index.html, as in general void path segments are not considered significant.
>>would *clearly* be improper with XBITHACK on --- if the XBIT is set on
>>/file.html/, then this is a reference to a server-side-includes file
>>with PATH_INFO, and the correct thing is very definitely to process
>>the file, and pass along the PATH_INFO to any scripts it happens to
>>invoke.
>
>In fact, I don't understand why allowing PATH_INFO for server-side include
>files is needed. I suspect it may be something to do with invoked CGI
>script...

Further thoughts, suppose I have a CGI script /cgi-bin/list and
an ssi /doc.shtml, what should happen in the event that I access
/doc.shtml/foo
which does
#exec cgi="/cgi-bin/list/bar"

should PATH_INFO for the CGI script be /foo or /bar ? Why would I ever want it
to be /foo ?

David.

Re: indexing suggestion (ATTN NCSA: possible 1.4 bug...) [ In reply to ]

Apr 17, 1995, 4:39 PM

Post #43 of 47 (5044 views)

>Again, not to me. If you have a request for /script.cgi/foo/bar/zot,
>then the PATH_INFO winds up as /foo/bar/zot. I am unable to see why
>PATH_INFO of a single '/' should be treated any differently. In any
>case, the current behavior is by far the simplest thing for anyone
>actually *writing* an http/unix-directory handler (and yes, I've
>written one to test the feature); I think that justifies it.

I didn't mean in general, only for accessing a directory index.
I do not see that PATH_INFO would be relevent to an http/unix-directory
handler. But its not the value of PATH_INFO that I mind; rather that
if PATH_INFO has the '/', then DOCUMENT_URI does not, which is surely wrong.
(But maybe you don't have it working that way. Unfortunately, ssi files do.)

David.

Re: indexing suggestion (ATTN NCSA: possible 1.4 bug...) [ In reply to ]

Apr 17, 1995, 4:42 PM

Post #44 of 47 (5031 views)

>Precisely. There are a people with fairly elaborate setups which use
>PATH_INFO to pass information around between scripts which are invoked
>from different SSI pages.
But they shouldn't do! SSI should have a setenv for this purpose.
Of course, saying that doesn't do any good as we need to be compatible...

David.

Re: indexing suggestion (ATTN NCSA: possible 1.4 bug...) [ In reply to ]

Apr 17, 1995, 5:02 PM

Post #45 of 47 (5051 views)

Rst wrote:
>Here is what the CGI docs have to say about PATH_INFO:
>
><LI> <CODE>PATH_INFO</CODE> <P>
> The extra path information, as given by the client. In other
> words, scripts can be accessed by their virtual pathname, followed
> by extra information at the end of this path. The extra
> information is sent as PATH_INFO. This information should be
> decoded by the server if it comes from a URL before it is passed
> to the CGI script.<P>
>
I though RobH was asking about CGI scripts in the context of server-side
includes, which this does not mention.

>Note that it doesn't say "...followed by extra information... The
>extra information is sent as PATH_INFO, unless it happens to consist
>entirely of '/' characters, in which case we do some other thing with
>it".

And it doesn't say we couldn't....

In fact, treating null path segments specially would be quite legitimate.
httpd already imposes some semantics on the URL path, part of which
is sent to the CGI script as PATH_INFO. Consider: we parse '/../' in both the
document URI part and the PATH_INFO, removing it and the previous path
segment. Similarly, httpd ignores '//' in the doucment URI part, so maybe it
should also in the PATH_INFO ? 8-)

David.

Re: indexing suggestion (ATTN NCSA: possible 1.4 bug...) [ In reply to ]

Apr 17, 1995, 5:34 PM

Post #46 of 47 (5034 views)

>*Surely* wrong? The semantics of httpd/unix-directory handlers aren't
>even defined yet! If you have an argument to make as to why these
>handlers should see a '/' on DOCUMENT_URI (as they presently don't),
>then please make it, but ex cathedra statements like this aren't
>terribly compelling.

I know...
My main problem is that I think the current ssi behaviour is completely
crap. The only spec is the source code itself, which is more broken than
a very broken thing, and upon which users probably depend. Hence my rather
argumentative comments. Maybe it would do some good if I tried to write an
actual specification.

Anyway, what is DOCUMENT_URI anyway. The 'URI' for the document? To my mind,
that means the string passed in the request.

What does the SSI documentation say:
`DOCUMENT_URI: The virtual path to this document (such as /~robm/foo.shtml).'
* What is a virtual path? The documentation takes you to a description of the
path part of a URL.
* What does it mean _The_ virtual path to this document?
A document can have many URLs. Which one will httpd choose?
* What does it mean by 'this document'? Is the document not the resource that
the client receives? Or does it really mean 'this file'?

Anyway, the only useful definitions of DOCUMENT_URI that I can think of are:
1. The path part of the URL for the requested resource.
2. The shortest URL formed by repeatedly removing path segments from the
path specified in 1 that would cause httpd to execute or parse the same
file (as for the full path).

It shouldn't be a surprise that I prefer 1.

Currently, for server-side includes, httpd implements some form of 2.
But I do think that for cgi scripts and handlers, knowning the actual URL
the user specified is important if you want to be able to return a document
with relative links. I would suggest that this is better to be completely
specified in the DOCUMENT_URI variable, than in the concatenation of
PATH_INFO and DOCUMENT_URI.

My original point was that, conceptually, I think about the URL for
a directory as having a trailing '/' as part of the resource name.
After all, I've been trained to think that way by the standard redirect
behaviour if the '/' is missing.

In fact, would you do a redirect for http://host/dir if it were handled
by a httpd/unix-directory handler? The CGI script could determine that the
user did not append a '/', and modify its relative links accordingly.

David.

Re: indexing suggestion (ATTN NCSA: possible 1.4 bug...) [ In reply to ]

robm at netscape

Apr 18, 1995, 1:10 PM

Post #47 of 47 (5039 views)

/*
* "Re: indexing suggestion (ATTN NCSA: possible 1.4 bug...)" by Brandon Long
* written Mon, 17 Apr 1995 10:29:23 -0500 (CDT)
*
* argh. You know, I'm beginning to understand why Netsite doesn't
* have local directory config files, just central ones that define
* everything. It makes it a lot easier.
*
*/

Actually, market demand made us add them to 1.1. So they're there,
improved, and with more sensible location methods than "stat the
whole damn tree". I think the real problem is that the httpd 1.3 code
makes so many assumptions about its own operation (no per-request
variable structure, glboals up the wazoo, die())

--Rob