Mailing List Archive

Re: Possible clue to the cause of random page error
lcrocker@nupedia.com wrote:
>>The following URL goes successfully to the desired article:
>>* http://www.wikipedia.org/w/wiki.phtml?title=Insomnia&redirect=no
>>
>>The "random page" error occurs for me only when I follow a link by
>
> clicking on it:
>
>>* http://www.wikipedia.org/wiki/Insomnia
>>
>>Maybe this behavior difference of the 2 forms of the same URL will
>
> provide a hint of the bug's cause.
>
> Thanks, and yes, that's a clue. I just rebooted the server
> since Jimbo said that fixed it last time, but I'll look into
> it more closely now--maybe mod_redirect is dying for some reason?

As I've mentioned before, I'm pretty sure it's the encoding hack I set
up to keep ampersands in titles _in_ the titles instead of as raw
ampersands that indicate the beginning of the next variable in the query
string:

RewriteEngine On
RewriteMap urlencode prg:/usr/local/bin/urlencode
RewriteRule ^/wiki/(.*)$ /w/wiki.phtml?title=${urlencode:$1} [L]

If the hackish little external program should die or get out of sync, we
end up with the wrong URLs. But this ugliness *shouldn't* be needed. We
*should* be able to use the internal function that Apache provides for this:

RewriteEngine On
RewriteMap urlencode int:encode
RewriteRule ^/wiki/(.*)$ /w/wiki.phtml?title=${urlencode:$1} [L]

but that function doesn't encode ampersands (%26), so it misses the
entire point of the exercise.

If anyone knows a cleaner way to do this, do speak up. I've gotten eerie
silence on alt.apache.configuration.

-- brion vibber (brion @ pobox.com)
Re: Re: Possible clue to the cause of random page error [ In reply to ]
> As I've mentioned before, I'm pretty sure it's the encoding hack
> I set up to keep ampersands in titles _in_ the titles instead of
> as raw ampersands that indicate the beginning of the next variable
> in the query string:
>
> RewriteEngine On
> RewriteMap urlencode prg:/usr/local/bin/urlencode
> RewriteRule ^/wiki/(.*)$ /w/wiki.phtml?title=${urlencode:$1} [L]
>
> If the hackish little external program should die or get out of
> sync, we end up with the wrong URLs. But this ugliness *shouldn't*
> be needed. We *should* be able to use the internal function that
> Apache provides for this...

You are mistaken that Apache is doing the wrong thing: ampersands
are /not/ supposed to be urlencoded--they are valid and meaningful
characters needed for URLs. But ampersands do need to be messed with
for Wikipedia-specific reasons: since article titles must appear as
values in the query string (which is separated by ampersands), they
must be escaped somehow for that function. Also, the
non-escaped ampersands in the URL must be HTML-escaped when they
appear as attribute values, such as HREFs. These are both entirely
separate issues, and the code formerly dealt with them correctly,
although in a way that you didn't like. We may have to compromise;
accept the double-encoding for ampersands that you removed for other
characters. Either that, or come up with some other escaping
mechanism for ampersands in titles.
Re: Re: Possible clue to the cause of random page error [ In reply to ]
lcrocker@nupedia.com wrote:
>>As I've mentioned before, I'm pretty sure it's the encoding hack
>>I set up to keep ampersands in titles _in_ the titles instead of
>>as raw ampersands that indicate the beginning of the next variable
>>in the query string:
>>
>> RewriteEngine On
>> RewriteMap urlencode prg:/usr/local/bin/urlencode
>> RewriteRule ^/wiki/(.*)$ /w/wiki.phtml?title=${urlencode:$1} [L]
>>
>>If the hackish little external program should die or get out of
>>sync, we end up with the wrong URLs. But this ugliness *shouldn't*
>>be needed. We *should* be able to use the internal function that
>>Apache provides for this...
>
> You are mistaken that Apache is doing the wrong thing: ampersands
> are /not/ supposed to be urlencoded--they are valid and meaningful
> characters needed for URLs.

It's not the wrong thing in _all_ cases, but it's definitely the wrong
thing for the case of "take an arbitrary string and put it as a value in
a key=value pair in a URL-encoded query string", which is the main
reason I would use such a function in URL-rewriting.

> But ampersands do need to be messed with
> for Wikipedia-specific reasons: since article titles must appear as
> values in the query string (which is separated by ampersands), they
> must be escaped somehow for that function. Also, the
> non-escaped ampersands in the URL must be HTML-escaped when they
> appear as attribute values, such as HREFs. These are both entirely
> separate issues, and the code formerly dealt with them correctly,
> although in a way that you didn't like. We may have to compromise;
> accept the double-encoding for ampersands that you removed for other
> characters. Either that, or come up with some other escaping
> mechanism for ampersands in titles.

Aside from my general distaste of the double-encoding, it doesn't handle
the case of manual input: someone who types
http://www.wikipedia.com/wiki/AT&T into their URL bar shouldn't end up
at [[AT]].

See attached patch for the Apache source which adds a rewrite map
function which encodes ampersands only. It works nicely on my test
server, but I don't want to mess with installing it on the main server;
I'm not sure exactly how the compile configuration was set up, and I've
done enough damage lately. :)

Once installed, the rewrite map can look like this:

RewriteEngine On
RewriteMap urlencode int:ampencode
RewriteRule ^/wiki/(.*)$ /w/wiki.phtml?title=${urlencode:$1} [L]
...

If it looks reasonable, please go ahead and set it up.

-- brion vibber (brion @ pobox.com)
Re: Re: Possible clue to the cause of random page error [ In reply to ]
I wrote:
> Once installed, the rewrite map can look like this:
>
> RewriteEngine On
> RewriteMap urlencode int:ampencode
...
> +static char *rewrite_mapfunc_ampescape(request_rec *r, char *key);

Err, make that "RewriteMap urlencode int:ampescape".

BTW, the code for rewrite_mapfunc_ampescape() is mainly ripped out of
the URI-encoding function ap_os_escape_path() in main/util.c

-- brion vibber (brion @ pobox.com)
Re: Re: Re: Possible clue to the cause of random page error [ In reply to ]
> Aside from my general distaste of the double-encoding, it doesn't
> handle the case of manual input: someone who types
> http://www.wikipedia.com/wiki/AT&T into their URL bar shouldn't
> end up at [[AT]].

If you ask me, they should. People who don't care about the
details of HTML/HTTP/etc. will either follow links or type "AT&T"
into the search box, and all is well. If they want to type in
URLs, then they have already committed themselves to knowing
about the quirks of URLs. One of those quirks is that the URL
http://...AT&T is badly formed, and the browser is allowed to
interpret it in more than one way. It might choose to interpret it
as "fetch the page http://...AT" and pass it the query string "T".
URLs are code--URLs are not user interface elements. If people who
choose to use URLs have to contend with their nuances as code,
that's OK.

Now, all that said, I understand that the web rose so fast that
we never had a chance to replace URLs with something usable, so
many users are forced to use them when they never should have been.
So I'm not opposed to making them easier sometimes. Let's make
them as simple as possible /but no simpler/. If we have to muck
with ampersands a bit just as we convert spaces to underscores,
then that's what we'll have to do.
Re: Re: Re: Possible clue to the cause of random page error [ In reply to ]
> Err, make that "RewriteMap urlencode int:ampescape".
> BTW, the code for rewrite_mapfunc_ampescape() is mainly ripped
> out of the URI-encoding function ap_os_escape_path() in main/util.c

BTW, would you mind putting source like that on the server
in /usr/local/src? That's where I've got all the post-install
stuff including Apache, MySQL, etc.
Re: Re: Possible clue to the cause of random page error [ In reply to ]
lcrocker@nupedia.com wrote:
>>Aside from my general distaste of the double-encoding, it doesn't
>>handle the case of manual input: someone who types
>>http://www.wikipedia.com/wiki/AT&T into their URL bar shouldn't
>>end up at [[AT]].
>
> If you ask me, they should.

I disagree...

> People who don't care about the
> details of HTML/HTTP/etc. will either follow links or type "AT&T"
> into the search box, and all is well. If they want to type in
> URLs, then they have already committed themselves to knowing
> about the quirks of URLs.

I agree, which is why I disagree with the previous statement.

> One of those quirks is that the URL
> http://...AT&T is badly formed, and the browser is allowed to
> interpret it in more than one way. It might choose to interpret it
> as "fetch the page http://...AT" and pass it the query string "T".

Untrue. The ampersand is not reserved in URL path sections (though it is
in the query string) and is allowed as a legitimate path character; see
RFC 2396, section 3.3. When we use /wiki/Title URLs, we're putting the
title in a _path_, not a query string, and we must treat it as such.

Further, the query string can never be delimited by an ampersand, only
by a question mark.

> URLs are code--URLs are not user interface elements. If people who
> choose to use URLs have to contend with their nuances as code,
> that's OK.

I agree, which is why we should treat them according to the standards!

> Now, all that said, I understand that the web rose so fast that
> we never had a chance to replace URLs with something usable, so
> many users are forced to use them when they never should have been.
> So I'm not opposed to making them easier sometimes. Let's make
> them as simple as possible /but no simpler/. If we have to muck
> with ampersands a bit just as we convert spaces to underscores,
> then that's what we'll have to do.

...

> BTW, would you mind putting source like that on the server
> in /usr/local/src? That's where I've got all the post-install
> stuff including Apache, MySQL, etc.

Done, see /usr/local/src/ampersand.diff

-- brion vibber (brion @ pobox.com)