Mailing List Archive

robots.txt
>A robots.txt could easily be set up to disallow
>/wiki/special%3ARecentChanges (and various case variations). That only
>stops _nice_ spiders, of course.

>History links would need to be changed to be sufficiently
>distinguishable, for instance using
>/wiki.phtml?title=Foo&action=history
>etc; then ban /wiki.phtml.

I think we should do that ASAP. Let's close the whole special:
namespace, &action=edit, &action=history, &diff=yes and &oldID stuff
to spiders. None of this is of any value to the spiders anyway.

Axel
Re: robots.txt [ In reply to ]
On 5/17/02 5:37 PM, "Axel Boldt" <axel@uni-paderborn.de> wrote:

>> A robots.txt could easily be set up to disallow
>> /wiki/special%3ARecentChanges (and various case variations). That only
>> stops _nice_ spiders, of course.
>
>> History links would need to be changed to be sufficiently
>> distinguishable, for instance using
>> /wiki.phtml?title=Foo&action=history
>> etc; then ban /wiki.phtml.
>
> I think we should do that ASAP. Let's close the whole special:
> namespace, &action=edit, &action=history, &diff=yes and &oldID stuff
> to spiders. None of this is of any value to the spiders anyway.
>
I think we should not do that any time soon. For one, this is a wikipedia-l
level discussion. Until there is direct evidence that spiders are causing
any serious problem for Wikipedia (and noone has presented any) we shouldn't
even be discussing this.

Just because you can't see why it would be of value doesn't mean that it
isn't.

Again, if we surmise that spiders are causing slowdowns, we should be able
to find evidence for that BEFORE we block parts of the site from them. And
even then we should see if the fault lies in the site's code.

Spiders simulate high traffic well, and that's something that wikipedia
should be able to handle.

tc
Re: robots.txt [ In reply to ]
The Cunctator wrote:
> Again, if we surmise that spiders are causing slowdowns, we should be able
> to find evidence for that BEFORE we block parts of the site from them. And
> even then we should see if the fault lies in the site's code.

I think this is right, although blocking them from 'edit' doesn't seem harmful.
Certainly, it's good for spiders to hit 'Recent Changes', and often.

> Spiders simulate high traffic well, and that's something that wikipedia
> should be able to handle.

Right.

I'll do some research to determine if spiders are causing any
problems, but in my experienced judgment based on running high traffic
sites, I think it is pretty unlikely.

--Jimbo
Re: robots.txt [ In reply to ]
On 5/18/02 3:08 PM, "Jimmy Wales" <jwales@bomis.com> wrote:

> The Cunctator wrote:
>> Again, if we surmise that spiders are causing slowdowns, we should be able
>> to find evidence for that BEFORE we block parts of the site from them. And
>> even then we should see if the fault lies in the site's code.
>
> I think this is right, although blocking them from 'edit' doesn't seem
> harmful.
> Certainly, it's good for spiders to hit 'Recent Changes', and often.
>
Both points make perfect sense to me.
robots.txt [ In reply to ]
> Certainly, it's good for spiders to hit 'Recent Changes', and often.

Why? The spider doesn't know that the pages on RecentChanges have had
recent changes. It's just a list of links, like special:allpages or
special:ShortPages.

Maybe one could add
/wiki/special:RecentChanges&
to the robots.txt file; that way, the spider would fetch only one copy
of RecentChanges, not 14.

Axel