Mailing List Archive: Wikipedia in Google

Wikipedia in Google

Jan 7, 2003, 6:25 AM

Post #1 of 3 (370 views)

Dear Sirs,

Wikipedia is a large, collaborative project to produce a free
encyclopedia. There are currently nearly 100,000 articles, and about
200,000 page impressions per day.

As of last weekend, all Wikipedia articles seem to have disappeared from
the Google index. Wikipedia article URLs look like this:

http://www.wikipedia.org/wiki/<article-name>

We are not aware of an outage that might have caused the Google spider
to miss the pages. I would much appreciate it if you could shed some
light on the issue.

Sincerely,

Erik Moeller
--
FOKUS - Fraunhofer Insitute for Open Communication Systems
Project BerliOS - http://www.berlios.de

Re: Wikipedia in Google [ In reply to ]

brion at pobox

Jan 7, 2003, 6:48 AM

Post #2 of 3 (365 views)

Permalink

On mar, 2003-01-07 at 05:25, Erik Moeller wrote:
> As of last weekend, all Wikipedia articles seem to have disappeared from
> the Google index. Wikipedia article URLs look like this:
>
> http://www.wikipedia.org/wiki/<article-name>
>
> We are not aware of an outage that might have caused the Google spider
> to miss the pages. I would much appreciate it if you could shed some
> light on the issue.

Upon noticing that the main pages and mailing list archives _are_
indexed, I have my suspicions about our robots.txt file; the line:

Disallow: /w

perhaps should be:

Disallow: /w/

The former may be accidentally blocking /wiki/<arcticle-name> paths
-- which of course form the bulk of our content! -- in addition to the
scripted pages via direct access to the /w subdirectory that it's
intended to block.

I have updated the robots.txt file; if indeed this is how the googlebot
was interpreting the line, I hope we can be respidered soon...

-- brion vibber (brion@pobox.com / brion@wikipedia.org)

Re: Wikipedia in Google [ In reply to ]

usenet at tonal

Jan 8, 2003, 7:27 AM

Post #3 of 3 (366 views)

Permalink

Brion Vibber wrote:

>On mar, 2003-01-07 at 05:25, Erik Moeller wrote:
>Upon noticing that the main pages and mailing list archives _are_
>indexed, I have my suspicions about our robots.txt file; the line:
>
> Disallow: /w
>
>perhaps should be:
>
> Disallow: /w/
>
>The former may be accidentally blocking /wiki/<arcticle-name> paths
>-- which of course form the bulk of our content! -- in addition to the
>scripted pages via direct access to the /w subdirectory that it's
>intended to block.
>
>I have updated the robots.txt file; if indeed this is how the googlebot
>was interpreting the line, I hope we can be respidered soon...
>
>-- brion vibber (brion@pobox.com / brion@wikipedia.org)
>
>
>
Alas, I think you're reight. The robots exclusion standard suggests a
simple substring comparison be used in implementations, and all their
examples of directory exclusion use the slash.

-- Neil

>
>
>