Mailing List Archive

Anyone have contacts at the Amazon or OpenAI web spiders?
One day I set up the world's lamest content farm. You can see it here:

https://www.web.sp.am/

While humans tend not to find its six billion pages very interesting,
some web spiders are entranced. In the past week or so, Amazon's
amazonbot has visited it 6 million times, and OpenAI's gptbot 2.6
million. (If you were wondering what they use to train ChatGPT, now
you know.) I don't care that googlebot comes by every 5 or 10 minutes,
but gptbot is every few seconds and amazon as fast as the server will
respond.

They both come from predictable IPs so I can set packet filters but
they're still hammering pretty hard. Each has a URL in the user agent
string, Amazon's page has an address to write to but OpenAI's doesn't.
I wrote to the Amazon address, no response.

If anyone has contacts at either I would appreciate it. A few years
ago the bingbot got trapped but fortunately I knew someone at
Microsoft who could pass the word. He reported back that while he
could not go into detail, there was a great deal of animated
conversation at the other end of the hall, and shortly after that it
stopped.

R's,
John
Re: Anyone have contacts at the Amazon or OpenAI web spiders? [ In reply to ]
Both robots respect robots.txt, of course they’re not going to answer.

On Feb 13, 2024, at 8:35 PM, John Levine <johnl@iecc.com> wrote:
>
> ?One day I set up the world's lamest content farm. You can see it here:
>
> https://www.web.sp.am/
>
> While humans tend not to find its six billion pages very interesting,
> some web spiders are entranced. In the past week or so, Amazon's
> amazonbot has visited it 6 million times, and OpenAI's gptbot 2.6
> million. (If you were wondering what they use to train ChatGPT, now
> you know.) I don't care that googlebot comes by every 5 or 10 minutes,
> but gptbot is every few seconds and amazon as fast as the server will
> respond.
>
> They both come from predictable IPs so I can set packet filters but
> they're still hammering pretty hard. Each has a URL in the user agent
> string, Amazon's page has an address to write to but OpenAI's doesn't.
> I wrote to the Amazon address, no response.
>
> If anyone has contacts at either I would appreciate it. A few years
> ago the bingbot got trapped but fortunately I knew someone at
> Microsoft who could pass the word. He reported back that while he
> could not go into detail, there was a great deal of animated
> conversation at the other end of the hall, and shortly after that it
> stopped.
>
> R's,
> John
Re: Anyone have contacts at the Amazon or OpenAI web spiders? [ In reply to ]
On Wed, Feb 14, 2024 at 1:36?PM John Levine <johnl@iecc.com> wrote:

> If anyone has contacts at either I would appreciate it.


https://developer.amazon.com/support/amazonbot
probably returned as a result of searching "amazonbot" on your favourite
search engine.
Re: Anyone have contacts at the Amazon or OpenAI web spiders? [ In reply to ]
>> If anyone has contacts at either I would appreciate it.
>
>
> https://developer.amazon.com/support/amazonbot

Um, that is the site I mentioned in the line above the one you quoted.
As I said, I wrote to the contact address, no reply.


> probably returned as a result of searching "amazonbot" on your favourite
> search engine.
>

Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
Re: Anyone have contacts at the Amazon or OpenAI web spiders? [ In reply to ]
It appears that Patrick Clochesy <patrick@mach.net> said:
>Both robots respect robots.txt, of course they’re not going to answer.

The content farm is not one site with six billion pages, it's six billion
sites each with one page. They check the robots.txt for each site they
visit but by then its's too late.

Most spiders can take the hint that they're all on the same IP. But not
these two.

R's,
John

>
>On Feb 13, 2024, at 8:35 PM, John Levine <johnl@iecc.com> wrote:
>>
>> ?One day I set up the world's lamest content farm. You can see it here:
>>
>> https://www.web.sp.am/
>>
>> While humans tend not to find its six billion pages very interesting,
>> some web spiders are entranced. In the past week or so, Amazon's
>> amazonbot has visited it 6 million times, and OpenAI's gptbot 2.6
>> million. (If you were wondering what they use to train ChatGPT, now
>> you know.) I don't care that googlebot comes by every 5 or 10 minutes,
>> but gptbot is every few seconds and amazon as fast as the server will
>> respond.
>>
>> They both come from predictable IPs so I can set packet filters but
>> they're still hammering pretty hard. Each has a URL in the user agent
>> string, Amazon's page has an address to write to but OpenAI's doesn't.
>> I wrote to the Amazon address, no response.
>>
>> If anyone has contacts at either I would appreciate it. A few years
>> ago the bingbot got trapped but fortunately I knew someone at
>> Microsoft who could pass the word. He reported back that while he
>> could not go into detail, there was a great deal of animated
>> conversation at the other end of the hall, and shortly after that it
>> stopped.
>>
>> R's,
>> John
>