Mailing List Archive: Identifying Known Spiders?

Identifying Known Spiders?

Jul 3, 2008, 12:48 AM

Post #1 of 5 (4815 views)

I'd like to know the success of my efforts to submit a new site to all
the search engines; some spiders won't visit a site until it's been
online for a while, and some will only visit the home page.

I can see some of the spiders in the BROWSERREP and BROWSERSUM, but
it's missing some because it's definitely missing Googlebot and Yahoo
Slurp.

Also the BROWSERREP shows all the browsers used by my human visitors;
it will get hard to spot spiders when my traffic picks up.

Is there a report specifically for known spiders?

Thanks! -- Mike
--
Michael David Crawford
mdcrawford at gmail dot com

Enjoy my art, photography, music and writing at
http://www.geometricvisions.com/
--- Free Music Downloads ---
+------------------------------------------------------------------------
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html
| List archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+------------------------------------------------------------------------

Re: Identifying Known Spiders? [ In reply to ]

analog07 at eircom

Jul 3, 2008, 4:29 AM

Post #2 of 5 (4683 views)

Permalink

On 7/3/2008 3:48 AM, Michael Crawford wrote:
> I'd like to know the success of my efforts to submit a new site to all
> the search engines; some spiders won't visit a site until it's been
> online for a while, and some will only visit the home page.
>
> I can see some of the spiders in the BROWSERREP and BROWSERSUM, but
> it's missing some because it's definitely missing Googlebot and Yahoo
> Slurp.
>
> Also the BROWSERREP shows all the browsers used by my human visitors;
> it will get hard to spot spiders when my traffic picks up.
>
> Is there a report specifically for known spiders?

No, the only special treatment for spiders in Analog is the ROBOTINCLUDE
command which tells Analog to count the requests with the specified
User-Agents as Search Engines in the OS Report.

There used to be a list of Spider User-Agents at
http://www.wadsack.com/robot-list.html but it seems to be empty at the
moment. There's a list from may 2007 at
http://www2.owen.vanderbilt.edu/mike.shor/diversions/analog/RobotInclude.txt

You might want to do a report with FILEINCLUDE /robots.txt, which should
give you a good indication of which search engines are hitting your site.

Aengus

+------------------------------------------------------------------------
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html
| List archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+------------------------------------------------------------------------

RE: Identifying Known Spiders? [ In reply to ]

jeremy at 7simplemachines

Jul 3, 2008, 8:37 AM

Post #3 of 5 (4688 views)

Permalink

The robots list from which that page was built no longer exists. The group that was maintaining it decided that it didn't make sense to maintain a database of "known robots" any more as anyone can make a robot. However, some quick review looks like there's a new user-submitted list at http://www.robotstxt.org/db.html and a "wild caught" list at http://www.botsvsbrowsers.com/category/1/index.html. If I have time this weekend, maybe I'll update the scripts to pull from one of those sources.

I have to say though, that the more I think about it, the more I'm of the mind set that anything that is *not a known web browser* is most likely a bot. And maybe inverting the logic would make sense at this point.

--

Jeremy Wadsack
Seven Simple Machines
Main: (206) 545-4850
Direct: (206) 812-6829

-----Original Message-----
From: analog-help-bounces@lists.meer.net [mailto:analog-help-bounces@lists.meer.net] On Behalf Of Aengus
Sent: Thursday, July 03, 2008 4:30 AM
To: Support for analog web log analyzer
Subject: Re: [analog-help] Identifying Known Spiders?

On 7/3/2008 3:48 AM, Michael Crawford wrote:
> I'd like to know the success of my efforts to submit a new site to all
> the search engines; some spiders won't visit a site until it's been
> online for a while, and some will only visit the home page.
>
> I can see some of the spiders in the BROWSERREP and BROWSERSUM, but
> it's missing some because it's definitely missing Googlebot and Yahoo
> Slurp.
>
> Also the BROWSERREP shows all the browsers used by my human visitors;
> it will get hard to spot spiders when my traffic picks up.
>
> Is there a report specifically for known spiders?

No, the only special treatment for spiders in Analog is the ROBOTINCLUDE
command which tells Analog to count the requests with the specified
User-Agents as Search Engines in the OS Report.

There used to be a list of Spider User-Agents at
http://www.wadsack.com/robot-list.html but it seems to be empty at the
moment. There's a list from may 2007 at
http://www2.owen.vanderbilt.edu/mike.shor/diversions/analog/RobotInclude.txt

You might want to do a report with FILEINCLUDE /robots.txt, which should
give you a good indication of which search engines are hitting your site.

Aengus

+------------------------------------------------------------------------
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html
| List archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+------------------------------------------------------------------------

+------------------------------------------------------------------------
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html
| List archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+------------------------------------------------------------------------

Re: Identifying Known Spiders? [ In reply to ]

mdcrawford at gmail

Jul 3, 2008, 9:30 PM

Post #4 of 5 (4680 views)

Permalink

On Thu, Jul 3, 2008 at 8:37 AM, Jeremy Wadsack
<jeremy@7simplemachines.com> wrote:
> The robots list from which that page was built no longer exists. The group that was maintaining it decided that it didn't make sense to maintain a database of "known robots" any more as anyone can make a robot.

In my personal case, it's not so much that I want to watch all the
bots, as to monitor my progress at getting a new site indexed by the
search engines.

While Google, Yahoo and MSN together provide the vast majority of
search engine referrals, there are still a few small, independent
players such as JGDO.

There are lots of reasons for running a bot, some good, some bad. I'd
be happy if I could get a report of visits by the bots belong to, say,
the top half-dozen search engines.

Note that it often happens, with new sites, that a search engine
spider may not visit at all for months, and even then will only fetch
the home page. By creating config files for each of my pages, I hope
to monitor spider visits throughout my site.

If this isn't yet possible with analog, I don't think it would be hard
to implement, and would be very popular, and so would get Analog a lot
more users, and maybe some consulting fees for Analog experts.
--
Michael David Crawford
mdcrawford at gmail dot com

Enjoy my art, photography, music and writing at
http://www.geometricvisions.com/
--- Free Music Downloads ---
+------------------------------------------------------------------------
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html
| List archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+------------------------------------------------------------------------

Re: Identifying Known Spiders? [ In reply to ]

analog07 at eircom

Jul 4, 2008, 6:04 AM

Post #5 of 5 (4678 views)

Permalink

On 7/4/2008 12:30 AM, Michael Crawford wrote:
> On Thu, Jul 3, 2008 at 8:37 AM, Jeremy Wadsack
> <jeremy@7simplemachines.com> wrote:
>> The robots list from which that page was built no longer exists. The group that was maintaining it decided that it didn't make sense to maintain a database of "known robots" any more as anyone can make a robot.
>
> In my personal case, it's not so much that I want to watch all the
> bots, as to monitor my progress at getting a new site indexed by the
> search engines.
>
> While Google, Yahoo and MSN together provide the vast majority of
> search engine referrals, there are still a few small, independent
> players such as JGDO.
>
> There are lots of reasons for running a bot, some good, some bad. I'd
> be happy if I could get a report of visits by the bots belong to, say,
> the top half-dozen search engines.
>
> Note that it often happens, with new sites, that a search engine
> spider may not visit at all for months, and even then will only fetch
> the home page. By creating config files for each of my pages, I hope
> to monitor spider visits throughout my site.
>
> If this isn't yet possible with analog, I don't think it would be hard
> to implement, and would be very popular, and so would get Analog a lot
> more users, and maybe some consulting fees for Analog experts.

It all comes down to the same simple question - how do you decide that
any given request is from a spider/bot rather than a real person? If you
rely on the User-Agent string you then have to decide how to identify
the relevant strings - assume that everything that isn't a "well known
browser" is a spider, or assume that everything that asks for asks for
/robots.txt is a spider.

Unfortunately, there's nothing to stop a bot using a "well known
browser" User-Agent (see recent controversy about the AVG LinkScanner,
for example), and there's nothing to stop an ordinary user from
requesting /robots.txt. That means that there's no simple way to
automate the identification of spiders - it requires some judgement, and
Analog doesn't do judgement :-).

Once you come up with a set of rules that work for you (or for the set
of log files that you're working with at the moment), then it's not
difficult to use Analog to delve deeper into the robot traffic. You can
use FILEINCLUDE /robots.txt to get a list of IP addresses or Browser
strings that have requested /robots.txt. You can then use this
information with HOSTINCLUDE or with BROWINCLUDE to get a view of the
rest of the traffic from either one specific spider, or all of the
spiders as a whole, bearing in mind that the job of spidering your site
might be spread between a number of different machines, so you might
need to HOSTINCLUDE a range of machines if you use that technique.

So you can certainly use Analog to watch this type of traffic - indeed
Analog's configurability makes it an ideal tool for the job. But because
there are no black and white rules for deciding what is or is not a
robot/spider, this functionality can't be built-in to Analog. The
decisions that you might make today to do this analysis on your site
might be different for someone else, and might be different in a few
months time, as the list of search engines change.

Aengus

+------------------------------------------------------------------------
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html
| List archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+------------------------------------------------------------------------