Mailing List Archive

Excluding Robots
Hallo,
how can I exclude hosts which request the file robots.txt?
Other programs for Logfile Analysis do it automatically, because they
count every host with such a request as a robot.
How do the users of Analog exclude robots, if there is no actual list at
http://www.wadsack.com/robot-list.html?
Best Regards,
Sabine

--
Sabine Henneberger

Humboldt-Universität Berlin
Computer- und Medienservice
Arbeitsgruppe Elektronisches Publizieren
Tel. 030 2093 7075

Humboldt University Berlin, Germany
Computer and Media Service
Electronic Publishing Group
phone: +49+30+2093-7075

+------------------------------------------------------------------------
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html
| List archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+------------------------------------------------------------------------
RE: Excluding Robots [ In reply to ]
The list of robots came from the robots database which is no longer active. If you want an archived version of that list there's probably someone here who has a copy.

Blocking hosts that requests robots.txt may not be accurate. Larger search engines certainly have dedicated IP addresses for their crawlers, but there are thousands of home-spun crawlers, robots, etc. (thus the stated reason for the demise of the robots database) and many of these could be run on networks where the represented IP address also addresses real traffic. For example AOL (where a single user may have several IP's in a session) or through a corporate firewall.

You can use Analog to produce a list of hosts that requested robots.txt. Just use a configuration file like this:

ALL OFF
HOST ON
OUTPUT TEXT
FILEINCLUDE /robots.txt

>From that list you could create a series of HOSTEXCLUDE commands to include in your subsequent run. It's just going to take two passes through the log files.

--

Jeremy Wadsack
Seven Simple Machines
(206) 545-4850

-----Original Message-----
From: analog-help-bounces@lists.meer.net [mailto:analog-help-bounces@lists.meer.net] On Behalf Of Sabine Henneberger
Sent: Thursday, January 17, 2008 5:30 AM
To: analog-help@lists.meer.net
Subject: [analog-help] Excluding Robots

Hallo,
how can I exclude hosts which request the file robots.txt?
Other programs for Logfile Analysis do it automatically, because they
count every host with such a request as a robot.
How do the users of Analog exclude robots, if there is no actual list at
http://www.wadsack.com/robot-list.html?
Best Regards,
Sabine

--
Sabine Henneberger

Humboldt-Universität Berlin
Computer- und Medienservice
Arbeitsgruppe Elektronisches Publizieren
Tel. 030 2093 7075

Humboldt University Berlin, Germany
Computer and Media Service
Electronic Publishing Group
phone: +49+30+2093-7075

+------------------------------------------------------------------------
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html
| List archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+------------------------------------------------------------------------

+------------------------------------------------------------------------
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html
| List archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+------------------------------------------------------------------------