Mailing List Archive

New bot user agent
This one I haven't seen before and was filling up my tmp directory:

"Googlebot-Image/1.0"

DB
_______________________________________________
interchange-users mailing list
interchange-users@interchangecommerce.org
https://www.interchangecommerce.org/mailman/listinfo/interchange-users
Re: New bot user agent [ In reply to ]
On Thu, 27 Feb 2020, DB wrote:

> This one I haven't seen before and was filling up my tmp directory:
>
> "Googlebot-Image/1.0"

Interesting. Sounds like that would be a good addition to the robots list.
Would you like to make a pull request against the GitHub repo?

Jon


--
Jon Jensen
End Point Corporation
https://www.endpoint.com/
_______________________________________________
interchange-users mailing list
interchange-users@interchangecommerce.org
https://www.interchangecommerce.org/mailman/listinfo/interchange-users
Re: New bot user agent [ In reply to ]
> On Feb 27, 2020, at 8:19 AM, Jon Jensen <jon@endpoint.com> wrote:
>
> On Thu, 27 Feb 2020, DB wrote:
>
>> This one I haven't seen before and was filling up my tmp directory:
>>
>> "Googlebot-Image/1.0"
>
> Interesting. Sounds like that would be a good addition to the robots list. Would you like to make a pull request against the GitHub repo?
>
> Jon

Wouldn’t the existing “GoogleBot” entry catch this, or are we (for some reason) being case-sensitive/word boundary aware here?

David
=
--
David Christensen
Senior Software and Database Engineer
End Point Corporation
david@endpoint.com
785-727-1171
Re: New bot user agent [ In reply to ]
On Thu, 27 Feb 2020, David Christensen wrote:

>>> This one I haven't seen before and was filling up my tmp directory:
>>>
>>> "Googlebot-Image/1.0"
>>
>> Interesting. Sounds like that would be a good addition to the robots
>> list. Would you like to make a pull request against the GitHub repo?
>
> Wouldn’t the existing “GoogleBot” entry catch this, or are we (for some
> reason) being case-sensitive/word boundary aware here?

Good point.

In the Interchange global structure file I see (trimmed for readability):

'RobotUA' => qr/gonzo|Google-Sitemaps|GoogleBot|grab/i,

so I would expect that to match.

DB, what does your interchange.structure file show for the RobotUA regex?
You'll need to have DumpStructure Yes in your interchange.cfg for that
file to be written at daemon startup.

Jon


--
Jon Jensen
End Point Corporation
https://www.endpoint.com/
Re: New bot user agent [ In reply to ]
> On Thu, 27 Feb 2020, David Christensen wrote:
>
>>>> This one I haven't seen before and was filling up my tmp directory:
>>>>
>>>> "Googlebot-Image/1.0"
>>>
>>> Interesting. Sounds like that would be a good addition to the robots
>>> list. Would you like to make a pull request against the GitHub repo?
>>
>> Wouldn’t the existing “GoogleBot” entry catch this, or are we (for some
>> reason) being case-sensitive/word boundary aware here?
>
> Good point.
>
> In the Interchange global structure file I see (trimmed for readability):
>
> 'RobotUA' => qr/gonzo|Google-Sitemaps|GoogleBot|grab/i,
>
> so I would expect that to match.
>
> DB, what does your interchange.structure file show for the RobotUA regex?
> You'll need to have DumpStructure Yes in your interchange.cfg for that
> file to be written at daemon startup.
>
> Jon

Hmm don't see that. Maybe may catalog started life too long ago for that?

[root@www]# grep -i robot catalog.structure
'BounceRobotSessionURL' => 0,
'RobotLimit' => '500',
'BounceReferralsRobot' => 0,
'robot_expire' => '0.005',

_______________________________________________
interchange-users mailing list
interchange-users@interchangecommerce.org
https://www.interchangecommerce.org/mailman/listinfo/interchange-users
Re: New bot user agent [ In reply to ]
On Mon, 2 Mar 2020, DB wrote:

>> DB, what does your interchange.structure file show for the RobotUA
>> regex?
>
> Hmm don't see that. Maybe may catalog started life too long ago for
> that?
>
> [root@www]# grep -i robot catalog.structure

That's global config, so you need to look in interchange.structure, not
the catalog structure file.

The RobotUA feature was introduced in 2002 so I'm guessing you probably
have it unless you intentionally removed it.

Jon


--
Jon Jensen
End Point Corporation
https://www.endpoint.com/
_______________________________________________
interchange-users mailing list
interchange-users@interchangecommerce.org
https://www.interchangecommerce.org/mailman/listinfo/interchange-users
Re: New bot user agent [ In reply to ]
> That's global config, so you need to look in interchange.structure, not
> the catalog structure file.
>
> The RobotUA feature was introduced in 2002 so I'm guessing you probably
> have it unless you intentionally removed it.
>
> Jon
>

Oh - maybe I should learn to read :)

Here's what I have after adding Googlebot-Image, so maybe that's redundant

'RobotUA' =>
qr/(?^i:adressendeutschland|AdsBot-Google|agent|AltaVista|Apache\s+\(internal\s+dummy\s+connection\)|appie|AppleSyndication|Arachnoidea|Aranha|Architext|archive|Argus|Ask|asterias|ATN_Worldwide|Atomz|AurNet|Awasu|BackRub|bender|bingbot|Bookdog|BookmarkSync|bot|Builder|CCBot|ccubee|cfetch|CFNetwork|check_http|CMC|collector|complex_network_group|Contact|crawl|Creep|Digital.*Integrity|Directory|dogpile|DotBot|Excite|EZResult|FavOrg|FeedDemon|FeedFetcher-Google|Feedreader|FeedValidator|Ferret|fido|find|Fireball|gazz|GetRight|gonzo|Google-Sitemaps|GoogleBot|Googlebot-Image|Googlebot-Image\/1\.0|grab|griffon|Gromit|Gulliver|H.m.h.kki|heritrix|HTTrack|Harvest|holmes|HTMLDOC|Hubater|IncyWincy|index|INGRID|Jack|JPluck|KIT.*Fireball|Kototoi|larbin|Leech|legs|libwww-perl|locator|LWP|Lycos|marvin|Mediapartners|MegaSheep|MEGAUPLOAD|Mercator|MFC_Tear_Sample|Microsoft\s+Data\s+Access|Microsoft\s+Office|Microsoft\s+URL\s+Control|Microsoft-WebDAV|MimeLive|mirago|Miva|moget|MSFrontPage|Nazilla|NetMech
anic|NetScoop|newscan|Nutch|Ocelli|ozelot|ozzie|pagebull|panscient\.com|ParaSite|pavuk|POE-Component|Pokey|Pompos|Refiner|retrieve|RoboDude|Rover|Rssbandit|RSSOwl|Rutgers|Scooter|search|seek|shelob|ShopWiki|silk|Slurp|sna|Snappy|Snoopy|speedy|spider|Spyder|suke|Susie|swish|T-H-U-N-D-E-R-S-T-O-N-E|tarantula|topiclink|Toutatis|TurnitinBot|Tv.*Merc|Twiceler|urllib|VB\s+Project|Valkyrie|Voyager|W3C_Validator|Walker|wget|WhizBang|whowhere|Wiki|WinInet|winona|Wire|Wombat|WordPress|worm|wwwster|WWW-Mechanize|xtreme|Yahoo|Yandex|Zeus|ZyBorg)/,


DB
_______________________________________________
interchange-users mailing list
interchange-users@interchangecommerce.org
https://www.interchangecommerce.org/mailman/listinfo/interchange-users
Re: New bot user agent [ In reply to ]
On Tue, 3 Mar 2020, DB wrote:

> Here's what I have after adding Googlebot-Image, so maybe that's
> redundant
>
> 'RobotUA' =>
[snip]

DB, can you tell what's in the files in your tmp directory? Maybe they're
things that get made regardless of there being a session. I don't think
RobotUA necessarily will stop all tmp files such as more-lists from being
made.

RobotUA sets $Vend::Robot which sets mv_tmp_session, $Session->{spider},
which affect a few things.

So you may need to look at what makes those tmp files and have those pages
act differently when [if session spider] is true, or something like that.

Jon


--
Jon Jensen
End Point Corporation
https://www.endpoint.com/
_______________________________________________
interchange-users mailing list
interchange-users@interchangecommerce.org
https://www.interchangecommerce.org/mailman/listinfo/interchange-users