[squeak-dev] SqueakSource indexability (aka should we just ask crawlers to desist?)

Wed Apr 28 20:31:01 UTC 2010

On 28.04.2010, at 22:08, Ken Causey wrote:
> 
>> -------- Original Message --------
>> Subject: Re: [squeak-dev] SqueakSource indexability (aka should we just
>> ask crawlers to desist?)
>> From: Bert Freudenberg <bert at freudenbergs.de>
>> Date: Wed, April 28, 2010 2:59 pm
>> To: The general-purpose Squeak developers list
>> <squeak-dev at lists.squeakfoundation.org>
>> 
>> 
>> On 28.04.2010, at 21:07, Ken Causey wrote:
>>> 
>>> At times access to source.squeak.org becomes slower, as has been the
>>> case today.  I can see in the logs that various web-crawlers are the
>>> likely culprit.  Having the information there accessible via search
>>> engines is a wornderful thing but I have to suspect that the Seaside
>>> session IDs eliminate this option.  (Of course when URLs like
>>> http://source.squeak.org/trunk.html are found on other sites they then
>>> become indexed.)
>> 
>> Which URLs are the bots accessing?
> 
> Well, without detailed analysis it seems to be everything.  Feel free to
> look at ~squeaksource/apachelogs/.
> 
>> 
>>> Unless I'm mistaken about this, and I would appreciate any guidance, it
>>> seems like we need to add a robots.txt to the site which guides or
>>> simply asks crawlers to stay away.  Thoughts?  I'm no SEO export.
>> 
>> We do have a robots.txt:
>> http://source.squeak.org/robots.txt
> 
> Aha.  Well, I know little about this subject.  But if this means what I
> think it means it seems that the crawlers are ignoring it.

I just read up on it. Glob patterns are *not* allowed, the single asterisk in the user agent is a special char and not a pattern match. We used

User-agent: *
Disallow: /@*

But it should be 

User-agent: *
Disallow: /@

I'm going to fix that, let's see how it works out.

- Bert -