[squeak-dev] SqueakSource indexability (aka should we just ask crawlers to desist?)

Wed Apr 28 20:18:04 UTC 2010

On Wed, 28 Apr 2010, Ken Causey wrote:

> At times access to source.squeak.org becomes slower, as has been the
> case today.  I can see in the logs that various web-crawlers are the
> likely culprit.  Having the information there accessible via search
> engines is a wornderful thing but I have to suspect that the Seaside
> session IDs eliminate this option.  (Of course when URLs like
> http://source.squeak.org/trunk.html are found on other sites they then
> become indexed.)

See http://code.google.com/p/seaside/issues/detail?id=262 . I had two 
solutions for the problem in Seaside 2.8. One was using a linked hashtable 
to manage the sessions, resulting in O(1) session creation/access time, 
but it broke the almost never used feature, that every session can have 
a distinct timeout value.
To solve that problem I replaced the linked hashtable with a heap, which 
gave O(log(n)) creation/access time, but this time I was told to implement 
it in Seaside 2.9 using the new plugin system. The above solutions can't 
be implemented as a plugin, so we got nowhere.

>
> Unless I'm mistaken about this, and I would appreciate any guidance, it
> seems like we need to add a robots.txt to the site which guides or
> simply asks crawlers to stay away.  Thoughts?  I'm no SEO export.

This should do it:
User-agent: *
Disallow: /

Levente

>
> Ken
>
>
>