Re: [squeak-dev] 4.1 - hashed collections still a problem

24 Mar 2010


      On Wed, 24 Mar 2010, Igor Stasenko wrote:
...
On a board's meeting, Chris mentioned my implementation, and he says
that it outperforms the LargeIdentitySet.
I'm not sure how Chris compared it, because i implemented this scheme
only for dictionaries , not sets.
It's really hard to believe that a dictionary can be faster than 
LargeIdentitySet (I even doubt that a set could be faster), so I'm waiting 
for the numbers.
As you pointed out it's unfair to compare sets to dictionaries, so I 
wrote LargeIdentityDictionary 
(http://leves.web.elte.hu/LargeIdentityDictionary/LargeIdentityDictionary.st 
) which works just like LargeIdentitySet, but it also stores the values 
besides the keys. I modified Chris' benchmark just like you suggested: 
store nil as the value 
(http://leves.web.elte.hu/LargeIdentityDictionary/LargeIdentityDictionaryBenc... 
). The benchmark compares LargeIdentityDictionary to IdentityDictionary. 
It uses two diffent lookup methods (#at: and #includesKey:). I ran the 
benchmarks and copied the results to a single file 
(http://leves.web.elte.hu/LargeIdentityDictionary/LargeIdentityDictionaryBenc... 
), but I'm sure the nice charts are more interesting, so here they are:
http://leves.web.elte.hu/LargeIdentityDictionary/LargeIdentityDictionary.png
http://leves.web.elte.hu/LargeIdentityDictionary/LargeIdentityDictionary2.pn...
The first chart shows that IdentityDictionary has a big spike at 970000, 
probably because of clustering (we need even better primes). So the second 
chart is the same as the first, but it only shows the 0-5000ms range. What 
we can see is #at: and #includesKey: have similar performance for 
IdentityDictionary (both use #scanFor:). LargeIdentityDictionary >> #at: 
is ~3.5-5x faster than IdentityDictionary >> #at:. And what's more 
interresing is that #includesKey: is so fast that it seems to be totally 
flat (though it's not). That's because #includesKey: can use the 
#pointsTo: primitive, but #at: can't, because it has to return the value 
for the key. So if a dictionary is faster than LargeIdentitySet, than it's 
graph has to be even flatter than #includesKey:'s graph (it's really hard 
to believe that such dictionary exists).
I wonder how LargeIdentityDictionary compares to your dictionaries'.
Levente
P.S. LargeIdentityDictionary is hardly tested and doesn't implement every 
"public" method that IdentityDictionary does.
...
If you interested in details, you can check it in Magma.
Brief description:
i using a linked list for storing associations which keys having same hash value
so, in addition to key/value pair there is next ivar.
And dictionary's array entries look like this:
...
...
...
e1 -> e2 -> e3
...
...
where e1,e2,e3 is associations , which having same hash value, but
different keys.
So, when it doing lookup for a key, say e3, initially by doing hash
lookup it finds e1, and then since #= for e1 answers false, it looks
for the next linked association, until e3 is found.
If i remember, the last time i benched this implementation, it almost
same in speed as an original implementation for small sizes, and
outperforms on a large sizes, especially, when you got more than 4096
elements.
So, Chris, can you give us details how you compared the speed between
LargeIdentitySet and my dictionaries?
I don't think its correct to compare them, because dictionaries need
to handle more data than sets, and we should expect some slowdown
because of it.
But sure, you can imitate the Set behavior with dicts, by simply
ignoring the value field of association (set it to nil), and use only
keys, i.e.
set add: object
is like:
dict at: object put: nil.
-- 
Best regards,
Igor Stasenko AKA sig.