Re: [squeak-dev] 4.1 - hashed collections still a problem

28 Mar 2010


      On Sat, 27 Mar 2010, Chris Muller wrote:
...
...
...
Dictionaries don't have to use associations (for example MethodDictionary
doesn't use them), that's why #pointsTo: works (MethodDictionary also uses
it).
But that means a linear scan of the whole collection, even if done primitively,
this is not scalable.
Not necessarily.  VisualAge had a LookupTable class that was identical
to Dictionary except it used two parallel Array's (keys / values)
rather than Association objects instaniated by its Dictionary.
LookupTable was faster..
Linear search is not scalable in the way MethodDictionary uses it, but 
that doesn't have to scale at all. MethodDictionaries rarely grow large 
and #pointsTo: is faster if the capacity of the dictinary is less than 
1024 (that covers all but one MethodDictionary in my image).
LookupTable is totally unrelated to this issue. It's just an 
association-free dictionary implementation, but it still uses open 
addressing so it doesn't solve the issue with Squeak's identityHash.
It's easier to fight the issue of few keys and lots of objects with 
separate-chaining. Igor's sets/dictionaries use a linked list of 
associations for that, while my implementations uses arrays and no 
associations.
Using no associations and (separate key-value) arrays has several 
advantages over the list of associations:
- smaller memory footprint because of less objects (this means that it's 
more cache-friendly)
- better cache locality, because pointers to keys with the same hash 
use adjacent memory slots (this means that it's much more cache-friendly)
- #pointsTo: can be used with IdentityDictionaries which is much 
faster than a linear search implemented with a fully optimized loop.
This all means that LargeIdentityDictionary is 3.7x faster than 
MaIdentityDictionary for #at:ifAbsent: and 40x faster for #includesKey: 
when reaching 1000000 elements in the benchmark below, while 
MaIdentityDictionary is only 30-40% faster than IdentityDictionary (if you 
don't run into a really bad clustering. With better primes the chance of 
that may be even lower, and with an extra check for clustering that may be 
totally avoided).
Of course the above numbers are only true if the #notNil send is replaced 
with == nil in MaIdentityDictionary >> #at:ifAbsent: otherwise it's 
a bit worse (MaIdentityDictionary is only 20% faster than IdentityDictionary).
The benchmark (requires the latest trunk image):
objects := Array new: 1000000 streamContents: [ :stream | 1 to: 1000000 do: [ :i | stream nextPut: Object new ] ].
objects shuffleBy: (Random seed: 36rSQUEAK). "Could be anything, since the hashes are random."
results := {
    LargeIdentityDictionary.
    MaIdentityDictionary.
    IdentityDictionary } collect: [ :dictionaryClass |
    	| objectSource |
    	objectSource := objects readStream.
    	dictionaryClass -> (Array streamContents: [ :results |
    		| dictionary newObjects firstRun oldObjects |
    		dictionary := dictionaryClass new.
    		newObjects := Array new: 10000.
    		oldObjects := nil.
    		Smalltalk garbageCollect.
    		[ dictionary size >= 1000000 ] whileFalse: [
    			1 to: 10000 do: [ :index | dictionary at: (newObjects at: index put: objectSource next) put: nil ].
    			oldObjects ifNil: [ oldObjects := newObjects copy ].
    			results nextPut: {
    				dictionary size.
    				[
    					1 to: 5 do: [ :run |
    						1 to: 10000 do: [ :index | dictionary includesKey: (oldObjects at: index) ] ].
    					1 to: 5 do: [ :run |
    						1 to: 10000 do: [ :index | dictionary includesKey: (newObjects at: index) ] ] ] timeToRun.
    				[
    					1 to: 5 do: [ :run |
    						1 to: 10000 do: [ :index | dictionary at: (oldObjects at: index) ifAbsent: nil ] ].
    					1 to: 5 do: [ :run |
    						1 to: 10000 do: [ :index | dictionary at: (newObjects at: index) ifAbsent: nil ] ] ] timeToRun } ] ]) ].
While reviewing MaDictionary and subclasses I found that #maxBuckets is 
wrong. It's okay for MaIdentityDictionary, but it won't allow an 
MaDictionary to have more than 4096 buckets.
Levente