On Sat, 27 Mar 2010, Chris Muller wrote:
Dictionaries don't have to use associations (for example MethodDictionary doesn't use them), that's why #pointsTo: works (MethodDictionary also uses it).
But that means a linear scan of the whole collection, even if done primitively, this is not scalable.
Not necessarily. VisualAge had a LookupTable class that was identical to Dictionary except it used two parallel Array's (keys / values) rather than Association objects instaniated by its Dictionary. LookupTable was faster..
Linear search is not scalable in the way MethodDictionary uses it, but that doesn't have to scale at all. MethodDictionaries rarely grow large and #pointsTo: is faster if the capacity of the dictinary is less than 1024 (that covers all but one MethodDictionary in my image).
LookupTable is totally unrelated to this issue. It's just an association-free dictionary implementation, but it still uses open addressing so it doesn't solve the issue with Squeak's identityHash.
It's easier to fight the issue of few keys and lots of objects with separate-chaining. Igor's sets/dictionaries use a linked list of associations for that, while my implementations uses arrays and no associations.
Using no associations and (separate key-value) arrays has several advantages over the list of associations: - smaller memory footprint because of less objects (this means that it's more cache-friendly) - better cache locality, because pointers to keys with the same hash use adjacent memory slots (this means that it's much more cache-friendly) - #pointsTo: can be used with IdentityDictionaries which is much faster than a linear search implemented with a fully optimized loop.
This all means that LargeIdentityDictionary is 3.7x faster than MaIdentityDictionary for #at:ifAbsent: and 40x faster for #includesKey: when reaching 1000000 elements in the benchmark below, while MaIdentityDictionary is only 30-40% faster than IdentityDictionary (if you don't run into a really bad clustering. With better primes the chance of that may be even lower, and with an extra check for clustering that may be totally avoided).
Of course the above numbers are only true if the #notNil send is replaced with == nil in MaIdentityDictionary >> #at:ifAbsent: otherwise it's a bit worse (MaIdentityDictionary is only 20% faster than IdentityDictionary).
The benchmark (requires the latest trunk image):
objects := Array new: 1000000 streamContents: [ :stream | 1 to: 1000000 do: [ :i | stream nextPut: Object new ] ]. objects shuffleBy: (Random seed: 36rSQUEAK). "Could be anything, since the hashes are random." results := { LargeIdentityDictionary. MaIdentityDictionary. IdentityDictionary } collect: [ :dictionaryClass | | objectSource | objectSource := objects readStream. dictionaryClass -> (Array streamContents: [ :results | | dictionary newObjects firstRun oldObjects | dictionary := dictionaryClass new. newObjects := Array new: 10000. oldObjects := nil. Smalltalk garbageCollect. [ dictionary size >= 1000000 ] whileFalse: [ 1 to: 10000 do: [ :index | dictionary at: (newObjects at: index put: objectSource next) put: nil ]. oldObjects ifNil: [ oldObjects := newObjects copy ]. results nextPut: { dictionary size. [ 1 to: 5 do: [ :run | 1 to: 10000 do: [ :index | dictionary includesKey: (oldObjects at: index) ] ]. 1 to: 5 do: [ :run | 1 to: 10000 do: [ :index | dictionary includesKey: (newObjects at: index) ] ] ] timeToRun. [ 1 to: 5 do: [ :run | 1 to: 10000 do: [ :index | dictionary at: (oldObjects at: index) ifAbsent: nil ] ]. 1 to: 5 do: [ :run | 1 to: 10000 do: [ :index | dictionary at: (newObjects at: index) ifAbsent: nil ] ] ] timeToRun } ] ]) ].
While reviewing MaDictionary and subclasses I found that #maxBuckets is wrong. It's okay for MaIdentityDictionary, but it won't allow an MaDictionary to have more than 4096 buckets.
Levente