2009/12/1 Andreas Raab andreas.raab@gmx.de:
Thanks. Now I see what's happening. The problem is that adding objects without scaled hash to a "regular" Dictionary creates hugely disproportional collision chain(s) as soon as we get north of 4096 objects. At this point all the possible entries that any new object could have by default are already taken and since we use a linear collision chain they combine into one gigantonormeous (hope I spelled that correctly :-) collision chain.
That explains the difference - for larger number of elements we end up (with your changes) with an average length of the collision chain being (dict size // 4096) for both adding and accessing where without the scale we end up with a collision chain of (dict size) length (for adding) and (dict size // 2) for accessing elements once all possible initial slots are filled up and spill over.
(and we should probably add a comment explaining the issue)
Cheers, - Andreas
Yes, that's what I tried to explain previously... The change does not change the RATE of collisions, only the COST of collisions. But I must be just bad in English ;)
Cheers
Nicolas
Levente Uzonyi wrote:
On Mon, 30 Nov 2009, Andreas Raab wrote:
| test array | Transcript open; cr. test := Dictionary new.
"There used to be an old issue with growth of Sets that were initially sized with goodSizeFor: 3. Try working around it by starting at 5." test := Dictionary new: 5.
This shouldn't be an issue in this case. The capacity of new HashedCollections are already modified to be carefully selected primes. AFAIK only hashMultiply doesn't like 5 as capacity but it's not involved here. Anyway I added this to the new test.
"Having sequences of objects with strictly consecutive hashes is not representative for real systems. Instead, create 4096 objects and choose 100 atRandom."
objSet := Array new: 4096 streamContents: [ :stream | 4096 timesRepeat: [ stream nextPut: Object new ] ]. array := (1 to: 100) collect:[:i| objSet atRandom].
Good point but the hashes are not consecutive (at least with my vm). Since we need 10000 different Objects I added this idea to the test with some modifications. 10000 Objects are created, then shuffled and used when needed.
The modified test:
| test array objects | Transcript open; cr. test := Dictionary new: 5. objects := ((1 to: 10000) collect: [ :each | Object new ]) shuffled readStream. [ test size >= 10000 ] whileFalse: [ Smalltalk garbageCollect. array := Array new: 100 streamContents: [ :stream | 100 timesRepeat: [ stream nextPut: objects next ] ]. "Create 100 new Objects" Transcript print: [ "Add the 100 new Objects to the Dictionary as keys with nil values" 1 to: array size do: [ :index | test at: (array at: index) put: nil ] ] timeToRun; tab; print: test size; tab; print: [ "Access the added Objects 1000 times" 1 to: 1000 do: [ :iteration | 1 to: array size do: [ :index | test at: (array at: index) ] ] ] timeToRun; tab; print: [ "This should be substracted from the previous value" 1 to: 1000 do: [ :iteration | 1 to: array size do: [ :index | array at: index ] ] ] timeToRun; cr; flush ]
The results: Updated image: 0 100 93 4 0 200 144 4 0 300 125 4 0 400 105 3 0 500 111 4 0 600 150 4 0 700 115 3 -- snip -- 0 9600 297 3 0 9700 296 4 0 9800 299 5 1 9900 312 4 0 10000 309 4
Values in the third column are increasing somewhat, but not significantly, no spikes etc.
Current image: 0 100 76 3 1 200 119 4 0 300 105 4 0 400 80 4 0 500 112 4 0 600 111 3 1 700 98 4 -- snip -- 1 3700 1279 4 2 3800 1998 4 3 3900 3196 4 5 4000 4473 4 11 4100 11301 4 19 4200 18823 4 36 4300 36391 3 93 4400 45294 4 47 4500 46674 3 ...
Around 4000 things get really slow. There are also smaller spikes near 2500. I didn't wait for 4600. As you can see in the first column #at:put: is consuming more time every iteration.
Levente