Re: [squeak-dev] Re: Hash changes

1 Dec 2009


      2009/12/1 Andreas Raab andreas.raab@gmx.de:
...
Thanks. Now I see what's happening. The problem is that adding objects
without scaled hash to a "regular" Dictionary creates hugely disproportional
collision chain(s) as soon as we get north of 4096 objects. At this point
all the possible entries that any new object could have by default are
already taken and since we use a linear collision chain they combine into
one gigantonormeous (hope I spelled that correctly :-) collision chain.
That explains the difference - for larger number of elements we end up (with
your changes) with an average length of the collision chain being (dict size
// 4096) for both adding and accessing where without the scale we end up
with a collision chain of (dict size) length (for adding) and (dict size //
2) for accessing elements once all possible initial slots are filled up and
spill over.
(and we should probably add a comment explaining the issue)
Cheers,
 - Andreas
Yes, that's what I tried to explain previously...
The change does not change the RATE of collisions, only the COST of collisions.
But I must be just bad in English ;)
Cheers
Nicolas
...
Levente Uzonyi wrote:
...
On Mon, 30 Nov 2009, Andreas Raab wrote:
...
...
| test array |
Transcript open; cr.
test := Dictionary new.
"There used to be an old issue with growth of Sets that were initially
sized with goodSizeFor: 3. Try working around it by starting at 5."
test := Dictionary new: 5.
This shouldn't be an issue in this case. The capacity of new
HashedCollections are already modified to be carefully selected primes.
AFAIK only hashMultiply doesn't like 5 as capacity but it's not involved
here. Anyway I added this to the new test.
...
"Having sequences of objects with strictly consecutive hashes is not
representative for real systems. Instead, create 4096 objects and choose 100
atRandom."
objSet := Array new: 4096 streamContents: [ :stream |
   4096 timesRepeat: [ stream nextPut: Object new ] ].
array := (1 to: 100) collect:[:i| objSet atRandom].
Good point but the hashes are not consecutive (at least with my vm). Since
we need 10000 different Objects I added this idea to the test with some
modifications. 10000 Objects are created, then shuffled and used when
needed.
The modified test:
| test array objects |
Transcript open; cr.
test := Dictionary new: 5.
objects := ((1 to: 10000) collect: [ :each | Object new ]) shuffled
readStream.
[ test size >= 10000 ] whileFalse: [
   Smalltalk garbageCollect.
   array := Array new: 100 streamContents: [ :stream |
       100 timesRepeat: [ stream nextPut: objects next  ] ]. "Create 100
new Objects"
   Transcript
       print: [ "Add the 100 new Objects to the Dictionary as keys with
nil values"
           1 to: array size do: [ :index |
               test at: (array at: index) put: nil ] ] timeToRun;
       tab;
       print: test size;
       tab;
       print: [ "Access the added Objects 1000 times"
           1 to: 1000 do: [ :iteration |
               1 to: array size do: [ :index |
                   test at: (array at: index) ] ] ] timeToRun;
       tab;
       print: [ "This should be substracted from the previous value"
           1 to: 1000 do: [ :iteration |
               1 to: array size do: [ :index |
                   array at: index ] ] ] timeToRun;
       cr;
       flush ]
The results:
Updated image:
0    100    93    4
0    200    144    4
0    300    125    4
0    400    105    3
0    500    111    4
0    600    150    4
0    700    115    3
-- snip --
0    9600    297    3
0    9700    296    4
0    9800    299    5
1    9900    312    4
0    10000    309    4
Values in the third column are increasing somewhat, but not significantly,
no spikes etc.
Current image:
0    100    76    3
1    200    119    4
0    300    105    4
0    400    80    4
0    500    112    4
0    600    111    3
1    700    98    4
-- snip --
1    3700    1279    4
2    3800    1998    4
3    3900    3196    4
5    4000    4473    4
11    4100    11301    4
19    4200    18823    4
36    4300    36391    3
93    4400    45294    4
47    4500    46674    3
...
Around 4000 things get really slow. There are also smaller spikes near
2500. I didn't wait for 4600. As you can see in the first column #at:put: is
consuming more time every iteration.
Levente