Hi,
I was messing around with SmallDictionary when I suddenly realised that I can't find a single reason to use it over a normal Dictionary. While its name and class comment imply that it is somehow an optimised Dictionary, I don't see any measurement where that actually holds up. The following was run in a Pharo 7 image on a recent VM (see below):
| d | d := SmallDictionary new. d sizeInMemory. "24" [100000 timesRepeat: [ 1 to: 100 do: [ :i | d at:i put: i] ] ] timeToRun. "0:00:00:05.226"
[100000 timesRepeat: [ d at: 48 ifAbsent: [] ] ] timeToRun. "0:00:00:00.041"
| d | d := Dictionary new. d sizeInMemory. "16" [100000 timesRepeat: [ 1 to: 100 do: [ :i | d at:i put: i] ] ] timeToRun. "0:00:00:00.385" [100000 timesRepeat: [ d at: 48 ifAbsent: [] ] ] timeToRun. "0:00:00:00.006"
As you can see, SmallDictionary is 8 bytes larger per instance and significantly faster while reading and writing (I know that this isn't a good benchmark but it suffices to make my point).
Is anyone aware of a reason for hanging on to SmallDictionary? I'm also curious to know how SmallDictionary came to be. There must have been some advantage over Dictionary at some point in the past.
Cheers, Max
Image version: Pharo 7.0 Build information: Pharo-7.0+alpha.build.961.sha.a69e72a97136bc3f93831584b6efa2b1703deb84 (32 Bit)
VM version: CoInterpreter VMMaker.oscog- nice.2281 uuid: 4beeaee7-567e-1a4b-b0fb-bd95ce302516 Nov 27 2017 StackToRegisterMappingCogit VMMaker.oscog-nice.2283 uuid: 2d20324d-a2ab-48d6-b0f6-9fc3d66899da Nov 27 2017 VM: 201711262336 https://github.com/OpenSmalltalk/opensmalltalk-vm.git $ Date: Mon Nov 27 00:36:29 2017 +0100 $ Plugins: 201711262336 https://github.com/OpenSmalltalk/opensmalltalk-vm.git $
OS: macOS 10.13.5 Machine: MacBook Pro (13-inch, 2016, Four Thunderbolt 3 Ports)
Hi,
it was supposed to be optimised for small sizes so I guessed it was about your choice of 100000 elements, but a fast benchmark with small numbers didn’t offered better results.
So question is appropriate. Why we have it? Or better: why SmallDictionary is not faster? Even better: is this test good? :)
cheers, Esteban
On 8 Jun 2018, at 08:46, Max Leske maxleske@gmail.com wrote:
Hi,
I was messing around with SmallDictionary when I suddenly realised that I can't find a single reason to use it over a normal Dictionary. While its name and class comment imply that it is somehow an optimised Dictionary, I don't see any measurement where that actually holds up. The following was run in a Pharo 7 image on a recent VM (see below):
| d | d := SmallDictionary new. d sizeInMemory. "24" [100000 timesRepeat: [ 1 to: 100 do: [ :i | d at:i put: i] ] ] timeToRun. "0:00:00:05.226"
[100000 timesRepeat: [ d at: 48 ifAbsent: [] ] ] timeToRun. "0:00:00:00.041"
| d | d := Dictionary new. d sizeInMemory. "16" [100000 timesRepeat: [ 1 to: 100 do: [ :i | d at:i put: i] ] ] timeToRun. "0:00:00:00.385" [100000 timesRepeat: [ d at: 48 ifAbsent: [] ] ] timeToRun. "0:00:00:00.006"
As you can see, SmallDictionary is 8 bytes larger per instance and significantly faster while reading and writing (I know that this isn't a good benchmark but it suffices to make my point).
Is anyone aware of a reason for hanging on to SmallDictionary? I'm also curious to know how SmallDictionary came to be. There must have been some advantage over Dictionary at some point in the past.
Cheers, Max
Image version: Pharo 7.0 Build information: Pharo-7.0+alpha.build.961.sha.a69e72a97136bc3f93831584b6efa2b1703deb84 (32 Bit)
VM version: CoInterpreter VMMaker.oscog- nice.2281 uuid: 4beeaee7-567e-1a4b-b0fb-bd95ce302516 Nov 27 2017 StackToRegisterMappingCogit VMMaker.oscog-nice.2283 uuid: 2d20324d-a2ab-48d6-b0f6-9fc3d66899da Nov 27 2017 VM: 201711262336 https://github.com/OpenSmalltalk/opensmalltalk-vm.git $ Date: Mon Nov 27 00:36:29 2017 +0100 $ Plugins: 201711262336 https://github.com/OpenSmalltalk/opensmalltalk-vm.git $
OS: macOS 10.13.5 Machine: MacBook Pro (13-inch, 2016, Four Thunderbolt 3 Ports)
Well, I can answer my own question: SmallDictionary is a lot more space efficient.
I think the class comment should clarify the use case for SmallDictionary and mention the performance trade off.
Max
On 8 Jun 2018, at 8:46, Max Leske wrote:
Hi,
I was messing around with SmallDictionary when I suddenly realised that I can't find a single reason to use it over a normal Dictionary. While its name and class comment imply that it is somehow an optimised Dictionary, I don't see any measurement where that actually holds up. The following was run in a Pharo 7 image on a recent VM (see below):
| d | d := SmallDictionary new. d sizeInMemory. "24" [100000 timesRepeat: [ 1 to: 100 do: [ :i | d at:i put: i] ] ] timeToRun. "0:00:00:05.226"
[100000 timesRepeat: [ d at: 48 ifAbsent: [] ] ] timeToRun. "0:00:00:00.041"
| d | d := Dictionary new. d sizeInMemory. "16" [100000 timesRepeat: [ 1 to: 100 do: [ :i | d at:i put: i] ] ] timeToRun. "0:00:00:00.385" [100000 timesRepeat: [ d at: 48 ifAbsent: [] ] ] timeToRun. "0:00:00:00.006"
As you can see, SmallDictionary is 8 bytes larger per instance and significantly faster while reading and writing (I know that this isn't a good benchmark but it suffices to make my point).
Is anyone aware of a reason for hanging on to SmallDictionary? I'm also curious to know how SmallDictionary came to be. There must have been some advantage over Dictionary at some point in the past.
Cheers, Max
Image version: Pharo 7.0 Build information: Pharo-7.0+alpha.build.961.sha.a69e72a97136bc3f93831584b6efa2b1703deb84 (32 Bit)
VM version: CoInterpreter VMMaker.oscog- nice.2281 uuid: 4beeaee7-567e-1a4b-b0fb-bd95ce302516 Nov 27 2017 StackToRegisterMappingCogit VMMaker.oscog-nice.2283 uuid: 2d20324d-a2ab-48d6-b0f6-9fc3d66899da Nov 27 2017 VM: 201711262336 https://github.com/OpenSmalltalk/opensmalltalk-vm.git $ Date: Mon Nov 27 00:36:29 2017 +0100 $ Plugins: 201711262336 https://github.com/OpenSmalltalk/opensmalltalk-vm.git $
OS: macOS 10.13.5 Machine: MacBook Pro (13-inch, 2016, Four Thunderbolt 3 Ports)
On 8 Jun 2018, at 08:46, Max Leske maxleske@gmail.com wrote:
Hi,
I was messing around with SmallDictionary when I suddenly realised that I can't find a single reason to use it over a normal Dictionary. While its name and class comment imply that it is somehow an optimised Dictionary, I don't see any measurement where that actually holds up. The following was run in a Pharo 7 image on a recent VM (see below):
| d | d := SmallDictionary new. d sizeInMemory. "24" [100000 timesRepeat: [ 1 to: 100 do: [ :i | d at:i put: i] ] ] timeToRun. "0:00:00:05.226"
[100000 timesRepeat: [ d at: 48 ifAbsent: [] ] ] timeToRun. "0:00:00:00.041"
| d | d := Dictionary new. d sizeInMemory. "16" [100000 timesRepeat: [ 1 to: 100 do: [ :i | d at:i put: i] ] ] timeToRun. "0:00:00:00.385" [100000 timesRepeat: [ d at: 48 ifAbsent: [] ] ] timeToRun. "0:00:00:00.006"
can you try with <10 elements?
Is anyone aware of a reason for hanging on to SmallDictionary? I'm also curious to know how SmallDictionary came to be. There must have been some advantage over Dictionary at some point in the past.
It came from RB… the idea was that there are (in the Refactoring engine) a lot of very small dictionaries with <10 elements. The idea is that for such dictionaries, the overhead of hashing was higher than just linear search.
I am sure I did a benchmark back then, but of course this could have changed in the meantime.
It would be nice to redo the benchmark.
For size: does #sizeInMemory take the internal data into account? e.g for Dictionary there is an Array that contains Associations, you need to count those, too.
In the end this is a hack anyway: if needed Dictionary could do things like this transparently… without the user having to decide. And the question is if this is not just a bad side effect of the particular implementation.
Marcus
On 8 Jun 2018, at 9:02, Marcus Denker wrote:
On 8 Jun 2018, at 08:46, Max Leske maxleske@gmail.com wrote:
Hi,
I was messing around with SmallDictionary when I suddenly realised that I can't find a single reason to use it over a normal Dictionary. While its name and class comment imply that it is somehow an optimised Dictionary, I don't see any measurement where that actually holds up. The following was run in a Pharo 7 image on a recent VM (see below):
| d | d := SmallDictionary new. d sizeInMemory. "24" [100000 timesRepeat: [ 1 to: 100 do: [ :i | d at:i put: i] ] ] timeToRun. "0:00:00:05.226"
[100000 timesRepeat: [ d at: 48 ifAbsent: [] ] ] timeToRun. "0:00:00:00.041"
| d | d := Dictionary new. d sizeInMemory. "16" [100000 timesRepeat: [ 1 to: 100 do: [ :i | d at:i put: i] ] ] timeToRun. "0:00:00:00.385" [100000 timesRepeat: [ d at: 48 ifAbsent: [] ] ] timeToRun. "0:00:00:00.006"
can you try with <10 elements?
The results are in! Note that I had to increase the repetitions by a factor or 100.
| d | d := SmallDictionary new. [10000000 timesRepeat: [ 1 to: 5 do: [ :i | d at:i put: i] ] ] timeToRun. "0:00:00:02.532"
[10000000 timesRepeat: [ d at: 4 ifAbsent: [] ] ] timeToRun. "0:00:00:00.686"
| d | d := Dictionary new. [10000000 timesRepeat: [ 1 to: 5 do: [ :i | d at:i put: i] ] ] timeToRun. "0:00:00:02.102"
[10000000 timesRepeat: [ d at: 4 ifAbsent: [] ] ] timeToRun. "0:00:00:00.567"
Is anyone aware of a reason for hanging on to SmallDictionary? I'm also curious to know how SmallDictionary came to be. There must have been some advantage over Dictionary at some point in the past.
It came from RB… the idea was that there are (in the Refactoring engine) a lot of very small dictionaries with <10 elements. The idea is that for such dictionaries, the overhead of hashing was higher than just linear search.
I am sure I did a benchmark back then, but of course this could have changed in the meantime.
It would be nice to redo the benchmark.
For size: does #sizeInMemory take the internal data into account? e.g for Dictionary there is an Array that contains Associations, you need to count those, too.
In the end this is a hack anyway: if needed Dictionary could do things like this transparently… without the user having to decide. And the question is if this is not just a bad side effect of the particular implementation.
Marcus
(Sorry, hit send accidentally).
On 8 Jun 2018, at 9:02, Marcus Denker wrote:
On 8 Jun 2018, at 08:46, Max Leske maxleske@gmail.com wrote:
Hi,
I was messing around with SmallDictionary when I suddenly realised that I can't find a single reason to use it over a normal Dictionary. While its name and class comment imply that it is somehow an optimised Dictionary, I don't see any measurement where that actually holds up. The following was run in a Pharo 7 image on a recent VM (see below):
| d | d := SmallDictionary new. d sizeInMemory. "24" [100000 timesRepeat: [ 1 to: 100 do: [ :i | d at:i put: i] ] ] timeToRun. "0:00:00:05.226"
[100000 timesRepeat: [ d at: 48 ifAbsent: [] ] ] timeToRun. "0:00:00:00.041"
| d | d := Dictionary new. d sizeInMemory. "16" [100000 timesRepeat: [ 1 to: 100 do: [ :i | d at:i put: i] ] ] timeToRun. "0:00:00:00.385" [100000 timesRepeat: [ d at: 48 ifAbsent: [] ] ] timeToRun. "0:00:00:00.006"
can you try with <10 elements?
The results are in! Note that I had to increase the repetitions by a factor or 100.
| d | d := SmallDictionary new. [10000000 timesRepeat: [ 1 to: 5 do: [ :i | d at:i put: i] ] ] timeToRun. "0:00:00:02.532"
[10000000 timesRepeat: [ d at: 4 ifAbsent: [] ] ] timeToRun. "0:00:00:00.686"
| d | d := Dictionary new. [10000000 timesRepeat: [ 1 to: 5 do: [ :i | d at:i put: i] ] ] timeToRun. "0:00:00:02.102"
[10000000 timesRepeat: [ d at: 4 ifAbsent: [] ] ] timeToRun. "0:00:00:00.567"
So SmallDictionary is indeed very fast for only few elements. Once the SmallDictionary has to grow its performance decreases rapidly, or at least the grow operation takes a long time. Dictionary still outperforms SmallDictionary though, so I guess the only reason left is space efficency.
Is anyone aware of a reason for hanging on to SmallDictionary? I'm also curious to know how SmallDictionary came to be. There must have been some advantage over Dictionary at some point in the past.
It came from RB… the idea was that there are (in the Refactoring engine) a lot of very small dictionaries with <10 elements. The idea is that for such dictionaries, the overhead of hashing was higher than just linear search.
I am sure I did a benchmark back then, but of course this could have changed in the meantime.
It would be nice to redo the benchmark.
For size: does #sizeInMemory take the internal data into account? e.g for Dictionary there is an Array that contains Associations, you need to count those, too.
No, I didn't check for that initially. When I did, it became apparent that the larger the dictionary the more space you save with SmallDictionary.
In the end this is a hack anyway: if needed Dictionary could do things like this transparently… without the user having to decide. And the question is if this is not just a bad side effect of the particular implementation.
Good point. Sounds like what Stefan Marr wrote in the paper where he performed a field study of collection libraries ;)
Max
Marcus
On 8 Jun 2018, at 9:30, Max Leske wrote:
(Sorry, hit send accidentally).
On 8 Jun 2018, at 9:02, Marcus Denker wrote:
On 8 Jun 2018, at 08:46, Max Leske maxleske@gmail.com wrote:
Hi,
I was messing around with SmallDictionary when I suddenly realised that I can't find a single reason to use it over a normal Dictionary. While its name and class comment imply that it is somehow an optimised Dictionary, I don't see any measurement where that actually holds up. The following was run in a Pharo 7 image on a recent VM (see below):
| d | d := SmallDictionary new. d sizeInMemory. "24" [100000 timesRepeat: [ 1 to: 100 do: [ :i | d at:i put: i] ] ] timeToRun. "0:00:00:05.226"
[100000 timesRepeat: [ d at: 48 ifAbsent: [] ] ] timeToRun. "0:00:00:00.041"
| d | d := Dictionary new. d sizeInMemory. "16" [100000 timesRepeat: [ 1 to: 100 do: [ :i | d at:i put: i] ] ] timeToRun. "0:00:00:00.385" [100000 timesRepeat: [ d at: 48 ifAbsent: [] ] ] timeToRun. "0:00:00:00.006"
can you try with <10 elements?
The results are in! Note that I had to increase the repetitions by a factor or 100.
| d | d := SmallDictionary new. [10000000 timesRepeat: [ 1 to: 5 do: [ :i | d at:i put: i] ] ] timeToRun. "0:00:00:02.532"
[10000000 timesRepeat: [ d at: 4 ifAbsent: [] ] ] timeToRun. "0:00:00:00.686"
| d | d := Dictionary new. [10000000 timesRepeat: [ 1 to: 5 do: [ :i | d at:i put: i] ] ] timeToRun. "0:00:00:02.102"
[10000000 timesRepeat: [ d at: 4 ifAbsent: [] ] ] timeToRun. "0:00:00:00.567"
So SmallDictionary is indeed very fast for only few elements. Once the SmallDictionary has to grow its performance decreases rapidly, or at least the grow operation takes a long time. Dictionary still outperforms SmallDictionary though, so I guess the only reason left is space efficency.
Is anyone aware of a reason for hanging on to SmallDictionary? I'm also curious to know how SmallDictionary came to be. There must have been some advantage over Dictionary at some point in the past.
It came from RB… the idea was that there are (in the Refactoring engine) a lot of very small dictionaries with <10 elements. The idea is that for such dictionaries, the overhead of hashing was higher than just linear search.
I am sure I did a benchmark back then, but of course this could have changed in the meantime.
It would be nice to redo the benchmark.
That would be very nice, as I just realised, that the type of object used as key has an effect on the hashing time... There are a lot of variables here, unfortunately.
Max
For size: does #sizeInMemory take the internal data into account? e.g for Dictionary there is an Array that contains Associations, you need to count those, too.
No, I didn't check for that initially. When I did, it became apparent that the larger the dictionary the more space you save with SmallDictionary.
In the end this is a hack anyway: if needed Dictionary could do things like this transparently… without the user having to decide. And the question is if this is not just a bad side effect of the particular implementation.
Good point. Sounds like what Stefan Marr wrote in the paper where he performed a field study of collection libraries ;)
Max
Marcus
I was messing around with SmallDictionary when I suddenly realised that
I can't find a single reason to use it over a normal Dictionary. While its name and class comment imply that it is somehow an optimised Dictionary, I don't see any measurement where that actually holds up. The following was run in a Pharo 7 image on a recent VM (see below):
FWIW it seems there is no SmallDictionary in Squeak.
Stef
Hi Max,
Theoretically, for a small number of elements, usually somewhere between 3 and 30 depending on implementations, a linear search is faster than a hash search, especially in the Pharo dictionary hash search implementation.
Efficient dictionary implementations are usually bucket-based. The dictionary holds a certain number of buckets, and based on the key hash, the bucket where the key value is present is determined. Small buckets are linear (arrays or linked list). Large buckets are typically balanced binary trees (red-black trees typically). Under a certain number of elements there is a single bucket, which means a linear search is performed, as for the SmallDictionary. When it grows the dictionary search becomes a combination between a hash search and a linear or tree search.
Pharo dictionary search is first hash-based, then all the buckets are represented next to each other in the same arrays and a linear search is performed there, leading to many collisions and slower search time (especially when the element is not found), sometimes the code searches over multiple buckets because the dictionary is too full or there are too many near-collisions. The memory consumption is competitive with the advanced implementations though (precise measurements would need to be made).
Method dictionaries are represented differently to optimize the look-up logic.
If you want to improve things and have one dictionary implementation instead of two, implement or look for a bucket based dictionary and put it in the base image instead of Dictionary. This is quite some work since there are many APIs to port. You can look at the Pinnochio implementation, it's quite good but they've not implemented large buckets.
On Fri, Jun 8, 2018 at 8:46 AM, Max Leske maxleske@gmail.com wrote:
Hi,
I was messing around with SmallDictionary when I suddenly realised that I can't find a single reason to use it over a normal Dictionary. While its name and class comment imply that it is somehow an optimised Dictionary, I don't see any measurement where that actually holds up. The following was run in a Pharo 7 image on a recent VM (see below):
| d | d := SmallDictionary new. d sizeInMemory. "24" [100000 timesRepeat: [ 1 to: 100 do: [ :i | d at:i put: i] ] ] timeToRun. "0:00:00:05.226"
[100000 timesRepeat: [ d at: 48 ifAbsent: [] ] ] timeToRun. "0:00:00:00.041"
| d | d := Dictionary new. d sizeInMemory. "16" [100000 timesRepeat: [ 1 to: 100 do: [ :i | d at:i put: i] ] ] timeToRun. "0:00:00:00.385" [100000 timesRepeat: [ d at: 48 ifAbsent: [] ] ] timeToRun. "0:00:00:00.006"
As you can see, SmallDictionary is 8 bytes larger per instance and significantly faster while reading and writing (I know that this isn't a good benchmark but it suffices to make my point).
Is anyone aware of a reason for hanging on to SmallDictionary? I'm also curious to know how SmallDictionary came to be. There must have been some advantage over Dictionary at some point in the past.
Cheers, Max
Image version: Pharo 7.0 Build information: Pharo-7.0+alpha.build.961.sha. a69e72a97136bc3f93831584b6efa2b1703deb84 (32 Bit)
VM version: CoInterpreter VMMaker.oscog- nice.2281 uuid: 4beeaee7-567e-1a4b-b0fb-bd95ce302516 Nov 27 2017 StackToRegisterMappingCogit VMMaker.oscog-nice.2283 uuid: 2d20324d-a2ab-48d6-b0f6-9fc3d66899da Nov 27 2017 VM: 201711262336 https://github.com/OpenSmalltalk/opensmalltalk-vm.git $ Date: Mon Nov 27 00:36:29 2017 +0100 $ Plugins: 201711262336 https://github.com/OpenSmalltalk/opensmalltalk-vm.git $
OS: macOS 10.13.5 Machine: MacBook Pro (13-inch, 2016, Four Thunderbolt 3 Ports)
On Fri, 8 Jun 2018, Clément Bera wrote:
Hi Max, Theoretically, for a small number of elements, usually somewhere between 3 and 30 depending on implementations, a linear search is faster than a hash search, especially in the Pharo dictionary hash search implementation.
Efficient dictionary implementations are usually bucket-based. The dictionary holds a certain number of buckets, and based on the key hash, the bucket where the key value is present is determined. Small buckets are linear (arrays or linked list). Large buckets are typically balanced binary trees (red-black trees typically). Under a certain number of elements there is a single bucket, which means a linear search is performed, as for the SmallDictionary. When it grows the dictionary search becomes a combination between a hash search and a linear or tree search.
The bucket-based list you described is called separate chaining collision resolution, and it won't have any better performance in Squeak than the currently used open addressing. However, it is preferrable, when you can't trust your hash function. When you do "crazy" things, like using a tree as a bucket, you just exchange the possible O(n) worst case time with O(log(n)).
Pharo dictionary search is first hash-based, then all the buckets are represented next to each other in the same arrays and a linear search is performed there, leading to many collisions and slower search time (especially when the element is not found), sometimes the code searches over multiple buckets because the dictionary is too full or there are too many near-collisions. The memory consumption is competitive with the advanced implementations though (precise measurements would need to be made).
This is called open addressing, and if your hash function is good enough, the lists are short, so those "many collisions and slower search time" just don't happen. And no, separate chaining needs a lot more memory than open addressing, unless you manage to get rid of Associations, but that's a lot harder than it sounds like.
Method dictionaries are represented differently to optimize the look-up logic.
MethodDictionary is optimized to let the VM do lookups more easily. Memory consumption is worse when the dictionary is sparse, but gets better as it gets filled up compared to the current Dictionary implementation (assuming Pharo still uses the same representation as Squeak does).
If you want to improve things and have one dictionary implementation instead of two, implement or look for a bucket based dictionary and put it in the base image instead of Dictionary. This is quite some work since there are many APIs to port. You can look at the Pinnochio implementation, it's quite good but they've not implemented large buckets.
Well, as an experiment, I'm sure it'll be fun. But don't expect better performance nor better memory usage.
Levente
On Fri, Jun 8, 2018 at 8:46 AM, Max Leske maxleske@gmail.com wrote: Hi,
I was messing around with SmallDictionary when I suddenly realised that I can't find a single reason to use it over a normal Dictionary. While its name and class comment imply that it is somehow an optimised Dictionary, I don't see any measurement where that actually holds up. The following was run in a Pharo 7 image on a recent VM (see below): | d | d := SmallDictionary new. d sizeInMemory. "24" [100000 timesRepeat: [ 1 to: 100 do: [ :i | d at:i put: i] ] ] timeToRun. "0:00:00:05.226" [100000 timesRepeat: [ d at: 48 ifAbsent: [] ] ] timeToRun. "0:00:00:00.041" | d | d := Dictionary new. d sizeInMemory. "16" [100000 timesRepeat: [ 1 to: 100 do: [ :i | d at:i put: i] ] ] timeToRun. "0:00:00:00.385" [100000 timesRepeat: [ d at: 48 ifAbsent: [] ] ] timeToRun. "0:00:00:00.006" As you can see, SmallDictionary is 8 bytes larger per instance and significantly faster while reading and writing (I know that this isn't a good benchmark but it suffices to make my point). Is anyone aware of a reason for hanging on to SmallDictionary? I'm also curious to know how SmallDictionary came to be. There must have been some advantage over Dictionary at some point in the past. Cheers, Max Image version: Pharo 7.0 Build information: Pharo-7.0+alpha.build.961.sha.a69e72a97136bc3f93831584b6efa2b1703deb84 (32 Bit) VM version: CoInterpreter VMMaker.oscog- nice.2281 uuid: 4beeaee7-567e-1a4b-b0fb-bd95ce302516 Nov 27 2017 StackToRegisterMappingCogit VMMaker.oscog-nice.2283 uuid: 2d20324d-a2ab-48d6-b0f6-9fc3d66899da Nov 27 2017 VM: 201711262336 https://github.com/OpenSmalltalk/opensmalltalk-vm.git $ Date: Mon Nov 27 00:36:29 2017 +0100 $ Plugins: 201711262336 https://github.com/OpenSmalltalk/opensmalltalk-vm.git $ OS: macOS 10.13.5 Machine: MacBook Pro (13-inch, 2016, Four Thunderbolt 3 Ports)
-- Clément Béra https://clementbera.github.io/https://clementbera.wordpress.com/
In addition, open addressing with linear probing has superior cache line read behavior (no indirection / random traversal, and if the first probe misses the second one was likely cached by the first one).
On 6/8/18 1:37 , Levente Uzonyi wrote:
On Fri, 8 Jun 2018, Clément Bera wrote:
Hi Max, Theoretically, for a small number of elements, usually somewhere between 3 and 30 depending on implementations, a linear search is faster than a hash search, especially in the Pharo dictionary hash search implementation.
Efficient dictionary implementations are usually bucket-based. The dictionary holds a certain number of buckets, and based on the key hash, the bucket where the key value is present is determined. Small buckets are linear (arrays or linked list). Large buckets are typically balanced binary trees (red-black trees typically). Under a certain number of elements there is a single bucket, which means a linear search is performed, as for the SmallDictionary. When it grows the dictionary search becomes a combination between a hash search and a linear or tree search.
The bucket-based list you described is called separate chaining collision resolution, and it won't have any better performance in Squeak than the currently used open addressing. However, it is preferrable, when you can't trust your hash function. When you do "crazy" things, like using a tree as a bucket, you just exchange the possible O(n) worst case time with O(log(n)).
Pharo dictionary search is first hash-based, then all the buckets are represented next to each other in the same arrays and a linear search is performed there, leading to many collisions and slower search time (especially when the element is not found), sometimes the code searches over multiple buckets because the dictionary is too full or there are too many near-collisions. The memory consumption is competitive with the advanced implementations though (precise measurements would need to be made).
This is called open addressing, and if your hash function is good enough, the lists are short, so those "many collisions and slower search time" just don't happen. And no, separate chaining needs a lot more memory than open addressing, unless you manage to get rid of Associations, but that's a lot harder than it sounds like.
Method dictionaries are represented differently to optimize the look-up logic.
MethodDictionary is optimized to let the VM do lookups more easily. Memory consumption is worse when the dictionary is sparse, but gets better as it gets filled up compared to the current Dictionary implementation (assuming Pharo still uses the same representation as Squeak does).
If you want to improve things and have one dictionary implementation instead of two, implement or look for a bucket based dictionary and put it in the base image instead of Dictionary. This is quite some work since there are many APIs to port. You can look at the Pinnochio implementation, it's quite good but they've not implemented large buckets.
Well, as an experiment, I'm sure it'll be fun. But don't expect better performance nor better memory usage.
Levente
On Fri, Jun 8, 2018 at 8:46 AM, Max Leske maxleske@gmail.com wrote: Hi,
I was messing around with SmallDictionary when I suddenly
realised that I can't find a single reason to use it over a normal Dictionary. While its name and class comment imply that it is somehow an optimised Dictionary, I don't see any measurement where that actually holds up. The following was run in a Pharo 7 image on a recent VM (see below):
| d | d := SmallDictionary new. d sizeInMemory. "24" [100000 timesRepeat: [ 1 to: 100 do: [ :i | d at:i put: i] ] ] timeToRun.
"0:00:00:05.226"
[100000 timesRepeat: [ d at: 48 ifAbsent: [] ] ] timeToRun. "0:00:00:00.041" | d | d := Dictionary new. d sizeInMemory. "16" [100000 timesRepeat: [ 1 to: 100 do: [ :i | d at:i put: i] ] ] timeToRun.
"0:00:00:00.385" [100000 timesRepeat: [ d at: 48 ifAbsent: [] ] ] timeToRun. "0:00:00:00.006"
As you can see, SmallDictionary is 8 bytes larger per instance
and significantly faster while reading and writing (I know that this isn't a good benchmark but it suffices to make my point).
Is anyone aware of a reason for hanging on to SmallDictionary?
I'm also curious to know how SmallDictionary came to be. There must have been some advantage over Dictionary at some point in the past.
Cheers, Max Image version: Pharo 7.0 Build information:
Pharo-7.0+alpha.build.961.sha.a69e72a97136bc3f93831584b6efa2b1703deb84 (32 Bit)
VM version: CoInterpreter VMMaker.oscog- nice.2281 uuid:
4beeaee7-567e-1a4b-b0fb-bd95ce302516 Nov 27 2017 StackToRegisterMappingCogit VMMaker.oscog-nice.2283 uuid: 2d20324d-a2ab-48d6-b0f6-9fc3d66899da Nov 27 2017 VM: 201711262336 https://github.com/OpenSmalltalk/opensmalltalk-vm.git $ Date: Mon Nov 27 00:36:29 2017 +0100 $ Plugins: 201711262336 https://github.com/OpenSmalltalk/opensmalltalk-vm.git $
OS: macOS 10.13.5 Machine: MacBook Pro (13-inch, 2016, Four Thunderbolt 3 Ports)
-- Clément Béra https://clementbera.github.io/https://clementbera.wordpress.com/
On 8 Jun 2018, at 09:48, Stéphane Rollandin wrote:
FWIW it seems there is no SmallDictionary in Squeak.
Oh... Thanks Stéf, I wasn't aware of that.
On 8 June 2018, at 15:13, John Brant wrote:
Is anyone aware of a reason for hanging on to SmallDictionary? I'm also curious to know how SmallDictionary came to be. There must have been some advantage over Dictionary at some point in the past.
It came from RB… the idea was that there are (in the Refactoring engine) a lot of very small dictionaries with <10 elements. The idea is that for such dictionaries, the overhead of hashing was higher than just linear search.
I created its ancestor in VW some 20+ years ago (as RBSmallDictionary). It was used when doing pattern matching. When it performs a pattern match against an AST, it puts the potential value of the pattern variable in the dictionary. If the value is used later in the pattern, then we can get the previous value and make sure that we have an equivalent AST. This allows you to write patterns like:
`@a = `@a
to find where someone has the same expression on both sides of the #= message. Since most patterns have very few pattern variables, these dictionaries won't hold many entries. Furthermore, we are likely to abort the match when we have 0 entries.
The original RBSmallDictionary had an #empty method that "emptied" the dictionary without actually removing any elements -- it just set the size to 0. In a general dictionary this would lead to memory leaks since the previous values are still held by the dictionary. However, these dictionaries were only used during the matching process and went away after the process completed.
Anyway, at the time when we converted our pattern matching code from using the VW parser with our pattern matching extensions to use the new RB parser with pattern matching, the time to run Smalllint on the image was cut in half even though our parser was quite a bit slower than the VW parser. I don't remember everything that was done, but I think that most of the speedup came from having special pattern AST nodes and the small dictionary.
John Brant
Very interesting! Thanks John!
As Marcus has mentioned before in this thread, it would make a lot of sense to run benchmarks again. Actually, I think it would be nice to have a benchmark suite for these cases, that would let us monitor the performance and ensure that changes to the codebase don't have a deteriorative effect. I'm not saying that it would be easy to make this happen, writing proper benchmarks is hard (for me especially, as it seems, given my utter failure to think of the edge cases before starting this thread). Such a suite might also prevent these sorts of questions on the mailing list in the future, or at least might make it easier to answer them.
On 8 June 2018, at 13:01, Andres Valloud wrote:
In addition, open addressing with linear probing has superior cache line read behavior (no indirection / random traversal, and if the first probe misses the second one was likely cached by the first one).
Ah, nice catch! Although that would require frequent access to the dictionary / repeated access to the same items to have an effect, wouldn't it?
On 8 Jun 2018, at 10:01, Clément Bera wrote:
Hi Max,
Theoretically, for a small number of elements, usually somewhere between 3 and 30 depending on implementations, a linear search is faster than a hash search, especially in the Pharo dictionary hash search implementation.
Efficient dictionary implementations are usually bucket-based. The dictionary holds a certain number of buckets, and based on the key hash, the bucket where the key value is present is determined. Small buckets are linear (arrays or linked list). Large buckets are typically balanced binary trees (red-black trees typically). Under a certain number of elements there is a single bucket, which means a linear search is performed, as for the SmallDictionary. When it grows the dictionary search becomes a combination between a hash search and a linear or tree search.
Pharo dictionary search is first hash-based, then all the buckets are represented next to each other in the same arrays and a linear search is performed there, leading to many collisions and slower search time (especially when the element is not found), sometimes the code searches over multiple buckets because the dictionary is too full or there are too many near-collisions. The memory consumption is competitive with the advanced implementations though (precise measurements would need to be made).
Method dictionaries are represented differently to optimize the look-up logic.
If you want to improve things and have one dictionary implementation instead of two, implement or look for a bucket based dictionary and put it in the base image instead of Dictionary. This is quite some work since there are many APIs to port. You can look at the Pinnochio implementation, it's quite good but they've not implemented large buckets.
Thanks for the detailed explanations Clément and Levente. I'll probably not add a new dictionary implementation ;)
On Fri, Jun 8, 2018 at 8:46 AM, Max Leske maxleske@gmail.com wrote:
Hi,
I was messing around with SmallDictionary when I suddenly realised that I can't find a single reason to use it over a normal Dictionary. While its name and class comment imply that it is somehow an optimised Dictionary, I don't see any measurement where that actually holds up. The following was run in a Pharo 7 image on a recent VM (see below):
| d | d := SmallDictionary new. d sizeInMemory. "24" [100000 timesRepeat: [ 1 to: 100 do: [ :i | d at:i put: i] ] ] timeToRun. "0:00:00:05.226"
[100000 timesRepeat: [ d at: 48 ifAbsent: [] ] ] timeToRun. "0:00:00:00.041"
| d | d := Dictionary new. d sizeInMemory. "16" [100000 timesRepeat: [ 1 to: 100 do: [ :i | d at:i put: i] ] ] timeToRun. "0:00:00:00.385" [100000 timesRepeat: [ d at: 48 ifAbsent: [] ] ] timeToRun. "0:00:00:00.006"
As you can see, SmallDictionary is 8 bytes larger per instance and significantly faster while reading and writing (I know that this isn't a good benchmark but it suffices to make my point).
Is anyone aware of a reason for hanging on to SmallDictionary? I'm also curious to know how SmallDictionary came to be. There must have been some advantage over Dictionary at some point in the past.
Cheers, Max
Image version: Pharo 7.0 Build information: Pharo-7.0+alpha.build.961.sha. a69e72a97136bc3f93831584b6efa2b1703deb84 (32 Bit)
VM version: CoInterpreter VMMaker.oscog- nice.2281 uuid: 4beeaee7-567e-1a4b-b0fb-bd95ce302516 Nov 27 2017 StackToRegisterMappingCogit VMMaker.oscog-nice.2283 uuid: 2d20324d-a2ab-48d6-b0f6-9fc3d66899da Nov 27 2017 VM: 201711262336 https://github.com/OpenSmalltalk/opensmalltalk-vm.git $ Date: Mon Nov 27 00:36:29 2017 +0100 $ Plugins: 201711262336 https://github.com/OpenSmalltalk/opensmalltalk-vm.git $
OS: macOS 10.13.5 Machine: MacBook Pro (13-inch, 2016, Four Thunderbolt 3 Ports)
-- Clément Béra https://clementbera.github.io/ https://clementbera.wordpress.com/
Hum, about dictionary implementations, I've just found this [1]: Open addressing vs. chaining *Chaining* *Open addressing* *Collision resolution* Using external data structure Using hash table itself *Memory waste* Pointer size overhead per entry (storing list heads in the table) No overhead 1 *Performance dependence on table's load factor* Directly proportional Proportional to (loadFactor) / (1 - loadFactor) *Allow to store more items, than hash table size* Yes No. Moreover, it's recommended to keep table's load factor below 0.7 *Hash function requirements* Uniform distribution Uniform distribution, should avoid clustering *Handle removals* Removals are ok Removals clog the hash table with "DELETED" entries *Implementation* Simple Correct implementation of open addressing based hash table is quite tricky1 Hash tables with chaining can work efficiently even with load factor more than 1. At the same time, tables based on open addressing scheme require load factor not to exceed 0.7 to be efficient. Hence, 30% of slots remain empty, which leads to obvious memory waste.
Do you guys agree with this?
Their open addressing implementation is slightly different from the Smalltalk one (they have deleted entries upon removal while we fix collisions, at least in Pharo 6). Their chaining implementation does not use balanced binary tree when a bucket becomes large like Java's implementation.
I don't really understand why an open addressing implementation is trickier to implement, in high level languages such as Smalltalk I may agree that it might be easier to implement chaining, but in low level language such as in the VM, IMO open addressing is easier since you've got only one large chunk of memory to manage. Maybe it's because of hash collisions, having good hash and good hash collection size (See HashTableSizes) is not that easy.
I'm curious because I do more and more bare metal programming and I end up having to implement this kind of things myself, I've always implemented naive open addressing up until now without really understanding details.
[1] http://www.algolist.net/Data_structures/Hash_table/Open_addressing
On Sat, Jun 16, 2018 at 8:49 PM, Max Leske maxleske@gmail.com wrote:
On 8 Jun 2018, at 09:48, Stéphane Rollandin wrote:
FWIW it seems there is no SmallDictionary in Squeak.
Oh... Thanks Stéf, I wasn't aware of that.
On 8 June 2018, at 15:13, John Brant wrote:
Is anyone aware of a reason for hanging on to SmallDictionary? I'm also
curious to know how SmallDictionary came to be. There must have been some advantage over Dictionary at some point in the past.
It came from RB… the idea was that there are (in the Refactoring engine) a lot of very small dictionaries with <10 elements. The idea is that for such dictionaries, the overhead of hashing was higher than just linear search.
I created its ancestor in VW some 20+ years ago (as RBSmallDictionary). It was used when doing pattern matching. When it performs a pattern match against an AST, it puts the potential value of the pattern variable in the dictionary. If the value is used later in the pattern, then we can get the previous value and make sure that we have an equivalent AST. This allows you to write patterns like:
`@a = `@a
to find where someone has the same expression on both sides of the #= message. Since most patterns have very few pattern variables, these dictionaries won't hold many entries. Furthermore, we are likely to abort the match when we have 0 entries.
The original RBSmallDictionary had an #empty method that "emptied" the dictionary without actually removing any elements -- it just set the size to 0. In a general dictionary this would lead to memory leaks since the previous values are still held by the dictionary. However, these dictionaries were only used during the matching process and went away after the process completed.
Anyway, at the time when we converted our pattern matching code from using the VW parser with our pattern matching extensions to use the new RB parser with pattern matching, the time to run Smalllint on the image was cut in half even though our parser was quite a bit slower than the VW parser. I don't remember everything that was done, but I think that most of the speedup came from having special pattern AST nodes and the small dictionary.
John Brant
Very interesting! Thanks John!
As Marcus has mentioned before in this thread, it would make a lot of sense to run benchmarks again. Actually, I think it would be nice to have a benchmark suite for these cases, that would let us monitor the performance and ensure that changes to the codebase don't have a deteriorative effect. I'm not saying that it would be easy to make this happen, writing proper benchmarks is hard (for me especially, as it seems, given my utter failure to think of the edge cases before starting this thread). Such a suite might also prevent these sorts of questions on the mailing list in the future, or at least might make it easier to answer them.
On 8 June 2018, at 13:01, Andres Valloud wrote:
In addition, open addressing with linear probing has superior cache line
read behavior (no indirection / random traversal, and if the first probe misses the second one was likely cached by the first one).
Ah, nice catch! Although that would require frequent access to the dictionary / repeated access to the same items to have an effect, wouldn't it?
On 8 Jun 2018, at 10:01, Clément Bera wrote:
Hi Max,
Theoretically, for a small number of elements, usually somewhere between 3 and 30 depending on implementations, a linear search is faster than a hash search, especially in the Pharo dictionary hash search implementation.
Efficient dictionary implementations are usually bucket-based. The dictionary holds a certain number of buckets, and based on the key hash, the bucket where the key value is present is determined. Small buckets are linear (arrays or linked list). Large buckets are typically balanced binary trees (red-black trees typically). Under a certain number of elements there is a single bucket, which means a linear search is performed, as for the SmallDictionary. When it grows the dictionary search becomes a combination between a hash search and a linear or tree search.
Pharo dictionary search is first hash-based, then all the buckets are represented next to each other in the same arrays and a linear search is performed there, leading to many collisions and slower search time (especially when the element is not found), sometimes the code searches over multiple buckets because the dictionary is too full or there are too many near-collisions. The memory consumption is competitive with the advanced implementations though (precise measurements would need to be made).
Method dictionaries are represented differently to optimize the look-up logic.
If you want to improve things and have one dictionary implementation instead of two, implement or look for a bucket based dictionary and put it in the base image instead of Dictionary. This is quite some work since there are many APIs to port. You can look at the Pinnochio implementation, it's quite good but they've not implemented large buckets.
Thanks for the detailed explanations Clément and Levente. I'll probably not add a new dictionary implementation ;)
On Fri, Jun 8, 2018 at 8:46 AM, Max Leske maxleske@gmail.com wrote:
Hi,
I was messing around with SmallDictionary when I suddenly realised that I can't find a single reason to use it over a normal Dictionary. While its name and class comment imply that it is somehow an optimised Dictionary, I don't see any measurement where that actually holds up. The following was run in a Pharo 7 image on a recent VM (see below):
| d | d := SmallDictionary new. d sizeInMemory. "24" [100000 timesRepeat: [ 1 to: 100 do: [ :i | d at:i put: i] ] ] timeToRun. "0:00:00:05.226"
[100000 timesRepeat: [ d at: 48 ifAbsent: [] ] ] timeToRun. "0:00:00:00.041"
| d | d := Dictionary new. d sizeInMemory. "16" [100000 timesRepeat: [ 1 to: 100 do: [ :i | d at:i put: i] ] ] timeToRun. "0:00:00:00.385" [100000 timesRepeat: [ d at: 48 ifAbsent: [] ] ] timeToRun. "0:00:00:00.006"
As you can see, SmallDictionary is 8 bytes larger per instance and significantly faster while reading and writing (I know that this isn't a good benchmark but it suffices to make my point).
Is anyone aware of a reason for hanging on to SmallDictionary? I'm also curious to know how SmallDictionary came to be. There must have been some advantage over Dictionary at some point in the past.
Cheers, Max
Image version: Pharo 7.0 Build information: Pharo-7.0+alpha.build.961.sha. a69e72a97136bc3f93831584b6efa2b1703deb84 (32 Bit)
VM version: CoInterpreter VMMaker.oscog- nice.2281 uuid: 4beeaee7-567e-1a4b-b0fb-bd95ce302516 Nov 27 2017 StackToRegisterMappingCogit VMMaker.oscog-nice.2283 uuid: 2d20324d-a2ab-48d6-b0f6-9fc3d66899da Nov 27 2017 VM: 201711262336 https://github.com/OpenSmalltalk/opensmalltalk-vm.git $ Date: Mon Nov 27 00:36:29 2017 +0100 $ Plugins: 201711262336 https://github.com/OpenSmalltalk/opensmalltalk-vm.git $
OS: macOS 10.13.5 Machine: MacBook Pro (13-inch, 2016, Four Thunderbolt 3 Ports)
-- Clément Béra https://clementbera.github.io/ https://clementbera.wordpress.com/
In an environment with garbage collection, it's worth considering the overhead of one's representation on the memory allocator and the garbage collector. Open addressing adds objects - the exact number of these varies - and hence adds GC overhead (= paying extra cycles for the life of the dictionary even if it never changes). Each object also has some bytes of overhead for headers, so chaining becomes less efficient as collisions occur due to extra object allocations.
Also, modern processors and memory systems are very good at reading forward in memory, as there's often a wide memory bus (and sometimes other optimisations). This means it's often quick to read the next entry as it's already been pulled in due to adjacency. By contrast, chaining puts collisions all over memory, reducing adjacency and increasing the bandwidth required to memory.
I'm unconvinced that chaining dictionaries are sensible in modern languages with garbage collection on modern architectures where bandwidth to memory, and cache space, are both scarce resources. The days of memory being truly random-access are long gone, and too few programmers realise the impact that can have.
Cheers,
- Peter
On 29 June 2018 at 08:27, Clément Bera bera.clement@gmail.com wrote:
Hum, about dictionary implementations, I've just found this [1]: Open addressing vs. chaining *Chaining* *Open addressing* *Collision resolution* Using external data structure Using hash table itself *Memory waste* Pointer size overhead per entry (storing list heads in the table) No overhead 1 *Performance dependence on table's load factor* Directly proportional Proportional to (loadFactor) / (1 - loadFactor) *Allow to store more items, than hash table size* Yes No. Moreover, it's recommended to keep table's load factor below 0.7 *Hash function requirements* Uniform distribution Uniform distribution, should avoid clustering *Handle removals* Removals are ok Removals clog the hash table with "DELETED" entries *Implementation* Simple Correct implementation of open addressing based hash table is quite tricky1 Hash tables with chaining can work efficiently even with load factor more than 1. At the same time, tables based on open addressing scheme require load factor not to exceed 0.7 to be efficient. Hence, 30% of slots remain empty, which leads to obvious memory waste.
Do you guys agree with this?
Their open addressing implementation is slightly different from the Smalltalk one (they have deleted entries upon removal while we fix collisions, at least in Pharo 6). Their chaining implementation does not use balanced binary tree when a bucket becomes large like Java's implementation.
I don't really understand why an open addressing implementation is trickier to implement, in high level languages such as Smalltalk I may agree that it might be easier to implement chaining, but in low level language such as in the VM, IMO open addressing is easier since you've got only one large chunk of memory to manage. Maybe it's because of hash collisions, having good hash and good hash collection size (See HashTableSizes) is not that easy.
I'm curious because I do more and more bare metal programming and I end up having to implement this kind of things myself, I've always implemented naive open addressing up until now without really understanding details.
[1] http://www.algolist.net/Data_structures/Hash_table/Open_addressing
On Sat, Jun 16, 2018 at 8:49 PM, Max Leske maxleske@gmail.com wrote:
On 8 Jun 2018, at 09:48, Stéphane Rollandin wrote:
FWIW it seems there is no SmallDictionary in Squeak.
Oh... Thanks Stéf, I wasn't aware of that.
On 8 June 2018, at 15:13, John Brant wrote:
Is anyone aware of a reason for hanging on to SmallDictionary? I'm also
curious to know how SmallDictionary came to be. There must have been some advantage over Dictionary at some point in the past.
It came from RB… the idea was that there are (in the Refactoring engine) a lot of very small dictionaries with <10 elements. The idea is that for such dictionaries, the overhead of hashing was higher than just linear search.
I created its ancestor in VW some 20+ years ago (as RBSmallDictionary). It was used when doing pattern matching. When it performs a pattern match against an AST, it puts the potential value of the pattern variable in the dictionary. If the value is used later in the pattern, then we can get the previous value and make sure that we have an equivalent AST. This allows you to write patterns like:
`@a = `@a
to find where someone has the same expression on both sides of the #= message. Since most patterns have very few pattern variables, these dictionaries won't hold many entries. Furthermore, we are likely to abort the match when we have 0 entries.
The original RBSmallDictionary had an #empty method that "emptied" the dictionary without actually removing any elements -- it just set the size to 0. In a general dictionary this would lead to memory leaks since the previous values are still held by the dictionary. However, these dictionaries were only used during the matching process and went away after the process completed.
Anyway, at the time when we converted our pattern matching code from using the VW parser with our pattern matching extensions to use the new RB parser with pattern matching, the time to run Smalllint on the image was cut in half even though our parser was quite a bit slower than the VW parser. I don't remember everything that was done, but I think that most of the speedup came from having special pattern AST nodes and the small dictionary.
John Brant
Very interesting! Thanks John!
As Marcus has mentioned before in this thread, it would make a lot of sense to run benchmarks again. Actually, I think it would be nice to have a benchmark suite for these cases, that would let us monitor the performance and ensure that changes to the codebase don't have a deteriorative effect. I'm not saying that it would be easy to make this happen, writing proper benchmarks is hard (for me especially, as it seems, given my utter failure to think of the edge cases before starting this thread). Such a suite might also prevent these sorts of questions on the mailing list in the future, or at least might make it easier to answer them.
On 8 June 2018, at 13:01, Andres Valloud wrote:
In addition, open addressing with linear probing has superior cache line
read behavior (no indirection / random traversal, and if the first probe misses the second one was likely cached by the first one).
Ah, nice catch! Although that would require frequent access to the dictionary / repeated access to the same items to have an effect, wouldn't it?
On 8 Jun 2018, at 10:01, Clément Bera wrote:
Hi Max,
Theoretically, for a small number of elements, usually somewhere between 3 and 30 depending on implementations, a linear search is faster than a hash search, especially in the Pharo dictionary hash search implementation.
Efficient dictionary implementations are usually bucket-based. The dictionary holds a certain number of buckets, and based on the key hash, the bucket where the key value is present is determined. Small buckets are linear (arrays or linked list). Large buckets are typically balanced binary trees (red-black trees typically). Under a certain number of elements there is a single bucket, which means a linear search is performed, as for the SmallDictionary. When it grows the dictionary search becomes a combination between a hash search and a linear or tree search.
Pharo dictionary search is first hash-based, then all the buckets are represented next to each other in the same arrays and a linear search is performed there, leading to many collisions and slower search time (especially when the element is not found), sometimes the code searches over multiple buckets because the dictionary is too full or there are too many near-collisions. The memory consumption is competitive with the advanced implementations though (precise measurements would need to be made).
Method dictionaries are represented differently to optimize the look-up logic.
If you want to improve things and have one dictionary implementation instead of two, implement or look for a bucket based dictionary and put it in the base image instead of Dictionary. This is quite some work since there are many APIs to port. You can look at the Pinnochio implementation, it's quite good but they've not implemented large buckets.
Thanks for the detailed explanations Clément and Levente. I'll probably not add a new dictionary implementation ;)
On Fri, Jun 8, 2018 at 8:46 AM, Max Leske maxleske@gmail.com wrote:
Hi,
I was messing around with SmallDictionary when I suddenly realised that I can't find a single reason to use it over a normal Dictionary. While its name and class comment imply that it is somehow an optimised Dictionary, I don't see any measurement where that actually holds up. The following was run in a Pharo 7 image on a recent VM (see below):
| d | d := SmallDictionary new. d sizeInMemory. "24" [100000 timesRepeat: [ 1 to: 100 do: [ :i | d at:i put: i] ] ] timeToRun. "0:00:00:05.226"
[100000 timesRepeat: [ d at: 48 ifAbsent: [] ] ] timeToRun. "0:00:00:00.041"
| d | d := Dictionary new. d sizeInMemory. "16" [100000 timesRepeat: [ 1 to: 100 do: [ :i | d at:i put: i] ] ] timeToRun. "0:00:00:00.385" [100000 timesRepeat: [ d at: 48 ifAbsent: [] ] ] timeToRun. "0:00:00:00.006"
As you can see, SmallDictionary is 8 bytes larger per instance and significantly faster while reading and writing (I know that this isn't a good benchmark but it suffices to make my point).
Is anyone aware of a reason for hanging on to SmallDictionary? I'm also curious to know how SmallDictionary came to be. There must have been some advantage over Dictionary at some point in the past.
Cheers, Max
Image version: Pharo 7.0 Build information: Pharo-7.0+alpha.build.961.sha. a69e72a97136bc3f93831584b6efa2b1703deb84 (32 Bit)
VM version: CoInterpreter VMMaker.oscog- nice.2281 uuid: 4beeaee7-567e-1a4b-b0fb-bd95ce302516 Nov 27 2017 StackToRegisterMappingCogit VMMaker.oscog-nice.2283 uuid: 2d20324d-a2ab-48d6-b0f6-9fc3d66899da Nov 27 2017 VM: 201711262336 https://github.com/OpenSmalltalk/opensmalltalk-vm.git $ Date: Mon Nov 27 00:36:29 2017 +0100 $ Plugins: 201711262336 https://github.com/OpenSmalltalk/opensmalltalk-vm.git $
OS: macOS 10.13.5 Machine: MacBook Pro (13-inch, 2016, Four Thunderbolt 3 Ports)
-- Clément Béra https://clementbera.github.io/ https://clementbera.wordpress.com/
-- Clément Béra https://clementbera.github.io/ https://clementbera.wordpress.com/
This discussion has limited its context to comparing *in-memory* Dictionary implementations, where open-addressing has an advantage. Using Magma reveals another context where chained Dictionary's can have some advantages: When they're large, persistent, and shared in a multi-user system -- e.g., when grow rehashing is not an option.
Not only do chained Dictionary's make that scenario possible, they improve the concurrency -- different users can add/change/remove from the Dictionary as long as they occur at different keys. With regular Dictionary's, concurrency guarantees a commit conflict.
- Chris
On Fri, Jun 29, 2018 at 2:56 PM, Peter Crowther peter@ozzard.org wrote:
In an environment with garbage collection, it's worth considering the overhead of one's representation on the memory allocator and the garbage collector. Open addressing adds objects - the exact number of these varies
- and hence adds GC overhead (= paying extra cycles for the life of the
dictionary even if it never changes). Each object also has some bytes of overhead for headers, so chaining becomes less efficient as collisions occur due to extra object allocations.
Also, modern processors and memory systems are very good at reading forward in memory, as there's often a wide memory bus (and sometimes other optimisations). This means it's often quick to read the next entry as it's already been pulled in due to adjacency. By contrast, chaining puts collisions all over memory, reducing adjacency and increasing the bandwidth required to memory.
I'm unconvinced that chaining dictionaries are sensible in modern languages with garbage collection on modern architectures where bandwidth to memory, and cache space, are both scarce resources. The days of memory being truly random-access are long gone, and too few programmers realise the impact that can have.
Cheers,
- Peter
On 29 June 2018 at 08:27, Clément Bera bera.clement@gmail.com wrote:
Hum, about dictionary implementations, I've just found this [1]: Open addressing vs. chaining *Chaining* *Open addressing* *Collision resolution* Using external data structure Using hash table itself *Memory waste* Pointer size overhead per entry (storing list heads in the table) No overhead 1 *Performance dependence on table's load factor* Directly proportional Proportional to (loadFactor) / (1 - loadFactor) *Allow to store more items, than hash table size* Yes No. Moreover, it's recommended to keep table's load factor below 0.7 *Hash function requirements* Uniform distribution Uniform distribution, should avoid clustering *Handle removals* Removals are ok Removals clog the hash table with "DELETED" entries *Implementation* Simple Correct implementation of open addressing based hash table is quite tricky1 Hash tables with chaining can work efficiently even with load factor more than 1. At the same time, tables based on open addressing scheme require load factor not to exceed 0.7 to be efficient. Hence, 30% of slots remain empty, which leads to obvious memory waste.
Do you guys agree with this?
Their open addressing implementation is slightly different from the Smalltalk one (they have deleted entries upon removal while we fix collisions, at least in Pharo 6). Their chaining implementation does not use balanced binary tree when a bucket becomes large like Java's implementation.
I don't really understand why an open addressing implementation is trickier to implement, in high level languages such as Smalltalk I may agree that it might be easier to implement chaining, but in low level language such as in the VM, IMO open addressing is easier since you've got only one large chunk of memory to manage. Maybe it's because of hash collisions, having good hash and good hash collection size (See HashTableSizes) is not that easy.
I'm curious because I do more and more bare metal programming and I end up having to implement this kind of things myself, I've always implemented naive open addressing up until now without really understanding details.
[1] http://www.algolist.net/Data_structures/Hash_table/Open_addressing
On Sat, Jun 16, 2018 at 8:49 PM, Max Leske maxleske@gmail.com wrote:
On 8 Jun 2018, at 09:48, Stéphane Rollandin wrote:
FWIW it seems there is no SmallDictionary in Squeak.
Oh... Thanks Stéf, I wasn't aware of that.
On 8 June 2018, at 15:13, John Brant wrote:
Is anyone aware of a reason for hanging on to SmallDictionary? I'm also
curious to know how SmallDictionary came to be. There must have been some advantage over Dictionary at some point in the past.
It came from RB… the idea was that there are (in the Refactoring engine) a lot of very small dictionaries with <10 elements. The idea is that for such dictionaries, the overhead of hashing was higher than just linear search.
I created its ancestor in VW some 20+ years ago (as RBSmallDictionary). It was used when doing pattern matching. When it performs a pattern match against an AST, it puts the potential value of the pattern variable in the dictionary. If the value is used later in the pattern, then we can get the previous value and make sure that we have an equivalent AST. This allows you to write patterns like:
`@a = `@a
to find where someone has the same expression on both sides of the #= message. Since most patterns have very few pattern variables, these dictionaries won't hold many entries. Furthermore, we are likely to abort the match when we have 0 entries.
The original RBSmallDictionary had an #empty method that "emptied" the dictionary without actually removing any elements -- it just set the size to 0. In a general dictionary this would lead to memory leaks since the previous values are still held by the dictionary. However, these dictionaries were only used during the matching process and went away after the process completed.
Anyway, at the time when we converted our pattern matching code from using the VW parser with our pattern matching extensions to use the new RB parser with pattern matching, the time to run Smalllint on the image was cut in half even though our parser was quite a bit slower than the VW parser. I don't remember everything that was done, but I think that most of the speedup came from having special pattern AST nodes and the small dictionary.
John Brant
Very interesting! Thanks John!
As Marcus has mentioned before in this thread, it would make a lot of sense to run benchmarks again. Actually, I think it would be nice to have a benchmark suite for these cases, that would let us monitor the performance and ensure that changes to the codebase don't have a deteriorative effect. I'm not saying that it would be easy to make this happen, writing proper benchmarks is hard (for me especially, as it seems, given my utter failure to think of the edge cases before starting this thread). Such a suite might also prevent these sorts of questions on the mailing list in the future, or at least might make it easier to answer them.
On 8 June 2018, at 13:01, Andres Valloud wrote:
In addition, open addressing with linear probing has superior cache line
read behavior (no indirection / random traversal, and if the first probe misses the second one was likely cached by the first one).
Ah, nice catch! Although that would require frequent access to the dictionary / repeated access to the same items to have an effect, wouldn't it?
On 8 Jun 2018, at 10:01, Clément Bera wrote:
Hi Max,
Theoretically, for a small number of elements, usually somewhere between 3 and 30 depending on implementations, a linear search is faster than a hash search, especially in the Pharo dictionary hash search implementation.
Efficient dictionary implementations are usually bucket-based. The dictionary holds a certain number of buckets, and based on the key hash, the bucket where the key value is present is determined. Small buckets are linear (arrays or linked list). Large buckets are typically balanced binary trees (red-black trees typically). Under a certain number of elements there is a single bucket, which means a linear search is performed, as for the SmallDictionary. When it grows the dictionary search becomes a combination between a hash search and a linear or tree search.
Pharo dictionary search is first hash-based, then all the buckets are represented next to each other in the same arrays and a linear search is performed there, leading to many collisions and slower search time (especially when the element is not found), sometimes the code searches over multiple buckets because the dictionary is too full or there are too many near-collisions. The memory consumption is competitive with the advanced implementations though (precise measurements would need to be made).
Method dictionaries are represented differently to optimize the look-up logic.
If you want to improve things and have one dictionary implementation instead of two, implement or look for a bucket based dictionary and put it in the base image instead of Dictionary. This is quite some work since there are many APIs to port. You can look at the Pinnochio implementation, it's quite good but they've not implemented large buckets.
Thanks for the detailed explanations Clément and Levente. I'll probably not add a new dictionary implementation ;)
On Fri, Jun 8, 2018 at 8:46 AM, Max Leske maxleske@gmail.com wrote:
Hi,
I was messing around with SmallDictionary when I suddenly realised that I can't find a single reason to use it over a normal Dictionary. While its name and class comment imply that it is somehow an optimised Dictionary, I don't see any measurement where that actually holds up. The following was run in a Pharo 7 image on a recent VM (see below):
| d | d := SmallDictionary new. d sizeInMemory. "24" [100000 timesRepeat: [ 1 to: 100 do: [ :i | d at:i put: i] ] ] timeToRun. "0:00:00:05.226"
[100000 timesRepeat: [ d at: 48 ifAbsent: [] ] ] timeToRun. "0:00:00:00.041"
| d | d := Dictionary new. d sizeInMemory. "16" [100000 timesRepeat: [ 1 to: 100 do: [ :i | d at:i put: i] ] ] timeToRun. "0:00:00:00.385" [100000 timesRepeat: [ d at: 48 ifAbsent: [] ] ] timeToRun. "0:00:00:00.006"
As you can see, SmallDictionary is 8 bytes larger per instance and significantly faster while reading and writing (I know that this isn't a good benchmark but it suffices to make my point).
Is anyone aware of a reason for hanging on to SmallDictionary? I'm also curious to know how SmallDictionary came to be. There must have been some advantage over Dictionary at some point in the past.
Cheers, Max
Image version: Pharo 7.0 Build information: Pharo-7.0+alpha.build.961.sha. a69e72a97136bc3f93831584b6efa2b1703deb84 (32 Bit)
VM version: CoInterpreter VMMaker.oscog- nice.2281 uuid: 4beeaee7-567e-1a4b-b0fb-bd95ce302516 Nov 27 2017 StackToRegisterMappingCogit VMMaker.oscog-nice.2283 uuid: 2d20324d-a2ab-48d6-b0f6-9fc3d66899da Nov 27 2017 VM: 201711262336 https://github.com/OpenSmalltalk/opensmalltalk-vm.git $ Date: Mon Nov 27 00:36:29 2017 +0100 $ Plugins: 201711262336 https://github.com/OpenSmalltalk/opensmalltalk-vm.git $
OS: macOS 10.13.5 Machine: MacBook Pro (13-inch, 2016, Four Thunderbolt 3 Ports)
-- Clément Béra https://clementbera.github.io/ https://clementbera.wordpress.com/
-- Clément Béra https://clementbera.github.io/ https://clementbera.wordpress.com/
On Sat, 30 Jun 2018, Chris Muller wrote:
This discussion has limited its context to comparing *in-memory* Dictionary implementations, where open-addressing has an advantage. Using Magma reveals another context where chained Dictionary's can have some advantages: When they're large, persistent, and shared in a multi-user system -- e.g., when grow rehashing is not an option.
B-trees were designed for that use case. If growing is not an option, then your dictionary's performance will degrade as its size increases, won't it?
Not only do chained Dictionary's make that scenario possible, they improve the concurrency -- different users can add/change/remove from the Dictionary as long as they occur at different keys. With regular Dictionary's, concurrency guarantees a commit conflict.
If you created a new, larger dictionary on grow and lazily copied the entries from the smaller to the larger upon access, then I think you could use open addressing hash tables. Also, I don't see the concurrency issue with open addressing, because as long as different users modify different chains, the case of concurrency is the same as what you described, isn't it?
Levente
- Chris
On Fri, Jun 29, 2018 at 2:56 PM, Peter Crowther peter@ozzard.org wrote: In an environment with garbage collection, it's worth considering the overhead of one's representation on the memory allocator and the garbage collector. Open addressing adds objects - the exact number of these varies - and hence adds GC overhead (= paying extra cycles for the life of the dictionary even if it never changes). Each object also has some bytes of overhead for headers, so chaining becomes less efficient as collisions occur due to extra object allocations.
Also, modern processors and memory systems are very good at reading forward in memory, as there's often a wide memory bus (and sometimes other optimisations). This means it's often quick to read the next entry as it's already been pulled in due to adjacency. By contrast, chaining puts collisions all over memory, reducing adjacency and increasing the bandwidth required to memory.
I'm unconvinced that chaining dictionaries are sensible in modern languages with garbage collection on modern architectures where bandwidth to memory, and cache space, are both scarce resources. The days of memory being truly random-access are long gone, and too few programmers realise the impact that can have.
Cheers,
- Peter
On 29 June 2018 at 08:27, Clément Bera bera.clement@gmail.com wrote: Hum, about dictionary implementations, I've just found this [1]:
OPEN ADDRESSING VS. CHAINING
Chaining Open addressing Collision resolution Using external data structure Using hash table itself Memory waste Pointer size overhead per entry (storing list heads in the table) No overhead 1 Performance dependence on table's load factor Directly proportional Proportional to (loadFactor) / (1 - loadFactor) Allow to store more items, than hash table size Yes No. Moreover, it's recommended to keep table's load factor below 0.7 Hash function requirements Uniform distribution Uniform distribution, should avoid clustering Handle removals Removals are ok Removals clog the hash table with "DELETED" entries Implementation Simple Correct implementation of open addressing based hash table is quite tricky1 Hash tables with chaining can work efficiently even with load factor more than 1. At the same time, tables based on open addressing scheme require load factor not to exceed 0.7 to be efficient. Hence, 30% of slots remain empty, which leads to obvious memory waste.
Do you guys agree with this?
Their open addressing implementation is slightly different from the Smalltalk one (they have deleted entries upon removal while we fix collisions, at least in Pharo 6). Their chaining implementation does not use balanced binary tree when a bucket becomes large like Java's implementation.
I don't really understand why an open addressing implementation is trickier to implement, in high level languages such as Smalltalk I may agree that it might be easier to implement chaining, but in low level language such as in the VM, IMO open addressing is easier since you've got only one large chunk of memory to manage. Maybe it's because of hash collisions, having good hash and good hash collection size (See HashTableSizes) is not that easy.
I'm curious because I do more and more bare metal programming and I end up having to implement this kind of things myself, I've always implemented naive open addressing up until now without really understanding details.
[1] http://www.algolist.net/Data_structures/Hash_table/Open_addressing
On Sat, Jun 16, 2018 at 8:49 PM, Max Leske maxleske@gmail.com wrote:
On 8 Jun 2018, at 09:48, Stéphane Rollandin wrote: FWIW it seems there is no SmallDictionary in Squeak. Oh... Thanks Stéf, I wasn't aware of that. On 8 June 2018, at 15:13, John Brant wrote: Is anyone aware of a reason for hanging on to SmallDictionary? I'm also curious to know how SmallDictionary came to be. There must have been some advantage over Dictionary at some point in the past. It came from RB… the idea was that there are (in the Refactoring engine) a lot of very small dictionaries with <10 elements. The idea is that for such dictionaries, the overhead of hashing was higher than just linear search.
I created its ancestor in VW some 20+ years ago (as RBSmallDictionary). It was used when doing pattern matching. When it performs a pattern match against an AST, it puts the potential value of the pattern variable in the dictionary. If the value is used later in the pattern, then we can get the previous value and make sure that we have an equivalent AST. This allows you to write patterns like:
`@a = `@a
to find where someone has the same expression on both sides of the #= message. Since most patterns have very few pattern variables, these dictionaries won't hold many entries. Furthermore, we are likely to abort the match when we have 0 entries.
The original RBSmallDictionary had an #empty method that "emptied" the dictionary without actually removing any elements -- it just set the size to 0. In a general dictionary this would lead to memory leaks since the previous values are still held by the dictionary. However, these dictionaries were only used during the matching process and went away after the process completed.
Anyway, at the time when we converted our pattern matching code from using the VW parser with our pattern matching extensions to use the new RB parser with pattern matching, the time to run Smalllint on the image was cut in half even though our parser was quite a bit slower than the VW parser. I don't remember everything that was done, but I think that most of the speedup came from having special pattern AST nodes and the small dictionary.
John Brant
Very interesting! Thanks John!
As Marcus has mentioned before in this thread, it would make a lot of sense to run benchmarks again. Actually, I think it would be nice to have a benchmark suite for these cases, that would let us monitor the performance and ensure that changes to the codebase don't have a deteriorative effect. I'm not saying that it would be easy to make this happen, writing proper benchmarks is hard (for me especially, as it seems, given my utter failure to think of the edge cases before starting this thread). Such a suite might also prevent these sorts of questions on the mailing list in the future, or at least might make it easier to answer them.
On 8 June 2018, at 13:01, Andres Valloud wrote:
In addition, open addressing with linear probing has superior cache line read behavior (no indirection / random traversal, and if the first probe misses the second one was likely cached by the first one).
Ah, nice catch! Although that would require frequent access to the dictionary / repeated access to the same items to have an effect, wouldn't it?
On 8 Jun 2018, at 10:01, Clément Bera wrote:
Hi Max, Theoretically, for a small number of elements, usually somewhere between 3 and 30 depending on implementations, a linear search is faster than a hash search, especially in the Pharo dictionary hash search implementation. Efficient dictionary implementations are usually bucket-based. The dictionary holds a certain number of buckets, and based on the key hash, the bucket where the key value is present is determined. Small buckets are linear (arrays or linked list). Large buckets are typically balanced binary trees (red-black trees typically). Under a certain number of elements there is a single bucket, which means a linear search is performed, as for the SmallDictionary. When it grows the dictionary search becomes a combination between a hash search and a linear or tree search. Pharo dictionary search is first hash-based, then all the buckets are represented next to each other in the same arrays and a linear search is performed there, leading to many collisions and slower search time (especially when the element is not found), sometimes the code searches over multiple buckets because the dictionary is too full or there are too many near-collisions. The memory consumption is competitive with the advanced implementations though (precise measurements would need to be made). Method dictionaries are represented differently to optimize the look-up logic. If you want to improve things and have one dictionary implementation instead of two, implement or look for a bucket based dictionary and put it in the base image instead of Dictionary. This is quite some work since there are many APIs to port. You can look at the Pinnochio implementation, it's quite good but they've not implemented large buckets.
Thanks for the detailed explanations Clément and Levente. I'll probably not add a new dictionary implementation ;)
On Fri, Jun 8, 2018 at 8:46 AM, Max Leske <maxleske@gmail.com> wrote: Hi, I was messing around with SmallDictionary when I suddenly realised that I can't find a single reason to use it over a normal Dictionary. While its name and class comment imply that it is somehow an optimised Dictionary, I don't see any measurement where that actually holds up. The following was run in a Pharo 7 image on a recent VM (see below): | d | d := SmallDictionary new. d sizeInMemory. "24" [100000 timesRepeat: [ 1 to: 100 do: [ :i | d at:i put: i] ] ] timeToRun. "0:00:00:05.226" [100000 timesRepeat: [ d at: 48 ifAbsent: [] ] ] timeToRun. "0:00:00:00.041" | d | d := Dictionary new. d sizeInMemory. "16" [100000 timesRepeat: [ 1 to: 100 do: [ :i | d at:i put: i] ] ] timeToRun. "0:00:00:00.385" [100000 timesRepeat: [ d at: 48 ifAbsent: [] ] ] timeToRun. "0:00:00:00.006" As you can see, SmallDictionary is 8 bytes larger per instance and significantly faster while reading and writing (I know that this isn't a good benchmark but it suffices to make my point). Is anyone aware of a reason for hanging on to SmallDictionary? I'm also curious to know how SmallDictionary came to be. There must have been some advantage over Dictionary at some point in the past. Cheers, Max Image version: Pharo 7.0 Build information: Pharo-7.0+alpha.build.961.sha. a69e72a97136bc3f93831584b6efa2b1703deb84 (32 Bit) VM version: CoInterpreter VMMaker.oscog- nice.2281 uuid: 4beeaee7-567e-1a4b-b0fb-bd95ce302516 Nov 27 2017 StackToRegisterMappingCogit VMMaker.oscog-nice.2283 uuid: 2d20324d-a2ab-48d6-b0f6-9fc3d66899da Nov 27 2017 VM: 201711262336 https://github.com/OpenSmalltalk/opensmalltalk-vm.git $ Date: Mon Nov 27 00:36:29 2017 +0100 $ Plugins: 201711262336 https://github.com/OpenSmalltalk/opensmalltalk-vm.git $ OS: macOS 10.13.5 Machine: MacBook Pro (13-inch, 2016, Four Thunderbolt 3 Ports) -- Clément Béra https://clementbera.github.io/ https://clementbera.wordpress.com/
-- Clément Béra https://clementbera.github.io/https://clementbera.wordpress.com/
This discussion has limited its context to comparing *in-memory* Dictionary implementations, where open-addressing has an advantage. Using Magma reveals another context where chained Dictionary's can have some advantages: When they're large, persistent, and shared in a multi-user system -- e.g., when grow rehashing is not an option.
B-trees were designed for that use case. If growing is not an option, then your dictionary's performance will degrade as its size increases, won't it?
Yes, but only logarithmically. But I was not making a performance-based comparison, only a possibility-based one.
Not only do chained Dictionary's make that scenario possible, they improve the concurrency -- different users can add/change/remove from the Dictionary as long as they occur at different keys. With regular Dictionary's, concurrency guarantees a commit conflict.
If you created a new, larger dictionary on grow and lazily copied the entries from the smaller to the larger upon access, then I think you could use open addressing hash tables.
That's an interesting idea, but what if it grows beyond "the larger" again while still many entries in the smaller? Add a third which is doubled in size again? That sounds like it starts to intrude into the efficiency of open-addressing (both computation and space), but the idea still sparks some interesting design thoughts in my head..
Also, I don't see the concurrency issue with open addressing, because as long as different users modify different chains, the case of concurrency is the same as what you described, isn't it?
Magma has a class called "MagmaPreallocatedDictionary" which does exactly this. It's like a regular Dictionary, except for its internal 'array', it utilizes a MagmaArray, which is just like an Array except large and persistent and can grow in place (so able to maintain its identity).
However, this alone was not enough to implement your proposal because of the 'tally'. Clients would each be trying to propose their specific value for that variable = conflict. Thankfully, Magma has yet another special class called MagmaCounter which fits that role beautifully. Designed for multi-user systems, a MagmaCounter provides a concurrent counter. Instances start at a #value of 0, after which any number of sessions may simultaneously increment or decrement (even by more than 1), but no one can actually hard set its value.
Best, Chris
On Sat, 30 Jun 2018, Chris Muller wrote:
This discussion has limited its context to comparing *in-memory* Dictionary implementations, where open-addressing has an advantage. Using Magma reveals another context where chained Dictionary's can have some advantages: When they're large, persistent, and shared in a multi-user system -- e.g., when grow rehashing is not an option.
B-trees were designed for that use case. If growing is not an option, then your dictionary's performance will degrade as its size increases, won't it?
Yes, but only logarithmically. But I was not making a performance-based comparison, only a possibility-based one.
Why is it logarithmic? If I have a hash table with an array of size 100, and I insert 100000 elements into this table, then the best case scenario is that all the lists are of length 1000. So, the operation cost is not logarithmic but linear with a small constant: 0.01.
Not only do chained Dictionary's make that scenario possible, they improve the concurrency -- different users can add/change/remove from the Dictionary as long as they occur at different keys. With regular Dictionary's, concurrency guarantees a commit conflict.
If you created a new, larger dictionary on grow and lazily copied the entries from the smaller to the larger upon access, then I think you could use open addressing hash tables.
That's an interesting idea, but what if it grows beyond "the larger" again while still many entries in the smaller? Add a third which is doubled in size again? That sounds like it starts to intrude into the efficiency of open-addressing (both computation and space), but the idea still sparks some interesting design thoughts in my head..
Yes, add as many hash tables as necessary to the dictionary. Let's say you have an empty dictionary and you want to insert n elements one by one. Let's say each hash table can hold twice as many elements as the previous one upon grow, and the initial hash table can hold 1 element. Once all elements are stored, the largest hash table will have at most n slots, the second largest will have n/2, the third largest will have n/4 and so on down to 1. So, at most log2(n) hash tables will be necessary. Since each hash table operation costs O(1), the overall operation cost will be O(log(n)), and that's the worst possible case. With some housekeeping you should be able to get rid of all the smaller hash tables. E.g. upon lookup, move the object from a smaller hash table to the largest. Once all the elements are removed from a hash table, it can be removed from the dictionary.
Levente
Also, I don't see the concurrency issue with open addressing, because as long as different users modify different chains, the case of concurrency is the same as what you described, isn't it?
Magma has a class called "MagmaPreallocatedDictionary" which does exactly this. It's like a regular Dictionary, except for its internal 'array', it utilizes a MagmaArray, which is just like an Array except large and persistent and can grow in place (so able to maintain its identity).
However, this alone was not enough to implement your proposal because of the 'tally'. Clients would each be trying to propose their specific value for that variable = conflict. Thankfully, Magma has yet another special class called MagmaCounter which fits that role beautifully. Designed for multi-user systems, a MagmaCounter provides a concurrent counter. Instances start at a #value of 0, after which any number of sessions may simultaneously increment or decrement (even by more than 1), but no one can actually hard set its value.
Best, Chris
This discussion has limited its context to comparing *in-memory* Dictionary implementations, where open-addressing has an advantage. Using Magma reveals another context where chained Dictionary's can have some advantages: When they're large, persistent, and shared in a multi-user system -- e.g., when grow rehashing is not an option.
B-trees were designed for that use case. If growing is not an option, then your dictionary's performance will degrade as its size increases, won't it?
Yes, but only logarithmically. But I was not making a performance-based comparison, only a possibility-based one.
Why is it logarithmic? If I have a hash table with an array of size 100, and I insert 100000 elements into this table, then the best case scenario is that all the lists are of length 1000. So, the operation cost is not logarithmic but linear with a small constant: 0.01.
Ah, you misunderstood. By "growing is not an option," I meant one would therefore use a _different_ Dictionary implementation, such as one based on a B-Tree, for example. A dictionary whose impelementation organizes or chains its associations into linked lists...
- Chris
On Fri, 29 Jun 2018, Clément Bera wrote:
Do you guys agree with this?
Mostly. I don't think open addressing's performance is proporional to loadFactor / (1 - loadFactor). That formula would be correct if you had a an array filled up to loadFactor randomly, but linear probing creates longer chains due to clustering, so it's worse than that in practice. Here's a small table based on my measurements:
lf formula measurement 0.5 1.0 ~1.5 0.55 1.222 ~1.96 0.6 1.5 ~2.62 0.65 1.857 ~3.56 0.7 2.333 ~5.06 0.75 3.0 ~7.5 0.8 4.0 ~12.0
In Squeak load factor normally varies between 0.5 and 0.75, but it can be anything between 0.0 and 0.8 (only compaction can push load factor above 0.75), so performance can differ quite a bit. Based on these numbers 0.7 sounds like a reasonable upper limit - something we might want to consider. You might think that linear probing's clustering effect costs too much, but because of cache locality this is probably still faster than linked lists or double hashing.
It is also irrelevant if you can store more elements than the size of the hash table, because that breaks the precondition of the use of hash tables, therefore the contract too. So, the promised O(1) runtime of operations is also gone in that case.
Finally, removals do not clog the hash tables using open addressing unless you use lazy deletion[1], which is something you should avoid.
Their open addressing implementation is slightly different from the Smalltalk one (they have deleted entries upon removal while we fix collisions, at least in Pharo 6).
Proactively removing the garbage - aka #fixCollisionsFrom: - takes O(chain length) time, which is amortized O(1) if your hash function is good enough. Lazy deletion degrades lookup performance over time, because lookups have to treat deleted entries as existing entries. It makes deletion somewhat quicker, because you can stop the lookup procedure as soon as you find your element, but what you save is amortized O(1).
Their chaining implementation does not use balanced binary tree when a bucket becomes large like Java's implementation.
Because hash tables in general don't do such thing. During my time at the university, it was one of the tricky questions why hash tables use linked lists instead of balanced binary trees for collision resolution. The answer is that you don't need them, because the contract of hash tables is that you'll (practically) not have long chains if your hash function is good enough. Using binary trees for collision resolution mitigate situations when your hash function is not good enough or when you don't have full control over your hash values. But those are fairly rare in practice, and you still lose the O(1) operation runtime cost. Binary trees also require your elements to have a total order defined on them.
I don't really understand why an open addressing implementation is trickier to implement, in high level languages such as Smalltalk I may agree that it might be easier to implement chaining, but in low level language such as in the VM, IMO open addressing is easier since you've got only one large chunk of memory to manage. Maybe it's because of hash collisions, having good hash and good hash collection size (See HashTableSizes) is not that easy.
It's easy to make indexing mistakes, especially when wrap-around is involved. Also, deletion is tricky and is often left as an exercise for the reader. Even the wikipedia article[2] has quite complex pseudocode for it. The version you find in Squeak has a few hidden tricks too, but those are there for performance and the code would still work without them.
Levente
[1] https://en.wikipedia.org/wiki/Lazy_deletion [2] https://en.wikipedia.org/wiki/Open_addressing
I'm curious because I do more and more bare metal programming and I end up having to implement this kind of things myself, I've always implemented naive open addressing up until now without really understanding details.
[1] http://www.algolist.net/Data_structures/Hash_table/Open_addressing
On Sat, Jun 16, 2018 at 8:49 PM, Max Leske maxleske@gmail.com wrote:
On 8 Jun 2018, at 09:48, Stéphane Rollandin wrote: FWIW it seems there is no SmallDictionary in Squeak. Oh... Thanks Stéf, I wasn't aware of that. On 8 June 2018, at 15:13, John Brant wrote: Is anyone aware of a reason for hanging on to SmallDictionary? I'm also curious to know how SmallDictionary came to be. There must have been some advantage over Dictionary at some point in the past. It came from RB… the idea was that there are (in the Refactoring engine) a lot of very small dictionaries with <10 elements. The idea is that for such dictionaries, the overhead of hashing was higher than just linear search.
I created its ancestor in VW some 20+ years ago (as RBSmallDictionary). It was used when doing pattern matching. When it performs a pattern match against an AST, it puts the potential value of the pattern variable in the dictionary. If the value is used later in the pattern, then we can get the previous value and make sure that we have an equivalent AST. This allows you to write patterns like:
`@a = `@a
to find where someone has the same expression on both sides of the #= message. Since most patterns have very few pattern variables, these dictionaries won't hold many entries. Furthermore, we are likely to abort the match when we have 0 entries.
The original RBSmallDictionary had an #empty method that "emptied" the dictionary without actually removing any elements -- it just set the size to 0. In a general dictionary this would lead to memory leaks since the previous values are still held by the dictionary. However, these dictionaries were only used during the matching process and went away after the process completed.
Anyway, at the time when we converted our pattern matching code from using the VW parser with our pattern matching extensions to use the new RB parser with pattern matching, the time to run Smalllint on the image was cut in half even though our parser was quite a bit slower than the VW parser. I don't remember everything that was done, but I think that most of the speedup came from having special pattern AST nodes and the small dictionary.
John Brant
Very interesting! Thanks John!
As Marcus has mentioned before in this thread, it would make a lot of sense to run benchmarks again. Actually, I think it would be nice to have a benchmark suite for these cases, that would let us monitor the performance and ensure that changes to the codebase don't have a deteriorative effect. I'm not saying that it would be easy to make this happen, writing proper benchmarks is hard (for me especially, as it seems, given my utter failure to think of the edge cases before starting this thread). Such a suite might also prevent these sorts of questions on the mailing list in the future, or at least might make it easier to answer them.
On 8 June 2018, at 13:01, Andres Valloud wrote:
In addition, open addressing with linear probing has superior cache line read behavior (no indirection / random traversal, and if the first probe misses the second one was likely cached by the first one).
Ah, nice catch! Although that would require frequent access to the dictionary / repeated access to the same items to have an effect, wouldn't it?
On 8 Jun 2018, at 10:01, Clément Bera wrote:
Hi Max, Theoretically, for a small number of elements, usually somewhere between 3 and 30 depending on implementations, a linear search is faster than a hash search, especially in the Pharo dictionary hash search implementation. Efficient dictionary implementations are usually bucket-based. The dictionary holds a certain number of buckets, and based on the key hash, the bucket where the key value is present is determined. Small buckets are linear (arrays or linked list). Large buckets are typically balanced binary trees (red-black trees typically). Under a certain number of elements there is a single bucket, which means a linear search is performed, as for the SmallDictionary. When it grows the dictionary search becomes a combination between a hash search and a linear or tree search. Pharo dictionary search is first hash-based, then all the buckets are represented next to each other in the same arrays and a linear search is performed there, leading to many collisions and slower search time (especially when the element is not found), sometimes the code searches over multiple buckets because the dictionary is too full or there are too many near-collisions. The memory consumption is competitive with the advanced implementations though (precise measurements would need to be made). Method dictionaries are represented differently to optimize the look-up logic. If you want to improve things and have one dictionary implementation instead of two, implement or look for a bucket based dictionary and put it in the base image instead of Dictionary. This is quite some work since there are many APIs to port. You can look at the Pinnochio implementation, it's quite good but they've not implemented large buckets.
Thanks for the detailed explanations Clément and Levente. I'll probably not add a new dictionary implementation ;)
On Fri, Jun 8, 2018 at 8:46 AM, Max Leske <maxleske@gmail.com> wrote: Hi, I was messing around with SmallDictionary when I suddenly realised that I can't find a single reason to use it over a normal Dictionary. While its name and class comment imply that it is somehow an optimised Dictionary, I don't see any measurement where that actually holds up. The following was run in a Pharo 7 image on a recent VM (see below): | d | d := SmallDictionary new. d sizeInMemory. "24" [100000 timesRepeat: [ 1 to: 100 do: [ :i | d at:i put: i] ] ] timeToRun. "0:00:00:05.226" [100000 timesRepeat: [ d at: 48 ifAbsent: [] ] ] timeToRun. "0:00:00:00.041" | d | d := Dictionary new. d sizeInMemory. "16" [100000 timesRepeat: [ 1 to: 100 do: [ :i | d at:i put: i] ] ] timeToRun. "0:00:00:00.385" [100000 timesRepeat: [ d at: 48 ifAbsent: [] ] ] timeToRun. "0:00:00:00.006" As you can see, SmallDictionary is 8 bytes larger per instance and significantly faster while reading and writing (I know that this isn't a good benchmark but it suffices to make my point). Is anyone aware of a reason for hanging on to SmallDictionary? I'm also curious to know how SmallDictionary came to be. There must have been some advantage over Dictionary at some point in the past. Cheers, Max Image version: Pharo 7.0 Build information: Pharo-7.0+alpha.build.961.sha. a69e72a97136bc3f93831584b6efa2b1703deb84 (32 Bit) VM version: CoInterpreter VMMaker.oscog- nice.2281 uuid: 4beeaee7-567e-1a4b-b0fb-bd95ce302516 Nov 27 2017 StackToRegisterMappingCogit VMMaker.oscog-nice.2283 uuid: 2d20324d-a2ab-48d6-b0f6-9fc3d66899da Nov 27 2017 VM: 201711262336 https://github.com/OpenSmalltalk/opensmalltalk-vm.git $ Date: Mon Nov 27 00:36:29 2017 +0100 $ Plugins: 201711262336 https://github.com/OpenSmalltalk/opensmalltalk-vm.git $ OS: macOS 10.13.5 Machine: MacBook Pro (13-inch, 2016, Four Thunderbolt 3 Ports) -- Clément Béra https://clementbera.github.io/ https://clementbera.wordpress.com/
-- Clément Béra https://clementbera.github.io/https://clementbera.wordpress.com/
Thanks for the detailed answer. I now understand better why Smalltalk dictionaries are this way.
Best,
On Sat, Jun 30, 2018 at 2:05 AM, Levente Uzonyi leves@caesar.elte.hu wrote:
On Fri, 29 Jun 2018, Clément Bera wrote:
Do you guys agree with this?
Mostly. I don't think open addressing's performance is proporional to loadFactor / (1 - loadFactor). That formula would be correct if you had a an array filled up to loadFactor randomly, but linear probing creates longer chains due to clustering, so it's worse than that in practice. Here's a small table based on my measurements:
lf formula measurement 0.5 1.0 ~1.5 0.55 1.222 ~1.96 0.6 1.5 ~2.62 0.65 1.857 ~3.56 0.7 2.333 ~5.06 0.75 3.0 ~7.5 0.8 4.0 ~12.0
In Squeak load factor normally varies between 0.5 and 0.75, but it can be anything between 0.0 and 0.8 (only compaction can push load factor above 0.75), so performance can differ quite a bit. Based on these numbers 0.7 sounds like a reasonable upper limit - something we might want to consider. You might think that linear probing's clustering effect costs too much, but because of cache locality this is probably still faster than linked lists or double hashing.
It is also irrelevant if you can store more elements than the size of the hash table, because that breaks the precondition of the use of hash tables, therefore the contract too. So, the promised O(1) runtime of operations is also gone in that case.
Finally, removals do not clog the hash tables using open addressing unless you use lazy deletion[1], which is something you should avoid.
Their open addressing implementation is slightly different from the Smalltalk one (they have deleted entries upon removal while we fix collisions, at least in Pharo 6).
Proactively removing the garbage - aka #fixCollisionsFrom: - takes O(chain length) time, which is amortized O(1) if your hash function is good enough. Lazy deletion degrades lookup performance over time, because lookups have to treat deleted entries as existing entries. It makes deletion somewhat quicker, because you can stop the lookup procedure as soon as you find your element, but what you save is amortized O(1).
Their chaining implementation does not use balanced binary tree when a
bucket becomes large like Java's implementation.
Because hash tables in general don't do such thing. During my time at the university, it was one of the tricky questions why hash tables use linked lists instead of balanced binary trees for collision resolution. The answer is that you don't need them, because the contract of hash tables is that you'll (practically) not have long chains if your hash function is good enough. Using binary trees for collision resolution mitigate situations when your hash function is not good enough or when you don't have full control over your hash values. But those are fairly rare in practice, and you still lose the O(1) operation runtime cost. Binary trees also require your elements to have a total order defined on them.
I don't really understand why an open addressing implementation is trickier to implement, in high level languages such as Smalltalk I may agree that it might be easier to implement chaining, but in low level language such as in the VM, IMO open addressing is easier since you've got only one large chunk of memory to manage. Maybe it's because of hash collisions, having good hash and good hash collection size (See HashTableSizes) is not that easy.
It's easy to make indexing mistakes, especially when wrap-around is involved. Also, deletion is tricky and is often left as an exercise for the reader. Even the wikipedia article[2] has quite complex pseudocode for it. The version you find in Squeak has a few hidden tricks too, but those are there for performance and the code would still work without them.
Levente
[1] https://en.wikipedia.org/wiki/Lazy_deletion [2] https://en.wikipedia.org/wiki/Open_addressing
I'm curious because I do more and more bare metal programming and I end up having to implement this kind of things myself, I've always implemented naive open addressing up until now without really understanding details.
[1] http://www.algolist.net/Data_structures/Hash_table/Open_addressing
On Sat, Jun 16, 2018 at 8:49 PM, Max Leske maxleske@gmail.com wrote:
On 8 Jun 2018, at 09:48, Stéphane Rollandin wrote: FWIW it seems there is no SmallDictionary in Squeak. Oh... Thanks Stéf, I wasn't aware of that. On 8 June 2018, at 15:13, John Brant wrote: Is anyone aware of a reason for hanging on to
SmallDictionary? I'm also curious to know how SmallDictionary came to be. There must have been some advantage over Dictionary at some point in the past.
It came from RB… the idea was that there are (in the
Refactoring engine) a lot of very small dictionaries with <10 elements. The idea is that for such dictionaries, the overhead of hashing was higher than just linear search.
I created its ancestor in VW some 20+ years ago (as RBSmallDictionary). It was used when doing pattern matching. When it performs a pattern match against an AST, it puts the potential value of the pattern variable in the dictionary. If the value is used later in the pattern, then we can get the previous value and make sure that we have an equivalent AST. This allows you to write patterns like:
`@a = `@a
to find where someone has the same expression on both sides of the #= message. Since most patterns have very few pattern variables, these dictionaries won't hold many entries. Furthermore, we are likely to abort the match when we have 0 entries.
The original RBSmallDictionary had an #empty method that "emptied" the dictionary without actually removing any elements -- it just set the size to 0. In a general dictionary this would lead to memory leaks since the previous values are still held by the dictionary. However, these dictionaries were only used during the matching process and went away after the process completed.
Anyway, at the time when we converted our pattern matching code from using the VW parser with our pattern matching extensions to use the new RB parser with pattern matching, the time to run Smalllint on the image was cut in half even though our parser was quite a bit slower than the VW parser. I don't remember everything that was done, but I think that most of the speedup came from having special pattern AST nodes and the small dictionary.
John Brant
Very interesting! Thanks John!
As Marcus has mentioned before in this thread, it would make a lot of sense to run benchmarks again. Actually, I think it would be nice to have a benchmark suite for these cases, that would let us monitor the performance and ensure that changes to the codebase don't have a deteriorative effect. I'm not saying that it would be easy to make this happen, writing proper benchmarks is hard (for me especially, as it seems, given my utter failure to think of the edge cases before starting this thread). Such a suite might also prevent these sorts of questions on the mailing list in the future, or at least might make it easier to answer them.
On 8 June 2018, at 13:01, Andres Valloud wrote:
In addition, open addressing with linear probing has superior cache
line read behavior (no indirection / random traversal, and if the first probe misses the second one was likely cached by the first one).
Ah, nice catch! Although that would require frequent access to the dictionary / repeated access to the same items to have an effect, wouldn't it?
On 8 Jun 2018, at 10:01, Clément Bera wrote:
Hi Max, Theoretically, for a small number of elements, usually somewhere
between 3 and 30 depending on implementations, a linear search is faster than a hash search, especially in the Pharo dictionary hash search implementation.
Efficient dictionary implementations are usually bucket-based. The dictionary holds a certain number of buckets, and based on the key
hash, the bucket where the key value is present is determined. Small buckets are linear (arrays or linked list). Large buckets are typically balanced binary trees (red-black trees typically). Under a certain number of elements there is a single bucket, which means a linear search is performed, as for the SmallDictionary. When it grows the dictionary search becomes a combination between a hash search and a linear or tree search.
Pharo dictionary search is first hash-based, then all the buckets
are represented next to each other in the same arrays and a linear search is performed there, leading to many collisions and slower search time (especially when the element is not found), sometimes the code searches over multiple buckets because the dictionary is too full or there are too many near-collisions. The memory consumption is competitive with the advanced implementations though (precise measurements would need to be made).
Method dictionaries are represented differently to optimize the
look-up logic.
If you want to improve things and have one dictionary implementation instead of two, implement or look for a bucket based dictionary and
put it in the base image instead of Dictionary. This is quite some work since there are many APIs to port. You can look at the Pinnochio implementation, it's quite good but they've not implemented large buckets.
Thanks for the detailed explanations Clément and Levente. I'll probably not add a new dictionary implementation ;)
On Fri, Jun 8, 2018 at 8:46 AM, Max Leske <maxleske@gmail.com>
wrote:
Hi, I was messing around with SmallDictionary when I suddenly
realised that I can't find a single reason to use it over a normal Dictionary. While its name and class comment imply that it is somehow an optimised Dictionary, I don't see any measurement where that actually holds up. The following was run in a Pharo 7 image on a recent VM (see below):
| d | d := SmallDictionary new. d sizeInMemory. "24" [100000 timesRepeat: [ 1 to: 100 do: [ :i | d at:i put: i] ] ] timeToRun.
"0:00:00:05.226"
[100000 timesRepeat: [ d at: 48 ifAbsent: [] ] ] timeToRun. "0:00:00:00.041" | d | d := Dictionary new. d sizeInMemory. "16" [100000 timesRepeat: [ 1 to: 100 do: [ :i | d at:i put: i] ] ] timeToRun.
"0:00:00:00.385" [100000 timesRepeat: [ d at: 48 ifAbsent: [] ] ] timeToRun. "0:00:00:00.006"
As you can see, SmallDictionary is 8 bytes larger per
instance and significantly faster while reading and writing (I know that this isn't a good benchmark but it suffices to make my point).
Is anyone aware of a reason for hanging on to
SmallDictionary? I'm also curious to know how SmallDictionary came to be. There must have been some advantage over Dictionary at some point in the past.
Cheers, Max Image version: Pharo 7.0 Build information: Pharo-7.0+alpha.build.961.sha. a69e72a97136bc3f93831584b6efa2b1703deb84 (32 Bit) VM version: CoInterpreter VMMaker.oscog- nice.2281 uuid: 4beeaee7-567e-1a4b-b0fb-bd95ce302516 Nov 27 2017 StackToRegisterMappingCogit VMMaker.oscog-nice.2283 uuid: 2d20324d-a2ab-48d6-b0f6-9fc3d66899da Nov 27 2017 VM: 201711262336 https://github.com/OpenSmallta
lk/opensmalltalk-vm.git $ Date: Mon Nov 27 00:36:29 2017 +0100 $ Plugins: 201711262336 https://github.com/OpenSmalltalk/opensmalltalk-vm.git $
OS: macOS 10.13.5 Machine: MacBook Pro (13-inch, 2016, Four Thunderbolt 3 Ports) -- Clément Béra https://clementbera.github.io/ https://clementbera.wordpress.com/
-- Clément Béra https://clementbera.github.io/https://clementbera.wordpress.com/
On Fri, 8 Jun 2018, Max Leske wrote:
Is anyone aware of a reason for hanging on to SmallDictionary? I'm also curious to know how SmallDictionary came to be. There must have been some advantage over Dictionary at some point in the past.
IIRC it came from Seaside, where it was used to minimize the memory footprint of sessions.
Levente
squeak-dev@lists.squeakfoundation.org