On Thu, Dec 03, 2009 at 03:03:48PM -0500, David T. Lewis wrote:
I would guess that the difference I am seeing now is due to compiler version. Ian's VM was compiled with gcc 4.3.3 and I am using an older gcc 4.1.2 compiler.
The results I got were:
For a 64-bit VM that I compiled locally, installed in /usr/local: CPU: AMD Turion(tm) 64 Mobile Technology ML-34, 1600 MHz OS: Linux version 2.6.18.2-34-default Compiler for VM: gcc 4.1.2 Results:
0 tinyBenchmarks ==> 154031287 bytecodes/sec; 5145368 sends/sec 0 tinyBenchmarks ==> 153201675 bytecodes/sec; 5183202 sends/sec 0 tinyBenchmarks ==> 151658767 bytecodes/sec; 5268426 sends/sec
For a 32-bit VM from Ian's site, running the same image from a local directory: CPU: AMD Turion(tm) 64 Mobile Technology ML-34, 1600 MHz OS: Linux version 2.6.18.2-34-default Compiler for VM: gcc 4.3.3 Results:
0 tinyBenchmarks ==> 62135922 bytecodes/sec; 3330746 sends/sec 0 tinyBenchmarks ==> 62256809 bytecodes/sec; 3425013 sends/sec 0 tinyBenchmarks ==> 62317429 bytecodes/sec; 3346096 sends/sec
After installing a prodigious number of 32-bit libraries on my 64-bit Linux, I can now build a 32-bit VM for comparison. Here are the results of a 64-bit versus 32-bit VM using the same compiler, operating system, and hardware:
Compiled in 64-bit mode: 0 tinyBenchmarks '155339805 bytecodes/sec; 5304104 sends/sec' 0 tinyBenchmarks '155812538 bytecodes/sec; 5393385 sends/sec' 0 tinyBenchmarks '155151515 bytecodes/sec; 5272367 sends/sec'
Compiled in 32-bit mode: 0 tinyBenchmarks '136679124 bytecodes/sec; 4652907 sends/sec' 0 tinyBenchmarks '135521439 bytecodes/sec; 4659058 sends/sec' 0 tinyBenchmarks '135665076 bytecodes/sec; 4690056 sends/sec'
So overall I see about a 14% speed advantage for the 64-bit VM versus the 32-bit VM on this platform. Again, this is with the older gcc 4.1.2 compiler.
Dave
On Thu, Dec 03, 2009 at 05:33:09PM -0500, David T. Lewis wrote:
On Thu, Dec 03, 2009 at 03:03:48PM -0500, David T. Lewis wrote:
I would guess that the difference I am seeing now is due to compiler version. Ian's VM was compiled with gcc 4.3.3 and I am using an older gcc 4.1.2 compiler.
The results I got were:
For a 64-bit VM that I compiled locally, installed in /usr/local: CPU: AMD Turion(tm) 64 Mobile Technology ML-34, 1600 MHz OS: Linux version 2.6.18.2-34-default Compiler for VM: gcc 4.1.2 Results:
0 tinyBenchmarks ==> 154031287 bytecodes/sec; 5145368 sends/sec 0 tinyBenchmarks ==> 153201675 bytecodes/sec; 5183202 sends/sec 0 tinyBenchmarks ==> 151658767 bytecodes/sec; 5268426 sends/sec
For a 32-bit VM from Ian's site, running the same image from a local directory: CPU: AMD Turion(tm) 64 Mobile Technology ML-34, 1600 MHz OS: Linux version 2.6.18.2-34-default Compiler for VM: gcc 4.3.3 Results:
0 tinyBenchmarks ==> 62135922 bytecodes/sec; 3330746 sends/sec 0 tinyBenchmarks ==> 62256809 bytecodes/sec; 3425013 sends/sec 0 tinyBenchmarks ==> 62317429 bytecodes/sec; 3346096 sends/sec
After installing a prodigious number of 32-bit libraries on my 64-bit Linux, I can now build a 32-bit VM for comparison. Here are the results of a 64-bit versus 32-bit VM using the same compiler, operating system, and hardware:
Compiled in 64-bit mode: 0 tinyBenchmarks '155339805 bytecodes/sec; 5304104 sends/sec' 0 tinyBenchmarks '155812538 bytecodes/sec; 5393385 sends/sec' 0 tinyBenchmarks '155151515 bytecodes/sec; 5272367 sends/sec'
Compiled in 32-bit mode: 0 tinyBenchmarks '136679124 bytecodes/sec; 4652907 sends/sec' 0 tinyBenchmarks '135521439 bytecodes/sec; 4659058 sends/sec' 0 tinyBenchmarks '135665076 bytecodes/sec; 4690056 sends/sec'
So overall I see about a 14% speed advantage for the 64-bit VM versus the 32-bit VM on this platform. Again, this is with the older gcc 4.1.2 compiler.
With apologies, I have to retract that last set of numbers. I may have inadvertently let my CPU fall into power save mode (I'm not sure). But in any case, I wanted to repeat the experiment, so I build two VMs from scratch, and the results I get this time are:
VM compiled in 64-bit mode: 0 tinyBenchmarks '158024691 bytecodes/sec; 4354017 sends/sec' 0 tinyBenchmarks '156670746 bytecodes/sec; 5187016 sends/sec' 0 tinyBenchmarks '155623100 bytecodes/sec; 5198491 sends/sec' 0 tinyBenchmarks '157635467 bytecodes/sec; 5179393 sends/sec' 0 tinyBenchmarks '157732593 bytecodes/sec; 5104384 sends/sec' 0 tinyBenchmarks '157927205 bytecodes/sec; 5221596 sends/sec'
VM compiled in 32-bit mode: 0 tinyBenchmarks '160300563 bytecodes/sec; 5179393 sends/sec' 0 tinyBenchmarks '160200250 bytecodes/sec; 5256640 sends/sec' 0 tinyBenchmarks '160703075 bytecodes/sec; 5126658 sends/sec' 0 tinyBenchmarks '160905091 bytecodes/sec; 5085970 sends/sec' 0 tinyBenchmarks '157635467 bytecodes/sec; 5202328 sends/sec' 0 tinyBenchmarks '158907510 bytecodes/sec; 5233225 sends/sec'
So the 32-bit VM may be two or three percent faster than the 64-bit version on this platform.
Summary: I do not see evidence of a big performance hit from use of 64-bit pointers, but there may well be a large performance hit from use of the latest gcc compiler.
Dave
Ok, is this a 64bit VM built with #define SQ_VI_BYTES_PER_WORD 4 ? or #define SQ_VI_BYTES_PER_WORD 8 ?
what that is set to does change things.
btw on a macbook pro 2.33 Ghz intel core 2 duo a 4.x squeak macintosh carbon vm does '533611255 bytecodes/sec; 11577747 sends/sec' '535005224 bytecodes/sec; 11599578 sends/sec' '533889468 bytecodes/sec; 11280518 sends/sec' '535284892 bytecodes/sec; 11837670 sends/sec' '535005224 bytecodes/sec; 11405773 sends/sec' '533611255 bytecodes/sec; 9517756 sends/sec' '533333333 bytecodes/sec; 11769725 sends/sec' '533611255 bytecodes/sec; 11420129 sends/sec' '533611255 bytecodes/sec; 11563238 sends/sec' '535284892 bytecodes/sec; 11304036 sends/sec'
which is 8.5x bytescodes/sec faster, and 3.3x sends/sec faster than the 32bit vm example you give Maybe someone can run this on a equivalent intel core 2 duo to understand why your 1.6 Ghz machine numbers are so dreadful.
5.x is with #define SQ_VI_BYTES_PER_WORD 4 in 32bit mode CHEATING below, but I've not tuned it yet. '504930966 bytecodes/sec; 12435935 sends/sec' '506429277 bytecodes/sec; 12229906 sends/sec' '506429277 bytecodes/sec; 12623070 sends/sec' '505429417 bytecodes/sec; 12649026 sends/sec' '505928853 bytecodes/sec; 12588628 sends/sec' '504433497 bytecodes/sec; 11868121 sends/sec' '505429417 bytecodes/sec; 12649026 sends/sec' '505429417 bytecodes/sec; 12614442 sends/sec' '506429277 bytecodes/sec; 11769725 sends/sec' '505928853 bytecodes/sec; 11505566 sends/sec'
5.x with SQ_VI_BYTES_PER_WORD 8 in 64bit mode GCC 4.2 '482563619 bytecodes/sec; 14152552 sends/sec' '483474976 bytecodes/sec; 14174292 sends/sec' '482563619 bytecodes/sec; 14076986 sends/sec' '481655691 bytecodes/sec; 14217973 sends/sec' '482563619 bytecodes/sec; 14141708 sends/sec' '483018867 bytecodes/sec; 13907256 sends/sec' '483018867 bytecodes/sec; 14120068 sends/sec' '483474976 bytecodes/sec; 14012854 sends/sec' '483474976 bytecodes/sec; 14076986 sends/sec' '483474976 bytecodes/sec; 13896783 sends/sec'
5.x with SQ_VI_BYTES_PER_WORD 4 in 64bit mode CHEATING below, but I've not tuned it yet. '433530906 bytecodes/sec; 13051576 sends/sec' '432432432 bytecodes/sec; 13042352 sends/sec' '433898305 bytecodes/sec; 13051576 sends/sec' '433530906 bytecodes/sec; 13042352 sends/sec' '433898305 bytecodes/sec; 12941745 sends/sec' '433898305 bytecodes/sec; 13023944 sends/sec' '414910858 bytecodes/sec; 12815922 sends/sec' '433898305 bytecodes/sec; 13005587 sends/sec' '433164128 bytecodes/sec; 13005587 sends/sec' '434266327 bytecodes/sec; 12444321 sends/sec'
On 2009-12-03, at 2:33 PM, David T. Lewis wrote:
On Thu, Dec 03, 2009 at 03:03:48PM -0500, David T. Lewis wrote:
I would guess that the difference I am seeing now is due to compiler version. Ian's VM was compiled with gcc 4.3.3 and I am using an older gcc 4.1.2 compiler.
The results I got were:
For a 64-bit VM that I compiled locally, installed in /usr/local: CPU: AMD Turion(tm) 64 Mobile Technology ML-34, 1600 MHz OS: Linux version 2.6.18.2-34-default Compiler for VM: gcc 4.1.2 Results:
0 tinyBenchmarks ==> 154031287 bytecodes/sec; 5145368 sends/sec 0 tinyBenchmarks ==> 153201675 bytecodes/sec; 5183202 sends/sec 0 tinyBenchmarks ==> 151658767 bytecodes/sec; 5268426 sends/sec
For a 32-bit VM from Ian's site, running the same image from a local directory: CPU: AMD Turion(tm) 64 Mobile Technology ML-34, 1600 MHz OS: Linux version 2.6.18.2-34-default Compiler for VM: gcc 4.3.3 Results:
0 tinyBenchmarks ==> 62135922 bytecodes/sec; 3330746 sends/sec 0 tinyBenchmarks ==> 62256809 bytecodes/sec; 3425013 sends/sec 0 tinyBenchmarks ==> 62317429 bytecodes/sec; 3346096 sends/sec
After installing a prodigious number of 32-bit libraries on my 64-bit Linux, I can now build a 32-bit VM for comparison. Here are the results of a 64-bit versus 32-bit VM using the same compiler, operating system, and hardware:
Compiled in 64-bit mode: 0 tinyBenchmarks '155339805 bytecodes/sec; 5304104 sends/sec' 0 tinyBenchmarks '155812538 bytecodes/sec; 5393385 sends/sec' 0 tinyBenchmarks '155151515 bytecodes/sec; 5272367 sends/sec'
Compiled in 32-bit mode: 0 tinyBenchmarks '136679124 bytecodes/sec; 4652907 sends/sec' 0 tinyBenchmarks '135521439 bytecodes/sec; 4659058 sends/sec' 0 tinyBenchmarks '135665076 bytecodes/sec; 4690056 sends/sec'
So overall I see about a 14% speed advantage for the 64-bit VM versus the 32-bit VM on this platform. Again, this is with the older gcc 4.1.2 compiler.
Dave
-- =========================================================================== John M. McIntosh johnmci@smalltalkconsulting.com Twitter: squeaker68882 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ===========================================================================
In all cases I was using a 32-bit image, hence SQ_VI_BYTES_PER_WORD 4.
My computer is a small laptop, several years old, with an AMD processor that was designed for low power rather than performance. I would not be surprised if a newer mac is quite a bit faster.
I think you can expect the 64 bit image (SQ_VI_BYTES_PER_WORD 8) to be relatively slow. Entirely aside from address calculation issues, the image is going to be full of 64-bit integer arithmetic in places where 32-bit arithmetic would normally be happening, and that is bound to take a toll. There is plenty of room for optimization.
Dave
On Thu, Dec 03, 2009 at 04:49:38PM -0800, John M McIntosh wrote:
Ok, is this a 64bit VM built with #define SQ_VI_BYTES_PER_WORD 4 ? or #define SQ_VI_BYTES_PER_WORD 8 ?
what that is set to does change things.
btw on a macbook pro 2.33 Ghz intel core 2 duo a 4.x squeak macintosh carbon vm does '533611255 bytecodes/sec; 11577747 sends/sec' '535005224 bytecodes/sec; 11599578 sends/sec' '533889468 bytecodes/sec; 11280518 sends/sec' '535284892 bytecodes/sec; 11837670 sends/sec' '535005224 bytecodes/sec; 11405773 sends/sec' '533611255 bytecodes/sec; 9517756 sends/sec' '533333333 bytecodes/sec; 11769725 sends/sec' '533611255 bytecodes/sec; 11420129 sends/sec' '533611255 bytecodes/sec; 11563238 sends/sec' '535284892 bytecodes/sec; 11304036 sends/sec'
which is 8.5x bytescodes/sec faster, and 3.3x sends/sec faster than the 32bit vm example you give Maybe someone can run this on a equivalent intel core 2 duo to understand why your 1.6 Ghz machine numbers are so dreadful.
5.x is with #define SQ_VI_BYTES_PER_WORD 4 in 32bit mode CHEATING below, but I've not tuned it yet. '504930966 bytecodes/sec; 12435935 sends/sec' '506429277 bytecodes/sec; 12229906 sends/sec' '506429277 bytecodes/sec; 12623070 sends/sec' '505429417 bytecodes/sec; 12649026 sends/sec' '505928853 bytecodes/sec; 12588628 sends/sec' '504433497 bytecodes/sec; 11868121 sends/sec' '505429417 bytecodes/sec; 12649026 sends/sec' '505429417 bytecodes/sec; 12614442 sends/sec' '506429277 bytecodes/sec; 11769725 sends/sec' '505928853 bytecodes/sec; 11505566 sends/sec'
5.x with SQ_VI_BYTES_PER_WORD 8 in 64bit mode GCC 4.2 '482563619 bytecodes/sec; 14152552 sends/sec' '483474976 bytecodes/sec; 14174292 sends/sec' '482563619 bytecodes/sec; 14076986 sends/sec' '481655691 bytecodes/sec; 14217973 sends/sec' '482563619 bytecodes/sec; 14141708 sends/sec' '483018867 bytecodes/sec; 13907256 sends/sec' '483018867 bytecodes/sec; 14120068 sends/sec' '483474976 bytecodes/sec; 14012854 sends/sec' '483474976 bytecodes/sec; 14076986 sends/sec' '483474976 bytecodes/sec; 13896783 sends/sec'
5.x with SQ_VI_BYTES_PER_WORD 4 in 64bit mode CHEATING below, but I've not tuned it yet. '433530906 bytecodes/sec; 13051576 sends/sec' '432432432 bytecodes/sec; 13042352 sends/sec' '433898305 bytecodes/sec; 13051576 sends/sec' '433530906 bytecodes/sec; 13042352 sends/sec' '433898305 bytecodes/sec; 12941745 sends/sec' '433898305 bytecodes/sec; 13023944 sends/sec' '414910858 bytecodes/sec; 12815922 sends/sec' '433898305 bytecodes/sec; 13005587 sends/sec' '433164128 bytecodes/sec; 13005587 sends/sec' '434266327 bytecodes/sec; 12444321 sends/sec'
On 2009-12-03, at 2:33 PM, David T. Lewis wrote:
On Thu, Dec 03, 2009 at 03:03:48PM -0500, David T. Lewis wrote:
I would guess that the difference I am seeing now is due to compiler version. Ian's VM was compiled with gcc 4.3.3 and I am using an older gcc 4.1.2 compiler.
The results I got were:
For a 64-bit VM that I compiled locally, installed in /usr/local: CPU: AMD Turion(tm) 64 Mobile Technology ML-34, 1600 MHz OS: Linux version 2.6.18.2-34-default Compiler for VM: gcc 4.1.2 Results:
0 tinyBenchmarks ==> 154031287 bytecodes/sec; 5145368 sends/sec 0 tinyBenchmarks ==> 153201675 bytecodes/sec; 5183202 sends/sec 0 tinyBenchmarks ==> 151658767 bytecodes/sec; 5268426 sends/sec
For a 32-bit VM from Ian's site, running the same image from a local directory: CPU: AMD Turion(tm) 64 Mobile Technology ML-34, 1600 MHz OS: Linux version 2.6.18.2-34-default Compiler for VM: gcc 4.3.3 Results:
0 tinyBenchmarks ==> 62135922 bytecodes/sec; 3330746 sends/sec 0 tinyBenchmarks ==> 62256809 bytecodes/sec; 3425013 sends/sec 0 tinyBenchmarks ==> 62317429 bytecodes/sec; 3346096 sends/sec
After installing a prodigious number of 32-bit libraries on my 64-bit Linux, I can now build a 32-bit VM for comparison. Here are the results of a 64-bit versus 32-bit VM using the same compiler, operating system, and hardware:
Compiled in 64-bit mode: 0 tinyBenchmarks '155339805 bytecodes/sec; 5304104 sends/sec' 0 tinyBenchmarks '155812538 bytecodes/sec; 5393385 sends/sec' 0 tinyBenchmarks '155151515 bytecodes/sec; 5272367 sends/sec'
Compiled in 32-bit mode: 0 tinyBenchmarks '136679124 bytecodes/sec; 4652907 sends/sec' 0 tinyBenchmarks '135521439 bytecodes/sec; 4659058 sends/sec' 0 tinyBenchmarks '135665076 bytecodes/sec; 4690056 sends/sec'
So overall I see about a 14% speed advantage for the 64-bit VM versus the 32-bit VM on this platform. Again, this is with the older gcc 4.1.2 compiler.
Dave
--
John M. McIntosh johnmci@smalltalkconsulting.com Twitter: squeaker68882 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ===========================================================================
Hi Dave,
I think you can expect the 64 bit image (SQ_VI_BYTES_PER_WORD 8) to be relatively slow. Entirely aside from address calculation issues, the image is going to be full of 64-bit integer arithmetic in places where 32- bit arithmetic would normally be happening, and that is bound to take a toll.
My experience with em64t has been entirely positive. Programs consistently run faster in 64-bit mode than in 32-bit mode. Much of the improvement is probably due to the 8 additional registers and the passing of the first few arguments in registers rather than on the stack. The Squeak VM likely does not benefit the way most other programs do because of the aggressive inlining of methods in the C code generator, eliminating entirely the impact of a better argument passing convention.
I have to say I am disgusted at gcc 4.3 though. Maybe replacing all the -O/-f options with '-Os -fno-cse-follow-jumps -fomit-frame- pointer' would help? It has worked wonders for me on (non-Squeak) bytecode interpreters; the Core2 in particular seems hypersensitive to locality and alignment of loops/jumps at the start of cache lines. (Attempts to manually allocate machine registers to VM registers always reduce performance with gcc-4.3, but I haven't experimented with explicit register assignments in the Squeak VM on 64-bit hardware.)
Cheers, Ian
vm-dev@lists.squeakfoundation.org