Hans, Tagging/untagging could be very fast! See my other post
1) UnTagging a double= No op 2) Tagging a double= a isnan test (so as to have a representable nan in Smalltalk) 3) This trick does not had any extra cost to tagging/untagging of other oops
What about the cost of allocating double? Of course, you won't reach speed of optimized compiled code using FPU extensively. BUT you remove the main cost of Smalltalk number crunching: pressure on Objectmemory garbage collector!
Envoyé par nicolas cellier via Google Reader : Re: floats via gmane.comp.lang.smalltalk.squeak.general de Hans-Martin Mosner le 15/03/09 Jecel Assumpcao Jr schrieb: That does not work since xxx10 is used as a sentinel value in the garbage collector. I think a better approach is to handle floats specially in a JIT, and keep them unboxed for typical sequences of arithmetic manipulation within methods. Of course, using immediate floats does avoid the object creation and destruction overhead, but you still have some overhead for tagging and untagging, which on modern architectures is still much higher than the actual floating point operation costs. Cheers, Hans-Martin Ce que vous pouvez faire à partir de cette page : - Vous abonner à gmane.comp.lang.smalltalk.squeak.general à l'aide de Google Reader - Suivre l'actualité de tous vos sites favoris avec Google Reader
nicolas cellier schrieb:
Hans, Tagging/untagging could be very fast! See my other post
- UnTagging a double= No op
- Tagging a double= a isnan test (so as to have a representable nan
in Smalltalk) 3) This trick does not had any extra cost to tagging/untagging of other oops
That's true for a 64-bit processor, and on such hardware I see the advantages of this scheme. For 32-bit hardware, it won't work. Hopefully we'll all have suitable hardware in the near future... But for example, I'm running 32-bit linux here on my 64-bit AMD processor just because the WLAN card I'm using only has a 32-bit Windows driver, and ndiswrapper on 64-bit linux would require a 64-bit driver to work correctly (which is somewhat stupid IMHO but I'm not going to hack ndiswrapper). In the real world, there are tons of silly constraints like this which still prevent people from fully using 64-bit hardware.
Cheers, Hans-Martin
Hans-Martin Mosner wrote:
nicolas cellier schrieb:
*snip*
In the real world, there are tons of silly constraints like this which still prevent people from fully using 64-bit hardware.
Silly questions from a lurker:
How much more or less current hardware is there which still is 32-bit?
More importantly, how much influence does the OS have on this? Can you make use of 64-bit features running on a 32-bit OS running on 64-bit hardware architectures?
Sorry for the stupid questions ...
Claus
Thanks to everyone who is contributing to this thread! I should have been more explicit about my interest in this area: a good floating point unit is about the same size as a reasonably compact integer core. So for the same cost I can have twice as many processors if I am willing to have slow floating point. The worst case would be to have both half as many processors (with a FPU each) *and* slow floating point anyway due to Squeak's limitations.
Squeak does have a scheme for good floating point performance: the FloatArray. In a previous discussion about this with Bryce, he felt that between this and being about to compile away boxing/unboxing operations within a single method (also mentioned by Hans-Martin in this thread) we could have essentially the same performance as immediate floats (and Hans-Martin pointed out that the bit pattern I suggested is already in use anyway).
Nicolas evaluated the advantages of the "64 bit everything is a float" scheme, which I unfortunately don't remember who was the inventor. One trick that some old mainframes used was to represent integers as denormalized floating point numbers, so you would need no checks nor conversions. The IEEE 754 standard doesn't seem to support this, however.
As Bert pointed out, lack of floating point hardware was the reason given for not choosing the ARM for the first OLPC machine. Ivan mentioned fixed point as an alternative, and this is actually what I have used in my projects (specially the Forth based ones) for most of the past ten years. But for Squeak I would rather just give people what they are used to (not counting Fractions, LargeIntegers and such, of course). Juan gave a list of application domains where floats are considered fundamental.
Hans-Martin and Claus asked about the availability of 64 bit hardware for the scheme I mentioned. That is indeed a problem (only my old Sparc machine would be able to run a 64 bit Squeak of the 14 or so computers I have around here, for example) but it could be solved by doing some conversions when saving/loading images. We need to do transformations when moving between 32 and 64 images anyway and unboxing floats would be one of the simplest.
-- Jecel
2009/3/15 Hans-Martin Mosner hmm@heeg.de
nicolas cellier schrieb:
Hans, Tagging/untagging could be very fast! See my other post
- UnTagging a double= No op
- Tagging a double= a isnan test (so as to have a representable nan
in Smalltalk) 3) This trick does not had any extra cost to tagging/untagging of other oops
That's true for a 64-bit processor, and on such hardware I see the advantages of this scheme. For 32-bit hardware, it won't work. Hopefully we'll all have suitable hardware in the near future... But for example, I'm running 32-bit linux here on my 64-bit AMD processor just because the WLAN card I'm using only has a 32-bit Windows driver, and ndiswrapper on 64-bit linux would require a 64-bit driver to work correctly (which is somewhat stupid IMHO but I'm not going to hack ndiswrapper). In the real world, there are tons of silly constraints like this which still prevent people from fully using 64-bit hardware.
Cheers, Hans-Martin
Of course, most of the nice properties come from the 64 bits adressing... Hey, wait, I don't even have a 64 processor in my house! For the fun I imagine we could emulate by spanning each oop over two int32 typedef struct {int32 high,low;} oop; I would expect a slower VM by roughly a factor 2 - except for double arithmetic...
Nicolas
On Sun, Mar 15, 2009 at 1:57 PM, Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> wrote:
2009/3/15 Hans-Martin Mosner hmm@heeg.de
nicolas cellier schrieb:
Hans, Tagging/untagging could be very fast! See my other post
- UnTagging a double= No op
- Tagging a double= a isnan test (so as to have a representable nan
in Smalltalk) 3) This trick does not had any extra cost to tagging/untagging of other oops
That's true for a 64-bit processor, and on such hardware I see the advantages of this scheme. For 32-bit hardware, it won't work. Hopefully we'll all have suitable hardware in the near future... But for example, I'm running 32-bit linux here on my 64-bit AMD processor just because the WLAN card I'm using only has a 32-bit Windows driver, and ndiswrapper on 64-bit linux would require a 64-bit driver to work correctly (which is somewhat stupid IMHO but I'm not going to hack ndiswrapper). In the real world, there are tons of silly constraints like this which still prevent people from fully using 64-bit hardware.
Cheers, Hans-Martin
Of course, most of the nice properties come from the 64 bits adressing... Hey, wait, I don't even have a 64 processor in my house! For the fun I imagine we could emulate by spanning each oop over two int32 typedef struct {int32 high,low;} oop; I would expect a slower VM by roughly a factor 2 - except for double arithmetic...
In theory, but only for memory-limited symbolic applications. If you have an application that fits entirely in cache then I would expect parity. The argument for symbolic applications is that a 64-bit symbolic app has to move twice the data as a 32-bit symbolic app because each symbolic object is twice the size.
Many Smalltalk applications are large and hence more in the memory-limited range, but many Smalltalk objects are byte data and so a) they are not moving twice the data all the time and b) images do not double in size. There are also opportunities for optimization in a 64-bit implementation. In particular in 64-bit VW/HPS I was able to store the number of fixed fields in an object in its header instead of only in the class format word. Hence 64-bit HPS has much faster at:[put:] than 32-bit.
So the experience with my 64-bit VW implementation was that - typical large symbolic benchmarks (e.g. all senders) were 15% to 20% slower in 64-bits than in 32-bits. - immediate double arithmetic is about 3 times faster at about half the speed of immediate integer operations - images are about 55% larger when converted from 32-bit to 64-bit
Eliot
Nicolas
Hi Eliot, AFAIK, VW does not use the nan trick, so it has to perform extra conversions on SmallDouble, doesn't it?
Nicolas
2009/3/16 Eliot Miranda eliot.miranda@gmail.com
So the experience with my 64-bit VW implementation was that - typical large symbolic benchmarks (e.g. all senders) were 15% to 20% slower in 64-bits than in 32-bits. - immediate double arithmetic is about 3 times faster at about half the speed of immediate integer operations - images are about 55% larger when converted from 32-bit to 64-bit
Eliot
On Mon, Mar 16, 2009 at 11:41 AM, Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> wrote:
Hi Eliot, AFAIK, VW does not use the nan trick, so it has to perform extra conversions on SmallDouble, doesn't it?
That's right. The VW 64-bit immediate double representation is msb lsb | 8 bit exponent | 52 bit mantissa | sign | 3 bit tag |
i.e. immediate doubles occupy the middle range of the doubles that corresponds to single-precision floats, ~ 10-38.
Putting the sign bit down low means that +/- 0 are the only immediate double values whose bit patterns are <= 15.
Converting an immediate double to an IEEE double then involves - logical shift right 3 bits (sign is now lsb) - compare against 1 to distinguish +/- 0 from others - if > 1 (not +/- 0) - add exponent offset (maps 8-bit exponent to 11-bit exponent) - rotate right 1 (move sign to sign bit) - move integer reg to float reg
Going in the other direction - move fp reg to integer reg - rotate left 1 bit (sign is now lsb) - compare against 1 to distinguish +/- 0 from others - if > 1 (not +/- 0) - subtract exponent offset (maps 11-bit exponent to 8-bit exponent) - fail if overflow (e.g. jump to code that boxes the float) - shift left 3 - add tags
So more complicated than immediate integers but of a similar complexity to the fp unit's internal operations on floats (extracting exponent, shifting mantissa by exponent).
Nicolas
2009/3/16 Eliot Miranda eliot.miranda@gmail.com
So the experience with my 64-bit VW implementation was that - typical large symbolic benchmarks (e.g. all senders) were 15% to 20% slower in 64-bits than in 32-bits. - immediate double arithmetic is about 3 times faster at about half the speed of immediate integer operations - images are about 55% larger when converted from 32-bit to 64-bit
Eliot
2009/3/16 Eliot Miranda eliot.miranda@gmail.com:
On Sun, Mar 15, 2009 at 1:57 PM, Nicolas Cellier nicolas.cellier.aka.nice@gmail.com wrote:
2009/3/15 Hans-Martin Mosner hmm@heeg.de
nicolas cellier schrieb:
Hans, Tagging/untagging could be very fast! See my other post
- UnTagging a double= No op
- Tagging a double= a isnan test (so as to have a representable nan
in Smalltalk) 3) This trick does not had any extra cost to tagging/untagging of other oops
That's true for a 64-bit processor, and on such hardware I see the advantages of this scheme. For 32-bit hardware, it won't work. Hopefully we'll all have suitable hardware in the near future... But for example, I'm running 32-bit linux here on my 64-bit AMD processor just because the WLAN card I'm using only has a 32-bit Windows driver, and ndiswrapper on 64-bit linux would require a 64-bit driver to work correctly (which is somewhat stupid IMHO but I'm not going to hack ndiswrapper). In the real world, there are tons of silly constraints like this which still prevent people from fully using 64-bit hardware.
Cheers, Hans-Martin
Of course, most of the nice properties come from the 64 bits adressing... Hey, wait, I don't even have a 64 processor in my house! For the fun I imagine we could emulate by spanning each oop over two int32 typedef struct {int32 high,low;} oop; I would expect a slower VM by roughly a factor 2 - except for double arithmetic...
In theory, but only for memory-limited symbolic applications. If you have an application that fits entirely in cache then I would expect parity. The argument for symbolic applications is that a 64-bit symbolic app has to move twice the data as a 32-bit symbolic app because each symbolic object is twice the size.
Couldn't you compress the oops? AFAIK HotSpot was the last remaining JVM that got this.
Cheers Philippe
On Mon, Mar 16, 2009 at 2:15 PM, Philippe Marschall < philippe.marschall@gmail.com> wrote:
2009/3/16 Eliot Miranda eliot.miranda@gmail.com:
On Sun, Mar 15, 2009 at 1:57 PM, Nicolas Cellier nicolas.cellier.aka.nice@gmail.com wrote:
2009/3/15 Hans-Martin Mosner hmm@heeg.de
nicolas cellier schrieb:
Hans, Tagging/untagging could be very fast! See my other post
- UnTagging a double= No op
- Tagging a double= a isnan test (so as to have a representable nan
in Smalltalk) 3) This trick does not had any extra cost to tagging/untagging of other oops
That's true for a 64-bit processor, and on such hardware I see the advantages of this scheme. For 32-bit hardware, it won't work. Hopefully we'll all have suitable hardware in the near future... But for example, I'm running 32-bit linux here on my 64-bit AMD processor just because the WLAN card I'm using only has a 32-bit
Windows
driver, and ndiswrapper on 64-bit linux would require a 64-bit driver
to
work correctly (which is somewhat stupid IMHO but I'm not going to hack ndiswrapper). In the real world, there are tons of silly constraints like this which still prevent people from fully using 64-bit hardware.
Cheers, Hans-Martin
Of course, most of the nice properties come from the 64 bits
adressing...
Hey, wait, I don't even have a 64 processor in my house! For the fun I imagine we could emulate by spanning each oop over two
int32
typedef struct {int32 high,low;} oop; I would expect a slower VM by roughly a factor 2 - except for double arithmetic...
In theory, but only for memory-limited symbolic applications. If you
have
an application that fits entirely in cache then I would expect parity.
The
argument for symbolic applications is that a 64-bit symbolic app has to
move
twice the data as a 32-bit symbolic app because each symbolic object is twice the size.
Couldn't you compress the oops? AFAIK HotSpot was the last remaining JVM that got this.
I don't see the point. Memory is cheap, getting cheaper. 64-bits means extremely cheap address space. Why slow down the critical path to save space?
Cheers Philippe
2009/3/16 Eliot Miranda eliot.miranda@gmail.com:
On Mon, Mar 16, 2009 at 2:15 PM, Philippe Marschall philippe.marschall@gmail.com wrote:
2009/3/16 Eliot Miranda eliot.miranda@gmail.com:
On Sun, Mar 15, 2009 at 1:57 PM, Nicolas Cellier nicolas.cellier.aka.nice@gmail.com wrote:
2009/3/15 Hans-Martin Mosner hmm@heeg.de
nicolas cellier schrieb:
Hans, Tagging/untagging could be very fast! See my other post
- UnTagging a double= No op
- Tagging a double= a isnan test (so as to have a representable nan
in Smalltalk) 3) This trick does not had any extra cost to tagging/untagging of other oops
That's true for a 64-bit processor, and on such hardware I see the advantages of this scheme. For 32-bit hardware, it won't work. Hopefully we'll all have suitable hardware in the near future... But for example, I'm running 32-bit linux here on my 64-bit AMD processor just because the WLAN card I'm using only has a 32-bit Windows driver, and ndiswrapper on 64-bit linux would require a 64-bit driver to work correctly (which is somewhat stupid IMHO but I'm not going to hack ndiswrapper). In the real world, there are tons of silly constraints like this which still prevent people from fully using 64-bit hardware.
Cheers, Hans-Martin
Of course, most of the nice properties come from the 64 bits adressing... Hey, wait, I don't even have a 64 processor in my house! For the fun I imagine we could emulate by spanning each oop over two int32 typedef struct {int32 high,low;} oop; I would expect a slower VM by roughly a factor 2 - except for double arithmetic...
In theory, but only for memory-limited symbolic applications. If you have an application that fits entirely in cache then I would expect parity. The argument for symbolic applications is that a 64-bit symbolic app has to move twice the data as a 32-bit symbolic app because each symbolic object is twice the size.
Couldn't you compress the oops? AFAIK HotSpot was the last remaining JVM that got this.
I don't see the point. Memory is cheap, getting cheaper.
But memory access isn't.
64-bits means extremely cheap address space. Why slow down the critical path to save space?
Because it's faster (because you have to move around fewer data) an gets you closer to 32bit speed.
http://wikis.sun.com/display/HotSpotInternals/CompressedOops http://blog.juma.me.uk/2008/10/14/32-bit-or-64-bit-jvm-how-about-a-hybrid/#c... http://www.lowtek.ca/roo/2008/java-performance-in-64bit-land/ http://www.devwebsphere.com/devwebsphere/2008/10/websphere-nd-70.html http://webspherecommunity.blogspot.com/2008/10/64-bit-performance-thoughputm...
Cheers Philippe
Hi Philipe,
On Mon, Mar 16, 2009 at 10:52 PM, Philippe Marschall < philippe.marschall@gmail.com> wrote:
2009/3/16 Eliot Miranda eliot.miranda@gmail.com:
On Mon, Mar 16, 2009 at 2:15 PM, Philippe Marschall philippe.marschall@gmail.com wrote:
2009/3/16 Eliot Miranda eliot.miranda@gmail.com:
On Sun, Mar 15, 2009 at 1:57 PM, Nicolas Cellier nicolas.cellier.aka.nice@gmail.com wrote:
2009/3/15 Hans-Martin Mosner hmm@heeg.de
nicolas cellier schrieb: > Hans, > Tagging/untagging could be very fast! See my other post > > 1) UnTagging a double= No op > 2) Tagging a double= a isnan test (so as to have a representable
nan
> in Smalltalk) > 3) This trick does not had any extra cost to tagging/untagging of > other oops That's true for a 64-bit processor, and on such hardware I see the advantages of this scheme. For 32-bit hardware, it won't work. Hopefully we'll all have suitable hardware in the near future... But for example, I'm running 32-bit linux here on my 64-bit AMD processor just because the WLAN card I'm using only has a 32-bit Windows driver, and ndiswrapper on 64-bit linux would require a 64-bit
driver
to work correctly (which is somewhat stupid IMHO but I'm not going to hack ndiswrapper). In the real world, there are tons of silly constraints like this
which
still prevent people from fully using 64-bit hardware.
Cheers, Hans-Martin
Of course, most of the nice properties come from the 64 bits adressing... Hey, wait, I don't even have a 64 processor in my house! For the fun I imagine we could emulate by spanning each oop over two int32 typedef struct {int32 high,low;} oop; I would expect a slower VM by roughly a factor 2 - except for double arithmetic...
In theory, but only for memory-limited symbolic applications. If you have an application that fits entirely in cache then I would expect parity. The argument for symbolic applications is that a 64-bit symbolic app has
to
move twice the data as a 32-bit symbolic app because each symbolic object
is
twice the size.
Couldn't you compress the oops? AFAIK HotSpot was the last remaining JVM that got this.
I don't see the point. Memory is cheap, getting cheaper.
But memory access isn't.
64-bits means extremely cheap address space. Why slow down the critical path to save space?
Because it's faster (because you have to move around fewer data) an gets you closer to 32bit speed.
http://wikis.sun.com/display/HotSpotInternals/CompressedOops
http://blog.juma.me.uk/2008/10/14/32-bit-or-64-bit-jvm-how-about-a-hybrid/#c... http://www.lowtek.ca/roo/2008/java-performance-in-64bit-land/ http://www.devwebsphere.com/devwebsphere/2008/10/websphere-nd-70.html
http://webspherecommunity.blogspot.com/2008/10/64-bit-performance-thoughputm...
OK, and this is a reasonable stop-gap until machines catch up with the potential of the 64-bit address space. It reminds me of segmented approaches to 16-bit limits on PDP-11s, 8086s et al. Basically these guys are scaling 32-bit oops by 8, allowing a maximum heap size of 32Gb and 4G small objects. There are other approaches like using an indirection table for intra-segment object references and using 32-bit oops within a segment, which would fit well with a Train algorithm.
My gut feels like these stop gaps are a temporary thing. After all if speed was so compelling we'd see lots of small 16-bit apps in places like Windows where there used to be good support for 16-bit code until quite recently. But in fact 16-bit apps have died the death and we favour the regularity of 32-bit code. Somewhat analogously Smalltalk trades perofrmance for regularity. So I don't find these approaches particularly compelling. In any case they require engineering teams that can afford to support multiple memory models in the VM, something I'm not going to assume in Cog :)
Thanks for the links.
Best Eliot
Cheers Philippe
squeak-dev@lists.squeakfoundation.org