I ran macrobenchmarks (after fixing it up, see my fixes posted). The results are:
3.4 image with closure compiler loaded and active but no methods recompiled: #(67308 246197 116342 31254 0 16290 14740) 3.4 image with closure compiler active and all method recompiled under closure compiler: #(85788 348667 124072 46906 0 18674 23292)
Since marcroBenchmarks mostly tests the compiler/decompiler and messageTally (hence contexts), most of these test are not executing exactly the same code from the first image to the second. The only test that doesn't change is number 6, the one that opens and closes 10 browsers. Number 6 results are 18674 vs. 16290, a 15% slow down.
Let me clarify why I think we should wait for a Jitter before changing the image format. Because the Jitter itself will likely required a image format change and we don't want two format changes in a row, especially when the second will superscede the first. Only the compiled method format change would remain from the first change, which could wait for the second because it is only an aesthetic change. It adds no new functionality.
I know past Jitters did not require a format change. But since Ian is on Eliot's Bytecode-to-Bytecode Adaptive Optimization for Smalltalk project (along with myself, Dan, Marcus, and John Sarkela) I suspect the next Jitter will be a Smalltalk Jitter producing optimized bytecodes. Some of these bytecodes will be new special low-level bytecodes, which will require a new image format. Even if Ian produces a C Jitter before that with no format change, it will speed things up enough to not need VI4. Only the compiled method format change will be missing, which again is not important.
So I believe we can live with the 15% slow down for a year or so. If not we can try to speed up the interpreter a little by just including stack enhancements but not bytecodes enhancements so we don't change the image format. I think any bytecode changes should be in conjunction with Eliot's project or some Jitter project. And to reiterate, I think the compiled method format change should only be include on the back of some other more significant image format change like bytecodes.
Cheers, Anthony
Anthony Hannan writes:
I know past Jitters did not require a format change. But since Ian is on Eliot's Bytecode-to-Bytecode Adaptive Optimization for Smalltalk project (along with myself, Dan, Marcus, and John Sarkela) I suspect the next Jitter will be a Smalltalk Jitter producing optimized bytecodes. Some of these bytecodes will be new special low-level bytecodes, which will require a new image format. Even if Ian produces a C Jitter before that with no format change, it will speed things up enough to not need VI4. Only the compiled method format change will be missing, which again is not important.
I'm not sure if a Jitter using Eliot's work will require a new image format. It does need special low level bytecodes but that could be provided by another bytecode set. It doesn't require changing the original bytecodes just the generated low level ones. Think of the low-level bytecodes as an intermediate language describing the boundary between Smalltalk with the compiler and the optimizer, and the VM with the code generation.
There are a few things that would be nice purely for performance in an image change. Having a tag bit of 0 rather than 1 for integers would shave 3 instructions off simple arithmetic taking it down to 5 instructions on an x86. Whether this is worth the bother would require a little analysis and probably playing with some basic optimization's, like those Ian Piumarta describes in "J3 for Squeak." Simple optimization across bytecodes should be able to remove a lot of needless tagging and untagging.
http://www-sor.inria.fr/~piumarta/squeak/unix/zip/j3-2.6.0/doc/j3/
The cost of optimizing across bytecodes is loosing synchronization points. It is no longer simple to reenter the method at any byte code. The debugger could have single stepped into an "intermediate" position so theoretically any point is a reentry point for a method. This does inhibit optimization, there are various solutions, but I'm delaying thinking more about it until I have more experience.
So I believe we can live with the 15% slow down for a year or so. If not we can try to speed up the interpreter a little by just including stack enhancements but not bytecodes enhancements so we don't change the image format. I think any bytecode changes should be in conjunction with Eliot's project or some Jitter project. And to reiterate, I think the compiled method format change should only be include on the back of some other more significant image format change like bytecodes.
Is it possible to speed proper block closures up using less drastic measures? The commercial Smalltalk's have had this problem for some years now. I remember reading some papers on the subject but can't remember the details. Most blocks are probably simple and can be handled as a special case quickly. A compiler could easily spot if the block returned or accessed the method's variables.
Personally, a 15% slow down seems reasonable for proper block closures. Especially if there are several different ways to regain the speed where some speed-ups are relatively simple to implement.
Bryce
Bryce Kampjes bryce@kampjes.demon.co.uk wrote:
There are a few things that would be nice purely for performance in an image change. Having a tag bit of 0 rather than 1 for integers would shave 3 instructions off simple arithmetic taking it down to 5 instructions on an x86.
.. and it would involve having the tag bit added to every OOP, meaning that one would have to mask out that bit for any indirection through that OOP.
Is it possible to speed proper block closures up using less drastic measures?
Of course. Not that it's particularly tied to block or closures but a realtively simple context cache (Eliot wrote one design up for OOPSLA87) can make a big difference. Costs some memory of course but that is something that can be configured at VM compile time or start time depending on the design. The key point is to not break the rules; one of the objections to Anthony's first BC system was that it did that by making too much visible and different.
tim
Tim Rowledge writes:
Bryce Kampjes bryce@kampjes.demon.co.uk wrote:
There are a few things that would be nice purely for performance in an image change. Having a tag bit of 0 rather than 1 for integers would shave 3 instructions off simple arithmetic taking it down to 5 instructions on an x86.
.. and it would involve having the tag bit added to every OOP, meaning that one would have to mask out that bit for any indirection through that OOP.
But the OOP mask can be done for free by using addressing modes. Even RISCs include loads and stores with a small constant offset. Just add the tag to the constant offset. Chances are that an offset is already used to point to the instance variable being fetched.
A Self paper on the VM data structure formats has a nice design for this. Probably the cleanest VM tagging design I've read about, in the paper they were claiming it was fast because it removed checks from the garbage collector's inner loops. Sorry, I don't have the reference, I'm relying on memory.
Bryce
Bryce Kampjes bryce@kampjes.demon.co.uk wrote:
But the OOP mask can be done for free by using addressing modes. Even RISCs include loads and stores with a small constant offset. Just add the tag to the constant offset. Chances are that an offset is already used to point to the instance variable being fetched.
Mm. I guess that could be made to work ok, particularly via a translator that can keep track of accumulating offets. In the old days an ARM could handle it because it simply ignored the bottom two bits of an address unless fetching a byte :-)
Another successfully used tag scheme is to use the top bit to mark SmallInts. Arithmetic is quite simple and overflow checking stays easy. It does limit your effective address space to half though.
tim
With homepage/download at: http://minnow.cc.gatech.edu/squeak/3170
Considered alpha until I hear that it works/doesn't work from someone.
Eddie
Already incorparated
< I'm a bug-fixing machine! >
This post brought to you by the BugFixArchiveViewer, a handy tool that makes it easy to comment on proposed fixes and enhancements for Squeak. For more information, check out the Web page for the BugFixArchiveViewer project: http://minnow.cc.gatech.edu/squeak/3214
< I'm a bug-fixing machine! >
On Tuesday 15 April 2003 14:13, Tim Rowledge wrote:
Bryce Kampjes bryce@kampjes.demon.co.uk wrote:
There are a few things that would be nice purely for performance in an image change. Having a tag bit of 0 rather than 1 for integers would shave 3 instructions off simple arithmetic taking it down to 5 instructions on an x86.
.. and it would involve having the tag bit added to every OOP, meaning that one would have to mask out that bit for any indirection through that OOP.
Actually, you could just change the implementation of the "push instance variable 3" bytecode from
push 12[bx]
or whatever nastiness the x86 does, to
push 11[bx]
since you can think of the OOP as being address+1 and the offset as 4*3.
-- Jecel
squeak-dev@lists.squeakfoundation.org