Speeding up Squeak
I have started working on speeding up Squeak, and I believe others are also working on this. I would like to share my thoughts, and perhaps start a discussion on how we can speed up Squeak and possibly what some good goals would be.
In order to create a plan, we need to know where we are, where we want to go, and how to get there.
----------------------- Where are we? -----------------------
The first question is where is Squeak speedwise? I have tested Squeak against various other Smalltalks and C (and some tests against Python) using the Stanford benchmarks distributed with Self. A summary of the results follows, where the speed of C is 1, and the numbers are how many times slower each is.
System Slowdown Implementation Python 250 Unoptimized bytecode interpreter Squeak 2.3 83 Dolphin 53 Optimized bytecode interpreter (more or less) Smalltalk V 50 Optimized bytecode interpreter, 16-bit code Smalltalk MT 24 Native compiler, but slow memory management Visual Works 13 Dynamic (just in time) compiler Self 4 Optimizing dynamic compiler with advanced type inference * Unoptimized C 2 Native compiler with debug compile Optimized C 1 Optimizing native compiler
* Speed relative to C from papers rather than measured
The details are in Appendix 1.
-------------------- Where do we want to go? --------------------
We could choose to equal the speed of any of the faster Smalltalk (or Self) implementations. I would propose doing things in stages.
1. Optimize the virtual machine, but keeping a bytecode interpreter. This should get us to the performance level of Dolphin or Smalltalk/V or perhaps slightly better. (We ought to be able to at least as well as Smalltalk-V, which runs in a 16-bit code environment and so has to do slow segment loads all over the place.)
2. Compile to native code. If the VM was well optimized before this, we might be able to reach performance near Visual Works. (It is interesting to note the almost factor of two speed difference between VisualWorks and Smalltalk-MT, indicating what an optimized VM can do.)
3. Possibly add self-style type inference for a further speed boost. This would be a very long term project.
It is possible to do things in another order: i.e. do native code generation first, then optimize the VM afterwards. However, it seems that this approach would likely be more difficult. If you optimize any of the fundamental data structures in the VM you have to then change the compiler to match. It seems more worthwhile to get the VM right, then do native compilation.
At the same time we are speeding up squeak, it would be nice to clean up its semantics. For example, people have been asking about true closures for a long time. If we need to restructure Smalltalk to speed it up, then adding closures, block temporaries, any necessary exception support, and possibly syncronization support at the same time would be useful.
Finally, most things come at a cost. Almost any changes made to smalltalk can cause one or more of: (1) increased memory usage (2) increased complexity (3) reduction in portability (4) incompatibility with current primitives and other C code. (5) people who understood how the interpreter used to work now won't know it or will have to relearn it. (6) speed tradeoffs, speeding up something slows down something else Obviously, it is a good idea to get the maximum benefit for the least cost. Some costs, particularly any reductions in portability for the base system, are not likely to be acceptable for almost any speed benefit.
-------------------- How do we get there? --------------------
There are probably two general approaches we can use together to optimize Squeak. One is top-down, the other is bottom-up. The top-down approach involves profiling squeak, looking at what eats the most time, and optimizing that. My idea for a bottom up approach is to consider the fastest possible way to do basic operations, and then get Squeak to conform to those ways. This may involve changing Squeak's methods or data structures. Some of the basic operations to consider are method dispatch, integer arithmetic, and array access. I have put some of my ideas in the "details" section.
-------------------------- Details --------------------------
Some of the things I have ideas on speeding up include (1) method dispatch (2) the garbage collector (3) bytecode interpreter (4) array access (including OrderedCollection)
Method Dispatch:
As an interesting example, the paper "Message dispatch on modern computer architectures" by Karel Driesen, Urs Holzle, and Jan Vitek presents an implementation of method dispatch based on "inline caches". Although the paper was looking at method dispatch for C++, the "inline cache" has been used by Smalltalk and Self.
The "inline cache" is based on the fact that the class of a variable is usually constant at a given call site. I.E. if you have the code "a add: b" at a particular place, and a is an OrderedCollection the first time this is called, a will probably be an OrderedCollection the next time. So, the first time the VM looks up the class and the method to call based on that class, and stores both with the code. All subsequent calls just check to make sure the class is the same as the last time, and if it is, call the same method as last time. The check can actually be placed in the header of the called method. The code they described looks like this.
load [object + #classOffset] -> r1 ;load actual class setlo #class -> r2 ;load predicted class call method ;call method from last time ..in method.. cmp r1, r2 ;if the classes don't match bne inlineCacheMiss ; do a longer dispatch
This is probably close to the most optimal method dispatch you can get. So, one "bottom-up" approach might be to get Squeak to use a method dispatch as close to this one as possible.
One addition is required for Smalltalk. Smalltalk needs to test to see if the variable is a SmallInteger before fetching its class. A different dispatch is needed for SmallIntegers.
One major change might be worthwhile. With the above code, the VM must modify the code whenever the classes don't match. This relatively frequent code modification can be costly on some processors, possibly requiring an I-cache flush. If, instead, you stored the predicted class and call location in a "method" object which also contained a pointer to the code, you wouldn't have to modify the code itself. The resulting code ends up as:
and oop, 1 ;If small integer load IntegerClass jz notSmallInt set #IntegerClass -> r1 jump afterIntTest notSmallInt: load [oop + #classOffset] -> r1 ;Load actual class afterIntTest: load [method + #someOffset] -> r2 ;Load predicted class load [method + #someOffset+4] -> r3 ;Load stored method to call from "inline" cache call r3 + #codeOffset ;Call actual metod ..in start of method... cmp r1, r2 ;If the predicted and actual classes are jne inlineCacheMiss ;different then redispatch using method cache ..continue with method code...
This code should be reorganized so that the loads appear well before their data is used.
This is my "minimal method dispatch code". My idea is to structure Squeak's internals so that this, or similar, code could be used for a method dispatch once native compilers are available. Of course, on most processors, additional code is required to pass the arguments and save locals.
In order to achieve this, Squeaks internal structures and ways of doing things will need to be changed. These changes might include: (1) Simplifying the header format so that the object header always contains a class pointer. Currently, determining an object's class is more complex. [In my tests so far this adds 10% to the image size.] (2) Possibly going to a primarily stack-based context. This would save copying over arguments. Variables of a home context modified by a block context could be allocated in a seperate object for true closures.
Garbage Collection:
The garbage collector is constantly needing to look at the size of objects. Finding the size of an object currently requires checking the header format, then looking for the size in one of two places. This not only adds code but can add branch mispredictions. Putting the size in one place should speed up garbage collection at least slightly.
Both my fast method dispatch idea and garbage collection speedup require changes to the header format. My idea is to change the header format to something like:
Word 0 3-bit reserved for GC (same as now) * 13-bit hash (up from 12) 10-bit allocated size - indexes a table of possible sizes 4-bit object format 2-bit header format (sometime used by GC) Word 1 Class pointer (Word 2 - only used on variable classes) Size used
This would use more space than the current headers. Instead of 1-3 words, object headers would be 2-3 words. Going to a minimum of two words adds only 10% to the size of a small image.
Bytecode interpreter:
It may be a while before all systems have native compilers, so speeding up the bytecode interpreter is worthwhile.
One thing costly on modern microprocessors is branch mispredictions. Since a bytecode interpreter calls or jumps to different code when dispatching each bytecode, it is likely to get a misprediction per bytecode. Therefore, increasing the amount of work per bytecode thereby reducing the number of bytecodes executed is likely to be worthwhile.
I would combine the most commonly paired simple bytecodes. I expect this will mean pairing loads of local variables with method dispatch and/or common operations such as + and at:.
Array and OrderedCollecton access:
The header changes I proposed above can be used to speed up OrderedCollection. Also, when executing bytecodes, self-modifying bytecodes can be used. One idea is to have individual bytecodes for executing at: and at:put: for Dictionaies and Array/OrderedCollections. They would check the class of the receiver and if it matched perform the correct operation. If it didn't they would change the bytecode and call the other bytecode's operation. Since matches would occur in the majority of instances, less code (especially less branch mispredictions) would result.
-------------------------- Summary --------------------------
I hope we can get a good discussion about speeding up Squeak going, as well as adding closures and any other basic VM level features. Various people are working on these issues, and it would be good to talk to each other and to hash out what can be done to create the "lean, mean, squeaking machine" that many of us would like Squeak to be.
We have a basic idea where we are. We need to figure out where we are going, and how to get there.
As for me, I would like to see Squeak eventually have performance close to VisualWorks. I have some ideas on how to move in that direction, but I am sure I am lacking many ideas, and even more, the time to implement all of them. (I have started on the header simplifications though.)
Sincerely,
Greg Gritton
---------------- Appendix: Benchmark Results ----------------
All runs on Cyrix 6x86 PR120 (100MHz)
Benchmark V Dolphin MT Sq2.0 Sq-JIT Sq2.3 VW3.0 C Python (Time in ms) BubbleSort 880 902 259 1349 1048 1086 245 21 3479 BubbleSort2 880 805 560 2083 2083 1107 206 21 3479 IntMM 1310 428 135 745 564 934 124 17 IntMM2 1160 650 197 1655 1634 1458 153 17 MM 990 741 1590 1497 2341 934 357 22 MM2 1260 660 2090 1721 2765 1289 367 22 Perm 930 701 136* 961 728 815 133 21 2630 Perm2 720 675 239 1273 1207 801 120 21 Queens 380 375 80 632 458 522 71 11 Queens2 280 285 59 468 327 555 67 11 Quicksort 550 384 105 599 485 481 95 11 1555 Quicksort2 490 365 199* 815 853 502 99 11 1555 Towers 550 930 215 1266 996 1010 171 22 3157 Towers2 330 538 118 756 623 557 79 22 3157 Puzzle 4720 7695 1490 15591 13527 12789 1679 77
Total 15340 14950 16134 7472 31411 29639 25393 3966 305 Fixed 1242 1194 1183 2176 2000 1727 277 28 Float 1571 1515 1388 2646 2745 2138 383 54
Built-in sort 483
Versions Smalltalk-V (from Smalltalk Express - version ?) Squeak 2.0, 2.3 Dolphin 2.1, patch level 2 Smalltalk MT 1.5 Visual Works 3.0
Note: without VM optimization, Jitter didn't buy much.
Note2: These benchmarks are C style benchmarks (lots of array access, computations) ported from C. They don't necessarily represent overall application performance.
Greg & Cindy Gritton wrote:
Speeding up Squeak
[...] inline caches:
The resulting code ends up as:
and oop, 1 ;If small integer load IntegerClass jz notSmallInt set #IntegerClass -> r1 jump afterIntTest
notSmallInt: load [oop + #classOffset] -> r1 ;Load actual class afterIntTest: load [method + #someOffset] -> r2 ;Load predicted class load [method + #someOffset+4] -> r3 ;Load stored method to call from "inline" cache call r3 + #codeOffset ;Call actual metod ...
By doing so, you undo the major advantage of IC. It is not the cost of cache lookup that inline caches try to minimize, but the avoidal of indirect jumps. Driesen et al argument that modern processors are bad at indirect jump prediction with enormous costs when failing. That's why a direct jump should appear in the instruction code.
jump o->class->vtable[methodIndex]
is no better than
jump currentMethod->predictedMethods[indexForThisCallSite]
in this respect, right ?
Matthias
At 04:29 PM 2/22/99 +0100, you wrote:
Greg & Cindy Gritton wrote:
Speeding up Squeak
[...] inline caches:
The resulting code ends up as:
and oop, 1 ;If small integer load
IntegerClass
jz notSmallInt set #IntegerClass -> r1 jump afterIntTest
notSmallInt: load [oop + #classOffset] -> r1 ;Load actual class afterIntTest: load [method + #someOffset] -> r2 ;Load predicted class load [method + #someOffset+4] -> r3 ;Load stored method to call from "inline" cache call r3 + #codeOffset ;Call actual metod ...
By doing so, you undo the major advantage of IC. It is not the cost of cache lookup that inline caches try to minimize, but the avoidal of indirect jumps. Driesen et al argument that modern processors are bad at indirect jump prediction with enormous costs when failing. That's why a direct jump should appear in the instruction code.
jump o->class->vtable[methodIndex]
is no better than
jump currentMethod->predictedMethods[indexForThisCallSite]
in this respect, right ?
Matthias
Right.
It is a tradeoff, trading off self-modifying code on inline cache miss for an indirect branch on a inline cache hit. Which is better depends on the specific processor implementation.
The paper by Driesen et al assumed that a processor could not predict indirect branches, making indirect branches expensive. Modern processor generally contain a Branch Target Cache (P6, PowerPC 750) or next instruction cache line pointer (K7). These can be used to predict indirect branches as well as direct branches.
On the other hand, self-modifying code can be expensive in real processors. Some RISC processors used to have to flush the I-cache when any code is modified. I am not sure whether this is true any more. Even if an I-cache flush is not required, flushing the pipeline likey is. As pipelines get longer this can start to hurt.
The best bet would be to try out both code sequences in a variety of processors and see which runs faster on average.
Greg Gritton gritton@ibm.net
I can't seem to find the message, but someone mentioned that following the "Optimizing Squeak" thread is like a course in computer architecture.
With this thought, I found a couple of good links today in comp.arch that give a good overview of the architectures modern processors. They are:
http://www.cse.msu.edu/~crs/cps920/ http://www.geocities.com/SiliconValley/9498//cpuwar.html
Greg Gritton
I'm just a nosy bystander at implementing interpreters so forgive me if these are stupid comments:
As you know, Smalltalk should spend most of its time in primitives (if I remember correctly we targeted it at over 70% at ParcPlace). And, of course, primitives can run at the speed of optimized C. If this is true, then the need to speed up the other ~20% is lessened.
I notice the need for method execution to be a lot faster in math intensive tasks. If we implemented math primitives that operated on entire vectors (arrays) of SmallIntegers and Floats for each operation and implemented appropriate matrix and vector classes (and primitive math operations on them), much of the need for fast math could be done using these vector math classes (at primitive speed (optimized C)).
Not that I want to discourage anyone from working on method execution faster! I don't. It's just that with the appropriate operations implemented primitively, the size/complexity cost I'm willing to pay for faster method execution is decreased.
Before coming to ParcPlace, Eliot Miranda implemented a Smalltalk VM (BrouHaHa) that implemented a very efficient threaded byte-code interpreter. He has even released the code into public domain I believe. Without the size/complexity cost of dynamic translation to machine code it still managed to get 85% (if I remember correctly) of the speed of ParcPlace's dynamic translation VM!
If Squeak's VM achieved that: - 85% of the speed of ParcPlace's dynamic translation VM at method execution and - none of the complexity of dynamic translation and - none of the space requirements for native code caches and - had vector/matrix math operations with primitive implemations
....I'd be one happy puppy! I can't imagine wanting any more speed.
Bytecode interpreter: One thing costly on modern microprocessors is branch mispredictions. Since a bytecode interpreter calls or jumps to different code when dispatching each bytecode, it is likely to get a misprediction per bytecode.
Not if the bytecode is used to index into a 256 element jump-table. In this case there is no need for an array-bounds check and there are no conditional branches so no branch mispredictions can occur.
Carl Watts http://AppliedThought.com/carl/
Carl Watts wrote:
As you know, Smalltalk should spend most of its time in primitives (if I remember correctly we targeted it at over 70% at ParcPlace). And, of course, primitives can run at the speed of optimized C. If this is true, then the need to speed up the other ~20% is lessened.
I have some reservations about this 70% figure -- I could only see this as a valid figure in reference to a particular benchmark suite. Much would depend on how your code is written and what you are doing. If your code is structured with abstractions layered on top of abstractions etc. then I would expect the method send/execution time to become a much more significant portion of the execution time, and therefore a much more attractive optimization target. Basically, I would want to minimize the "penalty" for using extensive abstraction and higher level modeling.
(BTW, what happened to the other 10%? <g>)
-- Dwight
Bytecode interpreter: One thing costly on modern microprocessors is branch mispredictions. Since a bytecode interpreter calls or jumps to different code when dispatching each bytecode, it is likely to get a misprediction per bytecode.
Not if the bytecode is used to index into a 256 element jump-table. In
this case there is no need for an array-bounds check and there are no conditional branches so no branch mispredictions can occur.
The issue is not array bounds checking. The issue is the processor pipleine uses the address of the branch instruction (usually an unconditional jump to a table indexed address) to "predict" what address a branch will go to. As the branch will go to 256 different places, depending on the bytecode value, the prediction will be mostly wrong. Really, instead of just using the branch instruction as a hash to some destination table, actually doing the branch effective address calculation as early as possible helps. It was my understansing the PPC processors were much better about this than the Intel procesors
On a Pentium II, a mispredicted branch means the processor has to throw away all the instructions it so cleaverly fetched and executed, thinking it knew where to go. As the processor execution stream can be out of order up to about 40 instructions, it may have to throw away all the work done for those 40 instructions, and restart the execution pipeline at the correct branch address.
The driving on a curved road analogy works pretty well here. The processor can ONLY do 200 MPH. So you turn the first curve (bytecode dispatch). The processor thinks it knows where your going, so zooms way out past the curve. Eventually, all that driving past the curve has to be thrown away, as you make you way back onto the road and turn for the new curve (bytecode dispatch), only to have the processor once again shoot way past the curve (like as much as 40 instructions past). Over time, you find your only spending 30% of the time on the road, and 70% driving around on the dirt, shooting off or getting back on the road.
If it weren't for the much high clock speeds, current Intel processors would execute byetcodes slower than older simpler Intel processors, like the 486. The branch prediction really does help run the C code that make up the primitives. You get nice things like not getting execution stalled because you have a memory cache miss (which can take like 50 processor cycles if not hidden by out of order execution).
I very much agree that even a slow interpreter can run programs fast, if you have lots of appropriate primitives implements in C. This is exactly what languages like VisualBasic do. My guess is many Smalltalkers would be upset if things like Collections where implemented as primitives. Even though you never change that code, it's somehow better to have the Smalltalk code there to look at.
Mabey a good performance jump could be attained by making many of the standard classes compatable with the Smalltalk to C translator, and then just compiling them into the VM. Mabey as part of an application deployment process.
- Jan ___________________________________________________________________ Paradigm Matrix Inc., San Ramon California "video products and development services for Win32 platforms" Internet: Jan Bottorff janb@pmatrix.com WWW http://www.pmatrix.com Phone: voice (925) 803-9318 fax (925) 803-9397 PGP: public key http://www-swiss.ai.mit.edu/~bal/pks-toplev.html fingerprint 52 CB FF 60 91 25 F9 44 6F 87 23 C9 AB 5D 05 F6 ___________________________________________________________________
The best alternative I came up with was what I'll call a message send bytecode set.
[snip]
Very interesting!
The issue is the processor pipleine uses the address of the branch instruction (usually an unconditional jump to a table indexed address) to "predict" what address a branch will go to.
Oh, I didn't know that. Interesting... It does this even for a "branch to address given in a register" instruction (which, I guess, is what it would be on a RISC processor)?
I had been under the mistaken impression that "branch prediction" meant "predicting whether or not a conditional branch will be taken" not "predicting which is the next instruction to execute after ANY branch so we can keep the pipeline filled".
Thanks for explaining that, Jan.
Then I read Tim Olson's further details:
There are at least 2 kinds of Branch Target Caches: Branch Target Address Cache [BTAC] (P6, PowerPC 604e) this cache is indexed by the address of the branch instruction, and returns the address of the predicted target Branch Target Instruction Cache [BTIC] (AMD 29K, PowerPC 750) this cache is indexed by the address of the branch target, and returns the first few instructions at that target
So (let me summarize my understanding here), on current Intel chips (P6) (which uses BTAC) this branch misprediction (for a byte-code dispatch loop that used the bytecode to index into a 256 element table of addresses) WOULD have the bad branch misprediction behavior Jan takes about.
But current PPC's (which use BTIC) wouldn't have this problem in this kind of byte-code dispatch loop. In fact since the first few instructions to handle each bytecode would be cached, they should have really good pipelining behavior for this kind of byte-code dispatch loop.
FYI: The BrouHaHa interpreter which Eliot Miranda wrote used (if I remember correctly) what the referenced document (http://www.complang.tuwien.ac.at/forth/threaded-code.html) would have us call "direct token threaded code". On current PPC chips this kind of byte-code interpreter should also perform well because of the PPC's BTIC.
Thanks for the additional details Tim!
Carl
P.S. Gee, following this thread is like taking a course in computer engineering!
At 08:42 AM 2/23/99 -0800, you wrote:
The best alternative I came up with was what I'll call a message send bytecode set.
[snip]
Very interesting!
The issue is the processor pipleine uses the address of the branch instruction (usually an unconditional jump to a table indexed address) to "predict" what address a branch will go to.
Oh, I didn't know that. Interesting... It does this even for a "branch to
address given in a register" instruction (which, I guess, is what it would be on a RISC processor)?
I had been under the mistaken impression that "branch prediction" meant
"predicting whether or not a conditional branch will be taken" not "predicting which is the next instruction to execute after ANY branch so we can keep the pipeline filled".
Thanks for explaining that, Jan.
Then I read Tim Olson's further details:
There are at least 2 kinds of Branch Target Caches: Branch Target Address Cache [BTAC] (P6, PowerPC 604e) this cache is indexed by the address of the branch instruction, and returns the address of the predicted target Branch Target Instruction Cache [BTIC] (AMD 29K, PowerPC 750) this cache is indexed by the address of the branch target, and returns the first few instructions at that target
So (let me summarize my understanding here), on current Intel chips (P6)
(which uses BTAC) this branch misprediction (for a byte-code dispatch loop that used the bytecode to index into a 256 element table of addresses) WOULD have the bad branch misprediction behavior Jan takes about.
But current PPC's (which use BTIC) wouldn't have this problem in this kind
of byte-code dispatch loop. In fact since the first few instructions to handle each bytecode would be cached, they should have really good pipelining behavior for this kind of byte-code dispatch loop.
FYI: The BrouHaHa interpreter which Eliot Miranda wrote used (if I
remember correctly) what the referenced document (http://www.complang.tuwien.ac.at/forth/threaded-code.html) would have us call "direct token threaded code". On current PPC chips this kind of byte-code interpreter should also perform well because of the PPC's BTIC.
Thanks for the additional details Tim!
Carl
P.S. Gee, following this thread is like taking a course in computer
engineering!
Not quite. A BTAC (as used in the P6) can be used to predict indirect branches if they are consistent. The bytecode dispatch branch is so inconsistent that it would almost always be mispredicted.
The BTIC used in the PPC 750 has a different purpose. By caching the branch target instructions, these instructions can be decoded on the cycle following the decode of a branch instruction rather than two cycles afterwards. However, it doesn't help predict addresses of indirect branches, so you always get a branch misprediction on any indirect branch. However, the PPC pipeline is so short that branch mispredictions are much less expensive than on the P6.
A processor may have a Branch History Table (BHT) in addition to a Branch Target Address Cache (BTAC) or Branch Target Instruction Cache (BTIC). The PPC 750 has both a BHT and BTIC, while the PPC 604e has a BHT and a BTAC. The BHT generally stores only a few bits of information about a branch, so it can cover more branches than a BTAC or BTIC. It can only predict the direction of conditional branches, not their address or destination instructions. It usually takes an extra cycle to fetch the branch target with a BHT hit than a BTAC or BTIC hit, but it is still faster than a full mispredict.
I believe BHTs came before BTACs or BTICs, and are probably more well known. This might explain why you, and Dreisen et al in their paper, assumed that branch prediction can only prediction the direction of conditional branches.
Greg Gritton
At 07:09 AM 2/22/99 -0800, Greg & Cindy Gritton wrote:
are available. Of course, on most processors, additional code is required to pass the arguments and save locals.
I'd like to suggest that most of the work done by bytecodes is essentially getting ready to do a send (parameter pushing from local, ivar, global, constant, outer), doing the send, and then following up on the send (possibly storing a result in a local, ivar, global). A few years ago I did a bunch of research on alternative Smalltalk bytecode sets and concluded the "classic" bytecode set could not be much worse from an interpreter overhead viewpoint.
The best alternative I came up with was what I'll call a message send bytecode set. Basically each bytecode fully represented the preparation, message send, and followup in a single bytecode. About 200 bytecodes were used for the different combinations of parameters sources and number (as I remember, 5 possible parameter sources and 0 to 2 parameters). With the rest of the bytecodes used to fill in the holes. I had a little prototype interpreter running that could get about 75% of the performance achieved by the VisualWorks native JIT compiler, on an Intel processor. It was written in macro language (M4) that expanded to assembler. This performance was measured using some hand "assembled" bytecode methods (leap year calculation in Time for an example). No fully running system or compiler ever was finished, so milage may vary in real life.
Some of the other tricks/features where:
1) Double indirect inline caches allowed native thread/multiprocessor safety. The same trick might be used to implement synchronization locks. The key was to be able to update the inline cache with a single 32-bit memory write (using two writes, like the cached class value and the cached method would have required a very expensive spinlock to maintain thread safety).
2) Methods that were "simple" were specially marked and never involked a new context. Simple meant things like returned self, returned ivar of self, returned constant. When the method lookup was first done (cache miss), a target piece of native code to do the right thing was substituted for the actual method in the cache entry.
3) There was no difference between bytecode dispatching and primitive dispatching. A primitive was typically a piece of native code that worked on the passed parameters and the last thing it did was execute the tail bytecode dispatching code. There were essentially bytecode send fragments and "helper" code fragments. The bytecode fragments got you to specific helper fragments. An example helper fragment would be SmallInteger +, or method entry context setup.
I don't offhand remember if the method cache ended up being a hashed index or if I had a dedicated "slot" index as part of the bytecode. I remember tinkering with both. As I remember, having the bytecode index a dedicated slot was faster (and allowed the cache to be preloaded <essentially at compile time>), but burned a lot more memory.
To try to give a flavor of this bytecode set, you might have a bytecode that did a receiver+one parameter send:
send local, ivar
The binary encoding would be like:
<send local, ivar bytecode>:8, <index into current context for local parameter>:4, <index into object for ivar parameter>:4, <index into current context for return result>:4, <index into method constant table for selector>:4
Classic bytecode would have been:
push temp push ivar send #someSelector pop temp
As you would get the work of 4 bytecodes collapsed into 1, the interpretation overhead was dramatically reduced. Also notice the classic bytecode format took 4 bytes, and the alternative format was only 3 bytes, so had improved code density. I actually though mabey Huffman coding with variable length bit fields (based on analysis of a full system) would give the densest code, but was a bit slower to decode. And hand assembling Huffman code was also a serious pain for tinkering. I suspect for an interpreter in C, like Squeak, the interpreter overhead reduction may be even more dramatic. This also would only get 2 branch prediction misses, once in the send to the helper code, and once at the end of the helper to dispatch the next send bytecode. The classic bytecode set is at least a branch prediction miss per bytecode, 4 in this example.
In case this isn't clear as mud, imagine the following 155 bytecodes:
send0 {temp, ivar, global, constant, outer} "5 variations for unary messages"
send1 {temp, ivar, global, constant, outer}, {temp, ivar, global, constant, outer} "25 variations for binary messages"
send2 {temp, ivar, global, constant, outer}, {temp, ivar, global, constant, outer}, {temp, ivar, global, constant, outer} "125 variations for receiver+2 parameter messages like 'anArray at:aLocal put:aConstant'"
You also needed some to fill in the less common cases:
move temp, {temp, ivar, global, constant, outer} move {temp, ivar, global, constant, outer}, temp
And also for sends with more than 2 parameters:
push {temp, ivar, global, constant, outer} push2 {temp, ivar, global, constant, outer}, {temp, ivar, global, constant, outer}
There were no special bytecodes for involking primitives or flow control, you just did a pseudo send which involked a piece of native code (via send cache magic), like this:
send2 ivar,temp,constant "send #compareEqual message with constant as the bytecode branch offset"
As I remember, you did need a method return bytecode. This basically stored one of the current context locals back into the calling context at the index specified in the involking bytecode (which you either peek at or else stored on original method activation).
I eventually decided making a new Smalltalk was not exactly a profitable idea (circa '94-'95), and I needed to go on with my life, so stopped working on the new interpreter architecture. I must still have the code around someplace. It would have been nice to make a paper all about this, but I got a little busy making a living.
Using these ideas in Squeak would be very significant work, as everything has to "mesh" for things to work. It's not clear this effort wouldn't be better spent on a native code JIT, which ultimatly I think would have the best performance (processor branch prediction would actually work for sends). If small memory footprint is still a major goal, then mabey these are useful ideas.
- Jan
___________________________________________________________________ Paradigm Matrix Inc., San Ramon California "video products and development services for Win32 platforms" Internet: Jan Bottorff janb@pmatrix.com WWW http://www.pmatrix.com Phone: voice (925) 803-9318 fax (925) 803-9397 PGP: public key http://www-swiss.ai.mit.edu/~bal/pks-toplev.html fingerprint 52 CB FF 60 91 25 F9 44 6F 87 23 C9 AB 5D 05 F6 ___________________________________________________________________
Greg -
It's good to hear more interest in speeding up Squeak. I thought I would comment on a few of your items from the perspective of the Jitter (II) project. Ian may have more to say about this as well. It will be easier to be specific when JII is released, but I thought you and everyone would want to consider this as part of the perspective when plotting the direction for future work.
-------------------- Where do we want to go? -------------------- We could choose to equal the speed of any of the faster Smalltalk (or Self) implementations. I would propose doing things in stages.
- Optimize the virtual machine, but keeping a bytecode interpreter.
This should get us to the performance level of Dolphin or Smalltalk/V or perhaps slightly better. (We ought to be able to at least as well as Smalltalk-V, which runs in a 16-bit code environment and so has to do slow segment loads all over the place.)
The Squeak interpreter is already pretty heavily optimized. At this point there appear to be three main improvements that could be made:
Change from bytecodes to word-length threaded code In-line caching of method lookups Streamlining of activation (context management)
Jitter II is a JIT (just-in-time compiler), but the first version to be released, will actually translate to threaded code (we've been calling this J2T or JIIT). This has the benefit of retaining complete portability while providing somewhat faster instruction dispatch, inline method lookup and streamlined activation.
- Compile to native code. If the VM was well optimized before this,
we might be able to reach performance near Visual Works. (It is interesting to note the almost factor of two speed difference between VisualWorks and Smalltalk-MT, indicating what an optimized VM can do.)
The threaded phase of Ian's implementation is intended to be be easier to debug, and much easier to port. However the architecture is designed for native code from the ground up. The plan is to reduce to an absolute minimum the complexity introduced by the change to native code. In particular, except for code generation, there should be almost no change to the translation and management of translated methods, to the handling of contexts, and to the interpreter interface.
Jitter II is an elegant and relatively simple design. Whether it will reach the performance of VW is not known, but it is unlikely that the first versions will do that well. If Jitter achieves between 2x and 4x while maintaining Squeak's relative simplicity and rampant portability, we will consider it to be a triumph.
[big snip]
Some of the things I have ideas on speeding up include (1) method dispatch
We agree, and Jitter IIT will incorporate both word-threading and inline caching.
(2) the garbage collector
Regarding changes to object headers, we may well want to make such changes. However we're inclined to wait until we have analyzed a running Jitter for a while. This is because Ian has done some very clever things with method preambles that reduce the overhead of variant headers.
(3) bytecode interpreter
It is our hope that Jitter II will be easily portable in both its threaded and native forms. With an overhead of less than a megabyte, all but the tiniest Squeak implementations will want to run Jitter. If so, then hopefully the pressures on the bytecode interpreter will reduce to simplicity and compactness.
(4) array access (including OrderedCollection)
Here again, we don't feel that the current treatment is necesarily ideal, but we do feel the importance of seeing the Jitter profiles before planning the next step.
-------------------------- Summary -------------------------- I hope we can get a good discussion about speeding up Squeak going, as well as adding closures and any other basic VM level features. Various people are working on these issues, and it would be good to talk to each other and to hash out what can be done to create the "lean, mean, squeaking machine" that many of us would like Squeak to be.
We have a basic idea where we are. We need to figure out where we are going, and how to get there.
As for me, I would like to see Squeak eventually have performance close to VisualWorks. I have some ideas on how to move in that direction, but I am sure I am lacking many ideas, and even more, the time to implement all of them. (I have started on the header simplifications though.)
Hopefully it will not be long before we can include Jitter II in our benchmarks, and lay out some priorities for further optimization from that new perspective.
The points you raise in your message are all serious candidates, and it is great to see a general move toward making Squeak a serious contender in Smalltalk execution speed.
- Dan
squeak-dev@lists.squeakfoundation.org