Optimizing Squeak - Squeak-dev

22 Feb 1999


      Speeding up Squeak
I have started working on speeding up Squeak, and I believe
others are also working on this.  I would like to share
my thoughts, and perhaps start a discussion on how we can
speed up Squeak and possibly what some good goals would be.
In order to create a plan, we need to know where we are,
where we want to go, and how to get there.
----------------------- Where are we? -----------------------
The first question is where is Squeak speedwise?
I have tested Squeak against various other Smalltalks and C
(and some tests against Python) using the Stanford benchmarks
distributed with Self.  A summary of the results follows,
where the speed of C is 1, and the numbers are how many times
slower each is.
System       Slowdown   Implementation
     Python        250    Unoptimized bytecode interpreter
     Squeak 2.3     83
     Dolphin        53    Optimized bytecode interpreter (more or less)
     Smalltalk V    50    Optimized bytecode interpreter, 16-bit code
     Smalltalk MT   24    Native compiler, but slow memory management
     Visual Works   13    Dynamic (just in time) compiler
     Self            4    Optimizing dynamic compiler with advanced type
inference *
     Unoptimized C   2    Native compiler with debug compile
     Optimized C     1    Optimizing native compiler
* Speed relative to C from papers rather than measured
The details are in Appendix 1.
-------------------- Where do we want to go? --------------------
We could choose to equal the speed of any of the faster Smalltalk
(or Self) implementations.  I would propose doing things in stages.
1. Optimize the virtual machine, but keeping a bytecode interpreter.
   This should get us to the performance level of Dolphin or Smalltalk/V
   or perhaps slightly better.  (We ought to be able to at least as 
   well as Smalltalk-V, which runs in a 16-bit code environment and
   so has to do slow segment loads all over the place.)
2. Compile to native code.  If the VM was well optimized before this,
   we might be able to reach performance near Visual Works.
   (It is interesting to note the almost factor of two speed difference
    between VisualWorks and Smalltalk-MT, indicating what an 
    optimized VM can do.)
3. Possibly add self-style type inference for a further speed boost.
   This would be a very long term project.
It is possible to do things in another order: i.e. do native
code generation first, then optimize the VM afterwards.  However,
it seems that this approach would likely be more difficult.
If you optimize any of the fundamental data structures in the VM
you have to then change the compiler to match.  It seems more worthwhile
to get the VM right, then do native compilation.
At the same time we are speeding up squeak, it would be nice to
clean up its semantics.  For example, people have been asking about
true closures for a long time.  If we need to restructure Smalltalk
to speed it up, then adding closures, block temporaries,
any necessary exception support, and possibly syncronization support
at the same time would be useful.
Finally, most things come at a cost.  Almost any changes made
to smalltalk can cause one or more of:
   (1) increased memory usage
   (2) increased complexity
   (3) reduction in portability
   (4) incompatibility with current primitives and other C code.
   (5) people who understood how the interpreter used to work
       now won't know it or will have to relearn it.
   (6) speed tradeoffs, speeding up something slows down
       something else
Obviously, it is a good idea to get the maximum benefit for
the least cost.  Some costs, particularly any reductions in
portability for the base system, are not likely to be acceptable
for almost any speed benefit.
-------------------- How do we get there? --------------------
There are probably two general approaches we can use together
to optimize Squeak.  One is top-down, the other is bottom-up.
The top-down approach involves profiling squeak, looking at what
eats the most time, and optimizing that.  My idea for a bottom
up approach is to consider the fastest possible way to do 
basic operations, and then get Squeak to conform to those ways.
This may involve changing Squeak's methods or data structures.
Some of the basic operations to consider are method dispatch,
integer arithmetic, and array access.  I have put some of
my ideas in the "details" section.
-------------------------- Details --------------------------
Some of the things I have ideas on speeding up include
  (1) method dispatch
  (2) the garbage collector
  (3) bytecode interpreter
  (4) array access (including OrderedCollection)
Method Dispatch:
As an interesting example, the paper "Message dispatch on modern
computer architectures" by Karel Driesen, Urs Holzle, and Jan Vitek
presents an implementation of method dispatch based on 
"inline caches".  Although the paper was looking at method dispatch
for C++, the "inline cache" has been used by Smalltalk and Self.
The "inline cache" is based on the fact that the class of a
variable is usually constant at a given call site.  I.E. if you
have the code "a add: b" at a particular place, and a is an 
OrderedCollection the first time this is called, a will probably
be an OrderedCollection the next time.  So, the first time
the VM looks up the class and the method to call based on 
that class, and stores both with the code.  All subsequent
calls just check to make sure the class is the same as the
last time, and if it is, call the same method as last time.
The check can actually be placed in the header of the called
method.  The code they described looks like this.
load [object + #classOffset] -> r1    ;load actual class
    setlo #class -> r2                    ;load predicted class 
    call method                           ;call method from last time
      ..in method..
    cmp   r1, r2                          ;if the classes don't match
    bne   inlineCacheMiss                 ; do a longer dispatch
This is probably close to the most optimal method dispatch you
can get.  So, one "bottom-up" approach might be to get Squeak
to use a method dispatch as close to this one as possible.
One addition is required for Smalltalk.  Smalltalk needs to
test to see if the variable is a SmallInteger before fetching
its class.  A different dispatch is needed for SmallIntegers.
One major change might be worthwhile.  With the above code,
the VM must modify the code whenever the classes don't match.
This relatively frequent code modification can be costly on 
some processors, possibly requiring an I-cache flush.
If, instead, you stored the predicted class and call 
location in a "method" object which also contained a pointer
to the code, you wouldn't have to modify the code itself.
The resulting code ends up as:
and  oop, 1                          ;If small integer load IntegerClass
    jz   notSmallInt
    set  #IntegerClass -> r1
    jump afterIntTest
notSmallInt:
    load [oop + #classOffset] -> r1      ;Load actual class
afterIntTest:
    load [method + #someOffset] -> r2    ;Load predicted class
    load [method + #someOffset+4] -> r3  ;Load stored method to call from
"inline" cache
    call r3 + #codeOffset                ;Call actual metod
       ..in start of method...
    cmp  r1, r2                          ;If the predicted and actual
classes are 
    jne  inlineCacheMiss                 ;different then redispatch using
method cache
       ..continue with method code...
This code should be reorganized so that the loads appear well
before their data is used.
This is my "minimal method dispatch code".  My idea is to
structure Squeak's internals so that this, or similar, code
could be used for a method dispatch once native compilers
are available. Of course, on most processors, additional
code is required to pass the arguments and save locals.
In order to achieve this, Squeaks internal structures and
ways of doing things will need to be changed.  These changes
might include:
  (1) Simplifying the header format so that the object
      header always contains a class pointer.  Currently,
      determining an object's class is more complex.
      [In my tests so far this adds 10% to the image size.]
  (2) Possibly going to a primarily stack-based context.
      This would save copying over arguments.
      Variables of a home context modified by a block
      context could be allocated in a seperate object
      for true closures.
Garbage Collection:
The garbage collector is constantly needing to look at 
the size of objects.  Finding the size of an object 
currently requires checking the header format, then
looking for the size in one of two places.
This not only adds code but can add branch mispredictions.
Putting the size in one place should speed up 
garbage collection at least slightly.
Both my fast method dispatch idea and garbage collection
speedup require changes to the header format.
My idea is to change the header format to something
like:
Word 0
       3-bit reserved for GC (same as now) *
      13-bit hash            (up from 12)
      10-bit allocated size - indexes a table of possible sizes
       4-bit object format
       2-bit header format (sometime used by GC)
    Word 1 
       Class pointer
    (Word 2 - only used on variable classes)
       Size used
This would use more space than the current headers.
Instead of 1-3 words, object headers would be 2-3 words.
Going to a minimum of two words adds only 10% to the
size of a small image.
Bytecode interpreter:
It may be a while before all systems have native compilers,
so speeding up the bytecode interpreter is worthwhile.
One thing costly on modern microprocessors is branch
mispredictions.  Since a bytecode interpreter calls
or jumps to different code when dispatching each bytecode,
it is likely to get a misprediction per bytecode.
Therefore, increasing the amount of work per bytecode
thereby reducing the number of bytecodes executed
is likely to be worthwhile.
I would combine the most commonly paired simple bytecodes.
I expect this will mean pairing loads of local
variables with method dispatch and/or common operations
such as + and at:.
Array and OrderedCollecton access:
The header changes I proposed above can be used to 
speed up OrderedCollection.  Also, when executing
bytecodes, self-modifying bytecodes can be used.
One idea is to have individual bytecodes for 
executing at: and at:put: for Dictionaies
and Array/OrderedCollections.  They would check
the class of the receiver and if it matched
perform the correct operation.  If it didn't 
they would change the bytecode and call the
other bytecode's operation.  Since matches would
occur in the majority of instances, less code
(especially less branch mispredictions) would result.
-------------------------- Summary --------------------------
I hope we can get a good discussion about speeding up Squeak going,
as well as adding closures and any other basic VM level features.
Various people are working on these issues, and it would
be good to talk to each other and to hash out what can be
done to create the "lean, mean, squeaking machine" that many
of us would like Squeak to be.
We have a basic idea where we are.  We need to figure out
where we are going, and how to get there.
As for me, I would like to see Squeak eventually have performance
close to VisualWorks.  I have some ideas on how to move
in that direction, but I am sure I am lacking many ideas,
and even more, the time to implement all of them.
(I have started on the header simplifications though.)
Sincerely,
Greg Gritton
---------------- Appendix: Benchmark Results ----------------
All runs on Cyrix 6x86 PR120 (100MHz)
Benchmark         V   Dolphin   MT  Sq2.0 Sq-JIT Sq2.3  VW3.0    C   Python 
                                 (Time in ms)
BubbleSort         880    902    259   1349   1048   1086    245    21   3479
BubbleSort2        880    805    560   2083   2083   1107    206    21   3479
IntMM             1310    428    135    745    564    934    124    17
IntMM2            1160    650    197   1655   1634   1458    153    17
MM                 990    741   1590   1497   2341    934    357    22
MM2               1260    660   2090   1721   2765   1289    367    22
Perm               930    701    136*   961    728    815    133    21   2630
Perm2              720    675    239   1273   1207    801    120    21
Queens             380    375     80    632    458    522     71    11
Queens2            280    285     59    468    327    555     67    11
Quicksort          550    384    105    599    485    481     95    11   1555
Quicksort2         490    365    199*   815    853    502     99    11   1555
Towers             550    930    215   1266    996   1010    171    22   3157
Towers2            330    538    118    756    623    557     79    22   3157
Puzzle            4720   7695   1490  15591  13527  12789   1679    77
Total            15340 14950  16134   7472  31411  29639  25393   3966   305
Fixed             1242  1194   1183          2176   2000   1727    277    28
Float             1571  1515   1388          2646   2745   2138    383    54
Built-in sort                                                            483
Versions
  Smalltalk-V (from Smalltalk Express - version ?)
  Squeak 2.0, 2.3
  Dolphin 2.1, patch level 2
  Smalltalk MT 1.5
  Visual Works 3.0
Note: without VM optimization, Jitter didn't buy much.
Note2: These benchmarks are C style benchmarks (lots of array access,
       computations) ported from C.  They don't necessarily represent
       overall application performance.