Fundamentally, in threaded code, the execution tokens are the addresses of the interpretive routines [...] So the core of the interpreter dispatch loop becomes trivial -- and very fast.
The crux of threaded code is that the interpreter loop goes away entirely. Each "opcode" ends by dispatching to the next opcode, so there is no return to a central loop -- e.g:
#define DISPATCH() { \ register void *nextOp= (void *)fetchWord(); \ goto *nextOp; \ }
pushReceiver: internalPush(receiver); DISPATCH();
pushTrue: internalPush(trueObj); DISPATCH();
pushFalse: internalPush(falseObj); DISPATCH();
and so on. (The addresses of the labels are the "opcodes" used to represent a program.)
The word "threaded" comes from an analogy with sewing: execution flows through the opcode implementations, like a needle pulling thread. There is a truly excellent short article that I highly recommend to the archaeologists amongst us:
James R.@ Bell, Threaded Code, Communications of the ACM, 16(6):370--372, 1973.
Ian
Ian Piumarta Ian.Piumarta@inria.fr wrote...
The crux of threaded code is that the interpreter loop goes away entirely. Each "opcode" ends by dispatching to the next opcode, so there is no return to a central loop -- e.g:
#define DISPATCH() { \ register void *nextOp= (void *)fetchWord(); \ goto *nextOp; \ }
pushReceiver: internalPush(receiver); DISPATCH();
pushTrue: internalPush(trueObj); DISPATCH();
pushFalse: internalPush(falseObj); DISPATCH();
and so on. (The addresses of the labels are the "opcodes" used to represent a program.)
For those who are following this, I'd like to make one fine point:
One can structure a bytecode interpreter in exactly the above form, so in that sense you can eliminate the interpreter loop with bytecodes as well. The real difference between bytecodes and threaded code is that a bytecode interpreter has to look the current bytecode up in a table to find the opcode address, whereas a threaded interpreter has the address in hand from the instruction fetch.
While there is no practical way around the table lookup (*), there ARE a couple of further tweaks that we were going to try in the absence of Jitter. One is to reduce redundant fetches of multiple bytecodes from the same 32-bit word in a method. There are several ways to do this, but the idea is to keep the full 32 bits from the first fetch in a register, and then subsequent bytecodes can eschew the memory reference in favor of a simple shift and mask. The challenge is to keep the shift count management overhead less than the memory reference that is being saved. We have done this in microcode with good results.
Another form of optimization unique to bytcode interpreters is the use of multiple dispatch tables. So, for instance, at the end of an arithmetic bytecode, one might look the next bytecode up in a table of routines that assume the top of stack to be a small integer. We did this for small integers, true and false in one of the Apple ST 68k interpreters.
- Dan
(*) Well, actually there are ways: If the most frequent opcodes can be coded in, say, 8 words, then you can just shift the bytecode to get an address offset into the block of opcode service routines. Any routines that don't fit can include a jump out of line where there is more space.
Dan Ingalls wrote:
For those who are following this, I'd like to make one fine point:
One can structure a bytecode interpreter in exactly the above form, so in that sense you can eliminate the interpreter loop with bytecodes as well. The real difference between bytecodes and threaded code is that a bytecode interpreter has to look the current bytecode up in a table to find the opcode address, whereas a threaded interpreter has the address in hand from the instruction fetch.
While there is no practical way around the table lookup (*),
(*) Well, actually there are ways: If the most frequent opcodes can be coded in, say, 8 words, then you can just shift the bytecode to get an address offset into the block of opcode service routines. Any routines that don't fit can include a jump out of line where there is more space.
Fundamentally where do the speed-ups come from?
a) the dispatch of direct threaded code being approximately twice as fast as bytecode dispatch because a table lookup is eliminated (this has been well discussed). As Dan pointed-out in his footnote one can do something analogous with bytecode dispatch by aligning bytecode routines on a boundary. However, one still pays for two arithmetic operations (a shift and an add), and will probably end up dedicating a register to the base address of the bytecode routines. This register is extremely expensive on instruction architectures like the x86.
but b) the real advantage comes from threaded code having 4 or 8 times the information content of a bytecode. i.e. in the threaded code designs I vaguely know (Peter Deutsch did a threaded code prototype before doing HPS, the current VisualWorks engine that generates native code; BrouHaHa-TC, my own direct threaded code engine, Ian Piumarta's Jitter) each bytecode is expanded to two 32-bit words. Hence one has 64-bits per operation instead of 8. These extra bits are very profitably used to cache the previous method lookup for each send bytecode. A bytecode send specifies an argument count and a selector index. Hence to do a bytecoded send one must: index the stack with the argument count to derive the receiver index the method with the selector index to derive the selector derive the class of the receiver (which involves tag tests, etc) derive the base address of the method lookup cache compute a hash of the class and selector (e.g. add their raw addresses together with the cache base address) fetch and compare both the class and the selector from the cache against the class of the receiver and the send's selector if they match, fetch the method from the cache and activate it (this last bit is also expensive, since the method header must be decoded)
Using e.g. threaded code a send bytecode is typically translated into two words where the first word points to a send routine that knows it has N arguments, and the second word is the selector. Read-ahead means the selector is really cheap to fetch from the word following the threaded code. Note that both the threaded code address and the selector are fetched using data access. Once the send has been looked-up and translated to threaded-code (the method->threaded-code method conversion is cached so one doesn't do this every time except in Jitter I, one of the reasons for its poor performance) one replaces the second word (the selector) with a reference to the threaded-code method and the first word to a routine that simply compares the class of the receiver against that stored in the header of the threaded-code method. i.e. index the stack with a constant offset to derive the receiver derive its class (same unavoidable tag test) fetch the new method index this address with a constant offset to derive the class compare the class against that of the receiver if they match, set the threaded-code instruction pointer to point at the prolog routine of the method and continue (hence one can have a range of prolog instructions, say 8, which implicitly decode various kinds of method prologs, such as has arguments and temps, has no args but temps, etc)
and c) using native code also provides more information, allowing analogs of the b) approach, but gets the processor to do all the decode and dispatch, and can use its branch-prediction logic to pre-fetch the prolog. With e.g. HPS a send is translated into two instructions, one that loads a register with the selector and another that calls a lookup routine that knows the number of arguments. When looked-up these are replaced by an instruction which loads a register with the class of the receiver when the send was looked-up, and a direct call of the native method for that lookup. The native method's prolog then derives the receiver's class (and the receiver is passed in a register, not on the stack) and compares it with the class/selector register and if they match continues. The native method's prolog is specific to the particular method, so there's no method decode overhead at all.
So IMO the fundamental reasons why threaded-code and native code approaches are "better" are that
1. threaded-code and native code approaches use more bits per send/per prolog which allows them to implement sends and prologs as memo functions that elide costly operations in the common case. Since typical programs exhibit huge locality implementing the conversion as a cached translation to a subset of all the methods in the system reclaims a lot of the space trade-off that one has made.
2. native code uses the processor's decode logic, i.e. parallelism. But one can imagine that super-scalar processors with large indirect branch prediction tables might elide the threaded-code dispatch.
In both Peter Deutsch's and my experience a tuned threaded-code implementation will run at about 2/3 the speed of a tuned native code implementation for macro benchmarks, but at around 1/4 for micro benchmarks. Tuned bytecode approaches are a further factor of 1/2 behind threaded code.
But whereas a direct threaded-code implementation can be written to be portable in a matter of hours a native code implementation is portable in a matter of months. And on small machines the gap between threaded-code and native code narrows because there's less instruction cache for the native code to take advantage of.
One last point; native code requires instruction cache flushing when relinking sends on many processors. It used to be the case that one could write extremely efficient cache flushing logic in the form of a jump table. One would pre-generate a jump table big enough to shadow the icache and then jump into it at the point that matched the send instructions being flushed. Hence to flush the cache on would simply execute a few nops and a pair of jumps (e.g. this works for 68020 & 68030 processors). icache designs are now much more complex (e.g. the 68040 has a 4 way set associative icache with a random replacement policy) meaning that one has to use the official route.
[snip]
Another form of optimization unique to bytcode interpreters is the use of multiple dispatch tables. So, for instance, at the end of an arithmetic bytecode, one might look the next bytecode up in a table of routines that assume the top of stack to be a small integer. We did this for small integers, true and false in one of the Apple ST 68k interpreters.
Interesting. This is an idea I've not heard before. What kind of speed-ups did it achieve?
_______________,,,^..^,,,_______________ Eliot Miranda, ParcPlace
On Wed 24 Feb, Eliot & Linda Miranda wrote: [snip lots of good stuff]
One last point; native code requires instruction cache flushing when
relinking sends o
n many processors. It used to be the case that one could write extremely
efficient ca
che flushing logic in the form of a jump table. One would pre-gen erate a jump table big enough to shadow the icache and then jump into it at
the point
that matched the send instructions being flushed. Hence to flush the cache
on would s
imply execute a few nops and a pair of jumps (e.g. this works for 68020 & 68030 processors). icache designs are now much more complex (e.g.
the 68040 h
as a 4 way set associative icache with a random replacement policy) meaning that one has to use the official route.
....and sometimes the official route is slow, difficult or even wrong(!) - for example:- flushing the cache on a StrongARM can cost ~2k cycles between two versions of the IBM RS6000, flushing changed from a fast instruction to a faulted library call between two revs of Sparc, the flush instruction changed from specifying (IIRC) a closed interval to a semiopen interval and so we (PPS - I was there at the time) were 'missing' flushing one word and thereby excuting (sometimes) a wrong instruction.
You can have a lot of fun with such changes....
One potential benefit of staying with threading is that you are typically generating addresses as data rather than instructions as data which doesn't cause quite so much confusion in the caching. As Eliot mentioned it's also a much simpler porting process.
And of course there is always the topic of Ian's PhD thesis, generating native directly from the Smalltalk compiler. Now that our machines need 128Mb ram just to boot the OS, the space costs of this approach seem somehow less worrying!
tim
Tim Rowledge wrote:
And of course there is always the topic of Ian's PhD thesis, generating native directly from the Smalltalk compiler. Now that our machines need 128Mb ram just to boot the OS, the space costs of this approach seem somehow less worrying!
Exactly. It should have been obvious years ago that the day would eventually come when the code size expansion of Smalltalk compiled directly to native code would asymptotically approach irrelevancy. That day will be here soon, if it has not already arrived.
Native code is necessary not simply for performance, but for **two way** interoperability between Smalltalk and other languages (e.g, so C can call Smalltalk code in a DLL). This does not mean that all Smalltalk code should always be native, just that having the option at least for selected methods, classes or modules (packages) is mandatory for widespread use of Smalltalk.
--Alan
[Hi Tim!]
Content-Type: text/x-vcard; charset=us-ascii; name="vcard.vcf" Content-Transfer-Encoding: 7bit Content-Description: Card for Alan Lovejoy Content-Disposition: attachment; filename="vcard.vcf"
Attachment converted: Anon:vcard.vcf 9 (TEXT/ttxt) (000072BD)
Alan Lovejoy wrote:
Exactly. It should have been obvious years ago that the day would eventually come when the code size expansion of Smalltalk compiled directly to native code would asymptotically approach irrelevancy. That day will be here soon, if it has not already arrived.
I have to disagree vehemently. One of the really nice points about Squeak is that it fits in so little memory with very reasonable performance. Until recently, my Mac Performa had 16M, which was barely enough for VisualWorks but plenty for Squeak. It now has 64M, but my PowerBook still has 32M, so an all-native-code Smalltalk would probably swap considerably, making it unusable for me on that machine. And dare I mention the various Squeak-on-a-PDA approaches? Try to tell those guys about 128M of main memory just to start up! So please, let the Squeak VM come in at least 2 (compatible) flavors: 1. pure bytecode interpreter with very small footprint and acceptable performace 2. threaded or native code JIT for better performance, requiring reasonable additional memory and probably 3. pure bytecode interpreter with reduced primitive set and less inlining for the really memory-challenged PDAs
Hans-Martin
squeak-dev@lists.squeakfoundation.org