On Thu, May 14, 2009 at 4:36 AM, Jecel Assumpcao Jr jecel@merlintec.comwrote:
Thanks Eliot and Igor for your comments!
Eliot is right that what Igor wrote is a good way to do a JIT but the problem is that my hardware is essentially an interpreter and must deal at runtime with expressions that the JIT would optimize away.
My design is described in:
http://www.squeakphone.com:8203/seaside/pier/siliconsqueak/details
What is missing from that description is what the control registers (x0 to x31) do, and that depends on the stack organization. Certainly it is possible to have three registers as Eliot suggested. A separate control stack is not only used on Forth processors but was also popular in Lisp Machine designs.
Registers t0 to t29 are really implemented through a small amount of hardware as a range of words in the stack cache. Having to split this set differently for each method would make the hardware larger and slower. With a separate control stack the mapping is really simple.
About the JIT having to use call which pushes the PC, that is not true on RISC processors. But I suppose the focus for Cog is making Squeak fast on the x86.
Right, but also for sanity the run-time support code needs to be ISA-agnostic. So the frame organization will be unchanged between different ISAs and that means even on RISCs the return pc will be pushed; perhaps not in frameless leaf methods. but definitely in normal frames and hence it might as well be pushed in the same place, and incremented by the size of the call instruction, instead of having the return instruction add the offset at return time.
There's a tension between implementing what the current compiler
produces and implementing what the instruction set defines. For example should one assume arguments are never written to? I lean on the side of implementing the instruction set.
That is specially a good idea if the same VM ends up being used for other languages like Newspeak. Certainly the Java VM runs many languages that the original designers never expected.
Good point.
Yes. In the JIT an interpreted frame needs an extra field to hold the saved bytecode instruction pointer when an interpreted frame calls a machine code frame because the return address is the "return to interpreter trampoline" pc. There is no flag word in a machine code frame. So machine code frames save one word w.r.t. the stack vm and interpreted frames gain a word. But most frames are machine code ones so most of the time one is saving space.
Ok, so the JIT VM will still have an interpreter. Self originally didn't have one but most of the effort that David Ungar put into the project when it was restarted was making it more interpreter friendly. The bytecodes became very similar to the ones in Little Smalltalk, for example.
Have you got an URL for that bytecode set? If not, can you mail me some code or documentation that describes it?
Will images be compatible between the JIT VM and the Stack VM? Or do you
expect the latter to not be used anymore once the JIT is available? I had originally understood that the Stack VM would be compatible with older images (since you divorce all frames on save and remarry them on reload) but I had missed the detail of the different bytecodes for variable instance access in the case of Context objects.
Most of this I've answered in a different message. Yes, accessing Context inst vars needs special treatment. One can add the test to all relevant inst var accesses (only inst vars 0 through 2 are affected on read and 0 through 5 on write) but that would be very slow. Hence the recompile to use the long-form bytecodes is necessary. This whole treatment of Context inst var access is complex but IMO it is better to hide it in the VM using the primitive and long-form IV bytecode interface than exposing the internals to the image because that makes the bootstrap process easier and makes evolving the VM easier too.
I guess that in hardware you can create an instruction that will
load a descriptor register as part of the return sequence in parallel with restoring the frame pointer and method so one would never indirect through the frame pointer to fetch the flags word; instead it would be part of the register state. But that's an extremely uneducated guess :)
Well, I am trying to avoid having a flags word even though in hardware it is so easy to have any size fields that you might want. I can check if context == nil very efficiently. For methods, t0 is the same value as the "self" register (x4, for example) while for blocks it is different. And with three pointers (fp, sp and control pointer) I shouldn't need to keep track of the number of arguments.
Jecel can also design the machine to avoid taking interrupts on the operand stack and provide a separate interrupt stack.
Hmmm... it has been a while since I designed hardware with interrupts, but have normally used Alto style coroutines instead.
Of course; a much better architecture!
The stack cache is divided up into 32 word blocks and can hold parts of stacks from several threads at once. Checking for overflow/underflow only needs to happen when the stack pointer moves from one block to a different one (and even then, only in certain cases which aren't too common). An interesting feature of this scheme is that only 5 bit adders are needed (which are much faster than 16 or 32 bit adders, for example. Wide adders could reduce the clock speed or make the operation take an extra clock). Another detail is that having operand or control frames split among two stack pages is no problem at all.
Cool :)
address of tN in the stack cache:
raddr := fp[4:0] + N. scaddr := (raddr[5] ? tHigh : tLow) , raddr[4:0].
When fp[5] changes value, then tLow := tHigh and tHigh := head of free block list (if fp was going up). If there are no free blocks, then some have to be flushed to their stack pages in main memory. When going down, tLow is loaded from a linked list, which might have to be extended by loading blocks from stack pages. With a 4KB stack cache, for example, there are 32 blocks with 32 words each and so block 0 can dedicate a word for each of the other 31 blocks. The bottom 7 bits of that word (only 5 actually needed, but it is nice to have a little room to grow) can form the "previous" linked list (tHigh and tLow would also be 5 bits wide) while the remaining bits can hold the block's address in main memory.
This might seem far more complicated than a split arg/temp frame and it certainly would be if implemented in software. In hardware, it is mostly wires, a multiplexer and a small adder.
A different world. Cool!
-- Jecel
Eliot,
Have you got an URL for that bytecode set? If not, can you mail me some code or documentation that describes it?
I have a very messy page with several different instruction set designs and including the two used for Self at:
http://www.merlintec.com:8080/software/3
The Self stuff is near the bottom and easy to miss, but it is simple enough that I can describe it here so you don't have to bother with that page. The original (Self 4.1 and earlier) bytecodes had a 3 bit op field followed by a 5 bit data field. The 8 operations were: extend, pushSelf, pushLiteral, nonLocalReturn, setDirectee, send, selfSend and resend. The extend is a prefix that will add its 5 bits of data to the following bytecode. Note that the pushSelf and nonLocalReturn bytecodes ignored their data field. The resend was like super in Squeak and since we have multiple inheritance it could be preceded by setDirectee to limit lookup to a specific parent. One important detail is that the pushLiteral bytecode checked for blocks and actually created a closure and pushed that instead. All variable access is done through the selfSend bytecodes. Note that selfSend #self does the same thing as pushSelf because all Contexts have a slot named "self", so this latter bytecode is not really needed (but does save an entry in the literal array).
For Self 4.1.2 the bytecode set was changed to make it more interpreter friendly. It now has a 4 bit op field followed by a 4 bit data field. The 16 operations are : extend, pushLiteral, send, selfSend, extra, readLocal, writeLocal, lexicalLevel, branchAlways, branchIfTrue, branchIfFalse, branchIndexed, delegatee, undefined, undefined and undefined. Only the first four extra operations are defined: pushSelf, pop, nonLocalReturn, undirectedResend. The lexicalLevel bytecode would change the meaning of the following readLocal or writeLocal. Resending is a bit different, with the delegatee bytecode not only setting the parent for lookup but also changing the following send into a resend.
The original bytecodes are described in various papers and thesis but the new ones are not well documented except in the code itself. I mentioned that these are similar to the current Little Smalltalk bytecodes, which are a bit different from the original LST instruction set described in the book. The new LST bytecodes are only described in the source code for the VM and also have a 4 bit op field and 4 bit data field, with the ops: extended, psuhInstance, pushArgument, pushTemporary, pushLiteral, pushConstant, assignInstance, assignTemporary, markArguments, sendMessage, sendUnary, sendBinary, pushBlock, doPrimitive and doSpecial. The first 12 special instructions are defined: selfReturn, stackReturn, blockReturn, duplicate, popTop, branch, branchIfTrue, branchIfFalse, sendToSuper and breakPoint. The branch bytecodes don't have a data field but seem to use the following byte as their data.
This is far more understandable when I am not trying to be so compact, of course. I always like to see instruction sets in a table form, for example.
-- Jecel
vm-dev@lists.squeakfoundation.org