On Thu, May 14, 2009 at 4:36 AM, Jecel Assumpcao Jr <jecel@merlintec.com> wrote:

Thanks Eliot and Igor for your comments!

Eliot is right that what Igor wrote is a good way to do a JIT but the
problem is that my hardware is essentially an interpreter and must deal
at runtime with expressions that the JIT would optimize away.

My design is described in:


What is missing from that description is what the control registers (x0
to x31) do, and that depends on the stack organization. Certainly it is
possible to have three registers as Eliot suggested. A separate control
stack is not only used on Forth processors but was also popular in Lisp
Machine designs.

Registers t0 to t29 are really implemented through a small amount of
hardware as a range of words in the stack cache. Having to split this
set differently for each method would make the hardware larger and
slower. With a separate control stack the mapping is really simple.

About the JIT having to use call which pushes the PC, that is not true
on RISC processors. But I suppose the focus for Cog is making Squeak
fast on the x86.

Right, but also for sanity the run-time support code needs to be ISA-agnostic.  So the frame organization will be unchanged between different ISAs and that means even on RISCs the return pc will be pushed; perhaps not in frameless leaf methods. but definitely in normal frames and hence it might as well be pushed in the same place, and incremented by the size of the call instruction, instead of having the return instruction add the offset at return time.

> There's a tension between implementing what the current compiler
> produces and implementing what the instruction set defines.  For
> example should one assume arguments are never written to?  I lean
> on the side of implementing the instruction set.

That is specially a good idea if the same VM ends up being used for
other languages like Newspeak. Certainly the Java VM runs many languages
that the original designers never expected.

Good point.
> Yes.  In the JIT an interpreted frame needs an extra field to hold
> the saved bytecode instruction pointer when an interpreted frame
> calls a machine code frame because the return address is the "return
> to interpreter trampoline" pc.  There is no flag word in a machine
> code frame.  So machine code frames save one word w.r.t. the
> stack vm and interpreted frames gain a word.  But most frames
> are machine code ones so most of the time one is saving space.

Ok, so the JIT VM will still have an interpreter. Self originally didn't
have one but most of the effort that David Ungar put into the project
when it was restarted was making it more interpreter friendly. The
bytecodes became very similar to the ones in Little Smalltalk, for

Have you got an URL for that bytecode set?  If not, can you mail me some code or documentation that describes it?

Will images be compatible between the JIT VM and the Stack VM? Or do you
expect the latter to not be used anymore once the JIT is available? I
had originally understood that the Stack VM would be compatible with
older images (since you divorce all frames on save and remarry them on
reload) but I had missed the detail of the different bytecodes for
variable instance access in the case of Context objects.

Most of this I've answered in a different message.  Yes, accessing Context inst vars needs special treatment.  One can add the test to all relevant inst var accesses (only inst vars 0 through 2 are affected on read and 0 through 5 on write) but that would be very slow.  Hence the recompile to use the long-form bytecodes is necessary.  This whole treatment of Context inst var access is complex but IMO it is better to hide it in the VM using the primitive and long-form IV bytecode interface than exposing the internals to the image because that makes the bootstrap process easier and makes evolving the VM easier too.

> I guess that in hardware you can create an instruction that will
> load a descriptor register as part of the return sequence in parallel
> with restoring the frame pointer and method so one would never
> indirect through the frame pointer to fetch the flags word; instead
> it would be part of the register state.  But that's an extremely
> uneducated guess :)

Well, I am trying to avoid having a flags word even though in hardware
it is so easy to have any size fields that you might want. I can check
if context == nil very efficiently. For methods, t0 is the same value as
the "self" register (x4, for example) while for blocks it is different.
And with three pointers (fp, sp and control pointer) I shouldn't need to
keep track of the number of arguments.

> Jecel can also design the machine to avoid taking interrupts on
> the operand stack and provide a separate interrupt stack.

Hmmm... it has been a while since I designed hardware with interrupts,
but have normally used Alto style coroutines instead.

Of course; a much better architecture!
The stack cache is
divided up into 32 word blocks and can hold parts of stacks from several
threads at once. Checking for overflow/underflow only needs to happen
when the stack pointer moves from one block to a different one (and even
then, only in certain cases which aren't too common). An interesting
feature of this scheme is that only 5 bit adders are needed (which are
much faster than 16 or 32 bit adders, for example. Wide adders could
reduce the clock speed or make the operation take an extra clock).
Another detail is that having operand or control frames split among two
stack pages is no problem at all.

Cool :)

address of tN in the stack cache:

 raddr := fp[4:0] + N.
 scaddr := (raddr[5] ? tHigh : tLow) , raddr[4:0].

When fp[5] changes value, then tLow := tHigh and tHigh := head of free
block list (if fp was going up). If there are no free blocks, then some
have to be flushed to their stack pages in main memory. When going down,
tLow is loaded from a linked list, which might have to be extended by
loading blocks from stack pages. With a 4KB stack cache, for example,
there are 32 blocks with 32 words each and so block 0 can dedicate a
word for each of the other 31 blocks. The bottom 7 bits of that word
(only 5 actually needed, but it is nice to have a little room to grow)
can form the "previous" linked list (tHigh and tLow would also be 5 bits
wide) while the remaining bits can hold the block's address in main

This might seem far more complicated than a split arg/temp frame and it
certainly would be if implemented in software. In hardware, it is mostly
wires, a multiplexer and a small adder.

A different world.  Cool!

-- Jecel