On Wed, May 13, 2009 at 5:05 PM, Jecel Assumpcao Jr jecel@merlintec.comwrote:
Eliot,
[JIT code uses call which pushes PC first]
Ok, so this can't be helped.
So while one could I don't see that its worth-while. Even if one did keep the arguments and temporaries together one would still have the stack contents separate from the arguments and temporaries and temporary access bytecodes can still access those so arguably one would still have to check the index against the temp count.
Really? I wouldn't expect the compiler to ever generate such bytecodes and so wasn't too worried if the VM did the wrong thing in this situation.
There's a tension between implementing what the current compiler produces and implementing what the instruction set defines. For example should one assume arguments are never written to? I lean on the side of implementing the instruction set.
In the JIT the flag is a bit in the method reference's LSBs and is set for free
on frame build.
That sounds like a neat trick. Are the stack formats for the interpreted stack vm and the jit a little diffeent?
Yes. In the JIT an interpreted frame needs an extra field to hold the saved bytecode instruction pointer when an interpreted frame calls a machine code frame because the return address is the "return to interpreter trampoline" pc. There is no flag word in a machine code frame. So machine code frames save one word w.r.t. the stack vm and interpreted frames gain a word. But most frames are machine code ones so most of the time one is saving space.
Thanks for the explanations. I haven't figured out how to do this in
hardware in a reasonable way and so might have to go with a different design.
I guess that in hardware you can create an instruction that will load a descriptor register as part of the return sequence in parallel with restoring the frame pointer and method so one would never indirect through the frame pointer to fetch the flags word; instead it would be part of the register state. But that's an extremely uneducated guess :)
-- Jecel
2009/5/14 Eliot Miranda eliot.miranda@gmail.com:
On Wed, May 13, 2009 at 5:05 PM, Jecel Assumpcao Jr jecel@merlintec.com wrote:
Eliot,
[JIT code uses call which pushes PC first]
Ok, so this can't be helped.
So while one could I don't see that its worth-while. Even if one did keep the arguments and temporaries together one would still have the stack contents separate from the arguments and temporaries and temporary access bytecodes can still access those so arguably one would still have to check the index against the temp count.
Really? I wouldn't expect the compiler to ever generate such bytecodes and so wasn't too worried if the VM did the wrong thing in this situation.
There's a tension between implementing what the current compiler produces and implementing what the instruction set defines. For example should one assume arguments are never written to? I lean on the side of implementing the instruction set.
In the JIT the flag is a bit in the method reference's LSBs and is set for free on frame build.
That sounds like a neat trick. Are the stack formats for the interpreted stack vm and the jit a little diffeent?
Yes. In the JIT an interpreted frame needs an extra field to hold the saved bytecode instruction pointer when an interpreted frame calls a machine code frame because the return address is the "return to interpreter trampoline" pc. There is no flag word in a machine code frame. So machine code frames save one word w.r.t. the stack vm and interpreted frames gain a word. But most frames are machine code ones so most of the time one is saving space.
Thanks for the explanations. I haven't figured out how to do this in hardware in a reasonable way and so might have to go with a different design.
I guess that in hardware you can create an instruction that will load a descriptor register as part of the return sequence in parallel with restoring the frame pointer and method so one would never indirect through the frame pointer to fetch the flags word; instead it would be part of the register state. But that's an extremely uneducated guess :)
-- Jecel
The one thing you mentioned is separation of data stack & code stack (use separate space to push args & temps, and another space to push VM/JIT-specific state, such as context pointers & return addresses).
In this way, there are no big issue with this separation, because to access the state in both variants you still have to track two pointers - namely stack pointer and frame pointer. But such a separation could be very helpful to handle the unitialized temps. To call a method you usually pushing args (sp decreasing - if stack organized in bottom-to-top order) then pushing return address, save sp & other required context state (fp decreasing) And then you are ready to activate a new method. Now there are 2 variants: at the time when you entering new method, you can simply reserve the stack space for its temps, OR, leave the sp unchanged, but organize the code in such way, that each time you get a temp initialized, you decreasing sp. Then, at each point of execution, if debugger needs to determine if some of the method's temps are unitialized , it can simply check that context's saved sp should be <= temp pointer. This of course makes things more complex, because you should reorganize temps in order of their initialization i.e. | a b | b := 5. a := 6.
to make pointer to a will be < pointer to b. As well, as you should not alter the sp when "pushing" arguments (otherwise debugger will assume that you have already initialized temps).
But, it allows you to conserve the stack space between a method calls:
someMethod | a b |
"1" b := 5. "2" self foo. "3" a := 6.
at 1) you decrement sp by pushing a initialized temp value of b at 2) you saving the sp then increasing the sp by number of arguments for method (foo) and activating a new method, on return you restoring sp at 3) you pushing a new initialized value , which leads to another decrement of sp.
and, as i said, if you suspend the process in the middle of "self foo" call, and inspecting the context of someMethod, debugger can easily see, that it's saved sp is too small to hold all temps, and if user wanting to see a current uninitialized value of 'a', then answer is nil, because offset of 'a' is greater than currently saved sp.
There's another cost , of couse, - if method having 10 temps, then the generated code needs 10 pushes for each initialization, in contrast to single instruction for reserving space at the method's activation, when you simply decreasing the sp by a known constant.
But i think there is no much difference: in both variants you have to write at some memory location , and there just a little difference how you calculate the address at initial store i.e. doing push reg or mov [fp + offset] <- reg
So the major pros is: - conserving a stack space - easy to determine unitialized temps for debugger
and cons is: - this could complicate the code generation
2009/5/14 Igor Stasenko siguctua@gmail.com:
2009/5/14 Eliot Miranda eliot.miranda@gmail.com:
On Wed, May 13, 2009 at 5:05 PM, Jecel Assumpcao Jr jecel@merlintec.com wrote:
Eliot,
[JIT code uses call which pushes PC first]
Ok, so this can't be helped.
So while one could I don't see that its worth-while. Even if one did keep the arguments and temporaries together one would still have the stack contents separate from the arguments and temporaries and temporary access bytecodes can still access those so arguably one would still have to check the index against the temp count.
Really? I wouldn't expect the compiler to ever generate such bytecodes and so wasn't too worried if the VM did the wrong thing in this situation.
There's a tension between implementing what the current compiler produces and implementing what the instruction set defines. For example should one assume arguments are never written to? I lean on the side of implementing the instruction set.
In the JIT the flag is a bit in the method reference's LSBs and is set for free on frame build.
That sounds like a neat trick. Are the stack formats for the interpreted stack vm and the jit a little diffeent?
Yes. In the JIT an interpreted frame needs an extra field to hold the saved bytecode instruction pointer when an interpreted frame calls a machine code frame because the return address is the "return to interpreter trampoline" pc. There is no flag word in a machine code frame. So machine code frames save one word w.r.t. the stack vm and interpreted frames gain a word. But most frames are machine code ones so most of the time one is saving space.
Thanks for the explanations. I haven't figured out how to do this in hardware in a reasonable way and so might have to go with a different design.
I guess that in hardware you can create an instruction that will load a descriptor register as part of the return sequence in parallel with restoring the frame pointer and method so one would never indirect through the frame pointer to fetch the flags word; instead it would be part of the register state. But that's an extremely uneducated guess :)
-- Jecel
The one thing you mentioned is separation of data stack & code stack (use separate space to push args & temps, and another space to push VM/JIT-specific state, such as context pointers & return addresses).
In this way, there are no big issue with this separation, because to access the state in both variants you still have to track two pointers
- namely stack pointer and frame pointer.
But such a separation could be very helpful to handle the unitialized temps. To call a method you usually pushing args (sp decreasing - if stack organized in bottom-to-top order) then pushing return address, save sp & other required context state (fp decreasing) And then you are ready to activate a new method. Now there are 2 variants: at the time when you entering new method, you can simply reserve the stack space for its temps, OR, leave the sp unchanged, but organize the code in such way, that each time you get a temp initialized, you decreasing sp. Then, at each point of execution, if debugger needs to determine if some of the method's temps are unitialized , it can simply check that context's saved sp should be <= temp pointer. This of course makes things more complex, because you should reorganize temps in order of their initialization i.e. | a b | b := 5. a := 6.
to make pointer to a will be < pointer to b. As well, as you should not alter the sp when "pushing" arguments (otherwise debugger will assume that you have already initialized temps).
But, it allows you to conserve the stack space between a method calls:
someMethod | a b |
"1" b := 5. "2" self foo. "3" a := 6.
at 1) you decrement sp by pushing a initialized temp value of b at 2) you saving the sp then increasing the sp by number of arguments for method (foo) and activating a new method, on return you restoring sp at 3) you pushing a new initialized value , which leads to another decrement of sp.
and, as i said, if you suspend the process in the middle of "self foo" call, and inspecting the context of someMethod, debugger can easily see, that it's saved sp is too small to hold all temps, and if user wanting to see a current uninitialized value of 'a', then answer is nil, because offset of 'a' is greater than currently saved sp.
There's another cost , of couse, - if method having 10 temps, then the generated code needs 10 pushes for each initialization, in contrast to single instruction for reserving space at the method's activation, when you simply decreasing the sp by a known constant.
But i think there is no much difference: in both variants you have to write at some memory location , and there just a little difference how you calculate the address at initial store i.e. doing push reg or mov [fp + offset] <- reg
So the major pros is: - conserving a stack space - easy to determine unitialized temps for debugger
oh, there's another pro, that you don't have to worry that any value stored at location > sp is left unitialized. This simplifies the GC code for visiting the stack to a simple loop from the beginning of stack to the current stack pointer. ( Visiting the code stack, tracked by fp is done separately).
and cons is: - this could complicate the code generation
-- Best regards, Igor Stasenko AKA sig.
Hi Igor, On Wed, May 13, 2009 at 6:18 PM, Igor Stasenko siguctua@gmail.com wrote:
2009/5/14 Eliot Miranda eliot.miranda@gmail.com:
On Wed, May 13, 2009 at 5:05 PM, Jecel Assumpcao Jr jecel@merlintec.com
wrote:
Eliot,
[JIT code uses call which pushes PC first]
Ok, so this can't be helped.
So while one could I don't see that its worth-while. Even if one did keep the arguments and temporaries together one would still have the stack contents separate from the arguments and temporaries and temporary access bytecodes can still access those so arguably one would still have to check the index against the temp count.
Really? I wouldn't expect the compiler to ever generate such bytecodes and so wasn't too worried if the VM did the wrong thing in this situation.
There's a tension between implementing what the current compiler produces
and implementing what the instruction set defines. For example should one assume arguments are never written to? I lean on the side of implementing the instruction set.
In the JIT the flag is a bit in the method reference's LSBs and is set
for free
on frame build.
That sounds like a neat trick. Are the stack formats for the interpreted stack vm and the jit a little diffeent?
Yes. In the JIT an interpreted frame needs an extra field to hold the
saved bytecode instruction pointer when an interpreted frame calls a machine code frame because the return address is the "return to interpreter trampoline" pc. There is no flag word in a machine code frame. So machine code frames save one word w.r.t. the stack vm and interpreted frames gain a word. But most frames are machine code ones so most of the time one is saving space.
Thanks for the explanations. I haven't figured out how to do this in hardware in a reasonable way and so might have to go with a
different design.
I guess that in hardware you can create an instruction that will load a
descriptor register as part of the return sequence in parallel with restoring the frame pointer and method so one would never indirect through the frame pointer to fetch the flags word; instead it would be part of the register state. But that's an extremely uneducated guess :)
-- Jecel
The one thing you mentioned is separation of data stack & code stack (use separate space to push args & temps, and another space to push VM/JIT-specific state, such as context pointers & return addresses).
I think the "official" names used in Forth implementations are "control stack" and "operand stack".
In this way, there are no big issue with this separation, because to access the state in both variants you still have to track two pointers
- namely stack pointer and frame pointer.
I think you end up with three pointers. There's a frame pointer into the control stack, a stack pointer to the receiver/0th temp for that frame on the operand stack and a stack pointer proper to the top of stack on the operand stack. An operand stack page needs to be large enough to hold the maximum amount of stack space used by as many frames as can fit on the control stack and this is an issue because one ends up wasting a lot of space, something the single stack organization avoids.
But such a separation could be very helpful to handle the unitialized temps.
Right. I took this approach in my BrouHaHa VMs, with the simplification that there was only one stack page. After becoming familiar with HPS's organization I prefer it. It fits better with conventional hardware. But I think Jecel could use the split stack approach profitably on custom hardware where having 3 registers frame, stack base and top of stack is no problem, and where memory is still cheap. Its fine to throw transistors at the operand stack as that memory will be made frequent use of, even if one is wasting 25% of it most of the time.
Jecel can also design the machine to avoid taking interrupts on the operand stack and provide a separate interrupt stack.
To call a method you usually pushing args (sp decreasing - if stack organized in bottom-to-top order) then pushing return address, save sp & other required context state (fp decreasing) And then you are ready to activate a new method. Now there are 2 variants: at the time when you entering new method, you can simply reserve the stack space for its temps, OR, leave the sp unchanged, but organize the code in such way, that each time you get a temp initialized, you decreasing sp. Then, at each point of execution, if debugger needs to determine if some of the method's temps are unitialized , it can simply check that context's saved sp should be <= temp pointer. This of course makes things more complex, because you should reorganize temps in order of their initialization i.e. | a b | b := 5. a := 6.
to make pointer to a will be < pointer to b. As well, as you should not alter the sp when "pushing" arguments (otherwise debugger will assume that you have already initialized temps).
But, it allows you to conserve the stack space between a method calls:
someMethod | a b |
"1" b := 5. "2" self foo. "3" a := 6.
at 1) you decrement sp by pushing a initialized temp value of b at 2) you saving the sp then increasing the sp by number of arguments for method (foo) and activating a new method, on return you restoring sp at 3) you pushing a new initialized value , which leads to another decrement of sp.
and, as i said, if you suspend the process in the middle of "self foo" call, and inspecting the context of someMethod, debugger can easily see, that it's saved sp is too small to hold all temps, and if user wanting to see a current uninitialized value of 'a', then answer is nil, because offset of 'a' is greater than currently saved sp.
There's another cost , of couse, - if method having 10 temps, then the generated code needs 10 pushes for each initialization, in contrast to single instruction for reserving space at the method's activation, when you simply decreasing the sp by a known constant.
But i think there is no much difference: in both variants you have to write at some memory location , and there just a little difference how you calculate the address at initial store i.e. doing push reg or mov [fp + offset] <- reg
So the major pros is:
- conserving a stack space
- easy to determine unitialized temps for debugger
and cons is:
- this could complicate the code generation
-- Best regards, Igor Stasenko AKA sig.
2009/5/14 Eliot Miranda eliot.miranda@gmail.com:
Hi Igor, On Wed, May 13, 2009 at 6:18 PM, Igor Stasenko siguctua@gmail.com wrote:
2009/5/14 Eliot Miranda eliot.miranda@gmail.com:
On Wed, May 13, 2009 at 5:05 PM, Jecel Assumpcao Jr jecel@merlintec.com wrote:
Eliot,
[JIT code uses call which pushes PC first]
Ok, so this can't be helped.
So while one could I don't see that its worth-while. Even if one did keep the arguments and temporaries together one would still have the stack contents separate from the arguments and temporaries and temporary access bytecodes can still access those so arguably one would still have to check the index against the temp count.
Really? I wouldn't expect the compiler to ever generate such bytecodes and so wasn't too worried if the VM did the wrong thing in this situation.
There's a tension between implementing what the current compiler produces and implementing what the instruction set defines. For example should one assume arguments are never written to? I lean on the side of implementing the instruction set.
In the JIT the flag is a bit in the method reference's LSBs and is set for free on frame build.
That sounds like a neat trick. Are the stack formats for the interpreted stack vm and the jit a little diffeent?
Yes. In the JIT an interpreted frame needs an extra field to hold the saved bytecode instruction pointer when an interpreted frame calls a machine code frame because the return address is the "return to interpreter trampoline" pc. There is no flag word in a machine code frame. So machine code frames save one word w.r.t. the stack vm and interpreted frames gain a word. But most frames are machine code ones so most of the time one is saving space.
Thanks for the explanations. I haven't figured out how to do this in hardware in a reasonable way and so might have to go with a different design.
I guess that in hardware you can create an instruction that will load a descriptor register as part of the return sequence in parallel with restoring the frame pointer and method so one would never indirect through the frame pointer to fetch the flags word; instead it would be part of the register state. But that's an extremely uneducated guess :)
-- Jecel
The one thing you mentioned is separation of data stack & code stack (use separate space to push args & temps, and another space to push VM/JIT-specific state, such as context pointers & return addresses).
I think the "official" names used in Forth implementations are "control stack" and "operand stack".
In this way, there are no big issue with this separation, because to access the state in both variants you still have to track two pointers
- namely stack pointer and frame pointer.
I think you end up with three pointers. There's a frame pointer into the control stack, a stack pointer to the receiver/0th temp for that frame on the operand stack and a stack pointer proper to the top of stack on the operand stack.
maybe you right, but i think this can be avoided, becasue at each call site you know exactly the number of arguments and the order how they will be pushed on stack, so to make a send
self x:1 y: 2
you can do: [sp - 4] <- self [sp - 8] <- 1 [sp - 12] <- 2 on 32bit machine
in case if you have a nested sends:
self x: (12 + 5) y: 2
" at the moment of preparing to send 12+5" [sp - 4] <- self [sp - 8] <- 12 [sp - 12] <- 5 " at the moment of preparing to send #x:y:" [sp - 4] <- self [sp - 8] <- 17 [sp - 12] <- 5
and even more complex, if you have a temp initialization in the middle:
| temp | self x: 2 y: (temp:=5)
here you have to reserve the space for 'temp' first (push nil), before storing 'self' at [sp-4].
Is there another cases, when you need to know the 'real' top of operand stack for given context, except by the code flow? I mean, for debugger, it will see the initialized temps in caller's context and arguments in callee context - so there is no problem with losing introspection abilities.
An operand stack page needs to be large enough to hold the maximum amount of stack space used by as many frames as can fit on the control stack and this is an issue because one ends up wasting a lot of space, something the single stack organization avoids.
can't catch up with this statement. Do you mean that you have to maintain 2 stack spaces instead of 1 (and consequently double the overhead of reserving extra space to prevent the occasional overflow)? And check 2 stacks before proceeding with call instead of just one. But from other side, control stack frames always have a constant size - so its very easy to predict how deep call chain should be before you need to allocate another page for control stack. This means that you can check the control stack 1/4 1/8 1/16 times the depth of call chain changes to minimize this overhead.
But such a separation could be very helpful to handle the unitialized temps.
Right. I took this approach in my BrouHaHa VMs, with the simplification that there was only one stack page. After becoming familiar with HPS's organization I prefer it. It fits better with conventional hardware. But I think Jecel could use the split stack approach profitably on custom hardware where having 3 registers frame, stack base and top of stack is no problem, and where memory is still cheap. Its fine to throw transistors at the operand stack as that memory will be made frequent use of, even if one is wasting 25% of it most of the time. Jecel can also design the machine to avoid taking interrupts on the operand stack and provide a separate interrupt stack.
To call a method you usually pushing args (sp decreasing - if stack organized in bottom-to-top order) then pushing return address, save sp & other required context state (fp decreasing) And then you are ready to activate a new method. Now there are 2 variants: at the time when you entering new method, you can simply reserve the stack space for its temps, OR, leave the sp unchanged, but organize the code in such way, that each time you get a temp initialized, you decreasing sp. Then, at each point of execution, if debugger needs to determine if some of the method's temps are unitialized , it can simply check that context's saved sp should be <= temp pointer. This of course makes things more complex, because you should reorganize temps in order of their initialization i.e. | a b | b := 5. a := 6.
to make pointer to a will be < pointer to b. As well, as you should not alter the sp when "pushing" arguments (otherwise debugger will assume that you have already initialized temps).
But, it allows you to conserve the stack space between a method calls:
someMethod | a b |
"1" b := 5. "2" self foo. "3" a := 6.
at 1) you decrement sp by pushing a initialized temp value of b at 2) you saving the sp then increasing the sp by number of arguments for method (foo) and activating a new method, on return you restoring sp at 3) you pushing a new initialized value , which leads to another decrement of sp.
and, as i said, if you suspend the process in the middle of "self foo" call, and inspecting the context of someMethod, debugger can easily see, that it's saved sp is too small to hold all temps, and if user wanting to see a current uninitialized value of 'a', then answer is nil, because offset of 'a' is greater than currently saved sp.
There's another cost , of couse, - if method having 10 temps, then the generated code needs 10 pushes for each initialization, in contrast to single instruction for reserving space at the method's activation, when you simply decreasing the sp by a known constant.
But i think there is no much difference: in both variants you have to write at some memory location , and there just a little difference how you calculate the address at initial store i.e. doing push reg or mov [fp + offset] <- reg
So the major pros is: - conserving a stack space - easy to determine unitialized temps for debugger
and cons is: - this could complicate the code generation
-- Best regards, Igor Stasenko AKA sig.
On Wed, May 13, 2009 at 7:08 PM, Igor Stasenko siguctua@gmail.com wrote:
2009/5/14 Eliot Miranda eliot.miranda@gmail.com:
Hi Igor, On Wed, May 13, 2009 at 6:18 PM, Igor Stasenko siguctua@gmail.com
wrote:
2009/5/14 Eliot Miranda eliot.miranda@gmail.com:
On Wed, May 13, 2009 at 5:05 PM, Jecel Assumpcao Jr <
jecel@merlintec.com> wrote:
Eliot,
[JIT code uses call which pushes PC first]
Ok, so this can't be helped.
So while one could I don't see that its worth-while. Even if one
did
keep the arguments and temporaries together one would still have
the
stack contents separate from the arguments and temporaries and temporary access bytecodes can still access those so arguably one would still have to check the index against the temp count.
Really? I wouldn't expect the compiler to ever generate such
bytecodes
and so wasn't too worried if the VM did the wrong thing in this situation.
There's a tension between implementing what the current compiler
produces and implementing what the instruction set defines. For example should one assume arguments are never written to? I lean on the side of implementing the instruction set.
In the JIT the flag is a bit in the method reference's LSBs and is
set for free
on frame build.
That sounds like a neat trick. Are the stack formats for the
interpreted
stack vm and the jit a little diffeent?
Yes. In the JIT an interpreted frame needs an extra field to hold the
saved bytecode instruction pointer when an interpreted frame calls a machine code frame because the return address is the "return to interpreter trampoline" pc. There is no flag word in a machine code frame. So machine code frames save one word w.r.t. the stack vm and interpreted frames gain a word. But most frames are machine code ones so most of the time one is saving space.
Thanks for the explanations. I haven't figured out how to do this in hardware in a reasonable way and so might have to go with a
different design.
I guess that in hardware you can create an instruction that will load
a descriptor register as part of the return sequence in parallel with restoring the frame pointer and method so one would never indirect through the frame pointer to fetch the flags word; instead it would be part of the register state. But that's an extremely uneducated guess :)
-- Jecel
The one thing you mentioned is separation of data stack & code stack (use separate space to push args & temps, and another space to push VM/JIT-specific state, such as context pointers & return addresses).
I think the "official" names used in Forth implementations are "control
stack" and "operand stack".
In this way, there are no big issue with this separation, because to access the state in both variants you still have to track two pointers
- namely stack pointer and frame pointer.
I think you end up with three pointers. There's a frame pointer into the
control stack, a stack pointer to the receiver/0th temp for that frame on the operand stack and a stack pointer proper to the top of stack on the operand stack.
maybe you right, but i think this can be avoided, becasue at each call site you know exactly the number of arguments and the order how they will be pushed on stack, so to make a send
self x:1 y: 2
you can do: [sp - 4] <- self [sp - 8] <- 1 [sp - 12] <- 2 on 32bit machine
Hang on; I think you're changing the context. Jecel and I (I think) have been discussing how to avoid checking the number of arguments on each temp var access on a machine where one has push/Store/StorePopTempVarN bytecodes. You're discussing a more general instruction set now. Yes, if you use a conventional instruction set where you've computed the offsets of al variables relative to the stack pointer at all points in the computation then you can do this; this is what gcc does when one says -fomit-frame-pointer. But that instruction set is not a 1 token=1 instruction set like a bytecode set and so is difficult to compile for, difficult to decompile, etc.
in case if you have a nested sends:
self x: (12 + 5) y: 2
" at the moment of preparing to send 12+5" [sp - 4] <- self [sp - 8] <- 12 [sp - 12] <- 5 " at the moment of preparing to send #x:y:" [sp - 4] <- self [sp - 8] <- 17 [sp - 12] <- 5
and even more complex, if you have a temp initialization in the middle:
| temp | self x: 2 y: (temp:=5)
here you have to reserve the space for 'temp' first (push nil), before storing 'self' at [sp-4].
Is there another cases, when you need to know the 'real' top of operand stack for given context, except by the code flow? I mean, for debugger, it will see the initialized temps in caller's context and arguments in callee context - so there is no problem with losing introspection abilities.
An operand stack page needs to be large enough to hold the maximum
amount of stack space used by as many frames as can fit on the control stack and this is an issue because one ends up wasting a lot of space, something the single stack organization avoids.
can't catch up with this statement. Do you mean that you have to maintain 2 stack spaces instead of 1 (and consequently double the overhead of reserving extra space to prevent the occasional overflow)? And check 2 stacks before proceeding with call instead of just one.
Yes, but you don't have to check 2 stacks instead of one if the operand stack is large enough. i.e. if a control stack page holds N frames and the maximum stack space per frame is M then if an operand stack page is M * N it can't overflow before the control stack page does. But if the average stack space per frame is much less than the maximum (which it is) then one ends up wasting a lot of space. This is not a big issue if memory is cheap. I expect in Jecel's process technology memory isn't that expensive and would also expect that since an operand stack page would be heavily used it is good use for the memory.
But from other side, control stack frames always have a constant size
- so its very easy to predict how deep call chain should be before you
need to allocate another page for control stack. This means that you can check the control stack 1/4 1/8 1/16 times the depth of call chain changes to minimize this overhead.
And in hardware the check would be done in parallel with other instructions so it would effectively be free. One only pays for moving the overflow state from one page to the next.
But such a separation could be very helpful to handle the unitialized
temps.
Right. I took this approach in my BrouHaHa VMs, with the simplification
that there was only one stack page. After becoming familiar with HPS's organization I prefer it. It fits better with conventional hardware. But I think Jecel could use the split stack approach profitably on custom hardware where having 3 registers frame, stack base and top of stack is no problem, and where memory is still cheap. Its fine to throw transistors at the operand stack as that memory will be made frequent use of, even if one is wasting 25% of it most of the time.
Jecel can also design the machine to avoid taking interrupts on the
operand stack and provide a separate interrupt stack.
To call a method you usually pushing args (sp decreasing - if stack organized in bottom-to-top order) then pushing return address, save sp & other required context state (fp decreasing) And then you are ready to activate a new method. Now there are 2 variants: at the time when you entering new method, you can simply reserve the stack space for its temps, OR, leave the sp unchanged, but organize the code in such way, that each time you get a temp initialized, you decreasing sp. Then, at each point of execution, if debugger needs to determine if some of the method's temps are unitialized , it can simply check that context's saved sp should be <= temp pointer. This of course makes things more complex, because you should reorganize temps in order of their initialization i.e. | a b | b := 5. a := 6.
to make pointer to a will be < pointer to b. As well, as you should not alter the sp when "pushing" arguments (otherwise debugger will assume that you have already initialized temps).
But, it allows you to conserve the stack space between a method calls:
someMethod | a b |
"1" b := 5. "2" self foo. "3" a := 6.
at 1) you decrement sp by pushing a initialized temp value of b at 2) you saving the sp then increasing the sp by number of arguments for method (foo) and activating a new method, on return you restoring sp at 3) you pushing a new initialized value , which leads to another decrement of sp.
and, as i said, if you suspend the process in the middle of "self foo" call, and inspecting the context of someMethod, debugger can easily see, that it's saved sp is too small to hold all temps, and if user wanting to see a current uninitialized value of 'a', then answer is nil, because offset of 'a' is greater than currently saved sp.
There's another cost , of couse, - if method having 10 temps, then the generated code needs 10 pushes for each initialization, in contrast to single instruction for reserving space at the method's activation, when you simply decreasing the sp by a known constant.
But i think there is no much difference: in both variants you have to write at some memory location , and there just a little difference how you calculate the address at initial store i.e. doing push reg or mov [fp + offset] <- reg
So the major pros is:
- conserving a stack space
- easy to determine unitialized temps for debugger
and cons is:
- this could complicate the code generation
-- Best regards, Igor Stasenko AKA sig.
-- Best regards, Igor Stasenko AKA sig.
cheers Eliot
Thanks Eliot and Igor for your comments!
Eliot is right that what Igor wrote is a good way to do a JIT but the problem is that my hardware is essentially an interpreter and must deal at runtime with expressions that the JIT would optimize away.
My design is described in:
http://www.squeakphone.com:8203/seaside/pier/siliconsqueak/details
What is missing from that description is what the control registers (x0 to x31) do, and that depends on the stack organization. Certainly it is possible to have three registers as Eliot suggested. A separate control stack is not only used on Forth processors but was also popular in Lisp Machine designs.
Registers t0 to t29 are really implemented through a small amount of hardware as a range of words in the stack cache. Having to split this set differently for each method would make the hardware larger and slower. With a separate control stack the mapping is really simple.
About the JIT having to use call which pushes the PC, that is not true on RISC processors. But I suppose the focus for Cog is making Squeak fast on the x86.
There's a tension between implementing what the current compiler produces and implementing what the instruction set defines. For example should one assume arguments are never written to? I lean on the side of implementing the instruction set.
That is specially a good idea if the same VM ends up being used for other languages like Newspeak. Certainly the Java VM runs many languages that the original designers never expected.
Yes. In the JIT an interpreted frame needs an extra field to hold the saved bytecode instruction pointer when an interpreted frame calls a machine code frame because the return address is the "return to interpreter trampoline" pc. There is no flag word in a machine code frame. So machine code frames save one word w.r.t. the stack vm and interpreted frames gain a word. But most frames are machine code ones so most of the time one is saving space.
Ok, so the JIT VM will still have an interpreter. Self originally didn't have one but most of the effort that David Ungar put into the project when it was restarted was making it more interpreter friendly. The bytecodes became very similar to the ones in Little Smalltalk, for example.
Will images be compatible between the JIT VM and the Stack VM? Or do you expect the latter to not be used anymore once the JIT is available? I had originally understood that the Stack VM would be compatible with older images (since you divorce all frames on save and remarry them on reload) but I had missed the detail of the different bytecodes for variable instance access in the case of Context objects.
I guess that in hardware you can create an instruction that will load a descriptor register as part of the return sequence in parallel with restoring the frame pointer and method so one would never indirect through the frame pointer to fetch the flags word; instead it would be part of the register state. But that's an extremely uneducated guess :)
Well, I am trying to avoid having a flags word even though in hardware it is so easy to have any size fields that you might want. I can check if context == nil very efficiently. For methods, t0 is the same value as the "self" register (x4, for example) while for blocks it is different. And with three pointers (fp, sp and control pointer) I shouldn't need to keep track of the number of arguments.
Jecel can also design the machine to avoid taking interrupts on the operand stack and provide a separate interrupt stack.
Hmmm... it has been a while since I designed hardware with interrupts, but have normally used Alto style coroutines instead. The stack cache is divided up into 32 word blocks and can hold parts of stacks from several threads at once. Checking for overflow/underflow only needs to happen when the stack pointer moves from one block to a different one (and even then, only in certain cases which aren't too common). An interesting feature of this scheme is that only 5 bit adders are needed (which are much faster than 16 or 32 bit adders, for example. Wide adders could reduce the clock speed or make the operation take an extra clock). Another detail is that having operand or control frames split among two stack pages is no problem at all.
address of tN in the stack cache:
raddr := fp[4:0] + N. scaddr := (raddr[5] ? tHigh : tLow) , raddr[4:0].
When fp[5] changes value, then tLow := tHigh and tHigh := head of free block list (if fp was going up). If there are no free blocks, then some have to be flushed to their stack pages in main memory. When going down, tLow is loaded from a linked list, which might have to be extended by loading blocks from stack pages. With a 4KB stack cache, for example, there are 32 blocks with 32 words each and so block 0 can dedicate a word for each of the other 31 blocks. The bottom 7 bits of that word (only 5 actually needed, but it is nice to have a little room to grow) can form the "previous" linked list (tHigh and tLow would also be 5 bits wide) while the remaining bits can hold the block's address in main memory.
This might seem far more complicated than a split arg/temp frame and it certainly would be if implemented in software. In hardware, it is mostly wires, a multiplexer and a small adder.
-- Jecel
vm-dev@lists.squeakfoundation.org