Eliot's BlockClosure model questions

List overview All Threads
Download

newer

older

modified method temp references in...

New Cog VMs available

Clément Bera

29 Jul 2013 29 Jul '13

10:54 a.m.

Hello guys,

I was looking recently at the blockClosure model of Eliot in Pharo/Squeak and the blockClosure model of VisualWorks and I have a few questions.

- Why Pharo/Squeak does not have compiled block as in VW and has the block byte code in the enclosing method ? Is it to save memory ? Would it worth it to implement CompiledBlock in term of speed and memory consumption ?

- Why Pharo/Squeak context have this variable closureOrNil instead of having the closure in the receiver field as in VW ? Is it an optimization because there are a lot of access to self and instance variables in the blocks in Pharo/Squeak ? Because if I'm correct it uses 1 more slot per stack frame to have this.

- Lastly, does VW have the tempVector optimization for escaping write temporaries in their blockClosure ? It seems they have not (I don't see any reference to it in VW 7). Did Pharo/Squeak blocks earns a lot of speed or memory with this optimization ?

Thank you for any answer.

Attachments:

attachment.html (text/html — 1.1 KB)

Show replies by date

tim Rowledge

29 Jul 29 Jul

7:03 p.m.

On 29-07-2013, at 1:54 AM, Clément Bera bera.clement@gmail.com wrote:

...

Lastly, does VW have the tempVector optimization for escaping write temporaries in their blockClosure ? It seems they have not (I don't see any reference to it in VW 7). Did Pharo/Squeak blocks earns a lot of speed or memory with this optimization ?

Assuming you're talking about what I think you are talking about, VW had it around 12years ago; I remember Eliot explaining it to me at least three times before I quite got it. I happened to be visiting with him for lunch on the day when it first worked well, IIRC.

tim -- tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim Useful Latin Phrases:- Re vera, potas bene = Say, you sure are drinking a lot.

Clément Bera

10:28 p.m.

2013/7/29 tim Rowledge tim@rowledge.org

...

On 29-07-2013, at 1:54 AM, Clément Bera bera.clement@gmail.com wrote:

...

Lastly, does VW have the tempVector optimization for escaping write

temporaries in their blockClosure ? It seems they have not (I don't see any reference to it in VW 7). Did Pharo/Squeak blocks earns a lot of speed or memory with this optimization ?

Assuming you're talking about what I think you are talking about, VW had it around 12years ago; I remember Eliot explaining it to me at least three times before I quite got it. I happened to be visiting with him for lunch on the day when it first worked well, IIRC.

Thanks for your answer. So they have it. Actually on second thoughts it may be that blocks need the temp vector to work with the stack mapping: if a block outlive its homeContext, the homeContext variables may be overridden in the stack.

It is just that I see no reference to it in VW class comments. Perhaps it is only VM details hidden from the user.

...

tim

tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim Useful Latin Phrases:- Re vera, potas bene = Say, you sure are drinking a lot.

tim Rowledge

10:33 p.m.

On 29-07-2013, at 1:28 PM, Clément Bera bera.clement@gmail.com wrote:

...

Actually on second thoughts it may be that blocks need the temp vector to work with the stack mapping: if a block outlive its homeContext, the homeContext variables may be overridden in the stack.

No, VW used stack mapping of various sorts a long time before that. It was just a fair bit more complicated to handle returns IIRC. I think the issue was that making sends a tiny bit slower (to handle the temp array) could make a large number of expensive returns much faster and thus improve overall performance quite a bit.

tim -- tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim Security announcement - as of next week, passwords will be entered in Morse code.

Eliot Miranda

30 Jul 30 Jul

1:05 a.m.

http://www.mirandabanda.org/cogblog/2008/06/07/closures-part-i/Hi Clément,

On Mon, Jul 29, 2013 at 1:54 AM, Clément Bera bera.clement@gmail.comwrote:

...

Hello guys,

I was looking recently at the blockClosure model of Eliot in Pharo/Squeak and the blockClosure model of VisualWorks and I have a few questions.

Why Pharo/Squeak does not have compiled block as in VW and has the block

byte code in the enclosing method ? Is it to save memory ? Would it worth it to implement CompiledBlock in term of speed and memory consumption ?

Squeak derives directly from the "blue book" Smalltalk-80 implementation in which CompiledMethod is a hybrid object, half pointers (method header and literals) and half bytes (bytecode and source pointer). This format was chosen to save space in the original 16-bit Smalltalk implementations on the Xerox D machines (Alto & Dorado). VisualWorks has a few extra steps in between, In ObjectWorks 2.4 and ObjectWorks 2.5 Peter Deutsch both introduced closures and eliminated the hybrid CompiledMethod format, introducing CompiledBlock.

IMO adding CompiledBlock, while simplifying the VM a little would not improve performance, especially in the interpreter, essentially because activating and retuning form methods now requires an ecxtra level of indirection to get from the CompiledMethod object to its bytecodes in its bytecode object.

However, adding CompiledBlock (or rather eliminating the hybrid CompiledMethod format) would definitely *not* save space. The hybrid format is more compact (one less object per method). One can try and improve this as in VisualWorks by encoding the bytecodes of certain methods as SmallIntegers in the literal frame, but this is only feasible in a pure JIT VM. Squeak still has an interpreter, and Cog is a hybrid JIT and Interpreter. In an interpreter it is costly in performance to be able to interpret this additional form of bytecodes.

So IMO while the hybrid CompiledMethod isn't ideal it is acceptable, having important advantages to go along with its disadvantages.

- Why Pharo/Squeak context have this variable closureOrNil instead of

...

having the closure in the receiver field as in VW ? Is it an optimization because there are a lot of access to self and instance variables in the blocks in Pharo/Squeak ? Because if I'm correct it uses 1 more slot per stack frame to have this.

I did this because I think its simpler and more direct. I don't like VW's access to the receiver and inst vars having to use different bytecodes within a block to within a method. There are lots of complexities resulting from this (e.g. in scanning code for inst var refs, the decompiler, etc).

But in fact there isn't really an additional stack slot because the frame format in the VM does not use the stacked receiver (the 0'th argument) as accessing the receiver in this position requires knowing the method's argument count. So in both methods and blocks the receiver is pushed on the stack immediately before allocating space for, and nilling, any temporaries. This puts the receiver in a known place relative to the frame pointer, making it accessible to the bytecodes without having to know the method's argument count. So the receiver always occurs twice on the stack in a method anyway. In a block, the block is on the stack in the 0'th argument position. The actual receiver is pushed after the temps.

- Lastly, does VW have the tempVector optimization for escaping write

...

temporaries in their blockClosure ? It seems they have not (I don't see any reference to it in VW 7). Did Pharo/Squeak blocks earns a lot of speed or memory with this optimization ?

Yes, VW has this same organization. I implemented it in VisualWorks 5i in ~ 2000. It resulted in a significant increase in performance (for example, factors of two improvement in block-intensive code such as exception handling). This is because of details in the context-to-stack mapping machinery which mean that if an activation of a closure can update the temporaries of its outer contexts then keeping contexts and stack frames in sync is much more complex and costly. The 5i/Cog organization (which in fact derives from some Lisp implementations) results in much simpler context-to0stack mapping such that no tests need be done when returning from a method to keep frames and contexts in sync.

...

Thank you for any answer.

You're most welcome. Have you read my blog post on the design? It is "Under Cover Contexts and the Big Frame-Uphttp://www.mirandabanda.org/cogblog/2009/01/14/under-cover-contexts-and-the-big-frame-up/", with additional information in "Closures Part I" & "Closures Part II – the Bytecodeshttp://www.mirandabanda.org/cogblog/2008/07/22/closures-part-ii-the-bytecodes/ ".

-- best, Eliot

Clément Bera

10:56 p.m.

Thanks for the answer it was very helpful. I got it now.

I had a look at the first posts of your blog (Closures I & II) when I was working on the Opal compiler. Today I was looking at Under Cover Contexts and the Big Frame-Uphttp://www.mirandabanda.org/cogblog/2009/01/14/under-cover-contexts-and-the-big-frame-up/ and I think I should read all your blog.

That is really nice that you wrote this blog it is the main documentation about an efficient Smalltalk VM. I learnt by looking at Cog's source mostly. VW VM source is closed so... I will have a look at Strongtalk implementation instead it seems it is open source.

Why are the clean blocks of VW much faster ? Are they activated like method ? I didn't find it in your blog (probably because it is not in Cog). Is it possible to implement clean blocks in Pharo/Squeak ? (I think that 53% of blocks non optimized by the compiler are clean in Pharo 3) Would it worth it ?

2013/7/30 Eliot Miranda eliot.miranda@gmail.com

...

http://www.mirandabanda.org/cogblog/2008/06/07/closures-part-i/Hi Clément,

On Mon, Jul 29, 2013 at 1:54 AM, Clément Bera bera.clement@gmail.comwrote:

...
Hello guys,

I was looking recently at the blockClosure model of Eliot in Pharo/Squeak and the blockClosure model of VisualWorks and I have a few questions.

Why Pharo/Squeak does not have compiled block as in VW and has the

block byte code in the enclosing method ? Is it to save memory ? Would it worth it to implement CompiledBlock in term of speed and memory consumption ?

Squeak derives directly from the "blue book" Smalltalk-80 implementation in which CompiledMethod is a hybrid object, half pointers (method header and literals) and half bytes (bytecode and source pointer). This format was chosen to save space in the original 16-bit Smalltalk implementations on the Xerox D machines (Alto & Dorado). VisualWorks has a few extra steps in between, In ObjectWorks 2.4 and ObjectWorks 2.5 Peter Deutsch both introduced closures and eliminated the hybrid CompiledMethod format, introducing CompiledBlock.

IMO adding CompiledBlock, while simplifying the VM a little would not improve performance, especially in the interpreter, essentially because activating and retuning form methods now requires an ecxtra level of indirection to get from the CompiledMethod object to its bytecodes in its bytecode object.

However, adding CompiledBlock (or rather eliminating the hybrid CompiledMethod format) would definitely *not* save space. The hybrid format is more compact (one less object per method). One can try and improve this as in VisualWorks by encoding the bytecodes of certain methods as SmallIntegers in the literal frame, but this is only feasible in a pure JIT VM. Squeak still has an interpreter, and Cog is a hybrid JIT and Interpreter. In an interpreter it is costly in performance to be able to interpret this additional form of bytecodes.

So IMO while the hybrid CompiledMethod isn't ideal it is acceptable, having important advantages to go along with its disadvantages.

Why Pharo/Squeak context have this variable closureOrNil instead of

...
having the closure in the receiver field as in VW ? Is it an optimization because there are a lot of access to self and instance variables in the blocks in Pharo/Squeak ? Because if I'm correct it uses 1 more slot per stack frame to have this.

I did this because I think its simpler and more direct. I don't like VW's access to the receiver and inst vars having to use different bytecodes within a block to within a method. There are lots of complexities resulting from this (e.g. in scanning code for inst var refs, the decompiler, etc).

But in fact there isn't really an additional stack slot because the frame format in the VM does not use the stacked receiver (the 0'th argument) as accessing the receiver in this position requires knowing the method's argument count. So in both methods and blocks the receiver is pushed on the stack immediately before allocating space for, and nilling, any temporaries. This puts the receiver in a known place relative to the frame pointer, making it accessible to the bytecodes without having to know the method's argument count. So the receiver always occurs twice on the stack in a method anyway. In a block, the block is on the stack in the 0'th argument position. The actual receiver is pushed after the temps.

Lastly, does VW have the tempVector optimization for escaping write

...
temporaries in their blockClosure ? It seems they have not (I don't see any reference to it in VW 7). Did Pharo/Squeak blocks earns a lot of speed or memory with this optimization ?

Yes, VW has this same organization. I implemented it in VisualWorks 5i in ~ 2000. It resulted in a significant increase in performance (for example, factors of two improvement in block-intensive code such as exception handling). This is because of details in the context-to-stack mapping machinery which mean that if an activation of a closure can update the temporaries of its outer contexts then keeping contexts and stack frames in sync is much more complex and costly. The 5i/Cog organization (which in fact derives from some Lisp implementations) results in much simpler context-to0stack mapping such that no tests need be done when returning from a method to keep frames and contexts in sync.

...
Thank you for any answer.

You're most welcome. Have you read my blog post on the design? It is "Under Cover Contexts and the Big Frame-Uphttp://www.mirandabanda.org/cogblog/2009/01/14/under-cover-contexts-and-the-big-frame-up/", with additional information in "Closures Part I" & "Closures Part II – the Bytecodeshttp://www.mirandabanda.org/cogblog/2008/07/22/closures-part-ii-the-bytecodes/ ". -- best, Eliot

Eliot Miranda

31 Jul 31 Jul

2:41 a.m.

On Tue, Jul 30, 2013 at 1:56 PM, Clément Bera bera.clement@gmail.comwrote:

...

Thanks for the answer it was very helpful. I got it now.

I had a look at the first posts of your blog (Closures I & II) when I was working on the Opal compiler. Today I was looking at Under Cover Contexts and the Big Frame-Uphttp://www.mirandabanda.org/cogblog/2009/01/14/under-cover-contexts-and-the-big-frame-up/ and I think I should read all your blog.

That is really nice that you wrote this blog it is the main documentation about an efficient Smalltalk VM. I learnt by looking at Cog's source mostly. VW VM source is closed so... I will have a look at Strongtalk implementation instead it seems it is open source.

Why are the clean blocks of VW much faster ? Are they activated like method ? I didn't find it in your blog (probably because it is not in Cog). Is it possible to implement clean blocks in Pharo/Squeak ? (I think that 53% of blocks non optimized by the compiler are clean in Pharo 3) Would it worth it ?

Clean blocks are faster because they don't access their outer environment and hence their outer context does not have to be created. So there is no allocation associated with a clean block. It exists already as a literal and its outer context does not have to be reified. Normal closures are created when the point at which they are defined in method execution is reached (the pushClosure bytecode) and if the current context does not yet exist that must be instantiated too, so creating a closure usually takes two allocations.

Clean blocks are activated like blocks. Block and method activation is different in the first phase (the send side) but quite similar in the second phase (frame building). In VW for example, finding the machine code method associated with a block involves a cache lookup which can be slow. In Cog, it involves following a pointer in the method header (inside, the VM replaces the header of a method with a pointer to its machine code) and then jumping to a hard-coded binary search which jumps to the correct block's entry-point depending on the closure's startpc. If a method contains a single block then this is a direct jump. As a result, block dispatch in Cog is typically faster than in VW.

Yes, it is possible to implement clean blocks. It is only an issue to do with the representation of closures. Ideally they need a method inst var, making the outerContext inst var optional (or at least nil in a clean block). But that would require a change to BlockClosure's class definition and a VM change. To avoid having to change the class definition of BlockClosure and the VM, the compiler could create an empty context to hold onto the method, and that would work fine. So to implement clean blocks the compiler would instantiate a BlockClosure literal for each clean block and a MethodContext whose receiver was nil shared between all the clean blocks in a method. There are tricky issues such as setting breakpoints in methods (toggle break on entry), or copying methods, which would require scanning the literals for clean blocks and duplicating them and their outerCOntext too. But that's just detail. Some time I must try this for Squeak. Let me know if you try if=t for Opal. (and of course I'm very happy to help with advice).

I expect that in certain cases the speedup would be noticeable, but it is a micro-optimization. You'd of course only notice the difference in tight loops that used clean blocks.

2013/7/30 Eliot Miranda eliot.miranda@gmail.com

...

...
http://www.mirandabanda.org/cogblog/2008/06/07/closures-part-i/ Hi Clément,

On Mon, Jul 29, 2013 at 1:54 AM, Clément Bera bera.clement@gmail.comwrote:

...
Hello guys,

I was looking recently at the blockClosure model of Eliot in Pharo/Squeak and the blockClosure model of VisualWorks and I have a few questions.

Why Pharo/Squeak does not have compiled block as in VW and has the

block byte code in the enclosing method ? Is it to save memory ? Would it worth it to implement CompiledBlock in term of speed and memory consumption ?

Squeak derives directly from the "blue book" Smalltalk-80 implementation in which CompiledMethod is a hybrid object, half pointers (method header and literals) and half bytes (bytecode and source pointer). This format was chosen to save space in the original 16-bit Smalltalk implementations on the Xerox D machines (Alto & Dorado). VisualWorks has a few extra steps in between, In ObjectWorks 2.4 and ObjectWorks 2.5 Peter Deutsch both introduced closures and eliminated the hybrid CompiledMethod format, introducing CompiledBlock.

IMO adding CompiledBlock, while simplifying the VM a little would not improve performance, especially in the interpreter, essentially because activating and retuning form methods now requires an ecxtra level of indirection to get from the CompiledMethod object to its bytecodes in its bytecode object.

However, adding CompiledBlock (or rather eliminating the hybrid CompiledMethod format) would definitely *not* save space. The hybrid format is more compact (one less object per method). One can try and improve this as in VisualWorks by encoding the bytecodes of certain methods as SmallIntegers in the literal frame, but this is only feasible in a pure JIT VM. Squeak still has an interpreter, and Cog is a hybrid JIT and Interpreter. In an interpreter it is costly in performance to be able to interpret this additional form of bytecodes.

So IMO while the hybrid CompiledMethod isn't ideal it is acceptable, having important advantages to go along with its disadvantages.

Why Pharo/Squeak context have this variable closureOrNil instead of

...
having the closure in the receiver field as in VW ? Is it an optimization because there are a lot of access to self and instance variables in the blocks in Pharo/Squeak ? Because if I'm correct it uses 1 more slot per stack frame to have this.

I did this because I think its simpler and more direct. I don't like VW's access to the receiver and inst vars having to use different bytecodes within a block to within a method. There are lots of complexities resulting from this (e.g. in scanning code for inst var refs, the decompiler, etc).

But in fact there isn't really an additional stack slot because the frame format in the VM does not use the stacked receiver (the 0'th argument) as accessing the receiver in this position requires knowing the method's argument count. So in both methods and blocks the receiver is pushed on the stack immediately before allocating space for, and nilling, any temporaries. This puts the receiver in a known place relative to the frame pointer, making it accessible to the bytecodes without having to know the method's argument count. So the receiver always occurs twice on the stack in a method anyway. In a block, the block is on the stack in the 0'th argument position. The actual receiver is pushed after the temps.

Lastly, does VW have the tempVector optimization for escaping write

...
temporaries in their blockClosure ? It seems they have not (I don't see any reference to it in VW 7). Did Pharo/Squeak blocks earns a lot of speed or memory with this optimization ?

Yes, VW has this same organization. I implemented it in VisualWorks 5i in ~ 2000. It resulted in a significant increase in performance (for example, factors of two improvement in block-intensive code such as exception handling). This is because of details in the context-to-stack mapping machinery which mean that if an activation of a closure can update the temporaries of its outer contexts then keeping contexts and stack frames in sync is much more complex and costly. The 5i/Cog organization (which in fact derives from some Lisp implementations) results in much simpler context-to0stack mapping such that no tests need be done when returning from a method to keep frames and contexts in sync.

...
Thank you for any answer.

You're most welcome. Have you read my blog post on the design? It is "Under Cover Contexts and the Big Frame-Uphttp://www.mirandabanda.org/cogblog/2009/01/14/under-cover-contexts-and-the-big-frame-up/", with additional information in "Closures Part I" & "Closures Part II – the Bytecodeshttp://www.mirandabanda.org/cogblog/2008/07/22/closures-part-ii-the-bytecodes/ ". -- best, Eliot

-- best, Eliot

Clément Bera

1 Aug 1 Aug

10:21 a.m.

Hello Eliot,

So I implemented clean blocks with Opal in Pharo 3. I didn't know where to put the byte code of the clean block, so I put it at the end of the method.

ex: exampleCleanBlock ^ [ 1 + 2 ]

17 <20> pushConstant: [...] 18 <7C> returnTop 19 <76> pushConstant: 1 20 <77> pushConstant: 2 21 <B0> send: + 22 <7D> blockReturn

having in the literal Array: [ 1 + 2 ] #exampleCleanBlock OCOpalExamples

The startpc of the block is 19. Its outerContext is a context with nil as receiver and the method OCOpalExamples>>#exampleCleanBlock. Its numArgs is 0 and it has no copiedValues.

But it does not work with the JIT. If I run: OCOpalExamples new exampleCleanBlock value I got 3 all the time, it's fine. Now 1 to: 5 do: [ :i | OCOpalExamples new exampleCleanBlock value ] Works on Stack VM, but crashes Cog VM. I don't know why (not enough knowledge about the Cog JIT).

Do you have any clue ?

2013/7/31 Eliot Miranda eliot.miranda@gmail.com

...

On Tue, Jul 30, 2013 at 1:56 PM, Clément Bera bera.clement@gmail.comwrote:

...
Thanks for the answer it was very helpful. I got it now.

I had a look at the first posts of your blog (Closures I & II) when I was working on the Opal compiler. Today I was looking at Under Cover Contexts and the Big Frame-Uphttp://www.mirandabanda.org/cogblog/2009/01/14/under-cover-contexts-and-the-big-frame-up/ and I think I should read all your blog.

That is really nice that you wrote this blog it is the main documentation about an efficient Smalltalk VM. I learnt by looking at Cog's source mostly. VW VM source is closed so... I will have a look at Strongtalk implementation instead it seems it is open source.

Why are the clean blocks of VW much faster ? Are they activated like method ? I didn't find it in your blog (probably because it is not in Cog). Is it possible to implement clean blocks in Pharo/Squeak ? (I think that 53% of blocks non optimized by the compiler are clean in Pharo 3) Would it worth it ?

Clean blocks are faster because they don't access their outer environment and hence their outer context does not have to be created. So there is no allocation associated with a clean block. It exists already as a literal and its outer context does not have to be reified. Normal closures are created when the point at which they are defined in method execution is reached (the pushClosure bytecode) and if the current context does not yet exist that must be instantiated too, so creating a closure usually takes two allocations.

Clean blocks are activated like blocks. Block and method activation is different in the first phase (the send side) but quite similar in the second phase (frame building). In VW for example, finding the machine code method associated with a block involves a cache lookup which can be slow. In Cog, it involves following a pointer in the method header (inside, the VM replaces the header of a method with a pointer to its machine code) and then jumping to a hard-coded binary search which jumps to the correct block's entry-point depending on the closure's startpc. If a method contains a single block then this is a direct jump. As a result, block dispatch in Cog is typically faster than in VW.

Yes, it is possible to implement clean blocks. It is only an issue to do with the representation of closures. Ideally they need a method inst var, making the outerContext inst var optional (or at least nil in a clean block). But that would require a change to BlockClosure's class definition and a VM change. To avoid having to change the class definition of BlockClosure and the VM, the compiler could create an empty context to hold onto the method, and that would work fine. So to implement clean blocks the compiler would instantiate a BlockClosure literal for each clean block and a MethodContext whose receiver was nil shared between all the clean blocks in a method. There are tricky issues such as setting breakpoints in methods (toggle break on entry), or copying methods, which would require scanning the literals for clean blocks and duplicating them and their outerCOntext too. But that's just detail. Some time I must try this for Squeak. Let me know if you try if=t for Opal. (and of course I'm very happy to help with advice).

I expect that in certain cases the speedup would be noticeable, but it is a micro-optimization. You'd of course only notice the difference in tight loops that used clean blocks.

2013/7/30 Eliot Miranda eliot.miranda@gmail.com

...
...
http://www.mirandabanda.org/cogblog/2008/06/07/closures-part-i/ Hi Clément,

On Mon, Jul 29, 2013 at 1:54 AM, Clément Bera bera.clement@gmail.comwrote:

...
Hello guys,

I was looking recently at the blockClosure model of Eliot in Pharo/Squeak and the blockClosure model of VisualWorks and I have a few questions.

Why Pharo/Squeak does not have compiled block as in VW and has the

block byte code in the enclosing method ? Is it to save memory ? Would it worth it to implement CompiledBlock in term of speed and memory consumption ?

Squeak derives directly from the "blue book" Smalltalk-80 implementation in which CompiledMethod is a hybrid object, half pointers (method header and literals) and half bytes (bytecode and source pointer). This format was chosen to save space in the original 16-bit Smalltalk implementations on the Xerox D machines (Alto & Dorado). VisualWorks has a few extra steps in between, In ObjectWorks 2.4 and ObjectWorks 2.5 Peter Deutsch both introduced closures and eliminated the hybrid CompiledMethod format, introducing CompiledBlock.

IMO adding CompiledBlock, while simplifying the VM a little would not improve performance, especially in the interpreter, essentially because activating and retuning form methods now requires an ecxtra level of indirection to get from the CompiledMethod object to its bytecodes in its bytecode object.

However, adding CompiledBlock (or rather eliminating the hybrid CompiledMethod format) would definitely *not* save space. The hybrid format is more compact (one less object per method). One can try and improve this as in VisualWorks by encoding the bytecodes of certain methods as SmallIntegers in the literal frame, but this is only feasible in a pure JIT VM. Squeak still has an interpreter, and Cog is a hybrid JIT and Interpreter. In an interpreter it is costly in performance to be able to interpret this additional form of bytecodes.

So IMO while the hybrid CompiledMethod isn't ideal it is acceptable, having important advantages to go along with its disadvantages.

Why Pharo/Squeak context have this variable closureOrNil instead of

...
having the closure in the receiver field as in VW ? Is it an optimization because there are a lot of access to self and instance variables in the blocks in Pharo/Squeak ? Because if I'm correct it uses 1 more slot per stack frame to have this.

I did this because I think its simpler and more direct. I don't like VW's access to the receiver and inst vars having to use different bytecodes within a block to within a method. There are lots of complexities resulting from this (e.g. in scanning code for inst var refs, the decompiler, etc).

But in fact there isn't really an additional stack slot because the frame format in the VM does not use the stacked receiver (the 0'th argument) as accessing the receiver in this position requires knowing the method's argument count. So in both methods and blocks the receiver is pushed on the stack immediately before allocating space for, and nilling, any temporaries. This puts the receiver in a known place relative to the frame pointer, making it accessible to the bytecodes without having to know the method's argument count. So the receiver always occurs twice on the stack in a method anyway. In a block, the block is on the stack in the 0'th argument position. The actual receiver is pushed after the temps.

Lastly, does VW have the tempVector optimization for escaping write

...
temporaries in their blockClosure ? It seems they have not (I don't see any reference to it in VW 7). Did Pharo/Squeak blocks earns a lot of speed or memory with this optimization ?

Yes, VW has this same organization. I implemented it in VisualWorks 5i in ~ 2000. It resulted in a significant increase in performance (for example, factors of two improvement in block-intensive code such as exception handling). This is because of details in the context-to-stack mapping machinery which mean that if an activation of a closure can update the temporaries of its outer contexts then keeping contexts and stack frames in sync is much more complex and costly. The 5i/Cog organization (which in fact derives from some Lisp implementations) results in much simpler context-to0stack mapping such that no tests need be done when returning from a method to keep frames and contexts in sync.

...
Thank you for any answer.

You're most welcome. Have you read my blog post on the design? It is "Under Cover Contexts and the Big Frame-Uphttp://www.mirandabanda.org/cogblog/2009/01/14/under-cover-contexts-and-the-big-frame-up/", with additional information in "Closures Part I" & "Closures Part II – the Bytecodeshttp://www.mirandabanda.org/cogblog/2008/07/22/closures-part-ii-the-bytecodes/ ". -- best, Eliot

-- best, Eliot

Eliot Miranda

7:15 p.m.

On Thu, Aug 1, 2013 at 1:21 AM, Clément Bera bera.clement@gmail.com wrote:

...

Hello Eliot,

So I implemented clean blocks with Opal in Pharo 3. I didn't know where to put the byte code of the clean block, so I put it at the end of the method.

ex: exampleCleanBlock ^ [ 1 + 2 ]

17 <20> pushConstant: [...] 18 <7C> returnTop 19 <76> pushConstant: 1 20 <77> pushConstant: 2 21 <B0> send: + 22 <7D> blockReturn

having in the literal Array: [ 1 + 2 ] #exampleCleanBlock OCOpalExamples

The startpc of the block is 19. Its outerContext is a context with nil as receiver and the method OCOpalExamples>>#exampleCleanBlock. Its numArgs is 0 and it has no copiedValues.

But it does not work with the JIT. If I run: OCOpalExamples new exampleCleanBlock value I got 3 all the time, it's fine. Now 1 to: 5 do: [ :i | OCOpalExamples new exampleCleanBlock value ] Works on Stack VM, but crashes Cog VM. I don't know why (not enough knowledge about the Cog JIT).

Do you have any clue ?

no. send me an image?

...

2013/7/31 Eliot Miranda eliot.miranda@gmail.com

...
On Tue, Jul 30, 2013 at 1:56 PM, Clément Bera bera.clement@gmail.comwrote:

...
Thanks for the answer it was very helpful. I got it now.

I had a look at the first posts of your blog (Closures I & II) when I was working on the Opal compiler. Today I was looking at Under Cover Contexts and the Big Frame-Uphttp://www.mirandabanda.org/cogblog/2009/01/14/under-cover-contexts-and-the-big-frame-up/ and I think I should read all your blog.

That is really nice that you wrote this blog it is the main documentation about an efficient Smalltalk VM. I learnt by looking at Cog's source mostly. VW VM source is closed so... I will have a look at Strongtalk implementation instead it seems it is open source.

Why are the clean blocks of VW much faster ? Are they activated like method ? I didn't find it in your blog (probably because it is not in Cog). Is it possible to implement clean blocks in Pharo/Squeak ? (I think that 53% of blocks non optimized by the compiler are clean in Pharo 3) Would it worth it ?

Clean blocks are faster because they don't access their outer environment and hence their outer context does not have to be created. So there is no allocation associated with a clean block. It exists already as a literal and its outer context does not have to be reified. Normal closures are created when the point at which they are defined in method execution is reached (the pushClosure bytecode) and if the current context does not yet exist that must be instantiated too, so creating a closure usually takes two allocations.

Clean blocks are activated like blocks. Block and method activation is different in the first phase (the send side) but quite similar in the second phase (frame building). In VW for example, finding the machine code method associated with a block involves a cache lookup which can be slow. In Cog, it involves following a pointer in the method header (inside, the VM replaces the header of a method with a pointer to its machine code) and then jumping to a hard-coded binary search which jumps to the correct block's entry-point depending on the closure's startpc. If a method contains a single block then this is a direct jump. As a result, block dispatch in Cog is typically faster than in VW.

Yes, it is possible to implement clean blocks. It is only an issue to do with the representation of closures. Ideally they need a method inst var, making the outerContext inst var optional (or at least nil in a clean block). But that would require a change to BlockClosure's class definition and a VM change. To avoid having to change the class definition of BlockClosure and the VM, the compiler could create an empty context to hold onto the method, and that would work fine. So to implement clean blocks the compiler would instantiate a BlockClosure literal for each clean block and a MethodContext whose receiver was nil shared between all the clean blocks in a method. There are tricky issues such as setting breakpoints in methods (toggle break on entry), or copying methods, which would require scanning the literals for clean blocks and duplicating them and their outerCOntext too. But that's just detail. Some time I must try this for Squeak. Let me know if you try if=t for Opal. (and of course I'm very happy to help with advice).

I expect that in certain cases the speedup would be noticeable, but it is a micro-optimization. You'd of course only notice the difference in tight loops that used clean blocks.

2013/7/30 Eliot Miranda eliot.miranda@gmail.com

...
...
http://www.mirandabanda.org/cogblog/2008/06/07/closures-part-i/ Hi Clément,

On Mon, Jul 29, 2013 at 1:54 AM, Clément Bera bera.clement@gmail.comwrote:

...
Hello guys,

I was looking recently at the blockClosure model of Eliot in Pharo/Squeak and the blockClosure model of VisualWorks and I have a few questions.

Why Pharo/Squeak does not have compiled block as in VW and has the

block byte code in the enclosing method ? Is it to save memory ? Would it worth it to implement CompiledBlock in term of speed and memory consumption ?

Squeak derives directly from the "blue book" Smalltalk-80 implementation in which CompiledMethod is a hybrid object, half pointers (method header and literals) and half bytes (bytecode and source pointer). This format was chosen to save space in the original 16-bit Smalltalk implementations on the Xerox D machines (Alto & Dorado). VisualWorks has a few extra steps in between, In ObjectWorks 2.4 and ObjectWorks 2.5 Peter Deutsch both introduced closures and eliminated the hybrid CompiledMethod format, introducing CompiledBlock.

IMO adding CompiledBlock, while simplifying the VM a little would not improve performance, especially in the interpreter, essentially because activating and retuning form methods now requires an ecxtra level of indirection to get from the CompiledMethod object to its bytecodes in its bytecode object.

However, adding CompiledBlock (or rather eliminating the hybrid CompiledMethod format) would definitely *not* save space. The hybrid format is more compact (one less object per method). One can try and improve this as in VisualWorks by encoding the bytecodes of certain methods as SmallIntegers in the literal frame, but this is only feasible in a pure JIT VM. Squeak still has an interpreter, and Cog is a hybrid JIT and Interpreter. In an interpreter it is costly in performance to be able to interpret this additional form of bytecodes.

So IMO while the hybrid CompiledMethod isn't ideal it is acceptable, having important advantages to go along with its disadvantages.

Why Pharo/Squeak context have this variable closureOrNil instead of

...
having the closure in the receiver field as in VW ? Is it an optimization because there are a lot of access to self and instance variables in the blocks in Pharo/Squeak ? Because if I'm correct it uses 1 more slot per stack frame to have this.

I did this because I think its simpler and more direct. I don't like VW's access to the receiver and inst vars having to use different bytecodes within a block to within a method. There are lots of complexities resulting from this (e.g. in scanning code for inst var refs, the decompiler, etc).

But in fact there isn't really an additional stack slot because the frame format in the VM does not use the stacked receiver (the 0'th argument) as accessing the receiver in this position requires knowing the method's argument count. So in both methods and blocks the receiver is pushed on the stack immediately before allocating space for, and nilling, any temporaries. This puts the receiver in a known place relative to the frame pointer, making it accessible to the bytecodes without having to know the method's argument count. So the receiver always occurs twice on the stack in a method anyway. In a block, the block is on the stack in the 0'th argument position. The actual receiver is pushed after the temps.

Lastly, does VW have the tempVector optimization for escaping write

...
temporaries in their blockClosure ? It seems they have not (I don't see any reference to it in VW 7). Did Pharo/Squeak blocks earns a lot of speed or memory with this optimization ?

Yes, VW has this same organization. I implemented it in VisualWorks 5i in ~ 2000. It resulted in a significant increase in performance (for example, factors of two improvement in block-intensive code such as exception handling). This is because of details in the context-to-stack mapping machinery which mean that if an activation of a closure can update the temporaries of its outer contexts then keeping contexts and stack frames in sync is much more complex and costly. The 5i/Cog organization (which in fact derives from some Lisp implementations) results in much simpler context-to0stack mapping such that no tests need be done when returning from a method to keep frames and contexts in sync.

...
Thank you for any answer.

You're most welcome. Have you read my blog post on the design? It is "Under Cover Contexts and the Big Frame-Uphttp://www.mirandabanda.org/cogblog/2009/01/14/under-cover-contexts-and-the-big-frame-up/", with additional information in "Closures Part I" & "Closures Part II – the Bytecodeshttp://www.mirandabanda.org/cogblog/2008/07/22/closures-part-ii-the-bytecodes/ ". -- best, Eliot

-- best, Eliot

-- best, Eliot

Eliot Miranda

10:17 p.m.

On Thu, Aug 1, 2013 at 10:15 AM, Eliot Miranda eliot.miranda@gmail.comwrote:

...

On Thu, Aug 1, 2013 at 1:21 AM, Clément Bera bera.clement@gmail.comwrote:

...
Hello Eliot,

So I implemented clean blocks with Opal in Pharo 3. I didn't know where to put the byte code of the clean block, so I put it at the end of the method.

ex: exampleCleanBlock ^ [ 1 + 2 ]

17 <20> pushConstant: [...] 18 <7C> returnTop 19 <76> pushConstant: 1 20 <77> pushConstant: 2 21 <B0> send: + 22 <7D> blockReturn

having in the literal Array: [ 1 + 2 ] #exampleCleanBlock OCOpalExamples

The startpc of the block is 19. Its outerContext is a context with nil as receiver and the method OCOpalExamples>>#exampleCleanBlock. Its numArgs is 0 and it has no copiedValues.

But it does not work with the JIT.

Thinking about it I'm pretty sure the problem is that the JIT scans for and counts pushClosure: bytecodes to know how many blocks a method contains, but clean blocks don't need pushClosure: bytecodes. So the JIT needs to look for clean blocks, e.g. either by scanning a method's literals or by looking at the arguments of pushLiteral: bytecodes. In any case the image will allow me to develop a fix.

...

If I run:

...
OCOpalExamples new exampleCleanBlock value I got 3 all the time, it's fine. Now 1 to: 5 do: [ :i | OCOpalExamples new exampleCleanBlock value ] Works on Stack VM, but crashes Cog VM. I don't know why (not enough knowledge about the Cog JIT).

Do you have any clue ?

no. send me an image?

...
2013/7/31 Eliot Miranda eliot.miranda@gmail.com

...
On Tue, Jul 30, 2013 at 1:56 PM, Clément Bera bera.clement@gmail.comwrote:

...
Thanks for the answer it was very helpful. I got it now.

I had a look at the first posts of your blog (Closures I & II) when I was working on the Opal compiler. Today I was looking at Under Cover Contexts and the Big Frame-Uphttp://www.mirandabanda.org/cogblog/2009/01/14/under-cover-contexts-and-the-big-frame-up/ and I think I should read all your blog.

That is really nice that you wrote this blog it is the main documentation about an efficient Smalltalk VM. I learnt by looking at Cog's source mostly. VW VM source is closed so... I will have a look at Strongtalk implementation instead it seems it is open source.

Why are the clean blocks of VW much faster ? Are they activated like method ? I didn't find it in your blog (probably because it is not in Cog). Is it possible to implement clean blocks in Pharo/Squeak ? (I think that 53% of blocks non optimized by the compiler are clean in Pharo 3) Would it worth it ?

Clean blocks are faster because they don't access their outer environment and hence their outer context does not have to be created. So there is no allocation associated with a clean block. It exists already as a literal and its outer context does not have to be reified. Normal closures are created when the point at which they are defined in method execution is reached (the pushClosure bytecode) and if the current context does not yet exist that must be instantiated too, so creating a closure usually takes two allocations.

Clean blocks are activated like blocks. Block and method activation is different in the first phase (the send side) but quite similar in the second phase (frame building). In VW for example, finding the machine code method associated with a block involves a cache lookup which can be slow. In Cog, it involves following a pointer in the method header (inside, the VM replaces the header of a method with a pointer to its machine code) and then jumping to a hard-coded binary search which jumps to the correct block's entry-point depending on the closure's startpc. If a method contains a single block then this is a direct jump. As a result, block dispatch in Cog is typically faster than in VW.

Yes, it is possible to implement clean blocks. It is only an issue to do with the representation of closures. Ideally they need a method inst var, making the outerContext inst var optional (or at least nil in a clean block). But that would require a change to BlockClosure's class definition and a VM change. To avoid having to change the class definition of BlockClosure and the VM, the compiler could create an empty context to hold onto the method, and that would work fine. So to implement clean blocks the compiler would instantiate a BlockClosure literal for each clean block and a MethodContext whose receiver was nil shared between all the clean blocks in a method. There are tricky issues such as setting breakpoints in methods (toggle break on entry), or copying methods, which would require scanning the literals for clean blocks and duplicating them and their outerCOntext too. But that's just detail. Some time I must try this for Squeak. Let me know if you try if=t for Opal. (and of course I'm very happy to help with advice).

I expect that in certain cases the speedup would be noticeable, but it is a micro-optimization. You'd of course only notice the difference in tight loops that used clean blocks.

2013/7/30 Eliot Miranda eliot.miranda@gmail.com

...
...
http://www.mirandabanda.org/cogblog/2008/06/07/closures-part-i/ Hi Clément,

On Mon, Jul 29, 2013 at 1:54 AM, Clément Bera bera.clement@gmail.comwrote:

...
Hello guys,

I was looking recently at the blockClosure model of Eliot in Pharo/Squeak and the blockClosure model of VisualWorks and I have a few questions.

Why Pharo/Squeak does not have compiled block as in VW and has the

block byte code in the enclosing method ? Is it to save memory ? Would it worth it to implement CompiledBlock in term of speed and memory consumption ?

Squeak derives directly from the "blue book" Smalltalk-80 implementation in which CompiledMethod is a hybrid object, half pointers (method header and literals) and half bytes (bytecode and source pointer). This format was chosen to save space in the original 16-bit Smalltalk implementations on the Xerox D machines (Alto & Dorado). VisualWorks has a few extra steps in between, In ObjectWorks 2.4 and ObjectWorks 2.5 Peter Deutsch both introduced closures and eliminated the hybrid CompiledMethod format, introducing CompiledBlock.

IMO adding CompiledBlock, while simplifying the VM a little would not improve performance, especially in the interpreter, essentially because activating and retuning form methods now requires an ecxtra level of indirection to get from the CompiledMethod object to its bytecodes in its bytecode object.

However, adding CompiledBlock (or rather eliminating the hybrid CompiledMethod format) would definitely *not* save space. The hybrid format is more compact (one less object per method). One can try and improve this as in VisualWorks by encoding the bytecodes of certain methods as SmallIntegers in the literal frame, but this is only feasible in a pure JIT VM. Squeak still has an interpreter, and Cog is a hybrid JIT and Interpreter. In an interpreter it is costly in performance to be able to interpret this additional form of bytecodes.

So IMO while the hybrid CompiledMethod isn't ideal it is acceptable, having important advantages to go along with its disadvantages.

Why Pharo/Squeak context have this variable closureOrNil instead

...
of having the closure in the receiver field as in VW ? Is it an optimization because there are a lot of access to self and instance variables in the blocks in Pharo/Squeak ? Because if I'm correct it uses 1 more slot per stack frame to have this.

I did this because I think its simpler and more direct. I don't like VW's access to the receiver and inst vars having to use different bytecodes within a block to within a method. There are lots of complexities resulting from this (e.g. in scanning code for inst var refs, the decompiler, etc).

But in fact there isn't really an additional stack slot because the frame format in the VM does not use the stacked receiver (the 0'th argument) as accessing the receiver in this position requires knowing the method's argument count. So in both methods and blocks the receiver is pushed on the stack immediately before allocating space for, and nilling, any temporaries. This puts the receiver in a known place relative to the frame pointer, making it accessible to the bytecodes without having to know the method's argument count. So the receiver always occurs twice on the stack in a method anyway. In a block, the block is on the stack in the 0'th argument position. The actual receiver is pushed after the temps.

Lastly, does VW have the tempVector optimization for escaping write

...
temporaries in their blockClosure ? It seems they have not (I don't see any reference to it in VW 7). Did Pharo/Squeak blocks earns a lot of speed or memory with this optimization ?

Yes, VW has this same organization. I implemented it in VisualWorks 5i in ~ 2000. It resulted in a significant increase in performance (for example, factors of two improvement in block-intensive code such as exception handling). This is because of details in the context-to-stack mapping machinery which mean that if an activation of a closure can update the temporaries of its outer contexts then keeping contexts and stack frames in sync is much more complex and costly. The 5i/Cog organization (which in fact derives from some Lisp implementations) results in much simpler context-to0stack mapping such that no tests need be done when returning from a method to keep frames and contexts in sync.

...
Thank you for any answer.

You're most welcome. Have you read my blog post on the design? It is "Under Cover Contexts and the Big Frame-Uphttp://www.mirandabanda.org/cogblog/2009/01/14/under-cover-contexts-and-the-big-frame-up/", with additional information in "Closures Part I" & "Closures Part II – the Bytecodeshttp://www.mirandabanda.org/cogblog/2008/07/22/closures-part-ii-the-bytecodes/ ". -- best, Eliot

-- best, Eliot

-- best, Eliot

-- best, Eliot

Clément Bera

2 Aug 2 Aug

7:12 a.m.

Hi Eliot.

So I changed the implementation according to what you've just said and it works with Cog. I added a jump and a pushClosure byte code which is never called but permits to be JIT-compatible.

exampleCleanBlock ^ [ 1 + 2 ]

17 <20> pushConstant: [...] 18 <72> pushConstant: false 19 <9F> jumpFalse: 28 20 <8F 00 00 04> closureNumCopied: 0 numArgs: 0 bytes 24 to 27 24 <76> pushConstant: 1 25 <77> pushConstant: 2 26 <B0> send: + 27 <7D> blockReturn 28 <7C> returnTop

Here the BlockClosure in the literals has a startpc of 24, therefore the pushClosure bytecode cannot be called.

I will try to replace the jumpFalse by a jump, I didn't do it because Opal then detects the block byte code as not reachable and removes it. I will then check if it still works with the JIT (I don't know if the JIT has these unreachable bytecode removal feature). I may earn some speed by not having to push false.

Already now the clean block is definitely faster, at first look : OCOpalExamples >>#exampleCleanBlock ^ [ 1 + 2 ] foo := OCOpalExamples new. [ foo exampleCleanBlock ] bench (5x faster) [ foo exampleCleanBlock value ] bench (3.5 times faster)

I can prepare you an image so you can have a look, but - Pharo 3 requires NativeBoost plugin to find environment variables so it may not work on your Cog builds - Pharo 3 is in alpha state which currently implies that the debugger is not stable - I need to clean it up before ...

Anyway I'm happy to have it working.

2013/8/1 Eliot Miranda eliot.miranda@gmail.com

...

On Thu, Aug 1, 2013 at 10:15 AM, Eliot Miranda eliot.miranda@gmail.comwrote:

...
On Thu, Aug 1, 2013 at 1:21 AM, Clément Bera bera.clement@gmail.comwrote:

...
Hello Eliot,

So I implemented clean blocks with Opal in Pharo 3. I didn't know where to put the byte code of the clean block, so I put it at the end of the method.

ex: exampleCleanBlock ^ [ 1 + 2 ]

17 <20> pushConstant: [...] 18 <7C> returnTop 19 <76> pushConstant: 1 20 <77> pushConstant: 2 21 <B0> send: + 22 <7D> blockReturn

having in the literal Array: [ 1 + 2 ] #exampleCleanBlock OCOpalExamples

The startpc of the block is 19. Its outerContext is a context with nil as receiver and the method OCOpalExamples>>#exampleCleanBlock. Its numArgs is 0 and it has no copiedValues.

But it does not work with the JIT.

Thinking about it I'm pretty sure the problem is that the JIT scans for and counts pushClosure: bytecodes to know how many blocks a method contains, but clean blocks don't need pushClosure: bytecodes. So the JIT needs to look for clean blocks, e.g. either by scanning a method's literals or by looking at the arguments of pushLiteral: bytecodes. In any case the image will allow me to develop a fix.

...
If I run:

...
OCOpalExamples new exampleCleanBlock value I got 3 all the time, it's fine. Now 1 to: 5 do: [ :i | OCOpalExamples new exampleCleanBlock value ] Works on Stack VM, but crashes Cog VM. I don't know why (not enough knowledge about the Cog JIT).

Do you have any clue ?

no. send me an image?

...
2013/7/31 Eliot Miranda eliot.miranda@gmail.com

...
On Tue, Jul 30, 2013 at 1:56 PM, Clément Bera bera.clement@gmail.comwrote:

...
Thanks for the answer it was very helpful. I got it now.

I had a look at the first posts of your blog (Closures I & II) when I was working on the Opal compiler. Today I was looking at Under Cover Contexts and the Big Frame-Uphttp://www.mirandabanda.org/cogblog/2009/01/14/under-cover-contexts-and-the-big-frame-up/ and I think I should read all your blog.

That is really nice that you wrote this blog it is the main documentation about an efficient Smalltalk VM. I learnt by looking at Cog's source mostly. VW VM source is closed so... I will have a look at Strongtalk implementation instead it seems it is open source.

Why are the clean blocks of VW much faster ? Are they activated like method ? I didn't find it in your blog (probably because it is not in Cog). Is it possible to implement clean blocks in Pharo/Squeak ? (I think that 53% of blocks non optimized by the compiler are clean in Pharo 3) Would it worth it ?

Clean blocks are faster because they don't access their outer environment and hence their outer context does not have to be created. So there is no allocation associated with a clean block. It exists already as a literal and its outer context does not have to be reified. Normal closures are created when the point at which they are defined in method execution is reached (the pushClosure bytecode) and if the current context does not yet exist that must be instantiated too, so creating a closure usually takes two allocations.

Clean blocks are activated like blocks. Block and method activation is different in the first phase (the send side) but quite similar in the second phase (frame building). In VW for example, finding the machine code method associated with a block involves a cache lookup which can be slow. In Cog, it involves following a pointer in the method header (inside, the VM replaces the header of a method with a pointer to its machine code) and then jumping to a hard-coded binary search which jumps to the correct block's entry-point depending on the closure's startpc. If a method contains a single block then this is a direct jump. As a result, block dispatch in Cog is typically faster than in VW.

Yes, it is possible to implement clean blocks. It is only an issue to do with the representation of closures. Ideally they need a method inst var, making the outerContext inst var optional (or at least nil in a clean block). But that would require a change to BlockClosure's class definition and a VM change. To avoid having to change the class definition of BlockClosure and the VM, the compiler could create an empty context to hold onto the method, and that would work fine. So to implement clean blocks the compiler would instantiate a BlockClosure literal for each clean block and a MethodContext whose receiver was nil shared between all the clean blocks in a method. There are tricky issues such as setting breakpoints in methods (toggle break on entry), or copying methods, which would require scanning the literals for clean blocks and duplicating them and their outerCOntext too. But that's just detail. Some time I must try this for Squeak. Let me know if you try if=t for Opal. (and of course I'm very happy to help with advice).

I expect that in certain cases the speedup would be noticeable, but it is a micro-optimization. You'd of course only notice the difference in tight loops that used clean blocks.

2013/7/30 Eliot Miranda eliot.miranda@gmail.com

...
...
http://www.mirandabanda.org/cogblog/2008/06/07/closures-part-i/ Hi Clément,

On Mon, Jul 29, 2013 at 1:54 AM, Clément Bera <bera.clement@gmail.com > wrote:

> > Hello guys, > > I was looking recently at the blockClosure model of Eliot in > Pharo/Squeak and the blockClosure model of VisualWorks and I have a few > questions. > > - Why Pharo/Squeak does not have compiled block as in VW and has the > block byte code in the enclosing method ? Is it to save memory ? Would it > worth it to implement CompiledBlock in term of speed and memory consumption > ? >

Squeak derives directly from the "blue book" Smalltalk-80 implementation in which CompiledMethod is a hybrid object, half pointers (method header and literals) and half bytes (bytecode and source pointer). This format was chosen to save space in the original 16-bit Smalltalk implementations on the Xerox D machines (Alto & Dorado). VisualWorks has a few extra steps in between, In ObjectWorks 2.4 and ObjectWorks 2.5 Peter Deutsch both introduced closures and eliminated the hybrid CompiledMethod format, introducing CompiledBlock.

IMO adding CompiledBlock, while simplifying the VM a little would not improve performance, especially in the interpreter, essentially because activating and retuning form methods now requires an ecxtra level of indirection to get from the CompiledMethod object to its bytecodes in its bytecode object.

However, adding CompiledBlock (or rather eliminating the hybrid CompiledMethod format) would definitely *not* save space. The hybrid format is more compact (one less object per method). One can try and improve this as in VisualWorks by encoding the bytecodes of certain methods as SmallIntegers in the literal frame, but this is only feasible in a pure JIT VM. Squeak still has an interpreter, and Cog is a hybrid JIT and Interpreter. In an interpreter it is costly in performance to be able to interpret this additional form of bytecodes.

So IMO while the hybrid CompiledMethod isn't ideal it is acceptable, having important advantages to go along with its disadvantages.

Why Pharo/Squeak context have this variable closureOrNil instead

> of having the closure in the receiver field as in VW ? Is it an > optimization because there are a lot of access to self and instance > variables in the blocks in Pharo/Squeak ? Because if I'm correct it uses 1 > more slot per stack frame to have this. >

I did this because I think its simpler and more direct. I don't like VW's access to the receiver and inst vars having to use different bytecodes within a block to within a method. There are lots of complexities resulting from this (e.g. in scanning code for inst var refs, the decompiler, etc).

But in fact there isn't really an additional stack slot because the frame format in the VM does not use the stacked receiver (the 0'th argument) as accessing the receiver in this position requires knowing the method's argument count. So in both methods and blocks the receiver is pushed on the stack immediately before allocating space for, and nilling, any temporaries. This puts the receiver in a known place relative to the frame pointer, making it accessible to the bytecodes without having to know the method's argument count. So the receiver always occurs twice on the stack in a method anyway. In a block, the block is on the stack in the 0'th argument position. The actual receiver is pushed after the temps.

Lastly, does VW have the tempVector optimization for escaping write

> temporaries in their blockClosure ? It seems they have not (I don't see any > reference to it in VW 7). Did Pharo/Squeak blocks earns a lot of speed or > memory with this optimization ? >

Yes, VW has this same organization. I implemented it in VisualWorks 5i in ~ 2000. It resulted in a significant increase in performance (for example, factors of two improvement in block-intensive code such as exception handling). This is because of details in the context-to-stack mapping machinery which mean that if an activation of a closure can update the temporaries of its outer contexts then keeping contexts and stack frames in sync is much more complex and costly. The 5i/Cog organization (which in fact derives from some Lisp implementations) results in much simpler context-to0stack mapping such that no tests need be done when returning from a method to keep frames and contexts in sync.

> Thank you for any answer. >

You're most welcome. Have you read my blog post on the design? It is "Under Cover Contexts and the Big Frame-Uphttp://www.mirandabanda.org/cogblog/2009/01/14/under-cover-contexts-and-the-big-frame-up/", with additional information in "Closures Part I" & "Closures Part II – the Bytecodeshttp://www.mirandabanda.org/cogblog/2008/07/22/closures-part-ii-the-bytecodes/ ". -- best, Eliot

-- best, Eliot

-- best, Eliot

-- best, Eliot

Clément Bera

1:04 p.m.

I tried with a simple jumpTo: over the block closure byte code (no push: false, jumpFalse:) and it works fine with Cog . It does not result in an important speed up but it saves 1 bytecode. It is true that this implementation waste 5 bytecodes (4 for the pushClosure, 1 for the jump) over the previous one I did, but it works without VM modification. There are 6250 clean blocks so it wastes 12kb at worse. I guess it is fine then.

I will do a cleaner implementation and integrate in Pharo 3 in the next few weeks (I will try before 15th August).

2013/8/2 Clément Bera bera.clement@gmail.com

...

Hi Eliot.

So I changed the implementation according to what you've just said and it works with Cog. I added a jump and a pushClosure byte code which is never called but permits to be JIT-compatible.

exampleCleanBlock ^ [ 1 + 2 ]

17 <20> pushConstant: [...] 18 <72> pushConstant: false 19 <9F> jumpFalse: 28 20 <8F 00 00 04> closureNumCopied: 0 numArgs: 0 bytes 24 to 27 24 <76> pushConstant: 1 25 <77> pushConstant: 2 26 <B0> send: + 27 <7D> blockReturn 28 <7C> returnTop

Here the BlockClosure in the literals has a startpc of 24, therefore the pushClosure bytecode cannot be called.

I will try to replace the jumpFalse by a jump, I didn't do it because Opal then detects the block byte code as not reachable and removes it. I will then check if it still works with the JIT (I don't know if the JIT has these unreachable bytecode removal feature). I may earn some speed by not having to push false.

Already now the clean block is definitely faster, at first look : OCOpalExamples >>#exampleCleanBlock ^ [ 1 + 2 ] foo := OCOpalExamples new. [ foo exampleCleanBlock ] bench (5x faster) [ foo exampleCleanBlock value ] bench (3.5 times faster)

I can prepare you an image so you can have a look, but

Pharo 3 requires NativeBoost plugin to find environment variables so it

may not work on your Cog builds

Pharo 3 is in alpha state which currently implies that the debugger is

not stable

I need to clean it up before

...

Anyway I'm happy to have it working.

2013/8/1 Eliot Miranda eliot.miranda@gmail.com

...
On Thu, Aug 1, 2013 at 10:15 AM, Eliot Miranda eliot.miranda@gmail.comwrote:

...
On Thu, Aug 1, 2013 at 1:21 AM, Clément Bera bera.clement@gmail.comwrote:

...
Hello Eliot,

So I implemented clean blocks with Opal in Pharo 3. I didn't know where to put the byte code of the clean block, so I put it at the end of the method.

ex: exampleCleanBlock ^ [ 1 + 2 ]

17 <20> pushConstant: [...] 18 <7C> returnTop 19 <76> pushConstant: 1 20 <77> pushConstant: 2 21 <B0> send: + 22 <7D> blockReturn

having in the literal Array: [ 1 + 2 ] #exampleCleanBlock OCOpalExamples

The startpc of the block is 19. Its outerContext is a context with nil as receiver and the method OCOpalExamples>>#exampleCleanBlock. Its numArgs is 0 and it has no copiedValues.

But it does not work with the JIT.

Thinking about it I'm pretty sure the problem is that the JIT scans for and counts pushClosure: bytecodes to know how many blocks a method contains, but clean blocks don't need pushClosure: bytecodes. So the JIT needs to look for clean blocks, e.g. either by scanning a method's literals or by looking at the arguments of pushLiteral: bytecodes. In any case the image will allow me to develop a fix.

...
If I run:

...
OCOpalExamples new exampleCleanBlock value I got 3 all the time, it's fine. Now 1 to: 5 do: [ :i | OCOpalExamples new exampleCleanBlock value ] Works on Stack VM, but crashes Cog VM. I don't know why (not enough knowledge about the Cog JIT).

Do you have any clue ?

no. send me an image?

...
2013/7/31 Eliot Miranda eliot.miranda@gmail.com

...
On Tue, Jul 30, 2013 at 1:56 PM, Clément Bera bera.clement@gmail.comwrote:

...
Thanks for the answer it was very helpful. I got it now.

I had a look at the first posts of your blog (Closures I & II) when I was working on the Opal compiler. Today I was looking at Under Cover Contexts and the Big Frame-Uphttp://www.mirandabanda.org/cogblog/2009/01/14/under-cover-contexts-and-the-big-frame-up/ and I think I should read all your blog.

That is really nice that you wrote this blog it is the main documentation about an efficient Smalltalk VM. I learnt by looking at Cog's source mostly. VW VM source is closed so... I will have a look at Strongtalk implementation instead it seems it is open source.

Why are the clean blocks of VW much faster ? Are they activated like method ? I didn't find it in your blog (probably because it is not in Cog). Is it possible to implement clean blocks in Pharo/Squeak ? (I think that 53% of blocks non optimized by the compiler are clean in Pharo 3) Would it worth it ?

Clean blocks are faster because they don't access their outer environment and hence their outer context does not have to be created. So there is no allocation associated with a clean block. It exists already as a literal and its outer context does not have to be reified. Normal closures are created when the point at which they are defined in method execution is reached (the pushClosure bytecode) and if the current context does not yet exist that must be instantiated too, so creating a closure usually takes two allocations.

Clean blocks are activated like blocks. Block and method activation is different in the first phase (the send side) but quite similar in the second phase (frame building). In VW for example, finding the machine code method associated with a block involves a cache lookup which can be slow. In Cog, it involves following a pointer in the method header (inside, the VM replaces the header of a method with a pointer to its machine code) and then jumping to a hard-coded binary search which jumps to the correct block's entry-point depending on the closure's startpc. If a method contains a single block then this is a direct jump. As a result, block dispatch in Cog is typically faster than in VW.

Yes, it is possible to implement clean blocks. It is only an issue to do with the representation of closures. Ideally they need a method inst var, making the outerContext inst var optional (or at least nil in a clean block). But that would require a change to BlockClosure's class definition and a VM change. To avoid having to change the class definition of BlockClosure and the VM, the compiler could create an empty context to hold onto the method, and that would work fine. So to implement clean blocks the compiler would instantiate a BlockClosure literal for each clean block and a MethodContext whose receiver was nil shared between all the clean blocks in a method. There are tricky issues such as setting breakpoints in methods (toggle break on entry), or copying methods, which would require scanning the literals for clean blocks and duplicating them and their outerCOntext too. But that's just detail. Some time I must try this for Squeak. Let me know if you try if=t for Opal. (and of course I'm very happy to help with advice).

I expect that in certain cases the speedup would be noticeable, but it is a micro-optimization. You'd of course only notice the difference in tight loops that used clean blocks.

2013/7/30 Eliot Miranda eliot.miranda@gmail.com

...
> > http://www.mirandabanda.org/cogblog/2008/06/07/closures-part-i/ > Hi Clément, > > On Mon, Jul 29, 2013 at 1:54 AM, Clément Bera < > bera.clement@gmail.com> wrote: > >> >> Hello guys, >> >> I was looking recently at the blockClosure model of Eliot in >> Pharo/Squeak and the blockClosure model of VisualWorks and I have a few >> questions. >> >> - Why Pharo/Squeak does not have compiled block as in VW and has >> the block byte code in the enclosing method ? Is it to save memory ? Would >> it worth it to implement CompiledBlock in term of speed and memory >> consumption ? >> > > Squeak derives directly from the "blue book" Smalltalk-80 > implementation in which CompiledMethod is a hybrid object, half pointers > (method header and literals) and half bytes (bytecode and source pointer). > This format was chosen to save space in the original 16-bit Smalltalk > implementations on the Xerox D machines (Alto & Dorado). VisualWorks has a > few extra steps in between, In ObjectWorks 2.4 and ObjectWorks 2.5 Peter > Deutsch both introduced closures and eliminated the hybrid CompiledMethod > format, introducing CompiledBlock. > > IMO adding CompiledBlock, while simplifying the VM a little would > not improve performance, especially in the interpreter, essentially because > activating and retuning form methods now requires an ecxtra level of > indirection to get from the CompiledMethod object to its bytecodes in its > bytecode object. > > However, adding CompiledBlock (or rather eliminating the hybrid > CompiledMethod format) would definitely *not* save space. The hybrid > format is more compact (one less object per method). One can try and > improve this as in VisualWorks by encoding the bytecodes of certain methods > as SmallIntegers in the literal frame, but this is only feasible in a pure > JIT VM. Squeak still has an interpreter, and Cog is a hybrid JIT and > Interpreter. In an interpreter it is costly in performance to be able to > interpret this additional form of bytecodes. > > So IMO while the hybrid CompiledMethod isn't ideal it is acceptable, > having important advantages to go along with its disadvantages. > > - Why Pharo/Squeak context have this variable closureOrNil instead >> of having the closure in the receiver field as in VW ? Is it an >> optimization because there are a lot of access to self and instance >> variables in the blocks in Pharo/Squeak ? Because if I'm correct it uses 1 >> more slot per stack frame to have this. >> > > I did this because I think its simpler and more direct. I don't > like VW's access to the receiver and inst vars having to use different > bytecodes within a block to within a method. There are lots of > complexities resulting from this (e.g. in scanning code for inst var refs, > the decompiler, etc). > > But in fact there isn't really an additional stack slot because the > frame format in the VM does not use the stacked receiver (the 0'th > argument) as accessing the receiver in this position requires knowing the > method's argument count. So in both methods and blocks the receiver is > pushed on the stack immediately before allocating space for, and nilling, > any temporaries. This puts the receiver in a known place relative to the > frame pointer, making it accessible to the bytecodes without having to know > the method's argument count. So the receiver always occurs twice on the > stack in a method anyway. In a block, the block is on the stack in the > 0'th argument position. The actual receiver is pushed after the temps. > > - Lastly, does VW have the tempVector optimization for escaping >> write temporaries in their blockClosure ? It seems they have not (I don't >> see any reference to it in VW 7). Did Pharo/Squeak blocks earns a lot of >> speed or memory with this optimization ? >> > > Yes, VW has this same organization. I implemented it in VisualWorks > 5i in ~ 2000. It resulted in a significant increase in performance (for > example, factors of two improvement in block-intensive code such as > exception handling). This is because of details in the context-to-stack > mapping machinery which mean that if an activation of a closure can update > the temporaries of its outer contexts then keeping contexts and stack > frames in sync is much more complex and costly. The 5i/Cog organization > (which in fact derives from some Lisp implementations) results in much > simpler context-to0stack mapping such that no tests need be done when > returning from a method to keep frames and contexts in sync. > > > >> Thank you for any answer. >> > > You're most welcome. Have you read my blog post on the design? It > is "Under Cover Contexts and the Big Frame-Uphttp://www.mirandabanda.org/cogblog/2009/01/14/under-cover-contexts-and-the-big-frame-up/", > with additional information in "Closures Part I" & "Closures Part > II – the Bytecodeshttp://www.mirandabanda.org/cogblog/2008/07/22/closures-part-ii-the-bytecodes/ > ". > -- > best, > Eliot > >

-- best, Eliot

-- best, Eliot

-- best, Eliot

stephane ducasse

11:12 a.m.

Hi eliot and clement (and others)

I'm not sure that this is something that we can gain from but I read and totally forgot :) the paper: optimizing Closures in O(0) time

http://users-cs.au.dk/danvy/sfp12/papers/keep-hearn-dybvig-paper-sfp12.pdf

Stef

...

Hello Eliot,

So I implemented clean blocks with Opal in Pharo 3. I didn't know where to put the byte code of the clean block, so I put it at the end of the method.

ex: exampleCleanBlock ^ [ 1 + 2 ]

17 <20> pushConstant: [...] 18 <7C> returnTop 19 <76> pushConstant: 1 20 <77> pushConstant: 2 21 <B0> send: + 22 <7D> blockReturn

having in the literal Array: [ 1 + 2 ] #exampleCleanBlock OCOpalExamples

The startpc of the block is 19. Its outerContext is a context with nil as receiver and the method OCOpalExamples>>#exampleCleanBlock. Its numArgs is 0 and it has no copiedValues.

But it does not work with the JIT.

Thinking about it I'm pretty sure the problem is that the JIT scans for and counts pushClosure: bytecodes to know how many blocks a method contains, but clean blocks don't need pushClosure: bytecodes. So the JIT needs to look for clean blocks, e.g. either by scanning a method's literals or by looking at the arguments of pushLiteral: bytecodes. In any case the image will allow me to develop a fix.

If I run: OCOpalExamples new exampleCleanBlock value I got 3 all the time, it's fine. Now 1 to: 5 do: [ :i | OCOpalExamples new exampleCleanBlock value ] Works on Stack VM, but crashes Cog VM. I don't know why (not enough knowledge about the Cog JIT).

Do you have any clue ?

no. send me an image?

2013/7/31 Eliot Miranda eliot.miranda@gmail.com

On Tue, Jul 30, 2013 at 1:56 PM, Clément Bera bera.clement@gmail.com wrote:

Thanks for the answer it was very helpful. I got it now.

I had a look at the first posts of your blog (Closures I & II) when I was working on the Opal compiler. Today I was looking at Under Cover Contexts and the Big Frame-Up and I think I should read all your blog.

That is really nice that you wrote this blog it is the main documentation about an efficient Smalltalk VM. I learnt by looking at Cog's source mostly. VW VM source is closed so... I will have a look at Strongtalk implementation instead it seems it is open source.

Why are the clean blocks of VW much faster ? Are they activated like method ? I didn't find it in your blog (probably because it is not in Cog). Is it possible to implement clean blocks in Pharo/Squeak ? (I think that 53% of blocks non optimized by the compiler are clean in Pharo 3) Would it worth it ?

Clean blocks are faster because they don't access their outer environment and hence their outer context does not have to be created. So there is no allocation associated with a clean block. It exists already as a literal and its outer context does not have to be reified. Normal closures are created when the point at which they are defined in method execution is reached (the pushClosure bytecode) and if the current context does not yet exist that must be instantiated too, so creating a closure usually takes two allocations.

Clean blocks are activated like blocks. Block and method activation is different in the first phase (the send side) but quite similar in the second phase (frame building). In VW for example, finding the machine code method associated with a block involves a cache lookup which can be slow. In Cog, it involves following a pointer in the method header (inside, the VM replaces the header of a method with a pointer to its machine code) and then jumping to a hard-coded binary search which jumps to the correct block's entry-point depending on the closure's startpc. If a method contains a single block then this is a direct jump. As a result, block dispatch in Cog is typically faster than in VW.

Yes, it is possible to implement clean blocks. It is only an issue to do with the representation of closures. Ideally they need a method inst var, making the outerContext inst var optional (or at least nil in a clean block). But that would require a change to BlockClosure's class definition and a VM change. To avoid having to change the class definition of BlockClosure and the VM, the compiler could create an empty context to hold onto the method, and that would work fine. So to implement clean blocks the compiler would instantiate a BlockClosure literal for each clean block and a MethodContext whose receiver was nil shared between all the clean blocks in a method. There are tricky issues such as setting breakpoints in methods (toggle break on entry), or copying methods, which would require scanning the literals for clean blocks and duplicating them and their outerCOntext too. But that's just detail. Some time I must try this for Squeak. Let me know if you try if=t for Opal. (and of course I'm very happy to help with advice).

I expect that in certain cases the speedup would be noticeable, but it is a micro-optimization. You'd of course only notice the difference in tight loops that used clean blocks.

2013/7/30 Eliot Miranda eliot.miranda@gmail.com

http://www.mirandabanda.org/cogblog/2008/06/07/closures-part-i/Hi Clément,

On Mon, Jul 29, 2013 at 1:54 AM, Clément Bera bera.clement@gmail.com wrote:

Hello guys,

I was looking recently at the blockClosure model of Eliot in Pharo/Squeak and the blockClosure model of VisualWorks and I have a few questions.

Why Pharo/Squeak does not have compiled block as in VW and has the block byte code in the enclosing method ? Is it to save memory ? Would it worth it to implement CompiledBlock in term of speed and memory consumption ?

Squeak derives directly from the "blue book" Smalltalk-80 implementation in which CompiledMethod is a hybrid object, half pointers (method header and literals) and half bytes (bytecode and source pointer). This format was chosen to save space in the original 16-bit Smalltalk implementations on the Xerox D machines (Alto & Dorado). VisualWorks has a few extra steps in between, In ObjectWorks 2.4 and ObjectWorks 2.5 Peter Deutsch both introduced closures and eliminated the hybrid CompiledMethod format, introducing CompiledBlock.

IMO adding CompiledBlock, while simplifying the VM a little would not improve performance, especially in the interpreter, essentially because activating and retuning form methods now requires an ecxtra level of indirection to get from the CompiledMethod object to its bytecodes in its bytecode object.

However, adding CompiledBlock (or rather eliminating the hybrid CompiledMethod format) would definitely *not* save space. The hybrid format is more compact (one less object per method). One can try and improve this as in VisualWorks by encoding the bytecodes of certain methods as SmallIntegers in the literal frame, but this is only feasible in a pure JIT VM. Squeak still has an interpreter, and Cog is a hybrid JIT and Interpreter. In an interpreter it is costly in performance to be able to interpret this additional form of bytecodes.

So IMO while the hybrid CompiledMethod isn't ideal it is acceptable, having important advantages to go along with its disadvantages.

Why Pharo/Squeak context have this variable closureOrNil instead of having the closure in the receiver field as in VW ? Is it an optimization because there are a lot of access to self and instance variables in the blocks in Pharo/Squeak ? Because if I'm correct it uses 1 more slot per stack frame to have this.

I did this because I think its simpler and more direct. I don't like VW's access to the receiver and inst vars having to use different bytecodes within a block to within a method. There are lots of complexities resulting from this (e.g. in scanning code for inst var refs, the decompiler, etc).

But in fact there isn't really an additional stack slot because the frame format in the VM does not use the stacked receiver (the 0'th argument) as accessing the receiver in this position requires knowing the method's argument count. So in both methods and blocks the receiver is pushed on the stack immediately before allocating space for, and nilling, any temporaries. This puts the receiver in a known place relative to the frame pointer, making it accessible to the bytecodes without having to know the method's argument count. So the receiver always occurs twice on the stack in a method anyway. In a block, the block is on the stack in the 0'th argument position. The actual receiver is pushed after the temps.

Lastly, does VW have the tempVector optimization for escaping write temporaries in their blockClosure ? It seems they have not (I don't see any reference to it in VW 7). Did Pharo/Squeak blocks earns a lot of speed or memory with this optimization ?

Yes, VW has this same organization. I implemented it in VisualWorks 5i in ~ 2000. It resulted in a significant increase in performance (for example, factors of two improvement in block-intensive code such as exception handling). This is because of details in the context-to-stack mapping machinery which mean that if an activation of a closure can update the temporaries of its outer contexts then keeping contexts and stack frames in sync is much more complex and costly. The 5i/Cog organization (which in fact derives from some Lisp implementations) results in much simpler context-to0stack mapping such that no tests need be done when returning from a method to keep frames and contexts in sync.

Thank you for any answer.

You're most welcome. Have you read my blog post on the design? It is "Under Cover Contexts and the Big Frame-Up", with additional information in "Closures Part I" & "Closures Part II – the Bytecodes".

best, Eliot

-- best, Eliot

-- best, Eliot

-- best, Eliot

Igor Stasenko

12:47 p.m.

On 2 August 2013 11:12, stephane ducasse stephane.ducasse@gmail.com wrote:

...

Hi eliot and clement (and others)

I'm not sure that this is something that we can gain from but I read and totally forgot :) the paper: optimizing Closures in O(0) time

http://users-cs.au.dk/danvy/sfp12/papers/keep-hearn-dybvig-paper-sfp12.pdf

skimmed through it.. in smalltalk there is no 'well-known' closures (or at least you cannot determine it at compile time), so optimizations which described there (in section 2.1) , we can only apply for cases 2a and 2b, but nothing from case1xx.

The only case of 'well-known' closures in smalltalk is things like ifTrue/ifFalse/to:do: which handled directly by compiler by inlining them

We could try to optimize this:

foo | x | x := 3. ^ [:y | y + x ]

17 <20> pushConstant: 3 18 <68> popIntoTemp: 0 19 <10> pushTemp: 0 20 <8F 11 00 04> closureNumCopied: 1 numArgs: 1 bytes 24 to 27 24 <10> pushTemp: 0 25 <11> pushTemp: 1 26 <B0> send: + 27 <7D> blockReturn 28 <7C> returnTop

(here 25 <11> pushTemp: 1

'potentially' can be replaced by pushConstant: 3 )

but that could be very tricky, given the need to reflectively support all things.. (imagine you get access to method's context and say tempAt: 1 put: 5... so we expecting that closure start adding 5 instead of 3?

of course, if we lower our expectations, then it can be optimized more aggressively.

-- Best regards, Igor Stasenko.

Henrik Johansen

1:35 p.m.

On Aug 1, 2013, at 10:17 , Eliot Miranda eliot.miranda@gmail.com wrote:

...

On Thu, Aug 1, 2013 at 10:15 AM, Eliot Miranda eliot.miranda@gmail.com wrote:

On Thu, Aug 1, 2013 at 1:21 AM, Clément Bera bera.clement@gmail.com wrote:

Hello Eliot,

So I implemented clean blocks with Opal in Pharo 3. I didn't know where to put the byte code of the clean block, so I put it at the end of the method.

ex: exampleCleanBlock ^ [ 1 + 2 ]

17 <20> pushConstant: [...] 18 <7C> returnTop 19 <76> pushConstant: 1 20 <77> pushConstant: 2 21 <B0> send: + 22 <7D> blockReturn

having in the literal Array: [ 1 + 2 ] #exampleCleanBlock OCOpalExamples

The startpc of the block is 19. Its outerContext is a context with nil as receiver and the method OCOpalExamples>>#exampleCleanBlock. Its numArgs is 0 and it has no copiedValues.

But it does not work with the JIT.

Thinking about it I'm pretty sure the problem is that the JIT scans for and counts pushClosure: bytecodes to know how many blocks a method contains, but clean blocks don't need pushClosure: bytecodes. So the JIT needs to look for clean blocks, e.g. either by scanning a method's literals or by looking at the arguments of pushLiteral: bytecodes. In any case the image will allow me to develop a fix.

Hi! Made a non-Opal version that works in 2.0, located at https://dl.dropboxusercontent.com/u/6751081/BlockTest.zip together with a startup script that loads the package, then crashes the image. Note; all three clean block version in the .zip can be #value'd from a Workspace, it's first when you try to run those made by installClean/installCleanMinimal in a loop that the image crashes.

Cheers, Henry

Henrik Johansen

1:56 p.m.

On Aug 2, 2013, at 1:35 , Henrik Johansen henrik.s.johansen@veloxit.no wrote:

...

On Aug 1, 2013, at 10:17 , Eliot Miranda eliot.miranda@gmail.com wrote:

...
On Thu, Aug 1, 2013 at 10:15 AM, Eliot Miranda eliot.miranda@gmail.com wrote:

On Thu, Aug 1, 2013 at 1:21 AM, Clément Bera bera.clement@gmail.com wrote:

Hello Eliot,

So I implemented clean blocks with Opal in Pharo 3. I didn't know where to put the byte code of the clean block, so I put it at the end of the method.

ex: exampleCleanBlock ^ [ 1 + 2 ]

17 <20> pushConstant: [...] 18 <7C> returnTop 19 <76> pushConstant: 1 20 <77> pushConstant: 2 21 <B0> send: + 22 <7D> blockReturn

having in the literal Array: [ 1 + 2 ] #exampleCleanBlock OCOpalExamples

The startpc of the block is 19. Its outerContext is a context with nil as receiver and the method OCOpalExamples>>#exampleCleanBlock. Its numArgs is 0 and it has no copiedValues.

But it does not work with the JIT.

Thinking about it I'm pretty sure the problem is that the JIT scans for and counts pushClosure: bytecodes to know how many blocks a method contains, but clean blocks don't need pushClosure: bytecodes. So the JIT needs to look for clean blocks, e.g. either by scanning a method's literals or by looking at the arguments of pushLiteral: bytecodes. In any case the image will allow me to develop a fix.

Hi! Made a non-Opal version that works in 2.0, located at https://dl.dropboxusercontent.com/u/6751081/BlockTest.zip together with a startup script that loads the package, then crashes the image. Note; all three clean block version in the .zip can be #value'd from a Workspace, it's first when you try to run those made by installClean/installCleanMinimal in a loop that the image crashes.

Cheers, Henry

Here's some sample benchmarks, the extra jump required to make it work currently seems to have an appreciable impact on performance:

{[BlockTest normal] bench . [BlockTest normal value] bench} #('7,340,000 per second.' '7,090,000 per second.') #('7,050,000 per second.' '7,130,000 per second.')

BlockTest installCleanWithClosureCreation. {[BlockTest clean] bench . [BlockTest clean value] bench} #('38,900,000 per second.' '32,600,000 per second.') #('39,000,000 per second.' '33,100,000 per second.')

BlockTest installCleanMinimal. {[BlockTest clean] bench . 'Cannot bench without crash'} #('48,700,000 per second.' 'Cannot bench without crash') #('49,700,000 per second.' 'Cannot bench without crash')

Cheers, Henry

3942

Age (days ago)

3946

Last active (days ago)

vm-dev@lists.squeakfoundation.org

15 comments

6 participants

tags (0)

participants (6)

Clément Bera
Eliot Miranda
Henrik Johansen
Igor Stasenko
stephane ducasse
tim Rowledge