Thanks for this interesting list of your relevant work. I look forward to any other thoughts that you could add to this thread, in particular to provide a reality-check where your real-world experience disagrees with my theoretical understanding :-)
Best, Josh
On Oct 30, 2007, at 4:24 PM, Jecel Assumpcao Jr wrote:
I would like to mention some of my previous work in this area:
- tinySelf 1 (1996)
http://www.lsi.usp.br/~jecel/tiny.html#rel1
This was a Self interepreter written in Self which implemented the one thread per object model. All messages were future messages but since sending a message to an unresolved future would block, you would have deadlock on any recursion (direct or indirect). This problem was solved by detecting the cycles and preempting the blocked mesasge with the one it depends on. This results in interleaved execution, but since the semantics are exactly the same as in a sequential execution of the recursive code any bugs that appear won't be due to concurrency.
I was able to test simple expressions and was very happy with how much parallelism I was able to extract from seemingly sequential code, but I made the mistake of introducing a significant optimization (tail send elimination) that made debugging so much harder that I was unable to finish in the two weeks that I was able to dedicate to this project.
- 64 node Smalltalk machine (1992)
http://www.lsi.usp.br/~jecel/ms8702.html
The most interesting result in this project was the notion that most objects in the system are immutable at any given time and that a security system might be used to detect this. For example, just because you can edit some font today doesn't mean that you will do it. And if you and everyone currently logged on the local system only have read permission for that font then it is effectively immutable. Only when the font's owner logs in is this assumption invalid.
The advantage of knowing that an object is immutable is that you can replicate it and you can allow multiple threads to access it at the same time.
The only paper in English from this project describes how adaptive compilation could be used to trim away excessive concurrency by transforming future message passing into sequential message passing (the semantics allow this) and then inlining them away. So if a machine has 64 processors and the application initially starts out with 10 thousand threads, the compiler will eventually change this into code with 200 or so threads (some are blocked at any given instant, so going down to 64 threads would not be good).. http://www.lsi.usp.br/~jecel/jabs1.html
- operating system in an Objective-C like language (1988)
http://www.lsi.usp.br/~jecel/atos.html (this page has download links but the text still hasn't been written)
This operating system for 286 machine used the virtual memory of that hardware to isolate groups of objects, with one thread per group. This would be similar to the vat/island model. All messages were sent in exactly the same way and if the receiver was a local object then it was just a fancy subroutine call but for remote objects you got a "segment not present" fault and the message was packed up and sent to the other task (possibly over the network). All messages were synchronous since I was not aware of futures at that time.
-- current model --
I moved back to the one thread per object group model since I feel that makes it easier for programmers to control things without having to worry to much about details most of the time. Since my target is children this is particularly important. An alternative that I experimented with was having a separation between active and passive objects. A passive object could be known only to a single active one, but it is just too hard to program without ever accidentally letting references to passive objects "leak". With the group/vat/island model there is just one kind of object and things are simpler for the programmer (but more complicated for the implementor). I have a limitation that you can only create new objects in your own group or in an entirely new group - I think forcing some other random group to create an object for you is rude, though of course you can always ask for an object there to please do it.
Some of the loaded groups are read/write but many are read-only. The latter don't actually have their own threads but instead their code executes in the thread of the calling group. I have hardware support for this.
Speaking of hardware, I would like to stress how fantastically slow (relatively speaking) main memory is these days. If I have a good network connecting processor cores in a single chip then I can probably send a message from one to another, get a reply, send a second message and get another reply in the time that it takes to read a byte from external RAM. So we should start thinking of DDR SDRAM as a really fast disk to swap objects to/from and not as a shared memory. We should start to take message passing seriously.
-- Jecel
Joshua Gargus wrote:
Thanks for this interesting list of your relevant work. I look forward to any other thoughts that you could add to this thread, in particular to provide a reality-check where your real-world experience disagrees with my theoretical understanding :-)
Sadly, my practical experience is far more limited than it should be, given all the years I have spent on this, since often a project would be abandoned after only preliminary results in favor of a "much better" one.
One thing that I haven't tried yet is making messages to futures return new futures instead of blocking (which I understood from this dicussion to be the E way). I had thought about it but imagined it might lead to ever growing graphs of pending messages with no actual work being done. I see now that in practice the overhead might be comparable to what my deadlock detector had. This would probably also make my "tail send" optimization mostly useless, which is a good thing.
In all my projects I took a very conservative path regarding blocks: I simply defined "." as a kind of barrier where all previous instructions must finish before any of the following instructions can be started. Since I was getting a lot of parallelism even with this I didn't worry too much about it and this allowed code like this to work just fine:
| a | a := 1. 1 to: 20 do: [ :i | a := a + i ]. a := a - 1. 1 to: 20 do: [ :x | a := a - x ]. ^ a
Having "." as a barrier is unecessary in the code below, but at least the results will be correct even if at a much reduced performance:
| a b | a := (1 to: 20) collect: [ :i | i * i ]. b := (1 to: 20) collect: [ :x | x + 7 ]. ^ a + b
It isn't very hard for the compiler to know which is the case for each example.
-- Jecel
squeak-dev@lists.squeakfoundation.org