Hi Stephane, Hi All,
let me talk a little about the ParcPlace experience, which led to David Leibs' parcels, whose architecture Fuel uses.
In the late 80's 90's Peter Deutsch write BOSS (Binary Object Storage System), a traditional interpretive pickling system defined by a little bytecoded language. Think of a bytecode as something like "What follows is an object definition, which is its id followed by size info followed by the definitions or ids of its sub-parts, including its class", or "What follows is the id of an already defined object". So the loading interpreter looks at the next byte in the stream and that tells it what to do. So the storage is a recursive definition of a graph, much like a recursive grammar for a programming language.
This approach is slow (its a bytecode interpreter) and fragile (structures in the process of being built aren't valid yet, imagine trying to take the hash of a Set that is only half-way through being materialized). But this architecture was very common at the time (I wrote something very similar). The advantage BOSS had was a clumsy hack for versioning. One could specify blocks that were supplied with the version and state of older objects, and these blocks could effect shape change etc to bring loaded instances up-to-date.
David Leibs has an epiphany as, in the early 90's, ParcPlae was trying to decompose the VW image (chainsaw was the code name of the VW 2.5 release). If one groups instances by class, one can instantiate in bulk, creating all the instances of a particular class in one go, followed by all the instances of a different class, etc. Then the arc information (the pointers to objects to be stored in the loaded objects inst vars) can follow the instance information. So now the file looks like header, names of classes that are referenced (not defined), definitions of classes, definitions of instances (essentially class id, count pairs), arc information. And materializing means finding the classes in the image, creating the classes in the file, creating the instances, stitching the graph together, and then performing any post-load actions (rehashing instances, etc).
Within months we merged with Digitalk (to form DarcPlace-Dodgytalk) and were introduced to TeamV's loading model which was very much like ImageSegments, being based on the VM's object format. Because an ImageSegment also has imports (references to classes and globals taken from the host system, not defined in the file) performance doesn't just depend on loading the segment into memorty. It also depends on how long it takes to search the system to find imports, etc. In practice we found that a) Parcels were 4 times faster than BOSS, and b) they were no slower than Digitalk's image segments. But being independent of the VM's heap format Parcels had BOSS's flexibility and could support shape change on load, something ImageSegments *cannot do*. I went on to extend parcels with support for shape change, plus support for partial loading of code, but I won't describe that here. Too detailed, even thought its very important.
Mariano spent time talking with me and Fuel's basic architecture is that of parcels, but reimplemented to be nicer, more flexible etc. But essentially Parcels and Fuel are at their core David Leibs' invention. He came up with the ideas of a) grouping objects by class and b) separating the arcs from the nodes.
Now, where ImageSegments are faster than Parcels is *not* loading. Our experience with VW vs TeamV showed us that. But they are faster in collecting the graph of objects to be included. ImageSegments are dead simple. So IMO the right architecture is to use Parcels' segregation, and Parcels' "abstract" format (independent of the heap object format) with ImageSegment's computation of the object graph. Igor Stasenko has suggested providing the tracing part of ImageSegments (Dan Ingalls' cool invention of mark the segment root objects, then mark the heap, leaving the objects to be stored unmarked in the shadow of the marked segment roots) as a separate primitive. Then this can be quickly partitioned by class and then written by Smalltalk code.
The loader can then materialize objects using Smalltalk code, can deal with shape change, and not be significantly slower than image segments. Crucially this means that one has a portable, long-lived object storage format; freeing the VM to evolve its object format without breaking image segments with every change to the object format.
I'd be happy to help people working on Fuel by providing that primitive for anyone who wants to try and reimplement the ImageSegment functonality (project saving, class faulting, etc) above Fuel.
On Wed, Oct 22, 2014 at 11:56 AM, Stéphane Ducasse < stephane.ducasse@inria.fr> wrote:
What I can tell you is that instability raised by just having one single pointer not in the root objects pointing to an element in the segment and the implication of this pointer on reloaded segments, (yes I do not want to have two objects in memory after loading) makes sure that we will not use IS primitive in Pharo in any future. For us this is a non feature.
IS was a nice trick but since having a pointer to an object is so cheap and the basis of our computational model so this is lead fo unpredictable side effects. We saw that when mariano worked during the first year of his PhD (which is a kind of LOOM revisit).
Stef
Hi Stephane, Hi All,
let me talk a little about the ParcPlace experience, which led to
David Leibs' parcels, whose architecture Fuel uses.
In the late 80's 90's Peter Deutsch write BOSS (Binary Object Storage System), a traditional interpretive pickling system defined by a little bytecoded language. Think of a bytecode as something like "What follows is an object definition, which is its id followed by size info followed by the definitions or ids of its sub-parts, including its class", or "What follows is the id of an already defined object". So the loading interpreter looks at the next byte in the stream and that tells it what to do. So the storage is a recursive definition of a graph, much like a recursive grammar for a programming language.
This approach is slow (its a bytecode interpreter) and fragile (structures in the process of being built aren't valid yet, imagine trying to take the hash of a Set that is only half-way through being materialized). But this architecture was very common at the time (I wrote something very similar). The advantage BOSS had was a clumsy hack for versioning. One could specify blocks that were supplied with the version and state of older objects, and these blocks could effect shape change etc to bring loaded instances up-to-date.
David Leibs has an epiphany as, in the early 90's, ParcPlae was trying to decompose the VW image (chainsaw was the code name of the VW 2.5 release). If one groups instances by class, one can instantiate in bulk, creating all the instances of a particular class in one go, followed by all the instances of a different class, etc. Then the arc information (the pointers to objects to be stored in the loaded objects inst vars) can follow the instance information. So now the file looks like header, names of classes that are referenced (not defined), definitions of classes, definitions of instances (essentially class id, count pairs), arc information. And materializing means finding the classes in the image, creating the classes in the file, creating the instances, stitching the graph together, and then performing any post-load actions (rehashing instances, etc).
Within months we merged with Digitalk (to form DarcPlace-Dodgytalk) and were introduced to TeamV's loading model which was very much like ImageSegments, being based on the VM's object format. Because an ImageSegment also has imports (references to classes and globals taken from the host system, not defined in the file) performance doesn't just depend on loading the segment into memorty. It also depends on how long it takes to search the system to find imports, etc. In practice we found that a) Parcels were 4 times faster than BOSS, and b) they were no slower than Digitalk's image segments. But being independent of the VM's heap format Parcels had BOSS's flexibility and could support shape change on load, something ImageSegments *cannot do*. I went on to extend parcels with support for shape change, plus support for partial loading of code, but I won't describe that here. Too detailed, even thought its very important.
Mariano spent time talking with me and Fuel's basic architecture is that of parcels, but reimplemented to be nicer, more flexible etc. But essentially Parcels and Fuel are at their core David Leibs' invention. He came up with the ideas of a) grouping objects by class and b) separating the arcs from the nodes.
Indeed it was never our intention to say that it was our idea. I still remember the first time I loaded RB in VW30.... 2 s while normally loading code was taking the time to cook pasta. I remember that I was still waiting but the code was already loaded. It was a cool feeling. So I always wanted to experiment with that and one day mariano came and needed a fast loader and martin was working on ... a pickle format... What a coincidence :)
Now, where ImageSegments are faster than Parcels is *not* loading. Our experience with VW vs TeamV showed us that. But they are faster in collecting the graph of objects to be included. ImageSegments are dead simple. So IMO the right architecture is to use Parcels' segregation, and Parcels' "abstract" format (independent of the heap object format) with ImageSegment's computation of the object graph. Igor Stasenko has suggested providing the tracing part of ImageSegments (Dan Ingalls' cool invention of mark the segment root objects, then mark the heap, leaving the objects to be stored unmarked in the shadow of the marked segment roots) as a separate primitive. Then this can be quickly partitioned by class and then written by Smalltalk code.
may be. For me if the use of IS is tructured (ie you control the fact that there will no pointer to the graph from elements that are not in the roots) then you may have a stable system on reload else you will have to decide what to do on reload and this can be a real pain.
The loader can then materialize objects using Smalltalk code, can deal with shape change, and not be significantly slower than image segments. Crucially this means that one has a portable, long-lived object storage format; freeing the VM to evolve its object format without breaking image segments with every change to the object format.
Oh yes! This was what was also worrying to me.
I'd be happy to help people working on Fuel by providing that primitive for anyone who wants to try and reimplement the ImageSegment functonality (project saving, class faulting, etc) above Fuel.
We do not have the resources for that now and will get probably less in the future because student cost doubled for internships :(
Stef
On Wed, Oct 22, 2014 at 11:56 AM, Stéphane Ducasse <stephane.ducasse@inria.fr mailto:stephane.ducasse@inria.fr> wrote:
What I can tell you is that instability raised by just having one single pointer not in the root objects pointing to an element in the segment and the implication of this pointer on reloaded segments, (yes I do not want to have two objects in memory after loading) makes sure that we will not use IS primitive in Pharo in any future. For us this is a non feature. IS was a nice trick but since having a pointer to an object is so cheap and the basis of our computational model so this is lead fo unpredictable side effects. We saw that when mariano worked during the first year of his PhD (which is a kind of LOOM revisit). Stef
-- best, Eliot
I wonder how the GemStone guys deal with this.
On 10/22/14 13:45 , stepharo wrote:
Hi Stephane, Hi All,
let me talk a little about the ParcPlace experience, which led to
David Leibs' parcels, whose architecture Fuel uses.
In the late 80's 90's Peter Deutsch write BOSS (Binary Object Storage System), a traditional interpretive pickling system defined by a little bytecoded language. Think of a bytecode as something like "What follows is an object definition, which is its id followed by size info followed by the definitions or ids of its sub-parts, including its class", or "What follows is the id of an already defined object". So the loading interpreter looks at the next byte in the stream and that tells it what to do. So the storage is a recursive definition of a graph, much like a recursive grammar for a programming language.
This approach is slow (its a bytecode interpreter) and fragile (structures in the process of being built aren't valid yet, imagine trying to take the hash of a Set that is only half-way through being materialized). But this architecture was very common at the time (I wrote something very similar). The advantage BOSS had was a clumsy hack for versioning. One could specify blocks that were supplied with the version and state of older objects, and these blocks could effect shape change etc to bring loaded instances up-to-date.
David Leibs has an epiphany as, in the early 90's, ParcPlae was trying to decompose the VW image (chainsaw was the code name of the VW 2.5 release). If one groups instances by class, one can instantiate in bulk, creating all the instances of a particular class in one go, followed by all the instances of a different class, etc. Then the arc information (the pointers to objects to be stored in the loaded objects inst vars) can follow the instance information. So now the file looks like header, names of classes that are referenced (not defined), definitions of classes, definitions of instances (essentially class id, count pairs), arc information. And materializing means finding the classes in the image, creating the classes in the file, creating the instances, stitching the graph together, and then performing any post-load actions (rehashing instances, etc).
Within months we merged with Digitalk (to form DarcPlace-Dodgytalk) and were introduced to TeamV's loading model which was very much like ImageSegments, being based on the VM's object format. Because an ImageSegment also has imports (references to classes and globals taken from the host system, not defined in the file) performance doesn't just depend on loading the segment into memorty. It also depends on how long it takes to search the system to find imports, etc. In practice we found that a) Parcels were 4 times faster than BOSS, and b) they were no slower than Digitalk's image segments. But being independent of the VM's heap format Parcels had BOSS's flexibility and could support shape change on load, something ImageSegments *cannot do*. I went on to extend parcels with support for shape change, plus support for partial loading of code, but I won't describe that here. Too detailed, even thought its very important.
Mariano spent time talking with me and Fuel's basic architecture is that of parcels, but reimplemented to be nicer, more flexible etc. But essentially Parcels and Fuel are at their core David Leibs' invention. He came up with the ideas of a) grouping objects by class and b) separating the arcs from the nodes.
Indeed it was never our intention to say that it was our idea. I still remember the first time I loaded RB in VW30.... 2 s while normally loading code was taking the time to cook pasta. I remember that I was still waiting but the code was already loaded. It was a cool feeling. So I always wanted to experiment with that and one day mariano came and needed a fast loader and martin was working on ... a pickle format... What a coincidence :)
Now, where ImageSegments are faster than Parcels is *not* loading. Our experience with VW vs TeamV showed us that. But they are faster in collecting the graph of objects to be included. ImageSegments are dead simple. So IMO the right architecture is to use Parcels' segregation, and Parcels' "abstract" format (independent of the heap object format) with ImageSegment's computation of the object graph. Igor Stasenko has suggested providing the tracing part of ImageSegments (Dan Ingalls' cool invention of mark the segment root objects, then mark the heap, leaving the objects to be stored unmarked in the shadow of the marked segment roots) as a separate primitive. Then this can be quickly partitioned by class and then written by Smalltalk code.
may be. For me if the use of IS is tructured (ie you control the fact that there will no pointer to the graph from elements that are not in the roots) then you may have a stable system on reload else you will have to decide what to do on reload and this can be a real pain.
The loader can then materialize objects using Smalltalk code, can deal with shape change, and not be significantly slower than image segments. Crucially this means that one has a portable, long-lived object storage format; freeing the VM to evolve its object format without breaking image segments with every change to the object format.
Oh yes! This was what was also worrying to me.
I'd be happy to help people working on Fuel by providing that primitive for anyone who wants to try and reimplement the ImageSegment functonality (project saving, class faulting, etc) above Fuel.
We do not have the resources for that now and will get probably less in the future because student cost doubled for internships :(
Stef
On Wed, Oct 22, 2014 at 11:56 AM, Stéphane Ducasse <stephane.ducasse@inria.fr mailto:stephane.ducasse@inria.fr> wrote:
What I can tell you is that instability raised by just having one single pointer not in the root objects pointing to an element in the segment and the implication of this pointer on reloaded segments, (yes I do not want to have two objects in memory after loading) makes sure that we will not use IS primitive in Pharo in any future. For us this is a non feature. IS was a nice trick but since having a pointer to an object is so cheap and the basis of our computational model so this is lead fo unpredictable side effects. We saw that when mariano worked during the first year of his PhD (which is a kind of LOOM revisit). Stef
-- best, Eliot
Eliot,
Thanks for this background, it is very helpful and interesting.
I would also like to put in a good word for Fuel. It is well designed, well documented, and well supported on Squeak and Pharo. Very high quality work.
I use Fuel in RemoteTask (in package CommandShell) for inter-image communication. ReferenceStream also works, and both are supported in RemoteTask. But if you want to have a serializer that you can read and understand, I'd say that Fuel is hard to beat.
I am not advocating anything with respect to image segments, project saving, and so forth, I'm just saying that Fuel is a very good thing. It works well in Squeak, and I suspect that many folks may not be aware of this.
Dave
On Wed, Oct 22, 2014 at 12:53:15PM -0700, Eliot Miranda wrote:
Hi Stephane, Hi All,
let me talk a little about the ParcPlace experience, which led to David
Leibs' parcels, whose architecture Fuel uses.
In the late 80's 90's Peter Deutsch write BOSS (Binary Object Storage System), a traditional interpretive pickling system defined by a little bytecoded language. Think of a bytecode as something like "What follows is an object definition, which is its id followed by size info followed by the definitions or ids of its sub-parts, including its class", or "What follows is the id of an already defined object". So the loading interpreter looks at the next byte in the stream and that tells it what to do. So the storage is a recursive definition of a graph, much like a recursive grammar for a programming language.
This approach is slow (its a bytecode interpreter) and fragile (structures in the process of being built aren't valid yet, imagine trying to take the hash of a Set that is only half-way through being materialized). But this architecture was very common at the time (I wrote something very similar). The advantage BOSS had was a clumsy hack for versioning. One could specify blocks that were supplied with the version and state of older objects, and these blocks could effect shape change etc to bring loaded instances up-to-date.
David Leibs has an epiphany as, in the early 90's, ParcPlae was trying to decompose the VW image (chainsaw was the code name of the VW 2.5 release). If one groups instances by class, one can instantiate in bulk, creating all the instances of a particular class in one go, followed by all the instances of a different class, etc. Then the arc information (the pointers to objects to be stored in the loaded objects inst vars) can follow the instance information. So now the file looks like header, names of classes that are referenced (not defined), definitions of classes, definitions of instances (essentially class id, count pairs), arc information. And materializing means finding the classes in the image, creating the classes in the file, creating the instances, stitching the graph together, and then performing any post-load actions (rehashing instances, etc).
Within months we merged with Digitalk (to form DarcPlace-Dodgytalk) and were introduced to TeamV's loading model which was very much like ImageSegments, being based on the VM's object format. Because an ImageSegment also has imports (references to classes and globals taken from the host system, not defined in the file) performance doesn't just depend on loading the segment into memorty. It also depends on how long it takes to search the system to find imports, etc. In practice we found that a) Parcels were 4 times faster than BOSS, and b) they were no slower than Digitalk's image segments. But being independent of the VM's heap format Parcels had BOSS's flexibility and could support shape change on load, something ImageSegments *cannot do*. I went on to extend parcels with support for shape change, plus support for partial loading of code, but I won't describe that here. Too detailed, even thought its very important.
Mariano spent time talking with me and Fuel's basic architecture is that of parcels, but reimplemented to be nicer, more flexible etc. But essentially Parcels and Fuel are at their core David Leibs' invention. He came up with the ideas of a) grouping objects by class and b) separating the arcs from the nodes.
Now, where ImageSegments are faster than Parcels is *not* loading. Our experience with VW vs TeamV showed us that. But they are faster in collecting the graph of objects to be included. ImageSegments are dead simple. So IMO the right architecture is to use Parcels' segregation, and Parcels' "abstract" format (independent of the heap object format) with ImageSegment's computation of the object graph. Igor Stasenko has suggested providing the tracing part of ImageSegments (Dan Ingalls' cool invention of mark the segment root objects, then mark the heap, leaving the objects to be stored unmarked in the shadow of the marked segment roots) as a separate primitive. Then this can be quickly partitioned by class and then written by Smalltalk code.
The loader can then materialize objects using Smalltalk code, can deal with shape change, and not be significantly slower than image segments. Crucially this means that one has a portable, long-lived object storage format; freeing the VM to evolve its object format without breaking image segments with every change to the object format.
I'd be happy to help people working on Fuel by providing that primitive for anyone who wants to try and reimplement the ImageSegment functonality (project saving, class faulting, etc) above Fuel.
On Wed, Oct 22, 2014 at 11:56 AM, St??phane Ducasse < stephane.ducasse@inria.fr> wrote:
What I can tell you is that instability raised by just having one single pointer not in the root objects pointing to an element in the segment and the implication of this pointer on reloaded segments, (yes I do not want to have two objects in memory after loading) makes sure that we will not use IS primitive in Pharo in any future. For us this is a non feature.
IS was a nice trick but since having a pointer to an object is so cheap and the basis of our computational model so this is lead fo unpredictable side effects. We saw that when mariano worked during the first year of his PhD (which is a kind of LOOM revisit).
Stef
-- best, Eliot
Hi David,
On Oct 22, 2014, at 5:52 PM, "David T. Lewis" lewis@mail.msen.com wrote:
Eliot,
Thanks for this background, it is very helpful and interesting.
I would also like to put in a good word for Fuel. It is well designed, well documented, and well supported on Squeak and Pharo. Very high quality work.
I use Fuel in RemoteTask (in package CommandShell) for inter-image communication. ReferenceStream also works, and both are supported in RemoteTask. But if you want to have a serializer that you can read and understand, I'd say that Fuel is hard to beat.
I am not advocating anything with respect to image segments, project saving, and so forth, I'm just saying that Fuel is a very good thing. It works well in Squeak, and I suspect that many folks may not be aware of this.
Oh I agree. If only ImageSegments weren't used... :-). We use an early version of Fuel at Cadence which is essential to our system. We haven't upgraded as it "just works".
Dave
On Wed, Oct 22, 2014 at 12:53:15PM -0700, Eliot Miranda wrote:
Hi Stephane, Hi All,
let me talk a little about the ParcPlace experience, which led to David Leibs' parcels, whose architecture Fuel uses.
In the late 80's 90's Peter Deutsch write BOSS (Binary Object Storage System), a traditional interpretive pickling system defined by a little bytecoded language. Think of a bytecode as something like "What follows is an object definition, which is its id followed by size info followed by the definitions or ids of its sub-parts, including its class", or "What follows is the id of an already defined object". So the loading interpreter looks at the next byte in the stream and that tells it what to do. So the storage is a recursive definition of a graph, much like a recursive grammar for a programming language.
This approach is slow (its a bytecode interpreter) and fragile (structures in the process of being built aren't valid yet, imagine trying to take the hash of a Set that is only half-way through being materialized). But this architecture was very common at the time (I wrote something very similar). The advantage BOSS had was a clumsy hack for versioning. One could specify blocks that were supplied with the version and state of older objects, and these blocks could effect shape change etc to bring loaded instances up-to-date.
David Leibs has an epiphany as, in the early 90's, ParcPlae was trying to decompose the VW image (chainsaw was the code name of the VW 2.5 release). If one groups instances by class, one can instantiate in bulk, creating all the instances of a particular class in one go, followed by all the instances of a different class, etc. Then the arc information (the pointers to objects to be stored in the loaded objects inst vars) can follow the instance information. So now the file looks like header, names of classes that are referenced (not defined), definitions of classes, definitions of instances (essentially class id, count pairs), arc information. And materializing means finding the classes in the image, creating the classes in the file, creating the instances, stitching the graph together, and then performing any post-load actions (rehashing instances, etc).
Within months we merged with Digitalk (to form DarcPlace-Dodgytalk) and were introduced to TeamV's loading model which was very much like ImageSegments, being based on the VM's object format. Because an ImageSegment also has imports (references to classes and globals taken from the host system, not defined in the file) performance doesn't just depend on loading the segment into memorty. It also depends on how long it takes to search the system to find imports, etc. In practice we found that a) Parcels were 4 times faster than BOSS, and b) they were no slower than Digitalk's image segments. But being independent of the VM's heap format Parcels had BOSS's flexibility and could support shape change on load, something ImageSegments *cannot do*. I went on to extend parcels with support for shape change, plus support for partial loading of code, but I won't describe that here. Too detailed, even thought its very important.
Mariano spent time talking with me and Fuel's basic architecture is that of parcels, but reimplemented to be nicer, more flexible etc. But essentially Parcels and Fuel are at their core David Leibs' invention. He came up with the ideas of a) grouping objects by class and b) separating the arcs from the nodes.
Now, where ImageSegments are faster than Parcels is *not* loading. Our experience with VW vs TeamV showed us that. But they are faster in collecting the graph of objects to be included. ImageSegments are dead simple. So IMO the right architecture is to use Parcels' segregation, and Parcels' "abstract" format (independent of the heap object format) with ImageSegment's computation of the object graph. Igor Stasenko has suggested providing the tracing part of ImageSegments (Dan Ingalls' cool invention of mark the segment root objects, then mark the heap, leaving the objects to be stored unmarked in the shadow of the marked segment roots) as a separate primitive. Then this can be quickly partitioned by class and then written by Smalltalk code.
The loader can then materialize objects using Smalltalk code, can deal with shape change, and not be significantly slower than image segments. Crucially this means that one has a portable, long-lived object storage format; freeing the VM to evolve its object format without breaking image segments with every change to the object format.
I'd be happy to help people working on Fuel by providing that primitive for anyone who wants to try and reimplement the ImageSegment functonality (project saving, class faulting, etc) above Fuel.
On Wed, Oct 22, 2014 at 11:56 AM, St??phane Ducasse < stephane.ducasse@inria.fr> wrote:
What I can tell you is that instability raised by just having one single pointer not in the root objects pointing to an element in the segment and the implication of this pointer on reloaded segments, (yes I do not want to have two objects in memory after loading) makes sure that we will not use IS primitive in Pharo in any future. For us this is a non feature.
IS was a nice trick but since having a pointer to an object is so cheap and the basis of our computational model so this is lead fo unpredictable side effects. We saw that when mariano worked during the first year of his PhD (which is a kind of LOOM revisit).
Stef
-- best, Eliot
I would also like to put in a good word for Fuel. It is well designed, well documented, and well supported on Squeak and Pharo. Very high quality work.
I use Fuel in RemoteTask (in package CommandShell) for inter-image communication. ReferenceStream also works, and both are supported in RemoteTask. But if you want to have a serializer that you can read and understand, I'd say that Fuel is hard to beat.
I am not advocating anything with respect to image segments, project saving, and so forth, I'm just saying that Fuel is a very good thing. It works well in Squeak, and I suspect that many folks may not be aware of this.
Oh I agree. If only ImageSegments weren't used... :-). We use an early version of Fuel at Cadence which is essential to our system. We haven't upgraded as it "just works".
I'd just like to remind everyone, there is another stand-alone serializer available for Squeak called "Ma-Object-Serializer". It was developed from the ground up for _Squeak_ -- meaning, it already supports all the same Squeak-specific preserialization and postmaterialization pickling/unpickling behaviors, like for Project, etc. which used by ReferenceStream.
There is nothing more that I would *love* than for interest from my fellow Squeakers to lead to significant improvements in this serializer from trying to incorporate it into your applications. I think there is some low-hanging fruit (like the nascent #addNewElement:!) to be had simply by everyone's different development views and experience. Such improvements would be directly inherited by Magma!
I looked at trying to incorporate Fuel as the serializer for Magma, to take advantage of its purported speed. But one of the very first things I found was the benchmarks for "the Magma serializer" in the Fuel paper were totally bogus. I had asked Mariano to separate out initialization from serialization and materialization, but since he didn't, the numbers reported are a tiny fraction of their actual speed.
I came to realize that Fuel is really targeted at just two primary use-cases: 1) saving a complete-graph and 2) loading a complete-graph. But Ma-Object-Serializer has the ability to serialize/materialize *partial* graphs by letting the user specify a TraversalStrategy, which is essential for Magma. Unfortunately, Fuel cannot do this.
The other innovation of Ma-Object-Serializer is its first-class access to the object-graph **in its serialized state** in the same ways (partial or complete) like when they were Smalltalk objects.
If one groups instances by class, one can instantiate in bulk, creating all the instances of a particular class in one go
What does "instantiate in bulk" mean? Doesn't that mean one still must send #new (or #basicNew) to the class for each instance? Why would that be faster?
One aspect with Etoys projects is that they can not extend the system. It works nicely if you just use Etoys tile scripting. But if you introduce a new class in a project, loading the project in a system that do not have that class will fail. So the use of project as a distribution system of applications will be limited to a certain version of images.
Karl
On Wed, Oct 22, 2014 at 9:53 PM, Eliot Miranda eliot.miranda@gmail.com wrote:
Hi Stephane, Hi All,
let me talk a little about the ParcPlace experience, which led to
David Leibs' parcels, whose architecture Fuel uses.
In the late 80's 90's Peter Deutsch write BOSS (Binary Object Storage System), a traditional interpretive pickling system defined by a little bytecoded language. Think of a bytecode as something like "What follows is an object definition, which is its id followed by size info followed by the definitions or ids of its sub-parts, including its class", or "What follows is the id of an already defined object". So the loading interpreter looks at the next byte in the stream and that tells it what to do. So the storage is a recursive definition of a graph, much like a recursive grammar for a programming language.
This approach is slow (its a bytecode interpreter) and fragile (structures in the process of being built aren't valid yet, imagine trying to take the hash of a Set that is only half-way through being materialized). But this architecture was very common at the time (I wrote something very similar). The advantage BOSS had was a clumsy hack for versioning. One could specify blocks that were supplied with the version and state of older objects, and these blocks could effect shape change etc to bring loaded instances up-to-date.
David Leibs has an epiphany as, in the early 90's, ParcPlae was trying to decompose the VW image (chainsaw was the code name of the VW 2.5 release). If one groups instances by class, one can instantiate in bulk, creating all the instances of a particular class in one go, followed by all the instances of a different class, etc. Then the arc information (the pointers to objects to be stored in the loaded objects inst vars) can follow the instance information. So now the file looks like header, names of classes that are referenced (not defined), definitions of classes, definitions of instances (essentially class id, count pairs), arc information. And materializing means finding the classes in the image, creating the classes in the file, creating the instances, stitching the graph together, and then performing any post-load actions (rehashing instances, etc).
Within months we merged with Digitalk (to form DarcPlace-Dodgytalk) and were introduced to TeamV's loading model which was very much like ImageSegments, being based on the VM's object format. Because an ImageSegment also has imports (references to classes and globals taken from the host system, not defined in the file) performance doesn't just depend on loading the segment into memorty. It also depends on how long it takes to search the system to find imports, etc. In practice we found that a) Parcels were 4 times faster than BOSS, and b) they were no slower than Digitalk's image segments. But being independent of the VM's heap format Parcels had BOSS's flexibility and could support shape change on load, something ImageSegments *cannot do*. I went on to extend parcels with support for shape change, plus support for partial loading of code, but I won't describe that here. Too detailed, even thought its very important.
Mariano spent time talking with me and Fuel's basic architecture is that of parcels, but reimplemented to be nicer, more flexible etc. But essentially Parcels and Fuel are at their core David Leibs' invention. He came up with the ideas of a) grouping objects by class and b) separating the arcs from the nodes.
Now, where ImageSegments are faster than Parcels is *not* loading. Our experience with VW vs TeamV showed us that. But they are faster in collecting the graph of objects to be included. ImageSegments are dead simple. So IMO the right architecture is to use Parcels' segregation, and Parcels' "abstract" format (independent of the heap object format) with ImageSegment's computation of the object graph. Igor Stasenko has suggested providing the tracing part of ImageSegments (Dan Ingalls' cool invention of mark the segment root objects, then mark the heap, leaving the objects to be stored unmarked in the shadow of the marked segment roots) as a separate primitive. Then this can be quickly partitioned by class and then written by Smalltalk code.
The loader can then materialize objects using Smalltalk code, can deal with shape change, and not be significantly slower than image segments. Crucially this means that one has a portable, long-lived object storage format; freeing the VM to evolve its object format without breaking image segments with every change to the object format.
I'd be happy to help people working on Fuel by providing that primitive for anyone who wants to try and reimplement the ImageSegment functonality (project saving, class faulting, etc) above Fuel.
On Wed, Oct 22, 2014 at 11:56 AM, Stéphane Ducasse < stephane.ducasse@inria.fr> wrote:
What I can tell you is that instability raised by just having one single pointer not in the root objects pointing to an element in the segment and the implication of this pointer on reloaded segments, (yes I do not want to have two objects in memory after loading) makes sure that we will not use IS primitive in Pharo in any future. For us this is a non feature.
IS was a nice trick but since having a pointer to an object is so cheap and the basis of our computational model so this is lead fo unpredictable side effects. We saw that when mariano worked during the first year of his PhD (which is a kind of LOOM revisit).
Stef
-- best, Eliot
That is not a limitation of ImageSegments per se, but just how they are used in Etoys.
- Bert -
On 26.10.2014, at 05:30, karl ramberg karlramberg@gmail.com wrote:
One aspect with Etoys projects is that they can not extend the system. It works nicely if you just use Etoys tile scripting. But if you introduce a new class in a project, loading the project in a system that do not have that class will fail. So the use of project as a distribution system of applications will be limited to a certain version of images.
Karl
On Wed, Oct 22, 2014 at 9:53 PM, Eliot Miranda eliot.miranda@gmail.com wrote:
Hi Stephane, Hi All,
let me talk a little about the ParcPlace experience, which led to David Leibs' parcels, whose architecture Fuel uses.
In the late 80's 90's Peter Deutsch write BOSS (Binary Object Storage System), a traditional interpretive pickling system defined by a little bytecoded language. Think of a bytecode as something like "What follows is an object definition, which is its id followed by size info followed by the definitions or ids of its sub-parts, including its class", or "What follows is the id of an already defined object". So the loading interpreter looks at the next byte in the stream and that tells it what to do. So the storage is a recursive definition of a graph, much like a recursive grammar for a programming language.
This approach is slow (its a bytecode interpreter) and fragile (structures in the process of being built aren't valid yet, imagine trying to take the hash of a Set that is only half-way through being materialized). But this architecture was very common at the time (I wrote something very similar). The advantage BOSS had was a clumsy hack for versioning. One could specify blocks that were supplied with the version and state of older objects, and these blocks could effect shape change etc to bring loaded instances up-to-date.
David Leibs has an epiphany as, in the early 90's, ParcPlae was trying to decompose the VW image (chainsaw was the code name of the VW 2.5 release). If one groups instances by class, one can instantiate in bulk, creating all the instances of a particular class in one go, followed by all the instances of a different class, etc. Then the arc information (the pointers to objects to be stored in the loaded objects inst vars) can follow the instance information. So now the file looks like header, names of classes that are referenced (not defined), definitions of classes, definitions of instances (essentially class id, count pairs), arc information. And materializing means finding the classes in the image, creating the classes in the file, creating the instances, stitching the graph together, and then performing any post-load actions (rehashing instances, etc).
Within months we merged with Digitalk (to form DarcPlace-Dodgytalk) and were introduced to TeamV's loading model which was very much like ImageSegments, being based on the VM's object format. Because an ImageSegment also has imports (references to classes and globals taken from the host system, not defined in the file) performance doesn't just depend on loading the segment into memorty. It also depends on how long it takes to search the system to find imports, etc. In practice we found that a) Parcels were 4 times faster than BOSS, and b) they were no slower than Digitalk's image segments. But being independent of the VM's heap format Parcels had BOSS's flexibility and could support shape change on load, something ImageSegments *cannot do*. I went on to extend parcels with support for shape change, plus support for partial loading of code, but I won't describe that here. Too detailed, even thought its very important.
Mariano spent time talking with me and Fuel's basic architecture is that of parcels, but reimplemented to be nicer, more flexible etc. But essentially Parcels and Fuel are at their core David Leibs' invention. He came up with the ideas of a) grouping objects by class and b) separating the arcs from the nodes.
Now, where ImageSegments are faster than Parcels is *not* loading. Our experience with VW vs TeamV showed us that. But they are faster in collecting the graph of objects to be included. ImageSegments are dead simple. So IMO the right architecture is to use Parcels' segregation, and Parcels' "abstract" format (independent of the heap object format) with ImageSegment's computation of the object graph. Igor Stasenko has suggested providing the tracing part of ImageSegments (Dan Ingalls' cool invention of mark the segment root objects, then mark the heap, leaving the objects to be stored unmarked in the shadow of the marked segment roots) as a separate primitive. Then this can be quickly partitioned by class and then written by Smalltalk code.
The loader can then materialize objects using Smalltalk code, can deal with shape change, and not be significantly slower than image segments. Crucially this means that one has a portable, long-lived object storage format; freeing the VM to evolve its object format without breaking image segments with every change to the object format.
I'd be happy to help people working on Fuel by providing that primitive for anyone who wants to try and reimplement the ImageSegment functonality (project saving, class faulting, etc) above Fuel.
On Wed, Oct 22, 2014 at 11:56 AM, Stéphane Ducasse stephane.ducasse@inria.fr wrote: What I can tell you is that instability raised by just having one single pointer not in the root objects pointing to an element in the segment and the implication of this pointer on reloaded segments, (yes I do not want to have two objects in memory after loading) makes sure that we will not use IS primitive in Pharo in any future. For us this is a non feature.
IS was a nice trick but since having a pointer to an object is so cheap and the basis of our computational model so this is lead fo unpredictable side effects. We saw that when mariano worked during the first year of his PhD (which is a kind of LOOM revisit).
Stef
-- best, Eliot
On Sun, Oct 26, 2014 at 12:35 PM, Bert Freudenberg bert@freudenbergs.de wrote:
That is not a limitation of ImageSegments per se, but just how they are used in Etoys.
I agree at some point. But what if the serializer were able to serialize classes as well? Fuel is able to serialize classes, traits, closures, compiled methods, etc. Of course there are scenarios when this becomes complicated, but for the average it works.
- Bert -
On 26.10.2014, at 05:30, karl ramberg karlramberg@gmail.com wrote:
One aspect with Etoys projects is that they can not extend the system. It works nicely if you just use Etoys tile scripting. But if you introduce a new class in a project, loading the project in a system that do not have that class will fail. So the use of project as a distribution system of applications will be limited to a certain version of images.
Karl
On Wed, Oct 22, 2014 at 9:53 PM, Eliot Miranda eliot.miranda@gmail.com wrote:
Hi Stephane, Hi All,
let me talk a little about the ParcPlace experience, which led to
David Leibs' parcels, whose architecture Fuel uses.
In the late 80's 90's Peter Deutsch write BOSS (Binary Object Storage System), a traditional interpretive pickling system defined by a little bytecoded language. Think of a bytecode as something like "What follows is an object definition, which is its id followed by size info followed by the definitions or ids of its sub-parts, including its class", or "What follows is the id of an already defined object". So the loading interpreter looks at the next byte in the stream and that tells it what to do. So the storage is a recursive definition of a graph, much like a recursive grammar for a programming language.
This approach is slow (its a bytecode interpreter) and fragile (structures in the process of being built aren't valid yet, imagine trying to take the hash of a Set that is only half-way through being materialized). But this architecture was very common at the time (I wrote something very similar). The advantage BOSS had was a clumsy hack for versioning. One could specify blocks that were supplied with the version and state of older objects, and these blocks could effect shape change etc to bring loaded instances up-to-date.
David Leibs has an epiphany as, in the early 90's, ParcPlae was trying to decompose the VW image (chainsaw was the code name of the VW 2.5 release). If one groups instances by class, one can instantiate in bulk, creating all the instances of a particular class in one go, followed by all the instances of a different class, etc. Then the arc information (the pointers to objects to be stored in the loaded objects inst vars) can follow the instance information. So now the file looks like header, names of classes that are referenced (not defined), definitions of classes, definitions of instances (essentially class id, count pairs), arc information. And materializing means finding the classes in the image, creating the classes in the file, creating the instances, stitching the graph together, and then performing any post-load actions (rehashing instances, etc).
Within months we merged with Digitalk (to form DarcPlace-Dodgytalk) and were introduced to TeamV's loading model which was very much like ImageSegments, being based on the VM's object format. Because an ImageSegment also has imports (references to classes and globals taken from the host system, not defined in the file) performance doesn't just depend on loading the segment into memorty. It also depends on how long it takes to search the system to find imports, etc. In practice we found that a) Parcels were 4 times faster than BOSS, and b) they were no slower than Digitalk's image segments. But being independent of the VM's heap format Parcels had BOSS's flexibility and could support shape change on load, something ImageSegments *cannot do*. I went on to extend parcels with support for shape change, plus support for partial loading of code, but I won't describe that here. Too detailed, even thought its very important.
Mariano spent time talking with me and Fuel's basic architecture is that of parcels, but reimplemented to be nicer, more flexible etc. But essentially Parcels and Fuel are at their core David Leibs' invention. He came up with the ideas of a) grouping objects by class and b) separating the arcs from the nodes.
Now, where ImageSegments are faster than Parcels is *not* loading. Our experience with VW vs TeamV showed us that. But they are faster in collecting the graph of objects to be included. ImageSegments are dead simple. So IMO the right architecture is to use Parcels' segregation, and Parcels' "abstract" format (independent of the heap object format) with ImageSegment's computation of the object graph. Igor Stasenko has suggested providing the tracing part of ImageSegments (Dan Ingalls' cool invention of mark the segment root objects, then mark the heap, leaving the objects to be stored unmarked in the shadow of the marked segment roots) as a separate primitive. Then this can be quickly partitioned by class and then written by Smalltalk code.
The loader can then materialize objects using Smalltalk code, can deal with shape change, and not be significantly slower than image segments. Crucially this means that one has a portable, long-lived object storage format; freeing the VM to evolve its object format without breaking image segments with every change to the object format.
I'd be happy to help people working on Fuel by providing that primitive for anyone who wants to try and reimplement the ImageSegment functonality (project saving, class faulting, etc) above Fuel.
On Wed, Oct 22, 2014 at 11:56 AM, Stéphane Ducasse < stephane.ducasse@inria.fr> wrote:
What I can tell you is that instability raised by just having one single pointer not in the root objects pointing to an element in the segment and the implication of this pointer on reloaded segments, (yes I do not want to have two objects in memory after loading) makes sure that we will not use IS primitive in Pharo in any future. For us this is a non feature.
IS was a nice trick but since having a pointer to an object is so cheap and the basis of our computational model so this is lead fo unpredictable side effects. We saw that when mariano worked during the first year of his PhD (which is a kind of LOOM revisit).
Stef
-- best, Eliot
On Oct 26, 2014, at 8:41 AM, Mariano Martinez Peck marianopeck@gmail.com wrote:
On Sun, Oct 26, 2014 at 12:35 PM, Bert Freudenberg bert@freudenbergs.de wrote:
That is not a limitation of ImageSegments per se, but just how they are used in Etoys.
I agree at some point. But what if the serializer were able to serialize classes as well? Fuel is able to serialize classes, traits, closures, compiled methods, etc. Of course there are scenarios when this becomes complicated, but for the average it works.
There's no restriction in the kinds if object ImageSegments can store either. Including classes, contexts, etc presents no problem (*). The only real difference is that ImageSegments are tied to the VM's object representation whereas systems like Fuel are portable. That's a key advantage.
(*) I realize that Cog's implementations are currently broken w.r.t. Contexts and should be fixed. Because of context-to-stack mapping the segment writer should be careful to store a married context as an in married context. Luckily this is just a small matter if programming ;-)
- Bert -
On 26.10.2014, at 05:30, karl ramberg karlramberg@gmail.com wrote:
One aspect with Etoys projects is that they can not extend the system. It works nicely if you just use Etoys tile scripting. But if you introduce a new class in a project, loading the project in a system that do not have that class will fail. So the use of project as a distribution system of applications will be limited to a certain version of images.
Karl
On Wed, Oct 22, 2014 at 9:53 PM, Eliot Miranda eliot.miranda@gmail.com wrote:
Hi Stephane, Hi All,
let me talk a little about the ParcPlace experience, which led to David Leibs' parcels, whose architecture Fuel uses.
In the late 80's 90's Peter Deutsch write BOSS (Binary Object Storage System), a traditional interpretive pickling system defined by a little bytecoded language. Think of a bytecode as something like "What follows is an object definition, which is its id followed by size info followed by the definitions or ids of its sub-parts, including its class", or "What follows is the id of an already defined object". So the loading interpreter looks at the next byte in the stream and that tells it what to do. So the storage is a recursive definition of a graph, much like a recursive grammar for a programming language.
This approach is slow (its a bytecode interpreter) and fragile (structures in the process of being built aren't valid yet, imagine trying to take the hash of a Set that is only half-way through being materialized). But this architecture was very common at the time (I wrote something very similar). The advantage BOSS had was a clumsy hack for versioning. One could specify blocks that were supplied with the version and state of older objects, and these blocks could effect shape change etc to bring loaded instances up-to-date.
David Leibs has an epiphany as, in the early 90's, ParcPlae was trying to decompose the VW image (chainsaw was the code name of the VW 2.5 release). If one groups instances by class, one can instantiate in bulk, creating all the instances of a particular class in one go, followed by all the instances of a different class, etc. Then the arc information (the pointers to objects to be stored in the loaded objects inst vars) can follow the instance information. So now the file looks like header, names of classes that are referenced (not defined), definitions of classes, definitions of instances (essentially class id, count pairs), arc information. And materializing means finding the classes in the image, creating the classes in the file, creating the instances, stitching the graph together, and then performing any post-load actions (rehashing instances, etc).
Within months we merged with Digitalk (to form DarcPlace-Dodgytalk) and were introduced to TeamV's loading model which was very much like ImageSegments, being based on the VM's object format. Because an ImageSegment also has imports (references to classes and globals taken from the host system, not defined in the file) performance doesn't just depend on loading the segment into memorty. It also depends on how long it takes to search the system to find imports, etc. In practice we found that a) Parcels were 4 times faster than BOSS, and b) they were no slower than Digitalk's image segments. But being independent of the VM's heap format Parcels had BOSS's flexibility and could support shape change on load, something ImageSegments *cannot do*. I went on to extend parcels with support for shape change, plus support for partial loading of code, but I won't describe that here. Too detailed, even thought its very important.
Mariano spent time talking with me and Fuel's basic architecture is that of parcels, but reimplemented to be nicer, more flexible etc. But essentially Parcels and Fuel are at their core David Leibs' invention. He came up with the ideas of a) grouping objects by class and b) separating the arcs from the nodes.
Now, where ImageSegments are faster than Parcels is *not* loading. Our experience with VW vs TeamV showed us that. But they are faster in collecting the graph of objects to be included. ImageSegments are dead simple. So IMO the right architecture is to use Parcels' segregation, and Parcels' "abstract" format (independent of the heap object format) with ImageSegment's computation of the object graph. Igor Stasenko has suggested providing the tracing part of ImageSegments (Dan Ingalls' cool invention of mark the segment root objects, then mark the heap, leaving the objects to be stored unmarked in the shadow of the marked segment roots) as a separate primitive. Then this can be quickly partitioned by class and then written by Smalltalk code.
The loader can then materialize objects using Smalltalk code, can deal with shape change, and not be significantly slower than image segments. Crucially this means that one has a portable, long-lived object storage format; freeing the VM to evolve its object format without breaking image segments with every change to the object format.
I'd be happy to help people working on Fuel by providing that primitive for anyone who wants to try and reimplement the ImageSegment functonality (project saving, class faulting, etc) above Fuel.
On Wed, Oct 22, 2014 at 11:56 AM, Stéphane Ducasse stephane.ducasse@inria.fr wrote:
What I can tell you is that instability raised by just having one single pointer not in the root objects pointing to an element in the segment and the implication of this pointer on reloaded segments, (yes I do not want to have two objects in memory after loading) makes sure that we will not use IS primitive in Pharo in any future. For us this is a non feature.
IS was a nice trick but since having a pointer to an object is so cheap and the basis of our computational model so this is lead fo unpredictable side effects. We saw that when mariano worked during the first year of his PhD (which is a kind of LOOM revisit).
Stef
-- best, Eliot
-- Mariano http://marianopeck.wordpress.com
vm-dev@lists.squeakfoundation.org