Hi Colin,

I think you have mentioned a number of parts of the problem already. Schema Management, Address Space Management, and Version Management.

Schema management gives you a protocol that allows you to define the source and destination schema along with any information that might be useful for transferring the data to a new environment.

Address Space Management allows you to move data from one location to another while maintaining links and data integrity. What I think of here is that as long as the address space is well defined (move all data from UID to Header+UID) the links can maintain themselves. It’s also possible to allow nodes to move and manage data instead of having one big process that does it for you. For example say you have an address space that separates nodes into general address spaces instead of hard addresses. This would allow you to duplicate data to nodes that grab the closest data to their own address. Finding data in that structure would be looking for data next to addresses that match stored data index of some kind. See Distributed Hash Tables for more info. What’s nice about this method is that the actual structure is not important, nodes can be deleted without losing data, and nodes can maintain their own address space and storage and heal themselves.

Version Management is good for adding links to data that point to versions of objects and allow you to find the latest version. You could have maps of versions that give you a point in time version of data. Much like Monticello Configurations.

Combining them all is an interesting idea. A query of data might first find the right version in time, locate the right data, and then apply the right transformation to give you an answer. Copying may select just a single version, apply a transformation, and then copy the data to a new address space.

That’s my take at least.

All the best,

Ron Teitelbaum

Head Of Engineering

3D Immersive Collaboration Consulting

ron@3Dicc.com

Follow Me On Twitter: @RonTeitelbaum

www.3Dicc.com

https://www.google.com/+3Dicc

From: squeak-dev-bounces@lists.squeakfoundation.org [mailto:squeak-dev-bounces@lists.squeakfoundation.org] On Behalf Of Colin Putney
Sent: Tuesday, December 29, 2015 3:47 PM
To: The general-purpose Squeak developers list
Subject: [squeak-dev] [OT] What is this called?

Hi folks,

Sorry for the off-topic post. I'm posting it here because I know there are lots of high-powered comp-sci folks lurking, and I hope to benefit from their wisdom.

I have in mind a simple problem. Imagine we have a data structure in the memory of some program. The exact structure it has doesn't matter, but does have structure; it's not just a buffer full of bytes. For argument's sake, we'll assume it's a tree.

Our task it to copy this tree. We want to make another tree in memory that is logically equivalent to the first. This is pretty straightforward. We need to walk the tree, allocating new nodes and copying over any internal values they contain, then recursing into the children. So far so good. But what if we generalize the problem?

One way to do that would be to make the copy more distant from the original. We could copy into a different address space, in another process or on another machine. We could make the copy more distant in time. Perhaps we need to reclaim the memory that our tree uses, and the reconstruct it later.

This means we'll have to do IO of some kind. The simplest adaptation of our tree-walking algorithm might be to allocate space for the nodes on disk, rather than in memory. This is really simple if we have access to raw disk storage, but gets a little more complicated (and slooow) if we're using files in a file system. We can gain back some simplicity by putting all the nodes in one file, and having some kind of internal structure to the file that lets us recover the nodes and links between them. We could also allocate storage space via communicating with another process. That might be over a network, or using some kind or IPC mechanism supplied by the operating system. Once again we'll need to transmit the data within the nodes, along with some metadata describing the connections between them.

We can also loosen our definition of "logically equivalent". We might want to construct an equivalent tree in a process running a different version of our program. Or another program entirely, potentially written in another programming language. This forces us to raise the level of abstraction. We can't just copy each node as a blob of raw memory, and assume that the "receiving" program will know how to interpret it. We need some semantic definitions agreed upon by the two programs, and some means of representing those semantics as byte sequences that can be copied between them.

Now, this is not an intractable problem. We do it all the time. In fact, I'd say a large fraction of the code I've written in my career has been a solution to some version of this problem. I'm sure many of you can relate. :-)

So what is this problem called? What theory describes the possible solutions? Are there classes of solutions that have similar trade-offs? Where can I learn more?

Colin