New subject: [Pharo-project] Plan/discussion/communication around new object format

30 May 2012

      On Wed, May 30, 2012 at 12:59 PM, Stéphane Ducasse <
stephane.ducasse@inria.fr> wrote:
...
I would like to be sure that we can have
       - bit for immutable objects
       - bits for experimenting.
There will be quite a few.  And one will be able to steal bits from the
class field if one needs fewer classes.  I'm not absolutely sure of the
layout yet.  But for example
8: slot size (255 => extra header word with large size)
3: odd bytes/fixed fields (odd bytes for non-pointer, fixed fields for
pointer, 7 => # fixed fields is in the class)
4 bits: format (pointers, indexable, bytes/shorts/longs/doubles indexable,
compiled method, ephemerons, weak, etc)
1: immutability
3: GC 2 mark bits. 1 forwarded bit
20: identity hash
20: class index
still leaves 5 bits unused.  And stealing 4 bits each from class index
still leaves 64k classes.  So this format is simple and provides lots of
unused bits.  The format field is a great idea as it combines a number of
orthogonal properties in very few bits.  I don't want to include odd bytes
in format because I think a separate field that holds odd bytes and fixed
fields is better use of space.  But we can gather statistics before
deciding.
...
Stef
On May 30, 2012, at 8:48 AM, Stéphane Ducasse wrote:
...
Hi guys
Here is an important topic I would like to see discussed so that we see
how we can improve and join forces.
...
May a mail discussion then a wiki for the summary would be good.
stef
Begin forwarded message:
...
From: Eliot Miranda eliot.miranda@gmail.com
Subject: Re: Plan/discussion/communication around new object format
Date: May 27, 2012 10:49:54 PM GMT+02:00
To: Stéphane Ducasse stephane.ducasse@inria.fr
On Sat, May 26, 2012 at 1:46 AM, Stéphane Ducasse <
stephane.ducasse@inria.fr> wrote:
...
...
Hi eliot
do you have a description of the new object format you want to
introduce?
...
...
The design is in the class comment of CogMemoryManager in the Cog
VMMaker packages.
...
...
Then what is your schedule?
This is difficult. I have made a small start and should be able to
spend time on it starting soon.  I want to have it finished by early next
year.  But it depends on work schedules etc.
...
...
I would like to see if we can allocate igor/esteban time before we run
out of money
...
...
to help on that important topic.
Now the solution is unclear and I did not see any document where we can
evaluate
...
...
and plan how we can help. So do you want help on that topic? Then how
can people
...
...
contribute if any?
The first thing to do is to read the design document, to see if the
Pharo community thinks it is the right direction, and to review it, spot
deficiencies etc.  So please get those interested to read the class comment
of CogMemoryManager in the latest VMMaker.oscog.
...
...
Here's the current version of it:
CogMemoryManager is currently a place-holder for the design of the new
Cog VM's object representation and garbage collector.  The goals for the GC
are
...
...

efficient object representation a la Eliot Miranda's VisualWorks

64-bit object representation that uses a 64-bit header, eliminating direct
class references so that all objects refer to their classes indirectly.
 Instead the header contains a constant class index, in a field smaller
than a full pointer, These class indices are used in inline and first-level
method caches, hence they do not have to be updated on GC (although they do
have to be traced to be able to GC classes).  Classes are held in a sparse
weak table.  The class table needs only to be indexed by an instance's
class index in class hierarchy search, in the class primitive, and in
tracing live objects in the heap.  The additional header space is allocated
to a much expanded identity hash field, reducing hash efficiency problems
in identity collections due to the extremely small (11 bit) hash field in
the old Squeak GC.  The identity hash field is also a key element of the
class index scheme.  A class's identity hash is its index into the class
table, so to create an instance of a class one merely copies its identity
hash into the class index field of the new instance.  This implies that
when classes gain their identity hash they are entered into the class table
and their identity hash is that of a previously unused index in the table.
 It also implies that there is a maximum number of classes in the table.
 At least for a few years 64k classes should be enough.  A class is entered
into the class table in the following operations:
...
...
 behaviorHash
 adoptInstance
 instantiate
 become  (i.e. if an old class becomes a new class)
         if target class field's = to original's id hash
            and replacement's id hash is zero
                 enter replacement in class table

behaviorHash is a special version of identityHash that must be
implemented in the image by any object that can function as a class (i.e.
Behavior).
...
...

more immediate classes.  An immediate Character class would speed up

String accessing, especially for WideString, since no instatiation needs to
be done on at:put: and no dereference need be done on at:.  In a 32-bit
system tag checking is complex since it is thought important to retain
31-bit SmallIntegers.  Hence, as in current Squeak, the least significant
bit set implies a SmallInteger, but Characters would likely have a tag
pattern of xxx10.  Hence masking with 11 results in two values for
SmallInteger, xxx01 and xxx11.  30-bit characters are more than adequate
for Unicode.  In a 64-bit system we can use the full three bits and
usefully implement an immediate Float.  As in VisualWorks a functional
representation takes three bits away from the exponent.  Rotating to put
the sign bit in the least significant non-tag bit makes expanding and
contracting the 8-bit exponent to the 11-bit IEEE double exponent easy ad
makes comparing negative and positive zero easier (an immediate Float is
zero if its unsigned 64-bits are < 16).  So the representation looks like
...
...
 | 8 bit exponent | 52 bit mantissa | sign bit | 3 tag bits |

For details see "60-bit immediate Floats" below.

efficient scavenging.  The current Squeak GC uses a slow

pointer-reversal collector that writes every field in live objects three
times in each collection, twice in the pointer-reversing heap traversal to
mark live objects and once to update the pointer to its new location.  A
scavenger writes every field of live data twice in each collection, once as
it does a block copy of the object when copying to to space, once as it
traverses the live pointers in the to space objects.  Of course the block
copy is a relatively cheap write.
...
...

lazy become.  The JIT's use of inline cacheing provides a cheap way

of avoiding scanning the heap as part of a become (which is the simple
approach to implementing become in a system with direct pointers).  A
becomeForward: on a (set of) non-zero-sized object(s) turns the object into
a "corpse" or "forwarding object" whose first (non-header) word/slot is
replaced by a pointer to the target of the becomeForward:.  The corpse's
class index is set to one that identifies corpses and, because it is a
hidden class index, will always fail an inline cache test.  The inline
cache failure code is then responsible for following the forwarding pointer
chain (these are Iliffe vectors :) ) and resolving to the actual target.
 We have yet to determine exactly how this is done (e.g. change the
receiver register and/or stack contents and retry the send, perhaps
scanning the current activation).  See below on how we deal with becomes on
objects with named inst vars.  Note that we probably don't have to worry
about zero-sized objects.  These are unlikely to be passed through the FFI
(there is nothing to pass :) ) and so will rarely be becommed.  If they do,
they can become slowly.  Alternatively we can insist that objects are at
least 16 bytes in size (see a8-byte alignment below) so that there will
always be space for a forwarding pointer.  Since none of the immediate
classes can have non-immediate instances and since we allocate the
immediate classes indices corresponding to their tag pattern (SmallInteger
= 1, Character = 3, SmallFloat = 4?) we can use all the class indices from
0 to 7 for special uses, 0 = forward, and e.g. 1 = header-sized filler.
...
...

pinning.  To support a robust and easy-to-use FFI the memory manager

must support temporary pinning where individual objects can be prevented
from being moved by the GC for as long as required, either by being one of
an in-progress FFI call's arguments, or by having pinning asserted by a
primitive (allowing objects to be passed to external code that retains a
reference to the object after returning).  Pinning probably implies a
per-object "is-pinned" bit in the object header.  Pinning will be done via
lazy become; i..e an object in new space will be becommed into a pinned
object in old space.  We will only support pinning in old space.
...
...

efficient old space collection.  An incremental collector (a la

Dijkstra's three colour algorithm) collects old space, e.g. via an amount
of tracing being hung off scavenges and/or old space allocations at an
adaptive rate that keeps full garbage collections to a minimum.  (see free
space/free list below)
...
...

8-byte alignment.  It is advantageous for the FFI, for floating-point

access, for object movement and for 32/64-bit compatibility to keep object
sizes in units of 8 bytes.  For the FFI, 8-byte alignment means passing
objects to code that expects that requirement (such as modern x86 numeric
processing instructions).  This implies that
...
...
 - the starts of all spaces are aligned on 8-byte boundaries
 - object allocation rounds up the requested size to a multiple of

8 bytes
...
...
 - the overflow size field is also 8 bytes

We shall probably keep the minimum object size at 16 bytes so that
there is always room for a forwarding pointer.  But this implies that we
will need to implement an 8-byte filler to fill holes between objects > 16
bytes whose length mod 16 bytes is 8 bytes and following pinned objects.
 We can do this using a special class index, e.g. 1, so that the method
that answers the size of an object looks like, e.g.
...
...
 chunkSizeOf: oop
         <var: #oop type: #'object *'>
         ^object classIndex = 1
                 ifTrue: [BaseHeaderSize]
                 ifFalse: [BaseHeaderSize
                           + (object slotSize = OverflowSlotSize
                                         ifTrue: [OverflowSizeBytes]
                                         ifFalse: [0])
                           + (object slotSize * BytesPerSlot)]

 chunkStartOf: oop
         <var: #oop type: #'object *'>
         ^(self cCoerceSimple: oop to: #'char *')
                 - ((object classIndex = 1
                     or: [object slotSize ~= OverflowSlotSize])
                                 ifTrue: [0]
                                 ifFalse: [OverflowSizeBytes])

For the moment we do not tackle the issue of heap growth and shrinkage
with the ability to allocate and deallocate heap segments via
memory-mapping.  This technique allows space to be released back to the OS
by unmapping empty segments.  We may revisit this but it is not a key
requirement for the first implementation.
...
...
The basic approach is to use a fixed size new space and a growable old
space.  The new space is a classic three-space nursery a la Ungar's
Generation Scavenging, a large eden for new objects and two smaller
survivor spaces that exchange roles on each collection, one being the to
space to which surviving objects are copied, the other being the from space
of the survivors of the previous collection, i.e. the previous to space.
...
...
To provide apparent pinning in new space we rely on lazy become.  Since
most pinned objects will be byte data and these do not require stack zone
activation scanning, the overhead is simply an old space allocation and
corpsing.
...
...
To provide pinning in old space, large objects are implicitly pinned
(because it is expensive to move large objects and, because they are both
large and relatively rare, they contribute little to overall fragmentation

as in aggregates, small objects can be used to fill-in the spaces between

karge objects).  Hence, objects above a particular size are automatically
allocated in old space, rather than new space.  Small objects are pinned as
per objects in new space, by asserting the pin bit, which will be set
automaticaly when allocating a large object.  As a last resort, or by
programmer control (the fullGC primitive) old space is collected via
mark-sweep (mark-compact) and so the mark phase must build the list of
pinned objects around which the sweep/compact phase must carefully step.
...
...
Free space in old space is organized by a free list/free tree as in
Eliot's VisualWorks 5i old space allocator.  There are 64 free lists,
indices 1 through 63 holding blocks of space of that size, index 0 holding
a semi-balanced ordered tree of free blocks, each node being the head of
the list of free blocks of that size.  At the start of the mark phase the
free list is thrown away and the sweep phase coallesces free space and
steps over pinned objects as it proceeds.  We can reuse the forwarding
pointer compaction scheme used in the old collector.  Incremental
collections merely move unmarked objects to the free lists (as well as
nilling weak pointers in weak arrays and scheduling them for finalization).
 The occupancy of the free lists is represented by a bitmap in a 64-bit
integer so that an allocation of size 63 or less can know whether there
exists a free chunk of that size, but more importantly can know whether a
free chunk larger than it exists in the fixed size free lists without
having to search all larger free list heads.
...
...
The incremental collector (a la Dijkstra's three colour algorithm)
collects old space via an amount of tracing being hung off scavenges and/or
old space allocations at an adaptive rate that keeps full garbage
collections to a minimum.  [N.B. Not sure how to do this yet.  The
incremental collector needs to complete a pass often enough to reclaim
objects, but infrequent enough not to waste time.  So some form of feedback
should work.  In VisualWorks tracing is broken into quanta or work where
image-level code determines the size of a quantum based on how fast the
machine is, and how big the heap is.  This code could easily live in the
VM, controllable through vmParameterAt:put:.  An alternative would be to
use the heartbeat to bound quanta by time.  But in any case some amount of
incremental collection would be done on old space allocation and
scavenging, the ammount being chosen to keep pause times acceptably short,
and at a rate to reclaim old space before a full GC is required, i.e. at a
rate proportional to the growth in old space]. The incemental collector is
a state machine, being either marking, nilling weak pointers, or freeing.
 If nilling weak pointers is not done atomically then there must be a read
barrier in weak array at: so that reading from an old space weak array that
is holding stale un-nilled references to unmarked objects.  Tricks such as
including the weak bit in bounds calculations can make this cheap for
non-weak arrays.  Alternatively nilling weak pointers can be made an atomic
part of incremental collection, which can be made cheaper by maintaining
the set of weak arrays (e.g. on a list).
...
...
The incremental collector implies a more complex write barrier.
Objects are of three colours, black, having been scanned, grey, being
scanned, and white, unreached.  A mark stack holds the grey objects.   If
the incremental collector is marking and an unmarked white object is stored
into a black object then the stored object must become grey, being added to
the mark stack.  So the wrte barrier is essentially
...
...
 target isYoung ifFalse:
         [newValue isYoung
                 ifTrue: [target isInRememberedSet ifFalse:
                                 [target addToRememberedSet]]

"target now refers to a young object; it is a root for scavenges"
...
...
                 ifFalse:
                         [(target isBlack
                           and: [igc marking
                           and: [newValue isWhite]]) ifTrue:
                                 [newValue beGrey]]] "add newValue

to IGC's markStack for subsequent scanning"
...
...
The incremental collector does not detect already marked objects all of
whose references have been overwritten by other stores (e.g. in the above
if newValue overwrites the sole remaining reference to a marked object).
 So the incremental collector only guarantees to collect all garbage
created in cycle N at the end of cycle N + 1.  The cost is hence slightly
worse memory density but the benefit, provided the IGC works hard enough,
is the elimination of long pauses due to full garbage collections, which
become actions of last resort or programmer desire.
...
...
Lazy become.
As described earlier the basic idea behind lazy become is to use
corpses (forwarding objects) that are followed lazily during GC and inline
cache miss.  However, a lazy scheme cannot be used on objects with named
inst vars without adding checking to all inst var accesses, which we judge
too expensive.  Instead, when becomming objects with named inst vars, we
scan all activations in the stack zone, eagerly becomming these references,
and we check for corpses when faulting in a context into the stack zone.
 Essentially, the invariant is that there are no references to corpses from
the receiver slots of stack activations.  A detail is whether we allow or
forbid pinning of closure indirection vectors, or scan the entire stack of
each activation.  Using a special class index pun for indirection vectors
is a cheap way of preventing their becomming/pinning etc.  Although "don't
do that" (don't attempt to pin/become indirection vectors) is also an
acceptable response.
...
...
60-bit immediate Floats
Representation for immediate doubles, only used in the 64-bit
implementation. Immediate doubles have the same 52 bit mantissa as IEEE
double-precision  floating-point, but only have 8 bits of exponent.  So
they occupy just less than the middle 1/8th of the double range.  They
overlap the normal single-precision floats which also have 8 bit exponents,
but exclude the single-precision denormals (exponent-127) and the
single-precsion NaNs (exponent +127).  +/- zero is just a pair of values
with both exponent and mantissa 0.
...
...
So the non-zero immediate doubles range from
        +/- 0x3800,0000,0000,0001 / 5.8774717541114d-39
to      +/- 0x47ff,ffff,ffff,ffff / 6.8056473384188d+38
The encoded tagged form has the sign bit moved to the least significant
bit, which allows for faster encode/decode because offsetting the exponent
can't overflow into the sign bit and because testing for +/- 0 is an
unsigned compare for <= 0xf:
...
...
msb

                   lsb

...
...
[8 exponent subset bits][52 mantissa bits ][1 sign bit][3 tag bits]

So assuming the tag is 5, the tagged non-zero bit patterns are
             0x0000,0000,0000,001[d/5]
to           0xffff,ffff,ffff,fff[d/5]
and +/- 0d is 0x0000,0000,0000,000[d/5]
Encode/decode of non-zero values in machine code looks like:
                                             msb
                   lsb

...
...
Decode:                              [8expsubset][52mantissa][1s][3tags]
shift away tags:                     [ 000 ][8expsubset][52mantissa][1s]
add exponent offset: [     11 exponent     ][52mantissa][1s]
rot sign:                            [1s][     11 exponent
][52mantissa]
...
...
Encode:                                      [1s][     11 exponent
][52mantissa]
...
...
rot sign:                            [     11 exponent
][52mantissa][1s]
...
...
sub exponent offset: [ 000 ][8expsubset][52 mantissa][1s]
shift:                                       [8expsubset][52
mantissa][1s][ 000 ]
...
...
or/add tags:                 [8expsubset][52mantissa][1s][3tags]
but is slower in C because
a) there is no rotate, and
b) raw conversion between double and quadword must (at least in the
source) move bits through memory ( quadword = *(q64 *)&doubleVariable).
...
...
Issues:
How do we avoid the Size4Bit for 64-bits?  The format word encodes the
number of odd bytes, but currently has only 4 bits and hence only supports
odd bytes of 0 - 3.  We need odd bytes of 0 - 7.  But I don't like the
separate Size4Bit.  Best to change the VI code and have a 5 bit format?  We
loose one bit but save two bits (isEphemeron and isWeak (or three, if
isPointers)) for a net gain of one (or two)
...
...
Further, keep Squeak's format idea or go for separate bits?  For
64-bits we need a 5 bit format field.  This contrasts with isPointers,
isWeak, isEphemeron, 3 odd size bits (or byte size)..  format field is
quite economical.
...
...
Are class indices in inline caches strong references to classes or weak
references?
...
...
If strong then they must be scanned during GC and the methodZone must
be flushed on fullGC to reclaim all classes (this looks to be a bug in the
V3 Cogit).
...
...
If weak then when the class table loses references, PICs containing
freed classes must be freed and then sends to freed PICs or containing
freed clases must be unlinked.
...
...
The second approach is faster; the common case is scanning the class
table, the uncommon case is freeing classes.  The second approach is
better; in-line caches do not prevent reclamation of classes.
...
...
Stef
--
best,
Eliot
-- 
best,
Eliot

Re: [Pharo-project] Plan/discussion/communication around new object format

agree, but this is exactly what i tried to show you when you stated: