Re: [squeak-dev] ByteArray accessors for 64-bit manipulation

List overview All Threads
Download

newer

older

VM Maker:...

[commit][3429] CogVM source as per...

Eliot Miranda

31 Aug 2015 31 Aug '15

7:25 p.m.

Hi Chrises,

my vote would be to write these as 12 numbered primitives, (2,4 & 8 bytes) * (at: & at:put:) * (big & little endian) because they can be performance critical and implementing them like this means the maximum efficiency in both 32-bit and 64-bit Spur, plus the possibility of the JIT implementing the primitives.

On Sun, Aug 30, 2015 at 10:01 PM, Chris Cunningham cunningham.cb@gmail.com wrote:

...

Hi Chris,

I'm all for having the fastest that in the image that works. If you could make your version handle endianess, then I'm all for including it (at least in the 3 variants that are faster). My first use for this (interface for KAFKA) apparently requires bigEndianess, so I really want that supported.

It might be best to keep my naming, though - it follows the name pattern that is already in the class. Or will yours also support 128?

-cbc

On Sun, Aug 30, 2015 at 2:38 PM, Chris Muller asqueaker@gmail.com wrote:

...
Hi Chris, I think these methods belong in the image with the fastest implementation we can do.

I implemented 64-bit unsigned access for Ma Serializer back in 2005. I modeled my implementation after Andreas' original approach which tries to avoid LI arithmetic. I was curious whether your implementations would be faster, because if they are then it could benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a trunk image, I used the following script to take comparison measurements:

| smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2 raisedTo: 14)) atRandom. largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. maBa := ByteArray new: 8. cbBa := ByteArray new: 8. maBa maUint: 64 at: 0 put: largeN. cbBa unsignedLong64At: 1 put: largeN bigEndian: false. self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1 bigEndian: false). { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN bigEndian: false] bench. 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench. 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN bigEndian: false] bench. 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench. 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. }

Here are the results:

'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' . 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' . 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' . 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' . 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' . 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' . 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' . 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'

It looks like your 64-bit access is 86% faster for accessing the high-end of the 64-bit range, but slower in the other 3 metrics. Noticeably, it was only 14% as fast for writing the high-end of the 64-bit range, and similarly as much slower for small-number access..

On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham cunningham.cb@gmail.com wrote:

...
Hi.

I've committed a change to the inbox with changes to allow

getting/putting

...
64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could

this

...
be added to trunk?

Also, first time I used the selective commit function - very nice! the changes I didn't want committed didn't, in fact, get commited. Just the desirable bits!

-cbc

-- _,,,^..^,,,_ best, Eliot

Attachments:

attachment.html (text/html — 5.3 KB)

Show replies by date

Chris Muller

31 Aug 31 Aug

8:35 p.m.

New subject: [squeak-dev] ByteArray accessors for 64-bit manipulation

Sometimes the number of bytes is only known in a variable, so would it be possible to do 4 primitives which accept the number of bits (or bytes) as an argument? (uint:at: uint:at:put:) * (big endian, little endian)

On Mon, Aug 31, 2015 at 12:25 PM, Eliot Miranda eliot.miranda@gmail.com wrote:

...

Hi Chrises,
my vote would be to write these as 12 numbered primitives, (2,4 & 8
bytes) * (at: & at:put:) * (big & little endian) because they can be performance critical and implementing them like this means the maximum efficiency in both 32-bit and 64-bit Spur, plus the possibility of the JIT implementing the primitives.

On Sun, Aug 30, 2015 at 10:01 PM, Chris Cunningham cunningham.cb@gmail.com wrote:

...
Hi Chris,

I'm all for having the fastest that in the image that works. If you could make your version handle endianess, then I'm all for including it (at least in the 3 variants that are faster). My first use for this (interface for KAFKA) apparently requires bigEndianess, so I really want that supported.

It might be best to keep my naming, though - it follows the name pattern that is already in the class. Or will yours also support 128?

-cbc

On Sun, Aug 30, 2015 at 2:38 PM, Chris Muller asqueaker@gmail.com wrote:

...
Hi Chris, I think these methods belong in the image with the fastest implementation we can do.

I implemented 64-bit unsigned access for Ma Serializer back in 2005. I modeled my implementation after Andreas' original approach which tries to avoid LI arithmetic. I was curious whether your implementations would be faster, because if they are then it could benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a trunk image, I used the following script to take comparison measurements:

| smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2 raisedTo: 14)) atRandom. largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. maBa := ByteArray new: 8. cbBa := ByteArray new: 8. maBa maUint: 64 at: 0 put: largeN. cbBa unsignedLong64At: 1 put: largeN bigEndian: false. self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1 bigEndian: false). { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN bigEndian: false] bench. 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench. 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN bigEndian: false] bench. 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench. 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. }

Here are the results:

'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' . 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' . 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' . 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' . 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' . 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' . 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' . 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'

It looks like your 64-bit access is 86% faster for accessing the high-end of the 64-bit range, but slower in the other 3 metrics. Noticeably, it was only 14% as fast for writing the high-end of the 64-bit range, and similarly as much slower for small-number access..

On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham cunningham.cb@gmail.com wrote:

...
Hi.

I've committed a change to the inbox with changes to allow getting/putting 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could this be added to trunk?

Also, first time I used the selective commit function - very nice! the changes I didn't want committed didn't, in fact, get commited. Just the desirable bits!

-cbc

-- _,,,^..^,,,_ best, Eliot

Eliot Miranda

9:21 p.m.

New subject: [squeak-dev] ByteArray accessors for 64-bit manipulation

On Mon, Aug 31, 2015 at 11:35 AM, Chris Muller asqueaker@gmail.com wrote:

...

Sometimes the number of bytes is only known in a variable, so would it be possible to do 4 primitives which accept the number of bits (or bytes) as an argument? (uint:at: uint:at:put:) * (big endian, little endian)

Of course its possible, but such an architecture can hardly be quick. If one needs the flexible primitives then use them, but don't hobble the system by only providing them. Having a real 64-bit VM means that the use of 2 32-bit accesses is unnecessarily slow.

Which would you rather, and which would you think would be faster (I don't know, but I have my suspicions):

Expand the existing flexible integerAt: prims to integerAt:put:bytes:signed:bigEndian: (yuck), or implement this in terms of a wrapper something like

ByteArray>>integerAt: index bytes: numBytes signed: signed bigEndian: bigEndian

^size >= 4 ifTrue: [size = 8 ifTrue: [value := self unsignedLong64At: index. bigEndian ifTrue: [value := self byteReverseEightBytes: value]. (sign := value bitShift: -63) ~= 0 ifTrue: "if the VM is intelligent about left shift of zero then this test is unnecessary..." [value := value - ((sign bitAnd: 1) bitShift: 64)]. ^value]. size = 4 ifTrue: [value := self unsignedLong32At: index. bigEndian ifTrue: [value := self byteReverseFourBytes: value]. (sign := value bitShift: -31) ~= 0 ifTrue: "if the VM is intelligent about left shift of zero then this test is unnecessary..." [value := value - ((sign bitAnd: 1) bitShift: 32)]. ^value]. ^self error: 'size must be a power of two from 1 to 8'] ifFalse: ...

...

On Mon, Aug 31, 2015 at 12:25 PM, Eliot Miranda eliot.miranda@gmail.com wrote:

...
Hi Chrises,
my vote would be to write these as 12 numbered primitives, (2,4 & 8
bytes) * (at: & at:put:) * (big & little endian) because they can be performance critical and implementing them like this means the maximum efficiency in both 32-bit and 64-bit Spur, plus the possibility of the
JIT

...
implementing the primitives.

On Sun, Aug 30, 2015 at 10:01 PM, Chris Cunningham <

cunningham.cb@gmail.com>

...
wrote:

...
Hi Chris,

I'm all for having the fastest that in the image that works. If you

could

...
...
make your version handle endianess, then I'm all for including it (at

least

...
...
in the 3 variants that are faster). My first use for this (interface

for

...
...
KAFKA) apparently requires bigEndianess, so I really want that

supported.

...
...
It might be best to keep my naming, though - it follows the name pattern that is already in the class. Or will yours also support 128?

-cbc

On Sun, Aug 30, 2015 at 2:38 PM, Chris Muller asqueaker@gmail.com

wrote:

...
...
...
Hi Chris, I think these methods belong in the image with the fastest implementation we can do.

I implemented 64-bit unsigned access for Ma Serializer back in 2005. I modeled my implementation after Andreas' original approach which tries to avoid LI arithmetic. I was curious whether your implementations would be faster, because if they are then it could benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a trunk image, I used the following script to take comparison measurements:

| smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2 raisedTo: 14)) atRandom. largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. maBa := ByteArray new: 8. cbBa := ByteArray new: 8. maBa maUint: 64 at: 0 put: largeN. cbBa unsignedLong64At: 1 put: largeN bigEndian: false. self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1 bigEndian: false). { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN bigEndian: false] bench. 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench. 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN bigEndian: false] bench. 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench. 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. }

Here are the results:

'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' . 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' . 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.'

.

...
...
...
'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per

run.' .

...
...
...
'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' . 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' . 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.'

.

...
...
...
'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'

It looks like your 64-bit access is 86% faster for accessing the high-end of the 64-bit range, but slower in the other 3 metrics. Noticeably, it was only 14% as fast for writing the high-end of the 64-bit range, and similarly as much slower for small-number access..

On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham cunningham.cb@gmail.com wrote:

...
Hi.

I've committed a change to the inbox with changes to allow getting/putting 64bit values to ByteArrays (similar to 32 and 16 bit accessors).

Could

...
...
...
...
this be added to trunk?

Also, first time I used the selective commit function - very nice!

the

...
...
...
...
changes I didn't want committed didn't, in fact, get commited. Just the desirable bits!

-cbc

-- _,,,^..^,,,_ best, Eliot

-- _,,,^..^,,,_ best, Eliot

Chris Cunningham

11:49 p.m.

New subject: [squeak-dev] ByteArray accessors for 64-bit manipulation

Hi Chris,

So, I've finally installed the MA Serializer (which I should have done first thing), and understand better what is going on.

Your uint:at: is already delegating to various other methods depending on the byte size, so introducing unsignedLog63At:bigEndian: would be in line with that - as long as we made it as efficient as your version. Perfect. Your uint:at:put: however, writes directly to the ByteArray (and assumes littleEndian). If I/we can make the unsignedLong64At:put:bitEndian: work as effeciently as the uint:at:put, would that work for you?

-cbc

On Mon, Aug 31, 2015 at 11:35 AM, Chris Muller asqueaker@gmail.com wrote:

...

Sometimes the number of bytes is only known in a variable, so would it be possible to do 4 primitives which accept the number of bits (or bytes) as an argument? (uint:at: uint:at:put:) * (big endian, little endian)

On Mon, Aug 31, 2015 at 12:25 PM, Eliot Miranda eliot.miranda@gmail.com wrote:

...
Hi Chrises,
my vote would be to write these as 12 numbered primitives, (2,4 & 8
bytes) * (at: & at:put:) * (big & little endian) because they can be performance critical and implementing them like this means the maximum efficiency in both 32-bit and 64-bit Spur, plus the possibility of the
JIT

...
implementing the primitives.

On Sun, Aug 30, 2015 at 10:01 PM, Chris Cunningham <

cunningham.cb@gmail.com>

...
wrote:

...
Hi Chris,

I'm all for having the fastest that in the image that works. If you

could

...
...
make your version handle endianess, then I'm all for including it (at

least

...
...
in the 3 variants that are faster). My first use for this (interface

for

...
...
KAFKA) apparently requires bigEndianess, so I really want that

supported.

...
...
It might be best to keep my naming, though - it follows the name pattern that is already in the class. Or will yours also support 128?

-cbc

On Sun, Aug 30, 2015 at 2:38 PM, Chris Muller asqueaker@gmail.com

wrote:

...
...
...
Hi Chris, I think these methods belong in the image with the fastest implementation we can do.

I implemented 64-bit unsigned access for Ma Serializer back in 2005. I modeled my implementation after Andreas' original approach which tries to avoid LI arithmetic. I was curious whether your implementations would be faster, because if they are then it could benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a trunk image, I used the following script to take comparison measurements:

| smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2 raisedTo: 14)) atRandom. largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. maBa := ByteArray new: 8. cbBa := ByteArray new: 8. maBa maUint: 64 at: 0 put: largeN. cbBa unsignedLong64At: 1 put: largeN bigEndian: false. self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1 bigEndian: false). { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN bigEndian: false] bench. 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench. 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN bigEndian: false] bench. 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench. 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. }

Here are the results:

'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' . 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' . 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.'

.

...
...
...
'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per

run.' .

...
...
...
'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' . 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' . 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.'

.

...
...
...
'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'

It looks like your 64-bit access is 86% faster for accessing the high-end of the 64-bit range, but slower in the other 3 metrics. Noticeably, it was only 14% as fast for writing the high-end of the 64-bit range, and similarly as much slower for small-number access..

On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham cunningham.cb@gmail.com wrote:

...
Hi.

I've committed a change to the inbox with changes to allow getting/putting 64bit values to ByteArrays (similar to 32 and 16 bit accessors).

Could

...
...
...
...
this be added to trunk?

Also, first time I used the selective commit function - very nice!

the

...
...
...
...
changes I didn't want committed didn't, in fact, get commited. Just the desirable bits!

-cbc

-- _,,,^..^,,,_ best, Eliot

Chris Cunningham

1 Sep 1 Sep

12:59 a.m.

New subject: [squeak-dev] ByteArray accessors for 64-bit manipulation

Ok. After tweaking the unsignedLong64At:bigEndian: to avoid as many largeIntegers shifts as I can, I get:

smallN := ((2 raisedTo: 13) to: (2 raisedTo: 14)) atRandom. largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. cbBa := ByteArray new: 8. cbBa unsignedLong64At: 1 put: largeN bigEndian: false. self assert: (cbBa maUnsigned64At: 1) = (cbBa unsignedLong64At: 1 bigEndian: false). { 'smallN write' -> (cbBa unsignedLong64At: 1 put: smallN bigEndian: false). 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench. 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. 'smallN write' -> (cbBa unsignedLong64At: 1 put: largeN bigEndian: false). 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench. 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. }

{ 'smallN write'->15464 . 'cbc smallN access'->'18,500,000 per second. 54.1 nanoseconds per run.' . 'ma smallN access'->'18,800,000 per second. 53.2 nanoseconds per run.' . 'smallN write'->17835413562943208876 . 'cbc largeN access'->'337,000 per second. 2.97 microseconds per run.' . 'ma largeN access'->'137,000 per second. 7.29 microseconds per run.' }

So, 1 nanosecond slower than yours on my machine for small numbers, and significantly faster than yours on largeNumbers (and faster than what I had before - taking 3/4 of the time as previous).

This is in inBox as Collections-cbc.651.mcz.

I'll work on the writing speed as well.

-cbc

On Mon, Aug 31, 2015 at 2:49 PM, Chris Cunningham cunningham.cb@gmail.com wrote:

...

Hi Chris,

So, I've finally installed the MA Serializer (which I should have done first thing), and understand better what is going on.

Your uint:at: is already delegating to various other methods depending on the byte size, so introducing unsignedLog63At:bigEndian: would be in line with that - as long as we made it as efficient as your version. Perfect. Your uint:at:put: however, writes directly to the ByteArray (and assumes littleEndian). If I/we can make the unsignedLong64At:put:bitEndian: work as effeciently as the uint:at:put, would that work for you?

-cbc

On Mon, Aug 31, 2015 at 11:35 AM, Chris Muller asqueaker@gmail.com wrote:

...
Sometimes the number of bytes is only known in a variable, so would it be possible to do 4 primitives which accept the number of bits (or bytes) as an argument? (uint:at: uint:at:put:) * (big endian, little endian)

On Mon, Aug 31, 2015 at 12:25 PM, Eliot Miranda eliot.miranda@gmail.com wrote:

...
Hi Chrises,
my vote would be to write these as 12 numbered primitives, (2,4 & 8
bytes) * (at: & at:put:) * (big & little endian) because they can be performance critical and implementing them like this means the maximum efficiency in both 32-bit and 64-bit Spur, plus the possibility of the
JIT

...
implementing the primitives.

On Sun, Aug 30, 2015 at 10:01 PM, Chris Cunningham <

cunningham.cb@gmail.com>

...
wrote:

...
Hi Chris,

I'm all for having the fastest that in the image that works. If you

could

...
...
make your version handle endianess, then I'm all for including it (at

least

...
...
in the 3 variants that are faster). My first use for this (interface

for

...
...
KAFKA) apparently requires bigEndianess, so I really want that

supported.

...
...
It might be best to keep my naming, though - it follows the name

pattern

...
...
that is already in the class. Or will yours also support 128?

-cbc

On Sun, Aug 30, 2015 at 2:38 PM, Chris Muller asqueaker@gmail.com

wrote:

...
...
...
Hi Chris, I think these methods belong in the image with the fastest implementation we can do.

I implemented 64-bit unsigned access for Ma Serializer back in 2005. I modeled my implementation after Andreas' original approach which tries to avoid LI arithmetic. I was curious whether your implementations would be faster, because if they are then it could benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a trunk image, I used the following script to take comparison measurements:

| smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2 raisedTo: 14)) atRandom. largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. maBa := ByteArray new: 8. cbBa := ByteArray new: 8. maBa maUint: 64 at: 0 put: largeN. cbBa unsignedLong64At: 1 put: largeN bigEndian: false. self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1 bigEndian: false). { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN bigEndian: false] bench. 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench. 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN bigEndian: false] bench. 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench. 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. }

Here are the results:

'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.'

.

...
...
...
'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' . 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per

run.' .

...
...
...
'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per

run.' .

...
...
...
'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' . 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' . 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per

run.' .

...
...
...
'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'

It looks like your 64-bit access is 86% faster for accessing the high-end of the 64-bit range, but slower in the other 3 metrics. Noticeably, it was only 14% as fast for writing the high-end of the 64-bit range, and similarly as much slower for small-number access..

On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham cunningham.cb@gmail.com wrote:

...
Hi.

I've committed a change to the inbox with changes to allow getting/putting 64bit values to ByteArrays (similar to 32 and 16 bit accessors).

Could

...
...
...
...
this be added to trunk?

Also, first time I used the selective commit function - very nice!

the

...
...
...
...
changes I didn't want committed didn't, in fact, get commited. Just the desirable bits!

-cbc

-- _,,,^..^,,,_ best, Eliot

David T. Lewis

2:21 a.m.

New subject: [squeak-dev] ByteArray accessors for 64-bit manipulation

I would ask that someone please measure the real-world performance benefit of adding these (or any other) numbered primitives. Maybe it's a lot, maybe it's not, but when in doubt leave it out.

Dave

On Mon, Aug 31, 2015 at 10:25:59AM -0700, Eliot Miranda wrote:

...

Hi Chrises,
my vote would be to write these as 12 numbered primitives, (2,4 & 8
bytes) * (at: & at:put:) * (big & little endian) because they can be performance critical and implementing them like this means the maximum efficiency in both 32-bit and 64-bit Spur, plus the possibility of the JIT implementing the primitives.

On Sun, Aug 30, 2015 at 10:01 PM, Chris Cunningham cunningham.cb@gmail.com wrote:

...
Hi Chris,

I'm all for having the fastest that in the image that works. If you could make your version handle endianess, then I'm all for including it (at least in the 3 variants that are faster). My first use for this (interface for KAFKA) apparently requires bigEndianess, so I really want that supported.

It might be best to keep my naming, though - it follows the name pattern that is already in the class. Or will yours also support 128?

-cbc

On Sun, Aug 30, 2015 at 2:38 PM, Chris Muller asqueaker@gmail.com wrote:

...
Hi Chris, I think these methods belong in the image with the fastest implementation we can do.

I implemented 64-bit unsigned access for Ma Serializer back in 2005. I modeled my implementation after Andreas' original approach which tries to avoid LI arithmetic. I was curious whether your implementations would be faster, because if they are then it could benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a trunk image, I used the following script to take comparison measurements:

| smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2 raisedTo: 14)) atRandom. largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. maBa := ByteArray new: 8. cbBa := ByteArray new: 8. maBa maUint: 64 at: 0 put: largeN. cbBa unsignedLong64At: 1 put: largeN bigEndian: false. self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1 bigEndian: false). { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN bigEndian: false] bench. 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench. 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN bigEndian: false] bench. 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench. 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. }

Here are the results:

'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' . 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' . 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' . 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' . 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' . 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' . 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' . 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'

It looks like your 64-bit access is 86% faster for accessing the high-end of the 64-bit range, but slower in the other 3 metrics. Noticeably, it was only 14% as fast for writing the high-end of the 64-bit range, and similarly as much slower for small-number access..

On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham cunningham.cb@gmail.com wrote:

...
Hi.

I've committed a change to the inbox with changes to allow

getting/putting

...
64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could

this

...
be added to trunk?

Also, first time I used the selective commit function - very nice! the changes I didn't want committed didn't, in fact, get commited. Just the desirable bits!

-cbc

-- _,,,^..^,,,_ best, Eliot

...

Tobias Pape

7:51 a.m.

New subject: [squeak-dev] ByteArray accessors for 64-bit manipulation

On 01.09.2015, at 02:21, David T. Lewis lewis@mail.msen.com wrote:

...

I would ask that someone please measure the real-world performance benefit of adding these (or any other) numbered primitives. Maybe it's a lot, maybe it's not, but when in doubt leave it out.

...

Dave

On Mon, Aug 31, 2015 at 10:25:59AM -0700, Eliot Miranda wrote:

...
Hi Chrises,

my vote would be to write these as 12 numbered primitives, (2,4 & 8 bytes) * (at: & at:put:) * (big & little endian) because they can be performance critical and implementing them like this means the maximum efficiency in both 32-bit and 64-bit Spur, plus the possibility of the JIT implementing the primitives.

On Sun, Aug 30, 2015 at 10:01 PM, Chris Cunningham cunningham.cb@gmail.com wrote:

...
Hi Chris,

I'm all for having the fastest that in the image that works. If you could make your version handle endianess, then I'm all for including it (at least in the 3 variants that are faster). My first use for this (interface for KAFKA) apparently requires bigEndianess, so I really want that supported.

It might be best to keep my naming, though - it follows the name pattern that is already in the class. Or will yours also support 128?

-cbc

On Sun, Aug 30, 2015 at 2:38 PM, Chris Muller asqueaker@gmail.com wrote:

...
Hi Chris, I think these methods belong in the image with the fastest implementation we can do.

I implemented 64-bit unsigned access for Ma Serializer back in 2005. I modeled my implementation after Andreas' original approach which tries to avoid LI arithmetic. I was curious whether your implementations would be faster, because if they are then it could benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a trunk image, I used the following script to take comparison measurements:

| smallN largeN maBa cbBa | smallN := ((2 raisedTo: 13) to: (2 raisedTo: 14)) atRandom. largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. maBa := ByteArray new: 8. cbBa := ByteArray new: 8. maBa maUint: 64 at: 0 put: largeN. cbBa unsignedLong64At: 1 put: largeN bigEndian: false. self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1 bigEndian: false). { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN bigEndian: false] bench. 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench. 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN bigEndian: false] bench. 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench. 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. }

Here are the results:

'cbc smallN write'->'3,110,000 per second. 322 nanoseconds per run.' . 'ma smallN write'->'4,770,000 per second. 210 nanoseconds per run.' . 'cbc smallN access'->'4,300,000 per second. 233 nanoseconds per run.' . 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' . 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' . 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' . 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' . 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'

It looks like your 64-bit access is 86% faster for accessing the high-end of the 64-bit range, but slower in the other 3 metrics. Noticeably, it was only 14% as fast for writing the high-end of the 64-bit range, and similarly as much slower for small-number access..

On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham cunningham.cb@gmail.com wrote:

...
Hi.

I've committed a change to the inbox with changes to allow

getting/putting

...
64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could

this

...
be added to trunk?

Also, first time I used the selective commit function - very nice! the changes I didn't want committed didn't, in fact, get commited. Just the desirable bits!

-cbc

-- _,,,^..^,,,_ best, Eliot

Andres Valloud

2:52 a.m.

New subject: [squeak-dev] ByteArray accessors for 64-bit manipulation

FWIW... IMO it's better to enable access to the relevant compiler intrinsic with platform specific macros, rather than implementing instructions such as Intel's BSWAP or MOVBE by hand. In HPS, isolating endianness concerns from the large integer arithmetic primitives with such macros enabled 25-40% faster performance on big endian platforms. Just as importantly, the intrinsic approach takes significantly less code to implement.

On 8/31/15 10:25 , Eliot Miranda wrote:

...

Hi Chrises,

 my vote would be to write these as 12 numbered primitives, (2,4 & 8

bytes) * (at: & at:put:) * (big & little endian) because they can be performance critical and implementing them like this means the maximum efficiency in both 32-bit and 64-bit Spur, plus the possibility of the JIT implementing the primitives.

On Sun, Aug 30, 2015 at 10:01 PM, Chris Cunningham <cunningham.cb@gmail.com mailto:cunningham.cb@gmail.com> wrote:

Hi Chris,

I'm all for having the fastest that in the image that works.  If you
could make your version handle endianess, then I'm all for including
it (at least in the 3 variants that are faster).  My first use for
this (interface for KAFKA) apparently requires bigEndianess, so I
really want that supported.

It might be best to keep my naming, though - it follows the name
pattern that is already in the class.  Or will yours also support 128?

-cbc

On Sun, Aug 30, 2015 at 2:38 PM, Chris Muller <asqueaker@gmail.com
<mailto:asqueaker@gmail.com>> wrote:

    Hi Chris, I think these methods belong in the image with the fastest
    implementation we can do.

    I implemented 64-bit unsigned access for Ma Serializer back in 2005.
    I modeled my implementation after Andreas' original approach which
    tries to avoid LI arithmetic.  I was curious whether your
    implementations would be faster, because if they are then it could
    benefit Magma.  After loading "Ma Serializer" 1.5 (or head) into a
    trunk image, I used the following script to take comparison
    measurements:

    | smallN largeN maBa cbBa |  smallN := ((2 raisedTo: 13) to: (2
    raisedTo: 14)) atRandom.
    largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom.
    maBa := ByteArray new: 8.
    cbBa := ByteArray new: 8.
    maBa maUint: 64 at: 0 put: largeN.
    cbBa unsignedLong64At: 1 put: largeN bigEndian: false.
    self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1
    bigEndian: false).
    { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN
    bigEndian: false] bench.
    'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench.
    'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian:
    false. ] bench.
    'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench.
    'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN
    bigEndian: false] bench.
    'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench.
    'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian:
    false ] bench.
    'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench.
      }

    Here are the results:

    'cbc smallN write'->'3,110,000 per second.  322 nanoseconds per
    run.' .
    'ma smallN write'->'4,770,000 per second.  210 nanoseconds per
    run.' .
    'cbc smallN access'->'4,300,000 per second.  233 nanoseconds per
    run.' .
    'ma smallN access'->'16,400,000 per second.  60.9 nanoseconds
    per run.' .
    'cbc largeN write'->'907,000 per second.  1.1 microseconds per
    run.' .
    'ma largeN write'->'6,620,000 per second.  151 nanoseconds per
    run.' .
    'cbc largeN access'->'1,900,000 per second.  527 nanoseconds per
    run.' .
    'ma largeN access'->'1,020,000 per second.  982 nanoseconds per
    run.'

    It looks like your 64-bit access is 86% faster for accessing the
    high-end of the 64-bit range, but slower in the other 3 metrics.
    Noticeably, it was only 14% as fast for writing the high-end of the
    64-bit range, and similarly as much slower for small-number access..


    On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham
    <cunningham.cb@gmail.com <mailto:cunningham.cb@gmail.com>> wrote:
     > Hi.
     >
     > I've committed a change to the inbox with changes to allow
    getting/putting
     > 64bit values to ByteArrays (similar to 32 and 16 bit
    accessors).  Could this
     > be added to trunk?
     >
     > Also, first time I used the selective commit function - very
    nice!  the
     > changes I didn't want committed didn't, in fact, get
    commited.  Just the
     > desirable bits!
     >
     > -cbc
     >
     >
     >

-- _,,,^..^,,,_ best, Eliot

Eliot Miranda

4:12 a.m.

New subject: [squeak-dev] ByteArray accessors for 64-bit manipulation

Hi Andres,

...

On Aug 31, 2015, at 5:52 PM, Andres Valloud avalloud@smalltalk.comcastbiz.net wrote:

FWIW... IMO it's better to enable access to the relevant compiler intrinsic with platform specific macros, rather than implementing instructions such as Intel's BSWAP or MOVBE by hand. In HPS, isolating endianness concerns from the large integer arithmetic primitives with such macros enabled 25-40% faster performance on big endian platforms. Just as importantly, the intrinsic approach takes significantly less code to implement.

Makes sense, and the performance increases are impressive. The only issue I have is that the Cog JIT (which would have the easiest time generating those intrinsics) currently runs only in little-endianness platforms and I seriously doubt it will run in a big endianness platform in the next five years. PowerPC is the only possibility I see. Yes, ARM is biendian but all the popular applications I know of are little endian.

VW's a different beast; significant big endian legacy.

But what you say about isolating makes perfect sense. Thanks

...

On 8/31/15 10:25 , Eliot Miranda wrote: Hi Chrises,

my vote would be to write these as 12 numbered primitives, (2,4 & 8

On Sun, Aug 30, 2015 at 10:01 PM, Chris Cunningham <cunningham.cb@gmail.com mailto:cunningham.cb@gmail.com> wrote:

Hi Chris,

I'm all for having the fastest that in the image that works. If you could make your version handle endianess, then I'm all for including it (at least in the 3 variants that are faster). My first use for this (interface for KAFKA) apparently requires bigEndianess, so I really want that supported.

It might be best to keep my naming, though - it follows the name pattern that is already in the class. Or will yours also support 128?

-cbc

On Sun, Aug 30, 2015 at 2:38 PM, Chris Muller <asqueaker@gmail.com mailto:asqueaker@gmail.com> wrote:

   Hi Chris, I think these methods belong in the image with the fastest
   implementation we can do.

   I implemented 64-bit unsigned access for Ma Serializer back in 2005.
   I modeled my implementation after Andreas' original approach which
   tries to avoid LI arithmetic.  I was curious whether your
   implementations would be faster, because if they are then it could
   benefit Magma.  After loading "Ma Serializer" 1.5 (or head) into a
   trunk image, I used the following script to take comparison
   measurements:

   | smallN largeN maBa cbBa |  smallN := ((2 raisedTo: 13) to: (2
   raisedTo: 14)) atRandom.
   largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom.
   maBa := ByteArray new: 8.
   cbBa := ByteArray new: 8.
   maBa maUint: 64 at: 0 put: largeN.
   cbBa unsignedLong64At: 1 put: largeN bigEndian: false.
   self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1
   bigEndian: false).
   { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN
   bigEndian: false] bench.
   'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench.
   'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian:
   false. ] bench.
   'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench.
   'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN
   bigEndian: false] bench.
   'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench.
   'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian:
   false ] bench.
   'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench.
     }

   Here are the results:

   'cbc smallN write'->'3,110,000 per second.  322 nanoseconds per
   run.' .
   'ma smallN write'->'4,770,000 per second.  210 nanoseconds per
   run.' .
   'cbc smallN access'->'4,300,000 per second.  233 nanoseconds per
   run.' .
   'ma smallN access'->'16,400,000 per second.  60.9 nanoseconds
   per run.' .
   'cbc largeN write'->'907,000 per second.  1.1 microseconds per
   run.' .
   'ma largeN write'->'6,620,000 per second.  151 nanoseconds per
   run.' .
   'cbc largeN access'->'1,900,000 per second.  527 nanoseconds per
   run.' .
   'ma largeN access'->'1,020,000 per second.  982 nanoseconds per
   run.'

   It looks like your 64-bit access is 86% faster for accessing the
   high-end of the 64-bit range, but slower in the other 3 metrics.
   Noticeably, it was only 14% as fast for writing the high-end of the
   64-bit range, and similarly as much slower for small-number access..

   On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham
   <cunningham.cb@gmail.com <mailto:cunningham.cb@gmail.com>> wrote:
    > Hi.
    >
    > I've committed a change to the inbox with changes to allow
   getting/putting
    > 64bit values to ByteArrays (similar to 32 and 16 bit
   accessors).  Could this
    > be added to trunk?
    >
    > Also, first time I used the selective commit function - very
   nice!  the
    > changes I didn't want committed didn't, in fact, get
   commited.  Just the
    > desirable bits!
    >
    > -cbc
    >
    >
    >

-- _,,,^..^,,,_ best, Eliot

Andres Valloud

4:39 a.m.

New subject: [squeak-dev] ByteArray accessors for 64-bit manipulation

Interesting about the fading relevancy of big endian platforms. Just in case the point was lost, I meant the macro-style approach in contrast with this (from Squeak-dev):

=============== Diff against Collections-cbc.650 ===============

Item was changed: ----- Method: ByteArray>>unsignedLong64At:bigEndian: (in category 'platform independent access') ----- unsignedLong64At: index bigEndian: aBool + "Avoid as much largeInteger as we can" + | b0 b2 b3 b5 b6 w n2 n3 | + + aBool ifFalse: [ + w := self at: index. + b6 := self at: index+1. + b5 := self at: index+2. + n2 := self at: index+3. + b3 := self at: index+4. + b2 := self at: index+5. + n3 := self at: index+6. + b0 := self at: index+7. + ] ifTrue: [ + b0 := self at: index. + n3 := self at: index+1. + b2 := self at: index+2. + b3 := self at: index+3. + n2 := self at: index+4. + b5 := self at: index+5. + b6 := self at: index+6. + w := self at: index+7. + ]. + + "Minimize LargeInteger arithmetic" + b6 = 0 ifFalse:[w := (b6 bitShift: 8) + w]. + b5 = 0 ifFalse:[w := (b5 bitShift: 16) + w]. + + b3 = 0 ifFalse:[n2 := (b3 bitShift: 8) + n2]. + b2 = 0 ifFalse:[n2 := (b2 bitShift: 16) + n2]. + n2 == 0 ifFalse: [w := (n2 bitShift: 24) + w]. + + b0 = 0 ifFalse:[n3 := (b0 bitShift: 8) + n3]. + n3 == 0 ifFalse: [w := (n3 bitShift: 48) + w]. + + ^w! - | n1 n2 | - aBool - ifTrue: [ - n2 := self unsignedLongAt: index bigEndian: true. - n1 := self unsignedLongAt: index+4 bigEndian: true. - ] - ifFalse: [ - n1 := self unsignedLongAt: index bigEndian: false. - n2 := self unsignedLongAt: index+4 bigEndian: false. - ]. - ^(n2 bitShift: 32) + n1!

I'd rather have that pushed down enough so that the compiler intrinsic becomes visible. And at that point, all that code is reduced to a single instruction.

Andres.

On 8/31/15 19:12 , Eliot Miranda wrote:

...

Hi Andres,

...

On Aug 31, 2015, at 5:52 PM, Andres Valloud avalloud@smalltalk.comcastbiz.net wrote:

FWIW... IMO it's better to enable access to the relevant compiler intrinsic with platform specific macros, rather than implementing instructions such as Intel's BSWAP or MOVBE by hand. In HPS, isolating endianness concerns from the large integer arithmetic primitives with such macros enabled 25-40% faster performance on big endian platforms. Just as importantly, the intrinsic approach takes significantly less code to implement.

VW's a different beast; significant big endian legacy.

But what you say about isolating makes perfect sense. Thanks

...

On 8/31/15 10:25 , Eliot Miranda wrote: Hi Chrises,

 my vote would be to write these as 12 numbered primitives, (2,4 & 8

On Sun, Aug 30, 2015 at 10:01 PM, Chris Cunningham <cunningham.cb@gmail.com mailto:cunningham.cb@gmail.com> wrote:

Hi Chris,

I'm all for having the fastest that in the image that works.  If you
could make your version handle endianess, then I'm all for including
it (at least in the 3 variants that are faster).  My first use for
this (interface for KAFKA) apparently requires bigEndianess, so I
really want that supported.

It might be best to keep my naming, though - it follows the name
pattern that is already in the class.  Or will yours also support 128?

-cbc

On Sun, Aug 30, 2015 at 2:38 PM, Chris Muller <asqueaker@gmail.com
<mailto:asqueaker@gmail.com>> wrote:

    Hi Chris, I think these methods belong in the image with the fastest
    implementation we can do.

    I implemented 64-bit unsigned access for Ma Serializer back in 2005.
    I modeled my implementation after Andreas' original approach which
    tries to avoid LI arithmetic.  I was curious whether your
    implementations would be faster, because if they are then it could
    benefit Magma.  After loading "Ma Serializer" 1.5 (or head) into a
    trunk image, I used the following script to take comparison
    measurements:

    | smallN largeN maBa cbBa |  smallN := ((2 raisedTo: 13) to: (2
    raisedTo: 14)) atRandom.
    largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom.
    maBa := ByteArray new: 8.
    cbBa := ByteArray new: 8.
    maBa maUint: 64 at: 0 put: largeN.
    cbBa unsignedLong64At: 1 put: largeN bigEndian: false.
    self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1
    bigEndian: false).
    { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN
    bigEndian: false] bench.
    'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench.
    'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian:
    false. ] bench.
    'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench.
    'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN
    bigEndian: false] bench.
    'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench.
    'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian:
    false ] bench.
    'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench.
      }

    Here are the results:

    'cbc smallN write'->'3,110,000 per second.  322 nanoseconds per
    run.' .
    'ma smallN write'->'4,770,000 per second.  210 nanoseconds per
    run.' .
    'cbc smallN access'->'4,300,000 per second.  233 nanoseconds per
    run.' .
    'ma smallN access'->'16,400,000 per second.  60.9 nanoseconds
    per run.' .
    'cbc largeN write'->'907,000 per second.  1.1 microseconds per
    run.' .
    'ma largeN write'->'6,620,000 per second.  151 nanoseconds per
    run.' .
    'cbc largeN access'->'1,900,000 per second.  527 nanoseconds per
    run.' .
    'ma largeN access'->'1,020,000 per second.  982 nanoseconds per
    run.'

    It looks like your 64-bit access is 86% faster for accessing the
    high-end of the 64-bit range, but slower in the other 3 metrics.
    Noticeably, it was only 14% as fast for writing the high-end of the
    64-bit range, and similarly as much slower for small-number access..


    On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham
    <cunningham.cb@gmail.com <mailto:cunningham.cb@gmail.com>> wrote:
     > Hi.
     >
     > I've committed a change to the inbox with changes to allow
    getting/putting
     > 64bit values to ByteArrays (similar to 32 and 16 bit
    accessors).  Could this
     > be added to trunk?
     >
     > Also, first time I used the selective commit function - very
    nice!  the
     > changes I didn't want committed didn't, in fact, get
    commited.  Just the
     > desirable bits!
     >
     > -cbc
     >
     >
     >

-- _,,,^..^,,,_ best, Eliot

Chris Cunningham

5:21 a.m.

New subject: [squeak-dev] ByteArray accessors for 64-bit manipulation

Hi Andres,

ByteArray currently doesn't have a primitive that handles any part of getting bytes from the ByteArray and forming them into an integer. If it did have one, I would be happy to alter the code around that.

The long drawn out method is 4x faster for small (SmallInteger) results, and 25% faster for LargeInteger results (those that excercise all 8 bytes). This because it does at most 2 LargeInteger bitShifts, and as little as no LargeInteger bitShifts. The 'macro' version does a minimum of 1 LargeInteger bitShifts, and up to 3 of them.

For BigEndian platforms, speed may be important; in any case, it is nice.

You are probably aware, but the current Squeak has does not have #unsignedLong64At:bigEndian: in the image at all - that diff was from my first attempt.

-cbc

On Mon, Aug 31, 2015 at 7:39 PM, Andres Valloud < avalloud@smalltalk.comcastbiz.net> wrote:

...

Interesting about the fading relevancy of big endian platforms. Just in case the point was lost, I meant the macro-style approach in contrast with this (from Squeak-dev):

=============== Diff against Collections-cbc.650 ===============

Item was changed: ----- Method: ByteArray>>unsignedLong64At:bigEndian: (in category 'platform independent access') ----- unsignedLong64At: index bigEndian: aBool
  "Avoid as much largeInteger as we can"
  | b0 b2 b3 b5 b6 w n2 n3 |
  aBool ifFalse: [
          w := self at: index.
          b6 := self at: index+1.
          b5 := self at: index+2.
          n2 := self at: index+3.
          b3 := self at: index+4.
          b2 := self at: index+5.
          n3 := self at: index+6.
          b0 := self at: index+7.
  ] ifTrue: [
          b0 := self at: index.
          n3 := self at: index+1.
          b2 := self at: index+2.
          b3 := self at: index+3.
          n2 := self at: index+4.
          b5 := self at: index+5.
          b6 := self at: index+6.
          w := self at: index+7.
          ].
  "Minimize LargeInteger arithmetic"
  b6 = 0 ifFalse:[w := (b6 bitShift: 8) + w].
  b5 = 0 ifFalse:[w := (b5 bitShift: 16) + w].
  b3 = 0 ifFalse:[n2 := (b3 bitShift: 8) + n2].
  b2 = 0 ifFalse:[n2 := (b2 bitShift: 16) + n2].
  n2 == 0 ifFalse: [w := (n2 bitShift: 24) + w].
  b0 = 0 ifFalse:[n3 := (b0 bitShift: 8) + n3].
  n3 == 0 ifFalse: [w := (n3 bitShift: 48) + w].
  ^w!
  | n1 n2 |
  aBool
          ifTrue: [
                  n2 := self unsignedLongAt: index  bigEndian: true.
                  n1 := self unsignedLongAt: index+4  bigEndian:
true.
                  ]
          ifFalse: [
                  n1 := self unsignedLongAt: index bigEndian: false.
                  n2 := self unsignedLongAt: index+4 bigEndian:
false.
                  ].
  ^(n2 bitShift: 32) + n1!
I'd rather have that pushed down enough so that the compiler intrinsic becomes visible. And at that point, all that code is reduced to a single instruction.

Andres.

On 8/31/15 19:12 , Eliot Miranda wrote:

...
Hi Andres,

On Aug 31, 2015, at 5:52 PM, Andres Valloud <

...
avalloud@smalltalk.comcastbiz.net> wrote:

FWIW... IMO it's better to enable access to the relevant compiler intrinsic with platform specific macros, rather than implementing instructions such as Intel's BSWAP or MOVBE by hand. In HPS, isolating endianness concerns from the large integer arithmetic primitives with such macros enabled 25-40% faster performance on big endian platforms. Just as importantly, the intrinsic approach takes significantly less code to implement.

Makes sense, and the performance increases are impressive. The only issue I have is that the Cog JIT (which would have the easiest time generating those intrinsics) currently runs only in little-endianness platforms and I seriously doubt it will run in a big endianness platform in the next five years. PowerPC is the only possibility I see. Yes, ARM is biendian but all the popular applications I know of are little endian.

VW's a different beast; significant big endian legacy.

But what you say about isolating makes perfect sense. Thanks

...
On 8/31/15 10:25 , Eliot Miranda wrote:

...
Hi Chrises,
 my vote would be to write these as 12 numbered primitives, (2,4 & 8
bytes) * (at: & at:put:) * (big & little endian) because they can be performance critical and implementing them like this means the maximum efficiency in both 32-bit and 64-bit Spur, plus the possibility of the JIT implementing the primitives.

On Sun, Aug 30, 2015 at 10:01 PM, Chris Cunningham <cunningham.cb@gmail.com mailto:cunningham.cb@gmail.com> wrote:
Hi Chris,

I'm all for having the fastest that in the image that works.  If you
could make your version handle endianess, then I'm all for including
it (at least in the 3 variants that are faster).  My first use for
this (interface for KAFKA) apparently requires bigEndianess, so I
really want that supported.

It might be best to keep my naming, though - it follows the name
pattern that is already in the class.  Or will yours also support
128?
-cbc

On Sun, Aug 30, 2015 at 2:38 PM, Chris Muller <asqueaker@gmail.com
<mailto:asqueaker@gmail.com>> wrote:

    Hi Chris, I think these methods belong in the image with the
fastest implementation we can do.
    I implemented 64-bit unsigned access for Ma Serializer back in
I modeled my implementation after Andreas' original approach

which tries to avoid LI arithmetic. I was curious whether your implementations would be faster, because if they are then it could benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a trunk image, I used the following script to take comparison measurements:
    | smallN largeN maBa cbBa |  smallN := ((2 raisedTo: 13) to: (2
    raisedTo: 14)) atRandom.
    largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom.
    maBa := ByteArray new: 8.
    cbBa := ByteArray new: 8.
    maBa maUint: 64 at: 0 put: largeN.
    cbBa unsignedLong64At: 1 put: largeN bigEndian: false.
    self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At:
1 bigEndian: false). { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN bigEndian: false] bench. 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench. 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN bigEndian: false] bench. 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench. 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. }
    Here are the results:

    'cbc smallN write'->'3,110,000 per second.  322 nanoseconds per
    run.' .
    'ma smallN write'->'4,770,000 per second.  210 nanoseconds per
    run.' .
    'cbc smallN access'->'4,300,000 per second.  233 nanoseconds per
    run.' .
    'ma smallN access'->'16,400,000 per second.  60.9 nanoseconds
    per run.' .
    'cbc largeN write'->'907,000 per second.  1.1 microseconds per
    run.' .
    'ma largeN write'->'6,620,000 per second.  151 nanoseconds per
    run.' .
    'cbc largeN access'->'1,900,000 per second.  527 nanoseconds per
    run.' .
    'ma largeN access'->'1,020,000 per second.  982 nanoseconds per
    run.'

    It looks like your 64-bit access is 86% faster for accessing the
    high-end of the 64-bit range, but slower in the other 3 metrics.
    Noticeably, it was only 14% as fast for writing the high-end of
the 64-bit range, and similarly as much slower for small-number access..
    On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham
    <cunningham.cb@gmail.com <mailto:cunningham.cb@gmail.com>>
wrote: > Hi. > > I've committed a change to the inbox with changes to allow getting/putting > 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could this > be added to trunk? > > Also, first time I used the selective commit function - very nice! the > changes I didn't want committed didn't, in fact, get commited. Just the > desirable bits! > > -cbc > > >

-- _,,,^..^,,,_ best, Eliot

.

Chris Cunningham

5:42 a.m.

New subject: [squeak-dev] ByteArray accessors for 64-bit manipulation

Ok. Committed Collections-cbc.652.mcz to the inbox. It has a faster unsignedLong64At:put:bigEndian:. It essentially used the ma code, but checks for endianness to make it work in those situations.

Speed test:

smallN := ((2 raisedTo: 13) to: (2 raisedTo: 14)) atRandom. largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom. maBa := ByteArray new: 8. cbBa := ByteArray new: 8. maBa maUint: 64 at: 0 put: largeN. cbBa unsignedLong64At: 1 put: largeN bigEndian: false. self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1 bigEndian: false). self assert: (cbBa maUnsigned64At: 1) = (cbBa unsignedLong64At: 1 bigEndian: false).

{ 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN bigEndian: false] bench. 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN bigEndian: false] bench. 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. } { 'cbc smallN write'->'3,770,000 per second. 266 nanoseconds per run.' . 'ma smallN write'->'3,700,000 per second. 270 nanoseconds per run.' . 'cbc largeN write'->'4,190,000 per second. 238 nanoseconds per run.' . 'ma largeN write'->'4,120,000 per second. 243 nanoseconds per run.' }

I would like to have this pushed to Trunk so that we have a shared 64bit access to ByteArrays. I know of at least 3 places this has been coded: my code, VMMaker, and MA Serializable. Probably several other places exist.

The put code (benchmarked above) shouldn't be affected if it is run in a 64 bit image - if anything just a bit faster.

The get code (also in the committed code) should not get any slower in a 64 bit image, but if the SmallInteger uses more bits in its representation than the 32 bit image, it could be optimized to run faster there at the cost of running slower on 32 bit images. So probably not worth fixing in this code. If you* do make a primitive for these accesses, then those would obviously be made significantly faster, and this could be the fallback code.

(* having never made a primitive nor compiled the image - yet - I probably won't be added these. And I don't need that speed myself at this point, either.)

-cbc

On Mon, Aug 31, 2015 at 8:21 PM, Chris Cunningham cunningham.cb@gmail.com wrote:

...

Hi Andres,

ByteArray currently doesn't have a primitive that handles any part of getting bytes from the ByteArray and forming them into an integer. If it did have one, I would be happy to alter the code around that.

The long drawn out method is 4x faster for small (SmallInteger) results, and 25% faster for LargeInteger results (those that excercise all 8 bytes). This because it does at most 2 LargeInteger bitShifts, and as little as no LargeInteger bitShifts. The 'macro' version does a minimum of 1 LargeInteger bitShifts, and up to 3 of them.

For BigEndian platforms, speed may be important; in any case, it is nice.

You are probably aware, but the current Squeak has does not have #unsignedLong64At:bigEndian: in the image at all - that diff was from my first attempt.

-cbc

On Mon, Aug 31, 2015 at 7:39 PM, Andres Valloud < avalloud@smalltalk.comcastbiz.net> wrote:

...
Interesting about the fading relevancy of big endian platforms. Just in case the point was lost, I meant the macro-style approach in contrast with this (from Squeak-dev):

=============== Diff against Collections-cbc.650 ===============

Item was changed: ----- Method: ByteArray>>unsignedLong64At:bigEndian: (in category 'platform independent access') ----- unsignedLong64At: index bigEndian: aBool
  "Avoid as much largeInteger as we can"
  | b0 b2 b3 b5 b6 w n2 n3 |
  aBool ifFalse: [
          w := self at: index.
          b6 := self at: index+1.
          b5 := self at: index+2.
          n2 := self at: index+3.
          b3 := self at: index+4.
          b2 := self at: index+5.
          n3 := self at: index+6.
          b0 := self at: index+7.
  ] ifTrue: [
          b0 := self at: index.
          n3 := self at: index+1.
          b2 := self at: index+2.
          b3 := self at: index+3.
          n2 := self at: index+4.
          b5 := self at: index+5.
          b6 := self at: index+6.
          w := self at: index+7.
          ].
  "Minimize LargeInteger arithmetic"
  b6 = 0 ifFalse:[w := (b6 bitShift: 8) + w].
  b5 = 0 ifFalse:[w := (b5 bitShift: 16) + w].
  b3 = 0 ifFalse:[n2 := (b3 bitShift: 8) + n2].
  b2 = 0 ifFalse:[n2 := (b2 bitShift: 16) + n2].
  n2 == 0 ifFalse: [w := (n2 bitShift: 24) + w].
  b0 = 0 ifFalse:[n3 := (b0 bitShift: 8) + n3].
  n3 == 0 ifFalse: [w := (n3 bitShift: 48) + w].
  ^w!
  | n1 n2 |
  aBool
          ifTrue: [
                  n2 := self unsignedLongAt: index  bigEndian: true.
                  n1 := self unsignedLongAt: index+4  bigEndian:
true.
                  ]
          ifFalse: [
                  n1 := self unsignedLongAt: index bigEndian: false.
                  n2 := self unsignedLongAt: index+4 bigEndian:
false.
                  ].
  ^(n2 bitShift: 32) + n1!
I'd rather have that pushed down enough so that the compiler intrinsic becomes visible. And at that point, all that code is reduced to a single instruction.

Andres.

On 8/31/15 19:12 , Eliot Miranda wrote:

...
Hi Andres,

On Aug 31, 2015, at 5:52 PM, Andres Valloud <

...
avalloud@smalltalk.comcastbiz.net> wrote:

FWIW... IMO it's better to enable access to the relevant compiler intrinsic with platform specific macros, rather than implementing instructions such as Intel's BSWAP or MOVBE by hand. In HPS, isolating endianness concerns from the large integer arithmetic primitives with such macros enabled 25-40% faster performance on big endian platforms. Just as importantly, the intrinsic approach takes significantly less code to implement.

Makes sense, and the performance increases are impressive. The only issue I have is that the Cog JIT (which would have the easiest time generating those intrinsics) currently runs only in little-endianness platforms and I seriously doubt it will run in a big endianness platform in the next five years. PowerPC is the only possibility I see. Yes, ARM is biendian but all the popular applications I know of are little endian.

VW's a different beast; significant big endian legacy.

But what you say about isolating makes perfect sense. Thanks

...
On 8/31/15 10:25 , Eliot Miranda wrote:

...
Hi Chrises,
 my vote would be to write these as 12 numbered primitives, (2,4 &
8 bytes) * (at: & at:put:) * (big & little endian) because they can be performance critical and implementing them like this means the maximum efficiency in both 32-bit and 64-bit Spur, plus the possibility of the JIT implementing the primitives.

On Sun, Aug 30, 2015 at 10:01 PM, Chris Cunningham <cunningham.cb@gmail.com mailto:cunningham.cb@gmail.com> wrote:
Hi Chris,

I'm all for having the fastest that in the image that works.  If
you could make your version handle endianess, then I'm all for including it (at least in the 3 variants that are faster). My first use for this (interface for KAFKA) apparently requires bigEndianess, so I really want that supported.
It might be best to keep my naming, though - it follows the name
pattern that is already in the class.  Or will yours also support
128?
-cbc

On Sun, Aug 30, 2015 at 2:38 PM, Chris Muller <asqueaker@gmail.com
<mailto:asqueaker@gmail.com>> wrote:

    Hi Chris, I think these methods belong in the image with the
fastest implementation we can do.
    I implemented 64-bit unsigned access for Ma Serializer back in
I modeled my implementation after Andreas' original approach

which tries to avoid LI arithmetic. I was curious whether your implementations would be faster, because if they are then it could benefit Magma. After loading "Ma Serializer" 1.5 (or head) into a trunk image, I used the following script to take comparison measurements:
    | smallN largeN maBa cbBa |  smallN := ((2 raisedTo: 13) to: (2
    raisedTo: 14)) atRandom.
    largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom.
    maBa := ByteArray new: 8.
    cbBa := ByteArray new: 8.
    maBa maUint: 64 at: 0 put: largeN.
    cbBa unsignedLong64At: 1 put: largeN bigEndian: false.
    self assert: (cbBa maUnsigned64At: 1) = (maBa
unsignedLong64At: 1 bigEndian: false). { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN bigEndian: false] bench. 'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench. 'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false. ] bench. 'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench. 'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN bigEndian: false] bench. 'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench. 'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian: false ] bench. 'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench. }
    Here are the results:

    'cbc smallN write'->'3,110,000 per second.  322 nanoseconds per
    run.' .
    'ma smallN write'->'4,770,000 per second.  210 nanoseconds per
    run.' .
    'cbc smallN access'->'4,300,000 per second.  233 nanoseconds
per run.' . 'ma smallN access'->'16,400,000 per second. 60.9 nanoseconds per run.' . 'cbc largeN write'->'907,000 per second. 1.1 microseconds per run.' . 'ma largeN write'->'6,620,000 per second. 151 nanoseconds per run.' . 'cbc largeN access'->'1,900,000 per second. 527 nanoseconds per run.' . 'ma largeN access'->'1,020,000 per second. 982 nanoseconds per run.'
    It looks like your 64-bit access is 86% faster for accessing
the high-end of the 64-bit range, but slower in the other 3 metrics. Noticeably, it was only 14% as fast for writing the high-end of the 64-bit range, and similarly as much slower for small-number access..
    On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham
    <cunningham.cb@gmail.com <mailto:cunningham.cb@gmail.com>>
wrote: > Hi. > > I've committed a change to the inbox with changes to allow getting/putting > 64bit values to ByteArrays (similar to 32 and 16 bit accessors). Could this > be added to trunk? > > Also, first time I used the selective commit function - very nice! the > changes I didn't want committed didn't, in fact, get commited. Just the > desirable bits! > > -cbc > > >

-- _,,,^..^,,,_ best, Eliot

.

Andres Valloud

5:46 a.m.

New subject: [squeak-dev] ByteArray accessors for 64-bit manipulation

Chris,

All I'm trying to say is Learn From My Fail. After dealing with such code for a while (in C --- yuck), I realized it was much better to use the compiler intrinsics. Once I had the new code running, I deleted numerous implementations of the "let's swap bytes around" business. IIRC it was a net lines-of-code loss, and less code is great.

I'm not disputing the performance gains. But consider how much faster and simpler still the compiler intrinsic approach could be.

The Smalltalk code does 8 at:, which are bound checked. There are also 7 additions, which have to be checked against overflow. Then there are some comparisons, some ifFalse:, more bitShift: (overflow check), more additions (overflow check), and so on. Creating large integers is going to be costly. Even with optimized bounds / overflow checks, surely that's going to expand to tens of assembly instructions, if not a couple hundred.

I'm not saying this is an inefficient Smalltalk way of doing things, however I'd point out it's reimplementing what's available in hardware.

In constrast, if that code was made into a primitive, a relevant compiler intrinsic would be available to help. The VM would grab a 64 bit integer with a MOVBE instruction, perform one test for overflow, and if all is well then return tagging the integer with a single LEA instruction. If the result must be a large integer, the VM might as well create it. Surely a decent compiler can express all that in very little code, because it's effectively sweeping all the complexity under MOVBE.

For the sake of illustration, and assuming only one tag bit for simplicity, the assembly would be something like:

; calculate the pointer to dereference in rax, then... movbe rax, [rax] test rax, rax js overflowToLargeInteger lea rax, [rax+rax+1] ret

; ok, it didn't fit, so... overflowToLargeInteger: call largePositiveIntegerFromRAX ; returning in RAX ret

I'd imagine the integer arithmetic cannot possibly fit in that space.

Andres.

On 8/31/15 20:21 , Chris Cunningham wrote:

...

Hi Andres,

For BigEndian platforms, speed may be important; in any case, it is nice.

You are probably aware, but the current Squeak has does not have #unsignedLong64At:bigEndian: in the image at all - that diff was from my first attempt.

-cbc

On Mon, Aug 31, 2015 at 7:39 PM, Andres Valloud <avalloud@smalltalk.comcastbiz.net mailto:avalloud@smalltalk.comcastbiz.net> wrote:

Interesting about the fading relevancy of big endian platforms.
Just in case the point was lost, I meant the macro-style approach in
contrast with this (from Squeak-dev):

=============== Diff against Collections-cbc.650 ===============

Item was changed:
   ----- Method: ByteArray>>unsignedLong64At:bigEndian: (in category
'platform independent access') -----
   unsignedLong64At: index bigEndian: aBool
+       "Avoid as much largeInteger as we can"
+       | b0 b2 b3 b5 b6 w n2 n3 |
+
+       aBool ifFalse: [
+               w := self at: index.
+               b6 := self at: index+1.
+               b5 := self at: index+2.
+               n2 := self at: index+3.
+               b3 := self at: index+4.
+               b2 := self at: index+5.
+               n3 := self at: index+6.
+               b0 := self at: index+7.
+       ] ifTrue: [
+               b0 := self at: index.
+               n3 := self at: index+1.
+               b2 := self at: index+2.
+               b3 := self at: index+3.
+               n2 := self at: index+4.
+               b5 := self at: index+5.
+               b6 := self at: index+6.
+               w := self at: index+7.
+               ].
+
+       "Minimize LargeInteger arithmetic"
+       b6 = 0 ifFalse:[w := (b6 bitShift: 8) + w].
+       b5 = 0 ifFalse:[w := (b5 bitShift: 16) + w].
+
+       b3 = 0 ifFalse:[n2 := (b3 bitShift: 8) + n2].
+       b2 = 0 ifFalse:[n2 := (b2 bitShift: 16) + n2].
+       n2 == 0 ifFalse: [w := (n2 bitShift: 24) + w].
+
+       b0 = 0 ifFalse:[n3 := (b0 bitShift: 8) + n3].
+       n3 == 0 ifFalse: [w := (n3 bitShift: 48) + w].
+
+       ^w!
-       | n1 n2 |
-       aBool
-               ifTrue: [
-                       n2 := self unsignedLongAt: index  bigEndian:
true.
-                       n1 := self unsignedLongAt: index+4
bigEndian: true.
-                       ]
-               ifFalse: [
-                       n1 := self unsignedLongAt: index bigEndian:
false.
-                       n2 := self unsignedLongAt: index+4
bigEndian: false.
-                       ].
-       ^(n2 bitShift: 32) + n1!


I'd rather have that pushed down enough so that the compiler
intrinsic becomes visible.  And at that point, all that code is
reduced to a single instruction.

Andres.



On 8/31/15 19:12 , Eliot Miranda wrote:

    Hi Andres,

        On Aug 31, 2015, at 5:52 PM, Andres Valloud
        <avalloud@smalltalk.comcastbiz.net
        <mailto:avalloud@smalltalk.comcastbiz.net>> wrote:

        FWIW... IMO it's better to enable access to the relevant
        compiler intrinsic with platform specific macros, rather
        than implementing instructions such as Intel's BSWAP or
        MOVBE by hand.  In HPS, isolating endianness concerns from
        the large integer arithmetic primitives with such macros
        enabled 25-40% faster performance on big endian platforms.
        Just as importantly, the intrinsic approach takes
        significantly less code to implement.


    Makes sense, and the performance increases are impressive.  The
    only issue I have is that the Cog JIT (which would have the
    easiest time generating those intrinsics) currently runs only in
    little-endianness platforms and I seriously doubt it will run in
    a big endianness platform in the next five years.  PowerPC is
    the only possibility I see.  Yes, ARM is biendian but all the
    popular applications I know of are little endian.

    VW's a different beast; significant big endian legacy.

    But what you say about isolating makes perfect sense.  Thanks


            On 8/31/15 10:25 , Eliot Miranda wrote:
            Hi Chrises,

                  my vote would be to write these as 12 numbered
            primitives, (2,4 & 8
            bytes) * (at: & at:put:) * (big & little endian) because
            they can be
            performance critical and implementing them like this
            means the maximum
            efficiency in both 32-bit and 64-bit Spur, plus the
            possibility of the
            JIT implementing the primitives.

            On Sun, Aug 30, 2015 at 10:01 PM, Chris Cunningham
            <cunningham.cb@gmail.com
            <mailto:cunningham.cb@gmail.com>
            <mailto:cunningham.cb@gmail.com
            <mailto:cunningham.cb@gmail.com>>> wrote:

                 Hi Chris,

                 I'm all for having the fastest that in the image
            that works.  If you
                 could make your version handle endianess, then I'm
            all for including
                 it (at least in the 3 variants that are faster).
            My first use for
                 this (interface for KAFKA) apparently requires
            bigEndianess, so I
                 really want that supported.

                 It might be best to keep my naming, though - it
            follows the name
                 pattern that is already in the class.  Or will
            yours also support 128?

                 -cbc

                 On Sun, Aug 30, 2015 at 2:38 PM, Chris Muller
            <asqueaker@gmail.com <mailto:asqueaker@gmail.com>
                 <mailto:asqueaker@gmail.com
            <mailto:asqueaker@gmail.com>>> wrote:

                     Hi Chris, I think these methods belong in the
            image with the fastest
                     implementation we can do.

                     I implemented 64-bit unsigned access for Ma
            Serializer back in 2005.
                     I modeled my implementation after Andreas'
            original approach which
                     tries to avoid LI arithmetic.  I was curious
            whether your
                     implementations would be faster, because if
            they are then it could
                     benefit Magma.  After loading "Ma Serializer"
            1.5 (or head) into a
                     trunk image, I used the following script to
            take comparison
                     measurements:

                     | smallN largeN maBa cbBa |  smallN := ((2
            raisedTo: 13) to: (2
                     raisedTo: 14)) atRandom.
                     largeN := ((2 raisedTo: 63) to: (2 raisedTo:
            64)) atRandom.
                     maBa := ByteArray new: 8.
                     cbBa := ByteArray new: 8.
                     maBa maUint: 64 at: 0 put: largeN.
                     cbBa unsignedLong64At: 1 put: largeN bigEndian:
            false.
                     self assert: (cbBa maUnsigned64At: 1) = (maBa
            unsignedLong64At: 1
                     bigEndian: false).
                     { 'cbc smallN write' -> [ cbBa
            unsignedLong64At: 1 put: smallN
                     bigEndian: false] bench.
                     'ma smallN write' -> [cbBa maUint: 64 at: 0
            put: smallN ] bench.
                     'cbc smallN access' -> [ cbBa unsignedLong64At:
            1 bigEndian:
                     false. ] bench.
                     'ma smallN access' -> [ cbBa maUnsigned64At: 1]
            bench.
                     'cbc largeN write' -> [ cbBa unsignedLong64At:
            1 put: largeN
                     bigEndian: false] bench.
                     'ma largeN write' -> [cbBa maUint: 64 at: 0
            put: largeN ] bench.
                     'cbc largeN access' -> [ cbBa unsignedLong64At:
            1 bigEndian:
                     false ] bench.
                     'ma largeN access' -> [ cbBa maUnsigned64At: 1]
            bench.
                       }

                     Here are the results:

                     'cbc smallN write'->'3,110,000 per second.  322
            nanoseconds per
                     run.' .
                     'ma smallN write'->'4,770,000 per second.  210
            nanoseconds per
                     run.' .
                     'cbc smallN access'->'4,300,000 per second.
            233 nanoseconds per
                     run.' .
                     'ma smallN access'->'16,400,000 per second.
            60.9 nanoseconds
                     per run.' .
                     'cbc largeN write'->'907,000 per second.  1.1
            microseconds per
                     run.' .
                     'ma largeN write'->'6,620,000 per second.  151
            nanoseconds per
                     run.' .
                     'cbc largeN access'->'1,900,000 per second.
            527 nanoseconds per
                     run.' .
                     'ma largeN access'->'1,020,000 per second.  982
            nanoseconds per
                     run.'

                     It looks like your 64-bit access is 86% faster
            for accessing the
                     high-end of the 64-bit range, but slower in the
            other 3 metrics.
                     Noticeably, it was only 14% as fast for writing
            the high-end of the
                     64-bit range, and similarly as much slower for
            small-number access..


                     On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham
                     <cunningham.cb@gmail.com
            <mailto:cunningham.cb@gmail.com>
            <mailto:cunningham.cb@gmail.com
            <mailto:cunningham.cb@gmail.com>>> wrote:
                      > Hi.
                      >
                      > I've committed a change to the inbox with
            changes to allow
                     getting/putting
                      > 64bit values to ByteArrays (similar to 32
            and 16 bit
                     accessors).  Could this
                      > be added to trunk?
                      >
                      > Also, first time I used the selective commit
            function - very
                     nice!  the
                      > changes I didn't want committed didn't, in
            fact, get
                     commited.  Just the
                      > desirable bits!
                      >
                      > -cbc
                      >
                      >
                      >








            --
            _,,,^..^,,,_
            best, Eliot



    .

Nicolas Cellier

9:22 a.m.

New subject: [squeak-dev] ByteArray accessors for 64-bit manipulation

Just a side note: there are not only big-endian platforms, but big-endian protocols also for data exchange...

2015-09-01 4:12 GMT+02:00 Eliot Miranda eliot.miranda@gmail.com:

...

Hi Andres,

...
On Aug 31, 2015, at 5:52 PM, Andres Valloud <

avalloud@smalltalk.comcastbiz.net> wrote:

...
FWIW... IMO it's better to enable access to the relevant compiler

intrinsic with platform specific macros, rather than implementing instructions such as Intel's BSWAP or MOVBE by hand. In HPS, isolating endianness concerns from the large integer arithmetic primitives with such macros enabled 25-40% faster performance on big endian platforms. Just as importantly, the intrinsic approach takes significantly less code to implement.

Makes sense, and the performance increases are impressive. The only issue I have is that the Cog JIT (which would have the easiest time generating those intrinsics) currently runs only in little-endianness platforms and I seriously doubt it will run in a big endianness platform in the next five years. PowerPC is the only possibility I see. Yes, ARM is biendian but all the popular applications I know of are little endian.

VW's a different beast; significant big endian legacy.

But what you say about isolating makes perfect sense. Thanks

...
...
On 8/31/15 10:25 , Eliot Miranda wrote: Hi Chrises,
my vote would be to write these as 12 numbered primitives, (2,4 & 8
bytes) * (at: & at:put:) * (big & little endian) because they can be performance critical and implementing them like this means the maximum efficiency in both 32-bit and 64-bit Spur, plus the possibility of the JIT implementing the primitives.

On Sun, Aug 30, 2015 at 10:01 PM, Chris Cunningham <cunningham.cb@gmail.com mailto:cunningham.cb@gmail.com> wrote:

Hi Chris,

I'm all for having the fastest that in the image that works. If you could make your version handle endianess, then I'm all for including it (at least in the 3 variants that are faster). My first use for this (interface for KAFKA) apparently requires bigEndianess, so I really want that supported.

It might be best to keep my naming, though - it follows the name pattern that is already in the class. Or will yours also support
128?

...
...
-cbc

On Sun, Aug 30, 2015 at 2:38 PM, Chris Muller <asqueaker@gmail.com mailto:asqueaker@gmail.com> wrote:
   Hi Chris, I think these methods belong in the image with the
fastest

...
...
   implementation we can do.

   I implemented 64-bit unsigned access for Ma Serializer back in
...
...
   I modeled my implementation after Andreas' original approach
which

...
...
   tries to avoid LI arithmetic.  I was curious whether your
   implementations would be faster, because if they are then it
could

...
...
   benefit Magma.  After loading "Ma Serializer" 1.5 (or head) into
a

...
...
   trunk image, I used the following script to take comparison
   measurements:

   | smallN largeN maBa cbBa |  smallN := ((2 raisedTo: 13) to: (2
   raisedTo: 14)) atRandom.
   largeN := ((2 raisedTo: 63) to: (2 raisedTo: 64)) atRandom.
   maBa := ByteArray new: 8.
   cbBa := ByteArray new: 8.
   maBa maUint: 64 at: 0 put: largeN.
   cbBa unsignedLong64At: 1 put: largeN bigEndian: false.
   self assert: (cbBa maUnsigned64At: 1) = (maBa unsignedLong64At: 1
   bigEndian: false).
   { 'cbc smallN write' -> [ cbBa unsignedLong64At: 1 put: smallN
   bigEndian: false] bench.
   'ma smallN write' -> [cbBa maUint: 64 at: 0 put: smallN ] bench.
   'cbc smallN access' -> [ cbBa unsignedLong64At: 1 bigEndian:
   false. ] bench.
   'ma smallN access' -> [ cbBa maUnsigned64At: 1] bench.
   'cbc largeN write' -> [ cbBa unsignedLong64At: 1 put: largeN
   bigEndian: false] bench.
   'ma largeN write' -> [cbBa maUint: 64 at: 0 put: largeN ] bench.
   'cbc largeN access' -> [ cbBa unsignedLong64At: 1 bigEndian:
   false ] bench.
   'ma largeN access' -> [ cbBa maUnsigned64At: 1] bench.
     }

   Here are the results:

   'cbc smallN write'->'3,110,000 per second.  322 nanoseconds per
   run.' .
   'ma smallN write'->'4,770,000 per second.  210 nanoseconds per
   run.' .
   'cbc smallN access'->'4,300,000 per second.  233 nanoseconds per
   run.' .
   'ma smallN access'->'16,400,000 per second.  60.9 nanoseconds
   per run.' .
   'cbc largeN write'->'907,000 per second.  1.1 microseconds per
   run.' .
   'ma largeN write'->'6,620,000 per second.  151 nanoseconds per
   run.' .
   'cbc largeN access'->'1,900,000 per second.  527 nanoseconds per
   run.' .
   'ma largeN access'->'1,020,000 per second.  982 nanoseconds per
   run.'

   It looks like your 64-bit access is 86% faster for accessing the
   high-end of the 64-bit range, but slower in the other 3 metrics.
   Noticeably, it was only 14% as fast for writing the high-end of
the

...
...
   64-bit range, and similarly as much slower for small-number
access..

...
...
   On Fri, Aug 28, 2015 at 6:01 PM, Chris Cunningham
   <cunningham.cb@gmail.com <mailto:cunningham.cb@gmail.com>>
wrote:

...
...
    > Hi.
    >
    > I've committed a change to the inbox with changes to allow
   getting/putting
    > 64bit values to ByteArrays (similar to 32 and 16 bit
   accessors).  Could this
    > be added to trunk?
    >
    > Also, first time I used the selective commit function - very
   nice!  the
    > changes I didn't want committed didn't, in fact, get
   commited.  Just the
    > desirable bits!
    >
    > -cbc
    >
    >
    >
-- _,,,^..^,,,_ best, Eliot

3183

Age (days ago)

3184

Last active (days ago)

vm-dev@lists.squeakfoundation.org

13 comments

7 participants

tags (0)

participants (7)

Andres Valloud
Chris Cunningham
Chris Muller
David T. Lewis
Eliot Miranda
Nicolas Cellier
Tobias Pape