2013/12/23 Nicolas Cellier <nicolas.cellier.aka.nice@gmail.com>

2013/12/23 Nicolas Cellier <nicolas.cellier.aka.nice@gmail.com>
Currently we use a very clear but naive algorithm

    alpha := sourceWord >> 24.  "High 8 bits of source pixel"
    alpha = 0 ifTrue: [ ^ destinationWord ].
    alpha = 255 ifTrue: [ ^ sourceWord ].
    unAlpha := 255 - alpha.
    colorMask := 16rFF.
    result := 0.

    "red"
    shift := 0.
    blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) +
                ((destinationWord>>shift bitAnd: colorMask) * unAlpha)
                 + 254 // 255 bitAnd: colorMask.
    result := result bitOr: blend << shift.
    "green"
    shift := 8.
    blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) +
                ((destinationWord>>shift bitAnd: colorMask) * unAlpha)
                 + 254 // 255 bitAnd: colorMask.
    result := result bitOr: blend << shift.
    "blue"
    shift := 16.
    blend := ((sourceWord >> shift bitAnd: colorMask) * alpha) +
                ((destinationWord>>shift bitAnd: colorMask) * unAlpha)
                 + 254 // 255 bitAnd: colorMask.
    result := result bitOr: blend << shift.
    "alpha (pre-multiplied)"
    shift := 24.
    blend := (alpha * 255) +
                ((destinationWord>>shift bitAnd: colorMask) * unAlpha)
                 + 254 // 255 bitAnd: colorMask.
    result := result bitOr: blend << shift.
    ^ result


Of course, the best we could do to improve it is using a native OS library when it exists on the whole bitmap. I let this path apart, it can be handled at platform specific source like tim did for Pi.
But still, with our own crafted bits, we could do better than current implementation.
See http://stackoverflow.com/questions/1102692/how-to-do-alpha-blend-fast

Using specific hardware instructions by ourselves is not really an option for a portable VM, it's better to call a native library if we cant to have specific optimizations, so i let SSE instructions apart.

But there are two simple ideas we can recycle from above SO reference:

1) multiplex Red+Blue and Alpha+Green computations
2) avoid division by 255

Here it is:

    "red and blue"
    blend := ((sourceWord bitAnd: 16rFF00FF) * alpha) +
                ((destinationWord bitAnd: 16rFF00FF) * unAlpha) + 16rFE00FE.
    "divide by 255"
    blend := blend + 16r10001 + (blend >> 8 bitAnd: 16rFF00FF) >> 8.
I forgot to protect  bitAnd: 16rFF00FF but you get the idea...
 
    result := blend.

    "alpha and green"
    blend := (((sourceWord>> 8 bitOr: 16rFF0000) bitAnd: 16rFF00FF) * alpha) +
                ((destinationWord>>8 bitAnd: 16rFF00FF) * unAlpha) + 16rFE00FE.
    "divide by 255"
    blend := blend + 16r10001 + (blend >> 8 bitAnd: 16rFF00FF) >> 8.

bitAnd: 16rFF00FF too of course...
 
    result := result bitOr: blend<<8.
    ^ result

For bytes B1 and B2 in (0..255), alpha*B1+unAlpha*B2 is in (0..16rFE01)
alpha*B1+unAlpha*B2+254 is in (0..16rFEFF)
So when we multiplex non adjacent components, we're safe from overflow.

Now for division by 255 we are also safe: when adding 1 -> (1..16rFF00)
And when adding blend>>8 bitAnd 16rFF -> (1..16rFFFF)
We are still free of overflow and can extend the //255 division trick to 32bit word (the formula given on SO is for 16bit only).

I expect roughly a x2 factor in throughput, but it's hard to measure.
What do you think? Is this interesting?

Find corresponding code attached