## Friday, May 1, 2015

### 2D transformation matrices baking (2) : Benchmarks

This is the next part of my article about 2D transformation matrices baking. After the theory, let's go to practice and do some benchmarks. I will play with Starling 1.6 on the Flash platform.

Starling optimization

If you are a Flash platform developer and you like to use Starling for 2D accelerated graphics, go to the source code of 1.6 version and open the starling.display.DisplayObject class. The interesting part for us is in the getter function tranformationMatrix, at line 634:

mTransformationMatrix.identity();
mTransformationMatrix.scale(mScaleX, mScaleY);
MatrixUtil.skew(mTransformationMatrix, mSkewX, mSkewY);
mTransformationMatrix.rotate(mRotation);
mTransformationMatrix.translate(mX, mY);

if (mPivotX != 0.0 || mPivotY != 0.0)
{
// prepend pivot transformation
mTransformationMatrix.tx = mX - mTransformationMatrix.a * mPivotX - mTransformationMatrix.c * mPivotY;
mTransformationMatrix.ty = mY - mTransformationMatrix.b * mPivotX  - mTransformationMatrix.d * mPivotY;
}

That is a really interesting part of the code, that shows an optimized calculation of a complex 2D transformation matrix as a sequence of many basic 2D transformations.

The sequence could be simplified and summarized as follow:
1. begin from the identity matrix
2. apply translation for pivot
3. apply scale
4. apply skew
5. apply rotation
6. apply translation

So it is a quite complex sequence of transformations. Can it be baked and can we hope a significant performances improvement ? The answer is yes !

Matrices baking

I wrote the whole sequence as a concatenation sequence and I gave it to Wolfram.

From:
translation . rotation . skewing . scaling . pivot

I obtained:

I derived a code sample from that baked matrix. It can be copied-pasted in Starling in place of the previous code sample shown above:

__e = Math.cos(mSkewY);
__f = -Math.sin(mSkewX);
__g = Math.sin(mSkewY);
__h = Math.cos(mSkewX);
__i = Math.cos(mRotation);
__j = Math.sin(mRotation);
__ceigj = mScaleX * (__e * __i - __g * __j);
__dfihj = mScaleY * (__f * __i - __h * __j);
__cekgl = mScaleX * (__e * __j + __g * __i);
__dfkhl = mScaleY * (__f * __j + __h * __i);
mTransformationMatrix.a = __ceigj;
mTransformationMatrix.c = __dfihj;
mTransformationMatrix.tx = - mPivotX * __ceigj - mPivotY * __dfihj + mX;
mTransformationMatrix.b = __cekgl;
mTransformationMatrix.d = __dfkhl;
mTransformationMatrix.ty = - mPivotX * __cekgl - mPivotY * __dfkhl + mY;

That optimized code assumes to add the following properties to the starling.display.DisplayObject class:

private var __e:Number;
private var __f:Number;
private var __g:Number;
private var __h:Number;
private var __i:Number;
private var __j:Number;
private var __ceigj:Number;
private var __dfihj:Number;
private var __cekgl:Number;
private var __dfkhl:Number;

Performances

To measure the performances, I choose to instanciate 1000 Sprite (it inherits from DisplayObject), change their properties and call the transformMatrix getter function all along the execution of an Enter Frame. I measure the time elapsed just before and just after the transformMatrix call in order to have the delta. In pseudo code:

vs = new Vector.<starling.display.Sprite>(1000)

function onEnterFrame(e:Event)
{
w = 17
for (i = 0 ; i<1000 ; i++)
{
s = vs[i]
s.pivotX = w
s.pivotY = w
s.rotation = w
s.scaleX = w
s.scaleY = w
s.skewX = w
s.skewY = w
s.x = w
s.y = w

w += 0.1
if (w >= 100)
w = 17
}

time = getTimer();
for (i = 0 ; i<1000 ; i++)
{
s = vs[i];
m = s.transformationMatrix;
}
time = getTimer() - time;
}

It is quite a realistic situation ; 1000 moving sprites on screen can be easily involved for some games. Also, in Starling rendering, the call of transformationMatrix getter function is really done every frame for every moving Sprite on screen.

I ran the code on a laptop computer and on a Android phone, switching from the original Starling code to my optimized code on both devices. I choose to compute and focus on the average time of the last 60 frames all along the benchmark execution because the values per frame can vary less or more 10%. You can look to my full code for more details.

The results:
Laptop device:
- original code: 3ms
- optimized code: 2ms
Mobile device:
- originale code: 6ms
- optimized code: 5ms

In both situation, we reached an average improvement of 1ms by frame. 1ms could look negligible at a first look, but it is really significant if you think that trying to reach 60FPS, you have a very tight budget of only 16ms every frame for your whole code execution. In that context, 1ms is a great save.