• # Useless Snippet #1: Transform Vec3f by Matrix4x4f

General and Gameplay Programming

This is a rewrite of an old blog post I wrote a couple of years back. Finally, I found some time to redo the performance tests based on an observation made by Sean Barrett on the original post. The code below remains the same. The difference lies in the way performance is measured. Here I use RDTSC instead of Intel's Performance Counter Monitor library, which ended up producing a high overhead compared to the actual time taken by the measured functions. As an added bonus, I'm uploading a small VS project for anyone interested to try out.

## The problem

Goal Multiply a batch of Vector3f's with the same 4x4 matrix. Restrictions
• 'src' and 'dst' arrays shouldn't point to the same memory location
• All pointers should be 16-byte aligned (see below for details on array sizes)
• Treat Vector3f's as positions (w = 1.0)
• Matrix is column major

## Structs and helper functions

struct _Vector3f { float x, y, z; }; struct _Vector4f { float x, y, z, w; }; _Vector3f* AllocateSrcArray(unsigned int numVertices) { // We are loading 3 xmm regs per loop. So we need numLoops * 12 floats, where // numLoops is (numVertices / 4) + 1 for the SSE version unsigned int memSize = sizeof(float) * ((numVertices >> 2) + 1) * 12; return (_Vector3f*)_aligned_malloc(memSize, 16); } _Vector4f* AllocateDstArray(unsigned int numVertices) { unsigned int memSize = sizeof(_Vector4f) * (((numVertices >> 2) + 1) << 2); return (_Vector4f*)_aligned_malloc(memSize, 16); } As you can see, we are wasting a bit or memory in order to avoid dealing with special cases in the SSE version. The C version isn't affected by the extra padding, so there is no overhead. The worst case scenario (numVertices = 4 * n + 1) is to allocate an extra 48 bytes for the 'dst' array and extra 4 bytes for the 'src' array (total = 52 bytes). Nothing extraordinary when dealing with large batches of vertices. Code-side, the worst case is to perform 3 extra stores to the 'dst' array and 3 extra loads from the 'src' array. Also note that we don't impose any restrictions on the matrix values, so the result of each transformation should be a 4-element vector.

## C code

void TransformVec3Pos_C(_Vector3f* __restrict src, _Vector4f* __restrict dst, unsigned int numVertices, float* __restrict matrix4x4) { for(unsigned int iVertex = 0;iVertex < numVertices;++iVertex) { _Vector3f* s = &src[iVertex]; _Vector4f* d = &dst[iVertex]; d->x = matrix4x4[0] * s->x + matrix4x4[4] * s->y + matrix4x4[ 8] * s->z + matrix4x4[12]; d->y = matrix4x4[1] * s->x + matrix4x4[5] * s->y + matrix4x4[ 9] * s->z + matrix4x4[13]; d->z = matrix4x4[2] * s->x + matrix4x4[6] * s->y + matrix4x4[10] * s->z + matrix4x4[14]; d->w = matrix4x4[3] * s->x + matrix4x4[7] * s->y + matrix4x4[11] * s->z + matrix4x4[15]; } } Performance (8k vertices): ~19 cycles/vertex (?: 1.4) There is nothing special with this code. Straight Vec3/Matrix4x4 multiply. Plain-old FPU code generated by the compiler. Note: Turning on /arch:SSE compiler option doesn't seem to produce any SSE code for the above function. The compiler insisted on using the FPU for all the calculations. Using /arch:SSE2 compiler option ended up producing a lot of SSE2 double-to-float and float-to-double conversions, which in turn made things worse performance-wise.

## Comparison

In order to compare the functions above, we execute 100,000 iterations for each batch size and calculate the clock cycles taken for each one of them. The results are then sorted and the middle 50,000 iterations are used to calculate the average and standard deviation. All values are in cycles/vertex  Batchsize(vertices) C SSE Speedup 128 24.3(s=0.98) 13.8(s=0.48) 1.76x 256 21.5(s=1.86) 12.8(s=0.36) 1.67x 512 19.5(s=1.71) 8.8(s=0.2) 2.21x 1024 18.6(s=1.23) 8.3(s=0.26) 2.24x 4096 18.9(s=1.29) 7.8(s=0.13) 2.42x 8192 18.8(s=1.43) 7.1(s=0.15) 2.64x 65536 20.4(s=1.53) 8.2(s=0.34) 2.48x
Table: Comparison between the two methods and speedup (values are averages between several independent runs) The stages of optimization aren't present because I don't have the code anymore. Next time, I'll keep it around. One thing observed is that even the most naive SSE implementation (loading individual vertex components (movss) and processing only one vertex per loop) gives significant speedups compared to the FPU (C) implementation. All timings have been measured using RDTSC. All tests have been executed on Core i7 740QM using Microsoft Visual C++ 2008 compiler. The process' and thread's affinity has been set to 0x01 (the thread runs on the 1st core only) and the thread's priority has been set to highest. If you happen to test the code above, please share your findings. Corrections are always welcome. Thanks for reading.

Report Article

## User Feedback

Interesting. Would you happen to know if the compiler utilizes the SSE instructions for D3DXMATRIX and D3DXVECTOR3 operations on the CPU?

I'm interested in learning how I should implement skeleton animation for 1000+ characters. Where, each bone will be affected by 1-4 animation tracks and then each bone will be tranformed to its parent bone. There could be 120 bones in each skeleton.

##### Share on other sites

Interesting. Would you happen to know if the compiler utilizes the SSE instructions for D3DXMATRIX and D3DXVECTOR3 operations on the CPU?

As far as I know those functions do not use SSE. They are implemented in some static (?) library using FPU instructions. If you want to use an MS-supplied library, D3DX is deprecated in favor of DirectXMath, which is a header-only library, which uses SSE2 compiler intrinsics.

Relating to the article: I think it's a nice introduction to SSE math. As far as I know MS compilers cannot vectorize code (maybe the newest one, which ships with VS2012, can do something, I don't know for sure), that's why setting it to use SSE2 didn't do anything useful. Using inline assembly code is disabled in 64 bit builds, meaning you have to work around it by putting writing inline assembly in .asm files and set a build rule to build them using masm (which ships with VS).

##### Share on other sites

I have personally often had reasonably good results using SIMD / SSE2 intrinsics.

though, I haven't used this them all that often, partly as they can make the code pretty ugly, as well as requiring any appropriate #ifdef's and similar.

## Create an account

Register a new account

• 0
• 22
• 0
• 1
• 0

• 9
• 9
• 11
• 11
• 23
×