Performance testing with bit operations, SSE2 and asm

posted in SIC Games' Journal

Published May 01, 2015

Today I figure I sit down and see how SSE2 is benefitial to a game. I don't come fromt a strong background knowledge of ASM code. However, I found it fascinating to say the least! I can understand the headaches that go into ASM but if you can get around knowing some ASM then you're good to go on SSE2 instructions.

As I already aknowledged from previous issues with DIrectX Math is the vectors have to be 16 bit aligned or there will be access violations exceptions thrown about. I recently had issues like this when I changed the camera class.

On my performance test I tested out a simple function that would return the minimal of two floating points.

inline float _b_min(float a, float b) {          return (a < b ? a : b);}inline float _a_nin(float a, float b) {      float result = 0.0f;      __asm {            mov eax,a            mov ebx,b            cmp eax,ebx            mov [result], eax      }return result;}

the function _a_min was a bit faster than _b_min. _a_min gave around 900 microseconds.
_b_min function gave over 1 millisecond time elapsed from the high resolution timer.

The sse2 min function gave me around 0.003 milliseconds.

inline float *_sse2_min(float a, float b) {    __m128 _a = _mm_set_ps1(a); //-- set _a to the floating point a;    __m128 _b = _mm_set_ps1(b); //-- set _b to floating point b;    __m128 c = _mm_min_ps(_a,_b); //-- return in C the minimal of _a,_b;    float *result = (float*)_aligned_malloc(sizeof(float), sizeof(float));     _mm_store_ps(result, c); //-- store the result float* result;    return result;}

my question is for whomever the reader is - why is SSE2 and bit operator maniupation a bit slower than ASM or does it not matter? Possible SSE2 is better with bigger data than just comparing minimal of floating points?

Previous Entry Started to Dive into the Compute Shader!

Next Entry Game Engine Update

1 likes 7 comments

Comments

MarkS_

I'm guessing that was a typo. 0.003 microseconds *IS* faster than 1 microsecond.

May 01, 2015 09:48 PM

Paul C Skertich

Yes you're right lol it's 0.0009 microseconds is faster than 1 millisecond.

May 01, 2015 11:46 PM

Paul C Skertich

so the SSE2 timing gave me 3 milliseconds and the asm timing gave me 900 microseconds. I got confused with the whole second conversion to microseconds and milliseconds. The output of the time elapsed is in milliseconds.

May 01, 2015 11:57 PM

Aressera

You realize that aligned malloc is usually just a wrapper around regular malloc that enforces the alignment? and that malloc is a kernel call that requires context switch, acquiring a lock, and various data structure manipulation? You're doing like 1000x the work in that one call than the rest of the function. You should never return memory like that from a function, it'd bad form and requires the caller to free it (and know to use aligned free).

Slightly better code below, but the whole idea of this function is flawed in the first place.


typedef union Float4
{
    __m128 v;
    float x[4];
};

inline float _sse2_min(float a, float b)
{
    __m128 _a = _mm_set_ps1(a);
    __m128 _b = _mm_set_ps1(b);
    Float4 c;
    c.v = _mm_min_ps(_a,_b);
    return c.x[0];
}

Generally, SIMD code is only an advantage when you layout your data in a vector-friendly format (structures of arrays), then you can directly load/store_ps and operate on multiple values at once and get close to 4x speedup.

May 02, 2015 06:28 AM

Juliean

Yeah, SSE is not made for operating on regular floats. You eigther need to process 4 values at the same time, and/or perform a series of SSE-operations on the same values to minimize time spent on shoving values between registers for there to be a performance improvement.

May 02, 2015 02:15 PM

Paul C Skertich

I see what you guys are getting at. Yeah I didn't expect to run small floating operations using SSE and had the idea that SSE had to have 16-bit aligned or else exceptions would be thrown. I started working with the DIrectXMath not XNA like I was using before. An exception kept on being thrown each time the camera class wasn't aligned properly. Again you're absolutely right Juliean about the vectors. I wanted to dive into knowing SSE2 and gain some knowledge in ASM a bit. Again Aressera you're code fits the bill because I was wasting memory and it was a potentially more hazardous on memory leak. The code wasn't meant to be used in game engine - just seeing if I could learn more about SSE2. I didn't know the aligned_malloc was a wrapper function of malloc.

May 02, 2015 02:31 PM

Bacterius

You realize that aligned malloc is usually just a wrapper around regular malloc that enforces the alignment? and that malloc is a kernel call that requires context switch, acquiring a lock, and various data structure manipulation? You're doing like 1000x the work in that one call than the rest of the function. You should never return memory like that from a function, it'd bad form and requires the caller to free it (and know to use aligned free).

That's not quite accurate, malloc is not a kernel call (thankfully) except perhaps when it needs to grow the heap. I agree returning memory to the caller like that is pretty nasty though.

May 02, 2015 03:34 PM

You must log in to join the conversation.

Don't have a GameDev.net account? Sign up!

Paul C Skertich

Author

Performance testing with bit operations, SSE2 and asm

Comments

Paul C Skertich

Latest Entries

Engine Update - Rendering Bounding Volumes

Engine Update - Frustum Culling, multiple subsets

Game Engine Update - Bloom Post Process

Game Engine Update

Performance testing with bit operations, SSE2 and asm

Started to Dive into the Compute Shader!

Another great way to reduce my insantiy with interfaces...

Take a Break - Step Back - Rethink - Reoganize

Fix Fix Fix-A-Roo

SICEditor 1.0.0 - UI Changes

Performance testing with bit operations, SSE2 and asm

Comments

Paul C Skertich

Latest Entries

Engine Update - Rendering Bounding Volumes

Engine Update - Frustum Culling, multiple subsets

Game Engine Update - Bloom Post Process

Game Engine Update

Performance testing with bit operations, SSE2 and asm

Started to Dive into the Compute Shader&#33;

Another great way to reduce my insantiy with interfaces...

Take a Break - Step Back - Rethink - Reoganize

Fix Fix Fix-A-Roo

SICEditor 1.0.0 - UI Changes

Reticulating splines

Started to Dive into the Compute Shader!