• entries
31
59
• views
36108

# Performance testing with bit operations, SSE2 and asm

1490 views

Today I figure I sit down and see how SSE2 is benefitial to a game. I don't come fromt a strong background knowledge of ASM code. However, I found it fascinating to say the least! I can understand the headaches that go into ASM but if you can get around knowing some ASM then you're good to go on SSE2 instructions.

As I already aknowledged from previous issues with DIrectX Math is the vectors have to be 16 bit aligned or there will be access violations exceptions thrown about. I recently had issues like this when I changed the camera class.

On my performance test I tested out a simple function that would return the minimal of two floating points.inline float _b_min(float a, float b) { return (a < b ? a : b);}inline float _a_nin(float a, float b) { float result = 0.0f; __asm { mov eax,a mov ebx,b cmp eax,ebx mov [result], eax }return result;}
the function _a_min was a bit faster than _b_min. _a_min gave around 900 microseconds.
_b_min function gave over 1 millisecond time elapsed from the high resolution timer.

The sse2 min function gave me around 0.003 milliseconds.inline float *_sse2_min(float a, float b) { __m128 _a = _mm_set_ps1(a); //-- set _a to the floating point a; __m128 _b = _mm_set_ps1(b); //-- set _b to floating point b; __m128 c = _mm_min_ps(_a,_b); //-- return in C the minimal of _a,_b; float *result = (float*)_aligned_malloc(sizeof(float), sizeof(float)); _mm_store_ps(result, c); //-- store the result float* result; return result;}
my question is for whomever the reader is - why is SSE2 and bit operator maniupation a bit slower than ASM or does it not matter? Possible SSE2 is better with bigger data than just comparing minimal of floating points?

I'm guessing that was a typo. 0.003 microseconds *IS* faster than 1 microsecond.

Yes you're right lol it's 0.0009 microseconds is faster than 1 millisecond.

so the SSE2 timing gave me 3 milliseconds and the asm timing gave me 900 microseconds. I got confused with the whole second conversion to microseconds and milliseconds. The output of the time elapsed is in milliseconds.

Yeah, SSE is not made for operating on regular floats. You eigther need to process 4 values at the same time, and/or perform a series of SSE-operations on the same values to minimize time spent on shoving values between registers for there to be a performance improvement.

I see what you guys are getting at. Yeah I didn't expect to run small floating operations using SSE and had the idea that SSE had to have 16-bit aligned or else exceptions would be thrown. I started working with the DIrectXMath not XNA like I was using before. An exception kept on being thrown each time the camera class wasn't aligned properly. Again you're absolutely right Juliean about the vectors. I wanted to dive into knowing SSE2 and gain some knowledge in ASM a bit. Again Aressera you're code fits the bill because I was wasting memory and it was a potentially more hazardous on memory leak. The code wasn't meant to be used in game engine - just seeing if I could learn more about SSE2. I didn't know the aligned_malloc was a wrapper function of malloc.

You realize that aligned malloc is usually just a wrapper around regular malloc that enforces the alignment? and that malloc is a kernel call that requires context switch, acquiring a lock, and various data structure manipulation? You're doing like 1000x the work in that one call than the rest of the function. You should never return memory like that from a function, it'd bad form and requires the caller to free it (and know to use aligned free).

That's not quite accurate, malloc is not a kernel call (thankfully) except perhaps when it needs to grow the heap. I agree returning memory to the caller like that is pretty nasty though.