Jump to content
  • Advertisement
Sign in to follow this  
akinak

[VC++] Simple SSE Experience

This topic is 3741 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

hi i have just started some experiences on SSE with vc++ 2005 , but the results are not something i have expected. a simple Vector4 add function with "inline assembly" , "fpu" and "cpp simd intrinsics". Vc++ 2005 "Release mode" , Maximize Speed (/O2).
#include <stdio.h>
#include <windows.h>
#include <mmintrin.h>
#include <xmmintrin.h>

struct __declspec(align(16)) vector {
	float x,y,z,w;
};
inline vector addVector_sse( vector &v1 , vector &v2 ) {
	vector o;
	__asm {
		mov EAX, v1
		mov EBX, v2
		movaps xmm0,[EAX]
		movaps xmm1,[EBX]
		addps xmm0,xmm1
		movups [o],xmm0
	}
	return o;
}
inline vector addVector_fpu( vector &v1 , vector &v2 ) {
	vector o;
	o.x = v1.x + v2.x;
	o.y = v1.y + v2.y;
	o.z = v1.z + v2.z;
	o.w = v1.w + v2.w;
	return o;
}

int main() {
	LARGE_INTEGER start, end, freq;
	QueryPerformanceFrequency( &freq );
	int max = 100000, i;
	vector v1 = { 1 , 2 , 3 , 4 };
	vector v2 = { 2 , 3 , 4 , 5 };
	__m128 v_1 = { 1.0f , 2.0f , 3.0f , 4.0f };
	__m128 v_2 = { 2.0f , 3.0f , 4.0f , 5.0f };

	float test = 0;
	QueryPerformanceCounter( &start );
	for (i = 0; i < max; i++) {
		vector v = addVector_fpu( v1 , v2 );
		test = v.x;
	}
	QueryPerformanceCounter( &end );
	long cycle_fpu = (end.QuadPart - start.QuadPart);
	printf( "Cycles for FPU : %d\n" , cycle_fpu );

	QueryPerformanceCounter( &start );
	for (i = 0; i < max; i++) {
		vector v =  addVector_sse( v1 , v2 );
		test = v.x;
	}
	QueryPerformanceCounter( &end );
	long cycle_sse = (end.QuadPart - start.QuadPart);
	printf( "Cycles for SSE : %d\n" , cycle_sse );
	
	QueryPerformanceCounter( &start );
	for (i = 0; i < max; i++) {
		__m128 v = _mm_add_ps( v_1 , v_2 );
		test = v.m128_f32[0];
	}
	QueryPerformanceCounter( &end );
	long cycle_simd = (end.QuadPart - start.QuadPart);
	printf( "Cycles for C++ SIMD : %d\n" , cycle_simd );

}
results on my Core Duo (2.0GH) laptop
Cycles for FPU : 7
Cycles for SSE : 1444
Cycles for C++ SIMD : 7
whats wrong, should'nt SSE version be fastest, or using "QueryPerformanceCounter" is not the way?

Share this post


Link to post
Share on other sites
Advertisement
Have a look at the generated dissasembly; I bet the compiler is optimising the loop out completely in two of those cases (Since you never use the final result), so only the SSE one is getting code generated for it.

Share this post


Link to post
Share on other sites
i have turned off optimizing and got an stupid "Unhandled exception" in line "return o;" of "addVector_sse" function!
even debugger shows correct values in "vector o"...

but results are changed!
Cycles for FPU : 129535
Cycles for SSE : "Unhandled exception"
Cycles for C++ SIMD : 26822

now, since these are just simple experiences and the main framework wont have 100000 loops so that compiler cant Optimize them (they are going to used in a game engine :) , the problem now is "Unhandled exception", why this exception is thrown when Optimizing is off ?

Share this post


Link to post
Share on other sites
Quote:
Original post by akinak
i have turned off optimizing and got an stupid "Unhandled exception" in line "return o;" of "addVector_sse" function!
even debugger shows correct values in "vector o"...

but results are changed!
Cycles for FPU : 129535
Cycles for SSE : "Unhandled exception"
Cycles for C++ SIMD : 26822
Profiling unoptimised code isn't a good idea, the results can be pretty meaningless. It's better to make sure the compiler doesn't optimise the loop out, by doing something like printing the result vector out at the end of the loop. Check the disassembly to make sure it's not doing anything odd though.

Quote:
Original post by akinak
now, since these are just simple experiences and the main framework wont have 100000 loops so that compiler cant Optimize them (they are going to used in a game engine :) , the problem now is "Unhandled exception", why this exception is thrown when Optimizing is off ?
What are the exact details of the exception? You should get something in the debug spew saying what the exception is.
My guess (And it's just a guess) is that your vectors aren't aligned in debug mode for some reason - you could try printf()ing their address to check that though.

Share this post


Link to post
Share on other sites
changed some parts of the code:

vector v;

long cycle_fpu = 0;
for (i = 0; i < max; i++) {
QueryPerformanceCounter( &start );
v = crossVector_fpu( v1 , v2 );
QueryPerformanceCounter( &end );
cycle_fpu += (end.QuadPart - start.QuadPart);
}
printf( "Cycles for FPU : %d\n" , cycle_fpu );

long cycle_sse = 0;
for (i = 0; i < max; i++) {
QueryPerformanceCounter( &start );
v = crossVector_fpu( v1 , v2 );
QueryPerformanceCounter( &end );
cycle_sse += (end.QuadPart - start.QuadPart);
}
printf( "Cycles for SSE : %d\n" , cycle_sse );


__m128 _v;
long cycle_simd = 0;
for (i = 0; i < max; i++) {
QueryPerformanceCounter( &start );
_v = _mm_add_ps( v_1 , v_2 );
QueryPerformanceCounter( &end );
cycle_simd += (end.QuadPart - start.QuadPart);
}
printf( "Cycles for C++ SIMD : %d\n" , cycle_simd );

results range ( optimizing is on )
Cycles for FPU : 686159 ~ 702983
Cycles for SSE : 677565 ~ 684866
Cycles for C++ SIMD : 676939 ~ 688227

( and i dont know how to open disassembly window in vc, just insert "__asm int 3" to make a break point and open disassembly window in release mode < this wont work in debug mode :) > ) by the way i cant understand it ;(

Quote:

is that your vectors aren't aligned in debug mode for some reason

used "movups" instead "movaps" and exception again ...
i will check addresses later

and thanks for your replys "Evil Steve" :-*

Share this post


Link to post
Share on other sites
You should take your Query calls out of the for loop.. just time the loop for x # of iterations. A problem with your code is that _v is being assigned the same vlaue every single time so the compiler can just remove the entire loop, so your release timings are meaningless.

Share this post


Link to post
Share on other sites
keep in mind also that there's some CPU and OS overhead involved in switching to SSE calculation, and that inducing this overhead for just a simple add instruction will likely over-shadow the add itself.

Share this post


Link to post
Share on other sites
changed code a little again
now i'm using rand() inside loops:

long cycle_mmn = 0;
QueryPerformanceCounter( &start );
for ( i = 0; i < max; i++ ) {
Vector4f v4_1( rand() % 10 , 2 , 3 , 4);
Vector4f v4_2( rand() % 10 , 3 , 4 , 5);
math->addVector( v4_1 , v4_2 );
}
QueryPerformanceCounter( &end );
cycle_mmn += (end.QuadPart - start.QuadPart);
printf( "Math Cycles :\t %d\n" , cycle_mmn );

long cycle_d3d = 0;
QueryPerformanceCounter( &start );
for ( i = 0; i < max; i++ ) {
D3DXVECTOR4 d4_1( rand() % 10 , 2 , 3 , 4);
D3DXVECTOR4 d4_2( rand() % 10 , 3 , 4 , 5);
D3DXVec4Add( &d4_r , &d4_1 , &d4_2 );
}
QueryPerformanceCounter( &end );
cycle_d3d += (end.QuadPart - start.QuadPart);
printf( "D3DX Cycles :\t %d\n" , cycle_d3d );


math->addVector uses SIMD intrinsics:

__m128 v4_1 = *(__m128*)&v1;
__m128 v4_2 = *(__m128*)&v2;
__m128 v_ret = _mm_add_ps( v4_1 , v4_2 );
return Vector4f(v_ret.m128_f32[0],v_ret.m128_f32[1],v_ret.m128_f32[2],v_ret.m128_f32[3]);

addVector is a pure virtual function and class Math uses singletone to check for sse support and create Math_SSE or Math_FPU instance

Results: ( SSE Vector 4 )
Math Cycles : 73370
D3DX Cycles : 41684

Results: ( FPU Vector 4 )
Math Cycles : 69024
D3DX Cycles : 40918

so D3DX is 2x faster than my Math lib.
is this Results mineanigfull? so i can change Math class and use another way instead of singletone and pure functions OR need to change main loops again...

Share this post


Link to post
Share on other sites
The rand() inside the loops are watering down the difference. Preload arrays with the random coefficients outside the loop and the timer queries....

Share this post


Link to post
Share on other sites
change again..
now im using the output vector by storing its value in a long ( in loop ), and passing this value to a function ( out of loop ).


/* useless function, just to make sure compiler not removing loops */
void anotherFunc( long e ) {
if ( e == 1 ) /* never happens */
printf( "OOOPS WRONG!!" );
else
return;
}

int main() {
LARGE_INTEGER start, end;
int max = 500000, i, Repeat = 500;
/* false: create SSE version, true: create FPU version of math */
Math *math = Math::getSingleTone( false );

Vector4f v4_r;
D3DXVECTOR4 d4_r;

Vector4f v4_1( 0 , 2 , 3 , 4);
Vector4f v4_2( 0 , 3 , 4 , 5);
D3DXVECTOR4 d4_1( 0 , 2 , 3 , 4);
D3DXVECTOR4 d4_2( 0 , 3 , 4 , 5);

float result = 0, tester_long = 0, max_saved = 0, max_lost = 0;
for ( int j = 0; j < Repeat; j++ ) {
long cycle_mmn = 0;
QueryPerformanceCounter( &start );
for ( i = 0; i < max; i++ ) {
v4_r = math->addVector( v4_1 , v4_2 );
tester_long = tester_long + v4_r.x;
}
QueryPerformanceCounter( &end );
cycle_mmn = (end.QuadPart - start.QuadPart);

long cycle_d3d = 0;
QueryPerformanceCounter( &start );
for ( i = 0; i < max; i++ ) {
D3DXVec4Add( &d4_r , &d4_1 , &d4_2 );
tester_long = tester_long + d4_r.x;
}
QueryPerformanceCounter( &end );
cycle_d3d = (end.QuadPart - start.QuadPart);

/* Send tester_long to a function just to make sure loop is not removed by compiler */
anotherFunc( tester_long );

float percent = 0;
if ( cycle_d3d > cycle_mmn ) {
percent = (1 - float(cycle_mmn)/float(cycle_d3d)) * 100;
printf( " ++ Speed %f %% saved ( [ %d ][ %d ].\n" , percent , cycle_d3d , cycle_mmn );
max_saved = __max( max_saved , percent );
result = result + percent;
} else {
percent = float(cycle_mmn)/float(cycle_d3d) * 100 - 100;
printf( " -- Speed %f %% lost ( [ %d ][ %d ].\n" , percent , cycle_d3d , cycle_mmn );
max_lost = __max( max_lost , percent );
result = result - percent;
}
}

float perc = result / float(Repeat);
printf( "\n\tTarget is %f %% %s than D3DX \n" , abs(perc) , (perc>0)?"faster":"slower");
printf( "\tMax saved time:\t%f %%\n\tMax lost time:\t%f %%\n\n" , max_saved , max_lost );
}


i think this should be good way to end this!
now the result should be correct.

Quote:

Preload arrays

arrayes of floats ?? arrayes of Vector4f and D3DXVECTOR4 wont change anything...

about "Unhandled exception" in debug mode:
changing line " mov ebx,v2" to "mov ecx,v2" removes the problem,
by the way i think its better to let compiler do this things and just write entire SSE codes with Cpp SIMD intrinsics...

Some result:
with "math->addVector" a pure virtual function (simd intrinsics)
Target is 288.732117 % slower
Max saved time: 0.000000 %
Max lost time: 391.315460 %
---
with an inline function (simd intrinsics)
Target is 0.892567 % slower than D3DX
Max saved time: 16.738482 %
Max lost time: 15.788765 %

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!