Sign in to follow this  
staticVoid2

SIMD slower than C.

Recommended Posts

I came across a function for finding the length of a vector, written in intels's SSE that claims to be faster than the C version i.e:
inline float getLength(float *vec)
{
   return sqrt((double)(vec[0]*vec[0] + vec[1]*vec[1] + vec[2]*vec[2]));
}

SSE:
inline float getLengthSIMD(float *vec)
{
	float ret;
	float *r = &ret;

	//static __declspec(align(16)) int mask[] = { 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0x00000000 };
	__asm
	{
		mov ecx, r
		mov esi, vec
		movups xmm0, [esi]
	//	andps xmm0, mask
		mulps xmm0, xmm0
		movaps xmm1, xmm0
		shufps xmm1, xmm1, 4Eh
		addps xmm0, xmm1
		movaps xmm1, xmm0
		shufps xmm1, xmm1, 11h
		addps xmm0, xmm1
		sqrtss xmm0, xmm0
		movss [ecx], xmm0
	};
	return ret;
}

This was from "3D game engine programming by stepfan zerbst" so i figure it's probaly a correct statement, but when I ran a profiler on the two functions the C version was much faster. any reason(s) why this is?

Share this post


Link to post
Share on other sites
Can't really tell you much without knowing what compiler you used, what options, what CPU you used, what your test program looked like, what the assembly output looked like...

--Edit

Or in what context the claims in the book were made.

Share this post


Link to post
Share on other sites
SSE stuff is normally quite a bit quicker but costs quite a lot changing back + forward from float to an acceptable SSE datatype

thus only calculating a vectors length may be in fact slower, but if youre gonna get its length + then perform some more calcukations before converting back to a float using SSE is prolly a win

Share this post


Link to post
Share on other sites
Swizzling the vector to set up the SSE is using up most of the time. SSE can be alot faster if you have many floating point operations that can be done at once.
Also if there is alot of data, especially if it is stored in the "structure of arrays" layout which usually is more efficient. Also your compiler could be using SSE for the C++ code anyway (most newer ones will.)

Share this post


Link to post
Share on other sites
Ick assembly! Where in the dark ages did you find that code? How about something human readable like Open MP?

Okay, so a single precision float is 32 bits. A 128 bit MMX register can hold 4 floats. If your data is already byte aligned and in AoS (Array of structures) you should be able to do avoid the conversion penalties zedz mentioned:

1) copy your first vector to the first register
2) copy your first vector to the second register
3) Multiply the two registers
4) Add the resulting register
5) Do a squareroot
6) Copy the final register into memory

Questions:
Why you are doing shufps?
Why anyone would suggest SOA?
"mulps xmm0, xmm0", an operation on a single register?

Closing arguments:
You're probably going to see a larger performance increase if this were a collision detection, so the function would loop through several times only changing one of the registers.

- Valles

Share this post


Link to post
Share on other sites
Intel's "IA-32 Architecture Reference Manual" has the same exact pseudo code on page 5-8 that you wrote, they're using it to get the length of two DIFFERENT vectors. They show a better solution on page 5-9 dropping your 7 operations down to 5 by operating on 4 vector comparisons in a single loop. Best of all the book is free for download.

- Valles

Share this post


Link to post
Share on other sites
that's weird - my profiler now says the simd version is faster:

avg% : max% : min% : calls : Name

------------------------------------------------------------

3.1 : 3.1 : 3.1 : 1 : MAIN

43.3 : 43.3 : 43.3 : 1 : SIMD

53.6 : 53.6 : 53.6 : 1 : C





I loop the two functions 1000 times although it says 1 call, its 1 call to a for loop. would the compiler recognise the constants and evaluate this value before runtime? and I also don't store the return value anywhere so there's really no reason to call these functions apart from profiling.

I'm using visual c++ express.

Quote:

4) Add the resulting register


this is where im using shufps - so that I can switch the last two elements of the vector with the first two and then add this to the origional vector to produce the scalar addition of all elements of the vector. is there a better way to do this, or simply an instruction?

Share this post


Link to post
Share on other sites
Yes, if your test function looks something like

void foo() {
begin_time();
for( int i = 0; i < 1000000; ++i )
calc_sqrt( (float)i );
end_time();
std::cout << "That was a long time" << std::endl;
}


the compiler will recognize that you infact never use the values calculated in the loop, and just throw it away. You need to store/use/print the values somehow to get a true measurement, which is why such small benchmarks are never as good as actually profiling a complete application with non-trivial usage patterns.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this