# SIMD slower than C.

## Recommended Posts

staticVoid2    381
I came across a function for finding the length of a vector, written in intels's SSE that claims to be faster than the C version i.e:
inline float getLength(float *vec)
{
return sqrt((double)(vec[0]*vec[0] + vec[1]*vec[1] + vec[2]*vec[2]));
}


SSE:
inline float getLengthSIMD(float *vec)
{
float ret;
float *r = &ret;

//static __declspec(align(16)) int mask[] = { 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0x00000000 };
__asm
{
mov ecx, r
mov esi, vec
movups xmm0, [esi]
//	andps xmm0, mask
mulps xmm0, xmm0
movaps xmm1, xmm0
shufps xmm1, xmm1, 4Eh
movaps xmm1, xmm0
shufps xmm1, xmm1, 11h
sqrtss xmm0, xmm0
movss [ecx], xmm0
};
return ret;
}


This was from "3D game engine programming by stepfan zerbst" so i figure it's probaly a correct statement, but when I ran a profiler on the two functions the C version was much faster. any reason(s) why this is?

##### Share on other sites
Antheus    2409
How much faster? Is it possible that C version was evaluated during compile-time?

##### Share on other sites
outRider    852
Can't really tell you much without knowing what compiler you used, what options, what CPU you used, what your test program looked like, what the assembly output looked like...

--Edit

Or in what context the claims in the book were made.

##### Share on other sites
zedz    291
SSE stuff is normally quite a bit quicker but costs quite a lot changing back + forward from float to an acceptable SSE datatype

thus only calculating a vectors length may be in fact slower, but if youre gonna get its length + then perform some more calcukations before converting back to a float using SSE is prolly a win

##### Share on other sites
Roboticus    122
Swizzling the vector to set up the SSE is using up most of the time. SSE can be alot faster if you have many floating point operations that can be done at once.
Also if there is alot of data, especially if it is stored in the "structure of arrays" layout which usually is more efficient. Also your compiler could be using SSE for the C++ code anyway (most newer ones will.)

##### Share on other sites
valles    173
Ick assembly! Where in the dark ages did you find that code? How about something human readable like Open MP?

Okay, so a single precision float is 32 bits. A 128 bit MMX register can hold 4 floats. If your data is already byte aligned and in AoS (Array of structures) you should be able to do avoid the conversion penalties zedz mentioned:

1) copy your first vector to the first register
2) copy your first vector to the second register
3) Multiply the two registers
4) Add the resulting register
5) Do a squareroot
6) Copy the final register into memory

Questions:
Why you are doing shufps?
Why anyone would suggest SOA?
"mulps xmm0, xmm0", an operation on a single register?

Closing arguments:
You're probably going to see a larger performance increase if this were a collision detection, so the function would loop through several times only changing one of the registers.

- Valles

##### Share on other sites
valles    173
Intel's "IA-32 Architecture Reference Manual" has the same exact pseudo code on page 5-8 that you wrote, they're using it to get the length of two DIFFERENT vectors. They show a better solution on page 5-9 dropping your 7 operations down to 5 by operating on 4 vector comparisons in a single loop. Best of all the book is free for download.

- Valles

##### Share on other sites
staticVoid2    381
that's weird - my profiler now says the simd version is faster:
   avg%   :     max%   :     min%   :     calls   :   Name------------------------------------------------------------    3.1   :       3.1  :       3.1  :         1   :   MAIN   43.3   :      43.3  :      43.3  :         1   :   SIMD   53.6   :      53.6  :      53.6  :         1   :   C

I loop the two functions 1000 times although it says 1 call, its 1 call to a for loop. would the compiler recognise the constants and evaluate this value before runtime? and I also don't store the return value anywhere so there's really no reason to call these functions apart from profiling.

I'm using visual c++ express.

Quote:
 4) Add the resulting register

this is where im using shufps - so that I can switch the last two elements of the vector with the first two and then add this to the origional vector to produce the scalar addition of all elements of the vector. is there a better way to do this, or simply an instruction?

##### Share on other sites
Promethium    580
Yes, if your test function looks something like
void foo() {    begin_time();    for( int i = 0; i < 1000000; ++i )         calc_sqrt( (float)i );    end_time();    std::cout << "That was a long time" << std::endl;}

the compiler will recognize that you infact never use the values calculated in the loop, and just throw it away. You need to store/use/print the values somehow to get a true measurement, which is why such small benchmarks are never as good as actually profiling a complete application with non-trivial usage patterns.

## Create an account or sign in to comment

You need to be a member in order to leave a comment

## Create an account

Sign up for a new account in our community. It's easy!

Register a new account