SSE optimizations not so optimized

Started by
4 comments, last by japro 12 years, 7 months ago
Hi guys, im having great problems with a few operations i implemented using SSE intrinsics(im using Visual Studio 2008/2010 by the way). The thing is that the performance gain i get depends to much in the operations i make. Im using single floating point presition, so i make 4 floats operations with 1 SSE instruction. When i make a SSE division(_mm_div_ps) the operation only takes 20% of what the fpu would take. With the squareroot i get even better results, something like 15% of what the sqrt(CRT) without SSE takes. But that is the only nice things. With the addition, subtraction and multiplication i get almost the same times with SSE that i get with the fpu. I dont understand how the division and squareroot could be so performing and the rest of the operations only a joke. I dont know if this kind of results should be expected or im doing something really bad?

I perform a few tests. Each test consist of 5000000 operations of its kind(for example only divisions). And a result is the analisis of 20 tests(60 in the case of the addition because it had higer standard deviation Gods knows why). Every test is run in release with speed optimizations, and Enable Enhaced Instruction Set to Not Set (the only diference that it makes is that makes my code that runs in the fpu slower).The blue bar is the average time(in seconds) of a test, and the red bar is the standard deviation. Here are the results ive got:
(NOTE: "Con SIMD" means test made with SIMD and "Sin SIMD" means test made without SIMD, ie fpu)

division.png

squareroot.png

addition.png

multiplication.png


The code for example to make an addition is:



---------- Vector4.h --------------

#ifdef SIMD_EXTENSION



class Vector4
{
public:

__declspec(align(16)) union
{
__m128 m_xyzw;
struct
{
float m_x, m_y, m_z, m_w;
};
};
//This union is not probably a good idea... What do you think?
//Anyway i never get m_x, m_y, m_z, m_w in this test

.....


inline Vector4 operator+(const Vector4 &B) const
{
return Vector4( _mm_add_ps(m_xyzw, B.m_xyzw) );
}

....

#else


class Vector4
{
public:

float m_x, m_y, m_z, m_w;
......


inline Vector4 operator+(const Vector4 &B) const
{
return Vector4( m_x+B.m_x, m_y+B.m_y, m_z+B.m_z, m_w+B.m_w );
}

.....

#endif


A simplified example of a call of a addition operation inside the test would be:


Vector4* data = (Vector4*)_aligned_malloc(iterations*sizeof(Vector4), __alignof(Vector4));//if not properly aligned everything is going to hell
srand(static_cast<unsigned int>(time(NULL)));

for(unsigned int i=0; i < iterations ;++i)
data = Vector4(static_cast<float>(rand()%100), static_cast<float>(rand()%100), static_cast<float>(rand()%100), static_cast<float>(rand()%100));

...... later inside a test

data = data +data[i+1];

.... other operations of addition



Here is my code. There are projects for Visual Studio 2008 and 2010.
Code

The examples i have seen in the web that show times always use division or squareroot... i dont know if intentionaly or what. For example:

http://software.inte...u-acceleration/
http://supercomputin...se-programming/
http://www.codeproje...s/sseintro.aspx
Advertisement
It's possible that the "badly performing" SIMD operations are simply memory bound in both versions --- i.e. in both the FPU and SIMD versions, the bottleneck is reading data from RAM and writing data to RAM. If computation is not a bottleneck, the optimising the computation via SIMD is not going to help that much.

One way to reduce the impact of RAM on your test would be to increase the locality of your test data by replacing the index '[font="Courier New"]i[/font]' with '[font="Courier New"]0[/font]', e.g.: data[0] = data[0] +data[0+1];
data[0+1] = data[0] -data[0+1];
data[0+2] = data[0+2]*data[0+3];
data[0+3] = data[0+2]/data[0+3];
data[0+4] = data[0+3].SquareLength();

Another possibility is that your constructors aren't being inlined or your overloaded operators are inhibiting the optimiser. Try writing a SIMD test using a core C-style API to see if it performs differently than your C++-style API.
Check these out:
http://altdevblogada...simd-interface/
http://altdevblogada...nterface-redux/
Not entirely related but your timing code also times 'n' iterations of a 7-case switch statement: Invert the implementation to make one switch that contains one-loop-per-case to get cleaner results.

Jans.
To really see the full advantage of SSE you need:
1. working set that fits into L1 cache or even better into the 8 SSE registers.
2. minimum overhead around SSE instructions. So unroll your benchmark loops (for a project of mine I had to unroll the SSE instructions 16-fold to get to peak performance, so every iteration would process 128+ float operands), don't have lots of branches mixed with SSE instructions.

The amortized cost of SSE instructions like add, mul etc. is usually in the range of single clock cylces. So everything that introduces any overhead on instruction level will instantly distort results.
Thanks for the help i could fix the problem. First try like Hodgman suggested, make operations in a small working set so i had locality in the data. This made the SSE operations as fast as expected. Then i put more operations in each loop like japro told, with that i got great results. The only thing strange is that the compiler didnt unroll my loop even if i put a bound known at compile time(iterations is a define):


for(unsigned int i=0; i < (iterations-1) ;++i)//not unrolling so i put a lot of operations in each loop
{
//15 SSE operations
}

I thougth this kind of things was always optimized....

Other thing, more as a curiosity than anything else, is that the operations between inf with the fpu is EXTREMELY SLOW(i had a case when the data had an overflow, i already fixed that). I dont know if it is because some flags are set because of the inf or is something else... in the same condition the SSE operations are exactly as fast as before.

multiplication_fixed_operations.png

addition_fixed_operations.png


PS: thanks for the links Hodgman, they are really useful.



The only thing strange is that the compiler didnt unroll my loop even if i put a bound known at compile time(iterations is a define):

That depends on the compiler. If you are using gcc try "-funroll-loops". Also it seems the Intel compiler won't touch loops that contain intrinsics.

Other thing, more as a curiosity than anything else, is that the operations between inf with the fpu is EXTREMELY SLOW(i had a case when the data had an overflow, i already fixed that). I dont know if it is because some flags are set because of the inf or is something else... in the same condition the SSE operations are exactly as fast as before.
[/quote]
I think the FPU with "wrong" flags will trigger an interrupt whenever such an operation is encountered while the SSE unit just doesn't care about the undefined result and goes on.

This topic is closed to new replies.

Advertisement