Jump to content
  • Advertisement
Sign in to follow this  
jamesleighe

SSE and performance (+compiler intrinsics)

This topic is 2783 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

First of all, I'm pretty new to SSE etc. so that's something to keep in mind :)
Also, I'm using intrinsics for Microsoft Visual Studio.

Ok, so my question is why can't I get a speed boost from SSE here? (just a simple multiplication test) I tried a few things and I get the same results.
Does anyone know how I can use SSE to improve the speed of simple things like this and Vector maths? (this example is not vector math however)


On a positive note, I also included code (below the following code) that uses SSE to compute a square root far faster that using 'sqrtf'...


#include <iostream>
#include <cstdio>
#define WIN32_LEAN_AND_MEAN
#include <Windows.h>
#include <intrin.h>

__declspec (align(16))
class Vector
{
public:
Vector () { }
Vector (float _x, float _y, float _z)
: x(_x), y(_y), z(_z)
{ }

float x, y, z;
float dummy[1]; // alignment
};

// non-geometric multiply
void Multiply (Vector& a, Vector& b, Vector& result)
{
// Example One (~5k cycles)
/*
__m128* pA = (__m128*)&a;
__m128* pB = (__m128*)&b;
__m128* pResult = (__m128*)&result;

*pResult = _mm_mul_ps (*pA, *pB);
*/

// Example Two (~5k cycles)
__m128 alA = _mm_load_ps ((float*)&a);
__m128 alB = _mm_load_ps ((float*)&b);

__m128 alResult = _mm_mul_ps (alA, alB);

_mm_store_ps ((float*)&result, alResult);
}

// non-geometric multiply (maybe slow)
void MultiplySlow (Vector& a, Vector& b, Vector& result)
{
result.x = a.x * b.x;
result.y = a.y * b.y;
result.z = a.z * b.z;
}

int main ()
{
__int64 sampleA;
__int64 sampleB;

Vector a(1.0f, 2.0f, 3.0f);
Vector b(1.0f, 2.0f, 3.0f);
Vector c;

// just in case setting up the vecs takes time
Sleep (100);

sampleA = __rdtsc ();
{
// ~5k cycles
Multiply (a, b, c);

// also ~5k cycles
//MultiplySlow (a, b, c);
}
sampleB = __rdtsc ();

std::cout <<"Ticks: "<<sampleB - sampleA <<"\n\n";
std::system ("pause");
return 0;
}



#include <iostream>
#include <cstdio>
#define WIN32_LEAN_AND_MEAN
#include <Windows.h>
#include <intrin.h>

int main ()
{
__int64 sampleA;
__int64 sampleB;

// just in case setting up the takes time
Sleep (100);

sampleA = __rdtsc ();
{
// ~56k ticks
/*
float test = 1100;
test = sqrtf (test);
*/

// ~4k ticks
float test = 1100;
float result;
__m128 alTest;
__m128 alResult;

alTest = _mm_load_ss (&test);
alResult = _mm_sqrt_ss (alTest);
_mm_store_ss (&result, alResult);
}
sampleB = __rdtsc ();

std::cout <<"Ticks: "<<sampleB - sampleA <<"\n\n";
std::system ("pause");
return 0;
}

Share this post


Link to post
Share on other sites
Advertisement
Ok, so my question is why can't I get a speed boost from SSE here? [/quote]
Because you're doing same workload and introducing huge overhead with loading into SSE registers. SSE operations are not faster. A multiply is still a multiply and it suffers from memory access penalty or load/store stalls, which is dominating the above by far in both cases.


SIMD is Single Instruction Multiple Data. So the first step is to have a lot of data.float a[1024*3];
float b[1024*3];
float result[1024*3];
Now we have 1024 vectors. SSE can work on 4 at a time, so we need 1024*3/4 = 768 operations.

Array obviously contains linearly encoded vectors, 4 of which will be processed by one operation:| [x0][y0][z0][x1] | [y1][z1][x2]...

Do the same as above:int n = 1024*3/4;
float * pa = &a[0]; // obviously aligned as needed
float * pb = &b[0];
float * pr = &result[0];

for (int i = 0; i < n; i++) {
__m128 alA = _mm_load_ps (pa);
__m128 alB = _mm_load_ps (pb);

__m128 alResult = _mm_mul_ps (alA, alB);

_mm_store_ps (pr, alResult);
pa += 4;
pb += 4;
pr += 4;
}
This is still highly suboptimal since it spends too much time in load/store vs. work and does no prefetching, so for better results one would prefetch the whole cache line (4*4 floats), load them 4-element floats into registers, then process them in bulk. A simple process of unrolling the loop.

But at least the above does 4 operations at the same time.

The key however is to structure data in such a way that:
- you have lots of same data
- same operation is applied on all of it

Share this post


Link to post
Share on other sites
Thank you for your candor, however, why is cosine sine and sqrt so much faster? Just wondering... It's too bad really that it's not just amazingly faster for everything.

Share this post


Link to post
Share on other sites
Your example (load, mul, store) is likely bound by the speed of the RAM, not by the CPU speed.

If memory access is the bottleneck, then making the compute part 4 times faster (via SIMD) isn't going to help much.

Try testing the performance of something more math heavy, like a chain of matrix multiplications, etc... where the code is more likely to be compute-bound rather than memory-bound.

It's too bad really that it's not just amazingly faster for everything.[/quote]That's like complaining that an assembly-line factory is too slow when used to only produce a single item (instead of using it for mass-production). You can't just turn SIMD on and expect it to speed up code that wasn't written with the "mass production" style in mind.

Also, as I said above, you've taken something where the bottleneck is problem A, and then you've optimised problem B instead.

When optimising code these days, memory access patterns are often more important than cutting down on CPU cycles. RAM is incredibly, incredibly slow compared to the CPU.

See the pitfalls of oop presentation for some great examples of how slow RAM is, and the kinds of impacts this has on performance.

[edit]As for SSE's sqrt, they're allowed to do things differently than the strict IEEE definitions and give you a slightly more approximate result.

To give regular 'float' math the same freedom from strictness you can tell your compiler that you don't care about the strict IEEE math definitions. If you're using MSVC, go into [project properties -> C/C++ -> Code Generation] and try changing the [Floating Point Model] to "Fast".

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!