Sign in to follow this  
subi

sse2 replacements for some maths functions

Recommended Posts

subi    157
Hi Everyone In my current engine i make alot of calls to the standard math function sqrt. i was wondering if it is worth using sse2 instructions and using a bit of assembly to do this instead, i have come up with this and it appears to work.
float SSE2sqrt(float src)
{
	float t;
	_asm
	{
		movupd xmm0, src
		movupd xmm1,t
		sqrtss xmm1,xmm0
		movupd t,xmm1
	}
	return t;
}

do you think trying a few of these type of optimizations are worth it or just a waste of time?

Share this post


Link to post
Share on other sites
Spoonbender    1258
Depends.

- Is this actually a bottleneck in your app? Or is it premature optimization/waste of time? Do you actually need to optimize anything at this point? If so, *what* do you need to optimize? Have you profiled your app?
- Is your sse version actually faster? By how much? Enough to make a difference? - - What about cpu's that don't have SSE2?

Oh, and why not use the compiler intrinsics to call the SSE instructions? In the worst case, it'll yield the same code. In the best case, it'll allow the compiler to optimize more efficiently.

Share this post


Link to post
Share on other sites
John Schultz    811
[MOVUPD moves two double precision fp values into the destination register; why move stack garbage into xmm1, then overwrite with float src (+garbage) sqrt into xmm1? Also appears that the last MOVUPD is going to trash the stack after t (unless the compiler fixes it by observing that t is a (single) float]

It can make a difference, sometimes a huge difference, but you have to benchmark the custom asm in the context that you use it (and perhaps examine the compiler asm output to see how well the asm was integrated into your C++ code).

VS 2005 can target SSE2 with a compiler option.

I find that sqrt inverse is the best optimization as you can save a divide along with a faster sqrt. Examples to benchmark below.


// Intel's sqrtInverse:
inline __declspec(naked) float __fastcall _sqrtInverse(float a) {
__asm {
mov eax,0be6eb508h
mov DWORD PTR [esp-12],03fc00000h // 1.5 on the stack
sub eax, DWORD PTR [esp+4] // a
sub DWORD PTR [esp+4],800000h // a/2 a=Y0
shr eax,1 // first approx. in eax=R0
mov DWORD PTR [esp-8],eax

fld DWORD PTR [esp-8] // r
fmul st,st // r*r
fld DWORD PTR [esp-8] // r
fxch st(1)
fmul DWORD PTR [esp+4] // a ; r*r*y0
fld DWORD PTR [esp-12] // load 1.5
fld st(0)
fsub st,st(2) // r1 = 1.5 - y1
// x1 = st(3)
// y1 = st(2)
// 1.5 = st(1)
// r1 = st(0)

fld st(1)
fxch st(1)
fmul st(3),st // y2=y1*r1*...
fmul st(3),st // y2=y1*r1*r1
fmulp st(4),st // x2=x1*r1
fsub st,st(2) // r2=1.5-y2
// x2=st(3)
// y2=st(2)
// 1.5=st(1)
// r2 = st(0)

fmul st(2),st // y3=y2*r2*...
fmul st(3),st // x3=x2*r2
fmulp st(2),st // y3=y2*r2*r2
fxch st(1)
fsubp st(1),st // r3= 1.5 - y3
// x3 = st(1)
// r3 = st(0)
fmulp st(1),st
ret 4
}
}

// From http://www.lomont.org/Math/Papers/2003/InvSqrt.pdf
inline flt sqrtInverse2(flt n) { // Appears slower than ASM above.
flt halfn = .5f*n;
int i = *(int *)&n;
i = 0x5f375a86 - (i>>1); // 0x5f3759df: original number.
n = *(flt *)&i;
n = n*(1.5f-halfn*n*n);
return n;
} // sqrtInverse2

inline flt sqrtInverse(flt n) { if (n > RSQR(1.f/6.5192024e+018)) return _sqrtInverse(n); else return 0.f; }
inline flt rSqrt(flt n) { return n*sqrtInverse(n); }

// SSE

inline flt sqrtInverse(flt n) {
__asm {
movss xmm0,n
rsqrtss xmm0,xmm0
movss n, xmm0
}
return n;
} // sqrtInverse

inline bool lessThanOrEqualZero(flt f) {
return (long &)f <= 0;
} // lessThanOrEqualZero

inline float rSqrt(float n) {
// if (n <= 0.f) return 0.f; // 859
if (lessThanOrEqualZero(n)) return 0.f; // 844
__asm {
rsqrtss xmm0,n
mulss xmm0,n
movss n,xmm0
}
return n;
}

inline flt rSqrt(flt n) { // 875
__asm {
movss xmm0,n
// rsqrtss xmm0,xmm0
// rcpss xmm0,xmm0 // The rcpss version might be slightly faster. Each version of SSE sqrt has about the same precision.
sqrtss xmm0,xmm0
movss n, xmm0
}
return n;
} // rSqrt






Keep in mind that if SSEx is used, the code won't run on older computers without SSEx support (either alternate code paths, self-modifying code (may be obsolete soon), or separate .dll's).

Share this post


Link to post
Share on other sites
Rockoon1    104

Suppose you need to calculate square roots (N) times per second, and that it takes (DT) time to actualy caculate one of your square roots..

Then (N*DT) is the amount of time consumed by square roots during 1 second.

If (N*DT) is very small, then there is absolute no reason to consider optimising it.

Of course, there are complications involved in measuring (DT) .. but you don't really need to know (DT) to make some inferences.

Suppose you decide that if your square roots are consuming at least 1% of your CPU time that you will optimise it. Now, if N = 1000 then (DT) would have to be at least (.01 / 1000) which is 1/100,000th of a second! Thats about 20,000 CPU cycles on a 2ghz machine!

A square root, while expensive in terms of most other basic math operations, is still a fairly cheap operation. Its not like it will blow off 20,000 cycles. I can't imagine it taking more that 100 cycles even in a poorly designed newton iterator.

The SSE method is of course extremely fast but the value of that speed simply isnt there if (N) is small.

Share this post


Link to post
Share on other sites
Mastaba    761
Just a minor correction, SQRTSS is a member of the SSE instruction set, not SSE2 as the thread title suggests. Also, SSE really shines best when you use the SIMD instructions, and SQRTSS is not a SIMD instruction. See if you can rearrange your logic and batch some square roots together so that you may use SQRTPS instead.

Share this post


Link to post
Share on other sites
johnb    351
I would look at your code to see why you're calling sqrt so often. As others have said it's not that slow an operation in itself, so if you're spending many % of your CPU time in it there may be some other issues with your code making you call it too much.

E.g. square root is used to calculate distance but in many if not most such cases you can avoid it. Suppose you want to find the nearest of a set of objects to a point, so you work out the distance to each point and sort them to find the nearest. But you don't need to work out the square roots, instead work out the squared distances (dx^2 + dy^2 + dz^2), sort and compare them. Only take the square root to get the distance of the nearest if you need it for some other calculation.

Share this post


Link to post
Share on other sites
meisawesome    186
I think that rewriting the normal math functions are a waste of time unless you want a function with less precision.
Also, you dont need sse2 for square roots.


inline float ASMsqrt(float src)
{
float t;
_asm
{
fld src
fsqrt
fstp t
}
return t;
}

Share this post


Link to post
Share on other sites
John Schultz    811
Quote:
Original post by meisawesome
I think that rewriting the normal math functions are a waste of time unless you want a function with less precision.
Also, you dont need sse2 for square roots.

*** Source Snippet Removed ***


Believe it or not, the Intel sqrt inverse approximation followed by a multiply benchmarks faster than the fsqrt instruction. The SSEx versions are even faster. Benchmarking is important as sometimes "optimizations" run slower than the original code.

Another issue when considering SSEx instructions when writing game code: the newer, faster CPU's will run faster with SSEx, but the performance improvements really need to be made available to the older processors running at slower clock speeds (which may not have the SSEx instructions). Thus, it's usually better to change the methods/algorithms or reduce the number of items/objects requiring expensive ops.

Share this post


Link to post
Share on other sites
meisawesome    186
Quote:
Original post by John Schultz
Quote:
Original post by meisawesome
I think that rewriting the normal math functions are a waste of time unless you want a function with less precision.
Also, you dont need sse2 for square roots.

*** Source Snippet Removed ***


Believe it or not, the Intel sqrt inverse approximation followed by a multiply benchmarks faster than the fsqrt instruction. The SSEx versions are even faster. Benchmarking is important as sometimes "optimizations" run slower than the original code.

Another issue when considering SSEx instructions when writing game code: the newer, faster CPU's will run faster with SSEx, but the performance improvements really need to be made available to the older processors running at slower clock speeds (which may not have the SSEx instructions). Thus, it's usually better to change the methods/algorithms or reduce the number of items/objects requiring expensive ops.


I decided to benckmark to see if your statement was true:
While your functions might be better for inverse square roots when i un-inverse them the benchmark says mine are faster:
NOTE: I removed 2 functions because they crashed.
NOTE: Windows 2000/XP+ only


#include <math.h>
#include <string>
#include <iostream>
#include <iomanip>
#include <windows.h>

static const int Count = 50000000;

typedef float (*tSqrtFunction) (float);

// From http://www.lomont.org/Math/Papers/2003/InvSqrt.pdf
inline float sqrtInverse2(float n)
{
float halfn = 0.5f * n;
int i = *(int*) &n;
i = 0x5f375a86 - (i >> 1); // 0x5f3759df: original number.
n = *(float*) &i;
n = n * (1.5f - halfn * n * n);
return 1.0f / n;
}

//SSE1
float invsqrt(float n)
{
_asm
{
movss xmm0, n;
rsqrtss xmm0, xmm0;
movss n, xmm0;
}

return 1.0f / n;
}

//SSE2
float SSE2sqrt(float src)
{
float t;

_asm
{
movupd xmm0, src;
movupd xmm1, t;
sqrtss xmm1, xmm0;
movupd t, xmm1;
}
return t;
}

//ASM
float ASMsqrt(float src)
{
float t;

_asm
{
fld src;
fsqrt;
fstp t;
}
return t;
}

inline void TestFunction(const std::wstring& Name, tSqrtFunction SqrtFunction, int Count)
{
float f;
DWORD StartTime, EndTime, TotalTime;

StartTime = GetTickCount();
for (int i = 0;i < Count;i++)
{
f = SqrtFunction(10.f);
}
EndTime = GetTickCount();
TotalTime = EndTime - StartTime;

std::wcout << L"Function " << Name << L" f*f=" << f * f << L" took: " << TotalTime << std::endl;
}

int main(int argc, char* argv[])
{
std::wcout << std::setprecision(10) << std::endl;

TestFunction(L"sqrtf", sqrtf, Count);
TestFunction(L"ASMsqrt", ASMsqrt, Count);
TestFunction(L"invsqrt", invsqrt, Count);
//TestFunction(L"SSE2sqrt", SSE2sqrt, Count);
TestFunction(L"sqrtInverse2", sqrtInverse2, Count);

return 0;
}



The results i got were:
Function sqrtf f*f=10 took: 1203
Function ASMsqrt f*f=10 took: 625
Function invsqrt f*f=10.0003 took: 641
Function sqrtInverse2 f*f=10.0344 took: 1125

Share this post


Link to post
Share on other sites
John Schultz    811
Quote:
Original post by meisawesome
I decided to benckmark to see if your statement was true:


Cool. Though remember that you need to benchmark code in a real app to see if the "optimization" really is. Additionally, it's important to examine the asm output of the compiler (to make sure it's not optimizing away important code, rendering the benchmark worthless).

Quote:
Original post by meisawesome
While your functions might be better for inverse square roots when i un-inverse them the benchmark says mine are faster:


Change your code and benchmark again with:

float invsqrt(float n)
{
float rsn;
_asm
{
movss xmm0, n;
rsqrtss xmm0, xmm0;
movss rsn, xmm0;
}

return n*rsn;
}



Also try this version (from my timing numbers in comments, this version was the fastest):

inline float rSqrt(float n) {
__asm {
rsqrtss xmm0,n
mulss xmm0,n
movss n,xmm0
}
return n;
}



When normalizing vectors, etc., where 1/sqrt(n) is needed, these version are significantly faster than 1/fsqrt(n).

The following should be very fast for normalizing a vector:

void normalize(Vec3 * v) {
__asm {
mov eax,v
movss xmm0,[eax]Vec3.x
movss xmm1,[eax]Vec3.y
movss xmm2,[eax]Vec3.z

mulss xmm0,xmm0
mulss xmm1,xmm1
mulss xmm2,xmm2

addss xmm0,xmm1
addss xmm2,xmm0

movss xmm4,[eax]Vec3.x
movss xmm5,[eax]Vec3.y
movss xmm6,[eax]Vec3.z

rsqrtss xmm2,xmm2

mulss xmm4,xmm2
mulss xmm5,xmm2
mulss xmm6,xmm2

movss [eax]Vec3.x,xmm4
movss [eax]Vec3.y,xmm5
movss [eax]Vec3.z,xmm6
} // asm
} // normalize



Note that the above code will normalize vectors to (very) near unit length (not quite as accurate as the slower fsqrt, etc.). Data alignment will also affect performance...

Quote:
Original post by meisawesome
NOTE: I removed 2 functions because they crashed.


See the documentation for MOVUPD to see why that version crashes (or see my comments above).
The Intel version should compile and work OK with VC... In application benchmarks, with heavy sqrt/invSqrt use, the Intel version was shown to be 5-10% faster than intrinsic (auto-compiler generated direct calls to sqrtf) (frame rate).

Share this post


Link to post
Share on other sites
John Schultz    811
Here's another useful asm optimization that really should be part of Standard C++:

inline void fastSinCos(float radians,float & s,float & c) {
__asm {
fld radians
fsincos
mov ecx,[c]
mov edx,[s]
fstp DWORD PTR[ecx]
fstp DWORD PTR[edx]
} // __asm
} // fastSinCos




Get sine and cosine for about the same cost as just sine or cosine. In "olden times" we used lookup tables (sometimes with linear interpolation between entries to improve precision). Today, memory performance is more critical and lookup tables aren't necessarily faster anymore. Short Taylor series expansions can also be reasonably fast (and allow tailored precision (pun not intended)).

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this