sse2 replacements for some maths functions

Started by
10 comments, last by John Schultz 17 years, 12 months ago
Hi Everyone In my current engine i make alot of calls to the standard math function sqrt. i was wondering if it is worth using sse2 instructions and using a bit of assembly to do this instead, i have come up with this and it appears to work.

float SSE2sqrt(float src)
{
	float t;
	_asm
	{
		movupd xmm0, src
		movupd xmm1,t
		sqrtss xmm1,xmm0
		movupd t,xmm1
	}
	return t;
}

do you think trying a few of these type of optimizations are worth it or just a waste of time?
Advertisement
Depends.

- Is this actually a bottleneck in your app? Or is it premature optimization/waste of time? Do you actually need to optimize anything at this point? If so, *what* do you need to optimize? Have you profiled your app?
- Is your sse version actually faster? By how much? Enough to make a difference? - - What about cpu's that don't have SSE2?

Oh, and why not use the compiler intrinsics to call the SSE instructions? In the worst case, it'll yield the same code. In the best case, it'll allow the compiler to optimize more efficiently.
[MOVUPD moves two double precision fp values into the destination register; why move stack garbage into xmm1, then overwrite with float src (+garbage) sqrt into xmm1? Also appears that the last MOVUPD is going to trash the stack after t (unless the compiler fixes it by observing that t is a (single) float]

It can make a difference, sometimes a huge difference, but you have to benchmark the custom asm in the context that you use it (and perhaps examine the compiler asm output to see how well the asm was integrated into your C++ code).

VS 2005 can target SSE2 with a compiler option.

I find that sqrt inverse is the best optimization as you can save a divide along with a faster sqrt. Examples to benchmark below.

// Intel's sqrtInverse:inline __declspec(naked) float __fastcall _sqrtInverse(float a) {  __asm {    mov   eax,0be6eb508h    mov   DWORD PTR [esp-12],03fc00000h // 1.5 on the stack    sub   eax, DWORD PTR [esp+4]        // a    sub   DWORD PTR [esp+4],800000h     // a/2 a=Y0    shr   eax,1                         // first approx. in eax=R0    mov   DWORD PTR [esp-8],eax    fld   DWORD PTR [esp-8]              // r    fmul  st,st                          // r*r    fld   DWORD PTR [esp-8]              // r    fxch  st(1)    fmul  DWORD PTR [esp+4]              // a ; r*r*y0    fld   DWORD PTR [esp-12]             // load 1.5    fld   st(0)    fsub  st,st(2)                       // r1 = 1.5 - y1                                         // x1 = st(3)                                         // y1 = st(2)                                         // 1.5 = st(1)                                         // r1 = st(0)    fld   st(1)    fxch  st(1)    fmul  st(3),st	                 // y2=y1*r1*...    fmul  st(3),st	                 // y2=y1*r1*r1    fmulp st(4),st                       // x2=x1*r1    fsub  st,st(2)                       // r2=1.5-y2                                         // x2=st(3)                                         // y2=st(2)                                         // 1.5=st(1)                                         // r2 = st(0)    fmul  st(2),st	                 // y3=y2*r2*...    fmul  st(3),st                       // x3=x2*r2    fmulp st(2),st	                 // y3=y2*r2*r2    fxch  st(1)    fsubp st(1),st	                 // r3= 1.5 - y3                                         // x3 = st(1)                                         // r3 = st(0)    fmulp  st(1),st    ret 4  }}// From http://www.lomont.org/Math/Papers/2003/InvSqrt.pdfinline flt sqrtInverse2(flt n) { // Appears slower than ASM above.  flt halfn = .5f*n;  int i = *(int *)&n;  i = 0x5f375a86 - (i>>1); // 0x5f3759df: original number.  n = *(flt *)&i;  n = n*(1.5f-halfn*n*n);  return n;} // sqrtInverse2inline flt sqrtInverse(flt n) { if (n > RSQR(1.f/6.5192024e+018)) return _sqrtInverse(n); else return 0.f; }inline flt rSqrt(flt n) { return n*sqrtInverse(n); }// SSEinline flt sqrtInverse(flt n) {  __asm  {    movss   xmm0,n    rsqrtss xmm0,xmm0    movss   n,   xmm0  }  return n;} // sqrtInverseinline bool lessThanOrEqualZero(flt f) {  return (long &)f <= 0;} // lessThanOrEqualZeroinline float rSqrt(float n) {  //  if (n <= 0.f) return 0.f; // 859  if (lessThanOrEqualZero(n)) return 0.f; // 844  __asm {    rsqrtss xmm0,n    mulss   xmm0,n    movss   n,xmm0  }  return n;}inline flt rSqrt(flt n) { // 875  __asm  {    movss   xmm0,n//    rsqrtss xmm0,xmm0//    rcpss   xmm0,xmm0 // The rcpss version might be slightly faster. Each version of SSE sqrt has about the same precision.    sqrtss  xmm0,xmm0    movss   n,   xmm0  }  return n;} // rSqrt


Keep in mind that if SSEx is used, the code won't run on older computers without SSEx support (either alternate code paths, self-modifying code (may be obsolete soon), or separate .dll's).

Suppose you need to calculate square roots (N) times per second, and that it takes (DT) time to actualy caculate one of your square roots..

Then (N*DT) is the amount of time consumed by square roots during 1 second.

If (N*DT) is very small, then there is absolute no reason to consider optimising it.

Of course, there are complications involved in measuring (DT) .. but you don't really need to know (DT) to make some inferences.

Suppose you decide that if your square roots are consuming at least 1% of your CPU time that you will optimise it. Now, if N = 1000 then (DT) would have to be at least (.01 / 1000) which is 1/100,000th of a second! Thats about 20,000 CPU cycles on a 2ghz machine!

A square root, while expensive in terms of most other basic math operations, is still a fairly cheap operation. Its not like it will blow off 20,000 cycles. I can't imagine it taking more that 100 cycles even in a poorly designed newton iterator.

The SSE method is of course extremely fast but the value of that speed simply isnt there if (N) is small.

Just a minor correction, SQRTSS is a member of the SSE instruction set, not SSE2 as the thread title suggests. Also, SSE really shines best when you use the SIMD instructions, and SQRTSS is not a SIMD instruction. See if you can rearrange your logic and batch some square roots together so that you may use SQRTPS instead.
.
I would look at your code to see why you're calling sqrt so often. As others have said it's not that slow an operation in itself, so if you're spending many % of your CPU time in it there may be some other issues with your code making you call it too much.

E.g. square root is used to calculate distance but in many if not most such cases you can avoid it. Suppose you want to find the nearest of a set of objects to a point, so you work out the distance to each point and sort them to find the nearest. But you don't need to work out the square roots, instead work out the squared distances (dx^2 + dy^2 + dz^2), sort and compare them. Only take the square root to get the distance of the nearest if you need it for some other calculation.
John BlackburneProgrammer, The Pitbull Syndicate
I think that rewriting the normal math functions are a waste of time unless you want a function with less precision.
Also, you dont need sse2 for square roots.

inline float ASMsqrt(float src){	float t;	_asm	{		fld src		fsqrt		fstp t	}	return t;}
Quote:Original post by meisawesome
I think that rewriting the normal math functions are a waste of time unless you want a function with less precision.
Also, you dont need sse2 for square roots.

*** Source Snippet Removed ***


Believe it or not, the Intel sqrt inverse approximation followed by a multiply benchmarks faster than the fsqrt instruction. The SSEx versions are even faster. Benchmarking is important as sometimes "optimizations" run slower than the original code.

Another issue when considering SSEx instructions when writing game code: the newer, faster CPU's will run faster with SSEx, but the performance improvements really need to be made available to the older processors running at slower clock speeds (which may not have the SSEx instructions). Thus, it's usually better to change the methods/algorithms or reduce the number of items/objects requiring expensive ops.
Quote:Original post by John Schultz
Quote:Original post by meisawesome
I think that rewriting the normal math functions are a waste of time unless you want a function with less precision.
Also, you dont need sse2 for square roots.

*** Source Snippet Removed ***


Believe it or not, the Intel sqrt inverse approximation followed by a multiply benchmarks faster than the fsqrt instruction. The SSEx versions are even faster. Benchmarking is important as sometimes "optimizations" run slower than the original code.

Another issue when considering SSEx instructions when writing game code: the newer, faster CPU's will run faster with SSEx, but the performance improvements really need to be made available to the older processors running at slower clock speeds (which may not have the SSEx instructions). Thus, it's usually better to change the methods/algorithms or reduce the number of items/objects requiring expensive ops.


I decided to benckmark to see if your statement was true:
While your functions might be better for inverse square roots when i un-inverse them the benchmark says mine are faster:
NOTE: I removed 2 functions because they crashed.
NOTE: Windows 2000/XP+ only

#include <math.h>#include <string>#include <iostream>#include <iomanip>#include <windows.h>static const int Count = 50000000;typedef float (*tSqrtFunction) (float);// From http://www.lomont.org/Math/Papers/2003/InvSqrt.pdfinline float sqrtInverse2(float n){  float halfn = 0.5f * n;  int i = *(int*) &n;  i = 0x5f375a86 - (i >> 1); // 0x5f3759df: original number.  n = *(float*) &i;  n = n * (1.5f - halfn * n * n);  return 1.0f / n;} //SSE1float invsqrt(float n){	_asm	{		movss   xmm0, n;		rsqrtss xmm0, xmm0;		movss   n, xmm0;	}	return 1.0f / n;}//SSE2float SSE2sqrt(float src){	float t;	_asm	{		movupd xmm0, src;		movupd xmm1, t;		sqrtss xmm1, xmm0;		movupd t, xmm1;	}	return t;}//ASMfloat ASMsqrt(float src){	float t;	_asm	{		fld src;		fsqrt;		fstp t;	}	return t;}inline void TestFunction(const std::wstring& Name, tSqrtFunction SqrtFunction, int Count){	float f;	DWORD StartTime, EndTime, TotalTime;	StartTime = GetTickCount();	for (int i = 0;i < Count;i++)	{		f = SqrtFunction(10.f);	}	EndTime = GetTickCount();	TotalTime = EndTime - StartTime;	std::wcout << L"Function " << Name << L" f*f=" << f * f << L" took: " << TotalTime << std::endl;}int main(int argc, char* argv[]){	std::wcout << std::setprecision(10) << std::endl;	TestFunction(L"sqrtf", sqrtf, Count);	TestFunction(L"ASMsqrt", ASMsqrt, Count);	TestFunction(L"invsqrt", invsqrt, Count);	//TestFunction(L"SSE2sqrt", SSE2sqrt, Count);	TestFunction(L"sqrtInverse2", sqrtInverse2, Count);	return 0;}


The results i got were:
Function sqrtf f*f=10 took: 1203
Function ASMsqrt f*f=10 took: 625
Function invsqrt f*f=10.0003 took: 641
Function sqrtInverse2 f*f=10.0344 took: 1125
Quote:Original post by meisawesome
I decided to benckmark to see if your statement was true:


Cool. Though remember that you need to benchmark code in a real app to see if the "optimization" really is. Additionally, it's important to examine the asm output of the compiler (to make sure it's not optimizing away important code, rendering the benchmark worthless).

Quote:Original post by meisawesome
While your functions might be better for inverse square roots when i un-inverse them the benchmark says mine are faster:


Change your code and benchmark again with:
float invsqrt(float n){	float rsn;	_asm	{		movss   xmm0, n;		rsqrtss xmm0, xmm0;		movss   rsn, xmm0;	}	return n*rsn;}


Also try this version (from my timing numbers in comments, this version was the fastest):
inline float rSqrt(float n) {  __asm {    rsqrtss xmm0,n    mulss   xmm0,n    movss   n,xmm0  }  return n;}


When normalizing vectors, etc., where 1/sqrt(n) is needed, these version are significantly faster than 1/fsqrt(n).

The following should be very fast for normalizing a vector:
void normalize(Vec3 * v) {  __asm {    mov   eax,v    movss xmm0,[eax]Vec3.x    movss xmm1,[eax]Vec3.y    movss xmm2,[eax]Vec3.z    mulss xmm0,xmm0    mulss xmm1,xmm1    mulss xmm2,xmm2    addss xmm0,xmm1    addss xmm2,xmm0    movss xmm4,[eax]Vec3.x    movss xmm5,[eax]Vec3.y    movss xmm6,[eax]Vec3.z    rsqrtss xmm2,xmm2    mulss xmm4,xmm2    mulss xmm5,xmm2    mulss xmm6,xmm2    movss [eax]Vec3.x,xmm4    movss [eax]Vec3.y,xmm5    movss [eax]Vec3.z,xmm6  } // asm} // normalize


Note that the above code will normalize vectors to (very) near unit length (not quite as accurate as the slower fsqrt, etc.). Data alignment will also affect performance...

Quote:Original post by meisawesome
NOTE: I removed 2 functions because they crashed.


See the documentation for MOVUPD to see why that version crashes (or see my comments above).
The Intel version should compile and work OK with VC... In application benchmarks, with heavy sqrt/invSqrt use, the Intel version was shown to be 5-10% faster than intrinsic (auto-compiler generated direct calls to sqrtf) (frame rate).

This topic is closed to new replies.

Advertisement