float SSE2sqrt(float src)
{
float t;
_asm
{
movupd xmm0, src
movupd xmm1,t
sqrtss xmm1,xmm0
movupd t,xmm1
}
return t;
}
sse2 replacements for some maths functions
Hi Everyone
In my current engine i make alot of calls to the standard math function sqrt. i was wondering if it is worth using sse2 instructions and using a bit of assembly to do this instead, i have come up with this and it appears to work.
do you think trying a few of these type of optimizations are worth it or just a waste of time?
Depends.
- Is this actually a bottleneck in your app? Or is it premature optimization/waste of time? Do you actually need to optimize anything at this point? If so, *what* do you need to optimize? Have you profiled your app?
- Is your sse version actually faster? By how much? Enough to make a difference? - - What about cpu's that don't have SSE2?
Oh, and why not use the compiler intrinsics to call the SSE instructions? In the worst case, it'll yield the same code. In the best case, it'll allow the compiler to optimize more efficiently.
- Is this actually a bottleneck in your app? Or is it premature optimization/waste of time? Do you actually need to optimize anything at this point? If so, *what* do you need to optimize? Have you profiled your app?
- Is your sse version actually faster? By how much? Enough to make a difference? - - What about cpu's that don't have SSE2?
Oh, and why not use the compiler intrinsics to call the SSE instructions? In the worst case, it'll yield the same code. In the best case, it'll allow the compiler to optimize more efficiently.
[MOVUPD moves two double precision fp values into the destination register; why move stack garbage into xmm1, then overwrite with float src (+garbage) sqrt into xmm1? Also appears that the last MOVUPD is going to trash the stack after t (unless the compiler fixes it by observing that t is a (single) float]
It can make a difference, sometimes a huge difference, but you have to benchmark the custom asm in the context that you use it (and perhaps examine the compiler asm output to see how well the asm was integrated into your C++ code).
VS 2005 can target SSE2 with a compiler option.
I find that sqrt inverse is the best optimization as you can save a divide along with a faster sqrt. Examples to benchmark below.
Keep in mind that if SSEx is used, the code won't run on older computers without SSEx support (either alternate code paths, self-modifying code (may be obsolete soon), or separate .dll's).
It can make a difference, sometimes a huge difference, but you have to benchmark the custom asm in the context that you use it (and perhaps examine the compiler asm output to see how well the asm was integrated into your C++ code).
VS 2005 can target SSE2 with a compiler option.
I find that sqrt inverse is the best optimization as you can save a divide along with a faster sqrt. Examples to benchmark below.
// Intel's sqrtInverse:inline __declspec(naked) float __fastcall _sqrtInverse(float a) { __asm { mov eax,0be6eb508h mov DWORD PTR [esp-12],03fc00000h // 1.5 on the stack sub eax, DWORD PTR [esp+4] // a sub DWORD PTR [esp+4],800000h // a/2 a=Y0 shr eax,1 // first approx. in eax=R0 mov DWORD PTR [esp-8],eax fld DWORD PTR [esp-8] // r fmul st,st // r*r fld DWORD PTR [esp-8] // r fxch st(1) fmul DWORD PTR [esp+4] // a ; r*r*y0 fld DWORD PTR [esp-12] // load 1.5 fld st(0) fsub st,st(2) // r1 = 1.5 - y1 // x1 = st(3) // y1 = st(2) // 1.5 = st(1) // r1 = st(0) fld st(1) fxch st(1) fmul st(3),st // y2=y1*r1*... fmul st(3),st // y2=y1*r1*r1 fmulp st(4),st // x2=x1*r1 fsub st,st(2) // r2=1.5-y2 // x2=st(3) // y2=st(2) // 1.5=st(1) // r2 = st(0) fmul st(2),st // y3=y2*r2*... fmul st(3),st // x3=x2*r2 fmulp st(2),st // y3=y2*r2*r2 fxch st(1) fsubp st(1),st // r3= 1.5 - y3 // x3 = st(1) // r3 = st(0) fmulp st(1),st ret 4 }}// From http://www.lomont.org/Math/Papers/2003/InvSqrt.pdfinline flt sqrtInverse2(flt n) { // Appears slower than ASM above. flt halfn = .5f*n; int i = *(int *)&n; i = 0x5f375a86 - (i>>1); // 0x5f3759df: original number. n = *(flt *)&i; n = n*(1.5f-halfn*n*n); return n;} // sqrtInverse2inline flt sqrtInverse(flt n) { if (n > RSQR(1.f/6.5192024e+018)) return _sqrtInverse(n); else return 0.f; }inline flt rSqrt(flt n) { return n*sqrtInverse(n); }// SSEinline flt sqrtInverse(flt n) { __asm { movss xmm0,n rsqrtss xmm0,xmm0 movss n, xmm0 } return n;} // sqrtInverseinline bool lessThanOrEqualZero(flt f) { return (long &)f <= 0;} // lessThanOrEqualZeroinline float rSqrt(float n) { // if (n <= 0.f) return 0.f; // 859 if (lessThanOrEqualZero(n)) return 0.f; // 844 __asm { rsqrtss xmm0,n mulss xmm0,n movss n,xmm0 } return n;}inline flt rSqrt(flt n) { // 875 __asm { movss xmm0,n// rsqrtss xmm0,xmm0// rcpss xmm0,xmm0 // The rcpss version might be slightly faster. Each version of SSE sqrt has about the same precision. sqrtss xmm0,xmm0 movss n, xmm0 } return n;} // rSqrt
Keep in mind that if SSEx is used, the code won't run on older computers without SSEx support (either alternate code paths, self-modifying code (may be obsolete soon), or separate .dll's).
Suppose you need to calculate square roots (N) times per second, and that it takes (DT) time to actualy caculate one of your square roots..
Then (N*DT) is the amount of time consumed by square roots during 1 second.
If (N*DT) is very small, then there is absolute no reason to consider optimising it.
Of course, there are complications involved in measuring (DT) .. but you don't really need to know (DT) to make some inferences.
Suppose you decide that if your square roots are consuming at least 1% of your CPU time that you will optimise it. Now, if N = 1000 then (DT) would have to be at least (.01 / 1000) which is 1/100,000th of a second! Thats about 20,000 CPU cycles on a 2ghz machine!
A square root, while expensive in terms of most other basic math operations, is still a fairly cheap operation. Its not like it will blow off 20,000 cycles. I can't imagine it taking more that 100 cycles even in a poorly designed newton iterator.
The SSE method is of course extremely fast but the value of that speed simply isnt there if (N) is small.
Just a minor correction, SQRTSS is a member of the SSE instruction set, not SSE2 as the thread title suggests. Also, SSE really shines best when you use the SIMD instructions, and SQRTSS is not a SIMD instruction. See if you can rearrange your logic and batch some square roots together so that you may use SQRTPS instead.
I would look at your code to see why you're calling sqrt so often. As others have said it's not that slow an operation in itself, so if you're spending many % of your CPU time in it there may be some other issues with your code making you call it too much.
E.g. square root is used to calculate distance but in many if not most such cases you can avoid it. Suppose you want to find the nearest of a set of objects to a point, so you work out the distance to each point and sort them to find the nearest. But you don't need to work out the square roots, instead work out the squared distances (dx^2 + dy^2 + dz^2), sort and compare them. Only take the square root to get the distance of the nearest if you need it for some other calculation.
E.g. square root is used to calculate distance but in many if not most such cases you can avoid it. Suppose you want to find the nearest of a set of objects to a point, so you work out the distance to each point and sort them to find the nearest. But you don't need to work out the square roots, instead work out the squared distances (dx^2 + dy^2 + dz^2), sort and compare them. Only take the square root to get the distance of the nearest if you need it for some other calculation.
I think that rewriting the normal math functions are a waste of time unless you want a function with less precision.
Also, you dont need sse2 for square roots.
Also, you dont need sse2 for square roots.
inline float ASMsqrt(float src){ float t; _asm { fld src fsqrt fstp t } return t;}
Quote:Original post by meisawesome
I think that rewriting the normal math functions are a waste of time unless you want a function with less precision.
Also, you dont need sse2 for square roots.
*** Source Snippet Removed ***
Believe it or not, the Intel sqrt inverse approximation followed by a multiply benchmarks faster than the fsqrt instruction. The SSEx versions are even faster. Benchmarking is important as sometimes "optimizations" run slower than the original code.
Another issue when considering SSEx instructions when writing game code: the newer, faster CPU's will run faster with SSEx, but the performance improvements really need to be made available to the older processors running at slower clock speeds (which may not have the SSEx instructions). Thus, it's usually better to change the methods/algorithms or reduce the number of items/objects requiring expensive ops.
Quote:Original post by John SchultzQuote:Original post by meisawesome
I think that rewriting the normal math functions are a waste of time unless you want a function with less precision.
Also, you dont need sse2 for square roots.
*** Source Snippet Removed ***
Believe it or not, the Intel sqrt inverse approximation followed by a multiply benchmarks faster than the fsqrt instruction. The SSEx versions are even faster. Benchmarking is important as sometimes "optimizations" run slower than the original code.
Another issue when considering SSEx instructions when writing game code: the newer, faster CPU's will run faster with SSEx, but the performance improvements really need to be made available to the older processors running at slower clock speeds (which may not have the SSEx instructions). Thus, it's usually better to change the methods/algorithms or reduce the number of items/objects requiring expensive ops.
I decided to benckmark to see if your statement was true:
While your functions might be better for inverse square roots when i un-inverse them the benchmark says mine are faster:
NOTE: I removed 2 functions because they crashed.
NOTE: Windows 2000/XP+ only
#include <math.h>#include <string>#include <iostream>#include <iomanip>#include <windows.h>static const int Count = 50000000;typedef float (*tSqrtFunction) (float);// From http://www.lomont.org/Math/Papers/2003/InvSqrt.pdfinline float sqrtInverse2(float n){ float halfn = 0.5f * n; int i = *(int*) &n; i = 0x5f375a86 - (i >> 1); // 0x5f3759df: original number. n = *(float*) &i; n = n * (1.5f - halfn * n * n); return 1.0f / n;} //SSE1float invsqrt(float n){ _asm { movss xmm0, n; rsqrtss xmm0, xmm0; movss n, xmm0; } return 1.0f / n;}//SSE2float SSE2sqrt(float src){ float t; _asm { movupd xmm0, src; movupd xmm1, t; sqrtss xmm1, xmm0; movupd t, xmm1; } return t;}//ASMfloat ASMsqrt(float src){ float t; _asm { fld src; fsqrt; fstp t; } return t;}inline void TestFunction(const std::wstring& Name, tSqrtFunction SqrtFunction, int Count){ float f; DWORD StartTime, EndTime, TotalTime; StartTime = GetTickCount(); for (int i = 0;i < Count;i++) { f = SqrtFunction(10.f); } EndTime = GetTickCount(); TotalTime = EndTime - StartTime; std::wcout << L"Function " << Name << L" f*f=" << f * f << L" took: " << TotalTime << std::endl;}int main(int argc, char* argv[]){ std::wcout << std::setprecision(10) << std::endl; TestFunction(L"sqrtf", sqrtf, Count); TestFunction(L"ASMsqrt", ASMsqrt, Count); TestFunction(L"invsqrt", invsqrt, Count); //TestFunction(L"SSE2sqrt", SSE2sqrt, Count); TestFunction(L"sqrtInverse2", sqrtInverse2, Count); return 0;}
The results i got were:
Function sqrtf f*f=10 took: 1203
Function ASMsqrt f*f=10 took: 625
Function invsqrt f*f=10.0003 took: 641
Function sqrtInverse2 f*f=10.0344 took: 1125
Quote:Original post by meisawesome
I decided to benckmark to see if your statement was true:
Cool. Though remember that you need to benchmark code in a real app to see if the "optimization" really is. Additionally, it's important to examine the asm output of the compiler (to make sure it's not optimizing away important code, rendering the benchmark worthless).
Quote:Original post by meisawesome
While your functions might be better for inverse square roots when i un-inverse them the benchmark says mine are faster:
Change your code and benchmark again with:
float invsqrt(float n){ float rsn; _asm { movss xmm0, n; rsqrtss xmm0, xmm0; movss rsn, xmm0; } return n*rsn;}
Also try this version (from my timing numbers in comments, this version was the fastest):
inline float rSqrt(float n) { __asm { rsqrtss xmm0,n mulss xmm0,n movss n,xmm0 } return n;}
When normalizing vectors, etc., where 1/sqrt(n) is needed, these version are significantly faster than 1/fsqrt(n).
The following should be very fast for normalizing a vector:
void normalize(Vec3 * v) { __asm { mov eax,v movss xmm0,[eax]Vec3.x movss xmm1,[eax]Vec3.y movss xmm2,[eax]Vec3.z mulss xmm0,xmm0 mulss xmm1,xmm1 mulss xmm2,xmm2 addss xmm0,xmm1 addss xmm2,xmm0 movss xmm4,[eax]Vec3.x movss xmm5,[eax]Vec3.y movss xmm6,[eax]Vec3.z rsqrtss xmm2,xmm2 mulss xmm4,xmm2 mulss xmm5,xmm2 mulss xmm6,xmm2 movss [eax]Vec3.x,xmm4 movss [eax]Vec3.y,xmm5 movss [eax]Vec3.z,xmm6 } // asm} // normalize
Note that the above code will normalize vectors to (very) near unit length (not quite as accurate as the slower fsqrt, etc.). Data alignment will also affect performance...
Quote:Original post by meisawesome
NOTE: I removed 2 functions because they crashed.
See the documentation for MOVUPD to see why that version crashes (or see my comments above).
The Intel version should compile and work OK with VC... In application benchmarks, with heavy sqrt/invSqrt use, the Intel version was shown to be 5-10% faster than intrinsic (auto-compiler generated direct calls to sqrtf) (frame rate).
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement