Fast approximation for Inverse Fourth Root? (1/(x^(1/4)))
I need a fast approximation of f(x) = 1/(x^(1/4)), where x and f(x) are floats (I don't need the precision of a double).
There's plenty of info out there on fast implementation of Sqrt() and InvSqrt(), so I can just do a combined implementation of Sqrt(InvSqrt(x)) or InvSqrt(Sqrt(x)). This yields the proper result, but I'm wondering if there's a faster way to approximately calculate 1/(x^(1/4)) directly.
Any ideas, pointers, or flat-out brilliant answers appreciated.
Is this going to be a general routine or do you know the probable range of numbers that you're going to use?
-Josh
-Josh
Quote:Original post by jjd
Is this going to be a general routine or do you know the probable range of numbers that you're going to use?
-Josh
I don't know the probable range just yet (still working out some details), but it will definitely be in the "0 to some reasonable upper limit" range. For now, I'd say maybe 0 to 100,000. Actually, since it's an exponential function, I could use any range, and just scale the input and output by appropriate constants.
I would avoid the scaling issue if possible. One approach is to use Newton's method to find the root of f(x) = x^(-4) - a, where a is the number you want to find the quartic root of. Use this one if a > 1. Finding the quartic root of 1/a is faster (no divisions).
If x_n is your current estimate of the quartic root, the next estimate will be
b_n = x_n * x_n;
x_(n+1) = 0.25 * (5.0 * x_n - a * x_n * b_n * b_n);
Using b reduces the number of multiplications by one. For this method to be fast, you need a good initial guess. If it is "good" then this method will converge very quickly.
If a is less than one (or, perhaps, a lot smaller than one) you'll probably want to use f(x) = x^4 - a. The resulting equations are
b_n = x_n * x_n;
x_(n+1) = x_n - 0.25 * (b * b - a) / (b_n * x_n);
I don't think you can avoid the division here, which is unfortunte.
I'm sure there are better ways of approaching this problem, although this is not too bad.
-Josh
If x_n is your current estimate of the quartic root, the next estimate will be
b_n = x_n * x_n;
x_(n+1) = 0.25 * (5.0 * x_n - a * x_n * b_n * b_n);
Using b reduces the number of multiplications by one. For this method to be fast, you need a good initial guess. If it is "good" then this method will converge very quickly.
If a is less than one (or, perhaps, a lot smaller than one) you'll probably want to use f(x) = x^4 - a. The resulting equations are
b_n = x_n * x_n;
x_(n+1) = x_n - 0.25 * (b * b - a) / (b_n * x_n);
I don't think you can avoid the division here, which is unfortunte.
I'm sure there are better ways of approaching this problem, although this is not too bad.
-Josh
inline void inv4thRoot(float val, float* result){ __asm { rsqrtss xmm0, val rsqrtss xmm0, xmm0 rcpss xmm0, xmm0 movss [result], xmm0 };}
Of course, you see there is a very bad dependency chain here re-using xmm0 every operation like this, but still it's the fastest way. I would guess it at around 14 cycles. If you could show what the code surrounding this call would be, I'd be able to help you come up with a better SSE solution.
@jjd:
Thanks! I thought there might be a convergence function in there somewhere.
I should have a good initial guess most of the time, because I'll have spatial-temporal cohesion in my favor. I'll be querying for groups of nearby values in succession, so I can usually use the result from the previous query as a starting point for my next.
I should be able to scale everything such that a is almost always > 1. There's no cost to scaling, because my input data could run at whatever scale it needs to (doesn't matter to the game), and the output will be going to a renderer, where it gets scaled by a transformation matrix anyway.
Thanks! I thought there might be a convergence function in there somewhere.
I should have a good initial guess most of the time, because I'll have spatial-temporal cohesion in my favor. I'll be querying for groups of nearby values in succession, so I can usually use the result from the previous query as a starting point for my next.
I should be able to scale everything such that a is almost always > 1. There's no cost to scaling, because my input data could run at whatever scale it needs to (doesn't matter to the game), and the output will be going to a renderer, where it gets scaled by a transformation matrix anyway.
@ajas95:
Thanks! I used to code a lot of assembly on the 6502 (back in the day ;o), then later on the 68000. I've never taken the time to get up to speed on x86 assembly.
Here's the context in which I'll be using the function (and if you come up with a nice chunk of fast inline asm to do the whole thing, that'd be fantastic!)
Basicly, I need to take a 2D point (x0,y0), get it's distance (d0) from the center of the screen, and transform the point to (x1,y1) such that it's new distance (d1) from the screen center is sqrt(d0). The screen is a moving window, so the screen center could be anywhere in 2D space. The transformation gives a warped view of the 2D space so that the screen is magnified in the middle, and things become smaller and more compressed the further they are toward the screen edges.
So I do something like this:
(Of course, I'd write it a little more compact and optimized. I'm just trying to make it clear what I'm doing.)
So if I take a point p0 that's a distance d0 from the screen's center, and call transformPoint(), I should get a point p1 whose distance from the screen's center is sqrt(d0), which is what I want.
Thanks! I used to code a lot of assembly on the 6502 (back in the day ;o), then later on the 68000. I've never taken the time to get up to speed on x86 assembly.
Here's the context in which I'll be using the function (and if you come up with a nice chunk of fast inline asm to do the whole thing, that'd be fantastic!)
Basicly, I need to take a 2D point (x0,y0), get it's distance (d0) from the center of the screen, and transform the point to (x1,y1) such that it's new distance (d1) from the screen center is sqrt(d0). The screen is a moving window, so the screen center could be anywhere in 2D space. The transformation gives a warped view of the 2D space so that the screen is magnified in the middle, and things become smaller and more compressed the further they are toward the screen edges.
So I do something like this:
float inv4thRoot(float x){ //some implementation here}void transformPoint(Vector2d &p0, Vector2d &p1){ // get distance squared from p0 to screen center float p0DisSquared = Vector2d(p0 - screenCenter).LengthSquared(); // now get the inverse sqrt of distance from p0 to screen center float invDisSqrt = inv4thRoot(p0DisSquared); // scale p0 by invDisSqrt to get p1 p1 = p0 * invDisSqrt;}
(Of course, I'd write it a little more compact and optimized. I'm just trying to make it clear what I'm doing.)
So if I take a point p0 that's a distance d0 from the screen's center, and call transformPoint(), I should get a point p1 whose distance from the screen's center is sqrt(d0), which is what I want.
Here is the fastest version you will ever find. In a loop, it runs in 12 cycles per 4 estimate calculations (it does 4 at a time).
Here's it goes (this time tested :)
and the results:
Here's it goes (this time tested :)
inline void inv4thRoot(float a, float b, float c, float d, float* result){ __asm { mov eax, result rsqrtss xmm4, a rsqrtss xmm5, b rsqrtss xmm6, c rsqrtss xmm7, d rsqrtss xmm0, xmm4 rsqrtss xmm1, xmm5 rsqrtss xmm2, xmm6 rsqrtss xmm3, xmm7 rcpss xmm0, xmm0 rcpss xmm1, xmm1 rcpss xmm2, xmm2 rcpss xmm3, xmm3 movss [eax], xmm0 movss [eax+4], xmm1 movss [eax+8], xmm2 movss [eax+12], xmm3 };}// run with this input: float inv4th[4]; float a = 1.0f; float b = 2.0f; float c = 10000.0f; float d = 0.001f; inv4thRoot(a, b, c, d, inv4th);
and the results:
x approximate actual1.0 0.999756 1.02.0 0.84082 0.84089610000 0.0999756 0.10.001 5.62305 5.62341
Quote:Original post by ajas95
Here is the fastest version you will ever find. In a loop, it runs in 12 cycles per 4 estimate calculations (it does 4 at a time).
Here's it goes (this time tested :)
*** Source Snippet Removed ***
and the results:x approximate actual1.0 0.999756 1.02.0 0.84082 0.84089610000 0.0999756 0.10.001 5.62305 5.62341
Why to do this like that when you have the rsqrtps and rcpps instructions??
okay, I wrote this. I felt like some of your math might be wrong though. Particularly, I wrote it so that:
would become:
Is that what you'd meant?
called like:
vNistelrooy:
about the *ps instructions. It would've taken more work to get all the components into the right places in the registers and would've been slower. Also, there's still that dependency when you do rsqrtps/rsqrtps/rcpps all right in a row, so even though it's fewer instructions, it's no faster.
However! If he wanted to make this code really really fly, he'd set up his components to be 4 x components in a row and 4 y components in a row and store them in 16-byte aligned arrays, and we'd process 4 of those at a time (using the *ps instructions), so 16 points at a time. And then we can eliminate those dependencies by stacking the rsqrtps instructions like I stack my rsrtss instructions, and you could probably cook 16 in around 80 cycles.
The code I posted above does 4 in 55 cycles. So, it's possible to make the *ps instructions run faster but it would require a major re-organization of the data. What I've posted will work with much more lenient data structure requirements, as long as Vector2 is 8 bytes.
p1 = p0 * invDisSqrt;
would become:
p1 = screenCenter + invDisSqrt * (p0 - screenCenter);
Is that what you'd meant?
struct Vector2{ float x,y; friend std::ostream& operator << (std::ostream& out, const Vector2& vec);};std::ostream& operator << (std::ostream& out, const Vector2& vec){ out << "[" << vec.x << ",\t" << vec.y << "]" << std::endl; return out;}inline void transform4Points(Vector2* out, const Vector2* in, const Vector2* screenCenter){ Vector2 t0; Vector2 t1; Vector2 t2; Vector2 t3; __asm { mov eax, in mov ecx, out mov edx, screenCenter movss xmm0, [eax + 0]Vector2.x movss xmm1, [eax + 8]Vector2.x movss xmm2, [eax + 10h]Vector2.x movss xmm3, [eax + 18h]Vector2.x subss xmm0, [edx]Vector2.x subss xmm1, [edx]Vector2.x subss xmm2, [edx]Vector2.x subss xmm3, [edx]Vector2.x movss xmm4, [eax + 0]Vector2.y movss xmm5, [eax + 8]Vector2.y movss xmm6, [eax + 10h]Vector2.y movss xmm7, [eax + 18h]Vector2.y subss xmm4, [edx]Vector2.y subss xmm5, [edx]Vector2.y subss xmm6, [edx]Vector2.y subss xmm7, [edx]Vector2.y movss t0.x, xmm0 movss t0.y, xmm4 movss t1.x, xmm1 movss t1.y, xmm5 movss t2.x, xmm2 movss t2.y, xmm6 movss t3.x, xmm3 movss t3.y, xmm7 mulss xmm0, xmm0 mulss xmm1, xmm1 mulss xmm2, xmm2 mulss xmm3, xmm3 mulss xmm4, xmm4 mulss xmm5, xmm5 mulss xmm6, xmm6 mulss xmm7, xmm7 addss xmm0, xmm4 addss xmm1, xmm5 addss xmm2, xmm6 addss xmm3, xmm7 rsqrtss xmm0, xmm0 rsqrtss xmm1, xmm1 rsqrtss xmm2, xmm2 rsqrtss xmm3, xmm3 rsqrtss xmm0, xmm0 rsqrtss xmm1, xmm1 rsqrtss xmm2, xmm2 rsqrtss xmm3, xmm3 movss xmm4, t0.y movss xmm5, t1.y movss xmm6, t2.y movss xmm7, t3.y rcpss xmm0, xmm0 rcpss xmm1, xmm1 rcpss xmm2, xmm2 rcpss xmm3, xmm3 mulss xmm4, xmm0 mulss xmm5, xmm1 mulss xmm6, xmm2 mulss xmm7, xmm3 mulss xmm0, t0.x mulss xmm1, t1.x mulss xmm2, t2.x mulss xmm3, t3.x addss xmm0, [edx]Vector2.x addss xmm1, [edx]Vector2.x addss xmm2, [edx]Vector2.x addss xmm3, [edx]Vector2.x addss xmm4, [edx]Vector2.y addss xmm5, [edx]Vector2.y addss xmm6, [edx]Vector2.y addss xmm7, [edx]Vector2.y movss [ecx+00h], xmm0 movss [ecx+04h], xmm4 movss [ecx+08h], xmm1 movss [ecx+0Ch], xmm5 movss [ecx+10h], xmm2 movss [ecx+14h], xmm6 movss [ecx+18h], xmm3 movss [ecx+1Ch], xmm7 };}
called like:
Vector2 orig[4] = { {0.0f, 0.0f}, {1.0f, 1.0f}, {100.0f, 100.0f}, {20.0f, 80.0f} }; Vector2 scr = { 50.0f, 1.0f }; Vector2 res[4]; transform4Points(res, orig, &scr);
vNistelrooy:
about the *ps instructions. It would've taken more work to get all the components into the right places in the registers and would've been slower. Also, there's still that dependency when you do rsqrtps/rsqrtps/rcpps all right in a row, so even though it's fewer instructions, it's no faster.
However! If he wanted to make this code really really fly, he'd set up his components to be 4 x components in a row and 4 y components in a row and store them in 16-byte aligned arrays, and we'd process 4 of those at a time (using the *ps instructions), so 16 points at a time. And then we can eliminate those dependencies by stacking the rsqrtps instructions like I stack my rsrtss instructions, and you could probably cook 16 in around 80 cycles.
The code I posted above does 4 in 55 cycles. So, it's possible to make the *ps instructions run faster but it would require a major re-organization of the data. What I've posted will work with much more lenient data structure requirements, as long as Vector2 is 8 bytes.
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement