Jump to content
  • Advertisement
Sign in to follow this  
Martoon

Fast approximation for Inverse Fourth Root? (1/(x^(1/4)))

This topic is 4835 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I need a fast approximation of f(x) = 1/(x^(1/4)), where x and f(x) are floats (I don't need the precision of a double). There's plenty of info out there on fast implementation of Sqrt() and InvSqrt(), so I can just do a combined implementation of Sqrt(InvSqrt(x)) or InvSqrt(Sqrt(x)). This yields the proper result, but I'm wondering if there's a faster way to approximately calculate 1/(x^(1/4)) directly. Any ideas, pointers, or flat-out brilliant answers appreciated.

Share this post


Link to post
Share on other sites
Advertisement
Is this going to be a general routine or do you know the probable range of numbers that you're going to use?


-Josh

Share this post


Link to post
Share on other sites
Quote:
Original post by jjd
Is this going to be a general routine or do you know the probable range of numbers that you're going to use?


-Josh


I don't know the probable range just yet (still working out some details), but it will definitely be in the "0 to some reasonable upper limit" range. For now, I'd say maybe 0 to 100,000. Actually, since it's an exponential function, I could use any range, and just scale the input and output by appropriate constants.

Share this post


Link to post
Share on other sites
I would avoid the scaling issue if possible. One approach is to use Newton's method to find the root of f(x) = x^(-4) - a, where a is the number you want to find the quartic root of. Use this one if a > 1. Finding the quartic root of 1/a is faster (no divisions).

If x_n is your current estimate of the quartic root, the next estimate will be

b_n = x_n * x_n;
x_(n+1) = 0.25 * (5.0 * x_n - a * x_n * b_n * b_n);

Using b reduces the number of multiplications by one. For this method to be fast, you need a good initial guess. If it is "good" then this method will converge very quickly.

If a is less than one (or, perhaps, a lot smaller than one) you'll probably want to use f(x) = x^4 - a. The resulting equations are

b_n = x_n * x_n;
x_(n+1) = x_n - 0.25 * (b * b - a) / (b_n * x_n);

I don't think you can avoid the division here, which is unfortunte.

I'm sure there are better ways of approaching this problem, although this is not too bad.


-Josh

Share this post


Link to post
Share on other sites


inline void inv4thRoot(float val, float* result)
{
__asm {

rsqrtss xmm0, val
rsqrtss xmm0, xmm0
rcpss xmm0, xmm0
movss [result], xmm0
};
}



Of course, you see there is a very bad dependency chain here re-using xmm0 every operation like this, but still it's the fastest way. I would guess it at around 14 cycles. If you could show what the code surrounding this call would be, I'd be able to help you come up with a better SSE solution.

Share this post


Link to post
Share on other sites
@jjd:
Thanks! I thought there might be a convergence function in there somewhere.

I should have a good initial guess most of the time, because I'll have spatial-temporal cohesion in my favor. I'll be querying for groups of nearby values in succession, so I can usually use the result from the previous query as a starting point for my next.

I should be able to scale everything such that a is almost always > 1. There's no cost to scaling, because my input data could run at whatever scale it needs to (doesn't matter to the game), and the output will be going to a renderer, where it gets scaled by a transformation matrix anyway.

Share this post


Link to post
Share on other sites
@ajas95:
Thanks! I used to code a lot of assembly on the 6502 (back in the day ;o), then later on the 68000. I've never taken the time to get up to speed on x86 assembly.

Here's the context in which I'll be using the function (and if you come up with a nice chunk of fast inline asm to do the whole thing, that'd be fantastic!)

Basicly, I need to take a 2D point (x0,y0), get it's distance (d0) from the center of the screen, and transform the point to (x1,y1) such that it's new distance (d1) from the screen center is sqrt(d0). The screen is a moving window, so the screen center could be anywhere in 2D space. The transformation gives a warped view of the 2D space so that the screen is magnified in the middle, and things become smaller and more compressed the further they are toward the screen edges.

So I do something like this:

float inv4thRoot(float x)
{
//some implementation here
}

void transformPoint(Vector2d &p0, Vector2d &p1)
{
// get distance squared from p0 to screen center
float p0DisSquared = Vector2d(p0 - screenCenter).LengthSquared();

// now get the inverse sqrt of distance from p0 to screen center
float invDisSqrt = inv4thRoot(p0DisSquared);

// scale p0 by invDisSqrt to get p1
p1 = p0 * invDisSqrt;
}


(Of course, I'd write it a little more compact and optimized. I'm just trying to make it clear what I'm doing.)

So if I take a point p0 that's a distance d0 from the screen's center, and call transformPoint(), I should get a point p1 whose distance from the screen's center is sqrt(d0), which is what I want.

Share this post


Link to post
Share on other sites
Here is the fastest version you will ever find. In a loop, it runs in 12 cycles per 4 estimate calculations (it does 4 at a time).

Here's it goes (this time tested :)

inline void inv4thRoot(float a, float b, float c, float d, float* result)
{
__asm {
mov eax, result

rsqrtss xmm4, a
rsqrtss xmm5, b
rsqrtss xmm6, c
rsqrtss xmm7, d

rsqrtss xmm0, xmm4
rsqrtss xmm1, xmm5
rsqrtss xmm2, xmm6
rsqrtss xmm3, xmm7

rcpss xmm0, xmm0
rcpss xmm1, xmm1
rcpss xmm2, xmm2
rcpss xmm3, xmm3

movss [eax], xmm0
movss [eax+4], xmm1
movss [eax+8], xmm2
movss [eax+12], xmm3
};
}

// run with this input:

float inv4th[4];

float a = 1.0f;
float b = 2.0f;
float c = 10000.0f;
float d = 0.001f;

inv4thRoot(a, b, c, d, inv4th);


and the results:

x approximate actual
1.0 0.999756 1.0
2.0 0.84082 0.840896
10000 0.0999756 0.1
0.001 5.62305 5.62341

Share this post


Link to post
Share on other sites
Quote:
Original post by ajas95
Here is the fastest version you will ever find. In a loop, it runs in 12 cycles per 4 estimate calculations (it does 4 at a time).

Here's it goes (this time tested :)
*** Source Snippet Removed ***
and the results:

x approximate actual
1.0 0.999756 1.0
2.0 0.84082 0.840896
10000 0.0999756 0.1
0.001 5.62305 5.62341


Why to do this like that when you have the rsqrtps and rcpps instructions??

Share this post


Link to post
Share on other sites
okay, I wrote this. I felt like some of your math might be wrong though. Particularly, I wrote it so that:
p1 = p0 * invDisSqrt;

would become:
p1 = screenCenter + invDisSqrt * (p0 - screenCenter);


Is that what you'd meant?


struct Vector2
{
float x,y;

friend std::ostream& operator << (std::ostream& out, const Vector2& vec);
};

std::ostream& operator << (std::ostream& out, const Vector2& vec)
{
out << "[" << vec.x << ",\t" << vec.y << "]" << std::endl;

return out;
}



inline void transform4Points(Vector2* out, const Vector2* in, const Vector2* screenCenter)
{
Vector2 t0;
Vector2 t1;
Vector2 t2;
Vector2 t3;

__asm {
mov eax, in
mov ecx, out
mov edx, screenCenter

movss xmm0, [eax + 0]Vector2.x
movss xmm1, [eax + 8]Vector2.x
movss xmm2, [eax + 10h]Vector2.x
movss xmm3, [eax + 18h]Vector2.x

subss xmm0, [edx]Vector2.x
subss xmm1, [edx]Vector2.x
subss xmm2, [edx]Vector2.x
subss xmm3, [edx]Vector2.x

movss xmm4, [eax + 0]Vector2.y
movss xmm5, [eax + 8]Vector2.y
movss xmm6, [eax + 10h]Vector2.y
movss xmm7, [eax + 18h]Vector2.y

subss xmm4, [edx]Vector2.y
subss xmm5, [edx]Vector2.y
subss xmm6, [edx]Vector2.y
subss xmm7, [edx]Vector2.y

movss t0.x, xmm0
movss t0.y, xmm4
movss t1.x, xmm1
movss t1.y, xmm5
movss t2.x, xmm2
movss t2.y, xmm6
movss t3.x, xmm3
movss t3.y, xmm7

mulss xmm0, xmm0
mulss xmm1, xmm1
mulss xmm2, xmm2
mulss xmm3, xmm3

mulss xmm4, xmm4
mulss xmm5, xmm5
mulss xmm6, xmm6
mulss xmm7, xmm7

addss xmm0, xmm4
addss xmm1, xmm5
addss xmm2, xmm6
addss xmm3, xmm7

rsqrtss xmm0, xmm0
rsqrtss xmm1, xmm1
rsqrtss xmm2, xmm2
rsqrtss xmm3, xmm3

rsqrtss xmm0, xmm0
rsqrtss xmm1, xmm1
rsqrtss xmm2, xmm2
rsqrtss xmm3, xmm3

movss xmm4, t0.y
movss xmm5, t1.y
movss xmm6, t2.y
movss xmm7, t3.y

rcpss xmm0, xmm0
rcpss xmm1, xmm1
rcpss xmm2, xmm2
rcpss xmm3, xmm3

mulss xmm4, xmm0
mulss xmm5, xmm1
mulss xmm6, xmm2
mulss xmm7, xmm3

mulss xmm0, t0.x
mulss xmm1, t1.x
mulss xmm2, t2.x
mulss xmm3, t3.x

addss xmm0, [edx]Vector2.x
addss xmm1, [edx]Vector2.x
addss xmm2, [edx]Vector2.x
addss xmm3, [edx]Vector2.x

addss xmm4, [edx]Vector2.y
addss xmm5, [edx]Vector2.y
addss xmm6, [edx]Vector2.y
addss xmm7, [edx]Vector2.y

movss [ecx+00h], xmm0
movss [ecx+04h], xmm4
movss [ecx+08h], xmm1
movss [ecx+0Ch], xmm5

movss [ecx+10h], xmm2
movss [ecx+14h], xmm6
movss [ecx+18h], xmm3
movss [ecx+1Ch], xmm7
};
}



called like:

Vector2 orig[4] = { {0.0f, 0.0f}, {1.0f, 1.0f}, {100.0f, 100.0f}, {20.0f, 80.0f} };
Vector2 scr = { 50.0f, 1.0f };
Vector2 res[4];

transform4Points(res, orig, &scr);


vNistelrooy:
about the *ps instructions. It would've taken more work to get all the components into the right places in the registers and would've been slower. Also, there's still that dependency when you do rsqrtps/rsqrtps/rcpps all right in a row, so even though it's fewer instructions, it's no faster.

However! If he wanted to make this code really really fly, he'd set up his components to be 4 x components in a row and 4 y components in a row and store them in 16-byte aligned arrays, and we'd process 4 of those at a time (using the *ps instructions), so 16 points at a time. And then we can eliminate those dependencies by stacking the rsqrtps instructions like I stack my rsrtss instructions, and you could probably cook 16 in around 80 cycles.

The code I posted above does 4 in 55 cycles. So, it's possible to make the *ps instructions run faster but it would require a major re-organization of the data. What I've posted will work with much more lenient data structure requirements, as long as Vector2 is 8 bytes.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!