Fast approximation for Inverse Fourth Root? (1/(x^(1/4)))

Started by
15 comments, last by ajas95 18 years, 8 months ago
I need a fast approximation of f(x) = 1/(x^(1/4)), where x and f(x) are floats (I don't need the precision of a double). There's plenty of info out there on fast implementation of Sqrt() and InvSqrt(), so I can just do a combined implementation of Sqrt(InvSqrt(x)) or InvSqrt(Sqrt(x)). This yields the proper result, but I'm wondering if there's a faster way to approximately calculate 1/(x^(1/4)) directly. Any ideas, pointers, or flat-out brilliant answers appreciated.
Advertisement
Is this going to be a general routine or do you know the probable range of numbers that you're going to use?


-Josh

--www.physicaluncertainty.com
--linkedin
--irc.freenode.net#gdnet

Quote:Original post by jjd
Is this going to be a general routine or do you know the probable range of numbers that you're going to use?


-Josh


I don't know the probable range just yet (still working out some details), but it will definitely be in the "0 to some reasonable upper limit" range. For now, I'd say maybe 0 to 100,000. Actually, since it's an exponential function, I could use any range, and just scale the input and output by appropriate constants.
I would avoid the scaling issue if possible. One approach is to use Newton's method to find the root of f(x) = x^(-4) - a, where a is the number you want to find the quartic root of. Use this one if a > 1. Finding the quartic root of 1/a is faster (no divisions).

If x_n is your current estimate of the quartic root, the next estimate will be

b_n = x_n * x_n;
x_(n+1) = 0.25 * (5.0 * x_n - a * x_n * b_n * b_n);

Using b reduces the number of multiplications by one. For this method to be fast, you need a good initial guess. If it is "good" then this method will converge very quickly.

If a is less than one (or, perhaps, a lot smaller than one) you'll probably want to use f(x) = x^4 - a. The resulting equations are

b_n = x_n * x_n;
x_(n+1) = x_n - 0.25 * (b * b - a) / (b_n * x_n);

I don't think you can avoid the division here, which is unfortunte.

I'm sure there are better ways of approaching this problem, although this is not too bad.


-Josh

--www.physicaluncertainty.com
--linkedin
--irc.freenode.net#gdnet

inline void inv4thRoot(float val, float* result){    __asm {        rsqrtss xmm0,     val        rsqrtss xmm0,     xmm0        rcpss   xmm0,     xmm0        movss   [result], xmm0    };}


Of course, you see there is a very bad dependency chain here re-using xmm0 every operation like this, but still it's the fastest way. I would guess it at around 14 cycles. If you could show what the code surrounding this call would be, I'd be able to help you come up with a better SSE solution.
@jjd:
Thanks! I thought there might be a convergence function in there somewhere.

I should have a good initial guess most of the time, because I'll have spatial-temporal cohesion in my favor. I'll be querying for groups of nearby values in succession, so I can usually use the result from the previous query as a starting point for my next.

I should be able to scale everything such that a is almost always > 1. There's no cost to scaling, because my input data could run at whatever scale it needs to (doesn't matter to the game), and the output will be going to a renderer, where it gets scaled by a transformation matrix anyway.
@ajas95:
Thanks! I used to code a lot of assembly on the 6502 (back in the day ;o), then later on the 68000. I've never taken the time to get up to speed on x86 assembly.

Here's the context in which I'll be using the function (and if you come up with a nice chunk of fast inline asm to do the whole thing, that'd be fantastic!)

Basicly, I need to take a 2D point (x0,y0), get it's distance (d0) from the center of the screen, and transform the point to (x1,y1) such that it's new distance (d1) from the screen center is sqrt(d0). The screen is a moving window, so the screen center could be anywhere in 2D space. The transformation gives a warped view of the 2D space so that the screen is magnified in the middle, and things become smaller and more compressed the further they are toward the screen edges.

So I do something like this:
float inv4thRoot(float x){    //some implementation here}void transformPoint(Vector2d &p0, Vector2d &p1){    // get distance squared from p0 to screen center    float p0DisSquared = Vector2d(p0 - screenCenter).LengthSquared();    // now get the inverse sqrt of distance from p0 to screen center    float invDisSqrt = inv4thRoot(p0DisSquared);    // scale p0 by invDisSqrt to get p1    p1 = p0 * invDisSqrt;}

(Of course, I'd write it a little more compact and optimized. I'm just trying to make it clear what I'm doing.)

So if I take a point p0 that's a distance d0 from the screen's center, and call transformPoint(), I should get a point p1 whose distance from the screen's center is sqrt(d0), which is what I want.
Here is the fastest version you will ever find. In a loop, it runs in 12 cycles per 4 estimate calculations (it does 4 at a time).

Here's it goes (this time tested :)
inline void inv4thRoot(float a, float b, float c, float d, float* result){	__asm  {		mov		eax,	result		rsqrtss xmm4,	a		rsqrtss xmm5,	b		rsqrtss xmm6,	c		rsqrtss xmm7,	d		rsqrtss xmm0,	xmm4		rsqrtss xmm1,	xmm5		rsqrtss xmm2,	xmm6		rsqrtss xmm3,	xmm7		rcpss	xmm0,	xmm0		rcpss	xmm1,	xmm1		rcpss	xmm2,	xmm2		rcpss	xmm3,	xmm3		movss [eax],	xmm0		movss [eax+4],	xmm1		movss [eax+8],	xmm2		movss [eax+12],	xmm3	};}// run with this input:	float inv4th[4];	float a = 1.0f;	float b = 2.0f;	float c = 10000.0f;	float d = 0.001f;	inv4thRoot(a, b, c, d, inv4th);

and the results:
x	approximate	actual1.0 	0.999756 	1.02.0 	0.84082 	0.84089610000 	0.0999756 	0.10.001 	5.62305 	5.62341
Quote:Original post by ajas95
Here is the fastest version you will ever find. In a loop, it runs in 12 cycles per 4 estimate calculations (it does 4 at a time).

Here's it goes (this time tested :)
*** Source Snippet Removed ***
and the results:
x	approximate	actual1.0 	0.999756 	1.02.0 	0.84082 	0.84089610000 	0.0999756 	0.10.001 	5.62305 	5.62341


Why to do this like that when you have the rsqrtps and rcpps instructions??
"C lets you shoot yourself in the foot rather easily. C++ allows you to reuse the bullet!"
okay, I wrote this. I felt like some of your math might be wrong though. Particularly, I wrote it so that:
p1 = p0 * invDisSqrt;

would become:
p1 = screenCenter + invDisSqrt * (p0 - screenCenter);


Is that what you'd meant?

struct Vector2{	float x,y;	friend std::ostream& operator << (std::ostream& out, const Vector2& vec);};std::ostream& operator << (std::ostream& out, const Vector2& vec){	out << "[" << vec.x << ",\t" << vec.y << "]" << std::endl;	return out;}inline void transform4Points(Vector2* out, const Vector2* in, const Vector2* screenCenter){	Vector2 t0;	Vector2 t1;	Vector2 t2;	Vector2 t3;	__asm {		mov		eax,	in		mov		ecx,	out		mov		edx,	screenCenter		movss	xmm0,	[eax + 0]Vector2.x		movss	xmm1,	[eax + 8]Vector2.x		movss	xmm2,	[eax + 10h]Vector2.x		movss	xmm3,	[eax + 18h]Vector2.x		subss	xmm0,	[edx]Vector2.x		subss	xmm1,	[edx]Vector2.x		subss	xmm2,	[edx]Vector2.x		subss	xmm3,	[edx]Vector2.x		movss	xmm4,	[eax + 0]Vector2.y		movss	xmm5,	[eax + 8]Vector2.y		movss	xmm6,	[eax + 10h]Vector2.y		movss	xmm7,	[eax + 18h]Vector2.y		subss	xmm4,	[edx]Vector2.y		subss	xmm5,	[edx]Vector2.y		subss	xmm6,	[edx]Vector2.y		subss	xmm7,	[edx]Vector2.y		movss	t0.x,	xmm0		movss	t0.y,	xmm4		movss	t1.x,	xmm1		movss	t1.y,	xmm5		movss	t2.x,	xmm2		movss	t2.y,	xmm6		movss	t3.x,	xmm3		movss	t3.y,	xmm7		mulss	xmm0,	xmm0		mulss	xmm1,	xmm1		mulss	xmm2,	xmm2		mulss	xmm3,	xmm3		mulss	xmm4,	xmm4		mulss	xmm5,	xmm5		mulss	xmm6,	xmm6		mulss	xmm7,	xmm7		addss	xmm0,	xmm4		addss	xmm1,	xmm5		addss	xmm2,	xmm6		addss	xmm3,	xmm7		rsqrtss	xmm0,	xmm0		rsqrtss	xmm1,	xmm1		rsqrtss	xmm2,	xmm2		rsqrtss	xmm3,	xmm3		rsqrtss	xmm0,	xmm0		rsqrtss	xmm1,	xmm1		rsqrtss	xmm2,	xmm2		rsqrtss	xmm3,	xmm3		movss	xmm4,	t0.y		movss	xmm5,	t1.y		movss	xmm6,	t2.y		movss	xmm7,	t3.y		rcpss	xmm0,	xmm0		rcpss	xmm1,	xmm1		rcpss	xmm2,	xmm2		rcpss	xmm3,	xmm3		mulss	xmm4,	xmm0		mulss	xmm5,	xmm1		mulss	xmm6,	xmm2		mulss	xmm7,	xmm3		mulss	xmm0,	t0.x		mulss	xmm1,	t1.x		mulss	xmm2,	t2.x		mulss	xmm3,	t3.x		addss	xmm0,	[edx]Vector2.x		addss	xmm1,	[edx]Vector2.x		addss	xmm2,	[edx]Vector2.x		addss	xmm3,	[edx]Vector2.x		addss	xmm4,	[edx]Vector2.y		addss	xmm5,	[edx]Vector2.y		addss	xmm6,	[edx]Vector2.y		addss	xmm7,	[edx]Vector2.y		movss	[ecx+00h], xmm0		movss	[ecx+04h], xmm4		movss	[ecx+08h], xmm1		movss	[ecx+0Ch], xmm5		movss	[ecx+10h], xmm2		movss	[ecx+14h], xmm6		movss	[ecx+18h], xmm3		movss	[ecx+1Ch], xmm7	};}

called like:
	Vector2 orig[4] = { {0.0f, 0.0f}, {1.0f, 1.0f}, {100.0f, 100.0f}, {20.0f, 80.0f} };	Vector2 scr = { 50.0f, 1.0f };	Vector2 res[4];	transform4Points(res, orig, &scr);


vNistelrooy:
about the *ps instructions. It would've taken more work to get all the components into the right places in the registers and would've been slower. Also, there's still that dependency when you do rsqrtps/rsqrtps/rcpps all right in a row, so even though it's fewer instructions, it's no faster.

However! If he wanted to make this code really really fly, he'd set up his components to be 4 x components in a row and 4 y components in a row and store them in 16-byte aligned arrays, and we'd process 4 of those at a time (using the *ps instructions), so 16 points at a time. And then we can eliminate those dependencies by stacking the rsqrtps instructions like I stack my rsrtss instructions, and you could probably cook 16 in around 80 cycles.

The code I posted above does 4 in 55 cycles. So, it's possible to make the *ps instructions run faster but it would require a major re-organization of the data. What I've posted will work with much more lenient data structure requirements, as long as Vector2 is 8 bytes.

This topic is closed to new replies.

Advertisement