Fast approximation for Inverse Fourth Root? (1/(x^(1/4)))

Started by
15 comments, last by ajas95 18 years, 8 months ago
Quote:Original post by ajas95
vNistelrooy:
about the *ps instructions. It would've taken more work to get all the components into the right places in the registers and would've been slower. Also, there's still that dependency when you do rsqrtps/rsqrtps/rcpps all right in a row, so even though it's fewer instructions, it's no faster.

However! If he wanted to make this code really really fly, he'd set up his components to be 4 x components in a row and 4 y components in a row and store them in 16-byte aligned arrays, and we'd process 4 of those at a time (using the *ps instructions), so 16 points at a time. And then we can eliminate those dependencies by stacking the rsqrtps instructions like I stack my rsrtss instructions, and you could probably cook 16 in around 80 cycles.

The code I posted above does 4 in 55 cycles. So, it's possible to make the *ps instructions run faster but it would require a major re-organization of the data. What I've posted will work with much more lenient data structure requirements, as long as Vector2 is 8 bytes.


Yeah, I thought about the dependencies, but I last time I have benched this kind of stuff I remmember that *ps instructions are faster than doing 4x*ss.
Anyway doing it like 4x4 as you suggested seems best.
"C lets you shoot yourself in the foot rather easily. C++ allows you to reuse the bullet!"
Advertisement
Quote:Original post by ajas95
okay, I wrote this. I felt like some of your math might be wrong though. Particularly, I wrote it so that:
p1 = p0 * invDisSqrt;

would become:
p1 = screenCenter + invDisSqrt * (p0 - screenCenter);


Is that what you'd meant?

Doh! Yes, that's definitely what I meant. Stupid oversight on my part.

Thanks a million for the asm code. What class of CPU does that require?
Quote:
However! If he wanted to make this code really really fly, he'd set up his components to be 4 x components in a row and 4 y components in a row and store them in 16-byte aligned arrays, and we'd process 4 of those at a time (using the *ps instructions), so 16 points at a time. And then we can eliminate those dependencies by stacking the rsqrtps instructions like I stack my rsrtss instructions, and you could probably cook 16 in around 80 cycles.

Interleaving the vector components like that actually wouldn't be much of a problem. The vectors I would need to transform won't generally be contiguous in an array, so even with the code above, I'd need to pack seperate vectors into an array before transforming, anyway. It really wouldn't be any more work to pack them interleaved.

In any case, I'd probably make specialized versions of the function to handle 1, 2, 3, or 4 vectors. I know it's much more efficient to always do 4 at a time, but there's a lot of times I wouldn't have 4 that need it. I'm going to be using this to do dynamic, adaptive tesselation of 2D meshes. A lot of times I'll need to transform one vector, and use the result of that to determine the value of the next vector to transform. I suppose I could tesselate 4 meshes in parallel, but it seems like the complex logic overhead would outweigh the benefits of transforming 4 vectors at once.
Quote:Original post by Martoon
Thanks a million for the asm code. What class of CPU does that require?


One that supports SSE. Pentium3+ and AthlonXP+.
"C lets you shoot yourself in the foot rather easily. C++ allows you to reuse the bullet!"
Quote:Original post by vNistelrooy
Quote:Original post by Martoon
Thanks a million for the asm code. What class of CPU does that require?


One that supports SSE. Pentium3+ and AthlonXP+.

Ahh, okay. Thanks.
Okay, I'm really a sucker for this kind of stuff, so I decided to write the 4x4 swizzled version. So good news and bad news... it ends up processing 16 points in 120 cycles (not the 80 I'd hoped for), so it's still a 100% speed-up (which is very hard to do over already highly-optimized SSE!) but the difference lies mostly in the fact that I changed the interface.

Before, I'd assumed that the Vector2s processed were a continuous chunk of memory, but Martoon clarified that they weren't... so in this version, I opted to pass in pointers to these 4x2 structures. There's a lot of very technical stuff to explain why this makes a 20 cycle stall, but it does :)

But also note, at the cost of 20 cycles we get some really cool flexibility. You can build the entire array of SwizVec2's to represent your screen (and outputs), and only need to do that once. Then you can pass in just the screen center and an array of 4 pointers to these swizzled chunks depending on where you want to calculate, and they're all independent.

Also, note that we're down to 8 cycles per point now, and that includes calculating an inverse 4th root for each point! Any performance gains beyond this will be to determine what area you're going to compute then prefetch that to the L1 cache before issuing these calls. There are more dangerous demons than stall cycles, and they live on "The BUS" :)

code:
__declspec(align(16))struct SwizVec2{	float x0, x1, x2, x3;	float y0, y1, y2, y3;	friend std::ostream& operator << (std::ostream& out, const SwizVec2& vec);};std::ostream& operator << (std::ostream& out, const SwizVec2& vec){	out << "[ ( " << vec.x0 << ",\t" << vec.y0 << " ) \t" 		<< "  ( " << vec.x1 << ",\t" << vec.y1 << " ) \t" 		<< "  ( " << vec.x2 << ",\t" << vec.y2 << " ) \t" 		<< "  ( " << vec.x3 << ",\t" << vec.y3 << " ) ]" << std::endl;	return out;}inline void transform16PointsSwizzled(SwizVec2** out, const SwizVec2** in, const SwizVec2* screenCenter){	SwizVec2 t0;	SwizVec2 t1;	SwizVec2 t2;	SwizVec2 t3;	__asm {		mov		eax,	in		mov		ecx,	[eax + 04h]		mov		edi,	[eax + 08h]		mov		esi,	[eax + 0ch]		mov		eax,	[eax + 00h]		mov		edx,	screenCenter		// Good Idea #1: Don't load screenCenter into a register.  As long as it is aligned on a		// 16-byte boundary, we can use it directly as the 2nd argument to most *ps instructions.		// This is good news!  Register economy is key.		// x's first		movaps	xmm0,	[eax]		movaps	xmm1,	[ecx]		movaps	xmm2,	[edi]		movaps	xmm3,	[esi]		subps	xmm0,	[edx]	// subtract screen center.  The same x,y values are replicated								// into all 4 screen center coordinates. 		// start moving in y's also.		movaps	xmm4,	[eax + 10h]		subps	xmm1,	[edx]		movaps	xmm5,	[ecx + 10h]		subps	xmm2,	[edx]		movaps	xmm6,	[edi + 10h]		subps	xmm3,	[edx]		movaps	xmm7,	[esi + 10h]		movaps	[t0],	xmm0		movaps	[t1],	xmm1		movaps	[t2],	xmm2		movaps	[t3],	xmm3		mulps	xmm0,	xmm0		subps	xmm4,	[edx + 10h]		mulps	xmm1,	xmm1		subps	xmm5,	[edx + 10h]		mulps	xmm2,	xmm2		subps	xmm6,	[edx + 10h]		mulps	xmm3,	xmm3		subps	xmm7,	[edx + 10h]		mulps	xmm4,	xmm4		mulps	xmm5,	xmm5		mulps	xmm6,	xmm6		mulps	xmm7,	xmm7				// now we have x^2 in xmm0-3 and y^2 in xmm4-7, so add.		addps	xmm0,	xmm4		addps	xmm1,	xmm5		addps	xmm2,	xmm6		addps	xmm3,	xmm7		// start reloading y's.  In the previous implementation, we saved off		// the entire (p0 - ScreenCenter) calculation.  That requires moving		// a lot of memory around just to save a simple subtraction!  Here,		// it turns out to be a good idea to save just the x-components and		// recalculate the y components.  The 'why' is somewhat complicated,		// but involves filling instruction latency and the fact that we start		// off using all 8 xmm registers, then shrink down to using only 4		// during the rsqrt/rcp, then expand back to 8 to calculate p1 finally.		// There's *always* a good way to use 4 vacant registers...		movaps	xmm4,	[eax + 10h]		movaps	xmm5,	[ecx + 10h]		movaps	xmm6,	[edi + 10h]		movaps	xmm7,	[esi + 10h]		// start inverse 4th root.  (x = x^(-1/2))		rsqrtps	xmm0,	xmm0		rsqrtps xmm1,	xmm1		rsqrtps xmm2,	xmm2		rsqrtps xmm3,	xmm3		// recomputing (p0 - screenCenter).x here, also filling pipeline stalls.		subps	xmm4,	[edx + 10h]		subps	xmm5,	[edx + 10h]		subps	xmm6,	[edx + 10h]		subps	xmm7,	[edx + 10h]		// AGAIN!   (x = x^(1/4))		rsqrtps	xmm0,	xmm0		rsqrtps	xmm1,	xmm1		rsqrtps	xmm2,	xmm2		rsqrtps	xmm3,	xmm3		// and reciprocal to negate the exponent: (x = x^(-1/4))		rcpps	xmm0,	xmm0		rcpps	xmm1,	xmm1		rcpps	xmm2,	xmm2		rcpps	xmm3,	xmm3		// stalls are building up, good time to reassign our addressing regs.		// note: this doesn't help much, these all dispatch and execute		// in just 2 cycles! 		mov		eax,	out		mov		ecx,	[eax + 04h]		mov		edi,	[eax + 08h]		mov		esi,	[eax + 0ch]		mov		eax,	[eax + 00h]		mulps	xmm4,	xmm0		mulps	xmm5,	xmm1		mulps	xmm6,	xmm2		mulps	xmm7,	xmm3		mulps	xmm0,	[t0]		mulps	xmm1,	[t1]		mulps	xmm2,	[t2]		mulps	xmm3,	[t3]		addps	xmm0,	[edx]		addps	xmm1,	[edx]		addps	xmm2,	[edx]		addps	xmm3,	[edx]		addps	xmm4,	[edx + 10h]		addps	xmm5,	[edx + 10h]		addps	xmm6,	[edx + 10h]		addps	xmm7,	[edx + 10h]		movaps	[eax],	xmm0		movaps	[ecx],	xmm1		movaps	[edi],	xmm2		movaps	[esi],	xmm3		movaps	[eax + 10h], xmm4		movaps	[ecx + 10h], xmm5		movaps	[edi + 10h], xmm6		movaps	[esi + 10h], xmm7	};}

called like:
	SwizVec2 someVecs[8] = 	{		{  1,  2,  3,  4,  5,  6,  7,  8 },		{ 11, 12, 13, 14, 15, 16, 17, 18 },		{ 21, 22, 23, 24, 25, 26, 27, 28 },		{ 31, 32, 33, 34, 35, 36, 37, 38 },		{ 41, 42, 43, 44, 45, 46, 47, 48 },		{ 51, 52, 53, 54, 55, 56, 57, 58 },		{ 61, 62, 63, 64, 65, 66, 67, 68 },		{ 71, 72, 73, 74, 75, 76, 77, 78 },	};	SwizVec2 someOuts[8];	SwizVec2 screen = 	{ 		10, 10, 10, 10, 40, 40, 40, 40 	};	const	SwizVec2* inputs[4] = { &someVecs[0], &someVecs[3], &someVecs[1], &someVecs[7] };		SwizVec2* outputs[4]= { &someOuts[0], &someOuts[3], &someOuts[1], &someOuts[7] };	transform16PointsSwizzled(outputs, inputs, &screen);


So the SSE turns out very lovely. I mean, Picasso couldn't paint a pipeline this beautiful (well, he could but it would have some naked fat chicks lavishing about...)

I decided to crop off the nasty pointer-to-pointer stalls in order to make this image fit on the screen. If you refer to the assembly, you should be able to see where it picks up and leaves off...

Free Image Hosting at www.ImageShack.us

Martoon, this will only work on processors equipped with SSE. This includes all Pentium 3+ and Athlon brands (basically 95% of processors in homes today). I never use SSE2+ because it eliminates many processors that folks have, and there are almost no benefits. Additionally, while the underlying assembly will work with x86 GCC, it would have to be reformatted to adhere to the GCC asm syntax, which is both frustrating and powerful (in this case, mostly frustrating). If you have any questions, ask away.

[Edited by - ajas95 on August 21, 2005 10:56:47 AM]
Wow. I'm speechless.

Okay, now I'm speechful... er, speechy... uh, respeeched? Anyway, I can talk again.

When I asked for flat-out brilliant answers in my original post, it was just florid rhetoric. I wasn't really expecting one.

If I keep feeding you bits of source code as I write it, will you re-code the entire game in optimized assembly? :o) If only my compiler were this good.

That really is an impressive pipeline. The crappy little bathroom in our local movie theater has more stalls than that.

So if I follow this correctly, each element in the array of 4 inputs can point to a SwizVec2 struct anywhere in memory, and likewise for the 4 outputs. You just created arrays of 8 and picked 4 at random for illustration purposes, correct? So I could also say, for example:
	const	SwizVec2* inputs[4] = { &someVecs[2], &someVecs[1], &someVecs[0], &someVecs[6] };		SwizVec2* outputs[4]= { &someOuts[7], &someOuts[6], &someOuts[3], pointerToSomeOtherSwizVec2 };

or whatever. Correct?

Thanks again for the code. The sad thing is, with the adaptive tesselation, I don't know that I'll have 16 points I can batch together. I still have to write the tesselation code. I may rethink the algorithm to take advantage of the batching efficiency.
Quote:Original post by Martoon
If I keep feeding you bits of source code as I write it, will you re-code the entire game in optimized assembly? :o) If only my compiler were this good.


Don't write a game in assembly! I help people to understand SSE because there is very little (correct) documentation on how to solve problems with it. Sure you can download the instruction manual, but it's pretty meaningless unless you get to see it "in action". So when I learned it, there was a lot of searching in the dark, maybe for other people it should be less blind a search.

Quote:Thanks again for the code. The sad thing is, with the adaptive tesselation, I don't know that I'll have 16 points I can batch together... I may rethink the algorithm to take advantage of the batching efficiency.


I don't mean to imply that you should use the code! I just did it for fun, maybe a little instructional purposes. It appeared that this was one of those 10% parts in the 90/10 equation. And when you approach a problem that you know beforehand is very expensive you need to understand the optimization goal before you design the data structures. I don't know your algorithm, you only say "I need this fast", so I say "What's the fastest it could possibly be?"

Anyway, here's some more pipeline scans. First is the 4 Vector2 algorithm I presented many posts ago, it uses nothing but *ss instructions. It's funny here, becaus the Athlon can dispatch the *ss 3 per cycle, but only execute 1. So the out-of-order unit is hanging on by it's fingernails, trying to find an execution unit for this flood... and it works! There are only 10 dispatch stalls for all the hundreds of pipeline stalls... amazing technology to do this at 3.x Ghz.

Free Image Hosting at www.ImageShack.us

Here is a pipeline scan for the swizzled version, but altered to only execute a single 4x1 set of swizzled Vec2s, instead of 4 chunks. This demonstrates how you need to process a bunch of things at a time with the *ps instructions to get the full benefit. This runs in 61 cycles, half the cost of the 4x4 version, but doing just a quarter of the work (so 100% slower). Still look at the out-of-order stuff going on, it's AMAZING that this terrible code can actually perform this well!

Pipeline:

Free Image Hosting at www.ImageShack.us

Source:
inline void transform4PointsSwizzled(SwizVec2** out, const SwizVec2** in, const SwizVec2* screenCenter){	SwizVec2 t0;	__asm {		mov		eax,	in		mov		eax,	[eax]		mov		edx,	screenCenter		// x's first		movaps	xmm0,	[eax]		subps	xmm0,	[edx]	// subtract screen center.  The same x,y values are replicated								// into all 4 screen center coordinates. 		// start moving in y's also.		movaps	xmm4,	[eax + 10h]		movaps	[t0],	xmm0		mulps	xmm0,	xmm0		subps	xmm4,	[edx + 10h]		mulps	xmm4,	xmm4				// now we have x^2 in xmm0 and y^2 in xmm4, so add.		addps	xmm0,	xmm4		movaps	xmm4,	[eax + 10h]		// start inverse 4th root.  (x = x^(-1/2))		rsqrtps	xmm0,	xmm0		// recomputing (p0 - screenCenter).x here, also filling pipeline stalls.		subps	xmm4,	[edx + 10h]		// AGAIN!   (x = x^(1/4))		rsqrtps	xmm0,	xmm0		// and reciprocal to negate the exponent: (x = x^(-1/4))		rcpps	xmm0,	xmm0		// stalls are building up, good time to reassign our addressing regs.		mov		eax,	out		mov		eax,	[eax]		mulps	xmm4,	xmm0		mulps	xmm0,	[t0]		addps	xmm0,	[edx]		addps	xmm4,	[edx + 10h]		movaps	[eax],	xmm0		movaps	[eax + 10h], xmm4	};}


[edit] removed some spurious comments cut'n'pasted in.

This topic is closed to new replies.

Advertisement